Anonymizing logs for GDPR using Logback pattern layout

In the new GDPR-present world, of many things, special attention must be given to logging. It stems from the fact that not only the brand new rulebook obliges companies to respect users’ right to be forgotten but also to make every effort needed to protect their privacy.

Thus far, a common practice in many applications could have been described as “extensive logging”. In fact, it might have to do with being pragmatic – disk space is cheaper than programmers’ time. If by logging more, it is easier and quicker for the developer to debug potential problems then it seems like a no-brainer. At least it had until May 25th.

GDPR reminded many of us how important users’ privacy is and that it is the responsibility of every application to respect that. Therefore it is essential for the logs to be GDPR-compliant. Here are some recommendations to achieve this:

Do not log user’s sensitive (private) data unless it is necessary

This states the obvious, but in real world scenarios it is not always easy to achieve. Nevertheless, you should strive for building your applications in a GDPR-driven fashion and one way of this is by paying special attention to your logs.

Set up reasonable log retention

In most cases you do not need logs after a specific period of time (depending on your business). Ideally you should set up the retention policy for all your logs in one place. If you use any log management service, it should be fairly easy.

Structure your logs

If, for some reason, it is necessary for you to log private data do it in a structured way. JSON, XML or any other machine-friendly format is a good recommendation. Try not to log sensitive information in a “random” fashion, think about how easy would it be to find that piece of data using regular expression – structured logs are easier to anonymize.

Anonymize/mask sensitive data

If you find yourself in a situation in which you already log something that may contain private data, you should consider implementing anonymization mechanisms. Solutions may differ depending on your needs – encryption, masking or complete removal are some of the possible options.

Masking Logs using Logback pattern layout

Logging request/response payloads is generally considered a good practice, but in the light of GDPR it is advised to prepare for that they may contain private data that you are supposed to anonymize. Let’s assume your application received the following request that you logged:

INFO [2018-07-24 12:41:31,681] [qtp1777178337-48] com.schibsted.payment.wire: Container in-bound request
-->> POST http://localhost:8077/api/mask
-->> Cookie: JSESSIONID=node01gyp0jf2b114884a0ki1qm4bh0.node0
-->> Cache-Control: no-cache
-->> Accept: */*
-->> Connection: keep-alive
-->> Host: localhost:8077
-->> Accept-Encoding: gzip, deflate
-->> Content-Length: 205
-->> Content-Type: application/json
{ "user_id" : "1234", "ssn" : "3310104322", "favourite_team" : "Juventus", "address" : "Wiejska 4, Warszawa", "additional_info_1" : "192.168.1.1", "additional_info_2" : "bianconeri36@gmail.com" }

As you can see, structuring your logs lets you identify where the private data may be – at first glance `ssn` (Social Security Number) and the `address` are the fields that you should mask. Luckily enough, you can easily parse the log using regular expression to find them. Moreover, the `Cookie` header is something you should not log. You know the drill already. But wait! Client that uses your API put some other personal information in two text-free fields: `additional_info_1` and `additional_info_2`. Fortunately, IP address and e-mail are also quite easy to catch by the regex. Those are two good examples of sensitive information you should be looking out for in your logs to make your masking mechanism more bullet-proof.

How to make sure all this information is always masked in your logs?

There are, of course, many ways. One solution is to implement a custom servlet request/response filter. This is something we tried in our team by implementing JAX-RS custom `ClientRequestFilter` and `ClientResponseFilter`, but the biggest downside of the approach was that it only filters request/response payloads and not all logs.

Thus we needed to dig “deeper” and decided to mask the logs centrally, by configuring masking rules for all log entries produced by Logback. In order to do that we had to implement custom ch.qos.logback.classic.PatternLayout.

Configuration:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
...

   <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
       <encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
           <layout class="com.schibsted.payment.logback.PatternMaskingLayout">
               <maskPattern>\"ssn\"\s*:\s*\"(.*?)\"</maskPattern> <!-- SSN JSON pattern -->
               <maskPattern>\"address\"\s*:\s*\"(.*?)\"</maskPattern> <!-- address JSON pattern -->
               <maskPattern>(\d+\.\d+\.\d+\.\d+)</maskPattern> <!-- simple IPv4 pattern -->
               <maskPattern>(\w+@\w+\.\w+)</maskPattern> <!-- simple email pattern -->
               <maskPattern>Cookie:\s*(.*?)\s</maskPattern> <!-- Cookie header pattern -->
               <pattern>%-5p [%d{ISO8601,UTC}] [%thread] %c: %m%n%rootException</pattern>
           </layout>
       </encoder>
   </appender>

...
</configuration>

The idea behind the configuration is to extend every logback appender you need with custom layout, in our case it is: `PatternMaskingLayout` which is an implementation of `ch.qos.logback.classic.PatternLayout` Each mask pattern represents regular expression to fetch one piece of sensitive data.

PatternMaskingLayout:

package com.schibsted.payment.logback;

import ...

public class PatternMaskingLayout extends ch.qos.logback.classic.PatternLayout {

   private Pattern multilinePattern;
   private List<String> maskPatterns = new ArrayList<>();

   public void addMaskPattern(String maskPattern) { // invoked for every single entry in the xml
       maskPatterns.add(maskPattern);
       multilinePattern = Pattern.compile(
               maskPatterns.stream()
                       .collect(Collectors.joining("|")), // build pattern using logical OR
               Pattern.MULTILINE
       );
   }

   @Override
   public String doLayout(ILoggingEvent event) {
       return maskMessage(super.doLayout(event)); // calling superclass method is required
   }

   private String maskMessage(String message) {
       if (multilinePattern == null) {
           return message;
       }
       StringBuilder sb = new StringBuilder(message);
       Matcher matcher = multilinePattern.matcher(sb);
       while (matcher.find()) {
           IntStream.rangeClosed(1, matcher.groupCount()).forEach(group -> {
               if (matcher.group(group) != null) {
                   IntStream.range(matcher.start(group), matcher.end(group))
                           .forEach(i -> sb.setCharAt(i, '*')); // replace each character with asterisk
               }
           });
       }
       return sb.toString();
   }

}

Implementation of `doLayout` method from PatternLayout is responsible for masking matched data in each log message of your application if it matches one of the configured patterns. A multiline pattern is constructed from `maskPatterns` list taken from `logback.xml`. Unfortunately, logback engine does not support constructor injection (Or does it? I couldn’t find better solution) if it comes to list of properties, therefore `addMaskPattern` is invoked for every config entry (if there was a single property, setter method would be invoked) so we have to compile the pattern every time we add new regex to the list.

After that has been applied, when we run the application our example log entry looks like this:

INFO [2018-07-24 12:43:44,585] [qtp1777178337-48] com.schibsted.payment.wire: Container in-bound request
-->> POST http://localhost:8077/api/mask
-->> Cookie: ************************************************
-->> Cache-Control: no-cache
-->> Accept: */*
-->> Connection: keep-alive
-->> Host: localhost:8077
-->> Accept-Encoding: gzip, deflate
-->> Content-Length: 205
-->> Content-Type: application/json
{ "user_id" : "1234", "ssn" : "**********", "favourite_team" : "Juventus", "address" : "*******************", "additional_info_1" : "***********", "additional_info_2" : "**********************" }

Useful tips

Naturally, you always have to play carefully with the matchers. If the regex is broken, you may end up masking too much. Notice that I used reluctant quantifiers in my regular expressions. It wasn’t a coincidence – let’s change the `ssn` regex to use greedy quantifier to illustrate that:

\"ssn\"\s*:\s*\"(.*)\"

Output log message this time looks like this:

{ "user_id" : "1234", "ssn"  : "*****************************************************************************************************************************************************************" }

Bummer! Not exactly what we wanted to accomplish.

Final thoughts

To wrap up – even if you are very careful with logging, it’s inevitable that some day you will discover a piece of information there that should have never been logged. Whether it’s a new teammate who hasn’t fully got into common practices of the project, misbehaving 3rd party library or unwary API consumer – you should be prepared for the worst and build as many protection layers you can. One of them could be masking sensitive data in your application logs.