Log Analysis Part 2: Using Logstash’s Grok Filter to Parse Docker Nginx Logs

2020-05-11

The Logging Destination is Reachable... Now What?

In a previous blog post, I explain how Fast Reverse Proxy is used to expose an ELK stack within a home network for the blakejarvis.com server to send logs to. Once this connection was available, I explored options for sending logs, including sending directly to Elasticsearch from the Docker container using Filebeat, reading the files located at /var/lib/docker/containers/*.json, as well as the various logging drivers native to Docker.

I quickly realized that, regardless of how I sent the data, It came in various forms with additional fields that needed to be removed. This prevented the data from being sent directly to Elasticsearch, which takes a guess as to how to index the data.

Sending to Logstash

Logstash is a component of the ELK stack used to transform data before being sent to another destination. After I determined I could not send data directly to Elasticsearch, I re-configured my FRP client to forward the Logstash port, instead of the Elasticsearch port to the blakejarvis.com server, and updated my docker-compose.yml file on blakejarvis.com with a gelf logging driver. This configuration change told Docker to no longer write log files to the host, but to send them to localhost on port 5000 over UDP, where I had the Logstash port forwarded using FRP.

New blakejarvis.com docker-compose file:

docker-compose.yml
[snip]
 nginx-proxy:
    image: nginx-proxy-blake:test1
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/etc/nginx/vhost.d"
      - "/usr/share/nginx/html"
      - "/var/run/docker.sock:/tmp/docker.sock:ro"
      - "/etc/nginx/certs"
    logging:
       driver: "gelf"
       options:
         gelf-address: "udp://localhost:5000"
[snip]

Logstash Pipeline: On the ELK host within the home network, ELK is deployed using docker-elk Docker compose. For testing, the Logstash pipeline located at docker-elk/logstash/pipeline/logstash.conf, has the basic configuration:

logstash.conf
input {
    gelf {
        port => 5000
        }
}
output {
    elasticsearch {
        hosts => "localhost:9200"
        user => "elastic"
        password => "[password]"
        index => "logstash-%{+YYYY.MM.dd}"
        ilm_enabled => "false"
    }
}

This setup is enough to begin logging NGINX HTTP requests/responses, and identify the necessary parsing. Unparsed logs can be viewed in Kibana:

Example Web Requests

With logs from the NGINX container being sent in various forms as well as having bad bits, I needed an understanding of the message fields before parsing could occur. After pulling a large enough sample size from Kibana, I performed this parsing exercise by hand to identify where I could consistently extract data.

Example 1: Standard Web Request

  • Raw Message: nginx.1 | blakejarvis.com 1.2.3.4 - - [28/Apr/2020:23:45:00 +0000] "GET / HTTP/2.0" 200 4131 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36"

  • Bad bits: 

  • Docker Container name: nginx.1

  • Bad bits: | 

  • Host requested: blakejarvis.com (Field located within the host header value within an HTTP request message. Other common fields were the IP address or "_")

  • Source IP: 1.2.3.4

  • Unknown Fields: - -

  • Date: [28/Apr/2020:23:45:00 +0000]

  • HTTP verb: GET

  • Uniform Resource Indicator (URI) requested: /

  • HTTP version: HTTP/2.0

  • Response code: 200

  • Bytes transferred: 4131

  • Referrer: "-"

  • User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36

Example 2: A port scan

  • Raw Message: nginx.1 | blakejarvis.com 1.2.3.4 - - [28/Apr/2020:23:29:29 +0000] "\x5Cx16\x5Cx03\x5Cx01\x5Cx00\x5CxD2\x5Cx01\x5Cx00\x5Cx00\x5CxCE\x5Cx03\x5Cx03\x5CxE8\x5Cx10\x5Cx80\x5CxC1\x5Cx0E\x5CxF9J\x5Cx06R\x5Cx8E/t\x5CxC90\x5Cx85\x5CxEC\x5Cx04t#\x5CxEB^\x5CxD8\x5Cx18\x5Cx01Q\x5Cx02\x5Cx8B\x5CxCA\x5CxDB\x5CxB3\x5CxB3\x5CxD3\x5Cx00\x5Cx00b\x5CxC00\x5CxC0,\x5CxC0/\x5CxC0+\x5Cx00\x5Cx9F\x5Cx00\x5Cx9E\x5CxC02\x5CxC0.\x5CxC01\x5CxC0-\x5Cx00\x5CxA5\x5Cx00\x5CxA1\x5Cx00\x5CxA4\x5Cx00\x5CxA0\x5CxC0(\x5CxC0$\x5CxC0\x5Cx14\x5CxC0" 400 150 "-" "-"

  • Bad bits: 

  • Docker Container name: nginx.1

  • Bad bits: | 

  • Host requested: blakejarvis.com

  • Source IP: 1.2.3.4

  • Unknown Fields: - -

  • Date: [28/Apr/2020:23:29:29 +0000]

  • Malformed Request: \x5Cx16\x5Cx03\x5Cx01\x5Cx00\x5CxD2\x5Cx01\x5Cx00\x5Cx00\x5CxCE\x5Cx03\x5Cx03\x5CxE8\x5Cx10\x5Cx80\x5CxC1\x5Cx0E\x5CxF9J\x5Cx06R\x5Cx8E/t\x5CxC90\x5Cx85\x5CxEC\x5Cx04t#\x5CxEB^\x5CxD8\x5Cx18\x5Cx01Q\x5Cx02\x5Cx8B\x5CxCA\x5CxDB\x5CxB3\x5CxB3\x5CxD3\x5Cx00\x5Cx00b\x5CxC00\x5CxC0,\x5CxC0/\x5CxC0+\x5Cx00\x5Cx9F\x5Cx00\x5Cx9E\x5CxC02\x5CxC0.\x5CxC01\x5CxC0-\x5Cx00\x5CxA5\x5Cx00\x5CxA1\x5Cx00\x5CxA4\x5Cx00\x5CxA0\x5CxC0(\x5CxC0$\x5CxC0\x5Cx14\x5CxC0

  • Response Code: 400

  • Bytes transferred: 150

  • Referrer: "-"

  • User Agent: "-"

The Grok Parser

Grok is a Logstash filter that parses and transforms data using regular expressions, which can match multiple different log patterns. Grok parsing results in key-value pairs, which is easily consumable by Elasticsearch. Grok patterns can be found on the logstash-plugins-core github site.

After much trial and error using the Grok Debugger, the below patterns finally matched the two most common web requests.

Standard Web Request Grok Pattern:

%{GREEDYDATA}\|%{GREEDYDATA}%{INT}m%{GREEDYDATA:host_requested}%{SPACE}%{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|-)\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}

Port Scan Grok Pattern:

%{GREEDYDATA}\|%{GREEDYDATA:junk}%{INT}m%{GREEDYDATA:host_requested}%{SPACE} %{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\] \"%{GREEDYDATA:payload}\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"

Implementing the Grok Parser in the Logstash Pipeline

Below is the full logstash.conf configuration file. Since there will always be NGINX metadata in the log entry, regardless of what the HTTP request is, the NGINX metadata was matched in its own grok {} block, and the remaining message was labeled body. The contents of body are passed into a second grok {} block, where it is either matched as a standard HTTP request or a port scan.

In addition to Grok parsing the following Logstash filters are applied:

  • Remove the original message field, if _grokparsefailure does not exist in the tags, which indicates a failed Grok parse

  • Set the date using a date {} filter

  • Remove the body field, since its values will be parsed into key-value pairs

  • Use Logstash's Geoip module to obtain geolocation of the client IP address

logstash.conf
input {
    gelf {
        port => 5000
	}
}

filter {
    grok {
    	# Match nginx headers
        match => [ "message", "%{GREEDYDATA}\|%{GREEDYDATA}%{INT}m%{GREEDYDATA:host_requested}%{SPACE} %{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\]%{GREEDYDATA:body}"]
	}

    grok {
	# Match a standard HTTP request
	match => ["body", " \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|-)\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"]
	# Match port scans
	match => ["body", " \"%{GREEDYDATA:payload}\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"]
	}

    # If the message is parsed correctly, remove it
    if "_grokparsefailure" not in [tags] {
        mutate {        
            remove_field => ["message"]	
        }
    }
    date {
        locale => "en"
        timezone => "America/New_York"
        match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
 	}
    mutate {	
        remove_field => ["body"]
	}
    geoip {
        source => "client_ip"
        }
}

output {
    elasticsearch {
        hosts => "localhost:9200"
	user => "elastic"
	password => "[password]"
	index => "logstash-%{+YYYY.MM.dd}"
	ilm_enabled => "false"
    }
}

The Logstash Docker container needs to be restarted for the updated pipeline to take effect.

Performing Queries Using Parsed Data

With the log data parsed, queries can be performed to answer questions such as: What are the top 10 most common user agents over the last 30 days? The below table has been exported from a Kibana Data Table visualization.

Most Common User Agents

Count

"-"

998

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36"

955

"AWS Security Scanner"

895

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"

546

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/78.0.3904.70 Safari/537.36"

396

"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"

374

"Mozilla/5.0 (iPhone; CPU iPhone OS 13_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5 Mobile/15E148 Safari/604.1"

354

"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"

208

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36"

186

"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

186

An interesting finding is that the user agent "-" is the most common user agent, indicating the web server is likely seeing a disproportionately high amount of illegitimate traffic. In the next and final blog post of this log analysis series, the NGINX logs will be enriched with greynoise, a service that analyzes internet scan data and background noise, to identify common attack trends the blakejarvis.com blog faces.

Last updated