# Log Analysis Part 2: Using Logstash’s Grok Filter to Parse Docker Nginx Logs

### The Logging Destination is Reachable... Now What?

In a previous blog [post](https://blakejarvis.com/using-fast-reverse-proxy-frp-to-expose-logging-infrastructure/), I explain how Fast Reverse Proxy is used to expose an ELK stack within a home network for the blakejarvis.com server to send logs to. Once this connection was available, I explored options for sending logs, including sending directly to Elasticsearch from the Docker container using [Filebeat](https://www.elastic.co/beats/filebeat), reading the files located at `/var/lib/docker/containers/*.json`, as well as the various [logging drivers](https://docs.docker.com/config/containers/logging/configure/) native to Docker.

I quickly realized that, regardless of how I sent the data, It came in various forms with additional fields that needed to be removed. This prevented the data from being sent directly to Elasticsearch, which takes a guess as to how to index the data.

### Sending to Logstash

[Logstash](https://www.elastic.co/logstash) is a component of the ELK stack used to transform data before being sent to another destination. After I determined I could not send data directly to Elasticsearch, I re-configured my FRP client to forward the Logstash port, instead of the Elasticsearch port to the blakejarvis.com server, and updated my `docker-compose.yml` file on blakejarvis.com with a [gelf logging driver](https://docs.docker.com/config/containers/logging/gelf/). This configuration change told Docker to no longer write log files to the host, but to send them to localhost on port 5000 over UDP, where I had the Logstash port forwarded using FRP.

**New blakejarvis.com docker-compose file:**

{% code title="docker-compose.yml" %}

```yaml
[snip]
 nginx-proxy:
    image: nginx-proxy-blake:test1
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - "/etc/nginx/vhost.d"
      - "/usr/share/nginx/html"
      - "/var/run/docker.sock:/tmp/docker.sock:ro"
      - "/etc/nginx/certs"
    logging:
       driver: "gelf"
       options:
         gelf-address: "udp://localhost:5000"
[snip]
```

{% endcode %}

**Logstash Pipeline:**\
On the ELK host within the home network, ELK is deployed using [docker-elk](https://github.com/deviantony/docker-elk) Docker compose. For testing, the Logstash pipeline located at `docker-elk/logstash/pipeline/logstash.conf`, has the basic configuration:

{% code title="logstash.conf" %}

```
input {
    gelf {
        port => 5000
        }
}
output {
    elasticsearch {
        hosts => "localhost:9200"
        user => "elastic"
        password => "[password]"
        index => "logstash-%{+YYYY.MM.dd}"
        ilm_enabled => "false"
    }
}
```

{% endcode %}

This setup is enough to begin logging NGINX HTTP requests/responses, and identify the necessary parsing. Unparsed logs can be viewed in Kibana:

![](https://3603814209-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-MQ8cxwnVFEmRLJ-r5WP%2F-MQ8l2lQ1UUY7PEtROsI%2F-MQ8l_xHmlM_GLUjcenc%2Flog-2.11.png?alt=media\&token=8501d279-6e29-4712-81a8-bb3b3128ed77)

### Example Web Requests

With logs from the NGINX container being sent in various forms as well as having bad bits, I needed an understanding of the message fields before parsing could occur. After pulling a large enough sample size from Kibana, I performed this parsing exercise by hand to identify where I could consistently extract data.

#### Example 1: Standard Web Request

* Raw Message: `[0;33;1mnginx.1    | [0mblakejarvis.com 1.2.3.4 - - [28/Apr/2020:23:45:00 +0000] "GET / HTTP/2.0" 200 4131 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36"`
* Bad bits: `[0;33;1m`
* Docker Container name: `nginx.1`
* Bad bits: `| [0m`
* Host requested: `blakejarvis.com` (Field located within the host header value within an HTTP request message. Other common fields were the IP address or "\_")
* Source IP: `1.2.3.4`
* Unknown Fields: `- -`
* Date: `[28/Apr/2020:23:45:00 +0000]`
* HTTP verb: `GET`
* Uniform Resource Indicator (URI) requested: `/`
* HTTP version: `HTTP/2.0`
* Response code: `200`
* Bytes transferred: `4131`
* Referrer: `"-"`
* User Agent: `Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106 Safari/537.36`

#### Example 2: A port scan

* Raw Message: `[0;33;1mnginx.1 | [0mblakejarvis.com 1.2.3.4 - - [28/Apr/2020:23:29:29 +0000] "\x5Cx16\x5Cx03\x5Cx01\x5Cx00\x5CxD2\x5Cx01\x5Cx00\x5Cx00\x5CxCE\x5Cx03\x5Cx03\x5CxE8\x5Cx10\x5Cx80\x5CxC1\x5Cx0E\x5CxF9J\x5Cx06R\x5Cx8E/t\x5CxC90\x5Cx85\x5CxEC\x5Cx04t#\x5CxEB^\x5CxD8\x5Cx18\x5Cx01Q\x5Cx02\x5Cx8B\x5CxCA\x5CxDB\x5CxB3\x5CxB3\x5CxD3\x5Cx00\x5Cx00b\x5CxC00\x5CxC0,\x5CxC0/\x5CxC0+\x5Cx00\x5Cx9F\x5Cx00\x5Cx9E\x5CxC02\x5CxC0.\x5CxC01\x5CxC0-\x5Cx00\x5CxA5\x5Cx00\x5CxA1\x5Cx00\x5CxA4\x5Cx00\x5CxA0\x5CxC0(\x5CxC0$\x5CxC0\x5Cx14\x5CxC0" 400 150 "-" "-"`
* Bad bits: `[0;33;1m`
* Docker Container name: `nginx.1`
* Bad bits: `| [0m`
* Host requested: `blakejarvis.com`
* Source IP: `1.2.3.4`
* Unknown Fields: `- -`
* Date: `[28/Apr/2020:23:29:29 +0000]`
* Malformed Request: `\x5Cx16\x5Cx03\x5Cx01\x5Cx00\x5CxD2\x5Cx01\x5Cx00\x5Cx00\x5CxCE\x5Cx03\x5Cx03\x5CxE8\x5Cx10\x5Cx80\x5CxC1\x5Cx0E\x5CxF9J\x5Cx06R\x5Cx8E/t\x5CxC90\x5Cx85\x5CxEC\x5Cx04t#\x5CxEB^\x5CxD8\x5Cx18\x5Cx01Q\x5Cx02\x5Cx8B\x5CxCA\x5CxDB\x5CxB3\x5CxB3\x5CxD3\x5Cx00\x5Cx00b\x5CxC00\x5CxC0,\x5CxC0/\x5CxC0+\x5Cx00\x5Cx9F\x5Cx00\x5Cx9E\x5CxC02\x5CxC0.\x5CxC01\x5CxC0-\x5Cx00\x5CxA5\x5Cx00\x5CxA1\x5Cx00\x5CxA4\x5Cx00\x5CxA0\x5CxC0(\x5CxC0$\x5CxC0\x5Cx14\x5CxC0`
* Response Code: `400`
* Bytes transferred: `150`
* Referrer: `"-"`
* User Agent: `"-"`

### The Grok Parser

[Grok](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html) is a Logstash filter that parses and transforms data using regular expressions, which can match multiple different log patterns. Grok parsing results in key-value pairs, which is easily consumable by Elasticsearch. Grok patterns can be found on the logstash-plugins-core [github site](https://github.com/logstash-plugins/logstash-patterns-core/blob/master/patterns/grok-patterns).

After much trial and error using the [Grok Debugger](https://grokdebug.herokuapp.com/), the below patterns finally matched the two most common web requests.

#### Standard Web Request Grok Pattern:

```
%{GREEDYDATA}\|%{GREEDYDATA}%{INT}m%{GREEDYDATA:host_requested}%{SPACE}%{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\] \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|-)\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}
```

**Port Scan Grok Pattern:**

```
%{GREEDYDATA}\|%{GREEDYDATA:junk}%{INT}m%{GREEDYDATA:host_requested}%{SPACE} %{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\] \"%{GREEDYDATA:payload}\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"
```

### Implementing the Grok Parser in the Logstash Pipeline

Below is the full `logstash.conf` configuration file. Since there will always be NGINX metadata in the log entry, regardless of what the HTTP request is, the NGINX metadata was matched in its own `grok {}` block, and the remaining message was labeled `body`. The contents of `body` are passed into a second `grok {}` block, where it is either matched as a standard HTTP request or a port scan.

In addition to Grok parsing the following Logstash filters are applied:

* Remove the original `message` field, if `_grokparsefailure` does not exist in the tags, which indicates a failed Grok parse
* Set the date using a `date {}` filter
* Remove the `body` field, since its values will be parsed into key-value pairs
* Use Logstash's Geoip module to obtain geolocation of the client IP address

{% code title="logstash.conf" %}

```
input {
    gelf {
        port => 5000
	}
}

filter {
    grok {
    	# Match nginx headers
        match => [ "message", "%{GREEDYDATA}\|%{GREEDYDATA}%{INT}m%{GREEDYDATA:host_requested}%{SPACE} %{IP:client_ip}%{SPACE}\-%{SPACE}\-%{SPACE}\[%{HTTPDATE:timestamp}\]%{GREEDYDATA:body}"]
	}

    grok {
	# Match a standard HTTP request
	match => ["body", " \"(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|-)\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"]
	# Match port scans
	match => ["body", " \"%{GREEDYDATA:payload}\" %{NUMBER:response} (?:-|%{NUMBER:bytes})%{SPACE}\"%{NOTSPACE:referrer}\"%{SPACE}%{GREEDYDATA:user_agent}"]
	}

    # If the message is parsed correctly, remove it
    if "_grokparsefailure" not in [tags] {
        mutate {        
            remove_field => ["message"]	
        }
    }
    date {
        locale => "en"
        timezone => "America/New_York"
        match => [ "timestamp" , "dd/MMM/yyyy:HH:mm:ss Z" ]
 	}
    mutate {	
        remove_field => ["body"]
	}
    geoip {
        source => "client_ip"
        }
}

output {
    elasticsearch {
        hosts => "localhost:9200"
	user => "elastic"
	password => "[password]"
	index => "logstash-%{+YYYY.MM.dd}"
	ilm_enabled => "false"
    }
}
```

{% endcode %}

The Logstash Docker container needs to be restarted for the updated pipeline to take effect.

### Performing Queries Using Parsed Data

With the log data parsed, queries can be performed to answer questions such as: **What are the top 10 most common user agents over the last 30 days?** The below table has been exported from a Kibana Data Table visualization.

| Most Common User Agents                                                                                                                             | Count |
| --------------------------------------------------------------------------------------------------------------------------------------------------- | ----- |
| "-"                                                                                                                                                 | 998   |
| "Mozilla/5.0 (Macintosh; Intel Mac   OS X 10\_13\_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.106   Safari/537.36"                   | 955   |
| "AWS Security Scanner"                                                                                                                              | 895   |
| "Mozilla/5.0 (Macintosh; Intel Mac   OS X 10\_13\_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122   Safari/537.36"                   | 546   |
| "Mozilla/5.0 (X11; Linux x86\_64)   AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/78.0.3904.70   Safari/537.36"                             | 396   |
| "Mozilla/5.0 (Windows NT 10.0;   Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"                                                                  | 374   |
| "Mozilla/5.0 (iPhone; CPU iPhone OS   13\_3\_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.0.5   Mobile/15E148 Safari/604.1" | 354   |
| "Mozilla/5.0 (Windows NT 10.0;   Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0"                                                                  | 208   |
| "Mozilla/5.0 (Windows NT 10.0;   Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163   Safari/537.36"                           | 186   |
| "Mozilla/5.0 (compatible;   Googlebot/2.1; +<http://www.google.com/bot.html>)"                                                                      | 186   |

An interesting finding is that the user agent "-" is the most common user agent, indicating the web server is likely seeing a disproportionately high amount of illegitimate traffic. In the next and final blog post of this log analysis series, the NGINX logs will be enriched with greynoise, a service that analyzes internet scan data and background noise, to identify common attack trends the blakejarvis.com blog faces.
