4

I've set up a new proof of concept logstash system

CentOS 6.6 (on Vmware 5.5) - single CPU VM with 12G RAM allocated

Elasticsearch and Logstash installed from RPMs …

# rpm -q elasticsearch logstash
elasticsearch-1.7.1-1.noarch
logstash-1.5.3-1.noarch

JVM: 1.8.0_51

Data I'm feeding in is simple records of the form …

M1234 z123 2015-01-31 23:28:09.417 8.55373

(fields are machine name, userid, date, time, time taken to logon - everything is simple US-ASCII)

Logstash config below (this data comes from a MSSQL database and for the moment I'm exporting to text file and transferring file to the logstash server).

This worked fine for a day's worth of logs (11K records) but when I try to process the backlog from this calendar year it 'hangs'.

Symptoms of this are

  • elasticsearch is still responsive - searches and access to configuration still fine
  • the number of documents in the indices stops going up
  • the system becomes essential idle - only background disk activity and minimal CPU usage
  • if I try to stop the logstash process (which is still running) it will only die with kill -9.

This seems to happen at around 200K documents. It isn't affected by the number of indices - I started off with daily indices and then changed to weekly - it still stops around 200K docs.

Because this is a proof of concept running on single machine I've turned replica count down to 0 and shards to 1 - I don't think that should make any difference to this problem.

I don't see any errors in the logstash or elasticsearch logs despite turning verbosity up on both.

I don't think the system is running out of memory, disk space, file descriptors.

I'm not sure what else to look at. This feels like a trivially sized problem (for ELK) and I don't see this problem on an existing ELK setup which handles our mail logs (though that is running earlier versions and has multiple elasticsearch storage nodes)

Although I'm confident that there are no odd byte sequences in the input files, I've explicitly declared the input as US_ASCII with charset => "US-ASCII" in the file input plugin stanza. I don't expect this to make any difference (test is still running).

Update: Although there was nothing interesting in the logs when the import stalled the lines logged when logstash was asked to shutdown were interesting …

{:timestamp=>"2015-08-03T10:17:39.104000+0100", :message=>["INFLIGHT_EVENTS_REPORT", "2015-08-03T10:17:39+01:00", {"input_to_filter"=>20, "filter_to_output"=>0, "outputs"=>[]}], :level=>:warn}

Implies to me that the problem is at the filtering stage and not the output to elasticsearch. I've confirmed that by first getting rid of the elasticsearch output and just having stdout. That showed the same behaviour - import stalled after a while.

Putting the elasticsearch output back but clearing out everything in the filter section gave me a successful, complete data import.

I've now got a fix for this - details in answer.

input {
        file {
                path => "/var/lib/clusters/*"
                type => "clusterF"
                start_position => "beginning"
        }
}

filter {
        mutate {
                remove_field => [ "path", "host" ]
        }
        # 13COMP014   nabcteam    2015-07-29 11:09:21.353 153.493
        if [type] == "clusterF" {
                grok {
                        match => { "message" => "%{NOTSPACE:client} +%{WORD:userid} +%{TIMESTAMP_ISO8601:datestamp} +%{BASE10NUM:elapsed:float}" }
                }
        }
        if [elapsed] < 0 {
                drop {}
        }
        if [elapsed] > 1000.0 {
                drop {}
        }
        if [userid] =~ "[a-z][0-9]{7}" {
                mutate {
                        add_field => [ "userClass", "student" ]
                }
        } else if [userid] =~ "n[a-z].*" {
                mutate {
                        add_field => [ "userClass", "staff" ]
                }
        } else {
                mutate {
                        add_field => [ "userClass", "other" ]
                }
        }
        date {
                match => [ "datestamp", "ISO8601" ]
        }
        mutate {
                remove_field => [ "message" ]
        }
}

output {
        elasticsearch {
                bind_host => "clog01.ncl.ac.uk"
                protocol => "http"
                cluster => "elasticsearch"
                flush_size => 10
                index => "clusters-%{+xxxx.ww}"
        }
}
Paul Haldane
  • 4,457
  • 1
  • 20
  • 31
  • I haven't got access to "my" logstash/ELK (am on vacation), but I remember overburdening it as I turned volumes up when we got to staging/production. I recall there being memory consumption settings for each of the components in the ELK stack (plus Redis which we use for buffering), which are non obvious but only slightly tricky to find (in the config files of respective component). Increasing memory allowance, especially for logstash/redis I think was our fix. It was a royal pain to diagnose I have to add. – ErikE Aug 02 '15 at 21:37
  • Take a look into elasticsearch logs ... there might be something to help. – Fox Aug 03 '15 at 06:42

1 Answers1

2

Once I knew that the stall was happening around filter rather than output this was much easier to find.

Putting the elasticsearch output back but clearing out everything in the filter section gave me a successful, complete data import.

I wrote a simple perl script to validate the input lines against the grok specification - this showed me that some userid's contained hyphens (which I didn't expect). Replacing +%{WORD:userid} with +%{NOTSPACE:userid} in the original config gave me a working setup. I suspect that what I should have done in the first place was add a field on successful grok and only apply the other filter rules if that field was present.

Main moral I take from this is that it's important to simplify this sort of problem otherwise the search space for potential causes is too large.

Paul Haldane
  • 4,457
  • 1
  • 20
  • 31