0

We are in the process of deploying an ELK stack and need advice and general recommendations regarding the performance of the cluster and more specifically, logstash.

So the current setup we have now is that we have 1 kibana node, 2 logstash nodes and 4 elastic nodes. The logstash nodes are using 8 vCPUs and 32 GB RAM each and are being fed syslog data using nginx as a load balancer. The elastic nodes have 8 vCPUs and 64 GB RAM each. The heap size have been set to ½ of RAM for all nodes.

We are currently processing about 4-5000 events/second but are planning to increase to much more events/second. With the current amount of events we are seeing that both logstash nodes are using about 90% CPU. Now we do process the logs before moving them to elastic with a few filters. Here they are:

3000-filter-syslog.conf:

filter {
  if "syslog" in [tags] and "pre-processed" not in [tags] {
    if "%ASA-" in [message] {
      mutate {
        add_tag => [ "pre-processed", "Firewall", "ASA" ]
      }
      grok {
        match => ["message", "%{CISCO_TAGGED_SYSLOG} %{GREEDYDATA:cisco_message}"]
      }
      syslog_pri { }

        if "_grokparsefailure" not in [tags] {
          mutate {
          rename => ["cisco_message", "message"]
          remove_field => ["timestamp"]
          }
        }

 grok {
      match => [
        "message", "%{CISCOFW106001}",
        "message", "%{CISCOFW106006_106007_106010}",
        "message", "%{CISCOFW106014}",
        "message", "%{CISCOFW106015}",
        "message", "%{CISCOFW106021}",
        "message", "%{CISCOFW106023}",
        "message", "%{CISCOFW106100}",
        "message", "%{CISCOFW110002}",
        "message", "%{CISCOFW302010}",
        "message", "%{CISCOFW302013_302014_302015_302016}",
        "message", "%{CISCOFW302020_302021}",
        "message", "%{CISCOFW305011}",
        "message", "%{CISCOFW313001_313004_313008}",
        "message", "%{CISCOFW313005}",
        "message", "%{CISCOFW402117}",
        "message", "%{CISCOFW402119}",
        "message", "%{CISCOFW419001}",
        "message", "%{CISCOFW419002}",
        "message", "%{CISCOFW500004}",
        "message", "%{CISCOFW602303_602304}",
        "message", "%{CISCOFW710001_710002_710003_710005_710006}",
        "message", "%{CISCOFW713172}",
        "message", "%{CISCOFW733100}"
      ]
    }

    }
  }
}

3010-filter-jdbc.conf:

filter {
  if "syslog" in [tags] {
    jdbc_static {
      loaders => [
        {
          id => "elkDevIndexAssoc"
          query => "select * from elkDevIndexAssoc"
          local_table => "elkDevIndexAssoc"
        }
      ]
      local_db_objects => [
        {
          name => "elkDevIndexAssoc"
          index_columns => ["cenDevIP"]
          columns => [
            ["cenDevSID", "varchar(255)"],
            ["cenDevFQDN", "varchar(255)"],
            ["cenDevIP", "varchar(255)"],
            ["cenDevServiceName", "varchar(255)"]
          ]
        }
      ]
      local_lookups => [
        {
          id => "localObjects"
          query => "select * from elkDevIndexAssoc WHERE cenDevIP = :host"
          parameters => {host => "[host]"}
          target => "cendotEnhanced"
        }
      ]
      # using add_field here to add & rename values to the event root
      add_field => { cendotFQDN => "%{[cendotEnhanced[0][cendevfqdn]}" }
      add_field => { cendotSID => "%{[cendotEnhanced[0][cendevsid]}" }
      add_field => { cendotServiceName => "%{[cendotEnhanced[0][cendevservicename]}" }
      remove_field => ["cendotEnhanced"]
      jdbc_user => "user"
      jdbc_password => "password"
      jdbc_driver_class => "com.mysql.jdbc.Driver"
      jdbc_driver_library => "/usr/share/java/mysql-connector-java-8.0.11.jar"
      jdbc_connection_string => "jdbc:mysql://84.19.155.71:3306/logstash?serverTimezone=Europe/Stockholm"
      #jdbc_default_timezone => "Europe/Stockholm"
    }
  }  
}

Is there any way to debug what is taking so much CPU power? Do anyone have any recommendations as to what to do since we need to be able to process much more logs?

Here is the output from jstat:

jstat -gc 56576
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT   
68096.0 68096.0  0.0   68096.0 545344.0 66712.9  30775744.0 10740782.3 113316.0 93805.9 16452.0 13229.3   1341  146.848   6      0.449  147.297

Thanks

nillenilsson
  • 143
  • 1
  • 8

1 Answers1

0

Here are some tips to help you along with your performance tuning mission.

Use multiple pipelines where possible

Logstash 6.0 introduced the possibility to easily run multiple pipelines. You can use this to split out event processing logic if it makes sense. E.g. you if can distinguish two or more types of inputs/outputs and their filtering processes in-between.

Have a read here and here for some tips on using multiple pipelines.

Conditional logic

Next up try to see if you can reduce the conditional logic in your filters at all. The more if..else logic you have the more CPU intensive things get for Logstash.

Get hold of some valuable stats to see what is causing high CPU usage

You should definitely use the Node Stats API for Logstash to see what is going on inside your current event processing pipeline.

curl -XGET 'localhost:9600/_node/stats/process'

You can also look up other stats types. (For example try pipelines as well as process). Check out this page for more info on using the API to query your Logstash stats. This will more than likely tell you where the really intensive stuff is happening.

Good luck!

Shogan
  • 236
  • 1
  • 2
  • 8