8

I'm currently evaluating whether logstash and elasticsearch are useful for our use-case. What I have is a log file containing multiple entries which is of the form

<root>
    <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        ...
        <fieldarray>
            <fielda>...</fielda>
            <fielda>...</fielda>
            ...
        </fieldarray>
    </entry>
    <entry>
    ...
    </entry>
    ...
<root>

Each entry element would contain one log event. (If you are interested, the file is actually a Tempo Timesheets (An Atlassian JIRA Plug-in) work-log export.)

Is it possible to transform such a file into multiple log events without writing my own codec?

dualed
  • 388
  • 1
  • 2
  • 14

2 Answers2

11

Alright, I found a solution that does work for me. The biggest problem with the solution is that the XML plugin is ... not quite unstable, but either poorly documented and buggy or poorly and incorrectly documented.

TLDR

Bash command line:

gzcat -d file.xml.gz | tr -d "\n\r" | xmllint --format - | logstash -f logstash-csv.conf

Logstash config:

input {
    stdin {}
}

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
    # multiline filter adds the tag "multiline" only to lines spanning multiple lines
    # We _only_ want those here.
    if "multiline" in [tags] {
        # Add the encoding line here. Could in theory extract this from the
        # first line with a clever filter. Not worth the effort at the moment.
        mutate {
            replace => ["message",'<?xml version="1.0" encoding="UTF-8" ?>%{message}']
        }
        # This filter exports the hierarchy into the field "entry". This will
        # create a very deep structure that elasticsearch does not really like.
        # Which is why I used add_field to flatten it.
        xml {
            target => entry
            source => message
            add_field => {
                fieldx         => "%{[entry][fieldx]}"
                fieldy         => "%{[entry][fieldy]}"
                fieldz         => "%{[entry][fieldz]}"
                # With deeper nested fields, the xml converter actually creates
                # an array containing hashes, which is why you need the [0]
                # -- took me ages to find out.
                fielda         => "%{[entry][fieldarray][0][fielda]}"
                fieldb         => "%{[entry][fieldarray][0][fieldb]}"
                fieldc         => "%{[entry][fieldarray][0][fieldc]}"
            }
        }
        # Remove the intermediate fields before output. "message" contains the
        # original message (XML). You may or may-not want to keep that.
        mutate {
            remove_field => ["message"]
            remove_field => ["entry"]
        }
    }
}

output {
    ...
}

Detailed

My solution works because at least until the entry level, my XML input is very uniform and thus can be handled by some kind of pattern matching.

Since the export is basically one really long line of XML, and the logstash xml plugin essentially works only with fields (read: columns in lines) that contain XML data, I had to change the data into a more useful format.

Shell: Preparing the file

  • gzcat -d file.xml.gz |: Was just too much data -- obviously you can skip that
  • tr -d "\n\r" |: Remove line-breaks inside XML elements: Some of the elements can contain line breaks as character data. The next step requires that these are removed, or encoded in some way. Even though it assumed that at this point you have all XML code in one massive line, it does not matter if this command removes any white space between elements

  • xmllint --format - |: Format the XML with xmllint (comes with libxml)

    Here the single huge spaghetti line of XML (<root><entry><fieldx>...</fieldx></entry></root>) Is properly formatted:

    <root>
      <entry>
        <fieldx>...</fieldx>
        <fieldy>...</fieldy>
        <fieldz>...</fieldz>
        <fieldarray>
          <fielda>...</fielda>
          <fieldb>...</fieldb>
          ...
        </fieldarray>
      </entry>
      <entry>
        ...
      </entry>
      ...
    </root>
    

Logstash

logstash -f logstash-csv.conf

(See full content of the .conf file in the TL;DR section.)

Here, the multiline filter does the trick. It can merge multiple lines into a single log message. And this is why the formatting with xmllint was necessary:

filter {
    # add all lines that have more indentation than double-space to the previous line
    multiline {
        pattern => "^\s\s(\s\s|\<\/entry\>)"
        what => previous
    }
}

This basically says that every line with indentation that is more than two spaces (or is </entry> / xmllint does indentation with two spaces by default) belongs to a previous line. This also means character data must not contain newlines (stripped with tr in shell) and that the xml must be normalised (xmllint)

dualed
  • 388
  • 1
  • 2
  • 14
  • Hi did you manage to make this work? I am curious since I have as similar need and the multiline solution along with the split did not work for me. Thanks for your feedback – viz Nov 06 '15 at 02:43
  • @viz This worked, but we never used it in production. Multiline only works if you have a very regular XML structure and have formatted it first with indentation (see answer, section "preparing the file") – dualed Nov 09 '15 at 15:01
1

I had a similar case. To parse this xml:

<ROOT number="34">
  <EVENTLIST>
    <EVENT name="hey"/>
    <EVENT name="you"/>
  </EVENTLIST>
</ROOT>

I use this configuration to logstash:

input {
  file {
    path => "/path/events.xml"
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => multiline {
      pattern => "<ROOT"
      negate => "true"
      what => "previous"
      auto_flush_interval => 1
    }
  }
}
filter {
  xml {
    source => "message"
    target => "xml_content"
  }
  split {
    field => "xml_content[EVENTLIST]"
  }
  split {
    field => "xml_content[EVENTLIST][EVENT]"
  }
  mutate {
    add_field => { "number" => "%{xml_content[number]}" }
    add_field => { "name" => "%{xml_content[EVENTLIST][EVENT][name]}" }
    remove_field => ['xml_content', 'message', 'path']
  }
}
output {
  stdout {
    codec => rubydebug
  }
}

I hope this can help someone. I've needed a long time to get it.

rjurado01
  • 111
  • 3