2

We're using ElasticSearch to store and inspect logs from our infrastructure. Some of those logs are required by law, and we can't afford to lose any.

We've been parsing logs for quite some time without any mapping. That makes them mostly unusable for searching and/or graphing. For example, some integer fields have been automatically recognized as text, and thus we can't aggregate them in histograms.

We want to introduce templates and mapping, which would solve the issue for new indices.

However, we've noticed that having a mapping also opens the door for parsing failures. If a field is defined as an integer but suddenly gets a non-integer value, then the parsing will fail, and the document will be rejected.

Is there any place those documents go and/or any way to save them for inspection later?

Python script here below works with a local ES instance.

#!/usr/bin/env python3

import requests
import JSON
from typing import Any, Dict


ES_HOST = "http://localhost:9200"


def es_request(method: str, path: str, data: Dict[str, Any]) -> None:
    response = requests.request(method, f"{ES_HOST}{path}", json=data)

    if response.status_code != 200:
        print(response.content)


es_request('put', '/_template/my_template', {
    "index_patterns": ["my_index"],
    "mappings": {
        "properties": {
            "some_integer": { "type": "integer" }
        }
    }
})

# This is fine
es_request('put', '/my_index/_doc/1', {
    'some_integer': 42
})

# This will be rejected by ES, as it doesn't match the mapping.
# But how can I save it?
es_request('put', '/my_index/_doc/2', {
    'some_integer': 'hello world'
})

Running the script gives the following error:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason":"failed to parse field [some_integer] of type [integer] in document with id '2'. Preview of field's value: 'hello world'"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason":"failed to parse field [some_integer] of type [integer] in document with id '2'. Preview of field's value: 'hello world'",
        "caused_by": {
            "type": "number_format_exception",
            "reason": "For input string: \"hello world\""
        }
    },
    "status": 400
}

And then the document is lost, or so it seems. Can I set an option somewhere that would save the document automagically somewhere else, a sort of dead letter queue?

tl;dr: We need mappings, but can't afford to lose log lines due to parsing errors. Can we automatically save the documents that don't fit the mapping somewhere else?

aspyct
  • 340
  • 6
  • 19

2 Answers2

1

Turns out it is as simple as allowing "malformed" attributes. There are two ways to do that. Either on the whole index:

PUT /_template/ignore_malformed_attributes
{
  "index_patterns": ["my_index"],
  "settings": {
      "index.mapping.ignore_malformed": true
  }
}

Or per attribute (see example here: https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html )

PUT my_index
{
  "mappings": {
    "properties": {
      "number_one": {
        "type": "integer",
        "ignore_malformed": true
      },
      "number_two": {
        "type": "integer"
      }
    }
  }
}

# Will work
PUT my_index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

# Will be rejected
PUT my_index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

Note that you can also change the property on an existing index, but you'll need to close it first:

POST my_existing_index/_close
PUT my_existing_index/_settings
{
  "index.mapping.ignore_malformed": false
}
POST my_existing_index/_open

NOTE: The type change won't be visible in kibana until you refresh the index pattern. You will then have a type conflict, which requires you to reindex your data to search again through it... What a pain.

POST _reindex
{
  "source": {
    "index": "my_index"
  },
  "dest": {
    "index": "my_new_index"
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

aspyct
  • 340
  • 6
  • 19
0

Alternative approach which might be preferable for quite some use cases is to put a logstash in between the producers and Elasticseaech. The logstash can do reformatting and/or checking and routing to specific indexes.
Or of course if You have native producers let them validate and route.

EOhm
  • 795
  • 2
  • 7