1

I'm working on a portal that hosts multiple types of documentation (HTML, PDF, PPTx, DocX) and makes them all searchable in one place.

We could achieve this using the "standard" out of the box ElasticSearch and the Ingest Attachment plugin but I'm no data scientist and know very little about writing Elastic queries, so our search results are not great.

I've been recommended to use Enterprise App Search instead of trying tune my basic queries, but it would seem I can't use the ingest attachment plugin with it? As a result, I can't leverage simply sending base64 encoded content to the documents API and expect the content to be indexed.

Is there any way around this? Or would I not be able to use App Search for this??

RekaB
  • 111
  • 2

1 Answers1

0

The answer to this is extracting the attachment content either by leveraging the attachment pipeline as suggested in this blog post or, if you have your backend in Java like me, you can use Apache Tika to extract your content from attachments yourself.

I implemented Tika to extract the HTML content (it's very straight-forward actually)

static String getContent(String htmlContent) throws TikaException, SAXException, IOException {
    InputStream input = new ByteArrayInputStream(htmlContent.getBytes());
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    new HtmlParser().parse(input, handler, metadata, new ParseContext());
    return handler.toString();
}

For PDF files I was already using Apache PdfBox to extract some other properties so the text came for "free". Same for Office files, but that requires Apache Poi.

RekaB
  • 111
  • 2