ElasticSearch Cookbook(Second Edition)
上QQ阅读APP看书,第一时间看更新

Mapping an attachment field

ElasticSearch allows you to extend its core types to cover new requirements with native plugins that provide new mapping types. The most-used custom field type is the attachment mapping type.

It allows you to index and search the contents of common documental files, such as Microsoft Office formats, open document formats, PDF, epub, and many others.

Getting ready

You need a working ElasticSearch cluster with the attachment plugin (https://github.com/elasticsearch/elasticsearch-mapper-attachments) installed.

It can be installed from the command line with the following command:

 bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.9.0

The plugin version is related to the current ElasticSearch version; check the GitHub page for further details.

How to do it...

To map a field as an attachment, it's necessary to set the type field to attachment.

Internally, the attachment field defines the fields property as a multifield that takes some binary data (encoded base64) and extracts useful information such as author, content, title, date, and so on.

If you want to create a mapping for an e-mail storing attachment, it should be as follows:

{
  "email": {
    "properties": {
      "sender": {
        "type": "string",
        "store": "yes",
        "index": "not_analyzed"
      },
      "date": {
        "type": "date",
        "store": "no",
        "index": "not_analyzed"
      },
 "document": {
 "type": "attachment",
 "fields": {
 "file": {
 "store": "yes",
 "index": "analyzed"
 },
 "date": {
 "store": "yes"
 },
 "author": {
 "store": "yes"
 },
 "keywords": {
 "store": "yes"
 },
 "content_type": {
 "store": "yes"
 },
 "title": {
 "store": "yes"
 }
        }
      }
    }
  }
}

How it works...

The attachment plugin uses Apache Tika internally, a library that specializes in text extraction from documents. The list of supported document types is available on the Apache Tika site (http://tika.apache.org/1.5/formats.html), but it covers all the common file types.

The attachment type field receives a base64 binary stream that is processed by Tika metadata and text extractor. The field can be seen as a multifield that stores different contents in its subfields:

  • file: This stores the content of the file
  • date: This stores the file creation data extracted by Tika metadata
  • author: This stores the file's author extracted by Tika metadata
  • keywords: This stores the file's keywords extracted by Tika metadata
  • content_type: This stores the file's content type
  • title: This stores the file's title extracted by Tika metadata

The default setting for an attachment plugin is to extract 100,000 characters. This value can be changed globally by setting the index settings to index.mappings.attachment.indexed_chars or by passing a value to the _indexed_chars property when indexing the element.

There's more...

The attachment type is an example of how it's possible to extend ElasticSearch with custom types.

The attachment plugin is very useful for indexing documents, e-mails, and all types of unstructured documents. A good example of an application that uses this plugin is ScrutMyDocs (http://www.scrutmydocs.org/).

See also