Elasticsearch Analyzer (Sentence Analysis)

Analyzer

An analyzer divides a sentence into word, or term, units.

This chapter explains the following topics.

  • Components of an analyzer
  • The Nori Korean morphological analyzer
  • Custom Analyzer

Why should an analyzer be used to split text into words?

The answer is “because documents cannot be searched accurately,” but let us look at a more concrete example.

Prepare a search sentence

POST demo_standard_analyzer/_doc
{
  "text":"올해 서울시 예산이 결정되었다."
}

Search the document above with the word “서울”.

GET demo_standard_analyzer/_search
{
  "query" : {
     "match" : {
       "text" : "서울"
    }
  }
}

The result is not found, as shown below.

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

The document containing “서울시” is not found by the word “서울” because the “서울” part of “서울시” is not matched.

You can use the analyze API to see how the document string is broken down by the analyzer and stored in the inverted index, so let us investigate the cause.

Break down a sentence with the default Standard Analyzer

POST demo_standard_analyzer/_analyze
{
  "text": "올해 서울시 예산이 결정되었다."
}

Result of sentence decomposition with the default Standard Analyzer

{
  "tokens" : [
    {
      "token" : "올해",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "서울시",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<HANGUL>",
      "position" : 1
    },
    {
      "token" : "예산이",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<HANGUL>",
      "position" : 2
    },
    {
      "token" : "결정되었다",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<HANGUL>",
      "position" : 3
    }
  ]
}

As shown by the values of the token key, the Standard Analyzer stores the sentence in the inverted index by separating it in this way.

Therefore, to perform accurate searches, Korean text must be broken down into words with an analyzer that supports Korean.

Components of an analyzer

An analyzer consists of the following three elements.

  • Char Filter: transforms the sentence
  • Tokenizer: splits the sentence into tokens, or words
  • Token filter: transforms tokens, or words

An analyzer processes these three elements from top to bottom and decomposes a sentence into tokens to create an inverted index.

Char Filter: example of sentence transformation

  • HTML Strip Character Filter
    • Removes HTML tags.
    • <p>foo</p> –(transformed)–> foo

Tokenizer: example of splitting a sentence into tokens

  • Standard Tokenizer
  • Nori Tokenizer

Token filter: example of token transformation

  • Lower case Token Filter
    • Converts all token characters to lowercase.
    • Used when you want to treat Google and google as the same token.
  • Stop Token Filter
    • Deletes unused tokens.
    • Used for tasks such as deleting particles.
  • Stemmer Token Filter
    • Performs stemming.
    • Converts forms such as “즐겁게” and “즐겁다” to “즐거움”.
  • Synonym Token Filter
    • Normalizes synonyms.
    • Converts words such as “구글” and “검색” to the same term, “검색”.

Nori Korean morphological analyzer

Elasticsearch’s default Standard Analyzer does not support Korean.

To perform Korean search correctly, use the nori Analysis Plugin, a plugin for Korean morphological analysis, to split sentences into words.

Install the nori Analysis Plugin

Command to install the nori Analysis Plugin

bin/elasticsearch-plugin install analysis-nori

The following is an execution example in Docker.

sh-4.4# bin/elasticsearch-plugin install analysis-nori
-> Installing analysis-nori
-> Downloading analysis-nori from elastic
[=================================================] 100%?? 
-> Installed analysis-nori
-> Please restart Elasticsearch to activate any plugins installed
sh-4.4# 

For reference, the plugin can be removed as follows. Command to remove the nori Analysis Plugin

bin/elasticsearch-plugin remove analysis-nori

After installation, restart Elasticsearch to use the nori plugin.

The following shows how to use the systemctl command.

systemctl restart elasticsearch

The restart method differs depending on the installation environment, so restart it according to your environment.

Nori Analyzer configuration

  • Char Filter
    • None
  • Tokenizer
    • nori_tokenizer
  • Token filter
    • nori_part_of_speech
    • nori_readingform

How to use the Nori Tokenizer

Analyze a string with only the nori tokenizer.

Analyze a sentence with the Nori Tokenizer

GET _analyze
{
  "tokenizer": "nori_tokenizer",
  "text": [
    "올해 서울시 예산이 결정되었다."
  ]
}

Result of analyzing a sentence with the Nori Tokenizer

{
  "tokens" : [
    {
      "token" : "올",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "해",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "서울",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "이",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "결정",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "되",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "었",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "다",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "word",
      "position" : 9
    }
  ]
}

How to use the Nori Analyzer

Now configure the Nori Analyzer on the demo_analyzer index.

Specify nori as the analyzer

PUT demo_analyzer
{
  "mappings" : {
     "properties" : {
       "text" :{
         "type" : "text" ,
         "analyzer" : "nori"
      }
    }
  }
}

Next, create a document in the demo_analyzer index and run a Korean search.

Create a document

POST demo_analyzer/_doc
{
  "text": "올해 서울시 예산이 결정되었다."
}

Search for “울시”

GET demo_analyzer/_search
{
  "query" : {
     "match" : {
       "text" : "울시"
    }
  }
}

Search result for “울시”

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

The word “울시” does not find “서울시”. It seems to be recognized as a completely different word.

Check what kind of inverted index is created by the nori Analyzer.

Morphological analysis with the nori Analyzer

GET demo_analyzer/_analyze
{
  "analyzer" : "nori", 
  "text" : "올해 서울시 예산이 결정되었다."
}

Morphological analysis result with the nori Analyzer

{
  "tokens" : [
    {
      "token" : "올",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "해",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "서울",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "결정",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 6
    }
  ]
}

It appears to analyze “서울시” as “서울” and “시”. Therefore, “울시” does not match “서울시”.

Custom Analyzer

You can freely create a custom analyzer by combining the following three elements.

  • Char Filter
  • Tokenizer
  • Token filter

For example, create the following Custom Analyzer.

Create a Custom Analyzer

PUT custome_analyze
{
  "settings" : {
     "analysis" : {
       "analyzer" : {
         "original_analyze" :{
           "char_filter" :[ "html_strip" ],
           "tokenizer" : "nori_tokenizer" ,
           "filter" :[ "my_stop" ]
        }
      },
      "filter" :{
         "my_stop" :{
           "type" : "stop" ,
           "stopwords" :[ "올", "해" , "이" ]  
        }
      }
    }
  }
}

Analyze a sentence with the analyzer you created.

Use the Custom Analyzer

POST custome_analyze/_analyze
{
  "analyzer" : "original_analyze",
  "text" : "<p>올해 서울시 예산이 결정되었다</p>"
}

Result of using the Custom Analyzer

{
  "tokens" : [
    {
      "token" : "서울",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "결정",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "되",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "었",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "다",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "word",
      "position" : 9
    }
  ]
}

The output reflects the following processing defined in the custom analyzer original_analyze.

  • html_strip removes the HTML <p> tag.
  • nori_tokenizer converts the sentence into tokens.
  • my_stop removes "올", "해", and "이" as defined in stopwords.

References