Elasticsearch Analyzer (Sentence Analysis)
Analyzer
An analyzer divides a sentence into word, or term, units.
This chapter explains the following topics.
- Components of an analyzer
- The Nori Korean morphological analyzer
- Custom Analyzer
Why should an analyzer be used to split text into words?
The answer is “because documents cannot be searched accurately,” but let us look at a more concrete example.
Prepare a search sentence
POST demo_standard_analyzer/_doc
{
"text":"올해 서울시 예산이 결정되었다."
}
Search the document above with the word “서울”.
GET demo_standard_analyzer/_search
{
"query" : {
"match" : {
"text" : "서울"
}
}
}
The result is not found, as shown below.
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
The document containing “서울시” is not found by the word “서울” because the “서울” part of “서울시” is not matched.
You can use the analyze API to see how the document string is broken down by the analyzer and stored in the inverted index, so let us investigate the cause.
Break down a sentence with the default Standard Analyzer
POST demo_standard_analyzer/_analyze
{
"text": "올해 서울시 예산이 결정되었다."
}
Result of sentence decomposition with the default Standard Analyzer
{
"tokens" : [
{
"token" : "올해",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<HANGUL>",
"position" : 0
},
{
"token" : "서울시",
"start_offset" : 3,
"end_offset" : 6,
"type" : "<HANGUL>",
"position" : 1
},
{
"token" : "예산이",
"start_offset" : 7,
"end_offset" : 10,
"type" : "<HANGUL>",
"position" : 2
},
{
"token" : "결정되었다",
"start_offset" : 11,
"end_offset" : 16,
"type" : "<HANGUL>",
"position" : 3
}
]
}
As shown by the values of the token key, the Standard Analyzer stores the sentence in the inverted index by separating it in this way.
Therefore, to perform accurate searches, Korean text must be broken down into words with an analyzer that supports Korean.
Components of an analyzer
An analyzer consists of the following three elements.
- Char Filter: transforms the sentence
- Tokenizer: splits the sentence into tokens, or words
- Token filter: transforms tokens, or words
An analyzer processes these three elements from top to bottom and decomposes a sentence into tokens to create an inverted index.
Char Filter: example of sentence transformation
- HTML Strip Character Filter
- Removes HTML tags.
<p>foo</p>–(transformed)–> foo
Tokenizer: example of splitting a sentence into tokens
- Standard Tokenizer
- Nori Tokenizer
Token filter: example of token transformation
- Lower case Token Filter
- Converts all token characters to lowercase.
- Used when you want to treat
Googleandgoogleas the same token.
- Stop Token Filter
- Deletes unused tokens.
- Used for tasks such as deleting particles.
- Stemmer Token Filter
- Performs stemming.
- Converts forms such as “즐겁게” and “즐겁다” to “즐거움”.
- Synonym Token Filter
- Normalizes synonyms.
- Converts words such as “구글” and “검색” to the same term, “검색”.
Nori Korean morphological analyzer
Elasticsearch’s default Standard Analyzer does not support Korean.
To perform Korean search correctly, use the nori Analysis Plugin, a plugin for Korean morphological analysis, to split sentences into words.
Install the nori Analysis Plugin
Command to install the nori Analysis Plugin
bin/elasticsearch-plugin install analysis-nori
The following is an execution example in Docker.
sh-4.4# bin/elasticsearch-plugin install analysis-nori
-> Installing analysis-nori
-> Downloading analysis-nori from elastic
[=================================================] 100%??
-> Installed analysis-nori
-> Please restart Elasticsearch to activate any plugins installed
sh-4.4#
For reference, the plugin can be removed as follows. Command to remove the nori Analysis Plugin
bin/elasticsearch-plugin remove analysis-nori
After installation, restart Elasticsearch to use the nori plugin.
The following shows how to use the systemctl command.
systemctl restart elasticsearch
The restart method differs depending on the installation environment, so restart it according to your environment.
Nori Analyzer configuration
- Char Filter
- None
- Tokenizer
- nori_tokenizer
- Token filter
- nori_part_of_speech
- nori_readingform
How to use the Nori Tokenizer
Analyze a string with only the nori tokenizer.
Analyze a sentence with the Nori Tokenizer
GET _analyze
{
"tokenizer": "nori_tokenizer",
"text": [
"올해 서울시 예산이 결정되었다."
]
}
Result of analyzing a sentence with the Nori Tokenizer
{
"tokens" : [
{
"token" : "올",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "해",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "서울",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "시",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "예산",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "이",
"start_offset" : 9,
"end_offset" : 10,
"type" : "word",
"position" : 5
},
{
"token" : "결정",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 6
},
{
"token" : "되",
"start_offset" : 13,
"end_offset" : 14,
"type" : "word",
"position" : 7
},
{
"token" : "었",
"start_offset" : 14,
"end_offset" : 15,
"type" : "word",
"position" : 8
},
{
"token" : "다",
"start_offset" : 15,
"end_offset" : 16,
"type" : "word",
"position" : 9
}
]
}
How to use the Nori Analyzer
Now configure the Nori Analyzer on the demo_analyzer index.
Specify nori as the analyzer
PUT demo_analyzer
{
"mappings" : {
"properties" : {
"text" :{
"type" : "text" ,
"analyzer" : "nori"
}
}
}
}
Next, create a document in the demo_analyzer index and run a Korean search.
Create a document
POST demo_analyzer/_doc
{
"text": "올해 서울시 예산이 결정되었다."
}
Search for “울시”
GET demo_analyzer/_search
{
"query" : {
"match" : {
"text" : "울시"
}
}
}
Search result for “울시”
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
The word “울시” does not find “서울시”. It seems to be recognized as a completely different word.
Check what kind of inverted index is created by the nori Analyzer.
Morphological analysis with the nori Analyzer
GET demo_analyzer/_analyze
{
"analyzer" : "nori",
"text" : "올해 서울시 예산이 결정되었다."
}
Morphological analysis result with the nori Analyzer
{
"tokens" : [
{
"token" : "올",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "해",
"start_offset" : 1,
"end_offset" : 2,
"type" : "word",
"position" : 1
},
{
"token" : "서울",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 2
},
{
"token" : "시",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 3
},
{
"token" : "예산",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 4
},
{
"token" : "결정",
"start_offset" : 11,
"end_offset" : 13,
"type" : "word",
"position" : 6
}
]
}
It appears to analyze “서울시” as “서울” and “시”. Therefore, “울시” does not match “서울시”.
Custom Analyzer
You can freely create a custom analyzer by combining the following three elements.
- Char Filter
- Tokenizer
- Token filter
For example, create the following Custom Analyzer.
Create a Custom Analyzer
PUT custome_analyze
{
"settings" : {
"analysis" : {
"analyzer" : {
"original_analyze" :{
"char_filter" :[ "html_strip" ],
"tokenizer" : "nori_tokenizer" ,
"filter" :[ "my_stop" ]
}
},
"filter" :{
"my_stop" :{
"type" : "stop" ,
"stopwords" :[ "올", "해" , "이" ]
}
}
}
}
}
Analyze a sentence with the analyzer you created.
Use the Custom Analyzer
POST custome_analyze/_analyze
{
"analyzer" : "original_analyze",
"text" : "<p>올해 서울시 예산이 결정되었다</p>"
}
Result of using the Custom Analyzer
{
"tokens" : [
{
"token" : "서울",
"start_offset" : 6,
"end_offset" : 8,
"type" : "word",
"position" : 2
},
{
"token" : "시",
"start_offset" : 8,
"end_offset" : 9,
"type" : "word",
"position" : 3
},
{
"token" : "예산",
"start_offset" : 10,
"end_offset" : 12,
"type" : "word",
"position" : 4
},
{
"token" : "결정",
"start_offset" : 14,
"end_offset" : 16,
"type" : "word",
"position" : 6
},
{
"token" : "되",
"start_offset" : 16,
"end_offset" : 17,
"type" : "word",
"position" : 7
},
{
"token" : "었",
"start_offset" : 17,
"end_offset" : 18,
"type" : "word",
"position" : 8
},
{
"token" : "다",
"start_offset" : 18,
"end_offset" : 19,
"type" : "word",
"position" : 9
}
]
}
The output reflects the following processing defined in the custom analyzer original_analyze.
html_stripremoves the HTML<p>tag.nori_tokenizerconverts the sentence into tokens.my_stopremoves"올","해", and"이"as defined instopwords.