Elasticsearch Analyzer（文章分析）

Analyzer

Analyzer は文章を単語（term）単位に分割する。

この章では、次の項目について説明する。

Analyzer の構成要素
nori 韓国語形態素解析器
Custom Analyzer（Analyzer のカスタマイズ）

なぜ Analyzer を使って単語に分割する必要があるのか

「ドキュメントを正確に検索できないから」が答えだが、もう少し具体的に例を見ながら確認してみよう。

検索用文章の準備

POST demo_standard_analyzer/_doc
{
  "text":"올해 서울시 예산이 결정되었다."
}

上記の文書に対して、서울 という単語で検索してみる。

GET demo_standard_analyzer/_search
{
  "query" : {
     "match" : {
       "text" : "서울"
    }
  }
}

検索結果は次のようにヒットしない。

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

서울 という単語では 서울시 の文書がヒットしない。서울시 の 서울 部分だけでは検索できないためである。

analyze API を使うと、文書文字列が Analyzer によってどのように分解され、逆インデックスに保存されているかを確認できるため、原因を調べてみる。

基本 Standard Analyzer で文章分解

POST demo_standard_analyzer/_analyze
{
  "text": "올해 서울시 예산이 결정되었다."
}

基本 Standard Analyzer で文章分解した結果

{
  "tokens" : [
    {
      "token" : "올해",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<HANGUL>",
      "position" : 0
    },
    {
      "token" : "서울시",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "<HANGUL>",
      "position" : 1
    },
    {
      "token" : "예산이",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "<HANGUL>",
      "position" : 2
    },
    {
      "token" : "결정되었다",
      "start_offset" : 11,
      "end_offset" : 16,
      "type" : "<HANGUL>",
      "position" : 3
    }
  ]
}

token キーの値を見ると、Standard Analyzer は文章をこのように分割して逆インデックスに保存していることがわかる。

そのため、正しく検索するには、韓国語に対応した Analyzer で文章を単語に分解する必要がある。

Analyzer の構成要素

Analyzer は次の 3 つの要素で構成される。

Char Filter: 文章変換
Tokenizer: 文章を Token（単語）に分割
Token filter: Token（単語）変換

Analyzer は 3 つの要素を上から順に処理し、文章を Token（単語）に分解して逆インデックスを作成する。

Char Filter: 文章変換の例

HTML Strip Character Filter
- HTML タグを削除する。
- <p>foo</p> –(変換)–> foo

Tokenizer: 文章を Token（単語）に分割する例

Standard Tokenizer（基本トークナイザー）
Nori Tokenizer（nori トークナイザー）

Token filter: Token（単語）変換の例

Lower case Token Filter
- トークンの文字をすべて小文字に変換する。
- Google と google を同じトークンとして扱いたいときに使用する。
Stop Token Filter
- 使用しないトークンを削除する。
- 助詞削除などに利用される。
Stemmer Token Filter
- ステミング処理を行う。
- 즐겁게 と 즐겁다 を 즐거움 に変換する。
Synonym Token Filter
- 同義語を正規化する。
- 구글 と 검색 を同じ 검색 に変換する。

nori 韓国語形態素解析器

Elasticsearch の基本 Standard Analyzer は韓国語をサポートしない。

韓国語検索を正しく行うには、nori Analysis Plugin という韓国語形態素解析用プラグインを使って、文章を単語に分割する必要がある。

nori Analysis Plugin のインストール

nori Analysis Plugin インストールコマンド

bin/elasticsearch-plugin install analysis-nori

以下は Docker で実行した内容である。

sh-4.4# bin/elasticsearch-plugin install analysis-nori
-> Installing analysis-nori
-> Downloading analysis-nori from elastic
[=================================================] 100%?? 
-> Installed analysis-nori
-> Please restart Elasticsearch to activate any plugins installed
sh-4.4#

参考までに、逆にプラグインを削除する方法は次のとおりである。 nori Analysis Plugin 削除コマンド

bin/elasticsearch-plugin remove analysis-nori

インストール後は、nori プラグインを使用するため Elasticsearch を再起動する必要がある。

systemctl コマンドを使用する方法は次のとおりである。

systemctl restart elasticsearch

再起動方法はインストール環境によって異なるため、それぞれの環境に合わせて再起動してほしい。

Nori Analyzer 構成

Char Filter
- なし
Tokenizer
- nori_tokenizer
Token filter
- nori_part_of_speech
- nori_readingform

Nori Tokenizer の使い方

文字列だけを nori トークン分解してみる。

Nori Tokenizer で文章分析

GET _analyze
{
  "tokenizer": "nori_tokenizer",
  "text": [
    "올해 서울시 예산이 결정되었다."
  ]
}

Nori Tokenizer で文章分析した結果

{
  "tokens" : [
    {
      "token" : "올",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "해",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "서울",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "이",
      "start_offset" : 9,
      "end_offset" : 10,
      "type" : "word",
      "position" : 5
    },
    {
      "token" : "결정",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "되",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "었",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "다",
      "start_offset" : 15,
      "end_offset" : 16,
      "type" : "word",
      "position" : 9
    }
  ]
}

Nori Analyzer の使い方

実際に demo_analyzer インデックスで Nori Analyzer を設定してみる。

analyzer に nori を指定

PUT demo_analyzer
{
  "mappings" : {
     "properties" : {
       "text" :{
         "type" : "text" ,
         "analyzer" : "nori"
      }
    }
  }
}

次に demo_analyzer インデックスへドキュメントを作成し、韓国語検索を実行してみる。

ドキュメント作成

POST demo_analyzer/_doc
{
  "text": "올해 서울시 예산이 결정되었다."
}

울시 検索

GET demo_analyzer/_search
{
  "query" : {
     "match" : {
       "text" : "울시"
    }
  }
}

울시 検索結果

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

울시 という単語では 서울시 がヒットしない。まったく別の単語として認識されているようだ。

nori Analyzer ではどのような逆インデックスが生成されるのか確認してみる。

nori Analyzer で形態素分析

GET demo_analyzer/_analyze
{
  "analyzer" : "nori", 
  "text" : "올해 서울시 예산이 결정되었다."
}

nori Analyzer で形態素分析した結果

{
  "tokens" : [
    {
      "token" : "올",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "해",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "서울",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "결정",
      "start_offset" : 11,
      "end_offset" : 13,
      "type" : "word",
      "position" : 6
    }
  ]
}

서울 と 시 に分析してくれるようだ。そのため、서울시 に対して 울시 はヒットしない。

Custom Analyzer（Analyzer のカスタマイズ）

Analyzer は次の 3 つの要素を組み合わせて、自由にカスタム作成できる。

Char Filter
Tokenizer
Token filter

たとえば、次のような Custom Analyzer を作成する。

Custom Analyzer 作成

PUT custome_analyze
{
  "settings" : {
     "analysis" : {
       "analyzer" : {
         "original_analyze" :{
           "char_filter" :[ "html_strip" ],
           "tokenizer" : "nori_tokenizer" ,
           "filter" :[ "my_stop" ]
        }
      },
      "filter" :{
         "my_stop" :{
           "type" : "stop" ,
           "stopwords" :[ "올", "해" , "이" ]  
        }
      }
    }
  }
}

作成した Analyzer で文章分析してみよう。

Custom Analyzer 使用

POST custome_analyze/_analyze
{
  "analyzer" : "original_analyze",
  "text" : "<p>올해 서울시 예산이 결정되었다</p>"
}

Custom Analyzer を使用した結果

{
  "tokens" : [
    {
      "token" : "서울",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "시",
      "start_offset" : 8,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "예산",
      "start_offset" : 10,
      "end_offset" : 12,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "결정",
      "start_offset" : 14,
      "end_offset" : 16,
      "type" : "word",
      "position" : 6
    },
    {
      "token" : "되",
      "start_offset" : 16,
      "end_offset" : 17,
      "type" : "word",
      "position" : 7
    },
    {
      "token" : "었",
      "start_offset" : 17,
      "end_offset" : 18,
      "type" : "word",
      "position" : 8
    },
    {
      "token" : "다",
      "start_offset" : 18,
      "end_offset" : 19,
      "type" : "word",
      "position" : 9
    }
  ]
}

出力結果には、Custom Analyzer original_analyze に定義した次の処理が反映されている。

html_strip は HTML の <p> タグを削除する。
nori_tokenizer で文章を Token（単語）に変換する。
my_stop で定義した stopwords により、"올"、"해"、"이" を削除する。

参照

6.7.2 nori 韓国語形態素解析器