Yasuhiro Horimoto 2019-01-08 10:04:06 +0900 (Tue, 08 Jan 2019) Revision: 2bb5278c3af233a52c5835c03a440f12762757ae https://github.com/groonga/groonga/commit/2bb5278c3af233a52c5835c03a440f12762757ae Message: doc: add missing explain of options Added files: doc/source/example/reference/token_filters/nfkc100-unify-hyphen-and-prolonged-sound-mark.log doc/source/example/reference/token_filters/nfkc100-unify-hyphen.log doc/source/example/reference/token_filters/nfkc100-unify-kana-case-hiragana.log doc/source/example/reference/token_filters/nfkc100-unify-kana-case-katakana.log doc/source/example/reference/token_filters/nfkc100-unify-kana.log doc/source/example/reference/token_filters/nfkc100-unify-katakana-bu-sounds.log doc/source/example/reference/token_filters/nfkc100-unify-katakana-v-sounds.log doc/source/example/reference/token_filters/nfkc100-unify-middle-dot.log doc/source/example/reference/token_filters/nfkc100-unify-prolonged-sound-mark.log doc/source/example/reference/token_filters/nfkc100-unify-to-romaji.log doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-hiragana.log doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-katakana.log Modified files: doc/locale/ja/LC_MESSAGES/reference.po doc/source/example/reference/token_filters/nfkc100.log doc/source/reference/token_filters/token_filter_nfkc100.rst Modified: doc/locale/ja/LC_MESSAGES/reference.po (+93 -21) =================================================================== --- doc/locale/ja/LC_MESSAGES/reference.po 2019-01-07 18:24:08 +0900 (4373ae2cf) +++ doc/locale/ja/LC_MESSAGES/reference.po 2019-01-08 10:04:06 +0900 (5e0090f99) @@ -27011,30 +27011,108 @@ msgid "``TokenFilterNFKC100``" msgstr "" msgid "" -"This token filter can translate a token for katakana to hiragana with same " -"option of NormalizerNFKC100. This token filter convenient when you want to " -"get reading of token as hiragana." +"This token filter can use the same option by :ref:`normalizer-nfkc100`. This " +"token filter is used to normalize after tokenizing. Because, if you " +"normalize before tokenizing with ``TokenMecab`` , the meaning of a token may " +"be lost." msgstr "" -"このトークンフィルターは、 ``NormalizerNFKC100`` と同じオプションでカタカナの" -"トークンをひらがなに変換できます。このトークンフィルターは、トークンの読みが" -"なをひらがなとして取得したい時に便利です。" +"このトークンフィルターは、 :ref:`normalizer-nfkc100` と同じオプションを使えま" +"す。``TokenMecab`` を使ってトークナイズする前にノーマライズをすると、トークン" +"の意味が失われることがあるため、このトークンフィルターは、トークナイズ後に" +"ノーマライズするために使用します。" -msgid "``TokenFilterNFKC100`` has a parameter::" -msgstr "``TokenFilterNFKC100`` は引数が一つあります。" +msgid "``TokenFilterNFKC100`` has optional parameter::" +msgstr "``TokenFilterNFKC100`` は省略可能な引数があります。" msgid "" -"Here is an example of ``TokenFilterNFKC100``. ``リンゴ`` is translated to ``" -"りんご``." +"``TokenFilterNFKC100`` normalizes text by Unicode NFKC (Normalization Form " +"Compatibility Composition) for Unicode version 10.0." msgstr "" -"以下は、 ``TokenFilterNFKC100`` の例です。 ``リンゴ`` を ``りんご`` へ変換し" -"ています。" +"``TokenFilterNFKC100`` はUnicode 10.0用のUnicode NFKC(Normalization Form " +"Compatibility Composition)を使ってテキストを正規化します。" + +msgid "" +"Here is an example of ``TokenFilterNFKC100``. ``TokenFilterNFKC100`` " +"normalizes text by Unicode NFKC (Normalization Form Compatibility " +"Composition) for Unicode version 10.0." +msgstr "" +"以下は、``TokenFilterNFKC100`` の使用例です。 ``TokenFilterNFKC100`` は" +"Unicode 10.0用のUnicode NFKC(Normalization Form Compatibility Composition)" +"を使ってテキストを正規化します。" + +msgid "Here is an example of :ref:`token-filter-nfkc100-unify-kana` option." +msgstr "以下は :ref:`token-filter-nfkc100-unify-kana` オプションの使用例です。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-kana-case` option." +msgstr "" +"以下は :ref:`token-filter-nfkc100-unify-kana-case` オプションの使用例です。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-kana-voiced-sound-" +"mark` option." +msgstr "" +"以下は、 :ref:`token-filter-nfkc100-unify-kana-voiced-sound-mark` オプション" +"の使用例です。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-hyphen` option. This " +"option enables normalize hyphen to \"-\" (U+002D HYPHEN-MINUS) as below." +msgstr "" +"以下は、 :ref:`token-filter-nfkc100-unify-hyphen` オプションの使用例です。こ" +"のオプションは、以下のように、ハイフンを\"-\" (U+002D HYPHEN-MINUS)に正規化し" +"ます。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-prolonged-sound-mark` " +"option. This option enables normalize prolonged sound to \"-\" (U+30FC " +"KATAKANA-HIRAGANA PROLONGED SOUND MARK) as below." +msgstr "" +"以下は、 :ref:`token-filter-nfkc100-unify-prolonged-sound-mark` オプションの" +"使用例です。このオプションは、以下のように長音記号を\"-\" (U+30FC KATAKANA-" +"HIRAGANA PROLONGED SOUND MARK)に正規化します。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-hyphen-and-prolonged-" +"sound-mark` option. This option enables normalize hyphen and prolonged sound " +"to \"-\" (U+002D HYPHEN-MINUS) as below." +msgstr "" +"以下は、:ref:`token-filter-nfkc100-unify-hyphen-and-prolonged-sound-mark` オ" +"プションの使用例です。このオプションは、以下のように、ハイフンと長音記号を\"-" +"\" (U+002D HYPHEN-MINUS)に正規化します。" msgid "" -"``TokenFilterNFKC100`` is not translate a token of hiragana and kanji as " +"Here is an example of :ref:`token-filter-nfkc100-unify-middle-dot` option. " +"This option enables normalize middle dot to \"·\" (U+00B7 MIDDLE DOT) as " "below." msgstr "" -"以下のように、 ``TokenFilterNFKC100`` はひらがなと漢字のトークンは変換しませ" -"ん。" +"以下は、:ref:`token-filter-nfkc100-unify-middle-dot` オプションの使用例です。" +"このオプションは、中点を\"·\" (U+00B7 MIDDLE DOT)に正規化します。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-katakana-v-sounds` " +"option. This option enables normalize \"ヴァヴィヴヴェヴォ\" to \"バビブベボ" +"\" as below." +msgstr "" +"以下は、:ref:`token-filter-nfkc100-unify-katakana-v-sounds` オプションの使用" +"例です。このオプションは、以下のように、\"ヴァヴィヴヴェヴォ\"を\"バビブベボ" +"\"に正規化します。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-katakana-bu-sounds` " +"option. This option enables normalize \"ヴァヴィヴゥヴェヴォ\" to \"ブ\" as " +"below." +msgstr "" +"以下は、:ref:`token-filter-nfkc100-unify-katakana-bu-sounds` オプションの使用" +"例です。このオプションは、以下のように、\"ヴァヴィヴゥヴェヴォ\"を\"ブ\"に正" +"規化します。" + +msgid "" +"Here is an example of :ref:`token-filter-nfkc100-unify-to-romaji` option. " +"This option enables normalize hiragana and katakana to romaji as below." +msgstr "" +"以下は、 :ref:`token-filter-nfkc100-unify-to-romaji` オプションの使用例です。" +"このオプションは、以下のように、ひらがなとカタカナをローマ字に正規化します。" msgid "" "You can output all input string as hiragana with cimbining " @@ -27044,12 +27122,6 @@ msgstr "" "``TokenFilterNFKC100`` と ``TokenMecab`` の ``use_reading`` オプションを組み" "合わせることで、入力文字列を全てひらがなとして出力できます。" -msgid "There are a required parameters ``unify_kana``." -msgstr "必須の引数 ``unify_kana`` があります。" - -msgid "Translate a token katakana to hiragana." -msgstr "カタカナのトークンをひらがなに変換します。" - msgid "``TokenFilterStem``" msgstr "" Added: doc/source/example/reference/token_filters/nfkc100-unify-hyphen-and-prolonged-sound-mark.log (+30 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-hyphen-and-prolonged-sound-mark.log 2019-01-08 10:04:06 +0900 (2a9fbfde6) @@ -0,0 +1,30 @@ +Execution example:: + + tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋− ﹣- ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_hyphen_and_prolonged_sound_mark", true)' + # [ + # [ + # 0, + # 1546907138.989727, + # 0.0003240108489990234 + # ], + # [ + # { + # "value": "-----------", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "--", + # "position": 1, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "------", + # "position": 2, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-hyphen.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-hyphen.log 2019-01-08 10:04:06 +0900 (93fb81b76) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋−" --token_filters 'TokenFilterNFKC100("unify_hyphen", true)' + # [ + # [ + # 0, + # 1546907023.849045, + # 0.0003139972686767578 + # ], + # [ + # { + # "value": "-----------", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-kana-case-hiragana.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-kana-case-hiragana.log 2019-01-08 10:04:06 +0900 (52b686b3b) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "ぁあぃいぅうぇえぉおゃやゅゆょよゎわゕかゖけ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)' + # [ + # [ + # 0, + # 1546906658.116119, + # 0.0003299713134765625 + # ], + # [ + # { + # "value": "ああいいううええおおややゆゆよよわわかかけけ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-kana-case-katakana.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-kana-case-katakana.log 2019-01-08 10:04:06 +0900 (20370d15c) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "ァアィイゥウェエォオャヤュユョヨヮワヵカヶケ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)' + # [ + # [ + # 0, + # 1546906730.305962, + # 0.0003023147583007812 + # ], + # [ + # { + # "value": "アアイイウウエエオオヤヤユユヨヨワワカカケケ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-kana.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-kana.log 2019-01-08 10:04:06 +0900 (ce2ead3ae) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "あイウェおヽヾ" --token_filters 'TokenFilterNFKC100("unify_kana", true)' + # [ + # [ + # 0, + # 1546906576.590515, + # 0.0003581047058105469 + # ], + # [ + # { + # "value": "あいうぇおゝゞ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-katakana-bu-sounds.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-katakana-bu-sounds.log 2019-01-08 10:04:06 +0900 (5f1864314) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_bu_sound", true)' + # [ + # [ + # 0, + # 1546907361.518968, + # 0.0002958774566650391 + # ], + # [ + # { + # "value": "ブブブブブブ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-katakana-v-sounds.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-katakana-v-sounds.log 2019-01-08 10:04:06 +0900 (23cdf5bab) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_v_sounds", true)' + # [ + # [ + # 0, + # 1546907295.776949, + # 0.0003447532653808594 + # ], + # [ + # { + # "value": "バビブベボブ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-middle-dot.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-middle-dot.log 2019-01-08 10:04:06 +0900 (caed36d0c) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "·ᐧ•∙⋅⸱・・" --token_filters 'TokenFilterNFKC100("unify_middle_dot", true)' + # [ + # [ + # 0, + # 1546907221.227195, + # 0.0003573894500732422 + # ], + # [ + # { + # "value": "········", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-prolonged-sound-mark.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-prolonged-sound-mark.log 2019-01-08 10:04:06 +0900 (d7a39bed3) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_prolonged_sound_mark", true)' + # [ + # [ + # 0, + # 1546907076.575454, + # 0.0003325939178466797 + # ], + # [ + # { + # "value": "ーーーーーー", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-to-romaji.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-to-romaji.log 2019-01-08 10:04:06 +0900 (e512e4a1e) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "アァイィウゥエェオォ" --token_filters 'TokenFilterNFKC100("unify_to_romaji", true)' + # [ + # [ + # 0, + # 1546907415.47742, + # 0.0003619194030761719 + # ], + # [ + # { + # "value": "axaixiuxuexeoxo", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-hiragana.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-hiragana.log 2019-01-08 10:04:06 +0900 (ede9108f4) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "かがきぎくぐけげこごさざしじすずせぜそぞただちぢつづてでとどはばぱひびぴふぶぷへべぺほぼぽ" --token_filters 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)' + # [ + # [ + # 0, + # 1546906812.423493, + # 0.0003724098205566406 + # ], + # [ + # { + # "value": "かかききくくけけここささししすすせせそそたたちちつつててととはははひひひふふふへへへほほほ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Added: doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-katakana.log (+18 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/token_filters/nfkc100-unify-voiced-sound-mark-katakana.log 2019-01-08 10:04:06 +0900 (941082c8e) @@ -0,0 +1,18 @@ +Execution example:: + + tokenize TokenDelimit "カガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドハバパヒビピフブプヘベペホボポ" --token_filters 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)' + # [ + # [ + # 0, + # 1546906950.51529, + # 0.0003533363342285156 + # ], + # [ + # { + # "value": "カカキキククケケココササシシススセセソソタタチチツツテテトトハハハヒヒヒフフフヘヘヘホホホ", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Modified: doc/source/example/reference/token_filters/nfkc100.log (+4 -4) =================================================================== --- doc/source/example/reference/token_filters/nfkc100.log 2019-01-07 18:24:08 +0900 (22c6cd663) +++ doc/source/example/reference/token_filters/nfkc100.log 2019-01-08 10:04:06 +0900 (a346593fe) @@ -1,15 +1,15 @@ Execution example:: - tokenize TokenDelimit "リンゴ" --token_filters 'TokenFilterNFKC100("unify_kana", true)' + tokenize TokenDelimit "㎡" --token_filters 'TokenFilterNFKC100' # [ # [ # 0, - # 1545901643.191951, - # 0.0003898143768310547 + # 1546906509.304568, + # 0.0002825260162353516 # ], # [ # { - # "value": "りんご", + # "value": "m2", # "position": 0, # "force_prefix": false, # "force_prefix_search": false Modified: doc/source/reference/token_filters/token_filter_nfkc100.rst (+234 -12) =================================================================== --- doc/source/reference/token_filters/token_filter_nfkc100.rst 2019-01-07 18:24:08 +0900 (2be294b07) +++ doc/source/reference/token_filters/token_filter_nfkc100.rst 2019-01-08 10:04:06 +0900 (3cb536594) @@ -11,33 +11,139 @@ Summary ------- -This token filter can translate a token for katakana to hiragana with same option of NormalizerNFKC100. -This token filter convenient when you want to get reading of token as hiragana. +.. versionadded:: 8.0.9 + +This token filter can use the same option by :ref:`normalizer-nfkc100`. +This token filter is used to normalize after tokenizing. +Because, if you normalize before tokenizing with ``TokenMecab`` , the meaning of a token may be lost. Syntax ------ -``TokenFilterNFKC100`` has a parameter:: +``TokenFilterNFKC100`` has optional parameter:: + +No options:: + + TokenFilterNFKC100 + +``TokenFilterNFKC100`` normalizes text by Unicode NFKC (Normalization Form Compatibility Composition) +for Unicode version 10.0. + +Specify option:: TokenFilterNFKC100("unify_kana", true) + TokenFilterNFKC100("unify_kana_case", true) + + TokenFilterNFKC100("unify_kana_voiced_sound_mark", true) + + TokenFilterNFKC100("unify_hyphen", true) + + TokenFilterNFKC100("unify_prolonged_sound_mark", true) + + TokenFilterNFKC100("unify_hyphen_and_prolonged_sound_mark", true) + + TokenFilterNFKC100("unify_middle_dot", true) + + TokenFilterNFKC100("unify_katakana_v_sounds", true) + + TokenFilterNFKC100("unify_katakana_bu_sound", true) + + TokenFilterNFKC100("unify_to_romaji", true) + Usage ----- Simple usage ------------ -Here is an example of ``TokenFilterNFKC100``. ``リンゴ`` is translated to ``りんご``. +Here is an example of ``TokenFilterNFKC100``. ``TokenFilterNFKC100`` normalizes text by Unicode NFKC (Normalization Form Compatibility Composition) for Unicode version 10.0. .. groonga-command .. include:: ../../example/reference/token_filters/nfkc100.log -.. tokenize TokenDelimit "リンゴ" --token_filters 'TokenFilterNFKC100("unify_kana", true)' +.. tokenize TokenDelimit "©" --token_filters TokenFilterNFKC100 + +Here is an example of :ref:`token-filter-nfkc100-unify-kana` option. + +This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-kana.log +.. tokenize TokenDelimit "あイウェおヽヾ" --token_filters 'TokenFilterNFKC100("unify_kana", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-kana-case` option. + +This option enables that large and small versions of same letters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-kana-case-hiragana.log +.. tokenize TokenDelimit "ぁあぃいぅうぇえぉおゃやゅゆょよゎわゕかゖけ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)' + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-kana-case-katakana.log +.. tokenize TokenDelimit "ァアィイゥウェエォオャヤュユョヨヮワヵカヶケ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-kana-voiced-sound-mark` option. + +This option enables that letters with/without voiced sound mark and semi voiced sound mark in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below. + + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-voiced-sound-mark-hiragana.log +.. tokenize TokenDelimit "かがきぎくぐけげこごさざしじすずせぜそぞただちぢつづてでとどはばぱひびぴふぶぷへべぺほぼぽ" --token_filters 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)' + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-voiced-sound-mark-katakana.log +.. tokenize TokenDelimit "カガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドハバパヒビピフブプヘベペホボポ" --token-fitlers 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-hyphen` option. +This option enables normalize hyphen to "-" (U+002D HYPHEN-MINUS) as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-hyphen.log +.. tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋−" --token_filters 'TokenFilterNFKC100("unify_hyphen", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-prolonged-sound-mark` option. +This option enables normalize prolonged sound to "-" (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK) as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-prolonged-sound-mark.log +.. tokenize TokenDelimit "ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_prolonged_sound_mark", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-hyphen-and-prolonged-sound-mark` option. +This option enables normalize hyphen and prolonged sound to "-" (U+002D HYPHEN-MINUS) as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-hyphen-and-prolonged-sound-mark.log +.. tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋− ﹣- ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_hyphen_and_prolonged_sound_mark", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-middle-dot` option. +This option enables normalize middle dot to "·" (U+00B7 MIDDLE DOT) as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-middle-dot.log +.. tokenize TokenDelimit "·ᐧ•∙⋅⸱・・" --token_filters 'TokenFilterNFKC100("unify_middle_dot", true)' -``TokenFilterNFKC100`` is not translate a token of hiragana and kanji as below. +Here is an example of :ref:`token-filter-nfkc100-unify-katakana-v-sounds` option. +This option enables normalize "ヴァヴィヴヴェヴォ" to "バビブベボ" as below. .. groonga-command -.. include:: ../../example/reference/token_filters/nfkc100-hiragana-and-kanji.log -.. tokenize TokenDelimit "りんご 林檎" --token_filters 'TokenFilterNFKC100("unify_kana", true)' +.. include:: ../../example/reference/token_filters/nfkc100-unify-katakana-v-sounds.log +.. tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_v_sounds", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-katakana-bu-sounds` option. +This option enables normalize "ヴァヴィヴゥヴェヴォ" to "ブ" as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-katakana-bu-sounds.log +.. tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_bu_sound", true)' + +Here is an example of :ref:`token-filter-nfkc100-unify-to-romaji` option. +This option enables normalize hiragana and katakana to romaji as below. + +.. groonga-command +.. include:: ../../example/reference/token_filters/nfkc100-unify-to-romaji.log +.. tokenize TokenDelimit "アァイィウゥエェオォ" --token_filters 'TokenFilterNFKC100("unify_to_romaji", true)' Advanced usage -------------- @@ -51,12 +157,128 @@ You can output all input string as hiragana with cimbining ``TokenFilterNFKC100` Parameters ---------- -Required parameters -^^^^^^^^^^^^^^^^^^^ +Optional parameter +^^^^^^^^^^^^^^^^^^ -There are a required parameters ``unify_kana``. +There are optional parameters as below. + +.. _token-filter-nfkc100-unify-kana: ``unify_kana`` """""""""""""" -Translate a token katakana to hiragana. +This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character. + +.. _token-filter-nfkc100-unify-kana-case: + +``unify_kana_case`` +""""""""""""""""""" + +This option enables that large and small versions of same letters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character. + +.. _token-filter-nfkc100-unify-kana-voiced-sound-mark: + +``unify_kana_voiced_sound_mark`` +"""""""""""""""""""""""""""""""" + +This option enables that letters with/without voiced sound mark and semi voiced sound mark in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character. + +.. _token-filter-nfkc100-unify-hyphen: + +``unify_hyphen`` +"""""""""""""""" + +This option enables normalize hyphen to "-" (U+002D HYPHEN-MINUS). + +Hyphen of the target of normalizing is as below. + +* "-" (U+002D HYPHEN-MINUS) +* "֊" (U+058A ARMENIAN HYPHEN) +* "˗" (U+02D7 MODIFIER LETTER MINUS SIGN) +* "‐" (U+2010 HYPHEN) +* "—" (U+2014 EM DASH) +* "⁃" (U+2043 HYPHEN BULLET) +* "⁻" (U+207B SUPERSCRIPT MINUS) +* "₋" (U+208B SUBSCRIPT MINUS) +* "−" (U+2212 MINUS SIGN) + +.. _token-filter-nfkc100-unify-prolonged-sound-mark: + +``unify_prolonged_sound_mark`` +"""""""""""""""""""""""""""""" + +This option enables normalize prolonged sound to "-" (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK). + +Prolonged sound of the target of normalizing is as below. + +* "—" (U+2014 EM DASH) +* "―" (U+2015 HORIZONTAL BAR) +* "─" (U+2500 BOX DRAWINGS LIGHT HORIZONTAL) +* "━" (U+2501 BOX DRAWINGS HEAVY HORIZONTAL) +* "ー" (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK) +* "ー" (U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK) + +.. _token-filter-nfkc100-unify-hyphen-and-prolonged-sound-mark: + +``unify_hyphen_and_prolonged_sound_mark`` +""""""""""""""""""""""""""""""""""""""""" + +This option enables normalize hyphen and prolonged sound to "-" (U+002D HYPHEN-MINUS). + +Hyphen and prolonged sound of the target normalizing is below. + +* "-" (U+002D HYPHEN-MINUS) +* "֊" (U+058A ARMENIAN HYPHEN) +* "˗" (U+02D7 MODIFIER LETTER MINUS SIGN) +* "‐" (U+2010 HYPHEN) +* "—" (U+2014 EM DASH) +* "⁃" (U+2043 HYPHEN BULLET) +* "⁻" (U+207B SUPERSCRIPT MINUS) +* "₋" (U+208B SUBSCRIPT MINUS) +* "−" (U+2212 MINUS SIGN) + +* "—" (U+2014 EM DASH) +* "―" (U+2015 HORIZONTAL BAR) +* "─" (U+2500 BOX DRAWINGS LIGHT HORIZONTAL) +* "━" (U+2501 BOX DRAWINGS HEAVY HORIZONTAL) +* "ー" (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK) +* "ー" (U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK) + +.. _token-filter-nfkc100-unify-middle-dot: + +``unify_middle_dot`` +"""""""""""""""""""" + +This option enables normalize middle dot to "·" (U+00B7 MIDDLE DOT). + +Middle dot of the target of normalizing is as below. + +* "·" (U+00B7 MIDDLE DOT) +* "ᐧ" (U+1427 CANADIAN SYLLABICS FINAL MIDDLE DOT) +* "•" (U+2022 BULLET) +* "∙" (U+2219 BULLET OPERATOR) +* "⋅" (U+22C5 DOT OPERATOR) +* "⸱" (U+2E31 WORD SEPARATOR MIDDLE DOT) +* "・" (U+30FB KATAKANA MIDDLE DOT) +* "・" (U+FF65 HALFWIDTH KATAKANA MIDDLE DOT) + +.. _token-filter-nfkc100-unify-katakana-v-sounds: + +``unify_katakana_v_sounds`` +""""""""""""""""""""""""""" + +This option enables normalize "ヴァヴィヴヴェヴォ" to "バビブベボ". + +.. _token-filter-nfkc100-unify-katakana-bu-sounds: + +``unify_katakana_bu_sound`` +""""""""""""""""""""""""""" + +This option enables normalize "ヴァヴィヴゥヴェヴォ" to "ブ". + +.. _token-filter-nfkc100-unify-to-romaji: + +``unify_to_romaji`` +""""""""""""""""""" + +This option enables normalize hiragana and katakana to romaji. -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190108/1f28f121/attachment-0001.html>