Yasuhiro Horimoto 2019-01-04 16:25:48 +0900 (Fri, 04 Jan 2019) Revision: 1dd039baf602fc2ac49a390ebcfb7e3a6d5dd59f https://github.com/groonga/groonga/commit/1dd039baf602fc2ac49a390ebcfb7e3a6d5dd59f Message: doc: use more meaningful example for TokenTrigram Added files: doc/source/example/reference/tokenizers/token-trigram-non-ascii.log Modified files: doc/locale/ja/LC_MESSAGES/reference.po doc/source/reference/tokenizers/token_trigram.rst Modified: doc/locale/ja/LC_MESSAGES/reference.po (+20 -4) =================================================================== --- doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 16:18:19 +0900 (c3f79ee41) +++ doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 16:25:48 +0900 (b5425c354) @@ -27784,11 +27784,20 @@ msgid "``TokenTrigram`` hasn't parameter::" msgstr "``TokenTrigram`` には、引数がありません。" msgid "" -":ref:`token-bigram` uses 2 characters per token. ``TokenTrigram`` uses 3 " -"characters per token as below example." +"If normalizer is used, ``TokenTrigram`` uses white-space-separate like " +"tokenize method for ASCII characters. ``TokenTrigram`` uses trigram tokenize " +"method for non-ASCII characters." msgstr "" -":ref:`token-bigram` は各トークンが2文字ですが、以下の例のように " -"``TokenTrigram`` は各トークンが3文字です。" +"ノーマライザーを使っている場合は ``TokenTrigram`` はASCIIの文字には空白区切り" +"のようなトークナイズ方法を使います。非ASCII文字にはトリグラムのトークナイズ方" +"法を使います。" + +msgid "" +"If ``TokenTrigram`` tokenize non-ASCII charactors, ``TokenTrigram`` uses 3 " +"character per token as below example." +msgstr "" +"``TokenTrigram`` が非ASCII文字をトークナイズすると、以下の例のように " +"``TokenTrigram`` は各トークンが3文字となります。" msgid "``TokenUnigram``" msgstr "" @@ -28302,6 +28311,13 @@ msgid "``window_sum``" msgstr "" #~ msgid "" +#~ ":ref:`token-bigram` uses 2 characters per token. ``TokenTrigram`` uses 3 " +#~ "characters per token as below example." +#~ msgstr "" +#~ ":ref:`token-bigram` は各トークンが2文字ですが、以下の例のように " +#~ "``TokenTrigram`` は各トークンが3文字です。" + +#~ msgid "" #~ "``TokenTrigram`` is similar to :ref:`token-bigram`. The differences " #~ "between them is token unit. :ref:`token-bigram` uses 2 characters per " #~ "token. ``TokenTrigram`` uses 3 characters per token." Added: doc/source/example/reference/tokenizers/token-trigram-non-ascii.log (+48 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-trigram-non-ascii.log 2019-01-04 16:25:48 +0900 (40b1f2b91) @@ -0,0 +1,48 @@ +Execution example:: + + tokenize TokenTrigram "日本語の勉強" NormalizerAuto + # [ + # [ + # 0, + # 1546586185.123834, + # 0.0003123283386230469 + # ], + # [ + # { + # "value": "日本語", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "本語の", + # "position": 1, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "語の勉", + # "position": 2, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "の勉強", + # "position": 3, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "勉強", + # "position": 4, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "強", + # "position": 5, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Modified: doc/source/reference/tokenizers/token_trigram.rst (+8 -4) =================================================================== --- doc/source/reference/tokenizers/token_trigram.rst 2019-01-04 16:18:19 +0900 (18a4545d0) +++ doc/source/reference/tokenizers/token_trigram.rst 2019-01-04 16:25:48 +0900 (b1f89ad6a) @@ -26,9 +26,13 @@ Syntax Usage ----- -:ref:`token-bigram` uses 2 characters per -token. ``TokenTrigram`` uses 3 characters per token as below example. +If normalizer is used, ``TokenTrigram`` uses white-space-separate like +tokenize method for ASCII characters. ``TokenTrigram`` uses trigram +tokenize method for non-ASCII characters. + +If ``TokenTrigram`` tokenize non-ASCII charactors, ``TokenTrigram`` uses +3 character per token as below example. .. groonga-command -.. include:: ../../example/reference/tokenizers/token-trigram.log -.. tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto +.. include:: ../../example/reference/tokenizers/token-trigram-non-ascii.log +.. tokenize TokenTrigram "日本語の勉強" NormalizerAuto -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190104/90e3ceb1/attachment-0001.html>