Yasuhiro Horimoto 2019-01-04 16:02:16 +0900 (Fri, 04 Jan 2019) Revision: cea6796bca7e1a709af9e066e211c59ec55e7fd4 https://github.com/groonga/groonga/commit/cea6796bca7e1a709af9e066e211c59ec55e7fd4 Message: doc: use more meaningful example Added files: doc/source/example/reference/tokenizers/token-unigram-non-ascii.log Modified files: doc/locale/ja/LC_MESSAGES/reference.po doc/source/reference/tokenizers/token_unigram.rst Modified: doc/locale/ja/LC_MESSAGES/reference.po (+12 -3) =================================================================== --- doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 15:26:38 +0900 (c3bf527d6) +++ doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 16:02:16 +0900 (c3f79ee41) @@ -27804,11 +27804,20 @@ msgid "``TokenUnigram`` hasn't parameter::" msgstr "``TokenUnigram`` には、引数がありません。" msgid "" -":ref:`token-bigram` uses 2 characters per token. ``TokenUnigram`` uses 1 " +"If normalizer is used, ``TokenUnigram`` uses white-space-separate like " +"tokenize method for ASCII characters. ``TokenUnigram`` uses unigram tokenize " +"method for non-ASCII characters." +msgstr "" +"ノーマライザーを使っている場合は ``TokenUnigram`` はASCIIの文字には空白区切り" +"のようなトークナイズ方法を使います。非ASCII文字にはユニグラムのトークナイズ方" +"法を使います。" + +msgid "" +"If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses 1 " "character per token as below example." msgstr "" -":ref:`token-bigram` は各トークンが2文字ですが、以下の例のように " -"``TokenUnigram`` は各トークンが1文字です。" +"``TokenUnigram`` が非ASCII文字をトークナイズすると、以下の例のように " +"``TokenUnigram`` は各トークンが1文字となります。" msgid "Tuning" msgstr "チューニング" Added: doc/source/example/reference/tokenizers/token-unigram-non-ascii.log (+48 -0) 100644 =================================================================== --- /dev/null +++ doc/source/example/reference/tokenizers/token-unigram-non-ascii.log 2019-01-04 16:02:16 +0900 (6f51efe71) @@ -0,0 +1,48 @@ +Execution example:: + + tokenize TokenUnigram "日本語の勉強" NormalizerAuto --output_pretty yes + # [ + # [ + # 0, + # 1546584495.218799, + # 0.0002140998840332031 + # ], + # [ + # { + # "value": "日", + # "position": 0, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "本", + # "position": 1, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "語", + # "position": 2, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "の", + # "position": 3, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "勉", + # "position": 4, + # "force_prefix": false, + # "force_prefix_search": false + # }, + # { + # "value": "強", + # "position": 5, + # "force_prefix": false, + # "force_prefix_search": false + # } + # ] + # ] Modified: doc/source/reference/tokenizers/token_unigram.rst (+7 -3) =================================================================== --- doc/source/reference/tokenizers/token_unigram.rst 2019-01-04 15:26:38 +0900 (ea91a094a) +++ doc/source/reference/tokenizers/token_unigram.rst 2019-01-04 16:02:16 +0900 (8fc636610) @@ -26,9 +26,13 @@ Syntax Usage ----- -:ref:`token-bigram` uses 2 characters per -token. ``TokenUnigram`` uses 1 character per token as below example. +If normalizer is used, ``TokenUnigram`` uses white-space-separate like +tokenize method for ASCII characters. ``TokenUnigram`` uses unigram +tokenize method for non-ASCII characters. + +If ``TokenUnigram`` tokenize non-ASCII charactors, ``TokenUnigram`` uses +1 character per token as below example. .. groonga-command -.. include:: ../../example/reference/tokenizers/token-unigram.log +.. include:: ../../example/reference/tokenizers/token-unigram-non-ascii.log .. tokenize TokenUnigram "100cents!!!" NormalizerAuto -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190104/3012137b/attachment-0001.html>