Yasuhiro Horimoto 2019-01-04 10:44:48 +0900 (Fri, 04 Jan 2019) Revision: 2a04577b0fe89221ecaad8c1c2002fd4c427b5bf https://github.com/groonga/groonga/commit/2a04577b0fe89221ecaad8c1c2002fd4c427b5bf Message: doc: Separate from tokenizers page Added files: doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst Modified files: doc/locale/ja/LC_MESSAGES/reference.po doc/source/reference/tokenizers.rst Modified: doc/locale/ja/LC_MESSAGES/reference.po (+5 -8) =================================================================== --- doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 10:35:39 +0900 (6b9b44c79) +++ doc/locale/ja/LC_MESSAGES/reference.po 2019-01-04 10:44:48 +0900 (9c2c1353e) @@ -27422,25 +27422,22 @@ msgid "``TokenBigramSplitSymbol`` tokenizes symbols by bigram tokenize method:" msgstr "" "``TokenBigramSplitSymbol`` は記号のトークナイズ方法にバイグラムを使います。" -#, fuzzy msgid "" "``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The " "difference between them is symbol and alphabet handling." msgstr "" -"``TokenBigramIgnoreBlankSplitSymbolAlpha`` は :ref:`token-bigram` と似ていま" -"す。違いは次の通りです。" +"``TokenBigramSplitSymbolAlpha`` は :ref:`token-bigram` と似ています。違いは記" +"号とアルファベットの扱いです。" -#, fuzzy msgid "``TokenBigramSplitSymbolAlpha`` hasn't parameter::" -msgstr "``TokenBigram`` には、引数がありません。" +msgstr "``TokenBigramSplitSymbolAlpha`` には、引数がありません。" -#, fuzzy msgid "" "``TokenBigramSplitSymbolAlpha`` tokenizes symbols and alphabets by bigram " "tokenize method:" msgstr "" -"``TokenBigramIgnoreBlankSplitSymbolAlpha`` は記号とアルファベットをバイグラム" -"でトークナイズします。" +"``TokenBigramSplitSymbolAlpha`` は記号とアルファベットのトークナイズ方法にバ" +"イグラムを使います。" msgid "" "``TokenDelimit`` extracts token by splitting one or more space characters " Modified: doc/source/reference/tokenizers.rst (+0 -140) =================================================================== --- doc/source/reference/tokenizers.rst 2019-01-04 10:35:39 +0900 (74666bebd) +++ doc/source/reference/tokenizers.rst 2019-01-04 10:44:48 +0900 (e3d97ec2f) @@ -128,146 +128,6 @@ Here is a list of built-in tokenizers: tokenizers/* -.. _token-bigram: - -``TokenBigram`` -^^^^^^^^^^^^^^^ - -``TokenBigram`` is a bigram based tokenizer. It's recommended to use -this tokenizer for most cases. - -Bigram tokenize method tokenizes a text to two adjacent characters -tokens. For example, ``Hello`` is tokenized to the following tokens: - - * ``He`` - * ``el`` - * ``ll`` - * ``lo`` - -Bigram tokenize method is good for recall because you can find all -texts by query consists of two or more characters. - -In general, you can't find all texts by query consists of one -character because one character token doesn't exist. But you can find -all texts by query consists of one character in Groonga. Because -Groonga find tokens that start with query by predictive search. For -example, Groonga can find ``ll`` and ``lo`` tokens by ``l`` query. - -Bigram tokenize method isn't good for precision because you can find -texts that includes query in word. For example, you can find ``world`` -by ``or``. This is more sensitive for ASCII only languages rather than -non-ASCII languages. ``TokenBigram`` has solution for this problem -described in the below. - -``TokenBigram`` behavior is different when it's worked with any -:doc:`/reference/normalizers`. - -If no normalizer is used, ``TokenBigram`` uses pure bigram (all tokens -except the last token have two characters) tokenize method: - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-no-normalizer.log -.. tokenize TokenBigram "Hello World" - -If normalizer is used, ``TokenBigram`` uses white-space-separate like -tokenize method for ASCII characters. ``TokenBigram`` uses bigram -tokenize method for non-ASCII characters. - -You may be confused with this combined behavior. But it's reasonable -for most use cases such as English text (only ASCII characters) and -Japanese text (ASCII and non-ASCII characters are mixed). - -Most languages consists of only ASCII characters use white-space for -word separator. White-space-separate tokenize method is suitable for -the case. - -Languages consists of non-ASCII characters don't use white-space for -word separator. Bigram tokenize method is suitable for the case. - -Mixed tokenize method is suitable for mixed language case. - -If you want to use bigram tokenize method for ASCII character, see -``TokenBigramSplitXXX`` type tokenizers such as -:ref:`token-bigram-split-symbol-alpha`. - -Let's confirm ``TokenBigram`` behavior by example. - -``TokenBigram`` uses one or more white-spaces as token delimiter for -ASCII characters: - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log -.. tokenize TokenBigram "Hello World" NormalizerAuto - -``TokenBigram`` uses character type change as token delimiter for -ASCII characters. Character type is one of them: - - * Alphabet - * Digit - * Symbol (such as ``(``, ``)`` and ``!``) - * Hiragana - * Katakana - * Kanji - * Others - -The following example shows two token delimiters: - - * at between ``100`` (digits) and ``cents`` (alphabets) - * at between ``cents`` (alphabets) and ``!!!`` (symbols) - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log -.. tokenize TokenBigram "100cents!!!" NormalizerAuto - -Here is an example that ``TokenBigram`` uses bigram tokenize method -for non-ASCII characters. - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log -.. tokenize TokenBigram "日本語の勉強" NormalizerAuto - -.. _token-bigram-split-symbol: - -``TokenBigramSplitSymbol`` -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``TokenBigramSplitSymbol`` is similar to :ref:`token-bigram`. The -difference between them is symbol handling. ``TokenBigramSplitSymbol`` -tokenizes symbols by bigram tokenize method: - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log -.. tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto - -.. _token-bigram-split-symbol-alpha: - -``TokenBigramSplitSymbolAlpha`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The -difference between them is symbol and alphabet -handling. ``TokenBigramSplitSymbolAlpha`` tokenizes symbols and -alphabets by bigram tokenize method: - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log -.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto - -.. _token-bigram-split-symbol-alpha-digit: - -``TokenBigramSplitSymbolAlphaDigit`` -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -``TokenBigramSplitSymbolAlphaDigit`` is similar to -:ref:`token-bigram`. The difference between them is symbol, alphabet -and digit handling. ``TokenBigramSplitSymbolAlphaDigit`` tokenizes -symbols, alphabets and digits by bigram tokenize method. It means that -all characters are tokenized by bigram tokenize method: - -.. groonga-command -.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log -.. tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto - .. _token-bigram-ignore-blank: ``TokenBigramIgnoreBlank`` Added: doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst (+32 -0) 100644 =================================================================== --- /dev/null +++ doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst 2019-01-04 10:44:48 +0900 (8c0fc3e30) @@ -0,0 +1,32 @@ +.. -*- rst -*- + +.. highlightlang:: none + +.. groonga-command +.. database: tokenizers + +``TokenBigramSplitSymbolAlpha`` +=============================== + +Summary +------- + +``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The +difference between them is symbol and alphabet handling. + +Syntax +------ + +``TokenBigramSplitSymbolAlpha`` hasn't parameter:: + + TokenBigramSplitSymbolAlpha + +Usage +----- + +``TokenBigramSplitSymbolAlpha`` tokenizes symbols and +alphabets by bigram tokenize method: + +.. groonga-command +.. include:: ../../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log +.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190104/b04838ea/attachment-0001.html>