Kouhei Sutou
null+****@clear*****
Mon Mar 16 15:10:28 JST 2015
Kouhei Sutou 2015-03-16 15:10:28 +0900 (Mon, 16 Mar 2015) New Revision: 8feb31521b13bc345167904d919a38033260531d https://github.com/groonga/groonga/commit/8feb31521b13bc345167904d919a38033260531d Message: doc: describe more about TokenBigram Modified files: doc/source/reference/tokenizers.rst Modified: doc/source/reference/tokenizers.rst (+19 -1) =================================================================== --- doc/source/reference/tokenizers.rst 2015-03-16 14:59:34 +0900 (f4c2b8d) +++ doc/source/reference/tokenizers.rst 2015-03-16 15:10:28 +0900 (7e960f4) @@ -144,6 +144,25 @@ If normalizer is used, ``TokenBigram`` uses white-space-separate like tokenize method for ASCII characters. ``TokenBigram`` uses bigram tokenize method for non-ASCII characters. +You may be confused with this combined behavior. But it's reasonable +for most use cases such as English text (only ASCII characters) and +Japanese text (ASCII and non-ASCII characters are mixed). + +Most languages consists of only ASCII characters use white-space for +word separator. White-space-separate tokenize method is suitable for +the case. + +Languages consists of non-ASCII characters don't use white-space for +word separator. Bigram tokenize method is suitable for the case. + +Mixed tokenize method is suitable for mixed language case. + +If you want to use bigram tokenize method for ASCII character, see +``TokenBigramSplitXXX`` type tokenizers such as +:ref:`token-bigram-split-symbol-alpha`. + +Let's confirm ``TokenBigram`` behavior by example. + ``TokenBigram`` uses one or more white-spaces as token delimiter for ASCII characters: @@ -178,7 +197,6 @@ for non-ASCII characters. .. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log .. tokenize TokenBigram "日本語の勉強" NormalizerAuto - .. _token-bigram-split-symbol ``TokenBigramSplitSymbol`` -------------- next part -------------- HTML����������������������������...Download