[Groonga-commit] groonga/groonga at 8feb315 [master] doc: describe more about TokenBigram

Back to archive index

Kouhei Sutou null+****@clear*****
Mon Mar 16 15:10:28 JST 2015


Kouhei Sutou	2015-03-16 15:10:28 +0900 (Mon, 16 Mar 2015)

  New Revision: 8feb31521b13bc345167904d919a38033260531d
  https://github.com/groonga/groonga/commit/8feb31521b13bc345167904d919a38033260531d

  Message:
    doc: describe more about TokenBigram

  Modified files:
    doc/source/reference/tokenizers.rst

  Modified: doc/source/reference/tokenizers.rst (+19 -1)
===================================================================
--- doc/source/reference/tokenizers.rst    2015-03-16 14:59:34 +0900 (f4c2b8d)
+++ doc/source/reference/tokenizers.rst    2015-03-16 15:10:28 +0900 (7e960f4)
@@ -144,6 +144,25 @@ If normalizer is used, ``TokenBigram`` uses white-space-separate like
 tokenize method for ASCII characters. ``TokenBigram`` uses bigram
 tokenize method for non-ASCII characters.
 
+You may be confused with this combined behavior. But it's reasonable
+for most use cases such as English text (only ASCII characters) and
+Japanese text (ASCII and non-ASCII characters are mixed).
+
+Most languages consists of only ASCII characters use white-space for
+word separator. White-space-separate tokenize method is suitable for
+the case.
+
+Languages consists of non-ASCII characters don't use white-space for
+word separator. Bigram tokenize method is suitable for the case.
+
+Mixed tokenize method is suitable for mixed language case.
+
+If you want to use bigram tokenize method for ASCII character, see
+``TokenBigramSplitXXX`` type tokenizers such as
+:ref:`token-bigram-split-symbol-alpha`.
+
+Let's confirm ``TokenBigram`` behavior by example.
+
 ``TokenBigram`` uses one or more white-spaces as token delimiter for
 ASCII characters:
 
@@ -178,7 +197,6 @@ for non-ASCII characters.
 .. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log
 .. tokenize TokenBigram "日本語の勉強" NormalizerAuto
 
-
 .. _token-bigram-split-symbol
 
 ``TokenBigramSplitSymbol``
-------------- next part --------------
HTML����������������������������...
Download 



More information about the Groonga-commit mailing list
Back to archive index