[Groonga-commit] groonga/groonga at 9a6123a [master] doc: Separate from tokenizers page

Back to archive index
Yasuhiro Horimoto null+****@clear*****
Fri Jan 4 09:58:48 JST 2019


Yasuhiro Horimoto	2019-01-04 09:58:48 +0900 (Fri, 04 Jan 2019)

  Revision: 9a6123a672c428e596215320c47a312aecc9fb9d
  https://github.com/groonga/groonga/commit/9a6123a672c428e596215320c47a312aecc9fb9d

  Message:
    doc: Separate from tokenizers page

  Added files:
    doc/source/reference/tokenizers/token_bigram.rst
  Modified files:
    doc/locale/ja/LC_MESSAGES/reference.po

  Modified: doc/locale/ja/LC_MESSAGES/reference.po (+3 -0)
===================================================================
--- doc/locale/ja/LC_MESSAGES/reference.po    2018-12-28 12:45:36 +0900 (0c87d49f1)
+++ doc/locale/ja/LC_MESSAGES/reference.po    2019-01-04 09:58:48 +0900 (86e738871)
@@ -27405,6 +27405,9 @@ msgstr ""
 "入れ、テキストの最後にテキストの最後であるというマーク( ``U+FFF0`` )を入れ"
 "ます。"
 
+msgid "``TokenBigram`` hasn't parameter::"
+msgstr "``TokenBigram`` には、引数がありません。"
+
 msgid ""
 "``TokenDelimit`` extracts token by splitting one or more space characters "
 "(``U+0020``). For example, ``Hello World`` is tokenized to ``Hello`` and "

  Added: doc/source/reference/tokenizers/token_bigram.rst (+115 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/reference/tokenizers/token_bigram.rst    2019-01-04 09:58:48 +0900 (d81df8f29)
@@ -0,0 +1,115 @@
+.. -*- rst -*-
+
+.. highlightlang:: none
+
+.. groonga-command
+.. database: tokenizers
+
+``TokenBigram``
+===============
+
+Summary
+-------
+
+``TokenBigram`` is a bigram based tokenizer. It's recommended to use
+this tokenizer for most cases.
+
+Bigram tokenize method tokenizes a text to two adjacent characters
+tokens. For example, ``Hello`` is tokenized to the following tokens:
+
+  * ``He``
+  * ``el``
+  * ``ll``
+  * ``lo``
+
+Bigram tokenize method is good for recall because you can find all
+texts by query consists of two or more characters.
+
+In general, you can't find all texts by query consists of one
+character because one character token doesn't exist. But you can find
+all texts by query consists of one character in Groonga. Because
+Groonga find tokens that start with query by predictive search. For
+example, Groonga can find ``ll`` and ``lo`` tokens by ``l`` query.
+
+Bigram tokenize method isn't good for precision because you can find
+texts that includes query in word. For example, you can find ``world``
+by ``or``. This is more sensitive for ASCII only languages rather than
+non-ASCII languages. ``TokenBigram`` has solution for this problem
+described in the below.
+
+Syntax
+------
+
+``TokenBigram`` hasn't parameter::
+
+  TokenBigram
+
+Usage
+-----
+
+``TokenBigram`` behavior is different when it's worked with any
+:doc:`/reference/normalizers`.
+
+If no normalizer is used, ``TokenBigram`` uses pure bigram (all tokens
+except the last token have two characters) tokenize method:
+
+.. groonga-command
+.. include:: ../example/reference/tokenizers/token-bigram-no-normalizer.log
+.. tokenize TokenBigram "Hello World"
+
+If normalizer is used, ``TokenBigram`` uses white-space-separate like
+tokenize method for ASCII characters. ``TokenBigram`` uses bigram
+tokenize method for non-ASCII characters.
+
+You may be confused with this combined behavior. But it's reasonable
+for most use cases such as English text (only ASCII characters) and
+Japanese text (ASCII and non-ASCII characters are mixed).
+
+Most languages consists of only ASCII characters use white-space for
+word separator. White-space-separate tokenize method is suitable for
+the case.
+
+Languages consists of non-ASCII characters don't use white-space for
+word separator. Bigram tokenize method is suitable for the case.
+
+Mixed tokenize method is suitable for mixed language case.
+
+If you want to use bigram tokenize method for ASCII character, see
+``TokenBigramSplitXXX`` type tokenizers such as
+:ref:`token-bigram-split-symbol-alpha`.
+
+Let's confirm ``TokenBigram`` behavior by example.
+
+``TokenBigram`` uses one or more white-spaces as token delimiter for
+ASCII characters:
+
+.. groonga-command
+.. include:: ../../example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log
+.. tokenize TokenBigram "Hello World" NormalizerAuto
+
+``TokenBigram`` uses character type change as token delimiter for
+ASCII characters. Character type is one of them:
+
+  * Alphabet
+  * Digit
+  * Symbol (such as ``(``, ``)`` and ``!``)
+  * Hiragana
+  * Katakana
+  * Kanji
+  * Others
+
+The following example shows two token delimiters:
+
+  * at between ``100`` (digits) and ``cents`` (alphabets)
+  * at between ``cents`` (alphabets) and ``!!!`` (symbols)
+
+.. groonga-command
+.. include:: ../../example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log
+.. tokenize TokenBigram "100cents!!!" NormalizerAuto
+
+Here is an example that ``TokenBigram`` uses bigram tokenize method
+for non-ASCII characters.
+
+.. groonga-command
+.. include:: ../../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log
+.. tokenize TokenBigram "日本語の勉強" NormalizerAuto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190104/0d0f6a8b/attachment-0001.html>


More information about the Groonga-commit mailing list
Back to archive index