[Groonga-commit] groonga/groonga at 2a04577 [master] doc: Separate from tokenizers page

Back to archive index
Yasuhiro Horimoto null+****@clear*****
Fri Jan 4 10:44:48 JST 2019


Yasuhiro Horimoto	2019-01-04 10:44:48 +0900 (Fri, 04 Jan 2019)

  Revision: 2a04577b0fe89221ecaad8c1c2002fd4c427b5bf
  https://github.com/groonga/groonga/commit/2a04577b0fe89221ecaad8c1c2002fd4c427b5bf

  Message:
    doc: Separate from tokenizers page

  Added files:
    doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst
  Modified files:
    doc/locale/ja/LC_MESSAGES/reference.po
    doc/source/reference/tokenizers.rst

  Modified: doc/locale/ja/LC_MESSAGES/reference.po (+5 -8)
===================================================================
--- doc/locale/ja/LC_MESSAGES/reference.po    2019-01-04 10:35:39 +0900 (6b9b44c79)
+++ doc/locale/ja/LC_MESSAGES/reference.po    2019-01-04 10:44:48 +0900 (9c2c1353e)
@@ -27422,25 +27422,22 @@ msgid "``TokenBigramSplitSymbol`` tokenizes symbols by bigram tokenize method:"
 msgstr ""
 "``TokenBigramSplitSymbol`` は記号のトークナイズ方法にバイグラムを使います。"
 
-#, fuzzy
 msgid ""
 "``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The "
 "difference between them is symbol and alphabet handling."
 msgstr ""
-"``TokenBigramIgnoreBlankSplitSymbolAlpha`` は :ref:`token-bigram` と似ていま"
-"す。違いは次の通りです。"
+"``TokenBigramSplitSymbolAlpha`` は :ref:`token-bigram` と似ています。違いは記"
+"号とアルファベットの扱いです。"
 
-#, fuzzy
 msgid "``TokenBigramSplitSymbolAlpha`` hasn't parameter::"
-msgstr "``TokenBigram`` には、引数がありません。"
+msgstr "``TokenBigramSplitSymbolAlpha`` には、引数がありません。"
 
-#, fuzzy
 msgid ""
 "``TokenBigramSplitSymbolAlpha`` tokenizes symbols and alphabets by bigram "
 "tokenize method:"
 msgstr ""
-"``TokenBigramIgnoreBlankSplitSymbolAlpha`` は記号とアルファベットをバイグラム"
-"でトークナイズします。"
+"``TokenBigramSplitSymbolAlpha`` は記号とアルファベットのトークナイズ方法にバ"
+"イグラムを使います。"
 
 msgid ""
 "``TokenDelimit`` extracts token by splitting one or more space characters "

  Modified: doc/source/reference/tokenizers.rst (+0 -140)
===================================================================
--- doc/source/reference/tokenizers.rst    2019-01-04 10:35:39 +0900 (74666bebd)
+++ doc/source/reference/tokenizers.rst    2019-01-04 10:44:48 +0900 (e3d97ec2f)
@@ -128,146 +128,6 @@ Here is a list of built-in tokenizers:
 
    tokenizers/*
 
-.. _token-bigram:
-
-``TokenBigram``
-^^^^^^^^^^^^^^^
-
-``TokenBigram`` is a bigram based tokenizer. It's recommended to use
-this tokenizer for most cases.
-
-Bigram tokenize method tokenizes a text to two adjacent characters
-tokens. For example, ``Hello`` is tokenized to the following tokens:
-
-  * ``He``
-  * ``el``
-  * ``ll``
-  * ``lo``
-
-Bigram tokenize method is good for recall because you can find all
-texts by query consists of two or more characters.
-
-In general, you can't find all texts by query consists of one
-character because one character token doesn't exist. But you can find
-all texts by query consists of one character in Groonga. Because
-Groonga find tokens that start with query by predictive search. For
-example, Groonga can find ``ll`` and ``lo`` tokens by ``l`` query.
-
-Bigram tokenize method isn't good for precision because you can find
-texts that includes query in word. For example, you can find ``world``
-by ``or``. This is more sensitive for ASCII only languages rather than
-non-ASCII languages. ``TokenBigram`` has solution for this problem
-described in the below.
-
-``TokenBigram`` behavior is different when it's worked with any
-:doc:`/reference/normalizers`.
-
-If no normalizer is used, ``TokenBigram`` uses pure bigram (all tokens
-except the last token have two characters) tokenize method:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-no-normalizer.log
-.. tokenize TokenBigram "Hello World"
-
-If normalizer is used, ``TokenBigram`` uses white-space-separate like
-tokenize method for ASCII characters. ``TokenBigram`` uses bigram
-tokenize method for non-ASCII characters.
-
-You may be confused with this combined behavior. But it's reasonable
-for most use cases such as English text (only ASCII characters) and
-Japanese text (ASCII and non-ASCII characters are mixed).
-
-Most languages consists of only ASCII characters use white-space for
-word separator. White-space-separate tokenize method is suitable for
-the case.
-
-Languages consists of non-ASCII characters don't use white-space for
-word separator. Bigram tokenize method is suitable for the case.
-
-Mixed tokenize method is suitable for mixed language case.
-
-If you want to use bigram tokenize method for ASCII character, see
-``TokenBigramSplitXXX`` type tokenizers such as
-:ref:`token-bigram-split-symbol-alpha`.
-
-Let's confirm ``TokenBigram`` behavior by example.
-
-``TokenBigram`` uses one or more white-spaces as token delimiter for
-ASCII characters:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-white-space-with-normalizer.log
-.. tokenize TokenBigram "Hello World" NormalizerAuto
-
-``TokenBigram`` uses character type change as token delimiter for
-ASCII characters. Character type is one of them:
-
-  * Alphabet
-  * Digit
-  * Symbol (such as ``(``, ``)`` and ``!``)
-  * Hiragana
-  * Katakana
-  * Kanji
-  * Others
-
-The following example shows two token delimiters:
-
-  * at between ``100`` (digits) and ``cents`` (alphabets)
-  * at between ``cents`` (alphabets) and ``!!!`` (symbols)
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-ascii-and-character-type-change-with-normalizer.log
-.. tokenize TokenBigram "100cents!!!" NormalizerAuto
-
-Here is an example that ``TokenBigram`` uses bigram tokenize method
-for non-ASCII characters.
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-non-ascii-with-normalizer.log
-.. tokenize TokenBigram "日本語の勉強" NormalizerAuto
-
-.. _token-bigram-split-symbol:
-
-``TokenBigramSplitSymbol``
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``TokenBigramSplitSymbol`` is similar to :ref:`token-bigram`. The
-difference between them is symbol handling. ``TokenBigramSplitSymbol``
-tokenizes symbols by bigram tokenize method:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-with-normalizer.log
-.. tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
-
-.. _token-bigram-split-symbol-alpha:
-
-``TokenBigramSplitSymbolAlpha``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The
-difference between them is symbol and alphabet
-handling. ``TokenBigramSplitSymbolAlpha`` tokenizes symbols and
-alphabets by bigram tokenize method:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log
-.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
-
-.. _token-bigram-split-symbol-alpha-digit:
-
-``TokenBigramSplitSymbolAlphaDigit``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-``TokenBigramSplitSymbolAlphaDigit`` is similar to
-:ref:`token-bigram`. The difference between them is symbol, alphabet
-and digit handling. ``TokenBigramSplitSymbolAlphaDigit`` tokenizes
-symbols, alphabets and digits by bigram tokenize method. It means that
-all characters are tokenized by bigram tokenize method:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/token-bigram-split-symbol-alpha-digit-with-normalizer.log
-.. tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
-
 .. _token-bigram-ignore-blank:
 
 ``TokenBigramIgnoreBlank``

  Added: doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst (+32 -0) 100644
===================================================================
--- /dev/null
+++ doc/source/reference/tokenizers/token_bigram_split_symbol_alpha.rst    2019-01-04 10:44:48 +0900 (8c0fc3e30)
@@ -0,0 +1,32 @@
+.. -*- rst -*-
+
+.. highlightlang:: none
+
+.. groonga-command
+.. database: tokenizers
+
+``TokenBigramSplitSymbolAlpha``
+===============================
+
+Summary
+-------
+
+``TokenBigramSplitSymbolAlpha`` is similar to :ref:`token-bigram`. The
+difference between them is symbol and alphabet handling.
+
+Syntax
+------
+
+``TokenBigramSplitSymbolAlpha`` hasn't parameter::
+
+  TokenBigramSplitSymbolAlpha
+
+Usage
+-----
+
+``TokenBigramSplitSymbolAlpha`` tokenizes symbols and
+alphabets by bigram tokenize method:
+
+.. groonga-command
+.. include:: ../../example/reference/tokenizers/token-bigram-split-symbol-alpha-with-normalizer.log
+.. tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190104/b04838ea/attachment-0001.html>


More information about the Groonga-commit mailing list
Back to archive index