[Groonga-commit] groonga/groonga at 828d362 [master] doc: modify to display tokenizers in just beneath "http://groonga.org/docs/"

Back to archive index
Yasuhiro Horimoto null+****@clear*****
Mon Jan 7 18:13:12 JST 2019


Yasuhiro Horimoto	2019-01-07 18:13:12 +0900 (Mon, 07 Jan 2019)

  Revision: 828d36286d1eddd06fa5019558ce58519239d935
  https://github.com/groonga/groonga/commit/828d36286d1eddd06fa5019558ce58519239d935

  Message:
    doc: modify to display tokenizers in just beneath "http://groonga.org/docs/"

  Copied files:
    doc/source/reference/tokenizer/summary.rst
      (from doc/source/reference/tokenizers.rst)
  Modified files:
    doc/source/reference/tokenizers.rst

  Copied: doc/source/reference/tokenizer/summary.rst (+1 -21) 92%
===================================================================
--- doc/source/reference/tokenizers.rst    2019-01-07 17:38:33 +0900 (b3f281133)
+++ doc/source/reference/tokenizer/summary.rst    2019-01-07 18:13:12 +0900 (ba299fe9d)
@@ -2,14 +2,8 @@
 
 .. highlightlang:: none
 
-.. groonga-command
-.. database: tokenizers
-
-Tokenizers
-==========
-
 Summary
--------
+=======
 
 Groonga has tokenizer module that tokenizes text. It is used when
 the following cases:
@@ -48,9 +42,6 @@ try :ref:`token-bigram` tokenizer by
 .. include:: ../example/reference/tokenizers/tokenize-example.log
 .. tokenize TokenBigram "Hello World"
 
-What is "tokenize"?
--------------------
-
 "tokenize" is the process that extracts zero or more tokens from a
 text. There are some "tokenize" methods.
 
@@ -101,14 +92,3 @@ tokenize method. Because ``World`` is tokenized to one token ``World``
 with white-space-separate tokenize method. It means that precision is
 increased for people who wants to search "logical and". But recall is
 decreased because ``Hello World`` that contains ``or`` isn't found.
-
-Built-in tokenizsers
---------------------
-
-Here is a list of built-in tokenizers:
-
-.. toctree::
-   :maxdepth: 1
-   :glob:
-
-   tokenizers/*

  Modified: doc/source/reference/tokenizers.rst (+1 -102)
===================================================================
--- doc/source/reference/tokenizers.rst    2019-01-07 17:38:33 +0900 (b3f281133)
+++ doc/source/reference/tokenizers.rst    2019-01-07 18:13:12 +0900 (1c8489d3b)
@@ -2,113 +2,12 @@
 
 .. highlightlang:: none
 
-.. groonga-command
-.. database: tokenizers
-
 Tokenizers
 ==========
 
-Summary
--------
-
-Groonga has tokenizer module that tokenizes text. It is used when
-the following cases:
-
-  * Indexing text
-
-    .. figure:: /images/reference/tokenizers/used-when-indexing.png
-       :align: center
-       :width: 80%
-
-       Tokenizer is used when indexing text.
-
-  * Searching by query
-
-    .. figure:: /images/reference/tokenizers/used-when-searching.png
-       :align: center
-       :width: 80%
-
-       Tokenizer is used when searching by query.
-
-Tokenizer is an important module for full-text search. You can change
-trade-off between `precision and recall
-<http://en.wikipedia.org/wiki/Precision_and_recall>`_ by changing
-tokenizer.
-
-Normally, :ref:`token-bigram` is a suitable tokenizer. If you don't
-know much about tokenizer, it's recommended that you choose
-:ref:`token-bigram`.
-
-You can try a tokenizer by :doc:`/reference/commands/tokenize` and
-:doc:`/reference/commands/table_tokenize`. Here is an example to
-try :ref:`token-bigram` tokenizer by
-:doc:`/reference/commands/tokenize`:
-
-.. groonga-command
-.. include:: ../example/reference/tokenizers/tokenize-example.log
-.. tokenize TokenBigram "Hello World"
-
-What is "tokenize"?
--------------------
-
-"tokenize" is the process that extracts zero or more tokens from a
-text. There are some "tokenize" methods.
-
-For example, ``Hello World`` is tokenized to the following tokens by
-bigram tokenize method:
-
-  * ``He``
-  * ``el``
-  * ``ll``
-  * ``lo``
-  * ``o_`` (``_`` means a white-space)
-  * ``_W`` (``_`` means a white-space)
-  * ``Wo``
-  * ``or``
-  * ``rl``
-  * ``ld``
-
-In the above example, 10 tokens are extracted from one text ``Hello
-World``.
-
-For example, ``Hello World`` is tokenized to the following tokens by
-white-space-separate tokenize method:
-
-  * ``Hello``
-  * ``World``
-
-In the above example, 2 tokens are extracted from one text ``Hello
-World``.
-
-Token is used as search key. You can find indexed documents only by
-tokens that are extracted by used tokenize method. For example, you
-can find ``Hello World`` by ``ll`` with bigram tokenize method but you
-can't find ``Hello World`` by ``ll`` with white-space-separate tokenize
-method. Because white-space-separate tokenize method doesn't extract
-``ll`` token. It just extracts ``Hello`` and ``World`` tokens.
-
-In general, tokenize method that generates small tokens increases
-recall but decreases precision. Tokenize method that generates large
-tokens increases precision but decreases recall.
-
-For example, we can find ``Hello World`` and ``A or B`` by ``or`` with
-bigram tokenize method. ``Hello World`` is a noise for people who
-wants to search "logical and". It means that precision is
-decreased. But recall is increased.
-
-We can find only ``A or B`` by ``or`` with white-space-separate
-tokenize method. Because ``World`` is tokenized to one token ``World``
-with white-space-separate tokenize method. It means that precision is
-increased for people who wants to search "logical and". But recall is
-decreased because ``Hello World`` that contains ``or`` isn't found.
-
-Built-in tokenizsers
---------------------
-
-Here is a list of built-in tokenizers:
-
 .. toctree::
    :maxdepth: 1
    :glob:
 
+   tokenizer/summary
    tokenizers/*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190107/6f123139/attachment-0001.html>


More information about the Groonga-commit mailing list
Back to archive index