Yasuhiro Horimoto 2019-01-07 18:13:12 +0900 (Mon, 07 Jan 2019) Revision: 828d36286d1eddd06fa5019558ce58519239d935 https://github.com/groonga/groonga/commit/828d36286d1eddd06fa5019558ce58519239d935 Message: doc: modify to display tokenizers in just beneath "http://groonga.org/docs/" Copied files: doc/source/reference/tokenizer/summary.rst (from doc/source/reference/tokenizers.rst) Modified files: doc/source/reference/tokenizers.rst Copied: doc/source/reference/tokenizer/summary.rst (+1 -21) 92% =================================================================== --- doc/source/reference/tokenizers.rst 2019-01-07 17:38:33 +0900 (b3f281133) +++ doc/source/reference/tokenizer/summary.rst 2019-01-07 18:13:12 +0900 (ba299fe9d) @@ -2,14 +2,8 @@ .. highlightlang:: none -.. groonga-command -.. database: tokenizers - -Tokenizers -========== - Summary -------- +======= Groonga has tokenizer module that tokenizes text. It is used when the following cases: @@ -48,9 +42,6 @@ try :ref:`token-bigram` tokenizer by .. include:: ../example/reference/tokenizers/tokenize-example.log .. tokenize TokenBigram "Hello World" -What is "tokenize"? -------------------- - "tokenize" is the process that extracts zero or more tokens from a text. There are some "tokenize" methods. @@ -101,14 +92,3 @@ tokenize method. Because ``World`` is tokenized to one token ``World`` with white-space-separate tokenize method. It means that precision is increased for people who wants to search "logical and". But recall is decreased because ``Hello World`` that contains ``or`` isn't found. - -Built-in tokenizsers --------------------- - -Here is a list of built-in tokenizers: - -.. toctree:: - :maxdepth: 1 - :glob: - - tokenizers/* Modified: doc/source/reference/tokenizers.rst (+1 -102) =================================================================== --- doc/source/reference/tokenizers.rst 2019-01-07 17:38:33 +0900 (b3f281133) +++ doc/source/reference/tokenizers.rst 2019-01-07 18:13:12 +0900 (1c8489d3b) @@ -2,113 +2,12 @@ .. highlightlang:: none -.. groonga-command -.. database: tokenizers - Tokenizers ========== -Summary -------- - -Groonga has tokenizer module that tokenizes text. It is used when -the following cases: - - * Indexing text - - .. figure:: /images/reference/tokenizers/used-when-indexing.png - :align: center - :width: 80% - - Tokenizer is used when indexing text. - - * Searching by query - - .. figure:: /images/reference/tokenizers/used-when-searching.png - :align: center - :width: 80% - - Tokenizer is used when searching by query. - -Tokenizer is an important module for full-text search. You can change -trade-off between `precision and recall -<http://en.wikipedia.org/wiki/Precision_and_recall>`_ by changing -tokenizer. - -Normally, :ref:`token-bigram` is a suitable tokenizer. If you don't -know much about tokenizer, it's recommended that you choose -:ref:`token-bigram`. - -You can try a tokenizer by :doc:`/reference/commands/tokenize` and -:doc:`/reference/commands/table_tokenize`. Here is an example to -try :ref:`token-bigram` tokenizer by -:doc:`/reference/commands/tokenize`: - -.. groonga-command -.. include:: ../example/reference/tokenizers/tokenize-example.log -.. tokenize TokenBigram "Hello World" - -What is "tokenize"? -------------------- - -"tokenize" is the process that extracts zero or more tokens from a -text. There are some "tokenize" methods. - -For example, ``Hello World`` is tokenized to the following tokens by -bigram tokenize method: - - * ``He`` - * ``el`` - * ``ll`` - * ``lo`` - * ``o_`` (``_`` means a white-space) - * ``_W`` (``_`` means a white-space) - * ``Wo`` - * ``or`` - * ``rl`` - * ``ld`` - -In the above example, 10 tokens are extracted from one text ``Hello -World``. - -For example, ``Hello World`` is tokenized to the following tokens by -white-space-separate tokenize method: - - * ``Hello`` - * ``World`` - -In the above example, 2 tokens are extracted from one text ``Hello -World``. - -Token is used as search key. You can find indexed documents only by -tokens that are extracted by used tokenize method. For example, you -can find ``Hello World`` by ``ll`` with bigram tokenize method but you -can't find ``Hello World`` by ``ll`` with white-space-separate tokenize -method. Because white-space-separate tokenize method doesn't extract -``ll`` token. It just extracts ``Hello`` and ``World`` tokens. - -In general, tokenize method that generates small tokens increases -recall but decreases precision. Tokenize method that generates large -tokens increases precision but decreases recall. - -For example, we can find ``Hello World`` and ``A or B`` by ``or`` with -bigram tokenize method. ``Hello World`` is a noise for people who -wants to search "logical and". It means that precision is -decreased. But recall is increased. - -We can find only ``A or B`` by ``or`` with white-space-separate -tokenize method. Because ``World`` is tokenized to one token ``World`` -with white-space-separate tokenize method. It means that precision is -increased for people who wants to search "logical and". But recall is -decreased because ``Hello World`` that contains ``or`` isn't found. - -Built-in tokenizsers --------------------- - -Here is a list of built-in tokenizers: - .. toctree:: :maxdepth: 1 :glob: + tokenizer/summary tokenizers/* -------------- next part -------------- An HTML attachment was scrubbed... URL: <https://lists.osdn.me/mailman/archives/groonga-commit/attachments/20190107/6f123139/attachment-0001.html>