[Groonga-commit] groonga/groonga [master] [doc][suggest] complete document about correction.

Back to archive index

null+****@clear***** null+****@clear*****
2011年 8月 11日 (木) 00:22:54 JST


Kouhei Sutou	2011-08-10 15:22:54 +0000 (Wed, 10 Aug 2011)

  New Revision: d871e117bc2a72147cf9df3a34096db29a91b7b3

  Log:
    [doc][suggest] complete document about correction.

  Modified files:
    doc/source/suggest/completion.txt
    doc/source/suggest/correction.txt

  Modified: doc/source/suggest/completion.txt (+2 -2)
===================================================================
--- doc/source/suggest/completion.txt    2011-08-10 09:26:39 +0000 (263176f)
+++ doc/source/suggest/completion.txt    2011-08-10 15:22:54 +0000 (6389c7b)
@@ -120,8 +120,8 @@ an user submission". Groonga doesn't treat user inputs
 before a minute ago.
 
 If an user inputs "sea" and cooccurrence search returns
-"search" because "sea" is in input column and completed word
-column is "search".
+"search" because "sea" is in input column and corresponding
+completed word column value is "search".
 
 Prefix search
 ^^^^^^^^^^^^^

  Modified: doc/source/suggest/correction.txt (+105 -10)
===================================================================
--- doc/source/suggest/correction.txt    2011-08-10 09:26:39 +0000 (4a5002a)
+++ doc/source/suggest/correction.txt    2011-08-10 15:22:54 +0000 (5ed749e)
@@ -31,17 +31,18 @@ Cooccurrence search can find registered words from user's
 wrong input. It uses user submit sequences that will be
 learned from query logs, access logs and so on.
 
-For example, there is the following user input sequence:
+For example, there is the following user submissions:
 
-+-----------------+
-|   submit        |
-+=================+
-| serach (typo!)  |
-+-----------------+
-| search (fixed!) |
-+-----------------+
++-------------------+---------------------------+
+|  query            |    time                   |
++===================+===========================+
+| serach (typo!)    | 2011-08-10T22:20:50+09:00 |
++-------------------+---------------------------+
+| search (fixed!)   | 2011-08-10T22:20:52+09:00 |
++-------------------+---------------------------+
 
-Groonga creates the following completion pair:
+Groonga creates the following completion pair from the above
+submissions:
 
 +----------+--------------------+
 |  input   |   corrected word   |
@@ -49,4 +50,98 @@ Groonga creates the following completion pair:
 |serach    |search              |
 +----------+--------------------+
 
-...
+Groonga treats continuous submissions within a minute as
+input correction by user. Not submitted user input sequence
+between two submissions isn't used as learned data for
+correction.
+
+If an user inputs "serach" and cooccurrence search returns
+"search" because "serach" is in input column and
+corresponding corrected word column value is "search".
+
+Similar search
+^^^^^^^^^^^^^^
+
+Similar search can find registered words that has one or
+more the same tokens as user input. TokenBigram tokenizer is
+used for tokenization because suggest dataset schema
+created by :doc:`executables/groonga-suggest-create-dataset`
+uses TokenBigram tokenizer as the default tokenizer.
+
+For example, there is a registered query "search engine". An
+user can find "search engine" by "web search service",
+"sound engine" and so on. Because "search engine" and "web
+search engine" have the same token "search" and "search
+engine" and "sound engine" have the same token "engine".
+
+"search engine" is tokenized to "search" and "engine"
+tokens. (Groonga's TokenBigram tokenizer doesn't tokenize
+two characters for continuous alphabets and continuous
+digits for reducing search
+noise. TokenBigramSplitSymbolAlphaDigit tokenizer should be
+used to ensure tokenizing to two characters.) "web search
+service" is tokenized to "web", "search" and
+"service". "sound engine" is tokenized to "sound" and
+"engine".
+
+How to use
+----------
+
+.. groonga-command
+.. load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
+.. [
+.. {"sequence": "1", "time": 1312950803.86057, "item": "s"},
+.. {"sequence": "1", "time": 1312950803.96857, "item": "sa"},
+.. {"sequence": "1", "time": 1312950804.26057, "item": "sae"},
+.. {"sequence": "1", "time": 1312950804.56057, "item": "saer"},
+.. {"sequence": "1", "time": 1312950804.76057, "item": "saerc"},
+.. {"sequence": "1", "time": 1312950805.76057, "item": "saerch", "type": "submit"},
+.. {"sequence": "1", "time": 1312950809.76057, "item": "serch"},
+.. {"sequence": "1", "time": 1312950810.86057, "item": "search", "type": "submit"}
+.. ]
+
+Groonga provides :doc:`commands/suggest` command to use
+completion. `--type correct` option requests corrections
+
+For example, here is an command to get correction results by
+"saerch":
+
+.. groonga-command
+.. include:: ../example/correction-1.log
+.. suggest --table item_query --column kana --types correction --threshold 1 --query saerch
+
+How it learns
+-------------
+
+Cooccurrence search uses learned data. They are based on
+query logs, access logs and so on. To create learned data,
+groonga needs user submit inputs with time stamp.
+
+For example, an user wants to search by "search" but the
+user has typo "saerch" before inputs the correct query. The
+user inputs the query with the following sequence:
+
+  1. 2011-08-10T13:33:23+09:00: s
+  2. 2011-08-10T13:33:23+09:00: sa
+  3. 2011-08-10T13:33:24+09:00: sae
+  4. 2011-08-10T13:33:24+09:00: saer
+  5. 2011-08-10T13:33:24+09:00: saerc
+  5. 2011-08-10T13:33:25+09:00: saerch (submit!)
+  5. 2011-08-10T13:33:29+09:00: serch (correcting...)
+  6. 2011-08-10T13:33:30+09:00: search (submit!)
+
+Groonga can be learned from the input sequence by the
+following command::
+
+  load --table event_query --each 'suggest_preparer(_id, type, item, sequence, time, pair_query)'
+  [
+  {"sequence": "1", "time": 1312950803.86057, "item": "s"},
+  {"sequence": "1", "time": 1312950803.96857, "item": "sa"},
+  {"sequence": "1", "time": 1312950804.26057, "item": "sae"},
+  {"sequence": "1", "time": 1312950804.56057, "item": "saer"},
+  {"sequence": "1", "time": 1312950804.76057, "item": "saerc"},
+  {"sequence": "1", "time": 1312950805.76057, "item": "saerch", "type": "submit"},
+  {"sequence": "1", "time": 1312950809.76057, "item": "serch"},
+  {"sequence": "1", "time": 1312950810.86057, "item": "search", "type": "submit"}
+  ]
+




Groonga-commit メーリングリストの案内
Back to archive index