Kouhei Sutou
null+****@clear*****
Thu Dec 27 17:12:05 JST 2012
Kouhei Sutou 2012-12-27 17:12:05 +0900 (Thu, 27 Dec 2012) New Revision: d6abcdee0892c4dcb0ff28ee4469a7c01cbcfe68 https://github.com/groonga/groonga/commit/d6abcdee0892c4dcb0ff28ee4469a7c01cbcfe68 Log: doc: add documentation about normalizers Added files: doc/source/reference/normalizers.txt Modified files: doc/source/reference.txt Modified: doc/source/reference.txt (+1 -0) =================================================================== --- doc/source/reference.txt 2012-12-27 17:09:28 +0900 (613a1c1) +++ doc/source/reference.txt 2012-12-27 17:12:05 +0900 (7773395) @@ -13,6 +13,7 @@ reference/command reference/type reference/tables + reference/normalizers reference/tokenizers reference/query_expanders reference/pseudo_column Added: doc/source/reference/normalizers.txt (+122 -0) 100644 =================================================================== --- /dev/null +++ doc/source/reference/normalizers.txt 2012-12-27 17:12:05 +0900 (85584f0) @@ -0,0 +1,122 @@ +.. -*- rst -*- + +.. highlightlang:: none + +.. groonga-command +.. database: normalisers + +Normalizers +=========== + +Summary +------- + +Groonga has normalizer module. It is used when tokenizing text and +storing table key. For example, ``A`` and ``a`` are processed as the +same character after normalization. + +Normalizer module can be added as a plugin. You can customize text +normalization by registering your normalizer plugins to groonga. + +A normalizer module is attached to a table. A table can have zero or +one normalizer module. You can attach a normalizer module to a table +by :ref:`table-create-normalizer` option in +:doc:`/reference/commands/table_create`. + +Here is an example ``table_create`` that uses ``NormalizerAuto`` +normalizer module: + +.. groonga-command +.. include:: ../example/reference/normalizers/example-table-create.log +.. table_create Dictionary TABLE_HASH_KEY ShortText --normalizer NormalizerAuto + +.. note:: + + Groonga 2.0.9 or earlier doesn't have ``--normalizer`` option in + ``table_create``. ``KEY_NORMALIZE`` flag was used instead. + + You can open an old database by groonga 2.1.0 or later. An old + database means that the database is created by groonga 2.0.9 or + earlier. But you cannot open the opened old database by groonga + 2.0.9 or earlier. Once you open the old database by groonga 2.1.0 + or later, ``KEY_NORMALIZE`` flag information in the old database is + converted to normalizer information. So groogna 2.0.9 or earlier + cannot find ``KEY_NORMALIZE`` flag information in the opened old + database. + +Keys of a table that has a normalizer module are normalized: + +.. groonga-command +.. include:: ../example/reference/normalizers/example-load.log +.. load --table Dictionary +.. [ +.. {"_key": "Apple"}, +.. {"_key": "black"}, +.. {"_key": "COLOR"} +.. ] +.. select Dictionary + +``NormalizerAuto`` normalizer normalizes a text as a downcased text. +For example, ``"Apple"`` is normalized to ``"apple"``, ``"black"`` is +normalized to ``"blank"`` and ``"COLOR"`` is normalized to +``"color"``. + +If a table is a lexicon for fulltext search, tokenized tokens are +normalized. Because tokens are stored as table keys. Table keys are +normalized as described above. + +Built-in normalizers +-------------------- + +Here is a list of built-in-normalizers: + + * ``NormalizerAuto`` + * ``NormalizerNFKC51`` + +``NormalizerAuto`` +^^^^^^^^^^^^^^^^^^ + +Normally you should use ``NormalizerAuto`` +normalizer. ``NormalizerAuto`` was the normalizer for groonga 2.0.9 or +earlier. ``KEY_NORMALIZE`` flag in ``table_create`` on groonga 2.0.9 +or earlier equals to ``--normalizer NormalizerAuto`` option in +``table`` on groonga 2.1.0 or later. + +``NormalizerAuto`` supports all encoding. It uses Unicode NFKC +(Normalization Form Compatibility Composition) for UTF-8 encoding +text. It uses encoding specific original normalization for other +encodings. The results of those original normalization are similar to +NFKC. + +For example, half-width katakana (such as U+FF76 HALFWIDTH KATAKANA +LETTER KA) + half-width katakana voiced sound mark (U+FF9E HALFWIDTH +KATAKANA VOICED SOUND MARK) is normalized to full-width katakana with +voiced sound mark (U+30AC KATAKANA LETTER GA). The former is two +chracters but the latter is one character. + +Here is an example that uses ``NormalizerAuto`` normalizer: + +.. groonga-command +.. include:: ../example/reference/normalizers/normalizer-auto.log +.. table_create NormalLexicon TABLE_HASH_KEY ShortText --normalizer NormalizerAuto + +``NormalizerNFKC51`` +^^^^^^^^^^^^^^^^^^^^ + +``NormalizerNFKC51`` normalizes texts by Unicode NFKC (Normalization +Form Compatibility Composition) for Unicode version 5.1. It supports +only UTF-8 encoding. + +Normally you don't need to use ``NormalizerNFKC51`` explicitly. You can +use ``NormalizerAuto`` instead. + +Here is an example that uses ``NormalizerNFKC51`` normalizer: + +.. groonga-command +.. include:: ../example/reference/normalizers/normalizer-nfkc51.log +.. table_create NormalLexicon TABLE_HASH_KEY ShortText --normalizer NormalizerNFKC51 + +See also +-------- + +* :doc:`/reference/commands/table_create` -------------- next part -------------- HTML����������������������������...Download