[Groonga-commit] ranguba/chupa-text at c8734ff [master] Split documentation

Back to archive index

Kouhei Sutou null+****@clear*****
Sun Jan 5 00:05:35 JST 2014


Kouhei Sutou	2014-01-05 00:05:35 +0900 (Sun, 05 Jan 2014)

  New Revision: c8734ff15ac87dd0d2d33e84325ed2f6e0e34fd9
  https://github.com/ranguba/chupa-text/commit/c8734ff15ac87dd0d2d33e84325ed2f6e0e34fd9

  Message:
    Split documentation

  Added files:
    doc/text/command-line.md
    doc/text/decomposer.md
    doc/text/library.md
    doc/text/news.md
  Modified files:
    README.md

  Modified: README.md (+11 -208)
===================================================================
--- README.md    2014-01-05 00:05:16 +0900 (eeb1290)
+++ README.md    2014-01-05 00:05:35 +0900 (c29da8d)
@@ -46,7 +46,7 @@ decomposed data will be text data.
 Decomposer module is a plugin. You can add supported data types by
 installing decomposer modules. Or you can create your custom
 decomposer. Decomposer is a simple Ruby object. So it is easy to
-write. It is described later.
+create. It is described later.
 
 ## Install
 
@@ -65,216 +65,19 @@ chupa-text 1.0.0
 
 ## How to use
 
-You can use ChupaText as command line tool or Ruby library.
+You can use ChupaText as command line tool or Ruby library. See the
+following documentations for details:
 
-### How to use as command line tool
+  * [doc/text/command-line.md](http://rubydoc.info/gems/chupa-text/file/doc/text/command-line.md)
+    describes how to use ChupaText as command line tool.
+  * [doc/text/library.md](http://rubydoc.info/gems/chupa-text/file/doc/text/library.md)
+    describes how to use ChupaText as a Ruby library.
 
-You can extract text and meta-data from an input by `chupa-text`
-command. `chupa-text` prints extracted text and meta-data as JSON.
+## How to create a decomposer
 
-#### Input
-
-`chupa-text` command accept a local file path or a URI.
-
-Here is a local file path example:
-
-```
-% chupa-text hello.txt.gz
-```
-
-Here is an URI example:
-
-```
-% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz
-```
-
-#### Output
-
-`chupa-text` command prints the extracted result as JSON:
-
-```
-% chupa-text hello.txt.gz
-{
-  "mime-type": "application/x-gzip",
-  "uri": "hello.txt.gz",
-  "size": 36,
-  "texts": [
-    {
-      "mime-type": "text/plain",
-      "uri": "hello.txt",
-      "size": 6,
-      "body": "Hello\n"
-    }
-  ]
-}
-```
-
-JSON uses the following data structure:
-
-```txt
-{
-  "mime-type":        "<MIME type of the input>",
-  "uri":              "<URI or path of the input>",
-  "size":             <Byte size of the input data>,
-  "other-meta-data1": <Other meta-data value1>,
-  "other-meta-data2": <Other meta-data value2>,
-  "...":              <...>,
-  "texts": [
-    {
-      "mime-type":        "<MIME type of the extracted data1>",
-      "uri":              "<URI or path of the extracted data1>",
-      "size":             "<Byte size of the text of the extracted data1>",
-      "body":             "<The text of the extracted data1>",
-      "other-meta-data1": <Other meta-data value1 of the extracted data1>,
-      "other-meta-data2": <Other meta-data value2 of the extracted data1>,
-      "...":              <...>
-    },
-    {
-      <The information of the extracted data2>
-    },
-    {
-      <The information of the extracted data3>
-    },
-    <...>
-  ]
-}
-```
-
-You can find extracted texts in `texts[0].body`, `texts[1].body` and
-so on. You may extract one or more texts from one input because
-ChupaText supports archive file such as `tar`.
-
-#### Command line options
-
-You can custom `chupa-text` command behavior. Here are command line
-options:
-
-`--configuration=FILE`
-
-It reads configuration from `FILE`. See the next section for
-configuration file details.
-
-ChupaText provides the default configuration file. It has suitable
-configurations. Normally, you don't need to use your custom
-configuration file.
-
-`--help`
-
-It shows available command line options and exits.
-
-#### Configuration
-
-ChupaText configuration file is a Ruby script but it is easy to read
-and write ChupaText configuration file for users who don't know about
-Ruby.
-
-The basic syntax is the following:
-
-```
-category.name = value
-```
-
-Here is an example that sets `["tar", "gzip"]` as `value` to `names`
-name variable in `decomposer` category:
-
-```
-decomposer.names = ["tar", "gzip"]
-```
-
-Here are configuration parameters:
-
-`decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]`
-
-It specifies an array of decomposer name to be used in `chupa-text`
-command. You can use glob pattern for decomposer name such as
-`"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on.
-
-The default is `["*"]`. It means that all installed decomposers are
-used.
-
-`mime_type["<extension>"] = "<MIME type>"`
-
-It specifies a map to a MIME type from path extension.
-
-Here is an example that maps `"html"` to `"text/html"`:
-
-```
-mime_type["html"] = "text/html"
-```
-
-Th default configuration file registers popular MIME types.
-
-### How to use as Ruby library
-
-You can use ChupaText as a Ruby library. If you want to extract text
-data from many input data, `chupa-text` command may be
-inefficient. You need to execute `chupa-text` command to process one
-input file. You need to execute `chupa-text` command N times to
-process N input files. It means that you need to initializes ChupaText
-N times. It may be inefficient.
-
-You can reduce initializations of ChupaText by using ChupaText as a
-Ruby library.
-
-Here is a simple usage:
-
-```
-require "chupa-text"
-gem "chupa-text-decomposer-html"
-
-ChupaText::Decomposers.load
-
-extractor = ChupaText::Extractor.new
-extractor.apply_configuration(ChupaText::Configuration.default)
-
-extractor.extract("http://ranguba.org/") do |text_data|
-  puts(text_data.body)
-end
-extractor.extract("http://ranguba.org/ja/") do |text_data|
-  puts(text_data.body)
-end
-```
-
-It is better that you use Bundler to manager decomposer plugins:
-
-```
-# Gemfile
-source "https://rubygems.org"
-
-gem "chupa-text-decomposer-html"
-gem "chupa-text-decomposer-XXX"
-# ...
-```
-
-Here is a usage that uses the Gemfile:
-
-```
-require "bundler/setup"
-
-ChupaText::Decomposers.load
-
-extractor = ChupaText::Extractor.new
-extractor.apply_configuration(ChupaText::Configuration.default)
-
-extractor.extract("http://ranguba.org/") do |text_data|
-  puts(text_data.body)
-end
-extractor.extract("http://ranguba.org/ja/") do |text_data|
-  puts(text_data.body)
-end
-```
-
-Use {ChupaText::Data#[]} to get meta-data from extracted text
-data. For example, you can get title from input HTML:
-
-```
-extractor.extract("http://ranguba.org/") do |text_data|
-  puts(text_data["title"])
-end
-```
-
-It is depended on decomposer that what meta-data can be got. See
-decomposer's documentation to know about it.
+See
+[doc/text/decomposer.md](http://rubydoc.info/gems/chupa-text/file/doc/text/decomposer.md)
+how to write a decomposer.
 
 ## Author
 

  Added: doc/text/command-line.md (+136 -0) 100644
===================================================================
--- /dev/null
+++ doc/text/command-line.md    2014-01-05 00:05:35 +0900 (68a256d)
@@ -0,0 +1,136 @@
+# How to use ChupaText as command line tool
+
+You can extract text and meta-data from an input by `chupa-text`
+command. `chupa-text` prints extracted text and meta-data as JSON.
+
+## Input
+
+`chupa-text` command accept a local file path or a URI.
+
+Here is a local file path example:
+
+```
+% chupa-text hello.txt.gz
+```
+
+Here is an URI example:
+
+```
+% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz
+```
+
+## Output
+
+`chupa-text` command prints the extracted result as JSON:
+
+```
+% chupa-text hello.txt.gz
+{
+  "mime-type": "application/x-gzip",
+  "uri": "hello.txt.gz",
+  "size": 36,
+  "texts": [
+    {
+      "mime-type": "text/plain",
+      "uri": "hello.txt",
+      "size": 6,
+      "body": "Hello\n"
+    }
+  ]
+}
+```
+
+JSON uses the following data structure:
+
+```txt
+{
+  "mime-type":        "<MIME type of the input>",
+  "uri":              "<URI or path of the input>",
+  "size":             <Byte size of the input data>,
+  "other-meta-data1": <Other meta-data value1>,
+  "other-meta-data2": <Other meta-data value2>,
+  "...":              <...>,
+  "texts": [
+    {
+      "mime-type":        "<MIME type of the extracted data1>",
+      "uri":              "<URI or path of the extracted data1>",
+      "size":             "<Byte size of the text of the extracted data1>",
+      "body":             "<The text of the extracted data1>",
+      "other-meta-data1": <Other meta-data value1 of the extracted data1>,
+      "other-meta-data2": <Other meta-data value2 of the extracted data1>,
+      "...":              <...>
+    },
+    {
+      <The information of the extracted data2>
+    },
+    {
+      <The information of the extracted data3>
+    },
+    <...>
+  ]
+}
+```
+
+You can find extracted texts in `texts[0].body`, `texts[1].body` and
+so on. You may extract one or more texts from one input because
+ChupaText supports archive file such as `tar`.
+
+## Command line options
+
+You can custom `chupa-text` command behavior. Here are command line
+options:
+
+`--configuration=FILE`
+
+It reads configuration from `FILE`. See the next section for
+configuration file details.
+
+ChupaText provides the default configuration file. It has suitable
+configurations. Normally, you don't need to use your custom
+configuration file.
+
+`--help`
+
+It shows available command line options and exits.
+
+## Configuration
+
+ChupaText configuration file is a Ruby script but it is easy to read
+and write ChupaText configuration file for users who don't know about
+Ruby.
+
+The basic syntax is the following:
+
+```
+category.name = value
+```
+
+Here is an example that sets `["tar", "gzip"]` as `value` to `names`
+name variable in `decomposer` category:
+
+```
+decomposer.names = ["tar", "gzip"]
+```
+
+Here are configuration parameters:
+
+`decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]`
+
+It specifies an array of decomposer name to be used in `chupa-text`
+command. You can use glob pattern for decomposer name such as
+`"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on.
+
+The default is `["*"]`. It means that all installed decomposers are
+used.
+
+`mime_type["<extension>"] = "<MIME type>"`
+
+It specifies a map to a MIME type from path extension.
+
+Here is an example that maps `"html"` to `"text/html"`:
+
+```
+mime_type["html"] = "text/html"
+```
+
+Th default configuration file registers popular MIME types.

  Added: doc/text/decomposer.md (+343 -0) 100644
===================================================================
--- /dev/null
+++ doc/text/decomposer.md    2014-01-05 00:05:35 +0900 (d7142ba)
@@ -0,0 +1,343 @@
+# How to create a decomposer
+
+You can extend ChupaText by Ruby. You can add supported input type by
+writing a decomposer module.
+
+## Overview
+
+Decomposer is a Ruby class. It needs the following two API:
+
+  * `target?`
+  * `decompose`
+
+Both of them accept only one argument `data`. `data` is an input
+data.
+
+First, ChupaText calls `target?` method of your decomposer. If your
+decomposer can decompose the input data, your `target?` method should
+return `true`.
+
+If your decomposer's `target?` method returns `true`, ChupaText calls
+`decomposer` method of your decomposer. Your decomposer needs to
+decomposer the input data and `yield` extracted text data or other
+format data that will be decomposed by other decomposers. Your
+decomposer can `yield` multiple times.
+
+If your decomposer decomposes an archive file such as tar and zip
+archives, your `decompose` method will `yield` other format data. If
+your decomposer extracts text and meta-data from an input such as
+HTML, your `decompose` method will `yield` text data.
+
+## Example
+
+Let's create a simple XML decomposer as an example. It extracts text
+data from input XML.
+
+For example, here is an input XML:
+
+```xml
+<root>
+  Hello <em>&amp;</em> World!
+</root>
+```
+
+The XML decomposer extracts the following text:
+
+```text
+Hello & World!
+```
+
+ChupaText provides `chupa-text-genearte-decomposer` command. It
+generates skeleton code for a new decomposer. Let's use it.
+
+`chupa-text-genearte-decomposer` accepts required information by
+command line options or reading from standard input. You can confirm
+the required information by `--help` option:
+
+```text
+% chupa-text-generate-decomposer --help
+Usage: chupa-text-generate-decomposer [options]
+        --name=NAME                  Decomposer name
+                                     (e.g.: html)
+        --extensions=EXTENSION1,EXTENSION2,...
+                                     Target file extensions
+                                     (e.g.: htm,html,xhtml)
+        --mime-types=TYPE1,TYPE2,... Target MIME types
+                                     (e.g.: text/html,application/xhtml+xml)
+        --author=AUTHOR              Author
+                                     (e.g.: 'Your Name')
+                                     (default: Kouhei Sutou)
+        --email=EMAIL                Author E-mail
+                                     (e.g.: your �� email.address)
+                                     (default: kou �� clear-code.com)
+        --license=LICENSE            License
+                                     (e.g.: MIT)
+                                     (default: LGPLv2.1 or later)
+```
+
+Some pieces of information have the default values. In the above case,
+`--author`, `--email` and `-license` have the default values.
+
+XML decomposer uses the following information:
+
+  * `--name`: `xml`
+  * `--extensions`: `xml`
+  * `--mime-types`: `text/xml`
+
+Run with the above information:
+
+```text
+% chupa-text-generate-decomposer --name xml --extensions xml --mime-types text/xml
+Creating directory: chupa-text-decomposer-xml
+Creating file:      chupa-text-decomposer-xml/chupa-text-decomposer-xml.gemspec
+Creating file:      chupa-text-decomposer-xml/Gemfile
+Creating file:      chupa-text-decomposer-xml/Rakefile
+Creating file:      chupa-text-decomposer-xml/LICENSE.txt
+Creating directory: chupa-text-decomposer-xml/lib/chupa-text/decomposers
+Creating file:      chupa-text-decomposer-xml/lib/chupa-text/decomposers/xml.rb
+Creating directory: chupa-text-decomposer-xml/test
+Creating file:      chupa-text-decomposer-xml/test/test-xml.rb
+Creating file:      chupa-text-decomposer-xml/test/helper.rb
+Creating file:      chupa-text-decomposer-xml/test/run-test.rb
+```
+
+`chupa-text-generate-decomposer` generates a directory that is named
+as `chupa-text-decomposer-#{name}/`.
+
+Look `lib/chupa-text/decomposers/xml.rb`:
+
+```
+module ChupaText
+  module Decomposers
+    class Xml < Decomposer
+      def target?(data)
+        ["xml"].include?(data.extension) or
+          ["text/xml"].include?(data.mime_type)
+      end
+
+      def decompose(data)
+        raise NotImplementedError, "#{self.class}##{__method__} isn't implemented yet."
+        text = "IMPLEMENTED ME"
+        text_data = TextData.new(text)
+        yield(text_data)
+      end
+    end
+  end
+end
+```
+
+The generated code implements `target?` method but doesn't implemented
+`decompose` method completely. Let's implement `decompose` method:
+
+```
+require "cgi"
+
+# ...
+      def decompose(data)
+        text = CGI.unescapeHTML(untag(data.body).strip)
+        text_data = TextData.new(text)
+        yield(text_data)
+      end
+
+      private
+      def untag(xml)
+        xml.gsub(/<.+?>/m, "")
+      end
+# ...
+```
+
+`chupa-text-generate-decomposer` also generates a test. Run the test:
+
+```
+% bundle install
+% rake
+/usr/bin/ruby2.0 test/run-test.rb
+Loaded suite .
+Started
+F
+===============================================================================
+Failure:
+test_body(decompose)
+/tmp/chupa-text-decomposer-xml/test/test-xml.rb:24:in `test_body'
+     21:     def test_body
+     22:       input_body = "TODO (input)"
+     23:       expected_text = "TODO (extracted)"
+  => 24:       assert_equal([expected_text],
+     25:                    decompose(input_body).collect(&:body))
+     26:     end
+     27:   end
+<["TODO (extracted)"]> expected but was
+<["TODO (input)"]>
+
+diff:
+? ["TODO (ex  tracted)"]
+?         inpu          
+===============================================================================
+
+
+Finished in 0.013355116 seconds.
+
+1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
+0% passed
+
+74.88 tests/s, 74.88 assertions/s
+rake aborted!
+Command failed with status (1): [/usr/bin/ruby2.0 test/run-test.rb...]
+/tmp/chupa-text-decomposer-xml/Rakefile:9:in `block in <top (required)>'
+```
+
+The generated test fails because the test has place holders. Look the
+generated test:
+
+```
+class TestXml < Test::Unit::TestCase
+  include Helper
+
+  def setup
+    @decomposer = ChupaText::Decomposers::Xml.new({})
+  end
+
+  sub_test_case("decompose") do
+    def decompose(input_body)
+      data = ChupaText::Data.new
+      data.mime_type = "text/xml"
+      data.body = input_body
+
+      decomposed = []
+      @decomposer.decompose(data) do |decomposed_data|
+        decomposed << decomposed_data
+      end
+      decomposed
+    end
+
+    def test_body
+      input_body = "TODO (input)"
+      expected_text = "TODO (extracted)"
+      assert_equal([expected_text],
+                   decompose(input_body).collect(&:body))
+    end
+  end
+end
+```
+
+`test_body` has TODO codes as place holder:
+
+```
+# ...
+    def test_body
+      input_body = "TODO (input)"
+      expected_text = "TODO (extracted)"
+      assert_equal([expected_text],
+                   decompose(input_body).collect(&:body))
+    end
+# ...
+```
+
+Fill the TODO by test XML and expected result:
+
+```
+# ...
+    def test_body
+      input_body = <<-XML
+<root>
+  Hello <em>&amp;</em> World!
+</root>
+      XML
+      expected_text = "Hello & World!"
+      assert_equal([expected_text],
+                   decompose(input_body).collect(&:body))
+    end
+# ...
+```
+
+Run test again:
+
+```
+% rake
+/usr/bin/ruby2.0 test/run-test.rb
+Loaded suite .
+Started
+.
+
+Finished in 0.000915172 seconds.
+
+1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications
+100% passed
+
+1092.69 tests/s, 1092.69 assertions/s
+```
+
+The test is passed!
+
+You can release the generator by the following command. It requires an
+account on https://rubygems.org/.
+
+```
+% rake release
+```
+
+Can you understand how to create a new decomposer?
+
+## API reference
+
+### `data`
+
+Both of `target?` and `decompose` receives an argument `data`. It is a
+{ChupaText::Data} instance or an instance of its sub class. You need
+to see the API reference manual just for {ChupaText::Data}. You don't
+use sub class specific API. It is not portable.
+
+### `target?`
+
+`target?` should return `true` or `false`. The decomposer should
+return `true` if the decomposer can decompose received `data`, `false`
+otherwise.
+
+### `decompose`
+
+`decompose` decomposes input `data` and `yield` extracted text data or
+decomposed other type data. `decompose` can `yield` zero or more
+times.
+
+Here is a template code to `yield` extracted text data:
+
+```
+def decompose(data)
+  text = extract_text(data)
+  text_data = ChupaText::TextData.new(text)
+  # text_data["meta-data1"] = meta_data_value1
+  # text_data["meta-data2"] = meta_data_value2
+  # ...
+  yield(text_data)
+end
+```
+
+See
+[lib/chupa-text/decomposers/csv.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/csv.rb)
+as an example of extracting text data.
+
+Here is a template code to `yield` other type data:
+
+```
+def decompose(data)
+  entries = decompose_archive(data)
+  entries.each do |entry|
+    path = entry.path
+    if entry.respond_to?(:read)
+      # The input must have "read" method.
+      input = entry
+    else
+      # If the entry doesn't have "read" method, wrap String data
+      # by StringIO.
+      input = StringIO.new(entry.data)
+    end
+    decomposed_data = ChupaText::VirtualFileData.new(path, input)
+    decomposed_data.source = data
+    yield(decomposed_data)
+  end
+end
+```
+
+See
+[lib/chupa-text/decomposers/tar.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/tar.rb)
+as an example of decomposing to other type data.

  Added: doc/text/library.md (+72 -0) 100644
===================================================================
--- /dev/null
+++ doc/text/library.md    2014-01-05 00:05:35 +0900 (b7ca421)
@@ -0,0 +1,72 @@
+# Hot to use ChupaText as Ruby library
+
+You can use ChupaText as Ruby library. If you want to extract text
+data from many input data, `chupa-text` command may be
+inefficient. You need to execute `chupa-text` command to process one
+input file. You need to execute `chupa-text` command N times to
+process N input files. It means that you need to initializes ChupaText
+N times. It may be inefficient.
+
+You can reduce initializations of ChupaText by using ChupaText as Ruby
+library.
+
+Here is a simple usage:
+
+```
+require "chupa-text"
+gem "chupa-text-decomposer-html"
+
+ChupaText::Decomposers.load
+
+extractor = ChupaText::Extractor.new
+extractor.apply_configuration(ChupaText::Configuration.default)
+
+extractor.extract("http://ranguba.org/") do |text_data|
+  puts(text_data.body)
+end
+extractor.extract("http://ranguba.org/ja/") do |text_data|
+  puts(text_data.body)
+end
+```
+
+It is better that you use Bundler to manager decomposer plugins:
+
+```
+# Gemfile
+source "https://rubygems.org"
+
+gem "chupa-text-decomposer-html"
+gem "chupa-text-decomposer-XXX"
+# ...
+```
+
+Here is a usage that uses the Gemfile:
+
+```
+require "bundler/setup"
+
+ChupaText::Decomposers.load
+
+extractor = ChupaText::Extractor.new
+extractor.apply_configuration(ChupaText::Configuration.default)
+
+extractor.extract("http://ranguba.org/") do |text_data|
+  puts(text_data.body)
+end
+extractor.extract("http://ranguba.org/ja/") do |text_data|
+  puts(text_data.body)
+end
+```
+
+Use {ChupaText::Data#[]} to get meta-data from extracted text
+data. For example, you can get title from input HTML:
+
+```
+extractor.extract("http://ranguba.org/") do |text_data|
+  puts(text_data["title"])
+end
+```
+
+It is depended on decomposer that what meta-data can be got. See
+decomposer's documentation to know about it.
+

  Added: doc/text/news.md (+5 -0) 100644
===================================================================
--- /dev/null
+++ doc/text/news.md    2014-01-05 00:05:35 +0900 (1bfecab)
@@ -0,0 +1,5 @@
+# News
+
+## 1.0.0: 2014-01-05
+
+The first release!!!
-------------- next part --------------
HTML����������������������������...
Download 



More information about the Groonga-commit mailing list
Back to archive index