Kouhei Sutou
null+****@clear*****
Sun Jan 5 00:05:35 JST 2014
Kouhei Sutou 2014-01-05 00:05:35 +0900 (Sun, 05 Jan 2014) New Revision: c8734ff15ac87dd0d2d33e84325ed2f6e0e34fd9 https://github.com/ranguba/chupa-text/commit/c8734ff15ac87dd0d2d33e84325ed2f6e0e34fd9 Message: Split documentation Added files: doc/text/command-line.md doc/text/decomposer.md doc/text/library.md doc/text/news.md Modified files: README.md Modified: README.md (+11 -208) =================================================================== --- README.md 2014-01-05 00:05:16 +0900 (eeb1290) +++ README.md 2014-01-05 00:05:35 +0900 (c29da8d) @@ -46,7 +46,7 @@ decomposed data will be text data. Decomposer module is a plugin. You can add supported data types by installing decomposer modules. Or you can create your custom decomposer. Decomposer is a simple Ruby object. So it is easy to -write. It is described later. +create. It is described later. ## Install @@ -65,216 +65,19 @@ chupa-text 1.0.0 ## How to use -You can use ChupaText as command line tool or Ruby library. +You can use ChupaText as command line tool or Ruby library. See the +following documentations for details: -### How to use as command line tool + * [doc/text/command-line.md](http://rubydoc.info/gems/chupa-text/file/doc/text/command-line.md) + describes how to use ChupaText as command line tool. + * [doc/text/library.md](http://rubydoc.info/gems/chupa-text/file/doc/text/library.md) + describes how to use ChupaText as a Ruby library. -You can extract text and meta-data from an input by `chupa-text` -command. `chupa-text` prints extracted text and meta-data as JSON. +## How to create a decomposer -#### Input - -`chupa-text` command accept a local file path or a URI. - -Here is a local file path example: - -``` -% chupa-text hello.txt.gz -``` - -Here is an URI example: - -``` -% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz -``` - -#### Output - -`chupa-text` command prints the extracted result as JSON: - -``` -% chupa-text hello.txt.gz -{ - "mime-type": "application/x-gzip", - "uri": "hello.txt.gz", - "size": 36, - "texts": [ - { - "mime-type": "text/plain", - "uri": "hello.txt", - "size": 6, - "body": "Hello\n" - } - ] -} -``` - -JSON uses the following data structure: - -```txt -{ - "mime-type": "<MIME type of the input>", - "uri": "<URI or path of the input>", - "size": <Byte size of the input data>, - "other-meta-data1": <Other meta-data value1>, - "other-meta-data2": <Other meta-data value2>, - "...": <...>, - "texts": [ - { - "mime-type": "<MIME type of the extracted data1>", - "uri": "<URI or path of the extracted data1>", - "size": "<Byte size of the text of the extracted data1>", - "body": "<The text of the extracted data1>", - "other-meta-data1": <Other meta-data value1 of the extracted data1>, - "other-meta-data2": <Other meta-data value2 of the extracted data1>, - "...": <...> - }, - { - <The information of the extracted data2> - }, - { - <The information of the extracted data3> - }, - <...> - ] -} -``` - -You can find extracted texts in `texts[0].body`, `texts[1].body` and -so on. You may extract one or more texts from one input because -ChupaText supports archive file such as `tar`. - -#### Command line options - -You can custom `chupa-text` command behavior. Here are command line -options: - -`--configuration=FILE` - -It reads configuration from `FILE`. See the next section for -configuration file details. - -ChupaText provides the default configuration file. It has suitable -configurations. Normally, you don't need to use your custom -configuration file. - -`--help` - -It shows available command line options and exits. - -#### Configuration - -ChupaText configuration file is a Ruby script but it is easy to read -and write ChupaText configuration file for users who don't know about -Ruby. - -The basic syntax is the following: - -``` -category.name = value -``` - -Here is an example that sets `["tar", "gzip"]` as `value` to `names` -name variable in `decomposer` category: - -``` -decomposer.names = ["tar", "gzip"] -``` - -Here are configuration parameters: - -`decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]` - -It specifies an array of decomposer name to be used in `chupa-text` -command. You can use glob pattern for decomposer name such as -`"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on. - -The default is `["*"]`. It means that all installed decomposers are -used. - -`mime_type["<extension>"] = "<MIME type>"` - -It specifies a map to a MIME type from path extension. - -Here is an example that maps `"html"` to `"text/html"`: - -``` -mime_type["html"] = "text/html" -``` - -Th default configuration file registers popular MIME types. - -### How to use as Ruby library - -You can use ChupaText as a Ruby library. If you want to extract text -data from many input data, `chupa-text` command may be -inefficient. You need to execute `chupa-text` command to process one -input file. You need to execute `chupa-text` command N times to -process N input files. It means that you need to initializes ChupaText -N times. It may be inefficient. - -You can reduce initializations of ChupaText by using ChupaText as a -Ruby library. - -Here is a simple usage: - -``` -require "chupa-text" -gem "chupa-text-decomposer-html" - -ChupaText::Decomposers.load - -extractor = ChupaText::Extractor.new -extractor.apply_configuration(ChupaText::Configuration.default) - -extractor.extract("http://ranguba.org/") do |text_data| - puts(text_data.body) -end -extractor.extract("http://ranguba.org/ja/") do |text_data| - puts(text_data.body) -end -``` - -It is better that you use Bundler to manager decomposer plugins: - -``` -# Gemfile -source "https://rubygems.org" - -gem "chupa-text-decomposer-html" -gem "chupa-text-decomposer-XXX" -# ... -``` - -Here is a usage that uses the Gemfile: - -``` -require "bundler/setup" - -ChupaText::Decomposers.load - -extractor = ChupaText::Extractor.new -extractor.apply_configuration(ChupaText::Configuration.default) - -extractor.extract("http://ranguba.org/") do |text_data| - puts(text_data.body) -end -extractor.extract("http://ranguba.org/ja/") do |text_data| - puts(text_data.body) -end -``` - -Use {ChupaText::Data#[]} to get meta-data from extracted text -data. For example, you can get title from input HTML: - -``` -extractor.extract("http://ranguba.org/") do |text_data| - puts(text_data["title"]) -end -``` - -It is depended on decomposer that what meta-data can be got. See -decomposer's documentation to know about it. +See +[doc/text/decomposer.md](http://rubydoc.info/gems/chupa-text/file/doc/text/decomposer.md) +how to write a decomposer. ## Author Added: doc/text/command-line.md (+136 -0) 100644 =================================================================== --- /dev/null +++ doc/text/command-line.md 2014-01-05 00:05:35 +0900 (68a256d) @@ -0,0 +1,136 @@ +# How to use ChupaText as command line tool + +You can extract text and meta-data from an input by `chupa-text` +command. `chupa-text` prints extracted text and meta-data as JSON. + +## Input + +`chupa-text` command accept a local file path or a URI. + +Here is a local file path example: + +``` +% chupa-text hello.txt.gz +``` + +Here is an URI example: + +``` +% chupa-text https://github.com/ranguba/chupa-text/raw/master/test/fixture/gzip/hello.txt.gz +``` + +## Output + +`chupa-text` command prints the extracted result as JSON: + +``` +% chupa-text hello.txt.gz +{ + "mime-type": "application/x-gzip", + "uri": "hello.txt.gz", + "size": 36, + "texts": [ + { + "mime-type": "text/plain", + "uri": "hello.txt", + "size": 6, + "body": "Hello\n" + } + ] +} +``` + +JSON uses the following data structure: + +```txt +{ + "mime-type": "<MIME type of the input>", + "uri": "<URI or path of the input>", + "size": <Byte size of the input data>, + "other-meta-data1": <Other meta-data value1>, + "other-meta-data2": <Other meta-data value2>, + "...": <...>, + "texts": [ + { + "mime-type": "<MIME type of the extracted data1>", + "uri": "<URI or path of the extracted data1>", + "size": "<Byte size of the text of the extracted data1>", + "body": "<The text of the extracted data1>", + "other-meta-data1": <Other meta-data value1 of the extracted data1>, + "other-meta-data2": <Other meta-data value2 of the extracted data1>, + "...": <...> + }, + { + <The information of the extracted data2> + }, + { + <The information of the extracted data3> + }, + <...> + ] +} +``` + +You can find extracted texts in `texts[0].body`, `texts[1].body` and +so on. You may extract one or more texts from one input because +ChupaText supports archive file such as `tar`. + +## Command line options + +You can custom `chupa-text` command behavior. Here are command line +options: + +`--configuration=FILE` + +It reads configuration from `FILE`. See the next section for +configuration file details. + +ChupaText provides the default configuration file. It has suitable +configurations. Normally, you don't need to use your custom +configuration file. + +`--help` + +It shows available command line options and exits. + +## Configuration + +ChupaText configuration file is a Ruby script but it is easy to read +and write ChupaText configuration file for users who don't know about +Ruby. + +The basic syntax is the following: + +``` +category.name = value +``` + +Here is an example that sets `["tar", "gzip"]` as `value` to `names` +name variable in `decomposer` category: + +``` +decomposer.names = ["tar", "gzip"] +``` + +Here are configuration parameters: + +`decomposer.names = ["<decomposer name1>", "<decomposer name2>, "..."]` + +It specifies an array of decomposer name to be used in `chupa-text` +command. You can use glob pattern for decomposer name such as +`"*zip"`. `"*zip"` matches `"zip"`, `"gzip"` and so on. + +The default is `["*"]`. It means that all installed decomposers are +used. + +`mime_type["<extension>"] = "<MIME type>"` + +It specifies a map to a MIME type from path extension. + +Here is an example that maps `"html"` to `"text/html"`: + +``` +mime_type["html"] = "text/html" +``` + +Th default configuration file registers popular MIME types. Added: doc/text/decomposer.md (+343 -0) 100644 =================================================================== --- /dev/null +++ doc/text/decomposer.md 2014-01-05 00:05:35 +0900 (d7142ba) @@ -0,0 +1,343 @@ +# How to create a decomposer + +You can extend ChupaText by Ruby. You can add supported input type by +writing a decomposer module. + +## Overview + +Decomposer is a Ruby class. It needs the following two API: + + * `target?` + * `decompose` + +Both of them accept only one argument `data`. `data` is an input +data. + +First, ChupaText calls `target?` method of your decomposer. If your +decomposer can decompose the input data, your `target?` method should +return `true`. + +If your decomposer's `target?` method returns `true`, ChupaText calls +`decomposer` method of your decomposer. Your decomposer needs to +decomposer the input data and `yield` extracted text data or other +format data that will be decomposed by other decomposers. Your +decomposer can `yield` multiple times. + +If your decomposer decomposes an archive file such as tar and zip +archives, your `decompose` method will `yield` other format data. If +your decomposer extracts text and meta-data from an input such as +HTML, your `decompose` method will `yield` text data. + +## Example + +Let's create a simple XML decomposer as an example. It extracts text +data from input XML. + +For example, here is an input XML: + +```xml +<root> + Hello <em>&</em> World! +</root> +``` + +The XML decomposer extracts the following text: + +```text +Hello & World! +``` + +ChupaText provides `chupa-text-genearte-decomposer` command. It +generates skeleton code for a new decomposer. Let's use it. + +`chupa-text-genearte-decomposer` accepts required information by +command line options or reading from standard input. You can confirm +the required information by `--help` option: + +```text +% chupa-text-generate-decomposer --help +Usage: chupa-text-generate-decomposer [options] + --name=NAME Decomposer name + (e.g.: html) + --extensions=EXTENSION1,EXTENSION2,... + Target file extensions + (e.g.: htm,html,xhtml) + --mime-types=TYPE1,TYPE2,... Target MIME types + (e.g.: text/html,application/xhtml+xml) + --author=AUTHOR Author + (e.g.: 'Your Name') + (default: Kouhei Sutou) + --email=EMAIL Author E-mail + (e.g.: your �� email.address) + (default: kou �� clear-code.com) + --license=LICENSE License + (e.g.: MIT) + (default: LGPLv2.1 or later) +``` + +Some pieces of information have the default values. In the above case, +`--author`, `--email` and `-license` have the default values. + +XML decomposer uses the following information: + + * `--name`: `xml` + * `--extensions`: `xml` + * `--mime-types`: `text/xml` + +Run with the above information: + +```text +% chupa-text-generate-decomposer --name xml --extensions xml --mime-types text/xml +Creating directory: chupa-text-decomposer-xml +Creating file: chupa-text-decomposer-xml/chupa-text-decomposer-xml.gemspec +Creating file: chupa-text-decomposer-xml/Gemfile +Creating file: chupa-text-decomposer-xml/Rakefile +Creating file: chupa-text-decomposer-xml/LICENSE.txt +Creating directory: chupa-text-decomposer-xml/lib/chupa-text/decomposers +Creating file: chupa-text-decomposer-xml/lib/chupa-text/decomposers/xml.rb +Creating directory: chupa-text-decomposer-xml/test +Creating file: chupa-text-decomposer-xml/test/test-xml.rb +Creating file: chupa-text-decomposer-xml/test/helper.rb +Creating file: chupa-text-decomposer-xml/test/run-test.rb +``` + +`chupa-text-generate-decomposer` generates a directory that is named +as `chupa-text-decomposer-#{name}/`. + +Look `lib/chupa-text/decomposers/xml.rb`: + +``` +module ChupaText + module Decomposers + class Xml < Decomposer + def target?(data) + ["xml"].include?(data.extension) or + ["text/xml"].include?(data.mime_type) + end + + def decompose(data) + raise NotImplementedError, "#{self.class}##{__method__} isn't implemented yet." + text = "IMPLEMENTED ME" + text_data = TextData.new(text) + yield(text_data) + end + end + end +end +``` + +The generated code implements `target?` method but doesn't implemented +`decompose` method completely. Let's implement `decompose` method: + +``` +require "cgi" + +# ... + def decompose(data) + text = CGI.unescapeHTML(untag(data.body).strip) + text_data = TextData.new(text) + yield(text_data) + end + + private + def untag(xml) + xml.gsub(/<.+?>/m, "") + end +# ... +``` + +`chupa-text-generate-decomposer` also generates a test. Run the test: + +``` +% bundle install +% rake +/usr/bin/ruby2.0 test/run-test.rb +Loaded suite . +Started +F +=============================================================================== +Failure: +test_body(decompose) +/tmp/chupa-text-decomposer-xml/test/test-xml.rb:24:in `test_body' + 21: def test_body + 22: input_body = "TODO (input)" + 23: expected_text = "TODO (extracted)" + => 24: assert_equal([expected_text], + 25: decompose(input_body).collect(&:body)) + 26: end + 27: end +<["TODO (extracted)"]> expected but was +<["TODO (input)"]> + +diff: +? ["TODO (ex tracted)"] +? inpu +=============================================================================== + + +Finished in 0.013355116 seconds. + +1 tests, 1 assertions, 1 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications +0% passed + +74.88 tests/s, 74.88 assertions/s +rake aborted! +Command failed with status (1): [/usr/bin/ruby2.0 test/run-test.rb...] +/tmp/chupa-text-decomposer-xml/Rakefile:9:in `block in <top (required)>' +``` + +The generated test fails because the test has place holders. Look the +generated test: + +``` +class TestXml < Test::Unit::TestCase + include Helper + + def setup + @decomposer = ChupaText::Decomposers::Xml.new({}) + end + + sub_test_case("decompose") do + def decompose(input_body) + data = ChupaText::Data.new + data.mime_type = "text/xml" + data.body = input_body + + decomposed = [] + @decomposer.decompose(data) do |decomposed_data| + decomposed << decomposed_data + end + decomposed + end + + def test_body + input_body = "TODO (input)" + expected_text = "TODO (extracted)" + assert_equal([expected_text], + decompose(input_body).collect(&:body)) + end + end +end +``` + +`test_body` has TODO codes as place holder: + +``` +# ... + def test_body + input_body = "TODO (input)" + expected_text = "TODO (extracted)" + assert_equal([expected_text], + decompose(input_body).collect(&:body)) + end +# ... +``` + +Fill the TODO by test XML and expected result: + +``` +# ... + def test_body + input_body = <<-XML +<root> + Hello <em>&</em> World! +</root> + XML + expected_text = "Hello & World!" + assert_equal([expected_text], + decompose(input_body).collect(&:body)) + end +# ... +``` + +Run test again: + +``` +% rake +/usr/bin/ruby2.0 test/run-test.rb +Loaded suite . +Started +. + +Finished in 0.000915172 seconds. + +1 tests, 1 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications +100% passed + +1092.69 tests/s, 1092.69 assertions/s +``` + +The test is passed! + +You can release the generator by the following command. It requires an +account on https://rubygems.org/. + +``` +% rake release +``` + +Can you understand how to create a new decomposer? + +## API reference + +### `data` + +Both of `target?` and `decompose` receives an argument `data`. It is a +{ChupaText::Data} instance or an instance of its sub class. You need +to see the API reference manual just for {ChupaText::Data}. You don't +use sub class specific API. It is not portable. + +### `target?` + +`target?` should return `true` or `false`. The decomposer should +return `true` if the decomposer can decompose received `data`, `false` +otherwise. + +### `decompose` + +`decompose` decomposes input `data` and `yield` extracted text data or +decomposed other type data. `decompose` can `yield` zero or more +times. + +Here is a template code to `yield` extracted text data: + +``` +def decompose(data) + text = extract_text(data) + text_data = ChupaText::TextData.new(text) + # text_data["meta-data1"] = meta_data_value1 + # text_data["meta-data2"] = meta_data_value2 + # ... + yield(text_data) +end +``` + +See +[lib/chupa-text/decomposers/csv.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/csv.rb) +as an example of extracting text data. + +Here is a template code to `yield` other type data: + +``` +def decompose(data) + entries = decompose_archive(data) + entries.each do |entry| + path = entry.path + if entry.respond_to?(:read) + # The input must have "read" method. + input = entry + else + # If the entry doesn't have "read" method, wrap String data + # by StringIO. + input = StringIO.new(entry.data) + end + decomposed_data = ChupaText::VirtualFileData.new(path, input) + decomposed_data.source = data + yield(decomposed_data) + end +end +``` + +See +[lib/chupa-text/decomposers/tar.rb](https://github.com/ranguba/chupa-text/blob/master/lib/chupa-text/decomposers/tar.rb) +as an example of decomposing to other type data. Added: doc/text/library.md (+72 -0) 100644 =================================================================== --- /dev/null +++ doc/text/library.md 2014-01-05 00:05:35 +0900 (b7ca421) @@ -0,0 +1,72 @@ +# Hot to use ChupaText as Ruby library + +You can use ChupaText as Ruby library. If you want to extract text +data from many input data, `chupa-text` command may be +inefficient. You need to execute `chupa-text` command to process one +input file. You need to execute `chupa-text` command N times to +process N input files. It means that you need to initializes ChupaText +N times. It may be inefficient. + +You can reduce initializations of ChupaText by using ChupaText as Ruby +library. + +Here is a simple usage: + +``` +require "chupa-text" +gem "chupa-text-decomposer-html" + +ChupaText::Decomposers.load + +extractor = ChupaText::Extractor.new +extractor.apply_configuration(ChupaText::Configuration.default) + +extractor.extract("http://ranguba.org/") do |text_data| + puts(text_data.body) +end +extractor.extract("http://ranguba.org/ja/") do |text_data| + puts(text_data.body) +end +``` + +It is better that you use Bundler to manager decomposer plugins: + +``` +# Gemfile +source "https://rubygems.org" + +gem "chupa-text-decomposer-html" +gem "chupa-text-decomposer-XXX" +# ... +``` + +Here is a usage that uses the Gemfile: + +``` +require "bundler/setup" + +ChupaText::Decomposers.load + +extractor = ChupaText::Extractor.new +extractor.apply_configuration(ChupaText::Configuration.default) + +extractor.extract("http://ranguba.org/") do |text_data| + puts(text_data.body) +end +extractor.extract("http://ranguba.org/ja/") do |text_data| + puts(text_data.body) +end +``` + +Use {ChupaText::Data#[]} to get meta-data from extracted text +data. For example, you can get title from input HTML: + +``` +extractor.extract("http://ranguba.org/") do |text_data| + puts(text_data["title"]) +end +``` + +It is depended on decomposer that what meta-data can be got. See +decomposer's documentation to know about it. + Added: doc/text/news.md (+5 -0) 100644 =================================================================== --- /dev/null +++ doc/text/news.md 2014-01-05 00:05:35 +0900 (1bfecab) @@ -0,0 +1,5 @@ +# News + +## 1.0.0: 2014-01-05 + +The first release!!! -------------- next part -------------- HTML����������������������������...Download