• R/O
  • SSH
  • HTTPS

tsukurimashou: Commit


Commit MetaInfo

Revision295 (tree)
Time2012-07-21 13:37:59
Authormskala

Log Message

more CHISE IDS stuff for IDSgrep

Change Summary

Incremental Difference

--- trunk/idsgrep/idsgrep.tex (revision 294)
+++ trunk/idsgrep/idsgrep.tex (revision 295)
@@ -285,27 +285,34 @@
285285
286286 A minimal default build and install could run something like this:
287287 \begin{verbatim}
288-tar -xzvf idsgrep-0.2.tar.gz
289-cd idsgrep-0.2
288+tar -xzvf idsgrep-0.3.tar.gz
289+cd idsgrep-0.3
290290 ./configure
291291 make
292292 su -c 'make install'
293293 \end{verbatim}
294294
295-IDSgrep as such does not include a dictionary, but it can build dictionaries
295+IDSgrep can build dictionaries
296296 from the Tsukurimashou font package, which is available through the same
297-Sourceforge.JP project as IDSgrep, from the KanjiVG database available at
298-\url{http://kanjivg.tagaini.net/}~\cite{KanjiVG}, or (only if KanjiVG is
299-also available) from the EDICT2 database available at
297+Sourceforge.JP project as IDSgrep; from the KanjiVG database available at
298+\url{http://kanjivg.tagaini.net/}~\cite{KanjiVG}; from the CHISE IDS
299+database available at
300+\url{http://chise.zinbun.kyoto-u.ac.jp/dist/ids/}~\cite{CHISE};
301+or from the EDICT2 database available at
300302 \url{http://www.csse.monash.edu.au/~jwb/edict.html}~\cite{EDICT2}. For an
301303 ideal complete installation of IDSgrep, one would download all those
302304 packages, build Tsukurimashou first, and make it and the dictionaries
303-available to the IDSgrep \texttt{configure} script. The \texttt{configure}
305+available to the IDSgrep \texttt{configure} script.
306+A precompiled version of the CHISE IDS-derived dictionary is bundled
307+in the IDSgrep distribution tarball, so that one should be available (though
308+not necessarily up-to-date) without any dependencies.
309+
310+The \texttt{configure}
304311 script will by default make a reasonable effort to find the dependencies; in
305312 many common cases it is not necessary to specify them on the command line.
306313 Here is a more complete installation process relying on \texttt{configure}
307-to find KanjiVG and EDICT2 in the current directory and Tsukurimashou in a
308-sibling directory:
314+to find Tsukurimashou in a
315+sibling directory and the others in the current directory:
309316 \begin{verbatim}
310317 unzip tsukurimashou-0.6.zip
311318 cd tsukurimashou-0.6
@@ -317,6 +324,7 @@
317324 cd idsgrep-0.2
318325 ln -s /some/where/else/kanjivg-20120219.xml.gz .
319326 ln -s /some/where/else/edict2.gz .
327+ln -s /some/where/else/chise-ids-0.25 .
320328 ./configure
321329 make
322330 make check
@@ -323,16 +331,34 @@
323331 su -c 'make install'
324332 \end{verbatim}
325333
334+It is necessary to at least configure Tsukurimashou, if not fully build it,
335+before building IDSgrep. The IDSgrep build will then invoke the
336+Tsukurimashou build to create just the files needed by IDSgrep. It is not
337+necessary to configure or build CHISE IDS (which would require first
338+installing other parts of the larger CHISE system and probably XEmacs as
339+well); IDSgrep only needs to look at the CHISE IDS data files.
340+
326341 If the default search fails, the filenames of KanjiVG (\texttt{.xml} or
327-\texttt{.xml.gz}), EDICT2 (\texttt{.gz}), and the top directory of
328-Tsukurimashou can be specified on the \texttt{configure} command line with
342+\texttt{.xml.gz}), EDICT2 (\texttt{.gz}), and the directories containing
343+extracted distributions of Tsukurimashou and CHISE IDS can be
344+specified on the \texttt{configure} command line with
329345 the \texttt{--with-kanjivg}, \texttt{--with-edict2},
330-and \texttt{--with-tsuku-build} options. For
346+\texttt{--with-tsuku-build}, and \texttt{--with-chise-ids} options. For
331347 other options, run \texttt{configure --help}. It's a reasonably standard
332348 GNU Autotools~\cite{Autotools} configuration script, albeit with a lot of
333349 options for inapplicable installation directories removed to simplify the
334350 help message.
335351
352+The EDICT2-based dictionary should preferably include
353+character decompositions from some other dictionary; which one is
354+selectable by the \texttt{--enable-edict-decomp} option. Allowed values
355+include \texttt{chise}, \texttt{kanjivg}, \texttt{tsuku}, and \texttt{no};
356+the default of \texttt{auto} will try all of those in that order and use the
357+first that works. The value \texttt{no} corresponds to simply mapping every
358+character to itself without further decomposition; that is obviously not as
359+informative as might be desired, but it will still allow for regular
360+expression searches.
361+
336362 The ``\texttt{check}'' Makefile target runs the IDSgrep test suite. Some
337363 tests require the dictionary files and will be skipped if those are not
338364 present. There is also a test that will use Valgrind~\cite{Valgrind} if
@@ -343,10 +369,11 @@
343369 enable meta-testing of the test suite's coverage. This feature requires
344370 that the Gcov coverage analyser be installed. To do a coverage analysis,
345371 run \texttt{configure} with \texttt{--enable-gcov} and any other desired
346-options, then do \texttt{make clean} (necessary to be sure all object
347-files are rebuilt with the coverage instrumentation) followed by
348-\texttt{make check}. Most people would not want to install an
349-IDSgrep binary built under this option.
372+options, then do \texttt{make clean} (necessary to be sure all object files
373+are rebuilt with the coverage instrumentation) followed by \texttt{make
374+check}. Full coverage can only be attained if the dictionary files are
375+installed (not just built). Most people would not want to install the
376+IDSgrep binary itself when built under this option.
350377
351378 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
352379
@@ -406,17 +433,21 @@
406433 current version, there is some value in being able to do sub-character
407434 structural searches on multi-character words.
408435
409-If both EDICT2 and KanjiVG are available to the IDSgrep build system, it
410-will invoke the \texttt{ed22eids} script and generate and install a
436+If another dictionary besides EDICT2 is available
437+(subject to configuration by \texttt{--enable-edict-decomp}), then the
438+build system will generate and install a
411439 dictionary file called \texttt{edict.eids}, which represents a database
412-join of the two dictionaries. A sample entry might look like this:
440+join of EDICT2 with the other dictionary. With no other dictionary,
441+the file can still be generated but will contain no character decomposition
442+information.
443+A sample entry might look like this:
413444 \begin{verbatim}
414445 【明】,<明>⿰日月⦅[みん] (n) Ming (dynasty of China)⦆
415446 \end{verbatim}
416447
417448 The head for the entire entry is the head from the EDICT2 entry. Then the
418-tree is a binary tree with a comma as the functor and the first child
419-being the entire \texttt{kanjivg.eids} entry for the first character. The
449+tree is a binary tree with a comma as the functor and the first child being
450+the entire decomposition dictionary entry for the first character. The
420451 second child represents the rest of the entry. With a two-character or
421452 longer head, this child would also be a binary comma with the second
422453 character of the entry head as its first child. In this way the characters
@@ -433,8 +464,8 @@
433464 commas can be imagined as separators (though their actual function is quite
434465 different) with a comma at the start and a question mark at the end, as in
435466 \texttt{idsgrep -ded ',教,育?'}. These queries can be combined with the
436-sub-character breakdown queries already supported by the KanjiVG-based
437-dictionary. For instance, \texttt{idsgrep -ded ',教,...|日月!,??'} will
467+sub-character breakdown queries already supported by the decomposition
468+dictionaries. For instance, \texttt{idsgrep -ded ',教,...|日月!,??'} will
438469 search for, and give definitions of, words of exactly two characters in
439470 which the first is \texttt{教} and the second character contains \texttt{日}
440471 or \texttt{月} anywhere. The restriction to exactly two characters is
@@ -441,9 +472,14 @@
441472 accomplished by the sub-query ``\texttt{!,??}'', which fails to match on the
442473 binary comma that would be present at that point in a longer word.
443474
444-Since both EDICT2 and KanjiVG are under the Creative Commons
445-Attribution--ShareAlike license, that license presumably also applies to the
446-combined dictionary made from them.
475+EDICT2 is under the Creative Commons Attribution--ShareAlike license. Since
476+KanjiVG is as well, that license would presumably also apply to a combined
477+dictionary made from EDICT2 and KanjiVG. An EDICT2-only dictionary with no
478+decompositions from other sources should similarly be under Creative Commons
479+Attribution--ShareAlike. It might not be legal to distribute outside one's
480+own organization a dictionary formed by joining EDICT2 with CHISE IDS or
481+Tsukurimashou, because those sources are covered by versions of the GNU GPL,
482+which is not compatible with the Creative Commons license.
447483
448484 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
449485
@@ -498,10 +534,9 @@
498534
499535 In order for IDSgrep to work together with Tsukurimashou, it is necessary
500536 that the Tsukurimashou build be one that supports the \texttt{make eids}
501-target in the first place. No released version contains such support yet,
502-but it is planned for Tsukurimashou~0.6. Development versions of
503-Tsukurimashou in the SVN repository have included EIDS support since early
504-January 2012.
537+target in the first place. Packaged versions of Tsukurimashou from 0.6
538+onward include EIDS support, and development versions of Tsukurimashou in
539+the SVN repository have included EIDS support since early January 2012.
505540
506541 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
507542
@@ -566,7 +601,7 @@
566601 that creates a unary node senior to the entire tree, so that the output
567602 remains in valid EIDS format, except in the case of filenames containing
568603 colons, which will be handled via backslash escapes in the future when those
569-are fully implemented.
604+are fully implemented for output.
570605
571606 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
572607
@@ -632,9 +667,9 @@
632667 \noindent
633668
634669 This section is intended to describe IDSgrep's syntax and matching procedure
635-in complete detail; and those things are, in turn, designed to be powerful
636-rather than easy. As a result, the description may be confusing for some
637-users. See the examples in the ``Quick start'' section for a more
670+in complete precise detail; and those things are, in turn, designed to be
671+powerful rather than easy. As a result, the description may be confusing
672+for some users. See the examples in the ``Quick start'' section for a more
638673 accessible introduction to how to use the utility.
639674
640675 The system is best understood in terms of three interconnected major
@@ -1003,7 +1038,7 @@
10031038
10041039 The value of $\textit{match}'(\texttt{...}x,y)$ is true if and only if there
10051040 exists any subtree of $y$ (including the entirety of $y$) for which
1006-$\textit{match}'(x,y)$ is true. In other words, this will look for an
1041+$\textit{match}(x,y)$ is true. In other words, this will look for an
10071042 instance of $x$ anywhere inside $y$ regardless of nesting level. Mnemonic:
10081043 three dots suggest omitting a variable-length sequence, in this case the
10091044 variable-length chain of ancestors above $x$.
@@ -1144,8 +1179,6 @@
11441179 ``a'' for ``associative.''
11451180 The verbose ASCII name for ``\texttt{.@.}'' is ``\texttt{.assoc.}.''
11461181
1147-This feature is not yet implemented in version 0.2.
1148-
11491182 \subsection{Regular expression matching}
11501183
11511184 It is planned that some future version (likely version 0.3) will support
--- trunk/idsgrep/idsgrep.1.in (revision 294)
+++ trunk/idsgrep/idsgrep.1.in (revision 295)
@@ -231,8 +231,46 @@
231231 The first opens the string, the second is the literal first character, and
232232 the third closes the string.
233233 .IP \(bu 4
234-It is likely that some backslash escapes will be implemented in the future.
234+Backslash can be used for escape sequences.
235+The sequences \ea, \eb, \ee, \ef, \et, \en, and \er have the same meanings
236+as in the C programming language under Unix, with \en always corresponding
237+to <U+000A>, a single line feed, not a localized end-of-line sequence which
238+might be different from that.
239+The \ec sequence followed by a case-insensitive ASCII Latin letter (only)
240+corresponds to an ASCII control character, equivalent to typing Ctrl plus
241+that letter on a standard ASCII keyboard.
242+There are also three hexadecimal escapes:
243+.RI \ex HH
244+(where
245+.I HH
246+is two hexadecimal digits);
247+.RI \eX HHHH
248+(where
249+.I HHHH
250+is four hexadecimal digits);
251+and
252+.RI \eX{ Hx }
253+(where
254+.I Hx
255+is a variable-length sequence of hexadecimal digits).
256+These allow entering arbitrary code points.
257+In all backslash-letter escape sequences, the letter after the backslash is
258+case-sensitive but the parameter, if applicable, is not.
259+Backslash sequences may not be nested; parameters must be given as literal
260+ASCII.
235261 .IP \(bu 4
262+Backslash, if not used with a letter to form one of the sequences above,
263+causes the next character (which could even be a second backslash) to be
264+taken literally and lose any special meaning.
265+Inside a bracketed string, it can be used for instance to escape a closing
266+bracket that would otherwise end the string.
267+Outside a bracketed string, a backslash-escaped character will always be
268+taken as a head with a syrupy semicolon (as described below) instead of
269+having sugary, opening-bracket, or skipped-whitespace behaviour.
270+That also applies to characters created by backslash-letter sequences, for
271+instance with \eX: after decoding the escape sequence, the result is always
272+literal inside a bracketed string and syrupy outside a bracketed string.
273+.IP \(bu 4
236274 ASCII control characters and whitespace characters, <U+0000> through
237275 <U+0020> (notably including <U+0000>), are ignored outside bracketed
238276 strings and taken
@@ -372,16 +410,44 @@
372410 .B configure
373411 may possibly have installed the files elsewhere.
374412 .PP
375-.I @flat_dictdir@/tsukurimashou.eids
413+.I @flat_dictdir@/chise.eids
376414 .RS
377-Japanese kanji decompositions from the Tsukurimashou font project.
378-These are relatively clean in terms of accurately reflecting the visual
379-construction of each character, but they only cover the glyphs included in
380-the fonts, and they are based on the visual appearance of the glyphs
381-(and, specifically, their appearance
382-.IR "in the Tsukurimashou fonts" )
383-rather than traditional etymology.
415+Character decompositions from the CHISE EIDS database.
416+Coverage of approximately 130000 Han-script characters, spanning multiple
417+languages.
418+Non-Unicode characters are expressed using symbolic names apparently invented
419+by the CHISE project, or possibly by the affiliated UTF-2000 initiative.
420+This dictionary is generally of high quality because the original source
421+provides it more or less in IDS format already (actually an extended
422+IDS format of their own, distinct from IDSgrep's extended IDS); as a result
423+there is very little guesswork involved in the conversion to IDSgrep's EIDS.
424+Its broad coverage is hard to beat.
425+However, about 6% of the entries in the CHISE IDS database are
426+syntactically invalid, and are therefore excluded from IDSgrep's
427+converted dictionary.
428+What that implies about the quality of the remaining entries is an
429+open question.
430+Because this dictionary has GPL 2+ licensing, it can be bundled with the
431+IDSgrep source package (the majority of which is GPL 3).
384432 .RE
433+.I @flat_dictdir@/edict.eids
434+.RS
435+Japanese-English dictionary of words and their meanings, based on
436+EDICT2 with character decompositions from one of the other dictionaries
437+selected at build time, most likely CHISE IDS.
438+This allows searching for multi-character words using partial descriptions
439+of the individual characters.
440+The EDICT2 entries are translated to EIDS format such that the EDICT2 head
441+(typically the word being defined) is the head of the EIDS; below that there
442+follows a chain of binary comma nodes each having a character of the word as
443+its left child and the rest of the chain as the right child.
444+These left children are decomposed according to whichever dictionary was used.
445+The final right child is a nullary node containing the remainder of the
446+EDICT2 entry, including pronunciation, part of speech, definition, and any
447+other tags.
448+For example, an entry that defined \(lqXYZ\(rq as \(lqdefinition\(rq might
449+be encoded as \(lq<XYZ>,X,Y,Z(definition)\(rq.
450+.RE
385451 .I @flat_dictdir@/kanjivg.eids
386452 .RS
387453 Japanese kanji decompositions from the KanjiVG database.
@@ -395,22 +461,15 @@
395461 As a result, decompositions in this database may be incomplete,
396462 idiosyncratic, or even flat-out wrong.
397463 .RE
398-.I @flat_dictdir@/edict.eids
464+.I @flat_dictdir@/tsukurimashou.eids
399465 .RS
400-Japanese-English dictionary of words and their meanings, based on
401-EDICT2, with character decompositions from KanjiVG.
402-This allows searching for multi-character words using partial descriptions
403-of the individual characters.
404-The EDICT2 entries are translated to EIDS format such that the EDICT2 head
405-(typically the word being defined) is the head of the EIDS; below that there
406-follows a chain of binary comma nodes each having a character of the word as
407-its left child and the rest of the chain as the right child.
408-These left children are decomposed according to the KanjiVG database.
409-The final right child is a nullary node containing the remainder of the
410-EDICT2 entry, including pronunciation, part of speech, definition, and any
411-other tags.
412-For example, an entry that defined \(lqXYZ\(rq as \(lqdefinition\(rq might
413-be encoded as \(lq<XYZ>,X,Y,Z(definition)\(rq.
466+Japanese kanji decompositions from the Tsukurimashou font project.
467+These are relatively clean in terms of accurately reflecting the visual
468+construction of each character, but they only cover the glyphs included in
469+the fonts, and they are based on the visual appearance of the glyphs
470+(and, specifically, their appearance
471+.IR "in the Tsukurimashou fonts" )
472+rather than traditional etymology.
414473 .
415474 .SH ENVIRONMENT
416475 .IP IDSGREP_DICTDIR
@@ -458,6 +517,8 @@
458517 The input files for those are subject to the Creative Commons
459518 Attribution-Share Alike 3.0 Licence, and their authors might make a
460519 copyright claim on the resulting dictionaries.
520+The input files for the CHISE IDS dictionary are subject to GPL 2 or any
521+later version.
461522 The Tsukurimashou project also builds an EIDS-format dictionary for
462523 use with IDSgrep, but happens to use the same copyright and GPL 3 licensing
463524 terms as IDSgrep anyway.
--- trunk/idsgrep/configure.ac (revision 294)
+++ trunk/idsgrep/configure.ac (revision 295)
@@ -334,6 +334,32 @@
334334 AM_CONDITIONAL([COND_TSUKU_BUILD], [test '!' "$with_tsuku_build" = no])
335335 AC_SUBST([with_tsuku_build])
336336 #
337+AC_ARG_ENABLE([edict-decomp],
338+ [AS_HELP_STRING([--enable-edict-decomp[[=chise|kanjivg|tsuku|auto]]],
339+ [dictionary for EDICT2 decomposition [auto]])],
340+ [],
341+ [enable_edict_decomp=auto])
342+AS_IF([test '!' "$with_edict2" = "no"],[
343+AC_MSG_CHECKING([EDICT2 decompositions])
344+AS_IF(
345+ [test "$enable_edict_decomp" = "auto"],
346+ [AS_IF([test "$with_chise_ids" = "no"],
347+ AS_IF([test "$with_kanjivg" = "no"],
348+ AS_IF([test "$with_tsuku_build" = "no"],
349+ [enable_edict_decomp=no],
350+ [enable_edict_decomp=tsuku]),
351+ [enable_edict_decomp=kanjivg]),
352+ [enable_edict_decomp=chise])])
353+AC_MSG_RESULT([$enable_edict_decomp])
354+])
355+AS_IF([test "$enable_edict_decomp" = "chise"],[edict_decomp=chise.eids])
356+AS_IF([test "$enable_edict_decomp" = "kanjivg"],[edict_decomp=kanjivg.eids])
357+AS_IF([test "$enable_edict_decomp" = "tsuku"],
358+ [edict_decomp=tsukurimashou.eids])
359+AS_IF([test "$enable_edict_decomp" = "no"],
360+ [edict_decomp=])
361+AC_SUBST([edict_decomp])
362+#
337363 ############################################################################
338364 #
339365 # Arch packaging
--- trunk/idsgrep/Makefile.am (revision 294)
+++ trunk/idsgrep/Makefile.am (revision 295)
@@ -32,9 +32,6 @@
3232 MAYBE_COVERAGE=--coverage --no-inline
3333 endif
3434 if COND_KANJIVG
35-if COND_EDICT2
36- MAYBE_EDICTDATA=edict.eids
37-endif
3835 MAYBE_KVDATA=kanjivg.eids
3936 endif
4037 if COND_TSUKU_BUILD
@@ -41,6 +38,9 @@
4138 MAYBE_DOCS=idsgrep.pdf
4239 MAYBE_TSUKUDATA=tsukurimashou.eids
4340 endif
41+if COND_EDICT2
42+ MAYBE_EDICTDATA=edict.eids
43+endif
4444
4545 bin_PROGRAMS = idsgrep
4646
@@ -48,11 +48,11 @@
4848 idsgrep.aux idsgrep.log idsgrep.blg idsgrep.bbl idsgrep.toc \
4949 *.gcda *.gcno *.gcov
5050
51-CLEANFILES = edict.eids kanjivg.eids tsukurimashou.eids
51+CLEANFILES = chise.errs edict.eids kanjivg.eids tsukurimashou.eids
5252
53-DISTCLEANFILES = $(if $(VPATH),idsgrep.pdf,)
53+DISTCLEANFILES = $(if $(VPATH),idsgrep.pdf chise.eids,)
5454
55-MAINTAINERCLEANFILES = idsgrep.pdf
55+MAINTAINERCLEANFILES = idsgrep.pdf chise.eids
5656
5757 dist_dict_DATA = $(MAYBE_CIDATA)
5858
@@ -83,12 +83,19 @@
8383 chise.eids: $(wildcard @with_chise_ids@/IDS-*.txt) chise2eids
8484 $(PERL) -CDS $(mvp)/chise2eids \
8585 @with_chise_ids@ @with_chise_ids@/IDS-*.txt \
86- > chise.eids
86+ > chise.eids 2> chise.errs
87+ echo `wc -l < chise.errs` errors detected in CHISE IDS
8788
88-edict.eids: @with_edict2@ kanjivg.eids ed22eids
89- $(GZIP) -cd @with_edict2@ \
89+# this if is for the case of chise.eids distributed and not locally built
90+edict.eids: @with_edict2@ @edict_decomp@ ed22eids
91+ if test -r @edict_decomp@ ; \
92+ then $(GZIP) -cd @with_edict2@ \
9093 | $(ICONV) -feuc-jp -tutf-8 \
91- | $(PERL) -CDS $(mvp)/ed22eids > edict.eids
94+ | $(PERL) -CDS $(mvp)/ed22eids @edict_decomp@ > edict.eids ; \
95+ else $(GZIP) -cd @with_edict2@ \
96+ | $(ICONV) -feuc-jp -tutf-8 \
97+ | $(PERL) -CDS $(mvp)/ed22eids $(mvp)/@edict_decomp@ \
98+ > edict.eids ; fi
9299
93100 kanjivg.eids: @with_kanjivg@ kvg2eids
94101 if $(PERL) \
Show on old repository browser