• R/O
  • SSH

tsukurimashou: Commit

Commit MetaInfo

Revision238 (tree)
Time2012-03-18 08:59:48

Log Message

IDSgrep 0.2 ready

Change Summary

Incremental Difference

--- trunk/idsgrep/idsgrep.tex (revision 237)
+++ trunk/idsgrep/idsgrep.tex (revision 238)
@@ -250,8 +250,7 @@
250250 only if the user is sure of the writing of the first few strokes in the
251251 character. Furthermore, these search schemes often are implemented only in
252252 heavy, non-portable, GUI software that cannot be automated and mixes poorly
253-with standard computing environments. IDSgrep, even in its current alpha
254-version with most features unimplemented, can answer the example query
253+with standard computing environments. IDSgrep can answer the example query
255254 correctly with a single, simple command line (\texttt{idsgrep -d
256255 '[tb][lr]??心'}). This software is intended to bring the user-friendliness
257256 of \texttt{grep} to Han character dictionaries.
@@ -258,14 +257,31 @@
259258 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
261-\section{Download, build, and install}
260+\section{What's new}
262+The main new features in version 0.2 are:
264+\item implementations of all the planned matching operators except
265+ \texttt{@} (associative) and \texttt{/} (regular expression);
266+\item a full test suite and some fixes for bugs found while creating it; and
267+\item the EDICT2-derived dictionary, and the binary comma sugar
268+ character to support it.
273+\section{Download, build, test, and install}
263275 IDSgrep is distributed under the umbrella of the Tsukurimashou project on
264276 Sourceforge.JP~\cite{Tsukurimashou},
265277 \url{http://en.sourceforge.jp/projects/tsukurimashou/}. Releases of IDSgrep
266278 will appear on the project download page; development versions are available
267279 by SVN checkout from the \texttt{trunk/idsgrep} subdirectory of the
280+repository. For the convenience of Github users, the Tsukurimashou (and
281+thus IDSgrep) repository is also mirrored into a Github
282+repository~\cite{TsukuGithub}, but the project on Sourceforge.JP and its SVN
283+repository remain the main public locations for IDSgrep development and all
284+bug-tracker items should be filed there.
270286 A minimal default build and install could run something like this:
271287 \begin{verbatim}
@@ -278,16 +294,21 @@
279295 IDSgrep as such does not include a dictionary, but it can build dictionaries
280296 from the Tsukurimashou font package, which is available through the same
281-Sourceforge.JP project as IDSgrep, or from the KanjiVG database available at
282-\url{http://kanjivg.tagaini.net/}~\cite{KanjiVG}. For an ideal complete
283-installation of IDSgrep, one would download both these packages, build
284-Tsukurimashou first, and make it and KanjiVG available to the IDSgrep
285-\texttt{configure} script. The \texttt{configure} script will by default
286-make a reasonable effort to find KanjiVG and Tsukurimashou; in many common
287-cases it is not necessary to specify them on the command line. Here is a
288-more complete installation process relying on \texttt{configure} to find
289-KanjiVG in the current directory and Tsukurimashou in a sibling directory:
290-\begin{verbatim} unzip tsukurimashou-0.6.zip cd tsukurimashou-0.6
297+Sourceforge.JP project as IDSgrep, from the KanjiVG database available at
298+\url{http://kanjivg.tagaini.net/}~\cite{KanjiVG}, or (only if KanjiVG is
299+also available) from the EDICT2 database available at
300+\url{http://www.csse.monash.edu.au/~jwb/edict.html}~\cite{EDICT2}. For an
301+ideal complete installation of IDSgrep, one would download all those
302+packages, build Tsukurimashou first, and make it and the dictionaries
303+available to the IDSgrep \texttt{configure} script. The \texttt{configure}
304+script will by default make a reasonable effort to find the dependencies; in
305+many common cases it is not necessary to specify them on the command line.
306+Here is a more complete installation process relying on \texttt{configure}
307+to find KanjiVG and EDICT2 in the current directory and Tsukurimashou in a
308+sibling directory:
310+unzip tsukurimashou-0.6.zip
311+cd tsukurimashou-0.6
291312 ./configure
292313 make
293314 # install of Tsukurimashou not needed by IDSgrep
@@ -294,21 +315,39 @@
294315 cd ..
295316 tar -xzvf idsgrep-0.2.tar.gz
296317 cd idsgrep-0.2
297-ln -s /some/where/else/kanjivg-20111029.xml.gz .
318+ln -s /some/where/else/kanjivg-20120219.xml.gz .
319+ln -s /some/where/else/edict2.gz .
298320 ./configure
299321 make
322+make check
300323 su -c 'make install'
301324 \end{verbatim}
303-If the default search fails, the filename of KanjiVG (\texttt{.xml} or
304-\texttt{.xml.gz}) and the top directory of Tsukurimashou can be specified on
305-the \texttt{configure} command line with the \texttt{--with-kanjivg} and
306-\texttt{--with-tsuku-build} options. For other options, run
307-\texttt{configure --help}. It's a reasonably standard GNU
308-Autotools~\cite{Autotools} configuration script, albeit with a lot of
326+If the default search fails, the filenames of KanjiVG (\texttt{.xml} or
327+\texttt{.xml.gz}), EDICT2 (\texttt{.gz}), and the top directory of
328+Tsukurimashou can be specified on the \texttt{configure} command line with
329+the \texttt{--with-kanjivg}, \texttt{--with-edict2},
330+and \texttt{--with-tsuku-build} options. For
331+other options, run \texttt{configure --help}. It's a reasonably standard
332+GNU Autotools~\cite{Autotools} configuration script, albeit with a lot of
309333 options for inapplicable installation directories removed to simplify the
310334 help message.
336+The ``\texttt{check}'' Makefile target runs the IDSgrep test suite. Some
337+tests require the dictionary files and will be skipped if those are not
338+present. There is also a test that will use Valgrind~\cite{Valgrind} if
339+available, to check for memory-related problems; if Valgrind is not found in
340+the \texttt{PATH}, this test will be skipped.
342+The \texttt{configure} script supports an \texttt{--enable-gcov} switch to
343+enable meta-testing of the test suite's coverage. This feature requires
344+that the Gcov coverage analyser be installed. To do a coverage analysis,
345+run \texttt{configure} with \texttt{--enable-gcov} and any other desired
346+options, then do \texttt{make clean} (necessary to be sure all object
347+files are rebuilt with the coverage instrumentation) followed by
348+\texttt{make check}. Most people would not want to install an
349+IDSgrep binary built under this option.
312351 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
314353 \section{Interface to KanjiVG}
@@ -342,6 +381,11 @@
342381 or not. As a result, not all entries in the dictionary will be right.
343382 However, only a few are affected by this issue.
384+As of March 2012, I (Matthew Skala, the author of IDSgrep) have become a
385+member of the KanjiVG project and there is some possibility that KanjiVG's
386+database design will change in a way that makes it easier to recover spatial
387+organization for searching with IDSgrep.
345389 With the current versions of IDSgrep and KanjiVG, the KanjiVG-derived
346390 dictionary contains 6660 entries covering all the popularly-used Japanese
347391 kanji. Note that the KanjiVG input file, and presumably the resulting
@@ -351,6 +395,58 @@
352396 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
398+\section{Interface to EDICT2}
400+Jim Breen's JMdict/EDICT project maintains a file called
401+EDICT2~\cite{EDICT2} which is more like a traditional dictionary, with words
402+and meanings, than a database of kanji. Such dictionaries are not the
403+primary target of IDSgrep and IDSgrep's query syntax is not perfectly suited
404+to them. However, planned future regular-expression matching features may
405+make it more practical to search EDICT2 with IDSgrep, and even in the
406+current version, there is some value in being able to do sub-character
407+structural searches on multi-character words.
409+If both EDICT2 and KanjiVG are available to the IDSgrep build system, it
410+will invoke the \texttt{ed22eids} script and generate and install a
411+dictionary file called \texttt{edict.eids}, which represents a database
412+join of the two dictionaries. A sample entry might look like this:
414+【明】,<明>⿰日月⦅[みん] (n) Ming (dynasty of China)⦆
417+The head for the entire entry is the head from the EDICT2 entry. Then the
418+tree is a binary tree with a comma as the functor and the first child
419+being the entire \texttt{kanjivg.eids} entry for the first character. The
420+second child represents the rest of the entry. With a two-character or
421+longer head, this child would also be a binary comma with the second
422+character of the entry head as its first child. In this way the characters
423+of the entry head are all represented as left children of commas, forming a
424+linked-list structure (much like a Prolog linked-list with commas instead of
425+dots as the functors). The final child at the bottom is a nullary node
426+containing as its functor simply the rest of the EDICT2 entry.
428+The rationale for this syntax is that it allows a relatively simple way of
429+querying multi-character words in EDICT2 using the existing IDSgrep query
430+types. To find an exact match, just query the head (which will require head
431+brackets and a semicolon if the query is more than one character long), as
432+in \texttt{idsgrep -ded '<教育>;'}. To search for the first few characters,
433+commas can be imagined as separators (though their actual function is quite
434+different) with a comma at the start and a question mark at the end, as in
435+\texttt{idsgrep -ded ',教,育?'}. These queries can be combined with the
436+sub-character breakdown queries already supported by the KanjiVG-based
437+dictionary. For instance, \texttt{idsgrep -ded ',教,...|日月!,??'} will
438+search for, and give definitions of, words of exactly two characters in
439+which the first is \texttt{教} and the second character contains \texttt{日}
440+or \texttt{月} anywhere. The restriction to exactly two characters is
441+accomplished by the sub-query ``\texttt{!,??}'', which fails to match on the
442+binary comma that would be present at that point in a longer word.
444+Since both EDICT2 and KanjiVG are under the Creative Commons
445+Attribution--ShareAlike license, that license presumably also applies to the
446+combined dictionary made from them.
354450 \section{Interface to Tsukurimashou}
356452 IDSgrep is closely connected with the Tsukuimashou font
@@ -730,8 +826,8 @@
730826 \end{itemize}
732828 Here are all the characters that have sugary implicit brackets, with the
733-brackets they imply: {\ttfamily (;) (?) .!. ./. .=. .*. .@. [\&] [|] [⿰]
734-[⿱] [⿴] [⿵] [⿶] [⿷] [⿸] [⿹] [⿺] [⿻] \{⿲\} \{⿳\}}
829+brackets they imply: {\ttfamily (;) (?) .!. ./. .=. .*. .@. [\&] [,]
830+[|] [⿰] [⿱] [⿴] [⿵] [⿶] [⿷] [⿸] [⿹] [⿺] [⿻] \{⿲\} \{⿳\}}
736832 Note that the sugary and syrupy implications of a character are only
737833 relevant when the character occurs where an opening bracket of some
@@ -992,7 +1088,7 @@
9921088 ``a'' for ``associative.''
9931089 The verbose ASCII name for ``\texttt{.@.}'' is ``\texttt{.assoc.}.''
995-This feature is not yet implemented in version 0.1.
1091+This feature is not yet implemented in version 0.2.
9971093 \subsection{Regular expression matching}
--- trunk/idsgrep/idsgrep.1.in (revision 237)
+++ trunk/idsgrep/idsgrep.1.in (revision 238)
@@ -249,7 +249,7 @@
249249 parentheses, and will thus become the functor of a nullary node.
250250 The complete list of characters that have sugary implicit brackets, with
251251 the brackets they imply, is:
252-(;) (?) .!. ./. .=. .*. .@. [&] [|]
252+(;) (?) .!. ./. .=. .*. .@. [&] [,] [|]
253253 [<U+2FF0>] [<U+2FF1>] [<U+2FF4>] [<U+2FF5>] [<U+2FF6>] [<U+2FF7>]
254254 [<U+2FF8>] [<U+2FF9>] [<U+2FFA>] [<U+2FFB>]
255255 {<U+2FF2>} {<U+2FF3>}.
@@ -309,9 +309,58 @@
309309 .I x
310310 matches some subtree of the input.
311311 .IP \(bu 4
312+If the pattern is
313+.RI .*. "x" ,
314+then it matches if and only if some permutation of the children of
315+.I x
316+(at the root level only) will cause it to match the input; that is, the
317+children are allowed to match in any order.
318+.IP \(bu 4
319+If the pattern is
320+.RI .!. "x" ,
321+then it matches if and only if
322+.I x
324+.I not
325+match the input.
326+.IP \(bu 4
327+If the pattern is
328+.RI [&] "xy" ,
329+then it matches if and only if both
330+.I x
332+.I y
333+match the input.
334+.IP \(bu 4
335+If the pattern is
336+.RI [|] "xy" ,
337+then it matches if and only if
338+.I x
339+matches the input or
340+.I y
341+matches the input.
342+.IP \(bu 4
343+If the pattern is
344+.RI .=. "x"
345+and if
346+.I x
347+and the input both have heads, then it matches if and only if those heads
348+are identical.
350+.RI .=. "x"
351+matches if and only if
352+.I x
353+and the input have identical functors, identical arity, and all their
354+corresponding children match.
355+The effect of this operation is to ignore any
356+special matching semantics of
357+.IR x 's
358+functor, should it happen to be one of the special values mentioned in these
360+.IP \(bu 4
312361 Otherwise, the pattern matches the input if and only if its functor and
313-arity are the same as the input's, and all the children of the pattern match
314-the corresponding children of the input recursively.
362+arity are the same as the input's and all the children of the pattern match
363+the corresponding children of the input.
315364 .
316365 .SH FILES
317366 Individual sites may well have a different set of dictionaries installed,
@@ -345,6 +394,23 @@
345394 of works.
346395 As a result, decompositions in this database may be incomplete,
347396 idiosyncratic, or even flat-out wrong.
398+.I @flat_dictdir@/edict.eids
400+Japanese-English dictionary of words and their meanings, based on
401+EDICT2, with character decompositions from KanjiVG.
402+This allows searching for multi-character words using partial descriptions
403+of the individual characters.
404+The EDICT2 entries are translated to EIDS format such that the EDICT2 head
405+(typically the word being defined) is the head of the EIDS; below that there
406+follows a chain of binary comma nodes each having a character of the word as
407+its left child and the rest of the chain as the right child.
408+These left children are decomposed according to the KanjiVG database.
409+The final right child is a nullary node containing the remainder of the
410+EDICT2 entry, including pronunciation, part of speech, definition, and any
411+other tags.
412+For example, an entry that defined \(lqXYZ\(rq as \(lqdefinition\(rq might
413+be encoded as \(lq<XYZ>,X,Y,Z(definition)\(rq.
348414 .
@@ -387,10 +453,11 @@
387453 .PP
388454 Please note that dictionaries prepared for use with IDSgrep may be subject to
389455 their own copyright terms differing from those of IDSgrep itself.
390-In particular, the IDSgrep distribution contains code to build a dictionary
391-based on Ulrich Apel's KanjiVG project.
392-That dictionary would be subject to his copyright and the Creative Commons
393-Attribution-Share Alike 3.0 Licence.
456+In particular, the IDSgrep distribution contains code to build dictionaries
457+based on KanjiVG and EDICT2.
458+The input files for those are subject to the Creative Commons
459+Attribution-Share Alike 3.0 Licence, and their authors might make a
460+copyright claim on the resulting dictionaries.
394461 The Tsukurimashou project also builds an EIDS-format dictionary for
395462 use with IDSgrep, but happens to use the same copyright and GPL 3 licensing
396463 terms as IDSgrep anyway.
--- trunk/idsgrep/configure.ac (revision 237)
+++ trunk/idsgrep/configure.ac (revision 238)
@@ -159,7 +159,7 @@
159159 #
160160 AC_PREREQ([2.63])
161161 AC_INIT([IDSgrep],
162- [0.1], [mskala@ansuz.sooke.bc.ca], [idsgrep],
162+ [0.2], [mskala@ansuz.sooke.bc.ca], [idsgrep],
163163 [[http://ansuz.sooke.bc.ca/]])
165165 AM_INIT_AUTOMAKE([foreign parallel-tests color-tests])
@@ -168,7 +168,7 @@
168168 AC_CONFIG_MACRO_DIR([m4])
169169 AC_REVISION([$Id: configure.ac 1015 2011-12-15 22:24:32Z mskala $])
170170 AC_COPYRIGHT([Copyright (C) 2012 Matthew Skala])
171-AC_SUBST([release_date],["January 26, 2012"])
171+AC_SUBST([release_date],["March 17, 2012"])
172172 #
173173 ############################################################################
174174 #
--- trunk/idsgrep/INSTALL (revision 237)
+++ trunk/idsgrep/INSTALL (revision 238)
@@ -7,4 +7,4 @@
77 ./configure --help
99 For more details, see idsgrep.pdf , or (after configuration)
10-the man page idsgrep.1 .
10+the man page idsgrep.1.
--- trunk/idsgrep/Makefile.am (revision 237)
+++ trunk/idsgrep/Makefile.am (revision 238)
@@ -53,7 +53,9 @@
56-dist_noinst_SCRIPTS = ed22eids kvg2eids
56+dist_noinst_SCRIPTS = \
57+ ed22eids kvg2eids \
58+ $(GCOV_TESTS) test/vgneko test/rmgcda test/gcov
5860 dist_pdf_DATA = $(MAYBE_DOCS)
Show on old repository browser