• R/O
  • SSH
  • HTTPS

tsukurimashou:


File Info

Rev. 390
Size 100,713 bytes
Time 2013-03-07 10:40:49
Author mskala
Log Message

final (again) prep for 0.7

Content

\documentclass[twocolumn]{report}

\usepackage{fontspec}

\usepackage{calc}
\usepackage[rigidchapters]{titlesec}
\usepackage{tocloft}
\usepackage{url}

\title{IDSgrep, version \idsgrepversion}
\author{Matthew Skala}

\input{config.tex}

\setsansfont{Nimbus Sans L}

\setlength{\topmargin}{0.25in}
\setlength{\headheight}{0pt}
\setlength{\headsep}{0pt}

\setlength{\oddsidemargin}{\paperwidth/17-0.5in}
\setlength{\columnsep}{1em}
\setlength{\textwidth}{\paperwidth*15/17-1in}
\setlength{\textheight}{8.5in}

\setlength{\parindent}{1.5em}

\makeatletter
\dimen@=\f@size\p@\dimen@6\dimen@\divide\dimen@5\edef\l@rgesize{\the\dimen@}
\dimen@=\f@size\p@\dimen@2\dimen@\edef\@hugesize{\the\dimen@}

\renewcommand\maketitle{%
  \begin{titlepage}%
    \let\footnotesize\small
    \let\footnoterule\relax
    \let \footnote \thanks
    \vspace*{\fill}
    \vspace*{\fill}
    {\sffamily\bfseries\huge \hfill\@title\hfill\null\par}
    \vspace{\fill}
    {\sffamily\bfseries\Large \hfill\@author\hfill\null\par}
    \vspace{\fill}
    \vspace{\fill}
    \vspace{\fill}
    \vspace{\fill}
    \vspace{\fill}
    {\sffamily\bfseries\Large \hfill\@date\par}
    \vspace*{\fill}
    \@thanks\par
    \null
  \end{titlepage}%
  \setcounter{footnote}{0}%
  \global\let\thanks\relax
  \global\let\maketitle\relax
  \global\let\@thanks\@empty
  \global\let\@author\@empty
  \global\let\@date\@empty
  \global\let\@title\@empty
  \global\let\title\relax
  \global\let\author\relax
  \global\let\date\relax
  \global\let\and\relax
}

\def\@maketitle{%
  \newpage\noindent
  \null
  \begingroup
    \let\footnote\thanks
    {\fontsize{\@hugesize}{2\baselineskip}\sffamily\bfseries\selectfont
      \@title\,\leaders\hrule height 0.2ex\hfill\null}\par\noindent
    {\fontsize{\l@rgesize}{\baselineskip}\sffamily\bfseries
      \@author\hfill\@date}\par
  \endgroup
  \dimen@=0.5\baselineskip\relax\advance\dimen@-0.5\p@\relax
  \vspace{\dimen@}\noindent
}

\newcommand{\ch@pterform@t}[1]{%
  \begin{@twocolumnfalse}%
    \fontsize{\@hugesize}{3\baselineskip}\sffamily\bfseries\selectfont
    #1\,\leaders\hrule height 0.2ex\hfill\null
  \end{@twocolumnfalse}%
}%
\titlespacing*{\chapter}{0pt}{6\baselineskip}{2\baselineskip}
\titleformat{\chapter}[hang]{}{}{0pt}{\ch@pterform@t}

\titleformat{\section}[runin]%
   {\fontsize{\l@rgesize}{\baselineskip}\sffamily\bfseries}{}{0pt}{}%
   [\,\leaders\hrule height 0.16ex\hfill\null\\]
\titlespacing*{\section}{0pt}{\baselineskip}{0pt}

\titleformat{\subsection}[runin]{\sffamily\bfseries}{}{0pt}{}
\titlespacing*{\subsection}{0pt}{\baselineskip}{0.666em}

\titleformat{\subsubsection}[runin]{\rmfamily\scshape}{}{0pt}{}
\titlespacing*{\subsubsection}{0pt}{\baselineskip}{0.666em}

\titleformat{\paragraph}[runin]{\rmfamily\itshape}{}{0pt}{}
\titlespacing*{\paragraph}{\parindent}{0pt}{0.666em}

\raggedbottom
\makeatother

\renewcommand{\thefootnote}{\fnsymbol{footnote}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

\maketitle

\setcounter{page}{2}

\renewcommand{\cfttoctitlefont}{%
  \huge\sffamily\bfseries}
\renewcommand{\cftaftertoctitle}{%
  {\huge\,\leaders\hrule height 0.2ex\hfill\null\vspace*{-4ex}}}

\def\gobbtohfil#1{%
  \begingroup\if#1\hfil\else\aftergroup\gobbtohfil\fi\endgroup}

\renewcommand{\cftchappresnum}{\gobbtohfil}
\renewcommand{\cftchapnumwidth}{0pt}
\renewcommand{\cftchapfont}{\sffamily\bfseries}
\renewcommand{\cftchappagefont}{\sffamily\bfseries}

\renewcommand{\cftsecpresnum}{\gobbtohfil}
\renewcommand{\cftsecnumwidth}{0pt}

\renewcommand{\cftsubsecpresnum}{\gobbtohfil}
\renewcommand{\cftsubsecnumwidth}{0pt}

\tableofcontents

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Quick start}

\noindent
Use \texttt{idsgrep} much as you would use \texttt{grep}:

\begin{quotation}
  \texttt{idsgrep}
    \textit{$[\langle$options$\rangle]$}
    \textit{$\langle$pattern$\rangle$}
    \textit{$[\langle$file$\rangle\ldots ]$}
\end{quotation}

Its general function is to search one or more files for items matching a
pattern, like \texttt{grep}~\cite{grep} but with a different pattern
syntax.  Although potentially usable for an unlimited range of tasks,
\texttt{idsgrep}'s motivating application is to searching databases of Han
script (Chinese, Japanese, etc.)\ character descriptions.  It provides a
much more powerful replacement for the ``radical search'' feature of
dictionaries like Kiten~\cite{Kiten} and WWWJDIC~\cite{WWWJDIC}.

The syntax for matching patterns, and the range of command-line options
available, are complicated.  Later sections of this document explain those
things in detail; for now, here are some examples.

\begin{description}
\item[\texttt{idsgrep 萌 dictionary}]~\\
  A literal character searches for the decomposition of that character,
  exact match only.
\item[\texttt{idsgrep -d 萌}]~\\
  The \texttt{-d} option with empty argument searches a default collection
  of dictionaries.
\item[\texttt{idsgrep -dtsuku 萌}]~\\
  The \texttt{-d} option can be given an argument to choose a specific
  default dictionary.  Note the argument must be given in the same
  \texttt{argv}-element with the \texttt{-d}; the syntax \texttt{-d tsuku}
  with a space would mean ``Use the default dictionaries and search for the
  (syntactically invalid) pattern `\texttt{tsuku}.'\,''
\item[\texttt{othersoft | idsgrep 萌}]~\\
  Standard input will be used if no other input source is specified.
\item[\texttt{idsgrep -d ...日}]~\\
  Three dots match their argument anywhere, so this will match \texttt{日},
  \texttt{早}, and \texttt{萌}.
\item[\texttt{idsgrep -d '?'}]~\\
  A question mark, which will probably require shell quoting, matches
  anything.  This is most useful as part of a more complex pattern.
\item[\texttt{idsgrep -d '⿱?心'}]~\\
  Unicode Ideographic Description Characters can be used to build up
  sequences that also
  incorporate the wildcards; this example matches characters
  consisting of something above \texttt{心}, such as \texttt{忽} and
  \texttt{恋} but not \texttt{応}.
\item[\texttt{idsgrep -d '[tb](anything)心'}]~\\
  There are ASCII aliases
  for operators that may be inconvenient to type; this query is
  functionally the same as the previous one.
\item[\texttt{idsgrep -d '\&!萌...|日月'}]~\\
  Boolean prefix operators include \texttt{\&} (AND), \texttt{|} (OR), and
  \texttt \texttt{!} (NOT).  This example matches anything that contains
  \texttt{日} or \texttt{月} but is not \texttt{萌}.
\item[\texttt{idsgrep -d '*⿰日?'}]~\\
  Asterisk makes children match in any order; this example matches
  \texttt{日} at left or right.
\item[\texttt{idsgrep -d '@⿱⿱?日?'}]~\\
  At-sign treats an operator as associative; this example matches both
  \texttt{⿱⿱?日?} and \texttt{⿱?⿱日?}.
\item[\texttt{idsgrep -d '.../(femoral)'}]~\\
  Slash invokes Perl-compatible regular expression matching, which might be
  useful for the EDICT2-based meaning dictionary.
\item[\texttt{idsgrep -d '...=?'}]~\\
  Equals escapes matching operators; this example searches for a literal
  question mark anywhere in the tree.
\item[\texttt{idsgrep -d '\textbackslash X840C'}]~\\
  Several kinds of backslash escapes allow entering characters that might
  not otherwise be available.
\item[\texttt{idsgrep -d -c indent 萌}]~\\
  The \texttt{-c} option selects ``cooked'' or pretty-printed output modes.
\item[\texttt{idsgrep -d -f FontFile.otf '\#1'}]~\\
  The \texttt{-f} option reads the character set of an OpenType font
  and makes it available as a user-defined matching predicate accessed
  with the hash-mark; in the example, it looks up each character in the
  default dictionaries.
\item[\texttt{idsgrep -U '?'}]~\\
  The \texttt{-U} option generates a list of Unicode characters.
\item[\texttt{idsgrep -Uxdb '?'}]~\\
  An optional argument to \texttt{-U} specifies information to include
  in the generated list entries: \texttt{x} for hexadecimal
  value, \texttt{d} for decimal, \texttt{b} for block name.
\item[\texttt{idsgrep -U -f FontFile.otf '\#1'}]~\\
  Combine \texttt{-U} and \texttt{-f} to list the characters in a font.
\end{description}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Introduction}

\noindent
The Han character set is open-ended.  Although a few thousand characters
suffice to write the languages most commonly written in Han script languages
(namely Chinese and Japanese) most of the time, popular standards define
tens of thousands of less-popular characters, and there are at least
hundreds of thousands of rare characters known to occur in names, historical
contexts, and in languages like Korean and Vietnamese that may still use Han
script occasionally despite now being written primarily in other scripts.

Computer text processing systems that use fixed lists of characters will
inevitably find themselves unable to represent some text.  As a result,
there is a need to \emph{describe} characters in a standard way that may
have no standard code points of their own.  A similar need for descriptions
of characters arises when looking up characters in a dictionary; a user may
recognize some or all the visual features of a character (such as its parts
and the way they are laid out) without knowing how to enter the character as
a whole.

IDSgrep's main function is to query character description databases in a
flexible way.  This need was identified during development of the
Tsukurimashou font family~\cite{Tsukurimashou}; there, the visual appearance
of Han character glyphs corresponds directly to the MetaPost code
implementing them, and the desire for code re-use and consistency often
motivates a close examination of the existing work to answer questions like
``What other characters contain this shape, and how did we implement it last
time?'' Standard tools like \texttt{grep}~\cite{grep} can sometimes be
applied to answer such questions by searching for subroutine names in the
source code, but the related question of ``What other characters, not yet
implemented, could we build that would use this shape?''\ requires comparing
with some external database of the characters commonly used in the language. 
How can we run \texttt{grep} on the writing system itself?

Someone confronted with an unknown character and wanting to look it up in a
more ordinary dictionary to find the meaning may, similarly, want to search
for characters based on specific features while leaving others unspecified,
with questions like ``What characters exist that have the common \texttt{心}
shape at the bottom, with the upper part divided into two things side by
side?  The two things at the top are shapes I don't recognize, printed too
small for me to identify them more precisely.'' Existing dictionary-query
methods are not adequate for some reasonable queries of this nature.

For instance, the radical-and-stroke-count method of traditional character
dictionaries requires identifying the head radical and counting strokes,
both of which may be difficult; dictionary codes like SKIP and Four Corners
key on some layout attributes but not all; multi-radical search allows the
user to choose whichever radicals they recognize, but it ignores layout
entirely; and computer handwriting recognition generally works well if and
only if the user is sure of the writing of the first few strokes in the
character.  Furthermore, these search schemes often are implemented only in
heavy, non-portable, GUI software that cannot be automated and mixes poorly
with standard computing environments.  IDSgrep can answer the example query
correctly with a single, simple command line (\texttt{idsgrep -d
'[tb][lr]??心'}).  This software is intended to bring the user-friendliness
of \texttt{grep} to Han character dictionaries.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{What's new}

The main new features in version 0.4 are:
\begin{itemize}
  \item changes to the build system for better integration with
    Tsukurimashou;
  \item support for user-defined matching predicates, and in particular, the
    ability to match against the list of characters defined by a font file
    (``\texttt{-f}'' option);
  \item built-in generation of Unicode character lists (``\texttt{-U}''
    option).
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Download, build, test, and install}

IDSgrep is distributed under the umbrella of the Tsukurimashou project on
Sourceforge.JP~\cite{Tsukurimashou},
\url{http://tsukurimashou.sourceforge.jp/}.  Releases of IDSgrep
will appear on the project download page; development versions are available
by SVN checkout from the \texttt{trunk/idsgrep} subdirectory of the
repository.  For the convenience of Github users, the Tsukurimashou (and
thus IDSgrep) repository is also mirrored into a Github
repository~\cite{TsukuGithub}, but the project on Sourceforge.JP and its SVN
repository remain the main public locations for IDSgrep development and all
bug-tracker items should be filed there.

A minimal default build and install could run something like this:
\begin{verbatim}
tar -xzvf idsgrep-0.3.tar.gz
cd idsgrep-0.3
./configure
make
su -c 'make install'
\end{verbatim}

IDSgrep can build dictionaries
from the Tsukurimashou font package, which is available through the same
Sourceforge.JP project as IDSgrep; from the KanjiVG database available at
\url{http://kanjivg.tagaini.net/}~\cite{KanjiVG}; from the CHISE IDS
database available at
\url{http://chise.zinbun.kyoto-u.ac.jp/dist/ids/}~\cite{CHISE};
or from the EDICT2 database available at
\url{http://www.csse.monash.edu.au/~jwb/edict.html}~\cite{EDICT2}.  For an
ideal complete installation of IDSgrep, one would download all those
packages, build Tsukurimashou first, and make it and the dictionaries
available to the IDSgrep \texttt{configure} script.
A precompiled version of the CHISE IDS-derived dictionary is bundled
in the IDSgrep distribution tarball, so that one should be available (though
not necessarily up-to-date) without any dependencies.

The \texttt{configure}
script will by default make a reasonable effort to find the dependencies; in
many common cases it is not necessary to specify them on the command line. 
Here is a more complete installation process relying on \texttt{configure}
to find Tsukurimashou in a
sibling directory and the others in the current directory:
\begin{verbatim}
unzip tsukurimashou-0.6.zip
cd tsukurimashou-0.6
./configure
make
# install of Tsukurimashou not needed by IDSgrep
cd ..
tar -xzvf idsgrep-0.2.tar.gz
cd idsgrep-0.2
ln -s /some/where/else/kanjivg-20120219.xml.gz .
ln -s /some/where/else/edict2.gz .
ln -s /some/where/else/chise-ids-0.25 .
./configure
make
make check
su -c 'make install'
\end{verbatim}

It is necessary to at least configure Tsukurimashou, if not fully build it,
before building IDSgrep.  The IDSgrep build will then invoke the
Tsukurimashou build to create just the files needed by IDSgrep.  It is not
necessary to configure or build CHISE IDS (which would require first
installing other parts of the larger CHISE system and probably XEmacs as
well); IDSgrep only needs to look at the CHISE IDS data files.

If the default search fails, the filenames of KanjiVG (\texttt{.xml} or
\texttt{.xml.gz}), EDICT2 (\texttt{.gz}), and the directories containing
extracted distributions of Tsukurimashou and CHISE IDS can be
specified on the \texttt{configure} command line with
the \texttt{--with-kanjivg}, \texttt{--with-edict2},
\texttt{--with-tsuku-build}, and \texttt{--with-chise-ids} options.  For
other options, run \texttt{configure --help}.  It's a reasonably standard
GNU Autotools~\cite{Autotools} configuration script, albeit with a lot of
options for inapplicable installation directories removed to simplify the
help message.

The EDICT2-based dictionary should preferably include
character decompositions from some other dictionary; which one is
selectable by the \texttt{--enable-edict-decomp} option.  Allowed values
include \texttt{chise}, \texttt{kanjivg}, \texttt{tsuku}, and \texttt{no};
the default of \texttt{auto} will try all of those in that order and use the
first that works.  The value \texttt{no} corresponds to simply mapping every
character to itself without further decomposition; that is obviously not as
informative as might be desired, but it will still allow for regular
expression searches.

The ``\texttt{check}'' Makefile target runs the IDSgrep test suite.  Some
tests require the dictionary files and will be skipped if those are not
present.  There is also a test that will use Valgrind~\cite{Valgrind} if
available, to check for memory-related problems; if Valgrind is not found in
the \texttt{PATH}, this test will be skipped.

The \texttt{configure} script supports an \texttt{--enable-gcov} switch to
enable meta-testing of the test suite's coverage.  This feature requires
that the Gcov coverage analyser be installed.  To do a coverage analysis,
run \texttt{configure} with \texttt{--enable-gcov} and any other desired
options, then do \texttt{make clean} (necessary to be sure all object files
are rebuilt with the coverage instrumentation) followed by \texttt{make
check}.  Full coverage can only be attained if the dictionary files are
installed (not just built).  Most people would not want to install the
IDSgrep binary itself when built under this option.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Unicode IDSes}

Although IDSgrep uses a more elaborate syntax, it is well to know
about the Unicode Consortium's ``Ideographic Description Sequences''
(IDSes), which are a subset of IDSgrep's.  These are documented
more fully in the Unicode standard~\cite{Unicode:IDS}.

\begin{itemize}

\item A character from one of the Unified Han or CJK Radical ranges is a
complete IDS and simply represents itself.  For instance, ``\texttt{大}'' is
a complete IDS.

\item The Ideographic Description Characer (IDC) code points U+2FF0, U+2FF1,
and U+2FF4 through U+2FFB, whose graphic images look like
\texttt{⿰⿱⿴⿵⿶⿷⿸⿹⿺⿻}, are prefix binary operators.
One of these characters followed by two complete IDSes
forms another complete IDS, representing a character formed by joining the
two smaller characters in a way suggested by the name and graphical image
of the IDC.  For instance, ``\texttt{⿰日月}'' describes the
character \texttt{明}.  These structures may be nested; for instance,
``\texttt{⿰言⿱五口}'' describes the character \texttt{語}.

\item The IDC code points U+2FF2 and U+2FF3, which look like \texttt{⿲⿳},
are prefix ternary operators. (Unicode uses the less-standard word
``trinary'' to describe them.) One of them can be followed by three complete
IDSes to form an IDS that describes a character made of three parts, much in
the same manner as the binary operators.  For instance,
``\texttt{⿱⿲糸言糸夂}'' describes the character \texttt{變}.

\item As of Unicode 6.1, IDS length is unlimited.  Earlier versions
specified that an IDS could not be more than 16 code points long overall
nor contain more than six consecutive non-operator characters.
This rule appears to be have been intended to make things easier for systems
that need to be able to jump into the middle of text and quickly find the
starts and ends of IDSes.

\item IDSes non-bindingly ``should'' be as short as possible and should
reflect ``the natural radical-phonetic division for an ideograph if it has
one.''
\end{itemize}

The Unicode standard does not mention variation selectors in any IDS-related
context, except that it offers the possibility of prefixing U+303E, the
``ideographic variation mark,'' to the entire sequence to indicate a
variation.  Such a prefix is explicitly defined not to be counted as part of
the IDS.

My opinion is that Unicode did not intend to permit variation selectors
inside IDS syntax.  Variation selectors arguably exist to patch over
encoding inadequacies resulting from Unicode's internal politics.  When a
code point is not really specific enough, because it refers to two or more
things which you think are not actually the same thing, then you can add a
variation selector to indicate which thing you really mean.  IDSes, on the
other hand, bearing in mind that they are imported from GBK, operate at a
lower level to specify characters in terms of parts that are assumed to be
adequately encoded.  If a code point to be used in an IDS is not specific
enough, then that element should be described with a smaller fragment of IDS
syntax instead of by using the ambiguous code point.  If the closest match
possible is still not perfect, then it is time to use U+303E.  The fact they
offer the U+303E mechanism for specifying variations offers further support
to the idea that they did not intend to allow variation selectors
\emph{inside} IDSes.

However, it's a difficult question because IDSes, by addressing the visual
appearance of characters instead of their semantics, fundamentally challenge
the basic Unicode principle that code points specify characters and not
glyphs.  The distinction between characters and glyphs simply cannot be made
perfectly in all cases.  For use in cases where variation selectors appear
to be appropriate, both CHISE IDS and IDSgrep extend IDS syntax in
such a way as to allow them in some way.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Interface to CHISE IDS}

The CHISE project~\cite{CHISE} maintains a database of Han characters
covering multiple languages as part of a larger processing environment that
also includes a version of XEmacs~\cite{XEmacs} modified to follow the
principles of the UTF-2000 initiative~\cite{UTF2000}.  It also has
connections to GlyphWiki~\cite{GlyphWiki}. 
These systems are documented primarily in Japanese; English-language
documentation is sparse and not necessarily up to date.

For IDSgrep's purposes, the most interesting part of CHISE is a module
called CHISE IDS, which includes a database of about 140000 characters
(exact count depending on the version), with decompositions in its own
extension of Unicode IDS syntax.  The main purpose of this IDS database is
to provide a search capability within the modified XEmacs; there is also
code for a Web search form.  From examination of the database files it
appears that the rules for CHISE IDS's extended IDS syntax are more or less
as follows.

\begin{itemize}
\item Generally, Unicode IDS rules apply.
\item An XML entity-like sequence of the form \texttt{\&NAME;} counts as a
single ideograph.  The field indicated by \texttt{NAME} is a symbolic
identifier or database key defined internally by the project.  Such
identifiers have been observed to contain uppercase ASCII letters, numerals,
hyphens, and plus signs; they usually consist of a short alphabetic
prefix, a hyphen, and a number.  These entity references are usually used to
refer to characters for which CHISE has an encoding and Unicode doesn't.
\item A Unicode variation sequence (an ideograph followed by a variation
selector) counts as a single character.
\end{itemize}

Although CHISE IDS's extensions to IDS permit strings that would not be
valid IDSgrep EIDS syntax, it is easy to convert them into EIDS format. 
IDSgrep includes a \texttt{chise2eids} Perl script for that purpose.  The
\texttt{configure} script will look for CHISE IDS in a directory named
\texttt{chise-ids-*} in a short list of likely places, or use the value of
the \texttt{--enable-chise-ids} command-line option if one is given.  This
directory should simply be an unpacked CHISE IDS distribution tarball, or a
checkout from the CHISE IDS Git repository.  It is not necessary to run
CHISE's Makefile, which would require also having and installing other parts
of the larger system.

As of this writing (March 6, 2013), the CHISE IDS distribution tarballs are
available from \url{http://chise.zinbun.kyoto-u.ac.jp/dist/ids/}, and there
is a Git repository at \url{http://git.chise.org/git/chise/ids.git}. 
However, on March 5, there was an announcement posted to the CHISE Project's
mailing lists in English and Japanese saying that on March 11 all the
servers at chise.org (which includes the mailing lists themselves) will
cease to operate.  It is not clear to me whether that is intended to be
something temporary, or the permanent end of the project, but given that
nothing seems to have happened in CHISE in years, I fear the worst.  I have
posted a snapshot of the current CHISE IDS Git repository in my own Github
account at \url{https://github.com/mskala/chise-ids}; that should be a
long-term stable source of the data used by IDSgrep.

The last distribution tarball of CHISE-IDS was version 0.25 dated June 2010. 
The Git versions are more recent and may be preferable.  The directory
created by checking out a Git version will probably not have a name
recognized automatically by the build system, so it should be given on the
\texttt{configure} command line with \texttt{--enable-chise-ids}.

Roughly 6\% of the entries in the CHISE IDS database include invalid
extended IDS syntax---most often in the form of too many children for the
operators used, or less often, too few.  Most but not all of the errors
occur in the \texttt{IDS-HZK??.txt} files, which are unmaintained.  It
appears that the native search tools for the database generally work on the
basis of pure substring searches, where the higher-level syntax errors that
would be detected by the IDSgrep parser can go unnoticed.  The
\texttt{chise2eids} program generates a \texttt{chise.errs} file during
build, listing all the syntax errors it finds (11748 of them in the current
Git version as of this writing); invalid entries are otherwise ignored and
will not appear in the main output file \texttt{chise.eids}.  Although 6\%
may sound like a lot of errors, the invalid entries are generally in
sufficiently obscure character components that it should have little
practical effect on the quality of dictionary lookups: at worst, some
character components may end up not broken down into pieces as small as
would otherwise be possible.

CHISE IDS refers to individual characters in a more general way than just by
single Unicode code point: sometimes it uses a variation sequence consisting
of a kanji code point followed by a variation selector in the U+FE00 to
U+FE0F or U+E0100 to U+E01EF ranges, and sometimes it uses a string that
looks like an XML character entity reference, along the lines of
``\texttt{\&NAME;}.'' Both of these map naturally to IDSgrep's concept of a
multi-character head.  The two-code-point sequence U+840C U+E0101 is
translated to the IDSgrep syntax ``\texttt{<\textbackslash%
X840c\textbackslash X\{E0101\}>;},'' and the XML-like syntax
``\texttt{\&FU-123;}'' is translated to ``\texttt{<FU-123>;}.'' CHISE IDS
does not seem to refer to the same character in different ways (for
instance, a code point with no variation selector somehow matching as a
default to the same code point with a variation selector, which might be
plausible under Unicode's definition of what variation selectors signify)
and \texttt{chise2eids} does not attempt to accomodate anything like that.

The CHISE IDS database is covered by the GNU GPL version 2 or later, which
is basically compatible with the GNU GPL version 3 used by IDSgrep.  So
distributions of IDSgrep can reasonably include a bundled copy of
\texttt{chise.eids} for the benefit of users who don't want to download the
separate package and generate their own.  Without taking a position on
whether the \texttt{chise.eids} file constitutes ``object code'' for the
purposes of the GNU GPL as opposed to being modified source code in itself,
I am willing to provide a copy of the CHISE IDS checkout I used to generate
my version of \texttt{chise.eids}, to anyone who contacts me about it at
\url{mskala@ansuz.sooke.bc.ca}.  Going directly to the original distribution
points at the URLs given above is probably a more convenient option for most
users.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Interface to KanjiVG}

The KanjiVG project~\cite{KanjiVG} maintains a database of kanji (Han
characters as used by Japanese) in an extended SVG format, which implies
that it is XML.  The \texttt{kvg2eids} Perl script, included as part of
IDSgrep, is capable of reading this database and converting it to Extended
Ideographic Description Sequences (EIDSes).  As described above, if a
reasonably recent version of KanjiVG's compressed XML file is available to
\texttt{configure}, then IDSgrep's build will create such a dictionary and
\texttt{make install} will install it.

KanjiVG describes characters primarily in terms of strokes, not multi-stroke
components, and it attempts to follow the official stroke order and
etymological component breakdown.  That approach results in some
peculiarities from the point of view of dictionary searching.  For instance,
in the kanji \texttt{園}, the official stroke order is to write two strokes
of the enclosing box, then the central glyph, then the bottom of the box. 
KanjiVG's XML file lists two ``elements'' identified with the kanji
\texttt{囗}, one for the first two strokes and one for the final stroke,
with additional attributes specifying that they are actually two parts of
the same element.  KanjiVG has changed its own standard for how to represent
this information in the recent past, and not all entries have been updated
to the latest standard yet.  The current version of \texttt{kvg2eids} does
not correctly process \texttt{園} nor some other characters with parts
written in nonsequential order.  On that particular one it generates a
special functor containing debugging information; for some others, it may
actually generate an EIDS with the same radical appearing multiple times,
following the structure described in KanjiVG whether it's what was intended
or not.  As a result, not all entries in the dictionary will be right. 
However, only a few are affected by this issue.

As of March 2012, I (Matthew Skala, the author of IDSgrep) have become a
member of the KanjiVG project and there is some possibility that KanjiVG's
database design will change in a way that makes it easier to recover spatial
organization for searching with IDSgrep.

With the current versions of IDSgrep and KanjiVG, the KanjiVG-derived
dictionary contains 6660 entries covering all the popularly-used Japanese
kanji.  Note that the KanjiVG input file, and presumably the resulting
format-converted dictionary, are covered by a Creative Commons
Attribution--ShareAlike license, distinct from the GNU GPL applicable to
IDSgrep itself.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Interface to EDICT2}

Jim Breen's JMdict/EDICT project maintains a file called
EDICT2~\cite{EDICT2} which is more like a traditional dictionary, with
words and meanings, than a database of kanji.  Such dictionaries are
not the primary target of IDSgrep and IDSgrep's query syntax is not
perfectly suited to them.  However, the regular-expression matching
features may make it practical to search EDICT2 with IDSgrep, and
there is some value in being able to do sub-character structural
searches on multi-character words.

If another dictionary besides EDICT2 is available
(subject to configuration by \texttt{--enable-edict-decomp}), then the
build system will generate and install a
dictionary file called \texttt{edict.eids} which represents a database
join of EDICT2 with the other dictionary.  With no other dictionary,
the file can still be generated but will contain no character decomposition
information.
A sample entry might look like this:
\begin{verbatim}
【明】,<明>⿰日月⦅[みん] (n) Ming (dynasty of China)⦆
\end{verbatim}

The head for the entire entry is the head from the EDICT2 entry.  Then the
tree is a binary tree with a comma as the functor and the first child being
the entire decomposition dictionary entry for the first character.  The
second child represents the rest of the entry.  With a two-character or
longer head, this child would also be a binary comma with the second
character of the entry head as its first child.  In this way the characters
of the entry head are all represented as left children of commas, forming a
linked-list structure (much like a Prolog linked-list with commas instead of
dots as the functors).  The final child at the bottom is a nullary node
containing as its functor simply the rest of the EDICT2 entry.

The rationale for this syntax is that it allows a relatively simple way of
querying multi-character words in EDICT2 using the existing IDSgrep query
types.  To find an exact match, just query the head (which will require head
brackets and a semicolon if the query is more than one character long), as
in \texttt{idsgrep -de '<教育>;'}.  To search for the first few characters,
commas can be imagined as separators (though their actual function is quite
different) with a comma at the start and a question mark at the end, as in
\texttt{idsgrep -de ',教,育?'}.  These queries can be combined with the
sub-character breakdown queries already supported by the decomposition
dictionaries.  For instance, \texttt{idsgrep -de ',教,...|日月!,??'} will
search for, and give definitions of, words of exactly two characters in
which the first is \texttt{教} and the second character contains \texttt{日}
or \texttt{月} anywhere.  The restriction to exactly two characters is
accomplished by the sub-query ``\texttt{!,??}'', which fails to match on the
binary comma that would be present at that point in a longer word.

EDICT2 is under the Creative Commons Attribution--ShareAlike license.  Since
KanjiVG is as well, that license would presumably also apply to a combined
dictionary made from EDICT2 and KanjiVG.  An EDICT2-only dictionary with no
decompositions from other sources should similarly be under Creative Commons
Attribution--ShareAlike.  It might not be legal to distribute outside one's
own organization a dictionary formed by joining EDICT2 with CHISE IDS or
Tsukurimashou, because those sources are covered by versions of the GNU GPL,
which is not compatible with the Creative Commons license.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Interface to Tsukurimashou}

IDSgrep is closely connected with the Tsukuimashou font
family~\cite{Tsukurimashou}.  They have the same author; it was largely for
use in Tsukurimashou development that IDSgrep was developed at all; and
IDSgrep's source control system is a subdirectory within Tsukurimashou's. 
Building IDSgrep in conjunction with Tsukurimashou allows IDSgrep to extract
from the Tsukurimashou build system a dictionary of character decompositions
as they appear in Tsukurimashou.  The Tsukurimashou fonts are also necessary
to build this IDSgrep user manual.  However, IDSgrep is also distributed as a
separate package, because it will be of use to non-users of Tsukurimashou,
and the Tsukurimashou build system will not recurse into IDSgrep's directory
and build IDSgrep by default; only if requested.

IDSgrep is one of several parasite packages of Tsukurimashou, using a
mechanism introduced in Tsukurimashou 0.7 and IDSgrep 0.4.  Previous
versions used a different interface.

To build Tsukurimashou with IDSgrep:  specify the
``\texttt{-{}-enable-parasites}'' option to Tsukurimashou's
\texttt{configure} script with an appropriate value, such as
``\texttt{-{}-enable-parasites=idsgrep}''.  See the Tsukurimashou
documentation for other possible values of this option.  Building
Tsukurimashou will then implicitly build IDSgrep.  It should be possible to
pass IDSgrep \texttt{configure} options to Tsukurimashou's
\texttt{configure} script and have them automatically passed down the chain
(in the standard Autotools sub-package fashion) but that is not well-tested.

To build Tsukurimashou without IDSgrep:  this is the default when you run
the Tsukurimashou build from the root of the Tsukurimashou distribution. 
The IDSgrep source is included as a subdirectory in distributions of
Tsukurimashou, but only built on request.

For a more customized build of IDSgrep, with or without Tsukurimashou:  you
can also run IDSgrep's \texttt{configure} in its own directory, and then do
\texttt{make} (and the usual targets) there.  It will look for Tsukurimashou
(specifically, a build directory in which Tsukurimashou's \texttt{configure}
has \emph{already been executed}) as the parent directory and in a few other
places, or you can specify the location of a Tsukurimashou build with the
``\texttt{--with-tsuku-build}'' option to IDSgrep's \texttt{configure}.  If
Tsukurimashou is not available, IDSgrep will build without creating
the Tsukurimashou-derived dictionary file.

During IDSgrep's build, if it can access a Tsukurimashou build directory, it
will recursively call \texttt{make eids} on Tsukurimashou's build system. 
That is a hook that causes Tsukurimashou's build system to generate the EIDS
decomposition dictionary, which is then copied or linked back into IDSgrep's
build directory and can be installed with IDSgrep's \texttt{make install}. 
IDSgrep's build will also look in Tsukurimashou's build directory for the
font ``Tsukurimashou Mincho'' which is needed to build this user manual, and
will make recursive calls to \texttt{make} for Tsukurimashou to build that
if necessary.  This kind of upward-callback \texttt{make} invocation is a
little inefficient (in particular, it does not handle jobserver mode well)
so it is better, if you want both packages, to use the centralized
Tsukurimashou build system, which will do its own thing first and then call
IDSgrep's build near the end in a better-integrated way.  If you want to
run ``\texttt{make install}'' just on IDSgrep and not on Tsukurimashou
(which might be a reasonable thing to want because of operating system font
installation issues), you should run just ``\texttt{make}'' in
Tsukurimashou's directory, then \texttt{cd} to IDSgrep's directory and run
``\texttt{make install}.''

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{A note on TrueType/OpenType}

This version of IDSgrep is designed to read TrueType or OpenType files
(the distinction between the two is not relevant at this level) for
character map information.  The specification for the
TrueType/OpenType file format reads like a parody.  I'd like to take a
moment to complain about a few things.

\begin{itemize}
  \item Although the format contains binary fields which must be read
    in a specific byte order, one of the two magic numbers that can
    identify the file format (for a single font as opposed to a
    ``collection'') is 0x4F54544F, which is a palindrome at the byte
    level and thus useless for detecting byte order problems.
  \item The other possible magic number for non-collection files
    is 0x00010000, which is quite likely to occur in files that are
    not TrueType/OpenType files, making it harder to detect when one
    may have been passed a bad file.
  \item Many decades of research on error detection codes were ignored
    in the design of the OpenType checksum algorithm, which (among
    other issues) cannot detect any reordering of 32-bit words unless
    it crosses a table boundary.  At least the algorithm produces its
    meaningless results fast; yay, efficiency!
  \item There are 32-bit byte offsets referenced to the start of the file.
    There are 32- and 16-bit byte offsets referenced to the start of
    the current table.  There are 32- and 16-bit byte offsets
    referenced to the \emph{locations of the offset fields themselves}, so a
    field at offset 0x1234 referring to another field at offset 0x5678
    will contain 0x4444.  There are also indices measured in units
    other than bytes.
  \item There are variable-length objects not tagged
    with their lengths except indirectly: they are presumably
    contained entirely within larger objects that are tagged with lengths.
  \item Consider the cmap format 4 subtable, which Microsoft
    says is their preferred format.  It includes four
    variable-length arrays each containing segCount
    number of two-byte entries.  The value of segCount is not directly
    recorded anywhere, but these values are all required
    in the header:
    \begin{itemize}
      \item[$\circ$] $2 \cdot \textrm{segCount}$;
      \item[$\circ$] $2 \cdot 2^{\lfloor \log_2 \textrm{segCount} \rfloor}$;
      \item[$\circ$] $\log_2 (2 \cdot 2^{\lfloor \log_2 \textrm{segCount}
        \rfloor}/2)$ (which is described like that in the spec);
        and of course
      \item[$\circ$] $2 \cdot \textrm{segCount}-2 \cdot 2^{\lfloor \log_2
        \textrm{segCount} \rfloor}$.
    \end{itemize}
  \item The bizarre length-derived values in the format 4 header (and
    other similar sets of table-size-logarithm numbers that occur
    elsewhere in the file format) appear to be designed to support
    someone's binary search code.  Instead of computing those numbers
    itself starting from the length, the search code can just use
    values straight from the table to initialize its variables.
    Consider what would happen if someone actually did that as the
    designers apparently intended, and the numbers happened to be incorrect
    in the file.  If, for instance, numbers in the file were swapped
    around on 32-bit boundaries, the checksums wouldn't detect a problem; and
    the speed demons who think they need precomputed logarithms
    probably aren't wasting time checking checksums anyway.
    The code isn't checking whether the numbers are consistent (because
    to do that you would have to calculate them fresh, and then why bother
    storing them in the first place?), so it will end up ``searching'' into
    random areas of the file, or into uninitialized memory beyond.  Now
    think about the relative costs of disk reads, network transfer, and
    arithmetic, and consider whether having those values precalculated
    and stored in the file would actually save any time even if they
    could be trusted.
  \item The cmap format 4 subtable consists of, in this order: fixed-length
    stuff totalling 14 bytes; one variable-length array of length $2
    \cdot \textrm{segCount}$ bytes; \emph{one more two-byte
    fixed-length field}; three more variable-length arrays each of
    length $2 \cdot \textrm{segCount}$ bytes; and finally, one more
    variable-length array whose length is not directly specified
    anywhere but could presumably be inferred by subtracting from the
    known size of the overall table. The four $2 \cdot
    \textrm{segCount}$-byte arrays are actually the rearranged slices
    of a single logical array whose elements are four-field
    structures; but the extra reserved two bytes
    stuck in the middle of the table make a straightforward transposition
    impossible.  Four-tuples of the same kind with the same four fields
    also occur in the format 2 subtable; but there, they occur as a single
    array with each record written in an 8-byte block.
  \item It is an intended, documented feature that some of the
    variable-length arrays in TrueType/OpenType may overlap with each other. 
    As a result, bounds-checking, in addition to being intrinsically
    difficult because of the lack of information, would cause the
    reader to reject some files that the specification claims are
    legitimate.
  \item Code-injection bugs allowing execution of arbitrary
    code in a privileged context have been reported in software
    that implemented this file format without bounds-checking.  This should
    surprise no one.
\end{itemize}

IDSgrep attempts to do all reasonable bounds-checking on the fields it
needs, and to ignore fields it does not need; given a bad TrueType/OpenType
file, it is intended that IDSgrep should be able to make the best of it and
at worst fail gracefully with an error message.  It should not be possible
to crash IDSgrep by giving it a bad font file to read.

However, the nature of the file format means that at least in the
current version, we can't be confident all possible problems have
been foreseen and excluded.  Let me know if you find a font file that
makes IDSgrep crash and I'll try to fix it. IDSgrep probably should not be
allowed to read font files supplied by untrusted sources such as Web
users.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Invoking \texttt{idsgrep}}

The command-line \texttt{idsgrep} utility works much like most other
command-line programs, and like \texttt{grep}~\cite{grep} in particular.  It
takes options and other arguments.  The first non-option argument is an EIDS
representing the matching pattern, and any remaining non-option arguments
are taken as filenames to read.  If there are no filenames, \texttt{idsgrep}
will read from standard input.  Output always goes to standard output.

When there is more than one file being read (either by direct specification
or indirectly with the \texttt{-d} dictionary option), \texttt{idsgrep} will
preface each EIDS in its output with
``\texttt{:}$\langle$\textit{filename}$\rangle$\texttt{:}'' to indicate
in which file the EIDS was found.  Note that under the EIDS syntax rules,
that creates a unary node senior to the entire tree, so that the output
remains in valid EIDS format, except in the case of filenames containing
colons, which will be handled via backslash escapes in the future when those
are fully implemented for output.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Command-line options}

\begin{description}

\item[\texttt{-d}, \texttt{--dictionary}]
Read a dictionary
from the standard location.  There is a pathname for dictionaries hardcoded
into the \texttt{idsgrep} binary, generally
\{\emph{prefix}\}\texttt{/share/dict}, and if this option is given, its
argument (which may be empty) will be appended to the dictionary directory
path, followed by ``\texttt{*.eids},'' and then treated as a shell glob
pattern.  Any matching files are then searched in addition to those
otherwise specified on the command line.  A small added wrinkle is that when
more than one file is searched (resulting in
\texttt{:}\textit{filename}\texttt{:} tags on the output lines), any of them
that came from the \texttt{-d} option will be abbreviated by omitting the
hardcoded path name.  The purpose of this option is to cover the common case
of searching the installed dictionaries.  Just specifying ``\texttt{-d}''
will search all the installed dictionaries; specifying an abbreviation of
the dictionary name, as ``\texttt{-dt}'' or ``\texttt{-dk},'' will search
just the matching one; and it remains possible to specify a file exactly or
use standard input in the usual \texttt{grep}-like way.

\item[\texttt{-c}, \texttt{--cooking}]
Select the output generation and input canonicalization mode.  Requires one
argument, which may be one of the keywords \texttt{raw}, \texttt{rawnc},
\texttt{ascii}, \texttt{cooked}, or \texttt{indent}, to specify a preset
mode; or a string of up to twelve decimal digits to control the output in
more detail.  The default mode is \texttt{raw}.  See the
section on ``cooked output'' in this manual for more details.

\item[\texttt{-f}, \texttt{--font-chars}]
Read a font file and make its character coverage available as a user-defined
matching predicate through the ``\texttt{\#}'' matching operator.  In the
current version, this feature can only read TrueType and OpenType files that
contain Unicode (or near equivalent) mappings described with cmap subtable
types 0, 2, 4, 12, or 13.  This option may be specified multiple times, with
successive invocations corresponding to user-defined predicates 1, 2, 3, and
so on.  The maximum number of user-defined predicates is limited to the
number of bits in the largest integer type available to the C compiler; 32
or 64 on many systems.

\item[\texttt{-U}, \texttt{--unicode-list}]
Generate a dictionary of Unicode code points, and read that before
reading any other dictionaries or input files that may be specified.
The generated dictionary consists of a single line for each of the
code points U+0000 through U+10FFFF in ascending order, excluding the
surrogates but not any other invalid or non-character code points; on
each line, there is a tree whose head is the character and whose body
is either a nullary semicolon or (if the optional argument to
\texttt{-U} was specified) a nullary functor containing
semicolon-separated pieces of information selected by the characters of the
optional argument.  Characters permitted in the argument
are ``\texttt{b}'' for the Unicode block name; ``\texttt{d}'' for the
decimal value of the code point; and ``\texttt{x}'' for the
hexadecimal value with ``U+.''  For example, specifying
``\texttt{-Uxdb}'' will generate and scan a dictionary that includes
the line ``\texttt{<A>(U+0041;65;Basic Latin)}.''
This option is intended to be used together with \texttt{-f} to produce
font coverage lists.

\item[\texttt{-V}, \texttt{--version}] Display the version and license
information for IDSgrep.

\item[\texttt{-h}, \texttt{--help}] Display a short summary of these
options.

\end{description}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Environment variables}

The \texttt{idsgrep} utility recognizes just one environment variable,
\texttt{IDSGREP\_DICTDIR}, which if present specifies a directory for the
\texttt{-d} option to search instead of its hardcoded default.

Note that \texttt{idsgrep} does not pay attention to any other environment
variables, and in particular, not \texttt{LC\_ALL} and company.  The input
and output of this program are always UTF-8 encoded Unicode \emph{regardless
of locale settings}.  Since the basic function of this program is closely
tied to the Unicode-specific ``ideographic description characters,'' it
would be difficult if not impossible for it to work in any non-Unicode
locale.  Predictability is also important because of the likely usefulness
of this software in automated contexts; if it followed locale environment
variables, many users would have to carefully override those all the time to
be sure of portability.  Instead of creating that situation,
\texttt{idsgrep} by design has a consistent input and output format on all
systems and users are welcome to pipe things through
a conversion program if necessary.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Technical details}

\noindent

This section is intended to describe IDSgrep's syntax and matching procedure
in complete precise detail; and those things are, in turn, designed to be
powerful rather than easy.  As a result, the description may be confusing
for some users.  See the examples in the ``Quick start'' section for a more
accessible introduction to how to use the utility.

The system is best understood in terms of three interconnected major
concepts:
\begin{itemize}
  \item an abstract data structure;
  \item a syntax for expressing instances of the data structure as
    ``Extended Ideographic Description Sequences'' (EIDSes);
  \item a function for determining whether two instances of the data
    structure ``match.''
\end{itemize}

Then the basic function of \texttt{idsgrep} is to take one EIDS as a
matching pattern, scan a file containing many more, and write out the ones
that match the matching pattern.  The three major concepts are described,
one each, in the following sections.  A final section describes options for
how the command-line \texttt{idsgrep} program
generates EIDS syntax on output.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{The data structure}

An \emph{EIDS tree} consists of the following:

\begin{itemize}
  \item An optional \emph{head}, which if present consists of a nonempty
    string of Unicode characters.
  \item A required \emph{functor}, which is a nonempty string of Unicode
    characters.
  \item A required \emph{arity}, which is an integer from 0 to 3 inclusive.
  \item A sequence of \emph{children}, of length equal to the arity (no
    children if arity is zero).  Each child is, recursively, an EIDS tree.
\end{itemize}

Trees with arity zero, one, two, and three are respectively called
nullary, unary, binary, and ternary.

Note that these ``nonempty strings of Unicode characters'' will very often
tend to be of length one (single characters) but that is not a requirement. 
They cannot be empty (length zero); the case of a tree without a head is
properly described by ``there is no head,'' not by ``the head is the empty
string.'' \emph{At present} no Unicode canonicalization is performed, that
being left to the user, but this may change in the future.  Zero bytes
(U+0000) are in principle permitted to occur in EIDS trees, but because
Unix passes command-line arguments as null-terminated C strings, they can
only be entered in matching patterns via backslash escape sequences.

Typically, these trees are used to describe kanji characters.  The literal
Unicode character being described will be the head, if there is a code point
for it; the functor will be either an ideographic description character like
\texttt{⿱} if the character can be subdivided, or else nullary \texttt{;}
if not.  Then the children will correspond to the parts into which it can be
decomposed.  Some parts of the character may also be available as
characters with Unicode code points in their own right; in that case, they
will have heads of their own.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{EIDS syntax}

Unicode's IDS syntax serves a similar purpose to IDSgrep's extended IDS
syntax, but it lacks sufficient expressive power to cover some of IDSgrep's
needs.  Nonetheless, EIDS syntax is noticeably derived from that of Unicode
IDSes.  Broadly speaking, EIDSes are IDSes extended to include heads (which
we need for partial-character lookup); bracketed strings as functors (which
we need for capturing arbitrary data); and with arbitrary limits on allowed
characters and length relaxed (needed for complex characters and so that
matching patterns can be expressed in the same syntax).

Here are some sample EIDSes:
\begin{verbatim}
大
⿱田⿰虫⿱土土
⿸厂⿱今止
【萌】⿱艹<明>⿰日月
【店】⿸广<占>⿱卜口 
⿱艹⿰日?
&...男...女
[tb]艹[or][lr]?日[lr]日?
\end{verbatim}

The first three of these examples are valid in the Unicode IDS syntax.  The
next two contain heads, and are typical of what might exist in a dictionary
designed to be searched by the \texttt{idsgrep} command-line utility.  The
last three might be matching patterns a user would enter.

EIDS trees are written in a simple prefix notation that could be called
``Polish notation'' inasmuch as it is the reverse of ``reverse Polish
notation.'' To write a tree, simply write the head if there is one, the
functor, and then if the tree is not nullary, write each of the children. 
Heads and the functors of trees of different arity are (unless otherwise
specified below) written enclosed in different kinds of brackets that
indicate the difference between heads and functors, and the arity of the
tree when writing a functor.

The basic ASCII brackets for heads and functors are as follows:

\hspace*{\fill}
\begin{tabular}{cccc}
  head & \texttt{<} & \texttt{>} & \texttt{<example>} \\
  nullary functor (0) & \texttt{(} & \texttt{)} & \texttt{(example)} \\
  unary functor (1) & \texttt{.} & \texttt{.} & \texttt{.example.} \\
  binary functor (2) & \texttt{[} & \texttt{]} & \texttt{[example]} \\
  ternary functor (3) & \texttt{\{} & \texttt{\}} & \texttt{\{example\}}
\end{tabular}
\hspace*{\fill}\par

Note that the opening and closing brackets for unary functors are both equal
to the ASCII period, U+002E.

Some sequences of Unicode characters beginning with
``\texttt{\textbackslash}'' (ASCII backslash, U+005C) are treated specially. 
Backslash followed by a character from a short list of ASCII Latin letters
introduces an escape sequence used to substitute for a character that would
otherwise be hard to type; backslash followed by any other character
(including a second backslash) is equivalent to the other character, but
without any special meaning it would otherwise have had.  Thus, backslash
can be used for instance to include literally in a bracketed string the
closing bracket that otherwise would mark the end of the string.

The backslash-letter escapes are listed below.  Note that the letters
identifying the type of escape sequence are case-sensitive, and all are
lower-case except ``\texttt{\textbackslash X}.''  However, for sequences
that take a parameter, the parameters are not case-sensitive.  Note that all
characters inside an escape sequence must be literal ASCII, except in the
``default'' case of a single backslash used to escape a single non-ASCII
character.  It is not permitted to use recursive backslash escapes to create
some of the characters that make up a multi-character escape sequence like
``\texttt{\textbackslash x\{\}}.''

\hspace*{\fill}
\begin{tabular}{ll}
  \texttt{\textbackslash a} & ASCII BEL (U+0007) \\
  \texttt{\textbackslash b} & ASCII BS (U+0008) \\
  \texttt{\textbackslash c}$X$ & ASCII control character $X$ \\
  \texttt{\textbackslash e} & ASCII ESC (U+001B) \\
  \texttt{\textbackslash f} & ASCII FF (U+000C) \\
  \texttt{\textbackslash t} & ASCII HT (U+0009) \\
  \texttt{\textbackslash n} & ASCII LF (U+000A) \\
  \texttt{\textbackslash r} & ASCII CR (U+000D) \\
  \texttt{\textbackslash x}$HH$ & two-digit Unicode hex \\
  \texttt{\textbackslash X}$HHHH$ & four-digit Unicode hex \\
  \texttt{\textbackslash x\{}$Hx$\texttt{\}}
    \texttt{\textbackslash X\{}$Hx$\texttt{\}} &
    variable-length Unicode hex
\end{tabular}
\hspace*{\fill}\par

The \texttt{\textbackslash c} escape takes a parameter consisting of a
single ASCII Latin letter character (only); it is equivalent to typing Ctrl
plus that letter (case insensitive) on a standard keyboard, that is the
ASCII control code in the range U+0001 to U+001A obtained by subtracting 64
from the uppercase letter's ASCII code or 96 from the lowercase letter's
ASCII code.

The hexadecimal escapes \texttt{\textbackslash x} and \texttt{\textbackslash
X} offer a choice of two-digit, four-digit, or variable-length (enclosed by
curly braces) hexadecimal specification of Unicode code points.  The hex
codes are case-insensitive.  Values greater than 10FFFF, and therefore
outside the Unicode range, will be replaced by the Unicode replacement
character U+FFFD.

Parsing of bracketed strings has a few features worth noting.  First, there
is no special treatment of nested brackets.  After the ``\texttt{<}'' that
begins a head, for instance, the next unescaped ``\texttt{>}'' will end the
head, regardless of how many other instances of ``\texttt{<}'' have been
seen.  However, because no head or functor can be less than one character
long, a closing bracket immediately after the opening bracket (which would
otherwise create an illegal empty string) is specially treated as the first
character of the string and \emph{not} as a closing bracket.  Thus,
``\texttt{())}'' is legal syntax for a functor equal to a closing
parenthesis, in a nullary tree; and ``\texttt{...}'' is a functor equal to a
single ASCII period in a unary tree, an important example because it is the
commonly-used match-anywhere operator.  A bracket character specified via a
backslash escape, whether by preceding the literal character with a
backslash or by giving its hexadecimal code in a ``\texttt{\textbackslash
x}'' or ``\texttt{\textbackslash X}'' construction, is never taken to start
or end a bracketed string.

Each pair of ASCII brackets also has two pairs of generally
non-ASCII synonyms, as
follows:

{\ttfamily\hspace*{\fill}
\begin{tabular}{cccccc}
  <&>&【&】&〖&〗\\
  (&)&(&)&⦅&⦆\\
  .&.&:&:&・&・\\\relax
  [&]&[&]&〚&〛\\
  \{&\}&〔&〕&〘&〙
\end{tabular}
\hspace*{\fill}\par}

The closing synonymous brackets for functors of unary trees are always
identical to the opening brackets.  A string may be opened by any of the
three opening bracket characters for its type of string; but then it must be
closed by the closing bracket character that goes with the opening bracket. 
Brackets from other pairs are taken literally and do not end the string.
For instance,
``\texttt{【<example>】}'' is a head whose value consists of
``\texttt{<example>}'' including the ASCII angle brackets.  There are
several reasons for the existence of the synonyms:

\begin{itemize}
  \item They look cool.
  \item There is an established tradition of using \texttt{【}lenticular
    brackets\texttt{】} for heads in printed dictionaries, which is exactly
    their meaning here.
  \item Allowing ASCII colons to bracket unary-node functors makes possible
    a more appealing and \texttt{grep}-like syntax for \texttt{idsgrep}'s
    output in the case of processing multiple input files.
  \item Allowing more than one way to bracket each kind of string makes it
    easier to express bracket characters that may occur literally in a string.
  \item The non-ASCII brackets may be easier to type without switching modes
    in some input methods.
  \item On the other hand, keeping an ASCII option for every bracket type
    allows matching patterns to be entered on ASCII-only terminals.
  \item Multiple bracket types allow for creating human-visible
    computer-invisible distinctions in dictionary files, for instance to
    flag pseudo-entries that contain metadata, without needing to create a
    special syntax for comments.
\end{itemize}

If a character other than an opening bracket occurs unescaped
in an EIDS where an opening bracket would be expected, it is treated
in one of three ways.

\begin{itemize}
  \item ASCII whitespace and control characters, U+0000 to U+0020 inclusive,
    are ignored.  In the future, this treatment might be extended to
    non-ASCII Unicode whitespace characters, which are best avoided because
    of the uncertainty.
  \item Some special characters, such as ``\texttt{⿰},'' have ``sugary
    implicit brackets.''  If one of these characters appears outside of
    brackets, it will be interpreted as a functor whose value is a
    single-character string equal to the literal character, and a fixed
    arity that depends on which character it is.  For instance,
    ``\texttt{⿰}'' and ``\texttt{[⿰]}'' will be parsed identically.
    A list of characters getting this treatment is below.
  \item Any other non-bracket character has a ``syrupy implicit semicolon.''
    That means it will be interpreted as a complete nullary tree with
    a single-character head equal to the literal character, and a
    single semicolon as the functor.  For instance, ``\texttt{x}'' and
    ``\texttt{<x>(;)}'' will be parsed identically.  Because semicolon
    itself has sugary implicit nullary brackets, we could also write
    ``\texttt{<x>;}'' for the same effect.
\end{itemize}

Here are all the characters that have sugary implicit brackets, with the
brackets they imply:  {\ttfamily (;) (?) .!. ./. .=. .*. .@. .\#. [\&] [,]
[|] [⿰] [⿱] [⿴] [⿵] [⿶] [⿷] [⿸] [⿹] [⿺] [⿻] \{⿲\} \{⿳\}}

Note that the sugary and syrupy implications of a character are only
relevant when the character occurs where an opening bracket of some type
would otherwise be required; inside a bracketed string, characters are taken
literally unless they end the string or make up escape sequences. 
Characters created by escape sequences are always syrupy outside a
string and always literal inside a string; they never start or end bracketed
strings nor have any special sugary meaning they would otherwise have.

Characters, for the purposes of EIDS parsing, are strictly single Unicode
code points.
Such things as combining accents and variation selectors are parsed as
separate characters from the bases to which they may be applied.
The sugary and syrupy parsing rules apply only to single characters.
Thus, appropriate brackets are necessary whenever a sequence containing
more than one code point is to be treated as a single head or functor.

It is an intentional consequence of these rules that all syntactically valid
Unicode IDSes are syntactically valid EIDSes, but the converse is not true. 
CHISE IDS extended IDSes can easily be converted to this syntax but in
general are not valid IDSgrep EIDSes in themselves.

Although it is technically not a parsing issue but rather a
transformation applied to the tree after parsing, there is one more
issue to mention: some functors have aliases.  If a functor and arity
matches one of the aliases on the following list, it will be replaced
with the indicated single-character functor.  The idea is to provide
verbose ASCII names for single-character functors of special
importance to the matching algorithm.  Note that the single-character
versions are always the canonical ones, and although the brackets are
shown explicitly for clarity, they are nearly all characters from the
``sugary implicit'' list.  This feature may be disabled or modified using
some settings of the ``\texttt{-c}'' command-line option; see the section
on output cooking for more information.

\texttt{\begin{tabular}{cccccc}
  (anything) & $\Rightarrow$ & (?) & .anywhere. & $\Rightarrow$ & ... \\
  .not. & $\Rightarrow$ & .!. & .regex. & $\Rightarrow$ & ./. \\
  .equal. & $\Rightarrow$ & .=. & .unord. & $\Rightarrow$ & .*. \\
  .assoc. & $\Rightarrow$ & .@. & .user. & $\Rightarrow$ & .\#. \\\relax
  [and] & $\Rightarrow$ & [\&] & [or] & $\Rightarrow$ & [|] \\\relax
  [lr] & $\Rightarrow$ & [⿰] & [tb] & $\Rightarrow$ & [⿱] \\\relax
  [enclose] & $\Rightarrow$ & [⿴] & [wrapu] & $\Rightarrow$ & [⿵] \\\relax
  [wrapd] & $\Rightarrow$ & [⿶] & [wrapl] & $\Rightarrow$ & [⿷] \\\relax
  [wrapul] & $\Rightarrow$ & [⿸] & [wrapur] & $\Rightarrow$ & [⿹] \\\relax
  [wrapll] & $\Rightarrow$ & [⿺] & [overlap] & $\Rightarrow$ & [⿻] \\
  \{lcr\} & $\Rightarrow$ & \{⿲\} & \{tcb\} & $\Rightarrow$ & \{⿳\}
\end{tabular}}

The \texttt{idsgrep} command-line utility attempts to follow Postel's Law
with respect to byte sequences that are not valid UTF-8: ``be conservative
in what you do, be liberal in what you accept from
others.''~\cite{Postel:TCP} Jesus of Nazareth stated a similar principle
somewhat earlier.\footnote{``There is nothing from without a man, that
entering into him can defile him: but the things which come out of him,
those are they that defile the man.''~(Mark 7:15, KJV)} Accordingly, invalid
UTF-8 on input is not in general treated as a fatal error.  Handling of
invalid UTF-8 represents a delicate balance of security issues: if invalid
UTF-8 is treated as completely fatal, that creates the possibility for
denial of service attacks, but if it is permitted to too great an extent, it
can create opportunities for things like buffer overflows.  In general, the
\texttt{idsgrep} utility will not itself break when given bad UTF-8, nor
will it make matters worse compared to a system that did not include
\texttt{idsgrep}, but \texttt{idsgrep} cannot be counted on to actively
protect some other piece of software that would otherwise be vulnerable to
bad UTF-8.\footnote{Genesis 4:9.}

The parser will skip over (as if they did not exist at all) byte sequences
that are not valid UTF-8, including the forbidden bytes 0xC0, 0xC1, and 0xF5
through 0xFF; continuation bytes outside valid multibyte sequences;
``overlong'' sequences (those that would otherwise be valid, but encode a
given code point other than in the shortest possible way); surrogates; and
sequences that encode code points outside the Unicode range.  Depending on
where they occur within a multibyte sequence, some of these things may
result in the whole sequence being skipped instead of just the bad bytes,
with the parser making its best guess as to what that means.  Be aware that
some other software may treat some of these things as valid.

When a code point outside the Unicode range, or a surrogate, is specified
using a backslash hexadecimal escape, the parser will interpret it as if the
substitute character U+FFFD had been specified instead.  All UTF-8 sequences
\emph{actually generated by} the \texttt{idsgrep} program are guaranteed to
be valid UTF-8, barring serious programming errors; and matching operations
including PCRE matches occur only on the parsed internal representation
which is valid UTF-8.  Note that PCRE, despite having a deprecated syntax
for sub-encoding byte matching, \emph{cannot} be used to detect invalid bytes
that the \texttt{idsgrep} parser skipped; it sees only what the parser
validly parsed.  However, since in its default mode the \texttt{idsgrep}
program will echo through to the output the exact input byte sequence that
was parsed to create a tree, not the internal representation, it is possible
that non-UTF-8 input could result in non-UTF-8 output.  Several cooked
output modes, in which \texttt{idsgrep} generates its own UTF-8 from the
internal representation and provides guarantees of valid UTF-8 or even valid
ASCII output, are available but non-default.

Some byte sequences that are valid UTF-8 but not valid Unicode, for instance
the sequence that encodes a reversed byte order mark, may possibly go
undetected in the input and be allowed in the output, even when cooked, by
the current version of \texttt{idsgrep}.  It is intended that
\texttt{idsgrep} should detect that kind of thing where it is reasonable to
do so, and future versions may do it better than this one does; but some
higher-level errors in Unicode usage, such as misuse of combining characters
or variation selectors, will probably never fall within the scope of
\texttt{idsgrep}.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Matching}

The basic function of the \texttt{idsgrep} command-line utility is to
evaluate each item in the database against a matching pattern.  The matching
patterns are similar in spirit to the ``regular expressions'' common
throughout the Unix world; however, for theoretical and practical reasons
standard regular expressions would be unsuitable for the applications
considered by IDSgrep.

The main theoretical issue is that IDSes, whether IDSgrep-style ``extended''
or Unicode-style traditional ones, belong to the class of
\emph{context-free} languages.  They describe tree-like structures nested to
arbitrary depth, similar in nature to programming-language expressions
containing balanced parentheses although balanced parantheses as such are
not actually part of EIDS syntax.  The natural way to parse these involves
an abstract machine with a stack-like memory that can assume an infinite
number of different states.  Regular expressions can only be used to
recognize the smaller, simpler class of \emph{regular} languages, parsable
by an abstract machine with a finite-state memory.  It is not possible to
write a correct regular expression that will match balanced parentheses. 
Some advanced software implementations of so-called ``regular expressions''
(for instance, Perl's) contain special features that make them more powerful
than the standard theoretical model, so that they are capable of recognizing
some languages that are non-regular, including balanced parentheses.  It is
also possible to fake a stack with a finite depth limit by writing a
complicated regular expression, and that may be good enough in some
practical cases.  Some users may also settle for just doing a substring
query with \texttt{grep} and calling the result close enough.  But IDSgrep
tries to do it in a way that is really right, and that is described
precisely in this section.

We will define a function $\textit{match}(x,y)$ which takes two EIDS trees
as input and returns a Boolean value of true or false.  We
call $x$ the \emph{pattern} or \emph{needle} and $y$ the \emph{subject} or
\emph{haystack}.  The \texttt{idsgrep} command-line utility generally takes
$x$ from its command line and repeatedly evaluates this function for each
EIDS it reads from its input; it then writes out all the values of $y$ for
which $\textit{match}(x,y)$ is true.

The $\textit{match}(x,y)$ function is defined as follows:
\begin{itemize}
  \item If $x$ and $y$ both have heads, then $\textit{match}(x,y)$ is true
    if and only if their heads are identical.  Nothing else is examined (in
    particular, not the children).  Then the two cases below do not apply.
  \item If $x$ and $y$ do not both have heads, then
    $\textit{match}(x,y)=\textit{match}'(x,y)$, whose value
    generally depends on the functor and arity of $x$.  The
    $\textit{match}'$ function has many special cases described in the
    subsections below, expressing different kinds of special matching
    operations.  These operations roughly correspond to the ASCII
    characters with sugary implicit brackets in EIDS syntax.  They are
    shown with brackets for clarity in the discussion below, but users
    would generally type them without the brackets and depend on the
    sugar in actual use.
  \item If none of the subsections below applies, then
    $\textit{match}'(x,y)$ is true if and only if $x$ and $y$ have identical
    functors, identical arities, and $\textit{match}(x_i,y_i)$ is true
    recursively for all their corresponding children $x_i,y_i$.  Note that
    $\textit{match}'$ recurses to $\textit{match}$, not itself, so
    there is a chance for head matching on the children even if it was
    not relevant to the parent nodes.
\end{itemize}

\subsection{Match anything}

The value of $\textit{match}'(\texttt{(?)},y)$ is always true.  Thus,
\texttt{?}\ can be used as a wildcard in \texttt{idsgrep} patterns to match
an entire subtree regardless of its structure.  Mnemonic:  question mark
is a shell wildcard for matching a single character.
The verbose ASCII name for ``\texttt{(?)}'' is ``\texttt{(anything)}.''

\subsection{Match anywhere}

The value of $\textit{match}'(\texttt{...}x,y)$ is true if and only if there
exists any subtree of $y$ (including the entirety of $y$) for which
$\textit{match}(x,y)$ is true.  In other words, this will look for an
instance of $x$ anywhere inside $y$ regardless of nesting level.  Mnemonic:
three dots suggest omitting a variable-length sequence, in this case the
variable-length chain of ancestors above $x$.
The verbose ASCII name for ``\texttt{...}'' is ``\texttt{.anywhere.}.''

\subsection{Match children in any order}

The value of $\textit{match}'(\texttt{.*.}x,y)$ is true if and only if there
exists a permutation of the children of $y$ such that $\textit{match}(x,y')$
is true of the resulting modified $y'$.  For instance, \texttt{*[a]bc}
matches both \texttt{[a]bc} and \texttt{[a]cb}.  This is obviously a
no-operation (matches simply if $x$ matches $y$, as if the asterisk were not
applied) for trees of arity less than two.  Mnemonic: asterisk is a general
wildcard, and this is a general matching operation.
The verbose ASCII name for ``\texttt{.*.}'' is ``\texttt{.unord.}.''

\subsection{NOT}

The value of $\textit{match}'(\texttt{.!.}x,y)$ is true if and only if
$\textit{match}(x,y)$ is false.  It matches any tree \emph{not} matched by
$x$ alone.  Mnemonic:  prefix exclamation point is logical NOT in many
programming languages.
The verbose ASCII name for ``\texttt{.!.}'' is ``\texttt{.not.}.''

\subsection{AND}

The value of $\textit{match}'(\texttt{[\&]}xy,z)$ is true if and only if
$\textit{match}(x,z) \wedge \textit{match}(y,z)$.  In other words, it
matches all trees that are matched by both $x$ and $y$; the set of strings
matched by $\texttt{[\&]}xy$ is the intersection of the sets matched by $x$
and by $y$.  Mnemonic: ampersand is logical or bitwise AND in many
programming languages.
The verbose ASCII name for ``\texttt{[\&]}'' is ``\texttt{[and]}.''

\subsection{OR}

The value of $\textit{match}'(\texttt{[|]}xy,z)$ is true if and only if
$\textit{match}(x,z) \vee \textit{match}(y,z)$.  In other words, it matches
all trees that are matched by at least one of $x$ or $y$; the set of strings
matched by $\texttt{[|]}xy$ is the union of the sets matched by $x$ and by
$y$.  Mnemonic: ASCII vertical bar is logical or bitwise OR in many
programming languages.
The verbose ASCII name for ``\texttt{[|]}'' is ``\texttt{[or]}.''

\subsection{Literal tree matching}

If $x$ and $y$ both have heads, then the value of
$\textit{match}'(\texttt{.=.}x,y)$ is true if and only if those heads are
identical.  Otherwise, it is true if and only if $x$ and $y$ have identical
functors, identical arity, and $\textit{match}(x_i,y_i)$ is true for each of
their corresponding children.

The effect of this operation is to ignore any special $\textit{match}'()$
semantics of $x$'s functor; the trees are compared as if that functor were
just an ordinary string, regardless of whether it might normally be special. 
Note that the full $\textit{match}()$ is still done on the children with
only the root taken literally; to do a completely literal match of the
entire trees it is necessary to insert an additional copy of \texttt{.=.}
above every node in the matching pattern, or at least every node that would
otherwise have a special meaning for $\textit{match}'()$, and even then
heads will continue to have their usual effect of overriding
recursion.\footnote{It may be interesting to consider how one could write a
pattern to test absolute identity of trees, with each node matching if and
only if its head or lack thereof is identical to the desired target as well
as the functors and arities matching and the same being true of all
children.} Mnemonic: equals sign suggests the literal equality that is being
tested rather than the more complicated comparisons that might otherwise be
used.
The verbose ASCII name for ``\texttt{.=.}'' is ``\texttt{.equal.}.''

For instance, this feature could allow searching for a unary tree whose
functor actually is \texttt{!}, where just specifying such a tree directly
as the matching pattern would instead (under the rule for ``NOT'' above)
search for trees that do not match the only child of \texttt{!}.  In the
original application of searching kanji decomposition databases this
operation is unlikely to be used because the special functors do not occur
anyway, but it seems important for potential applications of IDSgrep to more
general tree-querying, because otherwise some reasonable things people might
want to look for could not be found at all.

\subsection{Associative matching}

The value of $\textit{match}'(\texttt{.@.}x,y)$ is calculated as follows. 
Create a new EIDS tree $x'$, initially equal to $x$, which has the property
that its root may be of unlimited arity.  Then for every child of $x'$ whose
functor and arity are identical to the functor and arity of $x$, replace
that child in $x'$ with its children, in order.  Repeat that operation until
no more children of $x'$ have functor and arity identical to the
functor and arity of $x$.  Compute $y'$ from $y$ by the same process.  Then
$\textit{match}'(\texttt{.@.}x,y)=\textit{match}(\texttt{.=.}x',y')$.

This matching operator is intended for the case of three or more
things combined using a binary operator that has, or can be said to sometimes
have, an associative law.  For instance, the kanji \texttt{怠} could be
described by ``\texttt{⿱⿱厶口心}'' (\texttt{⿱厶口} over \texttt{心}) or
by ``\texttt{⿱厶⿱口心}'' (\texttt{厶} over \texttt{⿱口心}).  Unicode
might encourage use of the ternary operator \texttt{⿳} for this particular
case instead, but that does not cover all reasonably-occurring cases, and
the default databases seldom if ever use the Unicode ternary operators.

The difference between the representations is sometimes useful information
that the database \emph{should} retain; for instance, in the case of
Tsukurimashou, ``\texttt{⿱⿱厶口心},'' ``\texttt{⿱厶⿱口心},'' and
``\texttt{⿳厶口心}'' would correspond to three very different
stanzas of MetaPost source code, and the user might want a query
that separates them.  On the other hand, the user might instead have a more
general query along the lines of ``find three things stacked vertically with
\texttt{心} at the bottom'' and intend that that should match both cases of
binary decomposition.  The at-sign matching operation is meant for queries
that don't care about the order of binary operators; without it, matching
will by default follow the tree structure strictly.

Note that even with \texttt{.@.}, IDSgrep will not consider binary operators
in any way interchangeable with ternary ones; users must still use
\texttt{.|.} to achieve such an effect if desired.  Although the at-sign is
fully defined for all arities, it is only intended for use with binary
trees.  Note also that \texttt{.@.} and \texttt{.*.} behave according to
their definitions.  Incautious attempts to use them together will often fail
to have the desired effects, because the definitions do not include special
exceptions that some users might intuitively expect for these two operators
happening to occur near each other.  In a pattern like
``\texttt{*@[a][a]bcd},'' \texttt{.*.} will recognize \texttt{.@.} as the
functor of a unary tree and expand the single permutation of its one child,
and so that pattern will match the same things as if the asterisk had not
been present, namely ``\texttt{[a][a]bcd}'' and ``\texttt{[a]b[a]cd]}'' but
not, for instance, ``\texttt{[a][a]dcb}.'' In a pattern like
``\texttt{@[a]b*[a]cd},'' \texttt{.@.} will recognize \texttt{.*.} as a
different arity and functor from \texttt{[a]} and choose not to expand it in
$x'$, with the result that that pattern matches the same things as if the
at-sign had not been present, namely ``\texttt{[a]b[a]cd}'' and
``\texttt{[a]b[a]dc}'' but not ``\texttt{[a][a]bcd}'' nor
``\texttt{[a][a]bdc}.''

When considered as an operation on trees, what \texttt{.@.} does is
fundamentally the same thing as the algebraic operation that considers
$(a+b)+c$ equivalent to $a+(b+c)$, and for that reason it is called
``associative'' matching.  The mnemonic for at-sign is that it is a fancy
``a'' for ``associative.''
The verbose ASCII name for ``\texttt{.@.}'' is ``\texttt{.assoc.}.''

\subsection{Regular expression matching}

If $x$ and $y$ both have heads, then $\textit{match}'(\texttt{./.}x,y)$
is true if and only if the head of $x$, considered as a regular
expression, matches the head of $y$. If $x$ and $y$ do not both have
heads, then $\textit{match}'(\texttt{./.}x,y)$ is true if and only if
$x$ and $y$ have the same arity, the functor of $x$ considered as a
regular expression matches the functor of $y$, and
$\textit{match}(x_i,y_i)$ is true for each of their corresponding
children.  This operation is basically the same as the default
matching operation, except that regular expression matching is used
instead of strict equality for testing the heads and functors.
Mnemonic: slash means regular expression matching in Perl.
Verbose ASCII name: ``\texttt{.regex.}.''

Regular expression matching for the purposes of this operator is as
defined by the Perl Compatible Regular Expressions library, in
whichever version was linked with the \texttt{idsgrep} utility.
Strings are passed into PCRE as UTF-8, and are guaranteed (because the
EIDS parser decodes and re-encodes \texttt{idsgrep}'s input for
internal use) to be valid UTF-8 when PCRE sees them regardless of user
input; as such, PCRE is given the option flags that make it read UTF-8
without doing its own validity check.  Use of the PCRE
``\texttt{\textbackslash C}'' syntax for matching individual octets
within UTF-8 is strongly not recommended. All other PCRE options are
left to the defaults chosen when PCRE was compiled, even if those are
silly.  The character tables are PCRE's ``C locale'' defaults, not
generated at runtime from the current locale. Things like case
sensitivity can be controlled within the pattern using PCRE's syntax
for doing so.  In the event that \texttt{idsgrep} was compiled without
the PCRE library (which is not recommended, but is possible), or that
PCRE was compiled without UTF-8 support, then an attempt to evaluate
the slash operator will trigger a fatal error.

A matching pattern given to PCRE will have already passed through the
EIDS parser, which removes one level of backslash escaping.  The
pattern may also have been passed as a command-line argument to
\texttt{idsgrep} by a shell, which may have undone another level of
backslash escaping.  Thus, it may be necessary to escape characters as
many as three times in order to match them literally with the slash
operator.  Each of these levels may differ from the others in terms of
the escape sequences it supports and their exact meanings.  In many
cases it doesn't really matter which level of processing
evaluates the escaping.  For instance, ``\texttt{idsgrep
"/(\textbackslash t)"},'' (shell
evaluates ``\texttt{\textbackslash t},'' EIDS and PCRE see a literal tab);
``\texttt{idsgrep "/(\textbackslash\textbackslash
t)"},'' (shell removes one backslash, EIDS evaluates
``\texttt{\textbackslash t},'' PCRE sees a literal tab); and ``\texttt{idsgrep
"/(\textbackslash\textbackslash\textbackslash\textbackslash
t)"},'' (shell removes two backslashes, EIDS removes
one, PCRE evaluates ``\texttt{\textbackslash t}'') will all match the same
things.  If it matters, however, then caution is necessary.

PCRE because of the limitations of its API effectively forbids zero bytes
(U+0000) in its matching patterns, whereas EIDS allows them to exist within
strings in general.  The complexities of PCRE pattern syntax make it
impractical for \texttt{idsgrep} to automatically escape zero bytes before
passing the strings to PCRE; there are too many different cases possible for
the context in which a zero byte might occur.  Since the \texttt{idsgrep}
utility takes its matching patterns from the Unix command line anyway, and
Unix itself forbids literal zero bytes in command-line arguments, the case
of literal zero bytes in a matching pattern can only occur when they are
created deliberately by escape sequences at the level of the EIDS parser;
and the simplest advice to users is ``don't do that!''

Python, which like EIDS allows strings to contain zero bytes but has PCRE
bindings and so faces the same issue, briefly attempted to work around this
PCRE API limitation by auto-escaping.  They eventually gave it up as too
complicated and confusing.  The consequence of PCRE's API design is that if
the string given as a matching pattern contains a literal zero byte then the
regular expression to be matched will consist of the prefix of the string up
to but not including the first zero byte; anything after that will be
ignored.  Zero bytes are, nonetheless, permitted in the matching subject,
and PCRE can search for them, but not by means of literal zero bytes in the
pattern.  For instance, the PCRE syntax ``\texttt{\textbackslash000}'' (or
just ``\texttt{\textbackslash0}'' if the next character will
not be an octal digit) matches a zero byte.  As discussed above,
additional escaping might be needed to ensure that PCRE, and not EIDS nor
the shell, interprets the backslash escape.

\subsection{User-defined matching predicates}

It is assumed that by some out-of-band means, we have defined a
family of functions $U_i()$ for $i$ from $1$ up to some $k$.  These
functions take EIDS trees as input and return Boolean values (hence
``predicates'').

Then the value of $\textit{match}'(\texttt{.\#.}x,y)$ is determined as
follows.  First, an integer $i$ is computed.  If $x$ has a head, its
initial characters will be parsed as an ASCII decimal number using the
C library's \texttt{atoi(3)} function; $i$ is the resulting value, if
it is positive.  If $x$ has no head, the head of $x$ cannot be parsed,
or the head of $x$ is parsed as zero or negative, then $i$ is
defined to be $1$.  Having defined $i$, if $U_i()$ exists then
$\textit{match}'=U_i(y)$.  If $U_i()$
does not exist then $\textit{match}'$ is false.  Mnemonic: hash-mark
is used for parameter substitution in languages such as \TeX, and this
matching operation causes the matching pattern to take something
external (the user-defined predicate) as a parameter.

In the current version, the functions $U_i()$ are always defined using
the ``\texttt{-f}'' command-line option (or its long-named equivalent)
and correspond to the character coverage of TrueType or OpenType
fonts.  The predicate returns true if and only if $y$ has a head
consisting of a single Unicode character that is covered by the font.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Cooked output}

The default mode of operation for the \texttt{idsgrep} command-line utility
is that whenever a matching tree is detected, the exact sequence of bytes
that were parsed to generate that tree (including no skipped whitespace
before it, and all skipped whitespace after it but before the next tree)
will be copied through to the output.  This mode of operation is called
``raw.'' Raw mode is easy to understand, efficient, preserves distinctions
like different kinds of brackets in the input, and is as analogous as
reasonably possible to the operation of \texttt{grep}.  However, preserving
the exact input bytes may preserve invalid UTF-8, valid but weird EIDS
syntax, or non-ASCII characters users may find difficult to type or
display, that may have existed in the input.  The ``\texttt{-c}''
(``\texttt{--cooking}'') command-line option provides a wide range of
ways for \texttt{idsgrep} to generate new EIDS syntax of its own, guaranteed
to be valid, from the internal representation generated by the parser.  The
cooked output modes force the output into a well-behaved format
independent of what the input looked like.
Input canonicalization (such as the translation from ``\texttt{[lr]}'' to
``\texttt{⿰}'') can also be controlled through this interface.

The ``\texttt{-c}'' option can be given a (lowercase ASCII Latin,
unabbreviated) keyword as
its argument, to select a preset output mode.  That is the only recommended
way to use this option.  The available preset modes are as follows:
\begin{description}
  \item[\texttt{raw}] Raw mode: write out the exact input byte sequence
    that was parsed to generate the matching tree, \emph{even if it is not
    valid UTF-8}.  This is the default.
  \item[\texttt{rawnc}] Raw with no canonicalization: raw mode output,
    but without the canonicalization transformation during input parsing.
  \item[\texttt{ascii}] ASCII-only: all non-ASCII characters and ASCII
    control characters are replaced by escape sequences or subjected to
    the reverse of the input canonicalization transformation, to produce a
    result that should pass through most limited-character-set channels. 
    Note that the plainest ASCII space (U+0020) is not escaped in this mode
    when EIDS syntax does not require it to be.  This mode generally uses a
    lot of hexadecimal escapes and, in a dictionary-lookup context, may be
    useful for finding the hexadecimal code point value of an unknown
    character.
  \item[\texttt{cooked}] Generic cooked mode: render trees as reasonably
    clean and appealing Unicode text similar but not necessarily identical
    to what appears in the pregenerated dictionary files.  This will escape
    characters outside the Basic Multicharacter Plane; characters in all
    Private Use Areas; and any other characters that EIDS syntax
    \emph{requires} must be escaped; but no others.  It will choose an
    appropriate escaping method depending on the type of character. 
    Generally, it will use black lenticular brackets for top-level heads,
    ASCII brackets elsewhere, and syntactic sugar and syrup to avoid
    brackets where possible (except for top-level heads).
  \item[\texttt{indent}] Write trees on multiple lines with two-space
    indentation to show their structure as clearly as possible.  One blank
    line (two newlines) between trees.  In other ways this is similar to
    ``\texttt{cooked}.''
\end{description}

If not given a preset keyword, ``\texttt{-c}'' can be given a string of
ASCII decimal digits.  The decimal-string interface allows precise control
of how output syntax will be generated, but it is somewhat experimental,
very complicated, and may change incompatibly in future versions of this
software.  Use of this feature is not recommended.  Nonetheless, the
remainder of this section will attempt to document it.

The format specifier may be up to twelve digits long.  If it is shorter than
that, it is taken as a prefix with unspecified digits copied from the
default specifier, which is ``\texttt{100000013250}'' and equivalent to the
``\texttt{cooked}'' preset.  The two raw presets are handled as special
cases; of the remaining cooked presets, ``\texttt{ascii}'' is equivalent to
``\texttt{000000013551}'' and ``\texttt{indent}'' is equivalent to
``\texttt{100000223250}.''

The first digit specifies the type of brackets to be used for the head of
the root of the tree: 0 for ``\texttt{<>},'' 1 for ``\texttt{【】},'' or 2
for ``\texttt{〖〗}.''  The second digit specifies the type of brackets for
the head of any non-root node, using the same code.

The third digit specifies the type of brackets for nullary functors:  0 for
``\texttt{()},'' 1 for ``\texttt{()},'' or 2 for ``\texttt{⦅⦆}.''
Similarly, the fourth digit specifies the brackets for unary functors:
0 for ``\texttt{..},'' 1 for ``\texttt{::},'' or 2 for ``\texttt{・・}'';
the fifth digit specifies the brackets for binary functors:
0 for ``\texttt{[]},'' 1 for ``\texttt{[]},'' or 2 for ``\texttt{〚〛}'';
and the sixth digit specifies the brackets for ternary functors:
0 for ``\texttt{\{\}},'' 1 for ``\texttt{〔〕},'' or 2 for
``\texttt{〘〙}''.

The seventh digit describes how to insert newlines and indentation to
pretty-print the tree structure.  If it is 0, that will not be done.  If it
is 8, trees will be pretty-printed using one tab character per level; the
number eight is a mnemonic for the fact that people generally expect those
to be equivalent to eight spaces each.  Any other decimal digit specifies
that many spaces per level.

The eighth digit specifies the separator printed between trees: 0 for a null
byte (U+0000), 1 for a newline, 2 for two newlines, or 3 for no separator at
all.

The ninth digit specifies the circumstances under which the sugary and
syrupy features of EIDS syntax should be used.  It is a sum of binary flags:
add 4 to use a syrupy semicolon when possible at the top level;
2 to use a syrupy semicolon when possible at other levels;
and 1 to use sugary implicit brackets wherever possible.

The tenth digit specifies which characters should be escaped.  Literal
backslashes, and (within a bracketed string) literal instances of the
close-bracket character that would otherwise end the string, must
always be escaped. 
When the tenth digit is 0, those are the only characters that will be escaped. 
Other values add escaping for the following categories of characters, and do
so cumulatively with each digit also escaping everything that would be
escaped by all lesser digits.
\begin{description}
  \item[1] Escape characters from the astral planes; that is,
    characters with code points greater than U+FFFF and thus outside the
    Basic Multilingual Plane.
  \item[2] Escape characters from the BMP Private Use Areas, U+E000 to
    U+F8FF.  The other Private Use Areas are already escaped at level 1
    by virtue of being outside the BMP.
  \item[3] Escape all non-ASCII characters (U+0080 and up) except the core
    Unified Han range (U+4E00 to U+9FFF).
  \item[4] Escape the core Unified Han range.
  \item[5] Escape the ASCII control characters (U+0000 to U+001F).
  \item[6] Escape closing brackets at the start of bracketed strings, which
    otherwise escape escaping because of a special case in
    the syntax definition.
  \item[7] Escape all characters.  Depending on the value of the next digit,
    however, the ASCII Latin alphabet still might not be escaped.
\end{description}

The eleventh digit specifies \emph{how} to escape whatever characters were
selected for escaping by the tenth digit.  The available values are as
follows.
\begin{description}
  \item[0] Use a single backslash followed by the literal character, only.
    The ASCII Latin alphabet
    cannot be escaped in this way and under this option, or options 1 or 5
    which fall through to this case, will not be escaped at all.  Since the
    literal characters remain in the text, this option is not
    suitable for sending output through any channel that is not clean
    for the full range of UTF-8 characters.  However, unlike raw mode, this
    and all other cooked modes do guarantee to produce valid UTF-8, not
    arbitrary byte sequences.
  \item[1] Use a backslash-letter sequence for ASCII control characters
    U+0001 to U+001B, and otherwise follow option 0.
  \item[2] Use variable-length hexadecimal ``\texttt{\textbackslash x\{\}}''
    sequences for all characters that are selected to escape.  This syntax
    can escape any character.
  \item[3] Use two-digit ``\texttt{\textbackslash x}$HH$'' sequences
    wherever possible (that is, for ASCII and ISO-8859-1 characters),
    four-digit ``\texttt{\textbackslash X}$HHHH$'' sequences for
    other characters on the Basic Multilingual Plane,
    and variable-length hexadecimal sequences otherwise.
  \item[4] Use four-digit ``\texttt{\textbackslash X}$HHHH$'' sequences
    wherever possible (that is, for all characters on the BMP), and
    variable-length hexadecimal sequences otherwise.
  \item[5] Attempt to choose the simplest type of escape
    for each character depending on its value, just like option 3 except
    with backslash-letter escapes where possible (U+0001 to U+001B) and
    backslash-literal escapes for ASCII non-control characters
    (U+0020 to U+007E excluding the Latin alphabet).  The ASCII Latin
    alphabet will not be escaped at all under this option.
\end{description}

The twelfth digit specifies canonicalization processing; that is, the
translations on both input and output between alphabetic functor aliases
like ``\texttt{(anything)}'' and their symbolic equivalents like
``\texttt{(?)}.'' Note that in all cases the symbolic versions are the
matching operators; if you disable input canonicalization and enter a
matching pattern of ``\texttt{(anything)}'' it will be matched as an ordinary
nullary functor containing a string of eight ASCII letters, not as the
match-anything operator which is always named ``\texttt{(?)}.'' The digit
value is a sum of binary flags: add 4 to \emph{disable} the default
transformation of alphabetic aliases to symbolic names on input; plus 2 to
enable a translation from alphabetic aliases to symbolic names on output,
which is generally only meaningful if 4 was selected; plus 1 to enable a
transformation from symbolic names back to alphabetic aliases on output.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\clearpage
\addcontentsline{toc}{chapter}{Bibliography}
\bibliographystyle{plain}
\bibliography{idsgrep}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\end{document}
Show on old repository browser