• R/O
  • SSH
  • HTTPS

tsukurimashou: Commit


Commit MetaInfo

Revision382 (tree)
Time2013-02-26 12:26:36
Authormskala

Log Message

format 12, Unicode list generator

Change Summary

Incremental Difference

--- trunk/idsgrep/idsgrep.h (revision 381)
+++ trunk/idsgrep/idsgrep.h (revision 382)
@@ -145,6 +145,12 @@
145145
146146 /**********************************************************************/
147147
148+/* unilist.c */
149+
150+void generate_unicode_list(NODE *,char *);
151+
152+/**********************************************************************/
153+
148154 /* userpred.c */
149155
150156 void font_file_userpred(char *);
--- trunk/idsgrep/idsgrep.tex (revision 381)
+++ trunk/idsgrep/idsgrep.tex (revision 382)
@@ -209,12 +209,25 @@
209209 useful for the EDICT2-based meaning dictionary.
210210 \item[\texttt{idsgrep -d '...=?'}]~\\
211211 Equals escapes matching operators; this example searches for a literal
212- question mark anywhere in the tree.
212+ question mark anywhere in the tree.
213213 \item[\texttt{idsgrep -d '\textbackslash X840C'}]~\\
214214 Several kinds of backslash escapes allow entering characters that might
215215 not otherwise be available.
216216 \item[\texttt{idsgrep -d -c indent 萌}]~\\
217217 The \texttt{-c} option selects ``cooked'' or pretty-printed output modes.
218+\item[\texttt{idsgrep -d -f FontFile.otf '\#1'}]~\\
219+ The \texttt{-f} option reads the character set of an OpenType font
220+ and makes it available as a user-defined matching predicate accessed
221+ with the hash-mark; in the example, it looks up each character in the
222+ default dictionaries.
223+\item[\texttt{idsgrep -U '?'}]~\\
224+ The \texttt{-U} option generates a list of Unicode characters.
225+\item[\texttt{idsgrep -Uxdb '?'}]~\\
226+ An optional argument to \texttt{-U} specifies information to include
227+ in the generated list entries: \texttt{x} for hexadecimal
228+ value, \texttt{d} for decimal, \texttt{b} for block name.
229+\item[\texttt{idsgrep -U -f FontFile.otf '\#1'}]~\\
230+ Combine \texttt{-U} and \texttt{-f} to list the characters in a font.
218231 \end{description}
219232
220233 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -282,12 +295,15 @@
282295
283296 \section{What's new}
284297
285-The main new features in version 0.3 are:
298+The main new features in version 0.4 are:
286299 \begin{itemize}
287-\item associative and regular-expression matching;
288-\item the CHISE IDS-derived dictionary, and related support such as
289-offering a choice of which dictionary to join with EDICT2; and
290-\item cooked output modes.
300+ \item changes to the build system for better integration with
301+ Tsukurimashou;
302+ \item support for user-defined matching predicates, and in particular, the
303+ ability to match against the list of characters defined by a font file
304+ (``\texttt{-f}'' option);
305+ \item built-in generation of Unicode character lists (``\texttt{-U}''
306+ option).
291307 \end{itemize}
292308
293309 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -621,13 +637,13 @@
621637 \section{Interface to EDICT2}
622638
623639 Jim Breen's JMdict/EDICT project maintains a file called
624-EDICT2~\cite{EDICT2} which is more like a traditional dictionary, with words
625-and meanings, than a database of kanji. Such dictionaries are not the
626-primary target of IDSgrep and IDSgrep's query syntax is not perfectly suited
627-to them. However, planned future regular-expression matching features may
628-make it more practical to search EDICT2 with IDSgrep, and even in the
629-current version, there is some value in being able to do sub-character
630-structural searches on multi-character words.
640+EDICT2~\cite{EDICT2} which is more like a traditional dictionary, with
641+words and meanings, than a database of kanji. Such dictionaries are
642+not the primary target of IDSgrep and IDSgrep's query syntax is not
643+perfectly suited to them. However, the regular-expression matching
644+features may make it practical to search EDICT2 with IDSgrep, and
645+there is some value in being able to do sub-character structural
646+searches on multi-character words.
631647
632648 If another dictionary besides EDICT2 is available
633649 (subject to configuration by \texttt{--enable-edict-decomp}), then the
@@ -688,55 +704,172 @@
688704 Building IDSgrep in conjunction with Tsukurimashou allows IDSgrep to extract
689705 from the Tsukurimashou build system a dictionary of character decompositions
690706 as they appear in Tsukurimashou. The Tsukurimashou fonts are also necessary
691-to build this IDSgrep user manual. However, IDSgrep and Tsukurimashou
692-are distributed as separate packages, because they have very different
693-audiences and build prerequisites. Many people who can use one will be
694-unable to use the other, so it seems inappropriate to force all users to
695-download both.
707+to build this IDSgrep user manual. However, IDSgrep is also distributed as a
708+separate package, because it will be of use to non-users of Tsukurimashou,
709+and the Tsukurimashou build system will not recurse into IDSgrep's directory
710+and build IDSgrep by default; only if requested.
696711
697-When IDSgrep's \texttt{configure} script runs, it looks for a valid
698-Tsukurimashou build directory. Ideally, that would be one in which
699-Tsukurimashou has actually been fully built; but a directory where the
700-Tsukurimashou \texttt{configure} script has been executed is enough. If a
701-valid Tsukurimashou build directory is found automatically or specified with
702-the \texttt{--with-tsuku-build} option to \texttt{configure}, then when
703-\texttt{make} is run on IDSgrep, it will recursively go call \texttt{make
704-eids} in the Tsukurimashou build. That is a hook that causes
705-Tsukurimashou's build system to generate the EIDS decomposition dictionary,
706-which is then copied or linked back into IDSgrep's build directory and can
707-be installed with IDSgrep's \texttt{make install}. IDSgrep's build will
708-also look in Tsukurimashou's build directory for the font ``Tsukurimashou
709-Mincho'' which is needed to build this user manual, and will make
710-recursive calls to \texttt{make} for Tsukurimashou to build that if
711-necessary.
712+IDSgrep is one of several parasite packages of Tsukurimashou, using a
713+mechanism introduced in Tsukurimashou 0.7 and IDSgrep 0.4. Previous
714+versions used a different interface.
712715
713-Note that neither Tsukurimashou nor IDSgrep is a true ``sub-package'' of the
714-other in the sense of Autotools~\cite{Autotools}, as mediated by the
715-\texttt{SUBDIRS} Automake variable and so on, notwithstanding that a
716-checked-out SVN working copy of Tsukurimashou will contain a working copy of
717-IDSgrep in a subdirectory. Running the Tsukurimashou build will not invoke
718-the IDSgrep build at all; and running the IDSgrep build is not a good way to
719-trigger a full Tsukurimashou build, because it won't use the preferred
720-\texttt{-j} option, track all dependencies in detail, nor generate anything
721-that doesn't happen to be a prerequisite for the files IDSgrep needs. If
722-you want to build both systems, it's best to build Tsukurimashou first and
723-then build IDSgrep pointing at Tsukurimashou. Also, these two packages do
724-not necessarily have the same portability considerations, and it's possible
725-that the link between them may fail even on systems where each package
726-builds correctly by itself (for instance, possibly on some systems where GNU
727-Make is installed but non-default). The link between Tsukurimashou and
728-IDSgrep provides some convenience for my own frequent case of making changes
729-to both packages at once.
716+To build Tsukurimashou with IDSgrep: specify the
717+``\texttt{-{}-enable-parasites}'' option to Tsukurimashou's
718+\texttt{configure} script with an appropriate value, such as
719+``\texttt{-{}-enable-parasites=idsgrep}''. See the Tsukurimashou
720+documentation for other possible values of this option. Building
721+Tsukurimashou will then implicitly build IDSgrep. It should be possible to
722+pass IDSgrep \texttt{configure} options to Tsukurimashou's
723+\texttt{configure} script and have them automatically passed down the chain
724+(in the standard Autotools sub-package fashion) but that is not well-tested.
730725
731-In order for IDSgrep to work together with Tsukurimashou, it is necessary
732-that the Tsukurimashou build be one that supports the \texttt{make eids}
733-target in the first place. Packaged versions of Tsukurimashou from 0.6
734-onward include EIDS support, and development versions of Tsukurimashou in
735-the SVN repository have included EIDS support since early January 2012.
726+To build Tsukurimashou without IDSgrep: this is the default when you run
727+the Tsukurimashou build from the root of the Tsukurimashou distribution.
728+The IDSgrep source is included as a subdirectory in distributions of
729+Tsukurimashou, but only built on request.
736730
731+For a more customized build of IDSgrep, with or without Tsukurimashou: you
732+can also run IDSgrep's \texttt{configure} in its own directory, and then do
733+\texttt{make} (and the usual targets) there. It will look for Tsukurimashou
734+(specifically, a build directory in which Tsukurimashou's \texttt{configure}
735+has \emph{already been executed}) as the parent directory and in a few other
736+places, or you can specify the location of a Tsukurimashou build with the
737+``\texttt{--with-tsuku-build}'' option to IDSgrep's \texttt{configure}. If
738+Tsukurimashou is not available, IDSgrep will build without creating
739+the Tsukurimashou-derived dictionary file.
740+
741+During IDSgrep's build, if it can access a Tsukurimashou build directory, it
742+will recursively call \texttt{make eids} on Tsukurimashou's build system.
743+That is a hook that causes Tsukurimashou's build system to generate the EIDS
744+decomposition dictionary, which is then copied or linked back into IDSgrep's
745+build directory and can be installed with IDSgrep's \texttt{make install}.
746+IDSgrep's build will also look in Tsukurimashou's build directory for the
747+font ``Tsukurimashou Mincho'' which is needed to build this user manual, and
748+will make recursive calls to \texttt{make} for Tsukurimashou to build that
749+if necessary. This kind of upward-callback \texttt{make} invocation is a
750+little inefficient (in particular, it does not handle jobserver mode well)
751+so it is better, if you want both packages, to use the centralized
752+Tsukurimashou build system, which will do its own thing first and then call
753+IDSgrep's build near the end in a better-integrated way. If you want to
754+run ``\texttt{make install}'' just on IDSgrep and not on Tsukurimashou
755+(which might be a reasonable thing to want because of operating system font
756+installation issues), you should run just ``\texttt{make}'' in
757+Tsukurimashou's directory, then \texttt{cd} to IDSgrep's directory and run
758+``\texttt{make install}.''
759+
737760 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
761+
762+\section{A note on TrueType/OpenType}
763+
764+This version of IDSgrep is designed to read TrueType or OpenType files
765+(the distinction between the two is not relevant at this level) for
766+character map information. The specification for the
767+TrueType/OpenType file format reads like a parody. I'd like to take a
768+moment to complain about a few things.
769+
770+\begin{itemize}
771+ \item Although the format contains binary fields which must be read
772+ in a specific byte order, one of the two magic numbers that can
773+ identify the file format (for a single font as opposed to a
774+ ``collection'') is 0x4F54544F, which is a palindrome at the byte
775+ level and thus useless for detecting byte order problems.
776+ \item The other possible magic number for non-collection files
777+ is 0x00010000, which is quite likely to occur in files that are
778+ not TrueType/OpenType files, making it harder to detect when one
779+ may have been passed a bad file.
780+ \item Many decades of research on error detection codes were ignored
781+ in the design of the OpenType checksum algorithm, which (among
782+ other issues) cannot detect any reordering of 32-bit words unless
783+ it crosses a table boundary. At least the algorithm produces its
784+ meaningless results fast; yay, efficiency!
785+ \item There are 32-bit byte offsets referenced to the start of the file.
786+ There are 32- and 16-bit byte offsets referenced to the start of
787+ the current table. There are 32- and 16-bit byte offsets
788+ referenced to the \emph{locations of the offset fields themselves}, so a
789+ field at offset 0x1234 referring to another field at offset 0x5678
790+ will contain 0x4444. There are also indices measured in units
791+ other than bytes.
792+ \item There are variable-length objects not tagged
793+ with their lengths except indirectly: they are presumably
794+ contained entirely within larger objects that are tagged with lengths.
795+ \item Consider the cmap format 4 subtable, which Microsoft
796+ says is their preferred format. It includes four
797+ variable-length arrays each containing segCount
798+ number of two-byte entries. The value of segCount is not directly
799+ recorded anywhere, but these values are all required
800+ in the header:
801+ \begin{itemize}
802+ \item[$\circ$] $2 \cdot \textrm{segCount}$;
803+ \item[$\circ$] $2 \cdot 2^{\lfloor \log_2 \textrm{segCount} \rfloor}$;
804+ \item[$\circ$] $\log_2 (2 \cdot 2^{\lfloor \log_2 \textrm{segCount}
805+ \rfloor}/2)$ (which is described like that in the spec);
806+ and of course
807+ \item[$\circ$] $2 \cdot \textrm{segCount}-2 \cdot 2^{\lfloor \log_2
808+ \textrm{segCount} \rfloor}$.
809+ \end{itemize}
810+ \item The bizarre length-derived values in the format 4 header (and
811+ other similar sets of table-size-logarithm numbers that occur
812+ elsewhere in the file format) appear to be designed to support
813+ someone's binary search code. Instead of computing those numbers
814+ itself starting from the length, the search code can just use
815+ values straight from the table to initialize its variables.
816+ Consider what would happen if someone actually did that as the
817+ designers apparently intended, and the numbers happened to be incorrect
818+ in the file. If, for instance, numbers in the file were swapped
819+ around on 32-bit boundaries, the checksums wouldn't detect a problem; and
820+ the speed demons who think they need precomputed logarithms
821+ probably aren't wasting time checking checksums anyway.
822+ The code isn't checking whether the numbers are consistent (because
823+ to do that you would have to calculate them fresh, and then why bother
824+ storing them in the first place?), so it will end up ``searching'' into
825+ random areas of the file, or into uninitialized memory beyond. Now
826+ think about the relative costs of disk reads, network transfer, and
827+ arithmetic, and consider whether having those values precalculated
828+ and stored in the file would actually save any time even if they
829+ could be trusted.
830+ \item The cmap format 4 subtable consists of, in this order: fixed-length
831+ stuff totalling 14 bytes; one variable-length array of length $2
832+ \cdot \textrm{segCount}$ bytes; \emph{one more two-byte
833+ fixed-length field}; three more variable-length arrays each of
834+ length $2 \cdot \textrm{segCount}$ bytes; and finally, one more
835+ variable-length array whose length is not directly specified
836+ anywhere but could presumably be inferred by subtracting from the
837+ known size of the overall table. The four $2 \cdot
838+ \textrm{segCount}$-byte arrays are actually the rearranged slices
839+ of a single logical array whose elements are four-field
840+ structures; but the extra reserved two bytes
841+ stuck in the middle of the table make a straightforward transposition
842+ impossible. Four-tuples of the same kind with the same four fields
843+ also occur in the format 2 subtable; but there, they occur as a single
844+ array with each record written in an 8-byte block.
845+ \item It is an intended, documented feature that some of the
846+ variable-length arrays in TrueType/OpenType may overlap with each other.
847+ As a result, bounds-checking, in addition to being intrinsically
848+ difficult because of the lack of information, would cause the
849+ reader to reject some files that the specification claims are
850+ legitimate.
851+ \item Code-injection bugs allowing execution of arbitrary
852+ code in a privileged context have been reported in software
853+ that implemented this file format without bounds-checking. This should
854+ surprise no one.
855+\end{itemize}
856+
857+IDSgrep attempts to do all reasonable bounds-checking on the fields it
858+needs, and to ignore fields it does not need; given a bad TrueType/OpenType
859+file, it is intended that IDSgrep should be able to make the best of it and
860+at worst fail gracefully with an error message. It should not be possible
861+to crash IDSgrep by giving it a bad font file to read.
862+
863+However, the nature of the file format means that at least in the
864+current version, we can't be confident all possible problems have
865+been foreseen and excluded. Let me know if you find a font file that
866+makes IDSgrep crash and I'll try to fix it. IDSgrep probably should not be
867+allowed to read font files supplied by untrusted sources such as Web
868+users.
869+
738870 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
739871 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
872+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
740873
741874 \chapter{Invoking \texttt{idsgrep}}
742875
@@ -790,6 +923,34 @@
790923 more detail. The default mode is \texttt{raw}. See the
791924 section on ``cooked output'' in this manual for more details.
792925
926+\item[\texttt{-f}, \texttt{--font-chars}]
927+Read a font file and make its character coverage available as a
928+user-defined matching predicate through the ``\texttt{\#}'' matching
929+operator. In the current version, this feature can only read TrueType
930+and OpenType files that contain Unicode (or near equivalent) mappings
931+described with cmap subtable types 0, 2, 4, 12, or 13. This option may
932+be specified multiple times, with successive invocations corresponding
933+to user-defined predicates 1, 2, 3, and so on.
934+
935+\item[\texttt{-U}, \texttt{--unicode-list}]
936+Generate a dictionary of Unicode code points, and read that before
937+reading any other dictionaries or input files that may be specified.
938+The generated dictionary consists of a single line for each of the
939+code points U+0000 through U+10FFFF in ascending order, excluding the
940+surrogates but not any other invalid or non-character code points; on
941+each line, there is a tree whose head is the character and whose body
942+is either a nullary semicolon or (if the optional argument to
943+\texttt{-U} was specified) a nullary functor containing
944+semicolon-separated pieces of information selected by the characters of the
945+optional argument. Characters permitted in the argument
946+are ``\texttt{b}'' for the Unicode block name; ``\texttt{d}'' for the
947+decimal value of the code point; and ``\texttt{x}'' for the
948+hexadecimal value with ``U+.'' For example, specifying
949+``\texttt{-Uxdb}'' will generate and scan a dictionary that includes
950+the line ``\texttt{<A>(U+0041;65;Basic Latin)}.''
951+This option is intended to be used together with \texttt{-f} to produce
952+font coverage lists.
953+
793954 \item[\texttt{-V}, \texttt{--version}] Display the version and license
794955 information for IDSgrep.
795956
@@ -1082,7 +1243,7 @@
10821243 \end{itemize}
10831244
10841245 Here are all the characters that have sugary implicit brackets, with the
1085-brackets they imply: {\ttfamily (;) (?) .!. ./. .=. .*. .@. [\&] [,]
1246+brackets they imply: {\ttfamily (;) (?) .!. ./. .=. .*. .@. .\#. [\&] [,]
10861247 [|] [⿰] [⿱] [⿴] [⿵] [⿶] [⿷] [⿸] [⿹] [⿺] [⿻] \{⿲\} \{⿳\}}
10871248
10881249 Note that the sugary and syrupy implications of a character are only
@@ -1123,14 +1284,14 @@
11231284 (anything) & $\Rightarrow$ & (?) & .anywhere. & $\Rightarrow$ & ... \\
11241285 .not. & $\Rightarrow$ & .!. & .regex. & $\Rightarrow$ & ./. \\
11251286 .equal. & $\Rightarrow$ & .=. & .unord. & $\Rightarrow$ & .*. \\
1126- .assoc. & $\Rightarrow$ & .@. & [and] & $\Rightarrow$ & [\&] \\\relax
1127- [or] & $\Rightarrow$ & [|] & [lr] & $\Rightarrow$ & [⿰] \\\relax
1128- [tb] & $\Rightarrow$ & [⿱] & [enclose] & $\Rightarrow$ & [⿴] \\\relax
1129- [wrapu] & $\Rightarrow$ & [⿵] & [wrapd] & $\Rightarrow$ & [⿶] \\\relax
1130- [wrapl] & $\Rightarrow$ & [⿷] & [wrapul] & $\Rightarrow$ & [⿸] \\\relax
1131- [wrapur] & $\Rightarrow$ & [⿹] & [wrapll] & $\Rightarrow$ & [⿺] \\\relax
1132- [overlap] & $\Rightarrow$ & [⿻] & \{lcr\} & $\Rightarrow$ & \{⿲\} \\\relax
1133- \{tcb\} & $\Rightarrow$ & \{⿳\}
1287+ .assoc. & $\Rightarrow$ & .@. & .user. & $\Rightarrow$ & .\#. \\\relax
1288+ [and] & $\Rightarrow$ & [\&] & [or] & $\Rightarrow$ & [|] \\\relax
1289+ [lr] & $\Rightarrow$ & [⿰] & [tb] & $\Rightarrow$ & [⿱] \\\relax
1290+ [enclose] & $\Rightarrow$ & [⿴] & [wrapu] & $\Rightarrow$ & [⿵] \\\relax
1291+ [wrapd] & $\Rightarrow$ & [⿶] & [wrapl] & $\Rightarrow$ & [⿷] \\\relax
1292+ [wrapul] & $\Rightarrow$ & [⿸] & [wrapur] & $\Rightarrow$ & [⿹] \\\relax
1293+ [wrapll] & $\Rightarrow$ & [⿺] & [overlap] & $\Rightarrow$ & [⿻] \\
1294+ \{lcr\} & $\Rightarrow$ & \{⿲\} & \{tcb\} & $\Rightarrow$ & \{⿳\}
11341295 \end{tabular}}
11351296
11361297 The \texttt{idsgrep} command-line utility attempts to follow Postel's Law
@@ -1488,6 +1649,32 @@
14881649 additional escaping might be needed to ensure that PCRE, and not EIDS nor
14891650 the shell, interprets the backslash escape.
14901651
1652+\subsection{User-defined matching predicates}
1653+
1654+It is assumed that by some out-of-band means, we have defined a
1655+family of functions $U_i()$ for $i$ from $1$ up to some $k$. These
1656+functions take EIDS trees as input and return Boolean values (hence
1657+``predicates'').
1658+
1659+Then the value of $\textit{match}'(\texttt{.\#.}x,y)$ is determined as
1660+follows. First, an integer $i$ is computed. If $x$ has a head, its
1661+initial characters will be parsed as an ASCII decimal number using the
1662+C library's \texttt{atoi(3)} function; $i$ is the resulting value, if
1663+it is positive. If $x$ has no head, the head of $x$ cannot be parsed,
1664+or the head of $x$ is parsed as zero or negative, then $i$ is
1665+defined to be $1$. Having defined $i$, if $U_i()$ exists then
1666+$\textit{match}'=U_i(y)$. If $U_i()$
1667+does not exist then $\textit{match}'$ is false. Mnemonic: hash-mark
1668+is used for parameter substitution in languages such as \TeX, and this
1669+matching operation causes the matching pattern to take something
1670+external (the user-defined predicate) as a parameter.
1671+
1672+In the current version, the functions $U_i()$ are always defined using
1673+the ``\texttt{-f}'' command-line option (or its long-named equivalent)
1674+and correspond to the character coverage of TrueType or OpenType
1675+fonts. The predicate returns true if and only if $y$ has a head
1676+consisting of a single Unicode character that is covered by the font.
1677+
14911678 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
14921679
14931680 \section{Cooked output}
--- trunk/idsgrep/unilist.c (nonexistent)
+++ trunk/idsgrep/unilist.c (revision 382)
@@ -0,0 +1,358 @@
1+/*
2+ * Internally-generated Unicode dictionary for IDSgrep
3+ * Copyright (C) 2013 Matthew Skala
4+ *
5+ * This program is free software: you can redistribute it and/or modify
6+ * it under the terms of the GNU General Public License as published by
7+ * the Free Software Foundation, version 3.
8+ *
9+ * This program is distributed in the hope that it will be useful,
10+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
11+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12+ * GNU General Public License for more details.
13+ *
14+ * You should have received a copy of the GNU General Public License
15+ * along with this program. If not, see <http://www.gnu.org/licenses/>.
16+ *
17+ * Matthew Skala
18+ * http://ansuz.sooke.bc.ca/
19+ * mskala@ansuz.sooke.bc.ca
20+ */
21+
22+#include <stdio.h>
23+#include <stdlib.h>
24+#include <string.h>
25+
26+#include "idsgrep.h"
27+
28+/**********************************************************************/
29+
30+typedef struct _UNICODE_BLOCK_DATA {
31+ int low,high;
32+ char *name;
33+} UNICODE_BLOCK_DATA;
34+
35+UNICODE_BLOCK_DATA unicode_blocks[]={
36+ {0x0000,0x007F,"Basic Latin"},
37+ {0x0080,0x00FF,"Latin-1 Supplement"},
38+ {0x0100,0x017F,"Latin Extended-A"},
39+ {0x0180,0x024F,"Latin Extended-B"},
40+ {0x0250,0x02AF,"IPA Extensions"},
41+ {0x02B0,0x02FF,"Spacing Modifier Letters"},
42+ {0x0300,0x036F,"Combining Diacritical Marks"},
43+ {0x0370,0x03FF,"Greek and Coptic"},
44+ {0x0400,0x04FF,"Cyrillic"},
45+ {0x0500,0x052F,"Cyrillic Supplement"},
46+ {0x0530,0x058F,"Armenian"},
47+ {0x0590,0x05FF,"Hebrew"},
48+ {0x0600,0x06FF,"Arabic"},
49+ {0x0700,0x074F,"Syriac"},
50+ {0x0750,0x077F,"Arabic Supplement"},
51+ {0x0780,0x07BF,"Thaana"},
52+ {0x07C0,0x07FF,"NKo"},
53+ {0x0800,0x083F,"Samaritan"},
54+ {0x0840,0x085F,"Mandaic"},
55+ {0x08A0,0x08FF,"Arabic Extended-A"},
56+ {0x0900,0x097F,"Devanagari"},
57+ {0x0980,0x09FF,"Bengali"},
58+ {0x0A00,0x0A7F,"Gurmukhi"},
59+ {0x0A80,0x0AFF,"Gujarati"},
60+ {0x0B00,0x0B7F,"Oriya"},
61+ {0x0B80,0x0BFF,"Tamil"},
62+ {0x0C00,0x0C7F,"Telugu"},
63+ {0x0C80,0x0CFF,"Kannada"},
64+ {0x0D00,0x0D7F,"Malayalam"},
65+ {0x0D80,0x0DFF,"Sinhala"},
66+ {0x0E00,0x0E7F,"Thai"},
67+ {0x0E80,0x0EFF,"Lao"},
68+ {0x0F00,0x0FFF,"Tibetan"},
69+ {0x1000,0x109F,"Myanmar"},
70+ {0x10A0,0x10FF,"Georgian"},
71+ {0x1100,0x11FF,"Hangul Jamo"},
72+ {0x1200,0x137F,"Ethiopic"},
73+ {0x1380,0x139F,"Ethiopic Supplement"},
74+ {0x13A0,0x13FF,"Cherokee"},
75+ {0x1400,0x167F,"Unified Canadian Aboriginal Syllabics"},
76+ {0x1680,0x169F,"Ogham"},
77+ {0x16A0,0x16FF,"Runic"},
78+ {0x1700,0x171F,"Tagalog"},
79+ {0x1720,0x173F,"Hanunoo"},
80+ {0x1740,0x175F,"Buhid"},
81+ {0x1760,0x177F,"Tagbanwa"},
82+ {0x1780,0x17FF,"Khmer"},
83+ {0x1800,0x18AF,"Mongolian"},
84+ {0x18B0,0x18FF,"Unified Canadian Aboriginal Syllabics Extended"},
85+ {0x1900,0x194F,"Limbu"},
86+ {0x1950,0x197F,"Tai Le"},
87+ {0x1980,0x19DF,"New Tai Lue"},
88+ {0x19E0,0x19FF,"Khmer Symbols"},
89+ {0x1A00,0x1A1F,"Buginese"},
90+ {0x1A20,0x1AAF,"Tai Tham"},
91+ {0x1B00,0x1B7F,"Balinese"},
92+ {0x1B80,0x1BBF,"Sundanese"},
93+ {0x1BC0,0x1BFF,"Batak"},
94+ {0x1C00,0x1C4F,"Lepcha"},
95+ {0x1C50,0x1C7F,"Ol Chiki"},
96+ {0x1CC0,0x1CCF,"Sundanese Supplement"},
97+ {0x1CD0,0x1CFF,"Vedic Extensions"},
98+ {0x1D00,0x1D7F,"Phonetic Extensions"},
99+ {0x1D80,0x1DBF,"Phonetic Extensions Supplement"},
100+ {0x1DC0,0x1DFF,"Combining Diacritical Marks Supplement"},
101+ {0x1E00,0x1EFF,"Latin Extended Additional"},
102+ {0x1F00,0x1FFF,"Greek Extended"},
103+ {0x2000,0x206F,"General Punctuation"},
104+ {0x2070,0x209F,"Superscripts and Subscripts"},
105+ {0x20A0,0x20CF,"Currency Symbols"},
106+ {0x20D0,0x20FF,"Combining Diacritical Marks for Symbols"},
107+ {0x2100,0x214F,"Letterlike Symbols"},
108+ {0x2150,0x218F,"Number Forms"},
109+ {0x2190,0x21FF,"Arrows"},
110+ {0x2200,0x22FF,"Mathematical Operators"},
111+ {0x2300,0x23FF,"Miscellaneous Technical"},
112+ {0x2400,0x243F,"Control Pictures"},
113+ {0x2440,0x245F,"Optical Character Recognition"},
114+ {0x2460,0x24FF,"Enclosed Alphanumerics"},
115+ {0x2500,0x257F,"Box Drawing"},
116+ {0x2580,0x259F,"Block Elements"},
117+ {0x25A0,0x25FF,"Geometric Shapes"},
118+ {0x2600,0x26FF,"Miscellaneous Symbols"},
119+ {0x2700,0x27BF,"Dingbats"},
120+ {0x27C0,0x27EF,"Miscellaneous Mathematical Symbols-A"},
121+ {0x27F0,0x27FF,"Supplemental Arrows-A"},
122+ {0x2800,0x28FF,"Braille Patterns"},
123+ {0x2900,0x297F,"Supplemental Arrows-B"},
124+ {0x2980,0x29FF,"Miscellaneous Mathematical Symbols-B"},
125+ {0x2A00,0x2AFF,"Supplemental Mathematical Operators"},
126+ {0x2B00,0x2BFF,"Miscellaneous Symbols and Arrows"},
127+ {0x2C00,0x2C5F,"Glagolitic"},
128+ {0x2C60,0x2C7F,"Latin Extended-C"},
129+ {0x2C80,0x2CFF,"Coptic"},
130+ {0x2D00,0x2D2F,"Georgian Supplement"},
131+ {0x2D30,0x2D7F,"Tifinagh"},
132+ {0x2D80,0x2DDF,"Ethiopic Extended"},
133+ {0x2DE0,0x2DFF,"Cyrillic Extended-A"},
134+ {0x2E00,0x2E7F,"Supplemental Punctuation"},
135+ {0x2E80,0x2EFF,"CJK Radicals Supplement"},
136+ {0x2F00,0x2FDF,"Kangxi Radicals"},
137+ {0x2FF0,0x2FFF,"Ideographic Description Characters"},
138+ {0x3000,0x303F,"CJK Symbols and Punctuation"},
139+ {0x3040,0x309F,"Hiragana"},
140+ {0x30A0,0x30FF,"Katakana"},
141+ {0x3100,0x312F,"Bopomofo"},
142+ {0x3130,0x318F,"Hangul Compatibility Jamo"},
143+ {0x3190,0x319F,"Kanbun"},
144+ {0x31A0,0x31BF,"Bopomofo Extended"},
145+ {0x31C0,0x31EF,"CJK Strokes"},
146+ {0x31F0,0x31FF,"Katakana Phonetic Extensions"},
147+ {0x3200,0x32FF,"Enclosed CJK Letters and Months"},
148+ {0x3300,0x33FF,"CJK Compatibility"},
149+ {0x3400,0x4DBF,"CJK Unified Ideographs Extension A"},
150+ {0x4DC0,0x4DFF,"Yijing Hexagram Symbols"},
151+ {0x4E00,0x9FFF,"CJK Unified Ideographs"},
152+ {0xA000,0xA48F,"Yi Syllables"},
153+ {0xA490,0xA4CF,"Yi Radicals"},
154+ {0xA4D0,0xA4FF,"Lisu"},
155+ {0xA500,0xA63F,"Vai"},
156+ {0xA640,0xA69F,"Cyrillic Extended-B"},
157+ {0xA6A0,0xA6FF,"Bamum"},
158+ {0xA700,0xA71F,"Modifier Tone Letters"},
159+ {0xA720,0xA7FF,"Latin Extended-D"},
160+ {0xA800,0xA82F,"Syloti Nagri"},
161+ {0xA830,0xA83F,"Common Indic Number Forms"},
162+ {0xA840,0xA87F,"Phags-pa"},
163+ {0xA880,0xA8DF,"Saurashtra"},
164+ {0xA8E0,0xA8FF,"Devanagari Extended"},
165+ {0xA900,0xA92F,"Kayah Li"},
166+ {0xA930,0xA95F,"Rejang"},
167+ {0xA960,0xA97F,"Hangul Jamo Extended-A"},
168+ {0xA980,0xA9DF,"Javanese"},
169+ {0xAA00,0xAA5F,"Cham"},
170+ {0xAA60,0xAA7F,"Myanmar Extended-A"},
171+ {0xAA80,0xAADF,"Tai Viet"},
172+ {0xAAE0,0xAAFF,"Meetei Mayek Extensions"},
173+ {0xAB00,0xAB2F,"Ethiopic Extended-A"},
174+ {0xABC0,0xABFF,"Meetei Mayek"},
175+ {0xAC00,0xD7AF,"Hangul Syllables"},
176+ {0xD7B0,0xD7FF,"Hangul Jamo Extended-B"},
177+ /* surrogates NOT included in this table because we skip them */
178+ {0xE000,0xF8FF,"Private Use Area"},
179+ {0xF900,0xFAFF,"CJK Compatibility Ideographs"},
180+ {0xFB00,0xFB4F,"Alphabetic Presentation Forms"},
181+ {0xFB50,0xFDFF,"Arabic Presentation Forms-A"},
182+ {0xFE00,0xFE0F,"Variation Selectors"},
183+ {0xFE10,0xFE1F,"Vertical Forms"},
184+ {0xFE20,0xFE2F,"Combining Half Marks"},
185+ {0xFE30,0xFE4F,"CJK Compatibility Forms"},
186+ {0xFE50,0xFE6F,"Small Form Variants"},
187+ {0xFE70,0xFEFF,"Arabic Presentation Forms-B"},
188+ {0xFF00,0xFFEF,"Halfwidth and Fullwidth Forms"},
189+ {0xFFF0,0xFFFF,"Specials"},
190+ {0x10000,0x1007F,"Linear B Syllabary"},
191+ {0x10080,0x100FF,"Linear B Ideograms"},
192+ {0x10100,0x1013F,"Aegean Numbers"},
193+ {0x10140,0x1018F,"Ancient Greek Numbers"},
194+ {0x10190,0x101CF,"Ancient Symbols"},
195+ {0x101D0,0x101FF,"Phaistos Disc"},
196+ {0x10280,0x1029F,"Lycian"},
197+ {0x102A0,0x102DF,"Carian"},
198+ {0x10300,0x1032F,"Old Italic"},
199+ {0x10330,0x1034F,"Gothic"},
200+ {0x10380,0x1039F,"Ugaritic"},
201+ {0x103A0,0x103DF,"Old Persian"},
202+ {0x10400,0x1044F,"Deseret"},
203+ {0x10450,0x1047F,"Shavian"},
204+ {0x10480,0x104AF,"Osmanya"},
205+ {0x10800,0x1083F,"Cypriot Syllabary"},
206+ {0x10840,0x1085F,"Imperial Aramaic"},
207+ {0x10900,0x1091F,"Phoenician"},
208+ {0x10920,0x1093F,"Lydian"},
209+ {0x10980,0x1099F,"Meroitic Hieroglyphs"},
210+ {0x109A0,0x109FF,"Meroitic Cursive"},
211+ {0x10A00,0x10A5F,"Kharoshthi"},
212+ {0x10A60,0x10A7F,"Old South Arabian"},
213+ {0x10B00,0x10B3F,"Avestan"},
214+ {0x10B40,0x10B5F,"Inscriptional Parthian"},
215+ {0x10B60,0x10B7F,"Inscriptional Pahlavi"},
216+ {0x10C00,0x10C4F,"Old Turkic"},
217+ {0x10E60,0x10E7F,"Rumi Numeral Symbols"},
218+ {0x11000,0x1107F,"Brahmi"},
219+ {0x11080,0x110CF,"Kaithi"},
220+ {0x110D0,0x110FF,"Sora Sompeng"},
221+ {0x11100,0x1114F,"Chakma"},
222+ {0x11180,0x111DF,"Sharada"},
223+ {0x11680,0x116CF,"Takri"},
224+ {0x12000,0x123FF,"Cuneiform"},
225+ {0x12400,0x1247F,"Cuneiform Numbers and Punctuation"},
226+ {0x13000,0x1342F,"Egyptian Hieroglyphs"},
227+ {0x16800,0x16A3F,"Bamum Supplement"},
228+ {0x16F00,0x16F9F,"Miao"},
229+ {0x1B000,0x1B0FF,"Kana Supplement"},
230+ {0x1D000,0x1D0FF,"Byzantine Musical Symbols"},
231+ {0x1D100,0x1D1FF,"Musical Symbols"},
232+ {0x1D200,0x1D24F,"Ancient Greek Musical Notation"},
233+ {0x1D300,0x1D35F,"Tai Xuan Jing Symbols"},
234+ {0x1D360,0x1D37F,"Counting Rod Numerals"},
235+ {0x1D400,0x1D7FF,"Mathematical Alphanumeric Symbols"},
236+ {0x1EE00,0x1EEFF,"Arabic Mathematical Alphabetic Symbols"},
237+ {0x1F000,0x1F02F,"Mahjong Tiles"},
238+ {0x1F030,0x1F09F,"Domino Tiles"},
239+ {0x1F0A0,0x1F0FF,"Playing Cards"},
240+ {0x1F100,0x1F1FF,"Enclosed Alphanumeric Supplement"},
241+ {0x1F200,0x1F2FF,"Enclosed Ideographic Supplement"},
242+ {0x1F300,0x1F5FF,"Miscellaneous Symbols And Pictographs"},
243+ {0x1F600,0x1F64F,"Emoticons"},
244+ {0x1F680,0x1F6FF,"Transport And Map Symbols"},
245+ {0x1F700,0x1F77F,"Alchemical Symbols"},
246+ {0x20000,0x2A6DF,"CJK Unified Ideographs Extension B"},
247+ {0x2A700,0x2B73F,"CJK Unified Ideographs Extension C"},
248+ {0x2B740,0x2B81F,"CJK Unified Ideographs Extension D"},
249+ {0x2F800,0x2FA1F,"CJK Compatibility Ideographs Supplement"},
250+ {0xE0000,0xE007F,"Tags"},
251+ {0xE0100,0xE01EF,"Variation Selectors Supplement"},
252+ {0xF0000,0xFFFFF,"Supplementary Private Use Area-A"},
253+ {0x100000,0x10FFFF,"Supplementary Private Use Area-B"},
254+};
255+
256+void generate_unicode_list(NODE *match_pattern,char *cfg) {
257+ int i,j,blen=64,elen,complained=0,cfgbits;
258+ char *ebuf,*cptr;
259+ int start,parsed;
260+ NODE *to_match;
261+
262+ /* start with a small buffer - usually enough */
263+ ebuf=(char *)malloc(blen);
264+
265+ /* loop over chars */
266+ for (i=0;i<0x110000;i++) {
267+
268+ /* skip surrogates */
269+ if (i==0xD800)
270+ i=0xE000;
271+
272+ /* generate dictionary entry */
273+ ebuf[0]='<';
274+ if (i==(int)'\\') {
275+ ebuf[1]='\\';
276+ ebuf[2]='\\';
277+ elen=3;
278+ } else
279+ elen=1+construct_utf8(i,ebuf+1);
280+ ebuf[elen++]='>';
281+ if ((cfg==NULL) || (*cfg=='\0'))
282+ ebuf[elen++]=';';
283+ else {
284+ j=0;
285+ cfgbits=0;
286+ for (cptr=cfg;*cptr;cptr++) {
287+ if (blen-elen<50) {
288+ blen*=2;
289+ ebuf=(char *)realloc(ebuf,blen);
290+ }
291+ switch (*cptr) {
292+ case 'b':
293+ for (;unicode_blocks[j].high<i;j++);
294+ if (unicode_blocks[j].low<=i) {
295+ ebuf[elen++]=cfgbits?';':'(';
296+ cfgbits++;
297+ strcpy(ebuf+elen,unicode_blocks[j].name);
298+ elen+=strlen(unicode_blocks[j].name);
299+ }
300+ break;
301+
302+ case 'd':
303+ ebuf[elen++]=cfgbits?';':'(';
304+ cfgbits++;
305+ elen+=sprintf(ebuf+elen,"%d",i);
306+ break;
307+
308+ case 'x':
309+ ebuf[elen++]=cfgbits?';':'(';
310+ cfgbits++;
311+ elen+=sprintf(ebuf+elen,"U+%04X",i);
312+ break;
313+
314+ default:
315+ if (!complained) {
316+ fprintf(stderr,
317+ "bad character %c in dictionary generator config\n",
318+ *cptr);
319+ complained=1;
320+ }
321+ break;
322+ }
323+ }
324+ ebuf[elen++]=cfgbits?')':';';
325+ }
326+ ebuf[elen++]='\n';
327+
328+ /* try to parse */
329+ for (start=0;start<elen;start+=parsed) {
330+ parsed=parse(elen-start,ebuf+start);
331+
332+ /* complain about errors */
333+ if (parse_state==PS_ERROR) {
334+ puts("can't parse internally generated " /* SNH */
335+ "dictionary entry"); /* SNH */
336+ fwrite(ebuf,1,elen,stdout); /* SNH */
337+ exit(1); /* SNH */
338+ }
339+
340+ /* deal with a complete tree if we have one */
341+ if (parse_state==PS_COMPLETE_TREE) {
342+ to_match=parse_stack[0];
343+ stack_ptr=0;
344+ if (tree_match(match_pattern,to_match)) {
345+ if (cook_output)
346+ write_cooked_tree(to_match);
347+ else {
348+ fwrite(ebuf,1,elen-1,stdout);
349+ echoing_whitespace=1;
350+ }
351+ }
352+ free_node(to_match);
353+ }
354+ }
355+ }
356+
357+ free(ebuf);
358+}
--- trunk/idsgrep/Makefile.am (revision 381)
+++ trunk/idsgrep/Makefile.am (revision 382)
@@ -172,7 +172,7 @@
172172 AM_CFLAGS := $(MAYBE_COVERAGE) $(PCRE_CFLAGS) $(AM_CFLAGS)
173173 idsgrep_SOURCES = \
174174 assoc.c cook.c hash.c idsgrep.c idsgrep.h match.c parse.c \
175- regex.c userpred.c
175+ regex.c unilist.c userpred.c
176176
177177 LDADD = @LIBOBJS@ $(PCRE_LIBS)
178178
--- trunk/idsgrep/userpred.c (revision 381)
+++ trunk/idsgrep/userpred.c (revision 382)
@@ -160,8 +160,10 @@
160160 }
161161
162162 /* do swapping up front - works because entire table is 16-bit entries */
163- for (i=2;i<(length/2);i++)
164- ((uint16_t *)format2_table)[i]=BSWAP16(((uint16_t *)format2_table)[i]);
163+ if (swap_votes>0)
164+ for (i=2;i<(length/2);i++)
165+ ((uint16_t *)format2_table)[i]
166+ =BSWAP16(((uint16_t *)format2_table)[i]);
165167
166168 /* scan through high bytes */
167169 for (i=0;i<256;i++)
@@ -249,8 +251,10 @@
249251 }
250252
251253 /* do swapping up front - works because entire table is 16-bit entries */
252- for (i=2;i<(length/2);i++)
253- ((uint16_t *)format4_table)[i]=BSWAP16(((uint16_t *)format4_table)[i]);
254+ if (swap_votes>0)
255+ for (i=2;i<(length/2);i++)
256+ ((uint16_t *)format4_table)[i]
257+ =BSWAP16(((uint16_t *)format4_table)[i]);
254258
255259 /* set up the pointers */
256260 start_count=&(format4_table->end_count[0])
@@ -304,6 +308,83 @@
304308
305309 /**********************************************************************/
306310
311+/* note that the code for format 12 also works for format 13; glyph
312+ * indices are calculated differently between the two, but here we
313+ * are only interested in the question of whether they are defined at
314+ * all, so the exact values don't matter matter. */
315+
316+typedef struct _FORMAT12_GROUP {
317+ uint32_t start_char_code PACKED;
318+ uint32_t end_char_code PACKED;
319+ uint32_t start_glyph_id PACKED;
320+} FORMAT12_GROUP;
321+
322+typedef struct _FORMAT12_TABLE {
323+ uint16_t format PACKED;
324+ uint16_t reserved PACKED;
325+ uint32_t length PACKED;
326+ uint32_t language PACKED;
327+ uint32_t n_groups PACKED;
328+ FORMAT12_GROUP groups[];
329+} FORMAT12_TABLE;
330+
331+void scan_format12_table(FILE *fontfile,int swap_votes,
332+ char *fn,int table_number) {
333+ FORMAT12_TABLE *format12_table;
334+ int i,j,k;
335+ uint16_t reserved;
336+ uint32_t length;
337+
338+ /* read the table */
339+ if (fread(&reserved,sizeof(reserved),1,fontfile)!=1) {
340+ fprintf(stderr,"error reading %s (format 12 cmap subtable %d "
341+ "reserved field)\n",fn,table_number);
342+ return;
343+ }
344+ if (fread(&length,sizeof(length),1,fontfile)!=1) {
345+ fprintf(stderr,"error reading %s (format 12 cmap subtable %d length)\n",
346+ fn,table_number);
347+ return;
348+ }
349+ if (swap_votes>0)
350+ length=BSWAP32(length);
351+ format12_table=malloc(length);
352+ format12_table->format=12;
353+ format12_table->length=length;
354+ if (fread(((uint8_t *)format12_table)+8,length-8,1,fontfile)!=1) {
355+ fprintf(stderr,"error reading %s (format 12 cmap subtable %d)\n",
356+ fn,table_number);
357+ free(format12_table);
358+ return;
359+ }
360+
361+ /* do swapping up front - works because entire table is 32-bit entries */
362+ if (swap_votes>0)
363+ for (i=2;i<(length/4);i++)
364+ ((uint32_t *)format12_table)[i]
365+ =BSWAP32(((uint32_t *)format12_table)[i]);
366+
367+ /* check that table is big enough */
368+ if (16+12*format12_table->n_groups>format12_table->length) {
369+ fprintf(stderr,"subtable too small in %s "
370+ "(format 12 cmap subtable %d)\n",fn,table_number);
371+ free(format12_table);
372+ return;
373+ }
374+
375+ /* scan through character codes */
376+ for (i=0;i<format12_table->n_groups;i++)
377+ if (format12_table->groups[i].start_glyph_id!=0)
378+ for (j=format12_table->groups[i].start_char_code;
379+ j<=format12_table->groups[i].end_char_code;
380+ j++)
381+ add_userpred_character(j);
382+
383+ free(format12_table);
384+}
385+
386+/**********************************************************************/
387+
307388 #define CHECKSUM_BUFFER 2048
308389
309390 uint32_t compute_opentype_checksum(FILE *fontfile,uint32_t length,
@@ -547,11 +628,10 @@
547628 break;
548629
549630 case 12: /* Microsoft segmented */
631+ case 13: /* many-to-one - can be handled by format 12 code */
632+ scan_format12_table(fontfile,swap_votes,fn,table_number);
550633 break;
551634
552- case 13: /* many-to-one */
553- break;
554-
555635 default:
556636 /* Subtable type 14, Unicode Variation Sequences, is
557637 * deliberately ignored because main characters in it are only
@@ -595,9 +675,9 @@
595675 i=atoi(ms->nc_needle->child[0]->head->data);
596676 else
597677 i=1;
598- if (i==0)
678+ if (i<=0)
599679 i=1;
600- ms->match_result=((i<=num_userpreds) && (i>0) &&
680+ ms->match_result=((i<=num_userpreds) &&
601681 ((ms->nc_haystack->head->userpreds&(1<<(i-1)))!=0))?
602682 MR_TRUE:MR_FALSE;
603683 return ms;
--- trunk/idsgrep/idsgrep.c (revision 381)
+++ trunk/idsgrep/idsgrep.c (revision 382)
@@ -1,6 +1,6 @@
11 /*
22 * Extended IDS matcher
3- * Copyright (C) 2012 Matthew Skala
3+ * Copyright (C) 2012, 2013 Matthew Skala
44 *
55 * This program is free software: you can redistribute it and/or modify
66 * it under the terms of the GNU General Public License as published by
@@ -85,8 +85,8 @@
8585 /* complain about errors */
8686 if (parse_state==PS_ERROR) {
8787 puts("can't parse input pattern");
88- fwrite(input_buffer,1,parse_ptr,stdout);
89- putchar('\n');
88+ fwrite(input_buffer,1,parse_ptr,stdout);
89+ putchar('\n');
9090 exit(1);
9191 }
9292
@@ -131,6 +131,7 @@
131131 {"dictionary",optional_argument,NULL,'d'},
132132 {"font-chars",required_argument,NULL,'f'},
133133 {"help",no_argument,NULL,'h'},
134+ {"unicode-list",no_argument,NULL,'U'},
134135 {"version",no_argument,NULL,'V'},
135136 {0,0,0,0},
136137 };
@@ -145,9 +146,9 @@
145146 int main(int argc,char **argv) {
146147 NODE *match_pattern;
147148 int c,num_files=0;
148- char *dictdir,*dictname=NULL,*dictglob;
149+ char *dictdir,*dictname=NULL,*dictglob,*unilist_cfg=NULL;
149150 glob_t globres;
150- int show_version=0,show_help=0;
151+ int show_version=0,show_help=0,generate_list=0;
151152
152153 /* quick usage message */
153154 if (argc<2)
@@ -157,8 +158,13 @@
157158 register_syntax();
158159
159160 /* loop on command-line options */
160- while ((c=getopt_long(argc,argv,"Vc:d::f:h",long_opts,NULL))!=-1) {
161+ while ((c=getopt_long(argc,argv,"U::Vc:d::f:h",long_opts,NULL))!=-1) {
161162 switch (c) {
163+
164+ case 'U':
165+ generate_list=1;
166+ unilist_cfg=optarg;
167+ break;
162168
163169 case 'V':
164170 show_version=1;
@@ -200,6 +206,7 @@
200206 puts("Usage: " PACKAGE_TARNAME " [OPTION]... PATTERN [FILE]...\n"
201207 "PATTERN should be an Extended Ideographic Description Sequence\n\n"
202208 "Options:\n"
209+ " -U, --unicode-list=CFG generate Unicode list\n"
203210 " -V, --version display version and license\n"
204211 " -c, --cooking=FMT set input/output cooking\n"
205212 " -d, --dictionary=NAME search standard dictionary\n"
@@ -226,6 +233,10 @@
226233 /* count explicit filenames */
227234 num_files=argc-optind;
228235
236+ /* generate Unicode list if requested */
237+ if (generate_list)
238+ generate_unicode_list(match_pattern,unilist_cfg);
239+
229240 /* loop on default dictionaries */
230241 if (dictname!=NULL) {
231242 dictdir=getenv("IDSGREP_DICTDIR");
@@ -248,7 +259,7 @@
248259 process_file(match_pattern,argv[optind++],num_files>1?0:-1);
249260
250261 /* read stdin or complain */
251- if (num_files==0) {
262+ if ((num_files==0) && (generate_list==0)) {
252263 if (dictname==NULL)
253264 process_file(match_pattern,"-",-1);
254265 else
--- trunk/doc/usermanual.tex (revision 381)
+++ trunk/doc/usermanual.tex (revision 382)
@@ -2408,7 +2408,8 @@
24082408 The ``-{}-enable-parasites'' option allows this kind of include/exclude
24092409 selection for parasite packages (see next subsection). The supported tokens
24102410 are ``all,'' ``none,'' and the names of the parasites, currently
2411-``genjimon,'' ``idsgrep,'' and ``ocr.'' The default is ``none.''
2411+``beikaitoru,'' ``genjimon,'' ``idsgrep,'' and ``ocr.''
2412+The default is ``none.''
24122413
24132414 The ``make dist'' target defaults to building a ZIP file only, instead of
24142415 GNU's recommended tar-gzip. This decision was made in order to be
Show on old repository browser