• R/O
  • SSH
  • HTTPS

tsukurimashou: Commit


Commit MetaInfo

Revision457 (tree)
Time2013-08-21 08:42:51
Authormskala

Log Message

final stuff for IDSgrep 0.4

Change Summary

Incremental Difference

--- trunk/idsgrep/idsgrep.tex (revision 456)
+++ trunk/idsgrep/idsgrep.tex (revision 457)
@@ -456,10 +456,11 @@
456456 installed. To do a coverage analysis, run \texttt{configure} with
457457 \texttt{--enable-gcov} and any other desired options, then do \texttt{make
458458 clean} (necessary to be sure all object files are rebuilt with the coverage
459-instrumentation) followed by \texttt{make check}. Full coverage can only be
460-attained if the dictionary files are installed (not just built). Most
461-people would not want to install the IDSgrep binary itself when built under
462-this option.
459+instrumentation) followed by \texttt{make check}. Most people would not
460+want to install the IDSgrep binary itself when built under this option. As
461+of version 0.4, the current test suite is not expected to achieve full
462+coverage on most installations (though it should come close), so do not
463+report failure of this test as a bug nor get too concerned about it.
463464
464465 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
465466
@@ -817,6 +818,7 @@
817818
818819 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
819820
821+\break\vspace*{-1.82\baselineskip}%
820822 \section{A note on TrueType/OpenType}\DangerousSection
821823
822824 This version of IDSgrep is designed to read TrueType or
@@ -1029,10 +1031,17 @@
10291031 specifying ``\texttt{-Uxdb}'' will generate and scan a dictionary that
10301032 includes the line ``\texttt{<A>(U+0041;65;Basic Latin)}.'' This option is
10311033 intended to be used together with \texttt{-f} to produce font coverage
1032-lists. Bit vector indexing is of use for the internally-generated
1033-Unicode list, but when the query string has a head, IDSgrep will generate
1034+lists.
1035+
1036+Bit \DangerousBend vector indexing is of no use for the internally-generated
1037+Unicode list, but when the query tree has a head, IDSgrep will generate
10341038 only the at most one dictionary entry that could match that query, giving
10351039 something very much like the benefit of bit vector indexing.
1040+This option generates the entries as EIDS trees in an internal format,
1041+not as a byte stream, bypassing the input parser,
1042+so output from \texttt{-U}
1043+is always cooked even when a raw mode is selected with
1044+\texttt{-c} to be used for real input.
10361045
10371046 \item[\texttt{-V}, \texttt{--version}] Display the version and license
10381047 information for IDSgrep.
@@ -2009,12 +2018,12 @@
20092018 \url{http://sourceforge.net/projects/buddy/}~\cite{BuDDy}. Without it, bit
20102019 vectors will still provide some speed improvement, but not as much.
20112020
2012-Bit vector indices are expected to increase the speed of searching by about
2013-a factor of 15 in typical use. The improvement factor varies a lot
2014-depending on a number of issues, and could be a thousand or more under
2015-optimal conditions. It should never be significantly less than one; that
2016-is, searching with a bit vector index should never take significantly longer
2017-than searching without one.
2021+Bit vector indices properly used are expected to increase the speed of
2022+searching by about a factor of 15 in typical cases. The improvement factor
2023+varies a lot depending on a number of issues, and could be a thousand or
2024+more under optimal conditions. It should never be significantly less than
2025+one; that is, searching with a bit vector index should never take
2026+significantly longer than searching without one.
20182027
20192028 Bit vectors provide the greatest benefit when the query is simple
20202029 (exact-matching a single syrupy character is best); when the dictionary
@@ -2025,7 +2034,7 @@
20252034 when the query does not include special matching operators such as regular
20262035 expressions and user-defined predicates.
20272036
2028-Whenever the \texttt{texttt} utility reads a file whose pathname ends in
2037+Whenever the \texttt{idsgrep} utility reads a file whose pathname ends in
20292038 ``\texttt{.eids}''---regardless of whether that file was specified
20302039 explicitly on the command line or indirectly via the \texttt{-d} option---it
20312040 will look for an index file whose pathname is the same except with the
@@ -2036,7 +2045,7 @@
20362045 speed up the query process. Note that all those conditions must be met. If
20372046 any of the conditions fail to be met, no error will be reported, but the
20382047 scanner will be forced to read and parse the entire input file without using
2039-bit vector filtering. Once the \emph{idsgrep} utility commits to start
2048+bit vector filtering. Once the \texttt{idsgrep} utility commits to start
20402049 reading the index file past the header, it cannot switch to index-free
20412050 searching and errors after that point will abort the search, just like
20422051 errors in the EIDS input file.
@@ -2065,7 +2074,8 @@
20652074 \texttt{-I} option; that is unlikely to be useful except during speed tests,
20662075 but one could maybe imagine a case where it's absolutely necessary to have a
20672076 file named \texttt{*.bvec} which is not a bit vector index and must not be
2068-touched.
2077+touched, or where even looking for the index file incurs undesired traffic
2078+on a network filesystem.
20692079
20702080 Using a (valid) bit vector index, or not using one, should only affect
20712081 speed. It should never change which results are or are not returned from a
@@ -2586,6 +2596,32 @@
25862596 the tree match memoization looks for the parser's flag to determine which
25872597 nodes it is allowed to cache.
25882598
2599+The format of the statistics line generated by the \texttt{-{}-statistics}
2600+option is space-separated fields; the first is ``\texttt{STATS}'' and then
2601+the rest are mostly decimal numbers, in this order:
2602+\begin{itemize}
2603+ \item bit vector (lambda filter) checks;
2604+ \item lambda filter hits;
2605+ \item BDD hits (necessarily zero if BDDs not compiled in; the
2606+ number of BDD \emph{checks} when BDDs are used
2607+ is always exactly the value of the previous
2608+ field and thus not reported separately);
2609+ \item tree checks (may be greater than bit vector hits, because
2610+ of unindexed input which skips directly to the tree checking step);
2611+ \item tree hits (these result in output of matched trees);
2612+ \item memoization checks (may be much larger than number of tree checks,
2613+ because memoization happens inside the recursion of the tree check, but
2614+ only on sufficiently complicated needles);
2615+ \item memoization hits;
2616+ \item user CPU time (reported as seconds with a decimal fraction down to
2617+ microsecond precision as in the \texttt{struct rusage},
2618+ but your operating system probably rounds these numbers
2619+ to $1/100$ or $1/1000$ of a second);
2620+ \item node count in the BDD (zero if none was used or the feature is
2621+ absent); and
2622+ \item the query tree, in cooked EIDS format.
2623+\end{itemize}
2624+
25892625 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
25902626 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
25912627 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- trunk/idsgrep/bitvec.c (revision 456)
+++ trunk/idsgrep/bitvec.c (revision 457)
@@ -452,13 +452,13 @@
452452 z->lambda=c_min-1;
453453 break;
454454
455- case 7: /* nothing - should only happen with BDDs */
455+ case 7: /* nothing */
456456 #ifdef HAVE_BUDDY
457457 z->bits[0]=UINT64_MAX;
458458 z->bits[1]=UINT64_MAX;
459459 z->lambda=-1;
460460 #else
461- return bf_true(z); /* SNH */
461+ return bf_true(z);
462462 #endif
463463 break;
464464 }
--- trunk/idsgrep/idsgrep.1.in (revision 456)
+++ trunk/idsgrep/idsgrep.1.in (revision 457)
@@ -161,12 +161,29 @@
161161 .BI \-U "CFG\fR,\fP " "\-\^\-unicode-list=" CFG
162162 Generate a list of Unicode characters and use that as a dictionary, before
163163 and in addition to any others that may have been specified.
164-The generated dictionary contains 1112064 entries, one for every Unicode
164+The generated dictionary theoretically contains 1112064 entries,
165+one for every Unicode
165166 code point excluding surrogates; the head of the entry is the single
166167 character, and the tail is (by default) a nullary semicolon, or (if a
167168 .I CFG
168169 string has been specified) a nullary functor containing some information
169170 about the character.
171+If the search query has a head, then because all the
172+entries in the generated dictionary have unique heads, at most one of them
173+can be a match under the EIDS matching rules.
174+As an optimization, in such a case
175+.B @PACKAGE@
176+will only generate the one matching entry (if any), resulting in a
177+significant speed increase.
178+For any other kind of query it actually generates all the entries and
179+tries to match them.
180+This option generates the entries as EIDS trees in an internal format,
181+not as a byte stream, bypassing the input parser,
182+so output from
183+.B \-U
184+is always cooked even when a raw mode is selected with
185+.B \-c
186+to be used for real input.
170187 The
171188 .I CFG
172189 string should be some combination of the following characters, each of which
@@ -203,7 +220,8 @@
203220 The format of the line is:
204221 .IP "" 11
205222 .B STATS
206-.I weight-checks weight-hits bdd-hits tree-checks tree-hits time query
223+.I weight-checks weight-hits bdd-hits tree-checks tree-hits
224+.I memo-checks memo-hits time query
207225 .IP "" 7
208226 The fields are:
209227 .IR weight-checks ,
@@ -218,6 +236,10 @@
218236 vector index is in use and so all trees go directly to the tree check;
219237 .IR tree-hits ,
220238 the number of trees passing the tree check and thus returned as results;
239+.IR memo-checks ,
240+the number of checks performed on the tree-match memoization table;
241+.IR memo-hits ,
242+the number of hits in tree-match memoization ;
221243 .IR time ,
222244 the number of seconds of user CPU time, as measured by
223245 .BR getrusage (2);
--- trunk/idsgrep/configure.ac (revision 456)
+++ trunk/idsgrep/configure.ac (revision 457)
@@ -159,7 +159,7 @@
159159 #
160160 AC_PREREQ([2.67])
161161 AC_INIT([IDSgrep],
162- [0.4pre], [mskala@ansuz.sooke.bc.ca], [idsgrep],
162+ [0.4], [mskala@ansuz.sooke.bc.ca], [idsgrep],
163163 [[http://tsukurimashou.sourceforge.jp/]])
164164 AC_PRESERVE_HELP_ORDER
165165 AC_CONFIG_AUX_DIR([.])
@@ -169,7 +169,7 @@
169169 AC_CONFIG_MACRO_DIR([m4])
170170 AC_REVISION([$Id: configure.ac 1015 2011-12-15 22:24:32Z mskala $])
171171 AC_COPYRIGHT([Copyright (C) 2012, 2013 Matthew Skala])
172-AC_SUBST([release_date],["March 7, 2013"])
172+AC_SUBST([release_date],["August 20, 2013"])
173173 AM_SILENT_RULES
174174 #
175175 ############################################################################
@@ -359,7 +359,7 @@
359359 tsukurimashou-0.6 tsukurimashou-0.7 tsukurimashou-0.8 dnl
360360 tsukurimashou-0.9 tsukurimashou-0.10 tsukurimashou-0.11]),[
361361 m4_foreach_w([tbcheckpath],m4_expand([$srcdir $srcdir/.. $srcdir/../.. dnl
362-$prefix/src /src /usr/src /usr/local/src dnl
362+$srcdir/../../.. $prefix/src /src /usr/src /usr/local/src dnl
363363 $HOME $HOME/src]),[
364364 AS_IF([test "$with_tsuku_build" = "auto"],
365365 [AS_IF([test -r "tbcheckpath/tbcheckname/Makefile"],[
--- trunk/idsgrep/Makefile.am (revision 456)
+++ trunk/idsgrep/Makefile.am (revision 457)
@@ -131,6 +131,35 @@
131131
132132 ############################################################################
133133
134+# TESTING
135+
136+# must go before "filenames for autotools"
137+
138+GCOV_TESTS = \
139+ test/andor test/anynot test/assoc test/basicmatch test/backslash \
140+ test/bighash test/cooked test/demorgan test/equal test/genbv \
141+ test/kvg-grone test/messages test/regex test/spacing test/speed \
142+ test/tsu-grone test/unilist test/unord test/userpred test/utf8
143+
144+define GCDEP_RECIPE
145+$1.log: test/rmgcda.log
146+
147+endef
148+
149+if COND_GCOV
150+
151+ TESTS = test/rmgcda $(GCOV_TESTS) test/gcov
152+
153+ $(foreach test,$(GCOV_TESTS),$(eval $(call GCDEP_RECIPE,$(test))))
154+
155+ test/gcov.log: $(foreach test,$(GCOV_TESTS),$(test).log)
156+
157+else
158+ TESTS = $(GCOV_TESTS) test/vgneko
159+endif
160+
161+############################################################################
162+
134163 # FILENAMES FOR AUTOTOOLS
135164
136165 # QVFG and FPEVCGF are DIST and SCRIPTS in ROT13, to keep Automake
@@ -171,11 +200,12 @@
171200
172201 MOSTLYCLEANFILES = \
173202 idsgrep.aux idsgrep.log idsgrep.blg idsgrep.bbl idsgrep.toc \
174- *.gcda *.gcno *.gcov
203+ *.bvec *.gcda *.gcno *.gcov
175204
176205 CLEANFILES = chise.errs edict.eids kanjivg.eids tsukurimashou.eids
177206
178-DISTCLEANFILES = $(if $(VPATH),idsgrep.pdf chise.eids,)
207+DISTCLEANFILES = \
208+ $(if $(VPATH),idsgrep.pdf chise.eids,) _stdint.h aminclude.am
179209
180210 MAINTAINERCLEANFILES = idsgrep.pdf chise.eids
181211
@@ -239,33 +269,6 @@
239269
240270 ############################################################################
241271
242-# TESTING
243-
244-GCOV_TESTS = \
245- test/andor test/anynot test/assoc test/basicmatch test/backslash \
246- test/bighash test/cooked test/demorgan test/equal test/genbv \
247- test/kvg-grone test/messages test/regex test/spacing test/speed \
248- test/tsu-grone test/unilist test/unord test/userpred test/utf8
249-
250-define GCDEP_RECIPE
251-$1.log: test/rmgcda.log
252-
253-endef
254-
255-if COND_GCOV
256-
257- TESTS = test/rmgcda $(GCOV_TESTS) test/gcov
258-
259- $(foreach test,$(GCOV_TESTS),$(eval $(call GCDEP_RECIPE,$(test))))
260-
261- test/gcov.log: $(foreach test,$(GCOV_TESTS),$(test).log)
262-
263-else
264- TESTS = $(GCOV_TESTS) test/vgneko
265-endif
266-
267-############################################################################
268-
269272 # AUTOMAKE'S RULES WILL GO HERE
270273
271274 automake_rules = here
Show on old repository browser