• R/O
  • SSH
  • HTTPS

tsukurimashou: Commit


Commit MetaInfo

Revision456 (tree)
Time2013-08-11 08:44:53
Authormskala

Log Message

bug fix, more docs on advanced features

Change Summary

Incremental Difference

--- trunk/idsgrep/idsgrep.tex (revision 455)
+++ trunk/idsgrep/idsgrep.tex (revision 456)
@@ -817,7 +817,7 @@
817817
818818 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
819819
820-\section{A note on TrueType/OpenType}\DangerousBend
820+\section{A note on TrueType/OpenType}\DangerousSection
821821
822822 This version of IDSgrep is designed to read TrueType or
823823 OpenType files (the distinction between the two is not relevant at this
@@ -2138,14 +2138,14 @@
21382138 This idea of doing an easy approximate check first, to save the cost of a
21392139 more difficult accurate evaluation, should be quite familiar. Imagine a
21402140 hiring committee quickly throwing out all the job applications from software
2141-engineers, then interviewing the computational linguists.\footnote{You may
2142-say I'm a dreamer.} Similar concepts appear throughout computer science:
2143-consider instruction and data caches; memoization for dynamic programming;
2144-solving relaxed versions of optimization problems; and especially Bloom
2145-filters~\cite{Bloom:Space}. The bit vector technique used in IDSgrep builds
2146-on Bloom filters and on the work of Skala and
2147-others~\cite{Skala:Generalized} and Skala and Penn~\cite{Skala:Approximate}
2148-on unification of types in logic programming systems.
2141+engineers, then interviewing the computational linguists. Similar concepts
2142+appear throughout computer science: consider instruction and data caches;
2143+memoization for dynamic programming; solving relaxed versions of
2144+optimization problems; and especially Bloom filters~\cite{Bloom:Space}. The
2145+bit vector technique used in IDSgrep builds on Bloom filters and on the work
2146+of Skala and others~\cite{Skala:Generalized} and Skala and
2147+Penn~\cite{Skala:Approximate} on unification of types in logic programming
2148+systems.
21492149
21502150 Functional programming weenies will note that the real point of an element
21512151 of $\mathbb{F}$ is that you can combine it with $\mathit{check}$ to make a
@@ -2257,7 +2257,9 @@
22572257 $m_3$, so $(m_3,\lambda_3)$ will also match. The same is true for
22582258 $(m_2,\lambda_2)$; so if either of the original filters matched, then the
22592259 combined filter must match. In this way we can find a filter for the OR of
2260-two existing filters. Note that this is not an if and only if: there could
2260+two existing filters.
2261+
2262+Note that the OR filter is not an if and only if: there could
22612263 be combinations of bits that hit part of $m_1$ and part of $m_2$, so that
22622264 neither $(m_1,\lambda_1)$ nor $(m_2,\lambda_2)$ would match but
22632265 $(m_3,\lambda_3)$ matches. We have lost some precision in the filter
@@ -2283,21 +2285,21 @@
22832285 lambda filters.
22842286
22852287 When we push a subtree further down in the EIDS tree, its filter changes.
2286-Suppose we have a filter for a given needle (assuming it is the root) but it
2287-actually appears as the right child of a binary root. Any bits in its mask
2288-that would have queried the first word (head, functor, and arity of root)
2289-must now query the third word (head, functor, and arity of last child). We
2290-can easily rearrange those bits in the mask. However, if the mask selects
2291-any bits from the second through fourth words when the needle is at the
2292-root, then when it is shifted down the tree all those bits must come from
2293-the last word of the haystack's vector. They could collide with each other.
2294-Then a haystack that formerly would have been hit by the mask in three
2295-places might only be hit once, because all three of those bits collided.
2296-The value of $\lambda$ might need to be reduced by as much as a factor of
2297-three. IDSgrep computes all possible collisions and tries to use as large a
2298-value of $\lambda$ for the modified filter as it can prove will still give
2299-correct results. Similar considerations apply to the operations of moving a
2300-needle's filter from the root to the left or middle children.
2288+Suppose we have a filter that would be correct for a given needle if it were
2289+the root, but it actually appears as the right child of a binary root. Any
2290+bits in its mask that would have queried the first word (head, functor, and
2291+arity of root) must now query the third word (head, functor, and arity of
2292+last child). We can easily rearrange those bits in the mask. However, if
2293+the mask selects any bits from the second through fourth words when the
2294+needle is at the root, then when it is shifted down the tree all those bits
2295+must come from the last word of the haystack's vector. They could collide
2296+with each other. Then a haystack that formerly would have been hit by the
2297+mask in three places might only be hit once, because all three of those bits
2298+collided. The value of $\lambda$ might need to be reduced by as much as a
2299+factor of three. IDSgrep computes all possible collisions and tries to use
2300+as large a value of $\lambda$ for the modified filter as it can prove will
2301+still give correct results. Similar considerations apply to the operations
2302+of moving a needle's filter from the root to the left or middle children.
23012303
23022304 Given these operations of OR, AND, and moving a needle that refers to the
23032305 root to refer to a child instead, IDSgrep can find a lambda filter for basic
@@ -2314,7 +2316,7 @@
23142316 match-everything filter. The IDSgrep implementation of NOT goes slightly
23152317 beyond pure filter calculus by looking at the actual needle instead of only
23162318 the needle's recursively-computed filter; so it is smart enough to recognize
2317-\texttt{!!} as a no-op, \texttt{!?} as match-nothing, and to apply De
2319+\texttt{!!} as a no-op, \texttt{!?} as match-nothing, and to apply de
23182320 Morgan's laws to AND and OR. The unordered match operator \texttt{*}
23192321 translates easily into the OR of all permutations of its child, and the
23202322 literal match operator \texttt{=} is straightforward. Other match
@@ -2395,6 +2397,89 @@
23952397 get better results from NOT. More complicated operations, like
23962398 match-anywhere, have reasonably straightforward implementations.
23972399
2400+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2401+
2402+\section{Memoization in the tree match}\DangerousSection
2403+
2404+This feature does not involve bit vectors at all, but is described in the bit
2405+vector chapter because, like bit vectors, it is a speed enhancement that
2406+doesn't change the basic matching algorithm. There is no ``r'' in
2407+``memoization.'' That word is a shibboleth\footnote{Judges 12:6.} for CS
2408+theorists.
2409+
2410+For the most part, when a match is not ruled out by the bit vector indices,
2411+IDSgrep uses a straightforward recursive-descent algorithm to implement the
2412+tree matching function $\mathit{match}$. That works well in practice for
2413+the kinds of queries typically encountered. However, it is possible to
2414+construct pathological queries on which the recursive matching algorithm
2415+will misbehave. For instance, a query with many nested instances of
2416+\texttt{.anywhere.} can bog down attempting to match against a deep tree, as
2417+it tries all possibilities for the intermediate nodes that anchor the
2418+different \texttt{.anywhere.} operators, even though the choices are all
2419+equivalent.
2420+Matching according to IDSgrep's matching rules should in fact be possible in
2421+polynomial time, because we can match each node in the needle tree once
2422+against each node in the haystack tree. Assuming the match function is
2423+well-defined we only need do that once per pair of nodes; so this suggests a
2424+dynamic programming algorithm. The dynamic programming algorithm is
2425+unlikely to be a good idea in the usual case, because of excessive overhead
2426+maintaining the table and the possibility of doing extra, unnecessary
2427+matches if we're not careful. Recursive descent, with short-circuiting of
2428+unnecessary subtree tests once the answer is known, seems to perform much
2429+better in practice in ordinary cases despite its exponential worst-case
2430+bound. Nonetheless, to cover bad cases that may occur in the input,
2431+IDSgrep implements memoization of tree matches when it appears likely to be
2432+helpful.
2433+
2434+The matching operators that could cause trouble are ``\texttt{...}''
2435+(\texttt{.anywhere.}) and ``\texttt{.*.}'' (\texttt{.unord.}). All other
2436+special matching operators, and the default matching rules, recurse at most
2437+once into each child; but match-anywhere attempts to match its needle tree
2438+once against every descendant of its haystack tree, and unordered match
2439+attempts to match its needle once against each of the up to six permutations
2440+of its haystack's children. The time to do matching by recursive descent
2441+ends up having an exponent determined by the number of uses of those
2442+operators in the matching pattern.
2443+
2444+So IDSgrep counts how many times either of those operators occurs in the
2445+matching pattern, and uses that count to determine both whether memoization
2446+would help, and how big a table size to use. The memoized match is
2447+straightforward. Before testing a needle tree against a haystack tree,
2448+IDSgrep checks whether that pair of needle and haystack is recorded in the
2449+memoization hash table. If it is, the answer is taken from the table
2450+instead of calculated fresh. Otherwise, once it has been calculated, the
2451+result goes in the table.
2452+
2453+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2454+
2455+\section{Implementation details}\DangerousSection
2456+
2457+Storing bit vectors requires storing two \texttt{uint64\_t}s and an
2458+\texttt{off\_t}, with possible alignment padding, per entry. That works out
2459+to 24 bytes on typical newer 64-bit and 32-bit platforms. Some older 32-bit
2460+platforms with 32-bit \texttt{off\_t}s may only require 20 bytes per index
2461+entry, trading off the decreased space against likely problems should a
2462+dictionary ever grow larger than 4G. For comparison, the average length of
2463+a dictionary entry in the CHISE, KanjiVG, and Tsukurimashou dictionaries as
2464+of this writing is about 40 bytes; four times that for EDICT2. So storing
2465+the indices represents a significant, but not crippling, increase in storage
2466+requirements. The performance improvement numbers do account for the time
2467+taken to read the additional data from disk---it just isn't enough to be a
2468+problem compared to the savings in parsing and tree matching.
2469+
2470+The format of the bit vector index file is specific to the version of
2471+IDSgrep and the computer architecture including the C compiler. If you use
2472+a bit vector file with a mismatched \texttt{idsgrep} binary, it will almost
2473+certainly be detected as invalid and ignored. Building with or without
2474+BuDDy will not make a difference to the file format; the difference is a
2475+different search algorithm applied to the same data.
2476+
2477+It is possible that the magic numbers may not
2478+perfectly track incompatible changes among different development versions
2479+of IDSgrep that you might check out of SVN, but they should definitely work
2480+correctly to differentiate between incompatible versions that have been
2481+formally released.
2482+
23982483 One issue for BDD filters that did not occur with lambda filters is that
23992484 there is no practical limit to the size of a BDD on 128 bits, so it could
24002485 grow until it exhausts memory or consumes so much time for filter calculus
@@ -2456,36 +2541,51 @@
24562541 effort into automatic reordering of the variables to optimize the BDD
24572542 calculations.
24582543
2459-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2544+The particular case of many \texttt{.anywhere.} operators nested inside one
2545+another with no heads and no other nodes in between might be detected and
2546+automatically simplified. Any number of those will match the same set of
2547+trees as a single such operator, so the extras can be simply removed without
2548+affecting the semantivs. But it is easy to construct more elaborate queries
2549+that still take exponential time without memoization and that the system
2550+cannot reasonably simplify. Users are unlikely to do that by accident, but
2551+now that I've mentioned the possibility, someone will try.
24602552
2461-\section{Implementation details}\DangerousSection
2553+Let $k$ represent the number of \texttt{.anywhere.} and \texttt{.unord.}
2554+operators in the matching pattern. It is actually determined
2555+by looking at the reference counts for the strings consisting of a single
2556+ASCII period and a single ASCII asterisk in an internal string table, so
2557+there may be an overcount if those strings happen to occur as heads or
2558+as non-unary functors. If $k$ is less than three, memoization is not
2559+expected to be helpful, and so is not done. Then assuming $k$ was at least
2560+three, if it is less than ten it is set to ten and if it is greater than 22
2561+it is set to 22, and then memoization will be done with a hash table of
2562+$2^{k+1}$ entries. The idea here is to make the hash table grow as the
2563+number of entries needed grows, but always at least 2048 entries because
2564+there is little benefit from making it very small, and never more than 8M
2565+entries because if very large, memory becomes a bigger problem than time.
2566+The constants here were chosen by informal experiment, to switch over to
2567+memoized matching at the level of query complexity where it starts to be
2568+useful, and keep the tables just large enough that making them significantly
2569+larger provides little benefit.
24622570
2463-Storing bit vectors requires storing two \texttt{uint64\_t}s and an
2464-\texttt{off\_t}, with possible alignment padding, per entry. That works out
2465-to 24 bytes on typical newer 64-bit and 32-bit platforms. Some older 32-bit
2466-platforms with 32-bit \texttt{off\_t}s may only require 20 bytes per index
2467-entry, trading off the decreased space against likely problems should a
2468-dictionary ever grow larger than 4G. For comparison, the average length of
2469-a dictionary entry in the CHISE, KanjiVG, and Tsukurimashou dictionaries as
2470-of this writing is about 40 bytes; four times that for EDICT2. So storing
2471-the indices represents a significant, but not crippling, increase in storage
2472-requirements. The performance improvement numbers do account for the time
2473-taken to read the additional data from disk---it just isn't enough to be a
2474-problem compared to the savings in parsing and tree matching.
2571+Collisions are resolved by just overwriting the old entries; and the table
2572+gets emptied on each new top-level tree match. Table entries have
2573+generation numbers, so the emptying is done in constant time by
2574+incrementing the generation instead of really rewriting the table.
24752575
2476-The format of the bit vector index file is specific to the version of
2477-IDSgrep and the computer architecture including the C compiler. If you use
2478-a bit vector file with a mismatched \texttt{idsgrep} binary, it will almost
2479-certainly be detected as invalid and ignored. Building with or without
2480-BuDDy will not make a difference to the file format; the difference is a
2481-different search algorithm applied to the same data.
2576+The keys used for the hash table are the actual pointers. Thus, it is
2577+possible for the performance of this feature to be nondeterministic on some
2578+platforms. There are some additional wrinkles in the code because of the
2579+life cycle of pointers to tree nodes: the nodes created during parsing of
2580+input live at least as long as hash table generations, but there can be a
2581+few nodes created and deleted on the fly during matching, and those cannot
2582+be saved in the hash table lest the pointers be reused and invalidated. It
2583+happens that the parser already set a flag in each node it created, for its
2584+own internal purposes in determining when the node was ready to come off the
2585+parsing stack. Nodes created on the fly during matching lack that flag. So
2586+the tree match memoization looks for the parser's flag to determine which
2587+nodes it is allowed to cache.
24822588
2483-It is possible that the magic numbers may not
2484-perfectly track incompatible changes among different development versions
2485-of IDSgrep that you might check out of SVN, but they should definitely work
2486-correctly to differentiate between incompatible versions that have been
2487-formally released.
2488-
24892589 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
24902590 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
24912591 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
--- trunk/idsgrep/bitvec.c (revision 455)
+++ trunk/idsgrep/bitvec.c (revision 456)
@@ -732,7 +732,7 @@
732732 bdd_delref(fa.decision_diagram);
733733 bdd_delref(fb.decision_diagram);
734734 #else
735- bf_and(f,f,bf_or(&fa,&fa,&fb));
735+ bf_or(f,&fa,bf_and(f,f,&fb));
736736 #endif
737737 }
738738
--- trunk/idsgrep/match.c (revision 455)
+++ trunk/idsgrep/match.c (revision 456)
@@ -431,8 +431,9 @@
431431 /* clear out the memoization table, first time and when counter wraps */
432432 if (memo_mask!=0) {
433433 memo_gen++;
434+ /* just setting the fields to zero isn't enough because of padding */
435+ memset(&memo_key,0,sizeof(MEMO_REC));
434436 memo_key.generation=memo_gen;
435- memo_key.match_result=MR_INITIAL;
436437 if (memo_gen==1)
437438 memset(memo_table,0,sizeof(MEMO_REC)*(memo_mask+1));
438439 }
Show on old repository browser