Commit MetaInfo
Log Message
bug fix, more docs on advanced features
Change Summary
Incremental Difference
  @@ 817,7 +817,7 @@  817  817   818  818  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  819  819   820   \section{A note on TrueType/OpenType}\DangerousBend   820  +\section{A note on TrueType/OpenType}\DangerousSection  821  821   822  822  This version of IDSgrep is designed to read TrueType or  823  823  OpenType files (the distinction between the two is not relevant at this 
  @@ 2138,14 +2138,14 @@  2138  2138  This idea of doing an easy approximate check first, to save the cost of a  2139  2139  more difficult accurate evaluation, should be quite familiar. Imagine a  2140  2140  hiring committee quickly throwing out all the job applications from software  2141   engineers, then interviewing the computational linguists.\footnote{You may  2142   say I'm a dreamer.} Similar concepts appear throughout computer science:  2143   consider instruction and data caches; memoization for dynamic programming;  2144   solving relaxed versions of optimization problems; and especially Bloom  2145   filters~\cite{Bloom:Space}. The bit vector technique used in IDSgrep builds  2146   on Bloom filters and on the work of Skala and  2147   others~\cite{Skala:Generalized} and Skala and Penn~\cite{Skala:Approximate}  2148   on unification of types in logic programming systems.   2141  +engineers, then interviewing the computational linguists. Similar concepts   2142  +appear throughout computer science: consider instruction and data caches;   2143  +memoization for dynamic programming; solving relaxed versions of   2144  +optimization problems; and especially Bloom filters~\cite{Bloom:Space}. The   2145  +bit vector technique used in IDSgrep builds on Bloom filters and on the work   2146  +of Skala and others~\cite{Skala:Generalized} and Skala and   2147  +Penn~\cite{Skala:Approximate} on unification of types in logic programming   2148  +systems.  2149  2149   2150  2150  Functional programming weenies will note that the real point of an element  2151  2151  of $\mathbb{F}$ is that you can combine it with $\mathit{check}$ to make a 
  @@ 2257,7 +2257,9 @@  2257  2257  $m_3$, so $(m_3,\lambda_3)$ will also match. The same is true for  2258  2258  $(m_2,\lambda_2)$; so if either of the original filters matched, then the  2259  2259  combined filter must match. In this way we can find a filter for the OR of  2260   two existing filters. Note that this is not an if and only if: there could   2260  +two existing filters.   2261  +   2262  +Note that the OR filter is not an if and only if: there could  2261  2263  be combinations of bits that hit part of $m_1$ and part of $m_2$, so that  2262  2264  neither $(m_1,\lambda_1)$ nor $(m_2,\lambda_2)$ would match but  2263  2265  $(m_3,\lambda_3)$ matches. We have lost some precision in the filter 
  @@ 2283,21 +2285,21 @@  2283  2285  lambda filters.  2284  2286   2285  2287  When we push a subtree further down in the EIDS tree, its filter changes.  2286   Suppose we have a filter for a given needle (assuming it is the root) but it  2287   actually appears as the right child of a binary root. Any bits in its mask  2288   that would have queried the first word (head, functor, and arity of root)  2289   must now query the third word (head, functor, and arity of last child). We  2290   can easily rearrange those bits in the mask. However, if the mask selects  2291   any bits from the second through fourth words when the needle is at the  2292   root, then when it is shifted down the tree all those bits must come from  2293   the last word of the haystack's vector. They could collide with each other.  2294   Then a haystack that formerly would have been hit by the mask in three  2295   places might only be hit once, because all three of those bits collided.  2296   The value of $\lambda$ might need to be reduced by as much as a factor of  2297   three. IDSgrep computes all possible collisions and tries to use as large a  2298   value of $\lambda$ for the modified filter as it can prove will still give  2299   correct results. Similar considerations apply to the operations of moving a  2300   needle's filter from the root to the left or middle children.   2288  +Suppose we have a filter that would be correct for a given needle if it were   2289  +the root, but it actually appears as the right child of a binary root. Any   2290  +bits in its mask that would have queried the first word (head, functor, and   2291  +arity of root) must now query the third word (head, functor, and arity of   2292  +last child). We can easily rearrange those bits in the mask. However, if   2293  +the mask selects any bits from the second through fourth words when the   2294  +needle is at the root, then when it is shifted down the tree all those bits   2295  +must come from the last word of the haystack's vector. They could collide   2296  +with each other. Then a haystack that formerly would have been hit by the   2297  +mask in three places might only be hit once, because all three of those bits   2298  +collided. The value of $\lambda$ might need to be reduced by as much as a   2299  +factor of three. IDSgrep computes all possible collisions and tries to use   2300  +as large a value of $\lambda$ for the modified filter as it can prove will   2301  +still give correct results. Similar considerations apply to the operations   2302  +of moving a needle's filter from the root to the left or middle children.  2301  2303   2302  2304  Given these operations of OR, AND, and moving a needle that refers to the  2303  2305  root to refer to a child instead, IDSgrep can find a lambda filter for basic 
  @@ 2314,7 +2316,7 @@  2314  2316  matcheverything filter. The IDSgrep implementation of NOT goes slightly  2315  2317  beyond pure filter calculus by looking at the actual needle instead of only  2316  2318  the needle's recursivelycomputed filter; so it is smart enough to recognize  2317   \texttt{!!} as a noop, \texttt{!?} as matchnothing, and to apply De   2319  +\texttt{!!} as a noop, \texttt{!?} as matchnothing, and to apply de  2318  2320  Morgan's laws to AND and OR. The unordered match operator \texttt{*}  2319  2321  translates easily into the OR of all permutations of its child, and the  2320  2322  literal match operator \texttt{=} is straightforward. Other match 
  @@ 2395,6 +2397,89 @@  2395  2397  get better results from NOT. More complicated operations, like  2396  2398  matchanywhere, have reasonably straightforward implementations.  2397  2399    2400  +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   2401  +   2402  +\section{Memoization in the tree match}\DangerousSection   2403  +   2404  +This feature does not involve bit vectors at all, but is described in the bit   2405  +vector chapter because, like bit vectors, it is a speed enhancement that   2406  +doesn't change the basic matching algorithm. There is no ``r'' in   2407  +``memoization.'' That word is a shibboleth\footnote{Judges 12:6.} for CS   2408  +theorists.   2409  +   2410  +For the most part, when a match is not ruled out by the bit vector indices,   2411  +IDSgrep uses a straightforward recursivedescent algorithm to implement the   2412  +tree matching function $\mathit{match}$. That works well in practice for   2413  +the kinds of queries typically encountered. However, it is possible to   2414  +construct pathological queries on which the recursive matching algorithm   2415  +will misbehave. For instance, a query with many nested instances of   2416  +\texttt{.anywhere.} can bog down attempting to match against a deep tree, as   2417  +it tries all possibilities for the intermediate nodes that anchor the   2418  +different \texttt{.anywhere.} operators, even though the choices are all   2419  +equivalent.   2420  +Matching according to IDSgrep's matching rules should in fact be possible in   2421  +polynomial time, because we can match each node in the needle tree once   2422  +against each node in the haystack tree. Assuming the match function is   2423  +welldefined we only need do that once per pair of nodes; so this suggests a   2424  +dynamic programming algorithm. The dynamic programming algorithm is   2425  +unlikely to be a good idea in the usual case, because of excessive overhead   2426  +maintaining the table and the possibility of doing extra, unnecessary   2427  +matches if we're not careful. Recursive descent, with shortcircuiting of   2428  +unnecessary subtree tests once the answer is known, seems to perform much   2429  +better in practice in ordinary cases despite its exponential worstcase   2430  +bound. Nonetheless, to cover bad cases that may occur in the input,   2431  +IDSgrep implements memoization of tree matches when it appears likely to be   2432  +helpful.   2433  +   2434  +The matching operators that could cause trouble are ``\texttt{...}''   2435  +(\texttt{.anywhere.}) and ``\texttt{.*.}'' (\texttt{.unord.}). All other   2436  +special matching operators, and the default matching rules, recurse at most   2437  +once into each child; but matchanywhere attempts to match its needle tree   2438  +once against every descendant of its haystack tree, and unordered match   2439  +attempts to match its needle once against each of the up to six permutations   2440  +of its haystack's children. The time to do matching by recursive descent   2441  +ends up having an exponent determined by the number of uses of those   2442  +operators in the matching pattern.   2443  +   2444  +So IDSgrep counts how many times either of those operators occurs in the   2445  +matching pattern, and uses that count to determine both whether memoization   2446  +would help, and how big a table size to use. The memoized match is   2447  +straightforward. Before testing a needle tree against a haystack tree,   2448  +IDSgrep checks whether that pair of needle and haystack is recorded in the   2449  +memoization hash table. If it is, the answer is taken from the table   2450  +instead of calculated fresh. Otherwise, once it has been calculated, the   2451  +result goes in the table.   2452  +   2453  +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   2454  +   2455  +\section{Implementation details}\DangerousSection   2456  +   2457  +Storing bit vectors requires storing two \texttt{uint64\_t}s and an   2458  +\texttt{off\_t}, with possible alignment padding, per entry. That works out   2459  +to 24 bytes on typical newer 64bit and 32bit platforms. Some older 32bit   2460  +platforms with 32bit \texttt{off\_t}s may only require 20 bytes per index   2461  +entry, trading off the decreased space against likely problems should a   2462  +dictionary ever grow larger than 4G. For comparison, the average length of   2463  +a dictionary entry in the CHISE, KanjiVG, and Tsukurimashou dictionaries as   2464  +of this writing is about 40 bytes; four times that for EDICT2. So storing   2465  +the indices represents a significant, but not crippling, increase in storage   2466  +requirements. The performance improvement numbers do account for the time   2467  +taken to read the additional data from diskit just isn't enough to be a   2468  +problem compared to the savings in parsing and tree matching.   2469  +   2470  +The format of the bit vector index file is specific to the version of   2471  +IDSgrep and the computer architecture including the C compiler. If you use   2472  +a bit vector file with a mismatched \texttt{idsgrep} binary, it will almost   2473  +certainly be detected as invalid and ignored. Building with or without   2474  +BuDDy will not make a difference to the file format; the difference is a   2475  +different search algorithm applied to the same data.   2476  +   2477  +It is possible that the magic numbers may not   2478  +perfectly track incompatible changes among different development versions   2479  +of IDSgrep that you might check out of SVN, but they should definitely work   2480  +correctly to differentiate between incompatible versions that have been   2481  +formally released.   2482  +  2398  2483  One issue for BDD filters that did not occur with lambda filters is that  2399  2484  there is no practical limit to the size of a BDD on 128 bits, so it could  2400  2485  grow until it exhausts memory or consumes so much time for filter calculus 
  @@ 2456,36 +2541,51 @@  2456  2541  effort into automatic reordering of the variables to optimize the BDD  2457  2542  calculations.  2458  2543   2459   %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%   2544  +The particular case of many \texttt{.anywhere.} operators nested inside one   2545  +another with no heads and no other nodes in between might be detected and   2546  +automatically simplified. Any number of those will match the same set of   2547  +trees as a single such operator, so the extras can be simply removed without   2548  +affecting the semantivs. But it is easy to construct more elaborate queries   2549  +that still take exponential time without memoization and that the system   2550  +cannot reasonably simplify. Users are unlikely to do that by accident, but   2551  +now that I've mentioned the possibility, someone will try.  2460  2552   2461   \section{Implementation details}\DangerousSection   2553  +Let $k$ represent the number of \texttt{.anywhere.} and \texttt{.unord.}   2554  +operators in the matching pattern. It is actually determined   2555  +by looking at the reference counts for the strings consisting of a single   2556  +ASCII period and a single ASCII asterisk in an internal string table, so   2557  +there may be an overcount if those strings happen to occur as heads or   2558  +as nonunary functors. If $k$ is less than three, memoization is not   2559  +expected to be helpful, and so is not done. Then assuming $k$ was at least   2560  +three, if it is less than ten it is set to ten and if it is greater than 22   2561  +it is set to 22, and then memoization will be done with a hash table of   2562  +$2^{k+1}$ entries. The idea here is to make the hash table grow as the   2563  +number of entries needed grows, but always at least 2048 entries because   2564  +there is little benefit from making it very small, and never more than 8M   2565  +entries because if very large, memory becomes a bigger problem than time.   2566  +The constants here were chosen by informal experiment, to switch over to   2567  +memoized matching at the level of query complexity where it starts to be   2568  +useful, and keep the tables just large enough that making them significantly   2569  +larger provides little benefit.  2462  2570   2463   Storing bit vectors requires storing two \texttt{uint64\_t}s and an  2464   \texttt{off\_t}, with possible alignment padding, per entry. That works out  2465   to 24 bytes on typical newer 64bit and 32bit platforms. Some older 32bit  2466   platforms with 32bit \texttt{off\_t}s may only require 20 bytes per index  2467   entry, trading off the decreased space against likely problems should a  2468   dictionary ever grow larger than 4G. For comparison, the average length of  2469   a dictionary entry in the CHISE, KanjiVG, and Tsukurimashou dictionaries as  2470   of this writing is about 40 bytes; four times that for EDICT2. So storing  2471   the indices represents a significant, but not crippling, increase in storage  2472   requirements. The performance improvement numbers do account for the time  2473   taken to read the additional data from diskit just isn't enough to be a  2474   problem compared to the savings in parsing and tree matching.   2571  +Collisions are resolved by just overwriting the old entries; and the table   2572  +gets emptied on each new toplevel tree match. Table entries have   2573  +generation numbers, so the emptying is done in constant time by   2574  +incrementing the generation instead of really rewriting the table.  2475  2575   2476   The format of the bit vector index file is specific to the version of  2477   IDSgrep and the computer architecture including the C compiler. If you use  2478   a bit vector file with a mismatched \texttt{idsgrep} binary, it will almost  2479   certainly be detected as invalid and ignored. Building with or without  2480   BuDDy will not make a difference to the file format; the difference is a  2481   different search algorithm applied to the same data.   2576  +The keys used for the hash table are the actual pointers. Thus, it is   2577  +possible for the performance of this feature to be nondeterministic on some   2578  +platforms. There are some additional wrinkles in the code because of the   2579  +life cycle of pointers to tree nodes: the nodes created during parsing of   2580  +input live at least as long as hash table generations, but there can be a   2581  +few nodes created and deleted on the fly during matching, and those cannot   2582  +be saved in the hash table lest the pointers be reused and invalidated. It   2583  +happens that the parser already set a flag in each node it created, for its   2584  +own internal purposes in determining when the node was ready to come off the   2585  +parsing stack. Nodes created on the fly during matching lack that flag. So   2586  +the tree match memoization looks for the parser's flag to determine which   2587  +nodes it is allowed to cache.  2482  2588   2483   It is possible that the magic numbers may not  2484   perfectly track incompatible changes among different development versions  2485   of IDSgrep that you might check out of SVN, but they should definitely work  2486   correctly to differentiate between incompatible versions that have been  2487   formally released.  2488     2489  2589  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  2490  2590  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  2491  2591  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
  @@ 732,7 +732,7 @@  732  732  bdd_delref(fa.decision_diagram);  733  733  bdd_delref(fb.decision_diagram);  734  734  #else  735    bf_and(f,f,bf_or(&fa,&fa,&fb));   735  + bf_or(f,&fa,bf_and(f,f,&fb));  736  736  #endif  737  737  }  738  738  
  @@ 431,8 +431,9 @@  431  431  /* clear out the memoization table, first time and when counter wraps */  432  432  if (memo_mask!=0) {  433  433  memo_gen++;   434  + /* just setting the fields to zero isn't enough because of padding */   435  + memset(&memo_key,0,sizeof(MEMO_REC));  434  436  memo_key.generation=memo_gen;  435    memo_key.match_result=MR_INITIAL;  436  437  if (memo_gen==1)  437  438  memset(memo_table,0,sizeof(MEMO_REC)*(memo_mask+1));  438  439  } 
Show on old repository browser
