Ticket #32413

IDSgrep patterns as user-defined predicates

Open Date: 2013-11-08 00:46 Last Update: 2021-03-28 05:06

Open [Owner assigned]
5 - Medium
5 - Medium


While looking for similar software to IDSgrep, to make the academic paper about it look more scientific with "meaningful comparisons," I found that Stanford NLP's Tregex matcher can do something we can't: it can search for an arbitrarily deep descendant of a node while placing restrictions on what kinds of nodes may occur on the ancestry path. For instance (they're matching TreeBank parses) "find a V nested inside any number of VPs nested inside an S." We can't do that; IDSgrep's only way of matching arbitrarily deeply is through the ... operator, which places no restrictions on the nodes in between. It might be possible to fake some cases of it with a not-anywhere-not construction, but I don't think it'd be easy in general. I suspect Tregex of being an NP-hard decision problem anyway, because of its ability to backtrack over variable assignments, so it's reasonable that it could be more powerful than IDSgrep's polynomial-time matching. But it nonetheless ought to be possible to support at least some queries of this type without increasing IDSgrep's power unreasonably.

Specific proposal: create a new command-line option that takes an IDSgrep matching pattern as its argument, and causes "match against this pattern" to become a user-defined matching predicate. That sounds innocent enough. What makes it a big deal is allowing such patterns to recurse. Then by setting #1 equal to the pattern |木*⿰#1? (for instance) it's possible to match a 木 nested inside any number of ⿰.

I think allowing this doesn't hurt the asymptotic complexity of pattern matching as long as we require that matching one of these user-defined patterns against a given haystack node must not recurse to matching the same user-defined pattern against the same haystack node again, only strict subtrees of it. That rules out paradoxical things like setting #1 equal to ! #1. If this feature is used at all then we should probably turn on memoization, because otherwise it seems possible to demand exponential work even without the ... and * operators. It would be possible, but probably not worthwhile, to compute exact bit vector filters by recursing through the user-defined patterns just deeply enough that the filter doesn't change; probably a better plan is to simply assign the match-anything filter to any user-defined predicate and not recurse into them at all.

Ticket History (3/4 Histories)

2013-11-08 00:46 Updated by: mskala
  • New Ticket "IDSgrep patterns as user-defined predicates" created
2013-11-08 01:08 Updated by: mskala
  • Details Updated
2014-03-22 03:47 Updated by: mskala
2021-03-28 05:06 Updated by: mskala

Bumping this to 0.7. It would be cool, but will need new theory, and isn't appropriate for the current release.

Attachment File List

No attachments


You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » Login