Ticket #33629

Syntax errors in CHISE and CJKVI databases

Open Date: 2014-04-04 03:06 Last Update: 2014-04-04 03:07

Open [Owner assigned]
5 - Medium
5 - Medium


Despite our detection and filtering of some errors of this type, the CHISE and CJKVI databases compiled by the IDSgrep build process contain some "entries" that are not single syntactically valid EIDSes. This is caused by syntax errors in the original databases we are looking at, and is visible at the output in discrepancies between the number of lines in a result set and the count reported by --statistics. Those two numbers should differ when the multi-line headers from the dictionaries are included in the results, but only then - all actual dictionary entries should be single-line. Usually what happens is that a partial entry on one line will consume one or two entries on following lines to make up its missing children, so the tree count ends up smaller than the line count. Lines are not special to the EIDS parser.

Since this is properly an issue with the input data which we didn't write (IDSgrep is functioning correctly, given its specifications and the bad data), and there's no way to really fix it right short of creating our own replacement dictionary entries for the bad ones, it may not be top priority; but it's not nice for speed tests because it means we can't just count lines to count matches but must capture and sum the STATS lines. Filing it as a bug and not a hairy yak, though, because we're already attempting to filter out bad data in input dictionaries and that filtering has evidently failed in this case. Maybe consider a syntax-check feature to *make* lines special to the EIDS parser and throw an error if there is a tree incomplete at line end; then errors of this type could at least be detected during dictionary creation.

Ticket History (2/2 Histories)

2014-04-04 03:06 Updated by: mskala
  • New Ticket "Syntax errors in CHISE and CJKVI databases" created
2014-04-04 03:07 Updated by: mskala
  • Details Updated

Attachment File List

No attachments


