Ticket #40505

regex (creating language)

Open Date: 2020-06-25 05:17 Last Update: 2021-07-01 23:42

5 - Medium
1 - Lowest


How did you create your custom regex-like language for IDSgrep?

I would like to create an extended regex-like language myself, so I'm just wondering where you learned how to do it.

If you can send me books/videos/resources that you used to learn how to do it - it would be much appreciated.

Thank you

Ticket History (3/4 Histories)

2020-06-25 05:17 Updated by: ekaunt
  • New Ticket "regex" created
2020-06-25 07:02 Updated by: mskala
  • Summary Updated
  • Priority Update from 1 - Lowest to 5 - Medium
2020-06-25 07:36 Updated by: mskala

That's an interesting question and I'm not sure the best direction to take it.

I think probably the most important issue I faced was actually in the data representation, not the query language itself. I had to figure out how to represent the structure of a kanji character in a text-like format, the EIDS format discussed at some length in the IDSgrep documentation. Unicode IDS was only a starting point because of its limitations - for instance, it says almost nothing about how far to break down characters into their sub-parts, or not. I had to work out a way of describing the data in a basically text format that would contain everything the search program needed to be able to find. Once I had a clear idea of how to represent my data as text in the first place, the query language I needed for searching that data was relatively clear. So I'd certainly encourage you to put a lot of energy into making sure your data is in the best format for your application, before getting too far into the language for querying it.

On the query language, very much of what went into IDSgrep came from my general background in computer science. It's really useful in this kind of work to know about the different levels of expressive power in describing languages (sets of strings). I would have used plain regular expressions instead of defining something of my own, if I could have. There are plenty of libraries for searching with regexes and that would've been a lot easier than defining a new thing of my own. I should emphasize that: I *wouldn't* have defined a new query language if I could've used ordinary regexes. I only went to the more complicated thing because I was forced there by the application. This project ended up being research-level computer science and it wasn't something to attempt lightly. Using an existing library sure would have been more convenient.

Getting to the point that I knew I needed to define a new query language required some computer science theory. I figured out that I couldn't use plain regexes because those only recognize regular languages, and the language of EIDS strings is not a regular language, it's a more general context-free language. The theory of concepts like "regular" and "context-free" languages, organizing them into something called the Chomsky hierarchy, is definitely worth knowing if you want to do work in this area, but it won't come easily without the relevant math prerequisites. This is basically the subject of a one-semester second-year computer science course. When I took that course, now some years ago, the textbook we used was "Introduction to Languages and the Theory of Computation" by John C. Martin. I think that's still a decent book for this subject, and you can probably find a used copy cheaply. Depending on how much math background you have, you may or may not be able to just read the book and get it; as I say, this book would normally be the text for a course in second year computer science, for students who've already taken some college-level discrete math, and those students would normally be getting lectures and doing assignments too instead of just reading the book. There are other textbooks for the same sort of thing; I'm not sure what's popular today but it would be worth checking your local university or college to see what they're using for the languages-and-theory-of-computation class. You might even be able to formally or informally audit the course (sit in on lectures) without technically enrolling; or find some videos or other instructional material for it online.

I mentioned data representation. IDSgrep leans heavily on my knowledge of Prolog; that is the origin of some of the terminology I use for describing my data representation, such as "head," "functor," "arity." These words are used in other areas of computer science too but not necessarily in exactly the same senses and I basically use the Prolog senses because those were what I knew. Prolog-like representation made sense to me as a way of describing kanji characters and then Prolog-like unification made sense as a way of describing the search operations. I'm not going to say that you should necessarily learn Prolog; it was mostly by luck that I happened to have a lot of Prolog experience from other things I've done in the past. If you have background in something else that you find useful, maybe you end up using your own unique background for your application. For instance, I think people accustomed to LISP or Haskell would be able to find similar concepts in their own background for tackling these kinds of problems. Use what you have. But if you want to pick up Prolog in particular, look up SWI-Prolog and ECLiPSe-CLP, which are free software Prolog interpreters, and their documentation.

If you haven't already, you may want to read up my IJALP article on IDSgrep: http://www.colips.org/journals/volume25/25.2.4_Skala.pdf That's a bit terse but it's probably the best description of IDSgrep's representation and matching algorithm, beyond what's already in the manual. It may give you some ideas.

I hope that helps!

2021-07-01 23:42 Updated by: mskala
  • Status Update from Open to Closed
  • Resolution Update from None to Accepted

Closing after a year.

Attachment File List

No attachments


You are not logged in. I you are not logged in, your comment will be treated as an anonymous post. » Login