zenspider.com
(For all intents and purposes, this project is dead, but let us just call it "Deferred Indefinitely". Shall we?)

Problem

You need to tokenize a string into substrings based on a set of grammar rules.

Solution

TODO: SOLUTION

Discussion

There are many ways to do this, ranging from fast and simple (but not as powerful) to heavy-duty pull-out-the-textbook parsing engines. We will discuss a little of both, and a bit in between.

If all you need is a way to split up a string into substrings, use String#split or String#scan, depending on your needs. For example, if I need to tokenize a string on whitespace, I’d probably just use split:

tokens = str.split(/\s+/) # or, if $; is nil: tokens = str.split

If, however, I am trying to do something a bit more complex, like FIX, then you’d probably want to use scan:

FIX: find some real world examples of the use of scan.

FIX: should I write with “I”, “we”, “you”, or what?

If we get to the point where we are dealing with something more than simple tokens, we probably want to avoid writing our own parser by hand (unless you LIKE doing that sort of thing–I did, and now I like to get out once in a while). Instead, you’d probably just whip out your favorite parsing engine and use that. Unfortunately, at the time of this writing, I know of only one, racc. racc is a yacc like system for ruby, which means that it is an LALR parser generator. I personally prefer LL parser generators, so I’ll keep looking. Here is a simple example from racc’s readme file:

class Calcparser rule target: exp { print val[0] } exp: exp ‘+’ exp | exp ‘*’ exp | ‘(‘ exp ‘)’ | NUMBER end

Contrast

Perl and Python both have split.

Scan is easy enough to write in perl or python. In perl, you’d just loop on a string with a regex.

There are too many parsing engines to name here. The most popular is probably yacc and lex. Also, there are: C/Java’s antlr (my favorite), Java’s javacc, C’s rdp, Perl’s Text::Parse (yuck), and many many more.

TODO: LIST_OF_RELATED_ITEMS

Status: In Progress