peg-trees and you

Posted by s.f. on June 11, 2009

Recently at work, I needed to parse ugly data files from an ancient classified-ads database that a predecessor devoted an entire Ruby application to. While painstakingly building regexes, I remembered looking at Treetop a few months ago.

Treetop is a Ruby library for writing Parsing expression grammars. PEGs are another way of constructing grammars and could be thought of as super-regexes: they don’t allow left-lookup or ambiguity in the parse tree, making them not so useful for natural language but killer for computer languages. Around 40 lines of code and 7 rules took the place of what the original author devoted dedicated tempfiles and regex arrow code to.

That being said, PEGs are conceptually harder to get grips on, and Treetop’s documentation is not entirely clear on some hangups you might find. Most of which you can solve using the excellent mailing list, but I know I wished during the past few days that I could get it summed up for me.

  1. Have a base rule. Originally I tried parsing a list of categories with the “category” rule at the top. This causes accessor methods to be defined in the wrong modules. Like XML, you should have a “root” rule that only defines how to find the rest of the content in the file. I.e. mine finds the categories, and has a default method to compose them into a list based on responds_to? and find_all.
  2. Use the label accessor methods! While Treetop allows rule-substitution like variables, those don’t automatically get turned into variables. Think ‘beginning’ rule_value:rule ‘end’, and call rule_value.elements from the methods defined on the core rule.
  3. Use .inspect and tt when you get stuck! Calling .parse.inspect will produce a handy formatted string of the parse tree(if it succeeded, if it didn’t, call .failure_reason to find out why). ‘tt’ is a script included with Treetop that compiles your grammar out into a Ruby file, which is how you can find where methods are and aren’t being defined.

Since code speaks louder than words:

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

Comments