Next: , Previous: SRFI Support, Up: Top


40 The Rgx Regular Expression Library

This chapter describes the (lang librgx) module that provides the Rgx regular expression library (originally named Rx) written by Tom Lord. Although Rgx regular expressions share many features and behaviors with the POSIX regular expressions normally used in Guile (see Regexp Functions), there are subtle differences [that we would do well to document but haven't yet –ttn]. See Lexing and Parsing, for related modules.

Unlike using string-match, using Rgx requires a two step process: compile a regular expression into an efficient structure, then use the structure in any number of string comparisons.

For example, given the regular expression ‘abc.’ (which matches any string containing ‘abc’ followed by any single character):

     guile> (define r (regcomp "abc."))
     guile> r
     #<rgx abc.>
     guile> (regexec r "abc")
     #f
     guile> (regexec r "abcd")
     #((0 . 4))
     guile>

The definitions of regcomp and regexec are as follows:

— Scheme Procedure: regcomp pat [cfl]

Compile the regular expression pat using POSIX rules. cfl is optional and should be specified using symbolic names:

— Variable: REG_EXTENDED

use extended POSIX syntax

— Variable: REG_ICASE

use case-insensitive matching

— Variable: REG_NEWLINE

allow anchors to match after newline characters in the string and prevents . or [^...] from matching newlines.

The logior procedure can be used to combine multiple flags. The default is to use POSIX basic syntax, which makes + and ? literals and \+ and \? operators. Backslashes in pat must be escaped if specified in a literal string e.g., "\(a\)\?".

— Scheme Procedure: regexec rgx str [match_pick [eflags]]

Match string str against the compiled POSIX regular expression rgx. match-pick and eflags are optional. Possible flags (which can be combined using the logior procedure) are:

— Variable: REG_NOTBOL

The beginning of line operator won't match the beginning of str (presumably because it's not the beginning of a line)

— Variable: REG_NOTEOL

Similar to REG_NOTBOL, but prevents the end of line operator from matching the end of str.

If no match is possible, regexec returns #f. Otherwise match-pick determines the return value:

#t or unspecified: a newly-allocated vector is returned, containing pairs with the indices of the matched part of str and any substrings.

"": a list is returned: the first element contains a nested list with the matched part of str surrounded by the the unmatched parts. Remaining elements are matched substrings (if any). All returned substrings share memory with str.

#f: regexec returns #t if a match is made, otherwise #f.

vector: the supplied vector is returned, with the first element replaced by a pair containing the indices of the matched portion of string and further elements replaced by pairs containing the indices of matched substrings (if any).

list: a list will be returned, with each member of the list specified by a code in the corresponding position of the supplied list:

a number: the numbered matching substring (0 for the entire match).

#\<: the beginning of str to the beginning of the part matched by regex.

#\>: the end of the matched part of str to the end of str.

#\c: the "final tag", which seems to be associated with the "cut operator", which doesn't seem to be available through the posix interface.

e.g., (list #\< 0 1 #\>). The returned substrings share memory with str.

A compiled regexp, such as returned by regcomp is a type of deterministic finite automaton, or dfa for short; application of this dfa to a string is the essence of the regexec operation. Rgx provides some additional procedures that can be used to create, recognize, inspect and apply a dfa, with a level of control not available through regexec. These are described here.

— Scheme Procedure: compiled-regexp? obj

Return #t iff obj is a compiled regexp, i.e., an object returned by regcomp.

— Scheme Procedure: regexp->dfa regexp [cfl]

Return a deterministic finite automaton compiled from regular expression regexp. Optional arg cfl are flags are one of REG_EXTENDED or REG_NEWLINE (use logior to combine them).

— Scheme Procedure: dfa-fork dfa

[TODO: Write proper docs for dfa-fork.]

— Scheme Procedure: reset-dfa! dfa

Reset the deterministic finite automaton dfa.

— Scheme Procedure: dfa-final-tag dfa

Return the final tag of deterministic finite automaton dfa.

— Scheme Procedure: dfa-continuable? dfa

Return #t iff the deterministic finite automaton dfa is continuable.

— Scheme Procedure: advance-dfa! dfa s

Advance deterministic finite automaton dfa over string s.