Next: Lexing and Parsing, Previous: SRFI Support, Up: Top
This chapter describes the (lang librgx) module that provides the
Rgx regular expression library (originally named Rx) written by
Tom Lord. Although Rgx regular expressions share many features and behaviors
with the POSIX regular expressions normally used in Guile (see Regexp Functions), there are subtle differences [that we would do well to document
but haven't yet –ttn]. See Lexing and Parsing, for related modules.
Unlike using string-match, using Rgx requires a two step
process: compile a regular expression into an efficient structure,
then use the structure in any number of string comparisons.
For example, given the regular expression ‘abc.’ (which matches any string containing ‘abc’ followed by any single character):
guile> (define r (regcomp "abc."))
guile> r
#<rgx abc.>
guile> (regexec r "abc")
#f
guile> (regexec r "abcd")
#((0 . 4))
guile>
The definitions of regcomp and regexec are as follows:
Compile the regular expression pat using POSIX rules. cfl is optional and should be specified using symbolic names:
— Variable: REG_NEWLINE
allow anchors to match after newline characters in the string and prevents
.or[^...]from matching newlines.The
logiorprocedure can be used to combine multiple flags. The default is to use POSIX basic syntax, which makes+and?literals and\+and\?operators. Backslashes in pat must be escaped if specified in a literal string e.g.,"\(a\)\?".
Match string str against the compiled POSIX regular expression rgx. match-pick and eflags are optional. Possible flags (which can be combined using the
logiorprocedure) are:— Variable: REG_NOTBOL
The beginning of line operator won't match the beginning of str (presumably because it's not the beginning of a line)
— Variable: REG_NOTEOL
Similar to REG_NOTBOL, but prevents the end of line operator from matching the end of str.
If no match is possible, regexec returns #f. Otherwise match-pick determines the return value:
#tor unspecified: a newly-allocated vector is returned, containing pairs with the indices of the matched part of str and any substrings.
"": a list is returned: the first element contains a nested list with the matched part of str surrounded by the the unmatched parts. Remaining elements are matched substrings (if any). All returned substrings share memory with str.
#f: regexec returns #t if a match is made, otherwise #f.vector: the supplied vector is returned, with the first element replaced by a pair containing the indices of the matched portion of string and further elements replaced by pairs containing the indices of matched substrings (if any).
list: a list will be returned, with each member of the list specified by a code in the corresponding position of the supplied list:
a number: the numbered matching substring (0 for the entire match).
#\<: the beginning of str to the beginning of the part matched by regex.
#\>: the end of the matched part of str to the end of str.
#\c: the "final tag", which seems to be associated with the "cut operator", which doesn't seem to be available through the posix interface.e.g.,
(list #\< 0 1 #\>). The returned substrings share memory with str.
A compiled regexp, such as returned by regcomp is a type of
deterministic finite automaton, or dfa for short; application of
this dfa to a string is the essence of the regexec operation.
Rgx provides some additional procedures that can be used to create,
recognize, inspect and apply a dfa, with a level of control not
available through regexec. These are described here.
Return #t iff obj is a compiled regexp, i.e., an object returned by
regcomp.
Return a deterministic finite automaton compiled from regular expression regexp. Optional arg cfl are flags are one of
REG_EXTENDEDorREG_NEWLINE(uselogiorto combine them).