Next: lang grammar, Up: Lexing and Parsing
You can construct a lexical analyzer with the procedures in this module.
Return a procedure that returns a lexeme, according to the dfa. The returned proc, let's call it
read-lexeme, has signature:(read-lexeme [port])The rest of this description is about
read-lexeme.Read a lexeme, as defined by dfa, from port or the current input.
dfa should match an entire lexeme and nothing more.
read-lexemewill return:(<n> <token>)where
<token>is a string containing the characters matched, and<n>indicates the token type.A novel regexp operator is particularly important when using this procedure: the cut operator.
The cut operator, written ‘[[:cut <n>:]]’, causes immediate termination of a match.
<n>must be an integer. If it is 0, the match fails. Otherwise, the match succeeds, returning<n>in the match data. Using the Scheme procedureregexec, the match-data selector#\cretrieves the cut value. For example:(define r (regcomp "if[[:cut 1:]]\\|else[[:cut 2:]]")) (regexec r "if" '(#\c 0)) ⇒ (1 "if") (regexec r "else" '(#\c 0)) ⇒ (2 "else")The above examples illustrate how the return value of
read-lexemeis formed. Sufficient characters are read to form a match while leaving one character left over. The extra character is put back on the input stream and match-data for(#\c 0)is returned.If the end of file is reached before a match is found, then the lexer should call
eof-tokenwith the string of read-but-yet-unmatched characters, and return the resulting value.There are two exceptions to the rule that the lexer will read, then unread an "extra" character for each lexeme. The first exception is if EOF is reached after finding a match, but before reading an an extra character. In that case, the match is simply returned. The second exception is if the cut value of the match is less than 0, the lexeme is immediately returned without reading additional characters.
Using this interface is quite similar to using the lex language, except for the handling of illegal (non-lexable) input. The unix lexical analyzers are able to report an error as soon as characters are read which could not possibly begin a lexeme. This lexer, on the other hand, naively continues to read characters until eof is reached. This reflects a deficiency of the regexec interface (it has no way to report that not only was no match found, but no match could be formed by adding more characters). The work-around is to carefully write "regexp" so that it matches illegal input and returns such as a distinct token type. A more robust (lex-like) solution will eventually be implemented.
Return a regexp and a procedure suitable for lexing according to the analyzer specification spec. Call return when done, like so:
(RETURN regexp handler)See
make-lexerfor an example of how this is used.
Return a lexical analyzer built from specification.
A lexer is built from a specification that consists of regexps and actions. The regexps are listed in order of precedence, each matching a particular token type.
The actions are either a token-id (a symbol or #f), or a procedure.
If an action is a token-id
<i>then when a matching token is read, the list(<i> <lexeme>)is returned (where<lexeme>is a string consisting of the matched characters).If an action is a procedure, the procedure is called with one argument, the lexeme string. The return value of the procedure is returned from the lexer.
Here is a sample specification:
(define ctax-lexer-specification `(("[0-9]\\+\\.\\?[0-9]*" ,(lambda (token) (list 'number (number->string token)))) ("if" if) ("else" else) ("while" while) ("for" for) ("return" return) ("do" do) ("scm" scm) ("break" break) ("continue" continue) ("[a-zA-Z][a-z?!<>=0-9_]*" ,(lambda (token) (list 'identifier (string->symbol token)))) ("//[^\n]*" #f) ("[ \t\n]\\+" #f) ("[^ \t\n]\\+" error)))WARNING: all lexers should be total – that is, they must succeed at dividing all possible input streams into tokens. No provision is made to cleanly handle non-matching input. That's why the example lexer includes a token type "error" – to consume non-lexable input until syncronizing characters (whitespace in this case) are found.
Generally speaking, given a specification, the lexer will return the longest matching lexeme. If two cases both match, the lexer will use the one that occurs first in the specification.
The rule that longest matches may be overrided for a particular type of lexeme by putting the keyword
#:shortestafter the action in the lexer specification. If such a lexeme type is ever matched, it is returned immediately without consuming addtional characters to look for a longer match.