Next: , Up: Lexing and Parsing


41.1 Constructing a lexical analyzer

You can construct a lexical analyzer with the procedures in this module.

— Scheme Procedure: lexeme-reader dfa

Return a procedure that returns a lexeme, according to the dfa. The returned proc, let's call it read-lexeme, has signature:

          (read-lexeme [port])

The rest of this description is about read-lexeme.

Read a lexeme, as defined by dfa, from port or the current input.

dfa should match an entire lexeme and nothing more. read-lexeme will return:

          (<n> <token>)

where <token> is a string containing the characters matched, and <n> indicates the token type.

A novel regexp operator is particularly important when using this procedure: the cut operator.

The cut operator, written ‘[[:cut <n>:]]’, causes immediate termination of a match. <n> must be an integer. If it is 0, the match fails. Otherwise, the match succeeds, returning <n> in the match data. Using the Scheme procedure regexec, the match-data selector #\c retrieves the cut value. For example:

          (define r (regcomp "if[[:cut 1:]]\\|else[[:cut 2:]]"))
          
          (regexec r "if" '(#\c 0))
          ⇒ (1 "if")
          
          (regexec r "else" '(#\c 0))
          ⇒ (2 "else")

The above examples illustrate how the return value of read-lexeme is formed. Sufficient characters are read to form a match while leaving one character left over. The extra character is put back on the input stream and match-data for (#\c 0) is returned.

If the end of file is reached before a match is found, then the lexer should call eof-token with the string of read-but-yet-unmatched characters, and return the resulting value.

There are two exceptions to the rule that the lexer will read, then unread an "extra" character for each lexeme. The first exception is if EOF is reached after finding a match, but before reading an an extra character. In that case, the match is simply returned. The second exception is if the cut value of the match is less than 0, the lexeme is immediately returned without reading additional characters.

Using this interface is quite similar to using the lex language, except for the handling of illegal (non-lexable) input. The unix lexical analyzers are able to report an error as soon as characters are read which could not possibly begin a lexeme. This lexer, on the other hand, naively continues to read characters until eof is reached. This reflects a deficiency of the regexec interface (it has no way to report that not only was no match found, but no match could be formed by adding more characters). The work-around is to carefully write "regexp" so that it matches illegal input and returns such as a distinct token type. A more robust (lex-like) solution will eventually be implemented.

— Scheme Procedure: lexer-regexp-and-handler spec return

Return a regexp and a procedure suitable for lexing according to the analyzer specification spec. Call return when done, like so:

          (RETURN regexp handler)

See make-lexer for an example of how this is used.

— Scheme Procedure: make-lexer specification

Return a lexical analyzer built from specification.

A lexer is built from a specification that consists of regexps and actions. The regexps are listed in order of precedence, each matching a particular token type.

The actions are either a token-id (a symbol or #f), or a procedure.

If an action is a token-id <i> then when a matching token is read, the list (<i> <lexeme>) is returned (where <lexeme> is a string consisting of the matched characters).

If an action is a procedure, the procedure is called with one argument, the lexeme string. The return value of the procedure is returned from the lexer.

Here is a sample specification:

          (define ctax-lexer-specification
            `(("[0-9]\\+\\.\\?[0-9]*"          ,(lambda (token)
                                                  (list 'number
                                                       (number->string token))))
              ("if"                            if)
              ("else"                          else)
              ("while"                         while)
              ("for"                           for)
              ("return"                        return)
              ("do"                            do)
              ("scm"                           scm)
              ("break"                         break)
              ("continue"                      continue)
              ("[a-zA-Z][a-z?!<>=0-9_]*"       ,(lambda (token)
                                                  (list 'identifier
                                                       (string->symbol token))))
              ("//[^\n]*"                      #f)
              ("[ \t\n]\\+"                    #f)
              ("[^ \t\n]\\+"                   error)))

WARNING: all lexers should be total – that is, they must succeed at dividing all possible input streams into tokens. No provision is made to cleanly handle non-matching input. That's why the example lexer includes a token type "error" – to consume non-lexable input until syncronizing characters (whitespace in this case) are found.

Generally speaking, given a specification, the lexer will return the longest matching lexeme. If two cases both match, the lexer will use the one that occurs first in the specification.

The rule that longest matches may be overrided for a particular type of lexeme by putting the keyword #:shortest after the action in the lexer specification. If such a lexeme type is ever matched, it is returned immediately without consuming addtional characters to look for a longer match.

— Scheme Procedure: eof-token? obj

Return #t iff obj is an eof token.

— Scheme Procedure: eof-token [etc...]

Return a recognizable eof token with etc information. You can recognize it as such using procedure eof-token?. This procedure will probably be made private in the future. Do not use it.