"Maximal-Munch" Tokenization in Linear Time
Thomas Reps
The lexical-analysis (or scanning) phase of a compiler attempts to
partition an input string into a sequence of tokens. The convention
in most languages is that the input is scanned left to right, and each
token identified is a "maximal munch" of the remaining input---the
longest prefix of the remaining input that is a token of the language.
Although most of the standard compiler textbooks present a way to
perform maximal-munch tokenization, the algorithm they describe is one
that, for certain sets of token definitions, can cause the scanner to
exhibit quadratic behavior in the worst case. In this paper, we show
that maximal-munch tokenization can always be performed in time linear
in the size of the input.
CR Categories and Subject Descriptors:
D.3.4 [Programming Languages]: Processors -- compilers;
F.1.1 [Computation by Abstract Devices]:
Models of Computation -- automata;
F.2.2 [Analysis of Algorithms and Problem Complexity]:
Nonnumerical Algorithms and Problems -- pattern matching;
I.5.4 [Pattern Recognition]: Applications -- text processing
General Terms: Algorithms, Theory
Additional Key Words and Phrases:
backtracking, dynamic programming, memoization, tabulation, tokenization
(Click here to access the paper:
PostScript,
PDF.)
University of Wisconsin