"Maximal-Munch" Tokenization in Linear Time

Thomas Reps
University of Wisconsin

The lexical-analysis (or scanning) phase of a compiler attempts to partition an input string into a sequence of tokens. The convention in most languages is that the input is scanned left to right, and each token identified is a "maximal munch" of the remaining input---the longest prefix of the remaining input that is a token of the language. Although most of the standard compiler textbooks present a way to perform maximal-munch tokenization, the algorithm they describe is one that, for certain sets of token definitions, can cause the scanner to exhibit quadratic behavior in the worst case. In this paper, we show that maximal-munch tokenization can always be performed in time linear in the size of the input.

CR Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors -- compilers; F.1.1 [Computation by Abstract Devices]: Models of Computation -- automata; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems -- pattern matching; I.5.4 [Pattern Recognition]: Applications -- text processing

General Terms: Algorithms, Theory

Additional Key Words and Phrases: backtracking, dynamic programming, memoization, tabulation, tokenization

(Click here to access the paper: PostScript, PDF.)