Finite automata and implementing a regex engine in Python

An example implementation of a NFA based regular expression engine.

Automata can be used to model and understand computation and logic. An automaton that has a finite number of states is called a finite automaton (FA) or finite state machine. Commonly automata are restricted to occupy one state at a time. If we let go of this condition, we get a nondeterministic finite automaton (NFA); a discrete system that can be in multiple states at a time.

Every regular expression can be converted to an equivalent NFA. The initial algorithm was discovered by Kleene in the 1950’s¹^,². The first proper regular expression search algorithm was published by Thompson in 1968³. It goes by the name Thompson’s construction algorithm.

This is a fun algorithm! It’s very fast and uses a fixed amount of memory. The major drawback is that there is no (known) way to implement back references with it. NFA based regex engines are used by awk, grep and the Rust standard library ⁴^,⁵. Notably they are not used in Perl, PCRE, PHP, Python, and Ruby, all of which may experience catastrophic backtracking⁴.

It’s worth mentioning that many regex engines use deterministic finite automata (DFA) instead of nondeterministic finite automata (NFA). Any NFA can be transformed into a DFA, and they are equivalent in expressive power⁶. In practice, DFA are faster to execute than NFA, but require more memory.

In this article, we’ll implement a NFA based regex engine in Python.

We’ll have to do a few steps to convert an input expression, such as /ab*c/ into to something that can match a string.

Parsing an expression to an AST

We won’t focus on parsing in this article, so we’ll use Lark to generate a parser. The grammar is defined using ENBF notation. It defines what a valid regular expression consists of:

File: grammar.lark

regex:      "/" alt "/"
alt:        concat ("|" concat)*
concat:     item+
item:       atom QUANTIFIER?
atom:       dot | "(" alt ")" | literal
dot:        "."
literal:    /[^.()|*+]/
QUANTIFIER: "*" | "+"

Each row defines a rule followed by it’s definition. The root expression regex consists of an alternation surrounded by forward slashes. The alternation in turn consists of one or more concatenations. And so forth.

Don’t be confused by our use of regexes in the grammar file. Lark uses them to define a grammar, they have nothing to do with our regex implementation.

In our grammar, we support the regex meta characters +, *, wildcard letter ., and alternation |. Also concatenation is implicitly supported.

With the grammar defined, we can harness the parser to produce an abstract syntax tree (AST).

It's a bit of boiler plate, but we can define a transformation from the parse-tree to an abstract syntax tree by inheriting lark.Transformer. We've omitted that here; expand for full code.

# regex_ast.py
from dataclasses import dataclass

from lark import Lark, Transformer, v_args


@v_args(inline=True)
class AstBuilder(Transformer):
    def regex(self, child):
        return Ast.Regex(child)

    def alt(self, *children):
        return children[0] if len(children) == 1 else Ast.Alt(children)

    def concat(self, *children):
        return children[0] if len(children) == 1 else Ast.Concat(children)

    def item(self, atom, quantifier=None):
        if quantifier is None:
            return atom
        return Ast.Star(atom) if str(quantifier) == "*" else Ast.Plus(atom)

    def atom(self, child):
        return child

    def dot(self):
        return Ast.Dot()

    def literal(self, char):
        return Ast.Literal(str(char))


class Ast:
    @dataclass
    class Regex:
        child: "Ast.Node"

    @dataclass
    class Literal:
        char: str

    @dataclass
    class Dot:
        pass

    @dataclass
    class Star:
        child: "Ast.Node"

    @dataclass
    class Plus:
        child: "Ast.Node"

    @dataclass
    class Concat:
        children: tuple["Ast.Node", ...]

    @dataclass
    class Alt:
        children: tuple["Ast.Node", ...]

    type Node = Regex | Literal | Dot | Star | Plus | Concat | Alt

    @classmethod
    def len(cls, node) -> int:
        match node:
            case cls.Literal() | cls.Dot():
                return 1
            case cls.Star(child) | cls.Plus(child):
                return 1 + cls.len(child)
            case cls.Concat(children):
                return sum(cls.len(c) for c in children)
            case cls.Alt(children):
                return 1 + sum(cls.len(c) for c in children)
            case cls.Regex(child):
                return cls.len(child)
            case _:
                raise RuntimeError(f"Unhandled AST node: {node!r}")


builder = AstBuilder()
parser = Lark.open("grammar.lark", start="regex")

Using these we can parse and transform an the input expression /ab+/:

from regex_ast import builder, parser

ast = builder.transform(parser.parse("/ab+/"))
print(ast)

which results in the AST:

Ast.Regex(
  child=Ast.Concat(
    children=(
      Ast.Literal(char='a'),
      Ast.Plus(child=Ast.Literal(char='b'))
    )
  )
)

Compiling an AST into an automaton

The AST is a tree representation of the parsed input. It’s not well suited for executing a regex, so the next step is to transform it to a finite automaton.

Consider the regular expression /ab*/. Remind you, that finite automata and finite state machines are the same thing. Thus, one obvious way is to represent it using a state chart.

A finite automaton represented as a state chart — Figure 1: The state chart for a NFA equivalent to the regular expression `/ab*/`.

Tracing input abb through the chart in figure 1 letter by letter, you’ll find that there’s three ways to end up in the End state; both the string “a”, “ab” and “abb” match the expression /ab*/.

Thompson’s construction algorithm composes these partial NFA:s into one final automaton. His original approach was literally compiling the expression to IBM machine code. The transform from the plain text expression to a NFA is still commonly called compiling.

Modified thompson construction rules for a NFA — Figure 2: Our modified Thompson construction rules for a NFA.

The rules for partial MFA’s are shown in figure 2. Our approach deviates slightly from that of Thompson. We’ll unify the + and * meta characters as a Quantifier, and treat the dot operator as a special case of a Literal. This simplifies NFA creation slightly, since all rules have their entry state as the left-most node.

The regular expression will be transformed into the following NFA states:

File: regex_nfa.py (excerpt)

from dataclasses import dataclass

from regex_ast import Ast


class State:
    @dataclass
    class Split:
        next: tuple[int, ...]

    @dataclass
    class Quantifier:
        min_visits: int
        next: int
        exit: int

    @dataclass
    class Literal:
        state_literal: str | None  # None means match anything
        next: int

    @dataclass
    class End:
        pass

    type Node = Split | Quantifier | Literal | End

Each state has “pointers” to its successor nodes indices. The outer State class is used as a namespace.

To compile the Ast to a list[State], we use a helper build_states(). It walks the tree from the root node and recursively builds a list of states.

File: regex_nfa.py (excerpt, continued)

def build_states(ast: Ast.Node, states: list[State.Node], *, exit: int) -> None:
    """Builds a list of states from an AST node.

    Note:
        Mutates the `states` argument.
    """
    match ast:
        case Ast.Regex(child):
            build_states(child, states, exit=exit)

        case Ast.Concat(children):
            next = len(states)
            last = len(children) - 1
            for i, child in enumerate(children):
                next += Ast.len(child)
                child_exit = exit if i == last else next
                build_states(child, states, exit=child_exit)

        case Ast.Literal(char):
            states.append(State.Literal(state_literal=char, next=exit))

        case Ast.Dot():
            states.append(State.Literal(state_literal=None, next=exit))

        case Ast.Alt(children):
            next = len(states) + 1
            offsets = []
            for alt in children:
                offsets.append(next)
                next += Ast.len(alt)
            states.append(State.Split(next=tuple(offsets)))
            for alt in children:
                build_states(alt, states, exit=exit)

        case Ast.Star(child) | Ast.Plus(child):
            this = len(states)
            count = 0 if isinstance(ast, Ast.Star) else 1
            states.append(State.Quantifier(min_visits=count, next=this + 1, exit=exit))
            build_states(child, states, exit=this)

class Nfa:
    def __init__(self, ast):
        exit = Ast.len(ast)
        self.states: list = []
        build_states(ast, self.states, exit=exit)
        self.states.append(State.End())
        self.reset()

    # ...

We can now compile the NFA for the expression /ab*/ with:

from regex_ast import builder, parser
from regex_nfa import Nfa

ast = builder.transform(parser.parse("/ab*/"))
nfa = Nfa(ast)
print(nfa.states)

which gives the NFA:

[
  State.Literal(state_literal='a', next=1),
  State.Quantifier(min_visits=0, next=2, exit=3),
  State.Literal(state_literal='b', next=1),
  State.End()
]

We have used State.Quantifier as a convenient abstraction for both the dot (.) and plus (+) meta characters. Their only difference is the minimum_visits required for passing.

The final piece in the Nfa class is how to implement matching by executing the NFA.

Running the NFA regex engine

The general idea is to read the input string letter by letter, and check if any of the current states match the letter. For example, the State.Literal(state_literal='b', next=2) matches the letter ‘b’. When a state matches, the engine queues its successor states for processing.

The Thompson NFA relies on so called epsilon transitions. Upon entering a nodes with epsilon transitions, we queue their successors immediately without checking for a match.

In our implementation, we have epsilon nodes State.Quantifier and State.Alt. Since a NFA can be in many states simultaneously, there may be more than one epsilon transition.

To check for a match, we consume the whole input and iterate over current states until there are no active states, or we hit an State.End node.

File: regex_nfa.py (excerpt, continued)

class Nfa:

    # ...

    def reset(self):
        self.next_states: list[int] = [0]
        self.visit_counts: list[int] = [0] * len(self.states)

    def step(self, step_literal: str | None, prefix_match: bool = False) -> bool:
        states = self.next_states
        self.next_states = []

        # process current round of states and epsilon-states
        while states:
            idx = states.pop()
            match self.states[idx]:
                case State.End():
                    # In prefix mode, accept as soon as `End` is reached
                    if prefix_match:
                        return True
                    # In fullmatch mode, only accept during final epsilon step
                    if step_literal is None:
                        return True
                case State.Split(next=nexts):
                    states.extend(nexts)
                case State.Quantifier(min_visits=min_v, next=next, exit=exit):
                    states.append(next)
                    if self.visit_counts[idx] >= min_v:
                        states.append(exit)
                    self.visit_counts[idx] += 1
                case State.Literal(state_literal=state_literal, next=next):
                    if matches_literal(state_literal, step_literal):
                        # push to next round
                        self.next_states.append(next)
        return False

The step() method consumes a single letter and advances the NFA state accordingly. It returns either True or false depending on whether the NFA transitioned into an accepting End state.

That’s it, we have a working regex engine, and it’s surprisingly powerful!

The final touch is the Regex class and a wrapper for CLI:

File: regex.py

#!/usr/bin/env -S uv run --script
from regex_ast import builder, parser
from regex_nfa import Nfa


class Regex:
    def __init__(self, pattern: str):
        ast = builder.transform(parser.parse(pattern))
        self.nfa = Nfa(ast)

    def fullmatch(self, seq: str) -> bool:
        self.nfa.reset()
        # step through sequence with a final epsilon step to reach terminal
        for c in [*seq, None]:
            if self.nfa.step(c):
                return True
        return False


if __name__ == "__main__":
    import sys

    pattern, string = sys.argv[1], sys.argv[2]
    regex = Regex(pattern)

    lines = sys.stdin if string == "-" else [string]
    for line in lines:
        print(regex.fullmatch(line.rstrip("\n")))

With uv, we can directly invoke the regex CLI:

$ export EXPR="/ab+c/"
$ ./regex.py $EXPR abc
True

$ ./regex.py $EXPR abbc
True

$ ./regex.py $EXPR ac
False

It also handles more elaborate patterns

$ export EXPR="/(a|b|c)(nt|at|lb|ross)+/"
$ ./regex.py $EXPR cat
True

$ ./regex.py $EXPR bat
True

$ ./regex.py $EXPR ant
True

$ ./regex.py $EXPR albatross
True

$ ./regex.py $EXPR horse
False

Performance

Our short implementation performs quite alright. With a quick hyperfine benchmark, it chugs through the default unix dictionary of 235k words in half a second.

Expand for benchmark code.

$ export EXPR='(a|b|c)(nt|at|lb|ross)+'

$ hyperfine --warmup 1 \
  "./regex.py '/$EXPR/' - < /usr/share/dict/words > /dev/null" \
  "grep -E '^${EXPR}\$' /usr/share/dict/words > /dev/null"

Benchmark 1: ./regex.py '/(a|b|c)(nt|at|lb|ross)+/' - < /usr/share/dict/words > /dev/null
  Time (mean ± σ):     587.1 ms ±  16.5 ms    [User: 560.8 ms, System: 18.5 ms]
  Range (min … max):   570.2 ms … 628.4 ms    10 runs

Benchmark 2: grep -E '^(a|b|c)(nt|at|lb|ross)+$' /usr/share/dict/words > /dev/null
  Time (mean ± σ):      39.5 ms ±   2.0 ms    [User: 34.9 ms, System: 1.5 ms]
  Range (min … max):    36.2 ms …  44.3 ms    69 runs

Summary
  grep -E '^(a|b|c)(nt|at|lb|ross)+$' /usr/share/dict/words > /dev/null ran
   14.87 ± 0.84 times faster than ./regex.py '/(a|b|c)(nt|at|lb|ross)+/' - < /usr/share/dict/words > /dev/null

Compared to grep, our unoptimized Python regex engine is merely 15 times slower. Not bad, considering we’re comparing to a heavily optimized program.

Thanks for reading!

Representation of Events in Nerve Nets and Finite Automata, memorandum from the Rand corporation (1951). ↩
Representation of events in nerve nets and finite automata, S. Kleene. Automata Studies, Princeton University Press (1956) ↩
Programming Techniques: Regular expression search algorithm. Thompson, K. Communications of the ACM, 11(6): 419–422. (1968) ↩
Regular Expression Matching Can Be Simple And Fast on Rus Cox website. ↩ ↩²
Docs for NFA module in rexex_automata which is part of Rust’s standard library. ↩
Finite Automata and Their Decision Problems.. M. O. Rabin; D. Scott. IBM Journal of Research and Development (1959) ↩