An LL Grammar For Regular Expression Parsing
2020-07-09 - By Robert Elder
This article is part of Series On Regular Expressions.
Below is an LL grammar for a fairly complete regular exprssion parser. This grammar is a documentation of the grammar supported by the Regular Expression Visualizer Tool demonstrated here. This grammar supports most basic regular expression operations such as character classes, alternation (|), greedy and lazy quantifiers (*, +, {n}, {m,n}, {n,}, {,n}, *?, +?, {n}?, {m,n}?, {n,}?, {,n}?). Sub-expressions are supported. Simple escaped ASCII characters such as '\n', '\t' or '\x97' are supported. Currently, the following features are not supported: Unicode, backreferences, word mark breaks, capture groups, look-aheads, look-behinds. Other less-universal regular expression features like nested character classes are not supported.
LL grammars like this one are an excellent resource for writing recursive descent parsers. The grammar described below was written in a format similar to Yacc grammars, although it has never been perfected enough to actually run through a parser-generator to generate source code. Instead this hand-written grammar was used as a guide when creating a hand-written parser.
Update History
- 2020-12-05: Added grammar rules to support non-capture groups, negative and positive lookbehind/lookaheads, atomic groups. This will support better error messages in the compiler until they are fully supported.
The Grammar
epsilon
: 'epsilon' just means it's ok to match an empty string when no better option is available.
decimal_digit
: '0' to '9'
decimal_number_rest
: decimal_digit decimal_number_rest
| epsilon
decimal_number
: decimal_digit decimal_number_rest
hex_digit
: '0' to '9'
| 'A' to 'F'
| 'a' to 'f'
escaped_hex_character
| 'x' hex_digit hex_digit
escaped_ascii_character
: any character not in 'x' or escaped_regex_class_character
escaped_regex_class_character
: 'd'
| 'D'
| 's'
| 'S'
| 'w'
| 'W'
| 'N' # Not currently supported: Provide error message in compiler.
| 'h' # Not currently supported: Provide error message in compiler.
| 'H' # Not currently supported: Provide error message in compiler.
| 'V' # Not currently supported: Provide error message in compiler.
| 'R' # Not currently supported: Provide error message in compiler.
| 'A' # Not currently supported: Provide error message in compiler.
| 'z' # Not currently supported: Provide error message in compiler.
| 'Z' # Not currently supported: Provide error message in compiler.
| 'G' # Not currently supported: Provide error message in compiler.
| 'b' # Not currently supported: Provide error message in compiler.
| 'B' # Not currently supported: Provide error message in compiler.
escaped_character
: '\' escaped_hex_character
| '\' escaped_ascii_character
| '\' escaped_regex_class_character
character_range
: character_range_endpoint '-' character_range_endpoint
# Any non-escaped character that can appear inside a character class.
# Can include '^' because this is picked up directly in the grammar rule
# and subsequent '^' characters are treated as literal. '^' is only
# special when it's first. Allow un-escaped '-' since that seems to
# be a commonly expected use case.
class_ascii_character
: \x00 to ','
| '.' to '['
# Avoid '[' and '\'
| '^' to \xFF
# Any non-escaped character that can appear outside a character class.
non_class_ascii_character
# Any character from 0x00 to 0xFF, except
# Avoid '^' '$'
# Avoid '(' ')' '*' '+'
# Avoid '.'
# Avoid '?'
# Avoid '{' '}'
# Avoid '[' '\'
# Don't avoid ']' since a lot of regex parsers allow it unescaped
# Avoid '|'
non_class_special_character
: '.'
anchor
: '^'
| '$'
non_class_character
: non_class_ascii_character
| non_class_special_character
| escaped_character
character_range_endpoint
: class_ascii_character
| escaped_character
character_or_range
: character_range
| character_range_endpoint
character_class_rest
: character_or_range character_class_rest
| ']'
character_class
: '[' '^' character_or_range character_class_rest
| '[' character_or_range character_class_rest
explicit_quantifier
: '{' decimal_number '}'
| '{' decimal_number ',' '}'
| '{' ',' decimal_number '}'
| '{' decimal_number ',' decimal_number '}'
implicit_quantifier
: '+'
| '?'
| '*'
quantifier_operator
: implicit_quantifier '?'
| implicit_quantifier
| explicit_quantifier '?'
| explicit_quantifier
character_or_class
: character_class
| non_class_character
sub_expression
: '(' regular_expression ')'
# Technically supported by doing nothing since back-references are not implemented yet.
| '(' '?' ':' regular_expression ')'
# None of these are currently supported, but they should be parsed to provide good error message:
| '(' '?' '!' regular_expression ')'
| '(' '?' '=' regular_expression ')'
| '(' '?' '<' '!' regular_expression ')'
| '(' '?' '<' '=' regular_expression ')'
| '(' '?' '>' regular_expression ')'
quantified_expression
: character_or_class quantifier_operator
| character_or_class
| sub_expression quantifier_operator
| sub_expression
| anchor
concatenation_expression_rest
: quantified_expression concatenation_expression_rest
| epsilon
concatenation_expression
: quantified_expression concatenation_expression_rest
alternation_expression_rest
: '|' concatenation_expression alternation_expression_rest
| '|' alternation_expression_rest
: concatenation_expression alternation_expression_rest
| epsilon
alternation_expression
: '|' alternation_expression_rest
| concatenation_expression alternation_expression_rest
regular_expression
: alternation_expression regular_expression
| epsilon
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
Published 2020-07-09 |
$20.00 CAD |
Interesting Regular Expression Test Cases
Published 2020-07-09 |
Regular Expression Character Escaping
Published 2020-11-20 |
How Do Regular Expression Quantifier Work?
Published 2020-08-18 |
How Regular Expression Alternation Works
Published 2020-08-18 |
Character Ranges & Class Negation in Regular Expressions
Published 2020-05-31 |
Guide To Regular Expressions
Published 2020-07-09 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|