Interesting Regular Expression Test Cases
2020-07-09 - By Robert Elder
This article is part of Series On Regular Expressions.
The purpose of this article is to document a list of test cases for the Regular Expression Visualizer tool which uses a custom-written regular expression engine. For each test case below, the relevant regular expression is listed along with an explanation about why it is useful. A direct link is also provided for each test case which allows you to see the result of inputting that regular expression into the visualizer tool. These test cases serve as a form of documentation for what cases are currently supported by the tool. Some of these test cases may also be interesting to anyone else writing their own regular expression engine.
Everyday Use Cases
^([a-z0-9_\.\-]+)@([\da-z\.\-]+)\.([a-z\.]{2,5})$
The regex '^([a-z0-9_\.\-]+)@([\da-z\.\-]+)\.([a-z\.]{2,5})$': An example of a commonly cited regular expression that claims to be able to match email addresses (but doesn't do a very great job).
employ(|er|ee|ment|ing|able)
The regex 'employ(|er|ee|ment|ing|able)': A regular expression that capture multiple suffixes of a word.
[a-f0-9]{32}
The regex '[a-f0-9]{32}': A regular expression to match an MD5 hash.
[A-Fa-f0-9]{64}
The regex '[A-Fa-f0-9]{64}': A regular expression to match SHA256 hash.
<tag>[^<]*</tag>
The regex '<tag>[^<]*</tag>': A regular expression to match an XML tag and its contents (only works for cases where 'tag' doesn't contain any nested XML and has no support for attributes).
<[\s]*tag[^>]*>[^<]*<[\s]*/[\s]*tag[\s]*>
The regex '<[\s]*tag[^>]*>[^<]*<[\s]*/[\s]*tag[\s]*>': A slightly better version of the previous XML tag matching regex that adds a bit of tolerance for spaces.
^(https?:\/\/)?([\da-z.\-]+)\.([a-z.]{2,6})([\/\w \.\-]*)*\/?$
The regex '^(https?:\/\/)?([\da-z.\-]+)\.([a-z.]{2,6})([\/\w \.\-]*)*\/?$': A regex to match common URLs that only use common ASCII characters and TLDs less than 6 characters.
Character Classes
[]
The regex '[]': This case represents an 'empty' character class. Many regular expression engines don't allow this since it's almost certainly a mistake to write this. The only meaningful interpretation would be to mean 'match a character that that belongs to this empty list of characters'. In order words it represents the impossible constraint of being a character that isn't any character.
[^]
The regex '[^]': Similar to the empty character class '[]', but a bit more useful. This case would be interpreted as 'any character NOT in the following empty list'. Therefore, this would mean 'match any possible character'.
[.]
The regex '[.]': Test to make sure that period inside a character class is interpreted as a literal period character.
[^.]
The regex '[^.]': Another test to make sure that period inside a character class is interpreted as a literal period character.
[b-a]
The regex '[b-a]': Range endpoints out of order. This case should cause an error.
[a-\w]
The regex '[a-\w]': Range endpoints should not be 'sets' of characters. This case should cause an error.
[a-\d]
The regex '[a-\d]': Range endpoints should not be 'sets' of characters. This case should cause an error.
[^\Wf]
The regex '[^\Wf]': An example of a slightly complicated character class aggregation: \W is a negative version of \w, but when combind with 'f', this character class produdes a mix of single and double negative inclusions of characters.
[^^]
The regex '[^^]': The '^' is treated literally when it's not the first character in the class.
[日本国]
The regex '[日本国]': Unicode is not supported by the current version of the regular expression visualizer tool, but throw some in anyway to see what happens for character classes.
\d\D\s\S\w\W
The regex '\d\D\s\S\w\W': Escaped characters that denote character classes.
[\dabc][\D123][\sabc][\S\t][\w\x00][\Wabc]
The regex '[\dabc][\D123][\sabc][\S\t][\w\x00][\Wabc]': Escaped characters inside character classes should add to the coverate of the character class.
Alternation
()
The regex '()': An 'empty string' sub-expression.
(|)
The regex '(|)': A choice between 'empty string' and 'empty string' inside a sub-expression.
(||)
The regex '(||)': More code coverage for alternation grammar rules and control flow graph generator.
(|||)
The regex '(|||)': More code coverage for alternation grammar rules and control flow graph generator.
(a|)
The regex '(a|)': An 'a' or empty string inside a sub-expression.
(|b)
The regex '(|b)': Empty string or 'b' inside a sub-expression.
(a|b)
The regex '(a|b)': Standard alternation example.
|
The regex '|': Alternation with both options as empty string.
Quantifiers
a*
The regex 'a*': Zero or more greedy.
a+
The regex 'a+': One or more greedy.
a?
The regex 'a?': Zero or one greedy.
a*?
The regex 'a*?': Zero or more lazy.
a+?
The regex 'a+?': One or more lazy.
a??
The regex 'a??': Zero or one lazy.
a{5}
The regex 'a{5}': Explicit quantifier fixed value greedy.
a{5}?
The regex 'a{5}?': Explicit quantifier fixed value lazy.
a{,5}
The regex 'a{,5}': Explicit quantifier max value greedy.
a{,5}?
The regex 'a{,5}?': Explicit quantifier max value lazy.
a{5,}
The regex 'a{5,}': Explicit quantifier min value greedy.
a{5,}?
The regex 'a{5,}?': Explicit quantifier min value lazy.
a{5,7}
The regex 'a{5,7}': Explicit quantifier range of values greedy.
a{5,7}?
The regex 'a{5,7}?': Explicit quantifier range of values lazy.
abc+|def+
The regex 'abc+|def+': Test operator precedence.
ab+c|de+f
The regex 'ab+c|de+f': Test operator precedence.
a*{4}
The regex 'a*{4}': Quantified quantifier (Should result in error).
(a*){4}
The regex '(a*){4}': Quantified sub-expression.
(){0,1}
The regex '(){0,1}': Repetition of nothing.
(){1,2}
The regex '(){1,2}': Repetition of nothing.
()+
The regex '()+': Repetition of nothing.
(a*?)*
The regex '(a*?)*': This is a very interesting case that demonstrates the need for the 'progress' node in the control flow graph: In a backtracking regex engine, when the inner lazy quantifier, *?, is applied, it will choose to try the match with 0 iterations of 'a' first, but before it tries the other branch it will try and repeat the outer quantifier, *. Since '*' is greedy, it will return to try the inner quantifier again which will again try to match 0 characters first. The result is an infinite loop that never makes progress because it never consumes any characters. The solution is to add a 'progress' node that verifies that progress is always being when a quantifier is applied to a possibly zero-length match.
^(a*)*$
The regex '^(a*)*$': Another case to demonstrate that the 'infinite loop' problem can occur with only greedy operators too.
^+
The regex '^+': Attempting to apply a quantifier to an anchor is a case worth considering. Meaningful interpretation of what to do include: Providing an error, treating the quantifier as a literal, or actually checking for the anchor the quantified number of times (which could be a source of bugs due to the zero-length nature of anchors).
(a+a+)+b
Applied to the following string:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
The regex '(a+a+)+b' matched against the string 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': An example of catastrophic backtracking with greedy quantifiers.
(a+?a+?)+?b
Applied to the following string:
aaaaaaaaaaaaaaaa
The regex '(a+?a+?)+?b' matched against the string 'aaaaaaaaaaaaaaaa': The same catastrophic backtracking with 'lazy' quantifiers.
[bc]*(cd)+
Applied to the following string:
cbcdcd
The regex '[bc]*(cd)+' matched against the string 'cbcdcd': This case was added to verify a bug fix where 'progress node' values were not properly pushed from the stack after a backtracking even causing the match to fail when it should not have.
Individual Characters
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
The regex '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz': All alpha-numeric ASCII characters.
!"#%&',-/0123456789:;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ_`abcdefghijklmnopqrstuvwxyz~
The regex ' !"#%&',-/0123456789:;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ_`abcdefghijklmnopqrstuvwxyz~': Other printable non-special characters.
\$\.\(\)\*\+\?\[\\]\^\{\|\}
The regex '\$\.\(\)\*\+\?\[\\]\^\{\|\}': Escaped special characters.
\0\t\n\r\v\f\\
The regex '\0\t\n\r\v\f\\': Escaped non-special ASCII characters.
\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2A\x2B\x2C\x2D\x2E\x2F
The regex '\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2A\x2B\x2C\x2D\x2E\x2F': Hex escaped characters from 0x00 to 0x2F
\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3A\x3B\x3C\x3D\x3E\x3F\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\x4F\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5A\x5B\x5C\x5D\x5E\x5F
The regex '\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3A\x3B\x3C\x3D\x3E\x3F\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\x4F\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5A\x5B\x5C\x5D\x5E\x5F': Hex escaped characters from 0x30 to 0x5F
\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6A\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7A\x7B\x7C\x7D\x7E\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F
The regex '\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6A\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7A\x7B\x7C\x7D\x7E\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F': Hex escaped characters from 0x60 to 0x8F
\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF
The regex '\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF': Hex escaped characters from 0x90 to 0xBF
\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF
The regex '\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF': Hex escaped characters from 0xC0 to 0xEF
\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF
The regex '\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF': Hex escaped characters from 0xC0 to 0xEF
\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF
The regex '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF': Hex escaped characters from 0xF0 to 0xFF
日本国
The regex '日本国': Unicode is not supported by the current version of the regular expression visualizer tool, but throw some in anyway to see what happens.
HTML/Javascript Rendering
<script>alert('XSS')</script>
Applied to the following string:
<script>alert('XSS')</script>
The regex '<script>alert('XSS')</script>' matched against the string '<script>alert('XSS')</script>': Trivial test for XSS.
\";alert('XSS');//
Applied to the following string:
\";alert('XSS');//
The regex '\";alert('XSS');//' matched against the string '\";alert('XSS');//': Trivial test for XSS.
<svg/onload=alert('XSS')>
Applied to the following string:
<svg/onload=alert('XSS')>
The regex '<svg/onload=alert('XSS')>' matched against the string '<svg/onload=alert('XSS')>': Trivial test for XSS.
"><img src="x:x" onerror="alert(XSS)">
Applied to the following string:
"><img src="x:x" onerror="alert(XSS)">
The regex '"><img src="x:x" onerror="alert(XSS)">' matched against the string '"><img src="x:x" onerror="alert(XSS)">': Trivial test for XSS.
This article is part of Series On Regular Expressions.
The Regular Expression Visualizer, Simulator & Cross-Compiler Tool
Published 2020-07-09 |
$20.00 CAD |
An LL Grammar For Regular Expression Parsing
Published 2020-07-09 |
Regular Expression Character Escaping
Published 2020-11-20 |
How Do Regular Expression Quantifier Work?
Published 2020-08-18 |
How Regular Expression Alternation Works
Published 2020-08-18 |
Character Ranges & Class Negation in Regular Expressions
Published 2020-05-31 |
Guide To Regular Expressions
Published 2020-07-09 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|