Interesting Regular Expression Test Cases

2020-07-09 - By Robert Elder

This article is part of Series On Regular Expressions.

The purpose of this article is to document a list of test cases for the Regular Expression Visualizer tool which uses a custom-written regular expression engine. For each test case below, the relevant regular expression is listed along with an explanation about why it is useful. A direct link is also provided for each test case which allows you to see the result of inputting that regular expression into the visualizer tool. These test cases serve as a form of documentation for what cases are currently supported by the tool. Some of these test cases may also be interesting to anyone else writing their own regular expression engine.

Everyday Use Cases

^([a-z0-9_\.\-]+)@([\da-z\.\-]+)\.([a-z\.]{2,5})$

The regex '^([a-z0-9_\.\-]+)@([\da-z\.\-]+)\.([a-z\.]{2,5})$': An example of a commonly cited regular expression that claims to be able to match email addresses (but doesn't do a very great job).

employ(|er|ee|ment|ing|able)

The regex 'employ(|er|ee|ment|ing|able)': A regular expression that capture multiple suffixes of a word.

[a-f0-9]{32}

The regex '[a-f0-9]{32}': A regular expression to match an MD5 hash.

[A-Fa-f0-9]{64}

The regex '[A-Fa-f0-9]{64}': A regular expression to match SHA256 hash.

<tag>[^<]*</tag>

The regex '<tag>[^<]*</tag>': A regular expression to match an XML tag and its contents (only works for cases where 'tag' doesn't contain any nested XML and has no support for attributes).

<[\s]*tag[^>]*>[^<]*<[\s]*/[\s]*tag[\s]*>

The regex '<[\s]*tag[^>]*>[^<]*<[\s]*/[\s]*tag[\s]*>': A slightly better version of the previous XML tag matching regex that adds a bit of tolerance for spaces.

^(https?:\/\/)?([\da-z.\-]+)\.([a-z.]{2,6})([\/\w \.\-]*)*\/?$

The regex '^(https?:\/\/)?([\da-z.\-]+)\.([a-z.]{2,6})([\/\w \.\-]*)*\/?$': A regex to match common URLs that only use common ASCII characters and TLDs less than 6 characters.

Character Classes

[]

The regex '[]': This case represents an 'empty' character class. Many regular expression engines don't allow this since it's almost certainly a mistake to write this. The only meaningful interpretation would be to mean 'match a character that that belongs to this empty list of characters'. In order words it represents the impossible constraint of being a character that isn't any character.

[^]

The regex '[^]': Similar to the empty character class '[]', but a bit more useful. This case would be interpreted as 'any character NOT in the following empty list'. Therefore, this would mean 'match any possible character'.

[.]

The regex '[.]': Test to make sure that period inside a character class is interpreted as a literal period character.

[^.]

The regex '[^.]': Another test to make sure that period inside a character class is interpreted as a literal period character.

[b-a]

The regex '[b-a]': Range endpoints out of order. This case should cause an error.

[a-\w]

The regex '[a-\w]': Range endpoints should not be 'sets' of characters. This case should cause an error.

[a-\d]

The regex '[a-\d]': Range endpoints should not be 'sets' of characters. This case should cause an error.

[^\Wf]

The regex '[^\Wf]': An example of a slightly complicated character class aggregation: \W is a negative version of \w, but when combind with 'f', this character class produdes a mix of single and double negative inclusions of characters.

[^^]

The regex '[^^]': The '^' is treated literally when it's not the first character in the class.

[日本国]

The regex '[日本国]': Unicode is not supported by the current version of the regular expression visualizer tool, but throw some in anyway to see what happens for character classes.

\d\D\s\S\w\W

The regex '\d\D\s\S\w\W': Escaped characters that denote character classes.

[\dabc][\D123][\sabc][\S\t][\w\x00][\Wabc]

The regex '[\dabc][\D123][\sabc][\S\t][\w\x00][\Wabc]': Escaped characters inside character classes should add to the coverate of the character class.

Alternation

()

The regex '()': An 'empty string' sub-expression.

(|)

The regex '(|)': A choice between 'empty string' and 'empty string' inside a sub-expression.

(||)

The regex '(||)': More code coverage for alternation grammar rules and control flow graph generator.

(|||)

The regex '(|||)': More code coverage for alternation grammar rules and control flow graph generator.

(a|)

The regex '(a|)': An 'a' or empty string inside a sub-expression.

(|b)

The regex '(|b)': Empty string or 'b' inside a sub-expression.

(a|b)

The regex '(a|b)': Standard alternation example.

The regex '|': Alternation with both options as empty string.

Quantifiers

a*

The regex 'a*': Zero or more greedy.

a+

The regex 'a+': One or more greedy.

a?

The regex 'a?': Zero or one greedy.

a*?

The regex 'a*?': Zero or more lazy.

a+?

The regex 'a+?': One or more lazy.

a??

The regex 'a??': Zero or one lazy.

a{5}

The regex 'a{5}': Explicit quantifier fixed value greedy.

a{5}?

The regex 'a{5}?': Explicit quantifier fixed value lazy.

a{,5}

The regex 'a{,5}': Explicit quantifier max value greedy.

a{,5}?

The regex 'a{,5}?': Explicit quantifier max value lazy.

a{5,}

The regex 'a{5,}': Explicit quantifier min value greedy.

a{5,}?

The regex 'a{5,}?': Explicit quantifier min value lazy.

a{5,7}

The regex 'a{5,7}': Explicit quantifier range of values greedy.

a{5,7}?

The regex 'a{5,7}?': Explicit quantifier range of values lazy.

abc+|def+

The regex 'abc+|def+': Test operator precedence.

ab+c|de+f

The regex 'ab+c|de+f': Test operator precedence.

a*{4}

The regex 'a*{4}': Quantified quantifier (Should result in error).

(a*){4}

The regex '(a*){4}': Quantified sub-expression.

(){0,1}

The regex '(){0,1}': Repetition of nothing.

(){1,2}

The regex '(){1,2}': Repetition of nothing.

()+

The regex '()+': Repetition of nothing.

(a*?)*

The regex '(a*?)*': This is a very interesting case that demonstrates the need for the 'progress' node in the control flow graph: In a backtracking regex engine, when the inner lazy quantifier, *?, is applied, it will choose to try the match with 0 iterations of 'a' first, but before it tries the other branch it will try and repeat the outer quantifier, *. Since '*' is greedy, it will return to try the inner quantifier again which will again try to match 0 characters first. The result is an infinite loop that never makes progress because it never consumes any characters. The solution is to add a 'progress' node that verifies that progress is always being when a quantifier is applied to a possibly zero-length match.

^(a*)*$

The regex '^(a*)*$': Another case to demonstrate that the 'infinite loop' problem can occur with only greedy operators too.

^+

The regex '^+': Attempting to apply a quantifier to an anchor is a case worth considering. Meaningful interpretation of what to do include: Providing an error, treating the quantifier as a literal, or actually checking for the anchor the quantified number of times (which could be a source of bugs due to the zero-length nature of anchors).

(a+a+)+b

Applied to the following string:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

The regex '(a+a+)+b' matched against the string 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa': An example of catastrophic backtracking with greedy quantifiers.

(a+?a+?)+?b

Applied to the following string:

aaaaaaaaaaaaaaaa

The regex '(a+?a+?)+?b' matched against the string 'aaaaaaaaaaaaaaaa': The same catastrophic backtracking with 'lazy' quantifiers.

[bc]*(cd)+

Applied to the following string:

cbcdcd

The regex '[bc]*(cd)+' matched against the string 'cbcdcd': This case was added to verify a bug fix where 'progress node' values were not properly pushed from the stack after a backtracking even causing the match to fail when it should not have.

Individual Characters

0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

The regex '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz': All alpha-numeric ASCII characters.

 !"#%&',-/0123456789:;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ_`abcdefghijklmnopqrstuvwxyz~

The regex ' !"#%&',-/0123456789:;<=>@ABCDEFGHIJKLMNOPQRSTUVWXYZ_`abcdefghijklmnopqrstuvwxyz~': Other printable non-special characters.

\$\.\(\)\*\+\?\[\\]\^\{\|\}

The regex '\$\.\*\+\?\[\\]\^\{\|\}': Escaped special characters.

\0\t\n\r\v\f\\

The regex '\0\t\n\r\v\f\\': Escaped non-special ASCII characters.

\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2A\x2B\x2C\x2D\x2E\x2F

The regex '\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F\x20\x21\x22\x23\x24\x25\x26\x27\x28\x29\x2A\x2B\x2C\x2D\x2E\x2F': Hex escaped characters from 0x00 to 0x2F

\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3A\x3B\x3C\x3D\x3E\x3F\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\x4F\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5A\x5B\x5C\x5D\x5E\x5F

The regex '\x30\x31\x32\x33\x34\x35\x36\x37\x38\x39\x3A\x3B\x3C\x3D\x3E\x3F\x40\x41\x42\x43\x44\x45\x46\x47\x48\x49\x4A\x4B\x4C\x4D\x4E\x4F\x50\x51\x52\x53\x54\x55\x56\x57\x58\x59\x5A\x5B\x5C\x5D\x5E\x5F': Hex escaped characters from 0x30 to 0x5F

\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6A\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7A\x7B\x7C\x7D\x7E\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F

The regex '\x60\x61\x62\x63\x64\x65\x66\x67\x68\x69\x6A\x6B\x6C\x6D\x6E\x6F\x70\x71\x72\x73\x74\x75\x76\x77\x78\x79\x7A\x7B\x7C\x7D\x7E\x7F\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F': Hex escaped characters from 0x60 to 0x8F

\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF

The regex '\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9A\x9B\x9C\x9D\x9E\x9F\xA0\xA1\xA2\xA3\xA4\xA5\xA6\xA7\xA8\xA9\xAA\xAB\xAC\xAD\xAE\xAF\xB0\xB1\xB2\xB3\xB4\xB5\xB6\xB7\xB8\xB9\xBA\xBB\xBC\xBD\xBE\xBF': Hex escaped characters from 0x90 to 0xBF

\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF

The regex '\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF': Hex escaped characters from 0xC0 to 0xEF

\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC7\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1\xD2\xD3\xD4\xD5\xD6\xD7\xD8\xD9\xDA\xDB\xDC\xDD\xDE\xDF\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE7\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF

\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF

The regex '\xF0\xF1\xF2\xF3\xF4\xF5\xF6\xF7\xF8\xF9\xFA\xFB\xFC\xFD\xFE\xFF': Hex escaped characters from 0xF0 to 0xFF

日本国

The regex '日本国': Unicode is not supported by the current version of the regular expression visualizer tool, but throw some in anyway to see what happens.

HTML/Javascript Rendering

<script>alert('XSS')</script>

Applied to the following string:

<script>alert('XSS')</script>

The regex '<script>alert('XSS')</script>' matched against the string '<script>alert('XSS')</script>': Trivial test for XSS.

\";alert('XSS');//

Applied to the following string:

\";alert('XSS');//

The regex '\";alert('XSS');//' matched against the string '\";alert('XSS');//': Trivial test for XSS.

<svg/onload=alert('XSS')>

Applied to the following string:

<svg/onload=alert('XSS')>

The regex '<svg/onload=alert('XSS')>' matched against the string '<svg/onload=alert('XSS')>': Trivial test for XSS.

"><img src="x:x" onerror="alert(XSS)">

Applied to the following string:

"><img src="x:x" onerror="alert(XSS)">

The regex '"><img src="x:x" onerror="alert(XSS)">' matched against the string '"><img src="x:x" onerror="alert(XSS)">': Trivial test for XSS.

This article is part of Series On Regular Expressions.

The Regular Expression Visualizer, Simulator & Cross-Compiler Tool Published 2020-07-09	$20.00 CAD Regular Expression Laptop Stickers	An LL Grammar For Regular Expression Parsing Published 2020-07-09	Regular Expression Character Escaping Published 2020-11-20
How Do Regular Expression Quantifier Work? Published 2020-08-18	How Regular Expression Alternation Works Published 2020-08-18	Character Ranges & Class Negation in Regular Expressions Published 2020-05-31	Guide To Regular Expressions Published 2020-07-09

Why Bother Subscribing?

Free Software/Engineering Content. I publish all of my educational content publicly for free so everybody can make use of it. Why bother signing up for a paid 'course', when you can just sign up for this email list?
Read about cool new products that I'm building. How do I make money? Glad you asked! You'll get some emails with examples of things that I sell. You might even get some business ideas of your own :)
People actually like this email list. I know that sounds crazy, because who actually subscribes to email lists these days, right? Well, some do, and if you end up not liking it, I give you permission to unsubscribe and mark it as spam.