Undefined Behaviour With Grep -E
2020-10-01 - By Robert Elder
This article will attempt to convince the reader that it is almost never a good idea to use the '-E' flag with grep (for Extended Regular Expressions), and that you should instead use the '-P' flag (when possible) for Perl-compatible regular expressions. Unfortunately, the -P flag is not supported by all implementations of grep, so this may not always be possible and it will become ever more important to be mindful of the behaviour described in this article.
Testing Quantifiers
I came to this conclusion while doing some research for my Guide To Regular Expressions. Specifically, my goal was to verify my understanding of regular expression quantifiers by testing out a few examples with grep. As a starting point, the following command will instruct grep to print out any lines that contain at least one sequence of exactly 3 'a' characters:
echo "aaaaaaaaaaaaaaaaa" | grep -P "a{3}"
which outputs the following:
aaaaaaaaaaaaaaaaa
To make the exact matches more clear, you can include the '-o' flag with grep, which will print out each match on a separate line:
echo "aaaaaaaaaaaaaaaaa" | grep -Po "a{3}"
which outputs the following:
aaa
aaa
aaa
aaa
aaa
The 'Lazy' Case
In my research, the behaviour I was testing was to see how fixed range quantifiers behave when you make them lazy or possessive (which is pointless from a practical perspective, but that's an entirely different conversation). Here's a lazy version of the above:
echo "aaaaaaaaaaaaaaaaa" | grep -Po "a{3}?"
which outputs the following:
aaa
aaa
aaa
aaa
aaa
and here is the same thing with the '-E' flag:
echo "aaaaaaaaaaaaaaaaa" | grep -Eo "a{3}?"
which outputs the following:
aaa
aaa
aaa
aaa
aaa
The 'Bug' Case
Now, the case that surprised me was when you compare the behaviour of making the '{3}' possessive instead of lazy:
echo "aaaaaaaaaaaaaaaaa" | grep -Po "a{3}+"
which outputs the following (as expected):
aaa
aaa
aaa
aaa
aaa
and here is the same thing with the '-E' flag:
echo "aaaaaaaaaaaaaaaaa" | grep -Eo "a{3}+"
which outputs the following (not expected):
aaaaaaaaaaaaaaaaa
The above result is not expected since the regex that was specified should only match up to a maximum (and minimum) of 3 'a' characters, and then print each match on a separate line. In other words, most people would probably expect the output to look just like the output from using the '-P' flag.
ERE (Extended Regular Expressions) != Perl Compatible Regular Expressions
Naturally, I had to dig into what was going on here, so I reviewed the source code for grep to see if I had possibly found a bug. After reading through the source code and its comments such as the following:
/* In BRE consecutive duplications are not allowed. */
it became clear that what I really needed to do was consult the formal specifications for 'Basic Regular Expressions' and 'Extended Regular Expressions'. Therefore, I reviewed the document The Open Group Base Specifications Issue 7, 2018 edition which appears to be a formal specification for BRE and ERE.
The following statement in section 9.4.6 EREs Matching Multiple Characters seems to suggest that the behaviour I saw was not actually a bug, but rather 'undefined behaviour':
The behavior of multiple adjacent duplication symbols ( '+', '*', '?', and intervals) produces undefined results.
The other thing I realized from reviewing this document was that both BRE and ERE regular expressions don't support as many features as I thought they did. The formal specification for 'Basic Regular Expressions' does not even support the '+' or '?' quantifiers or '|' for alternation! 'Extended Regular Expressions' does support '+', '?' and '|', but it has no support for the concept of 'greedy' or 'lazy'. For this reason, I have decided to avoid them whenever I can in the future and always use '-P' from now on.
If you check the man page for grep, you'll see that it says 'This is experimental and grep -P may warn of unimplemented features.' under the section for '-P'. That's not ideal, but I think explicit warnings are better than implicit undefined behaviour.
Conclusion
Don't use the '-E' flag with grep, use '-P' instead (when possible). The '-E' flag uses 'Extended Regular Expression Mode' doesn't support some common 'modern' regular expression features. To make matters worse, for some of these 'modern' features with the -E flag, you can sometimes end up with undefined behaviour that doesn't give you the answer you expect.
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02 |
$1.00 CAD |
Can You Use 'ed' As A Drop-in Replacement For vim, grep & sed?
Published 2020-10-15 |
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28 |
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|