The Most Confusing Grep Mistakes I've Ever Made
2020-11-02 - By Robert Elder
Introduction
In this article, I'll discuss 5 very confusing mistakes that have caused me to waste significant amounts of time when using the command-line tool known as 'grep' to search for things. I have chosen to document these mistakes in detail because they are mistakes that beginners are very likely to make at some point, but also to not be able to debug on their own. The root causes of these mistakes are: Not knowing what flavour of regular expression that grep is currently using (and/or not understanding what features that flavour supports); Not considering the escaping rules of your shell; Issues with character encodings.
0) Forgetting To Specify A Filename
(Added on 2020-11-07) By popular demand, many people suggested that I add this item as #0 to this list. The grep command can accept input from two different methods: 1) From one or more files, like this:
grep "test" file.txt
Or, directly from stdin like this when no files are specified:
echo "something" | grep "test"
A common mistake is to forget to specify a file name at all when issuing a grep search when there is no input from stdin:
grep "test"
In this case, grep will just sit there and do nothing because it's waiting for you to type some input (until CTRL+d is pressed) since nothing was fed in from stdin. In other words, it's waiting for you to type in some text by hand so it can search it for your term. Therefore, if you try to do a grep search like this, you'll be waiting a long time (forever).
1) Confusion With '*', '+', '\*' '\+', '\{' and '\}'
Here is a file containing a few lines of text that we'll place inside the file 'hello.txt':
Hello World.
Hello There World.
Hello Some World.
Hello The World.
HelloWorld.
Goodbye World.
Let's say that you wanted to use the 'grep' command to find all lines in this file that contain the word 'Hello' followed by the word 'World'. You could use a grep command like this:
grep "Hello.*World" hello.txt
and as you expect, it finds these matches:
Hello World.
Hello There World.
Hello Some World.
Hello The World.
HelloWorld.
But now you might consider adding the additional requirement that there be at least one character between 'Hello' and 'World' so that the line with 'HelloWorld' is not included in the matches. Since you know that '*' is a regex pattern for 'zero or more' and '+' is a regex pattern for 'one or more', you decide to try the following:
grep "Hello.+World" hello.txt
But this doesn't match anything at all! What's going on here? Isn't '+' a regular expression symbol for 'one or more'? The answer is related to the default regular expression mode that grep uses. If you don't specify any flags to grep, it will use 'BRE' or 'Basic Regular Expressions' which are very old and quite primitive. In fact, the official standard for BRE doesn't even support the '+' quantifier! This can lead to very confusing behaviour since you might just try escaping the '+' and find that it gives you the result you expect:
grep "Hello.\+World" hello.txt
gives the following:
Hello World.
Hello There World.
Hello Some World.
Hello The World.
But since we're still using 'BRE' regular expressions, the official standard says that this is actually undefined behaviour! You can learn more about this in Undefined Behaviour With Grep -E.
2) Unexpected Shell Interpolation/Expansion
Let's say we have a file called 'sometext.txt' with the following text in it:
Make sure you write out the `date`.
Today's date is Oct 21, 2020.
If you wanted to find all lines in this file that contains the word 'date', you could use this grep command:
grep date sometext.txt
and you'll get the following result which is what you expect:
Make sure you write out the `date`.
Today's date is Oct 21, 2020.
But now, let's assume you wanted to use grep to only find the line that contain the backtick characters around the word 'date'. You might try doing the following:
grep `date` sometext.txt
But this just generates a bunch of error messages in the shell:
grep: Oct: No such file or directory
grep: 21: No such file or directory
grep: 13:30:57: No such file or directory
grep: EST: No such file or directory
grep: 2020: No such file or directory
You might be thinking, "Oh, no problem, I'll just use double quotes" and try something like this:
grep "`date`" sometext.txt
But that still doesn't work (at least not in bash)! It doesn't find any matches at all! The problem in this case is related to the fact that the backtick character has a special meaning in our shell, even when used inside double quotes. To illustrate this point, we can run the following two echo commands:
echo "date"
echo "`date`"
and the output of these echo statements is:
date
Mon Oct 21 13:30:57 EST 2020
So from reading the results above example, you can see why the grep command we last used didn't find anything: We were literally searching for the current date instead of the word 'date' surrounded by backticks! The solution (in bash shell), is to use single-quotes instead:
grep '`date`' sometext.txt
which will match correctly as expected:
Make sure you write out the `date`.
This isn't the only issue that you can encounter where your shell could unexpectedly change the meaning of the search string that you pass into grep. You can also encounter an issue with unexpected 'globbing' when you attempt to use a regular expression containing the '*' character without using quotes. For example, consider this simple echo statement that just prints out 'asdf':
echo "asdf"
If you filter this echo statement through a grep search for the character 'a' like this:
echo "asdf" | grep a
the search will pass the line 'asdf' through as expected. And similarly, if you do a regex search with grep for an 'a' followed by any number of other characters like this:
echo "asdf" | grep a.*
this will also let the 'asdf' through. However, if you create a new file in the current directory called 'a.txt':
touch a.txt
the following search won't work anymore! It doesn't find anything:
echo "asdf" | grep a.*
What!? How can creating a new file change how our grep commands run in the shell??? This problem is explained in detail in this article on shell globbing.
3) Confusing '.' with '\.'
This mistake isn't specific to grep since it's really about regular expressions in general, but it's common enough to include in this article. Consider a case where you're trying to use grep to extract all instances of numbers that include a decimal point. In your search, you're looking for one or more digits, followed by a period, followed by one or more digits. You might try writing a grep command like this:
echo "234.328" | grep -Eo "[0-9]+.[0-9]+"
which looks like it works just fine because it does match all of the things you do want. The problem is that is also matches things that you don't want:
echo "234A328" | grep -Eo "[0-9]+.[0-9]+"
In the above case, our regular expression will match the pattern '234A328' which isn't a decimal point number. This case becomes obvious when you point it out, since the '.' character usually represents "any character except for newline" in most regular expressions engines. In order to match a 'literal' period character in a regular expression, you need to escape it:
# Does not match
echo "234A328" | grep -Eo "[0-9]+\.[0-9]+"
# Does match
echo "234.328" | grep -Eo "[0-9]+\.[0-9]+"
The lesson is to be careful when using searches that include a '.' character, since it may not always literally mean a period character.
4) Confusion With \t and BRE/ERE
Here is some text that we'll place inside a file called 'animals.txt'. Take note that the two 'columns' in this file are separated with tab (\t) characters:
Person Favourite Animal\Pet
Robert Cat
Alexander Dog
Sam Monkey
Michael Snake
Let's say that we wanted to write a grep statement to extract the first column from this file. We could do this quickly and crudely by writing a regular expression that will extract anything up and including the tab character. Here's an attempt to do this with the following grep command:
grep -o ".*\t" animals.txt
But if you run this, you'll get results that are completely wrong:
Person Favourite Animal\Pet
Robert Cat
The reason is, again, because of grep's default regular expression mode: BRE or 'Basic Regular Expressions'. However, if we try using the -E flag for 'Extended Regular Expressions', this doesn't fix the problem:
grep -Eo ".*\t" animals.txt
still gives:
Person Favourite Animal\Pet
Robert Cat
In fact, if you check the official standard for BRE and ERE, you'll see that it has no support for matching just a 'tab' character! In POSIX BRE or ERE, there are just a handful of characters that you can escape with a backslash, and they don't include tab.
Confusingly, GNU grep does support things like '\s' in ERE even though it's not officially supported by the POSIX standard.
The solution in our case is to use the -P flag for 'Perl-Compatible Regular Expressions':
grep -Po ".*\t" animals.txt
which gives us the expected result:
Person
Robert
Alexander
Sam
Michael
Unfortunately, the '-P' flag is not supported by all version of grep, so this solution isn't always available.
5) UTF-8 Vs. UTF-16 Vs. Other Encodings
This issue is one that you won't encounter every day, but when you do it can be extremely confusing to figure out what's going on. If you ever happen to work with files that are encoded in UTF-16, you'll have to be mindful of the fact that grep isn't aware of character encodings, so whatever you grep for will likely only be found if it's in a character encoding that matches the current encoding of the terminal where you type the grep command.
For example, imagine that you have two files: the first file encoded in UTF-8 contains this text:
Hello World 123!
and the second file encoded in UTF-16 contains this text:
Hello World 456!
On my machine, if I do a grep search over both of these files using this grep command:
grep World *
this will only match the statement in the first file!
This fact isn't too surprising when you know what's going on, but the difficult part is noticing that you've got a file that's encoded in a different format in the first place. If you take regular ASCII characters and re-encode them as UTF-16, the file you get will look like regular ASCII encoded text with nulls placed between the characters. Therefore, if you print the file onto the terminal, the nulls will be ignored and what you see printed will look indistinguishable from regular ASCII text (except for the byte order marker). Programs like vim will automatically recognize the encoding and display the file as normal text, so you likely won't notice the encoding.
One way to identify the encoding of files is to use the 'file' command:
file file1.txt file2.txt
which gives this output in our example:
file1.txt: ASCII text
file2.txt: Little-endian UTF-16 Unicode text, with no line terminators
Here is an example of hex dump from those two files:
xxd file1.txt
00000000: 4865 6c6c 6f20 576f 726c 6420 3132 3321 Hello World 123!
00000010: 0a .
xxd file2.txt
00000000: fffe 4800 6500 6c00 6c00 6f00 2000 5700 ..H.e.l.l.o. .W.
00000010: 6f00 7200 6c00 6400 2000 3400 3500 3600 o.r.l.d. .4.5.6.
00000020: 2100 0a00 !...
As you can see, the UTF-16 encoded file looks just like ASCII text with null characters between every character.
So, how do we actually find matches in a UTF-16 encoded file using grep? Well, this is actually one of the few situations where grep isn't actually the best tool for the job. One option would be to normalize your files to UTF-8/ASCII encoding. You can convert files between different encoding using the 'iconv' command:
iconv -f UTF-16 file2.txt -t UTF-8 -o file3.txt
Since the file is now encoded as ASCII/UTF-8 in file3.txt, your original grep command should find the expected matches.
Another less ideal option is to use the '-P' flag with grep and explicitly include the null characters for the UTF-16 encoding in your grep command:
grep -Pa 'W\x00o\x00r\x00l\x00d\x00' *
This looks quite messy, and since '-P' is not supported by all version of grep, you can't always use this option. It also requires you to do a separate search every time you suspect there might be a UTF-16 file present (or more if there are even more encodings present).
Another thing to note is that the '-a' flag in the command above is necessary, otherwise grep will treat the UTF-16 files as binary data and refuse to search them.
Conclusion
Hopefully, you've learned a few things about grep and the shell environment in this article. I feel like I need to write a conclusion section to avoid ending the article too abrubtly, but there's really nothing more to say at this point, and if I keep writing then I'll just be rambling. I guess we can talk about the weather if you want. How are things going with you?
Can You Use 'ed' As A Drop-in Replacement For vim, grep & sed?
Published 2020-10-15 |
$1.00 CAD |
Undefined Behaviour With Grep -E
Published 2020-10-01 |
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28 |
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|