Using The 'uniq' Command To Impress People At Parties
2021-02-20 - By Robert Elder
Introduction
In this article, I will attempt to convince you that the 'uniq' command is worth learning since it can help you quickly solve a number of different command-line tasks that involve 'uniqueness' in some way. You'll also learn how this command can make you more popular at parties too.
The uniq command is the fastest way to find the set of unique lines that appear in a file. It can also answer other uniqueness related questions, like listing duplicates and counting occurrences. Finally, we'll discuss a couple of ways that you can easily mis-use this command, one of which could result in data loss.
The Simplest Example
Let's do an example on the following text file 'simple_example.txt' that has the following list of names:
Verity Jayda
Verity Jayda
Verity Jayda
Justy Kaiden
Christopher Rene
Christopher Rene
Christopher Rene
Branden McKenna
Branden McKenna
Branden McKenna
Branden McKenna
Branden McKenna
Branden McKenna
Triston Issy
to find the unique set of lines in this file, you can use it like this:
uniq simple_example.txt
which produces the following result:
Verity Jayda
Justy Kaiden
Christopher Rene
Branden McKenna
Triston Issy
Be Careful! 'uniq' Expects Sorted Data
However, a common pitfall of the uniq command, is that it will only detect lines as duplicates if they are adjacent. For example, here is some similar text in a file called 'non_adjacent.txt':
Branden McKenna
Verity Jayda
Verity Jayda
Christopher Rene
Justy Kaiden
Christopher Rene
Branden McKenna
Verity Jayda
Branden McKenna
Christopher Rene
Branden McKenna
Branden McKenna
Triston Issy
Branden McKenna
If we run the 'uniq' command on this file:
uniq non_adjacent.txt
We get the following result:
Branden McKenna
Verity Jayda
Christopher Rene
Justy Kaiden
Christopher Rene
Branden McKenna
Verity Jayda
Branden McKenna
Christopher Rene
Branden McKenna
Triston Issy
Branden McKenna
which obviously still contains duplicate lines.
Therefore, it's common to use the sort command in combination with the 'uniq' command like this:
sort non_adjacent.txt | uniq
the above command will produce the following result:
Branden McKenna
Christopher Rene
Justy Kaiden
Triston Issy
Verity Jayda
and now the output has all duplicate lines removed.
Counting Occurrences
One of the most useful features of the 'uniq' command comes from the '-c' flag that shows you a count of how many of each unique line appears in the file. For example, this command:
sort non_adjacent.txt | uniq -c
Produces the following output:
6 Branden McKenna
3 Christopher Rene
1 Justy Kaiden
1 Triston Issy
3 Verity Jayda
You can also use the '-u' flag to print only the 'unique' lines (the set of lines that are unique in the file/stream (occur only a single time) NOT the unique set of lines in the file/stream):
sort non_adjacent.txt | uniq -u
which produces this result:
Justy Kaiden
Triston Issy
The '-d' flag also exists to print out the opposite set of lines (lines that aren't unique in the input):
sort non_adjacent.txt | uniq -d
For this case, the output is:
Branden McKenna
Christopher Rene
Verity Jayda
Improving Your Social Live With The 'uniq' Command
The uniq command isn't just for solving technical problems, you can also use it to improve your social life too! Assume you're going to have a Christmas party, and you have 3 lists of potential guests: 'family.txt', 'coworkers.txt' and 'friends.txt' (the full files are included at the end of this article). For this example, you'd like to find the unique list of guests so that you don't send more than one invitation to anyone. The problem is, some of the potential guests are included in more than one list, and to make matters worse, the names often use different capitalizations like this:
RALPHIE BIRDIE
Ralphie Birdie
ralphie birdie
To find the unique set of lines from all of these three files, you can use this command:
cat friends.txt coworkers.txt family.txt | sort | uniq
Which produces output that starts something like this:
Alec Jacqueline
ALISHIA MEG
alishia meg
ASH DALLAS
ash dallas
Avah Breann
avah breann
branden Mckenna
branden mckenna
...
But this result is case sensitive, so we end up with the same name multiple times for different cases. To fix this, we can ignore casing using the '-i' flag (using -i on the sort command ensures that we don't get any issues with the sorting collation algorithm):
cat friends.txt coworkers.txt family.txt | sort -i | uniq -i
Which produces output that starts like this:
Alec Jacqueline
ALISHIA MEG
ASH DALLAS
Avah Breann
branden Mckenna
CASON MERYL
CHRISTOPHER RENE
courtney mick
...
The GNU implementation of uniq also includes the '--group' flag, which lets you see all the 'groups' of lines that are considered as unique grouped together. For example, this command:
cat friends.txt coworkers.txt family.txt | sort -i | uniq -i --group
Which produces output that starts like this:
Alec Jacqueline
ALISHIA MEG
alishia meg
ASH DALLAS
ash dallas
Avah Breann
Avah Breann
avah breann
avah breann
branden Mckenna
branden mckenna
branden mckenna
CASON MERYL
Cason Meryl
cason meryl
cason meryl
...trimmed for space...
Collation Issues
You might just assume that the 'uniqueness' of lines in a file only depends on the contents of the file itself. However, that assumption is completely wrong! It also depends on the 'collation algorithm'!
Consider the following file 'unicode_example.txt' containing Unicode characters:
x ◌̛ ◌̣
x ◌̣ ◌̛
Here's a hex dump of this file:
00000000: 7820 e297 8ccc 9b20 e297 8ccc a30a 7820 x ..... ......x
00000010: e297 8ccc a320 e297 8ccc 9b0a ..... ......
On my machine, if I run the uniq command on this file:
uniq unicode_example.txt
I get the following result (only one line):
x ◌̛ ◌̣
But, if I set the environment variable 'LC_ALL' to the value 'C' and run the same command:
LC_ALL=C uniq unicode_example.txt
I get this result (which now has two lines!):
x ◌̛ ◌̣
x ◌̣ ◌̛
The difference comes down to the 'Unicode collation algorithm'. You can also read more relevant information on the man page for 'setlocale':
man setlocale
A Dangerous Gotcha With 'uniq'
There is one unfortunate inconsistency with the 'uniq' command that could easily catch you off guard. To illustrate the problem, let's do a quick example and make some temporary files:
echo -e "Hello World!" >> abc.txt
echo -e "Here is some important, un-backed up data." >> important_data.txt
echo -e "More data." >> important_data.txt
echo -e "Hello World!" >> important_data.txt
Now that we have our two text files, let's run some familiar commands on them. First, we'll start with grep:
grep Hello *.txt
which produces this result:
abc.txt:Hello World!
important_data.txt:Hello World!
Looks normal. Let's try running a sed command:
sed 's/Hello/Goodbye/g' *.txt
which produces this result:
Goodbye World!
Here is some important, un-backed up data.
More data.
Goodbye World!
Looks normal. Let's try running a sort command:
sort *.txt
which produces this result:
Hello World!
Hello World!
Here is some important, un-backed up data.
More data.
Looks normal. Let's try running a 'uniq' command:
uniq *.txt
which produces no output at all. That's different. Plus, all of the data inside the file 'important_data.txt' has been over-written! Oh no! We just lost data! What's going on?
What happened here was the command:
uniq *.txt
expands to this:
uniq abc.txt important_data.txt
and it just so happens that the 'uniq' command isn't very consistent with other command-line tools with respect to handling multiple input files. It actually treats the second file as an output file instead of an input file! The result is that the contents of the 'important_data.txt' file is overwritten with the set of unique lines that are found in abc.txt (which is probably not what you wanted if you were using a wildcard like that).
A Couple Advanced Esoteric Examples
The 'uniq' command has a couple other esoteric command-line flags that you probably won't use every day, but I'll document them here for the sake of completeness. The first is the '-f' flag which forces the 'uniqueness' consideration to ignore the first N fields (where the fields are separated by a "A field is the maximal string matched by the basic regular expression: [[:blank:]]*[^[:blank:]]*"). Consider the following text file 'with_fields.txt':
one pear
two pear
one apple
two apple
two apple
one pear
If we run the following command on this file:
cat with_fields.txt | sort -k 2,2 | uniq -f 1 --group
the result will be to group together lines according to the 'uniqueness' of the second column:
one apple
two apple
two apple
one pear
one pear
two pear
The next interesting flag to consider is the '-s' flag which will skip the first N characters when considering uniqueness. The '-f' flag can be combined with the '-s' flag. For an example of using the '-s' flag, consider the following file 'with_prefix.txt':
256-ABC
929-DEF
453-ABC
398-DEF
398-GHI
123-ABC
If we run the following command on this file:
cat with_prefix.txt | sort -k 1.4,1 | uniq -s 4 --group
this will be the result:
123-ABC
256-ABC
453-ABC
398-DEF
929-DEF
398-GHI
where the effect is to group together the lines that are 'unique' after ignoring the first 4 characters.
Full Names Lists
Below, you'll find the full contents of the three files mentioned in the guest list example above.
Here are the full contents of the 'friends.txt' file mentioned above:
hayden monica
Lewis Mora
CHRISTOPHER RENE
SEQUOIA SYLVANUS
Ralphie Birdie
jaclyn suzanna
enola leroi
HAYDEN MONICA
ralphie birdie
Verity Jayda
alishia meg
Jaclyn Suzanna
WILLARD STARR
cason meryl
Meade Lane
ASH DALLAS
MEADE LANE
DUKE LARK
Avah Breann
Here is the full contents of the 'family.txt' file:
CASON MERYL
Justy Kaiden
meade lane
enola Leroi
Elwood nicky
courtney mick
VERITY JAYDA
enola leroi
Lewis mora
triston issy
RALPHIE BIRDIE
Enola Leroi
DUKE LARK
JACLYN SUZANNA
ALISHIA MEG
branden mckenna
avah breann
WILLARD STARR
Here is the full contents of the 'coworkers.txt' file:
avah breann
ELWOOD NICKY
hayden monica
Cason Meryl
sequoia sylvanus
branden mckenna
ralphie birdie
verity jayda
branden Mckenna
len xanthia
justy kaiden
Alec Jacqueline
cason meryl
wade Kori
Avah Breann
verity Jayda
ash dallas
meade lane
TRISTON ISSY
Enola Leroi
Closing Thoughts
As we've seen in this article, the 'uniq' command is a great tool to have at your disposal when you need to quickly ask questions about the uniqueness of lines in a file. Problems involving uniqueness come up surprisingly often! The 'uniq' command is also pre-install by default on almost all *nix distributions, so it'll always be there when you need it. As you've seen from the above example, it can even make you more popular at parties by showing everyone how good your event-planning and organizational skills are!
And that's why the 'uniq' command is my favourite Linux command.
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
$1.00 CAD |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28 |
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16 |
Using A Piece Of Paper As A Display Terminal - ed Vs. vim
Published 2020-10-05 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|