Why The 'sort' Command Is A Favourite For Librarians
2021-02-20 - By Robert Elder
Introduction
In this article, I will attempt to convince you that the 'sort' command is worth learning. The 'sort' command is no doubt the favourite Unix command of librarians, and by the end of this article, it should be clear why! There are a number of other Linux/Unix commands that only work correctly with input that has been pre-sorted, such as the 'uniq' command or the 'comm' command. Piping the output of 'sort' into these commands can make for quick and easy solutions to your many of your text-processing problems.
The Simplest Sort Example
Let's do a quick example to see how useful the sort command can be. Here is a text file 'best-novels.txt' that contains some random books from this list of 100 best novels on Wikipedia:
Tropic of Cancer Henry Miller 1934
Housekeeping Marilynne Robinson 1981
Deliverance James Dickey 1970
The Sun Also Rises Ernest Hemingway 1926
The Great Gatsby F. Scott Fitzgerald 1925
The Corrections Jonathan Franzen 2001
The Berlin Stories Christopher Isherwood 1946
Call It Sleep Henry Roth 1935
Slaughterhouse-Five Kurt Vonnegut 1969
Light in August William Faulkner 1932
These books have been listed in the file 'best-novels.txt' in a random order, but we want them to be listed in alphabetical order. We can quickly see them sorted using this command:
sort best-novels.txt
And this will immediately sort all of the lines in the file, then print them to the terminal:
Call It Sleep Henry Roth 1935
Deliverance James Dickey 1970
Housekeeping Marilynne Robinson 1981
Light in August William Faulkner 1932
Slaughterhouse-Five Kurt Vonnegut 1969
The Berlin Stories Christopher Isherwood 1946
The Corrections Jonathan Franzen 2001
The Great Gatsby F. Scott Fitzgerald 1925
The Sun Also Rises Ernest Hemingway 1926
Tropic of Cancer Henry Miller 1934
Sort In Reverse
You can also sort the lines in reverse order with the following command:
sort -r best-novels.txt
which will output the following:
Tropic of Cancer Henry Miller 1934
The Sun Also Rises Ernest Hemingway 1926
The Great Gatsby F. Scott Fitzgerald 1925
The Corrections Jonathan Franzen 2001
The Berlin Stories Christopher Isherwood 1946
Slaughterhouse-Five Kurt Vonnegut 1969
Light in August William Faulkner 1932
Housekeeping Marilynne Robinson 1981
Deliverance James Dickey 1970
Call It Sleep Henry Roth 1935
Sorting Numbers - Lexical Vs. Numerical Ordering
You can also use the 'sort' command to sort numbers. Here's a file 'some-numbers.txt' that contains some sample numbers:
16454
6123
10538
9446
23666
21749
101
6812
if we use the following sort command on this file:
sort some-numbers.txt
we'll get the following result:
101
10538
16454
21749
23666
6123
6812
9446
which probably isn't what you were expecting, since the numbers aren't sorted in numerically ascending order. This is because the sort command is defaulting to lexical sorting (used by librarians) rather than numerical sorting. To force numerical sorting, you can use the '-n' flag:
sort -n some-numbers.txt
and now the result will be:
101
6123
6812
9446
10538
16454
21749
23666
which shows the numbers in numerically ascending order.
Sorting By Column Offset
Whenever we run the sort command like this:
sort best-novels.txt
this will perform the sort comparisons using the entire line, which mixes the author together with the book title. In our use case, we can explicitly sort on the text starting at the first column using the following command (this probably isn't what you want, see next section!):
sort -t $'\t' -k 1 best-novels.txt
in order to split up the columns, we need to use the '-t' flag to specify the column delimiter. In this case the $'\t' is a special syntax that is used in bash to specify a literal tab character. If you were delimiting columns with a space, you'd do this:
sort -t ' ' -k 1 best-novels.txt
or with a comma, you'd do this:
sort -t ',' -k 1 best-novels.txt
If we want to start the comparison at the second or third column (also probably not exactly what you really want) you'd do this:
sort -t $'\t' -k 2 best-novels.txt
sort -t $'\t' -k 3 best-novels.txt
Be Careful! Column Sorting Isn't Intuitive
BUT, the '-k' flag with the sort command is easy to misunderstand! The correct way to sort based on a specific individual column (and only that column) is to use the '-k' flag like this:
sort -t $'\t' -k 1,1 best-novels.txt
The -k flag can be easy to mis-use since it actually requires that you specify a starting and an ending column, not just a column number. If you only specify one column number, the 'ending' column is assumed to be the end of the line! This is a very poor choice of default IMHO, but it's standard behaviour now.
When it comes to sorting on columns it can get hard to understand what's going on, but fortunately, the GNU implementation of the sort command also includes the --debug flag to help you debug what the sorting process is actually looking at. It will underline the parts of the line that are actually considered in the comparison. Let's see what the --debug shows us with this basic sort command:
sort --debug best-novels.txt
and the result is:
Call It Sleep>Henry Roth>1935
_____________________________
Deliverance>James Dickey>1970
_____________________________
Housekeeping>Marilynne Robinson>1981
____________________________________
Light in August>William Faulkner>1932
_____________________________________
Slaughterhouse-Five>Kurt Vonnegut>1969
______________________________________
The Berlin Stories>Christopher Isherwood>1946
_____________________________________________
The Corrections>Jonathan Franzen>2001
_____________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
_________________________________________
The Sun Also Rises>Ernest Hemingway>1926
________________________________________
Tropic of Cancer>Henry Miller>1934
__________________________________
As you can see, it's obviously underlining the entire part of every line. Let's see what happens when try to sort based on the first column as shown above:
sort -t $'\t' -k 1 --debug best-novels.txt
and the result is:
Call It Sleep>Henry Roth>1935
_____________________________
_____________________________
Deliverance>James Dickey>1970
_____________________________
_____________________________
Housekeeping>Marilynne Robinson>1981
____________________________________
____________________________________
...trimmed for space...
Aha! You can see clearly now that it's underlining the entire part of every line, so clearly '-k 1' might not be doing what you want. There are also two underlines for each line, which is something that we'll come back to later. Let's see what happens if we use '-k 2':
sort -t $'\t' -k 2 --debug best-novels.txt
and the result is:
The Berlin Stories>Christopher Isherwood>1946
__________________________
_____________________________________________
The Sun Also Rises>Ernest Hemingway>1926
_____________________
________________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
________________________
_________________________________________
...trimmed for space...
As you can see above, the underline now starts under the second column and then goes to the end of the line. That's because the number we specified after the -k flag is treated as the starting column to begin the comparison on.
Now, let's perform the sort on only the first column. To do this, we specify a starting column and an ending column like this:
sort -t $'\t' -k 1,1 --debug best-novels.txt
which produces this result:
Call It Sleep>Henry Roth>1935
_____________
_____________________________
Deliverance>James Dickey>1970
___________
_____________________________
Housekeeping>Marilynne Robinson>1981
____________
____________________________________
...trimmed for space...
As you can see, the first pass of the sorting comparison considers only the first column (column 1 to column 1). We can do the same thing with column two like this:
sort --debug -t $'\t' -k 2,2 best-novels.txt
which produces this result:
The Berlin Stories>Christopher Isherwood>1946
_____________________
_____________________________________________
The Sun Also Rises>Ernest Hemingway>1926
________________
________________________________________
The Great Gatsby>F. Scott Fitzgerald>1925
___________________
_________________________________________
...trimmed for space...
Above, we can see that first pass of the comparison will now only consider the author. We can do this same thing with the year column:
sort --debug -t $'\t' -k 3,3 best-novels.txt
which produces this result:
The Great Gatsby>F. Scott Fitzgerald>1925
____
_________________________________________
The Sun Also Rises>Ernest Hemingway>1926
____
________________________________________
Light in August>William Faulkner>1932
____
_____________________________________
...trimmed for space...
But wait a minute, what's with that extra underline that appears on every single line? I'm glad you asked, because the answer has to do with sorting stability which we'll discuss in the next section.
Sorting Stability
Sorting Stability is a statement about whether a sorting algorithm will keep the original ordering of input elements the same (aka 'stable') or not in cases where that item is repeated more than once and therefore 'tied' as far as sorting comparisons go.
So, does the 'sort' command provide 'stable' sorting? Well, it's not included in the POSIX standard, but the standard does note that many implementations do include a '-s' flag that provides stable sorting. The documentation notes that the '-s' flag disables 'last-resort comparisons', which effectively makes the sorting 'stable'. It also gets rid of that extra mysterious underline that we saw in the last section:
sort --debug -t $'\t' -k 1,1 -s best-novels.txt
will now produce this output:
Call It Sleep>Henry Roth>1935
_____________
Deliverance>James Dickey>1970
___________
Housekeeping>Marilynne Robinson>1981
____________
Light in August>William Faulkner>1932
_______________
...trimmed for space...
sort --debug -t $'\t' -k 2,2 -s best-novels.txt
will now produce this output:
The Berlin Stories>Christopher Isherwood>1946
_____________________
The Sun Also Rises>Ernest Hemingway>1926
________________
The Great Gatsby>F. Scott Fitzgerald>1925
___________________
Tropic of Cancer>Henry Miller>1934
____________
...trimmed for space...
sort --debug -t $'\t' -k 3,3 -s best-novels.txt
will now produce this output:
The Great Gatsby>F. Scott Fitzgerald>1925
____
The Sun Also Rises>Ernest Hemingway>1926
____
Light in August>William Faulkner>1932
____
Tropic of Cancer>Henry Miller>1934
____
...trimmed for space...
Let's review an example where stable sorting makes a difference. Here are the contents of a file called 'stable-sort-example.csv':
abc,789,hello
abc,123,hello
def,123,hello
def,456,hello
abc,456,hello
If we run this command to sort the data based on the first column without '-s' for stable sorting:
sort -t ',' -k 1,1 stable-sort-example.csv
the result is the following:
abc,123,hello
abc,456,hello
abc,789,hello
def,123,hello
def,456,hello
But, if we run this command again with the '-s' flag:
sort -t ',' -k 1,1 -s stable-sort-example.csv
we get the following:
abc,789,hello
abc,123,hello
abc,456,hello
def,123,hello
def,456,hello
As you can see from above, these two results are different because the '-s' flag will disable the 'last resort comparison' which would have defaulted to sorting the entire lines any time two lines have an identical value in the first column.
Multiple Sort Columns At Once
You can explicitly define sort orders for all columns by using the -k flag multiple times. Here is an example, that will sort our list of novels first according to the book year, then according to author, and finally by the book title:
sort -t $'\t' -k 3,3 -k 2,2 -k 1,1 -s best-novels.txt
However, the above command has an issue related to lexical vs. numerical sorting that we saw previously. The last column won't sort things in numerically ascending order unless we tell it to, so a book published in a year with only three digits that happens to start with a large number (like '823') would end up at the end of the list instead of at the start:
Some Really Old Book Really Old Guy 823
If we try our sort command on our novels list with this extra book, we'll get this:
The Great Gatsby F. Scott Fitzgerald 1925
The Sun Also Rises Ernest Hemingway 1926
Light in August William Faulkner 1932
Tropic of Cancer Henry Miller 1934
Call It Sleep Henry Roth 1935
The Berlin Stories Christopher Isherwood 1946
Slaughterhouse-Five Kurt Vonnegut 1969
Deliverance James Dickey 1970
Housekeeping Marilynne Robinson 1981
The Corrections Jonathan Franzen 2001
Some Really Old Book Really Old Guy 823
We can fix this my adding the 'n' flag, but only for the 3rd column:
sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt
and now the result will be this:
Some Really Old Book Really Old Guy 823
The Great Gatsby F. Scott Fitzgerald 1925
The Sun Also Rises Ernest Hemingway 1926
Light in August William Faulkner 1932
Tropic of Cancer Henry Miller 1934
Call It Sleep Henry Roth 1935
The Berlin Stories Christopher Isherwood 1946
Slaughterhouse-Five Kurt Vonnegut 1969
Deliverance James Dickey 1970
Housekeeping Marilynne Robinson 1981
The Corrections Jonathan Franzen 2001
The Dangers of Sorting Collations
It's common to assume that the final sort ordering of lines in a file would only depend on the actual content of the file itself. But is that assumption correct? Nope! Consider the file 'unicode-example.txt' containing the following text:
A B C
Abc
A b c
On my machine, if I run this sort command:
sort unicode-example.txt
I get the following result:
A B C
Abc
A b c
However, if I set the environment variable 'LC_ALL' to have the value 'C' when running the sort command, like this:
LC_ALL=C sort unicode-example.txt
then I get this result:
A B C
A b c
Abc
which is obviously different. The difference comes down to the unicode collation algorithm. You can also read more about the environment variables that affect sorting by checking the man page for 'setlocale':
man setlocale
Sort By Random
Another useful feature of the sort command is the '-R' flag, which will sort the lines in the file by 'random':
sort -R best-novels.txt
Each time you run this command, it will output the lines in a different order. This can be very useful for generating test cases, or any situation where you need to purposefully mix up data, such as for producing an unbiased class list (although don't expect it to be cryptographically unbiased).
Find Unique Lines In File
The 'sort' command also supports the '-u' flag which will print out the unique set of lines with duplicates removed (which is, confusingly, a completely different behaviour from using the '-u' flag with the 'uniq' command!):
sort -u best-novels.txt
Using Sort With Other Utilities
The sort command is also a necessary pre-requisite for several other common Unix utilities that require all data to be sorted first. For example, you can use the sort command together with the head or tail command to extract a subset of items at the start or end of the file. For example, if we want to find the 3 oldest books from our original novels list, we can use this command:
sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt | head -n 3
and the result will be:
The Great Gatsby F. Scott Fitzgerald 1925
The Sun Also Rises Ernest Hemingway 1926
Light in August William Faulkner 1932
To find the newest 3 books, you could use the head command like this:
sort -t $'\t' -k 3,3n -k 2,2 -k 1,1 -s best-novels.txt | tail -n 3
and the result will be:
Deliverance James Dickey 1970
Housekeeping Marilynne Robinson 1981
The Corrections Jonathan Franzen 2001
But the ordering of the list doesn't have the newest ones at the top. You could fix this by using 'head' again and changing the sort order to reverse from newest to oldest:
sort -t $'\t' -k 3,3rn -k 2,2 -k 1,1 -s best-novels.txt | head -n 3
The Corrections Jonathan Franzen 2001
Housekeeping Marilynne Robinson 1981
Deliverance James Dickey 1970
Another common use case for the sort command is to use it in combination with the 'uniq' command, since the 'uniq' command expects its input to be sorted. There is some overlap between the features of the 'sort' command and the 'uniq' command (because of sort's '-u' flag), but 'uniq' also has a few extra useful features. The '-u' flag with sort, works like this:
sort -u best-novels.txt
but you could do this to get the same result:
sort best-novels.txt | uniq
The 'uniq' command also supports a flag that will show you only the set of lines that were duplicated in the file (which is very useful in cases where you're merging data):
sort best-novels.txt | uniq -d
Another useful flag with 'uniq' is to find the counts for the number of times each line appears in a file:
sort best-novels.txt | uniq -c
And finally, another useful command-line tool that expects pre-sorted data is the 'comm' command. You can use this command to find the set intersections, unions, and complements of the lines in files or streams:
comm -13 <(sort best-novels.txt) <(sort best-novels2.txt)
Closing Thoughts
The 'sort' command is a great tool to have at your disposal. Since it comes built-in on most *nix distributions, it's always there when you need it, and it sure beats writing a from-scratch C program do the sorting instead! Whether you're an accomplished sysadmin, or an aspiring librarian, the 'sort' command is sure to make you more productive at your job.
And that's why the 'sort' command is my favourite Linux command.
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
$1.00 CAD |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
The Most Confusing Grep Mistakes I've Ever Made
Published 2020-11-02 |
Use The 'tail' Command To Monitor Everything
Published 2021-04-08 |
An Overview of How to Do Everything with Raspberry Pi Cameras
Published 2019-05-28 |
An Introduction To Data Science On The Linux Command Line
Published 2019-10-16 |
Using A Piece Of Paper As A Display Terminal - ed Vs. vim
Published 2020-10-05 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|