Intro To 'join' Command In Linux
2023-11-22 - By Robert Elder
- Example With Product Orders
- Specifying Join Column Index
- Using 'join' With Column Headers
- Using 'join' With Custom Field Separator
- Inner Join With 'join' Command
- Left Join With 'join' Command
- Right Join With 'join' Command
- Full Outer Join With 'join' Command
- Selecting Specific Columns
- The 'join' Command Requires Sorted Records
- SQL Schema Used In This Article
I use the 'join' command to perform the equivalent of a relational SQL join between CSV files. For example, here is a file called 'a.txt':
1 Robert
2 John
3 Bob
and here is a file called 'b.txt':
2 Item 2
3 Item 3
4 Item 4
If I run the 'join' command like this:
join <(sort -k 1 a.txt) <(sort -k 1 b.txt)
the output will be the following:
2 John Item 2
3 Bob Item 3
This could be compared to running a SQL query like the following:
SELECT a.id, a.name, b.item_description FROM a INNER JOIN b ON (a.id = b.id);
id | name | item_description
----+------+------------------
2 | John | Item 2
3 | Bob | Item 3
(2 rows)
Example With Product Orders
Here, I have a tab separated list of orders in the file 'orders.txt' that includes the product name and order date:
Butter 2023-10-27
Butter 2023-10-27
Caviar 2023-10-27
Eggs 2023-10-26
Eggs 2023-10-28
Flour 2023-10-28
Milk 2023-10-26
Milk 2023-10-28
Milk 2023-10-28
Rice 2023-10-29
Rice 2023-10-30
Steak 2023-10-29
and here, I have file called 'products.txt' that includes a list of products with each item's price:
Eggs $4.00
Flour $2.00
Milk $3.00
Pasta $2.00
Pizza $3.00
Rice $1.00
Steak $15.00
I can use this 'join' command to join these two sets of records together like this:
join <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
Now, the order date and item price is included together for each record:
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
Specifying Join Column Index
In general, the 'join column' (the column that will be used to match up each record) may be different in each file. You can explicitly specify the join column index in each file, with the '-1' and '-2' flags:
join -1 1 -2 1 <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
The above 'join' command will produce exactly the same result as the previous one. If the 'orders.txt' file was changed to look like this:
Something1 Something2 Something3 2023-10-27 Butter
Something1 Something2 Something3 2023-10-27 Butter
Something1 Something2 Something3 2023-10-27 Caviar
Something1 Something2 Something3 2023-10-26 Eggs
Something1 Something2 Something3 2023-10-28 Eggs
Something1 Something2 Something3 2023-10-28 Flour
Something1 Something2 Something3 2023-10-26 Milk
Something1 Something2 Something3 2023-10-28 Milk
Something1 Something2 Something3 2023-10-28 Milk
Something1 Something2 Something3 2023-10-29 Rice
Something1 Something2 Something3 2023-10-30 Rice
Something1 Something2 Something3 2023-10-29 Steak
so that the join column index was 5 instead of 1, and the 'products.txt' was changed to look like this:
Something4 Something5 Something6 Something7 Something8 $4.00 Eggs
Something4 Something5 Something6 Something7 Something8 $2.00 Flour
Something4 Something5 Something6 Something7 Something8 $3.00 Milk
Something4 Something5 Something6 Something7 Something8 $2.00 Pasta
Something4 Something5 Something6 Something7 Something8 $3.00 Pizza
Something4 Something5 Something6 Something7 Something8 $1.00 Rice
Something4 Something5 Something6 Something7 Something8 $15.00 Steak
so that its join column index was 7 instead of 1, the updated 'join' command would look like this:
join -1 5 -2 7 <(sort -k 5 orders.txt) <(sort -k 7 products.txt)
Eggs Something1 Something2 Something3 2023-10-26 Something4 Something5 Something6 Something7 Something8 $4.00
Eggs Something1 Something2 Something3 2023-10-28 Something4 Something5 Something6 Something7 Something8 $4.00
Flour Something1 Something2 Something3 2023-10-28 Something4 Something5 Something6 Something7 Something8 $2.00
Milk Something1 Something2 Something3 2023-10-26 Something4 Something5 Something6 Something7 Something8 $3.00
Milk Something1 Something2 Something3 2023-10-28 Something4 Something5 Something6 Something7 Something8 $3.00
Milk Something1 Something2 Something3 2023-10-28 Something4 Something5 Something6 Something7 Something8 $3.00
Rice Something1 Something2 Something3 2023-10-29 Something4 Something5 Something6 Something7 Something8 $1.00
Rice Something1 Something2 Something3 2023-10-30 Something4 Something5 Something6 Something7 Something8 $1.00
Steak Something1 Something2 Something3 2023-10-29 Something4 Something5 Something6 Something7 Something8 $15.00
Using 'join' With Column Headers
If the 'orders.txt' file contained column headers like this:
product order_date
Butter 2023-10-27
Butter 2023-10-27
Caviar 2023-10-27
Eggs 2023-10-26
Eggs 2023-10-28
Flour 2023-10-28
Milk 2023-10-26
Milk 2023-10-28
Milk 2023-10-28
Rice 2023-10-29
Rice 2023-10-30
Steak 2023-10-29
and the 'products.txt' file contained similar column headers:
product price
Eggs $4.00
Flour $2.00
Milk $3.00
Pasta $2.00
Pizza $3.00
Rice $1.00
Steak $15.00
The '--header' flag can be included to correctly consider the first row of each file as header information:
join --header orders.txt products.txt
and the output from the above command will be:
product order_date price
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
NOTE: You will still need to ensure that the records in the file are sorted even when column headers are present. Simply running 'sort' on a file that includes headers will probably mix the header record into the data somewhere, which is probably not what you want!
Using 'join' With Custom Field Separator
If the columns in 'orders.txt' were separated by a command like this:
Butter,2023-10-27
Butter,2023-10-27
Caviar,2023-10-27
Eggs,2023-10-26
Eggs,2023-10-28
Flour,2023-10-28
Milk,2023-10-26
Milk,2023-10-28
Milk,2023-10-28
Rice,2023-10-29
Rice,2023-10-30
Steak,2023-10-29
and the columns in 'products.txt' were similarly separated like this:
Eggs,$4.00
Flour,$2.00
Milk,$3.00
Pasta,$2.00
Pizza,$3.00
Rice,$1.00
Steak,$15.00
you can use the '-t' flag to specify a custom field separator:
join -t ',' <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the output from the above command will be:
Eggs,2023-10-26,$4.00
Eggs,2023-10-28,$4.00
Flour,2023-10-28,$2.00
Milk,2023-10-26,$3.00
Milk,2023-10-28,$3.00
Milk,2023-10-28,$3.00
Rice,2023-10-29,$1.00
Rice,2023-10-30,$1.00
Steak,2023-10-29,$15.00
Inner Join With 'join' Command
By default, the 'join' command performs an inner join:
# Inner join
join <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
this can be compared to the following SQL query:
SELECT product_order.product, product_order.order_date, product.price FROM
product_order INNER JOIN product ON ( product.product = product_order.product);
and the output of this SQL query is as follows:
product | order_date | price
---------+------------+--------
Eggs | 2023-10-26 | $4.00
Eggs | 2023-10-28 | $4.00
Flour | 2023-10-28 | $2.00
Milk | 2023-10-26 | $3.00
Milk | 2023-10-28 | $3.00
Milk | 2023-10-28 | $3.00
Rice | 2023-10-29 | $1.00
Rice | 2023-10-30 | $1.00
Steak | 2023-10-29 | $15.00
(9 rows)
Left Join With 'join' Command
Here's how to do a left join using the 'join' command:
# Left join
join -a 1 -e NULL -o auto <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
Butter 2023-10-27 NULL
Butter 2023-10-27 NULL
Caviar 2023-10-27 NULL
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
this can be compared to the following SQL query:
SELECT product_order.product, product_order.order_date, product.price FROM
product_order LEFT JOIN product ON ( product.product = product_order.product);
and the output of this SQL query is as follows:
product | order_date | price
---------+------------+--------
Butter | 2023-10-27 |
Butter | 2023-10-27 |
Caviar | 2023-10-27 |
Eggs | 2023-10-26 | $4.00
Eggs | 2023-10-28 | $4.00
Flour | 2023-10-28 | $2.00
Milk | 2023-10-26 | $3.00
Milk | 2023-10-28 | $3.00
Milk | 2023-10-28 | $3.00
Rice | 2023-10-29 | $1.00
Rice | 2023-10-30 | $1.00
Steak | 2023-10-29 | $15.00
(12 rows)
Right Join With 'join' Command
Here's how to do a right join using the 'join' command:
# Right join
join -a 2 -e NULL -o auto <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Pasta NULL $2.00
Pizza NULL $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
this can be compared to the following SQL query:
SELECT product.product, product_order.order_date, product.price FROM
product_order RIGHT JOIN product ON ( product.product = product_order.product);
and the output of this SQL query is as follows:
product | order_date | price
---------+------------+--------
Eggs | 2023-10-28 | $4.00
Eggs | 2023-10-26 | $4.00
Flour | 2023-10-28 | $2.00
Milk | 2023-10-28 | $3.00
Milk | 2023-10-28 | $3.00
Milk | 2023-10-26 | $3.00
Pasta | | $2.00
Pizza | | $3.00
Rice | 2023-10-30 | $1.00
Rice | 2023-10-29 | $1.00
Steak | 2023-10-29 | $15.00
(11 rows)
Full Outer Join With 'join' Command
Here's how to do a full outer join using the 'join' command:
# Full Outer
join -a 1 -a 2 -e NULL -o auto <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
Butter 2023-10-27 NULL
Butter 2023-10-27 NULL
Caviar 2023-10-27 NULL
Eggs 2023-10-26 $4.00
Eggs 2023-10-28 $4.00
Flour 2023-10-28 $2.00
Milk 2023-10-26 $3.00
Milk 2023-10-28 $3.00
Milk 2023-10-28 $3.00
Pasta NULL $2.00
Pizza NULL $3.00
Rice 2023-10-29 $1.00
Rice 2023-10-30 $1.00
Steak 2023-10-29 $15.00
this can be compared to the following SQL query:
SELECT coalesce(product.product, product_order.product) as product,
product_order.order_date, product.price FROM
product_order FULL OUTER JOIN product ON (product.product = product_order.product)
ORDER BY product ASC;
and the output of this SQL query is as follows:
product | order_date | price
---------+------------+--------
Butter | 2023-10-27 |
Butter | 2023-10-27 |
Caviar | 2023-10-27 |
Eggs | 2023-10-26 | $4.00
Eggs | 2023-10-28 | $4.00
Flour | 2023-10-28 | $2.00
Milk | 2023-10-28 | $3.00
Milk | 2023-10-26 | $3.00
Milk | 2023-10-28 | $3.00
Pasta | | $2.00
Pizza | | $3.00
Rice | 2023-10-29 | $1.00
Rice | 2023-10-30 | $1.00
Steak | 2023-10-29 | $15.00
(14 rows)
Selecting Specific Columns
The -o flag allows you to specify which columns will be present in the output, just like the select list in a SQL query. For example, this use of the 'join' command will select 'column 2 from file 1' (the 'order_date' column identified as 1.2), followed by 'column 2 from file 2' (the 'price' column identified as 2.2):
join -a 1 -a 2 -e NULL -o 1.2,2.2 <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
2023-10-27 NULL
2023-10-27 NULL
2023-10-27 NULL
2023-10-26 $4.00
2023-10-28 $4.00
2023-10-28 $2.00
2023-10-26 $3.00
2023-10-28 $3.00
2023-10-28 $3.00
NULL $2.00
NULL $3.00
2023-10-29 $1.00
2023-10-30 $1.00
2023-10-29 $15.00
this can be compared to the following SQL query:
SELECT product_order.order_date, product.price FROM
product_order FULL OUTER JOIN product ON (
product.product = product_order.product
)
ORDER BY coalesce(product_order.product, product.product);
and the output of this SQL query is as follows:
order_date | price
------------+--------
2023-10-27 |
2023-10-27 |
2023-10-27 |
2023-10-26 | $4.00
2023-10-28 | $4.00
2023-10-28 | $2.00
2023-10-28 | $3.00
2023-10-26 | $3.00
2023-10-28 | $3.00
| $2.00
| $3.00
2023-10-29 | $1.00
2023-10-30 | $1.00
2023-10-29 | $15.00
(14 rows)
Here is an example that will only select the 'order_date' column (identified as 1.2):
join -a 1 -a 2 -e NULL -o 1.2 <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
2023-10-27
2023-10-27
2023-10-27
2023-10-26
2023-10-28
2023-10-28
2023-10-26
2023-10-28
2023-10-28
NULL
NULL
2023-10-29
2023-10-30
2023-10-29
this can be compared to the following SQL query:
SELECT product_order.order_date FROM
product_order FULL OUTER JOIN product ON (
product.product = product_order.product
)
ORDER BY coalesce(product_order.product, product.product);
and the output of this SQL query is as follows:
order_date
------------
2023-10-27
2023-10-27
2023-10-27
2023-10-26
2023-10-28
2023-10-28
2023-10-28
2023-10-26
2023-10-28
2023-10-29
2023-10-30
2023-10-29
(14 rows)
And finally, here is an example that will only select the 'order_date' column (identified as 2.2):
join -a 1 -a 2 -e NULL -o 2.2 <(sort -k 1 orders.txt) <(sort -k 1 products.txt)
and the result is as follows:
NULL
NULL
NULL
$4.00
$4.00
$2.00
$3.00
$3.00
$3.00
$2.00
$3.00
$1.00
$1.00
$15.00
this can be compared to the following SQL query:
SELECT product.price FROM
product_order FULL OUTER JOIN product ON (
product.product = product_order.product
)
ORDER BY coalesce(product_order.product, product.product);
and the output of this SQL query is as follows:
price
--------
$4.00
$4.00
$2.00
$3.00
$3.00
$3.00
$2.00
$3.00
$1.00
$1.00
$15.00
(14 rows)
The 'join' Command Requires Sorted Records
The 'join' command requires that both input files be sorted, otherwise, the output can be incorrect. To illustrate this fact, here is a file 'a.txt' with its records in reverse sorted order:
3 Three
2 Two
1 One
and here is a file 'b.txt' that we'll join it to:
1 Description For 'One'
2 Description For 'Two'
3 Description For 'Three'
After running the following join command on these files like this:
join a.txt b.txt
the output is this:
join: a.txt:2: is not sorted: 2 Two
3 Three Description For 'Three'
Which includes a warning message along with incomplete output. If the command is updated to sort the records first, like this:
join <(sort -k 1 a.txt) <(sort -k 1 b.txt)
the output will now be joined correctly:
1 One Description For 'One'
2 Two Description For 'Two'
3 Three Description For 'Three'
SQL Schema Used In This Article
For the sake of completeness, you can find the Posgres SQL schema and data used in the creation of this article:
CREATE TABLE product_order (
product VARCHAR(255),
order_date VARCHAR(255)
);
CREATE TABLE product (
product VARCHAR(255),
price VARCHAR(255)
);
INSERT INTO product_order (product, order_date) VALUES
('Butter', '2023-10-27'),
('Butter', '2023-10-27'),
('Caviar', '2023-10-27'),
('Eggs', '2023-10-26'),
('Eggs', '2023-10-28'),
('Flour', '2023-10-28'),
('Milk', '2023-10-26'),
('Milk', '2023-10-28'),
('Milk', '2023-10-28'),
('Rice', '2023-10-29'),
('Rice', '2023-10-30'),
('Steak', '2023-10-29');
INSERT INTO product (product, price) VALUES
('Eggs', '$4.00'),
('Flour', '$2.00'),
('Milk', '$3.00'),
('Pasta', '$2.00'),
('Pizza', '$3.00'),
('Rice', '$1.00'),
('Steak', '$15.00');
CREATE TABLE a (
id serial,
name VARCHAR(255)
);
CREATE TABLE b (
id serial,
item_description VARCHAR(255)
);
INSERT INTO a (id, name) VALUES
(1, 'Robert'),
(2, 'John'),
(3, 'Bob');
INSERT INTO b (id, item_description) VALUES
(2, 'Item 2'),
(3, 'Item 3'),
(4, 'Item 4');
And that's why the 'join' command is my favourite Linux command.
Intro To 'stty' Command In Linux
Published 2023-10-04 |
$1.00 CAD |
Intro To 'nproc' Command In Linux
Published 2023-07-15 |
Intro To 'comm' Command In Linux
Published 2023-09-06 |
How To Force The 'true' Command To Return 'false'
Published 2023-07-09 |
A Surprisingly Common Mistake Involving Wildcards & The Find Command
Published 2020-01-21 |
A Guide to Recording 660FPS Video On A $6 Raspberry Pi Camera
Published 2019-08-01 |
Intro To 'chroot' Command In Linux
Published 2023-06-23 |
Join My Mailing List Privacy Policy |
Why Bother Subscribing?
|