[Unix/Linux] Remove lines that exist in another file
[Unix/Linux] Remove lines that exist in another file
Sometimes, we need to make a file not to contain the lines from another file. It is common when we need to remove some configurations from config files. It is easy if there are just a few lines to remove, but it would become tedious and error-prone when there are many lines to remove on multiple hosts. We should use an automated way instead of removing the lines manually. (Of course, we should still verify the final result is fine.)
We will use a simple example to illustrate how we can remove the lines. Let’s say we are organizing a party. We have 2 files: all.txt
contains all the friends we have and remove.txt
contains the friends that are not going to the party.
all.txt
John
Tom
Tony
Alex
Michael
Kalvin
remove.txt
Tony
Alex
Chris
Note that “Chris” is found in remove.txt
but not all.txt
. In reality, it is not uncommon to be requested to remove a line that does not exist.
If the files are NOT sorted
1. grep -Fvxf remove.txt all.txt
We can just use the grep
command to filter out all the lines we don’t need.
noob@learnfromnoobs:~$ grep -Fvxf remove.txt all.txt
John
Tom
Michael
Kalvin
Some notes for the command:
-F
is needed so it interprets patterns (lines inremove.txt
) as a list of fixed strings instead of regular expressions. It does not have any effects in our example, but when we are updating configuration files, there are always some special characters used. We need to make sure that they are not interpreted as regular expressions. Otherwise, it might remove more or fewer lines than expected.-v
is used so that only non-matching lines (lines that exist inall.txt
but notremove.txt
) are selected.-x
is used so that only those matches that exactly match the whole line are selected.-f
is used so that we can obtain patterns from a file.
For more information, please refer to man grep
.
2. awk ‘NR==FNR{array[0];next} !(0 in array)’ remove.txt all.txt
awk comes in handy when dealing with this kind of problem.
noob@learnfromnoobs:~$ awk 'NR==FNR{array[$0];next} !($0 in array)' remove.txt all.txt
John
Tom
Michael
Kalvin
We can break this command into a few parts to understand it.
NR==FNR
is a common way to determine whether we are reading the first file, which isremove.txt
in our case.NR==FNR{array[$0];next}
means if the current line is from the first file, save the line to an array.- When reading lines from the second file (
all.txt
), we use!($0 in array)
to check if the current line is in the array we created earlier (i.e. the lines inremove.txt
). Note that the default action{print}
is missing from the command which is a common practice. So the actual command we are using here is!($0 in array){print}
, which means we want to print out the line if the current line is NOT in the array (again, the lines inremove.txt
).
3. diff —new-line-format="" —old-line-format=“%L” —unchanged-line-format="" all.txt remove.txt
We can also use diff
to get the same result after setting the right line formats.
noob@learnfromnoobs:~$ diff --new-line-format="" --unchanged-line-format="" --old-line-format="%L" all.txt remove.txt
John
Tom
Michael
Kalvin
If you find it difficult to understand this command, we can break it into a few parts.
- Let’s examine the original diff result. We can see that it shows us which lines are removed/added, but not those lines that exist in both
all.txt
andremove.txt
(“Tony” and “Alex”).
noob@learnfromnoobs:~$ diff all.txt remove.txt
1,2d0
< John
< Tom
5,6c3
< Michael
< Kalvin
---
> Chris
- Now we add the option
--new-line-format=""
to our command to get rid of the lines added (“Chris”).
noob@learnfromnoobs:~$ diff --new-line-format="" all.txt remove.txt
John
Tom
Tony
Alex
Michael
Kalvin
- As you might notice, now the lines we want to remove show up again (“Tony” and “Alex”). We have already handled the format for the new lines in the previous step. We are also responsible to deal with the format for the unchanged lines and old lines. We can remove the unchanged lines by setting
--unchanged-line-format=""
. For the old lines, we can set--old-line-format="%L"
so that it will just print them as is.
noob@learnfromnoobs:~$ diff --new-line-format="" --unchanged-line-format="" --old-line-format="%L" all.txt remove.txt
John
Tom
Michael
Kalvin
Finally, only the old lines (i.e. the lines that are unique to all.txt
) are shown.
For more information, please refer to GNU Diffutils - Line Formats.
If the files are sorted
4. comm -23 all.txt remove.txt
If the order of the lines doesn’t matter to you, you can consider using the comm
command to remove the unwanted lines. We have to ensure the files are sorted if we use this method.
noob@learnfromnoobs:~$ sort all.txt > all_sorted.txt
noob@learnfromnoobs:~$ sort remove.txt > remove_sorted.txt
noob@learnfromnoobs:~$ comm -23 all_sorted.txt remove_sorted.txt
John
Kalvin
Michael
Tom
We can also make use of Process Substitution to sort the files when running the comm
command.
noob@learnfromnoobs:~$ comm -23 <(sort all.txt) <(sort remove.txt)
John
Kalvin
Michael
Tom
5. join -v 1 all.txt remove.txt
join -v FILENUM all.txt remove.txt
can be used to remove identical lines in the two files. The command prints all lines in FILENUM
but not those lines in the other file. FILENUM
is set to 1
so that it prints all lines in the first input file (all.txt
in our case) but not the other file (remove.txt
).
Note that we also need to ensure the input files are sorted when using this method.
noob@learnfromnoobs:~$ join -v 1 all_sorted.txt remove_sorted.txt
John
Kalvin
Michael
Tom
Again, we also use Process Substitution to sort the files when running the join
command.
noob@learnfromnoobs:~$ join -v 1 <(sort all.txt) <(sort remove.txt)
John
Kalvin
Michael
Tom
Conclusion
In this article, we discussed five different ways to remove lines that exist in another file:
grep -Fvxf remove.txt all.txt
awk 'NR==FNR{array[$0];next} !($0 in array)' remove.txt all.txt
diff --new-line-format="" --old-line-format="%L" --unchanged-line-format="" all.txt remove.txt
comm -23 all.txt remove.txt
join -v 1 all.txt remove.txt
No matter which method you have chosen finally, I hope you enjoyed this article.
Keep learning and have fun!