Journey of a noob

Learn From Noobs

[Unix/Linux] Remove lines that exist in another file

Posted at # Unix/Linux

[Unix/Linux] Remove lines that exist in another file

Sometimes, we need to make a file not to contain the lines from another file. It is common when we need to remove some configurations from config files. It is easy if there are just a few lines to remove, but it would become tedious and error-prone when there are many lines to remove on multiple hosts. We should use an automated way instead of removing the lines manually. (Of course, we should still verify the final result is fine.)

We will use a simple example to illustrate how we can remove the lines. Let’s say we are organizing a party. We have 2 files: all.txt contains all the friends we have and remove.txt contains the friends that are not going to the party.

all.txt

John
Tom
Tony
Alex
Michael
Kalvin

remove.txt

Tony
Alex
Chris

Note that “Chris” is found in remove.txt but not all.txt. In reality, it is not uncommon to be requested to remove a line that does not exist.


If the files are NOT sorted

1. grep -Fvxf remove.txt all.txt

We can just use the grep command to filter out all the lines we don’t need.

noob@learnfromnoobs:~$ grep -Fvxf remove.txt all.txt
John
Tom
Michael
Kalvin

Some notes for the command:

  1. -F is needed so it interprets patterns (lines in remove.txt) as a list of fixed strings instead of regular expressions. It does not have any effects in our example, but when we are updating configuration files, there are always some special characters used. We need to make sure that they are not interpreted as regular expressions. Otherwise, it might remove more or fewer lines than expected.
  2. -v is used so that only non-matching lines (lines that exist in all.txt but not remove.txt) are selected.
  3. -x is used so that only those matches that exactly match the whole line are selected.
  4. -f is used so that we can obtain patterns from a file.

For more information, please refer to man grep.

2. awk ‘NR==FNR{array[0];next} !(0 in array)’ remove.txt all.txt

awk comes in handy when dealing with this kind of problem.

noob@learnfromnoobs:~$ awk 'NR==FNR{array[$0];next} !($0 in array)' remove.txt all.txt
John
Tom
Michael
Kalvin

We can break this command into a few parts to understand it.

  1. NR==FNR is a common way to determine whether we are reading the first file, which is remove.txt in our case.
  2. NR==FNR{array[$0];next} means if the current line is from the first file, save the line to an array.
  3. When reading lines from the second file (all.txt), we use !($0 in array) to check if the current line is in the array we created earlier (i.e. the lines in remove.txt). Note that the default action {print} is missing from the command which is a common practice. So the actual command we are using here is !($0 in array){print}, which means we want to print out the line if the current line is NOT in the array (again, the lines in remove.txt).

3. diff —new-line-format="" —old-line-format=“%L” —unchanged-line-format="" all.txt remove.txt

We can also use diff to get the same result after setting the right line formats.

noob@learnfromnoobs:~$ diff --new-line-format="" --unchanged-line-format="" --old-line-format="%L" all.txt remove.txt
John
Tom
Michael
Kalvin

If you find it difficult to understand this command, we can break it into a few parts.

  1. Let’s examine the original diff result. We can see that it shows us which lines are removed/added, but not those lines that exist in both all.txt and remove.txt (“Tony” and “Alex”).
noob@learnfromnoobs:~$ diff all.txt remove.txt
1,2d0
< John
< Tom
5,6c3
< Michael
< Kalvin
---
> Chris
  1. Now we add the option --new-line-format="" to our command to get rid of the lines added (“Chris”).
noob@learnfromnoobs:~$ diff --new-line-format="" all.txt remove.txt
John
Tom
Tony
Alex
Michael
Kalvin
  1. As you might notice, now the lines we want to remove show up again (“Tony” and “Alex”). We have already handled the format for the new lines in the previous step. We are also responsible to deal with the format for the unchanged lines and old lines. We can remove the unchanged lines by setting --unchanged-line-format="". For the old lines, we can set --old-line-format="%L" so that it will just print them as is.
noob@learnfromnoobs:~$ diff --new-line-format="" --unchanged-line-format="" --old-line-format="%L" all.txt remove.txt
John
Tom
Michael
Kalvin

Finally, only the old lines (i.e. the lines that are unique to all.txt) are shown.

For more information, please refer to GNU Diffutils - Line Formats.


If the files are sorted

4. comm -23 all.txt remove.txt

If the order of the lines doesn’t matter to you, you can consider using the comm command to remove the unwanted lines. We have to ensure the files are sorted if we use this method.

noob@learnfromnoobs:~$ sort all.txt > all_sorted.txt
noob@learnfromnoobs:~$ sort remove.txt > remove_sorted.txt
noob@learnfromnoobs:~$ comm -23 all_sorted.txt remove_sorted.txt
John
Kalvin
Michael
Tom

We can also make use of Process Substitution to sort the files when running the comm command.

noob@learnfromnoobs:~$ comm -23 <(sort all.txt) <(sort remove.txt)
John
Kalvin
Michael
Tom

5. join -v 1 all.txt remove.txt

join -v FILENUM all.txt remove.txt can be used to remove identical lines in the two files. The command prints all lines in FILENUM but not those lines in the other file. FILENUM is set to 1 so that it prints all lines in the first input file (all.txt in our case) but not the other file (remove.txt).

Note that we also need to ensure the input files are sorted when using this method.

noob@learnfromnoobs:~$ join -v 1 all_sorted.txt remove_sorted.txt
John
Kalvin
Michael
Tom

Again, we also use Process Substitution to sort the files when running the join command.

noob@learnfromnoobs:~$ join -v 1 <(sort all.txt) <(sort remove.txt)
John
Kalvin
Michael
Tom

Conclusion

In this article, we discussed five different ways to remove lines that exist in another file:

  1. grep -Fvxf remove.txt all.txt
  2. awk 'NR==FNR{array[$0];next} !($0 in array)' remove.txt all.txt
  3. diff --new-line-format="" --old-line-format="%L" --unchanged-line-format="" all.txt remove.txt
  4. comm -23 all.txt remove.txt
  5. join -v 1 all.txt remove.txt

No matter which method you have chosen finally, I hope you enjoyed this article.

Keep learning and have fun!