Tag Archives: fast diff

Fast method for finding differences between two unsorted files

At work I had two files: one with 13 000 records (filea) and one with 300 000 records (fileb). Every record was on a new line in the file. I was looking for a method of finding which records from filea are not in fileb. Because of the huge number of records per file, the “grep -v -f filea fileb” was taking forever and also I would need the files to be sorted before executing the grep.

After a few searches on the world wide web, I found a very quick way of doing this with awk:

awk 'NR == FNR { A[$0]=1; next } !A[$0]' fileb filea

If you want to find out more about this fast and easy solution (no need to presort the files) I found a similar example with an explanation here.