I have two files which I am reading from, I have some lines that are found on both files. I need to write a function that will detect which lines are found in both files. Now I wrote code for this which will read the contents of file 1 and put the records in an arraylist, then read file 2, for each line in file2 I check if it is found in the arraylist, if it is found, I know it is a duplicate line. Now my problem is that I am saving full lines in the arraylist, I am wondering if it is possible to convert the line I read into a hashcode, then I will save this hashcode into the arraylist, after that, I will compare this hashcode to the hashcode for the line I am reading from file2, is this better approach to save memory?

Two completely different strings can have the same hashcode. It's not possible to ensure uniqueness of an unlimited amount of possible character sequences in just an int value.
–
BalusCOct 6 '11 at 13:20

It is an approach that will save memory but it won't guarantee a match. The definition of hashcodes says that they will not be unique. If you want to store a smaller version of the string then you should store a digest of the string like MD5.

MD5 is 16 bytes long so this will only save you memory if your strings are significantly longer than 8 characters (with 2 bytes per character).

But unless your files are extremely large, you really don't need to worry about memory and the HashSet answers will give you better results.

Edit:

MD5 does emit collisions but not in real world conditions. It should not be used as a cryptographic hashcode but would work fine in this circumstance. There are other digest functions such as SHA256 which have less of a chance of a collision but their digest size is larger.

MD5 is a cryptographic hashcode but it is more a digest algorithm compared to int hashCode(). It does emit collisions but not under real world conditions. The only collisions have been engineered in the lab.
–
GrayOct 6 '11 at 13:50

Read the sorted file, one line at a time, and compare each line to the line before it (only ever keeping two lines in memory - the current and the previous line). If the current line and the previous line are the same, then the line occurs in both original files.

The "external sort" was frequently necessary in the earlier days of computing when much less memory was available. One way of doing it was/is the Merge Sort, which was , when used with tapes (remember tapes?), known as a "tape sort". Yes, I am old :-)

Concatenating the two files makes it impossible to differentiate between the two files. An alternative is to sort both files and continually compare a line from file1 with a line from file2.
–
SjoerdOct 6 '11 at 15:06

I don't see anything in the original question that would require differentiating lines from the two files. All we need to know is if the same line appears in both. So, if, after sorting, a line is repeated, then it's in both files (assuming that it can't occur twice in the same file).
–
GreyBeardedGeekOct 18 '11 at 2:28

If you're concerned about space/memory issues, convert the strings to base36 before storing them in the HashSet as suggested by multiple people already. To standardize things, I suggest stripping all the white space and punctuation from the string and converting it to lower case before creating a base36 equivalent. Then in the HashSet you end up with HashSet<String> where the String holds the base36 encoding of the string instead of the entire string.