We can see that 28053391,1,19,DAD,766709012,0,-70,01-AUG-14,01-AUG-14,1 is repeated in both the files, but the latest should be considered and should be printed. So, In the output the second file's data is printed

Output has All First Files Txns and All Second Files Txns and if the Txn Repeats (Key is first column) in second file, Second file's data has to be considered.

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>"; open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>"; open my $out_fh, '<', $final_output or die "could not open $final_output <$!>";

# Since I don't know what your data represents, I can't come up with a better name # so I kept your hash name my %hash;

I have answered on your other post and see now that you posted the same question twice and Fishmonger has already given you a more detailed answer. Posting twice lead to duplication of work, please don't do it.

1GB of RAM is considered very low these days, especially if you're using Windows.

The files in the 200-300MB range should be much of a problem, but the GB files will be a problem due to your limited RAM.

My first recommendation is to add more RAM. IMO, 4GB should the minimum when doing this kind of work on Windows.

If you can't upgrade the the RAM then you could filter the data through a database rather than storing everything in memory via a hash. You could parse each line as you're currently doing but instead of assigning a hash value, you store that data in the DB. You could even access your csv files with sql statements as if they were database tables. Once the 2 input files have been processed, you execute another query that dumps the data directly to a new csv file.

Going the DB will be a little more complex coding but it will also reduce the memory footprint and won't hang the system like your previous experience.

3) insert file2.txt via a slightly adjusted "load data inifile" statement i.e., add the REPLACE keyword so that when a duplicate ID (primary key) is seen, it will update/replace that row from file1 with the row from file2.

4) once both files are loaded execute a "select into outfile" statement to dump out the data to a new csv file.

Hi Tejas, is your data sorted? It seems to be, but I don't understand fully how.

Update: the reason I am asking is that if the data is sorted one way or another in accordance with the comparison key, then you could read the files in parallel and remove the duplicates as you go. The good thing about this approach is that it will work for files of just about any size, irrespective of RAM size, and it will be much faster than a database approach. The downside is that it requires a bit of cleverness, or rather care and attention, to get the algorithm really right.

Re: [Laurent_R] Merging the data in two files using a hash
[In reply to]

Can't Post

Quote

then you could read the files in parallel and remove the duplicates as you go

I can get the files sorted by the keys in perl itself.

Bbut, Do u want me to open both the files at a time and check for the keys ?

Do u have a snippet for that. ?

The Worst case scenario would be that both file will not have any matching key and ultimately all the data in both the files have to be stored in a seperate file

Quote

I said it already. You first need to sort the files on their comparison key, say in ascending order (using the Unix sort utility, for example).

Then you open both new sorted files, read the first line of each. If the keys compare equal, then you have a common record. Store the line in a file of common records (if you need one) and move to the next line for both files. And repeat the key comparison.

If they don't compare equal, then the smallest of the two corresponds to an "orphan", i.e. a record that is in the file wxhere you found it and not in the other. Write that out to an orphan file. Get the next line of the file where the orphan was found, keeping the line from the other file. And repeat the comparison.

And so on until the end of one file, at which point any remaining lines in the other file are also orphans.

I have written a generic module to do that (and a number of other things on large files), and I am using it regularly , but have not uploaded it to the CPAN so far, because uploading a module requires to do a few additional steps (preparing an install procedure, etc.) that I don't know (yet) how to do.

But if you are trying to do it and don't succeed (and show how you've tried), I would gladly post the core algorithm.

The file comparison is extremely fast, but the initial sorting of the files has an overhead, which is why I was discouraged you from trying this approach given that your hash approach is giving good results in view of the data size.

Here's what u ve suggested when i had a similar problem last time And My Keys are nt just numbers , there are alhanumeric keys too

open my $in_fh1, '<', $clr_txns or die "could not open $clr_txns <$!>"; open my $in_fh2, '<', $temp_file or die "could not open $temp_file <$!>"; open my $out_fh, '>', $final_output or die "could not open $final_output <$!>";

If you are under Linux or Unix, you can use the OS's sort utility, which can sort files much larger than RAM by using temporary files on disk, but I do not know if there is such utility on Windows.

In Reply To

But, Do u want me to open both the files at a time and check for the keys ?

yes, that was the idea. It is detailed in the older post you quoted from me just above.

In Reply To

Do u have a snippet for that. ?

Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data items that also exist in the other file. Is this correct? Is there more to it?

But this snippet would only work for sorted data, so that it depends on whether you are really able to sort the files on their keys.

Re: [Laurent_R] Merging the data in two files using a hash
[In reply to]

Can't Post

Quote

Yes, I could provide one, but please explain exactly what you are trying to do, as I am not entirely sure of the details. It seems to me that you are trying to remove from one file data item s that also exist in the other file. Is this correct? Is there more to it?

Yes, This is the task..But there would be more operations and changes to the mathmatical operations. But the comparisions would definitely be there. And i have also specified in my earlier post that , the keys will not just be numbers, there would alphanumerics and aplhabets too (EX : aHXPVTTRER). If your code can help me , i will definitely use it , as it is working on sorted files and i assume that the comaprisions would be really less comaparitively

FInally, I dint really get why u suggested abt windows sort utility, i never use windows at all. My work is totally on linux and i can use command line sort utility i will be glad to use your code snippet.

I must say that this part in your code surprised me a bit when I saw it in your code (essentially, why do you delete from the hash) , but since I haven't understood what in details you are trying to do, I do not know whether this is correct. That's the problem with this thread and the previous one on the same subject: you haven't defined precisely what you want to do for us, and I am not even sure you really know for sure yourself. When you want to write a program, you first need to clarify exactly what you want it to do (often by writing some specs or some business rules, or at the very least by having them very clear in your own mind). Unless I missed an important post, your description of what you want is far from being precise enough on what you need.

Well, enough talking, I'll try to write up some code based on my best comprehension of what you need, you'll probably have to adapt it to fit your real needs. But at least you'll have a basic algorithm, hopefully well coded, to use, and hopefully you'll have only implementation details to change.

Re: [Laurent_R] Merging the data in two files using a hash
[In reply to]

Can't Post

Alright, a first very simple solution, which might work or might be too simple for your needs.

This assumes you just want to remove duplicates, or, in other words, retain only one line for every unique key. In this case, you can just use the sort utility to merge together and sort the data from both files and produce one file with unique values. Note that, from what you said previously, your sort should be alphanumerical, not numerical.

This function is receiving two lines from the calling function, splitting the lines to get the keys, and comparing the keys. It returns 1 if the keys are equal (duplicates) and 0 otherwise. In one of my real programs, this function would be much shorter (probably 2 or 3 lines) and would be most probably stored in a coderef rather than a regular function, but I tried to make it as simple as possible to help you understand the principle.

Please also note that it really makes sense to separate the functional rules (how to compare records, stored in this function) from the technical duplicate removing part (below). It means you can reuse the technical part and just change the functional part for another similar problem.

Now the duplicate removal. This assumes you have already opened three filehandlers, $FH_IN for the input, $FH_DUPL for printing out the duplicates, and $FH_OUT for output of the unique lines.

I haven't tested the above because I don't really have data to do it, but I believe this should work, because it is a simplified version of something that I have tested extensively. I might have goofed something when simplifying it, but if such is the case, it should be easy enough to fix it.

I'll post a bit later a more complex solution where the two files are read in parallel. But the one above might just be sufficient for your needs.

If you need a more detailed output than what I suggested above, you might try the following.

This assumes that 6 filehandlers are open before we start: - $IN1 and $IN2 for the two input files - $ORPH1 and $ORPH2 for orphans (records in one file but not in the other) - $OUT1 and $OUT2 for common lines (two files because the keys of the input files might be the same and the content not necessarily be exactly identical) Of course, you can simplify all this if some files are not needed.

The comparison function needs to be slightly different than before, because it needs to return three possible values:

Same comment as my previous post: I haven't tested on your data, because I don't have enough of your data, so I might have goofed one detail here or there, but my module from which I took the code has been thoroughly tested in real life applications and is believed to be bug free.

Re: [Laurent_R] Merging the data in two files using a hash
[In reply to]

Can't Post

Hi Iam sorry to say that our needs change day by day and i am adding new stuff day by day

The only reason iam deleting the hash is because i donot need the entries whose total sum is 0. Or else i will end up printing txns with amount 0 and non zero . And iam interested in non zero txns only

Basically i have two files 1. Files which have today's transaction report,which has an update of historical txns and current txns 2.And a historical non-zero txns report. Generally, We check the latest sum of amounts

And i firstly check whether the historical txns are available in todays's report If No --> They are still non zero (This shhud be prnted) if yes --> Then there are 2 cases 1. They can be zero 2. They can be non-zro but with some modification(as they are available in today's report, so definitely ,there will change in the amoount)

That is only reason why iam deleting the values with 0 from the hash and at the end i will just print those values which are non-zero

This means that a historical txns has an update today and the total amount is 0.so we dont need this to printed .as its happily balanced

But an example below is unballanced case, where there is an update .but the total is still non zero.we have to print the latest data , as there is an update 696795792443,6,308,APY,550,0,24-JUL-14,24-JUL-14,3 IN FIRST FILE 696795792443 6 14-308 APY 521806975 550 -1550 1000 24-JUL-14 15-AUG-14 4 IN SECOND FILE(Today's data)

Finally All The UnMatched Txns shud also be printed as they are all non-zero

First File always has non-zero values Second File has Zero Value's and Non-zero(There will be a lot of unmatched txn with 0 , which also shud nt printed, i am finding a way to do that. And MAtched Txns with 0 are anyhow being deleted.so only the above case has to be dealt with)

All that iam doing is print all the non-zero values from both the files

Compare if a non-zero has an update in the latest file, if yes and zero , ignore if yes and non-zero print if not available in current file, it means its still non zero