Can u give walk me through if u have time. I am running this code on my data and its running And i get this warining for every line

Quote

of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502. Use of uninitialized value in string eq at ./Kompare.pl line 58, <$in_fh2> line 261502. Use of uninitialized value in string eq at ...

Iam totally confused how the code works. But it is giving me desired out put in 5 minutes. Dint really understand how

below is the chaned code , i actually wanted all the Matche dand Unmatched in one file .

The split splits each of the lines received as argument to get the keys of these lines. Then we compare the keys. The cmp function retunrns 0 if the keys compare equal, -1 if key2 is larger than key1 and +1 if key1 is larger than key 2.

Now the main comparison part. I first read the first line of each file. Then I'll compare the lines, and based on the result of the comparison I'll fetch more lines. If the two keys compare equal (the subroutine returns 0 and we go into the else part), then I know that the lines are common to both files, so I print them to $OUT1 and $OUT2 and fetch a new line from both lines:

If any of the two lines is not defined, then we have reached the end of the corresponding file, and the last exits the while loop. And we chomp the two new lines for preparing the comparison in the next interation of the loop.

If the key dpon't compare equal (the subroutine returns -1 or +1), then it means that we have an orphan. If the sub returns +1, it means than the key of line 1 was larger than the key of line2, so we have in file2 a line that does not exists is file1. We print it as an orphan to $ORPH2. We keep line1 from the revious fetch and fetch a new line from file 2. If line2 is not defined, it means that we are at the end of file2, we are almost done, the "last" command exits the while loop. We chomp the new line 2 and get back to the beginning of the loop for the comparison.

If the subrotine returned -1, it is just the opposite case, line1 is an orphan, we print it to $ORPH1 and fetch the next line from file1.

When we exit the loop, it means that at least one of the file has been exhausted. We still need to print as an orphan the other line that we had and we still need to print as orphans all the lines still in the other file (if any). That's what the four print statements do after the end of the while loop.

Is this clearer now?

As for the warning about "unitialized value" you would have either to post the full code, or, at least, tell me what there is in line 58 of your program.

That's also what I almost always do (except that I am usually using a lexical filehandler rather than a bareword filehandler).

But here, since I need to read two files in parallel, with sometimes more lines from one source, sometimes more from the other source, it is just not possible. At least one of the files needs to be read line by line, on demand, depending on the conditions met. It would in theory be possible to read one file with a:

Code

while (<$IN>) { # ...

construct, and the other one on demand as I do in my code. But this turns out to be a bit complicated, and I've found that finely controlling input on both sides makes things symetrical and much simpler to code.

So, when I do something like:

Code

$ligne1 = <$IN1>;

I am just reading the next line from $IN. The $IN file handler is just an iterator that "remembers" where it should read next line. In most common cases, you put it in a while loop to read one line after the other, in this case, I am doing it "by hand", i.e. reading from the file(s) from which I need input at any particular time.

I am happy that this seems to solve your problem and does what you want (even if you had to change a couple of things). And also that it does it quite fast (this I knew, I have been using this method quite a number of times with file volumes of usually several gigabytes, I know that this is about as fast as it can get, even though there is an initial sorting phase taking quite a bit of time). And also, available memory is simply not an issue: at any given time, we only have two lines in memory. Even with files 10 or 100 times larger, processing time would obviously take longer, but memory usage would remain almost nothing (basically two lines of input, one for each file).

I am still a bit concerned about the "use of uninitialized value" warning that you get, it might be secundary or irrelevant (it seems to be the case if you obtain the desired result), but I would personally never put in production a program displaying such warnings, because they indicate something is not really behaving as expected, even when the results are or seem to be good. Please post your full program so that I can investigate what's going on on line 58 of your program.

Yes, if you are looking for only one value in a sorted file where there are 10999999 values, then, yes, binary search will be much faster. But you are talking about a problem very different from the one before. Besides, implementing binary search in a file is not necessarily easy to implement in the general case, a bit easier if you know certain things about the file (such as constant line length and other features simplifying the problem).

The keys look very similar, except for the fact that you have an additional "12" at the beginning of each the lines in file 12.

You'll probably need either to preprocess file 12 to remove those "12" at the beginning of the lines, or change slightly, the comparison procedure in order not to use the two first characters of the lines in file 12. But then do it on another version of the program, to make sure you have the other original program still working for your other files.

The main part of the program needs to receive 0 from the subroutine if the keys are equal (and you are just doing the opposite), -1 or +1 if they are not equal (depending on which is higher than the other). That's what the cmp operator does.

If the File two has the excat line twice, how can that be considered as a match ?

Quote

Ex : File1 :

1 ABC 45

FILE2: 1 ABC 45 1 ABC 45

Ouput : 1 ABC 45 1 ABC 45

Current Output 1 ABC 45

The matched output file has just one line, but 2 lins from file2 are matching. As we are moving to next line after the match ..file1 and file 2 have different data. Any Approach for that , or should we call compare again in match code

what I am doing when I need to compare two files is to have a preprocessing of each file in order to remove duplicates. It is only then that I read the two files in parallel for finding "orphans". Sometimes I do both things in one go, but not when I use the module from which this code is taken. Note that this module has other functions, including one for finding and removing duplicates.

In your case, if one line is coming twice in one file and only once in the other, then it can be argued that the second line is an orphan. But it is not a technicall question, it really depends on your business rules.

It would not be too difficult to change the code to take such cases into account (but the code would no longer be generic), but one would need to know the exact requirement.

Re: [Laurent_R] Merging the data in two files using a hash
[In reply to]

Can't Post

Quote

what I am doing when I need to compare two files is to have a preprocessing of each file in order to remove duplicates. It is only then that I read the two files in parallel for finding "orphans". Sometimes I do both things in one go, but not when I use the module from which this code is taken. Note that this module has other functions, including one for finding and removing duplicates.

Can u tell me what other funcs we have in the module Yeah this isn't a generic requiement and my approach here is to read the required valuesinto a hash from first file(which are unique lines) Read the second file into an array or just therequired values into variable after splitting that should be compared. And compare them with hash key and its value. But the real problem is when the file is too huge ex 3 Gn then hash wouldn't work well

No problem, Tejas, I'll provide you my full module, with its extensive documentation. I have made it free and open source anyway. I'll do it when I am in the office, I don't have it here on my mobile device.

Well, I have developed this module (which I have sent to you, BTW) reading the two files in parallel just because the files I am reading are just too large to fit into a hash in memory. It really depends on the size of your file and on your available resources.

Otherwise, you need only one hash for file 1, no need for an array for file 2. You can load one file into a hash, and then read the other file line by line.