I need to update the missing 462th column data with its own column data, only if 86th and 461th columns data matches. In my case, here 2008 and 1111 has 2 rows, in which 462th column data is missing for one to the row which I need to update with xyz same as row1.

Do I have to remember the xyz from first data line and put it on the 4th data line,
and while I am doing that,
also remember the zxc from the 3rd line and put it on both the 5th and 8th lines?
With a million lines, you really don't want a solution that depends on lines being adjacent
unless you can guarantee that condition absolutely.

Do I have to remember the xyz from first data line and put it on the 4th data line? Yes
Also I have to remember the zxc from the 3rd line and put it on both the 5th and 8th lines? Yes
Actually the requirment is something like this in oracle: Table1
-----
86th 461th 462th
column column column
----
col1 col2 col3
----
2008 1111 xyz
2008 1111 null
2009 2222 abc
2010 3333 zxc
2010 3333 null
2011 4444 pqr
2012 null ijk
null 5555 dfg
----

@shahid,
That's a very clear statement. But it raises another issue I had not considered.

(Please don't think I am picking on you - I do this for a living too.)

It is not clear that the contents of Table2 are unique.
That is, suppose your Table1 has a row with 2009 2222 abc
and somewhere else there is a row with 2009 2222 jkl.

Is that possible? If so, what is the desired result?

In oracle, I think Table2 would contain both rows (unless defined to be no duplicates).
But I'm not sure what the Update ... where would do about it.

The Where selects multiple rows and so the update has a choice of two different values of table2.col3.

The Create Table2 scans the whole file (unless you index on col3), and Table2 can be almost as many rows as Table1. And the Update is going to be hard to optimize on the join, so will probably read both tables in full again.

I believe any scripted solution would need to read the input twice, and hold a big memory table corresponding to Table2.

I would use awk because I normally work on vanilla Unix systems and can't be sure of Perl, so I never learned it. If you need a Perl solution, we might wait for TigerPeng to come in.

If awk is OK with you, I can post something reasonable. Your input has at least 462 columns of a million+ lines, so is a pretty big file.

The content of Table2 is unique in other words Table1 will only have two rows with 86th column and 461th column combination one with data and other will null.

2009 2222 abc
2009 2222 null

I am looking out for a solution where I need to get the same count after updating the Table1, say if I have 100 rows in Table1 after updating the Table1 with the missing records it should have 100rows when validated.

I have derived the Table2 from Table1 using IS NOT NULL.

If we can achive this using AWK, I am okay with it..

But the only thing is I will have 480 columns in my file i.e. in TABLE1 so at the end after using AWK, I need a file which will have all the 480columns with updated data for missing records.

It works under several assumptions:
- You need awk that can deal with your count of fields
gnu awk (gawk) is OK
- You need ensure that values are in proper fields
Empty field could not comply, better to use for example null string
- ... plus some assumptions about logic in your data (already under discussion)
Above solution fits to your example data. It is not tested with huge data file similar to your. I am curious how fast/slow it will be with it.

Original awk program was written in 5 lines to avoid splitting of the long line. If you see it in one line only it's probably browser issue (IE6 ?). To have it in one line the semicolons are needed as shown below:

IMHO the duplicates are not a problem provided that given combination of values in columns 86 and 461 is associated with one value only in the column 462.
Just first occurrence must define proper value for c. 462

gawk is working fine now on the sample data, after including the semi-colon, but the thing is I am loading the output data into table, and when implementing the gawk on original file the output data is missing its tab delimited so the columns are moving its tab spaces..

The below logic given by you is working perfectly fine as per my requirement, but I have a doubt here..can we have input data be sorted so that rows with non-blank col3 would show up earlier in the file not sure if sort would make any difference. Please suggest.

Hi Shahid,
if data order in your input file is not important for other reason then you could try to sort it. The sort command will depend on the data appearing in the column 3 (or 462).
For example, supposing that field to be filled is really empty (no string like 'null'), the command:

sort -rk1,2 -k3 sample.txt

will ensure that the rows with given values in col1 and col2 will create a compact block and the row with non-blank col3 will show up earlier in this block.
You can pipe the sorted output to gawk:

sort ... sample.txt | gawk '{ ... }'

I am not sure if this can lead to the significant speed improvement as I think that awk optimization is quite good. Just try it.

You can change the field numbers in the sort command as shown.
But you have to change those numbers consistently in the second part for awk as well. Sort keeps the content of every line (can't reorder the fields) and just re-order these lines. So in the whole command line use either 1,2,3 or 86,461,462.
There are some limitations how to combine field numbers in sort command but it should work as you wrote it.

Maybe Paul will come with his solution. It certainly would be more sophisticated.

There could be another easy solution if your data file is output from the database. Is it possible that you sort the data during file creation? Then it could be sufficient to sort it only according to the column that is to be changed - this can ensure that all rows with this field empty are at the file's end.
It would be most efficient to sort the data at the source. Then you can omit the sorting in command line.

Good, you have it! I got next idea: sort is doing too much. We just need simple rearrange, ie. moving all rows with empty field #86 to the end of the file. You can try whether following will also run satisfactorily:

I have a solution to your main problem which I have to post from home tonight. On your sort issue, you don't need to extend /tmp. If you have some spare in another partition, you can tell sort to use it with the -T option.

In fact, I have a function for applemcg which is in my profile. It sets a good memory size and work space for my host, than adds any other options like keys on the end.

This has awk reading the file internally (twice), so what gets passed into awk through the pipe is just the file name, NOT the data.

Couple of things not very obvious.

Fixes are made from the LAST occurrence of the key fields found on the scan. It would not be hard to use the first, or to report when there were multiple distinct versions of the fix.

Because the filenames go in through the pipe, you can just pass in any number of names and process any number of different files, just as if they were shell options.

It would not be hard to have awk copy the file to a new filename constructed from the input filename either.

The table of fixes might have a million entries and make the awk grow to 15 MB, which should be OK.

I've been looking for an awk task to hang a blog around, and this is a good example: non-trivial, but not too complex. Below is is how I would write it more professionally. It has comments. It uses column names for the fields. It does some diagnostic counts.