The superlist_product and superlist_substrate will encompass all the possible substrates & products in LIST. ie. substrate(LIST) is a subset of superlist_substrate and similarly for product(LIST). Now i want to create a SUPERARRAY as superlist_substrate(rows) X superlist_product(columns). Now parse the LIST for each substrate id one by one insert a "1" for each product id in the SUPERARRAY. For example consider first two lines of LIST

substrates: 3649

products: 3419 3648 So for substrate id 3649 ,the row id=3649 will be selected from SUPERARRAY and a "1" will be inserted at column ids 3419 & 3648 of the SUPERARRAY. And so on for the entire LIST.Basically SUPERARRAY will be a matrix.

if your list is smaller than the super-arrays, your matrix will be sparce. You probably want to use a hash of hashes and create only the entries that exist in the list (other entries will be undefined).

At the end of the while loop, you hash of hash is populated with 1's for each existing combination of substrates and products. Non existing combinations will be undefined. When using this data structure you will need to check for existence of a combination. For example, you may have later in your code:

Thanks a lot Laurent. But the problem is that my list would at max be ~5 to 10% smaller than superarrays. Also I need the output in such a format that there are blanks wherever there isnt a "1". Actually the quality of the output depends on the number of blanks also, because this output will be then compared to other 20 such outputs. So in that way position of "blanks" and "1" is equally important.

Right, I originally thought to put the data in the program (in the data section at the end). I then changed my mind to offer the OP the possibility to read from a file, because I thought it was more convenient to give an example of file opening. And I forgot to change it the second time the file handle is used.

Hmmm, you don't seem to realize that a Cartesian product is far more demanding in terms of space allocation than what you think.

If you have, say, 1,000 substrates in your list (and 1,000 products), you end up with at least one million possible combinations of substrate/products (actually more if you can have several products for one substrate). Most of these combination are probably useless. This is why I suggest a sparse matrix modelized with a hash or hashes.

Laurent thanks a lot for all your valuable suggestions and time. In my case the size of supermatrix is 762 X 680. And the LIST has 740 substrates and 600 products. Therefore a minimal representation would mean a considerable loss of information, a the representation is very much important for me.Computational resources aren't an issue. In that respect could you be please suggest a suitable method (which takes care of blanks also) [EDIT] I would like to save this matrix to a text file.

From what you described, you really don't need a complete matrix. In a complete matrix, perhaps 99% or more of the elements will be 0 and 1% or less will be 1. The 99% are just useless. You only need to know if you have a match or not. For that a sparce matrix is far better. For a specific combination of substrate/product, you only need to know if the element exists.

Im sorry Laurent for any misinterpretations. But in my case ~15% of elements will be zero, because the size of supermatrix is 762 X 680 (518160 elements). And the LIST has 740 substrates and 600 products (444000 elements). % blanks = 74160/518160 = 15%. So I cant afford to have a sparse representation. Moreover I need to create this matrix representation for 10 more such cases( i.e. different LISTS) for the same supermatrix. And all these 10 LISTS have at most 16% of blanks for the supermatrix. So please help me. Thanks again.

Not quite right, Rushadrena. From the data examples you gave, each substrate has 1 or 2 products associated with it, not more (at least more often 1 than more than 2). So you don't get at all a Cartesian product between 740 substrates and 600 products (444000 elements), but only the actually existing combinations, i.e., assuming two products per substrate, at most 1480 elements. This is very very far from the 518160 elements of a full matrix, less than 0,3%. This means that more than 99,7% of the full matrix would be unemployed, or, I would rather say, totally useless and worthless.

The other point is that, anyway, from the way you described your problem, all you really care to know if whether a specific substrate/combination exists (where to assign the 1 value), or not. For that, the solution I suggested is totally sufficient. The sparce matrix approach I suggested just contains exactly as much useful information on your data as a full matrix taking more than 300 times more space in memory (and far longer to load).

I'm ready to make one concession, though. You may want to have available a full list of all the possible substrates and a full list of all the possible products, not just those in the input list, so that you can say: although this specific combination of substrate/product does not exists in the input list, it would still be a possible candidate, since both the product and the substrate exists. If you want that, then all what you need is two other simple hashes, one with all the possible 740 substrates and one with the all the possible 600 products. So you would end up with two simple hashes and one hash of hashes, keeping in memory about 3,000 elements, still very far less than 500 k-elements. These two hashes give you a virtual Cartesian product of all possibilities, but you never have to compute the actual Cartesian product.

But your description of the problem is an extremely strong indication that a sparce matrix is really exactly what you need. And an hash of hashes is the ideal data structure to store that, because you need just one (pretty fast) line of code to retrieve the information you need (i.e. whether a given substrate/combination exsists in the input data).

I hope I am being clear in my explanations. I work a lot on quite similar problems, the one thing you want to avoid, especially when the volume of data grows, is the quadratic burden of a full Cartesian product (or, even worse, an exponential or factorial explosion of possibilities). Some of the problems I work on at my job can be solved within a few hours of computation with various things similar to the sparce matrix approach described above, but would probably not have the time to finish by the final explosion of the sun and the end of the solar system if we were to try to compute all the possibilities in a super-matrix approach.

One last example. My company has a database with about 35 million customers and about a million possible products and services. What is stored in the database, is the list of services (usually 5 to 20) actually subscribed by the customer. Not a "super-matrix" of all possible customer-service combination s, with 0 and 1 to record if the service has been subscribed or not by the customer. This "supermatrix" would have 35,000 billion elements and would take ages to query and require disk space that I can't even imagine. What a standard business-oriented database (e.g., Oracle) does is, in effect, is to implement a slightly more complicated version of the sparce matrix approach I have described.

I've been following this thread and some questions occurred to me. This would be such a large matrix that you would lose the header information if you had to scroll down anumber of rows. Likewise, if you scrolled over to read the columns, you would lose the substrate in the first left column.

Here is a sparse table created from the sample LIST.txt file. It lists only the combinations seen, not the 'super' matrix you could create from the SUPERSUB and SUPERPROD lists.

Would it be better to create a comma separated file, where it could be opened by a spreadsheet program like Excel? Those programs can 'freeze' the column/row headers so you can easily scroll and still keep them visible.

Yes, you could easily generate a CSV file from the hash of hashes for importing the data under a speadsheet. Or, better yet, you could use a CPAN module to write directly a speadsheet file. For example: Spreadsheet-WriteExcel, Spreadsheet::Write, Spreadsheet::SimpleExcel, etc.

I am happy that you presented the data in such a tabular form, Chris, as it will show graphically to Rushadrena how sparse the data actually is. This example has 210 element holders, and only 17 of them are really useful, already less than 10%. And the more you add data, the more the ratio between just empty places and actually useful elements becomes large.

Laurent and Chris , Thanks a ton for sharing practical views and extensively exploring other realms of the problem space. Yes the supermatrix will be very very sparse. But let me add the last element to the problem posed here by me. I need to create 10 such supermatrices and concatenate them taking two at a time. For instance till now I'm able to create a text file for each of these 10 supermatrices. Now there's the last piece of puzzle. I have created 10 such matrices (with obviously same number of rows and column). Now the problem is that I have to concatenate (OR logic operation) two such matrices, INPUT = Two matrices A,B (each saved in separate text files) of same row and column OUTPUT = A single matrix C ( C[j] = A[j] OR B[j] )

Is there a way I can read a matrix from a text file into perl, so as to access each element one by one. This is what I have tried .Though it reads the matrix from file and prints it as it is but Im not able to access elements one by one.

You might not know, but to pair every one of the 10 matrices with each other will create another 45 matrices. That is because the number of unique pairings is a combination of 10 items, 2 at a time.

(10 * 9) / (2 * 1) = 45

So you will have a total of 55 matrices. And you will probably want each in its own file, (or perhaps not).

Here is some output from what I worked on. Its ok for small tables, but I don't think you will be able to view a 400 column matrix that way. You should probably create a comma separated values file, (.csv), to be read by a program like Excel, which reads those files.

Dear Chris, Actually I dont need to visualize the matrix I just need to process it as it is and for the purposes text file is sufficient. Chris could you pass on the code (t1.pl) you have written to achieve the OR of two matrix. That would be really helpful.

your idea of storing the data in a file in tabular form is probably wrong.

It is far easier to store direcly the hash of hashes structure using Data::Dumper. Then, you only have to open the file, slurp its content and use eval on it to recreate the hash.

And combining two hashes the way you want to do is very easy, it only takes three lines of code (as shown in Chris Charley's code suggestion.

Final point, I can see from your code that you're still trying to build the complete matrix instead of a sparse one, this is simply the wrong approach, it takes more space, it takes more time, and it takes more code.