If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Manipulating very large arrays

Hello,

Background: Consider spreadsheet grids with related data in 5,000 to 50,000 cells. Now consider 1000 of these spreadsheet grids ('documents') and a GUI to manipulate and research their data.

Requirement: I need to be able to filter those documents 'on-the-fly' to show, for example, some data from "Column B" where "Column A" = some_var. This search is performed frequently depending on the user's activity.

Problem: The issue appears to be size and performance when searching -- 40,000 cells adds up to >40MB (???) in PHP using memory_get_usage()

So far I have tried various implementations...the best of which I've noted below:

To minimize the size of the arrays, I have tried smaller 'chunks' -- but this requires iterating over the chunks which adds more processing time.

I have also considered storing / indexing results in another table but the results for the same search are likely to change depending on document updates and are specific to each user depending on his/her permissions within the system -- so the storage space required and updating the table for each user when a document's information or permissions change would be resource intensive

To facilitate other required functionality, it would be awesome to be able to keep each spreadsheet grid in an array 'document' so it can be managed quickly and handled as if it were a spreadsheet file. We've tried creating separate tables in the database for each spreadsheet column but found it limiting and extremely slow for some of the more complex tasks we need to perform on the data.

Am I going about this the wrong way? Any suggestions or discussion on this would be appreciated. Thanks!

Yes, arrays are memory hogs to begin with in PHP, and even if they weren't, that much data would still require a lot of memory.

Not sure how feasible this is for your app, but I would consider a methodology that would load the spreadsheet data into a database (maybe using SQLite or MongoDB if you don't want to go with a full-bore relational database), and then do your data manipulation via DB operations -- something a database is optimized for.

another possible solution would be to do a "multi-threaded" request or what it has been dubbed on the client-side as "web workers".
this method reduces the load restrictions of a one-thread process into two-threads or more. even in a single-core or single cpu setup, this has many benefits or keeping memory usage and cpu usage low.
you will need to split the script so that one does all the reading and converting... and the other does all the processing on the multi-dim array. an use either sockets, stream, or cURL extensions to "load" the "parser" script. "processing" script example

PHP Code:

<?php// "load" the web worker via http this way the script runs.// parser.php should store the data in a file,// so we don't need to do anything with the "return value" of file_get_contents().if(file_get_contents('http:/example.com/path/to/parser.php') !== false){ include 'path/to/ss.data.php';}// process $spreadsheet...?>

I experienced a similar problem in a graphing application I am just about to finish, in which data was being read in very large tables across networks and servers. Some of the processing was complex and I didn't know how to do all of it in T-SQL alone. My solution was to use batch files, SQL and Windows Task Scheduler to create small, more manageable tables (there are downloadable applications for this depending on the DBMS you're using if you don't want to create one yourself). Often, the tables were created through a series of joins, which reduced the complexity needed, so graphs could be retrieved or generated on the fly a lot faster. I then created a small index of what each table contains for future users to understand.

In your case, the only time I would ever use PHP to do all the processing is if the data set and arrays are very small, otherwise you're going to wait a while. The most straight-forward approach and the one I'd first go with would be to use a DBMS, the one you choose is up to you and based on your needs and restrictions. The reason for this is that DBMS are meant to handle processing of large data sets and can be accessible with built-in PHP functions. If you cannot use a DBMS, then your next options would be parsing the data in chunks, cURL or multi-threading.

So far, I have tried NogDog's approach. I have tested queries on 1.1M rows which results in query times consistently around 0.8s which isn't fast enough, unfortunately. I have also tried creating a "view" in the database but, as expected, this took the same amount of time for the query.

ShrineDesigns multi-threaded request proposal is interesting. I had also considered a temp table to hold query data. Is this a similar approach?

Error404. I appreciate the suggestions. I'm not familiar with the approach so I'll need to research the method a bit more.

I've been trying different options and have run into a (basic) question...I feel like I'm overlooking something too easy...

When I use phpMyAdmin to query the database (SELECT * FROM `myInfo` WHERE `indexedFieldID2` IN ( 1,2,3 ) AND `someField` = 'X'), it says "Showing rows 0 - 29 ( 15,986 total, Query took 0.0039 sec)". Somehow it knows there were 15,986 rows (out of 2 million) that met my query criteria...and it figured that out in 0.0039 seconds. That's great!

The problem is that I need to query the 2 million rows again using an ID field in those 15,986 rows in the WHERE clause. If I return all 15,986 results from the query, the query takes far too long. Is there a way to use the info from the first query in a second query without returning it?

I thought the following INNER JOIN might work. But when I use the following, it takes 1.3 seconds which is still too long (since the query is run frequently):
SELECT * FROM `myInfo` as m INNER JOIN `myInfo` as e ON e.indexedFieldID1=m.parentID WHERE e.`indexedFieldID2` IN (1,2,3) AND e.`someField` ='X' AND m.`someField`='Y'

The table has an index on indexedFieldID1 and indexedFieldID2.

Having given the other options a bit more thought, I think NogDog's idea would work best for my particular needs in the long-run with other peripheral activities required. Just need to find a way to improve the query times.

Is there another way (a better formed query, perhaps, or temporary table, or another aspect to databases) that would be better suited for this?

[maybe this question should be moved to the SQL section of the forum?]

- Do you need to retrieve all of the columns from the table? If not, specifying the exact ones may help out a bit.
- For your inner join, instead of joining with the entire table, have you tried using a sub-query that retrieves only the relevant part of that table?
- You may want to consider a temporary table that already has the result of the join(s), so you would only need to search for the appropriate ID.

You can have the results of 1 query be used in another, however, it takes a bit more SQL code and I'm not sure if it would be as fast as you need.

With the Windows SQL Server, you can see the full execution plan and find out which aspect of your query is taking a larger portion of the time and more processing. I've used phpMyAdmin but not extensively, so is there a similar feature?

Thank you, NogDog for the idea to 'explain' the query. I did just that in phpMyAdmin and noticed that it was not using the index I had set up. In fact, the query 'type' was 'ALL' and there was no index selected. I'm not sure if there's a better way of identifying the best index to use, but by using multiple indexes and trial & error, I was able to reduce the query time to 0.11s which is sufficient for our needs.

Since the query is used frequently, I'll use Error404's idea for a temporary table and store the results for future look-ups.