If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Unanswered: HELP - Trying to upload file contents to database on a nightly

Hi,

I have created a Java program which I call each night at approximately
7PM whereby I pick up all files residing in a local directory in a
pre-defined format. The format is basically part,quantity with the
file name being [id].txt.

The file contains approximately 110000 lines and it is assumed that
the file is a full file. I loop through the file calling the
PreparedStatement(2) and update each part with the new quantity. If I
hit an error whereby the part does not exist, I insert the part and
quantity via PreparedStatement(1).

The problem I have is that this program generally takes 8-9 hours to
complete which is quite long. We are looking to expand it to update
many files but at this rate we will have to wait weeks to complete all
files!!!

The format of the file can not be changed as it is sent to use from
another company and they have defined the format.

I am not sure exactly how you are "preparing" the SQL statements. Do you repeat the prepare for each record in the input file, then perform the execute update/inserts? Or is there one prepare for the entire input file and then multiple executes for each record in the file. The first option (a prepare for each record in the input file) is very inefficient. Static SQL would be even better.

The other question I have is how many inserts are there per day, and are they inserted at the end of the table or in middle of the table (this may depend on the definition of a clustering index and the new part number sequence that is being inserted). If you are doing lots of inserts in the middle of the table, and there is not sufficient free space, then performance will suffer.

Other things to consider:

- frequent commits, maybe every 100 input records processed
- significantly increase size of log buffer
- significantly increase size of database buffer pool (the total of all buffer pools that are active at any one time should usually be at least 1/4 - 1/2 of total system memory)
- using separate tablespaces for each table
- using separate tablespaces for indexes
- define tables and indexes with sufficient freespace to handle inserts in-between reorgs.

On second thought, if the input file contains the complete set of data that ends up in the DB2 table (i.e., every existing row in the DB2 table has an input record with updated quantity, and there may also be some new rows to be inserted), then you might do better with a load replace.

- The rows are updated wherever they exist within the table or inserted at the end of the table.
- The table has no indexes (had none anyway, but performance hasn't changed with the addition of indexes recently)
- Autocommit = true, so very frequent commits. Would this slow things down?
- will look in to this
- will look in to this
- I am using a separate tablespace purely for this upload and have a view in another tablespace viewing it.
- indexes are created in separate INDX tablespace
- will look in to this as well

Can you please point me towards a good guide on the load replace? Also, can I specify an SQL statement in a load replace? e.g. can I add where id = ? to the end of the load ?

Commits are very expensive if taken after each update/insert. Switch to every 100 updates. This change alone might significantly decrease elapsed time. A commit after each update/insert effectively prevents asynchronous writes (see my post in http://dbforums.com/t792798.html).

I am not familiar with how autocommit works in a java program, so I am not sure what your actual commit frequency is. But a single commit after all updates (100,000+) is not good either.

The creation of an index should speed the update of the quantity on existing rows (if not, then something is wrong) but it will slow down the insert process. If you defined one of the indexes as a clustering index, then DB2 will try to insert the data row into the same data page as the index order. DB2 can spend a lot of time looking for free space on the correct data page (or on nearby pages) that does not exist, so if a clustering index is defined, then freespace should be defined on the table. Freespace is only materialized after a reorg or load replace.

Likewise, if index entries are not added at the end (as would occur if the new index entry is higher than all previous index entries), then DB2 can spend lots of time doing index page splits to keep the indexes in order. Indexes are always kept in exact order. If you have indexes during the insert process, they should probably have freespace defined (unless all new indexes are added at the end because they higher than all previous index entries.

For best performance, the DB2 data rows should in part number sequence (defined by making the index on part number the clustering index, which is enforced with reorg) and the input file should be sorted in the same part number sequence. This would cause each sequential quantity update to usually be made to the same DB2 physical 4K page as the previous update.

The load utility is discussed in the Command Reference Guide. The replace option would completely reload the table from a sequential input file and rebuild the indexes. But whether this is best for you, might depend on what percent of the records get inserted each night (as opposed to just an update of the quantity).

I turned off AutoCommit and committed the records every 500. This appeared to have a drastic effect, improving the upload time by a great deal. It has now been reduced for 103826 records down to 3 hours 26 minutes but I'm sure there is a lot more time to be saved somewhere else.

The table is a basic one with default allocation of freespace etc. and three columns (site_id, part_number, quantity). I have an index on the site_id which is clustered and allows reverse scans and an index on part_number. So I placed the cluster on site_id because each part number will be selected based on the site_id. Should I be converting the cluster as you mentioned, to the part_number field.

Realistically, normal files contain anywhere between 5000-10000 rows, it's just one that contains 100000 and couple of others up around the 30000 mark. There is a rough total of 200000 rows in the entire table and I have separated the indexes to be stored in the INDX tablespace.

Roughly how much freespace should I be allocating to the parts table before performing a reorg? I'm an Oracle guy who's just recently converted to DB2 and I am still learning the ropes. Any advice on table set-up and allocation would be greatly appreciated.

I do not understand what "site_id" is and how many unique values it has. Not sure if an index on that column is useful. Is the combination of site_id and part_number unique, or is part_number unique by itself?

I would consider dropping the index on site_id and checking update and query results. You might be better off with no clustering index and DB2 will just do inserts at the end of the table.

It's hard for me to tell you because I don't really understand the data. But based on the following update statement, you definitely want an index on part_number:

It would help if the table and the input file where in the same physical sequence: part_number, or (site_id and part_number), etc. This is so that the part_number being updated is on the same 4K tablespace page as the previous update transaction. DB2 does not store one row at a time, it stores it physically in 4K pages.

The amount of free space needed depends on how many inserts are done in-between reorgs. If you reorg after each update/insert, it might be better to not have clustering index or freespace.

I don't recall if this has been mentioned, but when you create an index, you need to reorg the table/indexes and then run runstats. This will help DB2 to decide whether to use any indexes you have created.

Therefore, I would suggest one composite index (defined as unique) on part_number and site_id together. I suspect that it would be better to have the part_number first, but not sure because I don't know how the data is accessed. If you have a query which shows all the site_ids and quantities for a particular part_number, then part number should be first in the index.

This may sound strange, but since part_number and site_id will be in the index, and the only other column is quantity, it would make sense to put the quantity as the last column in the index so that all queries will be "index only" access path. This will not speed up the updates (nor slow it down), but will speed up the queries.

So the unique index would be:

part_number,
site_id,
quantity

in that order. As previously mentioned, have the sequential input file sorted in the same order. Do a reorg and runstats after the index has been created.

- No, we will never display/search on all sites at one time, it will only be a part Number on a per site basis (each site is a retailer and each part is what they currently stock and we will always display comparitively what each retailer stocks)

The String I use to access the data is exactly the same every time and is as follows:

Given what you said about the "like" statement (you threw a curve with that one), I would create one index with the following columns (in the following order):

site_id,
part_number
quantity

If you feel uncomfortable about including quantity in the index, just create an index on:

site_id
part_number

Which ever index you create, it should be defined as unique. Make sure you reorg the table and execute runstats, and let us know how it performs. Sort the input file in the same sequence as the index.

If you could change the SQL which accesses the data to only have the % at the end (if the user knew the beginning of the part_number), and not put a % at the beginning, it would run a lot faster. Obviously, this can only be done depending on application requirements.

I'm going to re-create the index in the way you've explained but currently, there is no issue with speed of accessing the data, it is really only with the updating and inserting of the 100K+ record file.

I don't think you need reverse scans based on the SQL access you said that you use. I am not exactly sure what the performance penalty for allowing reverse scans is, but I am sure there is some penalty, otherwise it would be the default--so don't use it if you will never use it on queries.

The issue of clustering and percent free (on the table and index) depends on what percent of the data you insert each night (vs. just updating the quantity). If 10-12 percent of the data is inserted, then 15% free space would be about right. But this assumes that you reorg after each nightly update to recreate the free space for the next days updates/inserts.

So another option is to not define any clustering and let the data be inserted at the end of the table. However, indexes are always kept in exact order, so percent free should be used on indexes, and reorgs should be done nightly to improve the update/insert performance of the next day (aslo helps on queries somewhat).

I just reorg'ed a table with 16,000 rows and one index and it took about 5 seconds on a Pentium 3 running DB2 8.1. My database buffer pool size is 24 MB (defined as 6,000 4K pages). I would be interested in knowing how long it takes to reorg your 100,000 row table.

Re: HELP - Trying to upload file contents to database on a nightly

Are there any reasons, as why the load option was not considered.
if u want to use the load option i can suggest some method to perform this operation.

Thanks
Sateesh

Originally posted by shannonw
Hi,

I have created a Java program which I call each night at approximately
7PM whereby I pick up all files residing in a local directory in a
pre-defined format. The format is basically part,quantity with the
file name being [id].txt.

The file contains approximately 110000 lines and it is assumed that
the file is a full file. I loop through the file calling the
PreparedStatement(2) and update each part with the new quantity. If I
hit an error whereby the part does not exist, I insert the part and
quantity via PreparedStatement(1).

The problem I have is that this program generally takes 8-9 hours to
complete which is quite long. We are looking to expand it to update
many files but at this rate we will have to wait weeks to complete all
files!!!

The format of the file can not be changed as it is sent to use from
another company and they have defined the format.