Interview ques

I have been asked this question in a couple of interviews: "A java program reads records from the database and processes it. It takes about 6-7 hours to complete the process. How can you improve the performance and decrease the time to 10-15 minutes". any help would be a great help.

I think this is a trick question, assuming you didn't get any additional information. You don't know the cause of the performance problem, you don't even know whether there is a performance problem (you only know that someone claims that the process can be made 30 or 40 times faster). In real world, you need to find out what the bottleneck is and then offer solutions, if there are any at all.

If multithreading really was the correct answer, their box would have to be equipped with some 30 or 40 cores, probably more. This is still quite uncommon configuration. Moreover, the database or network might not be able to serve the records fast enough to keep forty CPUs busy all time.

I think Martin's right. I'm no expert on performance (or interviews ), but I suspect what a sensible interviewer would be looking for is an explanation of how you'd go about solving the problem - starting with profiling it to track down exactly where the bottleneck is.

After you've explained your approach, there wouldn't be any harm in following it with "here are some potential causes that I wouldn't be surprised to find, and if they turn out to be true, here's what might help" (e.g. finding the same query made many times with the same parameters - cache the results). Which might also help you out with a less well-informed interviewer who thinks there is a correct answer.

How about adding an index (or indexes) to the data you're reading - this would improve performance!

What's interesting about this question is that they give you the time frame they want to get the process to. If they were after a general 'how would you approach this problem' type of answer I'm not sure they would bound it with numbers like they did.

JDBC Performance :
One point many developers forgot while using plain jdbc or even hibernate (which internally uses JDBC API's) is properly setting fetchSize on JDBC Statement, improves throughput considerably (about 10x). Default fetch size is 20, which means in single network call, JDBC driver will fetch only 20 records and next 20 records in another network call. What if i am reading millions of records in those 4-5 hours ? Increasing fetch size from 20 to 500 or even 1000 will reduce IO/Network calls by 250 or 500 times. This will definitely need more memory for store those 1000 records in memory but i can ignore them for throughput.
Above mentioned point is definitely worth trying. Another interesting point is fetching CLOB/BLOB in resultSet. Using CLOB/BLOB in resultSet will force JDBC driver (alteast about oracle I know) to fetch only one record per network call. Hence fetching BLOB/CLOB will result in more network calls.If BLOB/CLOB is not used, remove them from resultSet and you will see performance improved.

This is an interesting idea, so I thought I'd check it out. A quick google search shows that the setFetchSize() parameter's default value is implementation specific, and is different for MySql, Oracle, SqlServer, etc.

It is also clear that one can't reliably make such broad generalizations, as a large value may result in the DBMS returning so much data that your program has to garbage collect constantly.

I wrote a simple test program to test a fairly large MySql database that I'm working on for my day job. A snippet of the code looks like:

As you can see, it loops, and calls a random number, and switched between using a setFetchSize() parameter of zero (use default), or 1000 or 10,000 a third of the time. I use a random selection, repeated, so we can fairly test any caching logic that the DBMS may have.

At least for my testing with MySql on a fast quad processor MacBook Pro, the answers are not obvious.

Opps, my earlier test had some bugs in the libraries that I was using. I have fixed that, and re-ran the code. The results are clear, there is no meaningful difference.

I wrote a simple test program to test a fairly large MySql database that I'm working on for my day job. The results of testing with a setFetchSize() of 0, 1000, and 40000 are:

Again, I picked a random number and did one of the three tests. You can see that I processed at least 28 trials yet no matter what value I used, the runtime was within 7 milliseconds with a small standard deviation. These values are so close together as to be within the measurement error.

So my tests with MySql show that there is no advantage, none at all, with settting the fetch size.

Pat Farrell wrote:So my tests with MySql show that there is no advantage, none at all, with settting the fetch size.

Yeah, there's a lot of "simple solutions" out there which people freely pass around to each other. Many of which are... um... of limited utility.

However would it be fair to guess that your tests were done in an environment with MySQL running locally? And that the results might have been different if the server was running elsewhere on the network? I wouldn't bet much money on the latter, frankly, but that's just guesswork, which isn't good practice in the performance-improvements world.

Paul Clapham wrote:However would it be fair to guess that your tests were done in an environment with MySQL running locally? And that the results might have been different if the server was running elsewhere on the network?

Yes, and yes, there is some chance that they would be different if there was real latency between the client and server.

My tests are good enough for me to say: (1) the proposed change is not always useful and (2) I don't bother to test/look at this further.

As always, when there is a real performance problem, I will profile and then see if this helps. But I'm not gonna hold my breath.

1. Use connection pooling
2. Define proper index on the table
3. Read only required columns (in most of the cases we no need to fetch all the column values)
4. Use the Producer - Consumer model for one set of threads reading it from db another set of threads processing it
5. Check CPU utilisation while running Java program, if the CPU utilisation is very less, increase the number of threads in the program. This should speed up the things. Even after increasing the threads if there is no improvement in the performance take heap dump of the application check if there is any memory leak.
6. Using profilers check which method/part of the program takes long time and address it first.

Lakshman Arun
Greenhorn

Joined: Sep 24, 2008
Posts: 19

posted Nov 30, 2011 02:51:42

0

7. In producer-consumer model, we can make the consumers running in distributed environment (using JMS worker threads as consumers). This way we can scale the application much better