How to speed up JDBC inserts?

One question I like to ask in interviews is: how would you speed up inserts when using JDBC?

This simple question usually shows me how knowledgeable the developer is with databases in general and JDBC specifically.

If you ever find yourself needing to insert data quickly to a SQL database (and not just being asked it in an interview), here are some options to consider.

1. Don’t use JDBC

Most databases have a native tool to handle bulk inserts. You don’t always have this option, but if your data is in a format the tool supports, you’re going to be hard pressed to find a faster option.

2. Use Prepared Statements

Prepared statements not only reduce insert time by compiling the SQL once, but they also help prevent SQL injection attacks. (I also like to ask interviewees about SQL injection attacks.)

Once a prepared statement is parsed by the database engine, the client only needs to send the data for each statement instead of the entire SQL each time.

3. Use Batch Inserts

Prepared statements are your friend when using JDBC, but batch statements are going to be your BFF when doing inserts and updates.

We have a demo in Data Pipeline that loads some hospital data into a MySQL database. Using batch statements of 100 records, insert time dropped from 3.5 minutes to under 1 second.

Two things you should know about the data: 1) The data isn’t perfect. It does have bad records and duplicate IDs causing some records to fail. 2) We need to do a little massaging to map the source and target formats.

Here’s the code from the demo.

Java

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

DataWriter discardWriter=newStreamWriter(System.out);

DataReader reader=newCSVReader(newFile("data/hospital.csv"))

.setFieldNamesInFirstRow(true);

reader=newTransformingReader(reader,discardWriter,"error_message")

.add(newRenameField("Provider Number","hospital_id"))

.add(newBasicFieldTransformer("hospital_id").stringToLong())

.add(newRenameField("Hospital Name","hospital_name"))

.add(newRenameField("Address 1","address1"))

.add(newRenameField("Address 2","address2"))

.add(newRenameField("Address 3","address3"))

.add(newRenameField("City","city"))

.add(newRenameField("State","state"))

.add(newRenameField("ZIP Code","zip_code"))

.add(newRenameField("County","county"))

.add(newRenameField("Phone Number","phone_number"))

.add(newRenameField("Hospital Type","hospital_type"))

.add(newRenameField("Hospital Ownership","hospital_ownership"))

.add(newRenameField("Emergency Services","emergency_services"))

.add(newSetCalculatedField("emergency_services",

"decode(emergency_services, 'Yes', true, 'No', false)"))

;

DataWriter writer=newJdbcWriter(getConnection(),"hospital")

.setBatchSize(BATCH_SIZE);

Job job=Job.run(reader,writer);

System.out.println("Time: "+job.getRunningTimeAsString());

If you’d like to learn more about the above code, please have a look at our Java ETL framework.

4. Use Multiple Connections

Depending on your database and configuration, you could see an improvement using more than one connection to do inserts.

We saw improvements with up to 5 connections in a test with one of the big database vendors. Going above 5 connections saw no noticeable benefit.

5. Temporarily Disable Integrity Constraints

This is one of those tips you have to be very careful with. I don’t recommend it unless you absolutely know what you’re doing, and most importantly, have a rollback plan.

Disabling integrity checks means your database engine can insert data faster because it has less work to do for each record. However, if your data isn’t perfectly clean, you can end up with corrupted tables and broken client apps. You also don’t ever want to be the guy who forgot to re-enable constraints on a production database.

The simplest way you can do this is to run your code that connects to the DB and inserts data in separate threads.

Our setup is a little less simple. Each thread has its own LinkedBlockingDeque it pulls data from to write to its own connection. We have a main thread to read data from a source (like a CSV file) and put one record in each thread’s LinkedBlockingDeque in turn.