urn:lsid:ibm.com:blogs:entries-04003118-1207-4647-b83b-a8f863754ccbJActor - Tags - persistence High-Performance Java Actors03012014-09-03T20:08:38-04:00IBM Connections - Blogsurn:lsid:ibm.com:blogs:entry-b1b44039-eac8-41ca-ae1e-113d04f0e0cfThe problem with Java Serializationlaforge49270005CXQGactivefalseComment Entriesapplication/atom+xml;type=entryLikestrue2012-10-19T08:02:42-04:002012-10-19T08:02:42-04:00<div>There are a number of problems with Java serialization and numerous alternatives have been developed. But my focus here is on a particular use case, databases, and a single issue, performance. </div><div> </div><div>Databases generally work with very large byte arrays. This is because seek time is very slow compared to the data transfer rate, so working with larger byte arrays often results in a performance gain. This is true for both hard disks and Solid State Disks (SSD). On the other hand, deserializing very large byte arrays is very CPU intensive. So Java databases optimize the size of the byte arrays they use to balance between the performance of the disk and the performance of the CPU. Fortunately the price of RAM has dropped significantly over the years so large memory caches can be used to hold the deserialized data, which reduces both the need to repeatedly read and deserialize the same data. Unfortunately there is still a need to frequently reserialize the updated data, which is also CPU intensive, and write it back to disk because of the transactional nature of many databases.</div><div> </div><div>Significant performance gains are achieved by not using Java serialization, but working more closely with the data, reading and writing the binary form of integers, floats, strings and the like. This is not difficult to do, especially as most databases work only with well-defined tables where all the data in a given column is of the same type. But still, when reading or writing the data to disk, the entire byte array must be deserialized and subsequently reserialized. Working directly with the binary data is much faster than using Java serialization. But this is still a CPU-intensive process. And the irony is that many database transactions only access or update a miniscule amount of data.</div><div> </div><div>Now in an ideal world, we would only deserialize data as needed, and then only reserialize the data that has changed. Doing this will mean that we can work with much larger byte arrays, resulting is an overall improvement in Java database performance. The data structures for doing this efficiently may be somewhat complex, but that should not be an issue so long as the API is reasonable. The term I use for this technology is JID, or Java Incremental Deserialization/reserialization.</div>There are a number of problems with Java serialization and numerous alternatives have been developed. But my focus here is on a particular use case, databases, and a single issue, performance. Databases generally work with very large byte arrays. This is...102374urn:lsid:ibm.com:blogs:entries-04003118-1207-4647-b83b-a8f863754ccbJActor2014-09-03T20:08:38-04:00