News

Welcome to End Point's blog

The Real Cost of Data Roundtrip

Sometimes you need to perform some heavy database operations. I don't know why very often programmers are afraid of using databases for that. They usually have some fancy ORM which performs all the operations, and the only way to change the data is to make some SELECT * from a table, create a bunch of unneeded objects, change one field, convert those changed objects into queries and send that to the database.

Have you ever thought about the cost of the roundtrip of data? The cost of getting all the data from database just to send changed data into the database? Why do that if there would be much faster way of achieving the same results?

Imagine that you have quite a heavy operation. Let's make something which normally databases cannot do, some more complicated operation. Many programmers just don't know that there is any other way than writing this in the application code. Let's change all the HTML entities into real characters.

The HTML entities are a way of writing many different characters in HTML. This way you can write for instance the Euro currency sign "€" in HTML even if you don't have it on your keyboard. You just have to write &euro; or &#8364; instead. I don't have to, as when I use UTF-8 encoding and write this character directly, it should be showed normally. What's more I have this character on my keyboard.

I will convert the text stored in database changing all the htmlentities into real unicode characters. I will do it using three different methods.

The first will be a simple query run inside PostgreSQL

The second will be an external program which downloads the text column from database, changes it externally and loads into database.

The third method will be almost the same as the second, however it will download whole rows.

Generate Data

So, for this test I need to have some data. Let's write a simple data generator.

First, a simple function for returning a random number within the given range.

The Tests

SQL

Many programmers think that such operations are not normally available inside a database. However PostgreSQL has quite a nice feature, it can execute functions written in many different languages. For the purpose of this test I will use the language pl/perlu which allows me to use external libraries. I will also use HTML::Entities package for the conversion.

6 comments:

One counter-argument I've often heard and that is hard to counter (it may sometimes be valid, and it is hard to measure) is that you end up using the db server's CPU when it would be more efficient to let the app server do that work. Especially since app servers tend to scale horizontally but db server vertically.

There's another argument against "SELECT *" in embedded Perl code. (Admittedly, this one gets solved if you use an interface like Rose, but if you are scripting down at the DBI level, this can be a pitfall.)

Using "SELECT *" means your code is vulnerable to column name-changes with very little warning. For instance,

SELECT * FROM emp

and code that references $row->{fname} will break (sometimes silently) when column "fname" becomes "first_name". (Yes, this is a somewhat contrived example, but it happens.)

As to why coders instinctively reach for java/etc.? Simple. Ever since programming became synonymous with "the web", coders have reveled in the reactionary paradigm of COBOL/VSAM of their granddaddies: dumb data, smart code. Few, if any, have taken a serious database course while in school. Lots of languages, of course, but little or nothing of databases. I don't count xml/NoSql as databases.

As to PG's support for many languages, not so much as one might think. Most (all?) that aren't C are themselves implemented in C, so the engine actually implements the underlying C. Not saying this is a bad thing, only that non-C languages aren't natively supported.

Should we ever get to the point that 5NF databases are the norm rather than the exception, then the argument that the DB is just a "file store" and data management *belongs* in the client (as it did/does in the COBOL/VSAM world) will fade.

@Robert Young, non-C languages are supported in much the same way SQL itself is -- it's all interpreted somewhere inside the server. For most of those languages there will be some extra overhead of converting the data from types PostgreSQL understands to types the embedded interpreter understands, and depending on the task at hand, that can be significant. But the same sort of conversion is necessary to get the data ready to send across a network, and then again in application code.