Lets say I want to store songs in my database. Instead of having just one Song table, I have ten tables. The Song table also has a foreign key to the Artist table. When an artist is added to the database, we check exactly how many songs are there in each of the Song tables, and we assign the table with the smallest number of songs to the artist. All of the artist's songs will be stored in that Song table.

I don't want to store 1M objects in one table, but rather divide them into ten pieces, each of around 100k objects, and store these pieces in ten different, but structurally similar tables. Now, provided that the song table reference in the artist objects are never changed, will my overall system be faster and have better performance?

I realize one big issue would be finding individual songs, but please answer this question under the context that songs can only be retrieved from the database by providing 2 parameters:

artist_id

song_id

If I have the artist_id, I can use that to get my artist object, which contains a reference to the song table that contains the song with the song with the song_id give. So I don't have to query ten different tables for finding a song, if I have the artist_id, which will always be the case.

Is this going to be completely useless? Or will it have a positive impact on my system's performance?

Note: I realize songs should never be stored this way, since you would want to query songs without knowing the artist, but this is only for an example, albeit a poor one. Also, please ignore the fact that these would be a mess to code and manage when answering. I only want to know about the performance impacts.

Check the database's specifications first to see if you even need to do this. Most people new to databases vastly underestimate their ability to scale. Also, consider storing the songs on the file system and storing links to the songs in the database instead.
– Robert HarveyNov 4 at 15:48

2 Answers
2

Dividing one logical table into multiple tables within the same database has zero benefits. This will complicate queries and may in fact hurt performance, because finding elements is more difficult. Instead of a simple query, you would have to repeat the query for each table and then take the UNION of the results.

In a well-administered database, having millions or billions of elements within a table is no problem at all. You will need suitable indices to get bearable performance for your queries, but you should be doing that anyway.

Sometimes, a “table” is indeed split up so that it can be distributed across multiple databases, or multiple nodes of a distributed database. This is called sharding and is useful if a single database is not sufficient to provide the required read/write performance due to hardware limitations. However, there are some drawbacks.

We can only shard by one key, e.g. the song ID.

In a simplistic sharding approach where we shard by ID ranges, one node might have old elements and one node have newer elements, which causes uneven loads across nodes. The database should therefore prefer GUIDs over sequential IDs, or use a hash function.

Adding or removing nodes from a sharded database cluster is also difficult.

If a database query cannot be resolved by the sharding key, the query must be repeated for each shard and the results combined later, map-reduce style. This can amplify loads instead of reducing them.

And the most important drawback: a sharded database cannot typically perform transactional updates that touch multiple entries, but this is highly software dependent.

Many databases have built-in support for sharding. A SQL database might transparently divide a table by its primary key – without having to modify any queries (but see the database's manual for caveats, e.g. whether this will relax some ACID guarantees). This clear separation between the logical table structure (exposed via SQL) and the physical table structure (e.g. the storage engine and index data structures) is the major feature of SQL databases!

Where possible, using a read replica database can be preferable over sharding. All writes go to the primary database, but the load of reads can be distributed across replicas. Transactional updates are still possible, although reads from replicas might be out of date.

So databases have many techniques for improving performance, such as sharding across multiple nodes. But in many cases this can be done transparently, you should not modify your table structure in anticipation of this. Quite likely, you do not need any scaling techniques, and a single database would be able to deliver sufficient performance when it is well-designed (proper ER modelling, use indices, don't normalize excessively, …).

To extend the previous comment – relational database systems are routinely capable of storing billions(!) of entries if necessary, and they contain specific mechanisms – such as "Tablespaces" – to manage the back-end logistics of actually storing all that data. Balancing-decisions such as the ones you speak of can be taken care of, transparently, by the database system itself.

Most companies wind up having to store their data, "essentially, forever." They might do this by creating separate "archive" tables, but modern databases offer other alternatives for segregating less-used or archival data to other places while keeping it fully accessible as a single image.