Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. It only takes a minute to sign up.

Review table acts as the the primary table. Each review can have 0 or more (we have a limit of 18 at the moment in application level) rating values. A rating value is something like "Food", "Location", etc for a restaurant review.

In this table, type is a machine name of the "food", "location", etc I mentioned above. There is a foreign key to the base table that stores these rating type information.
In our application, each rating can have multiple rating values that count towards the main rating. Say, there is "food", and there can multiple sub questions such as "food - appearance", "food - taste". These values can be 0 or more. If there are more than 1 sub question, this score field contains the average of the sub questions (please see the table below).

We will be generating reports with COUNT and AVG on rating_value tables. However, there will be always a JOIN with review table, and the filter would be review.nid, which is indexed.

When accessing reviews, application will almost every time use the review_id, which are either primary or foreign keys in all tables.

Sometimes we need to CAST the score field to nearest integer.

Admittedly I'm not an expert in databases, but I spent a lot of time normalizing the above, and I ended up with the above database after all three steps. I'm eager to learn, so if you have suggestions to above table structure, please don't hesitate to mention them. Note that all review and rating tables will contain some extra data that I didn't mention here. the rating table has a text field to allow users to write a text review. That's why rating values and rating are separated.

The reports we generate are never going to run on all rows. Just a small set of rows filtered by the nid (plus additional filters).

My questions are:

Is it common to spread data of a single user operation across multiple rows? I expect the rating_value table to grow almost 10 times (in record count) than the review table.

While (inner) JOINing tables, and calculating averages, can database engines filter down the eligible rows using the indexes first?

If you have any thoughts about the above schema, please don't hesitate to comment on.

1 Answer
1

Is it common to spread data of a single user operation across multiple rows?

Think of the "entities": User, Review, Rating. A "user" may go to the restaurant many times, and give a "review" each time. The "review" may give ratings for multiple things, such as 'food' and 'location'.

These are "Many-to-one" relationships, so Ratings has a review_id and Reviews has a user_id. ("Many-to-many" relationships need an extra table.)

Yes, database optimizer can usually use indexes as needed. We need to look at the specific SELECTs to say further.

It looks like you did a shotgun approach to adding INDEXes. Do EXPLAIN SELECT ... on your various SELECTs -- This is to see which indexes are actually used. Then DROP the unused queries.

Every table should (must?) have a PRIMARY KEY. For Ratings, it should probably be PRIMARY KEY(review_id, type).

Your "reports" are likely to do JOINs and may end up COUNTing and SUMming things extra times, so beware. Do a sanity check on the results. The overages can be fixed, but I'll let you get to that point before trying to explain the problem.