I know this question has probably been asked several times before, but my question is specific to my situation.

I am implementing a notification micro service. It has an email table and a queue table. The service is exposed using a REST API. Each email can have multiple to addresses and cc addresses. The queue has an email ID. Once the email is sent, the row is removed from the queue.

At this time, there is no requirement to query on who sent what emails or whether a email for a particular user with a given email address has been sent. There are multiple ideas on how to/cc email addresses can be stored.

Store them in JSON format in a address field of the email table:

{
toList: ['h@h','g@g'],
ccList: ['a@b','c@d']
}

I considered using the JSON data type of MySQL 5.7, but it's not supported in h2db which we use for integration tests.

Store them as comma separated values in toList, ccList columns in the email table.

Store them in a relational table such as email, emailaddress, emailaddresstype where the emailaddress table would contain the emailid, email address and the emailaddressstypeid. This results in several records in emailaddress table and also multiple joins.

What would be the best solution to implement in this case? I am generally against violating FNF and I'm not sure what it means to have so many records in an emailaddress table such as a record for every email sent to, for example, 2 to 3 people. There is no user table in this system, if that matters.

What's your argument for not normalizing your data? You seem to imply it may be less efficient, but it might be worth reminding you that it's the job of RDBMS to manage relational data. I don't see any strong argument other than you think it may be more efficient to do it that way. The canonical answer to "is it good to store comma-separated values in a single column" is no, unless you know why you need to be doing so.
– Vincent SavardMar 22 '17 at 19:27

Honestly, that was argument from a team member and I am against that. I like to follow FNF.
– TechCrunchMar 22 '17 at 19:28

1

By the way, "I considered using the JSON data type of MySQL 5.7 but its not supported in h2db which we use for integration tests." is a bit concerning. Why would you not use the same RDBMS in your integration tests?
– Vincent SavardMar 22 '17 at 19:30

Because the tests can be run faster on any machine with out dependencies. There are e2e tests run on SUT that interacts with real db.
– TechCrunchMar 22 '17 at 19:34

3

Your situation is in no way different from those of the many previous questions. Simulating collections by storing serialized lists in single columns is a terrible idea in general, and it's a terrible idea in this example as well.
– Kilian FothMar 22 '17 at 20:37

4 Answers
4

I suppose it really depends on what you plan to do with the data. If email addresses are nothing more than lists of strings the you will retrieve and manipulate somewhere else, then comma-separate strings in separate columns ("to", "cc", "bcc") is probably fine.

If you think you might want to query at some point and ask "who was CC'd on Email #1234 but was NOT CC'd on the reply message (#5678)?" it might be difficult to do that with simple comma-separated lists. In that case, you should go with separate tables to properly store the email addresses.

A frequent problem with persisting data in raw format is the deserialization. If by any reason the format change, the code will have to be backward compatible. So it could end up with more code to maintain.
– LaivMar 22 '17 at 20:51

1

Personally, I prefer to future proof as much as possible, so having a separate table detailing the lists of recipients is a FAR preferable solution, as it allows far more complex queries to be used in the future when requirements change and the business decides they NEED to be able to query "who was CC'd on Email #1234 but was NOT CC'd on the reply message (#5678)?"
– Maybe_FactorMar 23 '17 at 2:33

Accessing delimited data values embedded within a larger field generally results in [very] poor application performance. It almost always results in "Table Scans" because indexes can't be meaningfully used.

At this time, there is no requirement to query on who sent what emails or whether a email for a particular user with given email address is sent.

Lucky you.

Wait until you get into a dispute with a user who insists that they're not getting the emails that you [claim to] have sent them, at which point you'll have to start querying it. OK, if those queries table-scan it's not the end of the world if you only have to run a few of them but as soon a this sort of thing becomes common-place, you'll need to move to a more sensible data structure.

OK, preparing for this sort of usage flies in the face of the Developer's "YAGNI" mantra but, all too often, DBAs find themselves having to take a far longer-term view and making sure the data structures that they implement are as flexible as possible; sensible Normalisation is a big step in that direction.

You should store the data in normalized form (option 3) unless you have a compelling reason not to. Non-normalized data as in option 1 and 2 have a number of drawbacks: risk of inconsistencies, difficulties in parsing and querying and so on.

You are making the system more complex and brittle, so you need a good reason to do so. Some legitimate reasons might be caching complex data structures for performance reasons or storing arbitrary data which is opaque to the application.

But you don't mention any particular reasons justifying non-normalized data, so go with 3.

(You suggest a drawback of option 3 is more tables and more joins. But this is a red herring, since you have the same amount of data and underlying structure with the other options, it is just stored in a less convenient and optimized way.)

The first antipattern that Bill Karwin discusses in his book "SQL Antipatterns" is Jaywalking (because an intersection is avoided). The antipattern specifically lists out the problems associated with storing comma separated values when you need to use them separately (for instance, querying the table becomes complicated). The solution offered in the book chapter is straight forward - create an intersection table.