Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

itwbennett writes "Sauce Labs had outgrown CouchDB and too much unplanned downtime made them switch to MySQL. With 20-20 hindsight they wrote about their CouchDB experience. But Sauce certainly isn't the first organization to switch databases. Back in 2009, Till Klampaeckel wrote a series of blog posts about moving in the opposite direction — from MySQL to CouchDB. Klampaeckel said the decision was about 'using the right tool for the job.' But the real story may be that programmers are never satisfied with the tool they have."
Of course, then they say things like: "We have a TEXT column on all our tables that holds JSON, which our model layer silently treats the same as real columns for most purposes. The idea is the same as Rails' ActiveRecord::Store. It’s not super well integrated with MySQL's feature set — MySQL can’t really operate on those JSON fields at all — but it’s still a great idea that gets us close to the joy of schemaless DBs."

Why are we still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query?

apart from the concepts of query caches - and stored procedures - so what if the language is related to COBOL, javascript is closely related to C which is almost as old. And that has plenty of relations to Algol which is even older.

So yes, it sounds like they havn't really got a clue. Great advert for their business!

"Why are we still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query?"

I couldn't agree with you more, this quote makes me want to vomit. Is this really how low the average competence of today's web developer has stooped? Between PHP developers not getting why PHP is a pretty shitly designed and developed language and stuff like this, I barely get how the web even runs anymore.

To answer the original quote, the reason we're "still querying our databases by constructing strings of code in a language most closely related to freaking COBOL, which after being constructed have to be parsed for every single query?" is because SQL is a language based on mathematically sound principles, and which is supported widely, and known widely, and is processed by database engines across the globe that have literally decades of stability behind them, data in them and so forth.

There's absolutely no reason to change SQL, because if you build a new query language that is based on the same mathematically sound principles of relational algebra then it will er... look just like SQL. The fact the kiddie (I can only assume he's a kiddie due to his blatant lack of knowledge and/or experience in the field) who wrote that blog post doesn't get this suggests he should absolutely not be trusted with your data as he'll only lose it.

This is a classic example of someone bitching about something not because it's bad, but because they simply don't understand it and believe that rather than learn about it properly, it's better to bitch and hope you can somehow effect change by bitching.

The advantage of most SQL/RDBMS is that they do adhere to the ACID principles, and for people who want to be able to have some degree of trust in their data source that's pretty fucking important. It's no surprise that they've moved over to MySQL though as it's one of the few RDBMS that is completely shit at adhering to the ACID principles and keeping uptodate with solid, stable implementations of modern database functionality.

There's absolutely no reason to change SQL, because if you build a new query language that is based on the same mathematically sound principles of relational algebra then it will er... look just like SQL.

False. First of all, SQL is NOT based on mathematically sound principles of relational algebra. SQL took the mathematically sound principles of relational algebra and fucked them up. There should be no NULLs, there should be no natural ordering of "columns", there should be no possibility of having duplicate rows, there should be no possibility of inconsistent intermediate states in transactions (no deferred checking) etc. SQL has them all, and then some. Why? Because SQL simply ignores the relation model and "does what IBM and Oracle always did". That's not the same thing as "implementing the relational model".

Second, there is a separation between the surface structures of a language and its foundations. I really don't think that a language based on relational algebra has to look like SQL. That's like saying that a language with nouns having singular and plural and verbs having tenses has to look like English. Nope, it doesn't have to at all. Just look and VB.NET and C#. Basically two front-ends to a virtually identical language semantics, only one of them does not avoid non-alphabetic structural delimiters like the plague (and is so much more pleasant for it).

There should be no NULLsThen how do I, say, indicate the date of death for someone who hasn't died? An IsDead field? Really? (Yes, a NULL in a field is a shortcut for proper relationship, but a lack of relationship when using a linking table will still be represented by NULL)

there should be no natural ordering of "columns"Does it really matter? The natural ordering of columns is the order in which you added them to the table. Ignore it. It isn't important, and not in need of a "solution"

there should be no possibility of having duplicate rowsFirstly, get to know your DISTINCT SQL keyword. Secondly, data in real life sometimes IS duplicate. What the hell should people do? Have a DuplicatedThisManyTimes field? Ugh.

possibility of inconsistent intermediate states in transactionsThat is a property of the database engine, not SQL.

Because SQL simply ignores the relation model and "does what IBM and Oracle always did". That's not the same thing as "implementing the relational model".Where do you get this shit? Are you telling me the function of foreign key constraints and referential integrity, and the good ol INNER/RIGHT/LEFT join keywords are just smoke and mirrors and everything is really just a chaotic bowl of soup? References please.

GP is correct, and your understanding of the relational model appears to be - no offense - a bit lacking. To address your first example: people and deaths are different, though related, concepts. Ideally, they should have separate tables, plus a view. If someone died, he or she has a row in a Deaths table, which joins to the People table; otherwise, not; no NULLS necessary. When interacting with the data from outside the database, you use a view, which can be engineered to appear to contain NULLs, dupli

From a purely pragmatic point of view, it may not seem unreasonable to model it that way. But you should be aware that you are trading one form of complexity for another, probably bigger one. For instance, now, if you want to know who was alive on some specific date, you have to write something like "WHERE DateOfDeath IS NULL OR DateOfDeath > @date." You also will not know for certain whether a NULL means "person is still alive" versus "person is dead but we do not know his or her date of death." When you try to compare different people's death dates any comparison to NULL will yield NULL and you will need special case logic in every such comparison. You will need tristate logic throughout any part of your application that does logical tests based on the date of death. Nullable values will sometimes require special treatment in your code, depending on the language (e.g., whether date/time values are considered to be nullable in that language). I could go on. I also could build you both tables, an updateable view, and a set of SPs to do your basic CRUD stuff on both tables plus "show me living people" and "show me dead people", in a LOT less time than it would take to handle all the code problems that would result from breaking 1NF. I am not an extremist on this subject, but I wear both DBA and developer hats, and when I'm acting as a DBA or in any other situation where I have control over the DB, I do try to get into 3NF, and then denormalize only if there are demonstrated reasons to do so. As a developer, I will sometimes take shortcuts if it's genuinely necessary, but, more often than not, I end up regretting them.

That's not how debate works. If you can't take a position and defend it against questioning, without resorting to "go away and learn more", then you have no position and shouldn't have posted in the first place.

"False. First of all, SQL is NOT based on mathematically sound principles of relational algebra."

No, you've completely missed the point - I'm not saying SQL is an implementation of, and only of the relational model and nothing more, and nothing less, merely that those are it's foundations. SQL absolutely IS based on the principles of relational algebra - it's still ultimately based on much of the important set theory that underlies that when it comes down to it. The point being that sure, whilst SQL is far

I've worked on quite a few large-ish database applications (eg 800 - 2000 tables, some with multi-million rows), and I'd say I'm fluent with SQL. But the thing that annoys me most about SQL, from a maintenance perspective, is how much of the database structure ends up strewn around in your code base. SQL is *not* good at encapsulation.

When a new requirement comes in that should cause you to change some of the primary relationships in your database, you have a look at how much code you'd need to change to d

But yet, are these same developers that are being *highly* paid on these Web 2.0 times.

Serious. I was of of them - but got kicked out because I made the huge mistake of pointing the obvious: you must be a skilled programmer to do programs right. Ruby On Rails will not make a good coder from a dumb ass.

That is a common reason for firing. A couple of years ago some programmers wanted me to support them with the boss on switching a project written in python to Java. Their justification? The python programmer called them a bunch of monkeys. No technical arguments at all.

Unfortunately the boss sided with the monkeys and I was next on the chopping block for pointing out that a 200 Bingo player max using 3 machines (1 web 1 db, 1 backup db) was a design flaw.

If all your application is ever going to do is read and write to fixed sized record structured data with little relational (or any) attributes then COBOL will suit you fine as that's what it was designed for. Unfortunatly those sorts of apps are few and far between these days, but in its ever decreasing niche COBOL is still good.

I think the main problem is application developers not understanding anything about database theory. The vast majority of databases I encounter are not normalized at all, and it's almost always because they were designed by a developer with no database background.

Granted, I didn't come into this field with that background, either, but I made a point to learn it, and now I'm very cognizant of implementing sound database designs. This whole idea of throwing random strings of structured text into a database column, and then relying entirely on the program code to parse and use it... well, why the hell even use a relational database, then?

Relational databases aren't suitable for every application, nor are "bigtable" and other NoSQL implementations. The problem is that developers use a particular kind of database without really understanding how to use it properly. If they can get data in, and get data out, that's basically all they care about. Never mind if they make it a maintenance nightmare in the process.

Yes it makes sense up to a point , but it starts to suffer from the law of diminishing returns and at some point having to do complicated multi-table joins actually slows down your queries so much that it becomes simpler and faster to suffer duplicate data than normalise to the Nth degree.

It depends on the task though, I'd wager 90% of SQL work that is done by developers day to day isn't in such a performance sensitive environment that it needs to favour performance over normalisation, and I agree with the GP, there's far too many developers out there that just don't do it and hence simply don't have the performance excuse. It really is just bad database design as a result of incompetence most the time.

I can definitely see the value in making an informed tradeoff, but like you said, a lot of the time it's not an informed decision--they just do it to make it work and don't really have the expertise to know which is the right way to go. I've definitely seen enough bad database designs to know that most developers just have no clue how to design them. The worst I've seen had bad designs and poor performance, and were built in a completely ad hoc manner without any eye toward maintainability, performance, or

And in many databases, there'd be more performance gains from proper normalization than pre-mature optimization. I'm working with a legacy database that has this problem. Proper normalization would probably make it lightning fast, but instead it's slow as fuck because too many concerns are put in one table when they should be put in several tables. Also, it uses functions to retrieve values, which is just...so wrong.

Yeah, it really depends on what you are doing. But any time you break normalization there should be a good reason. Performance is certainly a valid reason. "I'm too lazy to make a well-designed database," however, is not.

If you find yourself breaking normalization all the time, then you've probably found a use case where a relational database isn't the best tool for the job.

While there is a "right" way to use a given tool, there is no one tool that is right for every situation. People who get this backwards are zealots and will often make poor decisions.

Yes it makes sense up to a point , but it starts to suffer from the law of diminishing returns and at some point having to do complicated multi-table joins actually slows down your queries so much that it becomes simpler and faster to suffer duplicate data than normalise to the Nth degree.

The question is whether this should be solved at the conceptual model level. As a developer, I don't care whether the database cheats and duplicates something to speed things up, as long as I don't have to do it in the data model and as long as the implementation is correct. The same logic applies to CPU caches and compiler optimizations. The computer is allowed to "cheat" if it can prove that the shortcut is correct. But you shouldn't be forced to do it manually, since it only makes your code (and data str

I completely agree. A lot of non-DB centric people think that they can do more in the app tier, effectively using their databases as glorified file stores. Why even have a database server in those instances? I'm not saying that everything should be done in the database, either, but take advantage of every tool you have.

NoSQL has a place, so does relational. Learn their strengths and determine which is the best fit for your project. Then, learn how to use the tool to its fullest.

Unfortunately the developers of these "NoSQL"databases seem to have the same idea. I'm working with one that shill remain nameless but sounds oddly like a piece of fruit right now. The generally accepted best practice for scaling is to pull as much of the logic out of the database layer. While there are fancy aggregation pieces, they're all impossibly slow (and hamper concurrency). Argh.

A lot of non-DB centric people think that they can do more in the app tier, effectively using their databases as glorified file stores. Why even have a database server in those instances?

This is pretty easy to answer, I think: because databases offer ACID attributes. Reimplementing those on your own is a big project and likely to create bugs; it's a lot easier to just grab an existing database and use it.

For instance, what if you need a "glorified file store" that multiple processes on multiple systems can

I think the main problem is application developers not understanding anything about database theory. The vast majority of databases I encounter are not normalized at all, and it's almost always because they were designed by a developer with no database background.

Or a developer who is experienced enough to know how bad an idea an overly normalized database is for most applications.

You've got it backwards. The highly normalized database is connected to transaction processing. Highly normalized databases have few lock issues and are optimized for transaction processing. Also TPS is narrow so you have good coders dealing with the relatively little code that bangs on it hard.

The read-only database denormalized for simplicity and query performance is the data warehouse. That's where the report monkeys work.

The reality is there are only two SQL databases in the entire universe: MySQL and Oracle. You might have been told others exist, hell, you might even have worked on something called "SQL Server" in your.NET shop, but in reality: they don't. They're all figments on your imagination. Your imagination is SO determined to find better, more robust, faster, powerful, alternatives to MySQL and Oracle that an entire fantasy world comprised of "a successor to Ingres that makes MySQL look like a piece of crap" and "A Microsoft product that doesn't feel like a thirty year old mainframe product hacked onto a modern platform" develops in your head.

I'm not generally a Microsoft fan, but I love SQL Server. However, I haven't started a new project with it in years, I guess since pricing for SQL Server 2008 was announced. I've not been in a situation where I could justify the costs as the project (hopefully) was successful and scaled up. I also don't like being forced to run my database server on Windows. For these reasons, I just don't use it any more except in projects where it was selected years ago. I know you have to look at TCO, but I still ca

MySql-s MyIsam is much faster with reads than PostgreSql. I think for the things people use NoSql, MyIsam is perfect. And when you want to move better ACID support, you can effortlessly switch to InnoDB.

But the real story may be that programmers are never satisfied with the tool they have.

Ah typo

But the real story may be that programmers don't know how to store data

They many not know because no one knows the business needs, but more often because they have no idea what they're doing WRT to data storage.

IT training tends to cover data manipulation pretty well "how to add two numbers'IT training gets shakey on data structures "So, in junior level class we will talk about data structures, which is too bad because you've already developed at least two years of bad habits first"IT training tends to pretty much skip data storage "In a senior level class, you might talk about scalability, maybe in an optional class. Or maybe you'll take a semester of cobol instead"

In some industries you can pretty well predict the future. In others.. no.

One app I built years ago would have literally required geographic changes to expand. Then "surprise" it gets rolled out to 5 additional bigger cities. Well, that was unexpected... I had a O(n**2) algorithm in there that did pretty well for values of N around 7 where N can never increase beyond 7, but not so good for values of N around 57. whoops.

It seems to be a knee jerk reaction amongst a lot of developers and designers that as soon as your app starts requiring persistent data beyond ini values a database is needed. Why? For large but simply structured data something like json or XML or even a flat csv file is perfectly adequate. Performance can be an issue during searches but if for example you have a fixed record size with key sorted data then finding a given key is simple (binary chop or similar).

The big benefit to a relational DB with lots of enforcement at the data layer is that you can have one or more applications reading/writing to it with minimal concern of data corruption.

What isn't obvious is that second application is often aggregate reporting for management. "How many customers are using $foo and where do they live geographically". With a relational DB, I might knock that query out in a few minutes across millions of customers.

With a flat XML file per customer spread across a number of servers, this could take days to assemble, particularly if $foo is nested deep in the structure.

Having spent far too much time writing one-off scripts to gather customer data because the middleware didn't support that type of query, I've actually gone the other way and started shoving some business logic into the DB.

Functions such as isCustomerPaymentOverdue are now in the relational DB with a very thin model in the middleware to allow for much easier and faster reporting.

Hop into the wayback machine and fire up any flavor of PICK. The database where schema is applied on use, not on storage. No length limits on fields and very fast on old hardware (really fast on new). Storing bits of xml and code are no problem. And for those users who simply must have SQL, many versions will support that too (UniData and UniVerse are two examples). It's not cool, not new, but it does work.

I know database concepts are difficult for some people, but it's by no means magic.

Sorry, I beg to differ. You select a DB. Turns out that's just the interface, and you have to *then* select the actual DB engine. Some engines / databases allow checking for and repair of corruption on-line, some don't. There's locking. Line level, table level, database level. Oh, wait, you didn't know about tables vs databases? What do you do when your query takes too long? Didn't you know about connecting before making a querry, persistent connections, and how to interpret obscure error messages?

Nowhere on the CouchDB home page [apache.org] is reliability even mentioned.
And that's the real issue. Developing a reliable database system is a difficult design and programming task. It requires real software engineering. The hacks who write PHP and use JSON aren't up to a job like that. The "aw, we'll fix it in the next release" attitude doesn't cut it in databases.

If your application fits well with the methodologies of a traditional RDBMS, use a traditional RDBMS, and hire people who are trained and experienced in using those methodologies to their full potential.

If you're dealing with the latest Big Data paradigms and designs, where you can sacrifice some of the rigidity of a RDBMS to gain some flexibility and cheaper scalability, use a NoSQL database, and hire people who aren't stuck in their old RDBMS ways.

If your application fits well with the methodologies of a traditional RDBMS, use a traditional RDBMS, and hire people who are trained and experienced in using those methodologies to their full potential.

If you're dealing with the latest Big Data paradigms and designs, where you can sacrifice some of the rigidity of a RDBMS to gain some flexibility and cheaper scalability, use a NoSQL database, and hire people who aren't stuck in their old RDBMS ways.

The real key is for the person doing the hiring to understand which of those of methodologies fits their application.

The real key is for the person doing the hiring to understand which of those of methodologies fits their application.

This is insighful. I've worked extensively with RDBMS solutions and now quite a bit with NoSQL technologies. They each have their place. An entire article could be written on where each fits most naturally, but in general if you don't need to join between tables, need to throw data to your store at a high velocity (e.g. logging), and/or need a loose schema, a NoSQL solution works best. If what you're doing can be naturally modeled (i.e. users HAVE AND BELONG TO stations, stations HAVE MANY playlists, etc. etc.), use an RDBMS.

One can see in the subtext of the GP that they may not get this, with their comment that people using RDBMS solutions are "stuck in old ways". It seems like they are saying that NoSQL is effectively always best. I'm curious why they think that. Nail, hammer, etc...

people using RDBMS solutions are "stuck in old ways". It seems like they are saying that NoSQL is effectively always best.

No, no, no, no, no, no, no, no, no, and hell no.

I'm referring somewhat-sarcastically to the RDBMS proponents who reject NoSQL out of hand. The ones who see "database" and think it must have a rigid structure, where all connections are made with JOINs. The ones who don't accept that NoSQL databases are inherently different and must be designed differently. If a programmer is actually stuck thinking in terms of an RDBMS, they should not be working in a NoSQL database. If the programmer is flexible enough to d

The ones who see "database" and think it must have a rigid structure, where all connections are made with JOINs

So, I'm actually curious about this part.

I've worked in RDB's, and I've worked in things that are more based on Berkeley DB... but I am actually having a hard time thinking of specific examples of where I'd want something database-ish and not have the need for JOINs.

Berkeley gives you key value pairs, but the product I worked on which was based on it allowed us to do searching on multiple of tho

Because I should think someone who thinks you should ditch your RDBMS when it's the thing you need to keep using is going to cause you more problems than they're worth. Of course, the opposite is true... I remember someone who insisted in writing ER diagrams to describe our system, despite it not being an RDB, and not being accurately described by ER diagrams -- but to him everything was an ER diagram.

Of course, the opposite is true... I remember someone who insisted in writing ER diagrams to describe our system, despite it not being an RDB, and not being accurately described by ER diagrams -- but to him everything was an ER diagram.

I can't say whether entity relationship diagrams were appropriate in the situation you describe but there is nothing wrong in principle in using ER diagrams to describe non-RDB systems. ER diagrams describe the logical or semantic model, not the physical implementation, and are therefore DB agnostic. Yes, they are often used to help design an RDB schema but their real value is to understand your data at the semantic level.

Unfortunately, many don't grasp this distinction and you'll see many RDB systems where

If you're dealing with the latest Big Data paradigms and designs, where you can sacrifice some of the rigidity of a RDBMS to gain some flexibility and cheaper scalability, use a NoSQL database, and hire people who aren't stuck in their old RDBMS ways.

Well, no. If you're dealing with "big data", you still need to evaluate which tool is appropriate for the task. If you're calling it "NoSQL", you're probably referring to a rather immature set of products designed to pander to people looking for teh new hotness. If you're looking for a key-value store, mature solutions like Berkeley DB have been around for ages.

It's not that key-value stores don't have their place, it's that people running around chanting the NoSQL mantra are really just reinventing the

Then how does a non-webscale database power popular web sites such as Wikipedia and Slashdot? If you don't do joins in the database, you'll probably end up doing the equivalent of joins (using one value as the key in another table) in your application.

In NoSQL systems such as MongoDB and CouchDB, what do you call the operation where you retrieve one document, pull an identifier out of that document, and use that identifier as the key to retrieve another document?

And someone with experience in the entity-relationship modeling that underlies a relational database schema is likely to see everything as "a lot of related data". For example, a Slashdot comment is related to a parent (a story or another comment), the user who posted it, and the moderations done to that comment.

So the thing is, traditional joins (on, say, Postgres or MySQL) aren't blocking operations. You can run more than one at a time. MapReduce (as well as writes, any aggregation, and any use of JavaScript) are blocking operations on Mongo. They block the entire mongo process. The MapReduce case gets around this with a bit of cooperative multitasking (yielding every few hundred or thousand rows), but writes, aggregation, and other use of javascript do not. So there's already a much bigger need to distribut

do they have an XML field type? MS SQL Server does [...] which allows you to essentially keep the table schema-less but still allows you to perform complex queries on the contained data.

But how does it index the data in the XML or JSON fields? How does it, say, tell an element containing a number from an element containing text? Does it act like SQLite, which is dynamically typed (and thus can store text in any field) but can be told to prefer to compare and index certain columns as numbers, dates, text with Unicode collation, or binary data?

Typing is just another constraint, like foreign keys, and various other domain constraints. I cannot see any valid argument for having foreign keys but not type constraints. It jest seems bizarre to me. It's not like they could be optional or anything.

They are optional. It appears you can enforce static typing for a column with constraints like CHECK(typeof(x)='integer'). I'd give more details, but the document that Wikipedia cites about such constraints is a printed publication of which I happen not to own a copy.

It's more than a little mildly annoying that when you specify the type, it doesn't check it, more you have to specify it twice. Anyhow, if it can be beaten into shape, then that makes it more acceptable.

Either constraints are good or they are bad. Typing is just another constraint, like foreign keys, and various other domain constraints. I cannot see any valid argument for having foreign keys but not type constraints. It jest seems bizarre to me. It's not like they could be optional or anything.

Foreign keys are good, but type independent. You just want to check that some foreign key actually references valid data in another table. If they match, they match.

However, typing in SQL is almost certainly a case of implementation details leaking through the abstraction layer. The data type is defined by the data and how it is used. Traditional SQL databases require that type data up front so they can organise the data on disk. But you shouldn't care how data is organised on disk.