Slashdot videos: Now with more Slashdot!

View

Discuss

Share

We've improved Slashdot's video section; now you can view our video interviews, product close-ups and site visits with all the usual Slashdot options to comment, share, etc. No more walled garden! It's a work in progress -- we hope you'll check it out (Learn more about the recent updates).

MyNoSQL has up an interview with Ryan King on how Twitter is transitioning to the Cassandra database. Here's some detailed background on Cassandra, which aims to "bring together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model." Before settling on Cassandra, the Twitter team looked into: "...HBase, Voldemort, MongoDB, MemcacheDB, Redis, Cassandra, HyperTable, and probably some others I'm forgetting. ... We're currently moving our largest (and most painful to maintain) table — the statuses table, which contains all tweets and retweets. ... Some side notes here about importing. We were originally trying to use the BinaryMemtable interface, but we actually found it to be too fast — it would saturate the backplane of our network. We've switched back to using the Thrift interface for bulk loading (and we still have to throttle it). The whole process takes about a week now. With infinite network bandwidth we could do it in about 7 hours on our current cluster." Relatedly, an anonymous reader notes that the upcoming NoSQL Live conference, which will take place in Boston March 11th, has announced their lineup of speakers and panelists including Ryan King and folks from LinkedIn, StumbleUpon, and Rackspace.

They keep saying that the Cassandra database is better, but somehow I don't believe them. I can't imagine they know what they're talking about. Maybe in the long-term they'll be proven right but I really don't think they are. I don't know why, though...

Could someone explain to me why this kind of speed would be a problem? It seems to me that if BinaryMemtable is so incredibly fast that other things become a bottleneck, then you're in a great position. You have something very fast for storing and retrieving data - you just need to get bigger, faster pipes.

I know next to nothing about NoSQL, but what they're talking about there seems to be using BinaryMemtable for the one-time move of data. You can see that you wouldn't want to "saturate the backplane of our network" for several days while that completes, so they're using a slower method & throttling it. It will take a week to do the move, but everything else will keep working.

If we're going to have to slow the system down, we'd rather use the standard interface, because that means the bulk loading doubles as a load test and the tools we build for it can be re-used for normal operations.

Yes and no. They are specifically talking about importing their data into cassandra. Which will be a one time event, not worth upgrading the network bandwidth. They need to throttle it to allow for more time sensitive traffic to use the bandwidth. The bandwidth to the database in normal use will be much, much less then the import bandwidth.

No way. Their architecture is about as "best guess" engineering as Facebook. I don't think that's actually what engineering is. "Maybe this one will work?"

In the meantime, I have not been able to update my avatar image on Twitter, and TwitPic-like feature is still a faint glimmer in Twitter's amateur eyes. Speaking of missed opportunities, why drive so much traffic to Twitter parasites Bit.ly, TwitPic, TinyURL, Twitition, TwitLonger?

What in the world are Twitter's engineers actually DOING should be the real question.

I like to think of Twitter's technology experiments, as high not to build and run a high performance web application. Hell, they bought us confirmation that Ruby on Rails wasn't exactly ready for prime time in terms of high performance work for example.

We have a lot to thank them for, but you're right, one of those things is not how to run a stable, secure, scalable web site, it is the opposite- how not to. I suspect before long we'll be able to see for ourselves ho

So instead of just seeing what an actual robust and fast site uses we should instead follow Twitter which switches at whim to whatever is the technology de jour of the moment while still being unstable and slow?

Or is this possibly a case where people are attempting to use technology as a silver bullet against bad design. Your point cannot be made unless we know that they have a good design to begin with and the reason for the outages lies specifically with the database technology. Sometimes the overlooked problem is with bad or dogmatic coding (i.e. too concerned with good form or over-using patterns, and not enough with performance or the 'KISS' principle), and not with the server or hardware technology. Chances

Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia? I would have thought serving up a fairly straightforward page, a stylesheet, a background image and the tweets or twits or whatever they're called can't be that difficult compared to, say, Facebook.

Does Twitter really have loads which are more difficult to manage than, say, the BBC, CNN, Google, or Wikipedia?

(1) In some measures , probably;(2) When Google or Wikipedia makes announcements about technology (whether its a "change" or not) they use in their backend, that's usually often a front-page story on Slashdot, too. The BBC and CNN don't, AFAIK, tend to make big public announcements about back-end technology.

I would have thought serving up a fairly straightforward page, a stylesheet, a background i

Why is it that whenever twitter makes any random change to some part of its infrastructure that we need a front page story about it?

Because in some areas Twitter is at an extreme of scale, so what they are doing to deal with that extreme of scale (even if it isn't necessarily always the ideal choice) is usually interesting since, if you are looking for things that have been done in production to deal with the kind of scaling they experience, there aren't a lot of other data points to find.

The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.

If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs? Databases are not being replaced by NoSQL in projects that need databases. The projects that may not have ever needed databases may b

your question is answered in my post: google does not need a database for ACID properties.

Can you complain much if in one location google gives you results that are very different for the same search query as for the same query in a different location at the same time? Well, if you do complain, you can ask google for your money back.

Regional data has nothing to do with BigTable or RDBMS. Have you read the white-papers on BigTable? If google leverages any IP isolated network solutions, then it's at the networking/application level ABOVE BigTable.BigTable itself leverages map-reduce to cascade the query to potentially thousands of machines, reducing their results back to a SINGLE requesting node.

Geo-location would pick one of several data-centers which house an isolated effective database. The upper layered code would act identically

Yeah, no kidding, it's called a filesystem. But when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS (BTW, I'm intentionally discounting OODBMSes, as I think they and RDBMSes are intended to target largely the same application space)? I know I haven't. People simply rolled their own and moved on. But times are changing and that niche is finally being filled (in part because that niche isn't so niche anymor

But when was the last time you heard announced a mainstream, high-performance, non-relational data store that was intended to be an alternative to an RDBMS (BTW, I'm intentionally discounting OODBMSes, as I think they and RDBMSes are intended to target largely the same application space)? I know I haven't. People simply rolled their own and moved on. But times are changing and that niche is finally being filled (in part because that niche isn't so niche anymore).

One of the most recent, well-known major successes before the recent "NoSQL" movement, in terms of a product that sacrificed ACID for performance as an alternative to databases providing ACID guarantees, was MySQL.

I said nothing about ACID compliance. I specifically mentioned non-relational datastores, and clearly MySQL isn't that. As such, it still forces the developer to work with a relational data model, and one of the main things these so-called "NoSQL" projects do is lift that requirement.

Um, the reason MySQL with MyISAM doesn't provide ACID guarantees (particularly, its deficiencies with regard to consistency) are related to the ways in which MySQL with MyISAM fails to implement the relational model. Merely using a dialect of SQL as a query language doesn't make a database relational.

Well bully for you having a chance to show off your obscure knowledge of non-relational data

You know, the truth is, most data is still stored in individual files, not in databases. So RDBMSs were always a very niche thing used for projects because they are understood and it's easier to develop for them if you really have massive data requirements.

Files - that's what many projects even today use, not databases. This is basically what they are going back to - files with whatever window dressing on top - a facade of hashes, it's all key/value pairs. It is, my friends, the old old idea of property

I think their point is not everything needs an RDBMS, whereas before it was the 'go to' method of storing data.

Except, of course, that it never was the "go to" method of storing data. There was no point in history where RDBMS's were anywhere close to the exclusive method of persisting data. Non-relational document-oriented storage has pretty much always dominated in the era in which relational databases existed, whether it was proprietary binary document formats, fairly direct text-based document formats, o

I think you're missing the point here, the problem with RDBMSs isn't that they are "slow" per-se, which implies that they just need some good ol' fashioned optimization. The problem is that there is a cost associated with the data integrity guarantees they make (usually appears in scalability bottlenecks rather than in pure computational inefficiencies), regardless of how good the implementation is, and if you don't need some of those guarantees, you can dispense with them and end up with better performance (again, this typically means better scalability). Additionally, this is the kind of bottleneck that you just can't throw more resources at. Sure you can find the bottleneck and beef up that particular component to do more transactions/second, but at a certain point you've isolated the bottleneck on a world-class server that is doing nothing but that, and it's still a bottleneck. At that point (preferably long before you reach that point) you have to look at transitioning to an infrastructure that makes some kind of tradeoff that allows the removal of the bottleneck, which is what NoSQL does.

I doubt Twitter wants very many RDBMS-type data coherency guarantees at all. 160-character text strings with a similarly-sized amount of metadata, and no real-time delivery guarantees? Sounds like their database can get pretty inconsistent without messing things up badly. It seems to me they would be well served by using a database that offers just what they want/need in that area and better performance.

As I said, there are projects and then there are projects. Tweater is not the project that requires any real database in the first place, who cares is a commit is transactional there?

As for your last comment: problems with database performance are all about design. You think NoSQL will not hit the same roadblocks in projects that don't do design right? What are they going to move to when that one fails? NoNoSQL++?

I am going to add a few other things here. The first is that "not possible to scale" is not really accurate. I believe there are ways to design structures so that write capacity on an RDBMS can scale upward with the nodes on the network. Of course this only works for some types of applications (the approach I have in mind would work with Twitter, for example). And even with Amazon, you would CERTAINLY want RI on purchases even if you don't care about reviews.

Speed is latency. (how long it takes)Scalability is throughput. (how many concurrent). Or put another way; Speed is the quality, throughput is the width.

who cares what twuufter is running off.

Well, developers, and their managers do. They're nothing if not fashion victims.

RDBMS aren't the be all and end all of scalability (or speed, they perform a shit load of management functions you may or may not need). While attempting to scale conventional rdbms you get into write consistency problem, lookup performance problems unless you specifically desig

I don't think you understand the niche that NoSQL databases are trying to fill.

The more interesting aspect of all of this 'NoSQL' movement is how they believe that if they achieve some speed improvement against some relational databases, how that makes them so much better.

It's not a black and white, panacea-type situation. Relational databases are good at some things, non-relational databases are good at others. Where non-relational databases are better is at solving very specific problems, many of which

Freedom of choice, definitely. I had projects just recently I used property files as a database - inserts, deletes, updates, all in a property file. Easy enough because it is just a hash map. You don't impress me with any of it, it's not in any way new first of all, but it does not replace any RDBMS where RDBMS is needed.

My entire point is that Twooter never needed an RDBMS in the first place. They should be just fine without any database usage on the front end, and forget about JSON. The problem with

Is there really a huge issue with rdbms speeds? Well if there is something there, that's what needs to be looked at. If RDBMSs are not fast enough, that's just an opportunity to work more on them to speed them up.

What makes you think this has not been done? Sometimes combination of arrogance and ignorance here is amazing. Very bright minds are working on all kinds of approaches; and of course Oracle (et al) are working on their set of tools to improve them as well.

RDBMS's are optimized for READS, not writes. You can produce a 1,000 machine mysql-INNODB cluster that will be faster than memcached and be fully ACID complaint. But you'll only ever have 1 write node. You CAN do sharded masters with interleaved auto-incremented values, but then your foreign keys are totally out the window - as is your ACIDity. Oracle has clustered lock managers, but very quickly is going to max out it's scalability - especially if it's limited to a single SAN.Relatively expensive 15,00

Thank you for the informative and thought-provoking post. It's certainly refreshing to see discourse on a level above "I hate MySQL, therefore SQL sucks." You make some good points.

Nevertheless, an RDBMS is still the way to go. You hint at the reason in your last paragraph, actually. The entire NoSQL "movement" is predicated on a confusion of implementation and interface. You describe various problems with the way conventional RDBMSes employ the disk: who said RDBMSes had to use those approaches?

The problem is that with this kind of backing storage, you can't implement most of sql effective. So you might end up with a 'sql' database where you can't user joins in production due to performance. So you end up with the worst of both worlds. A 'relational' database where you can't use most of the relational operations due to performance issues. And you still have a relative interface, so you can't do the kind of magic optimizations you can do with a simple key/value storage.

I totally love RDBMS's, don't get me wrong. You can manipulate schema on-the-fly (more or less). Introduce optimizations independently of the source-code. You don't have to think about fringe cases, or data-integrity. But when a project grows to a certain point, you have two decisions: Go to a hyper-expensive RDBMS solution ($100k.. $500k) (for a project that may only be worth $100k), or identify the key bottlenecks and try to re-architect the tables of interest.I've often found that DRBD+NFS flat file

If you don't really need a database to run your 'website', then who cares if you use flat files or an in memory hashmap for all your data needs?

There is a difference between needing a structured storage mechanism (database) and needing a database that implements the relational model and provides ACID guarantees. Further, many non-relational databases provide specific, weaker forms of ACID guarantees that are better than (say) naive flat file storage would, while providing better scalability in certain appli

http://slashdot.org/~roman_mir/comments - I imagine twater storm of moderation points was spent well this time, every single post I had on this issue was above 3 point and now within 1 hour, all comments were moderated down. To me that's just funny - someone does not like the truth.

I just wonder is it the twater birds or does it have something to do with the nosql ideologists?

you really should have left that comment to the 'bad analogy guy', he could make it sound good.

If other tools are faster and better than rdbms, then why people should waste their time with slower option.

- faster and better, ha? So you don't really mind if your bank switches from its datastore to a 'faster and better' NoSQL system, whatever the latest fad name is? I mean what's a few dollars not rolling back in a transaction that fails when your employment check is deposited?

Propeller would have been a file in a makeshift file system, we use jet engines now for large commercial aircraft now, you don't see the r

goodness, that is a terrible article you are referring to, just awful.

Additionally, managing relational databases in a production environment can become labor intensive and error-prone. Each database package comes with its own world of configuration options, performance sensitivities, bugs, and tools

- just like any other software. What, the NoSQL software will not have its own 'labor intensive, error-prone, configuration options, performance sensitive, bugs and tools' features? What, they invented a way to write software in the NoSQL camp that can avoid being complex to maintain in real production environments while being totally bug free while not having complex configuration options and will not be performance sensitive?

While these issues usually start small, they can become a drain on developers' time and resources as the product matures and its needs become more complex. This complexity of management arises from the complexity of the database packages themselves; it is their very breadth of capabilities which makes them difficult to manage.

They should move to Intersystems [intersystems.com] Caché [intersystems.com]. SQL, objects, XML and even MUMPS. It will make equally happy SQL and NoSQL fans. And it's damn fast. Much leaner than Oracle, DB2 or Informix, too. Excellent support. Extremely good. Not cheap, thought.

Excellent. That's, in fact, our use case. We have a very specialized application where every client is also server (multi-master). We even wrote our own database replication software, which is better than Intersystems' (the "shadow replication" Caché is able to do does not allow multi-master replication).

I love how ass backwards twitter has always been with learning how to scale their 90s infrastructure up. I remember when they called out the Ruby community because they didn't understand MySQL replication and memcached.

I guess without a profit model they couldn't use a real RDBMS like Oracle. EFD (Enterprise Flash Drive) support anyone? 11g supports EFD on native SSD block-levels. Write scale? How about 1+ million transactions/sec on a single node Oracle DB using <$100K worth of equipment and licenses? Anyway, I've built HUGE databases for a long time, odds are most of you have interfaced with them. Just because it's free and open-source doesn't make it cheap.

I love FOSS don't get me wrong, but best-in-class is best-in-class. I only use FOSS when it happens to be best-in-class. I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?

Holy fuck, the right tool for the right job, please? Oracle does somethings for some markets really well but for the rest of us who don't need such a high degree of transactional safety that $90k + two-node RAC price tag might just end up taking your great web 3.0 business through development, maybe early beta before you begin liquidating assets. That's per-processor licensing too on a database that scales vertically well (very well really) but not horizontally well (sharding a

You're either intentionally missing my point or have simply missed it.

Sure, you could run Oracle on x86 desktop hardware. You could even go so far as buying two cheap e-machines and run them both as Linux-based RAC nodes. Granted, those shitty nodes with the fibre-channel attached disk array would still have set you back probably over $90k with the additional hardware and licenses.

They are seeing about 1/2 million transactions per second with this setup based on the information given, but no word of what their cluster consists of. If it is just a handful of generic PCs, $100,000 for your setup looks pretty expensive.

You're right, I failed to mention disaster recovery– it was something we looked at, its just been awhile since we went through the evaluation process, so I've forgotten a few things.
We actually liked Cassandra for DR scenarios – the snapshot functionality makes backups relatively straight forward, plus multi-DC support will make operational continuity in the case of losing a whole DC a possibility.

I love oracle it is a fine database, would I personally buy it? Nope, but as long asit is OPM (Other Peoples Money) I am perfectly fine with it. Now say I was designing somethinglike a medical records system oracle would be a no brainer. Missing a couple of tweets hereand there who is really going to care.

I am curious what someone with your experience thinks of PostgreSQL ? Would you say that it can scale properly as Oracle does ?

This is a genuine question as I am pondering between both for my startup. Even thought I already done my investigations, one more opinion cannot hurt:) Assuming my current DB design holds, it will have about 50 tables, most having less than 10,000 records and some having few millions records (they will be partitioned). The volume of reads will be much higher than writes. Write quer

I laugh at how none of the requirements included disaster recovery. No single point of failure does not preclude failing at every point simultaneously. EMP bomb at your primary datacenter anyone?

1) They never said they didn't plan for disaster recovery. It's silly to deride them for not discussing the entirety of their backups and disaster recovery efforts when the whole topic of the article was their move to Cassandra as a primary data store.

I've never dealt with an EMP but a more realistic threat with similar effects would be planning for a hurricane or earthquake. I used to work at a international bank and we had to deal with both (offices in FL and CA). For the most part the best solution was to have an identical setup at another office and having all applications available via VPN and/or web access. We had a separate pipe that was used only for backup data transfers. The DB transaction logs were written both locally and remotely. All u

It's fascinating how after initially being a posterboy for the post-Java revolution Twitter is gradually moving their architecture to the JVM, piece by piece. I think it's actually a credit to them that they seem to have level heads and are evaluating technology on it's merits (where as if you talk to most of the ruby / python crowd they would rather stick toothpicks in their eyes than endorse a solution that involves java).

Until recently I thought the same way, I would never endorse a solution that involves java. Howevera recently came to the same realization that sun did when they created it. Java is a fantasticway to over sell gobs of expensive hardware. I am a system administrator so the more hardware it takes torun a solution the better off I am, more machines, more money and better job security. So I have nowfully jumped on the java bandwagon, java makes me smile.

Sure - but I think the whole point is that you'd be smiling even more if they were using one of the modern & trendy dynamic languages because you'd likely have 2 - 3 times the amount of hardware to look after. I'm not sure what alternative you would propose that uses less hardware but there actually aren't many that are better than the JVM these days.

It's fascinating how after initially being a posterboy for the post-Java revolution Twitter is gradually moving their architecture to the JVM piece by piece.

I think its fascinating, too -- but probably in a very different way than you do. You seem to think that it is a repudiation of some mythical "post-Java revolution", when in many ways I think it is a validation of exactly the approach that was common to pushing Ruby, Python, and similar languages as more agile alternatives to Java. The appeal of tools n

A lot of the complaints from NoSQL seem to be regarding DBMSses being too slow and SQL being too hard. And yet a lot of them invent query languages/query languages similar to SQL. Supposedly Oracle scales up really well. There is a paper that compares mapreduce to parallel databases and Hadoop takes a huge beating via the RDBMSes in performance. Now the funny thing is that Oracle was not included, yet most content that if you pay enough Oracle scales really well. DB2 also scales, because in 1999 I work

But most open sources databases seem to not be able to compete with the likes of the commercial parallel databases. But it seems like an open source parallel database would do a lot to silence many nosql critics. There is still the complaint about needing to define a schema, however if you are not exploring the data and are processing the same data over and over again, it seems like a good idea to define a schema anyway, that way you can better detect files that don't conform.

[complaining that] "SQL being too hard"? Well, one can assume you can ignore this class of amateurs - there's no lack of free learning tools for SQL - and it's dirt simple."And yet a lot of them invent query languages/query languages similar to SQL. " - See, I think you're magically associating two classes of programmers. There are people, like myself that love the expressiveness of SQL over virtually any other language for data-set manipulation. Thus we would like as an optional to utilize SQL on even a

"But it seems like an open source parallel database would do a lot to silence many nosql critics" - you're not going to silence people that think of data as simple key-value pairs, or highly specialized full-text-searching (which is related to but independent of RDBMS activity).

Simple key/value pairs work for some things. Most data cannot be managed reasonable as key/value pairs. And full text searches are entirely orthogonal. That involves searching through text rather than questions of semantic informa

I'm not sure this is true -- there are plenty of no-sql alternatives written in other languages (erlang (CouchDB), c++ (mongo),...).
I think choice has to do with somewhat more powerful model Cassandra has (compared to other choices that are simpler), or its good distribution (others only support limited sharding, not true distribution, AFAIK CouchDB has this issue).