Waffle Grid: Remote Buffer Cache -VS- SSD Grudge Match

As one of the co-founders of the Waffle Grid project, I beam with pride every time I get a stellar benchmark or every time I find a new use for the Waffle. But as a professional I still have to be critical of all solutions I would recommend or deploy. One of the big goals of Waffle Grid is to replace disk IO which should be slow with remote memory which should be much faster. But what happens when the disk is no longer slow? This leads me to ask myself, is Waffle Grid only good for servers with slower disk? Or is this a solution that can also help systems with fast disk? So which should you deploy SSD -vs- Waffle? Are they competitors? Or are they complementary technologies?

I am going to say this, in these tests latency is king. The faster the drives can deliver data, the higher the benchmarks should be. Basically if my interconnect can deliver faster then the drive can serve up data, I should still see Waffle Grid perform better then SSD. A note, all previous tests were done against 2 stripped 10K RPM disks. So from a latency perspective how does the Intel do?

So the SSD drive starts really fast with 1 thread, about a quarter of a ms, before raising to just over 2ms per request at with 10 threads.

Over 1Gb Ethernet I get about 0.15-0.20 ms latency with a single thread. I have been looking for a good tool to scale tcp threads up, but have not found one. I would like to produce a similar graph the one above. I have looked at netperf and ttcp. I got the above .15-.20 ms number using ttcp. I assume that the latency from multiple threads with such a low amount of data (16K) being transfered should be small, but I prefer more concrete numbers. When I spawn off 10 ping tests ( not tcp so going to be different ) I get consistent #’s up to 10 “threads” at once. So I can only assume that 10 threads requesting 16K blocks over the network should not experience the same latency increase as the disk does ( but I want to know for sure ). Maybe one of our network guru’s out their can point out another way to verify this… I am listening.

As I mentioned before I got better DBT2 numbers from a single Intel SSD drive then I got with an 8 Disk 10K RPM ) Raid 10 system. Now I am not sure how thats going to translate if you had 8 SSD’s. This type of information is critical because lets face it no one is going to put a single disk in their system… and if they do they deserve everything they get:) I would love to get another 7 drives or get access to a fully decked out server with 8 Intel SSD’s ***hint, hint for all those looking to buy me a gift***… I not only want to see just how fast a large array of these can be, but I want to see if a remote buffer pool via Waffle Grid is still a viable cost effective solution.

But for now lets test with what we have, I am not greedy! A single SSD should at least give us some idea if faster disk will start to make Waffle obsolete.

So for Intel’s SSD we have latency somewhere between 0.24ms and 2 ms ( depending on load ) -vs- 0.2ms for the network … Just based on this I would expect to see some benefit from a Waffle Grid deployment. But there is going to be additional overhead thats going to happen on the Waffle side ( code, memcached overhead, etc ) so these may wash.

Previously we had gotten 3218 TPM on our standard (768 Buffer pool, 20 warehouses, 16 threads) DBT2 test with 2 10K dirves mirrored. By merely switching over to SSD that number jumped to 8158 TPM! Thats a huge jump for 1 piece of hardware, but it shows you how disk constrained we actually were. That number alone is just shy of our Waffle Grid test earlier which delivered 9121 TPM.

What is interesting about these numbers is that the Waffle enabled database running on 10K disks is only about 11% faster then just ssd. But what happens when you run waffle and SSD?

Wait a second, that’s a very disappointing bump in performance…. I mean we are about 20% better then just SSD, but would you really want to deploy a second server just for 20%? I wouldn’t. Now as the workload gets busier I think this number will grow. You are alleviating disk contention with remote memory calls, so Waffle is going to help their, but how much will require further testing. Before you get too discouraged, there is one more test. The above 9825 TPM was achieved while running over 1Gb ethernet, so what about a faster interconnect? I know our previous tests showed little in the way of difference between 1Gbe and using the localhost ( attempt at simulating fast interconnects ), but we need to try for the sake of completeness. Lets Look:

Hold the presses! Running memcached on the localhost is 62.5% faster, a full 42.5% higher then running over 1Gb ethernet. Why? My theory is the disk is the 10K disks are too slow, dragging down the benefit of running Waffle Grid. The SSD drive shifts the bottleneck, we are no longer constrained by slow disk, rather the network is the bottleneck. This is something we see time and time again in performance tuning: remove one bottleneck and another one appears. It really is like peeling back the layers of an onion. I do not think a potential 50-60% increase in performance can be easily dismissed.

Conclusion?

Waffle Grid was conceived to help boost system performance by eliminating disk IO. As I have pointed out before plater based disk systems can realize huge performance benefits from both SSD and from Waffle Grid. Waffle Grid on its own can be faster in some cases then SSD, but this will more then likely shift as you add more SSD drives or cache on your controller into your disk subsystem. In the case of those using legacy servers or for those who do not want to plunge into SSD, Waffle Grid may offer an interesting alternative. It can be deployed using existing servers, can scale horizontally ( Add more memcached nodes ), and produces very good performance over current systems. If you are deploying SSD’s and looking to keep costs low ( 1Gbe ), I doubt a 20% improvement in performance would be enough to deploy both Waffle and SSD’s. However if you really are looking for the peak performance out of a system, deploying Waffle with a fast interconnect and using SSD’s may yield top notch performance that is unmatched elsewhere.

Of course these are generic benchmarks, running a controlled set of tests to simulate where we think we can have the most benefit. Other systems may see more or less of a gain. It all depends on the system. As with any solution, tip, or trick benchmark your load with it before moving it into production.

9 Responses to Waffle Grid: Remote Buffer Cache -VS- SSD Grudge Match

One thing that’s worth mentioning – SSD drives are, currently, small and their platter spinning relatives don’t really seem to be slowing down in terms of their ability to deliver even more space. So SSDs might not be viable for huge databases, due to cost or even hardware constraints. I think WaffleGrid could potentially do some good here.

True, but what you may find people starting to go back in time and split files between slow and faster disk. Using innodb file per table you could dump high usage data on SSD’s, and less frequently used in slow disks.

Also their is the cost factor you mention, which will get better as time goes on. But for 10 SSD disks you going to set yourself back 7-8K for consumer drives, and probably double that for the enterprise drives when they come out. If you have lots of memcached machines already you maybe able to instantly leverage waffle grid without spending the additional infrastructure costs.

Well usually if you hit RAM constraints for your buffer pool you will have a lot of data (or no money), this means you will also require a large amount of SSDs which can get extremely expensive if you want to store a lot of data.

What could also be interesting is to use the SSD as another storage layer, so it goes Buffer Pool (-> Waffle?) -> SSD -> Disk. Your code already allows to put memcached before disk, maybe you can do the same for SSD? One could even think of using SATA Disks for storing data then if most of the working set is in RAM or SSD.

It would also be interesting to see 10GigE, it’s faster than SATA at least

10GBE or even a Dolphin interconnect would be very interesting to look at.

About the SSD as a L3 cache, its an interesting idea. I see a few technical issues that may need to be overcome, so I would have to chew on it. I mean this maybe somewhere where memcacheddb ( mentioned in a previous comment ) maybe useful.

Couple of things I need to think about, how to add a third read from l3 ( first internal, second memcached, third ssd ) without slowing down the process too much. Also because the third level cache is data LRU’d from memcached and not innodb this would become a challenge ( have to rely on the memcached LRU to push it to disk )… this means data is once removed from the database, this may lead to stale data in some edge cases, but I need to think on it. One possible solution would be to dual write to disk and memory, but the performance hit could be huge. Maybe we could implement and asynch write for the disk piece. Good Idea, just not sure how this would turn out.

Flash on PCIe can be awesome no doubt. But their are still some serious issues standing in the way of wide spread adoption. First is cost, SSD’s are priced around $650 for an intel 80GB drive, while it runs ( last time I checked ) $2400 for an 80GB fusion IO drive. Consumers can get 4 intel drives for the price of 1 fusion IO card, also getting 240GB of storage vs 80GB.

The second thing to overcome is form factor. with 2.5Inch drives, you can pack a lot of the SSD drives in a single server. Depending on the form factor your PCIe slots are going to be extremely limited. For instance with 1u/2u servers you may get 2 or 3 Pcie slots, many clients I work with will take up 1 with an extra nic, sometimes 2. Even your bigger boxes end up with 5-7 PCIe slots. Also don’t Forget about All the blades being deployed today.

While these maybe viable solutions for the extreme performance edge cases, I still think clients will look to a solution like violin memory or texas memory systems who make appliances that can be integrated into their environments before maxing out expansion cards. When your dealing with terrabyte size databases, and you only 5 PCIe spots, your not going to be able to pack enough density into your machine.

While I understand and agree the price/size will come down… the price of SSD drives will ( and is ) plummeting more quickly… their are just too many vendors starting to play in this space. Competition will slowly bring the price/GB closer to disk.

Also whether its SSD or FusionIO, many people running MySQL are still bound by their hosting providers… if the hosting providers support it then great, if not all bets are off.

Kevin, also the flash chips for both Fusion IO and The Intel SSD are Nand flash, not sure where Fusion is getting their chips from but their are only a handful of manufactures ( maybe they use the same, could not find who they are getting theirs from with a quick search )… so the big difference is the PCIe interface ( from what I read ) and a few other tricks they pulled i.e. Chips use RAID.

Not dismissing it, in fact I would love to test one. But in my opinion these will represent a small % of the market while true 2.5 inch form factor ssds may overtake a large portion of standard rotational disks market share. Where these devices could make a huge splash is in specialized high speed files/operations. Add two of these cards into a system and point swap, tmp, and create a hot data directory .. pushing log files ( for instance redo in oracle could see a huge benefit ) along with high IO tables…. or with smaller databases… If they ship the 640GB devices they mentioned last year, then all this could change quickly however ( depending on the price ).

We need something to replace SATA:) Infiniband? Fibre channel? They all have a higher bandwidth.

I have read, not tested that the Intel SSD ( Enterprise model ) can push 250MB/ (I really need to find an enterprise model to test)… 3Gbps sata should be able to handle a little more then that, but that could be butting up to the max. Sata running at 6/Gbps is due out soon, which should bump the bandwidth for faster Sata Flash drives:)

Direct PCI-E is bad because there are few slots. SAS/SATA is bad because it is just wrong interface. You’re right – both are just intermediatery products and we will get something else for really mainstream flash usage. Something you can pack a lot in the server and something which does not have all these SAS/SATA overhead.