You program in a dynamic language, that runs on a JVM, that runs on a OS designed 40 years ago for a completely different purpose, that runs on virtualized hardware. Does this make sense? We've talked about this idea before in Machine VM + Cloud API - Rewriting The Cloud From Scratch, where the vision is to treat cloud virtual hardware as a compiler target, and converting high-level language source code directly into kernels that run on it.

As new technologies evolve the friction created by our old tool chains and architecture models becomes ever more obvious. Take, for example, what a team at UCSD is releasing: a phase-change memory prototype - asolid state storage device that provides performance thousands of times faster than a conventional hard drive and up to seven times faster than current state-of-the-art solid-state drives (SSDs). However, PCM has access latencies several times slower than DRAM.

This technology has obvious mind blowing implications, but an interesting not so obvious implication is what it says about our current standard datacenter stack. Gary Athens has written an excellent article, Revamping storage performance, spelling it all out in more detail:

Computer scientists at UCSD argue that new technologies such as PCM will hardly be worth developing for storage systems unless the hidden bottlenecks and faulty optimizations inherent in storage systems are eliminated.

Moneta, bypasses a number of functions in the operating system (OS) that typically slow the flow of data to and from storage. These functions were developed years ago to organize data on disk and manage input and output (I/O). The overhead introduced by them was so overshadowed by the inherent latency in a rotating disk that they seemed not to matter much. But with new technologies such as PCM, which are expected to approach dynamic random-access memory (DRAM) in speed, the delays stand in the way of the technologies' reaching their full potential. Linux, for example, takes 20,000 instructions to perform a simple I/O request.

By redesigning the Linux I/O stack and by optimizing the hardware/software interface, researchers were able to reduce storage latency by 60% and increase bandwidth as much as 18 times.

The I/O scheduler in Linux performs various functions, such as assuring fair access to resources. Moneta bypasses the scheduler entirely, reducing overhead. Further gains come from removing all locks from the low-level driver, which block parallelism, by substituting more efficient mechanisms that do not.

Moneta performs I/O benchmarks 9.5 times faster than a RAID array of conventional disks, 2.8 times faster than a RAID array of flash-based solid-state drives (SSDs), and 2.2 times faster than fusion-io's high-end, flash-based SSD.

The next step in the evolution is reduce latency by removing the standard IO calls completely and:

Address non-volatile storage directly from my application, just like DRAM. That's the broader vision—a future in which the memory system and the storage system are integrated into one.

A great deal of the complexity in database management systems lies in the buffer management and query optimization to minimize I/O, and much of that might be eliminated.

But there's a still problem in the latency induced by the whole datacenter stack (paraphrased):

This change in storage performance is going to force us to look at all the different aspects of computer system design: low levels of the OS, through the application layers, and on up to the data center and network architectures. The idea is to attack all these layers at once.

Today 10-gigabit interfaces are used more and more in datacenters and servers. On these links, packets flow as fast as one every 67.2 nanoseconds, yet modern operating systems can take 10-20 times longer just to move one packet between the wire and the application. We can do much better, not with more powerful hardware but by revising architectural decisions made long ago regarding the design of device drivers and network stacks.

In current mainstream operating systems (Windows, Linux, BSD and its derivatives), the architecture of the networking code and device drivers is heavily influenced by design decisions made almost 30 years ago. At the time, memory was a scarce resource; links operated at low (by today's standards) speeds; parallel processing was an advanced research topic; and the ability to work at line rate in all possible conditions was compromised by hardware limitations in the NIC (network interface controller) even before the software was involved.

There's a whole "get rid of the layers" meme here based on the idea that we are still using monolithic operating systems from a completely different age of assumptions. Operating systems aren't multi-user anymore, they aren't even generalized containers for running mixed workloads, they are specialized components in an overall distributed architecture running on VMs. And all that overhead is paid for by the hour to a cloud provider, by greater application latencies and by the means required to overcome them (caching, etc).

Scalability is often associated with specialization. We create something specialized in order to achieve the performance and scale that we can't get from standard tools. Perhaps it's time to see the cloud not as a hybrid of the past, but something that should be specialized, that it's something different by nature. We are already seeing networking transform away from former canonical hardware driven models to embrace radical new ideas such as virtual networking.

You mission, should you choose to accept it, is to rethink everything. Do we need a device driver layer? Do we need processes? Do we need virtual memory? Do we need a different security model? Do we need a kernel? Do we need libraries? Do we need installable packages? Do we need buffer management? Do we need file descriptors? We are stuck in the past. We can hear the creakiness of the edifice we've built layer by creaky layer all around us. How will we build applications in the future and what kind of stack will help us get their faster?

Reader Comments (17)

"You program in a dynamic language, that runs on a JVM, that runs on a OS designed 40 years ago for a completely different purpose, that runs on virtualized hardware. Does this make sense?"

Yes, it makes sense. Just because something is an old technology doesn't make it bad (even if not perfect).

Yes, operating systems need to evolve to better match the ecosystem of the cloud. Yes, some old legacy stuff should go away. And it will, eventually. But no, I don't believe in rewrites from scratch. It almost never works. You will never be able to build a new technology that's on par with existing operating systems in terms of features, stability, (backwards) compatibility and so on. The performance overhead? Not a big deal. Running more hardware is cheaper than hiring more developers.

If you compile an application into an appliance you can run 'in the cloud', you basically get a stripped down OS, because you still need lots of the subsystems/drivers that an OS provides out of the box, but you loose a lot of functionality in the process (no filesystems, logging, scheduled tasks, remote shells/editors, debugging tools, firewalls, support for 3rd party software for monitoring, backups, systems management). You don't think you need those features? Then you probably haven't done any serious sysadmin or devops work..

Dennis, I don't disagree with you for the practical purposes of making stuff work now. The question is should that be our future? And if it's not, what does that future look like? It's important to get people thinking along those lines now.

@Todd: I don't think any from-the-ground-up-approach is needed. The Linux OS model in the cloud will evolve over the years. Technology is changing faster today then ever before. Eventually bad design decisions we've been stuck with for decades will vanish as new interfaces and subsystems are introduced to replace the old. You can see this happening already. SysV init is being replaced with upstart in Ubuntu and systemd in Fedora, for example. ext(2|3|4) will eventually be replaced with btrfs. When talented developers are fed up with the status quo and have the opportunity to change it, they will scratch that itch.

I guess what's not really clear to me is this: What problem are you trying to solve by getting rid of the OS? If you think technological change will happen faster because unhindered by legacy decisions I think you're seeing things the wrong way. You can address any issues in the Linux kernel, the source code is there. Modifying it to suit your needs is most certainly faster than getting rid of the OS and trying to do everything yourself (or even as a community effort). There is a reason there haven't been any (successful) Linux forks in all these years: It takes a massive amount of resources to pull it off and there isn't much to gain by doing it. It will not happen unless the Linux kernel stops evolving (look at Xfree86 vs X.org).

Sounds like you want a microkernel or, better yet, an exokernel OS. Back in the day people thought they could use partial evaluation to automatically specialize a generic software stack to a specific task, thus optimizing performance. It never worked because it relies on a mythical "sufficiently smart compiler" to optimize it perfectly. Instead, you could maybe port memcached to run on L4 to get a high performance virtual appliance.

When pushing the limits on performance the overhead the OS puts on moving packets from NIC to CPU (and back) is irritating, as is the overhead writing blocks to disks.

Hopefully some thinner libraries can be introduced into linux that take away all the layers but still give you the benefits of being on an OS (i.e. I enjoy being able to gcc compile my C-code, but I could also benefit from a very stripped down linux kernel {and an easy way to build/tweak/customize said kernel})

The question of how to standardize such "thin" libraries is daunting. I/O w/ a HDD & a SSD & a PCIe enabled SSD require different semantics to really push performance to its maximum reach. Then there's PCM, which has different characteristics [latency,thruput,block-size,etc...]. These differences have to be abstracted, so say a RDBMS can run on any of them, which gets us back to where we are now.

Maybe the question should be abstracted differently and defined by the desired behavior and tradeoffs of these characteristics. Then you have to get hardware vendors to play along w/ standardized firmware, etc... I guess it is just a very large & complicated problem that needs solutions on both the general and the specialised levels.

If we had perfect software APIs, we should be able to swap in specialised modules in place of our generalised modules when we want that performance boost, and then we still maintain the "general" ability to compile on many platforms.

Abstraction is a tricky beast, it has its cons, but as a whole, we wouldnt be anywhere as far as we are now w/o heavy use of abstraction. I like the points in this article, and then I give my 2 cents and feel like I have added only more questions.

You can get rid of LINUX, you can get rid of some of the paradigms used in MODERN OSs, but you can't get rid of the OPERATING SYSTEM. May I remind that UNIX has been created, because there were machines around without a convinient OS.

You can do a lot to improve performance. A long time ago, I was in a VAX/VMS class and there a guy asked a pertinent question: "What is the cost [in performance] for doing multi tasking?" At that time batch processing in job queues was still very prominent.

I wouldn't like to do without multitasking, but for sure there is a cost of that! Conviviality and performance are always a trade of in relation with the now current technology.

Innovation happens when someone succeeds where all others don't dare to move. So go on, prove that you are right and I am not! Write the cloud from scratch!

Sounds like the premise of Windows back in the old days. And we know it turned out a VERY good idea.

While I don't fundamentally disagree with you, every attempt that has been made to break the Unix stack since Epoch, and that has seen wide use, has been a complete failure and nightmare from a software developer point of vue. At least try to keep the Unix API in your future stack :)

also, sounds like what you want is already in (small) part available from specialized (micro)kernels, or linux (embedded)distributions. So maybe it's also time to use these/help them be developped.

finally, maybe (I don't know) people from the IT/users won't like a performance-optimized stack if it doesn't offer OS-grade maintenance tools ? problem with server-software is that it should 1) run well enough 2) be maintenance-friendly 3) be user-friendly 4) win/avoid the FUD game. a long way to go. people developing applications tends to see only step 1).

Years ago IBM ruled the world with its operating system OS/370 (later MVS, OS/390, Z/OS) running on big iron. Then in 1970 they came up with a new OS called VM/370. VM is short for virtual machine. VM was able to run multiple instances of OS/370 in parallel. Sounds familiar? Wait, it gets better.

As it turns out, Running a full-fledged MVS has a lot of overhead in terms of memory consumption. If you give each of multiple users a virtual machine running MVS, you'll soon run out of Mainframe. So they invented a couple of other operating systems: the single-tasking CMS (conversational monitor system) and the multi-tasking GCS (group control system). These operating systems could not run on bare iron. A lot of the services they needed (such as I/O) were abstracted by the hypervisor (VM itself), leaving a pretty lean operating system. [MF experts please don't nitpick. I know they added threads to CMS in the 90s]

Even today a typical VM installation has a CMS machine for each user, a few GCS or CMS machines for servers, a few Linux machines (for some services that haven't been implemented in GCS or for running weird applications), and sometimes one or more MVS machines for some heavy lifting OLTP (these usually get direct access to the disk bypassing the VM)

I believe that we can come up with a model like that - a simplified OS for our applications, with much of "traditional OS stuff" abstracted away by the hypervisor.

Let us not forget that a bit is not a bit is not a bit...no a bit is an IDEA! People act on ideas--not machines. Computer technology as an idea represents the accumulation of millions of individual hours of effort focused toward delivering value from bit-flipping.

At a basic level, accumulations of technology have names, OS, Language, MRP, etc., for convenience of communication and application. Computer science careers and industrys build on these named foundations ever improving the human condition.

Only a fool would suggest intentional elimination of entire technologies simply because of convenience. However, tectonic shifts in performance can dramatically impact economic factors and force changes through necessity.

Even if new technologies dictate new architectures for use, the "OS" will still be called "Linux" because it's not software--it's a philosophy which many people treasure.

Who is going to maintain these systems? How will you debug and tune them once they are running?

Just recompile them? How much can go wrong with that process? How likely is it you can stop a system from working, and cannot get in to do anything about it, besides reformatting it and trying again?

Do you know the time it takes to reformat a system, even slimmed down, just to check a single change because your changes keep not working?

Linux provides a way to manage the system itself, and to keep the single purpose functionality you want working even through changes.

Like many network appliances and specifically the load balancer subset, like F5, which was based on BSD but over time became more stripped down and eventually more like firmware over ASICs, very special purposed systems can evolve.

Maybe some database systems will go this way, but any time you need to be able to change how things work beyond some tuning and config parameters, these systems are inadequate. They work for their special purpose only, and general systems are needed to do all other work because of the initial questions I posed.

Pie in the sky is great and fine, but it's something that will happen naturally as systems get locked down due to fulfilling their primary purpose and needing streamlining. Your complaint is missing the point in the rest of the cases, in that the systems cant be locked down enough for long enough for this streamlining to happen. When it does, it will start to happen.

Like when people stuffed an HTTP server into the Linux kernel in the 1990s, because it was too slow to just fetch files from disk using Apache over the OS. They died shortly thereafter because people started doing more than just static file retrieval with HTTP, and so it was no longer useful.

Your use case must out-last the time it takes for the tech to develop, or it will never go anywhere.

What server technologies today that dont have specialized appliances that already exist are ripe for this technology?

That would be a better place for this discussion, and you also have a potential market waiting for you if you can find them.

@synp: some attempts are on the way, see mirage for example: http://www.openmirage.org/ . It is currently able to run your application as a Virtual Machine on Xen. In fact, it has been attempted before, and the few exokernel-based systems do just that. They have failed to reach mainstream so far, but that doesn't really say anything about their potential.

Geoff, it's not that pie-in-the-sky at all. I've worked on a lot of embedded systems that have very stripped down operating systems, with direct DMA access to hardware and very little in the way way of OS services. On some of these systems quite a bit of the code was auto-generated from DSLs that targeted an abstraction layer that was extensible and specializable through various interfaces. It was efficient as possible while being easy to debug and program. In an embedded environment this strange architecture is easy to enforce because of the demands of performance and reliability. Nothing kills both faster than layers of subsystems and programmers writing pile on top of pile of low level code. In the cloud we are moving into this forcing condition as well for technological and cost reasons.

Aside from the lowest hanging performance fruit of changing from a dynamic language to one designed for efficiency, vis-a-vis changing the OS, is likely the language, I'm sure we'll eventually go in the direction the article is pointing.

Remember, as much of a speedup as Moneta achieved, it is a first attempt, ridiculously unoptimized compared to any modern OS. And, it is being run on hardware specifically designed to run current stacks, not academic oneoffs.

Just as it would be overkill for every little hardware device a contemporary server consists of, to run it's own full Linux, why would an app run on a machine that happens to consist of multiple cabinets require each cabinet to have one? I'm here assuming that "the cloud" will eventually specialize past the point of being built entirely out of one sort of standardized, interchangeable machine. After all, the big cloud providers are already loading different software on different machines (webserver, database), so complete machine interchangeability is out the door already. The next logical step, optimizing each function with specialized hardware, is really a difference in degree rather than kind.

A big reason for the generic OS' success, is hardware related to begin with. Due to the sheer volume of standalone desktops sold, and the cost of putting a new chip design into production, the chips designed to run general purpose OS' evolved fast enough that they largely surpassed more specialized stuff even at the specialized stuff's own games (re X86/64 vs Itanium). But that age is now drawing to an end. When end user computers are Ipads, and the apps are being run in the cloud, the generic desktop/server chip's volume dominance is drawing to an end. Bifurcating the chip market, and with it killing the generic, "everywhere" OS' raison d''etre.