Posted
by
timothy
on Sunday March 21, 2010 @06:48PM
from the ok-let's-split-up dept.

alphadogg writes "With chip makers continuing to increase the number of cores they include on each new generation of their processors, perhaps it's time to rethink the basic architecture of today's operating systems, suggested Dave Probert, a kernel architect within the Windows core operating systems division at Microsoft. The current approach to harnessing the power of multicore processors is complicated and not entirely successful, he argued. The key may not be in throwing more energy into refining techniques such as parallel programming, but rather rethinking the basic abstractions that make up the operating systems model. Today's computers don't get enough performance out of their multicore chips, Probert said. 'Why should you ever, with all this parallel hardware, ever be waiting for your computer?' he asked. Probert made his presentation at the University of Illinois at Urbana-Champaign's Universal Parallel Computing Research Center."

For that matter, since when have software vendors been willing to pay architects/designers/engineers etc to optimise their software to milk every cycle from the available CPUs and provide useful output with the minimum of effort? They don't, they just wait for hardware to get faster to keep up with code.

The only company that I have personally been exposed to that gives half a hoot about efficient performance is Google. It annoys me beyond belief that other companies think it's acceptable to make the user wait for minutes whilst the system recalculates data derived from a large data set, and doing those calculations multiple times just because a binding gets invoked.

There are(in broad strokes, and excluding the embedded market), two basic axes on which you have to place a company or a company's software offering in order to predict its attitude with respect to efficiency.

One is problem scale. If a program is a once-off, or an obscure niche thing, or just isn't expected to have to cope with very large data sets, putting a lot of effort into making it efficient will likely not be a priority. If the program is extremely widely distributed, or is expected to cope with massive datasets, efficiency is much more likely to be considered important(if widely distributed, cost of efficient engineering per unit falls dramatically, if expeced to cope with massive datasets, amount of hardware cost and energy cost avoided becomes significant. Tuning a process that eats 50% of a desktop CPU into one that eats 40% probably isn't worth it. Tuning a process that runs on 50,000 servers into one that runs on 40,000 easily could be).

The second is location: If a company is running their software on their own hardware, and selling access to whatever service it provides(search engine, webmail, whatever), their software's efficiency or inefficiency imposes a direct cost on them. Their customers are paying so much per mailbox, or so much per search query, they have an incentive to use as little computer power as possible to deliver that product. If a company is selling boxed software, to be run on customer machines, their efficiency incentives are indirect. This doesn't mean "nonexistent"(a game that only runs on $2,000 enthusiast boxes is going to lose money, nobody would release such a thing. Among enthusiasts, browser JS benchmarks are a point of contention); but it generally does mean "secondary to other considerations". Customers, as a rule, are more likely to use slow software with the features they want, or slow software that released first and they became accustomed to, than fast software that is missing features or requires substantial adjustment on their part. Shockingly enough, software developers act on this fact.

On these axes, you would strongly suspect that Google would be efficiency oriented. Their software runs on a grand scale, and most of it runs on their own servers, with the rest competing against various desktop incumbents, or not actually all that dramatically efficient(Nothing wrong with Google Earth or Sketchup; but nothing especially heroic, either). However, you would expect roughly the same of any entity similarly placed on those axes.

Google? I'm a big Google fan (and despite the rest of my comment, also a big Android fan and totally love my Nexus One).. but if Google was so hardcore into efficiency, why the hell did they develop a new runtime for their Android that's based on Java?

Google didn't seem like the best company to praise for efficiency. I would have picked some sort of video game company like id Software (yeah, I realize this an apples and oranges comparison though).

Because Google ain't crunching data sets on fucking mobile phones. They're optimizing their servers and the applications that run on those servers because Google is so damn big that a fraction of a percent increase in efficiency translates into huge amounts of money saved through less wasted CPU time. Mobile phones aren't a part of google.

If you phone runs a little less efficient then no one gives a damn. They want to make their phones easy to program for, which generally conflicts with efficiency.

Why Java for Android? This is a good question. There are several reasons (that the Android team have discussed).

One is that ARM native code is bigger, size-wise, than Dalvik VM bytecode. So it takes up more memory. Unlike the iPhone, Android was designed from the start to multi-task between lots of different (user installed) apps. It's quite feasible to rapidly switch between apps with no delay on Android, and that means keeping multiple running programs in RAM simultaneously. So trading off some CPU time for memory is potentially a good design. Now that said, Java has some design issues that make it more profligate with heap memory than it maybe needs to be (eg utf16 for strings) so I don't have a good feel for whether the savings are cancelled out or not, but it's a justification given by the Android team.

Another is that Java is dramatically easier to program than a C-like language. I mean, incredibly monstrously easier. One problem with languages like C++ or Objective-C is that lots of people think they understand them but very few programmers really do. Case in point - I have an Apple-mad friend who ironically programs C# servers on Windows for his day job. But he figured he'd learn iPad development. I warned him that unmanaged development was a PITA but he wasn't convinced, so I showed him a page that discussed reference counting in ObjC (retain/release). He read it and said "well that seems simple enough" - doh. Another one bites the dust. I walked him through cycle leaks, ref leaks on error paths (no smart pointers in objc!), and some basic thread safety issues. By the end he realized that what looked simple really wasn't at all.

By going with Java, Android devs skip that pain. I'm fluent in C++ and Java, and have used both regularly in the past year. Java is reliably easier to write correct code in. I don't think it's unreasonable to base your OS on it. Microsoft has moved a lot of Windows development to.NET over the last few years for the same reasons.

Fortunately, being based on Java doesn't mean Android is inherently inefficient. Large parts of the runtime are written in C++, and you can write parts of your own app in native code too (eg for 3D graphics). You need to use Java to use most of the OS APIs but you really shouldn't be experiencing perf problems with things like gui layout - if you are, that's a hint you need to simplify your app rather than try to micro-optimize.

One is that ARM native code is bigger, size-wise, than Dalvik VM bytecode.

Citation needed. Dalvik is better than baseline Java bytecode, agreed. But so is ARM native code. [http://portal.acm.org/citation.cfm?id=377837&dl=GUIDE&coll=GUIDE&CFID=82959920&CFTOKEN=24064384 - "[...] the code efficiency of Java turns out to be inferior to that of ARM Thumb"]. I can find no direct comparison of ARM Thumb and Dalvik, so I can't tell you which produces the smaller code size.

So it takes up more memory.

Even if your first statement is true, this doesn't necessarily follow. VMs add overhead, usually using up somewhat more runtime memory to execute, particularly if a JIT is used (the current version of Dalvik doesn't have one, but the next one apparently will).

Since when have OS designers optimised their code to milk every cycle from the available CPUs?

This isn't just an OS-level problem. It's a failure among programmers of all sorts.

I've been involved in software development since the late 1970s, and for the start I've heard the argument "We don't have to worry about code speed or size, because today's machines are so fast and have so much memory. This was just as common back when machines were 1,000 times slower and had 10,000 times less memory than today.

It's the reason for Henry Petroski's famous remark that "The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry."

Programmers respond to faster cpu speed and more memory by making their software use more cpu cycles and more memory. They always have, and there's no sign that this is going to change. Being efficient is hard, and you don't get rewarded for it, because managers can't measure it. So it's better to add flashy eye candy and more features, which people can see.

If we want efficient code, we have to figure out ways to reward the programmers that write it. I don't see any sign that people anywhere are interested in doing this. Anyone have suggestions for how it might be done?

Maybe it's not a question of whether the code is efficient. Maybe it's a question of how much you're asking the code to do. It's no surprise that hardware struggles to make gains against performance demands when software developers are adding on nonsense like compositing window managers and sidebar widgets. I'm enjoying Moore's law without any cancellation.. just run a sane environment. Qt or GTK, not both, if youre running an X desktop. Nothing other than IM in the system tray. No "upgrade fever" that makes people itch for Windows Media Player 14 when older versions work fine and mplayer and winamp work better.

If we want efficient code, we have to figure out ways to reward the programmers that write it. I don't see any sign that people anywhere are interested in doing this. Anyone have suggestions for how it might be done?

It's happening, from a source people didn't expect: portable devices. Battery life is becoming a primary feature of portable devices, and a large fraction of that comes from software efficiency. Take your average cell phone: it's probably got a half dozen cores running in it. One in the wifi, one in the baseband, maybe one doing voice codec, another doing audio decode, one (or more) doing video decode and/or 3d, and some others hiding away doing odds and ends.

The portable devices industry has been doing multi-core for ages. It's how your average cell phone manages immense power savings: you can power on/off those cores as necessary, switch their frequencies, and so on. They have engineers who understand how to do this. They're rewarded for getting it right: the reward is it lives on battery longer, and it's measurable.

Yes, you can get lazy and say 'next generation CPUs will be more efficient', but you'll be beaten by your competitors for battery life. Or, you fit a bigger battery and you lose in form factor.

The world is going mobile, and that'll be the push we need to get software efficient again.

It's not a failure among programmers at all - it's a business decision. The main reason software is less efficient is the costs are so heavily tilted toward software development instead of hardware. For the vast majority of business applications companies are using generalized frameworks to trade CPU cycles and memory for development time.

Even in terms of development style, it just isn't worth it to optimize your code if it's going to substantially increase development time. People are expensive. Time

Hey, if you liked programming for a one-byte machine, maybe you should join the quantum computer research effort. They're just now looking forward to the creation of their first 8-bit "computer" in the very near future.;-)

Of course, you can do a bit more computing with 8 Q-bits than you can with 8 of the more mundane bits that the rest of us are using.

I don't know if you noticed my sig, but I'm pretty familiar with what Apple have been up to these past few years;-)

What I was getting at was that, in general, programmers simply don't have the time or money to really optimise their code and now that computers are, for all intents and purposes, fast enough to not really worry about optimisations.

Apple are doing a lot of good, as you mention, with things like Grand Central Dispatch, but the multiprocessing features in earlier versions of OS X, and even more OS 9, were nothing that was in any major way any better than that offered by, say, Windows or other Unix based OSs. In fact, in the Mac OS 9 days, the multiprocessing capabilities of Mac OS lagged quite far behind that of Windows NT at the time.

No glory in it either. Even when you're doing it for free, nobody seems to care if you produce an optimization.

Plus, there are many more coders who have limited depth of understanding of OS interfacing, than there are coders who would go in after them to optimize. Heck, forget multicore -- how many applications fail to use vector units?

Sometimes optimizations get dropped from code as too difficult to maintain. Rarely, enough of them get collected in one spot to make a library out of them. Even more rarel

The fact is, the vast majority of programmers (and their tools) are not going to change virtually everything they do in order to deal with multiple cores. And there's good reason for that: it hugely complicates what could otherwise be fairly simple tasks. As the number of cores expand, it gets worse to the point of simply not being practical. This is a job that properly belongs in the OS or hardware layer.

Is it harder to design a system that decides for itself how to go about threading and multiprocessing, rather than relying on the programmer to know when it is best for that particular program? Yes! But that is irrelevant, because in the long run, that is the way it must be done. There is no other practical choice.

I had to laugh at Intel a few years ago when they called for end-product programmers to start programming for their multicore processors. I say, "No, Intel. It is you who must cater to the programmers. They are your customers, and essential suppliers of your other customers. It is your job to make sure that your processors do what the programmers want, not the other way around!"

Apple's decision to put provision for this in their Snow Leopard OS is a clear demonstration of their forward (and practical) thinking. Where are all the others?

I've always thought that both data flow languages and fortran95 had some innovations for multi-core programming worthy of being copied.

Data flow languages such as "G" which is sold as national instruments "labview" brand are intrinsically parallel at many levels. What they do is look at a function call as a list of unsatisfied inputs. These inputs are waiting for the data to arrive to make the variables valid. Then the subroutine fires. Thus every single function is potenitally a parallel process. it's just waiting on it's data. If you program in a serial fashion then of course those functions get called serially. But with graphic programming in 2D, you almost never are programming serially. You are just wiring outputs of other functions to inputs of others. Serial dependencies do arise but these are asynchronous and localized cliques. everything else is parallel. Yet you never ever ever actually write parallel code. it just happens automatically. Perl data language had a glimpse of this but it's not the same thing since the language is still perl and thus not parallel.

Objective-C with it's "message passing" abstraction is perhaps getting closer to the idea of a data flow. While one might complain that well objective-C message passing is just a different sugar coating of C just like C++ is. This would be true from the user's point of view. But it's not as true from the Operating system's point of view. IN OSX, these messages are passing more like actual socket programming at the kernel level. So there's more to objective C on apple's than meets the eye. But I don't know how far you can push that abstraction.

In fortran there are some rather simple but powerful multi-processor optimizations. First there's loops like "forall" that designate that a loop can be done in any order of the loop index and even in parallel. and then there's vectorized statements as part of the language like matrix multiplies. those are rather simple things so don't solve much but they do show that you can put a lot of compiler hinting into the language itself without re-inventing the language.

Apple's grand-central dispatch (GCD) solution is really primitive. It's just a simple thread-pool, where the programmer breaks their program down into tasks that can be executed independently then queues them for execution by the thread-pool.

GCD is not in the slightest innovative, except for a hack that allows "c" programmers to write tasks with slightly more convenience, by adding limited "closure" support to the language.

Similar concepts can be found all over the place; just see the "see also" section on the wikipedia article:
http://en.wikipedia.org/wiki/Grand_Central_DispatchUsing any of the libs listed in that "see also" section, you can get GCD equivalent behaviour on unix/windows, and have been able to for years.

There are also languages with far superior parallel-processing abilities, where the effort is done by the compiler/environment, not the programmer. See any functional language, eg Haskell or Erlang. Write a program in these languages, and the parallel-processing happens just about automatically.

Adding parallelism to the *OS* is quite a different issue, and not one that Apple's GCD addresses.

An iPhone 3GS with a 600MHz CPU outperforms a Nexus One with a 1000MHz CPU.

The reason the 3gs "outperforms" the N1 is because the N1 has more than twice the pixels of a 3GS. If the N1 had to drive the iphones resolution, it would wipe the floor with the iphones ass, all while supporting user app multitasking.

The reason the 3gs "outperforms" the N1 is because the N1 has more than twice the pixels of a 3GS. If the N1 had to drive the iphones resolution, it would wipe the floor with the iphones ass, all while supporting user app multitasking.

What many people are forgetting is that the N1 has no GPU, it requires the CPU to do all the rendering, which makes the rendering a little slower.

We are better off comparing it to the Motorola Milestone (Droid in the US) which has a GPU.

The iPhone certainly doesn't outperform a Nexus One. If you compare browser rendering tests the Nexus One consistently completes loading pages quite a bit faster then the iPhone. You are probably thinking of games performance, and while it's true that the iPhone gets better frame rates, you're forgetting that the Nexus One is pushing around 2.5 times more pixels so that's not exactly an apples to apples comparison.

Then if you look at iPhone OS, that has been highly, highly optimized. An iPhone 3GS with a 600MHz CPU outperforms a Nexus One with a 1000MHz CPU. The iPhone 3G with a 400MHz CPU outperforms a Palm Pre with 600MHz CPU

Citation needed? I think you'll find that Iphone only appears to outperform Android because Android is doing a lot more then the Iphone. Further more many things that work on Android do not work on Iphone, slashdot for instance works fine on my HTC Dream or newer Motorola Milestone with the standard browser, it works even better with Dolphin browser.

This cannot be a fair comparison until the Iphone can do everything that Android phones can, unless you want to compare functionality where Iphone is an epic failure.

Those optimizations are part of the reason why Apple is currently undercutting both Android and Palm on price,

Now I can tell you're full of it. All prices are incl of local taxes, and UK VAT does not apply outside the EU for those in Australia, Canada and the US.

The cheapest Iphone 3GS available is A$100 more expensive then the newer Motorola Milestone (droid for the Yanks) and Google Nexus One. Not to mention that both the Milestone and Nexus One can do more as well as lack the restrictions of the Iphone. But then again I suspect you were merely looking to confirm your quite obvious bias rather then do an accurate comparison.

Apple's operating systems are not very well optimised, not even as much as Windows operating systems, Apple's OS pretend to have optimisation by providing the OS with more hardware then it needs and limiting functionality to prevent any perceived loss of speed. Most people using a Mac or Iphone rarely use the full power of the hardware, ergo an un-optimised OS goes unnoticed by the user. Here is the core of the design (in an engineering perspective) a design does not have to work well, it just has to work. The vast majority of people will ignore tiny flaws if they can get the task done, OTOH if a computer doesn't do the task the user will get annoyed no matter how pretty the interface.

As a good developer friend of mine likes to say, "If given the choice, a user will press the 'I just want it to work today' button". OSX provides this very shiny button but only in a few select places, Windows provides this not so shiny button almost everywhere. This is why Windows is still the number one OS on the planet.

ya but those cases, as he reasonably explains, tend to get specialized development (say scientific computing), or separate processes, or while he doesn't explain it, a lot of server stuff is embarrassingly (or close to) parallel.

I can sort of see them not having a multi-processor OS just waiting for the consumer desktop- server processors are basically cache with some processor attached, whereas desktop processors are architected differently, and who knew for sure what the mutlicore world would look like in

It doesn't sound easily backwards compatible (but I might be wrong there), and there's a certain simplicity to 'reserve one core for the OS, application developers can manage the rest of them themselves' sort of model like consoles.

Those curious about what life would be like with application developers managing system resources, should try firing up an old copy of Windows 3.1 or MacOS and running 10 or so applications at the same time.

I can only assume TFA is an atrociously bad summary of what he's actua

Why has it taken so long for the OS designers to get with the program?

Coming up with a new OS paradigm is hard, but doable.

Coming up with a viable new OS that uses that paradigm is much harder; because even once the new OS is working perfectly, you still have to somehow make it compatible with the zillions of existing applications that people depend on. If you can't do that, your shiny new OS will be viewed as an interesting experiment for the propeller-head set, but it won't ever get the critical mass of users necessary to build up its own application base.

So far, I think Apple has had the most successful transition strategy: Come up with the great new OS, bundle the old OS with it, inside an emulator/sandbox, and after a few years, quietly deprecate (and then drop) the old OS. Repeat as necessary.

I don't know if you had to support Mac users during the years of transition, but it wasn't quite as easy as you made it sound. It was pretty smooth for such a drastic change, but I wouldn't want to repeat it any more than necessary.

Well it's my understanding that Carbon simply wasn't supposed to stick around this long. Cocoa was supposed to replace it, but there were some major developers (e.g. Adobe and Microsoft) who refused to transition.

There was even a dust up in the last year or so when 10.6 was released, and Apple made it clear that they weren't ever going to update Carbon to support 64-bit applications. Adobe pretty much flipped out, and is only now working on migrating over to Cocoa in CS5. Microsoft is finally releasing

hang on, this "new" OS you're referring to is basically UNIX (BSD). It was invented before Windows. Sure apple has modified it and put a shiny new layer on top (that works exceptionally smoothly, mind you), but if you wanna get technical, they didn't come up with a new OS, they improved an old one.

MS did the same during the transition to 32-bit. They included a 16-bit DOS emulator and had it run transparently. They did the same for the transition to 64-bit. It was so successful and so transparent a lot of IT professionals didn't even know it was even happening in the background.

Unlike Apple though, they never removed it. Sure, it resulted in a major security hole, but it also let legacy custom business apps run far longer than they otherwise would have been able to.

I don't think you understood the point he was trying to make. Windows has had threading since 1993 and a threadpool API [microsoft.com] since before OS X was released [microsoft.com]. The point he was making was not that Windows wasn't good enough for multiple cores, it was that the current paradigm about how OSes and apps relate wasn't good enough.

Back when you only had a single core CPU, the OS had to share the CPU with all the apps. Thus arose the kernel/user model where the OS ran in kernel mode and the apps ran in user mode. When an

developing server apps to run parallel is easy, client software is hard. Many times, the cost of syncing threads is greater than the work you get from them. So you leave it single threaded. The question is, how may you design a Framework/API that is very thread friendly while making sure everything runs in the order expected all the while making it easy for bad programmers to take advantage of it.

The biggest issue with developing async-threaded programs is logical dependencies that don't allow part to be loaded/processed before another. If from square one, you develop an app to take advantage of extra threads, it may be less efficient, but more responsive. Most programmers I talk to have issues trying to understand the interweaving logic of multi-threaded programing.

I guess it's up to MS to make a easy to use idiot-proof threaded framework for crappy programmers to use.

Well - I can tell you that Dave Probert saw his first multi-processor about 28 years ago at Burroughs corporation. It was a dual-processor B1855. I had the pleasure with working with the guy way back then. From what I recall he then went on to work at FPS systems which was an array processor that you could add onto other machines (I think vaxen...but I could be wrong there..)

Well, with the rise of the SSD, that's no longer as much of a problem. Case in point - I built a system on the weekend with a 40GB Intel SSD. Pretty much the cheapest "known-good" SSD I could get my hands on (ie TRIM support, good controller) at AUD $172, roughly the price of a 1.5TB spinning rust store - and the system only needs 22GB including apps.

Windows boots from end of POST in about 5 seconds. 5 seconds is not even enough for the TV to turn on (it's a Media Center box). Logon is instant. App start is nigh-on instant (I've never seen Explorer appear seemingly before the Win+E key is released). This is the fastest box I've ever seen, and it's the most basic "value" processor Intel offer - the i3-530, on a cheap Asrock board with cheap RAM (true, there's a slightly cheaper "bargain basement" CPU in the G6950 or something). The whole PC cost AUD800 from a reputable supplier, and I could have bought for $650 if I'd wanted to wait in line for an hour or get abused at the cheaper places.

Now, Intel are aiming to saturate SATA-3 (600MBps) with the next generation(s) of SSD, or so I'm told. Based on what I've seen - it's achievable, at reasonable cost, and it's not only true for sequential read access. So if the IO bottleneck disappears - because the SSD can do 30K, 50K, 100K IO operations per second? Yeah, I think it's reasonable to ask why we wait for the computer.

Not that I think a redesign is necessary for the current architectures - Windows, BSD, Linux all scale nicely to at least 8 or 16 logical CPUs in the server world, so the 4, 6 or 8 on the desktop isn't a huge problem. But in 5 years when we have 32 CPUs on the desktop? Maybe. Or maybe we'll just be using the same apps that only need 1 CPU most of the time, and using the other 20 CPUs for real-time stuff (Real voice control? Motion control and recognition?)

Well, with the rise of the SSD, that's no longer as much of a problem.

ORLY!

Let's do some math shall we? Take a simple 4 core Nehalem running at 2.66Ghz. Let's conservatively assume that it can complete a mere *1* double precision floating point number per clock cycle, per core. So. How big is a double? 64 bits, or 8 bytes. Now, that's 2.66 billion * 4 = 10.64 BILLION doubles per second, which is 85 GB/s.

The trick to understanding computing is that all computing really *is* at its heart a throughput problem.

The question wasn't, "why should your CPU have to wait", it was, "why should *you* have to wait". At speeds approaching 3Gb/s, I think it's fair to say, at the person you replied to actually did say, "well, with the rise of the SSD, that's no longer as much of a problem."

The trick to understanding computing is that all computing really *is* at its heart a throughput problem.

The trick to understanding computers is to realize that all computing really is, at its heart, a human problem. It really doesn't matter if the CPU has to wait a trillion cycles in between receiving each byte of data, if the computer respon

Yes, as does Windows. I think I should have been more clear - the scale curve is nice and flat up to 8, 16, maybe 32 logical CPUs. After that though, doubling CPUs doesn't necessarily double performance (even in heavy compute) - other bottlenecks start to impact, as does scheduler performance and architecture.

Nature abhors a vacuum. It seems that no matter how much compute power you have something will always want to snaffle it up. I have a dual PentiumD at work running WinXP and 3GB of RAM. The proprietary 8051 compiler toolset god awful slow (and pegs one of the CPUs) compiling even just a few thousands of lines of code (10's of seconds with lots of GUI seizures) because I think for some reason the compiler and IDE are running a crapload of inefficient python in the backend. Don't even get me started on ho

But that's exactly the way changing OS architectures and APIs can help. Right now the default behavior is to start a worker thread of some type that blocks on IO requests and then reports back. Most apps in the wild don't even bloody do this and just have a few threads do everything and some even have the main app loop block on IO.(let's all pretend we don't see our app windows grey out several times a day!)
We've argued for decades this was a programmer issue but that sort of pedantic criticism has accomp

The problem is that most (if not all) peripheral hardware is not parallel in many senses. Hardware in today's computers is serial: You access one device, then another, then another. There are some cases (such as a few good emulators) which use muti-threaded emulation (sound in one thread, graphics in another) but fundamentally the biggest performance kill is the final IRQs that get called to process data. The structure of modern day computers must change to take advantage of multicore systems.

The computer industry will have to wake up to reality sooner or later. We must reinvent the computer; there is no getting around this. The old paradigms from the 20th century do not work anymore because they were not designed for parallel processing.

In the 1980s there was lots of academic interest in parallel computing. Unfortunately a lot of it seemed to be driven merely by the quest for speed- once single CPUs got fast enough in the early 90s and everyone went 'whee C is good enough also objects are neat!', a whole generation of parallel language work was lost to the new&shiny.

I'm thinking you don't have much experience with.NET. During my projects it has always run comparable to native compiled code when I write my code with the mindset of a C++ programmer and not a VB one.

.NET apps DO use a virtual machine, the Common Language Runtime and support.NET IL. However, the Virtual Machine DOES use just-in-time compiling and precompiling to compile the code to native code before it runs it.

Same as any halfway decent desktop Java Virtual Machine implementation does now (mobile JVMs usually use hardware features like ARM Jazzelle to run the Java code faster)

Why should you ever, with all this parallel hardware, ever be waiting for your computer?

I dunno - maybe because optimal multiprocessor scheduling is an NP-complete problem? Or because concurrent computations require coordination at certain points, which is an issue that doesn't exist with single-threaded systems, and it's therefore wishful thinking to assume you'll get linear scaling as you add more cores?

Why for example does Windows Explorer decide to freeze ALL network connections when a single URN isn't quickly resolved? Why is it that when my USB drive wakes up, all explorer windows freeze? If you are trying to tell me there's no way using the current abstractions to implement this I say you're mad. For that matter when a copy or move fails in Explorer, why can't I simply resume it once I've fixed whatever the problem is. You're left piecing together what has and hasn't been moved. File requests make up a good deal of what we're waiting for. It's not the bus or the drives that are usually the limitation. It's the shitty coding. I can live with a hit at startup. I can live with delays if I have to eat into swap. But I'm sick and tired of basic functionality being missing or broken.

Windows explorer sucks. It always just abandons copies after a fail - even if you're moving thousands of files over a network. Yes, you're left wondering which files did/didn't make it. It's actually easier to sometimes copy all the files you want to shift locally, then move the copy, so that you can resume after a fail. It's laughable you have to do this, however.

But it's not a concurrency issue, and neither, really, are the first 2 problems you mention. They're also down to Windows Explorer sucking.

I wish I could mod you higher than +5, you just summed up some of the things that bother me most about the OS that is somehow still the most popular desktop OS in the world.

To anyone using Windows (XP, Vista or 7) right now, go ahead and open up an Explorer window, and type in ftp:// [ftp] followed by any url.Even when it's a name that obviously won't resolve, or an ip of your very own local network of a machine that just doesn't exist, this'll hang your Explorer window for a couple of solid seconds. If you're a truly patient person, try doing that with a name that does resolve, like ftp://microsoft.com [microsoft.com] . Better yet, try stopping it.... say goodbye to your explorer.exe.

This is one of the worst user experiences possible, all for a mundane task like using ftp. And this has been present in Windows for what, a decade?

there is a option, at least as far back as xp that allows explorer windows to run as their own tasks. Why its not enabled by default i have no clue about (except that i have seen some issues with custom icons when doing so).

For that matter when a copy or move fails in Explorer, why can't I simply resume it once I've fixed whatever the problem

Try TotalCopy [ranvik.net] which adds a copy/move in the right click menu; or Teracopy [codesector.com] commercial (free version available, supports Win7) complete replacement for the sucky Windows copy system.

USB/Network freezes and file copying isn't a fault of CPU cores like you say, Windows is just a sucky OS. Multicore stuff gets complicated, but this isn't going to be a panacea for Microsoft, it's another marketing opportunity.

You wait because some programmer thought it was more important to have animated menus than a fast algorithm. You wait because someone was told "computers have lots of disk space." You wait because the engineers never tested their database on a large enough scale. You wait because programmers today are taught to write everything themselves, and to simply expect new hardware to make their mistakes irrelevant.

not true, you wait because management speed tracks stuff out the door without giving developers enough time to code things properly and management ignores developer concerns in order to get something out there now that will make money at the expense of the end user, I have been coding a long time and have seen this over and over. Management doesn't care about customers or let developers code things correctly - they only care about $$$$$$$

Microsoft should go back and read some of the literature on parallel computing from 20-30 years ago. Machines with many cores are nothing new. And Microsoft could have designed for it if they hadn't been busy re-implementing a bloated version of VMS.

This is a very weak talk to give at a University. Rather than talking about 'parallel programming' and adding an "It Sucks" button., I would expect a discussion on CSP http://en.wikipedia.org/wiki/Communicating_sequential_processes [wikipedia.org] or perhaps real time hard to guarantee responsiveness. This is the indoctrination you get when you work for Microsoft, you start spruiking low-level marketing jumbo-jumbo to a very technical audience.

... for NFS to give up on a disconnected server... By the original design and the continuing default settings, the stuck processes are neither killable nor interruptible. You can reboot the whole system, but you can't kill one process.

With computers past and present -- Atari 8-bit, Atari ST, iPhone -- with "instant on", why does Windows not have this yet? This goes back to the lost decade [slashdot.org]. What has Microsoft been doing since XP was released?

iPhone isn't even slightly "instant on" - it takes at least a minute to boot an iPhone from off. What you're seeing most of the time is "screen off" mode. Unsurprisingly, switching the screen on & cranking up the CPU clock doesn't take much time. Likewise, waking my Windows box up from sleep doesn't take very long either. Comparing modern OS software running on modern hardware I see little difference in boot times, or wake time from sleep - which would indicate that if MS are being lazy then so are Appl

A big problem is the event-driven model of most user interfaces. Almost anything that needs to be done is placed on a serial event queue, which is then processed one event at a time. This prevents race conditions within the GUI, but at a high cost. Both the Mac and Windows started that way, and to a considerable extent, they still work that way. So any event which takes more time than expected stalls the whole event queue. There are attempts to fix this by having "background" processing for events known to be slow, but you have to know which ones are going to be slow in advance.
Intermittently slow operations, like an DNS lookup or something which infrequently requires disk I/O, tend to be bottlenecks.

Most languages still handle concurrency very badly. C and C++ are clueless about concurrency. Java and C# know a little about it. Erlang and Go take it more seriously, but are intended for server-side processing. So GUI programmers don't get much help from the language.

In particular, in C and C++, there's locking, but there's no way within the language to even talk about which locks protect which data. Thus, concurrency can't be analyzed automatically. This has become a huge mess in C/C++, as more attributes ("mutable", "volatile", per-thread storage, etc.) have been bolted on to give some hints to the compiler. There's still race condition trouble between compilers and CPUs with long look-ahead and programs with heavy concurrency.

We need better hard-compiled languages that don't punt on concurrency issues. C++ could potentially have been fixed, but the C++ committee is in denial about the problem; they're still in template la-la land, adding features few need and fewer will use correctly, rather than trying to do something about reliability issues. C# is only slightly better; Microsoft Research did some work on "Polyphonic C#" [psu.edu], but nobody seems to use that. Yes, there are lots of obscure academic languages that address concurrency. Few are used in the real world.

Game programmers have more of a clue in this area. They're used to designing software that has to keep the GUI not only updated but visually consistent, even if there are delays in getting data from some external source. Game developers think a lot about systems which look consistent at all times, and come gracefully into synchronization with outside data sources as the data catches up. Modern MMORPGs do far better at handling lag than browsers do. Game developers, though, assume they own most of the available compute resources; they're not trying to minimize CPU consumption so that other work can run. (Nor do they worry too much about not running down the battery, the other big constraint today.)

Incidentally, modern tools for hardware design know far more about timing and concurrency than anything in the programming world. It's quite possible to deal with concurrency effectively. But you pay $100,000 per year per seat for the software tools used in modern CPU design.

This has become a huge mess in C/C++, as more attributes ("mutable", "volatile", per-thread storage, etc.) have been bolted on to give some hints to the compiler.

An interesting comment overall, but what relevance does "mutable" have to multi-threaded programming? It is just a way to say that a particular field in a class is never const, even when the object itself is as a whole. There are no optimizations the compiler could possibly derive from that (in fact, if anything, it might make some optimizations non-applicable).

Same goes for "volatile", actually. It forces the code generator to avoid caching values in registers etc, and always do direct memory reads & writes on every access to a given lvalue, but this won't prevent one core from not seeing a write done by another core - you need memory barriers for that, and ISO C++ "volatile" doesn't guarantee any (nor do any existing C++ implementations).

Microsoft Research did some work on "Polyphonic C#" [psu.edu], but nobody seems to use that.

It's a research language, not intended for production use. Microsoft Research does quite a few of those - e.g. Spec# [microsoft.com] (DbC), or C-omega [microsoft.com] (this is what Polyphonic C# evolved into), or Axum [microsoft.com] (the most recent take at concurrency, Erlang-style).

Those projects are used to "cook" some idea to see if it's feasible, what approach is the best, and how it is taken by programmers. Eventually, features from those languages end up integrated into the mainstream ones - C# and VB. For example, X# became LINQ in.NET 3.5, and Spec# became Code Contracts in.NET 4.0. So, give it time.

An interesting comment overall, but what relevance does "mutable" have to multi-threaded programming?

A "const" object can be accessed simultaneously from multiple threads without locking, other than against deletion.
A "mutable const" object cannot; while it is "logically const", its internal representation may change (it might be cached or compressed) and thus requires locking.

Most languages still handle concurrency very badly. C and C++ are clueless about concurrency. Java and C# know a little about it. Erlang and Go take it more seriously, but are intended for server-side processing. So GUI programmers don't get much help from the language.

In particular, in C and C++, there's locking, but there's no way within the language to even talk about which locks protect which data. Thus, concurrency can't be analyzed automatically. This has become a huge mess in C/C++, as more attributes ("mutable", "volatile", per-thread storage, etc.) have been bolted on to give some hints to the compiler. There's still race condition trouble between compilers and CPUs with long look-ahead and programs with heavy concurrency.

We need better hard-compiled languages that don't punt on concurrency issues. C++ could potentially have been fixed, but the C++ committee is in denial about the problem; they're still in template la-la land, adding features few need and fewer will use correctly, rather than trying to do something about reliability issues. C# is only slightly better; Microsoft Research did some work on "Polyphonic C#" [psu.edu], but nobody seems to use that. Yes, there are lots of obscure academic languages that address concurrency. Few are used in the real world.

Ada 2005's task model is a real world, production quality approach to include concurrency in a hard-compiled language. Ada isn't exactly known for its GUI libraries (there is GtkAda), but it could be used as a foundation for an improved concurrent GUI paradigm.

I love how Microsoft can come along in 2010 and with a straight face say it's about time they took multiprocessing seriously. Or say it's about time we started putting HTML5 features into our browser. And we're finally going to support the ISO audio video standard from 2002. And by the way, it's about time we let you know that our answer to the 2007 iPhone will be shipping in 2011. And look how great it is that we just got 10% of our platform modernized off the 2001 XP version! And our office suite is just about ready to discover that the World Wide Web exists. It's like they are in a time warp.

I know they have product managers instead of product designers, and so have to crib design from the rest of the industry, necessitating them to be years behind, but on engineering stuff like multiprocessing, you expect them to at least have read the memo from Intel in 2005 about single cores not scaling and how the future was going to be 128 core chips before you know it.

I guess when you recognize that Windows Vista was really Windows 2003 and Windows 7 is really Windows 2005 then it makes some sense. It really is time for them to start taking multiprocessing seriously.

The way I understand it, the cache kernel in kernel mode doesn't really have built-in policy for traditional OS tasks like scheduing or resource management. It just serves as a cache for loading and unloading for things like addresses spaces and threads and making them active. The policy for working with these things comes from separate application kernels in user mode and kernel objects that are loaded by the cache kernel.

There's also a 1997 MIT paper on exokernels (http://pdos.csail.mit.edu/papers/exo-sosp97/exo-sosp97.html). The idea is separating the responsibility of management from the responsibility of protection. The exokernel knows how to protect resources and the application knows how to make them sing. In the paper, they build a webserver on this architecture and it performs very well.

Both of these papers have research operating systems that demonstate specialized "native" applications running alongside unmodified UNIX applications running on UNIX emulators. That would suggest rebuilding an operating system in one of these styles wouldn't entail throwing out all the existing software or immediately forcing a new programming model on developers who aren't ready.

Microsoft used to talk about "personalities" in NT. It had subsystems for OS/2 1.x, WIn16, and Win32 that would allow apps from OS/2 (character mode), Windows 3.1 and Windows NT running as peers on top of the NT kernel. Perhaps someday the subsystems come back, some as OS personalities running traditional apps, and some as whole applications with resource management policy in their own right. Notepad might just run on the Win32 subsystem, but Photoshop might be interested in managing its own memory as well as disk space.

What's wrong with at least some operating systems doesn't even have anything to do with multiple cores per se. They're simply designing the OS and its UI incorrectly, assigning the wrong priorities to events. No event should EVER supersede the ability of a user to interact and intercede with the operating system (and applications). Nothing should EVER happen to prevent a user being able to move the mouse, access the start menu, etc., yet this still happens in both Windows and Linux distributions. That's a fucked-up set of priorities, when the user sitting in front of the damned box - who probably paid for it - gets second billing when it comes to CPU cycles.

It doesn't matter if there's one CPU core or a hundred. It's the fundamental design priorities that are screwed up. Hell should freeze over before a user is denied the ability to interact, intercede, or override, regardless how many cores are present. Apparently hell has already frozen over and I just didn't get the memo?

The FS operations simply need to happen at a slowed pace that favors the user at all times; why should that cause loss of data integrity? A file system that demands a minimum effective data rate below which bits are lost or corrupted?! That would certainly be a poorly designed file system or interface to it, wouldn't it? No wonder we need redundant (R)AID and backup solutions!

First, the article in question talks about OS architecture, not Windows specifically. He specifically states that what he is speaking about is not something MS is working on. Quite the opposite, many of his MS colleagues disagree with him.

Second, the fundamental problems with OS design are exactly that: fundamental problems with OS design. Nobody is making an OS that truly takes advantage of multiple cores, it's still single-processor thinking with the ability to use more than one processor, and this leads to a number of inherent problems.

The article talks about what an OS might look like if built from scratch specifically for multiple core processing power, and there is nothing on the market like it at the moment. It's basically a hypervisor-based OS, where instead of giving programs slices of CPU time, the OS gives programs actual CPUs and slices of memory to use.

Something like that would be extremely slick, we already do that for virtual machines and we end up with 8+ full-fledged servers running on the same machine. Why can't you pull that back a little more so it's individual programs assigned to each CPU such that they don't have to interact with the OS at all once they are up and running? Can you imagine?

Among other things, I'm going with the name "Ironfluid" now, as I've finally deconflated the terms "cloud computing" and "fluid computing". Cloud really just means "run by somebody else", while "fluid computing" implies parallel processing and fault tolerance; decoupling the software completely from the hardware. Google, for example, offer

>>What we need is a "you don't want to use C: right now, trust me" signal. Ever tried to use Firefox while copying something big? Why does it take ages to display a webpage when it does not need to use the disk?

It only works like that on Windows. I think it's mostly about bad system design. I have no such issues on my Linux machine, but lots on my wife's Windows one. Both are the same Thinkpad laptops, so fault can only be on OS side.

Are you running a 9 year old version of OSX too, or are you comparing a two generation old Windows version to a nice new Mac version? It really sounds like you are comparing apples (snicker) to oranges. After all, both Vista and Windows 7 have no problem running for a long, long time between reboots and don't get slow during that time.

I noticed the same on my mac. With a set of eight CPU graph meters in the menu bar, they're almost always evenly pitched anywhere from idle to 100%, with a few notable exceptions like second life, some photoshop filters, and firefox of all things.
When booted into Win, more often than not I have two cores pegged high, and the others idle. Getting even use out of all cores is the exception, not the rule.

This is pretty much completely down to the application mix. Windows has no trouble whatsoever scheduling processes and threads to max out 8 (or 16, or whatever) CPUs, but if the applications are only coded to have, say, 1 or 2 "processing" threads, then there's nothing the OS can do to change that.

I think GCD is a great idea, and a very useful tool, but it's not a magic bullet. GCD can schedule some things more effectively because it has a system-wide view. The closure extensions and GCD interface makes it reasonably easy for novice programmers to get things actually running in parallel.

Of the two, the latter has a MUCH bigger impact in terms of actually getting programs to take advantage of multiple cores. Actually sending

You may not have to write your code around threading, but you then have to write it around grand central dispatch. Having GCD available is going to do absolutely nothing for a program that was not written with GCD in mind. Its changing one set of problems/features for another. Writing multi-threaded software isn't exceptionally hard. I have done a lot of it. It may take a lot less code with GCD, but you also give up control. Even using GCD with code blocks you still have to deal with the problems that can b

I am finding it very difficult to believe that you have actually used GCD. I have, and have read most of the code for the implementation. Creating threads is not hard - it is definitely not what makes parallel programming difficult. The difficult bit is splitting your code into parts that have no interdependencies and so can execute concurrently.

When you use libdispatch, you still have to do this. All that it does for you is implement an N:M threading model. It allocates a group of kernel threads an

The trick with GCD is that it is somewhat more high-level than a simple thread pool - it operates in terms of tasks, not threads. The difference is that tasks have explicit dependencies on other tasks - this lets scheduler be smarter about allocating cores.

It seems you are severely underestimating what GCD means to the application developer. I strongly suggest you read parts 12 and 13 of John Siracusa's excellent review [arstechnica.com] very carefully. As Siracusa says,

Those with some multithreaded programming experience may be unimpressed with the GCD. So Apple made a thread pool. Big deal. They've been around forever. But the angels are in the details. Yes, the implementation of queues and threads has an elegant simplicity, and baking it into the lowest levels of the OS really helps to lower the perceived barrier to entry, but it's the API built around blocks that makes Grand Central Dispatch so attractive to developers. Just as Time Machine was "the first backup system people will actually use," Grand Central Dispatch is poised to finally spread the heretofore dark art of asynchronous application design to all Mac OS X developers. I can't wait.