File size

File size

File size

File size

File size

161.7 MB

How has Windows evolved, as a general purpose operating system and at the lowest levels, in Windows 7? Who better to talk to than Technical Fellow and Windows Kernel guru Mark Russinovich? Here, Mark enlightens us on the new kernel constructs in Windows 7 (and, yeah, we do wander up into user mode, but only briefly). One very important change in the Windows 7 kernel is the dismantling of the dispatcher spin lock and redesign and implementation of its functionality. This great work was done by Arun Kishan (you've met him here on C9 last year). EDIT: You can learn exactly what Arun did in eliminating the dispatcher lock and replacing it with a set of synchronization primitives and a new "pre-wait" thread state, here. The direct result of the reworking of the dispatcher lock is that Windows 7 can scale to 256 processors. Further, this enabled the great Landy Wang to tune the Windows Memory Manager to be even more efficient than it already is. Mark also explains (again) what MinWin really is (heck, even I was confused. Not anymore...). MinWin is present in Windows 7. Native support for VHD (boot from VHD anyone?) is another very cool addition to our next general purpose OS. Yes, and there's more!

Tune in. This is a great conversation (if you're into operating systems). It's always great to chat with Mark.

the updated tools are only available through the Microsoft Desktop Optimization Pack (MDOP). MDOP is an add-on subscription to Windows Client Software Assurance. MDOP also contains a lot of other cool tools like Application Virtualization that used to the
the Softricity SoftGrid product.

I'd like windows to not do a context swtich on a thread when its doing disk i/o. i want it to hold onto the thread for say 500ms, instead of 20ms as this gives the disk more time to read/write.

If each thread is accessing a file, the whole thing slows down to a crawl as the hard disc read head has to jump to each file every 20ms. It would be MUCH better if the operating system could allocate more time to read a file before it allowed a context switch.
say 500ms. that would allow more data to be retrieved from the hard disc, less head thrash, less time waiting for the head to move, and performance would go up greatly.

Just try creating 2 or more zip archives at the same time, then time it again but only doing 1 at a time. Winrar has a feature where it will wait (probably using a global mutex) for other winrar windows to finish before the next one starts.

You can context switch CPU threads till the cows come home, but a phsical device needs more time to read/write when the head arrives.

IO is extremely costly. In most cases, if a thread requests an IO operation, it's going to be hanging around for quite a while for the IO to complete, so it makes sense to curtail the thread's quantum and move onto the next thread awaiting CPU time (i.e.
context switch).

When the IO operation returns (most likely a DMA operation these days), the CPU will be interrupted and the interrupt handler fires, unblocking the interrupt service thread (IST) and releases the CPU. The CPU then works out whch thread to run next. Because
the IST is a high-priority thread, it'll most likely get the next quantum and complete the IO operation. Your IO requesting thread will then be reactivated and return.

Forcing the IO requesting thread's quantum to extend (to HALF A SECOND???) will only slow down the machine as the CPU will be able to execute FEWER threads per second because of the largely dormant thread hogging the CPU's time.

The reason that creating two Zip archives simultaneously mgiht be is slower has many factors, including the rotational, seek and data transfer capabilities of the storage device itself, how fragmented your storage device is, whether your device implements some
kind of write buffering, etc. And that's not to mention whether you're running single/multiple processors and what else is running on your box.

If it takes longer to create two zips at the same time vs. doing it serially indicates to me that you may be suffering from slow disk and.or high disk fragmentation forcing your Zip tool to create and extend its file in many small chunks, causing lots of disk
seeking and therefore slowing you down.

The topic summary refers to "the Spin Lock Dispatcher" -- i.e., a component that dispatches spin locks. That is meaningless. The talk in fact refers correctly to "the Dispatcher Spin Lock" -- i.e., the spin lock that protects the dispatcher (or rather
its data).

There are definitely things that Windows could do better with disk I/O unless my experience is atypical and due to something wrong with my system.

Consider this example which I experienced with quite simple Win32 code on my Vista machine:

I had two uncompressed BMP files on the HDD, about 50MB each. I needed to read both files into memory and process them and they had to be loaded completely before processing could begin. There was plenty of memory and it was a Core2Duo system with 32-bit Vista.

If I used two threads to load the 50MB files in parallel (one file per thread) then, no matter how I wrote it, it consistently took *twice* as long as using a single thread and reading the files in series. Not the same length of time but twice as long. This
was true even when both threads allocated the full 50MB each and read their respecitve files in a single 50MB ReadFile call each. No data was being written to disk and no other processes were using significant resources.

That cannot be right. The OS is being asked by two threads in the same process to do two reads and it's allowing them to compete with each other to the extent that it takes twice as long. It makes no sense for those reads to be done in parallel as, even in
the impossible best case of zero seek times, the result would still be both threads waiting until the full 100MB of data was read. Better, when the OS knows both threads are reading 50MB of data in a single ReadFile call, to let one thread read 50MB and move
on, then let the other thread read its 50MB. That would mean one thread is ready after 50MB and the other after 100MB. (i.e. Compared to the impossible best case of the other method, one thread takes no longer to be ready while the other thread is ready twice
as quickly. Win.)

I realise that doing that could be complex given the way the system is layered. Some interleving may be inevitable but what happens now has a lot of room for improvement. Neither thread is ready until the amount of time it would take a single thread to read
200MB of data, yet only 100MB of data is being read.

Even if you can refactor your own process to have a single "data loading" thread (which is very difficult with many 3rd party libraries and/or workloads which mix loading and processing), you have no way to synchronize your data loading with that of any other
process on the system.

I did a test and see no significant difference between serial and parallel reads on two 50 MB files. Parallel is a bit faster even, when I don't use the FILE_FLAG_NO_BUFFERING flag, but that would be cheating since it probably comes from the disk cache (buffered vs
unbuffered,
test project).

I ran your test project and got similar results to what I described on my system, although there was some variance with just 2 files.

I then modified your project to be more like my real example (which, unlike what I described, loads more than 2 files) in case that helps magnify what's going on (and make it less likely that disk caching is skewing things). The program still uses 2 threads
but each thread reads 5 files, then the serial version reads all 10 files.

I got some interesting results, especially when buffered and unbuffered are compared.

The files were copies of the same 23meg file. They were read off a standard NTFS partition (my system drive). Real-time antivirus was disabled. (NOD32 installed but turned off for the tests.) Vista 32-bit. Core2Duo. NVidia motherboard and NVidia SATA drivers.
2gig of RAM with 40% free.

With FILE_FLAG_SEQUENTIAL_SCAN the parallel reads were consistently 2 to 3 times slower than the serial reads (the real exe reports more detail than pasted here):

Parallel 23484567924; Serial 10204840629; Parallel was worse. 230% as long as serial.Parallel 32899454271; Serial 10073167110; Parallel was worse. 326% as long as serial.Parallel 33913801872; Serial 10052993466; Parallel was worse. 337% as long as serial.

With FILE_FLAG_NO_BUFFERING things are even BUT both are as slow as the buffered parallel case above. i.e. Now everything is 2-3 times slower than the buffered serial reads. It seems like reading data in parallel disables or cancels out read buffering, on my
system at least:

Parallel 34752228822; Serial 34509786165; Parallel was worse. 100% as long as serial.Parallel 33359695965; Serial 34333134759; Serial was worse. 102% as long as parallel.Parallel 32994485361; Serial 33712713216; Serial was worse. 102% as long as parallel.

I also made versions which read the entire files in one go, instead of multiple small read operations. (There being a big difference in these cases is why I think there is a problem. The OS appears to be allowing two large reads to compete with each other with
the result that they both lose horribly.) These show more variable speed when buffering is enabled, I guess due to the files being cached in memory on subsequent reads (both in the same execution and between executions), but the parallel reads are still consistently
slower for me even with the variance. (Perhaps it would be worth trying the tests in reverse order but I've spent too long on this for now and everything so far has confirmed what I saw in the past in cases where there was too much data, and too much time
between tests of serial vs parallel builds, for disk caching to have been the only factor.)

Quick follow-up: I just shut-down all my apps and freed up a bit more RAM, then tried again. Now it's running through the files very quickly after the first test has run, suggesting that everything is being done in memory now. (The results are all over
the place now, but it's going through the files so fast that nothing is really being tested.)

To properly test this stuff I think you need to make sure enough data is being read, or memory is low enough, that it isn't all being cached.

Or to clear the disk cache between each test (not just each run of the exe, of course, but each test within it). I don't know a way to do that, though.

It's not Mark's. He doesn't own an iPhone... Also, I'm not a reporter. I'm a
Technographer. For the record, I have multiple mobile devices in my gadget arsenal. The Samsung Epix Smartphone is my latest device. The iPhone is old news for me. I'm bored with it... At any rate, I'm thrilled to see how many people are watching this
interview!

Windows 7 promises to be a really solid general purpose OS. I'm running the pre-beta build released to PDC2008 atendees and it's impressive.

the vhd stuff sounds really awsome :O beeing a guy who installs a lot o beta stuff, could i install windows on a vhd and just have it as a backup and if just want to reset, could i just point the boot manager at a diffrent vhd and be done?

also, i could to a lengthy os install on a vhd image and the boot on the image just to finish the installation really really cool

Blocking IO is blocking, sjh30. That means that the thread is blocked, i.e. suspended, i.e. not running. Its quantum is no use to it.

You are confusing real time passing with thread quantum time, which is local to the thread and only counting while the thread is making forward progress. Blocked threads are not making forward progress.

Well, it'll always take longer to create two zips at the same time than in serial since they are definitely I/O bound operations (assuming your cpu is fast enough) and the two sets of concurrent reads and writes force the disk to seek back and forth between
the two files (this has nothing to do with the fragmentation of those files).But you're right that his solution is not really the correct one, as it makes no sense for the OS to keep the CPU waiting on an I/O operation when it could be executing some other thread that will actually just use the cpu.What might make sense (but you'd have to experiment to be sure I think) would be to try to find a way so that you switched to another thread which wasn't disk bound.

WinRAR is very smart to do this, and it makes sense for it's scenario, but the OS can't enforce this generally, since users expect the system to be responsive, which means programs can't be forced to wait for other programs disk-bound operations to complete
before they can run (or even perform their own disk i/o).

Perhasp Solid-State devices will make all of this moot, since they don't have seek time. It would be interesting to run the zip experiment against a flash drive.

Regarding Windows 7 improvements, I'd also like to point out a big issue in all previous versions of Windows OS. System hang-up when a CD/DVD disc is inserted into CD/DVD drive, and system keeps hanged until the disc TOC is read.

With discs with scratches, or poor quality are inserted into drive, the system is somtimes completely hanged, and user have to perform hard-reset, even when the disc was taken out of the drive, Windows keeps irresponsive!. I cannot tell how many times it have
happened to me and my friends.

I'd love to see improvements in Windows 7 in how it handels attached devices, and should not get in a state where user have to do hard-reset due to certain device delay, specifically CD/DVD drives.

BitCrazed is of course right in a theoretical world, but the reality of Windows systems is that fragmentation and seeking does indeed kill performance on the average system. Its certainly so for me on a quad CPU system, heavy I/O brings the system to its
knees responsiveness wise (hangs of c. 30 secs are not unheard of).

chall3ng3r's entry is interesting, but I think it's much more of a generic issue than even Windows specific. Damaged
CD's have the same effect on just about every OS I've tried, including OS/2, Linux, Windows. I rather suspect it's down to the drive's firmware getting stuck in a tight loop and becoming unresponsive and/or locking the bus (?DMA?)

Must say I found this interview very interesting, and the explanation of minwin definitely cleared up that issue. On the subject of minwin I think this is a great approach too the problem of a messy api, and I can see the parallels too the OSI model with
the use of layering. Nice work!

I didn't have the chance to watch the whole video yet, but I'd like to ask if you have a schedule to release MinWin for mobile devices.I remember seeing something last year that the goal was to get MinWin to 4Mb. How bug is it now? Do MS have a website where we could follow-up the advances on this product?

I've often been completely surprised at the beaviour of Windows in relation to how it handles removable devices.It had got a little better over the years with the release of XP and eventually Vista, but I have seen on numerous occasions Windows completely lock up the entire system because it either couldn't read a CD or a user has attempted access on a removable device
that did not have any media in it.

I fail to see how in this day and age how the entire OS needs to be effected; be it either slowing to a crawl or completely locks up explorer because it hasn't got a responce from removable media. This also goes for accessing a network resource that is no longer
available, the entire explorer window will lockup and become unresponsive until either contact is restored or a network timeout occurs.

Unforgivable that the user's GUI will just lock up in this way, surely keep the processing / waiting for this action invisible to the user and keep all the interactive components responsive. Maybe I'm asking for something here that isn't possible, but I've
never seen this behaviour on any linux distro.

Following up Dimebags comments, although as we all can see, Windows I/O handling is problematic, moving onto the more specific issue of the OS hanging while dealing with CDs / DVDs etc (especially those that have physical damage), I'm not convinced that
the problem is Windows specific - I'm rather more convinced that most/all drive firmware doesn't handle such media well.

I've seen similar lockups while trying to read damaged media in OS/2 and Linux. I wonder whether it's not due to the DMA access to the damaged disc "locking the bus" and preventing other activity. I have a strong suspicision that this is the case, since when
I've turned DMA off and used the same disc, the lockup doesn't occur (although obviously without DMA, for normal discs, you get very suboptimal performance).

I hope they fix this! Todays Core i7 965/975's don't use just the FSB/QPI to identify the BCLk of a processor. It also uses the mutli settings. It appears a kernel issue , and I've sent this to the team.

Perhaps this should be addressed to the team, but as Computing is pushing forward to the x64 bit realm, there are issues that needs to be addressed before launch, or Windows 7 will be behind the curve. One issue I've noted is Windows still refers to the
FSB to determine the BCLK speeds on most processors. This was fine, a year ago when processors speed calculations were dependant on the FSB. However, with the advent of the new Intel Core i7 processors, especailly the Extreme additions, the BCLK is determined
by the Multi settings, rather than the the QPI/FSB setting alone. For example:

My BCLK on my Core i7 965 EE is 24x, with a QPI of 133mhz. That's is seen in bios , and windows at 3.2 ghz. However, Many users, including myself, overclock our processors. So when I set my multi to 34x with the same QPI I'm at 4.5ghz in bios. But in Windows
it remains 3.184Ghz. This is because Windows is using the new processors QPI settings to determine the BCLK only, when in reality it should be using the multi's setting also to determine the clock speed of the processor. I bring this up because I sent feedback
on this issue, and have yet to see an update to addrees it. As you know, most new computer will have the intel Core i7 processors in them. It would be nice to have windows read them correctly, versus having to use third party software to get correct processor
speed ratings. Thanks!

shame charles as this is a hit people do happen to notice when they have an abundance of memory

I understand when pages are being written for cause, however nt writes pages pro-actively, long before they're even considered a candidate for release

as I mentioned with landy, (who did say this was going to be considered), this pro-active writing to the pagefile when the system is not under pressure should enjoy a differant policy then when the pages are being written actively

I understand only unique images in ram and not represernted on disc or the network are written to the pagefile however if there is no memory pressure yet there is hardrive use I think it would smooth things out if there was a "no memory pressure policy"
for pages that aren't even a candidate for release

Even it apparently there aren't many changes on Vista, at least major, the really cool stuff is that because systems got bigger and bigger, this aquisition of a global lock, called the dispacher lock can be the sollution to a lot of threads. Problably be swiching on Windows 7 soon..

I think this video has helped me to understand how programs work better I would definatly recomend it to a beginer programmer even though it isn't about that at all. Its just the way that he explains the tasks and how win7 manages them now.

Remove this comment

Remove this thread

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums, or
Contact Us and let us know.