Posted
by
ScuttleMonkeyon Monday July 02, 2007 @01:41PM
from the dave-you-like-your-ncsa-friends-better-than-me-don't-you dept.

Peter Kern writes "The National Center for Supercomputing Applications (NCSA) is one of the great supercomputing facilities in the world and is home to 'Abe', one of the top 10 supercomputers on the current Top 500 list. TG Daily recently toured the facility and published a stunning report about their computing capabilities (more than 140 teraflops), power requirements (a sustained 1.7 megawatts), enormous 20-ft chillers in four cooling systems and other installations that keep the NCSA online."

I sometimes toy with the idea of going to the various
used computer stores, buying a pallet of used computers and making
my very own Beowulf cluster. I've
seen pallets of fast P3 and low-end P4 boxes at interesting prices.
Boeing Surplus [boeing.com] have large numbers of essentially identical computers
almost every time I go there. I remember once looking through a big bin for a particular size wrench and grumbling
to the sales person "Surely there is something bolted to a 747 with these size bolts!"
They laughed...

I'd make sure I had ample experience in the systems and networking administration arenas. Know multiple flavors of UNIX, know Linux, and know multiple clustering technologies -- everything from shared-memory architectures to high performance clustering to grid computing to high availability systems. Know the systems available from multiple vendors -- IBM, HP, Sun, Red Hat, Veritas. Knowing storage area networking is pretty smart also. Know networks -- understand them at all levels in the OSI and TCP/IP models. Understand application and system-level debugging. Understand how to analyze the performance of a complete system, from the application level all the way to the lowest levels of an individual node.

Oh, and being able to think on your feet, the ability to communicate with engineers and scientists, and being very organized and able to work independently doesn't hurt either.

Certs aren't all their cracked up to be. I've been working in this field for many years and I'm near the top of the payscale. Any idea how many certs I have? Zero. As in none. Experience and a degree are far more valuable than certs, IMHO.

Thanks, but I did see that. It only lists common majors and the fact that you need to list all relevant work experience + three references. No specifics on the types of people who are likely to be employed outside of some vague areas of study.

I worked for NCSA as a web developer/researcher. Not all of the jobs there involve complex supercomputing tasks, so you may find an opening doing web development, networking, basic tech support, etc. From that point if you are able to train yourself and network inside the organization, you could probably move towards working with the big servers in time.

In 10 years, this will be on the desktop, everyone will yawn because we have been boiled frogs and it won't impress us then. In 10 years, you'll look at someones tie clasp computer and say, "Wow, I remember when that took up an 8 by 18 block of my desk."

In 10 years, DARPA will announce the shut down of the Quantum Computing Project because it will be discovered that every time Red Hat Mandriva Winux OS/Q green screens, a parallel universe winks out of existance.

My prediction is that in 10 years the place will be functionally obsolete as a result of processing advancements elsewhere.

NCSA has been around for a long time and will be around for a long time more. Your prediction is based on the assumption that the systems at NCSA are static, which is completely untrue. If the government decides to start up a mega-super-quantum-ultra-computing project, NCSA is pretty high on the list of places that are going to get the grant.

No, my prediction is based on the idea that processing power will have overtaken processing needs by so much in 10 years that it will be pointless to have a dedicated processing facility. Sure, maybe it'll take 20 years, but it's going to happen.

I did actually RTFA and see that they have several generations of hardware in use as they continually upgrade.

No, my prediction is based on the idea that processing power will have overtaken processing needs by so much in 10 years that it will be pointless to have a dedicated processing facility. Sure, maybe it'll take 20 years, but it's going to happen.

Piffle. There will be a new version of Windows by then. That will eat up at least 50% of this new processing power.

processing power will have overtaken processing needs by so much in 10 years that it will be pointless to have a dedicated processing facility.

Why would you expect this to ever happen? When it comes to modeling the behavior of physical systems, whether it's the weather and climate or molecular structure, I don't think there is a limit to processing "needs". More power just means you can run more models, or more accurate models, or bigger models. I'm not sure why you would expect that to change.

I'm an undergrad assistant sysadmin and a programmer for a department here at UIUC. As a sysadmin, one of my primary responsibilities is maintaining and running our 41-node Linux cluster and the associated mass storage system. As a programmer, I'm responsible for hacking on a climate model that will be a rather big deal once it works, due to it running a very fine resolution model of the global system.

Something I've noticed is that once a professor has gotten done modeling something, the immediate respo

Is there some way to perform a graceful shutdown before the water gets pumped and released? If the supercomputers are still on when the water is discharged from the pipes, wouldnt that damage the systems?
If they dont want to use halon why not use a more computer-friendly compound like FM-200 (http://en.wikipedia.org/wiki/FM-200/ [wikipedia.org])

the water isn't getting pumped all over the motherboards of these computers or something drastic like that. What they mean is that they keep super-chilled water on hand at all times. This way, should there be some catastrophic over-heating event, they have already cold water on hand; not the stuff that most liquid cooling systems use (which is just room temperature).

Uhm, go back and RTFA. No Halon system is used, a standard water sprinkler system is used. However, the one failsafe noted was that a smoke detector also had to activate in addtion to the heat fusible link in the sprinkler heads before torrents of water were released. So, you need both an indication of smoke AND excessive heat for things to start getting wet.

the water isn't getting pumped all over the motherboards of these computers or something drastic like that. What they mean is that they keep super-chilled water on hand at all times.

Nope, they mean that if there's a fire, they're dumping tonnes of water directly onto the computer cabinets that are burning... Once the computer's on fire, water can't hurt it much further...

This is a pre-action water system and they're becoming more popular in computer rooms now that Halon is falling out of favor. They start out dry (uncharged), then if smoke/heat is detected (or other combinations of pre-action's) they charge with water, but still don't 'go off'. Then, when a fire below a sprinkle

From the fine article, the sprinklers are kept dry, and 2 events must happen before they are activated.Both a smoke alarm, and then each individual sprinkler head has a thermal link that must melt before activating.

With a 24/7/365 control room, so yes, they have a chance to shut things down.

The idea that tripping a single sprinkler head will set them all off at once is hollywood fiction. They are set off one by one, when a fusible link burns out at the sprinkler head. Fancier heads can shut off once the hea

The idea that tripping a single sprinkler head will set them all off at once is hollywood fiction.

Just to be pedantic, such systems do exist. They're called "deluge sprinkler" systems. Like a pre-action system, the pipes are normally kept dry, until some external event triggers it. However, unlike a pre-action system, every sprinkler head is open, so once the water valve is opened, it immediately starts raining everywhere. Mainly used in places where any sign of fire warrants immediate drastic action,

No, you mnetioned bush in a negative context. To get modded up, you have to do it in the only positive way possible; Last night, I got bush. Or I got bushed whacked at the casino. Or wow, what a xxxx. Anything else, and you would have to be lying.

No, you mnetioned bush in a negative context. To get modded up, you have to do it in the only positive way possible; Last night, I got bush. Or I got bushed whacked at the casino. Or wow, what a xxxx. Anything else, and you would have to be lying.

Oh. In that case, I'll wear the down-modding as a badge of honor. Hey, Bushies! Your guy pardons criminals!

As a prospective student of UIUC, who also has a good friend attending the school in a CS Ph.d. program, I get a bit giddy inside any time I see the university in the media. I've been to that building when my friend was giving me a mini tour of the facilities, although I didn't go see the supercomputers themselves.

I have actually been in the newer facility dozens of times when I worked as an intern for the Architect on the building. I actually drafted the final drawings for this project. It is a VERY nice facility, with some pretty cool under-floor cooling systems and things like that. I am pretty sure I have 3D digital models of the facility somewhere in my work records.The lecture auditorium bites the big one though, purple seats? Nasty. The Seibel Center accross the mini-quad is a much more interesting building th

Me too. As an undergrad at UIUC in the early '90s, I worked as a comptuer operator in the ACB (slide 4 in the FA). They had some cool hardware: Two Crays, including the beautiful Cray 2 with the waterfall, two Connection Machines (a CM2 and a huge CM5), a Convex, a big SGI of some sort, and some ancillary systems. The CMs arrived when parallel supercomputing was just becoming popular. The CM5 was a work of art, but when you opened the side panels, it was actually just racks of Suns inside (if I remember rig

There were a number of Sun workstations inside the CM5 that served as partition managers, but the computational horsepower was custom TMC hardware based on the SPARC processor. The CM5 was much more than a rack of Sun workstations. At NCSA, two of the cabinets had the Sun (SPARCstation 10???) partition managers, but other three cabinets had nothing but the TMC hardware in them.

I guess I saw inside the wrong cabinet! Nice to know -- I've always had this "pay no attention to the man behind the curtain" feeling about that system. Of course, many parallel supercomputers these days really are "just" racks of commodity CPUs. No shame in that. Anyway, TMC certainly knew how to make a good-looking box.

All that supercomputing power and they come up with "a one-building electricity bill of $3 per second - or about $1,500,000 per year". I'm sure they meant $3/minute with is much more in line with all the other figures they quote. Still, that puts their electricity rate in the 10-cents/kWh range - surprisingly high for a large industrial customer.

You might be forgetting the peak demand charge. If you work that into the equation ~$20 - $25 per kW at peak demand, or roughly $36,000 to $45,000 of their monthly bill, you start to get back down to $0.08kW-hr for energy costs. And don't forget this is a large non-interuptible customer, so they will pay premium rates because ComEd (or whoever there in East Central IL) can't take them offline on a hot day.

Why are they a non-interruptible customer? Datacenters with backup generators should obviously be interruptible customers as long as the cost of peak power is more than the cost of running your own generators, which is probably the case when you have megawatts of local power generation capability.

From TFA: They also had two dedicated UPS boxes which stood six feet tall, three feet wide and 12 feet deep.*Yawn*. Only impressive to the slashgeek with no real experience with heavy iron (I.E most of them.) When I was in the Navy and serving at a training center - we had also had two UPS's this size. For each trainer/lab. And we had four labs.

Just in the Weapons Training end of the building.

Cooling and power conditioning for the training facility was in a seperate 15k sq ft b

So why the moderation as 'troll'? I just get tired of the technogeek porn in these articles. ("oohh baby, what a Big UPS, and your cooling towers... so smooth and sexy!")Articles like TFA always love to point out these systems - and they sound impressive to someone whose only experience is the desktop PC or small datacenter... But the reality is, they aren't anything rare or special. if you have the money you can order one of those huge UPS's just about as casually as you can pick its smaller b

The most surprising thing in the article was how inelegantly they've solved the problem with inevitable hardware failure. That is, limiting runs to only 6 hours. It seems like there just HAS to be a better way to handle the problem than this! Virtualization sounds a bit tricky, so why not just write the software to handle hardware errors in the first place? I.e. produce results, check to see if there was a hardware failure, if so, re-do.

Maybe they already do this, and the reporter didn't catch it. But it'd surprise me if they didn't have better solutions than just hoping nothing bad happens during a run.

Much of the software which is run at the NCSA is home-grown software written by computational scientists, not computer scientists. For many of these massively parallel codes, written on top of MPI, fault tolerance really isn't all that easy. For a commercial production code on the order of Gaussian, this may be doable, but for bleeding-edge research codes, it may be a better use of the (human) time to push the algorithms rather than worry about fault-tolerance. From the user's perspective, jobs that are

Much of the software which is run at the NCSA is home-grown software written by computational scientists, not computer scientists.

I've seen code written by computational guys before. While not really terrible, it's not terribly re-usable or maintainable. Obviously these guys don't study computer science, but I truly think there's gains to be made if they understood the tool they were using better.
For many of these massively parallel codes, written on top of MPI, fault tolerance really isn't all th