]]>Cloud infrastructure provider SoftLayer has updated their offering with hybrid servers. The company has now joined a small group of vendors offering GPU compute power on-demand. HPC in the Cloud’s Tiffany Trader covered the development in a feature story yesterday.

According to Trader’s report, the SoftLayer portfolio includes both dedicated and virtualized environments, but the hybrid servers are strictly bare metal. The ordering process includes customizations for storage, processor, OS, memory and network connectivity. Now that GPU-equipped nodes with the latest NVIDIA Tesla parts are available, users can get access to an additional 665 gigaflops of performance per server.

The hybrid systems will sport Intel’s new Sandy Bridge processors. With the low end server, which goes for $879 per month, you get a couple of Xeon E5-2620s, 16GB of memory, one GPU and 500GB storage memory. The host processor can be upgraded to the E5-2690, which pushes the rental to $1,179 per month.

SoftLayer owns more than 100,000 servers in Europe, Asia and North America. It’s likely the company has distributed a given number of GPUs per datacenter and will upgrade systems according to hybrid computing demand. Support is divided between SoftLayer and the user. Thee company handles all the hardware issues, with software, including provisioning, left to the customer.

Some early adopters of the cloud offering come from the oil and natural gas industry, who are using the infrastructure for seismic workloads. Also, media/entertainment companies are taking advantage of the GPUs to offload their graphics rendering. The company’s chief scientist, Nathan Day, expects the GPUs to attract research-based HPC users as well.

Of the cloud providers that offer GPU computing, Amazon is arguably the most popular player. Day drew a contrast between SoftLayer and Amazon’s offerings, explaining that Softlayer makes their hardware available through their bare metal service where Amazon only offers virtualization. When asked why his company offered bare metal services to users, Day responded, “So they don’t have to pay the hypervisor tax.”

To minimize latency, the company has 13 datacenters and 16 points of presence in major cities across the globe. By spreading out their infrastructure, SoftLayer aims to reduce ping times, physically locating their systems as close to the user as possible. While this might not fit for extremely low-latency applications like high speed trading, it doesn’t hurt either.

Takeaway

There are few HPC cloud vendors and even fewer offering GPU-equipped infrastructure. Companies that are considering running HPC applications with limited or non-existent in-house infrastructure or are thinking of migrating IT costs from capital to operational expenses, may find SoftLayer’s cloud a viable alternative.

]]>http://www.hpcwire.com/2012/04/18/cloud_provider_offers_up_gpu_acceleration/feed/0NVIDIA Pokes Holes in Intel’s Manycore Storyhttp://www.hpcwire.com/2012/04/03/nvidia_pokes_holes_in_intel_s_manycore_story/?utm_source=rss&utm_medium=rss&utm_campaign=nvidia_pokes_holes_in_intel_s_manycore_story
http://www.hpcwire.com/2012/04/03/nvidia_pokes_holes_in_intel_s_manycore_story/#commentsTue, 03 Apr 2012 07:00:00 +0000http://www.hpcwire.com/?p=4509<img style="float: left;" src="http://media2.hpcwire.com/hpcwire/knights_corner_small.JPG" alt="" width="105" height="87" />As NVIDIA's upcoming Kepler-grade Tesla GPU prepares to do battle with Intel's Knight Corner, the companies are busy formulating their respective HPC accelerator stories. While NVIDIA has enjoyed the advantage of actually having products in the field to talk about, Intel has managed to capture the attention of some fence-sitters with assurances of high programmability, simple recompiles, and transparent scalability for its Many Integrated Core (MIC) coprocessors. But according to NVIDIA's Steve Scott, such promises ignore certain hard truths about how accelerator-based computing really works.

]]>As NVIDIA’s upcoming Kepler-grade Tesla GPU prepares to do battle with Intel’s Knight Corner, the companies are busy formulating their respective HPC accelerator stories. While NVIDIA has enjoyed the advantage of actually having products in the field to talk about, Intel has managed to capture the attention of some fence-sitters with assurances of high programmability, simple recompiles, and transparent scalability for its Many Integrated Core (MIC) coprocessors. But according to NVIDIA’s Steve Scott, such promises ignore certain hard truths about how accelerator-based computing really works.

Over the past couple of years, Intel has been telling would-be MIC users that its upcoming Knights Corner coprocessor will deliver the performance of a GPU without the challenges of a having to adopt a new programming model — CUDA OpenCL, or whatever. And since the MIC architecture is x86-based (essentially simple Pentium cores glued to extra wide vector units), developing Knights Corner applications will not be that different than programming a multicore Xeon CPU.

Leveraging that commonality, Intel says their compiler will be able generate MIC executables from legacy HPC source code. And it will do so for applications based on both MPI and OpenMP, the two most popular parallel programming frameworks used in high performance computing. Essentially Intel is promising a free port to MIC.

Not so fast, says Scott, the former Cray alum who joined NVIDIA last year its chief technology officer of the Tesla business. According to him, porting applications for MIC, or even developing new ones, won’t be any easier than programming GPUs, or for that matter, any accelerator. In a blog posted on Tuesday, he described the problems with Intel’s manycore narrative and its claims of superiority over GPU computing.

Scott is not arguing against the MIC as an accelerator, per se. He and most of the community are convinced that HPC needs a hybrid (or heterogeneous) computing to move performance forward without consuming unreasonable amounts of energy. Traditional CPUs, whose cores are optimized for single-threaded performance, are not designed for work requiring lots of throughput. For that type of computing, much better energy efficiency can be delivered using simpler, slower, but more numerous cores. Both GPUs and the MIC adhere to this paradigm; they just come at the problem from different architectural pedigrees.

The problem is that running throughput code on a serial processor sucks up too much energy, which is the situation many users are facing today with conventional CPUs. Conversely, running serial code on a throughput processor is just too slow, and defeats the purpose of having an accelerator in the first place.

Even if low single-threaded performance wasn’t an issue, today’s accelerators live on PCIe cards with limited amounts of memory (usually just a handful a gigabytes) that exists at the end of a PCIe bus. So if the entire application were to run on the accelerator, all its data and instructions would have to be shuttled in from main memory in chunks. Consider that today, with only a portion of the application living on the GPU, the PCIe bottleneck can still hinder performance. Stuffing the whole program on the accelerator would make it that much worse.

So the main thrust of Scott’s critique is that for hybrid computing to work, you have to split the application intelligently between the CPU host and the accelerator. That’s true, he says, whether you’re talking about an x86-based accelerator like MIC or a graphic-based one like Tesla. “The entire game now is how do we deliver performance as power efficiently as possible,” he told HPCwire.

Intel has revealed very little about application performance on the future MIC parts, and has not really addressed how that application split is going to work programmatically, or even that it’s necessary. To date, they and some of the early MIC adopters have mostly talked about recompiling existing codes, based on OpenMP and/or MPI, and running the resulting executable natively on MIC.

Running MPI codes on a manycore architecture is particularly problematic. First there’s the memory capacity problem mentioned above (each MPI process uses quite a bit of data). And then there’s the fact that once the number of MPI processes exceeds the accelerator core count — 50-plus for Knights Corner — the application would have to use the server node’s network card to communicate with MPI processes running on other nodes. As Scott points out in his blog, that’s far too many MPI processes for a typical network interface; all the contention would overwhelm the available bandwidth.

OpenMP has the opposite problem, since most programs using this model don’t scale beyond more than 4 to 8 tasks. As a result, there would no way for most OpenMP applications to utilize the 50-plus cores expected on Knights Corner-equipped nodes. And once again, there’s the memory capacity problem. Like MPI, OpenMP expects to live in the relatively spacious accommodations of the CPU’s main memory.

Scott says if you’re just going to use a compiler to transform your existing application to run on the MIC, you’re not doing hybrid computing at all. More importantly, running the entire code on the accelerator does not take performance into account. After all, the idea is to speed up the application, not just recompile it so that it functionally works. “We don’t think it’s legitimate to talk about ease of programming without talking about performance,” he says.

Scott argues that for applications to take advantage of these new throughput processors, programmers will have delve into some sort of hybrid programming model that splits off the parallel throughput code from the serial code. For NVIDIA GPUs, the parallelism can be exposed with CUDA or with the emerging set of OpenMP-like directives for accelerators, known as OpenACC. There is already an initial CUDA port for x86 developed by PGI, so that’s one option. But the OpenACC framework is likely to reach a larger audience of developers since it offers a higher level of abstraction than CUDA and it looks like it will eventually be folded into the industry-standard OpenMP API.

The idea is that programmers can use OpenACC today to develop GPU-accelerated applications with the anticipation they will be able to use the same code for other accelerator-based hardware platforms, like MIC and AMD’s Fusion or discrete GPU processors. Intel and AMD have not jumped on the OpenACC bandwagon as of yet, but were it to be adopted as a standard and demanded by their customers, they would certainly have to support it.

Even OpenACC is not a magic bullet though. The programmer still has to do dive into the source code and tell the compiler where and how to carve out parallel code for the accelerator. And as Scott admits, that can be a significant effort, especially for large legacy HPC applications that were written for homogeneous CPU-only machines.

But, he maintains, if you’re interested in taking advantage of the performance offered by throughput processor like GPUs and MIC, the work has to be done. Processor clocks are not likely get any faster than they are today. So the only way to increase performance is via parallelism. As Scott says, “Computers aren’t getting faster, they’re only getting wider.”

]]>http://www.hpcwire.com/2012/04/03/nvidia_pokes_holes_in_intel_s_manycore_story/feed/0Convey Debuts Second-Generation Hybrid-Core Platformhttp://www.hpcwire.com/2010/11/09/convey_debuts_second-generation_hybrid-core_platform/?utm_source=rss&utm_medium=rss&utm_campaign=convey_debuts_second-generation_hybrid-core_platform
http://www.hpcwire.com/2010/11/09/convey_debuts_second-generation_hybrid-core_platform/#commentsTue, 09 Nov 2010 08:00:00 +0000http://www.hpcwire.com/?p=5033In an HPC market that seems determined to go down the CPU-GPU path, upstart Convey Computer may yet offer a few surprises. The company today unveiled the sequel to its HC-1 platform it introduced in 2008. Called the HC-1ex, the new system adds a lot more performance and capability, but retains the original x86-FPGA co-processor design.

]]>In an HPC market that seems determined to go down the CPU-GPU path, upstart Convey Computer may yet offer a few surprises. The company today unveiled the sequel to its HC-1 platform it introduced in 2008. Called the HC-1ex, the new system adds a lot more performance and capability, but retains the original x86-FPGA co-processor design.

Convey’s first HC-1 design, unveiled at SC08, began production shipment in 2009. Although still in startup mode, Convey seems to be on sound financial footing. They collected their second round of funding last summer, bringing their total to $40 million. Since then the company has increased its head count from 25 to 55.

According to company president and CEO Bruce Toal, they now have roughly 30 customer deployments, ranging from single units up to 8-node clusters. The majority of the systems have been installed for bioinformatics, government and research applications, with financial services, energy and logic simulation also represented.

Because of the platform’s malleability, it can serve virtually any HPC application domain. The basic concept is to offer a standard x86 server platform, but accelerated by FPGAs in the guise of a co-processor. For a specific application domain (or even just a single application), the FPGAs are programmed to extend the x86 ISA with custom instructions intended to accelerate the target software. These instructions are then generated by the Convey tools during source compilation. It’s a nifty little design, and worlds away from the more typical FPGAs-as-an-afterthought HPC approach that has been used in the past.

The CPU and FPGAs are glued together via the shared memory subsystem, which blends the x86 memory to the customized high performance memory on the co-processor side. This allows both of them to work within the same cache-coherent shared memory space. The approach is quite different from a conventional HPC accelerator, which typically treats the FPGA, GPGPU, or whatever as an I/O device, hanging off a PCI-Express slot. In Convey’s model, the FPGAs are virtualized and act as a true co-processor. “It enables you to build a completely integrated compiled environment, which we believe is a fundamental element for hybrid computing,” explains Toal.

The HC-1ex is the higher end version of the HC-1 but, according to Toal, is not a replacement for the original. In the second-generation product, the company has upgraded the dual-core Xeon to a quad-core part, and increased CPU memory capacity from 64 GB to 128 GB. More importantly, though, the HC-1ex has moved up to the latest generation Xilinx Virtex-6 FPGA (the LX760) from the Virtex-5 part (the LX330) in the original HC-1. The newer 40nm FPGA offers more that three times the gates of its predecessor.

Assuming the application can take advantage of those additional gates, that translates to higher absolute performance, better price-performance and increased performance per watt. For example, using a Smith-Waterman search (a nucleotide sequencing algorithm that scales extremely well on FPGAs), the HC-1ex performed 401 times faster than a single-core Intel CPU. That’s more than twice the performance of the HC-1. The general idea is to replace multiple racks of conventional servers with a single rack of Convey gear, so as to reduce floor space requirements, power usage and overall total cost of ownership (TCO).

The first HC-1ex was deployed at Georgia Tech in September. Rich Vuduc, assistant professor School of Computational Science and Engineering, is leading a research team to apply heterogeneous computing systems to data analysis and data mining applications. With the HC-1ex , Vuduc is developing a custom FPGA personality for his particular data analytics domain. The work is being partly funded under a DARPA contract, so one could surmise the work could end up in some interesting defense- or security-related applications .

Beyond the HC-1ex unveiling, Convey is also announcing some new partnerships this week. These include Panasas, AutoESL, Impulse, Jacquard Computing, and Voci Technologies. The Panasas collaboration will bring the company’s storage client software into the Convey OS and cluster framework software. The next three, AutoESL, Impulse and Jacquard, are providing higher level FPGA programming tools to help develop co-processor personalities.

The last-mentioned partner, Voci, is actually OEMing the Convey gear in the form of a speech recognition appliance. Called V-Blaze, the appliance can process a hundred phone conversations in real time and convert the conversations to text. The idea here is to be to transform phone conversations into text, which can then be keyword searched for further analysis. One application would be call center monitoring. Purportedly, the V-Blaze appliance delivers much better resolution and lower error rates than commercial voice recognition products. That’s 100x better than a single CPU could accomplish and perhaps 10x better than a GPGPU implementation.

The Voci collaboration is a good example of how Convey can expand its market other than through direct end user sales. But Toal does expect to see sizable growth in such sales over the next year, thanks to a larger distribution channel and the additional technology partnerships, not to mention the new HC-1ex offering. Fighting the GPGPU juggernaut won’t be easy, but the true believers at Convey seem determined to do so.

]]>It’s the age old question: GPU or CPU? Ok, so it’s not age-old, but it a popular topic at present. GPU-pusher NVIDIA has done a good job getting out the word on the GPU’s incarnation as an application accelerator extrordinaire. 50-100x speedup anyone? So it’s a no brainer, right? Not so fast. Desktop Engineering‘s Contributing Editor Peter Varhol has an interesting writeup comparing the two architectures. The gist of the article is that we need both, but there are best use cases for each, and the field is definitely still evolving.

GPUs have traditionally been used for manipulating computer graphics…they are, after all, graphics processing units. But over the last decade, engineers and scientists have increasingly been using GPUs for non-graphical calculations. Benchmarks show GPUs performing very well on engineering applications, but there’s a downside. Because the GPU is a specialized processor it has difficulty performing a host of general-purpose jobs — in other words, the GPU has serious limitations. And it’s difficult to get existing software to run on GPUs without vendor support.

Says Varhol:

So it turns out that you still need the traditional CPU after all. You need it because that is where the vast majority of engineering and office software runs, where the primary software development skill set resides, and whose all-around performance is at least good enough to remain in that role for the foreseeable future.

What GPU computing vendors have done is to pair the GPU with the CPU in one system. The GPU does the specialized computation and the CPU does what it has always done, the general-purpose computation. Right now, the hybrid model offers the impressive speed of the GPU and the general-purpose computing of the CPU. Eventually, these will likely come together on the same die. In fact, ATI has announced its Fusion integrated CPU/GPU architecture, which it calls an Accelerated Processing Unit (APU).

Varhol again:

Many systems using GPUs and CUDA have a single industry-standard processor, usually running Windows or Linux. An application written for a GPU typically has a front end running on one of these operating systems. When a computation is required, the relevant data is passed off to executable code loaded onto the GPUs. When execution is complete, the results are returned to the CPU and displayed.

If you’re in need of GPU-based workhorse, vendors such as Appro, Microway, Supermicro and Tyan all offer systems with multiple processors and cores targeted to specific uses like engineering.

Summing up, Varhol states:

An ideal configuration is one with one or more CPUs and a set of GPUs that use CUDA or similar parallel computation architecture. All support applications, such as email, web browsing, and word processing use the CPU. And with tools such as Accelereyes Jacket … and NVIDIA Nexus, engineering software will eventually take advantage of both to speed up complex computations.

Keeping all of the above in mind, the real question becomes apparent: whether to use a traditional CPU-only architecture or a hybrid approach that takes advantage of the best each processor has to offer. As software tools evolve to streamline the process, the “best of both worlds” scenario becomes more and more appealing.

]]>NVIDIA’s GPU computing ambitions got a major boost today with IBM’s announcement of the iDataPlex dx360 M3. The new HPC server pairs two Tesla GPUs with two CPUs inside the same server chassis. As such, IBM represents the first Tier 1 server vendor to bring CPU-GPU “hybrid” computing to the high performance computing market.

“This is the first time we’re in a mainstream server,” says NVIDIA’s Sumit Gupta, senior product manager for the Tesla GPU computing group. Last week, Appro, Supermicro, AMAX and Tyan announced integrated CPU-GPU server gear based on NVIDIA’s new Fermi architecture Tesla 20-series devices. What IBM provides is a broad global sales channel and unmatched brand recognition.

All these systems, including the new iDataPlex from IBM, make use of the latest Tesla M2050 computing modules that can be integrated into a CPU-based host system. Each M2050 delivers 515 gigaflops of raw double precision floating point performance (or 1,030 gigaflops single precision), and comes with 3 GB of GDDR5 memory. IBM customers can also opt for the M2070, which offers the same floating point performance, but with 6 GB of local GPU memory.

The base configuration on the new iDataPlex consists of a two-socket motherboard with the latest Intel Xeon CPUs. A riser card is used to hook in the Tesla modules. The configuration allows for relatively easy maintenance and replacement of the GPU components.

IBM’s move into the GPU computing space is a big win for NVIDIA and for GPU acceptance in HPC, in general. Over the past couple of years, the company had remained very quiet on the GPU computing front, and there were no indications it would be adding this capability to its HPC lineup. “I think what’s changed is that customers have been experimenting for a long time and now they’re getting ready to buy,” says Dave Turek, vice president of the deep computing group at IBM. “It’s as simple as that.”

According to Turek, IBM has been tracking customer demand for this capability for some time, and felt now was the time to jump onto the GPU computing train. From Turek’s point of view, this is less about the extra capabilities provided by NVIDIA’s new Fermi architecture (ECC memory, double precision, programmability) and more about the general increase in customer acceptance of the GPU computing paradigm. “If the marketplace hadn’t been ready at this time, we would have bypassed this for sure,” he admits. “It wasn’t the technology that drove us to do this. It was the maturation of the marketplace and the attitude toward using this technology.”

The company expects the new GPU-equipped iDataPlex to get the most traction in what have become the early adopter segments for GPU accelerated computing, namely the oil and gas industry, big science research at government labs and universities, and the biotech space (with perhaps some uptake by financial institutions). All of those segments have a few things in common that makes them an especially attractive target for GPU acceleration: a nearly endless need for more vector math capability, in-house programming expertise to push their apps over the GPU programming hurdle, and a limited dependency on ISVs who may or may not be interested in GPU support.

IBM’s decision to pursue the HPC market with a CPU-GPU offering is particularly relevant in another sense. Over the past couple of years, the company had pinned much of its hybrid supercomputing hopes on its own HPC variant of the Cell processor: the PowerXCell 8i. That processor was used to power the Roadrunner supercomputer, the first general-purpose computing system to break the Linpack petaflop barrier back in 2008. IBM still offers the Cell-based QS22 blades based on the PowerXCell 8i, but has halted plans to forge a successor to that chip design.

In fact, from IBM’s point of view, the GPU-equipped iDataPlex is just another entry in its rather large portfolio of HPC hardware. Between the new Power7-based 755 servers, the Blue Gene/P, and its x86-based iDataPlex gear, IBM has probably the broadest HPC offerings in the industry. The hybrid computing iDataPlex is another way the company thinks it can cover what has become a fairly diverse HPC market.

Turek says IBM will be careful not to overhype its new GPU-accelerated boxes. Although coprocessor acceleration seems to be in vogue right now, not every application is going to be able to take advantage of it. Certainly most matrix math-intensive apps will be able to realize a several-fold performance boost compared to a CPU-only implementation, but it really depends on how much of the code is engaged in these types operations and how much is just doing sequential threading.

If Linpack is a guide — and that’s really all it is — some apps will do very well indeed on the new Fermi GPUs. NVIDIA ran some benchmarks on its own CPU-GPU server, consisting of two Tesla C2050 cards (comparable to two M2050s) plus two Intel Xeon X5550 processors, with 48 GB memory. They found Linpack performance was eight times that of a comparable CPU-only server: 80.1 gigaflops for the CPU version versus 656.1 for the GPU-accelerated box. When they looked at price-performance and power usage, they found a five-fold advantage. So for $1 million worth of CPUs, you can get 10 teraflops of Linpack, while that same money spent on GPU-CPU gear will get you to 50 teraflops — and a certain spot on the TOP500 if you’re interested in HPC celebrity.

With IBM now in the GPU computing game, it’s almost a sure bet HP and Dell won’t be far behind. And with the tier 1 OEMs onboard, integrated CPU-GPU servers are likely to become standard operating equipment by most, if not all, HPC vendors over the next several months.

]]>Some trends that have gained a strong foothold will persist in 2010. For example, continued growth in the sheer quantity of digital information will continue to propel data deduplication and SSDs. The green IT movement will push cloud services, and the customer will rule. Some new trends will also emerge: Spending will shift to pre-integrated systems, and new hybrid computing platforms will emerge.