HPCwire » shared memoryhttp://www.hpcwire.com
Since 1986 - Covering the Fastest Computers in the World and the People Who Run ThemSun, 02 Aug 2015 12:39:43 +0000en-UShourly1http://wordpress.org/?v=4.2.3Many Cores, Large Memory, Low Latency Memory Access – Numascale Shows a New Wayhttp://www.hpcwire.com/2013/09/30/many_cores_large_memory_low_latency_memory_access__numascale_shows_a_new_way/?utm_source=rss&utm_medium=rss&utm_campaign=many_cores_large_memory_low_latency_memory_access__numascale_shows_a_new_way
http://www.hpcwire.com/2013/09/30/many_cores_large_memory_low_latency_memory_access__numascale_shows_a_new_way/#commentsMon, 30 Sep 2013 07:00:00 +0000http://www.hpcwire.com/2013/09/30/many_cores_large_memory_low_latency_memory_access__numascale_shows_a_new_way/Numascale offers a price breaker for shared memory systems by offering integration of a simple add-on card to commodity servers. The hardware is now deployed in system with up to more than 1700 cores and the memory addressing capability is virtually unlimited. The technology has a set of interesting advantages that will catch the interest of innovative developers.

The big differentiator for Numascale’s interconnect, NumaConnect, compared to other high-speed interconnect technologies is the shared memory and cache coherency. These features allow programs to access any memory location and any memory mapped I/O device in a multiprocessor system with high degree of efficiency. It provides scalable systems with a unified programming model that stays the same from the small multi-core machines used in laptops and desktops to the largest imaginable single system image machines that may contain thousands of processors. The architecture is commonly classified as ccNuma or Numa but the interconnect system can alternatively be used as low latency clustering interconnect.

NumaConnect offers the largest shared memory available today. The hardware can utilize a 256TB addressing space and no memory is lost to buffering of copies of remote memories.

It is a general conception that the shared memory model makes parallel programming easier than the clustering or message passing model. Parallel programming will probably never become really easy, so any simplification should be welcome. Thousands of years of work have been put into parallelizing software for clusters, and these applications will probably be running on clusters for a while. Applications that have not been moved to clusters or that are difficult to parallelize for message passing will benefit greatly from Numascale’s offering.

In the memory mapped I/O scheme of x86 servers all I/O devices will automatically be shared and all I/O devices in all boxes are directly available from any thread of the OS or application.

Numascale runs a plain standard OS with some minor add-ons to the standard Linux kernel, while some kernel parameters should preferably be optimized for the OS to run well on a large number of cores.

With NumaConnect the system can run a single image OS. This greatly simplifies the tasks of system maintenance and operation. In a very large system a highly reduced number of OSes can be used.

No virtualization software is needed to the NumaConnect systems. Virtualization systems tend to be large, large and complicated system software is not bug free and introduces execution overhead.

As compared to some emulation systems for large memories we think NumaConnect excels by:

The cache line level coherency that cannot be exploited by software systems gives much lower probability for false sharing. A cache line is 64 bytes and the pages that will be used for software systems are 4KBytes or 2Mbytes.

The system software has optimizations for cache line false sharing since this happens on standard multi-socket servers as well.

Buffer space for remote pages in software systems may consume as much as 25% of the memory and NumaConnect do not use any buffers in memory.

NumaConnect shows very good performance for random access in large data areas.

Integration in Commodity Servers

The Numascale systems are deployed by installing a card with PCI form factor into a standard server. This approach makes it possible to take advantage of the great price break of mass produced servers with volume applications outside the segment that NumaConnect covers. Serves from IBM and Supermicro are favored today and provide excellent building blocks for large memory systems in combination with the NumaConnect Cards.

Designed for Scalability and Robustness

The design is implemented in a chip, NumaChip, produced by IBM Microelectronics. The chip holds all functions of the interconnect except the NumaCache, a cache for accesses from one node to its companion nodes in the system, where standard external DRAM modules are used.

NumaChip implements 12 bits for the physical node address, limiting the number of nodes in a single image system to 4,096. Each node can have multiple processor cores. The AMD processors can address 256TBytes of data and this limits total memory space of the systems.

Functionality is included to manage robustness issues associated with high node counts and extremely high requirements for data integrity with the ability to provide high availability for systems managing critical data in transaction processing and real-time control.

A directory based cache coherence protocol handles scaling with significant number of nodes sharing data to avoid overloading of the interconnect between nodes with coherency traffic which would seriously reduce real data throughput.

The basic ring topology with distributed switching allows a number of different interconnect configurations that are more scalable than most other interconnect switch fabrics. This also eliminates the need for a centralized switch and includes inherent redundancy for multidimensional topologies.

Integrated, distributed switching

The NumaChip contains an on-chip switch to connect to other nodes in a NumaChip based system and eliminating the need to use a centralized switch. The on-chip switch can connect systems in one, two or three dimensions. Small systems can use one, medium sized system two and large systems will use all three dimensions to provide efficient and scalable connectivity between processors.

The two- and three-dimensional topologies (torus) that have the advantage of built-in redundancy as opposed to systems based on centralized switches, where the switch represents a single point of failure.

The distributed switching reduces the cost of the system since there is no extra switch hardware to pay for. It also reduces the amount of rack space required to hold the system as well as the power consumption and heat dissipation from the switch hardware and the associated power supply energy loss.

]]>http://www.hpcwire.com/2013/09/30/many_cores_large_memory_low_latency_memory_access__numascale_shows_a_new_way/feed/0Cray Snaps Together Shared Memory Storyhttp://www.hpcwire.com/2013/09/17/cray_snaps_together_shared_memory_story/?utm_source=rss&utm_medium=rss&utm_campaign=cray_snaps_together_shared_memory_story
http://www.hpcwire.com/2013/09/17/cray_snaps_together_shared_memory_story/#commentsTue, 17 Sep 2013 07:00:00 +0000http://www.hpcwire.com/2013/09/17/cray_snaps_together_shared_memory_story/In a notable move for a company known for its supercomputing hardware, Cray has turned its eye to the softer side of the stack to gather strength for a shared memory offering. While large memory system needs constitute only a small part of all HPC applications, using a software-driven approach to addressing the needs of...

Large shared memory systems are often a novelty and some, including the well-known SGI Altix UV1000 “Blacklight” system at PSCC, for instance, have received a great deal of attention due to their ability to address specific high performance computing workloads.

While Blacklight and similar large coherent shared memory systems are driven by hardware-based approaches to creating unified approaches, Cray veered off at the software fork, deciding to create similarly focused systems at the software level. This morning they announced two different pre-configured setups on their Cray CS300 systems that will make room for workloads that have a need for larger memory within a single operating system instance.

By tapping their longtime partner, virtualization-based shared memory system software vendor ScaleMP, the supercomputer maker says that they’re both able to broaden their cluster architectures to support larger memory applications—all without the risk of going it alone with a more investment-heavy hardware-based approach to creating shared memory systems. ScaleMP’s vSMP Foundation software snaps together commodity x86 servers to create a single virtual system, which provides an alternative to (what are usually more expensive) SMP systems.

More specifically, today Cray rolled out their CS300 SMP product, which is a shared memory parallel system that sports (upgrade aside) a basic 360 Xeon cores, 4.75 TB of memory and the ability to tap single or dual-rail FDR InfiniBand.

The other, a Cray CS300 LMS (a large memory system) manages these workloads via direct memory access without harnessing high core counts in high-RAM-demand environments that chug along on simpler dual and quad-socket systems. Cray says these stepped-down systems can scale from 4.75 TB to 8.75 TB of memory and harness 20-32 Xeon cores. These are standard air-cooled CS300s that have wrapped around ScaleMP’s vSMP Foundation software, which is at the core of the HPC system virtualization vendor’s business.

Cray’s Barry Bolding admitted that while there are certainly some HPC applications that can’t be broken up across conventional clusters, it’s a small number—perhaps around 10% at the most. Still, these workloads require large memory architectures, but the hardware-based approach that SGI, for example, takes can add significant expense and is not as simple to maintain (i.e. updates to the system required with new processor generations, etc.).

Interesting that a company known for its supercomputing hardware history would turn on its roots to favor software, but without a sizable known market, Bolding says the investments required to do what rival SGI has done with its Numalink technology are significant—and ScaleMP approach offers lower cost on all ends—and no real risk for Cray to add to its ranks of options for the CS300 line.

While Bolding said that creating their own hardware-based approach to large-memory systems isn’t out of the question (and has been an idea that’s been bandied about for some time already) this shouldn’t be seen as a definitive first step in that direction. While one can be certain Cray will assess the adoption and success of this addition to the CS300 line in their eventual evaluations of the hardware-shared memory field, Bolding says that there are advantages of the software-based take on shared memory—most notably, dramatically lower costs and, as mentioned previously, fewer maintenance hassles.

On the cost front, Bolding notes that the addition of ScaleMP’s shared memory software, which comes integrated and ready to roll from the factory, does not add significant cost. The systems range from around $200k for the large memory configuration and upwards from $300k for the SMP version. While Cray is not expecting this addition to shatter sales records, it does offer something to differentiate its CS300 portfolio—and to further test the shared memory waters.

In a conversation this morning with ScaleMP’s founder, president and CEO, Shai Fultheim, we talked about the value of the software-based approach to shared memory system creation. As Fultheim told us, their virtualization approach reduces overall system (CAPEX) and management complexities (OPEX) costs. Specifically, he says that their vSMP Foundation aggregates up to 128 x86 systems to create a single system with up to 32,768 cpus and up to 256 TB of shared memory.

Fultheim also noted that these approaches go beyond high performance computing environments. Big data, analytics and database-driven companies are looking to the benefits of the software-based paradigm of aggregating the common x86 systems into one single x86 virtualized system reach performance, management and efficiency targets.

ScaleMP has partnered with Cray in the past, beginning in 2009, via a joint solution for HPC customers to operate a shared-memory, deskside supercomputer that could scale up to 128 cores and 1TB of shared memory.

“Cray has always had a special relationship with the most demanding users, redefining the requirements for high-end systems. With this collaboration, Cray’s new large memory and shared memory systems will allow a broader technical computing audience to benefit from the ability to address larger workloads and get faster results,” said Fultheim.

]]>http://www.hpcwire.com/2013/09/17/cray_snaps_together_shared_memory_story/feed/0PSC Tests New Shared Memory Watershttp://www.hpcwire.com/2013/07/01/psc_tests_new_shared_memory_waters/?utm_source=rss&utm_medium=rss&utm_campaign=psc_tests_new_shared_memory_waters
http://www.hpcwire.com/2013/07/01/psc_tests_new_shared_memory_waters/#commentsMon, 01 Jul 2013 07:00:00 +0000http://www.hpcwire.com/2013/07/01/psc_tests_new_shared_memory_waters/The Pittsburgh Supercomputer Center recently announced a new step in its path to discover new routes to solving memory and I/O problems for data-intensive applications. While they already have a stable of systems that take new approaches to shared memory and have looked to software-based solutions as alternatives, they have now turned to Numascale to dip their toe in the hardware-based...

]]>The Pittsburgh Supercomputer Center (PSC), home to a number of data-intensive systems and projects, is testing new ropes to scale the memory wall.

More specifically, they’re interested in validating and testing technologies that offer possible improvements in I/O and memory. While they’ve experimented already with a number of technologies on both the hardware and software fronts, the center’s Scientific Director, Michael Levine, told us recently that approaches which pool together memory resources are of particular interest.

PSC has a number of research projects that require more directly addressable memory than what they are able to get with their current multi-socket, large memory servers. According to Michael Levine, Scientific Director at PSC, “having large amounts of data in directly-addressable memory avoids very time-consuming disk input/output and allows a much more productive computing paradigm.”

The center already has a stable of systems that attack the problems of “big data” and memory access in different ways. PSC houses the shared memory SGI Blacklight system (4096 cores and 32 TB shared memory) as well as Sherlock, which is the Cray (or specifically, YarcData division) XMT-based graph analytics appliance. There are also a number of smaller speciality clusters, including the SGI Altix SMP machine, Salk and an innovative data management system called the Data Supercell.

PSC recently announced that it would be putting the Norwegian company, Numascale, to the test. Specifically, PSC is interested in the company’s NumaConnect interconnect technology, which would allow them to build a cache coherent shared memory using the company’s hardware-based approach.

Levine says that the experimentation, which has not been defined in great detail, will at the very least allow them to compare how Numascale’s approach stacks up against different moves to the same goal from SGI and Cray in particular–both companies they’ve worked hand-in-hand with during implementation of other data-focused systems.

SGI takes a different route to opening memory than NumaConnect, but Levine says that following application-relevant testing they’ll be able to report comparisons (and hopefully shed some light on price/performance).

In essence, Numascale’s technology lets users turn a cluster into a shared memory machine by plugging into HyperTransport via the PCI bus using HTX. Numascale is focused on AMD systems for now–they have yet to sync up with Intel, which limits the choices PSC has. Levine says that one of the attractive elements about their work with SGI is that there is a great deal of customization possible.

According to Einar Rustad, CTO and co-founder of Numascale, “The huge and scalable memory capacity in systems with NumaConnect allows users to operate in the familiar programming and runtime environment they are used to…further, he notes that part of the strength of their approach is that it “eliminates the need for explicit message passing and reduces the overall time to solution.”

The key to Numascale’s attractiveness for a deep memory-focused center like PSC is the ability to tap the company’s interconnect to allow programs to access any memory location or memory-mapped IO device across the entirety of a system.

While the NUMA technology is not new by any means, Numascale’s technology is focused on using their interconnect system as a low-latency clustering interconnect–and, according to them, at a much lower price point than others although we’ve been unable to scare up information on the price range.

During our chat with Levine, he said they have also experimented with other technologies aimed at addressing memory problems outside of hardware. For instance, he pointed to a similar experiment with ScaleMP’s approach and although he didn’t offer details about how it stacked up, he did say it opened some important questions about solving memory problems in software versus hardware–although he noted as well that all assessments are application-dependent.

]]>Twitter is a veritable gold mine for those looking to garner large-scale sentiment analysis. As a result, the rush to provide customers with that large-scale analysis is on. SGI, via their UV 2 “Big Brain” supercomputer is literally showing its growing Twitter proficiency.

SGI is using the UV 2 to demonstrate how the such a system can decipher the Twitterverse by producing “heat maps” of various large events including the days leading up to Hurricane Sandy’s landfall and the day of the presidential election. SGI compiled these heat maps, with the help of the University of Illinois’s Kalev H. Leetaru and Dr. Shaowen Wang of the University of Illinois at Champaign-Urbana’s CIGI lab, by taking one out of ten tweets and determining if, in the case of the hurricane, whether the tweet was positive or negative. In the case of the election, they distinguished between pro-Obama and pro-Romney tweets.

The resulting maps look kind of cool, especially when paired with epic music, as SGI does in their time-lapse YouTube videos. The result is impressive. It highlights the supercomputer’s ability to gauge sentiment on an issue that the balance of the population of a 300 million-person nation is tweeting about. Even when that issue is binary (good/bad, Obama/Romney), the process is not straightforward, requiring a fair amount of computing smarts. With regard to the hurricane, SGI had to factor in a) whether or not the tweet was indeed about Sandy, b) whether the tweet was positive or negative, and c) the location of the person tweeting.

Parts a and c are no small feat, especially considering the system had to process 50 million tweets a day (which actually represents only about 10 percent of the tweets). But the real challenge lies in part b, a task mostly foreign to computers. That requires a much deeper level of semantic analysis – something IBM’s Watson machine has become notably skilled at doing.

Thanks to the UV 2’s coherent shared memory architecture, the system is well suited to these types of data-intensive problems. And because it can operate in a highly parallel manner, it has the ability shuffle this data around at top speed. For example, according to SGI, a UV system can “ingest the entire Library of Congress print collection in less than three seconds.”

Presumably, the company’s eventual goal is to translate this success to the business world, where companies can get a sense of, say, how a marketing campaign is going, or how to interpret customer feedback posted on the Web. Beyond that, the company hopes its UV 2 becomes a fixture in the scientific research arena, where problems like genomic sequencing and analysis of climate simulation results provide similar types of big data challenges.

]]>Theoretical physicist Stephen Hawking recently launched a new supercomputer at the University of Cambridge. Named COSMOS, the SGI Altix UV2000 cluster is Europe’s largest shared-memory system. An official statement from the university says that COSMOS will “open up new windows on our universe.”

The new machine is officially part of the Science and Technology Facilities Council’s (STFC) DiRAC high performance computing facility. The national service assists astronomers, particle physicists, cosmologists and non-academics with their research. Including the new installation, DiRAC now has five HPC clusters in its infrastructure, two of which reside at the University of Cambridge.

This is the ninth supercomputer iteration deployed by the COSMOS project, which as been around since 1997. In this case, the new SGI machine, currently houses 1,856 Xeon E5 (Sandy Bridge) cores and will eventually be upgraded with 31 Xeon Phi coprocessors based on the MIC architecture. The system also contains 14.5 TB of globally shared memory.

While the new computer has just been deployed, its predecessor (COSMOS VIII) has received an upgrade. That system is a first-generation SGI UV (UV1000) system, holding 2 TB of shared memory and powered by 768 Nehalem EX cores.

Explaining why these systems are important for the study of the universe, Hawking said: “We have made spectacular advances in cosmology and particle physics recently. Cosmology is now a precision science, so we need machines like COSMOS to reach out and touch the real universe, to investigate whether our mathematical models are correct.”

Last year, Intel studios filmed a promotional video about the COSMOS project featuring the iconic Hawking.

The COSMOS IX launch took place during the Numerical Cosmology 2012 workshop at the university’s Centre for Mathematical Sciences. Sponsored by Intel, the invitation-only event aimed to connect scientists and technologists of the various disciplines of numerical cosmology. Leaders in the fields of IT and cosmology typically do not cross paths, and bringing them together was one of the major goals for the workshop.

Though creating a gathering space for these professionals was deemed important, the workshop also works to realize Hawking’s goal of revealing an “ultimate theory”. The discovery would allow researchers to predict how everything in the universe will unfold. However, it would not spell the end of cosmological studies. Said Hawking: “Even if we do find the ultimate theory, we will still need supercomputers to describe how something as big and complex as the Universe evolves, let alone why humans behave the way they do!”

]]>http://www.hpcwire.com/2012/07/23/stephen_hawking_gets_a_new_supercomputer/feed/0SGI Launches Second Generation UV Supercomputerhttp://www.hpcwire.com/2012/06/14/sgi_launches_second_generation_uv_supercomputer/?utm_source=rss&utm_medium=rss&utm_campaign=sgi_launches_second_generation_uv_supercomputer
http://www.hpcwire.com/2012/06/14/sgi_launches_second_generation_uv_supercomputer/#commentsThu, 14 Jun 2012 07:00:00 +0000http://www.hpcwire.com/?p=4440The sequel to SGI's UV supercomputer has arrived. Dubbed UV 2, the new platform doubles the number of cores and quadruples the memory that can be supported under a single system. The product, which will be officially announced next week at the International Supercomputing Conference in Hamburg, represents the first major revision of SGI's original UV, which the company debuted in 2009.

]]>The sequel to SGI’s UV supercomputer has arrived. Dubbed UV 2, the new platform doubles the number of cores and quadruples the memory that can be supported under a single system. The product, which will be officially announced next week at the International Supercomputing Conference in Hamburg, represents the first major revision of SGI’s original UV, which the company debuted in 2009.

The UV’s claim to fame is its ability to support “big memory” applications, whose datasets can stretch into the multiple-terabyte realm. Since the architecture supports large amounts of global shared memory, applications don’t have to slice their data into chunks to be distributed and processed across multiple server nodes, as would be the case for compute clusters. Thanks to the SGI’s NUMAlink interconnect, UV is able to glue together hundreds of CPUs and make them behave as a single manycore system with gobs of memory. Essentially, you can treat the machine as an ultra-scale Linux PC.

The new UV 2 takes this to another level. While the original UV could scale up to 2,048 cores and 16 TB of memory on a single system, UV 2 doubles the max core count to 4,096 and quadruples the memory capacity to 64 TB. Even in the era of big data, that encompasses a lot of applications, at least those that don’t rely on Web-sized datasets.

Even with the lesser memory limits of the first generation UV, the supercomputer has worked its way into application niches across the data-intensive spectrum, primarily in technical computing, but a few on the business side as well. UV has had particular success in areas like life sciences and manufacturing, where the HPC cluster/MPI application paradigm never became fully entrenched. At lot of these applications had their origins on PCs or workstations, so the step up to a single system image UV was a natural one once those users exhausted RAM on the desktop.

The platform has also found application uptake in chemistry, physics (especially astrophysics), defense and intelligence, and research areas like social media analytics. Even business analytics applications like fraud detection are fair game. An example of the latter is a world-wide courier service that is employing a UV machine to detect fraudulent activity in real-time.

To crank up the performance and scalability on this second-generation machine, a lot of the UV parts had to be upgraded, starting with a new CPU. On that front, the UV 2 engineers opted for the latest Intel “Sandy Bridge” Xeon E5-4600 family chips, which replace the Nehalem EX and Westmere EX CPUs offered in the first UV. A fully loaded UV 2 rack with 64 CPUs can now deliver 11 peak teraflops, which is nearly twice the flops of the original Nehalem-based machine.

Conveniently, the Sandy Bridge processor provides an extra couple of address bits, which is what makes the 64 TB memory reach possible. (ScaleMP’s virtual SMP technology also enables a 64 TB memory reach, in this case on Sandy Bridge-based clusters, but does so without the performance benefit of a custom interconnect.) The new CPU also incorporates native support for PCIe Gen 3, basically doubling I/O bandwidth to storage and other external devices.

Speaking of which, UV is able to hook into multiple accelerators, both NVIDIA GPUs and Intel MIC, via a PCIe-based external chassis. Up to 8 GPUs and some unknown number of MIC coprocessors can be linked to a system in this way. At least one customer, the UK’s Computational Cosmology Consortium (COSMOS), is in line to get a MIC-accelerated UV 2.

Aside from the CPU, the other big UV 2 upgrade is NUMAlink 6, the next generation of SGI’s custom system interconnect. NUMAlink makes memory coherency across the UV blades possible; without this special chip, an E5-4600 system would max out at a mere 32 cores and 1.5 TB of memory. Besides adding support for the new E5 CPU, the interconnect also reduces the cabling requirements, while more than doubling the data rate of the previous generation NUMAlink 5, a pretty speedy interconnect in its own right.

“Even a nicely configured InfiniBand cluster really pales in comparison, in terms of system bandwidth that we can deliver,” says Jill Matzke, director of server marketing at SGI.

But according to her, it’s the improved memory capacity that is going to be the real draw here. “While the ability to scale more cores is interesting,” she says, “we think the ability to scale memory is going to be the most important driver for customer uptake and deployment of this technology.”

Product-wise UV 2 will be offered in two incarnations, the UV 20 and the UV 2000. The former is a 4-way rackmount server that tops out at 32 cores and 1.5 TB — the same upper limit you would find in standard server based on E5-4600 parts. The UV 2000 is the one that can scale all the way up.

Not that you need to buy thousands of cores and terabytes of RAM right off. UV 2000 customers can start with just 16 cores and 32 GB of memory and slip more blades into the enclosure as budget allows. With lower bin CPUs, that 16-core entry point system is just $30,000 and according to Matzke, the price increases more or less linearly as you fill the rack with additional CPUs and RAM. Once you get beyond a single rack, the cost of extra cabling and rack-top routers gets factored in.

But even just four racks can get you all the way to 64 terabytes, so there’s not a lot of hardware infrastructure involved. Remember this is not a machine built to max out flops. As with the original UV, the idea here is to offer a lots of shared memory in an affordable package — at least relative to “big iron mainframes. And while the UV may be more expensive than a flash-based system with a comparable memory footprint, SGI is claiming much better price-performance when data bandwidth and latency are taken into account.

If 64 TB of memory doesn’t quite do it for you, SGI lets you lash together multiple systems if you’re looking for a cluster of fat nodes. The maximum configuration in this case is 16K sockets and 8 petabytes of memory.

The UV 20 and UV 2000 are available for shipping now. And if you happen to be in Hamburg Germany next week, the technology will be on display in SGI’s booth at the International Supercomputer Conference.

]]>http://www.hpcwire.com/2012/06/14/sgi_launches_second_generation_uv_supercomputer/feed/0Nautilus Harnessed for Humanities Research, Future Predictionhttp://www.hpcwire.com/2011/09/09/nautilus_harnessed_for_humanities_research_future_prediction/?utm_source=rss&utm_medium=rss&utm_campaign=nautilus_harnessed_for_humanities_research_future_prediction
http://www.hpcwire.com/2011/09/09/nautilus_harnessed_for_humanities_research_future_prediction/#commentsFri, 09 Sep 2011 07:00:00 +0000http://www.hpcwire.com/?p=4700The power of the large shared memory machine, Nautilus has been captured to aid in a research endeavor that uses deep analytics to develop connections between the tone and location of news and the course of future events. We talked to the project's lead, Kalev H. Leetaru about big data analytics at the global scale--and how shared memory systems enable new possibilities for researchers.

]]>The observer influences the events he observes by the mere act of observing them or by being there to observe them.

–Isaac Asimov, Foundation’s Edge

Elements of science fiction have helped us venture guesses about what the future might look like—at least in terms of the technologies some suspect might be pervasive one day. Flying cars, robot housekeepers, and of course, supercomputers that can predict the future and answer humanity’s most pressing questions, are all staples.

This week news emerged that might bring the all-knowing “supercomputer as fortuneteller” trope into reality—or if nothing quite as dramatic, help us better understand the connections between the news and its tone in geographical context.

A recent project called “Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space” set about to find a way to use tone and geographical analyses methods to yield new insights about global society. If the lead researcher behind the project is correct, this could not only provide opportunities for societal research at global scale—but could also act as a warning bell before crises occur.

Kalev H. Leetaru, Assistant Director for Text and Digital Media Analytics at the Institute for Computing in the Humanities, Arts and Social Science at the University of Illinois and Center Affiliate at NCSA spearheaded the Culturomics 2.0 project. He claims that his analytics experiment has already allowed him to successfully forecast recent revolutions in Tunisia, Egypt, and Libya. Leetaru also says that he has been able to foresee stability in Saudi Arabia (at least through May 2011), and retroactively estimate Osama Bin Laden’s likely hiding place within a 200-kilometer radius.

Whereas initial Culturomics (1.0) studies focused on the frequency of a particular set of words from digitized books, he says that mere frequency isn’t enough to gain real-time, imminently useful information that reflects the modern world.

Shedding the word frequency element that defined version 1.0 of Culturomics, Leetaru set to take deep analytics to a new level by moving past frequency altogether and opting instead to sharpen the focus on tone, geography and the associations these two factors produced.

The project received funding from the National Science Foundation and was managed in part by the University of Tennessee’s Remote Data Analysis and Visualization Center (RDAV) and the National Institute for Computational Science (NICS). Leetaru was granted time on the large shared memory supercomputer Nautilus as part of the Extreme Science and Engineering Discovery Environment (XSEDE) program.

Leetaru says using a large shared memory system like the Nautilus was the key to achieving his research goals. The 1,024 Intel Nehalem core, 8.2 teraflop system with 4 terabytes available for big data workloads was manufactured by SGI as part of their UltraViolet product line. A system like this allows researchers more flexibility as they seek to take advantage of vast computing power to analyze “big data” in innovative ways.

Leetaru’s goals with this project represent a perfect example of a data-intensive problem in research. To arrive at his results, Leetaru needed to gather 100 million news articles stretching back half a century. From this point, the process required a staged approach, which began with a data mining algorithm that extracted important terms—people, places and events—to create a base network of 10 billion “nodes” in the network of news history.

With a mere 10 billion elements left following extraction, Leetaru next set about seeking out relationships that connected these nodes to begin building a second network. He said that when this was complete, he was left with a total of 100 trillion relationships, yielding a network that was about 2.4 petabytes in size.

Few machines have that kind of disk space let alone memory so he then found that to process the data, he needed to break the project up into pieces. He would look carefully at key pieces, generate that network on the fly using the shared memory system to begin the process of refining—a task he said wouldn’t be possible without Nautilus or another large shared memory system.

With the connections established, Leetaru then ran tools to seek out patterns to find interesting differences in tone in different countries or regions. Using 1500 dimensions of analysis that fall under the banner of “tone mining” which determines the positive or negative “score” of a dictionary of words from existing sources, Leetaru was able to build a profile of more profound connections.

These variances in tone of global news were matched with geographic mining efforts, which places the nodes and tones via an algorithm that seeks to determine where the news sources are talking about. Leetaru explains that this is not a simple algorithm since there are many cities called “Cairo” in the world. The algorithm must mine for contextual references to nearby places or elements to correctly place the coordinates.

The final element is the network analysis or modularity finding step. Leetaru takes his network and looks for nodes that are more tightly connected to each other than the rest of the network to find out how nations are related—an analytics project that yields a well-defined set of seven civilizations on Earth. To get this kind of network requires taking every city, every article that has ever referenced it, and each city then becomes a node with its own complex network of tones, meanings and potential for new findings.

With all of these stages in place, Leetaru says the possibilities are endless. One can watch change over time and create reproducible models—or even go back to look at past events to see how closely one can predict the end result. In the full paper, Leetaru hits on some of his successes showing how major crises have played out in a particular set of ways—offering a chance for researchers to predict the future.

Leetaru pointed to the benefits of using the shared memory system Nautilus with the example that has generated a lot of buzz this week—that his methods led to a retroactive map that pinpointed Bin Laden’s location within 200 km.

“One of the beauties of using a large shared memory machine is that for example I could see an interesting pattern (like the Bin Laden portion where I was assuming there was enough information to pinpoint where he was hiding) and then begin exploring different techniques, including writing quick little Perl scripts that would wrap a small network on the fly actually and process that material and basically make a quick chart or table or map.”

He went on to note:

“With a large shared memory machine, you don’t have to worry about memory—I never had to worry about writing MPI code to distribute memory across nodes; it’s like it was infinite–with a quick script I could grab all locations that mentioned “Bin Laden” since he first started to appear in the news around 10 years ago, and map it over time or in different ways. It boiled down to writing easy Perl scripts, running in a matter of minutes—if I didn’t have all that memory it would have taken weeks or months with each iteration so one benefit is that leveraging that much hardware allows you to do simple things.”

Leetaru says that even as an undergraduate at NCSA working with some of the first iterations of web-scale web mining, he has been fascinated with the possibilities of deep analytics. While his goal with the Culturomics 2.0 project was to forecast large-scale human behavior using global news media tone in time and space but along the way he stumbled upon a few other unexpected findings, including the fact that indeed, the news is becoming “more negative” in terms of general tone and also that the United States tends to favor itself in its own news filings.

In this era of deep analytics that harness real-time news and sentiment, the Foundation series from Isaac Asimov is never far from the mind. For those who haven’t read the books, in a very small nutshell, mathematical formulas allow civilization to predict the future course of history…and madness ensues.

All arguments about potential for chaos or leaps forward for civilization aside, advances in analytics and high-performance computing like those produced on the Nautilus supercomputer have brought this series of classic science fiction tales into the realm of possibility.

]]>http://www.hpcwire.com/2011/09/09/nautilus_harnessed_for_humanities_research_future_prediction/feed/0Intel Scales Up Cores and Memory with New Westmere EX CPUshttp://www.hpcwire.com/2011/04/06/intel_scales_up_cores_and_memory_with_new_westmere_ex_cpus/?utm_source=rss&utm_medium=rss&utm_campaign=intel_scales_up_cores_and_memory_with_new_westmere_ex_cpus
http://www.hpcwire.com/2011/04/06/intel_scales_up_cores_and_memory_with_new_westmere_ex_cpus/#commentsWed, 06 Apr 2011 07:00:00 +0000http://www.hpcwire.com/?p=4891This week Intel launched its new Westmere EX lineup, the latest Xeons aimed at large-memory, multi-socketed servers. The new chips come in 6-, 8- and 10-core flavors and will be sold under the name Xeon E7. According to Intel, these latest CPUs deliver 40 percent greater performance than the previous generation Nehalem EX processors while maintaining the same power draw.

]]>This week Intel launched its new Westmere EX lineup, the latest Xeons aimed at large-memory, multi-socketed servers. The new chips come in 6-, 8- and 10-core flavors and will be sold under the name Xeon E7. According to Intel, these latest CPUs deliver 40 percent greater performance than the previous generation Nehalem EX (Xeon 7500 and 6500) processors while maintaining the same power draw.

Compared to the 45nm-based Nehalem EX line, the E7 silicon is on 32nm process technology, which allowed them to add a couple of more cores and an additional 6 MB of L3 cache to the top-end chip. Despite that, Intel only grew the transistor count modestly, from 2.3 billion to 2.6 billion. The thrust was to make the cores smarter and more efficient at their job, not to rely on the brute force of Moore’s Law.

The E7s are 42 percent quicker than their Nehalem ancestors, at least at integer throughput (using the SPECint_rate_base2006 benchmark). One might wonder how Intel accomplished this since they only increased the core count and L3 cache by 25 percent apiece. Apparently 11 percent of the performance increase is the result of optimizations in the latest Intel Compiler XE2011. The rest of the performance bump can probably be attributed to the faster clock for the E7. (Intel pitted a 2.4 GHz E7-4870 against a 2.26 GHz Nehalem X7560 in their benchmark tests.) Floating point throughput (SPECfp_rate_base2006) increased at a more modest 32 percent.

Using the OpenMP benchmark (SPEC OMP2001) for shared memory throughput, the E7-4870 only delivered an 18 percent boost compared to the Nehalem X7560. On some real-life memory-intensive HPC workloads, however, performance was on par with the integer and FP results. For example, Intel reported that throughput improved 21 to 37 percent when exercising the E7s on a number of EDA analysis tools. It remains to be seen how other big memory HPC codes fare on the new hardware.

Besides the core count bump, the other notable E7 feature is its support for larger memory capacity. For a four-socket server, the E7 will scale up to 2 GB of RAM and 102 GB/second of bandwidth, which is twice as good as Nehalem EX. Intel accomplished this by adding support for 32 GB DIMM chips. (The E7 still relies on the same 16 DIMM slots per socket.) These 32 GB DIMMs tend to be rather expensive, though, and so far the server OEMs are only offering E7 systems with 16 GB DIMMs. But 1 TB in a four-socket box is quite useful in its own right, and will be able to handle some rather large in-memory databases.

Perhaps more importantly, the E7 chips can be paired with low voltage memory modules (LV DIMMs) to help curb energy consumption, especially on terascale-sized DRAM configurations. Intel has also added integrated memory buffers to further reduce power draw.

Unlike the Nehalem EX line, the E7 family is divided into three different processor series according chip socket support. The E7-2800 series is geared for two-socket systems, while the E7-4800 series is designed for machines with four CPUs. The quad-socket setup is probably the sweet spot for the E7 family given that four CPUs in one server is apt to be less expensive than 2 dual-socket boxes; plus you have twice the memory headroom. The E7-8800 series is for eight socket machines. These CPUs priced at a premium, but if you’re looking for an x86 SMP machine with up to 80 cores (160 threads) and multiple terabytes of memory, this is the CPU for you.

At launch, 19 server makers announced E7-based platforms, including the usual suspects like IBM, HP, Dell, Cisco, and Oracle. The principle destination for these chips will be “mission-critical” enterprise servers, the segment Intel first pursued in a major with its Nehalem EX line. To chase that application space, Intel has incorporated a number of new security and RAS features which, according to them, puts their latest x86 offering on par with RISC CPUs and even their own Itanium chip. Mission-critical enterprise computing is estimated to be worth about $18 billion per year — about twice that of the HPC server market.

But a number of vendors — SGI, Cray, Supermicro, and AMAX, thus far — are also using the E7s to build scaled-up HPC machinery. SGI for example, has latched onto the E7s to refresh their Altix UV shared memory products. The low-end Altix UV 10 and mid-range Altix UV 100 both benefitted from the extra cores and memory capacity.

For example, the UV 100 now scales to 960 cores and 12 TB of shared memory in just two racks. The top-of-the-line Altix UV 1000 can also use the new E7 CPUs, but for architectural reasons and OS limitations still tops out at 2,048 cores and 16 TB of memory. However, you can still take advantage of the more performant 8-core and 10-core E7s, so a UV 1000 can squeeze out more FLOPS per watt than before, and can scale past 20 teraflops of peak performance.

Finally, both Supermicro and AMAX have come up with four-socket and eight-socket E7-based servers (these might actually be the same hardware). The top-end offerings delivers up to 80 cores and 2 TB of memory in an 5U form factor, while the four-socket servers provide half that scalability, but in a 1U, 2U, or 4U package. The 8-way offerings can be outfitted with up to four NVIDIA GPUs if you want to pair the E7 parts with some extra vector acceleration. Although these Supermicro and AMAX systems are geared for HPC, at least the non-GPU versions are also being positioned for big memory enterprise workloads.

As you can see from the chart above, these high-end CPUs are priced accordingly. The top-end 130-watt E7-8870 is over $4,600 in quantities of a thousand. More mid-range E7s will run half that, and even the 10-core chip for dual-socket systems runs over $2,500. Intel apparently believes that they are worth the premium, and given that these chip are being paired with lots of expensive DRAM and software, the CPU itself is probably the one of the best-valued components in these high-end shared memory servers.

Regardless, the E7 parts will be less expensive than RISC processors, the Itanium, or any proprietary CPU. At the other end of the price spectrum, Intel will have to contend with AMD, which is planning to launch its Bulldozer-class “Interlagos” CPU in Q3. Those chips come in 12-core and 16-core versions and can populate four-socket servers. So for users with SMP workloads that are chewing on terabytes of data, the x86 architecture is looking a bit more tempting.

]]>http://www.hpcwire.com/2011/04/06/intel_scales_up_cores_and_memory_with_new_westmere_ex_cpus/feed/0With Windows Support, SGI Casts Altix UV in New Lighthttp://www.hpcwire.com/2011/04/03/with_windows_support_sgi_casts_altix_uv_in_new_light/?utm_source=rss&utm_medium=rss&utm_campaign=with_windows_support_sgi_casts_altix_uv_in_new_light
http://www.hpcwire.com/2011/04/03/with_windows_support_sgi_casts_altix_uv_in_new_light/#commentsSun, 03 Apr 2011 07:00:00 +0000http://www.hpcwire.com/?p=4896SGI has been getting a lot of mileage out of its SGI UV shared memory platform, having delivered close to 500 systems since it started shipping them in June 2010. Now, with the recent addition of support for Microsoft's Windows Server, the company is looking to expand its customer base in a big way.

]]>SGI has been getting a lot of mileage out of its SGI UV shared memory platform, having delivered close to 500 systems since it started shipping them in June 2010. Now, with the recent addition of support for Microsoft’s Windows Server, the company is looking to expand its customer base in a big way.

Altix UV, SGI’s latest generation shared-memory supercomputer, was introduced at the Supercomputing Conference in November 2009. It uses SGI’s fifth generation NUMAlink interconnect technology and Intel “Nehalem” Xeon processors to construct HPC-class SMP server nodes. The interconnect, along with the special UV hub chip, glue all the processors and memory together so that they can be operated as a monolithic resource. A fully tricked-out Altix UV 1000 will have 2,048 cores (4,096 threads via HyperThreading) and 16 TB of globally shared memory. A maximally configured machine represents 18.5 teraflops of peak performance.

Being able to command all that power within a single system image has a number of advantages, the main one being you can run standard (non-MPI) applications on a machine that for all intents and purposes behaves as an enormous PC with gobs of cores and memory at its disposal. And, by definition, such a system doesn’t require the complex set-up, software licensing, and maintenance of a distributed cluster platform — not an easy task as you approach the 1000-core realm.

Up until a few weeks ago, Altix UV came only with Linux, either Novell’s SUSE or Red Hat’s enterprise version. In early March, support was added for Microsoft Windows Server 2008 R2. The first iteration supported up to 128 cores and 1 TB of memory. On March 25, the company announced Windows Server was certified to the OS’s maximum reach: 256 cores and 2 TB of memory.

IBM and HP also have large shared memory x86-based servers with Windows Server support. But IBM’s X3950 and HP’s Proliant DL980 G7 top out at 96 and 64, respectively — well below the Windows Server limits. “Our engineering work finally brings Windows into true scalability,” says SGI CEO Mark Barrenechea.

On the other hand, Itanium-based platforms on Windows can scale to 128 cores. But with the new UV-Windows set-up, those customers (principally HP Integrity users) can now migrate their codes to SGI UV gear and achieve even greater scalability, at least on the core-count side. Itaniums still prevail in memory reach, being able to access up 128 TB.

Barrenechea says they’re targeting two major application areas with this system, the first being SGI’s traditional technical computing market. The top five application suites they expect will take advantage of the Windows-UV combo are ANSYS FLUENT, MATLAB, Mathematica, LS-Dyna, and Accelerys. These run the gamut from CFD and FEA, to computational chemistry and computational biology.

The idea here is to allow scientists to take their PC-based codes and easily slide them into these big memory UV machines with little if any porting work. In some cases, they won’t even need to perform a recompilation. A PC binary should be able to run unaltered on the Xeon-based machine (although maybe not optimally), and if the code was written correctly, will automagically take advantage of the larger memory. Of course, to utilize additional UV cores, the developer will have to parallelize the code via OpenMP threading or the equivalent.

But many of these applications are constrained only by available memory, (requiring just one to four threads to do their job). Since a typical PC isn’t going to have more than a few gigabytes of RAM, the data sizes are going to be rather limited when it comes a traditional HPC simulation code. Even a relatively modest-sized four-dimensional array of 1000 x 1000 x 1000 x 1000 byte-sized elements (for say a 3D object moving through time) will occupy an entire terabyte.

At the recent HPCC conference in Newport, Rhode Island, SGI CTO Dr. Eng Lim Goh demonstrated a simulation of the human heart developed at the University of Montreal. On a laptop, because of the limited memory, it could only be run with 60 million grid points. That delivered a rather poor resolution of the heart in action. Moving it to an Altix UV machine with 1.2 TB of memory, the model was expanded to 2 billion grid points, providing a much more realistic model.

At that scale, the simulation still took two weeks to compute a single heartbeat. Goh suggested that parallelizing the code to take advantage of the additional UV cores (768 in this case) might be able speed up the model to something close to real-time.

But big memory is not just for technical workloads. The second major application area for a Windows-capable Altix UV is on the enterprise side, in the realm of data-intensive applications. In particular, we’re talking about data warehousing, data mining, business intelligence and related types of tools. The driver behind these applications is Microsoft’s SQL Server, whose support was added in conjunction with the Windows Server OS.

This area represents a new market for SGI, although some of these customers have HPC leanings as well. In general, though, any informatics-type application that encapsulates terascale-sized structured databases is fair game for an Altix UV. The fact that many of these codes are developed in and for a Microsoft environment means there is now an easier path to greater scalability.

Barrenechea considers SGI’s entry into Microsoft’s software ecosystem a significant step for them. “Sure, we’ve supported Windows and certified it,” he says, ‘but it’s a new focus for the company.”

Of course, Linux will be the operating system of choice for most HPC users. And, in fact, Altix UV scalability is still better on that OS. Red Hat Enterprise Linux 6 reaches to 8 TB of memory, while SUSE Linux Enterprise Server 11 hits the full 16 TB. Conveniently, Linux also supports all 2,048 cores of a top-end UV, although it’s hard to imagine an SMP-based code scaled to that level.

It should be noted that the memory limit on the Altix UV is actually constrained by the current generation of Xeon chips, whose 44-bit addressing scheme maxes out at 16 TB. If your data outgrows that capacity, Intel’s next-generation “Sandy Bridge” Xeons will add a couple more bits to quadruple its memory reach to 64 TB. According to SGI’s Goh, the company plans to support the new chips in an upcoming version of the Altix UV, and already have one order for such a system.

Core counts on the next-generation Altix UV may rise as well, although the most acute demand will remain on the memory capacity side. In any case, one or more of the supported OS’s will likely be tweaked to support any new limits SGI comes up with in future UV hardware.

]]>http://www.hpcwire.com/2011/04/03/with_windows_support_sgi_casts_altix_uv_in_new_light/feed/0Cray Pushes XMT Supercomputer Into the Limelighthttp://www.hpcwire.com/2011/01/26/cray_pushes_xmt_supercomputer_into_the_limelight/?utm_source=rss&utm_medium=rss&utm_campaign=cray_pushes_xmt_supercomputer_into_the_limelight
http://www.hpcwire.com/2011/01/26/cray_pushes_xmt_supercomputer_into_the_limelight/#commentsWed, 26 Jan 2011 08:00:00 +0000http://www.hpcwire.com/?p=4993When announced in 2006, the Cray XMT supercomputer attracted little attention. The machine was originally targeted for high-end data mining and analysis for a particular set of government clients in the intelligence community. While the feds have given the XMT support over the past five years, Cray is now looking to move these machines into the commercial sphere. And with the next generation XMT-2 on the horizon, the company is gearing up to accelerate that strategy in 2011.

]]>When announced in 2006, the Cray XMT supercomputer attracted little attention. The machine was originally targeted for high-end data mining and analysis for a particular set of government clients in the intelligence community. While the feds have given the XMT support over the past five years, Cray is now looking to move these machines into the commercial sphere. And with the next generation XMT-2 on the horizon, the company is gearing up to accelerate that strategy in 2011.

From the company-wide standpoint, XMT is to big data-intensive applications what the Cray XT and XE product lines are to big science. The machine is made to deal with really huge datasets — we’re talking terabytes — whether they be technical or non-technical in nature. But the XMT is actually designed for a specific flavor of data-intensive application: those that must deal with irregularly structured data at scale — what are sometimes referred to as graph analytics problems.

These can be broken down further into two general categories. The first is the finding-the-needle-in-a-haystack problem, which involves locating a particular piece of information inside a huge dataset. The other is the connecting-the-dots problem, where you want to establish complex relationships in a cloud of seemingly unrelated data.

The most natural computational model for these types of applications is one in which thousands of computational threads inhabit a large global memory space. To further maximize performance, fine-grained thread synchronization is required. Broadly speaking, this model is not supported by more mundane cluster computing platforms as you might find with a traditional Oracle or Netezza database appliance. Unless the application can be partitioned naturally across cluster nodes and data access patterns are fairly regular, performance will suffer.

The encouraging news for XMT proponents is that over the last several years large-scale analytics applications using unstructured data have become much more mainstream. Areas such as intelligence/surveillance, protein folding, genomics, credit fraud detection, semantic searching, social networks analysis, computational geometry, scene recognition, and energy distribution all rely on large collections of unstructured data. As such the XMT is suitable for many high-end analytics applications in business intelligence, scientific research and Web search.

It’s no coincidence that companies like Google, Facebook, and Amazon that use data mining are attracting the same scrutiny from civil libertarians that used to be reserved for the three-letter government agencies. They are now both running essentially the same applications. Businesses and governments alike want to sift through enormous databases in order to extract real-time intelligence, and that is nowhere more apparent than in the rise of the semantic Web.

In fact, social network analysis is one of the big application areas Cray is targeting for its XMT product — that according to Shoaib Mufti, Cray’s director of Knowledge Management. Mufti says search engines are moving toward more complex analysis, especially in the area of natural language processing. The goal here is to interpret the search input more precisely in order to deliver more accurate results. All of this processing has to be done interactively, which puts an enormous strain on conventional hardware.

For example, instead of delivering 1,000 pages of search results to sift through, a semantic search engine will only deliver a handful of the most relevant sites, or perhaps even just one. This is not mainstream technology today, but with the spread of mobile platforms (whose natural interface just happens to be spoken input), there will be an enormous demand for semantic searching. “We see a huge potential for XMT in providing value there,” says Mufti.

There’s also a big demand for graph type problems in the financial industry, such as the aforementioned area of fraud detection. In this case banks need to search through thousands or even millions of credit transactions looking for evidence of bogus activity. The volume of transactions and the need for real-time response is pushing this application beyond the bounds of conventional computing systems.

Conventional the XMT is not. The supercomputer has some stand-out features not found in other highly-parallel platforms. The most obvious is that it marries an extreme multithreading CPU, Cray’s custom Threadstorm processor, with a high-capacity shared memory architecture. Many shared memory systems, such as SGI’s Altix UV, are based on conventional x86 technology. Although a UV machine can offer up to 64 threads per node (with four 8-core CPUs), one Threadstorm chip supports 128 threads. Better yet, each Threadstorm draws just 30 watts, or about a third that of a high-end x86 CPU. In addition, the XMT supports fine-grained synchronization in the hardware, in order to hide latencies across the threads.

The underlying architecture is based on the Cray’s mainstream XT platform, right down to the SeaStar2 interconnect and the AMD socket that Cray uses for the Threadstorm processors. In this way the company was able to reuse existing componentry, while at the same time providing a highly scalable platform for the Threadstorm technology. Today the system tops out at 8,024 processors, which can aggregate more than a million threads, and 64 terabytes of shared memory, the highest capacity of any such machine, says Mufti.

According to him, an XMT supercomputer can deliver 10 to 100 times better performance than conventional architectures on problems that exhibit irregular data access patterns. Making comparisons is somewhat problematic, though. There is as yet no widely accepted benchmark for graph problems. The new Graph 500 organization wants to fill that void, but that benchmark is still evolving. For the first Graph 500 results announced at SC10 last November, a 128-node XMT machine came in third place, beat out only by two much larger systems: an IBM Blue Gene/P (using 8,192 nodes) and a Cray XT4 (using 544 nodes).

Despite its computational muscle and its five-year history, the XMT business is still very much a work in progress. Mufti’s Knowledge Management team, which oversees the XMT product, is run out of Cray’s Custom Engineering division, a group that is focused on developing new business opportunities. Cray doesn’t break out how much revenue is generated from XMT sales, and you’d be hard-pressed to find a dollar figure associated with any current deployment at government agencies or research labs. Nevertheless, the company must be gleaning enough sales to warrant on-going development.

Sometime later this year, Cray intends to launch XMT-2, the first system upgrade in five years. As it targets the broader market, Cray is also looking to make the machine easier to use. A lot of this will come via partnerships with software firms like Cambridge Semantics and Clark & Parsia, LLC, who are developing semantics tools and middleware for large-scale analytics.

For the XMT-2 system itself, Cray is focusing on scalability and TCO. Although not ready to release details, according to Mufti the next generation has scaled “significantly.” This was done to accommodate the ever-growing problem size, especially in regard to database memory requirements. While this is yet to be confimed, it’s logical to assume the new system will move up to the latest Gemini interconnect used in the XT and XE lines in order to take advantage of the increased performance. The next-generation Threadstorm processors will also likely benefit from smaller transistor geometries, allowing for better performance-per watt, more threads, or a little of both. Overall, says Mufti, XMT-2 will be denser as well as more more energy efficient, and the underlying technology “will be taken to the next level.”