Although this approach has recently fallen out of favor, it is virtually impossible for other parallel processing methods to achieve the low cost and high performance possible by using a Linux system to host an attached parallel computing system. The problem is that very little software support is available; you are pretty much on your own.

In general, attached parallel processors tend to be specialized to perform specific types of functions.

Before becoming discouraged by the fact that you are somewhat on your own, it is useful to understand that, although it may be difficult to get a Linux PC to appropriately host a particular system, a Linux PC is one of the few platforms well suited to this type of use.

PCs make a good host for two primary reasons. The first is the cheap and easy expansion capability; resources such as more memory, disks, networks, etc., are trivially added to a PC. The second is the ease of interfacing. Not only are ISA and PCI bus prototyping cards widely available, but the parallel port offers reasonable performance in a completely non-invasive interface. The IA32 separate I/O space also facilitates interfacing by providing hardware I/O address protection at the level of individual I/O port addresses.

Linux also makes a good host OS. The free availability of full source code, and extensive "hacking" guides, obviously are a tremendous help. However, Linux also provides good near-real-time scheduling, and there is even a true real-time version of Linux at http://luz.cs.nmt.edu/~rtlinux/. Perhaps even more important is the fact that while providing a full UNIX environment, Linux can support development tools that were written to run under Microsoft DOS and/or Windows. MSDOS programs can execute within a Linux process using dosemu to provide a protected virtual machine that can literally run MSDOS. Linux support for Windows 3.xx programs is even more direct: free software such as wine, http://www.linpro.no/wine/, simulates Windows 3.11 well enough for most programs to execute correctly and efficiently within a UNIX/X environment.

The following two sections give examples of attached parallel systems that I'd like to see supported under Linux....

There is a thriving market for high-performance DSP (Digital Signal Processing) processors. Although these chips were generally designed to be embedded in application-specific systems, they also make great attached parallel computers. Why?

Many of them, such as the Texas Instruments ( http://www.ti.com/) TMS320 and the Analog Devices ( http://www.analog.com/) SHARC DSP families, are designed to construct parallel machines with little or no "glue" logic.

They are cheap, especially per MIP or MFLOP. Including the cost of basic support logic, it is not unheard of for a DSP processor to be one tenth the cost of a PC processor with comparable performance.

They do not use much power nor generate much heat. This means that it is possible to have a bunch of these chips powered by a conventional PC's power supply - and enclosing them in your PC's case will not turn it into an oven.

There are strange-looking things in most DSP instruction sets that high-level (e.g., C) compilers are unlikely to use well - for example, "Bit Reverse Addressing." Using an attached parallel system, it is possible to straightforwardly compile and run most code on the host, while running the most time-consuming few algorithms on the DSPs as carefully hand-tuned code.

These DSP processors are not really designed to run a UNIX-like OS, and generally are not very good as stand-alone general-purpose computer processors. For example, many do not have memory management hardware. In other words, they work best when hosted by a more general-purpose machine... such as a Linux PC.

Although some audio cards and modems include DSP processors that Linux drivers can access, the big payoff comes from using an attached parallel system that has four or more DSP processors.

Because the Texas Instruments TMS320 series, http://www.ti.com/sc/docs/dsps/dsphome.htm, has been very popular for a long time, and it is trivial to construct a TMS320-based parallel processor, there are quite a few such systems available. There are both integer-only and floating-point capable versions of the TMS320; older designs used a somewhat unusual single-precision floating-point format, but the new models support IEEE formats. The older TMS320C4x (aka, 'C4x) achieves up to 80 MFLOPS using the TI-specific single-precision floating-point format; in contrast, a single 'C67x will provide up to 1 GFLOPS single-precision or 420 MFLOPS double-precision for IEEE floating point calculations, using a VLIW-based chip architecture called VelociTI. Not only is it easy to configure a group of these chips as a multiprocessor, but in a single chip, the 'C8x multiprocessor will provide a 100 MFLOPS IEEE floating-point RISC master processor along with either two or four integer slave DSPs.

The other DSP processor family that has been used in more than a few attached parallel systems lately is the SHARC (aka, ADSP-2106x) from Analog Devices http://www.analog.com/. These chips can be configured as a 6-processor shared memory multiprocessor without external glue logic, and larger systems also can be configured using six 4-bit links/chip. Most of the larger systems seem targeted to military applications, and are a bit pricey. However, Integrated Computing Engines, Inc., http://www.iced.com/, makes an interesting little two-board PCI card set called GreenICE. This unit contains an array of 16 SHARC processors, and is capable of delivering a peak speed of about 1.9 GFLOPS using a single-precision IEEE format. GreenICE costs less than $5,000.

If parallel processing is all about getting the highest speedup, then why not build custom hardware? Well, we all know the answers; it costs too much, takes too long to develop, becomes useless when we change the algorithm even slightly, etc. However, recent advances in electrically reprogrammable FPGAs (Field Programmable Gate Arrays) have nullified most of those objections. Now, the gate density is high enough so that an entire simple processor can be built within a single FPGA, and the time to reconfigure (reprogram) an FPGA has also been dropping to a level where it is reasonable to reconfigure even when moving from one phase of an algorithm to the next.

This stuff is not for the weak of heart: you'll have to work with hardware description languages like VHDL for the FPGA configuration, as well as writing low-level code to interface to programs on the Linux host system. However, the cost of FPGAs is low, and especially for algorithms operating on low-precision integer data (actually, a small superset of the stuff SWAR is good at), FPGAs can perform complex operations just about as fast as you can feed them data. For example, simple FPGA-based systems have yielded better-than-supercomputer times for searching gene databases.

There are other companies making appropriate FPGA-based hardware, but the following two companies represent a good sample.

Virtual Computer Company offers a variety of products using dynamically reconfigurable SRAM-based Xilinx FPGAs. Their 8/16 bit "Virtual ISA Proto Board" http://www.vcc.com/products/isa.html is less than $2,000.

Many of the design tools, hardware description languages, compilers, routers, mappers, etc., come as object code only that runs under Windows and/or DOS. You could simply keep a disk partition with DOS/Windows on your host PC and reboot whenever you need to use them, however, many of these software packages may work under Linux using dosemu or Windows emulators like wine.