A single instruction multiple data (SIMD) processor particularly suited for scientific applications includes a high level language programmable front end processor, a parallel task processor having an array memory, a large very high speed secondary storage system having high speed I/O channels to the...http://www.google.com/patents/US4101960?utm_source=gb-gplus-sharePatent US4101960 - Scientific processor

A single instruction multiple data (SIMD) processor particularly suited for scientific applications includes a high level language programmable front end processor, a parallel task processor having an array memory, a large very high speed secondary storage system having high speed I/O channels to the front end processor and the array memory, and a control unit directing the parallel task processor via a template control mechanism. In operation an entire task is transferred from the front end processor to the secondary storage system whereupon the task is executed on the parallel task processor under the control of the control unit thereby freeing the front end processor to perform general purpose I/O, and other tasks. Upon parallel task completion, the complete results thereof are transferred back to the front end processor from the secondary storage system. The array memory is associated with an alignment network for non-conflictingly storing and accessing linear vectors.

Images(10)

Claims(9)

What is claimed is:

1. A single instruction multiple data processor comprising:

a large scale high level language programmable general purpose front end processor for user interfacing, archival storage and scalar task processing;

a parallel array processor having a parallel memory module array, a parallel array of arithmetic elements and an alignment network for aligning particular memory modules in said array thereof with particular arithmetic elements in said array thereof for parallel processing of linear vectors;

a large high speed secondary storage system having a high speed data channel connected to said front end processor and a high speed data channel connected to said parallel memory array; and

a control unit interconnected to said front end processor, said high speed secondary storage system and said parallel array processor for controlling said parallel array processor, said control unit comprising

a task memory for storing object program code for use in parallel processing;

a scalar processing unit for fetching object program code from said task memory and for issuing instructions in response thereto;

an array control unit for controlling said parallel task processor in response to instructions issued by said scalar processor; and

a control maintenance unit for providing communications between said front end processor and said control unit, for providing initialization and maintenance control for said control unit, and for gathering error and use data from said control unit, said secondary storage system, and said parallel array processor and for communicating gathered error and use data to said front end processor.

2. The single instruction multiple data processor according to claim 1 wherein

said parallel array of arithmetic elements comprises an array of sixteen identical arithmetic elements functioning in locked step operation; and

said alignment network comprises input alignment means for providing data communication paths from said memory module array to said parallel array of arithmetic elements; and

output alignment means for providing data communications paths from said parallel array of arithmetic elements to said memory module array.

3. The single instruction multiple data processor according to claim 1 wherein said large high speed secondary storage system includes:

a file storage unit for providing high performance dedicated secondary storage; and

a file memory controller for providing buffering and interfacing between said file storage unit and said front end processor, said parallel memory module array and said control unit.

4. The single instruction multiple data processor according to claim 1 wherein

said task memory includes a random access storage system.

5. The single instruction multiple data processor according to claim 4 where said scalar processing unit includes

an arithmetic unit for performing general scalar arithmetic functions.

6. The single instruction multiple data processor according to claim 1 wherein said array control unit includes:

means for addressing said parallel memory array;

means for directing said alignment network in its function of aligning particular memory modules in said array thereof with particular arithmetic elements in said array thereof; and

means for initiating particular arithmetic operations in said parallel array of arithmetic elements.

7. The single instruction multiple data processor according to claim 1 wherein

said task memory is a random access storage system;

said scalar processing unit includes

an arithmetic unit for performing general scalar arithmetic functions; and said array control unit includes

means for addressing said parallel memory array;

means for directing said alignment network in its function of aligning particular memory modules in said array thereof with particular arithmetic elements in said array thereof; and

means for initiating particular arithmetic operations in said parallel array of arithmetic elements.

8. The single instruction multiple data processor according to claim 1 wherein

said memory module array consists of a prime number of memory modules; and

said parallel array of arithmetic elements consists of a power of two number of arithmetic elements.

9. The single instruction multiple data processor according to claim 1 wherein said alignment network comprises:

input alignment means including a crossbar network for providing a data communications path between any particular memory module in said array thereof with any particular arithmetic element in said array thereof; and

output alignment means including a crossbar network for providing a data communications path between any particular arithmetic element in said array thereof with any particular memory module in said array thereof.

Description

CROSS REFERENCES RELATED TO APPLICATION

In copending application Ser. No. 682,526, for a "Multidimensional Parallel Access Computer Memory System", filed May 3, 1976, in the name of D. H. Lawrie et al and assigned to the assignee of the present invention, there is described and claimed a parallel memory array and parallel processor alignment system for storing and non-conflictingly accessing linear vectors. This application is hereby incorporated by reference.

BACKGROUND OF THE INVENTION

This invention relates generally to large scale data processing systems and more particularly, to the architecture of large single instruction multiple data (SIMD) type parallel processing arrays for scientific processing applications.

In the development of digital computers the most important design goal has always been to maximize their operating speed, i.e., the amount of data that can be processed in a unit of time. It has become increasingly apparent in recent times that two important limiting conditions exist within the present framework of computer design. These are the limits of component speed and of serial machine organization. To overstep these limitations high speed parallel processing systems have been developed providing an array of processing elements under the control of a single control unit.

As speed requirements of computation have continued to increase, systems employing greater numbers of parallel memory modules have been developed. One such system has in the order of 64 parallel memories, see U.S. Pat. No. 3,537,074, issued Oct. 27, 1970 to R. A. Stokes et al, and assigned to the assignee of the present invention. However, parallel processors have not been without their own problems. For example, a parallel array often has great capacity that is unusable because of limitations imposed by the I/O channels feeding data to it. Further, the parallel array being tailored to vector or parallel processing performs relatively slowly while handling scalar tasks.

Also, parallel processors being architecturally so far removed from scalar processors often are hard to program and have limited ability to function with standard high level languages such as Fortran.

Finally, prior art parallel processors often have difficulty handling matrix calculations which are often the heart of scientific problems. Unless each element of a matrix vector is stored in a different memory module in the array memory that vector cannot be accessed in parallel and a memory conflict occurs slowing and complicating matrix calculations.

OBJECTS AND SUMMARY OF THE INVENTION

It is therefore an object of this invention to improve single instruction multiple data (SIMD) computers.

It is a further object of this invention to provide a large scale parallel processing computer system which may be readily programmed in a high level language.

It is a further object of this invention to provide a parallel processing system which also efficiently processes scalar tasks.

It is yet a further object of the invention to provide a parallel processing system which minimizes processing efficiency deteriorations introduced by I/O limitations between the front end or management processor and the parallel task processor.

It is still a further object of this invention to provide an array processing system which is conflict free for processing multi-dimensional arrays and which operates in an efficient pipelined manner.

In carrying out these and other objects of this invention, there is provided a scalar front end processor, a parallel processing array, a control unit for controlling the parallel processing array and a large high speed secondary storage system having high speed I/O paths to the front end processor and to the memory modules of the parallel processing array.

In operation, the front end processor is programmed in a high level language and transfers complete prallel tasks to the secondary storage system whereupon complete control for the parallel processing operation is directed by the control unit thereby freeing the front end processor to perform general purpose or other tasks. Upon parallel task completion, complete files are transferred back to the front end processor from the secondary storage system.

The parallel processing array efficiently processes vector elements in a parallel locked-step fashion under template control provided by the control unit. The memory array of the parallel processor provides conflict-free access to multi-dimensional arrays stored therein.

Various other objects and advantages and features of this invention will become more fully apparent in the following specification with its appended claims and accompanying drawings wherein:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing the scientific processing architecture of the present invention;

FIG. 2 depicts the operation and partitioning of the scientific processing architecture of FIG. 1 from a Master Control Program point of view;

FIG. 3 lists the steps required to transfer and complete a job from the front end processor to the parallel task processor of the scientific processing architecture of FIG. 1;

FIG. 4 is a detailed block diagram of a large high-speed secondary storage unit used in the parallel task processor illsutrated in FIG. 1;

FIG. 5 is a block diagram illustrating the operating environment of the task memory of the control unit shown in FIG. 1;

FIG. 6 is a detailed block diagram depicting the features of the task memory of FIG. 5;

FIG. 7 is a timing diagram illustrating the procedure for fetching from the task memory of FIG. 6;

FIG. 8 is a timing diagram depicting the procedures for writing into the task memory of FIG. 6;

FIG. 9 is a block diagram depicting the scalar processing unit of the control unit shown in FIG. 1;

FIG. 10 is a diagram showing the array control unit of the control unit of FIG. 1;

FIG. 11 is a function flow diagram illustrating a job flow through the parallel array of FIG. 1;

FIG. 12 illustrates the various kinds of parameter groups that are transferred from the scalar processing unit to the array control unit and stored therein the vector function initialization and validation unit shown in FIG. 10; and

FIG. 13 is a block diagram of the template control unit of the array control unit shown in FIG. 10.

DESCRIPTION OF THE PREFERRED EMBODIMENTARCHITECTURAL APPROACH

The scientific processing architecture of the present invention places a scientific job in a computational envelope which responds quickly and with high bandwidth to the executing program's instruction stream. In the preferred embodiment, see FIG. 1, the computational envelope includes within, a Control Unit 11, a File Memory 13, an Array Memory 15 and an Arithmetic Element 17. A Memory Interface 19, an Input Alignment Network 21, and an Output Alignment Network 23 are provided to channel data flow between the File Memory 13 and the Array Memory 15 and between the Array Memory 15 and the Arithmetic Elements 17. While a Front End Processor 25 is the system manager from an overall task management viewpoint, the Control Unit 11 is in direct and complete control of actual task execution and I/O operations inside the computational envelope and makes requests of the Front End Processor 25. To facilitate a more complete understanding of the function and architecture of the present invention, the above-mentioned elements within the computational envelope and the Front End Processor 25 will all be described briefly with continued reference to FIG. 1, followed by a more detailed explanation of the elements and the interfaces therebetween.

The Front End Processor 25 functions as a host processor in that it handles the true I/O, user interfacing, archival storage, and building of job queues for processing within the computational envelope. In the preferred embodiment a large systems processor, namely, a Burroughs B7800, is selected as the Front End Processor 25.

The Control unit 19 comprises four main units; a Task Memory 27, a Scalar Processor Unit 29, an Array Control Unit 31 and a Control Maintenance Unit 33. The Control Unit 11 together with the File Memory 13 is capable of functioning independently of the Front End Processor 25 while performing scalar and vector tasks thereby freeing the Front End Processor 25 and allowing it to perform other tasks for which it is required or best suited.

The Task Memory 27 stores the Master Control Program (MCP), parts of the object program code, scalars and descriptors. Preferrably storage capability of 64K words is provided with expandability to 256K words.

The Scalar Processor Unit 29 which provides the system intelligence within the computational envelope, executes program code which is stored in bytes in the Task Memory 27. The Scalar Processor Unit 29 combines instruction buffering, variable length instructions, relative addressing of the Task Memory 27, use of an internal local memory, fast arithmetic synchronizers, maskable interrupts and other features which enhance Fortran program execution. Instruction processing is pipelined. Vector operations and parameters are assembled in an internal local memory before being sent to the Array Control Unit 31 for queuing.

The Array Control Unit 31 receives and queues vector operations and parameters from the Scalar Processor Unit 29 and generates the microsequence for their execution by the Arithmetic Elements 17. Memory Indexing parameters and tag parameters are generated and updated for each set of vector elements.

The Control Maintenance Unit 33 serves as an interface between the Front End Processor 25 and the rest of the Control Unit 11 for initialization, control data communication, and maintenance purposes. It receives commands from the Front End Processor 25 and monitors system error detection circuits (not shown) and reports detected errors to the Front End Processor 25. The Control Maintenance Unit 33 has access to critical data paths and controls in the Scalar Processor Unit 29, the Array Control Unit 31 and the Task Memory 27 for fault location purposes.

The File Memory 13 has a high speed I/O data path 35 to the Front End Processor 25 to facilitate fast data flow to and from the Array Memory 15. In operation, files of the program code are brought in from the Front End Processor 25 and temporarily stored in the File Memory 13 and the Task Memory 27. The large high speed File Memory 13 with its high speed data paths 35 and 37 is a most important element in the present invention and in the physical realization of the computational envelope approach.

The parallel processing array comprising the Array Memory 15, the Memory Indexing 19, the Input Alignment Network 21, the Arithmetic Elements 17, and the Output Alignment Network receives data from the File Memory 13 and processes the data in a parallel lock-step operation in the Arithmetic Elements 17 under direction from the Array Control Unit 31. A more detailed explanation of the array operation for processing linear vectors is given in U.S. Pat. application, Ser. No. 682,526, filed May 3, 1976, for a "Multidimensional Parallel Access Computer Memory System", by D. H. Lawrie et al, and assigned to the assignee of the present invention, the application being incorporated herein by reference. Basically, data is read from the File Memory 13 through the Memory Indexing 19 into the Array Memory 15. Thereafter, the data is fed through the Input Alignment Network 21 for proper aligning and is processed in parallel in lock-step fashion by the Arithmetic Elements 17. Thereafter, the data is realigned for storage or further processing by the Output Alignment Network 23.

The scientific processing architecture of the present invention having been briefly described above will now be detailed as implemented in the preferred embodiment thereof.

FRONT END PROCESSOR

The Front End Processor functions as the user interface, the basic job compiler, and the interface to the rest of the systems hereinafter referred toaas the Parallel Task Processor 41 comprising all the function elements of FIG. 1 except the Front End Processor 25. The I/O between the Parallel Task Processor 41 and the Front End Processor 25 is relatively simple due to dedicated storage in the form of the File Memory 13. As will be detailed hereinafter, the Front End Processor 25 gives parallel job tasks to the Parallel Task Processor 41 and is thereafter relatively isolated from the Parallel Task Processor 41 until task completion. Thus, the Front End Processor 25 is freed for a period of time to perform other functions, such as general purpose processing. In this manner a typical scientific problem comprising both general purpose and parallel tasks may be handled most efficiently by partitioning the scalar tasks to the Front End Processor 25 and the parallel tasks to the Parallel Task Processor 41.

In order to ease user interface problems and to simplify programming requirements, the Front End Processor 25 is implemented as a large scale Fortran programmable computer, preferably a Burroughs B7800. In the B7700 a Master Control Program (MCP) allows the user to gain access to the Parallel Task Processor 41 via a standard B6800/B7800 Work Flow Language. In alternative embodiments, a counterpart to the B6800/B7800 Work Flow Language is employed such as a Job Control Language. In either case, the Front End Processor 25 and the Parallel Task Processor 41 appear to be a single system from the point of view of job control. Under this arrangement, all standard Front End Processor 25 capabilities are invoked as though they are extensions to the standard Front End Processor capabilities. Thus, a single job can use all the facilities of the system.

With reference to FIG. 2 it is apparent that in one sense there are two Master Control Programs and yet in another sense, the MCP on the Parallel Task Processor 41 is merely an extension of the one on the Front End Processor 25. It is important to the present invention, however, that the Master Control Program in the Parallel Task Processor be in complete control of the Parallel Task Processor 41. The Front End Processor 25 must request the Parallel Task Processor 41 to perform certain functions. This is a major advantage over ILLIAC IV and other prior art systems where the management or Front End Processor always had full control.

In step 1, with reference to FIG. 3, the Front End Processor 25 interprets the work flow language program for transferring communications with the Parallel Task Processor 41. The word "interpret" is used rather than "compile" because the Work Flow Language has a compiler associated with it. Thus, the Work Flow Language compiler merely compiles the Work Flow Language statement to a form which is later interpreted by the overall Scientific Processor. In a dynamic state, when the operating system encounters a Work Flow Language expression, for example, as to compile a Parallel Task Processor 41 FORTRAN Program, the Front End Processor 25 calls the Parallel Task Processor 41 compiler and locates the input file for the compiler and tells the compiler to run. The input file will be the source program. When the computation is done there will be another work flow statement which tells it what to do with the results of the computation. Normally, the object code generated by the compilation is shifted over to the Parallel Task Processor 41 to run there. Therefore, there would be another work flow statement requesting the transfer of the result of the compilation to be transferred to the Parallel Task Processor 41. At the same time, a list of statements would be encountered which would tell which files are to go along with the object code which are needed by the object code to run. The above procedure is typical of matters handled by the Work Flow Language. Work Flow Language is merely a set of instructions describing how a program is to flow through the system.

Another matter that the Work Flow Language handles is that it can have one task running on the Parallel Task Processor 41 while executing another task on the Front End Processor 25, and when both are completed, to compare the results and to start up a third program based on the computation. The Work Flow Language thus can cause part of a job to be computed on one machine and part on another. Thus tasks may be partitioned and executed on the machine which is best optimized to perform that task.

In step 2, the job is placed in queue depending on its priority. There is queue for long jobs and one for short jobs, for high priority and low priority jobs. Other characteristics also effect its position in the queue. The queues are inspected by the Master Control Program (MCP) of the operating system of the Front End Processor 25. When conditions allow, the next entry is taken from the queue and run. The queues allow the job priorities to be correct.

In step 3, the Job File Memory 13 reservation is made and the job is started. This is accomplished by both machines. The Front End Processor 25 performs a MCP to MCP communications with the Parallel Task Processor 41 and in effect, asks if it is ready for the next job. When the Parallel Task Processor 41 responds affirmatively, memory space in the File Storage Unit 13 is reserved for the job and the Parallel Task Processor 41 gives to the Front End Processor 25 the descriptors which indicate where the job is to be stored.

With reference again to FIG. 1, it is seen that a very high speed I/O path 35 connects the Front End Processor 25 and the Parallel Task Processor 41. Also, a path interconnects the Front End Processor 25 and the Parallel Task Processor 41 for the purpose of MCP conversations. This MCP path 45 actually comprises two unidirectional ports. Protocol along the communications path 45 is quite similar to that in data communication paths. It is in essence, a message-to-message protocol. A short message is decoded before the next is communicated. The buffers are small, and in essence a little packet is transferred at a time.

In step 4, again with reference to FIG. 3, after a Memory File 13 reservation is made, the Front End Processor 25, places the tasks in the Parallel Task Processor 41 task queue and the job is further decoded into tasks which are placed in a queue. A job comprises at least one task. The Parallel Task Processor 41 makes the File memory 13 allocations.

In step 5, the Front End Processor 25 takes information from its own discs and files and transfers the necessary files to the File Memory 13 in the Parallel Task Processor 41.

In step 8, the Parallel Task Processor 41 informs the Front End Processor 25 that is through with the job and it lists the files which are to be transferred back and erases others if there are any. The Front End Processor 25 acutally performs the transfer back. The Parallel Task Processor 41, in essence, says "I am done and here are the descriptors, pick up the files". The Front End Processor 25 then takes care of the transfer and notifies the parallel Task Processor 41 that the transfer is completed. The descriptors are the description of the file from a hardware point of view. The descriptors are passed from the Parallel Task Processor 41 are the actual hardware descriptors which the Front End processor 25 will give to the hardware to cause it to do what is requested. Each descriptor designates how large its associated file is and where the associated file is located. The job executes out of the Array Memory 15 and is then packed back into the File Memory 13 before the Front End Processor 25 is notified that the job is completed. Thus, the Front End Processor 25 extracts only from File Memory 13. The Parallel Task Processor 41 wraps the job up in a package in the File Memory 13 and notifies the Front End Processor 25.

In step 9, the actual transfer of the output files to permanent storage occurs.

In step 10, finally, having completed a job, the Front End Processor 25 goes to the next job as indicated by step 2.

The word "task" is used in the Parallel Task Processor 41 is in essence a complete job. The Work Flow Language concept is so powerful that several tasks may be performed under a single job. However, each task may be considered a complete job in itself. The task is brought into File Memory 13 and deposited there by the Front End Processor 25. The Front End Processor 25 may, if there is room in the File Memory 13, queue up several tasks. Then the Parallel Task Processor 41 executes the tasks in sequence by taking them out of the File Memory 13 and returning them to the File Memory 13. Following this, the Front End Processor 25 is notified that a task is completed and is requested to transfer it back. In the steady state one task is running, one task is being loaded into the Parallel Task Processor 41 and one task is being removed from the Parallel Task Processor 41. Thus, the loading and unloading of tasks are overlapped. As soon as the Parallel Task Processor 41 is finished with one task it is prepared to go on to the next. Thus, the I/O channel 35 is kept busy.

In the preferred embodiment, a Proxy Task is performed for coding convenience on the Front End Processor 25. The Proxy Task is in essence the dummy job which is employed to take advantage of the work flow language capability of permitting the user to address both the Front End Processor 25 and the Parallel Task Processor 41 as though they were different aspects of the same overall machine. When the Front End Processor 25 starts up a Parallel Task Processor 41 task, it also starts up a Proxy Task at the same time and when it finishes the Parallel Task Processor 41 task the Proxy Task is halted and destroyed. Some of the messages between the Front End Processor 25 and the Parallel Task Processor 41 go through the Proxy Task. That is its main function. The Front End Processor 25 MCP performs as though it is running a job on the Front End Processor 25 because of the guise of the Front End Processor 25 Proxy Task. The Proxy Task allows one to get to all of the resources of the Front End Processor 25 while using the standard operating system of the Front End Processor 25. The Proxy Task occupies only a few hundred words of memory and is active only during those periods when communications are occurring between the Front End 25 and the Parallel Task Processor 41.

In the Front End Processor operating system in order for the Front End Processor 25 to make use of the queues which it has, there has to be tasks associated with the queue. In that sense, the Proxy Task is the task which the queues are driving. The only task which the queues, in a sense, are officially aware of. In prior art, machines such as the above-mentioned ILLIAC IV machine, there is something called an independent runner which in basic essence is something like the Proxy Task described above.

File Memory

The main communications paths between the Front End Processor 25 and the Parallel Task Processor 41 involve the File Memory 13. Because of this, the interface procedures are kept relatively clear and simple as above-described. With continued reference to FIG. 1 in general, and in particular reference now to FIG. 4, it is appreciated that the two main functional sections of the File Memory 13 are the File Storage Unit 43 and the File Memory Controller 45.

File Storage Unit

The File Storage Unit 43 provides high performance dedicated secondary storage. In the preferred embodiment, the File Storage Unit 43 is implemented by charged coupled devices having a maximum capacity of 64 million words and partitioned into a maximum of eight distinct modules for reliability. In a typical systems application of the present invention, the File Storage Unit 43 contains eight to sixteen megawords of 56 bits. Since only one module is required to be in operation at a time, data refreshing will require less than 10% of the File Memory 13 time.

In alternate embodiments for satisfying differing cost-performance criteria, the File Storage Unit 43 may be implemented by either MOS RAM chips or disc technology.

File Memory Controller

The File Memory Controller 45 functions as the main buffer and interface between the File Storage Unit 43 and the Front End Processor 25, the Array Memory 15, and the Control Unit 11, see FIG. 4. Thus, the File Memory Controller 45 is in essence an I/O Controller.

The I/O Data Communications of the Front End Processor 25 is maintained preferably at a rate of 250 kilowords per second maximum under the control of the Front End Processor Interface Unit 47. The Front End Processor Interface Unit 47 feeds data over data path 49 to and from the File Memory Interface 51. The File Memory Interface 51 handles data communications with the File Storage Unit 43 at a rate in the order of 12.5 million words per second for a CCD implementation of the File Storage Unit 43 and at a rate of 100 kilowords per second for a disc implementation thereof. In like manner, the Array Memory 53 handles data and communications with the Array Memory 15 at a rate in the order of 12.5 million words per second maximum. A Data Bus 55 is provided between the Array Memory Unit Interface 53 and the File Memory Interface 51.

In operation, descriptors are fed from the control Unit 11 to a descriptor address queue 57. The descriptor address queue 57 may preferably accept up to 30 descriptors at a time. Thus, the File Memory Controller 45 is able to accept more than one address at a time. As the descriptors are queued up, they are peeled off one at a time under the control and management of the Descriptor Handling Logic 59. The Descriptor Handling Logic 59 generates the necessary synchronizing and timing controls associated with the descriptors to properly handle the differing bandwidths associated with the Front End Interface 47, the Array Memory Interface 53 and the File Memory Interface 51. To permit the File Storage Unit 43 to have virtual addresses, a dynamic address translator 61 is provided. Thus all of the advantages of virtual addressing accrue to the File Storage Unit 43. For example, once a program is linked it can have all the proper addresses linked into it at that time. The addresses do not have to be modified. Thus the operating system is then allowed to move data around the File Storage Unit 43 to repack the File Storage Unit 43 in order to make maximum use of the space available. The operating system then instead of having to redo the addresses, only has to redo the pointers associated with the Dynamic Address Translator 61. Virtual addressing is common for main memories and has incorporated into the File Memory Controller 45 to provide virtual I/O addresses. It is appreciated, that the Descriptor Address Queue 57 may in implementation, comprise a plurality of queues such as low priority queue and a high priority queue.

Control Unit

The File Memory Controller 45 receives descriptors and other data from the Control Unit 11. The Control Unit 11 comprises the Task Memory 27, the Scalar Processor Unit 29, the Array Control Unit 31, and the Control Maintenance Unit 33, see FIG. 5.

Task Memory

The Task Memory 27 provides the storage of both code and data for tasks executing on the Control Unit 11 of the Parallel Task Processor 41. See FIG. 6. The storage supports scalar operations and control functions for both user tasks and the resident operating system. The Task Memory 27 is a linear list of full words, addressed by base relative, positive integers. Access is controlled by an ordered priority structure.

The Task Memory 27 provides random access storage. In the preferred embodiment, the memory size is from 65,536 words to 264,144 words in 65,536 word increments. The word size is 48 bits of data with 7 bits of Hamming Code plus an unused bit in each word, making the word 56 bits. It is structured such that four contiguous words at a 4-word boundary may be simultaneously accessed, provided that all accesses are for the same function (Read or Write).

Only one request is accepted for each access cycle. An access cycle may start on any minor clock and requires two minor clocks before another access cycle may be started. For any minor clock that an access may start, the highest priority request present at that time is accepted. There are five requestors with the following priority: (1) Control Maintenance Unit 33 (only used during diagnostics); (2) File Memory Control 45 (I/O); (3) Array Control Unit 31 (Bit Vectors and Scalar Results); Scalar Processor Unit 29 (IPU for instruction fetch); and Scalar Processor Unit 29 (LM for operands). Note that the SPU 29 has two distinct requestors: the Instruction Processing Unit, hereinafter the IPU 67 and the Local Memory, hereinafter the LM 69. These two requestors will be discussed later in more detail along with a general description of the SPU 29.

The given address is relative to a register contained base value in the Task Memory 27, except in supervisor mode (zeros are substituted). All addresses are unsigned integers. The selected address (based on priority resolution) is added to the base for application to the memory units. A Memory Limit check is provided for the top of the memory. The same sub-address is provided for each of the four Memory Module Units 59, 61, 63 and 65 of the Task Memory 27, see FIG. 6.

Data provided to the Memory Module Units 59, 61, 63 and 65 is aligned to the correct module for single word operations, see FIG. 6. Four word operations will take the data as presented. Data fetched from the Memory Module Units 59, 61, 63 and 65 is sent as it appears.

In the preferred embodiment, the Task Memory 27 includes such error detecting mechanisms as Hamming Code, bounds error, parity check on data being received, Memory Limit checks, and two hardware failure checks, the ACK Stack Check, and an address parity check on lines to the Memory Module Units 59, 61, 63, and 65. Information relating to detected errors is communicated to the Control Maintenance Unit 33 for logging of such information for corrective and diagnostic procedures. Error detection being well-known in the art, will be only briefly referred to in the following discussion so as not to unnecessarily complicate a utilitarian understanding of the present scientific processor architecture invention.

To fetch a word or four words from Task Memory 27, see FIG. 7, the requestor must put a true signal on the request line and put the address on his address lines. If the requestor is the highest priority at that time and the memory is not busy, the requestor's address will be added to a register contained base address (if in user mode) and stored. At the same time, the requestor's acknowledge (ACK) line will be driven true. In the next clock time, the memory will be busy while the address is sent to the memory unit. Then a memory cycle will be initiated and will take two minor clocks to finish. Finally in the fourth clock period the data will be sent to the requestor along with a strobe (STB) signal. The requestor then loads his data on the fifth clock edge after he received the ACK signal. The data will remain stable until the sixth clock edge. The requestor may change his address and request lines on the next clock edge after he receives the ACK signal.

To store a word or four words (for FMC) in Task Memory 27, see FIG. 8, the requestor does everything he did for a fetch operation, but instead puts a true on the read/write line and at the same time that the requestor puts the address on his address lines he also puts the data on the data lines. The store address will be handled in the same manner as a fetch address. The requestor will receive an ACK but not a STB signal. The requestor may change all lines to the memory on the next clock edge after he receives an ACK signal.

The Input Alignment Logic 71, see FIG. 6, selects the requestors data and aligns it with the selected Memory Module Unit(s) 59, 61, 63 65. The logic will align a word to all four modules, 59, 61, 63 and 65 for one word writes, or it will align four words to four modules as presented, for FMC four word write. The data is aligned and saved in a first cycle and then sent to the Memory Module Units 59, 61, 63 and 65 in a subsequent cycle.

The Output Alignment Logic 73, see FIG. 6, selects the requested Memory Module Units 59, 61, 63 and 65 and presents it (them) to the requestor. The logic 73 will present four words directly to the requestor (for four word reads) or will present one word in the position it appears (for one word reads). At the same time that the data is made available to the requestor, a strobe (STB) signal from the Task Memory Controller, TMC 75 is sent to the requestor. The data is then held until the next clock edge.

The control and Address Generator 75, see FIG. 6, provides the timing, control, and address generating signals for the Input Alignment Network 71, the Output Alignment Network 73, and the Memory Module Units 59, 61, 63 and 65. In operation, the Control and Address Generator 75 functions in six distinct phases. First, the requestor is selected according to priority and inputted address and data are stored while controls are set for later phases. Second, the received information is sent to the Memory Module Units 59, 61, 63 and 65. In the third phase, the TMC 75 sends written enable data to the Memory Module Units 59, 61, 63 and 65. In the fourth phase error information is stored and data is outputted from the Memory Module Units 59, 61, 63 and 65. In phase five data is sent to the requestor and in phase six, error messages are sent to the CMU 33 for logging, and diagnostics.

SCALAR PROCESSOR UNIT

The Scalar Processor Unit 29 is the primary controlling element of the Parallel Task Processor 41, see FIG. 1. It is the implementation link between the compiled program and unit executions. It performs the functions of instruction stream control, scalar arithmetics, vector parameter preprocessing, input/output initiation, and resource allocation.

More specifically, the SPU 29 fetches all instructions used by the Parallel Task Processor 41, performs those destined for internal operations, and passes vector operations to the Array Control Units 31. It also performs the arithmetic operations required for the pre-processing of vector parameters to be used in the Parallel Task Processor 41, and many of those operations that cannot efficiently be accomplished in parallel. Further, the SPU 20 performs those operations necessary to allocate the resources of the Parallel Task Processor 41 to the tasks in process. It enforces these both internally and over the units in its environment. Finally, the SPU 29 causes transfers between the Parallel Task Processor 41 elements via a descriptor sent to the File Memory Controller 47. Through the Control and Maintenance Unit 33, it requests the Front End Processor 25 to perform transfers external to the Parallel Task Processor 41. The scalar Processor Unit 29 includes a Local Memory 29 which performs temporary storage (buffering) for both operands of scalar operations and vector parameters. In the preferred embodiment, the operands of the scalar operations are stored in a 16 word by 48 bit register file which is accessed for word operation only. Two words may be simultaneously read while only one is written. Also in the preferred embodiment, vector parameters are temporarily stored in a 16 word by 120 bit random access memory which is accessed in a four word operation only for transfer thereof to the Array Control Unit 31 for further processing.

A processor environment unit 77 is provided for normal housekeeping operations such as maintaining the operational status of the Scalar Task Processor 29 via interrupt, synchronization and other standard techniques. As can be appreciated with respect to FIG. 9, a primary function of the Processor Environment Unit 77 is to provide the control interface between the Local Memory 69 and the Instruction Processing Unit 67.

The Instruction Processing Unit 67 performs instruction preparation by fetching four words in parallel from the Task Memory 27. The fetched words are buffered to assure a steady flow of instructions for branch free operation. Instructions are selected coincident with instruction execution. Branch capability exists to the extent of the buffering. Branches beyond that are penalized by Required Task Memory 27 accesses. Instructions are preferably in multiples of eight bytes. The Instruction Processing Unit 67 also controls instruction execution, Local Memory 69 addressing, and Scalar Processer Unit 29 interfacing.

Processing Unit 29 functions to implement the operand test and manipulation portions of the instruction set. Scalar operands are sourced from the Local Memory 29 and resultants are returned thereto. In addition to performing general arithmetic functions the Arithmetic Unit 79 of the preferred embodiment is also structured to quickly perform other more specialized functions such as: address index arithmetic, operand comparison (integer, Boolean, Normalized), addition and subtraction of integer operands and Boolean operations.

Bit Vector is an ordered set of data and each element of which is a bit.

Bit Vector Descriptor is a collection of items to specify bit vector.

Superword is a vector whose elements are fetched in parallel to be used by the AE's. The length of superword is equal to no. of AEs.

Vector Form is the specification of function. The operands for the function are vector sets, AE operators and bit vector. Results are vector sets and bit vector.

Vector function is the specification of function. The operands for the function are vector sets, AE operators and bit vector. Results are vector sets and bit vector.

A template is a fixed pattern of controls for the array pipe. It consists of a sequence of template microwords. Each microword contains information to control various units of the array pipe. A template can execute one superword wide slice of a vector form.

Click: Central indexing on consecutive superword is called a click operation which is performed by the CIU.

Superclick: Central indexing on first superword of a vector in a set of parallel vectors is called a superclick operation and is performed by the CIU.

The Array Descriptor gives the base address and the number of elements in the array. Note this array appears in the program data organizations as dimensioned variable.

An Incset contains the parameters required to specify the elements of the vector set relative to the base of the array.

Vector conflict occurs when all elements of the vector are located in one memory module. Note that the elements of a superword of a vector will either all be in separate memory modules or they will all be in one memory module.

Vector operation is the execution of a vector form.

The Array Control Unit 31 is positioned into four subunits, see FIG. 10; the Vector function initialization and Validation Unit 83, the Vector Function Parameter Queue 85, the Central Indexing Unit 87 and the Template Control Unit 89.

The VIV 83 accepts ACU 31 instructions from the SPU 29 and processes them for initialization and validation. A group of instructions describes a vector form (VF). Each instruction is validated to detect any inconsistent instruction in a group of instructions describing a VF. Processed parameters are put in the VPQ 85 and then they move to the CIU 87 or the TCU 89. The CIU 87 performs indexing operations to calculate initial values required by Array 81. The TCU 89 controls the CIU 87 and the Array 81 by means of templates and thus controls execution of VF. Scalar results are collected by the TCU 89 and then stored in the Task Memory 29.

The ACU 31 also communicates with Control and Maintenance Unit 33 for error logging, performance monitoring and diagnostics. I/O cycles are allocated on request from File Memory Controller 47.

The ACU 31 controls the execution of vector forms on the Array 81, see FIG. 11 with reference to FIG. 1. Various stages in the Array 81 are Central Indexing Unit 87. Memory Index Generator 91, Input Alignment Network Tag Generator 93, Array Memory 15, Input Alignment Network 21, Arithmetic Element 17, (AE), Output Alignment Network Tag Generator 95, and Output Alignment Network 23. The CIU 87 generates parameters required by the MIG, IANTG and OANTG for the index and tag computations. The CIU 87 also performs horizontal slicing of Vector Form (VF) by performing clicking and superclicking operations. MIG 91 generates the indexes required for the AM 15 fetch and store operations. The IANTG 93 generates tags required for Input Alignment. The IAN 21 does unscrambling of vector elements from AM 15. The OAN is a counterpart of IAN and it transfers the elements of the result vector back to the AM 15.

Units in the Array 81 each take one major cycle for their operation and perform operations in this period. If In an operation requires extra cycle(s) then the TCU 89 will allocate sufficient cycles one at a time. These units get operands from an input buffer (not shown) and deposit results into the input buffer (not show ) of the next unit in the Array 81. These units can be interrupted at the end of a cycle since the state of the Array 81 is saved in buffers. Extra paths for I/O do not change the state of the Array 81 except for the array memory 15 addressing. Thus I/O can steal cycles whenever the addressing can be restored. The FMC 47 generates requests for I/O cycles.

The ACU 31 accepts various kinds of parameter groups from the SPU 29. Each parameter group is stored as one entry in VIV 83. The entry preferably consists of 125 bits of information. Each of these, see FIG. 12, is described below in detail.

1. Setup Array region bounds: with this entry the Scalar Processor Unit provides a Base of Space (BOS) and Length of Space (LOS) values for subsequent vector set functions.

6. Vector Set descriptor updated by array descriptor: The SPU supplies the new array descriptor to be combined with the previous vector set incset. The array descriptor is two half words.

7. Vector Set descriptor updated by initial element index. The SPU supplies the index of the new initial element to be combined with the previous array descriptor and incset. The initial element index is one half word quantity. Other bits are unused.

10. Vector Set descriptor updated by incset: The SPU supplies the new incset to be combined with the previous array descriptor. The incset consists of 3 half words. Other bits are unused.

11. Vector Set Result descriptor: The SPU supplies array descriptor and incset of a vector set result. Array descriptor and incset requires five half words.

12. Vector Set Result descriptor updated by array descriptor: The SPU supplies the new array descriptor for use with the incset of the previous vector set descriptor. The array descriptor consists of two half words, other bits are unused.

13. Vector Set Result descriptor updated by initial element index: The SPU supplies the index of the new initial element for use with array descriptor and incset of the previous vector set descriptor. The initial element index consists of one half word. Other bits are unused.

14. Vector Set Result descriptor updated by incset: The SPU supplies the new incset for use with previous array descriptor of the previous vector set descriptor. The incset consists of 3 half words. Other bits are unused.

15. Scalar Result to task memory: The SPU supplies the task memory base address and initial element where scalar result is to be returned. Element displacement is d.

16. Scalar Result to array memory: The SPU supplies the array descriptor, initial element index (i) and element displacement (d), indicating the address where scalar result is to be returned. This consists of four half words. Other bits are unused.

17. Random access descriptor I: The SPU supplies the base and length fields to the VIV. It consists of two half words. Other bits are unused.

20. Random access descriptor II: The SPU supplies only VIV Tag to the VIV consisting of five bits. Other bits are unused.

24. COMPRESSED VECTOR OPERAND: The SPU supplies the base (BC) (Starting element of the vector set) and length of the vector set (LC). Other three half words are unused.

24. COMPRESSED VECTOR RESULT: The SPU supplies the Base (BC). (Starting element of the vector set) and length of the vector set. (LC). Other three half words are unused.

As the VIV 83 reads each entry it receives the information either for entry in the internal registers of the VIV 83 or to be transmitted to the VPQ 85. Values in internal registers will be used during subsequent VIV 83 processing of vector operators and operands. The processing consists of absolute address computation and relative address validation. Before vectors may be processed, the Vector Form (VF) parameters are validated. Any bit vectors associated to a vector function are checked for self consistency. Certain housekeeping computations and checks are performed with each individual type of VIV 83 entry.

The major function of the VIV 83 is to provide early detection of logical errors in the vector function as opposed to delayed detection by Memory Indexing Alignment 21 or Arithmetic 17 Units. The sequence of vector instructions that are fed into the VIV 83 are examined for correctness in their ordering and association to individual vector functions. Each type of instruction has appropriate checks made to ensure the validity of the parameters supplied to describe the vector function. These checks are described in subsequent paragraphs.

Each instruction to the VIV 83 is processed by the VIV 83 in one major cycle. The VIV 83 contains local registers for storing parameters. An Instruction may modify values of some of the local registers. The local registers contents may be used to computer the fields to be transferred to the VPQ 85.

The Vector Parameter Queue (VPQ) 85 is a first-in-first-out queue. An entry for the VPQ 85 may be prepared by the VIV 83 every two or more minor cycles. An entry will be consumed at most every major cycle. The VPQ 85 is a passive unit in that it merely stores data presented to it but does not act on it.

The Central Index unit (CIU) 87 stores vector set descriptors, scalar descriptors, bit vector descriptor and compressed vector supplied by the VPQ 85, performs operations needed for clicking and superclicking, produces initial memory addresses, alignment network tags and constants. It also produces some control information for the Template Control Unit 89. The CIU 87 is subdivided into two subunits:

1. Vector Set Descriptor Memory (VDM) 97 which is the descriptor buffer and working storage for the Central Index Arithmetic Unit 99.

In the preferred embodiment, the size of the VDM 97 is 16 words, each word consisting of 188 bits. Thus the VDM 97 holds up to 16 descriptors wherein each descriptor represents a complete vector set.

A vector set descriptor generally represents a vector set of Array Memory 15.

As shown in the figure, the scalar descriptor represents the vector result either to Array Memory 15 or Task Memory 27. In this case one value is generated every superclick.

The VDM 97 is used by two units of the ACU 31, namely, the Vector Parameter Queue (VPQ) 85 and the Central Index Arithmetic Unit (CIAU) 99. For the VPQ 85 the VDM 97 is a write only storage. The CIAU 99 reads data from the VDM 97 and after manipulating on certain fields, it writes back to the VDM 97. In addition, the VDM 97 also supplies addresses to the TCU 89.

The Central Index Arithmetic Unit (CIAU) 99 performs the following three operations:

1. Clicking and Superclicking operations for descriptor: When the length of a vector is more than a superword, central indexing for consecutive superwords is performed by updating certain fields. This operation is called click operation. This is simple to perform as the increment between successive elements (d) of the vector is the same and the starting element of the next superword can be calculated from the starting element of the previous superword by adding dĚN where N is the length of a superword. The length of the vector is reduced by N elements every time a click is performed. In superclicking parallel vectors of a vector set are indexed by hardware. This is possible as all have the same `d` and the distance between the starting element of successive vectors is constant, denoted by D.

3. Generation of control information for the TCU 89: The CIAU 99 supplies a control bit to the TCU 89 indicating the type of descriptor being involved. A logical zero control bit indicates a scalar result to the Task Memory 27 whereas a logical one control bit indicates a scalar/vector result to the Array Memory 15.

The Template Control Unit 89, see FIG. 13, functions to accept "vector form" requests from Vector Parameter Queue 85 and to control the execution of this "from" on the Array 81. Vector forms are performed by execution of a sequence of templates. The TCU 89 specifies the sequence of templates, initiates them properly and controls their execution using microprogrammed code for each template. The TCU 89 also controls Array Memory 15 cycles for I/O.

Since one vector form may require more than one template execution, the TCU 89 may be controlling different superword slices of the same vector form at a time. These templates are interfaced by the TCU 89 such that no conflict occurs in allocating array 81 pipe units to different templates, as described below.

Vector descriptors are stored in VDM 97 in sequence at increasing VDM 97 addresses. The sequence is bit vector operand (if any), Bit vector result (if any), first VD, second VD, etc. This order allows TCU 89 to compute VDM 97 address by following equation:

where x is the VDM 97 address of the first descriptor of the VF, OBVPRES = 1 only if operand bit vector is present and RBVPRES = 1 only if result bit vector is present.

The TCU 89 can produce a basic control word (BCW) every major cycle. This word is logical OR of up to 3 active templates microsequence words. Certain special conditions modify the resulting control word. The resulting control word specifies operations of the units in the Array 81 pipe.

Vector form requests from the VPQ 85 are accepted by the TCU 89 and are buffered therein.

A VF request from VPQ 85 consists of one "Write VF1" request and then after a few cycles another "Write VF2" request. The second request signifies a valid VF as checked by VIV 83.

A VFRFULL signal is sent to VPQ 85 if TCU 89 is fully buffered. The VPQ 85 will not send any request if the TCU 89 is fully buffered.

A VF is a sequence of templates. Execution of a template is performed by serially executing template microsequence cycle by cycle. The TCU 89 fetches 3 microwords (maximum 3 templates may be executing in parallel) one major cycle before the units in pipe are to receive control signals.

An access to superword with all its elements in one memory 15 module requires one memory cycle for each element access. If any operand or result vector has a vector conflict (VC) then the VIV 83 detects it and sets a condition bit in the TCU 89. The TCU 89 while processing such a VF will force superword size to be 1 for CIU 87 indexing. Thus, only one element slice of VF is processed by each template. This makes the execution time to be about N times the execution time without a VC. N is the superword size of the template assuming no VC.

Template microsequence code assumes that the AE 17 operation time is two major cycles but certain operations (e.g., 1/x, divide, etc.) the AE 17 requires longer operation time. The AE 17 operation time is an integer multiple of the major cycle. The TCU 89 adjusts the generated microsequence to allow different AE 17 operation times. Long AE 17 operators have two control bits indicating the time to finish (TF) to be greater than 1, 2 major cycles. Such condition inhibits incrementing of the template microsequence. The timing relationships are explicit in the templates. Certain AE 17 operations require only one major cycle and template will explicitly allocate only one cycle.

A scalar result is specified by a special bit in vector descriptor (VD) in VDM 87. The VD also specifies the destination to be the AM 15 or TM 27. The CIU 87 sends a signal to TCU 89 if the result destination is the TM 27. In this case, the CIU 87 deposits the destination address in the CIU 87 and modifies VD in VDM. The TCU on receipt of this signal inhibits AM store cycle for scalar result. The data path from OAN is tapped by the TCU. Under TCU control scalar result from OAN and the destination address from CIU are buffer loaded. Then, TCU sends it to the specified address in TM. A Scalar result is always obtained from AE-O, but for diagnostic purposes any AE can be selected.

Memory cycles for I/O are allocated by TCU. A free memory cycle is referred to as a "hole" If a hole is not found, operations in CIU, MIG, IAN, AE, and OAN are stopped for a cycle and status of these units is kept frozen. The vector operation continues after this freeze. This kind of cycle stealing is referred to as "vertical freeze". Memory cycles for I/O are allocated only when demanded by the FMC 47. For I/O requests, holes are searched for 8 cycles and if no hole is found, a vertical freeze is used during the eighth cycle.

CONTROL AND MAINTENANCE UNIT

The fourth and final unit within the Control Unit 11 is the Control and Maintenance Unit 33. This unit monitors the Parallel Task Processor 41 in terms of hardware status and performance. Maintenance logs are kept which automatically logs errors and the locations thereof. Eventually, the error data is transferred to the Front End Processor 25 for final storage or analysis. Also, for performance evaluation purposes, hardware use statistics are logged. Normally, the use statistics are transferred to the Front End Processor 25 at the end of each program but the transfer may be preprogrammed to occur at intermediate program points for examining specific operating details of a given program

Communications (both input and output) with the Front End Processor 25 is handled through appropriate communication buffering techniques within the CMU 33. Normally, The Scalar Processing Unit 29 provides the control intelligence for the Control Unit 11.- However, in the preferred embodiment, the CMU 33 includes the capacity to execute a primitive set of instructions which allow it to perform its monitoring tasks and to seize control of the Scalar Processing Unit 29 for cold starting, for fatal error conditions, and for debug single stepping control purposes.

ARRAY

Parallel or vector operations occur in the Array 81.

A complete disclosure of the apparatus and operation of "Multidimensional Parallel Access Computer Memory System" suitable to implement the Array 81 is given in copending U.S. Patent Application, Ser. No. 682,526, filed May 3, 1976, by D. H. Lawrie and C. R. Vora and assigned to the assignee of the present invention. The above-cited Ser. No. 682,526 patent application is incorporated herein by reference.

With reference now to FIG. 1, vector elements are stored in the Array Memory 15 comprising in the preferred embodiment 17 memory modules each implemented as LSI bipolar devices. The prime number 17 preserves the desired conflict free access characteristic of Applicants' invention. An Array Memory 15 comprises 56 bits and includes Hamming code for one bit error correction. Preferably, the Array Memory 15 accommodates one megaword.

The vector elements stored in the Array Memory 15 are accessed in parallel via the Memory Indexing Generator 91 in the Memory Interface 19. The accessed vector elements are then aligned with the appropriate Airthmetic Element 17 module via the Input Alignment Network 19 as directed by the Input Alignment Network Tag Generator 93. The Input Alighment Network 19 as implemented in the form of a 56 bit crossbar.

Vector operations are organized as sequences called templates which are executed in locked-step fashion in the Arithmetic Element 17 under the control of a microsequence 101 functioning in response to the Template Control Unit 89 as above-described. Simple combinatorial logic sequences serve as an efficient approach to process a plurality of distinct instructions in each Arithmetic Element 17 module. Arithmetic operations such as floating point add, subtract, or multiply are rapidly performed on the vector elements.

Vector results are returned to the Array Memory 15 via an alignment process in the Output Alignment Network 23 corresponding to the above-described alignment process in the Input Alignment Network 19.

EPILOG

Although the present scientific parallel processing architectural invention has been described with a certain degree of particularity, it should be understood that the present disclosure has been made by way of example and that changes in the combination and arrangement of parts obvious to one skilled in the art, may be resorted to without departing from the scope and spirit of the invention.

Input/output system for a massively parallel, single instruction, multiple data (SIMD) computer providing for the simultaneous transfer of data between a host computer input/output system and all SIMD memory devices