We begin with the intent to create a new programming paradigm that extends and improves on the Object Oriented Programming (OOP) paradigm and replaces the flawed Component Object Model (COM). What we aim to do is to permenantly merge code and data into a new component structure that is stored in a single file format (called the Component Executable, or CXE, format).

Over time programmers have evolved languages of more complex grammars and made more and more flexible use of libraries of existing code to leverage the work of each other. CBOOP is intended to step further in this same direction, making the component an easy to construct network of modular, interdependent structures. The division between library, executable and data file will be eliminated.

The trend toward modular, reusable, interoperable components has gained ground with the Component Object Model (COM). This allowed programmers to create software then replace one piece at a time without rebuilding the entire house. As COM evolved, programmers sought to make code which could run on multiple platforms, sharing resources. The result was Distributed Component Object Model (DCOM), a new dimension in scalability. Unfortunately COM and DCOM fail to consider the security risks associated with their usage. Further they lack any native language support and are cumbersome at best.

Though COM and DCOM may be implemented in virtually any language, the hoops a programmer must jump through are unnecessarily extensive and increase the chance of error. Further, since DCOM allows programs on one machine to execute via RemoteProcedure Call (RPC) code on another machine, the lack of effective security has lead to many serious, recurring security exploits over the years. Successive patches have failed to resolve these issues, due to fundamental flaws in the model itself.

[b][blue]INTRODUCTION: The New Paradigm[/b][black]CBOOP aims to approach the same component-software end from a new angle. Component-Based Object-Oriented Programming (CBOOP) is a language-based paradigm that extends the Object Oriented Programming (OOP) in a way which is more consistent and powerful. In fact components are called "public class," while internal OOP classes are "private classes."

Software developed with CBOOP will not be created as one executable, supported by libraries with separate data files. Rather, CBOOP is a network of small, reusable, interdependent component-objects. Each is contained in a Component Executable (CXE) file maintaining its own data. Code and data are merged permenantly to improve portability and security.

In theory a software package could create peripheral tools and a core component. Each stores its own parameter information internally, wherecorruption and misuse are less likely. One such "peripheral" might store the software package's product data, the actual result of its operations. This is akin to a Microsoft Word document file containing both document data and the routinesnecessary to store and retrieve that data. Other components then link to this component-document at run-time link to provide other features--such as editing and printing.

CBOOP implements extensive security through an Object Communication Protocol (OCP) handled by the OSloader. This protocol maintains a working relationship between components, authenticating components by their Universal Identifiers (UIDs), an MD5 hash of a given component, interface or object.

Writing CBOOP software will be more simple than in OOP. OOP programmers have had to fit a round peg (OOP classes) into square holes (structured,linear executables). The CBOOP model eliminates the traditional main() routine. Instead a CXE file can then contain multiple components, the construtor for each being a separate launch point for the file. But, to do this requires a new language.

[b][blue]Introduction: The New Language (X)[/b][black]

The language is based on C++. Originally dubbed "c-squared," the language has been streamlined and cut down warranting a more appropriate name, "X," since the language hass been reduced to a system of expressions. X is a very simple and very scalable language; itscompiler is likewise to be extremely modular and scalable.

X is actually two languages: algorithmic X and directive X. The first produces executable code, i.e. algorithms, to solve problems. The second language directs the compiler on how to manipulate and process the algorithmic code, so as to create new levels of abstraction...such as templates.

For this discussion we are concerned with "algorithmic X," to which we refer simply as X. Under X, there are no statements. Every structure is an expression. Components are defined as "public classes," as opposed to private OOP classes which are internal to the components that use them.

As classes, components are similar to OOP classes. Both haveconstructors and destructors. Both have functions to avail their features. Both provide for encapsulation, inheritance and overloading through their respective mechanisms. Yet the operation and terminology is a bit different. Publicly exposed component functions are called "interfaces"; whereas privately scoped funtions are called "methods." Further, though private classes can expose data directly to other classes, by declaring the data objects public "properties" of the class, components are not allowed to expose data directly. Instead, a component may only expose its data through interfaces (public functions).

An example of the X language component definition would appear as follows--

[code]public class somecomponent;[/code]

Consitently, an internal class would appear as--

[code]private class someOOPclass;[/code]

Both are complete declarations. somecomponent and someOOPclass are identifiers now known to the compiler as component and internal class, respectively. However both are unusable at this point, since they lack definition. We define the class using the link (<<) operator. This could occur during the declaration, as follows:

[code] public class somecomponent << {...[code block]...};[/code]

to fully define somecomponent. Or, the programmercould state a declaration first, as follows:

[code] public class somecomponent;[/code]

then later specify its structure:

[code] somecomponent << {...[code block]...};[/code]

Notice here that the link operator connects a code block and class identifier. Both means of doing this achieve the same end. Class and << are operators in these expressions. {...} is also an operator. All other identifiers in these expressions are considered to be operands (data manipulated by the operators).

One feature of X is its consistency. As everything is an expression we know that the compiler will find only two types of tokens: delimiters and identifiers. Anything that is not a delimiter is an identifier. Delimiters are symbols which separate identifiers into discrete units. There are literal and contextual delimiters. Literal delimiters include line remarks ("//...[CRLF]"), block remarks ("/*...*/"), quotationmarks, semicolons (";") and whitespace. Contextual delimiters are less clear. These are the instances where grammar causes the separation of two character sequences into separate tokens. For example, if "+"is an operator and the compiler sees "a+b", "a" and "b" are separated into separate tokens from "+" due to context.

Identifiers in X are either operators and operands. In the earlier example, "a+b" was delimited because "+" was defined as a known operator-identifier. Even if "a" and "b" are unknown, because "+" is known, the two unknowns can be parsed as the Left Hand and RightHand sides (LHS and RHS, respectively).

Operators are identifiers which manipulate values passed by operands. Operands are objects (public or private) which are acted upon by operators. Every operator then has two operands (explicit or implicit),they are the LHS and RHS. The result of an operation is the return value and is passed to the left by default, but does not by default change the LHS operand value. Accordingly in X, the expression

[code] x = a + b;[/code]

the addition of a and b does not alter a. Instead,the sum is the RHS operand for the assignment operator("="); whereas the assignment operator is coded tofirst assign the RHS value to its LHS operand thenreturn the RHS value to the left (if possible).

[code][NOTE: For the compiler design this means X source code can be parsed into a binary tree structure!

An identifier is then represented by a node in the tree, its LHS and RHS are the left and right downward branches and the return value is implied as being passed up the tree through the PREV link.][/code]

There is another neat aspect of X: objects arerepresented by identifiers, not bound by them. Anobject identifier (whether operator or operand)represents the object only so long as it is linked tothe object. An object identifier tracks the ADDRESSin memory where the object is stored and the size ofthe object's footprint in memory. X is strongly typedbut not strongly linked. Invoking an operator, then,causes the operator to take as input the references toits operands (LHS and RHS), NOT THE VALUES! Theoperator can then directly affect or protect theoperands. The operator's return value is likewise areference to some temporary memory location allocatedfor the purpose and later recovered through garbagecollection.

Since object identifiers represent the memory locationof their respective operators and operands, X permitsclass and operator definitions a new level of freedom. For instance, a class could be declared as follows:

[code]public class foo;[/code]

Then it could be defined--

[code]foo << {...[code block]...};[/code]

as we witnessed earlier. But X also allows theprogrammer to later in code state

[code]foo << {...[more code]...};[/code]

and assumming foo was not deleted (i.e. the originaldefinition was not erased), "more code" would addfeatures to that given by "code block." Moreover,since {...} is a "group operator" (which we haven'tdiscussed) that returns the memory location for thecontained code block, we could also define anotherclass--

[code] public class moo << {...[something else]...};[/code]and link moo and foo together to like this--

[code] moo << foo;[/code]

The result is a moo class which inherits foo class atthe point of its last definition before moo linkedinto its identity.

This brings another interesting point about the Xlanguage, class definitions contain expressions. Therefore, they are algorithic descriptions about theconstruction of a class instance. When a class isinstantiated its constructor is launched. Implicitlythe code produced by the class definition blocks arethe constructor's preamble. Its new and deleteoperators are constructor and destructor,respectively.

Class instances in X are created using the followingformat--

[code]instance new classname;[/code]

A class may also have a special array operator that isinvoked with new, as follows--

[code]instance new classname array somenumber;[/code]

Example:[code]n new int array 5[/code]

creates a 5-element integer array whose memorylocation is stored in some point represented by n. Wecan retrieve that location by invoking the ampersand("&") operator as follows:

[code]p = &n;[/code]

where p represents some properly declared object.

[Note: p is declared as follows... array p; to be an unknown array object. ]

Inheritance of classes, demonstrated earlier, with thelink operator << also affects the definition ofoperators in X. Remember, X has NO FUNCTIONS ORPROCEDURES, only operators. An operator is definedas--

[returntype]operator([LHStype][operatorid][RHStype]);

Example:[code]int operator (int + int);[/code]

where int (an integer data type) is the LHS and RHS ofa new operator (+) having an integer return type. Here parenthesis is a special operator in the contextof the operator operator.

But the above example, like in the case of a classdeclaration, fails to provide any real definition forthe new operation. This may be done concurrently, asbelow:

However, reference to the + operator must include itstype signature, since operators can be overloaded andcode must be linked only to specific instances of theoperator.

Linking code and class definitions in X also meansthat code blocks can be defined separately and linkedlater, during run-time. Thus, a programmer in X canstate--

[code]a = &{...[Code A]...};[/code]

to store the address of code block A in the object a,presumably a properly typed object. Later theprogrammer could define--

[code]b = &{...[Code B]...};[/code]

to the same end. Then the programmer could merge thetwo into a single routine by stating--

[code]c = &{{&a}<<{&b}};[/code]

where {...} accepts the references to the code and <<links the two code structures together before storingthe address in c.

[Note: In this example a,b and c are first declared as unknown operators. operator a; operator b; operator c; But none of the three can be used as operators in this context, they are only references to objects of the type since they have no type signature. ]

The X compiler is to be modular, capable of later revision into the CBOOP scheme as separate interoperating components. Each module of the compiler should provide either a service to the compiler or represent a phase of the compile process. We define the compiler using the following schematic:

Each component-role is defined as follows:[red][b]"INPUT: SOURCE CODE"[/b][black] -- We assume here that this is an arbitrary stream of ASCII text. In the prototype this stream will naturally be a text file with the extension .xxx. However, since CBOOP merges code and data the eventual source code storage point will be in the sourcestream component object itself.

[red][b]Source Stream[/b][black] -- This component represents the source input in its native form. This is the storage/retrival mechanism for X source code and will be added onto considerably with time. As the schema above depicts, the SourceStream will eventually call on the services of the HTML stripper (see discussion of hypersource code elsewhere), which will filter out characters unnecessary to compilation of the code. SourceStream provides two basic interfaces to the compiler SS.scan() and SS.lookahead(). These interfaces allow client modules to scan a single character from the buffer and lookahead at the next character which will be scanned. scan() and lookahead() will never allow filtered characters to be viewed.

[red][b]HTML Stripper[/b][black] -- This stripper will be used by source stream to

This stripper module will analyze two characters at a time and return a pass-fail result for the first character in a given test. Its primary interface is Stripper.test().

[red][b]Tokenizer[/b][black] -- The tokenizer scans characters from the source stream to determine whether the character is a delimiter or token. Its input is a single character and its output is a token object. This object is loaded with known information from the Symbol Table (which creates and issues new tokens) and the lexeme Tokenizer recognizes from the code input. Tokenizer.scan() scans characters until it can produce a single token. Tokenizer.lookahead() returns the next token to be scanned (which is waiting in the scan buffer.

[red][b]Token[/b][black] -- The token object is a medium. This is the smallest logical unit in the compiler's view of a language. Tokens are either identifiers or delimiters. The token must be capable of fitting into the structure of the ParseTree and the SymbolTable.

[red][b]SymbolTable[/b][black] -- This is the mechanism which tracks identifiers as they are declared and allocates space therefor in the final product. The SymbolTable resolves and maintains the definition of identifiers representing objects (operators and operands) as well as types. Scope tracking is performed by the SymbolTable.

[red][b]LexicalAnalyzer[/b][black] -- This is the hub of the compiler. Here the two languages (algorithmic and organizational) meet. The organizational language of X, traditionally known as compiler directives, are passed to the preprocessor while the algorithmic language of X is recognized and sent on to the parser.

[red][b]Preprocessor[/b][black] -- This is an interpreter which acts as a command shell for the compiler, accepting scripted commands in the source stream which can alter the source stream and the compiler's operation thereupon.

[red][b]Parser [/b][black]-- The Parser has three responsibilities: (1) Create a diagrammatic representation (ParseTree) of the source stream which is language neutral. This diagram is a binary tree representing each expression of the source code in a linear manner. (2) Verify the syntactical and semantic correctness of the diagrammatic language. (3) Reduce the ParseTree to its smallest size, elminating redundancy and waste.

[red][b]ParseTree[/b][black] -- The ParseTree is a binary tree with special properties. Each node of the tree is a token. Each token is first defined as legitimate in the symbol table. Each token has an assigned scope identifier which is recognized in the symbol table. The two nodes of each token in the tree is the LHS and RHS, respectively. These are the source of operand values for a given node. Likewise each node is either an LHS or RHS to some root node, and accordingly, it will return its resulting value to the PREV node.

[red][b]Code Generator [/b][black] -- The code generator will accept a ParseTree as input an produce two output streams (1) an assembly language version of the output and (2) a binary component executable (CXE) file.

[hr][blue][b]Author's note:[/b][black]

Please understand that this is a very brief outline of the compiler design. Some things have yet to be worked out. Others are pretty well set.

The primary weakness in this design is the lack of an error handling subsystem. I will add that with time. The second weakness is lack of a code generator plan. It is my intent that the code generator should be so modular that the same parsetree can be passed to multiple code generators to yield binaries for multiple machine language platforms.

Establish parameters for a diagrammatic structure capable of representing the meaning of processed source code which is neutral to the source language in preparation for conversion to an arbitrary object language.

[blue][b]Solution:[black][/b]

Implement ParseTrees using a modified binary tree structure in which each node is a single token having three links. The first link is the PREV pointer to some root node, the second link is the LHS pointer to some token which was detected to the left of the given token, and the third link is the RHS pointer to some token which was detected to the right of the given token.

A ParseTree is sorted by order of appearence and order of precedence. Order of appearence is the order in which a token appears in the source stream sequence. Order of precedence is the sequence in which a token should be evaluated (e.g. multiplication should precede addition).

Thus the basic tree, below,

[code] PREV | v TOKEN / LHS RHS[/code]

is implemented as

[code] ROOT | v ASSIGNMENT / op(a) op(b)[/code]

to represent the expression "a=b", where "ROOT" is some unknown preceding expression, = is the assignment operator and both a and b are the operands.

Expressions represented by ParseTrees are continually sifted for precedence. However, precedence is limited by the semicolon delimiter. Once a semicolon is encountered, evaluation of precedence stops, since all terms before the semicolon are considered to be a separate expression. Likewise, tokens contained in group operators (such as braces, brackets and parenthesis) are evaluated only within the boundaries of the group operator.

The ParseTree is a closed environment. Tokens can be passed into the ParseTree structure blindly using ParseTree->submit() and later extracted using ParseTree->extract(). However, all other data operations are internal other than navigation.

ParseTree methods[hr]Method Description[hr]submit() Adds a token to the tree and sifts the tree for proper positioning.

extract() Copy-returns the current token (eliminating all links to the tree).

Develop a C++ class which analyzes a ParseTree produced from some arbitrary source code stream and outputs equivalent assembly language and machine language files.

Maintain the modularity of the code generator, such that its basic parameters and framework can easily be adapted to other CPU instruction sets and possibly a machine-independent pcode.

[blue][b]Solution:[black][/b]

[No solution developed at this time]

[blue][b]Parameters:[black][/b]

Code generator must produce Component Executable (CXE) files. These files are to replace DLL and EXE files and will further require a loader for the target OS.

The Code generator must provide basic error handling and reporting services specific to its own needs, using a separate error handler class. The error handler class should perform its basic I/O using a dummy class named "errorhandler," which will be replaced later in the project with a full-fledged error handling facility.

The new programming paradigm requires a format for the component image, both in memory and on disk.

***Solution:

Discussion is limited to the component image's disk state. Its memory state will be taken up later. We have two priorities in defining this format: (1) maintaining a small disk footprint and to establish a high degree of scalability. CXE files must be able to contain multiple interdependent components efficiently.

We illustrate the component CXE file, below, and note that it consists of a 160 byte header and variable-length body of up to 4GB.

The 160-byte header begins with the FileVersion field specifying the exact format which follows. Reading the file accurately depends on the value of this field, which is followed by a reserved FileAttributes field. At present this field is reserved for use under the AMI-OS project.

The header also specifies a number of components contained in the file. This is a time saver and redundancy for integrity verification. Remember: a single CXE can contain multiple independent or interdependent components. (This allows enormous next-gen op/sys capabilities, as described in the AMI-OS design notes.)

A creation timestamp is also included as a security feature and a part of the AMI-OS design to be gradually introduced over the coming years.

Finally, the header references to the major sections of the file body. These include the addresses of the first code page (Cptr), data page (Dptr) and Component Descriptor Table (CDTAddress). Each address (like all addresses in the CXE file) are measured relative to byte 160 of the CXE file.

Code pages, data pages and CDTs are all connected into separate linked lists. The header reference to the first node in these lists maintains the integrity of the file. These lists are doubly linked to minimize the chance of fragmentation. Where these pages and tables are maintained as linked lists by the CXE file for tracking allocated memory, however, they are also linked as parallel linked lists to build component assemblies which are defined by the Component Descriptor Table (CDT).

The CDT references data and code pages. These pages are blocks of disk memory linked together into scalable lists. Each data and code page is defined as either persistent or volatile. In the disk state, a persistent page is one which cannot be altered during run-time (i.e. read-only); whereas a volatile page is that which may be altered during run-time (i.e. read-write).

The CDT acts as a blueprint for the component. It references the code and data pages needed by that blue print to assemble the required component. A single CDT defines one component. It may use any number of code and data pages to accomplish this mission. These same code and data pages may be cross-linked by several CDTs in separate linked lists defining separate components, where the contents of the code or data page is identical. Yet this is only to compress the size of the CXE file and does not imply that the actual data and code are shared, or that the boundaries of a component are in any way compromised.

Below, we define the Component Descriptor Table (CDT). In doing so we reference byte positions from the start of the CDT block, rather than from the 160th byte of the file. This is done since the actual position would depend on the placement of the CDT in the file, a matter known at this time to be variable.

The CDT is a table of tables. We start with a CDTversion field which denotes the version of the Component Descriptor Table we are about to read. This will allow newer CDT formats to be stored in older CXE formats, and vis-a-versa. CDTversion is followed by a UID or universal identifier. The UID is an MD5 checksum and the core to CBOOP security. Each component and component instance (object) is uniquely identified by its UID.

UID is followed by the CVSversion. CVS version refers to the programmer's version. A UID cannot accurately serve this function due to its higher sensitivity to a component's data. Rather CVSversion allows the programmer to track the component by the source code from which it was produced.

Threadmodel and language are fields used by Op/sys loaders to determine how a component will run. Will the component run within the same process space as the owner component or separately? Is the binary language native to the CPU, or is an intepreter necessary? Likewise, component Attributes are a matter for OSloaders and are left for discussion elsewhere.

CTime and UTime maintain the creation and update timestamps for a component. These timestamps affect the UID and increase security around the system.

Thus far, the CDT has given general parameter information. What follows this is a set of tables which define the actual component. The first is the dependency table, which defines the components upon which the defined component depends. That is, if component "foo" uses the services of component "moo," then moo appears in the dependency table for foo.

Dependency Tables define the filename and UID for components used by a given component. However, CBOOP does allow for component inheritance. Inherited components are defined in the Base Component Tables (BCT), referenced by the BCTptr field in the CDT. This table is similar in format to the CDT Dependency Table, as illustrated below:

The BCT contains basic information about the component's storage location (filename), identity (UID) and inheritance scope (public, private, friend or protected). This table is of variable length, as determined by TableSize. It may range up to 4GB in size.

BCT and the Dependency Table allow the CDT to describe the foundation upon which a component will be assembled. However the "meat and potatoes" is the Instance Vector Table (IVT). This IVT serves two functions: (1) it defines the prototype component object and (2) it stores the persistent definition of all objects of a given component type.

IVTs are another level of detail for the component. Here the component data area is referenced, the component-object is identified by unique hash and interface and method vector tables are defined. But, further, the IVT is revealed to be part of a linked list of records. The table is scalable. As new instances are created the file can grow. Likewise, instances can be deleted from the table and the table can contract without much effort.

UIDs are digital signatures identifying each component-object. The component-prototype from which all other instances are created has the same UID as that stored elsewhere in the CDT. This is always the first record in the IVT, and though it is possible to change this prototype image under CBOOP, doing so changes the MD5 checksum for the component-- and thus the component identity itself. This is core to the security of the CBOOP architecture.

From the component-prototype component objects are copy-created. That is the prototype IVT is copies, as are the related tables and data area. The first to be copies is the data area with the default data image, this is followed by the method vtable and interface vtable last. the interface vtable then contains the component constructor that is executed immediately following its copy creation to the new IVT.

The IVT-DAT format is straight forward. A CXE Page number identifies each data and code page in the CXE. This number is used to address the proper data/code page in the file. From this number, the data area can be calculated as an offset within the page having a given size.

The IVT Interface vTable is a linked list of descriptors which cover each interface exposed by a component. Each interface has its own UID, strongly typed parameter list and logial name. The interface itself is a chain of executable binary code represented by a linked list of code pages referenced by the Chainptr field.

Because methods are internal structures, they may be more rigidly defined by the compiler and require less information at run-time. Thus the method vtable only references the codechain where a component lies.

Code and data pages use the same format. They differ only in the type of information they contain. Even their attributes should be compatible. Primarily the attributes reflect persistence/volatility. Otherwise they are reserved for later specification under project AMI-OS.

In an earlier posting I defined the disk-state image of a component, i.e. the Component Executable (CXE) file format. Here I continue that discussion and define the memory-state image of the component. There are two priorities here: (1) maximize potential execution speeds and (2) reduce the memory overhead without sacrificing scalability and functionality.

[blue][b]Solution[/b][black]

The component-object in memory consists of two parts: the component-prototype and the component-object. Every component is prototyped initially. Components of the same class (type) share the prototype--a blueprint of the component object.

Component prototypes maintain all constant structure. That is, anything of the component which will not change is a part of the prototype. When new component instances are required, these areas are copy-created by the component constructor to new memory areas for the component-object. The component-prototype is similar to its disk-state image counterpart.

Components appear in memory as a collection of tables. The first table is the Component Descriptor Table (CDT).

Much of the general information contained in the CXE image CDTs is not present in the memory version. Rather this information is utilized and if needed tracked by the OSLoader. The CDT merely contains the CDTVersion, UID and a pointer to the Instance vTable.

The BCT contains basic information about the component's storage location (filename), identity (UID) and inheritance scope (public, private, friend or protected). This table is of variable length, as determined by TableSize. It may range up to 4GB in size.

BCT and the Dependency Table allow the CDT to describe the foundation upon which a component will be assembled. However the "meat and potatoes" is the Instance Vector Table (IVT). This IVT serves two functions: (1) it defines the prototype component object and (2) it stores the persistent definition of all objects of a given component type.

IVTs are another level of detail for the component. Here the component data area is referenced, the component-object is identified by unique hash and interface and method vector tables are defined. But, further, the IVT is revealed to be part of a linked list of records. The table is scalable. As new instances are created the file can grow. Likewise, instances can be deleted from the table and the table can contract without much effort.

UIDs are digital signatures identifying each component-object. The component-prototype from which all other instances are created has the same UID as that stored elsewhere in the CDT. This is always the first record in the IVT, and though it is possible to change this prototype image under CBOOP, doing so changes the MD5 checksum for the component-- and thus the component identity itself. This is core to the security of the CBOOP architecture.

From the component-prototype component objects are copy-created. That is the prototype IVT is copies, as are the related tables and data area. The first to be copies is the data area with the default data image, this is followed by the method vtable and interface vtable last. the interface vtable then contains the component constructor that is executed immediately following its copy creation to the new IVT.

The IVT-DAT format is straight forward. A CXE Page number identifies each data and code page in the CXE. This number is used to address the proper data/code page in the file. From this number, the data area can be calculated as an offset within the page having a given size.

The IVT Interface vTable is a linked list of descriptors which cover each interface exposed by a component. Each interface has its own UID, strongly typed parameter list and logial name. The interface itself is a chain of executable binary code represented by a linked list of code pages referenced by the Chainptr field.

Because methods are internal structures, they may be more rigidly defined by the compiler and require less information at run-time. Thus the method vtable only references the codechain where a component lies.