Saturday, August 22, 2009

In the previous two sections of my guide, I discussed basics of functions and then went into the specifics of types and names, as represented by the Elsa AST nodes. In this section, I will cover classes and some more errata on declarations. As I have mentioned earlier, the subsections of this third step will be visited out of the order of their numbering.

Step 3.1.9: MemberList (inside a class)

MemberList has only a single variable: ASTList<Member> list.

Step 3.1.10: Member (part of a class)

Member nodes represent a member of a class type. Like other nodes, this is mostly represented by its subclasses, MR_decl, MR_func, MR_access, MR_usingDecl, and MR_template. The node itself has two members, SourceLoc loc and SourceLoc endloc. End locations represent the location just after the semicolon or closing brace, such that the range of text matches [loc, endloc); if you recall, this is the same syntax that the patcher uses when working with ranges of location.

Each of the subclasses of Member only adds one variable, which is the type that member represents. MR_decl adds Declaration d, hence it is used for all declarations within the class, be it a variable declaration like int x, a type definition, or a function without a body. MR_func adds Function f, so it represents all functions that have their body in the class declaration (A(); is a declaration, but A() {} is a function). MR_usingDecl has as its member ND_usingDecl decl (which is covered under the section NamespaceDecl), and MR_template uses TemplateDeclaration d.

The final subclass of Member is MR_access, which has its member AccessKeyword k. This node represents all of the declarations like private:. Since the information keeping track of the access is stuffed in separate AST nodes, you may be wondering how to retrieve this information given only a specific member. The answer, naturally, lies in the auxiliary classes to the AST, something which I have avoided mentioning. Some nodes provide access to a Variable member, one of whose methods retrieves the access to the member. More information about this will be discussed when I talk about that class in detail.

A final thing to note is that Elsa will add some nodes into the AST by the time you use the visitor. These are the implicit methods dictated by the C++ standard. You can check if one of these members is implicit if the DeclFlags variable contains DF_IMPLICIT. Another flag that will also be set is the DF_MEMBER flag.

Step 3.1.7: BaseClassSpec (extending classes)

BaseClassSpec nodes represent a superclass for a class. It has three variables, bool isVirtual, AccessKeyword access, and PQName name, all of which are self-explanatory.

Step 3.4.1: AccessKeyword (controlling access)

AccessKeyword is an enumerated type with three important members. These are AK_PUBLIC, AK_PROTECTED, and AK_PRIVATE, all of which represent what you think they represent. There is also a AK_UNSPECIFIED member, but that should not be present by the time you get to the AST nodes. Naturally, there is also a const char *toString(AccessKeyword) method for converting these types into a string.

Step 3.4.7: OperatorName (operator overloading)

These nodes are informational nodes about operators. You should only find them within the PQ_operator type. The class OperatorName only has a single method const char *getOperatorName(). This is the basis for the PQ_operator name strings, so its results are as mentioned there.

The first subclass, ON_newDel, represents the memory operator overloads. It has two members, bool isNew and bool isArray. The first differentiates between operator new and operator delete, the latter differentiation between operator new and operator new[].

The second subclass, ON_operator, represents the standard operator overloads. It has only one member, OverloadableOp op, which represents the operator being overloaded. The names in the OverloadableOp enum all begin with OP_ and can be idiosyncratic. Example operators are OP_NOT, OP_BITNOT, OP_PLUSPLUS, OP_STAR, OP_AMPERSAND, OP_DIV, OP_LSHIFT, OP_ASSIGN, OP_MULTEQ, OP_GREATEREQ, OP_AND, OP_ARROW_STAR, OP_BRACKETS, and OP_PARENS. There are naturally more, but the other names should be derivable from this sample (pretty much all the idiosyncracies were added to the list); the full list is in cc_flags.h if you need to see it. There is also the standard toString(OverloadableOp) method if you are confused about a particular operator.

Note that some of the operators can be used in different ways. For example, OP_STAR both represents the multiplication operator and the pointer dereference operator. The way to differentiate between the two is via the number of arguments, although one must keep in mind that operators that are class members have one less argument. The postfix increment and decrement operators are differentiated from the prefix forms in that the postfix forms add a second int argument, which is incidentally always 0 if you don't explicitly call the function.

The final subclass is the type conversion operator, ON_conversion. This contains a single member, ASTTypeId type. The member type will have a terminal D_name in its declaration with a null name; the main purpose of the declaration under the ASTTypeId is to capture the pointers.

Step 3.4.2: ASTTypeId (a less powerful version of Declaration)

ASTTypeId is modelled after the Declaration node, but it's used in places where multiple declarations are not usable. Indeed, its most common usage is to represent a type (nominally TypeSpecifier) that can have pointers or references. It has two members, TypeSpecifier spec and Declarator decl, both of which act as their analogues in Declaration.

Step 3.1.4: MemberInit (simple constructors)

When working with constructors, the initialization of members is treated separately from the rest of the constructor. In Elsa, the nodes where this happens are the MemberInit nodes. These nodes contain a few members:

SourceLoc loc

SourceLoc endloc

PQName *name

FakeList<ArgExpression> *args

The source location and end locations have the standard meanings. The name attribute refers to the name of member being initialized. The arguments refer to the arguments of the function-like calls.

Step 3.1.14: Initializer (the last part of declarations)

The Initializer nodes represent various ways to initialize an object. The class itself has a single member, SourceLoc loc, but it has three subclasses, each representing some form of initialization.

IN_expr represents the standard forms of initialization people are used to seeing, something along the lines of int x = 3;. These nodes have a single member in addition, the Expression e member which represents the expression initializing the declaration.

IN_ctor represents the constructor-like initialization forms, such as int x(3);. These have a single member, FakeList<ArgExpression> *args, which represents the arguments within the parentheses.

IN_compound is the final form, which represents the array-like initialization for structs or arrays. For example, int x[1] = { 0 };. This has a single member as well, ASTList<Initializer> inits, which is a list of the initializers within the aggregate syntax. Some words of caution, though, is that aggregate initialization can have unexpected results: multidimensional arrays need not have nested braces, and, in C++0x (and gcc since a long time, though it gives you a warning), you can also omit the braces for nested structures. Bit-fields and statics are omitted from initializers, and, if you have less elements in the initializer than you need, the rest are "value-initialized" (i.e., the equivalent of 0). Elsa, unfortunately, does not aide you any further in deducing which element is actually initialized by any given initializer.

Step 3.1.13: ExceptionSpec (saying what you may throw)

ExceptionSpec nodes correspond to the throw declarations on function declarations. These nodes only contains one member, FakeList<ASTTypeId> *types, which represents the types that method is declared to be able to throw.

That is all I have for this part of the pork guide. Part 6 should finish up sections 3.1 and 3.4, so I look track to have part 7 start discussion statements and expressions. I will probably defer the auxilliary API until around part 9 or so, as I really need to play with it some more first.

Monday, August 17, 2009

The last portion of the guide started covering declarations. This week, I will be covering a lot more about declarations. In particular, types and names are covered in a lot more detail. I had intended to talk about classes in more detail as well, but the post was getting long enough as it was, so I'll save discussion for a fifth part.

What has been covered so far:Step 1: Building and running your toolStep 1.1: Running the patcherStep 2: Using the patcherStep 3: The structure of the Elsa ASTStep 3.1: Declarations and other top-level-esque funStep 3.1.3: FunctionStep 3.1.6: TypeSpecifierStep 3.1.11: DeclaratorStep 3.1.12: IDeclaratorStep 3.4: The AST objects that aren't classesStep 3.4.5: DeclFlags

Aside 1: An introduction to porky (continued)

Aside 2: Pork Web

In the course of writing this guide, I got the idea of writing a tool to display the Elsa AST nodes without having to constantly fidget around with dumpAST. The result is Pork Web, which is also a good expository of how much a little CSS will get you.

Step 3.1.6: TypeSpecifier (continued)

In the last article, I mentioned TypeSpecifier but elided details of its subclasses, who hold the interesting information, because I held a misunderstanding of key pieces of information.

For projects that are sufficiently large to be considered good candidates for automated rewriting, chances are that basic types like int are going to be rather rare, in favor of typedefs that give more precise storage sizes (such as mozilla's PRInt32). The parsing of the AST in Elsa and pork happens at a different stage from the type verification, which means that typedefs have an impact on the structure of nodes. That is not to say that you can't get type information; it just means that you want to use Elsa's type information (embodied in Variable) for more accuracy here. Naturally, #define has no impact on type information, because we are dealing with preprocessed files.

Which of the subclasses of TypeSpecifier is used depends on the format. If you are using a standard type keyword like int, you get the TS_simple flavor, which I covered last week. Structures parsed as classes in C++ (i.e., class, struct, and union) are all TS_classSpec nodes; enums form TS_enumSpec nodes. Class nodes, if you do not provide an actual definition, are classified as TS_elaborated nodes. If all you have is a simple name, then the node is a TS_name node, regardless if that type is a class, enum, or other such type. Names will never be null; for anonymous constructs like enum {a} x;, a unique string beginning with __ will be used instead.

TS_name has two variables: a PQName *name, and a bool typenameUsed. Both of these parameters are self-explanatory. For the curious, the latter comes about via an elabarator of typename, such as in the below:template<class T> class Y { T::A a; };

TS_elaborated again has two variables, the same PQName *name variable, as well as a TypeIntr keyword variable. The keyword variable is an explanation of which keyword was used as the elaboration.

TS_enumSpec has again two variables, this time a StringRef /*(const char *)*/ name, as well as a FakeList<Enumerator> elts, which contains the elements in the enumeration.

TS_classSpec is the most complex of the subclasses, as it represents the definition of a class. It contains the same name and keyword variables as TS_elaborated, but it also has the base classes in the form of a FakeList<BaseClassSpec> *bases and its members in a MemberList *members.

Step 3.4.9: SimpleTypeId (Primitives, if you come from Java)

The SimpleTypeId enum represents the primitive types defined by C++, namely char, bool, int, long, long long, short, wchar_t, float, double, and void, as well as their unsigned and signed counterparts (if they exist). The name of each of these follows the general scheme ST_UNSIGNED_INT, although short and long are ST_LONG_INT and ST_SHORT_INT, respectively (but not long long!).

That's not all, though. For simplicity, some places have fake type codes. The most common of these will be ST_ELLIPSIS, the varargs portion of functions; there is also ST_CDTOR, the return type for constructors and destructors. The source code also mentions GNU or C99 support for complex numbers, but I have not found the magic needed to get those to work.

Step 3.4.4: CVFlags (CV-qualified IDs)

Whenever something can be const or volatile, there is a CVFlags enum. It can either be CV_NONE (no qualifiers), CV_CONST, CV_VOLATILE, or both of the latter. There also exists a method sm::string toString(CVFlags cv) that will print a string representation of such a variable. Need I say more?

Step 3.1.5: Declaration (The outer part of declarations)

Any time a variable is declared, one of the wrappers is Declaration (which may itself be found in various places). This has just three members, a DeclFlags dflags variable that represents the flags on the declaration, a TypeSpecifier *spec that is the type of the declaration, and the FakeList<Declarator> *decllist that contains the rest of the declaration. All of these have been covered in more detail earlier.

TypeIntr is a little enum that has four members: TI_STRUCT, TI_CLASS, TI_UNION, and TI_ENUM. The descriptions of them are, I think, straightfoward. There is also a top-level method to convert to a string representation, char const *toString(TypeIntr tr), which will do what you think it does.

Step 3.1.8: Enumerator (The members of enumerations)

Within the definition of an enum is a FakeList of Enumerator nodes. These have a standard location and a StringRef name. The values can be represented in the potentially null Expression *expr variable, or the actual value in int enumValue.

Step 3.4.8: PQName (Everybody's name)

In declarations and other places, in lieu of a string representing name, you have the AST node PQName. The name stands for "possibly qualified." It comes about because there is the necessity of finding the different components of a name.

This class has four subclasses, PQ_qualifier, PQ_name, PQ_operator, and PQ_template. In addition to these, it has a plethora of functions intended to help you with the task of printing these names, as well as overloaded operators to aid in output (to std::ostream and stringBuilder). They are:

SourceLoc loc

bool hasQualifiers()

sm::string qualifierString()

sm::string toString()

sm::string toString_noTemplArgs()

StringRef /* const char * */ getName()

sm::string toComponentString()

PQName *getUnqualifiedName() /* (And a const version) */

bool templateUsed()

PQ_qualifier represents a namespace or similar component to a qualified name. This is handled in a right-associative manner, such that std::tr1::shared_ptr would be the qualifier std, which qualifies tr1, which qualifies shared_ptr. This class has three variables: StringRef qualifier (the name to the left of the double-colon), TemplateArgument *templArgs (which represents the template arguments for templated class qualifiers), and PQName *rest (the right of the double-colon).

PQ_operator represents a PQName that is actually an operator overload. It has just two variables: OperatorName *o, the operator in question, and StringRef fakeName, a string representation of the operator. The latter is essentially a space-less name of the function (except that operater new and friends are represented as such, as well as conversion operators having the poor name of conversion-operator).

PQ_template represents a templated argument name. It again has two variables: the StringRef name of the base type and the TemplateArgument *templArgs that contains the arguments to the templatization. Note that if you are getting a member of a templated class, the name tree will have the PQ_qualifier node instead.

PQ_name is the other of PQName (note the minor spelling difference). This has a single variable StringRef name which is the name. This is by far the most common name node, since everything that is not an instantiated template or an operator name will have this in the name somewhere.

For standard names, all of the various string output methods save qualifierString (which returns the empty string) will return the same thing, the name variable from PQ_name or the fakeName from PQ_operator. The differences arise when you have templates or qualified names.

If you have a qualified name, the methods change rather predictably. qualifierString returns the entire qualification string before the tail node (e.g., std::auto_ptr<T> becomes std::). The toString method and toString_noTemplArgs return the fully qualified names (optionally without template instantation). toComponentString becomes idempotent to the qualifier variable. getName will be essentially identical to getUnqualifedName()->getName(): it returns the name of the right most declarator.

Templated names also modify stuff predictably. toString_noTemplArgs and getName return the base name, without template arguments; toString and toComponentString return the name with template arguments.

The interesting stuff happens when you have a templated qualifier (e.g., std::set<int>::iterator). In that case, the toString_noTemplArgs will not strip the template args from the qualifier.

hasQualifiers() is identical to the isPQ_qualifier() method. templateUsed() is true if the qualifier or template used the template keyword. This is a feature that would be used for disambiguation purposes, such as this example (taken from the ISO C++ spec):

The last method to talk about is getUnqualifiedName. This method simply returns the PQName at the end of the name.

With all of the methods discussed, the most important question you're probably wondering about is the easiest way to get the name of a PQName. If you're trying to find a method or class name, getName is your safest option. If you need to know the type arguments as well (say you're looking for particular instantations), getUnqualifiedName()->toString() is a better option.

If you're looking at class members, you can probably use toString_templNoArgs successfully (when you're looking for a particular function), unless you're interested in the qualifierString() (when you're looking for any function of the class). For cases where the namespace information is necessary, you probably want to investigate the name with the parallel type APIs of elsa, unless you want to maintain your own state for using declarations and other shenanigans that make the original type non-obvious.

Unfortunately, while I earlier said that I intended to reference classes in this part of pork, it looks like I have not the time this week to cover them as well. At this point, it seems likely that part 5 will cover classes and part 6 will cover templates and errata. I cannot say what part 7 will touch: it will either be more errata or a start on the expression and statement code.

Tuesday, August 4, 2009

Taking a break from my ongoing series on pork, it's now time for an overdue update on jshydra.

First off, I have pushed Andrew Sutherland's changes (most of them, anyways) to my repo. These consist of the ability to push arguments, proper JS constructor handling (which I had locally but didn't commit for some reason I long forgot), and some minor comment handling fixes. I also never mentioned my JS inheritance divination work from way back in May. I also fixed a bug and finally got around to making TOK_* variables visible in JS, like the JSOP_* variables.

David Dahl has decided to use jshydra to find dead code in JS. That bug also has some discussion on how to run jshydra, as well as some work-in-progress files to grab JS information from the mozilla build system in a jshydra-friendly format.

Andrew Sutherland also used jshydra to generate documentation for TB. His blog posting does it more justice than I ever could.

Last night, I finally got around to creating an analogue of David Humphrey's Dehydra Web for JSHydra, predictably called JSHydra Web. As a note of caution, please don't upload your massive, 100+KB JS file for perusing: you'll find the output hard to read, and I don't want to overload the server.

Saturday, August 1, 2009

In the previous installment, I covered details on the patcher API and the basics of the Elsa AST API. If I were to write this as a reference manual, this is the point at which I would be spewing out a lot of verbose information on the C++ AST which would be hard to use for an introductory patch. One goal of this series of articles is to produce something close to a reference manual, so the parts will be numbered in that order. Instead, I will present the information in an order more likely to facilitate understanding.

In summary:Step 1: Building and running your toolStep 1.1: Running the patcherStep 2: Using the patcherStep 3: The structure of the Elsa AST

Aside 1: An introduction to porky

Having had more time to evaluate the new porky wrappers since my announcement of its commit last week, let me introduce you to it briefly. The tool takes in a list of rewrite specifications such as type PRCondVar** => mozilla::CondVar**, which will automatically change everything of the first type to the second type. It can also transform method calls into other method calls or even new/delete expressions. I would go into more detail, but that's for Chris Jones to discuss. If he ever blogs about it.

Step 3 (continued):

The following is an image representation of the C++ AST, excluding specific Expression and Statement subclasses:
To examine AST examples in detail, let us consider the example of a typical hello world program in C++:

After being preprocessed, this small program becomes 19,153 lines of C++ goodness, thanks to the many header files being recursively included. Although we only define a single function, g++ defines for us namespaces (with GCC-specific attributes), templates, template specialization, typedefs, structs, classes, unions, function declarations, and half of the other C++ language features.

Let us look at each of these in turn.

Step 3.1.3: Function (Function definitions)

The Function AST node represents the definition of a function. It contains these variables:

DeclFlags dflags

TypeSpecifier retspec

Declarator nameAndParams

S_compound /* (Statement) */ body

FakeList<MemberInit> inits

FakeList<Handler> handlers

Unlike most AST nodes, the Function node does not have a SourceLoc loc member; instead, it has a getLoc() method which returns the location of the nameAndParams member, which would represent the location of the name. In our example, this would be line 3, column 5 (the beginning of `main').

Most of the members are self-explanatory: they represent, in order, the declaration flags (such as static or inline), the type of the return, the name and parameters aglomeration, and the statements of the body. The last two will be less common and NULL in the example I have here. The inits member represents the intializers of a constructor, while the handlers represents try-catch blocks that are scoped for the entire method. An example of such a block:

Strictly speaking, the enclosing try-catch is not limited to constructors, although I doubt it would be used outside of them except for stress-testing compilers.

There is little that you will want to do with these objects that does not fall under the use of other objects. Probably the most burning question you have is how to get the name of the function. These are the shortest ways:
func->nameAndParams->var->fullyQualifiedName0().c_str() (going to a const char *) and func->nameAndParams->decl->getDeclaratorId()->toString() (going to an sm::string). The latter form will probably be more helpful if you are looking for specific methods, since it overrides the == operator.

Step 3.4.5: DeclFlags (declaration modifiers)

DeclFlags is an enum that specifiers certain flags about declarations. Each of the values is in the form DF_, always uppercase. The standard declarations--auto, register, static, extern, mutable, inline, virtual, explicit, friend, and typedef--have values in this manner. Other flags in the enum have uses for the Variable constructs, which I will not go into detail now.

Step 3.1.6: TypeSpecifier (first half of declarations)

A TypeSpecifier node represents a declaration of type, such as char. If you recall your C/C++ syntax, the declaration char *x, y; declares only one variable as a pointer--the * is matched with the variable name and not the type itself; in the AST therefore, TypeSpecifier does not receive these pointers (there come about in Declarator nodes).

By themselves, these nodes only have a CVFlags cv variable (representing the const- and volatile-ness of the type), as well as a SourceLoc loc location member. Instead, it has five subclasses with more specific attributes.

TS_name, TS_elaborated, TS_classSpec, and TS_enumSpec will be discussed in a future part.

TS_simple nodes represent the built-in simple types, like char. It has a single additional member, a SimpleTypeId id member.

Step 3.1.11: Declarator (the other half of declarations)

A Declarator node represents the non-type part of a declaration. It contains these variables and methods:

IDeclarator *decl

Initializer *init

PQName *getDeclaratorId() /*(And a const version)*/

Step 3.1.12: IDeclarator (the real declaration part)

The IDeclarator node does the heavy work in terms of annotating declarations. In some corner cases, the AST for a declarator can get very deep; the parallel type structure makes working with declarations much easier.

With the exception of the D_name and D_bitfield subclasses, all subclasses contain at least an IDeclarator *base member. In addition, IDeclarator holds these variables and methods:

SourceLoc loc

PQName *getDeclaratorId() /*(And a const version)*/

IDeclarator *getBase() /*(And a const version)*/

IDeclarator *skipGroups()

bool bottomIsDfunc()

D_func getD_func()

The skipGroups method skips through any excess groupings (i.e., parentheses layers). getBase returns either the base of the declaration or NULL if it is a leaf. getD_func returns the bottom-most D_func node, while bottomIsDfunc will tell you if the declaration is a function (but not a function pointer). getDeclaratorId obviously returns the name of the object at the very base of the declaration tree.

The first subclass is D_name, which is the typical leaf of the declaration. It has a single PQName *name attribute, which returns the name of the variable or function being declared. The other leaf class is D_bitfield, which has both the name attribute as well as an Expression *bits representing the number of bits in the declaration.

D_pointer and D_reference represent a pointer or reference indirection, respectively; D_pointer additionally has a CVFlags cv variable representing the qualifications of its pointer type.

D_ptrToMember is similar to D_pointer, but it adds another PQName *nestedName attribute to represent the construct whose member the pointer is pointing to. For those who don't recognize this feature, it can be demonstrated in this C++ snippet:

D_array represents an array declaration. In addition to its base, it also has the possibly null Expression *size member to represent the size of the array. Multidimensional arrays have these members arranged from the outside in.

D_grouping is a dummy node used mostly for the purposes of AST disambiguation during the parsing phase (it represents the use of parentheses in declarations). The skipGroups function can be used to pass these nodes.

D_func is necessarily the most complex declarator. Its base is typically the name of the function, although function pointers can have nested declarators. It has a FakeList<ASTTypeId> *params member for the parameters, CVFlags cv member for the const member functions, and ExceptionSpec *exnSpec for the exception specifiers.

Predicting declarator trees is not difficult. In general, you can apply standard rules to find the declarator: each * and & at the beginning creates the respective declarator indirection node; a [] or () at the end creates the array and function nodes, respectively; surrounding with paranthesis yields a grouping node. Pointers to members are created when the syntax is used (look for the ::*), and the choice between bitfield and name as the leaf comes from obvious decisions. One also needs to remember that the structures on the left are parsed before the ones on the right, unless overriden by parentheses, so the obscene int (*(*asdf)())[0] is yielded as array, grouping, pointer, function, grouping, pointer, name. If you're wondering, it's a zero-sized array named "asdf" that contains pointers to functions that return poiinters to integers.

There is much more to talk about in the world of ASTs. This will get you started on function bodies and declarations; the next part of the guide will cover more in-depth knowledge on declarations and begin to introduce classes.