Here's a quick recipe for Amazon Errands: Hook A9 Yellow Pages into the Amazon Marketplace,
bringing those thousands of individuals and companies selling virtually
through Amazon with their myriad products into real-space, mix in an
In-Store Pickup button for anyone with a local shingle, and expand ever
so slightly the functionality of wishlists to allow for lists of lists,
some of which are local. Stir and you're thinking global, shopping
local.

We are very interested in seeing applications which rip, mix, and burn (to borrow a phrase) data from multiple web services, including ECS, AWIS, and SQS.

Today at Amazon we have Russell Miles, author of the AspectJ Cookbook. Russ is the latest speaker in our Friday Learning Series, an internal continuing education program for our developers.

Russ will talk about why we should be doing Aspect-Oriented programming.

Agenda:

What is AOSD and why is it important?

Key features of AOSD

Advanced features of AspectJ

Types and scales of Aspect

Refactoring for AO

AO Challenges

Where is AspectJ and AO going?

What is AOSD? It is not just for logging and tracing, there are many scenarios where it can be used. He'll cover theory and then practice.

Let's start with software concerns. A concern is a goal, concept, or area of interest (Laddad). A system is made up of a collection of concerns.A concern is a collection of behavior and data. There are two types of concerns, core and cross-cutting. The aarchitect's dilemma is to take into account future change while not creating an overdesigned system. Be sure to developer something, don't sit around and think too much.

What's wrong with object orientation? Some concerns don't work within current OO approaches. Problematic concerns cross-cut an application. Lack of modularization of cross-cutting results in code tangling, code scattering, and duplication. AO tries to manage and solve this problem.

Diagram of an analysis of an XML parser, indicates where a particular concern is used within the code. This code has good modularity. Related areas are well encapsulated. Changes have a minimal ripple effect. The graphs were drawn using the AspectJ toolkit component of Eclipse. Another diagram, this one of logging within Tomcat. It is all over the place, and not even in a small number of places. OO practices can lead to bad modularity for certain types of concerns.

Bad modularity leads to redundant code, difficult to reason about structure (because it is not explicit), difficult to change ("nightmare").

Example of some Java code. Class with a method. Add some logging of start and completion, adds code to beginning and end of method, and add an attribute to the class. This has nothing to do with the requirements for the class. Now add a new requirement, thread safety. Again, adds more code, lock/unlock. Then add constraint for programming by contract, then authorization, persistence, cache consistency, and so forth. Lots of stuff that is unrelated to, yet complicating, the class. Core operation is almost hidden.

Aspect Orientation (AO) to the rescue. Modularize across concerns, rules for where it is going to be applied as well as what is to be done. Needs language or framework support in order to do it right.

Other languages: #define (C/C++, C#). Manually declared, not modular,not AO. Source-level metadata: annotations in Java, C#. Again, must be declared on code block. Not modular and not AO.

Question: What about debugging? Answer: IDE and tool support is important.

Enter AspectJ. A small extension to Java, a first-class language extension. Supports three core AO models: Aspects, Pointcuts, and Advise.

More Java code, 1 class and 3 methods. Eclipse, create Aspect, for logging. aspect is new keyword for Java.

Join points - a discrete point in the application that can be captured. Method calls, exception handling, object creation.

Aspects - First class language construct. Can be declared abstract. Aspect is modularization for AO. Can contain traditional Java attributes and methods. There can be complex relationships between Aspects. Aspects have lifecycles and can have relationships-- inherit from other classes or aspects, implement interfaces, declare classes, declare aspects. Lifecycle is not managed by developer.

Aspect lifecycles can be singleton, perthis(Pointcut) pertarget(Pointcut), or percflow() (control flow), lives for duration of control.

Pointcuts - encapsulate the 'where'. Based on pointcut logic. Can use wildcards, can be anonymous or named, and can be abstract. Many types: call, execution, handler, initialization, static initialization, get, set, within, adviceexecution. Putting them on accessors can be done, but appears to be bad practice. All pointcuts are scoped. Execution and call are good pointcuts to use.

Can combine pointcuts using logic, &&, || , ! (not). Don't join on stuff you are not interested in.

Advice - Encapsulate the 'what'. Unnamed and local to an Aspect. Never abstract. Five types: before, after, around after returning, and after throwing. Around lets you override a constructor and return a different object. Multiple advice per join point.

Java example: Use methodCalled advice to implement logging. Goal is to make this all as transparent as possible. Advice can use wildcards on callers and types to choose what they apply to.

Recommendation: Optimize and tune as much as possible. Use AspectJ Visualizer to see coverage.

Any runtime or compile time info available at join point is available to the Advice code.

On to AspectJ. Heavyweight language, extension to Java. Power extensions; use them with care. Support for static cross-cutting to alter application architecture, changes how application is put together. Don't use it to fix bad code or bad design. Use when you need new values or behavior in an existing app, such as in the implementation of an Observer pattern. Advise and extend the compiler to declare warnings and errors, create your own rules. Could use it to deprecate things, "but very strongly."

Question: How does this work? It looks like magic!Answer: It uses another compiler, a pre-compiler to the regular Java compiler. There's also something called aspect weaving, which apparently happens at the byte code level.

Question: Does Reflection give AspectJ conniptions?Answer: Yes, it can be, and you can implement a lot of what AspectJ does, without modularity.

Scales of Aspect:

Class and component - parameter validation, class persistence

Application - logging, lazy loading

Enterprise - transactions, security policy

Aspects can be declared public and static, within a class, to enforce invariants using an around() advice item. Must be public so that the AspectJ runtime can see the code.

Refactoring for AO, a relatively new area, a new book is on the way. Migrating to AO requires refactoring.

Impact of AO on OO. Change of perspective, help for the architect's dilemma. Increase controlled flexibility. Provides benefits to some OO design patterns.

There are some design patterns in AO, mostly just starting to emerge. Include Cuckoo's Egg, Director, and Worker Object. This is all work in progress.

Challenges: Learning curve, IDE support (Eclipse, JBuilder, NetBeans). Other languages still to come. Modeling and design are not yet supported in tools. Methods and practices are starting to appear. AOSD with Use Cases.

Q: Performance Setbacks?A: Not supposed to be, really depends on implementation of language. Should be same as hand coding of what the aspects do.Before & after are better than around. Use it for the right reasons.

Q: One join point and several aspects, what happens?A: There's a precedence, set on the aspect.

Q: Assuming there will be libraries of aspects, what's the precedence?A: You can declare precedence outside of the aspect, to set this.

Q: Besides logging, what are other generic aspects?A: (reads from the book), transactions, locking, exception handling, any characteristic that doesn't seem to fit into the class, shouldn't be there.

Asked about the Parrot project, Guido thought that it was "a long way from being useful," and said he didn't "know if they'll be able to fulfill their promise of running Python as well as the original implementation."

He's not "afraid of jumping onto that bandwagon, but it looks like it's still in the garage."

That said, Guido maintained that he wasn't stuck on the current Python VM, and mentioned the work on getting Python to work on the .NET and JVM platforms.

Fielding another question in his black PYTHONLABS tee, Guido talked a bit about the process of integrating new code or libraries into the Python core, noting that changes should be useful across many application domains, mature, and have an active maintainer. It also may not make sense to integrate code that changes much more quickly than the realtively slower Python release schedule.

Guido also talked a bit about the optional static typing debate that's been flying around the programming languages blogosphere, calling it a "incredibly contentious subject".

When asked about languages like Haskell or ML, Guido said that Python would never end up evolving in that direction, in order to maintain Python's dynamic flexibility.

Asked about Ruby, Guido thought that it "seems like an accident attempt in cleaning up Perl". He disliked the perlesque parts of Ruby and claimed that the semantics of Ruby code blocks is not always clear.

One developer sitting on the floor made a feature request for continutations in Python, but Guido replied that "his head explodes" whenever he has tried to understand continuations. Instead, generators in Python, we were told, could do 90% of what you might want to do with continuations.

After that, Guido started his talk, titled "What's (Not) New in Python", his favorite talk. "What's New" talks, he told us, tend to degrade into too much "laundry-listness".

He breezed through a few slide describing his own Pythonic lifetime from his "first grlfriend :-1" at age 21 to his recent role as a relatively new father. I expected him to say that he teaches his kid Python algonside Dutch and English, but he said that he tried to keep his kid away from computers, except for looking at firetruck images on google.

As for teaching other kids programming languages, Guido told us that teachers of programming tell him that kids learn and "understand much better" if they see Python before Java.

Moving on to an older target audience, Guido asked how many of us were Python users. Maybe a quarter of the room raised their hands. And just as in Brian Aker did in his MySQL talk, Guido asked first how many of use his technology at work, and then asked second how many of those people told their boss :).

Guido seemed to think that a language like Python would have significant advantages at a place like Amazon, where you want to develop quickly and prototype software. At his workplace, he's doing enterprise security software these days.

But in his spare time, Guido acts as the benevolent dictator for life for Python, but provides a "PEP" mechanism for allowing everybody to give input on the Python language.

Guido went on to describe the evolution of iterators and generators, from simple loops to pseudo-sequences and iterators in Python 2.2. Along the way he took another friendly shot at the Perl community, for creating a whole new language for Perl 6 and calling it the same name.

He showed how Python 2.2's generators and iterators made it easier to write cleaner and even faster code, mentioning parsing/tokenizing, the standard library difflib module, and os.walk() as example uses.

[Insert picture of Guido's kid eating out of a big tub of cool whip.]

Guido next stepped through a few neat Python features such as generator expressions and told us about how methods can be added and called in a dynamic fashion.

In talking about function binding, lookup and currying, Guido had this to say. "In general, I don't like magic."

His next slide proclaimed "Magic Is Bad!" and subsequent slides described the "magic revealed: descriptors", to unify user-defined classes with built-in types.

Another slide told us that "If you take one thing away from this talk, it should be how to create simple properties", another one mentioned how to do schizophrenic things with classes and metaclasses, another slide of his cute little kid, and finally "Question Time", one last simple slide.

A: It may take another generation of programmers to get over the prejudice for static typing.

Q: Are there languages out there that you want to take features from?

A: All the time. But I have to be careful... It's a balance.

He mentions Boo, which is 95% Python, but statically typed and thus feels completely different. Guido does not want Python to become Boo.

Q: Incremental garbage collection?

A: Referencing counting was good until you had cyclic references. Python still do reference counting, but there's an optional high-performance GC that kicks in when cycles exist -- it's not something you need to worry about. It'd be "incredibly tough" to compile with something like an incremental garbage collector, but there are things like Jython with the standard JVM GC.

Guido started by saying, "This is my favorite talk for technical audience. I used to always talk about What's new in 1.0? What's new in 2.0? They were too much like laundry lists. I think it's more interesting to talk in detail about one or two interesting features."

About Guido: "I've been a geek all my life". I work in a cool start up, ElementalSecurity.

Standard Library: The stuff you'd expect plus an enormous amount of third-party functions. Everything except a few applications is open source. 80-90% of third-party add-ons are open source. License is BSD-ish.

There are about 60 developers. Everyone is basically a volunteer; although some are paid by their employers to work on Python since they use it for their jobs. The Python development team spans the globe.

Process for introducing new features: new features voted in by consensus by the developers on python-dev. It's not a democracy. Guido is the "BDFL: Benevolent Dictator For Life." He says this is OK, because he's a fairly typical user, and he lets everyone give input before he makes the call.

Releases happen every 12-18 months. Minor releases are purely focused on stability and backwards compatibility. Code is compatible backward and forward between the different releases. Previous release is kept alive for most of the lifetime of the next release. Python-dev will introduce more backwards incompatibilities out of necessity at some point. But it won't be like, "Let's design a new language but give it the same name."

Case Study 1: Iterators and Generators

Iterators

Evolution of the for loop: Guido showed the evolution of "for" through Pascal, C, and then Python.

The iterator supports only one method: next(). It just loops through, and there's no index to increment. Using an iterator is actually faster (Guido recalls 40% faster) than loops with sequences. Other alternatives were more expensive too (i.e., creating a tuple). They did not introduce any backwards incompatibilities with iterators. Any sequence will continue to work.

Dictionary Iterators:

Python 2.1:

for key in d.keys(): print key, "->", d[key]

Python 2.2

for key in d: print key, "->", d[key]

Savings: Python 2.1 copies the keys into a list; Python 2.2 doesn't.

Downside: With Python 2.2 option, dictionary has to remain unchanged during looping.

File Iterators:

Loop over all lines of a file in Python 2.1:

line = fp.readline()while line: <statements> line = fp.readline()

In Python 2.2:

for line in fp: <statements>

This is also 40% faster and looks better in Python 2.2.

Generator functions

Remember coroutines? Or think or a parser and a tokenizer. A parser like to sit in a loop and occasionally ask the tokenizer for the next token. The tokenizer would like to sit in a loop and occasionally give the parser the next token. How can both sides be happy?

Generators let you write both sides (consumer and producer) as a loop:

Generator Expressions came along in Python 2.4. These can complete list comprehension, sums, and other calculations faster without creating a list and using less memory.

Q&A

Q: Can you talk about the differences between Ruby and Python?

A: Ruby seems to me like an excellent attempt at cleaning up Perl. Ruby to me has a lot of Perlesque features.

Q: In a number of scripting languages, there's always a problem with debugging time? Any suggestions for that with Python?

A: Run your code through PyChecker http://pychecker.sourceforge.net/. It's very good. Unit testing is the other half of the picture, but you can't unit test everything perfectly. For example, I don't know how to force an I/O exception in all cases.

Q: What is stopping Python from being a premiere, first-class language like C++ or Java?

A: Purely a mind set thing. It will probably take another generation of programmers. There are some cases where it runs slower. For example, it can't count to one million as fast as Java. But it can scan a file for a particular string faster than Java.

Q: Do you ever see any languages out there with features that you want to get into Python?

A: All the time. I have to be careful, because people come to me all the time asking for new features and fixes. If the language changes too much, we'll lose our existing community. For the last 4-5 weeks, I've been looking at a way to add optional type declarations. That's a delicate subject, and there's a lot to learn from other languages. The right solution will end up being unique to Python though. It has to feel like Python.

Gavin King leads the Hibernate project at JBoss. Gavin said he planned to talk about EJB 3.0. It's conceptually similar to what has been in Hibernate 1.0 and 2.0 for quite a while. Gavin told us that he would segue from EJB 3.0 into talking about Hibernate 3.0.

EJB 3.0

Motivation for EJB 3.0: Simplified programming model (make middleware continue to be successful as a line of porducts - simplified APIs or no APIs at all). EJB 2.1 is too "noisy" (too much code was required to make container happy; slower to write, maintain, and understand). Metadata is XML hell (it was code itself and wasn't easily maintained). Simplify testability - unit testing and automation, test suites that are run frequently (hundreds of times a day). Simplified application architecture. Simplify modeling by giving you a richer language to work with (domain object modeling). Simplify querying by making the query language more powerful (EJBQL required many workarounds if you wanted to do anything interesting. You had to break out of EJB model).

Entity beans in EJB 3

- Plain Java Objects: no required interfaces/inheritance, all public methods are non-final, all instance variables are private, non-abstract persistent properties, etc.- Metadata is as important: Mapped as annotations. Changes more often when you're maintaining a domain model. There are two levels of annotations: logical level and schema level.

Gavin showed some code for creating an entity bean. He said some people think annotations are really ugly and don't like seeing SQL stuff embedded in code. Gavin said annotations will soon be created automatically.

You can just instantiate entities since they're plain objects.

No more home interface ("detyped 'home' interface"), but home interfaces can be useful for running queries and creating objects. They had to replace it with something. The new operator instantiates an entity. Managed entities are associated with an EntityManager. New entities become managed by calling create(). Entities become detached when a persistence context ends. The state of detached entities may be reused in a new persistence context using merge(). This provides true support for optimistic transactions.

EntityManager API

Nice to make a typed interface that isolates business-specific stuff.

Query API

Queries can be expressed as EJBQL strings. Can be invoked via the Query API.

Query Language

EJBQL 3.0 is very similar to HQL (Hibernate Query Language). Aggregation, projection, fetching, subselects. Minimal changing to standard SQL to make it object-oriented. Definition of what data I'm interested in -- what data do I need in this transaction? This is how one problem between object-relational (O/R) mismatch is addressed.

What more is coming? Session beans, interceptors and callback listeners, native SQL query support, XML-based O/R mapping (this will be an alternative if you don't like annotations).

Hibernate 3.0

Hibernate has been around for three years. 3.0 is a major revision of Hibernate implementation and APIs. Migration from 2.X is straightforward, but it's not a drop-in replacement. 3.0 fixes several limitations.

Hibernate 3.0 forms the core implementation of JBoss EntityManager (EJB 3.0 - preview is available now). Going live in a month.

Coolest new feature is filters. The motivation behind filters is looking at temporal, permissioned, and regional data. Global, parameterized restriction on the data. Doing this in SQL is difficult. Filter conditions may be defined as SQL fragments in a mapping document, and they may be applied cumulatively. Filters offer a simple declarative mechanism for handling messy queries.

Filters are defined in mapping metadata. You can define what the filter means for a particular entity. The filter is then injected into any join conditions and where clauses for that entity. Filters are applied transparently when query runs.

Handwritten SQL support. This is an improved feature in Hibernate 3.0. Motivation is that writing SQL is tedious and generated SQL is a good thing. You can bind SQL queries to managed persistent objects.

The title of Chris' talk is "Amazon Involvement with Mozilla and Open Source Projects."

A few Amazon employees are already involved with Open Source and all of the conference attendees use Open Source software. (Apparently, FireFox has a 90% market share, well, at least in the conference room.)

Everyone's use of Open Source software is important to break into the mainstream. Chris told a story of a pastor who said people should use FireFox to rid their browsers of the evils of popups. Talk about spreading the religion!

The Netscape decision to move Netscape to Open Source came about from the top levels of the Netscape Management, as another competitive response to Microsoft. This was announced in early Jan 98 and released on 3/31/98. This was first embraced with enthusiasm by the engineering staff, who dreamed of outsourcing the programming to some developer on the Web. but then they realized that they had to cleanse 3 million lines of code.

There were all sorts of profanity and rants about Bill Gates in the code.

It took about 3 months and 60 engineers to work through the code. Check out the PBS documentary, "Code Rush" for more details:

You will need an Amazon Web Services Subscription Id to use this URL. They are free and you can get one here.

There are some differences between the Mozilla Project and other open source efforts:

Cross-platform

Large scale effort

Highly visible

Commercial involvement

The differences drive changes in the approach and tactics for running the project. The Mozilla Public License (MPL) was created for the project. Here's how it works: If you make changes to the code, you must contribute them back to the code base. You can also make additions to the code and segregate those additions in separate files. This allows you to license your additions under different terms.

There are a lot of project issues related to Open Source projects. Joel Spolsky wrote about the things that can lead to great software. Eric Raymond wrote, "The Cathedral and the Bazaar." Feedback loops are a key part of the development effort. With lots of peer review of the code, you find bugs early in the development cycle, attention is paid to details that help in maintaining the code, and it encourages reuse where it makes technical sense.

Here are some Open Source principles:

Release early

Release often

Listen to your customers

Treat your beta-testers as if they're your most valuable resource--they will respond by becoming your most valuable resource.

Keep the software at "near shipping quality" all of the time

Use a modular vs. monolithic architecture

Create a Meritocracy. You earn your voice and recognition on the project through good contributions. Open Source projects can enable an ability to resist commercial and business goals that can conflict with things users really want. For example, the pop-up blocking example.

Anecdote break: Netscape had been making money off of pop-up ads. The code for the pop-up blocker was accidentally integrated into the Netscape code base from the Mozilla code base. Needless to say, there were some meetings at Netscape about this. The pop-up feature won out.

FireFox attempts to avoid the features "arms" race like what was going on between Mozilla and Internet Explorer.

A security system is only as secure as its secrets. Open Source software stays away of pseudo-secrets and back doors. Publishing the code encourages review and security analysis.

There are some business cases for Open Source:

When the maintainer loses interest, the last duty is to hand it off to a competent successor.

Lower development costs. The cost is shared by others.

There is constant incremental revision and refinement, rather than develop, ship, or cancel.

There's the ability to stand on the shoulders of others.

Who gets involved in Mozilla?

There are 12 engineers at the foundation. There are 60 full-time engineers from Sun, Oracle, IBM, Redhat, Novell, and others. Then there are 1,000 volunteers submitting patches. The project is translated into over 100 languages (all by volunteers). What about testers? There are 10,000 pre-alpha testers of daily builds who file 70-100 bugs a day. There are over 100,000 beta testers.

A question came up about Google and it's involvement with the Mozilla project. Chris mentioned that Google has a strong interest in the technology and has been a big help in the past few months. Any announcements would come through Google, of course.

Another question was about the turnaround time for vulnerabilities. Chris mentioned that their track record is 24 hours, based on recent history. A vulnerability was found at 10:00 PM. They had a patch in an hour, but the remaining time was spent testing and creating the final package build.

Jeff Bezos mentioned in Wired magazine (January 13) that, "If today the successful recipe is to put 70 percent of your energy into shouting about your service and 30 percent into making it great, over the next 20 years, I think that's going to invert." FireFox and Open Source projects really have zero-based marketing budgets. The experiment with spreadfirefox.com is an open-source style of marketing. The community of 60k users even came up with $250k raised for a two-page ad in the NY Times.

Introduction to Eclipse, Bjorn Freeman-Benson of Predictable Software. Member of the Eclipse board, and a Committer Representative.

Project has lots of users, some contributors, and a few who can make commits to the source code repository.

He worked at OTI (Object Technology International) putting together Eclipse, has written plug-ins, but has a different perspective because he is not a core author.

Quick outline: Intro, Neat Features, Futures.

Goal is for Eclipse to be a univeral tool platform. Anything at all, nothing in particular."

Deliberate attempt to make money, in an open source way. Eclipse Foundation is not-for-profit. Most people on the project are paid by IBM, HP, Actuate, or other members. There's a marketing side and an open source side to what they do.

Roots - Software is from people. Origins include ENVY/Smalltalk, Hoops, Visual Age for Java, Team/V (Smalltalk for Windows), Sniff, and Visual Age Micro Edition. IBM'ers talk about their plug-in architecture a lot, everyone has one, let's talk about features instead.

Architecture - Runtime based on OSGI standard. UI (Generic Workshop + JFace + SWT). Extensible tools for search, help, debug. Java IDE atop all of that. Resources on the side. All can be plugged together to build other applications. People started to build things that were not IDEs. Then it was labeled as a rich client platform.

Goal - Completely open platform for application development tools. Rumor has it that it didn't succeed even within IBM until it was open sourced. Handle unrestricted content types, be language neutral. HTML, Java, C, JSP, EJB, XML, GIF, etc. Seamless tool integration, a dream for the last 1.5 decades.

3 releases to date, each with an accompanying perception change:

3 - Eclipse is a general application2 - Eclipse is a general tooling platform1 - Eclipse is a Java IDE

Lots of examples of use of internal knowledge of code. Show all references, uses, child classes. Find declarations, chase refs and defs. Type and call hierarchy; easy for IDE to display. Use nice icons and colors - denote abstract, protected, private, etc. Static call hierarchy, method calls, count number of callers, markall occurences of an identifier. Examples include all return statements in a method, all uses of an exception handler, all overridden methods of a class. Use automatic refactoring to create new classes, from original source.

Code critique system (a very dynamic form of lint), from originator and other places. Show up as errors or warnings in the Eclipse TODO list.

Easy navigation between property files and source code. Example: Jump between source and resources for a localized application.

Some Neat Features

JUnit integration -- test driven development. Run tests, get errors, jump to source, run with debugger.

Ant debugger -- to debug the build process.

UML tools.

Debugging stuff -- run code with errors, code that doesn't actually compile. Popups to inspect values. Logical structure display, show vector as array, hide the implementation details. Debugging live web services (extra plugin). Put breakpoint in servlet, step through from client into server, other threads on same server keep going. Do hot code replacement (using the JVM and the Java Debug API) on a running system. Sun JVM supports replacement of any method that is not running, as long as the class's type signature is not changed.

Team development -- Eclipse team scattered around, started using CVS as backing store, other companies added plugins for other source repositories. Eclipse hides some of CVS's warts, still no directory-level management. Nice differencing browser, merging with conflict resolution. "Blame co-worker" feature, identifies who made the changes to any line of code.

Q: You talked about Java, what's the parity level with the C IDE. A: C is second most widely used. Debugger works well, doesn't know about refactoring. Completion & highlighting are good, overall not quite as complete. About 1/2 of the Java features are in the C version.

Milestones -- next release in June. Most of team is in Ottawa, they work hard in the winter, and take the summer off to go fishing. Team has fixed 1 bug/developer/day for 5 consecutive years, no exceptions. That's 50-100 fixes per working day over that time. All is done using a public instance of BugZilla. Milestone releases every 6 weeks, they are good at hitting them.

3.1 Themes and Priorities:Scaling up - Simplify getting started. Websphere project added 300 extensions, make sure that it handles these well. IBM pays for about 1/3 of development, they know they are building tools that have to scale for large-scale development.

Very large code base requires a 64-bit processor; Sun JVM has a 1.7GB limit and all of the data is resident.

Brian Aker is the Director of Architecture at MySQL. Brian was invited to speak at Amazon DevCon and told that he could talk about anything. Brian chose Writing Storage Engines, because his favorite thing to talk aobut is the guts of MySQL and the storage engine. Brian is the author of lots of stuff. See Freshmeat...

A bit about MySQL: 5 million installations, 180 employees, 20+ countries. Most North American developers are in Seattle. MySQL is a high performance relational DB. Aiming to be SQL-99 compliant. Stable, scalable, usable, modular. This last characteristic makes MySQL different than most relational DBs, but similar to most open source technology. There are client support libraries in many languages. All MySQL code written in simplified C++ -- don't look inside the code if you're a C++ purist.

Brian asked the crowd: How many people are using MySQL today? (quite a few hands went up) How many are using it but haven't told their bosses yet? (fewer hands, but still some... this got a good laugh)

Overview of architecture

Kernel: Parser, Optimizer (cost-based), Storage Engine

The storage engine is separate from Optimizer. You don't have to worry about what a join is. Don't have to parse SQL syntax yourself. You get basic set of primitive operations.

Dozens of storage engines have been written. Some examples are:

MyISAM

Innodb - Typical transaction engine. Think about it as Oracle. Easy for Oracle DBAs to get it.

NDB - Different, designed for telecom environment. In telecom environment, you pick up the phone and you dial - response has to be instant. High availability also required. Storage arena with storage nodes in clustered environment. Just keep throwing more boxes at it and you'll scale. Interconnects supported. All data and indexes in memory - it's a memory hog.

HEAP - Design in env where you don't care if DB goes down. Need fastest lookup possible. You throw data in it - if your data goes away, you know where else you can find it. Used in combination with MyISAM for Slashdot. Mixing and matching is useful.

Merge - Primitive partitioning engine.

The list of engines goes on and one. CSV engines, archiving engines, etc.

More on storage engines. We hand them data, and they store data formats on disk. Don't make the assumption that there's one way to get data in and out. There are many ways and you may need different storage engines based on your app. Operating systems dictate how the storage engine should behave.

Understanding storage engines and their architecture

Every row is made up of fields. Each field is an object. Optimize for the smallest amount of space possible. Integer, tiny integer, etc. Faster writes, faster reads, etc.

Knowledge of required methods for coding

Berkeley Engine - SQL interface to BDBs

Storage Engine Methods

There are four groups of methods in the storage engine interface.

1) Table control methods::create() - whatever you need to do to create a table, i.e., write a description -- see ha_example

::open() - a little more difficult, header code creates a char - for every table open, create a char - manage state of table (any level of lock - table, row, page, etc.), for the most part, don't need to touch get char code - open any files you need to access

::close()

::delete() - can skip depending on how you code - pass in name of table

2 - Optimizer methods::info() - Feed information to your engine so that it can decide whether to talk to indexes or scan all the data. It's a cost-based optimizer. info() asks basic questions: What table is being deleted? How big are the records? How much data? How big are the indexes?

::records_in_range() - You have enough rows, there's an index. Returns an estimate of how many rows you'll touch based on the index. Cost optimizer tuned to 30%. It does a complete scan if it is already touching more than 30%, rather than using indexes. Can tweak that 30% number too.

3 - SQL Modifier methods::delete_row()

::insert_row() Pass in write_row record. Let engines decide how timestamp will be applied. Allow you to do basic increments and sequences. Write out data however you need to write it.

::update_row() Handle old data and new data. Some storage engines can fix rows in place (fixed row format) and avoid updating all rows - basically non-existent fragmentation. Otherwise, dynamic indexing.

4 - SQL Read methods - two types

Scan Reads ::rnd_init()

::rnd_next()

::position()

::rnd_pos()

Index Reads - a little trickier ::index_read()

::index_next()

Brian showed this table Scan example that shows sequence of methods:

ha_example::store_lockha_example::external_lockha_example::infoha_example::rnd_initha_example::extra Cash record in HA_rnd()ha_example::rnd_nextha_example::rnd_nextha_example::rnd_nextha_example::extra End cacheing of records(def)ha_example::external_lockha_example::extra Reset database to after open

Brian said that to create a storage engine, you simply take the interface and write these very easy methods described above.

There are some more methods you can write:

Transaction methods

Bulk load methods

Defrag methods

And many more (read handler.h for more of the interface)

You'll need to touch a few more files to get your storage engine going:

autoconf files

sql/Makefile.am

sql/handler.h

sql/mysql_priv.h

sql/handler.cc

sql/myseld.cc

sql/set_var.cc

You can also take advantage of the test case system -- open source and easy to use, highly recommended for testing. See mysql-test/.

Other things to consider: Read only? Do you need durable data? Do you have a network database? If so, you can do optimizing for a federated environment.

Brian recommended looking into sql/ha_example if you want to explore more.

Q&A

Q: Roadmap for full text support?

A: Goal is to get full text searching performance up in 5.1. full text support not performing well in 4.1. Timeline is end of 2005 or first quarter of 2006.

"Try to do less, but better," he told us, rather than trying to do everything, but poorly.

Margo went on to describe what it means for Berkeley DB to be an "enterprise data mangement" system, sketching out its high-level architecture and feature sets.

How do people use Berkeley DB? Margo listed a slide full of customers, from google to Nokia, Sun to Cisco.

Margo was happy to be able to actually tell people what she does. Programmers often have this problem, in explaining nitty-gritty technical work to everyday people.

She just tells people that when they go to the Amazon web page and look at items, she helped to do that.

Diving into technical details, Margo told us about how Berkeley DB is neither a relational database nor an object store. It has no intrinsic schema.

It doesn't yet support partitioning -- often useful at the terabytescale and beyond.

As for how data in such stores actually get accessed, almost nobody uses random access, Margo told us, even if developers claim otherwise. There's pretty much always locality.

During the rest of the talk, Margo fielded highly technical questions from a number of Amazon developers. They wanted to know about practical limitations of BDB, how garbage collection affects performance, and about other detailed issues.

Margo was proud that Berkeley DB developers live, eat and breath the mantra "separate mechanism from policy". That is, allow the application developer to specify policy, whether in concurrency or transactions, cache sizes or locking. Such flexibility allows for a wide range of developers to apply the technology to their own niches.

"The good news is that it's flexible," Margo told us. "The bad news is that it's flexible."

Amazon developers cared a lot about replication and performance, and Margo tried to field all the questions that we threw at her.

Towards the end, she told us about how to do DB replication. Do not build in replication from day one, she said. Get the transaction system working, add a communication infrastructure, then add the modification architecture.

She ended the talk with a cycle that we here at Amazon go through every single day: "Build, test, test, test, test and deploy"