Thursday, October 21, 2010

Here's what I love -- an argument that can only add cost and complexity to a project.

It sounds like this to me: "We need to fragment the code base into several different languages. Some of the application programming simply must be written in a language that's poorly-understood, with tools that are not widely available, and supported by a select few individuals that have exclusive access to this code. We haven't benchmarked the technical benefit."

Further, we'll create complex organizational roadblocks in every single project around this obscure, specialized, hard-to-support language.

Perhaps I'm wrong, but database triggers always seem to create more problems than they solve.

They Totally Solve a Problem

The most common argument boils down to application-specific cross-cutting concerns. The claim is that these concerns (logging, validation, data model integrity, whatever) can only be solved with triggers. For some reason, though, these cross-cutting concerns can't be solved through ordinary software design. I'm not sure why triggers are the only solution when simple OO design would be far simpler.

Some folks like to adopt the "multiple application programming languages" argument. That is, that ordinary OO design won't work because the code would have to be repeated in each language. This is largely bunk. It's mostly folks scent-marking territory and refusing to cooperate.

Step 1. Write a library and share it. It's hard to find a language that can't be used to write a sharable library. It's easy to find an organization where the Visual C# programmers are not on speaking terms with the Java programmers and the isolated Python folks are pariahs. This isn't technology. Any one of the languages can create the necessary shared library. A modicum of cooperation would be simpler than creating triggers.

Step 2. Get over it. "Duplicated" business logic is rampant in most organizations. Now that you know about, you can manage it. You don't need to add Yet Another Language to the problem. Just cooperate to propagate the changes.

They're Totally Essential To The Database

The silly argument is that some business rules are "closer to" or "essential to" the database. The reason I can call this silly is because when the data is converted to another database (or extracted to the data warehouse) the triggers aren't relevant or even needed. If the triggers aren't part of "interpreting" or "using" the data, they aren't essential. They're just convenient.

The data really is separate from the processing. And the data is far, far more valuable than the processing. The processing really is mostly application-specific. Any processing that isn't specific to the application really is a cross-cutting concern (see above). There is no "essential" processing that's magically part of the data.

What If...

Life is simpler if all application programming is done in application programming languages. And all triggers are just methods in classes. And everyone just uses the class library they're supposed to use.

"But what if someone doesn't use the proper library? A trigger would magically prevent problems."

If someone refuses to use the application libraries, they need career coaching. As in "find another job where breaking the rules is tolerated."

"There are many good reasons to NOT upgrade/modernize many applications, and I believe Gartner is out of line using words like 'debt' which have guilt associated with them,"

"Guilt"? That's a problem? Why are we pandering to an organization's (i.e., CIO's) emotional response?

I'm not sure that using a word like "debt" is a problem. Indeed, I think they should ramp up the threat level on this and add words like "short-sighted" and "daft" and perhaps even "idiotic".

Anyone who doesn't believe (or doesn't understand) technical debt needs only to review the Y2K plans and budgets. A bad technology decision lead to a mountain of rework. Yes, it was all successful, but it made IT budgeting difficult for years afterwords.

The rest of the organization was grumpy about having their projects were stalled until after Y2K. IT created it's own problems by letting the technology debt accumulate to a level where it was "fix or face an unacceptable risk of not being able to stay in business."

Wednesday, October 13, 2010

Lots of folks like to wring their hands over the Big Vague Concept (BVC™) labeled "security".

There's a lot of quibbling. Let's move beyond BVC to the interesting stuff.

I've wasted hours listening to people identify risks and costs of something that's not very complex. I've been plagued by folks throwing up the "We don't know what we don't know" objection to a web services interface. This objection amounts to "We don't know every possible vulnerability; therefore we don't know how to secure it; therefore all architectures are bad and we should stop development right now!" The OWASP top-ten list, for some reason, doesn't sway them into thinking that security is actually manageable.

What's more interesting than quibbling over BVC, is determining the authorization rules.

Basics

Two of the pillars of security are Authentication (who are you?) and Authorization (what are you allowed to do?)

Authentication is not something to be invented. It's something to be used. In our case, with an Apache/Django application, the Django authentication system works nicely for identity management. It supports a simple model of users, passwords and profiles.

We're moving to Open SSO. This takes identity management out of Django.

The point is that authentication is -- largely -- a solved problem. Don't invent. It's solved and it's easy to get wrong. Download or License an established product for identity management

and use it for all authentication.

Authorization

The Authorization problem is always more nuanced, and more interesting, than Authentication. Once we know who the user is, we still have to determine what they're really allowed to do. This varies a lot. A small change to the organization, or a business process, can have a ripple effect through the authorization rules.

In the case of Django, there is a "low-level" set of authorization tests that can be attached to each view function. Each model has an implicit set of three permissions (can_add, can_delete and can_change). Each view function can test to see if the current user has the required permission. This is done through a simple permission_required decorator on each view function.

The auth profile module can be used to provide additional authorization information. In our case, we just figured out that we have some "big picture" authorizations. For sales and marketing purposes, some clusters of features are identified as "products" (or "features" or "options" or something). They aren't smallish things like Django models. They aren't largish things like whole sites. They're intermediate things based on what customers like to pay for (and not pay for).

Some of these "features" map to Django applications. That's easy. The application view functions can all simply refuse to work if the user's contract doesn't include the option.

Sadly, however, some "features" are part of an application. Drat. We have two choices here.

Assure that there's a "default" option and configure the feature or the default at run time. For a simple class (or even a simple module) this isn't too hard. Picking a class to instantiate at run time is pretty standard OO programming.

Rewrite the application to refactor it into two applications: the standard version and the optional version. This can be hard when the feature shows up as one column in a displayed list of objects or one field in a form showing object details. However, it's very Django to have applications configured dynamically in the settings file.

Our current structure is simple: all customers get all applications. We have to move away from that to mix-and-match applications on a per-customer basis. And Django supports this elegantly.

Security In Depth

This leads us to the "Defense in Depth" buzzword bingo. We have SSL. We have SSO. We have high-level "product" authorizations. We have fine-grained Django model authorizations.

So far, all of this is done via Django group memberships, allowing us to tweak permissions through the auth module. Very handy. Very nice. And we didn't invent anything new.

All we invented was our high-level "product" authorization. This is a simple many-to-many relationship between the Django Profile model and a table of license terms and conditions with expiration dates.

Django rocks. The nuanced part is fine-tuning the available bits and pieces to match the marketing and sales pitch and the the legal terms and conditions in the contracts and statements of work.

Monday, October 4, 2010

For years, I've been using Apache POI in Java and XLRD in Python to read spreadsheets. Finally, now that .XLSX and .XLSM files are in more widespread use, we can move away from those packages and their reliance on successful reverse engineering of undocumented features.

Spreadsheets are -- BTW -- the universal user interface. Everyone likes them, they're almost inescapable. And they work. There's no reason to attempt to replace the spreadsheet with a web page or a form or a desktop application. It's easier to cope with spreadsheet vagaries than to replace them.

The downside is, of course, that users often tweak their spreadsheets, meaning that you never have a truly "stable" interface. However, transforming each row of data into a Python dictionary (or Java mapping) often works out reasonably well to make your application mostly immune to the common spreadsheet tweaks.

Most of the .XLSX and .XLSM spreadsheets we process can be trivially converted to CSV files. It's manual, yes, but a quick audit can check the counts and totals.

Yesterday we got an .XLSM with over 80,000 plus rows. It couldn't be trivially converted to CSV by my installation of Excel.

Step 2. It's a zip archive. So, to process the file, we need to locate the various bits inside the archive. In many cases, the zip members can be processed "in memory". In the case of our 80,000+ row spreadsheet, the archive is 34M. The sheet in question expands to a 215M beast. The shared strings are 3M. This doesn't easily fit into memory.

Further, a simple DOM parser, like Python's excellent ElementTree, won't work on files this huge.

This does two things. First, it locates the shared strings and the various sheets within the zip archive. Second, it expands the sheets and shared strings into the local working directory.

There are many other parts to the workbook archive. The good news is that we're not interesting in complex workbooks with lots of cool Excel features. We're interested in workbooks that are basically file-transfer containers. Usually a few sheets with a consistent format.

Once we have the raw files, we have to parse the shared strings first. Then we can parse the data. Both of these files are simple XML. However, they don't fit in memory. We're forced to use SAX.

We create the handler, create a parser, and process the shared strings portion of the workbook. When this is done, the handler has a dictionary of all strings. This is string_handler.string_dict. Note that a shelve database could be used if the string dictionary was so epic that it wouldn't fit in memory.

The Final Countdown

Once we have the shared strings, we can then parse each worksheet, using the share string data to reconstruct a simple CSV file (or JSON document or something more usable).

The Content Handler for the worksheet isn't too complex. We only want cell values, so there's little real subtlety. The biggest issue is coping with the fact that sometimes the content of a tag is reported in multiple parts.

We locate the four header rows and build labels from the the four rows of data. Given these big, complex headers, we can then build a dictionary from each data row. The resulting structure is exactly like the results of a csv.DictReader, and can be used to do the "real work" of the application.

Advertisers

About Me

Steven F. Lott is a consultant, teacher, author and software developer with over 35 years of experience building software of every kind, from specialized control systems for military hardware to large data warehouses to web service API's.