heavy industries.

consider this a replacement for all the notes text files in various directories on various machines, or in forgotten archives. any code found here can be presumed devoid of warranty, merchantability or fitness etc, unless explicitly stated otherwise.

Thursday, July 13, 2006

On CSS selectors, HTML ID and class attributes and reducing cost

Coming up with a suitable rule for the names of ID and class selectors that are both valid HTML and CSS — as well as being easy to remember and/or infer — is an oft-neglected chore in the realm of web development.

If an organization neglects to properly delegate this task, it typically ends up in the hands of those responsible for marking up and styling web content. These individuals tend to have little incentive to care about the consistency and semantic value of the names they choose as selectors. This has the potential to cause problems with those writing Javascript, as they use class and id attributes to select elements out of the DOM. Furthermore, it drives the cost up of adding microcontent later on, as the entire base of HTML templates, CSS stylesheets and Javascript is likely to have to be scrubbed.

It should be safe to assume the following:

Only marketing, search engine optimization (SEO) people and information architects should actually care about what these values say, as this is valuable microcontent.

Javascript programmers and markup/layout people should only care that the values are granular enough to "grab on to".

Back-end programmers should only care that documentation of these values exists, should they even care about any of this at all.

Despite their purported apathy, however, programmers could play an integral role in eliminating CSS as a source of frustration and loss of productivity. Coordinating with the information architect, programmers can use their object model to produce a scheme to generate CSS selector names (as well as CSS stubs) that will always be consistent. For example, Java class notation (e.g. com.foobar.myproject.MyClass) or XML Schema type (e.g. xsd:dateTime).

CSS authors need never create selectors based on element IDs — please, let me know of a case (stone-age compatibility notwithstanding) in which it would be 100% necessary.

Produce consistent, unique element IDs for the Javascript authors — they're the ones who rely on them the most.

And now, fun facts about HTML class and id attributes and CSS selectors:

The HTML class attribute is of DTD type CDATA, meaning that literally any content can be placed into it, provided that content is escaped properly. CSS class selectors split on whitespace, meaning any opaque token can be used as a class. An example of an opaque token could be a QName like an XML Schema datatype or perhaps an rdfs:Class URI. Both parent document formats of those datatypes — XML Schema and RDF Schema — have more than adequate capacity to document the nature of the object to be displayed on the screen.

The HTML id attribute is of DTD type ID, which, according to the spec, must start with an ASCII letter, followed by any number of ASCII letters, digits, hyphens, underscores, colons and periods. This means that it is possible to use an ASCII-conformant XML QName (except those in which the namespace prefix begins with "_"), and still maintain a valid document.

On the CSS side, characters in selectors that collide with the CSS grammar can be escaped with a backslash ("\"), followed by their Unicode codepoint (one to six hexadecimal digits), followed by optional whitespace if the code is shorter than the full six digits. Meaning that:

The preceding has been confirmed to work in recent versions of Gecko, MSIE, Opera, Konqueror and Safari.

To keep designers happy, I recommend devising a tool that will generate CSS stubs based on their object model hierarchy. Note that "inheritance" in the context of CSS is not the same as it is in a standard object hierarchy. Here is an example:

Now, this is an extreme example of a set of fully qualified class names and how to map them to CSS. SEO folk might argue that the token repetition is bad for Google scores and performance geeks might say that the length of the class names needlessly bloat the size of the document. I would recommend experimenting to find an optimal solution.

But, to satisfy the aforementioned people in the meantime, here is an example of ancestor-descendant selectors using XML Schema simple types to designate the datatypes and FOAF QName types to designate the relationships. Picture this markup:

/*foaf:name - A person's name.*/.foaf\3a Person foaf\3a name {/* drop-cap the first initial or something cute like that */}

/*foaf:weblog - The URL of a person's weblog.*/.foaf\3a Person foaf\3a weblog {/* make it flash or something gaudy */}

Upon implementing a system like this, I suggest that it is possible for those interested in the minutiae of the naming of CSS selectors (the marketing/SEO and the information architects) to employ those who benefit directly from the automation and documentation thereof (the programmers) to dictate values to those who would rather not have to think about choosing them anyway (the CSS authors). The result? Better cooperation, fewer bugs and quicker time to market.

Sunday, June 18, 2006

RDF Actually Isn't That Hard

That's right. Not hard. Easy, in fact. RDF just hangs with a bad crowd. I'll explain:

I suspect one of the major detriments to the Resource Description Framework and its proponents is the association with XML. I also posit that the second largest source of confusion is its use of URIs as something other than URLs. Now, the populace has just begun to understand the concept of an URL as a globally unique locator for porn, news, games, porn and shopping. The relative few tech-savvy types that actually understand XML do so in the capacity that it is a way to serialize an ordered, hierarchical data structure. RDF is actually a lot more simple.

That's right. The near-English from above has been mangled almost completely beyond recognition. Try it, it's valid. It's just not legible. The only thing RDF/XML and non-RDF XML have in common is the raw syntax.

"So Dorian, what about that funny namespace URI you used for 'foo'?"

That's the other part.

URIs don't actually have to be web pages.

Although in this case, I'd probably want the URI "http://doriantaylor.com/messing/with/your/head/with/RDF#" to point to an RDF schema that explained nicely what my understanding of the verbs "wears" and "recommends" are. But I digress. Just remember the following:

http://www.cnn.com/ is only ever going to point to CNN's homepage (unless, of course, they neglect to pay their domain bill).urn:isbn:096139210X will only ever refer to a particular favourite book of mine.tel:+1-900-HOT-CHIX is only ever going to be the destination of my date for Friday night.

Um, yeah.

The important part about URIs in RDF is that they represent globally unique resources — things, categories, ideas. Suppose I were to replace my original example with URIs:

When I swap URIs in, it becomes clear that those are the only things in the world I can possibly be talking about. The only loose thread is, collectively, what I consider "to wear", what a "manufacturer" is, and what it means "to recommend". This is where stuff like RDF Schema and OWL comes in, which I consider out of the scope of this post. I will say, however, that it's usually better to pick a lingua franca to describe certain things than to come up with your own, as I oh-so-naughtily did above.

One other item: that genid:GLASSES represents an item that is local to the set of statements, in order to tie a group of statements together. I can just as easily replace it with xxx:HGLAUGAHLGA or urn:x-foo:bizzle or http://doriantaylor.com/possessions/glasses where there might be a nice picture of me wearing my glasses. In fact, having all resources refer to something globally unique is preferred, but a generated ID can suffice in a pinch.

Why are we doing all of this?

For the computers, of course! The poor darlings work so hard but they're really not that bright. Especially when it comes to icky human things like semantics. The idea is, if we give them enough clues, they will work really hard to help us get a better picture of the world around us.

If people could advertisetheir work in a unified way, our computers could sort and filter this information based on what it actually is, rather than words it contains or what refers to it.

Wednesday, October 12, 2005

broad security considerations for web apps

Never reveal the engine's implementation.

Stack traces and internal errors echoed to the front-facing part of a web application can reveal vulnerabilities in underlying implementations. This is by no means advocacy of error suppression, as proper diagnostics are invaluable for debugging purposes. Ideally, error types should be mapped to a user-friendly dictionary that allows the user to describe his or her situation to support staff, while the stack trace slips quietly out the back into the error log where it belongs. Applications should be designed to cost more effort to bubble stack traces to their user-facing sides than a friendly error message.

Limit the dataset exposed to the templating layer.

With modern web MVC frameworks, it is typically very easy to provide fairly large subsets of the data structures and objects from within the application's space to the templating mechanism. In some systems, it is possible to provide live executable objects to the templating layer, thereby allowing layout designers to easily muddle logic and presentation, as well as inadvertently (or possibly deliberately) yield aspects of the application's internal state.

Decouple the authentication mechanism from the primary application.

When a web application is decoupled from its authentication mechanism, it enjoys two major benefits: First, developers do not have to invoke authentication (however it may be done) on a per-resource basis, and can develop as if their application is already authenticated. Second, both frameworks can be tested independently of each other, allowing for more efficient and effective testing.

Organize authenticated resources by URI path segments.

Stand-alone authentication modules are easiest to organize by URI path. Choosing an initial path segment like /auth or /protected, underneath which all resources are covered by the authentication module, will emphasize the logical division between regular and logged-in resources. Authorization modules can then be bound to subpaths under these segments.

Never trust input.

Input traditionally can come from three different sources: query string, form data and cookies. A common practice is to merge these into a single dictionary of key-value pairs and pass them into the application as parameters. Under no circumstances should it be possible to access this dictionary without it having been sanitized. Each entry in the dictionary should be checked against an expected set of keys. Entries present that are not expected should be at least removed, or better, produce an error. This should occur before validation of the values. Following this, the values can be checked against their respective validating functions, as well as each other, in the case of interdependencies. It is only now that the application developer should be able to retrieve these parameters.

Never trust input (Part II).

With the advent of web services, arbitrarily nested XML content (or any other content, for that matter) can come in via a POST or PUT request. XML content should be first inspected for encoding consistency. Ideally, the content's encoding should be converted into one's preferred Unicode format before it is parsed (suggest UTF-8). More ideally, this should happen in a separate process space, in case of an attempt to compromise. Any illegal character sequences should produce an error. Following this, the document should be parsed and checked for well-formedness. Lack of compliance should produce an error. Finally, the document should be checked against a DTD, XML Schema or RelaxNG content model. Anything not validating should produce an error. All of this should happen before the content is received by the application layer.

Never trust input (Part III).

It is extremely unwise to provide a user with data of any kind (cookie, hidden form field) with the expectation of retrieving it unmodified. Data passed to the user should only be done so when it is within the user's interest to keep it intact (e.g. session, or target resource). Ideally, the capacity for a web application developer to produce a dependency on a piece of user data should require greater effort than interacting with a central state mechanism.

Conclusion

Developers will inherently follow the path of least resistance to complete a project. If not of their own volition, they will be pressed to do so. Creating an environment that requires less effort to do the right (read: secure) thing will yield an overall more stable product with few to no embarrassing compromises.

Saturday, September 10, 2005

Javascript DOM.Traversal

I'm looking into creating a bridge for the DOM level 2 Traversal API for MSIE and Safari. It would be significantly less painful than trying to navigate documents with just the DOM core API, and perhaps a good framework for a DOM 3 XPath bridge implementation as well, as Opera and Safari don't implement it.

I'm going to be publishing this on JSAN. i'll update when it's available.