I kind of love Passenger for my Rails deployments. It Just Works, it does exactly what it should do, no muss, no fuss. I use Passenger with apache.

I very occasionally have a problem that I am not able to reproduce in my dev environment, and only seems to reproduce on production using Passenger apache. Note well: In every case so far, the problem actually had nothing to do with passenger or apache, there were other differences in environment that were causing it.

But still, being able to drop into a debugger in the Rails actually running under apache Passenger would have helped me find it quicker.

Support for dropping into the debugger, remotely, when running under Apache is included only in Passenger Enterprise. I recommend considering purchasing Passenger to support the Passenger team, the price is reasonable… for one server or two. But I admit I have not yet purchased Enterprise, mainly because the number of dev/staging/production servers I would want it on to have it everywhere starts to make the cost substantial for my environment.

I haven’t tried it yet, but making this post as a note to myself and others who might want to give it a try.

The really exciting thing only in Passenger Enterprise, to me, is the way it can deploy with a hybrid multiple process+multi-threaded-request-dispatch setup. This is absolutely the best way to deploy under MRI, I have no doubts at all, it just is (and I’m surprised it’s not getting more attention). This lower-level feature is unlikely to come from a third-party open source, and I’m not sure I’d trust it if it did. The open source Puma, an alternative to Passenger, also offers this deploy model. I haven’t tried it in Puma myself beyond some toy testing like the benchmark mentioned above. But I know I absolutely trust Passenger to get it right with no fuss. If you need to maximize performance (or deal with avoiding end-user latency spikes in the presence of some longer-running requests), and deploy under MRI, you should definitely consider Passenger Enterprise just for this multi-process/multi-thread combo feature.

I received my author’s copy of More Library Mashups, edited by Nicole Engard. I notice the publisher’s site is still listing it as “pre-order”, but I think it’s probably available for purchase (in print or e).

I’m hoping it attracts some more attention and exposure for Umlaut, and maybe gets some more people trying it out.

Consider asking your employing library to purchase a copy of the book for the collection! It looks like it’s got a lot of interesting stuff in it, including a chapter by my colleague Sean Hannan on building a library website by aggregating content services.

Breaking new ground for the open-access movement, the Bill & Melinda Gates Foundation, a major funder of global health research, plans to require that the researchers it funds publish only in immediate open-access journals.

The policy doesn’t kick in until January 2017; until then, grantees can publish in subscription-based journals as long as their paper is freely available within 12 months. But after that, the journal must be open access, meaning papers are free for anyone to read immediately upon publication. Articles must also be published with a license that allows anyone to freely reuse and distribute the material. And the underlying data must be freely available.

Is this going to work? Will researchers be able to comply with these requirements without harm to their careers? Does the Gates Foundation fund enough research that new open access venues will open up to publish this research (and if so how will their operation be funded?), or do sufficient venues already exist? Will Gates Foundation grants include funding for “gold” open access fees?

I am interested to find out. I hope this article is accurate about what their doing, and am glad they are doing it if so.

The Gates Foundation’s own announcement appears to be here, and their policy, which doesn’t answer very many questions but does seem to be bold and without wiggle-room, is here.

I note that the policy mentions “including any underlying data sets.” Do they really mean to be saying that underlying data sets used for all publications “funded, in whole or in part, by the foundation” must be published? I hope so. Requiring “underlying data sets” to be available at all is in some ways just as big or bigger as requiring them to be available open access.

1. Regexp.union

Have a bunch of regex’s, and want to see if a string matches any of them, but don’t actually care which one it matches, just if it matches any one or more? Don’t loop through them, combine them with Regexp.union.

2. Regexp.escape

Have an arbitrary string that you want to embed in a regex, interpreted as a literal? Might it include regex special chars that you want interpreted as literals instead? Why even think about whether it might or not, just escape it, always.

A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?

Note that “linked data” is basically talking about the same technologies as “semantic web”, it’s sort of the new branding for “semantic web”, with some minor changes in focus.

The top-rated comment in the discussion says, in part:

A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them.

I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.

The semantic web as originally proposed (Berners-Lee, Hendler, Lassila) is as dead as last year’s roadkill, though there are plenty out there that pretend that’s not the case. There’s still plenty of groups trying to revive the original idea, or like most things in the KM field, they’ve simply changed the definition to encompass something else that looks like it might work instead.

The reasons are complex but it basically boils down to: going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

The entire comment, and, really the entire thread, are worth a read. There seems to be a lot of energy in libraryland behind trying to produce “linked data”, and I think it’s important to pay attention to what’s going on in the larger world here.

Especially because much of the stated motivation for library “linked data” seems to have been: “Because that’s where non-library information management technology is headed, and for once let’s do what everyone else is doing and not create our own library-specific standards.” It turns out that may or may not be the case, if your motivation for library linked data was “so we can be like everyone else,” that simply may not be an accurate motivation, everyone else doesn’t seem to be heading there in the way people hoped a few years ago.

On the other hand, some of the reasons that semantic web/linked data have not caught on are commercial and have to do with business models.

One of the reasons that whole thing died was that existing business models simply couldn’t be reworked to make it make sense. If I’m running an ad driven site about Cat Breeds, simply giving you all my information in an easy to parse machine readable form so your site on General Pet Breeds can exist and make money is not something I’m particularly inclined to do. You’ll notice now that even some of the most permissive sites are rate limited through their API and almost all require some kind of API key authentication scheme to even get access to the data.

It may be that libraries and other civic organizations, without business models predicated on competition, may be a better fit for implementation of semantic web technologies. And the sorts of data that libraries deal with (bibliographic and scholarly) may be better suited for semantic data as well compared to general commercial business data. It may be that at the moment libraries, cultural heritage, and civic organizations are the majority of entities exploring linked data.

Still, the coarsely stated conclusion of that top-rated HN comment is worth repeating:

going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

Putting data into linked data form simply because we’ve been told that “everyone is doing it” without carefully understanding the use cases such reformatting is supposed to benefit and making sure that it does — risks undergoing great expense for no payoff. Especially when everyone is not in fact doing it.

GIGO

Taking the same data you already have and reformatting as “linked data” does not neccesarily add much value. If it was poorly controlled, poorly modelled, or incomplete data before — it still is even in RDF. You can potentially add a lot more value and more additional uses of your data by improving the data quality than by working to reformat it as linked data/RDF. The idea that simply reformatting it as RDF would add significant value was predicated on the idea of an ecology of software and services built to use linked data, software and services exciting enough that making your data available to them would result in added value. That ecology has not really materialized, and it’s hardly clear that it will (and to the extent it does, it may only be if libraries and cultural heritage organizations create it; we are unlikely to get a free ride on more general tools from a wider community).

But please do share your data

To be clear, I still highly advocate taking the data you do have and making it freely available under open (or public domain) license terms. In whatever formats you’ve already got it in. If your data is valuable, developers will find a way to use it, and simply making the data you’ve already got available is much less expensive than trying to reformat it as linked data. And you can find out if anyone is interested in it. If nobody’s interested in your data as it is — I think it’s unlikely the amount of interest will be significantly greater after you model it as ‘linked data’. The ecology simply hasn’t arisen to make using linked data any easier or more valuable than using anything else (in many contexts and cases, it’s more troublesome and challenging than less abstract formats, in fact).

Following the bandwagon vs doing the work

Part of the problem is that modelling data is inherently a context-specific act. There is no universally applicable model — and I’m talking here about the ontological level of entities and relationships, what objects you represent in your data as distinct entities and how they are related. Whether you model it as RDF or just as custom XML, the way you model the world may or may not be useful or even usable by those in different contexts, domains and businesses. See “Schemas aren’t neutral” in the short essay by Cory Doctorow linked to from that HN comment. But some of the linked data promise is premised on the idea that your data will be both useful and integrate-able nearly universally with data from other contexts and domains.

These are not insoluble problems, they are interesting problems, and they are problems that libraries as professional information organizations rightly should be interested in working on. Semantic web/linked data technologies may very well play a role in the solutions (although it’s hardly clear that they are THE answer).

It’s great for libraries to be interested in working on these problems. But working on these problems means working on these problems, it means spending resources on investigation and R&D and staff with the right expertise and portfolio. It does not mean blindly following the linked data bandwagon because you (erroneously) believe it’s already been judged as the right way to go by people outside of (and with the implication ‘smarter than’) libraries. It has not been.

For individual linked data projects, it means being clear about what specific benefits they are supposed to bring to use cases you care about — short and long term — and what other outside dependencies may be necessary to make those benefits happen, and focusing on those too. It means understanding all your technical options and considering them in a cost/benefit/risk analysis, rather than automatically assuming RDF/semantic web/linked data and as much of it as possible.

It means being aware of the costs and the hoped for benefits, and making wise decisions about how best to allocate resources to maximize chances of success at those hoped for benefits. Blindly throwing resources into taking your same old data and sharing it as “linked data”, because you’ve heard it’s the thing to do, does not in fact help.

“Information had very strong geographical boundaries,” he says. “I come from a place where those boundaries are very, very apparent. They are in your face. To be able to make a dent in that is a very attractive proposition.”

Acharya’s continued leadership of a single, small team (now consisting of nine) is unusual at Google, and not necessarily seen as a smart thing by his peers. By concentrating on Scholar, Acharya in effect removed himself from the fast track at Google…. But he can’t bear to leave his creation, even as he realizes that at Google’s current scale, Scholar is a niche.

…But like it or not, the niche reality was reinforced after Larry Page took over as CEO in 2011, and adopted an approach of “more wood behind fewer arrows.” Scholar was not discarded — it still commands huge respect at Google which, after all, is largely populated by former academics—but clearly shunted to the back end of the quiver.

…Asked who informed him of what many referred to as Scholar’s “demotion,” Acharya says, “I don’t think they told me.” But he says that the lower profile isn’t a problem, because those who do use Scholar have no problem finding it. “If I had seen a drop in usage, I would worry tremendously,” he says. “There was no drop in usage. I also would have felt bad if I had been asked to give up resources, but we have always grown in both machine and people resources. I don’t feel demoted at all.”

I’ve been trying to take my Rails error logs more seriously to make sure I handle any bugs revealed. 404’s can indicate a problem, especially when the referrer is my app itself. So I wanted to get all of those 404’s for Apache’s internal dummy connection out of my log. (How I managed to fight with Rails logs enough to actually get useful contextual information on FATAL errors is an entirely different complicated story for another time).

How can I make a Rails app handle them?

Well, first, let’s do a standards check and see that RFC 2616 HTTP 1.1 Section 9 (I hope I have a current RFC that hasn’t been superseded) says:

If the Request-URI is an asterisk (“*”), the OPTIONS request is intended to apply to the server in general rather than to a specific resource. Since a server’s communication options typically depend on the resource, the “*” request is only useful as a “ping” or “no-op” type of method; it does nothing beyond allowing the client to test the capabilities of the server. For example, this can be used to test a proxy for HTTP/1.1 compliance (or lack thereof).

Okay, sounds like we can basically reply with whatever we want to this request, it’s a “ping or no-op”. How about a 200 text/plain with “OK\n”?

Here’s a line I added to my Rails routes.rb file that seems to catch the “*” requests and just respond with such a 200 OK.

Since “*” is a special glob character to Rails routing, looks like you have to do that weird constraints trick to actually match it. (Thanks to mbklein, this does not seem to be documented and I never would have figured it out on my own).

And then we can use a little “Rack app implemented in a lambda” trick to just return a 200 OK right from the routing file, without actually having to write a controller action somewhere else just to do this.

I have not yet tested this extensively, but I think it works? (Still worried if Apache is really requesting “OPTIONS *” instead of “OPTIONS /*” it might not be. Stay tuned.)

Rails 4 removes the ability to produce non-digest-named assets in addition to digest-named-assets. (ie ‘application.js’ in addition to ‘application-810e09b66b226e9982f63c48d8b7b366.css’).

There are a variety of ways to work around this by extending asset compilation. After researching and considering them all, I chose to use a custom Rake task that uses the sprockets manifest.json file. In this post, I’ll explain the situation and the options.

It produces assets to be delivered to the client that are fingerprinted with a digest hash based on the contents of the file — such as ‘application-810e09b66b226e9982f63c48d8b7b366.css’. People (and configuration) often refer to this filename-fingerprinting as “digested assets”.

In Rails3, a ‘straight’ named copy of the assets (eg `application.js`) were also produced, alongside the fingerprinted digest-named assets.

Rails4 stopped doing this by default, and also took away any ability to do this even as a configurable option. While I can’t find the thread now, I recall seeing discussion that in Rails3, the production of non-digest-named assets was accomplished through actually asking sprockets to compile everything twice, which made asset compilation take roughly twice as long as it should. Which is indeed a problem.

Rather than looking to fix Sprockets api to make it possible to compile the file once but simply write it twice, Rails devs decided there was no need for the straight-named files at all, and simply removed the feature.

Why would you need straight-named assets?

The title of this issue reveals one reason people wanted the non-digest-named assets: “breaks compatibility with bad gems”. This mainly applies to gems that supply javascript, which may need to generate links to assets, and not be produced to look up the current digest-named URLs. It’s really about javascript, not ‘gems’, it can apply to javascript you’ve included without gemifying it too.

The Rails devs expression opinion on this issue believed (at least initially) that these ‘bad gems’ should simply be fixed, accomodating them was the wrong thing to do, as it eliminates the ability to cache-forever assets they refer to.

I think they under-estimate the amount of work it can take to fix these ‘bad’ JS dependencies, which often are included through multi-level dependency trees (requiring getting patches accepted by multiple upstreams) — and also basically requires wrapping all JS assets in rubygems that apply sprockets/rails-specific patches on top, instead of, say, just using bower.

I think there’s a good argument for accommodating JS assets which the community has not yet had the time/resources to make respect the sprockets fingerprinting. Still, it is definitely preferable, and always at least theoretically possible, to make all your JS respect sprockets asset fingerprinting — and in most of my apps, I’ve done that.

But there’s other use cases: like mine!

I have an application that needs to offer a Javascript file at a particular stable URL, as part of it’s API — think JS “widgets”.

I want it to go through the asset pipeline, for source control, release management, aggregation, SASS, minimization, etc. The suggestion to just “put it in /public as a static asset” is no good at all. But I need the current version available at a persistent URL.

Rails 3, this Just Worked, since the asset pipeline created a non-digested name. In Rails 4, we need a workaround. I don’t need every asset to have a non-digest-named version, but I do need a whitelist of a few that are part of my public API.

I think this is a pretty legitimate use case, and not one that can be solved by ‘fixing bad gems’. I have no idea if Rails devs recognize it or not.

Possible Workaround Options

So that giant Github Issue thread? At first it looks like just one of those annoying ones with continual argument by uninformed people that will never die, and eventually @rafaelfranca locked it. But it’s also got a bunch of comments with people offering their solutions, and is the best aggregation of possible workarounds to consider — I’m glad it wasn’t locked sooner. Another example of how GitHub qualitatively improves open source development — finding this stuff on a listserv would have been a lot harder.

The problem is it won’t work. I’ve got nothing against a rake task solution. It’s easy to wire things up so your new rake task automatically gets called every time after `rake assets:precompile’, no problem!

The problem is that a deployed Rails app may have multiple fingerprinted versions of a particular asset file around, representing multiple releases. And really you should set things up this way — because right after you do a release, there may be cached copies of HTML (in browser caches, or proxying caches including a CDN) still around, still referencing the old version with the old digest fingerprint. You’ve got to keep it around for a while.

(How long? Depends on the cache headers on the HTML that might reference it. The fact that sprockets only supports keeping around a certain number of releases, and not releases made within a certain time window, is a different discussion. But, yeah, you need to keep around some old versions).

So it’s unpredictable which of the several versions you’ve got hanging around the rake task is going to copy to the non-digest-named version, there’s no guarantee it’ll be the latest one. (Maybe it depends on their lexographic sort?). That’s no good.

Enhance the core-team-suggested rake task?

Before I realized this problem, I had already spent some time trying to implement the basic rake task, add a whitelist parameter, etc. So I tried to keep going with it after realizing this problem.

I figured, okay, there are multiple versions of the asset around, but sprockets and rails have to know which one is the current one (to serve it to the current application), so I must be able to use sprockets ruby API in the rake task to figure it out and copy that one.

It was kind of challenging to figure out how to get sprockets to do this, but eventually it was sort of working.

Except i started to get worried that I might be triggering the double-compilation that Rails3 did, which I didn’t want to do, and got confused about even figuring out if I was doing it.

And I wasn’t really sure if I was using sprockets API meant to be public or internal. It didn’t seem to be clearly documented, and sprockets and sprockets-rails have been pretty churny, I thought I was taking a significant risk of it breaking in future sprockets/rails version(s) and needing continual maintenance.

Verdict: Nope, not so simple, even though it seems to be the rails-core-endorsed solution.

Monkey-patch sprockets: non-stupid-digest-assets

Okay, so maybe we need to monkey-patch sprockets I figured.

@alexspeller provides a gem to monkey-patch Sprockets to support non-digested-asset creation, the unfortunately combatively named non-stupid-digest-assets.

If someone else has already figured it out and packaged it in a gem, great! Maybe they’ll even take on the maintenance burden of keeping it working with churny sprockets updates!

But non-stupid-digest-assets just takes the same kind logic from that basic rake task, another pass through all the assets post-compilation, but implements it with a sprockets monkeypatch instead of a rake task. It does add a white list. I can’t quite figure out if it’s still subject to the same might-end-up-with-older-version-of-asset problem.

There’s really no benefit just to using a monkey patch instead of a rake task doing the same thing, and it has increased risk of breaking with new Rails releases. Some have already reported it not working with the Rails 4.2.betas — I haven’t investigated myself to see what’s up with that, and @alexspeller doesn’t seem to be in any hurry to either.

Verdict: Nope. non-stupid-digest-assets ain’t as smart as it thinks it is.

Monkey-patch sprockets: The right way?

If you’re going to monkey-patch sprockets and take on forwards-compat risk, why not actually do it right, and make sprockets simply write the compiled file to two different file locations (and/or use symlinks) at the point of compilation?

At this point, I was too scared of the forwards-compatibility-maintenance risks of monkey patching sprockets, and realized there was another solution I liked better…

Verdict: It’s the right way to do it, but carries some forwards-compat maintenance risk as an unsupported monkey patch

Use the Manifest, Luke, erm, Rake!

I had tried and given up on using the sprockets ruby api to determine ‘current digest-named asset’. But as I was going back and reading through the Monster Issue looking for ideas again, I noticed @drojas suggested using the manifest.json file that sprockets creates, in a rake task.

Yep, this is where sprockets actually stores info on the current digest-named-assets. Forget the sprockets ruby api, we can just get it from there, and make sure we’re making a copy (or symlinking) the current digested version to the non-digested name.

But are we still using private api that may carry maintenance risk with future sprockets versions? Hey, look, in a source code comment Sprockets tells us “The JSON is part of the public API and should be considered stable.” Sweet!

Now, even if sprockets devs remember one of them once said this was public API (I hope this blog post helps), and even if sprockets is committed to semantic versioning, that still doesn’t mean it can never change. In fact, the way some of rubydom treats semver, it doesn’t even mean it can’t change soon and frequently; it just means they’ve got to update the sprockets major version number when it changes. Hey, at least that’d be a clue.

I didn’t really notice this one until I had settled on The Manifest. It requires two HTTP requests every time a client wants the asset at the persistent URL though. The first one will touch your app and needs short cache time, that will then redirect to the digest-named asset that will be served directly by the web server and can be cached forever. I’m not really sure if the performance implications are significant, probably depends on your use cases and request volume. @will-r suggests it won’t work well with CDN’s though.

Verdict: Meh, maybe, I dunno, but it doesn’t feel right to introduce the extra latency

The Future

But what’s “this issue” exactly? I dunno, they are not sharing what they see as the legitimate use cases to handle, and requirements on legitimate ways to handle em.

I kinda suspect they might just be dealing with the “non-Rails JS that needs to know asset URLs” issue, and considering some complicated way to automatically make it use digest-named assets without having to repackage it for Rails. Which might be a useful feature, although also a complicated enough one to have some bug risks (ah, the story of the asset pipeline).

And it’s not what I need, anyway, there are other uses cases than the “non-Rails JS” one that need non-digest-named assets.

I just need sprockets to produce parallel non-digested asset filenames for certain whitelisted assets. That really is the right way to handle it for my use case. Yes, it means you need to know the implications and how to use cache headers responsibly. If you don’t give me enough rope to hang myself, I don’t have enough rope to climb the rock face either. I thought Rails target audience was people who know what they’re doing?

It doesn’t seem like this would be a difficult feature for sprockets to implement (without double compilation!). @ryana’s monkeypatch seems like pretty simple code that is most of the way there. It’s the feature what I need.

I considered making a pull request to sprockets (the first step, then probably sprockets-rails, needs to support passing on the config settings). But you know what, I don’t have the time or psychic energy to get in an argument about it in a PR; the Rails/sprockets devs seem opposed to this feature for some reason. Heck, I just spent hours figuring out how to make my app work now, and writing it all up for you instead!

But, yeah, just add that feature to sprockets, pretty please.

So, if you’re reading this post in the future, maybe things will have changed, I dunno.

Always, always, always, start with “thank you for reporting this problem.”

1. Because it’s true, we need the problem reports to know about problems, and too often people are scared to report problems because of past bad experiences, or don’t report problems because they figure someone else already has, or because they are busy and don’t have the time.

2. Because it gets the support interaction off to a good collegial cooperative start.

It works. Do it every time. Even when the problem being reported doesn’t make any sense and you’re sure (you think!) that it’s not a real problem.

If they give a good problem report with actual reproduction steps and a clear explanation of why the outcome is not what they expected, thank them extra special.