Do software engineers and others who write software have professional ethical duties?

Might one of them be to do one’s best to create secure software (rather than intentionally releasing software with vulnerabilities for the purposes of allowing people in the know to exploit), and responsibly disclosing any security vulnerabilities found in third party software (rather than keeping them close so they can be used them for exploits)?

Of course, the APA policy didn’t keep the psychologists from doing what they did, and there is some suggestion that the APA even intentionally made sure to leave enough loophole, which they potentially regret. And there have been similar controversies within Anthropology. There’s no magic bullet to ethical behavior from simply writing rules, but I still think it’s a useful point for inquiry, at least acknowledging that there is such a thing as professional ethics for the profession, and providing official recognition that these discussions are part of the profession.

Are there ethical duties of software engineers and others who create software? As software becomes more and more socially powerful, is it important to society that this be recognized? Are these discussions happening? What professional bodies might they take place in? (IEEE? ACM?). The ACM has a code of ethics, but it’s pretty vague, it seems easy to justify just about any profit-making activity.

Are these discussions happening? Will the extensive Department of Defense funding of Computer Science (theoretical and applied) in the U.S. make it hard to have these discussions? (When I googled, the discussion that came up of how DoD funding effects computer science research was from 1989 — there may be self-interested reasons people aren’t that interested in talking about this).

Filed under: General]]>https://bibwild.wordpress.com/2015/02/17/ethical-code-for-software-engineering-professionals/feed/3jrochkindBe careful of regexes in a unicode worldhttps://bibwild.wordpress.com/2015/02/04/be-careful-of-regexes-in-a-unicode-world/
https://bibwild.wordpress.com/2015/02/04/be-careful-of-regexes-in-a-unicode-world/#commentsWed, 04 Feb 2015 19:18:37 +0000http://bibwild.wordpress.com/?p=3584Continue reading →]]>Check out the following, which I wrote some time ago:

Oops. ó is not in the class [a-zA-Z0-9_]. \w doesn’t actually mean “a word character” at all, unless your input is only ascii. The docs probably really should warn you about this, describing the class as “an ascii word character”, and warning you to use other metacharacters if you aren’t just dealing with ascii.

Fortunately, ruby also provides some unicode-aware regex character classes, but they’re a lot harder to remember and longer to type. Here it is right, let’s use unicode-aware spacing instead of `\s` too:

Yep, that’s what we wanted. There are several other unicode-aware character classes, apparently defined by POSX. The docs also say there’s a couple non-POSIX ones, including:

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

I wasn’t able to make that work, it didn’t seem to be recognized in my ruby. I am not sure why, and didn’t bother finding out. What works is good enough for me.

But in a non-ascii world, it turns out, you almost never actually want to use those traditional regex character class metacharacters that many of us have been using for decades. \w and \s, no way. \d is less risky since you probably really do mean 0-9 and not digits from some other script, but that better be what you mean.

Filed under: General]]>https://bibwild.wordpress.com/2015/02/04/be-careful-of-regexes-in-a-unicode-world/feed/1jrochkindRuby threads, gotcha with local vars and shared statehttps://bibwild.wordpress.com/2015/01/15/ruby-threads-gotcha-with-local-vars-and-shared-state/
https://bibwild.wordpress.com/2015/01/15/ruby-threads-gotcha-with-local-vars-and-shared-state/#commentsThu, 15 Jan 2015 18:12:14 +0000http://bibwild.wordpress.com/?p=3575Continue reading →]]>I end up doing a fair amount of work with multi-threading in ruby. (There is some multi-threaded concurrency in Umlaut, bento_search, and traject). Contrary to some belief, multi-threaded concurrency can be useful even in MRI ruby (which can’t do true parallelism due to the GIL), for tasks that spend a lot of time waiting on I/O, which is the purpose in Umlaut and bento_search (in both cases waiting on external HTTP apis). Traject uses multi-threaded concurrency for true parallelism in jruby (or soon rbx) for high performance.

There’s a gotcha with ruby threads that I haven’t seen covered much. What do you think this code will output from the ‘puts’?

It outputs “changed”. The local var `value` is shared between both threads, changes made in the primary thread effect the value of `value` in the created thread too. This is an issue not unique to threads, but is a result of how closures work in ruby — the local variables used in a closure don’t capture the fixed value at the time of closure creation, they are pointers to the original local variables. (I’m not entirely sure if this is traditional for closures, or if some other languages do it differently, or the correct CS terminology for talking about this stuff). It confuses people in other contexts too, but can especially lead to problems with threads.

Consider a loop which in each iteration prepares some work to be done, then dispatches to a thread to actually do the work. We’ll do a very simple fake version of that, watch:

threads = []
i = 0
10.times do
# pretend to prepare a 'work order', which ends up in local
# var i
i += 1
# now do some stuff with 'i' in the thread
threads << Thread.new do
sleep 1 # pretend this is a time consuming computation
# now we do something else with our work order...
puts i
end
end
threads.each {|t| t.join}

Do you think you’ll get “1”, “2”, … “10” printed out? You won’t. You’ll get 10 10’s. (With newlines in random places becuase of interleaving of ‘puts’, but that’s not what we’re talking about here). You thought you dispatched 10 threads each with different values for ‘i’, but the threads are actually all sharing the same ‘i’, when it changes, it changes for all of them.

Oops.

Ruby stdlib Thread.new has a mechanism to deal with this, although like much in ruby stdlib (and much about multi-threaded concurrency in ruby), it’s under-documented. But you can pass args to Thread.new, which will be passed to the block too, and allow you to avoid this local var linkage:

While that will seem to work for this particular example, there’s still a race condition there, the value could change before the first line of the thread block is executed, part of dealing with concurrency is giving up any expectations of what gets executed when, until you wait on a `join`.

So, yeah, the arguments to Thread.new. Which other libraries involving threading sometimes propagate. With a concurrent-ruby ThreadPoolExecutor:

I’m honestly not even sure how you get around this problem with Concurrent::Future, unlike Concurrent::ThreadPoolExecutor it does not seem to copy stdlib Thread.new in it’s method of being able to pass block arguments. There might be something I’m missing (or a way to use Futures that avoids this problem?), or maybe the authors of ruby-concurrent haven’t considered it yet either? I’ve asked the question of them. (PS: The ruby-concurrent package is super awesome, it’s still building to 1.0 but usable now; I am hoping that it’s existence will do great things for practical use of multi-threaded concurrency in the ruby community).

This is, for me, one of the biggest, most dangerous, most confusing gotchas with ruby concurrency. It can easily lead to hard-to-notice, hard-to-reproduce, and hard-to-debug race condition bugs.

“There’s no such thing as a true map,” says Mark Graham, a senior research fellow at Oxford Internet Institute. “Every single map is a misrepresentation of the world, every single map is partial, every single map is selective. And every single map tells a particular story from a particular perspective.”

Because online maps are in constant flux, though, it’s hard to plumb the bias in the cartography. Graham has found that the language of a Google search shapes the results, producing different interpretations of Bangkok and Tel Aviv for different residents. “The biggest problem is that we don’t know,” he says. “Everything we’re getting is filtered through Google’s black box, and it’s having a huge impact not just on what we know, but where we go, and how we move through a city.”

As an example of the mapmaker’s authority, Matt Zook, a collaborator of Graham’s who teaches at the University of Kentucky, demonstrated what happens when you perform a Google search for abortion: you’re led not just to abortion clinics and services but to organisations that campaign against it. “There’s a huge power within Google Maps to just make some things visible and some things less visible,” he notes.

But the sign is both tempting and elusive. That’s why you’ll find so many tourists taking photos on dead-end streets at the base of the Hollywood Hills. For many years, the urban design of the neighbourhood actually served as the sign’s best protection: Due to the confusingly named, corkscrewing streets, it’s actually not that easy to tell someone how to get to the Hollywood Sign.

That all changed about five years ago, thanks to our suddenly sentient devices. Phones and GPS were now able to aid the tourists immensely in their quests to access the sign, sending them confidently through the neighbourhoods, all the way up to the access gate, where they’d park and wander along the narrow residential streets. This, the neighbours complained, created gridlock, but even worse, it represented a fire hazard in the dry hills — fire trucks would not be able to squeeze by the parked cars in case of an emergency.

…

Even though Google Maps clearly marks the actual location of the sign, something funny happens when you request driving directions from any place in the city. The directions lead you to Griffith Observatory, a beautiful 1920s building located one mountain east from the sign, then — in something I’ve never seen before, anywhere on Google Maps — a dashed grey line arcs from Griffith Observatory, over Mt. Lee, to the sign’s site. Walking directions show the same thing.

Even though you can very clearly walk to the sign via the extensive trail network in Griffith Park, the map won’t allow you to try.

When I tried to get walking directions to the sign from the small park I suggest parking at in my article, Google Maps does an even crazier thing. It tells you to walk an hour and a half out of the way, all the way to Griffith Observatory, and look at the sign from there.

No matter how you try to get directions — Google Maps, Apple Maps, Bing — they all tell you the same thing. Go to Griffith Observatory. Gaze in the direction of the dashed grey line. Do not proceed to the sign.

Don’t get me wrong, the view of the sign from Griffith Observatory is quite nice. And that sure does make it easier to explain to tourists. But how could the private interests of a handful of Angelenos have persuaded mapping services to make it the primary route?

(h/t Nate Larson)

Filed under: General]]>https://bibwild.wordpress.com/2015/01/09/control-of-information-is-power/feed/0jrochkindFraud in scholarly publishinghttps://bibwild.wordpress.com/2015/01/09/fraud-in-scholarly-publishing/
https://bibwild.wordpress.com/2015/01/09/fraud-in-scholarly-publishing/#commentsFri, 09 Jan 2015 15:20:58 +0000http://bibwild.wordpress.com/?p=3566Continue reading →]]>Should librarianship be a field that studies academic publishing as an endeavor, and works to educate scholars and students to take a critical perspective? Some librarians are expected/required to publish for career promotion, are investigations in this area something anyone does?

Klaus Kayser has been publishing electronic journals for so long he can remember mailing them to subscribers on floppy disks. His 19 years of experience have made him keenly aware of the problem of scientific fraud. In his view, he takes extraordinary measures to protect the journal he currently edits, Diagnostic Pathology. For instance, to prevent authors from trying to pass off microscope images from the Internet as their own, he requires them to send along the original glass slides.

Despite his vigilance, however, signs of possible research misconduct have crept into some articles published in Diagnostic Pathology. Six of the 14 articles in the May 2014 issue, for instance, contain suspicious repetitions of phrases and other irregularities. When Scientific American informed Kayser, he was apparently unaware of the problem. “Nobody told this to me,” he says. “I’m very grateful to you.”

[…]

The dubious papers aren’t easy to spot. Taken individually each research article seems legitimate. But in an investigation by Scientific American that analyzed the language used in more than 100 scientific articles we found evidence of some worrisome patterns—signs of what appears to be an attempt to game the peer-review system on an industrial scale.

[…]

A quick Internet search uncovers outfits that offer to arrange, for a fee, authorship of papers to be published in peer-reviewed outlets. They seem to cater to researchers looking for a quick and dirty way of getting a publication in a prestigious international scientific journal.

This particular form of the for-pay mad-libs-style research paper appears to be prominent mainly among researchers in China. How can we talk about this without accidentally stooping to or encouraging anti-Chinese racism or xenophobia? There are other forms of research fraud and quality issues which are prominent in the U.S. and English-speaking research world too. If you follow this theme of scholarly quality issues, as I’ve been trying to do casually, you start to suspect the entire scholarly publishing system, really.

We know, for instance, that ghost-written scholarly pharmaceutical articles are not uncommon in the U.S. too. Perhaps in the U.S. scholarly fraud is more likely to come for ‘free’ from interested commercial entities, then by researchers paying ‘paper salesmen’ for poor quality papers. To me, a paper written by a pharmaceutical company employer but published under the name of an ‘independent’ researcher is arguably a worse ethical violation, even if everyone involved can think “Well, the science is good anyway.” It also wouldn’t shock me if very similar systems to China’s paper-for-sale industry exist in the U.S., on a much smaller scale, but they are more adept at avoiding reuse of nonsense boilerplate, making it harder to detect. Presumably the Chinese industry will get better at avoiding detection too, or perhaps already is at a higher end of the market.

In both cases, the context is extreme career pressure to ‘publish or perish’, into a system that lacks the ability to actually ascertain research quality sufficiently, but which the scholarly community believes has that ability.

Problems with research quality, don’t end here, they go on and on, and are starting to get more attention.

From the Economist, also from last year, “Trouble at the lab: Scientists like to think of science as self-correcting. To an alarming degree, it is not.”

From Nature August 2013 (was 2013 the year of discovering scientific publishing ain’t what we thought?), “US behavioural research studies skew positive:
Scientists speculate ‘US effect’ is a result of publish-or-perish mentality.

There are also individual research papers investigating particular issues, especially statistical methodology problems, in scientific publishing. I’m not sure if there are any scholarly papers or monographs which take a big picture overview of the crisis in scientific publishing quality/reliability — anyone know of any?

To change the system, we need to understand the system — and start by lowering confidence in the capabilities of existing ‘gatekeeping’. And the ‘we’ is the entire cross-disciplinary community of scholars and researchers. We need an academic discipline and community devoted to a critical examination of scholarly research and publishing as a social and scientific phenomenon, using social science and history/philosophy of science research methods; a research community (of research on research) which is also devoted to education of all scholars, scientists, and students into a critical perspective. Librarians seem well situated to engage in this project in some ways, although in others it may be unrealistic to expect.

Filed under: General]]>https://bibwild.wordpress.com/2015/01/09/fraud-in-scholarly-publishing/feed/2jrochkindNotes on oddities of Solr WordDelimiterFilterhttps://bibwild.wordpress.com/2015/01/07/notes-on-oddities-of-solr-worddelimiterfilter/
https://bibwild.wordpress.com/2015/01/07/notes-on-oddities-of-solr-worddelimiterfilter/#commentsWed, 07 Jan 2015 17:32:15 +0000http://bibwild.wordpress.com/?p=3562Continue reading →]]>A edited version of post I sent to the Blacklight listserv…

I have a WordDelimiterFilter configured in my analysis for the ‘text’ type. I thought I originally inherited that from Blacklight suggested configuration, although it doesn’t appear to be there at the moment if I’m looking at the right place:

Note that it’s got `splitOnCaseChange=”1″`, for both index and query time (no separate index/query analysis). Mine has the same. Although stanford applies the ICUFoldingFilter (case-insensitivity) _before_ the WDF, which actually probably means splitOnCaseChange isn’t doing actually doing anything, by the time the filter gets it, there are no more case changes. In mine, I do the ICUFoldingFilter _after_ the WDF, so the WDF can still do it’s thing.

Specifically, if the query includes a mixed-case term like “DuBois”, I expected this would match soure term “dubois” OR source term “du bois”.

But it turns out it _only_ matches source term “du bois”. Which was unexpected for one user that noticed it, and knew that our search was generally ‘case insensitive’ — a search for “dubois” would match source term “dubois”, but a search for “duBois” would not, violating their expectations. And I agree this is probably bad.

I thought the WDF could do what I wanted. But after spending a bunch of time with the docs, playing around with different configurations, and trying to get advice on the solr-user listserv — frankly, I’m still really confused about exactly what the WDF will do in various configurations, it’s a complicated thing.

But I think the WDF is not capable of doing quite what I expected.

I think what I need to do is split into separate index and query time analysis, which can be identical in all ways except in query time analysis splitOnCaseChange=0 — it still remains on in index time analysis.

The result of this seems to be that query time “DuBois” will only match source material single word “dubois” (in any case, it’ll also match source material “DuBois” still) — if it’s only going to match one of the choices, I think this is the right one.

Source material “DuBois” will still be indexed such that both queries “dubois” (or “DuBois”) and “du bois” will match it — source case changes will be expanded to two words in index, as an alternate, along with one word in index. But you can’t quite do the same thing at query time, to allow query with case change “DuBois” to match both variants in source.

I think this is probably the right thing to do — although in general, the WordDelimiterFilter is scaring me enough that if I had to do it over, I either wouldn’t use it at all, or would use it only with very specific configuration designed to support specific tested cases. As it is, I’m not quite sure what all it’s doing, and am scared to change it a lot. It’s odd to me that the example suggested analysis configuration given in the Solr wiki for the WordDelimiterFilter would seem to be subject to the same problem.

I am curious if anyone has dealt with this, and has any feedback. Especially Stanford, since I know they have a great test suite on their Solr configuration — although if that github represents current Stanford conf, the splitOnCaseChange=1 is probably having no effect at index OR query time, since there’s a case-normalization filter BEFORE it.

Filed under: General]]>https://bibwild.wordpress.com/2015/01/07/notes-on-oddities-of-solr-worddelimiterfilter/feed/0jrochkinddebugging apache Passenger without enterprisehttps://bibwild.wordpress.com/2014/12/10/debugging-apache-passenger-without-enterprise/
https://bibwild.wordpress.com/2014/12/10/debugging-apache-passenger-without-enterprise/#commentsWed, 10 Dec 2014 18:35:04 +0000http://bibwild.wordpress.com/?p=3554Continue reading →]]>I kind of love Passenger for my Rails deployments. It Just Works, it does exactly what it should do, no muss, no fuss. I use Passenger with apache.

I very occasionally have a problem that I am not able to reproduce in my dev environment, and only seems to reproduce on production using Passenger apache. Note well: In every case so far, the problem actually had nothing to do with passenger or apache, there were other differences in environment that were causing it.

But still, being able to drop into a debugger in the Rails actually running under apache Passenger would have helped me find it quicker.

Support for dropping into the debugger, remotely, when running under Apache is included only in Passenger Enterprise. I recommend considering purchasing Passenger to support the Passenger team, the price is reasonable… for one server or two. But I admit I have not yet purchased Enterprise, mainly because the number of dev/staging/production servers I would want it on to have it everywhere starts to make the cost substantial for my environment.

I haven’t tried it yet, but making this post as a note to myself and others who might want to give it a try.

The really exciting thing only in Passenger Enterprise, to me, is the way it can deploy with a hybrid multiple process+multi-threaded-request-dispatch setup. This is absolutely the best way to deploy under MRI, I have no doubts at all, it just is (and I’m surprised it’s not getting more attention). This lower-level feature is unlikely to come from a third-party open source, and I’m not sure I’d trust it if it did. The open source Puma, an alternative to Passenger, also offers this deploy model. I haven’t tried it in Puma myself beyond some toy testing like the benchmark mentioned above. But I know I absolutely trust Passenger to get it right with no fuss. If you need to maximize performance (or deal with avoiding end-user latency spikes in the presence of some longer-running requests), and deploy under MRI, you should definitely consider Passenger Enterprise just for this multi-process/multi-thread combo feature.

Filed under: General]]>https://bibwild.wordpress.com/2014/12/10/debugging-apache-passenger-without-enterprise/feed/0jrochkind“More library mashups”, with Umlaut chapterhttps://bibwild.wordpress.com/2014/12/01/more-library-mashups-with-umlaut-chapter/
https://bibwild.wordpress.com/2014/12/01/more-library-mashups-with-umlaut-chapter/#commentsMon, 01 Dec 2014 15:29:18 +0000http://bibwild.wordpress.com/?p=3551Continue reading →]]>I received my author’s copy of More Library Mashups, edited by Nicole Engard. I notice the publisher’s site is still listing it as “pre-order”, but I think it’s probably available for purchase (in print or e).

I’m hoping it attracts some more attention and exposure for Umlaut, and maybe gets some more people trying it out.

Consider asking your employing library to purchase a copy of the book for the collection! It looks like it’s got a lot of interesting stuff in it, including a chapter by my colleague Sean Hannan on building a library website by aggregating content services.

Breaking new ground for the open-access movement, the Bill & Melinda Gates Foundation, a major funder of global health research, plans to require that the researchers it funds publish only in immediate open-access journals.

The policy doesn’t kick in until January 2017; until then, grantees can publish in subscription-based journals as long as their paper is freely available within 12 months. But after that, the journal must be open access, meaning papers are free for anyone to read immediately upon publication. Articles must also be published with a license that allows anyone to freely reuse and distribute the material. And the underlying data must be freely available.

Is this going to work? Will researchers be able to comply with these requirements without harm to their careers? Does the Gates Foundation fund enough research that new open access venues will open up to publish this research (and if so how will their operation be funded?), or do sufficient venues already exist? Will Gates Foundation grants include funding for “gold” open access fees?

I am interested to find out. I hope this article is accurate about what their doing, and am glad they are doing it if so.

The Gates Foundation’s own announcement appears to be here, and their policy, which doesn’t answer very many questions but does seem to be bold and without wiggle-room, is here.

I note that the policy mentions “including any underlying data sets.” Do they really mean to be saying that underlying data sets used for all publications “funded, in whole or in part, by the foundation” must be published? I hope so. Requiring “underlying data sets” to be available at all is in some ways just as big or bigger as requiring them to be available open access.

Have a bunch of regex’s, and want to see if a string matches any of them, but don’t actually care which one it matches, just if it matches any one or more? Don’t loop through them, combine them with Regexp.union.

2. Regexp.escape

Have an arbitrary string that you want to embed in a regex, interpreted as a literal? Might it include regex special chars that you want interpreted as literals instead? Why even think about whether it might or not, just escape it, always.

A few years ago, it seemed as if everyone was talking about the semantic web as the next big thing. What happened? Are there still startups working in that space? Are people still interested?

Note that “linked data” is basically talking about the same technologies as “semantic web”, it’s sort of the new branding for “semantic web”, with some minor changes in focus.

The top-rated comment in the discussion says, in part:

A bit of background, I’ve been working in environments next to, and sometimes with, large scale Semantic Graph projects for much of my career — I usually try to avoid working near a semantic graph program due to my long histories of poor outcomes with them.

I’ve seen uncountably large chunks of money put into KM projects that go absolutely nowhere and I’ve come to understand and appreciate many of the foundational problems the field continues to suffer from. Despite a long period of time, progress in solving these fundamental problems seem hopelessly delayed.

The semantic web as originally proposed (Berners-Lee, Hendler, Lassila) is as dead as last year’s roadkill, though there are plenty out there that pretend that’s not the case. There’s still plenty of groups trying to revive the original idea, or like most things in the KM field, they’ve simply changed the definition to encompass something else that looks like it might work instead.

The reasons are complex but it basically boils down to: going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

The entire comment, and, really the entire thread, are worth a read. There seems to be a lot of energy in libraryland behind trying to produce “linked data”, and I think it’s important to pay attention to what’s going on in the larger world here.

Especially because much of the stated motivation for library “linked data” seems to have been: “Because that’s where non-library information management technology is headed, and for once let’s do what everyone else is doing and not create our own library-specific standards.” It turns out that may or may not be the case, if your motivation for library linked data was “so we can be like everyone else,” that simply may not be an accurate motivation, everyone else doesn’t seem to be heading there in the way people hoped a few years ago.

On the other hand, some of the reasons that semantic web/linked data have not caught on are commercial and have to do with business models.

One of the reasons that whole thing died was that existing business models simply couldn’t be reworked to make it make sense. If I’m running an ad driven site about Cat Breeds, simply giving you all my information in an easy to parse machine readable form so your site on General Pet Breeds can exist and make money is not something I’m particularly inclined to do. You’ll notice now that even some of the most permissive sites are rate limited through their API and almost all require some kind of API key authentication scheme to even get access to the data.

It may be that libraries and other civic organizations, without business models predicated on competition, may be a better fit for implementation of semantic web technologies. And the sorts of data that libraries deal with (bibliographic and scholarly) may be better suited for semantic data as well compared to general commercial business data. It may be that at the moment libraries, cultural heritage, and civic organizations are the majority of entities exploring linked data.

Still, the coarsely stated conclusion of that top-rated HN comment is worth repeating:

going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.

Putting data into linked data form simply because we’ve been told that “everyone is doing it” without carefully understanding the use cases such reformatting is supposed to benefit and making sure that it does — risks undergoing great expense for no payoff. Especially when everyone is not in fact doing it.

GIGO

Taking the same data you already have and reformatting as “linked data” does not neccesarily add much value. If it was poorly controlled, poorly modelled, or incomplete data before — it still is even in RDF. You can potentially add a lot more value and more additional uses of your data by improving the data quality than by working to reformat it as linked data/RDF. The idea that simply reformatting it as RDF would add significant value was predicated on the idea of an ecology of software and services built to use linked data, software and services exciting enough that making your data available to them would result in added value. That ecology has not really materialized, and it’s hardly clear that it will (and to the extent it does, it may only be if libraries and cultural heritage organizations create it; we are unlikely to get a free ride on more general tools from a wider community).

But please do share your data

To be clear, I still highly advocate taking the data you do have and making it freely available under open (or public domain) license terms. In whatever formats you’ve already got it in. If your data is valuable, developers will find a way to use it, and simply making the data you’ve already got available is much less expensive than trying to reformat it as linked data. And you can find out if anyone is interested in it. If nobody’s interested in your data as it is — I think it’s unlikely the amount of interest will be significantly greater after you model it as ‘linked data’. The ecology simply hasn’t arisen to make using linked data any easier or more valuable than using anything else (in many contexts and cases, it’s more troublesome and challenging than less abstract formats, in fact).

Following the bandwagon vs doing the work

Part of the problem is that modelling data is inherently a context-specific act. There is no universally applicable model — and I’m talking here about the ontological level of entities and relationships, what objects you represent in your data as distinct entities and how they are related. Whether you model it as RDF or just as custom XML, the way you model the world may or may not be useful or even usable by those in different contexts, domains and businesses. See “Schemas aren’t neutral” in the short essay by Cory Doctorow linked to from that HN comment. But some of the linked data promise is premised on the idea that your data will be both useful and integrate-able nearly universally with data from other contexts and domains.

These are not insoluble problems, they are interesting problems, and they are problems that libraries as professional information organizations rightly should be interested in working on. Semantic web/linked data technologies may very well play a role in the solutions (although it’s hardly clear that they are THE answer).

It’s great for libraries to be interested in working on these problems. But working on these problems means working on these problems, it means spending resources on investigation and R&D and staff with the right expertise and portfolio. It does not mean blindly following the linked data bandwagon because you (erroneously) believe it’s already been judged as the right way to go by people outside of (and with the implication ‘smarter than’) libraries. It has not been.

For individual linked data projects, it means being clear about what specific benefits they are supposed to bring to use cases you care about — short and long term — and what other outside dependencies may be necessary to make those benefits happen, and focusing on those too. It means understanding all your technical options and considering them in a cost/benefit/risk analysis, rather than automatically assuming RDF/semantic web/linked data and as much of it as possible.

It means being aware of the costs and the hoped for benefits, and making wise decisions about how best to allocate resources to maximize chances of success at those hoped for benefits. Blindly throwing resources into taking your same old data and sharing it as “linked data”, because you’ve heard it’s the thing to do, does not in fact help.

Filed under: General]]>https://bibwild.wordpress.com/2014/10/28/is-the-semantic-web-still-a-thing/feed/5jrochkindGoogle Scholar is 10 years oldhttps://bibwild.wordpress.com/2014/10/18/google-scholar-is-10-years-old/
https://bibwild.wordpress.com/2014/10/18/google-scholar-is-10-years-old/#commentsSat, 18 Oct 2014 15:47:23 +0000http://bibwild.wordpress.com/?p=3530Continue reading →]]>An article by Steven Levy about the guy who founded the service, and it’s history:

“Information had very strong geographical boundaries,” he says. “I come from a place where those boundaries are very, very apparent. They are in your face. To be able to make a dent in that is a very attractive proposition.”

Acharya’s continued leadership of a single, small team (now consisting of nine) is unusual at Google, and not necessarily seen as a smart thing by his peers. By concentrating on Scholar, Acharya in effect removed himself from the fast track at Google…. But he can’t bear to leave his creation, even as he realizes that at Google’s current scale, Scholar is a niche.

…But like it or not, the niche reality was reinforced after Larry Page took over as CEO in 2011, and adopted an approach of “more wood behind fewer arrows.” Scholar was not discarded — it still commands huge respect at Google which, after all, is largely populated by former academics—but clearly shunted to the back end of the quiver.

…Asked who informed him of what many referred to as Scholar’s “demotion,” Acharya says, “I don’t think they told me.” But he says that the lower profile isn’t a problem, because those who do use Scholar have no problem finding it. “If I had seen a drop in usage, I would worry tremendously,” he says. “There was no drop in usage. I also would have felt bad if I had been asked to give up resources, but we have always grown in both machine and people resources. I don’t feel demoted at all.”

I’ve been trying to take my Rails error logs more seriously to make sure I handle any bugs revealed. 404’s can indicate a problem, especially when the referrer is my app itself. So I wanted to get all of those 404’s for Apache’s internal dummy connection out of my log. (How I managed to fight with Rails logs enough to actually get useful contextual information on FATAL errors is an entirely different complicated story for another time).

How can I make a Rails app handle them?

Well, first, let’s do a standards check and see that RFC 2616 HTTP 1.1 Section 9 (I hope I have a current RFC that hasn’t been superseded) says:

If the Request-URI is an asterisk (“*”), the OPTIONS request is intended to apply to the server in general rather than to a specific resource. Since a server’s communication options typically depend on the resource, the “*” request is only useful as a “ping” or “no-op” type of method; it does nothing beyond allowing the client to test the capabilities of the server. For example, this can be used to test a proxy for HTTP/1.1 compliance (or lack thereof).

Okay, sounds like we can basically reply with whatever we want to this request, it’s a “ping or no-op”. How about a 200 text/plain with “OK\n”?

Here’s a line I added to my Rails routes.rb file that seems to catch the “*” requests and just respond with such a 200 OK.

Since “*” is a special glob character to Rails routing, looks like you have to do that weird constraints trick to actually match it. (Thanks to mbklein, this does not seem to be documented and I never would have figured it out on my own).

And then we can use a little “Rack app implemented in a lambda” trick to just return a 200 OK right from the routing file, without actually having to write a controller action somewhere else just to do this.

I have not yet tested this extensively, but I think it works? (Still worried if Apache is really requesting “OPTIONS *” instead of “OPTIONS /*” it might not be. Stay tuned.)

Filed under: General]]>https://bibwild.wordpress.com/2014/10/07/catching-http-options-request-in-a-rails-app/feed/0jrochkindUmlaut News: 4.0 and two new installationshttps://bibwild.wordpress.com/2014/10/06/umlaut-news-4-0-and-two-new-installations/
https://bibwild.wordpress.com/2014/10/06/umlaut-news-4-0-and-two-new-installations/#commentsMon, 06 Oct 2014 21:26:12 +0000http://bibwild.wordpress.com/?p=3522Continue reading →]]>Umlaut is, well, now I’m going to call it a known-item discovery layer, usually but not neccesarily serving as a front-end to SFX.

Umlaut 4.0.0 has been released

This release is mostly back-end upgrades, including:

Support for Rails 4.x (Rails 3.2 included to make migration easier for existing installations, but recommend upgrading to Rails 4.1 asap, and starting with Rails 4.1 in new apps)

There are a variety of ways to work around this by extending asset compilation. After researching and considering them all, I chose to use a custom Rake task that uses the sprockets manifest.json file. In this post, I’ll explain the situation and the options.

It produces assets to be delivered to the client that are fingerprinted with a digest hash based on the contents of the file — such as ‘application-810e09b66b226e9982f63c48d8b7b366.css’. People (and configuration) often refer to this filename-fingerprinting as “digested assets”.

In Rails3, a ‘straight’ named copy of the assets (eg `application.js`) were also produced, alongside the fingerprinted digest-named assets.

Rails4 stopped doing this by default, and also took away any ability to do this even as a configurable option. While I can’t find the thread now, I recall seeing discussion that in Rails3, the production of non-digest-named assets was accomplished through actually asking sprockets to compile everything twice, which made asset compilation take roughly twice as long as it should. Which is indeed a problem.

Rather than looking to fix Sprockets api to make it possible to compile the file once but simply write it twice, Rails devs decided there was no need for the straight-named files at all, and simply removed the feature.

Why would you need straight-named assets?

The title of this issue reveals one reason people wanted the non-digest-named assets: “breaks compatibility with bad gems”. This mainly applies to gems that supply javascript, which may need to generate links to assets, and not be produced to look up the current digest-named URLs. It’s really about javascript, not ‘gems’, it can apply to javascript you’ve included without gemifying it too.

The Rails devs expression opinion on this issue believed (at least initially) that these ‘bad gems’ should simply be fixed, accomodating them was the wrong thing to do, as it eliminates the ability to cache-forever assets they refer to.

I think they under-estimate the amount of work it can take to fix these ‘bad’ JS dependencies, which often are included through multi-level dependency trees (requiring getting patches accepted by multiple upstreams) — and also basically requires wrapping all JS assets in rubygems that apply sprockets/rails-specific patches on top, instead of, say, just using bower.

I think there’s a good argument for accommodating JS assets which the community has not yet had the time/resources to make respect the sprockets fingerprinting. Still, it is definitely preferable, and always at least theoretically possible, to make all your JS respect sprockets asset fingerprinting — and in most of my apps, I’ve done that.

But there’s other use cases: like mine!

I have an application that needs to offer a Javascript file at a particular stable URL, as part of it’s API — think JS “widgets”.

I want it to go through the asset pipeline, for source control, release management, aggregation, SASS, minimization, etc. The suggestion to just “put it in /public as a static asset” is no good at all. But I need the current version available at a persistent URL.

Rails 3, this Just Worked, since the asset pipeline created a non-digested name. In Rails 4, we need a workaround. I don’t need every asset to have a non-digest-named version, but I do need a whitelist of a few that are part of my public API.

I think this is a pretty legitimate use case, and not one that can be solved by ‘fixing bad gems’. I have no idea if Rails devs recognize it or not.

Possible Workaround Options

So that giant Github Issue thread? At first it looks like just one of those annoying ones with continual argument by uninformed people that will never die, and eventually @rafaelfranca locked it. But it’s also got a bunch of comments with people offering their solutions, and is the best aggregation of possible workarounds to consider — I’m glad it wasn’t locked sooner. Another example of how GitHub qualitatively improves open source development — finding this stuff on a listserv would have been a lot harder.

The problem is it won’t work. I’ve got nothing against a rake task solution. It’s easy to wire things up so your new rake task automatically gets called every time after `rake assets:precompile’, no problem!

The problem is that a deployed Rails app may have multiple fingerprinted versions of a particular asset file around, representing multiple releases. And really you should set things up this way — because right after you do a release, there may be cached copies of HTML (in browser caches, or proxying caches including a CDN) still around, still referencing the old version with the old digest fingerprint. You’ve got to keep it around for a while.

(How long? Depends on the cache headers on the HTML that might reference it. The fact that sprockets only supports keeping around a certain number of releases, and not releases made within a certain time window, is a different discussion. But, yeah, you need to keep around some old versions).

So it’s unpredictable which of the several versions you’ve got hanging around the rake task is going to copy to the non-digest-named version, there’s no guarantee it’ll be the latest one. (Maybe it depends on their lexographic sort?). That’s no good.

Enhance the core-team-suggested rake task?

Before I realized this problem, I had already spent some time trying to implement the basic rake task, add a whitelist parameter, etc. So I tried to keep going with it after realizing this problem.

I figured, okay, there are multiple versions of the asset around, but sprockets and rails have to know which one is the current one (to serve it to the current application), so I must be able to use sprockets ruby API in the rake task to figure it out and copy that one.

It was kind of challenging to figure out how to get sprockets to do this, but eventually it was sort of working.

Except i started to get worried that I might be triggering the double-compilation that Rails3 did, which I didn’t want to do, and got confused about even figuring out if I was doing it.

And I wasn’t really sure if I was using sprockets API meant to be public or internal. It didn’t seem to be clearly documented, and sprockets and sprockets-rails have been pretty churny, I thought I was taking a significant risk of it breaking in future sprockets/rails version(s) and needing continual maintenance.

Verdict: Nope, not so simple, even though it seems to be the rails-core-endorsed solution.

Monkey-patch sprockets: non-stupid-digest-assets

Okay, so maybe we need to monkey-patch sprockets I figured.

@alexspeller provides a gem to monkey-patch Sprockets to support non-digested-asset creation, the unfortunately combatively named non-stupid-digest-assets.

If someone else has already figured it out and packaged it in a gem, great! Maybe they’ll even take on the maintenance burden of keeping it working with churny sprockets updates!

But non-stupid-digest-assets just takes the same kind logic from that basic rake task, another pass through all the assets post-compilation, but implements it with a sprockets monkeypatch instead of a rake task. It does add a white list. I can’t quite figure out if it’s still subject to the same might-end-up-with-older-version-of-asset problem.

There’s really no benefit just to using a monkey patch instead of a rake task doing the same thing, and it has increased risk of breaking with new Rails releases. Some have already reported it not working with the Rails 4.2.betas — I haven’t investigated myself to see what’s up with that, and @alexspeller doesn’t seem to be in any hurry to either.

Verdict: Nope. non-stupid-digest-assets ain’t as smart as it thinks it is.

Monkey-patch sprockets: The right way?

If you’re going to monkey-patch sprockets and take on forwards-compat risk, why not actually do it right, and make sprockets simply write the compiled file to two different file locations (and/or use symlinks) at the point of compilation?

At this point, I was too scared of the forwards-compatibility-maintenance risks of monkey patching sprockets, and realized there was another solution I liked better…

Verdict: It’s the right way to do it, but carries some forwards-compat maintenance risk as an unsupported monkey patch

Use the Manifest, Luke, erm, Rake!

I had tried and given up on using the sprockets ruby api to determine ‘current digest-named asset’. But as I was going back and reading through the Monster Issue looking for ideas again, I noticed @drojas suggested using the manifest.json file that sprockets creates, in a rake task.

Yep, this is where sprockets actually stores info on the current digest-named-assets. Forget the sprockets ruby api, we can just get it from there, and make sure we’re making a copy (or symlinking) the current digested version to the non-digested name.

But are we still using private api that may carry maintenance risk with future sprockets versions? Hey, look, in a source code comment Sprockets tells us “The JSON is part of the public API and should be considered stable.” Sweet!

Now, even if sprockets devs remember one of them once said this was public API (I hope this blog post helps), and even if sprockets is committed to semantic versioning, that still doesn’t mean it can never change. In fact, the way some of rubydom treats semver, it doesn’t even mean it can’t change soon and frequently; it just means they’ve got to update the sprockets major version number when it changes. Hey, at least that’d be a clue.

I didn’t really notice this one until I had settled on The Manifest. It requires two HTTP requests every time a client wants the asset at the persistent URL though. The first one will touch your app and needs short cache time, that will then redirect to the digest-named asset that will be served directly by the web server and can be cached forever. I’m not really sure if the performance implications are significant, probably depends on your use cases and request volume. @will-r suggests it won’t work well with CDN’s though.

Verdict: Meh, maybe, I dunno, but it doesn’t feel right to introduce the extra latency

The Future

But what’s “this issue” exactly? I dunno, they are not sharing what they see as the legitimate use cases to handle, and requirements on legitimate ways to handle em.

I kinda suspect they might just be dealing with the “non-Rails JS that needs to know asset URLs” issue, and considering some complicated way to automatically make it use digest-named assets without having to repackage it for Rails. Which might be a useful feature, although also a complicated enough one to have some bug risks (ah, the story of the asset pipeline).

And it’s not what I need, anyway, there are other uses cases than the “non-Rails JS” one that need non-digest-named assets.

I just need sprockets to produce parallel non-digested asset filenames for certain whitelisted assets. That really is the right way to handle it for my use case. Yes, it means you need to know the implications and how to use cache headers responsibly. If you don’t give me enough rope to hang myself, I don’t have enough rope to climb the rock face either. I thought Rails target audience was people who know what they’re doing?

It doesn’t seem like this would be a difficult feature for sprockets to implement (without double compilation!). @ryana’s monkeypatch seems like pretty simple code that is most of the way there. It’s the feature what I need.

I considered making a pull request to sprockets (the first step, then probably sprockets-rails, needs to support passing on the config settings). But you know what, I don’t have the time or psychic energy to get in an argument about it in a PR; the Rails/sprockets devs seem opposed to this feature for some reason. Heck, I just spent hours figuring out how to make my app work now, and writing it all up for you instead!

But, yeah, just add that feature to sprockets, pretty please.

So, if you’re reading this post in the future, maybe things will have changed, I dunno.

Always, always, always, start with “thank you for reporting this problem.”

1. Because it’s true, we need the problem reports to know about problems, and too often people are scared to report problems because of past bad experiences, or don’t report problems because they figure someone else already has, or because they are busy and don’t have the time.

2. Because it gets the support interaction off to a good collegial cooperative start.

It works. Do it every time. Even when the problem being reported doesn’t make any sense and you’re sure (you think!) that it’s not a real problem.

If they give a good problem report with actual reproduction steps and a clear explanation of why the outcome is not what they expected, thank them extra special.

Filed under: General]]>https://bibwild.wordpress.com/2014/10/01/first-rule-of-responding-to-support-tickets/feed/1jrochkindRubyists, don’t forget about the dir glob!https://bibwild.wordpress.com/2014/09/29/rubyists-dont-forget-about-the-dir-glob/
https://bibwild.wordpress.com/2014/09/29/rubyists-dont-forget-about-the-dir-glob/#commentsMon, 29 Sep 2014 15:01:58 +0000http://bibwild.wordpress.com/?p=3508Continue reading →]]>If you are writing configuration to take a pattern to match against files in a file system…

You probably want Dir.globs, not regexes. Dir.glob is in the stdlib. Dir.glob’s unix-shell-style patterns are less expressive than regexes, but probably expressive enough for anything you need in this use case, and much simpler to deal with for common patterns in this use case.

Dir.glob(“root/path/**/*.rb”)

vs.

%r{\Aroot/path/.*\.rb\Z}

Or

Dir.glob(“root/path/*.rb”)

vs.

…I don’t even feel like thinking about how to express as a regexp that you don’t want child directories, but only directly there.

Dir.glob will find matches from within a directory on local file system — but if you have a certain filepath in a string you want to test for a match against a dirglob, you can easily do that too with Pathname.fnmatch, which does not even require the string to exist in the local file system but can still check it for a match against a dirglob.

Filed under: General]]>https://bibwild.wordpress.com/2014/09/29/rubyists-dont-forget-about-the-dir-glob/feed/1jrochkindUmlaut 4.0 betahttps://bibwild.wordpress.com/2014/09/18/umlaut-4-0-beta/
https://bibwild.wordpress.com/2014/09/18/umlaut-4-0-beta/#commentsThu, 18 Sep 2014 18:39:01 +0000http://bibwild.wordpress.com/?p=3505Continue reading →]]>Umlaut is an open source specific-item discovery layer, often used on top of SFX, and based on Rails.

Umlaut 4.0.0.beta2 is out! (Yeah, don’t ask about beta1 :) ).

This release is mostly back-end upgrades, including:

Support for Rails 4.x (Rails 3.2 included to make migration easier for existing installations, but recommend starting with Rails 4.1 in new apps)

Based on Bootstrap 3 (Rails 3 was Bootstrap 2)

internationalization/localization support

A more streamlined installation process with a custom installer

Anyone interested in beta testing? Probably most interesting if you have an SFX to point it at, but you can take it for a spin either way.

Filed under: General]]>https://bibwild.wordpress.com/2014/09/18/umlaut-4-0-beta/feed/0jrochkindCleaning up the Rails backtrace cleaner; Or, The Engine Stays in the Picture!https://bibwild.wordpress.com/2014/09/10/cleaning-up-the-rails-backtrace-cleaner-or-the-engine-stays-in-the-picture/
https://bibwild.wordpress.com/2014/09/10/cleaning-up-the-rails-backtrace-cleaner-or-the-engine-stays-in-the-picture/#commentsWed, 10 Sep 2014 16:40:05 +0000http://bibwild.wordpress.com/?p=3502Continue reading →]]>Rails has for a while included a BacktraceCleaner that removes some lines from backtraces, and reformats others to be more readable.

This is pretty crucial, especially since recent versions of Rails can have pretty HUGE call stacks, due to reliance on Rack middleware and other architectural choices.

I rely on clean stack traces in the standard Rails dev-mode error page, in my log files of fatal uncaught exceptions — but also in some log files I write myself, where I catch and recover from an exception, but want to log where it came from anyway, ideally with a clean stacktrace. `Rails.backtrace_cleaner.clean( exception.backtrace )`

A few problems I had with it though:

Several of my apps are based on kind of ‘one big Rails engine’. (Blacklight, Umlaut). The default cleaner will strip out any lines that aren’t part of the local app, but I really want to leave the ‘main engine’ lines in. That was my main motivation to look into this, but as long as I was at it, a couple other inconveniences…

The default cleaner nicely reformats lines from gems to remove the filepath to the gem dir, and replace with just the name of the gem. But this didn’t seem to work for gems listed in Bundler as :path (or, I think, :github ?), that don’t live in the standard gem repo. And that ‘main engine gem’ would often be checked out thus, especially in development.

Stack trace lines that come from ERB templates include a dynamically generated internal method name, which is really long and makes the stack trace confusing — the line number in the ERB file is really all we need. (At first I thought the Rails ‘render template pattern filter’ was meant to deal with that, but I think it’s meant for something else)

Fortunately, you can remove and add/or your own silencers (which remove lines from the stack trace), and filters (which reformat stack trace lines) from the ActiveSupport/Rails::BacktraceCleaner.

Here’s what I’ve done to make it the way I want. I wanted to add it directly built into Umlaut (a Rails Engine), so this is written to go in Umlaut’s `< Rails::Engine` class. But you could do something similar in a local app, probably in the `initializers/backtrace_silencers.rb` file that Rails has left as a stub for you already.

Note that all filters are executed before silencers, so your silencer has to be prepared to recognize already-filtered input.

module Umlaut
class Engine < Rails::Engine
engine_name "umlaut"
#...
initializer "#{engine_name}.backtrace_cleaner" do |app|
engine_root_regex = Regexp.escape (self.root.to_s + File::SEPARATOR)
# Clean those ERB lines, we don't need the internal autogenerated
# ERB method, what we do need (line number in ERB file) is already there
Rails.backtrace_cleaner.add_filter do |line|
line.sub /(\.erb:\d+)\:in `__.*$/, "\\1"
end
# Remove our own engine's path prefix, even if it's
# being used from a local path rather than the gem directory.
Rails.backtrace_cleaner.add_filter do |line|
line.sub(/^#{engine_root_regex}/, "#{engine_name} ")
end
# Keep Umlaut's own stacktrace in the backtrace -- we have to remove Rails
# silencers and re-add them how we want.
Rails.backtrace_cleaner.remove_silencers!
# Silence what Rails silenced, UNLESS it looks like
# it's from Umlaut engine
Rails.backtrace_cleaner.add_silencer do |line|
(line !~ Rails::BacktraceCleaner::APP_DIRS_PATTERN) &&
(line !~ /^#{engine_root_regex}/ ) &&
(line !~ /^#{engine_name} /)
end
end
#...
end
end

Filed under: General]]>https://bibwild.wordpress.com/2014/09/10/cleaning-up-the-rails-backtrace-cleaner-or-the-engine-stays-in-the-picture/feed/0jrochkindCardo is a really nice free webfonthttps://bibwild.wordpress.com/2014/09/09/cardo-is-a-really-nice-free-webfont/
https://bibwild.wordpress.com/2014/09/09/cardo-is-a-really-nice-free-webfont/#commentsTue, 09 Sep 2014 04:16:47 +0000http://bibwild.wordpress.com/2014/09/09/cardo-is-a-really-nice-free-webfont/Continue reading →]]>Some of the fonts on google web fonts aren’t that great. And I’m not that good at picking the good ones from the not-so-good ones on first glance either.

Cardo is a really nice old-style serif font that I originally found recommended on some list of “the best of google fonts”.

It’s got a pretty good character repertoire for latin text (and I think Greek). The Google Fonts version doesn’t seem to include Hebrew, even though some other versions might? For library applications, the more characters the better, and it should have enough to deal stylishly with whatever letters and diacritics you throw at it in latin/germanic languages, and all the usual symbols (currency, punctuation; etc).

I’ve used it in a project that my eyeballs have spent a lot of time looking at (not quite done yet), and been increasingly pleased by it, it’s nice to look at and to read, especially on a ‘retina’ display. (I wouldn’t use it for headlines though)