Performance on a many-membered Sufia/Hyrax show page

jrochkind

5 months ago

Advertisements

We still run Sufia 7.3, haven’t yet upgraded/migrated to hyrax, in our digital repository. (These are digital repository/digital library frameworks, for those who arrived here and are not familiar; you may not find the rest of the very long blog post very interesting. :))

We have a variety of ‘manuscript’/’scanned 2d text’ objects, where each page is a sufia/hyrax “member” of the parent (modeled based on PCDM). Sufia was originally designed as a self-deposit institutional repository, and I didn’t quite realize this until recently, but is now known sufia/hyrax to still have a variety of especially performance-related problems with works with many members. But it mostly works out.

The default sufia/hyrax ‘show’ page displays a single list of all members on the show page, with no pagination. This is also where admins often find members to ‘edit’ or do other admin tasks on them.

For our current most-membered work, that’s 473 members, 196 of which are “child works” (each of which is only a single fileset–we use child works for individual “interesting” pages we’d like to describe more fully and have show up in search results independently). In stock sufia 7.3 on our actual servers, it could take 4-6 seconds to load this page (just to get response from server, not including client-side time). This is far from optimal (or even ‘acceptable’ in standard Rails-land), but… it works.

While I’m not happy with that performance, it was barely acceptable enough that before getting to worrying about that, our first priority was making the ‘show’ page look better to end-users. Incorporating a ‘viewer’, launched by clicks on page thumbs, more options in a download menu, , bigger images with an image-forward kind of design, etc. As we were mostly just changing sizes and layouts and adding a few more attributes and conditionals, I didn’t think this would effect performance much compared to the stock.

However, just as we were about to reach a deadline for a ‘soft’ mostly-internal release, we realized the show page times on that most-membered work had deteriorated drastically. To 12 seconds and up for a server response, no longer within the bounds of barely acceptable. (This shows why it’s good to have some performance monitoring on your app, like New Relic or Skylight, so you have a chance to notice performance degradation as a result of code changes as soon as it happens. Although we don’t actually have this at present.)

We thus embarked on a week+ of most of our team working together on performance profiling to figure out what was up and — I’m happy to say — fixing it, perhaps even getting slightly better perf than stock sufia in the end. Some of the things we found definitely apply to stock sufia and hyrax too, others may not, we haven’t spend the time to completely compare and contrast, but I’ll try to comment with my advice.

When I see a major perf degradation like this, my experience tells me it’s usually one thing that’s caused it. But that wasn’t really true in this case, we had to find and fix several issues. Here’s what we found, how we found it, and our local fixes:

N+1 Solr Queries

The N+1 query problem is one of the first and most basic performance problems many Rails devs learn about. Or really, many web devs (or those using SQL or similar stores) generally.

It’s when you are showing a parent and it’s children, and end up doing an individual db fetch for every child, one-per-child. Disastrous performance wise, you need to find a way to do a single db fetch that gets everything you want instead.

So this was our first guess. And indeed we found that stock sufia/hyrax did do n+1 queries to Solr on a ‘show’ page, where n is the number of members/children.

If you were just fetching with ordinary ActiveRecord, the solution to this would be trivial, adding something like .includes(:members) to your ActiveRecord query. But of course we aren’t, so the solution is a bit more involved, since we have to go through Solr, and actually traverse over at least one ‘join’ object in Solr too, because of how sufia/hyrax stores these things.

I’m not a huge fan of overriding that core member_presenters method, but it works and I can’t think of a better way to solve this.

We went and implemented this without even doing any profiling first, cause it was a low-hanging fruit. And were dismayed to see that while it did improve things measurably, performance was still disastrous.

Solrizer.solr_name turns out to be a performance bottleneck?(!)

I first assumed this was probably still making extra fetches to solr (or even fedora!), that’s my experience/intuition for most likely perf problem. But I couldn’t find any of those.

Okay, now we had to do some actual profiling. I created a test work in my dev instance that had 200 fileset members. Less than our slowest work in production, but should be enough to find some bottlenecks, I hoped. The way I usually start is by a really clumsy and manual deleting parts of my templates to see what things deleted makes things faster. I don’t know if this is really a technique I’d recommend, but it’s my habit.

This allowed me to identify that indeed the biggest perf problem at this time was not in fetching the member-presenters, and indeed was in the rendering of them. But as I deleted parts of the partial for rendering each member, I couldn’t find any part that speeded up things drastically, deleting any part just speeded things up proportional to how much I deleted. Weird. Time for profiling with ruby-prof.

I wrapped the profiling just around the portion of the template I had already identified as problem area. I like the RubyProf::GraphHtmlPrinter report from ruby-prof for this kind of work. (One of these days I’m going to experiment GraphViz or compatible, but haven’t yet).

Surprisingly, the top culprit for taking up time was — Solrizer.solr_name. (We use Solrizer 3.4.1; I don’t believe as of this date newer versions of solrizer or other dependencies would fix this).

It makes sense Solrizer.solr_name is called a lot. It’s called basically every time you ask for any attribute from your Solr “show” presenter. I also saw it being called when generating an internal app link to a show page for a member, perhaps because that requires attributes. Anything you have set up to delegate …, to: :solr_document probably also ends up calling Solrizer.solr_name in the SolrDocument.

While I think this would be a problem in even stock Sufia/Hyrax, it explains why it could be more of a problem in our customization — we were displaying more attributes and links, something I didn’t expect would be a performance concern; especially attributes for an already-fetched object oughta be quite cheap. Also explains why every part of my problem area seemed to contribute roughly equally to the perf problem, they were all displaying some attribute or link!

I don’t think this would be a great upstream PR, it’s a workaround. Would be better to figure out why Solrizer.solr_name is so slow, but my initial brief forays there didn’t reveal much, and I had to return to our app.

Because while this did speed up my test case by a few hundred ms, my test case was still significantly slower compared to an older branch of our local app with better performance.

Using QuestioningAuthority gem in ways other than intended

We use the gem commonly referred to as “Questioning Authority“, but actually released as a gem called qa for most of our controlled vocabularies, including “rights”. We wanted to expand the display of “rights” information beyond just a label, we wanted a nice graphic and user-facing shortened label ala rightstatements.org.

It worked great… except after taking care of Solrizer.solr_name, this was the next biggest timesink in our perf profile. Specifically it seemed to be calling slow YAML.load a lot. Was it reloading the YAML file from disk on every call? It was! And we were displaying licensing info for every member.

I spent some time investigating the qa gem. Was there a way to add caching and PR it upstream? A way that would be usable in an API that would give me what I wanted here? I couldn’t quite come up with anything without pretty major changes. The QA gem wasn’t really written for this use case, it is focused pretty laser-like on just providing auto-complete to terms, and I’ve found it difficult in the past to use it for anything else. Even in it’s use case, not caching YAML is a performance mistake, but since it would usually be done only once per request it wouldn’t be disastrous.

Sufia::SufiaHelperBehavior#application_name is also slow

We were calling that #application_name helper twice per member… just in a data-confirm attr on a delete link! `Deleting #{file_set}from #{application_name} is permanent. Click OK to delete this from #{application_name}, or Cancel to cancel this operation. `

If the original sufia code didn’t have this, or only had application_name once instead of twice, that could explain a perf regression in our local code, if application_name is slow. I’m not sure if it did or not, but this was the biggest bottleneck in our local code at this time either way.

Why is application_name so slow? This is another method I might expect would be fast enough to call thousands of times on a page, in the cost vicinity of a hash lookup. Is I18n.t just slow to begin with, such that you can’t call it 400 times on a page? I doubt it, but it’s possible. What’s hiding in that super call, that is called on every invocation even if no default is needed? Not sure.

Too Many Partials

I was somewhat cheered, several days in, to be into actual generic Rails issues, and not Samvera-stack-specific ones. Because after fixing above, the next most expensive thing identifiable in our perf profile was a Rails ‘lookup_template’ kind of method. (Sorry, I didn’t keep notes or the report on the exact method).

As our HTML for displaying “a member on a show page” got somewhat more complex (with a popup menu for downloads and a popup for admin functions), to keep the code more readable we had extracted parts to other partials. So the main “show a member thumb” type partial was calling out to three other partials. So for 200 members, that meant 600 partial lookups.

Seeing that line in the profile report reminded me, oh yeah, partial lookup is really slow in Rails. I remembered that from way back, and had sort of assumed they would have fixed it in Rails by now, but nope. In production configuration template compilation is compiled, but every render partial: is still a live slow lookup, that I think even needs to check the disk in it’s partial lookup (touching disk is expensive!).

This would be a great thing to fix in Rails, it inconveniences many people. Perhaps by applying some kind of lookup caching, perhaps similar to what Bootsnap does for $LOAD_PATH and require, but for template lookup paths. Or perhaps by enhancing the template compilation so the exact result of template lookups are compiled in and only need to be done on template compilation. If either of these were easy to do, someone would probably have done them already (but maybe not).

In any event, the local solution is simple, if a bit painful to code legibility. Remove those extra partials. The main “show a member” partial is invoked with render collection, so only gets looked-up once and is not a problem, but when it calls out to others, it’s one lookup per render every time. We inlined one of them, and turned two more into helper methods instead of partials.

At this point, I had my 200-fileset test case performing as well or better as our older-more-performant-branch, and I was convinced we had it! But we deployed to staging, and it was still significantly slower than our more-performant-branch for our most-membered work. Doh! What was the difference? Ah right, our most-membered work has 200 child works, my test case didn’t have child works.

Okay, new test case (it was kinda painful to figure out how to create a many-hundred-child-work test case in dev, and very slow with what I ended up with). And back to ruby-prof.

N+1 Solr queries again, for representative_presenter

Right before our internal/soft deadline, we had to at least temporarily bail out of using riiif for tiled image viewer and other derivatives too, for performance reasons. (We ultimately ended up not using riiif, you can read about that too).

In the meantime, we added a feature switch to our app so we could have the riiif-using code in there, but turn it on and off. So even though we weren’t really using riiif yet (or perf testing with riiif), there was some code in there preparing for riiif, that ended up being relevant to perf for works with child-works.

For riiif, we need to get a file_id to pass to riiif. And we also wanted the image height and width, so we could use lazysizes-aspect ratio so the image would be taking up the proper space on the screen even if waiting for a slow riiif server to deliver it. (lazysizes for lazy image loading, and lazysizes-aspectratio which can be used even without lazy loading — are highly recommended, they work great).

We used polymorphism, for a fileset member, the height, width and original_file_id were available directly on the solr object fetched corresponding to the member. But for a child work, it delegated to representative_presenter to get them. And representative_presenter, of course, triggered a solr fetch. Actually, it seemed to trigger three solr fetches, so you could actually call this a 3n+1 query!

If we were fetching from ActiveRecord, the solution to this would possibly be as simple as adding something like .includes("members", "members.representative") . Although you’d have to deal with some polymorphism there in some ways tricky for AR, so maybe that wouldn’t work out. But anyway, we aren’t.

At first I spent some time thinking through if there was a way to bulk-eager-load these representatives for child works similarly to what you might do with ActiveRecord. It was tricky, because the solr data model is tricky, the polymorphism, and solr doesn’t make “joins” quite as straighforward as SQL does. But then I figured, wait, use Solr like Solr. In Solr it’s typical to “de-normalize” your data so the data you want is there when you need it.

I implemented code to index a representative_file_id, representative_width, and representative_height directly on a work in Solr. At first it seemed pretty straightforward. Then we discovered it was missing some edge cases (a work that has as it’s representative a child work, that has nothing set as it’s representative?), and that there was an important omission — if a work has a child work as a representative, and that child work changes it’s representative (which now applies to the first work), the first work needs to be reindexed to have it. So changes to one work need to trigger a reindex of another. After around 10 more frustrating dev hours, some tricky code (which reduces indexing performance but better than bad end-user performance), some very-slow and obtuse specs, and a very weary brain, okay, got that taken care of too. (this commit may not be the last word, I think we had some more bugfixes after that).

After a bulk reindex to get all these new values — our code is even a little bit faster than our older-better-performing-branch. And, while I haven’t spent the time to compare it, I wouldn’t be shocked if it’s actually a bit faster than the Stock sufia. It’s not fast, still 4-5s for our most-membered-work, but back to ‘barely good enough for now’.

Future: Caching? Pagination?

My personal rules of thumb in Rails are that a response over 200ms is not ideal, over 500ms it’s time to start considering caching, and over 1s (uncached) I should really figure out why and make it faster even if there is caching. Other Rails devs would probably consider my rules of thumb to already be profligate!

So 4s is still pretty slow. Very slow responses like this not only make the user wait, but load down your Rails server filling up it’s processing queue and causing even worse problems under multi-user use. It’s not great.

Under a more standard Rails app, I’d definitely reach for caching immediately. View or HTTP caching is a pretty standard technique to make your Rails app as fast as possible, even when it doesn’t have pathological performance.

But the standard Rails html caching approaches use something they call ‘russian doll caching’, where the updated_at timestamp on the parent is touched when a child is updated. The issue is making sure the cache for the parent page is refreshed when a child displayed on that page changes.

classProduct < ApplicationRecord

has_many :games

end

classGame < ApplicationRecord

belongs_to :product, touch: true

end

With touch set to true, any action which changes updated_at for a game record will also change it for the associated product, thereby expiring the cache.

ActiveFedora tries to be like ActiveRecord, but it does not support that “touch: true” on associations used in the example for russian doll caching. It might be easy to simulate with an after_save hook or something — but updating records in Fedora is so slow. And worse, I think (?) there’s no way to atomically update just the updated_at in fedora, you’ve got to update the whole record, introducing concurrency problems. I think this could be a whole bunch of work.

jcoyne in slack suggested that instead of russian-doll-style with touching updated_at, you could assemble your cache key from the updated_at values from all children. But I started to worry about child works, this might have to be recursive, if a child is a child work, you need to include all it’s children as well. (And maybe File children of every FileSet? Or how do fedora ‘versions’ effect this?). It could start getting pretty tricky. This is the kind of thing the russian-doll approach is meant to make easier, but it relies on quick and atomic touching of updated_at.

We’ll probably still explore caching at some point, but I suspect it will be much less straightforward to work reliably than if this were a standard rails/AR app. And the cache failure mode of showing end-users old not-updated data is, I know from experience, really confusing for everyone.

Alternately or probably additionally, why are we displaying all 473 child images on the page at once in the first place? Even in a standard Rails app, this might be hard to do performantly (although I’d just solve it with cache there if it was the UX I wanted, no problem). Mostly we’re doing it just cause stock sufia did it and we got used to it. Admins use ctrl-f on a page to find a member they want to edit. I kind of like having thumbs for all pages right on the page, even if you have to scroll a lot to see them (was already using lazysizes to lazy load the images only when scrolled to). But some kind of pagination would probably be the logical next step, that we may get to eventually. One or more of:

Actual manual pagination. Would probably require a ‘search’ box on titles of members for admins, since they can’t use cntrl-f anymore.

Javascript-based “infinite scroll” (not really infinite) to load a batch at a time as user scrolls there.

Or using similar techniques, but actually load everything with JS immediately on page load, but a batch at a time. Still going to use the same CPU on the server, but quicker initial page load, and splitting up into multiple requests is better for server health and capacity.

Even if we get to caching or some of these, I don’t think any of our work above is wasted — you don’t want to use this technique to workaround performance bottlenecks on the server, in my opinion you want to fix easily-fixable (once you find them!) performance bottlenecks or performance bugs on the server first, as we have done.

And another approach some would be not rendering some/all of this HTML on the server at all, but switching to some kind of JS client-side rendering (react etc.). There are plusses and minuses to that approach, but it takes our team into kinds of development we are less familiar with, maybe we’ll experiment with it at some point.

Thoughts on the Hydra/Samvera stack

So. I find Sufia and the samvera stack quite challenging, expensive, and often frustrating to work with. Let’s get that out of the way. I know I’m not alone in this experience, even among experienced developers, although I couldn’t say if it’s universal.

I also enjoy and find it rewarding and valuable to think about why software is frustrating and time-consuming (expensive) to work with, what makes it this way, and how did it get this way, and (hardest of all), what can be done or done differently.

If you’re not into that sort of discussion, please feel free to drop out now. Myself, I think it’s an important discussion to have. Developing a successful collaborative open source shared codebase is hard, there are many things we (or nobody) has figured out, and I think it can take some big-picture discussion and building of shared understanding to get better at it.

I’ve been thinking about how to have that discussion in as productive a way as possible. I haven’t totally figured it out — wanting to add this piece in but not sure how to do it kept me from publishing this blog post for a couple months after the preceding sections were finished — but I think it is probably beneficial to ground and tie the big picture discussion in specific examples — like the elements and story above. So I’m adding it on.

I also think it’s important to tell beginning developers working with Samvera, if you are feeling frustrated and confused, it’s probably not you, it’s the stack. If you are thinking you must not be very good at programming or assuming you will have similar experiences with any development project — don’t assume that, and try to get some experience in other non-samvera projects as well.

So, anyhow, this experience of dealing with performance problems on a sufia ‘show’ page makes me think of a couple bigger-picture topics: 1) The continuing cost of using a less established/bespoke data store layer (in this case Fedora/ActiveFedora/LDP) over something popular with many many developer hours already put into it like ActiveRecord, and 2) The idea of software “maturity”.

In this post, I’m actually going to ignore the first other than that, and focus on the second “maturity”.

Software maturity: What is it, in general?

People talk about software being “mature” (or “immature”) a lot, but googling around I couldn’t actually find much in the way of a good working definition of what is meant by this. A lot of what you find googling is about the “Capability Maturity Model“. The CMM is about organizational processes rather than product, it’s came out of the context of defense department contractors (a very different context than collaborative open source), and I find it’s language somewhat bureaucratic. It also has plenty of critique. I think organizational process matters, and CMM may be useful to our context, but I haven’t figured out how to make use of CMM to speak to about software maturity in the way I want to here, so I won’t speak of it again here.

Other discussions I found also seemed to me kind of vague, hand-wavy, or self-referential, in ways I still didn’t know how to make use of to talk about what I wanted.

I actually found a random StackOverflow answer I happened across to be more useful than most, I found it’s focus on usage scenarios and shared understandingto be stimulating:

I would say, mature would add the following characteristic to a technology:

People know how to use it, know its possibilities and limitations

People know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best

People have found out how to deal with limitations/bugs, there is a community knowledge and help out there

The technology is trusted enough to be used not only by individuals but in productive commercial environment as well

In this way of thinking about it, mature software is software where there is shared understanding about what the software is for, what patterns of use it is best at and which are still more ‘unfinished’ and challenging; where you’re going to encounter those, and how to deal with them. There’s no assumption that it does everything under the sun awesomely, but that there’s a shared understanding about what it does do awesomely.

I think the unspoken assumption here is that for the patterns of use the software is best at, it does a good job of them, meaning it handles the common use cases robustly with few bugs or surprises. (If it doesn’t even do a good job of those, that doesn’t seem to match what we’d want to call ‘maturity’ in software, right? A certain kind of ‘ready for use’; a certain assumption you are not working on an untested experiment in progress, but on something that does what it does well.).

For software meant as a tool for developing other software (any library or framework; I think sufia qualifies), the usage scenarios are at least as much about developers (what they will use the software for and how) as they are about the end-users those developers are ultimately develop software for.

Unclear understanding about use cases is perhaps a large part of what happened to me/us above. We thought sufia would support ‘manuscript’ use cases (which means many members per work if a page image is a member, which seems the most natural way to set it up) just fine. It appears to have the right functionality. Nothing in it’s README or other ‘marketing’ tells you otherwise. At the time we began our implementation, it may very well be that nobody else thought differently either.

At some point though, a year+ after the org began implementing the technology stack believing it was mature for our use case, and months after I started working on it myself — understanding that this use case would have trouble in sufia/hyrax began to build, we started realizing, and realizing that maybe other developers had already realized, that it wasn’t really ready for prime time with many-membered works and would take lots of extra customization and workarounds to work out.

The understanding of what use cases the stack will work painlessly for, and how much pain you will have in what areas, can be something still being worked out in this community, and what understanding there is can be unevenly distributed, and hard to access for newcomers. The above description of software maturity as being about shared understanding of usage scenarios speaks to me; from this experience it makes sense to me that that is a big part of ‘software maturity’, and that the samvera stack still has challenges there.

Polish is what distinguishes good software from great software. When you use an app or code that clearly cares about the edge cases and how all the pieces work together, it feels right.…

…User frustration comes when things do not behave as you expect them to. You pull out your car key, stick it in the ignition, turn it…and nothing happens. While you might be upset that your car is dead (again), you’re also frustrated that what you predicted would happen didn’t. As humans we build up stories to simplify our lives, we don’t need to know the complex set of steps in a car’s ignition system so instead, “the key starts the car” is what we’ve come to expect. Software is no different. People develop mental models, for instance, “the port configuration in the file should win” and when it doesn’t happen or worse happens inconsistently it’s painful.

I’ve previously called these types of moments papercuts. They’re not life threatening and may not even be mission critical but they are much more painful than they should be. Often these issues force you to stop what you’re doing and either investigate the root cause of the rogue behavior or at bare minimum abandon your thought process and try something new.

When we say something is “polished” it means that it is free from sharp edges, even the small ones. I view polished software to be ones that are mostly free from frustration. They do what you expect them to and are consistent…

…In many ways I want my software to be boring. I want it to harbor few surprises. I want to feel like I understand and connect with it at a deep level and that I’m not constantly being caught off guard by frustrating, time stealing, papercuts.

This kind of “polish” isn’t the same thing as maturity — schneems even suggests that most software may not live up to his standards of “polish”.

However, this kind of polish is a continuum. On the dark opposite side, we’d have hypothetical software, where working with it is about near constant surprises, constantly “being caught off guard by frustrating, time-stealing papercuts”, software where users (including developer-users for tools) have trouble developing consistent mental models, perhaps because the software is not very consistent in it’s behavior or architecture, with lots of edge cases and pieces working together unexpectedly or roughly.

I think our idea of “maturity” in software does depend on being somewhere along this continuum toward the “polished” end. If we combine that with the idea about shared understanding of usage scenarios and maturity, we get something reasonable. Mature software has shared understanding about what usage scenarios it’s best at, generally accomplishing those usage scenarios painlessly and well. At least in those usage scenarios it is “polished”, people can develop mental models that let them correctly know what to expect, with frustrating “papercuts” few and far between.

Mature software also generally maintains backwards compatibility, with backwards breaking changes coming infrequently and in a well-managed way — but I think that’s a signal or effect of the software being mature, rather than a cause. You could take software low on the “maturity” scale, and simply stop development on it, and thereby have a high degree of backwards compat in the future, but that doesn’t make it mature. You can’t force maturity by focusing on backwards compatibility, it’s a product of maturity.

So, Sufia and Samvera?

When trying to figure out how mature software is, we are used to taking certain signals as sort of proxy evidence for it. There are about 4 years between the release of sufia 1.0 (April 2013) and Sufia 7.3 (March 2017; beyond this point the community’s attention turned from Sufia to Hyrax, which combined Sufia and CurationConcerns). Much of sufia is of course built upon components that are even older: ActiveFedora 1.0 was Feb 2009, and the hydra gem was first released in Jan 2010. This software stack has been under development for 7+ years, and is used by several dozens of institutions.

Normally, one might take these as signs predicting a certain level of maturity in the software. But my experience has been that it was not as mature as one might expect from this history or adoption rate.

From the usage scenario/shared understanding bucket, I have not found that there is as high degree as I might have expected of easily accessible shared understanding of “know how to use it, know its possibilities and limitations,” “know what the typical usage scenarios are, patterns, what are good usage scenarios for this technology so that it shows its best.” Some people have this understanding to some extent, but this knowledge is not always very clear to newcomers or outsiders — and not what they may have expected. As in this blog post, things I may assume are standard usage scenarios that will work smoothly may not be. Features I or my team assumed were long-standing, reliable, and finished sometimes are not.

On the “polish” front, I honestly do feel like I am regularly “being caught off guard by frustrating, time stealing, papercuts,” and finding inconsistent and unparallel architecture and behavior that makes it hard to predict how easy or successful it will be to implement something in sufia; past experience is no guarantee of future results, because similar parts often work very differently. It often feels to me like we are working on something at a more proof-of-concept or experimental level of maturity, where you should expect to run into issues frequently.

To be fair, I am using sufia 7, which has been superceded by hyrax (1.0 released May 2017, first 2.0 beta released Sep 2017, no 2.0 final release yet), which in some cases may limit me to older versions of other samvera stack dependencies too. Some of these rough edges may have been filed off in hyrax 1/2, one would expect/hope that every release is more mature than the last. But even with Sufia 7 — being based on technology with 4-7 years of development history and adopted by dozens of institutions, one might have expected more maturity. Hyrax 1.0 was only released a few months ago after all. My impression/understanding is that hyrax 1.0 by intention makes few architectural changes from sufia (although it may include some more bugfixes), and upcoming hyrax 2.0 is intended to have more improvements, but still most of the difficult architectural elements I run into in sufia 7 seem to be mostly the same when I look at hyrax master repo. My impression is that hyrax 2.0 (not quite released) certainly has improvements, but does not make huge maturity strides.

Does this mean you should not use sufia/hyrax/samvera? Certainly not (and if you’re reading this, you’ve probably already committed to it at least for now), but it means this is something you should take account of when evaluating whether to use it, what you will do with it, and how much time it will take to implement and maintain. I certainly don’t have anything universally ‘better’ to recommend for a digital repository implementation, open source or commercial. But I was very frustrated by assuming/expecting a level of maturity that I then personally did not find to be delivered. I think many organizations are also surprised to find sufia/hyrax/samvera implementation to be more time-consuming (which also means “expensive”, staff time is expensive) than expected, including by finding features they had assumed were done/ready to need more work than expected in their app; this is more of a problem for some organizations than others. But I think it pays to take this into account when making plans and timelines. Again, if you (individually or as an institution) are having more trouble setting up sufia/hyrax/samvera than you expected, it’s probably not just you.

Why and what next?

So why are sufia and other parts of the samvera stack at a fairly low level of software maturity (for those who agree they are not)? Honestly, I’m not sure. What can be done to get things more mature and reliable and efficient (low TCO)? I know even less. I do notthink it’s because any of the developers involved (including myself!) have anything but the best intentions and true commitment, or because they are “bad developers.” That’s not it.

Just some brainstorms about what might play into sufia/samvera’s maturity level. Other developers may disagree with some of these guesses, either because I misunderstand some things, or just due to different evaluations.

Digital repositories are just a very difficult or groundbreaking domain, and it just necessarily would take this number of years/developer-hours to get to this level of maturity. (I don’t personally subscribe to this really, but it could be)

Fedora and RDF are both (at least relatively) immature technologies themselves, that lack the established software infrastructure and best practices of more mature technologies (at the other extreme, SQL/rdbms, technology that is many decades old), and building something with these at the heart is going to be more challenging, time-consuming, and harder to get ‘right’.

I had gotten the feeling from working with the code and off-hand comments from developers who had longer that Sufia had actually taken a significant move backwards in maturity at some point in the past. At first I thought this was about the transition from fedora/fcrepo 3 to 4. But from talking to @mjgiarlo (thanks buddy!), I now believe it wasn’t so much about that, as about some significant rewriting that happened between Sufia 6 and 7 to: Take sufia from an app focused on self-deposit institutional repository with individual files, to a more generalized app involving ‘works’ with ‘members’ (based on the newly created PCDM model); that would use data in Fedora that would be compatible with other apps like Islandora (a goal that has not been achieved and looks to me increasingly unrealistic); and exploded into many more smaller purpose hypothetically decoupled component dependencies that could be recombined into different apps (an approach that, based on outcomes, was later reversed in some ways in Hyrax).

This took a very significant number of developer hours, literally over a year or two. These were hours that were not spent on making the existing stack more mature.

But so much was rewritten and reorganized that I think it may have actually been a step backward in maturity (both in terms of usage scenarios and polish), not only for the new usage scenarios, but even for what used to be the core usage scenario.

So much was re-written, and expected usage scenarios changed so much, that it was almost like creating an entirely new app (including entirely new parts of the dependency stack), so the ‘clock’ in judging how long Sufia (and some but not all other parts of the current dependency stack) has had to become mature really starts with Sufia 7 (first released 2016), rather than sufia 1.0.

But it wasn’t really a complete rewrite, “legacy” code still exists, some logic in the stack to this day is still based on assumptions about the old architecture that have become incorrect, leading to more inconsistency, and less robustness — less maturity.

The success of this process in terms of maturity and ‘total cost of ownership’ are, I think… mixed at best. And I think some developers are still dealing with some burnout as fallout from the effort.

Both sufia and the evolving stack as a whole have tried to do a lot of things and fit a lot of usage scenarios. Our reach may have exceeded our grasp. If an institution came with a new usage scenario (for end-users or for how they wanted to use the codebase), whether they come with a PR or just a desire, the community very rarely says no, and almost always then tries to make the codebase accommodate. Perhaps in retrospect without sufficient regard for the cost of added complexity. This comes out of a community-minded and helpful motivation to say ‘yes’. But it can lead to lack of clarity on usage scenarios the stack excels at, or even lack of any usage scenarios that are very polished in the face of ever-expanding ambition. Under the context of limited developer resources yes, but increased software complexity also has costs that can’t be handled easily or sometimes at all simply by adding developers either (see The Mythical Man-Month).

Related, I think, sufia/samvera developers have often aspired to make software that can be used and installed by institutions without Rails developers, without having to write much or any code. This has not really been accomplished, or if it has only in the sense that you need samvera developer(s) who are or become proficient in our bespoke stack, instead of just Rails developers. (Our small institution found we needed 1-2 developers plus 1 devops). While motivated by the best intentions — to reduce Total Cost of Ownership for small institutions — the added complexity in pursuit of this ambitious and still unrealized goal may have ironically led to less maturity and increased TCO for institutions of all sizes.

I think most successfully mature open source software probably have one (or a small team of) lead developer/architect(s) providing vision as to the usage scenarios that are in or out, and to a consistent architecture to accomplish them. And with the authority and willingness to sometimes say ‘no’ when they think code might be taking the project in the wrong direction on the maturity axis. Samvera, due to some combination of practical resource limitations and ideology, has often not.

ActiveRecord is enormously complex software which took many many developer-hours to get to it’s current level of success and maturity. (I actually like AR okay myself). The thought that it’s API could be copied and reimplemented as ActiveFedora, with much fewer developer-hour resources, without encountering a substantial and perhaps insurmountable “maturity gap” — may in retrospect have been mistaken. (See above about how basing app on Fedora has challenges to achieving maturity).

What to do next, or different, or instead? I’m not sure! On the plus side we have a great community of committed and passionate and developers, and institutions interested in cooperating to help each other.

I think improvements start with acknowledging the current level of maturity, collectively and in a public way that reaches non-developer stakeholders, decision-makers, and funders too.

We should be intentional about being transparent with the level of maturity and challenge the stack provides. Resisting any urge to “market” samvera or deemphasize the challenges, which is a disservice to people evaluating or making plans based on the stack, but also to the existing community too.We don’t all have to agree about this either; I know some developers and institutions do have similar analysis to me here (but surely with some differences), others may not. But we have to be transparent and public about our experiences, to all layers of our community as well as external to it. We all have to see clearly what is, in order to make decisions about what to do next.

Personally, I think we need to be much more modest about our goals and the usage scenarios (both developer and end-user) we can support. This is not necessarily something that will be welcome to decision-makers and funders, who have reasons to want to always add on more instead. But this is why we need to be transparent about where we truly currently are, so decision-makers can operate based on accurate understanding of our current challenges and problems as well as successes