If you run a constantly-growing Solr index (as many VuFind users do), chances are that sooner or later, you will need to do some Java tuning in order to solve performance problems. There are some good resources already on the web about this topic (for example, Sun’s Java Tuning White Paper), but they tend to be somewhat dense and technical. This article is intended to give a shorter introduction to the problem and the most basic strategies for solving it. If you need more details, by all means refer to more technical sources; I just wanted to offer an easier starting point.

Why Java Needs Tuning

The main reason Java needs tuning has to do with how it handles memory management. I was studying computer science when my university switched its curriculum from C++ to Java, so I’m very familiar with Java’s distinctive approach to this subject. In C++, the programmer is responsible for all the fine details of memory management — you have to request all of the memory that you plan to use, then return it to the operating system when you are done with it. Failing to do this properly leads to the dreaded “memory leak.” Java relieves this burden by taking an entirely different approach: the programmer uses memory without worrying about where it came from, and Java uses something called a “garbage collector” to figure out which pieces of memory are no longer needed and free them up for others to use. The C++ to Java transition caused many lessons to abruptly change from “memory management is of vital importance to all of your work” to “don’t worry about memory management; the magic box will do it for you.”

Usually, sparing the programmer from worrying about memory is a great improvement — it removes a lot of tedium from the work of writing code, and most of the time, the garbage collector just does its job, and nobody has to think about it. The problem is that for complex, memory-hungry applications like Solr, the garbage collector sometimes can’t keep up. The longer the program runs, the more time Java spends on garbage collection and the less time it spends on actually running the program. In extreme situations, a Java program can become completely unresponsive, devoting all of its effort to cleaning up after itself. If you run into problems with VuFind searches becoming extremely slow and find that the problem goes away after you restart VuFind, the cause is almost certainly the garbage collector. A restart frees up all memory and gives Java a clean start, so it’s usually an easy fix to performance problems… but it’s only a matter of time before garbage once again accumulates to a critical level and the problem returns!

Possible Solutions

There are basically three answers to the Java garbage collection problem, and you don’t have to pick just one. Using multiple strategies at the same time often makes sense.

• Regularly restart your Java application — call it cheating or postponing the inevitable if you like, but it’s a very simple approach: if it takes several days for your application to start performing poorly, just schedule it to automatically restart in the middle of the night every night to get a clean slate and consistent stability.

• Give Java more memory — intense garbage collection is triggered by high memory use, so the more memory you have available, the longer it will take for a program to fill it all. Adding memory reduces (or at least postpones) the need for garbage collection, and it’s as simple as changing a couple of parameters (see details in the VuFind wiki). Generally, more memory is always better… but there is one important caveat: don’t give Java all of your system’s available memory, since that can crowd out your operating system and cause other performance problems — always leave a bit of a buffer.

• Change garbage collector behavior — Java has several different garbage collection strategies available, and some of them have additional tuneable parameters. This is where things start to get complicated, but Lucid Imagination’s Java Garbage Collection Boot Camp offers a good run-down of the available choices (not to mention providing some more detailed technical background). Even if you don’t understand all the gory details, knowing the available options means you can do some trial and error.

Testing Your Strategy

Trial and error is an inevitable part of solving Java tuning problems. The biggest shortcoming I found in other articles on the subject is that they don’t offer a simple strategy for doing this. Fortunately, it’s not too hard to test your progress using some simple tools.

Java is capable of recording a log of all of its garbage collection behavior, telling you how often it performs garbage collection and how long each collection takes to complete. While the exact parameter for generating a log may vary depending on the Java Virtual Machine that you are using, for VuFind’s preferred OpenJDK version, you can add something like this to your Java options:

-Xloggc:/tmp/garbage.log

As you can probably guess, this outputs the garbage collection data to a log file called /tmp/garbage.log. If you want something fancier, you could do this instead:

-Xloggc:$VUFIND_HOME/solr/jetty/logs/gc-`/bin/date +%F-%H-%M`.log

Through the magic of the Unix shell, this version stores logs inside VuFind’s solr/jetty/logs folder, naming each log file with the date and time that VuFind started up so that you can track behavior across multiple restarts.

So far, so good… except that these log files are really hard to read. Fortunately, an excellent tool exists to help visualize the data: gcviewer. With gcviewer, you can see a graph of your memory usage and the time spent on garbage collection, plus there are a number of handy statistics available (average collection time, total collection time, longest collection time, etc.). If gcviewer doesn’t meet your needs, there is also an IBM tool called PMAT which is slightly less convenient to download but which supports a broader range of log formats.

By logging data for several days between each tweak to your Java settings and using gcviewer or PMAT to analyze your logs, you can usually get a pretty good sense of whether you’ve made things better or worse… and how long it takes for your application to fall into the pit of inefficient garbage collection.

Conclusion

Java tuning is never going to be an easy subject to understand deeply, but that doesn’t mean you need to be afraid of it. There are several simple strategies available to help solve your problems even if you don’t know all the details of what is going on under the hood, and there are readily available tools to help you support your inevitable trial and error with empirical data. In fact, even if you are experiencing perfect performance today, it might not be a bad idea to examine garbage collection logs occasionally to see if you can prevent future problems before they become noticeable! Magic problem-solving boxes are great most of the time, but a bit of knowledge is always helpful for those times when they let you down.

Many great new books have been added to the library’s collection this month. Take a look at the New Books related to communication studies. No time to make it over to the library? This month’s list includes many e-books, which can be read on your computer screen.

For an interesting look at nineteenth-century domestic life, be sure to check out The complete home: an encyclopedia of domestic life and affairs … by Mrs. Julia McNair Wright (Philadelphia: J. C. McCurdy & Co., 1879). These tips and tricks are narrated by “Aunt Sophronia” to her three young nieces. Be sure to check out the illustrations! (The Internet Archive scans do not do them justice, so we have included our own scans of the illustrations in our Image Collection.) In the years following the U.S. Civil War, domestic bliss was seen as the nation’s saving grace after the loss of so many lives. In addition, women’s active role in the abolitionist movement came to an end with the passing of the 15th Amendment and women themselves became the topic of debate, with some parties arguing that women belonged in traditional domestic roles and others arguing that women should be allowed to participate more freely in non-traditional arenas.

Modern Home, from "The Complete Home..."

On that note, the Proceedings of the Twenty-fifth Annual Convention of the National American Woman Suffrage Association, held in Washington, D.C., January 16, 17, 18, 19, 1893 edited by Harriet Taylor Upton (Washington, D.C.: The Association, 1893) gives a different perspective of nineteenth-century life, when women were still fighting for the right to vote. We often take our voting rights for granted these days, so it is important to look back at the history of women’s suffrage. The National American Woman Suffrage Association (NAWSA) was formed in 1890 as a result of the merger of the National Woman Suffrage Association and the American Woman Suffrage Association, both of which were founded in 1869. It took 51 years from the initial creation of those two groups until women’s suffrage was finally achieved when the 19th Amendment became law in August of 1920, less than 100 years ago. For more on the history of the NAWSA and the path to women’s suffrage, check out this site, part of an online exhibit from the Bryn Mawr College Library.

Speaking of Bryn Mawr, another interesting book is A Book of Bryn Mawr stories edited by Margaretta Morris and Louise Buffum Congdon (Philadelphia: George W. Jacobs and Company, 1901). I am a 2005 alumna of Bryn Mawr College, so I couldn’t resist giving them a mention during Women’s History Month. Founded in 1885, Bryn Mawr College was not the first women’s college in the United States, but it was the first to offer undergraduate education on par with that of the top men’s colleges. The College sought to provide women with intellectual challenges and give them the opportunity to conduct original research, “a European-style program that was then available only at a few elite institutions for men.” It was also the first institute of higher education to grant graduate degrees (including doctorates) to women. In 1892, Bryn Mawr founded the first self-government association, granting its students the right to make and enforce the rules governing their conduct. From its inception, Bryn Mawr College strove to overcome the nineteenth-century notion that women were not the intellectual equals of men. This collection of “Bryn Mawr stories” marked the first truly introspective look at the College. Although fictional, the stories provide early glimpses of the unique characteristics of Bryn Mawr.

Get political news online – 58% of online adults looked online for news about politics or the 2010 campaigns, and 32% of online adults got most of their 2010 campaign news from online sources.

Go online to take part in specific political activities, such as watch political videos, share election-related content or “fact check” political claims – 53% of adult internet users did at least one of the eleven online political activities we measured in 2010.

Use Twitter or social networking sites for political purposes – One in five online adults (22%) used Twitter or a social networking site for political purposes in 2010.

In the process of creating and implementing research technology to benefit students, faculty, and staff, Falvey Memorial Library’s Technology Development team has pioneered several exciting library projects. These include the Digital Library and the Community Bibliography, two exciting and unique examples of how the Library uses technology.

Have you ever wondered what those hearts in the library catalog are for? Did you notice that some catalog records are tagged? Favorites and Tags can be used in different ways to organize books into lists for personal use or to share with other students and colleagues. Jutta Seibert provides a short overview of how these catalog features function.

One of the perils of keyword-based searching is that sometimes it is not totally clear why certain results show up after performing a search. Fortunately, two common conventions help ease this problem: highlighting matching keywords and displaying snippets of text to show matches in context. The Solr index engine has supported both of these features for a long time, but VuFind has only provided robust support for them starting in version 1.1.

Activating Highlighting and Snippets in VuFind

As a VuFind administrator, if you want to take advantage of these new features, all you have to do is upgrade to VuFind 1.1 and they will be turned on by default. If you want to turn them off or adjust some of the behavior, you can make a few adjustments to your searches.ini file as described in the VuFind wiki. Unless you are interested in the technical workings behind the scenes, that is all you need to know. Have fun! Solr power users, VuFind developers and other interested techies, please read on….

Highlighting and Snippets at the Solr Level

Solr’s support for highlighting and snippets is straightforward. By means of some search parameters (set in the solrconfig.xml configuration file and/or as part of the search request), you tell Solr whether or not to apply highlighting, which fields to highlight, how to mark highlighted words, and so forth. When highlighting is requested, Solr adds a new section to its search response listing all of the highlighted phrases found in all of the documents in the search response. The highlighting information is completely separate from the main list of search results, so highlighting does not actually alter the main part of the Solr response — the details need to be merged in by the calling code.

Problem #1: Marking Highlighted Text

One of the first problems that needs to be addressed is how to mark highlighted words in the Solr response. Solr provides hl.simple.pre and hl.simple.post parameters which can be used to specify text to mark the beginning and ending of highlighted words. The obvious first temptation is to simply stick some HTML in here — "<em>" and "</em>", for example. This can lead to pitfalls, however — if you are escaping your output, the HTML won’t make it through, and the end user will actually see the HTML code. If you are not escaping your output, then text between or around the emphasis tags may get misinterpreted as HTML, leading to garbled displays (never assume you won’t have angle brackets somewhere in your records!).

VuFind’s solution to this problem is fairly obvious — it uses markers that are extremely unlikely to show up in record text (“{{{{START_HILITE}}}}” and “{{{{END_HILITE}}}}”) and defines a special escaping routine used only for highlighted text. When displaying something that it knows has been highlighted, it first escapes any possible HTML entities, and THEN it replaces the highlighting markers with HTML code that achieves the actual highlighting logic. You can see the Smarty modifier that achieves this work here. Note that the Smarty code contains some extra logic for finding and highlighting words, since it is also designed for use by other modules of VuFind that are unable to rely on Solr’s highlighting capabilities — this logic is ignored when Solr results are being displayed.

Problem #2: Merging Highlighting Data with Records

As mentioned above, Solr provides highlighting information completely separately from its search result list. This can be rather inconvenient since it requires code to look in two different places during record processing. The first temptation when encountering this problem is to write code that merges everything together, overwriting fields in the main response with highlighted versions found elsewhere in the response. However, as with many first temptations, that’s a bad idea. First of all, you will very likely lose data if you do this. In a multi-valued field, it is possible that only certain values will be highlighted and others omitted entirely. Also, unless the hl.fragsize parameter is set to 0, snippets will be truncated to only show a few words around the highlighted term. Additionally, data loss aside, it is often convenient to have both highlighted and non-highlighted versions of fields available; for example, if you want to create a link to a page about an author, you want to use the non-highlighted text for inclusion in the target URL, but you want to use the highlighted version to display the link text.

Again, VuFind works through these issues in a fairly straightforward way. For convenience, it does merge the highlighting data with the search results so that code doesn’t need to look in two completely separate arrays for information about each record. However, it doesn’t overwrite any fields; instead, it creates a fake “_highlighting” field within the body of the record and stores all of the highlighting details in there. Whenever VuFind displays a field that might be subject to highlighting, it looks in two places — first it checks the _highlighting array and displays properly processed, highlighted text if it finds any. If no highlighted version exists, it fails over to the standard, non-highlighted text. Admittedly, this adds a bit more complexity to the display templates, but it seems a reasonable price to pay to ensure data integrity. It also helps to remind template designers where they need to use the Smarty highlight modifier described above, greatly reducing the risk of any “{{{START_HILITE}}}” tags accidentally slipping through to the end user’s display.

Problem #3: Highlighted Text May Be Truncated

As discussed above, highlighted text may be truncated in some circumstances (by default, snippets are limited to about 100 characters). This is reasonable, since search results should be brief and easy to read. Indeed, even before it supported highlighting, VuFind already had code to trim down super-long titles in search results. The critical difference between the old title-trimming code and the new reliance on Solr snippets is that the old code always showed the beginning of a title, while Solr snippets occasionally come from the middle of a title, yielding strange-looking results. Setting the hl.fragsize parameter to 0 is an option, though that will lead to very long titles in search results. VuFind’s solution relies on another new Smarty modifier (modifier.addEllipsis.php) which compares highlighted text against non-highlighted text and adds periods of ellipsis on each end if truncation is detected. This may not be a perfect solution, but at least it adds a little more visual context to the truncated text.

There is one additional caveat that should be noted: multi-valued fields are still a problem. If a field contains five values and only two of them match search terms, then the highlighting data will only contain (at most) two values. VuFind does not currently contain any mechanisms for matching up partial highlighted results with longer lists of non-highlighted results. The problem is avoided in the simplest way possible: the highlighted fields currently used in VuFind’s search result templates (title and primary author) are single-valued. Multi-valued fields are only displayed as snippets (see below).

Problem #4: Displaying Snippets

As discussed above, there are certain Solr fields which VuFind will always display in search results: most importantly, title and author. However, keyword matches may fall outside of these displayed fields. For that reason, it is helpful to display snippets showing matches in other fields. Since there may be many snippets, and the search result listing should be kept reasonably brief, it makes sense to try to display just one snippet, preferably the most relevant one.

Snippet selection is handled by the IndexRecord record driver, the base class that handles display of all records retrieved from the Solr index. This class contains two arrays: $preferredSnippetFields, an array of fields that are very likely to have good snippet data and should be checked first, and $forbiddenSnippetFields, an array of fields with bad or redundant data that should never be considered for use as a snippet. By default, $preferredSnippetFields contains subject headings and table of contents entries, since these tend to offer valuable information, while $forbiddenSnippetFields contains author and title fields (unnecessary for snippets since they are always displayed elsewhere in the template), ID values (obviously uninformative) and the spelling field (a jumble of data duplicated from other fields, necessary for spell checking but misleading as a snippet). The getHighlightedSnippet method uses these arrays to pick a single best snippet, first checking the preferred fields and then taking the first available non-forbidden field if necessary. Since the method and its related arrays are all protected, it is possible to extend the IndexRecord class and create custom behavior as needed on a driver-by-driver basis.

One further detail helps make things more clear: some snippets make little sense out of context, so searches.ini contains a [Snippet_Captions] section where Solr fields can be assigned labels that will be used as captions in front of snippets. Snippets for fields not listed in this section will display as stand-alone, uncaptioned lines in the search results.

Conclusions

Highlighting and snippets really aren’t too difficult to work with, but as with almost anything, they turn out to be a little more complicated than expected once you look at all of the details. I hope this post has helped point out the most obvious pitfalls and explain the reasoning behind VuFind’s implementation. There is still plenty more that could be done — some of the behavior could be made even smarter, and more of Solr’s power could be exposed through VuFind configuration settings. If you have ideas or questions, please feel free to share them as comments on this post or via the vufind-tech mailing list.

The Digital Library is proud to announce a new partnership between Villanova University and the Sisters of St. Basil the Great.

After the legal agreement was signed at the beginning of 2011, I had the pleasure of presenting to the Sisters on the project: to scan documents and other materials from their history, including realia (three-dimensional objects from real life) for the purposes of scholarship and digital preservation. These items are vital to an understanding of a major aspect of the life of Ukrainian Catholics in the Philadelphia region.

It is greatly anticipated that this project will be of benefit to a greater understanding of this bit of Church history and for a wider understanding of the contributions this particular order of women religious and the Eastern Catholic tradition generally, have made and continue to make to both Catholic heritage and local history. Working to document and disseminate primary source material from particular ethnic communities for future generations of scholars in Catholic studies and allied disciplines widens the scope of Catholica which we have undertaken to preserve. Not only because it is relatively unique, but because it adds to the mosaic of materials from a variety of backgrounds which would otherwise remain in greater relative obscurity.

Since I often read and enjoy Jonathan Rochkind’s blog, where he goes into great detail about the complexities of life as a library programmer, I was pleased when he asked me to write a bit about some of the new features in VuFind 1.1. That post will be coming up shortly. In the meantime, thank you, Jonathan, for prompting the creation of this blog. I hope this will become a useful resource for keeping up with the latest developments from Villanova’s library technology team and that the information here will be interesting and informative whether or not you use our software. Stay tuned for periodic posts about how we have approached various problems during the course of our work on VuFind, the forthcoming VuDL digital library package, and other library-related technologies.