Search on djangosnippets.org

Users of djangosnippets.org may have noticed the
addition of a few search-related features over the past several months. I'd
like to highlight some of the additions that have been made and show how you
can implement similar functionality on your sites. All of djangosnippet's
search leans on Apache Solr, a powerful search
engine built on top of Apache Lucene. Haystack is the search
solution for Django apps - it provides a querying interface similar to Django's
ORM, handles indexing your models for you, and supports advanced features like
"more-like-this" and faceting.

Getting set up (angle brackets, anyone?)

I've actually written another post
on setting up multi-core Solr on Ubuntu 10.04. I got a bit of flak for using
tomcat6 as the server - you can definitely go with jetty
instead. Jetty is bundled with Solr, check out the examples/README.txt to get
started quickly. You might find the following links useful:

When setting up search with haystack, there are two important configuration
files to be aware of:

schema.xml

solrconfig.xml

schema.xml

The Solr schema is only superficially analagous to a database schema (if your
database was just one big freaking table). It does a whole lot more than a
database schema, allowing you to configure how individual fields are tokenized,
filtered, stored, and searched. There is a high degree of configurability, so
if your needs go beyond a basic "site search" I'd recommend Solr 1.4 Enterprise Search Server -
I'm only 5 chapters in and it's already pretty much blown my mind. Luckily,
haystack will generate this file automatically,
allowing you to get up and running quickly.

solrconfig.xml

I have not gone too deep into this file, but it is where you can configure
things like caching, more-like-this support, spell check, and highlighting. It
also gives you a whole bunch of knobs for configuring the inner-workings of the
indexing and querying facilities.

a final word on getting solr running

I am still very much a n00b when it comes to Search and am probably doing more
than a few things wrong. Any helpful suggestions would be appreciated!

(in fact, the search engine for djangosnippets is running on a 10-year-old
pentium iii laptop. the three hours last week where search was down? I was
rearranging my room)

Site Search

The first search-related feature I'll discuss is the site search. The first
step was getting haystack installed and creating a SearchIndex for the snippets.
The SearchIndex usually mirrors the models.py to some extent, although if you
plan on indexing more than a couple models you may want to pick some field-naming
conventions to keep the number of different fields in your Solr index small.

There's a lot of stuff in there, but the two more interesting bits are the first
and last fields. The text field is the default search field and is generated
by rendering the search/indexes/cab/snippet_text.txt template. Peeking at
the schema.xml, this field is getting tokenized and filtered:

This field is the heart of the index and is queried whenever a field is not
explicitly specified. Check out Analyzers, tokenizers and filters
if you're intersted in reading up on what these various bits do.

The last field is unique because, as you can see in the field definition, it is
setting indexed=False. This yields the following line in the autogenerated
schema.xml:

Because the field is not indexed it cannot be queried directly, but it will be
returned as a part of the search results, effectively allowing me to save a database
query when generating a link.

Indexing and Searching

Once the schema and search_indexes.py are in place, I can index all the snippets
in the database by running django-admin.py rebuild_index. When I want to
update the index, I run django-admin.py update_index --age=[age in hours].
To get closer to real-time results try the Real-Time SearchIndex
that comes with haystack.

A basic search view can lean on haystack's default. Here is the line from my
urlconf:

url(r'^search/$','haystack.views.basic_search',name='cab_search'),

Advanced Search

To get advanced search going, I subclassed SearchForm, added the fields I needed
and then basically did a shitload of filtering.

classAdvancedSearchForm(SearchForm):language=forms.ModelChoiceField(queryset=Language.objects.all(),required=False)django_version=forms.MultipleChoiceField(choices=DJANGO_VERSIONS,required=False)minimum_pub_date=forms.DateTimeField(widget=admin.widgets.AdminDateWidget,required=False)minimum_bookmark_count=forms.IntegerField(required=False)minimum_rating_score=forms.IntegerField(required=False)defsearch(self):# First, store the SearchQuerySet received from other processing.sqs=super(AdvancedSearchForm,self).search()ifself.cleaned_data['language']:sqs=sqs.filter(language=self.cleaned_data['language'].name)ifself.cleaned_data['django_version']:sqs=sqs.filter(django_version__in=self.cleaned_data['django_version'])ifself.cleaned_data['minimum_pub_date']:sqs=sqs.filter(pub_date__gte=self.cleaned_data['minimum_pub_date'])ifself.cleaned_data['minimum_bookmark_count']:sqs=sqs.filter(bookmark_count__gte=self.cleaned_data['minimum_bookmark_count'])ifself.cleaned_data['minimum_rating_score']:sqs=sqs.filter(rating_score__gte=self.cleaned_data['minimum_rating_score'])returnsqs

Haystack ships with a templatetag that offers a good deal more options.

This definitely qualifies as low-hanging fruit once you've got the initial pieces
in place and can really add a lot of value to your site. One of the problems
I often have with djangosnippets is that I get a lot of old content that's been
upvoted to hell but there's actually a newer, cooler version out there. MLT
is pretty good at finding these newer snippets.

Autocomplete

Arguably, the feature I'm most excited about is Solr's ability to do
autocompletion. Out of the box it's possible to do wildcard searches but this
approach does not scale. It's better to use the NGram filter, which I've
wrapped up as a custom fieldType in my schema.xml:

Conclusion

Hope you found this post informative! There's a ton of interesting things that
Solr can do and Haystack provides a nice wrapper around the most common features.
As always, any comments, feedback, suggestions, errata, etc are appreciated.