Search AIT

Only match whole words

Sematext

An Interview with Otis Gospodnetic

When I started my Beyond Search blog in January 2008, one of the first experts to comment and challenge my viewpoint was Otis Gospodnetic. Since that time, I have learned that Mr. Gospodnetic is one of the individuals who is willing to share his ideas about search in general and open source search technology in particular. I spoke with him briefly this summer about the outlook for Lucene/Solr. I followed up with him after Labor Day and continued our discussion about information retrieval. The full text of my interview with Mr.Gospodnetic appears below.

Where did the idea for Sematext originate?

On a small island in the Adriatic Sea, actually. Not terribly coincidentally, I happen to be on that same island as we do this interview. I published my first technical book, Lucene in Action, at the end of 2004. That summer in 2004 I was sitting outside on the terrace, sipping white wine, staring at the blue sea, and wondering about what publishing a book on a highly-visible, high-demand tool like Lucene might bring. Daydreaming...

The answer was obvious - it will bring more demand. And so I created a small business entity imaginatively called Lucene Consulting, which then morphed into a more cleverly named Sematext in 2007. Prior to that, several high-profile people contacted me with ideas about starting a “Redhat for Lucene”. While I always believed in that model, I didn’t like the idea of getting in bed with venture capitalists.

I had worked for enough VC-backed startups to learn what it takes to build a self-sustaining business without involvement of VCs. And so today, Sematext is self-funded (or I should say customer-funded) and self-sustaining business with customers and employees on several continents.

When did you become interested in text and content processing?

There was always something about information gathering and processing, to use a general term, that attracted me. Back in college I remember building simple web crawlers, learning how full-text search engines work, and being curious about collaborative intelligent agents. Everything I’ve done since I left school involved some aspect of working with “information” and search.

Sure. There are additional free and non-free products in the works, too, plus our search expertize is expanding beyond core Lucene/Solr - for example, we recently finished a project that involved the use of Elastic Search (think of ES as a Solr-like piece of software or check our blog for more info) in an eDiscovery domain, involving one billion document indices. In addition, we also have a number of internal little reusable tools that we use in our engagements, but don’t necessarily advertise and sell directly.

Without divulging your firm's methods, will you characterize a typical use case for your firm's capabilities? Where does the open source technology come into play?

We are all about using open source technology to help organizations smart enough to see they no longer have to pay obscene amounts of licensing fees for products that are sometimes inferior and typically less flexible due to their closed nature. But it’s not only about money and the total cost of ownership. It’s also about having one’s hands untied, about being free to customize how something works. About adding a feature (or having someone add it on your behalf) this week and having it in production this month vs. waiting for the next release of the commercial product to see if maybe this new feature that you really need made it into the release. It’s about eliminating the need for expensive, months-long sales courting. You’ve heard all this before.

Each company puts a different emphasis on certain points. Please, continue.

Organizations around the world come to us because we have extensive experience working with search and data analytics technologies, such as Lucene, Solr, Hadoop, etc. Their needs fall in some broad categories. For example, there are organizations who need assistance creating the search infrastructure. Typically this means creating a horizontally scalable, high-performance, fault-tolerant backend. We do this literally daily. Another bucket contains those developers or engineers who require urgent assistance troubleshooting performance or other issues. Another group of clients want to make use one of our products.

Of course, some of our consulting customers then realize that getting the tech support subscription gives them an even better deal financially, especially since those tech support packages also happen to include discounts for our products, which just so happen to go hand in hand with core search functionality.

What are the benefits to a commercial organization or a government agency when working with your firm?

This is one of the things that I think makes Sematext an attractive alternative to (large) commercial search vendors. When we receive an inquiry, we don’t waste time - we typically respond the same day and engage immediately, thus allowing the inquiring organization to move as fast as they can, not as fast as we “let them”. Our terms of engagement are clear, our pricing model simple and affordable. Sales cycles as such don’t really exist. After the initial inquiry and our response, we quickly learn about the needs of the potential client through Q&A. We follow that with a phone call (we still use them!) to get all the nitty gritty details, which lets us move into the minimal paperwork, after which we are ready to start. Our presence in several key time zones gives us flexibility when working with customers around the world.

One challenge to those involved with squeezing useful elements from large volumes of content is the volume of content and the rate of change in existing content objects. What does your firm provide to customers to help them deal with the volume problem?

That's a good question. We primarily provide our knowledge and expertize in dealing with volume “problems”, be that data volume or request (search/query) volume. In addition, we have experience with tools that are designed to work well in high data (change) volume environments. For example, for our search customers we regularly design highly distributed search backend on top of Lucene or Solr or other search solutions that involve index sharding and distributed search or index replication, or both. While we focus on Lucene and Solr on the search side of our business, we are constantly looking at and evaluating new search technologies. In a recent engagement we looked beyond Lucene/Solr and, after evaluating several other solutions (although all based on Lucene!), decided to go with a solution that turned out to be more appropriate for the customer. As a matter of fact, I submitted a proposal for a talk on this topic for Lucene Revolution conference, but ... it looks like this guy Stephen Arnold, the Content Chair of the conference didn’t like that proposal enough. ;)

Well, following directions is one of my competencies. Let's talk about moving data from point A to point B; that is, information enters a system but it must be made available to an individual who needs that information or at least must know about the information. What does your firm offer licensees to address this issue?

In our consulting engagements we’ve made use of some of our internal pub-sub tools to solve this problem. These tools allowed us to build a system that lets an individual maintain multiple saved queries that, when new information enters the system (a searchable index in this case), act as filters against this new information. When the system catches a bit of information of interest with those stored filters, the filter owner(s) can be notified using whichever notification mechanism was plugged into the system.

Visualization has been a great addition to briefings. On the other hand, visualization and other graphic eye candy can be a problem to those in stressful operational situations? What's your firm's approach to presenting "outputs"?

Visualizations are great when used well, but are easy to misuse and overuse. Once in a while a customer will ask us about visualizations, alternative or additional representations of search results (often their clustering). Our typical response is the previous sentence, more or less. In addition, while a picture is said to tell a 1000 words, unless that picture doesn’t fit on your screen, you might be better off sticking with plain text. Sometimes simple text has better space-to-information ratio. Plus, think about people with mobile devices and their (so far) small screens.

I am on the fence about the merging of retrieval within other applications. What's your take on the "new" method which some people describe as "search enabled applications"?

I think it’s just a matter of getting used to the fact that search is just another facet of lots of applications today, just another way to get from the user interface to the desired information nugget. Just like you need to have an application to sit on top of some database if it needs access to lots of different stored records, you need to also put that application on top of some good (full-text) search servers or libraries if you want to be able to quickly search through lots of text content. If databases were good at providing full-text search functionality, then applications could keep riding just the database. But (relational) databases are typically bad at full-text search, and people are finally learning they shouldn’t use them for that (despite database vendors claiming otherwise). Of course, just as people are learning this, a new breed of non-relational databases has been gaining on popularity for the last few years. These databases tend to have better support for full-text search, though that support is commonly provided by simply integrating the database with an existing and proven search library like Lucene at the lower level, so that to the application using this database, the search functionality appears to come from the database itself.

There seems to be a popular perception that the world will be doing computing via iPad devices and mobile phones. My concern is that serious computing infrastructures are needed and that users are "cut off" from access to more robust systems? How does Sematext see the computing world over the next 12 to 18 months? What products / services will you be focusing on to deliver on your vision?

More and more data is coming our way. Some of it needs searched. Some of it needs to be used for improving search, say using machine learning algorithms and new relevance models. Some of it needs specific analysis. Without getting into details, I’ll say that Sematext is in a good position to play with very big data from both the search angle, and the analytics angle.

Put on your wizard hat. What are the three most significant technologies that you see affecting your business? How will your company respond?

Large-scale data processing (think Lucene/Solr and Hadoop family of products), distributed everything (think Solr, Nutch, Hadoop, HBase...), learning from data (think machine learning, Mahout...). As I said earlier, Sematext is in a good position - we have experience in all of these areas.