httpd-docs mailing list archives

Thank you for the responses fellas.
Solr is very fast, extremely flexible, can be deployed in a highly
available manner, does not have complex requirements, is easy to
configure and maintain and you guys own it. I am surprised that you are
not already using it for all of the ASF's projects documentation and
doubt Goodle's CSE would ever do the job as well as Solr could with just
a little elbow grease behind it. You can make it whatever you want. It
is after all open source, unlike the Google option. Why tie yourself to
a vendor when you can keep complete control yourself, do a better job
with very little resources, and never have to worry about the issues
that come along with building a dependency on something outside of your
control?
You have requirements outlined below for "critical services", some of
them would obviously not apply if the function were out sourced but some
would or at least should. Does Google provide an SLA for their free
CSE? Yes, I know how silly that sounds, but work with me here. It's
tough to compete with Google, I'm throwing out everything I have. ;)
I'll take a stab at approximations of answers to the questions below.
If it ever comes to the point you need the information with greater
precision, I will be glad to help with that.
Justin Erenkrantz wrote:
> On Oct 8, 2007 11:51 AM, Vincent Bray <noodlet@gmail.com> wrote:
>
>> I'm very much in favour of seeing how far we can take Solr as the
>> search mechanism for the httpd docs.
>>
>
> What are the production requirements for Solr? IOW, what do we need
> to run on www.apache.org to make this happen? How much disk space?
> How much RAM? We do not currently run Java on our main web servers,
> so running and maintaining it would have to be sorted out. I don't
> know if the Solr guys are even interested in helping us maintain a
> local search engine. (Previously, the Perl guys tried and gave up.)
>
>
Solr does not have to run on the web server hosting the search page. It
can live on any server reachable by your web server that meets the
requirements to run it The import/transform scripts also do not have to
run on the same server as Solr, as they also submit documents to be
indexes as a web request to Solr.
Solr needs Java 1.5 and an application server that supports the Servlet
2.4 standard. I used Jetty for my demo.
The import/transform script needs Perl with XML::Xpath and
XML::Xpath::XMLParser, an XSLT tool such as Xalan or Xsltproc, curl, or
the curl perl modules, and subversion to check out a copy of http-docs
and the build stuff. There needs to be enough space to check out the
docs and build files, plus temporary space for the transforms. ~80meg.
The current Solr index with only the English version of the httpd
documents is 1.7meg. Extrapolate that to account for the number of
languages supported, 5 or 6? and we will call it 15meg.
The Solr application itself comes in under 12meg including the source
and Jetty. I am not sure what Tomcat or other options would require but
I will find out.
Sum that up and disk space wise we have around 30meg for a full language
Solr install and an additional 80meg either local or decoupled for the
documents, build files and temporary space for the transforms. 110 meg's
not so bad. We'll say 200meg just to be safe and allow for some terse
logging.
Currently running in Jetty and with a nice full query cache but idle,
looks like this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8947 arreyder 25 0 830m 102m 18m S 0 5.1 0:07.11 java
1 gig of ram should be comfortable, but the more the better for the sake
of query caching.
I have not loaded it up yet to see how it looks during concurrent
successive connections but am working on a test script to do just that.
The test script will also be a great tool for preloading the cache. I
will do this and report the results if anyone is still interested.
> The ASF infrastructure team has a checklist of things that must be
> satisfied before adding any new 'critical services' (which this falls
> under). See below for the current list.
>
> So, I sort of think that just filling out a special account for a
> 'custom search engine' would be a *lot* less work. =) -- justin
>
>
You may be confusing work with fun and most of the fun has already been
had by me getting it this far. Perhaps this work that you were speaking
of is a hint that you would be the first to volunteer to help in getting
my Solr implementation formally going? ;) I doubt you got to where you
are by passing on good things because they required a little "work" and
aren't you guys under some sort of eat-your-own-dog-food directive? If
you do not use it, who will?
> ---
> This provides a list of requirements and doctrines for web applications
> that wish to be deployed on the Apache Infrastrcture. It is intended to
> help address many of the recurring issues we see with deployment and
> maintainence of applications.
>
> Definition of 'system': Any web application or site which will receive
> traffic from public users in any manner.
>
> Definition of 'critical systems': Any web application or site which runs
> under www.apache.org, or is expected to receive a significant portion of
> traffic.
>
> 1) All systems must be generally secure and robust. In cases of failure,
> they should not damage the entire machine.
>
>
Since Solr is a service that is typically only called by another service
it enjoys the security advantages of being at least once removed from
the end user and never directly accessed by them. You could certainly
provide rate limiting and other methods to keep load from ever reaching
the point where it could impact other co located services. No real
security or load management challenges here.
> 2) All systems must provide reliable backups, at least once a day, with
> preference to incremental, real time or <1 hour snapshots.
>
>
Solr provides an easy method for off site replication via snapshots.
This could be utilized for backups. Also it should be mentioned that on
my low end core2 duo with 2gig of ram it only takes around 70 seconds to
transform and index the complete English httpd documents from scratch.
As long as you have the documents available for check out and the
scripts to do it, you are never far from a freshly created index.
> 3) All systems must be maintainable by multiple active members of the
> infrastructure team.
>
>
I am not a member but I am still happy to help. Any one else like to
give me a hand? :)
> 4) All systems must come with a 'runbook' describing what to do in event
> of failures, reboots, etc. (If someone who has root needs to reboot the
> box, what do they need to pay attention to?)
>
>
Again no real challenge here, I'd be happy to throw this together.
> 5) All systems must provide at least minimal monitoring via Nagios.
>
I'll write a plugin to do this or we can just use the check_http one
already there. Depends on how deeply you want the service check to go.
> 6) All systems must be restorable and relocatable to other machines
> without significant pain.
>
>
Replication of this configuration and packaging it is trivial. As I
said before even if we have to re-index the docs, it just takes
seconds. I'll build a package and deployment script.
> 7) All systems must have some kind of critical mass. In general we do
> not want to host one offs of any system.
>
>
"If you build it they will come." Did I mention I am from Iowa? We
have this baseball diamond in a cornfield that you really should come see.
> 8) All system configuration files must be checked into Subversion.
>
>
Delighted to check in all 5 configuration files/scripts.
> 9) All system source must either be checked into Subversion, be at a
> well-known public location, or is provided by the base OS. (Hosting
> binary-only webapps is a non-starter.)
>
>
Since Solr is an Apache project I am guessing you have this part already
under control.
> 10) All systems, prior to consideration of deployment, must provide a
> detailed performance impact analysis (bandwidth and CPU). How are
> techniques like HTTP caching used? Lack of HTTP caching was MoinMoin's
> initial PITA.
>
>
It does cache queries, and with mod_deflate out front bandwidth should
be minimal. It's just text. I still need to get the details on CPU
load and see how well it scales on a single machine. I'm working on it.
> 11) All systems must have clearly articulated, defined, and recorded
> dependencies.
>
>
This is a very short list that I have for the most part already
covered. Perl with XML::Xpath and XML::Xpath::XMLParser, an XSLT tool
such as Xalan or Xsltproc, curl, or the curl perl modules, and
subversion to check out a copy of http-docs and the build stuff. For
Solr itself, Java 1.5 and Application Server that Supports 2.4 servelet
standard.
> 12) All critical systems must be replicated across multiple machines,
> with preference to cross-atlantic replication.
>
>
Not a problem. Solr has a multi server replication method using
snapshots, rsync and hard links.
> 13) All systems must have single command operations to start, restart
> and stop the system. Support for init scripts used by the base
> operating system is preferred.
>
>
You mean I need to do more than "nohup java -jar solr.jar &"? Sheesh!
Seriously, since you are probably not planning on running this in Jetty,
Tomcat, or whatever it lands on is probably already going to have that
requirement met. If not, I'm on it.
I left out a few requirements as they are yet to be determined. I am
not sure what kind of web front end you guys might want for both the
query and the results so I cannot speak to the requirements on that
end. The updating of documents in the Solr index is currently a manual
process. It could be adjusted to either run at an interval via crontab,
be actively initiated by the formal document builds, or configured to do
svn diffs and import when it sees a change in a document it has been
told to index.
Lastly, I'm doing this Solr apache documents thing with or without you,
you may as well take advantage of it. :)
chris rhodes
chris@ia.gov
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
> For additional commands, e-mail: docs-help@httpd.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: docs-unsubscribe@httpd.apache.org
For additional commands, e-mail: docs-help@httpd.apache.org