Make your likes visible on Facebook?

Building a Search Infrastructure with Apache Solr and SolrCloud

Apache Solr is the underlying technology that powers Mylife.com's people search capabilities. Apache Solr and SolrCloud will be introduced and their application to our current search infrastructure will be discussed.

Comments (0)

Transcript of Building a Search Infrastructure with Apache Solr and SolrCloud

Building a Search Infrastructure with Apache Solr and SolrCloudAgendaSolr ReplicasCentralized managementIntroduction to SolrSolrCloudWhen replica goes downLeader stops sending updatesDuring recoverySimple synch if differences are smallGet missing update commands from leaderOtherwise replicate whole indexGet full index from leaderSolr and LuceneBuilt on Apache Lucene- Java library for information retrievalManages an inverted indexRegular index"what terms in a document?"Inverted index"which documents contain a specific term?"ChallengesIf master goes down, then cannot update shardDocument assignment performed internally by own hashing schemeNo centralized management of schema or configurationSplitting shards requires reindexingSearches to cluster, must use lengthy 'shard' parameterMylife ArchitectureWhat is Solr?System built to search textA platform to build search applications onCustomizable, open source softwareSolrCloudSolrCloud IndexingNo more master/slave. Just leaders/replicasLeader automatically electedIf leader goes down, then a new replica is automatically elected as new leaderFixes problem of master going down and cannot update shardShard selection for indexingdocument ID hashedno need to do own hashingSolr SearchingRequests can be sent to any machineNo distinction between master and slaveSearching is near real-timedistributed indexing soft commits to memoryNo need for shard parameter in query or solrconfig.xmlZooKeeperCentralized configurationMaintains schema and configurationProvides distributed synchronizationEmbeddedRun ZooKeeper as part of Solr applicationIf Solr app goes down, ZooKeeper also goes downEnsembleRun ZooKeeper as stand-alone instance on separate boxesMultiple instances of ZooKeeper running if any ZooKeeper instance crashEmbedded versus EnsembleSolrCloud LeaderWhen leader goes downOnly some replicas may have received updatesNew leader chosen and synch processes run against other replicasIf replicas are too out of synch, asks for full replication/ replay based recoverySolves challenge of failed updates when master is downWhy Solr?Solr performs better for text search than relational DBsSolr features specific to text searchhighlighting, faceting, etcSample Inverted IndexSolr ArchitecturePRIZE GIVEAWAY!!!Distributed capabilities of SolrSet up fault-tolerant, highly available cluster of Solr serversCapacityExpandinginstall SolrStart Solr up with -DzkHost parameterRegister them with load balancerMagic!!ReducingShut down machineCurrently, each Solr server has its own schema.xml, solrconfig.xmlUpdates requires copying to each serverCan mean redeployment of entire clusterSolrCloud allows uploading configuration to ZooKeeperZooKeeper sends updated configuration to entire clusterSpencer Yuen, Mylife.comSolrCloudSimple ClientsManage own load balancingIf server fails, try another replica in shardIntelligent ClientsSolrJConnects to ZooKeeper to know which shards are upCommit StrategiesWhen doc is sent to indexFor replica, forward requestto own shard leader if shard is correctto leader of another shard that is correctFor leaderforward doc to correct shard leaderindex doc for itself and shard replicasTransaction LogRecords updatesAllows replay of uncommitted updates if indexing is interruptedAllows replicas to synchMylife PipelineSearching in Solrhttp://localhost:10018/solr/select/?fq=(source:cadillac+OR+source:reunion)&defType=dismax&rows=10&indent=true&qid=0e5e4a19ee&shards=localhost:10001/solr,10002/localhost:10003/solr,localhost:10004/solr,localhost:10005/solr,localhost:10006/solr,localhost:10007/solr,localhost:10008/solr,localhost:10009/solr,localhost:10010/solr,localhost:10011/solr,localhost:10012/solr,localhost:10013/solr,localhost:10014/solr,localhost:10015/solr,localhost:10016/solr,localhost:10017/solr,localhost:10018/solr,localhost:10019/solr,localhost:10020/solr&start=0&wt=json&bq=has_profile_image:true^0.2&q.alt=(((family_name:(Swanson)+OR+maiden_name:(Swanson)^0.5)+AND++(given_name:(Mikayla)+OR+given_name_exact:(Mikayla)))+OR+(name:%22Mikayla+Swanson%22^3.0+OR+name:%22Mikayla+Swanson%22~2)+OR+((+name:Mikayla)+AND+(+name:Swanson)))Mylife Solr Queryfq Filter queries limit responses to main querydeftypequery parser which processes user input, can handle errors, e.g. Lucene, dismax, edismaxshards request distributed across all shards in the list. We'll revisit for SolrCloud.wt Response writers format output, including XML, JSON, etcbq Boost particular field when determining which results go to topq or q.altactual main queryA Few ParametersRequest Handlinghttp://<host>:8983/solr/<core>/<request-handler>core index with its configurations. What is actually being indexed and searched on`"/facet"request handler plugin to Solr that processes incoming request in a particular way"/select"Solr Schemaschema.xmlfields - what you are searching/indexing in your document"name", "dob", "location"a field can be "indexed" (searched on) and/or "stored" (displayed as result)fields are denormalized or flat structurefield types - data type of field"string", "int", "date", "boolean"SolrConfigContains Solr configurationsCustom request handlersbrowse, admin, data importSupporting library pathsData directoryFrequency of commitcache management configurations...and moreSample Solr Schema <fields> <field name="name"type="string" indexed="true" stored="true"/> </fields> <types><fieldType name="string" class="solr.StrField"/> </types>IndexingUpdate HandlerHandles update requestProcesses commit to diskrefreshes searcher or snapshot view of indexSolr uses unique IDmarks old version as deletedadds new version of documentAnalyzerprocesses text for each field or apply transformations to make text easily searchable character filter, e.g. ISO LatinI LOVE This café -> I LOVE This cafetokenizer, e.g. whitespaceI LOVE This cafetoken filters, e.g. lowercasei love this cafeSummarySolr is powerful, simple, easily configurable platform for search applicationsSolrCloud support fault-tolerance through ZooKeeperPrize QuestionWhat is the name of an intelligent Solr client?Solr Feature: Faceted Searchtechnique to access information according to a classification system by multiple dimensionsSample faceted searchFaceting in SolrFaceting supported out-of-the-box in SolrUse facet query parametersCan facet on field valuesrangesdatesFacet ParametersSpecify faceted fields facet.fieldSpecify max number valuesfacet.limitSpecify sort order by count or alphabetfacet.sortOthers facet.offset for pagingfacet.mincountfacet.prefixSample Mylife queryhttp://localhost:8983/solr/facet/mylife?indent=true&wt=json&q.alt=*:*&facet=true&facet.field=job_title&facet.field=date_of_birth&facet.field=source&facet.field=location&facet.field=gender&facet.field=company_name&facet.limit=10Shard SplittingInitial collectionNeed to select number of shardsMay have chosen wrong numberRe-indexing data required to reallocate shardsShard splittingSolr 4.3 featurePre-existing shards can be split without reindexinghttp://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1Solr Cloud Admin