Title: Powerful Full-Text Search with Solr
1Powerful Full-Text Search with Solr
- Yonik Seeley
- yonik_at_apache.org
- Web 2.0 Expo, Berlin
- 8 November 2007
download at http//www.apache.org/yonik
2What is Lucene
- High performance, scalable, full-text search
library - Focus Indexing Searching Documents
- Document is just a list of namevalue pairs
- No crawlers or document parsing
- Flexible Text Analysis (tokenizers token
filters) - 100 Java, no dependencies, no config files
3What is Solr
- A full text search server based on Lucene
- XML/HTTP, JSON Interfaces
- Faceted Search (category counting)
- Flexible data schema to define types and fields
- Hit Highlighting
- Configurable Advanced Caching
- Index Replication
- Extensible Open Architecture, Plugins
- Web Administration Interface
- Written in Java5, deployable as a WAR
4Basic App
HTML
Indexer
Webapp
Document super_name Mr.
Fantastic name Reed Richards category
superhero powers elasticity
Query Response (matching docs)
Query (powersagility)
http//solr/update
http//solr/select
Solr
Servlet Container
5Indexing Data
- HTTP POST to http//localhost8983/solr/update
ltaddgtltdocgt ltfield nameidgt05991lt/fieldgt
ltfield namenamegtPeter Parkerlt/fieldgt ltfield
namesupernamegtSpider-Manlt/fieldgt ltfield
namecategorygtsuperherolt/fieldgt ltfield
namepowersgtagilitylt/fieldgt ltfield
namepowersgtspider-senselt/fieldgt lt/docgtlt/addgt
6Indexing CSV data
Iron Man, Tony Stark, superhero, powered armor
flight Sandman, William BakerFlint Marko,
supervillain, sand transform Wolverine,James
HowlettLogan, superhero, healingadamantium Magne
to, Erik Lehnsherr, supervillain,
magnetismelectricity
http//localhost8983/solr/update/csv? fieldnames
supername,name,category,powers separator, f.
name.splittruef.name.separator f.powers.spli
ttruef.powers.separator
7Data upload methods
- URLhttp//localhost8983/solr/update/csv
- HTTP POST body (curl, HttpClient, etc)
- curl URL -H 'Content-typetext/plain
charsetutf-8' --data-binary _at_info.csv - Multi-part file upload (browsers)
- Request parameter
- ?stream.bodyCyclops, Scott Summers,
- Streaming from URL (must enable)
- ?stream.urlfile//data/info.csv
8Indexing with SolrJ
- // Solrs Java Client API remote or
embedded/local! - SolrServer server new CommonsHttpSolrServer("htt
p//localhost8983/solr") - SolrInputDocument doc new SolrInputDocument()
- doc.addField("supername","Daredevil")
- doc.addField("name","Matt Murdock")
- doc.addField(category",superhero")
- server.add(doc)
- server.commit()
9Deleting Documents
- Delete by Id, most efficient
- ltdeletegt
- ltidgt05591lt/idgt
- ltidgt32552lt/idgt
- lt/deletegt
- Delete by Query
- ltdeletegt
- ltquerygtcategorysupervillainlt/querygt
- lt/deletegt
10Commit
- ltcommit/gt makes changes visible
- Triggers static cache warming in solrconfig.xml
- Triggers autowarming from existing caches
- ltoptimize/gt same as commit, merges all index
segments for faster searching
Lucene Index Segments
11Searching
- http//localhost8983/solr/select?qpowersagility
- start0rows2flsupername,category
- ltresponsegt
- ltresult numFound427" start"0"gt
- ltdocgt
- ltstr namesupername"gtSpider-Manlt/strgt
- ltstr namecategorygtsuperherolt/strgt
- lt/docgt
- ltdocgt
- ltstr namesupername"gtMsytiquelt/strgt
- ltstr namecategorygtsupervillainlt/strgt
- lt/docgt
- lt/resultgt
- lt/responsegt
12Response Format
- Add wtjson for JSON formatted response
- result" "numFound"427, "start"0,
- "docs"
- supernameSpider-Man,
categorysuperhero, - supername Msytique, category
supervillain -
-
- Also Python, Ruby, PHP, SerializedPHP, XSLT
13Scoring
- Query results are sorted by score descending
- VSM Vector Space Model
- tf term frequency numer of matching terms in
field - lengthNorm number of tokens in field
- idf inverse document frequency
- coord coordination factor, number of matching
terms - document boost
- query clause boost
- http//lucene.apache.org/java/docs/scoring.html
14Explain
- http//solr/select?qsuper fastindentondebugQue
ryon - ltlst name"debug"gt
- ltlst name"explain"gt
- ltstr name"idFlash,internal_docid6"gt
- 0.16389132 (MATCH) product of
- 0.32778263 (MATCH) sum of
- 0.32778263 (MATCH) weight(textfast in 6),
product of - 0.5012072 queryWeight(textfast), product
of - 2.466337 idf(docFreq5)
- 0.20321926 queryNorm
- 0.65398633 (MATCH) fieldWeight(textfast
in 6), product of - 1.4142135 tf(termFreq(textfast)2)
- 2.466337 idf(docFreq5)
- 0.1875 fieldNorm(fieldfast, doc6)
- 0.5 coord(1/2)
- lt/strgt
- ltstr name"idSuperman,internal_docid7"gt
- 0.1365761 (MATCH) product of
15Lucene Query Syntax
- justice league
- Equiv justice OR league
- QueryParser default operator is OR/optional
- justice league nameaquaman
- Equiv justice AND league NOT nameaquaman
- justice league nameaquaman
- titlespiderman10 descriptionspiderman
- descriptionspiderman movie100
16Lucene Query Examples2
- releaseDate2000 TO 2007
- Wildcard searches sup?r, sur, super
- spider
- Fuzzy search Levenshtein distance
- Optional minimum similarity spider0.7
-
- (Superman AND Lex Luthor) OR (Batman Joker)
17DisMax Query Syntax
- Good for handling raw user queries
- Balanced quotes for phrase query
- for required, - for prohibited
- Separates query terms from query structure
- http//solr/select?qtdismax
- qsuper man // the user query
- qftitle3 subject2 body // field to query
- pftitle2,body // fields to do phrase
queries - ps100 // slop for those phrase qs
- tie.1 // multi-field match reward
- mm2 // of terms that should match
- bfpopularity // boost function
18DisMax Query Form
- The expanded Lucene Query
- ( DisjunctionMaxQuery( titlesuper3
subjectsuper2 bodysuper) - DisjunctionMaxQuery( titleman3
subjectman2 bodyman) - )
- DisjunctionMaxQuery(titlesuper man1002
bodysuper man100) - FunctionQuery(popularity)
- Tip set up your own request handler with default
parameters to avoid clients having to specify them
19Function Query
- Allows adding function of field value to score
- Boost recently added or popular documents
- Current parser only supports function notation
- Example log(sum(popularity,1))
- sum, product, div, log, sqrt, abs, pow
- scale(x, target_min, target_max)
- calculates min max of x across all docs
- map(x, min, max, target)
- useful for dealing with defaults
20Boosted Query
- Score is multiplied instead of added
- New local params lt!...gt syntax added
- qlt!boost bsqrt(popularity)gtsuper man
- Parameter dereferencing in local params
- qlt!boost bboost vuserqgt
- boostsqrt(popularity)
- userqsuper man
21Analysis Search Relevancy
Query Analysis
Document Indexing Analysis
LexCorp BFG-9000
Lex corp bfg9000
WhitespaceTokenizer
WhitespaceTokenizer
LexCorp
BFG-9000
Lex
bfg9000
corp
WordDelimiterFilter catenateWords1
WordDelimiterFilter catenateWords0
BFG
9000
Lex
Corp
bfg
9000
Lex
corp
LexCorp
LowercaseFilter
LowercaseFilter
bfg
9000
lex
corp
bfg
9000
lex
corp
lexcorp
A Match!
22Configuring Relevancy
- ltfieldType name"text" class"solr.TextField"gt
- ltanalyzergt
- lttokenizer class"solr.WhitespaceTokenizerFacto
ry"/gt - ltfilter class"solr.LowerCaseFilterFactory"/gt
- ltfilter class"solr.SynonymFilterFactory"
- synonyms"synonyms.txt/gt
- ltfilter class"solr.StopFilterFactory
- wordsstopwords.txt/gt
- ltfilter class"solr.EnglishPorterFilterFactory"
- protected"protwords.txt"/gt
- lt/analyzergt
- lt/fieldTypegt
23Field Definitions
- Field Attributes name, type, indexed, stored,
multiValued, omitNorms, termVectors - ltfield name"id type"string"
indexed"true" stored"true"/gt - ltfield name"sku type"textTight indexed"true"
stored"true"/gt - ltfield name"name type"text
indexed"true" stored"true"/gt - ltfield nameinStock typeboolean
indexed"true storedfalse"/gt - ltfield nameprice typesfloat
indexed"true storedfalse"/gt - ltfield name"category type"text_ws
indexed"true" stored"true multiValued"true"/gt - Dynamic Fields
- ltdynamicField name"_i" type"sint
indexed"true" stored"true"/gt - ltdynamicField name"_s" type"string
indexed"true" stored"true"/gt - ltdynamicField name"_t" type"text
indexed"true" stored"true"/gt
24copyField
- Copies one field to another at index time
- Usecase 1 Analyze same field different ways
- copy into a field with a different analyzer
- boost exact-case, exact-punctuation matches
- language translations, thesaurus, soundex
- ltfield nametitle typetext/gt
- ltfield nametitle_exact typetext_exact
storedfalse/gt - ltcopyField sourcetitle desttitle_exact/gt
- Usecase 2 Index multiple fields into single
searchable field
25(No Transcript)
26(No Transcript)
27(No Transcript)
28Facet Query
- http//solr/select?qfoowtjsonindenton
- facettruefacet.fieldcat
- facet.queryprice0 TO 100
- facet.querymanuIBM
- "response""numFound"26,"start"0,"docs",
- facet_counts"
- "facet_queries"
- "price0 TO 100"6,
- manuIBM"2,
- "facet_fields"
- "cat" "electronics",14, "memory",3,
- "card",2, "connector",2
-
29Filters
- Filters are restrictions in addition to the query
- Use in faceting to narrow the results
- Filters are cached separately for speed
- 1. User queries for memory, query sent to solr is
- qmemoryfqinStocktruefacettrue
- 2. User selects 1GB memory size
- qmemoryfqinStocktruefqsize1GB
- 3. User selects DDR2 memory type
- qmemoryfqinStocktruefqsize1GB
- fqtypeDDR2
30Highlighting
- http//solr/select?qlcdwtjsonindenton
- hltruehl.flfeatures
- "response""numFound"5,"start"0,"docs"
- "id""3007WFP", price899.95,
- "highlighting"
- "3007WFP" "features""30\" TFT active matrix
ltemgtLCDlt/emgt, 2560 x 1600 - "VA902B" "features""19\" TFT active matrix
ltemgtLCDlt/emgt, 8ms response time, 1280 x 1024
native resolution"
31MoreLikeThis
- Selects documents that are similar to the
documents matching the main query. - qid6H500F0 mlttruemlt.flname,cat,features
- "moreLikeThis" "6H500F0""numFound"5,"start"0
, - "docs
- "name""Apple 60 GB iPod with Video
- Playback Black", "price"399.0,
- "inStock"true, "popularity"10,
- ,
-
32High Availability
Dynamic HTML Generation
Appservers
HTTP search requests
Load Balancer
Solr Searchers
Index Replication
admin queries
DB
updates
updates
admin terminal
Solr Master
33Resources
- WWW
- http//lucene.apache.org/solr
- http//lucene.apache.org/solr/tutorial.html
- http//wiki.apache.org/solr/
- Mailing Lists
- solr-user-subscribe_at_lucene.apache.org
- solr-dev-subscribe_at_lucene.apache.org