Solr Performance - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Solr Performance

Description:

... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 33
Provided by: Growler
Learn more at: http://people.apache.org
Category:

less

Transcript and Presenter's Notes

Title: Solr Performance


1
Solr Performance Key Innovations
  • Yonik Seeley, Lucid Imaginationyonik_at_lucidimagina
    tion.com, May 26 2011

2
Solr 3.1 Highlights
  • Numeric range facets (similar to date faceting).
  • New spatial search, including spatial filtering,
    boosting and sorting capabilities.
  • Example Velocity driven search UI at
    http//localhost8983/solr/browse
  • A new faster termvector-based highlighter.
  • Extended dismax (edismax) query parser with
    support for fielded queries, enhanced relevancy,
    and full lucene syntax support.
  • Distributed search support for the Spell check
    and Terms components.

3
Solr 3.1 Highlights (continued)
  • Suggester, a fast trie-based autocomplete
    component.
  • Sort results by any function query.
  • JSON document indexing.
  • CSV response format
  • Apache UIMA integration for metadata extraction.
  • Tons of optimizations, bugfixes, and new analysis
    capabilities via Apache Lucene 3.1.

4
Whats not in 3.1?
  • Result Grouping (AKA Field Collapsing)
  • Pivot Faceting
  • SolrCloud
  • Pseudo-fields
  • Pseudo-join
  • Relevancy function queries
  • Per-segment faceting
  • Tons of new Lucene performance/efficiency
    goodness

5
Recent Lucene Performance
  • TieredMergePolicy the new default
  • Much better for incremental indexing / NRT
  • Ignores segment order when selecting best merge
  • Takes deletes into account
  • Does not over-merge (no cascading merges)
  • Finite State Transducer (FST) based terms index

6
DocumentWriterPerThread (DWPT)
Indexing thread
  • Flushing new segment is now concurrent w/
    indexing
  • Use multiple indexing threads/connections
  • When max mem is hit, biggest DWPT is concurrently
    flushed

Index Writer
in-memory
Flush segment to disk
_1_0.tiv _1_0.prx _1_0.frq
_2_0.tiv _2_0.prx _2_0.frq
_3_0.tiv _3_0.prx _3_0.frq
7
Solr Cloud
http//.../solr/collection1?distribtrue
Load-balanced sub-request
shard1(replica1)
shard2(replica1)
replica2
replica2
replica3
replica3
ZK node
/livenodes server18983/solr
server28983/solr server28983/solr
ZK node
/collections /collection1 configNamemyconf
/shards /shard1 server18983/solr
server28983/solr /shard2
server38983/solr server48983/solr
ZK node
/configs /myconf solrconfig.xml
schema.xml
ZK node
ZK node
ZooKeeper quorum
8
Solr Cloud Getting Started
  • http//wiki.apache.org/solr/SolrCloud
  • java -Dbootstrap_confdir./solr/conf
  • -Dcollection.configNamemyconf
  • -DzkRun
  • -jar start.jar

Upload /solr/conf to ZK and call it myconf
Run an internal ZK server
http//localhost8983/solr/collection1/admin/zooke
eper.jsp
9
Distributed Requests
  • Explicitly specify node addresses to load-balance
    across
  • shardslocalhost8983/solrlocalhost8900/solr,
  • localhost7574/solrlocalhost7500/solr
  • A list of equivalent nodes are separated by
  • Different phases of the same distributed request
    use the same node
  • Specify logical shard ids to search across
  • shardsNY_shard,NJ_shard
  • Query across all shards in the collection
  • http//localhost8983/solr/collection1/select?dist
    ribtrue
  • public CloudSolrServer(String zkHost)
  • SolrJ Java client that load-balances across all
    nodes in cluster

10
Extended Dismax Parser
  • Superset of dismax
  • Designed to directly handle user queries w/o
    exceptions
  • defTypeedismaxqfooqfbody
  • Fixes edge cases where dismax could still throw
    exceptions
  • OR AND NOT -
  • Full lucene syntax support
  • Tries lucene syntax first
  • Smart escaping is done if syntax errors
  • Optionally supports treating and/or as AND/OR
    in lucene syntax
  • Fielded queries (e.g. myfieldfoo) even in
    degraded mode
  • uf parameter controls what field names may be
    directly specified in q

11
Extended Dismax Parser (continued)
  • boost parameter for multiplicative
    boost-by-function
  • Pure negative query clauses
  • Example solr OR (-solr)
  • Enhanced term proximity boosting
  • pf2myfield results in term bigrams in sloppy
    phrase queries
  • myfieldaa bb cc -gt myfieldaa bb
    myfieldbb cc
  • Enhanced stopword handling
  • stopwords omitted in main query, but added in
    optional proximity boosting part
  • Example qsolr is awesome qfmyfield
    pf2myfield -gt
  • myfield(solr awesome) (myfieldsolr is
    myfieldis awesome)
  • Currently controlled by the absence of
    StopWordFilter in index analyzer, and presence in
    query analyzer

12
Faceting Performance Improvements
  • For facet.methodenum, speed up initial
    population of the filterCache (i.e. first time
    facet) from 30 to 32x improvement
  • Optimized facet.methodfc for multi-valued fields
    and large facet.limit up to 3x faster
  • Optimized deep facet paging up to 10x faster
    with really large facet.offsets
  • Less memory consumed by field cache entries
  • Per-segment faceting with facet.methodfcs
  • Only faster when re-opening index frequently
    (many times a second)
  • Only works for single-valued fields

13
Pivot Faceting
  • Other names that could have made sense
  • Grid Faceting, Cross-Product Faceting, Matrix
    Faceting
  • Syntax facet.pivotfield1,field2,field3,

facet.pivotcat,inStock
docs docs w/ inStocktrue docs w/ instockfalse
catelectronics 14 10 4
catmemory 3 3 0
catconnector 2 0 2
catgraphics card 2 0 2
cathard drive 2 2 0
14
Pivot Faceting
http//...facettruefacet.pivotcat,popularity
  • "facet_counts"
  • "facet_pivot"
  • "cat,popularity"
  • "field""cat",
  • "value""electronics",
  • "count"14,
  • "pivot"
  • "field""popularity",
  • "value""6",
  • "count"5,
  • "field""popularity",
  • "value""7",
  • "count"4,

(continued)
"field""popularity",
"value""1", "count"2,
"field""cat",
"value""memory", "count"3,
"pivot",
14 docs w/ catelectronics
5 docs w/ catelectronics popularity6
15
Range Faceting
  • "facet_counts"
  • "facet_ranges"
  • "price"
  • "counts"
  • "0.0"5,
  • "50.0"2,
  • "100.0"0,
  • "150.0"2,
  • "200.0"0,
  • "250.0"1,
  • "300.0"2,
  • "350.0"2,
  • "400.0"0,
  • "450.0"1,
  • "gap"50.0,
  • "start"0.0,
  • "end"500.0
  • Like Date faceting, but more generic
  • http//...facettrue
  • facet.rangeprice
  • facet.range.start0
  • facet.range.end500
  • facet.range.gap50

16
Spatial Search
Step1 Index some locations! ltfield
namenamegtThe Alpine Shoplt/fieldgt ltfield
namestoregt44.013617,-73.168264lt/fieldgt
Step2 Decide where you are pt44.0153371,-73.167
34 d1 sfieldstore
Step3 Profit! Spatial Filter
fq!geofilt Bounding Box fq!bbox Distanc
e Function sortgeodist() asc Returning the
distance flgeodist()
Pseudo-fields!
Note You can now sort by any arbitrary function
query!
17
Pseudo-Fields
  • Returns other info along with document stored
    fields
  • Function queries
  • flname,location,geodist(),add(myfield,10)
  • Fieldname globs
  • flid,attr_
  • Multiple fl (field list) values
  • flid,attr_flgeodist()fltermfreq(text,solr
    )
  • Aliasing
  • flid,locationloc,_dist_geodist()
  • Future inlined highlighting, explain,
    sort-values, group-value

18
Result Grouping / Field Collapsing
  • Goal
  • Limit the number of results per category
  • category normally defined by unique values in a
    field
  • Uses
  • Web Search collapse by web site
  • Email threads collapse by thread id
  • Ecommerce/retail
  • Show the top 5 items for each store category
    (music, movies, etc)

19
Field Collapsing by Site
20
Field Collapse on Product Type
Result Grouping by Category
21
Group by Field
  • http//...flid,nameqipodgrouptruegroup.fiel
    dmanu_exact

"grouped" "manu_exact"
"matches"3, "groups"
"groupValue""Belkin",
"doclist""numFound"2,"start"0,"docs"
"id""IW-02",
"name""iPod iPod Mini USB 2.0 Cable"
, "groupValue""Apple
Computer Inc.", "doclist""numFound"1
,"start"0,"docs"
"id""MA147LL/A", "name""Apple
60 GB iPod with Video Playback Black"

22
Group by Query
http//...grouptruegroup.queryprice0 TO
99.99group.queryprice100 TO group.limit5
"grouped" "price0 TO 99.99"
"matches"3, "doclist""numFound"2,"start"
0,"docs"
"id""IW-02", "name""iPod iPod
Mini USB 2.0 Cable",
"id""F8V7067-APL-KIT",
"name""Belkin Mobile Power Cord for iPod"
, "price100 TO " "matches"3,
"doclist""numFound"1,"start"0,"docs"
"id""MA147LL/A",
"name""Apple 60 GB iPod with Video Playback
Black"
23
Grouping Params
parameter meaning default
group.fieldltfieldgt Like facet.field group by unique field values
group.queryltquerygt Like facet.query top docs that also match
group.functionltfunction querygt Group by unique values produced by the function query
group.limitltngt How many docs per group 1
group.sortltsort specgt How to sort documents within a group Same as sort
rowsltngt How many groups to return 10
sortltsort specgt How to sort the groups relative to each other (based on top doc)
group.formatltformatgt grouped/simple if simple, a single flat list is used and rows units are docs grouped
group.maintrue/false If true, the first field grouping command is used as main result set false
24
Pseudo-Join
id post1 blog_id blog1 author Yonik
Seeley title Solr relevancy function
queries body Lucenes default ranking
id blog1 name Solr n Stuff owner Yonik
Seeley Started 2007-10-26
id post2 blog_id blog1 author Yonik
Seeley title Solr result grouping body Result
Grouping, also called
id blog2 name lifehacker owner Gawker
Media started 2005-1-31
id post3 blog_id blog2 author Whitson
Gordon title How to Install Netflix on Almost
Any Android Device
Restrict to blogs mentioning netflix
fq!join fromblog_id toidbodynetflix
  • Finds all documents matching netflix
  • Maps to different docs by following blog_id to id

25
Pseudo-Join Examples
  • Only show posts from blogs started after 2010
  • qfoofq!join fromid toblog_idstarted2010
    TO
  • If any post in a blog mentions obama, then
    search all posts in that blog for bomb
    (self-join)
  • qbombfq!join fromblog_id toblog_idobama
  • If any blog post mentions obama, then search
    all websites with the same blog owner for bomb
  • qbombfq!join fromowner towebsite_owner!joi
    n fromblog_id toidobama

26
Cross-Core Join
  • http//localhost8983/solr/collection1/select?qfo
    ofq!join fromIndexsec1 fromsecurity_groups
    tosecurityuserjohn

27
Pseudo-Join vs Grouping
Pseudo-Join Result Grouping / Field Collapsing
O(n_terms_in_join_fields) O(n_docs_in_result)
Single or multi-valued fields Single-valued fields only
Filters only (no info currently passed from the from docs to the to docs). Can order docs within a group and groups by top doc within that group using normal sort criteria.
Chainable (one join can be the input to another) Not currently chainable can only group one field deep
Affects which documents match a request, so naturally affects facet numbers (e.g. you can search posts and get numbers of blogs) Grouping does not currently affect the set of documents matching the query, so faceting is unaffected.

28
Auto-Suggest
  • Many people previously used terms component
  • Can be slow for a large corpus
  • New auto-suggest builds off SpellCheck component
  • TST implementation compact memory based trie
  • FST implementation slower to build, but smaller
    faster lookup
  • Based on a field in the main index, or on a
    dictionary file
  • http//localhost8983/solr/suggest?wtjsonindent
    trueqult

"spellcheck" "suggestions" "ult",
"numFound"1, "startOffset"0,
"endOffset"3, "suggestion""ultrasha
rp", "collation","ultrasharp"
29
Index with JSON
URLhttp//localhost8983/solr/update/json
curl URL -H 'Content-typeapplication/json' -d
"id" "978-0641723445", "cat"
"book","hardcover", "title" "The
Lightning Thief", "author" "Rick Riordan",
"series_t" "Percy Jackson and the
Olympians", "sequence_i" 1, "genre_s"
"fantasy", "inStock" true, "price"
12.50, "pages_i" 384 '
30
Query Results in CSV
  • http//localhost8983/solr/select?qipodflname,p
    rice,cat,popularitywtcsv
  • name,price,cat,popularity
  • iPod iPod Mini USB 2.0 Cable,11.5,"electronics,c
    onnector",1
  • Belkin Mobile Power Cord for iPod w/
    Dock,19.95,"electronics,connector",1
  • Apple 60 GB iPod with Video Playback
    Black,399.0,"electronics,music",10
  • Can handle multi-valued fields (see cat field
    in example)
  • Completely compatible with the CSV update handler
    (can round-trip)
  • Results are streamed good for dumping entire
    parts of the index

31
http//localhost8983/solr/browse
32
QA
Write a Comment
User Comments (0)
About PowerShow.com