Lucene - PowerPoint PPT Presentation

About This Presentation
Title:

Lucene

Description:

Lucene Open source search project http://lucene.apache.org Index & search local files ... kids/crawldb $s1 Generate top-scoring 50K pages bin/nutch ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 15
Provided by: ucs59
Category:
Tags: apache | lucene | scoring

less

Transcript and Presenter's Notes

Title: Lucene


1
Lucene Nutch
  • Lucene
  • Project name
  • Started as text index engine
  • Nutch
  • A complete web search engine, including
  • Crawling, indexing, searching
  • Index 100M pages, crawl gt10M/day
  • Provide distributed architecture
  • Written in JAVA
  • Other language ports are work-in-progress

2
Lucene
  • Open source search project
  • http//lucene.apache.org
  • Index search local files
  • Download lucene-2.2.0.tar.gz from
    http//www.apache.org/dyn/closer.cgi/lucene/java/
  • Extract files
  • Build an index for a directory
  • java org.apache.lucene.demo.IndexFiles dir_path
  • Try search at command line
  • java org.apache.lucene.demo.SearchFiles

3
Deploy Lucene
  • Copy luceneweb.war to your tomcat-home/webapps
  • Browse to http//localhost8080/luceneweb
  • Tomcat will deploy the web app.
  • Edit webapps/luceneweb/configuration.jsp
  • Point indexLocationto your indexes
  • Search at http//localhost8080/luceneweb

4
Nutch
  • A complete search engine http//lucene.apache.org/
    nutch/release/
  • Mode
  • Intranet/local search
  • Internet search
  • Usage
  • Crawl
  • Index
  • Search

5
Intranet Search
  • Configuration
  • Input URLs create a directory and seed file
  • mkdir urls
  • echo http//www.cs.ucsb.edu gt urls/ucsb
  • Edit conf/crawl-urlfilter.txt and replace
    MY.DOMAIN.NAME with cs.ucsb.edu
  • Edit conf/nutch-site.xml

6
Intranet Running the Crawl
  • Crawl options include
  • -dir dir names the directory to put the crawl
    in.
  • -threads threads determines the number of
    threads that will fetch in parallel.
  • -depth depth indicates the link depth from the
    root page that should be crawled.
  • -topN N determines the maximum number of pages
    that will be retrieved at each level up to the
    depth.
  • E.g.
  • bin/nutch crawl urls -dir crawl -depth 3
    -topN 50

7
Intranet Search
  • Deploy nutch war file
  • rm -rf TOMCAT_DIR/webapps/ROOT
  • cp nutch-0.9.war TOMCAT_DIR/webapps/ROOT.war
  • The webapp finds indexes in ./crawl, relative to
    where you start Tomcat
  • TOMCAT_DIR/bin/catalina.sh start
  • Search at http//localhost8080/
  • CS.UCSB domain demo http//hactar.cs.ucsb.edu808
    0

8
Internet Crawling
  • Concept
  • crawldb all URL info
  • linkdb list of known links to each url
  • segments each is a set of urls that are fetched
    as a unit
  • indexes Lucene-format indexes

9
Internet Crawling Process
  1. Get seed URLs
  2. Fetch
  3. Update crawl DB
  4. Compute top URLs, goto 2
  5. Create Index
  6. Deploy

10
Seed URL
  • URLs from the DMOZ Open Directory
  • wget http//rdf.dmoz.org/rdf/content.rdf.u8.gz
  • gunzip content.rdf.u8.gz
  • mkdir dmoz
  • bin/nutch org.apache.nutch.tools.DmozParser
    content.rdf.u8 -subset 5000 gt dmoz/urls
  • Kids search URL from ask.com
  • Inject URLs
  • bin/nutch inject kids/crawldb 67k-url/
  • Edit conf/nutch-site.xml

11
Fetch
  • Generate a fetchlist from the database
  • bin/nutch generate kids/crawldb kids/segments
  • Save the name of fetchlist in variable s1
  • s1ls -d kids/segments/2 tail -1
  • Run the fetcher on this segment
  • bin/nutch fetch s1

12
Update Crawl DB and Re-fetch
  • Update craw db with the results of the fetch
  • bin/nutch updatedb kids/crawldb s1
  • Generate top-scoring 50K pages
  • bin/nutch generate kids/crawldb kids/segments
    -topN 50000
  • Refetch
  • s1ls -d kids/segments/2 tail -1
  • bin/nutch fetch s1

13
Index, Deploy, and Search
  • Create inverted index
  • bin/nutch invertlinks kids/linkdb kids/segments/
  • Index the segments
  • bin/nutch index kids/indexes kids/crawldb
    kids/linkdb kids/segments/
  • Deploy Search
  • Same as in Intranet search
  • Demo of 1M pages (570K 500K)?

14
Issues
  • Default crawling cycle is 30 days for all URLs
  • Duplicates are those have same URL or md5 of page
    content
  • JavaScript parser uses regular expression to
    extract URL literals from code.
Write a Comment
User Comments (0)
About PowerShow.com