Web Scraping Using Nutch and Solr 3/3 - PowerPoint PPT Presentation

About This Presentation
Title:

Web Scraping Using Nutch and Solr 3/3

Description:

A short presentation ( part 3 of 3 ) describing the use of open source code nutch and solr to web crawl the internet and process the data. – PowerPoint PPT presentation

Number of Views:160
Slides: 11
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: Web Scraping Using Nutch and Solr 3/3


1
Solr Extracting Data
  • Start this session with a full Solr indexed
    repository
  • Movie cAiYBD4BQeE showed installation
  • Movie Th5Scvlyt-E showed Nutch web crawl
  • This movie will show how to
  • Extract data from Solr
  • Extract to xml or csv
  • Show aim to load into data warehouse
  • This movie assumes you know Linux

2
Solr Extracting Data
  • Progress so far, greyed out area yet to be
    examined

3
Checking Solr Data
  • Data should have been indexed in Solr
  • In Solr Admin window
  • Set 'Core Selector' collection1
  • Click 'Query'
  • In Query window set fl field url
  • Click Execute Query
  • The result ( next ) shows the filtered list of
    urls in Solr

4
Checking Solr Data
5
How To Extract
  • How could we get at Solr data ?
  • In admin console via query
  • Via http solr select
  • Via curl -o call using solr http select
  • What format of data that suits this purpose
  • Xml
  • Comma separated variable (csv)?

6
How To Extract
  • We want to extract two columns from Solr
  • tstamp, url
  • We want to extract as csv ( csv in call below
    could be xml )?
  • We want to extract to a file
  • So we will use an http call
  • http//localhost8983/solr/select?qfltstamp,
    urlwtcsv
  • We will also use a curl call
  • curl -o ltcsv filegt 'lthttp callgt'

7
How To Extract
  • Ceate a bash file in Solr install directory
  • cd solr-4-2-1/extract touch solr_url_extract.bas
    h
  • chmod 755 solr_url_extract.bash
  • Add contents to bash file
  • !/bin/bash
  • curl -o result.csv 'http//localhost8983/solr/sel
    ect?qfltstamp,urlwtcsv'
  • mv result.csv result.csv.(date
    Ymd.HMS)?
  • Now run the bash script
  • ./solr_url_extract.bash

8
Check Output
  • Now we check whether we have data
  • ls -l shows
  • result.csv.20130506.124857
  • Check the content , wc -l shows 11 lines
  • Check the content , head -2 shows
  • tstamp, url
  • 2013-05-04T015658.157Z,http//www.mysite.co.nz/S
    earch? DateRange7 ...
  • Congratulations, you have extracted data from
    Solr
  • It's in CSV format ready to be loaded into a
    data warehouse

9
Possible Next Steps
  • Choose more fields to extract from data
  • Allow Nutch crawl to go deeper
  • Allow Nutch crawl to collect a lot more data
  • Look at facets in Solr data
  • Load CSV files into Data Warehouse Staging
    schema
  • Next movie will show next step in progress

10
Contact Us
  • Feel free to contact us at
  • www.semtech-solutions.co.nz
  • info_at_semtech-solutions.co.nz
  • We offer IT project consultancy
  • We are happy to hear about your problems
  • You can just pay for those hours that you need
  • To solve your problems
Write a Comment
User Comments (0)
About PowerShow.com