The Article Search API - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

The Article Search API

Description:

James Risen and Eric Lichtblau for national reporting, for their coverage of the ... 50,289 - Obituaries. 15,189 - Articles about the Yankees. What is possible? ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 34
Provided by: nyt5
Category:

less

Transcript and Presenter's Notes

Title: The Article Search API


1
  • The Article Search API

2
Articles
  • 28 years of collective work by the journalistic
    efforts of
  • The New York Times.

3
History 1.0
  • The first version of history.

4
1987
  • The causes of the Challenger shuttle disaster.

5
1991
  • Natalie Angier, for coverage of
  • molecular biology and animal behavior.

6
1999
  • A series of articles disclosing the corporate
    sale of American technology to China with the
    approval of the U.S. government despite national
    security risks.

7
2006
  • James Risen and Eric Lichtblau for national
    reporting, for their coverage of the United
    States government's secret eavesdropping program.

8
More than Pulitzers
  • 13,192 - Theater Reviews
  • 16,474 - Movie Reviews
  • 6,999 - Recipes
  • 50,289 - Obituaries
  • 15,189 - Articles about the Yankees

9
What is possible?
10
  • Most recent articles with France' in the
    headline.
  • France Announces 8.5 Billion Plan to Help
  • Struggling Auto Industry

11
  • People most often mentioned with George W.
  • Bush in 2008.
  • OBAMA, BARACK (189) MCCAIN, JOHN (160)
  • CLINTON, HILLARY RODHAM (55) PAULSON,
  • HENRY M JR (55) BERNANKE, BEN S (33)
  • CHENEY, DICK (33) RICE, CONDOLEEZZA
  • (32) MUSHARRAF, PERVEZ (29) PUTIN,
  • VLADIMIR V (23)

12
  • Recipes with associated thumbnail images.

13
  • First occurrence of internet.
  • Title Author of Computer 'Virus' Is Son Of
    N.S.A.
  • Expert on Data Security
  • byline By JOHN MARKOFF
  • date 19881105

14
  • Use of 'unemployment' by month for 2008.

15
  • Front page articles that mention Twitter.
  • 11 Articles
  • title As Web Traffic Grows, Crashes Take Bigger
    Toll

16
  • The hidden message that has been
  • sprinkled through the paper for decades?

17
Raw Data
  • 2.8 million articles
  • 37 searchable fields
  • 23 navigational facetted fields

18
(No Transcript)
19
Quick Glance
  • A few of the standard fields you can search
  • abstract author body byline title nytd_title
  • nytd_lead_paragraph lead_paragraph nytd_byline
  • text url

20
Super Sexy Cool - Facets
  • Controlled vocabulary / normalized
  • Applied by trained professionals
  • Excellent precision

21
Human-generated Facets
  • Descriptive terms - ADVERTISING AND MARKETING
  • Names of people - O'REILLY, TIM
  • Organizational names - AL QAEDA
  • Geographic names - SPAIN

22
Auto-generated Facets
  • Publication year, month, day - 2008, 08, 21
  • Material type - News, Letter, Review, Statistics,
    Biography
  • Desk - Metropolitan Desk, Financial Desk, Sports
    Desk, National Desk

23
Search It
24
Fielded Search
  • default
  • text (title, byline, body)
  • searching
  • fieldnamevalue
  • example
  • titlefrance bodyawesome
  • titleeconomy -bodynegative

25
Date Range
26
Facets
  • We read ALL the articles that match a search,
  • sum the number of occurrences of a given facet,
  • and return the most popular facet values.

27
Facet Example
  • For the query Google,' return a set of
  • publication-year facets.
  • 2007 (1060) 2008 (1017) 2006 (953)
  • 2005 (642) 2004 (453) 2003 (186)
  • 2009 (146) 2002 (106) 2001 (75)
  • 2000 (21)

28
What are folks doing?
Jer Thorp blprnt.com
29
What are folks doing?
It is a tool for analyzing web pages for
keywords, displaying them as links to search
results from various services around the
web http//semantalyzr.com/
30
What are folks doing?
http//nytexplorer.com/
31
What are folks doing?
Ruby Gem NYTimes Articles http//github.com/harris
j/nytimes-articles/tree/master
32
What are folks doing?
NYT Trendr 2008 http//tyn-search.appspot.com/
33
Future
  • Related documents
  • Open Calias integration
  • More fields
  • More article data
  • OpenSearch RSS/JSON
  • You tell me
Write a Comment
User Comments (0)
About PowerShow.com