Title: ATLAS Web Crawling for Data
1ATLAS Web Crawling for Data
- Ashwin Tengli
- Language Technologies Institute
- Carnegie Mellon University
2Outline
- Event Detection and Tracking finding news on an
interesting topic - How can Atlas help?
- Collecting data and using it
- Future Work
3Finding News on a Topic
- Manually monitor news sites
- Use search engines to find news
Drawbacks
- Time consuming limited number of sites covered
- Need to know many languages
4Using ATLAS to solve the problem
- ATLAS will monitor the web for documents on the
topic of interest
- Look for events in non-English documents too
- Provide event tracking support
Select Event To Track
Event in English Found
Event Detection
Arabic and French Event Found
5Training Data for ATLAS
- Needs topic-specific training data, in addition
to general data - Needs data in multiple languages parallel text
6Crawling for Data Topic focused data collection
and filtering
- Finding on-topic web pages
- Filtering content
- Using Named Entities to find event descriptions
- Finding non-English documents
7Topic focused data collection and
filtering Finding web pages
- Using search engines
- Traversing the web as a graph (crawling)
- Follow links
- Filtering out non-relevant pages using
- Text Classification
- Named Entities
8Topic focused data collection and
filtering Filtering Content
- Extracting relevant content from html
Relevant Content
- Use maximum entropy models/Hidden Markov Models
to pinpoint relevant content
9Topic focused data collection and filtering Using
Named Entities
- Named entities signify relevance
- Help identify events
Iraq Biological Warfare Agents Hussein
Kamal 1988 Smallpox Saddam Hussein 8,500
liters Anthrax 15,000 24,000 liters botulinum
toxin
10Topic focused data collection and
filtering Finding non-English documents
- Non-English web pages carry relevant news
- Many times they are the news-breakers
- Need to identify language
- Automatically
- Using HTML meta-data
11Crawling for Data Parallel Text
- Parallel Text Web pages in different languages
but content is translation of each other - Comparable Text Web pages in different languages
and content is on same topic - This data is required for training ATLAS for
event-detection in non-English webpages
12Parallel Text Comparable Text
- Use heuristics to detect parallel and comparable
text - URL format
- HTML structure similarity
- Link structure of the website
13Future Work
- Learn relations among topics on the web
- Use this to improve topic focused data collection
Topic Q
Topic S
Topic D
Topic X
Topic J
Topic N
Topic A
Topic M
Topic G
Topic B
14Conclusions
- Training data is crucial
- We need to supplement general training data with
topic-specific data - The Web
- Good source of multilingual data
- More realistic data