Minerva:Web Mining for Multimedia QA

1 / 29

About This Presentation

Title:

Minerva:Web Mining for Multimedia QA

Description:

(e.g., the increase of international students over recent years) ... Developing meaningful similarity metrics for comparing curves and temporal trends ... –

Number of Views:43

Avg rating:3.0/5.0

Slides: 30

Provided by: crlN

Category:

more less

Transcript and Presenter's Notes

Title: Minerva:Web Mining for Multimedia QA

1
MinervaWeb Mining for Multimedia QA

Presented by Yiming Yang
Joint work by the Minerva Team
LTI CSD, School of Computer Science
Carnegie Mellon University

2
Team

Yiming Yang (PI)
Jaime Carbonell (Co-PI)
Bryan Kisiel (Sr. programmer)
Nianli Ma (programmer grad student)
Ashwin Tengli (graduate student)
Shinjae Yoo (graduate student)
Derek Leung, Jibran Rashid (undergrad)

3
Emphases in AQUAINT

Phase-I Go beyond passage retrieval
Early TREC-style Q/A is not sufficient
Phase-II Go beyond single document
Extract information from multiple documents and
generate synthetic answers

4
Primary Claims in Minerva

Mining the web for statistical information
visiting hundreds of university web sites to get
aggregate statistics for an synthetic answer
(e.g., the increase of international students
over recent years)
Answering questions using tables, graphs and
supporting materials (keywords, URLs, etc.)
Developing meaningful similarity metrics for
comparing curves and temporal trends

5
Minerva System Overview
Related Web Sites
Table Extractor
Internal DB
Web Crawler Document Filter
On-topic documents
NE Extractor
IE
Training Examples
Table Agent
Query Interpreter
Internal queries
Text Agent
Answer Composer
Graph Agent
Retrieved Data
Sub-tasks
GUI
Q/A
6
Statistic-oriented Questions

Template based
Retrieval type
Aggregation type
Correlation type
Domain restrictions may be added for each query
type
Example Which kinds of crime show similar
trends in the total number of Drug Trafficking
crimes in Ohio state from 1982 to 2002?

7
Graphical Answer
8
GUI with Supporting Materials(URLs, Named
Entities, keywords, tables)
URLsof extracted information
Graph Results
Graph supporting tables
9
Accomplishments in Jan-Nov 2003

Data acquisition from the Web
Algorithm design and implementation
Corpus preparation annotation
Mid-project evaluation

10
Statistical Data From the Web

University Data
From 157 (out of 445) university web sites, 170
MB compressed
Pages containing Common Data Set (CDS), standard
information provided (supposely) by all
universities in USA
Criminal Data
From Ohio State web sites
Information about 315,000 criminals, one page per
criminal, 5 GB compressed in total
Property Data
From Allegheny County (PA) and Allen County (OH)
Information about 300,000 properties in 1.4
million web pages (13 GB compressed)

11
Data in Different Formats

Criminal records

Property tax records

University data

12
Data containing Named Entities(person names,
dates, places, etc.)
13
Data Exhibiting Trends
14
Data Exhibiting Correlations

Data in different domains can be correlated
through shared Named Entities
e.g.
Property taxes and criminals with same names
Interesting trends over same period of time

15
Data on a Semantic Structure

Common Data Set (CDS)
Standard information provided by U. S.
universities since 1999
Each university has its own format for presenting
the tables
We collect relevant portions (shown) of the data
for our Q/A system

16
Data-related Issues

Data may have common structures semantically
Data may differ syntactically (in table
formatting and attribute names)
Data often contain Named Entities
Data (statistics) may exhibit trends over time
Data from different domains may be correlated
How do we take advantage of those those
characteristics in statistic-based QA?

17
Accomplishments in Jan-Nov 2003

Data acquisition from the Web
Algorithm design and implementation
Table detection
Table extraction
Template-based Q/A (graph/table/text agents)
Curve-similarity based retrieval (next focus)
Corpus preparation annotation
Mid-project evaluation

18
Table Detection

Using lttablegt tags to identify a table
Not always a good indicator for a true table
This page, for example, contains 45 table tags

19
Table Detection Our Approach

Running a decision tree algorithm to learn rules
based on features like
Number of cells
Number of text cells
Number of rows
Number of columns

20
Table Extraction
Input Desired Output
Undergraduate, Degree Seeking Freshmen Full-Time
Men gt 784 Undergraduate, Other First-year Degree
Seeking Full-Time Men gt 14 Undergraduate, All
other degree seeking Full-Time Men gt 2365 .
21
Table Extraction Our Approach

Use domain knowledge
predefined attribute names distinguishing labels
(of rows columns) from data
TCDS hierarchy identifying the relationship
between attributes (super-categories vs.
sub-categories)
Use structural clues
spanning cells defining recursive structures of
tables and concatenation of attribute names
Remove unrecognized rows and columns

22
Question Types

Aggregation
Get aggregate statistics
Correlation
See stat-1 vs. stat-2
Retrieval
Find similar trends

23
Internal DB

Crawled data is stored in flat
(non-hierarchical) SQL tables
Structural information is stored separately,
defining taxonomy, aggregations, and links among
domains.

mapping
Intersection
24
Further Augmentations (Next)

Outgrow current three fixed query types
e.g. Restrict results of any query to top-n most
similar
e.g. Link properties to criminals by location vs.
by person
Query Relaxation
Detects and reports missing or incomplete results
Suggests alternate query(s) that can be answered
completely
Flexible graphical and tabular results
Manipulate axes, graph types, and filter results
on the fly
Providing supporting information for user
interaction
URLs of source documents, related Named Entities,
keywords, detailed tables, etc.

25
Accomplishments in Jan-Nov 2003

Data acquisition from the Web
Algorithm design and implementation
Corpus preparation annotation
Data set (completed) for mid-project evaluations
(on table detection and table extraction)
Data set (next step) for end-of-project
evaluations (on curve comparison and end-to-end
Q/A)
Mid-project evaluation

26
Evaluation Data

Sampling
Randomly sampled 100 pages from the crowed
university web pages (containing CDS data)
Human Annotation
For each page, identify whether or not a CDS
table is contained if yes, specify the CDS
category of the table
Identify the cells in the table which need to be
extracted
A user interface was provided to automatically
save the human judgments in an XML format

27
Evaluation Metrics

Contingency Table
Performance Measures
Precision A/(AB)
Recall A/(AC)
F1 2recallprecision/(recall precision)

28
Evaluation Results

Table Detection
Sampled Pages 100
Pages including CDS tables 76
True CDS tables 321
Included universities 44
Recall 100, precision 82.5, F190.4
Table Extraction
Tables 47 (in CDS categories B1, B2, C and H1)
Recall71 , Precision76.4 , F1 73.6

29
Milestones for Month 12-18

Complete and integrate component systems,
including graph/table/text agents, user
interface, table detector, table extractor and NE
extractor
Improve component systems based on mid-project
evaluation results
Annotate data for evaluating curve-similarity
based retrieval and end-to-end question answering
Conduct end-of-project evaluation, focusing
metrics for curve/trend comparison and different
types of answers