Title: Minerva:Web Mining for Multimedia QA
1MinervaWeb Mining for Multimedia QA
- Presented by Yiming Yang
- Joint work by the Minerva Team
- LTI CSD, School of Computer Science
- Carnegie Mellon University
2Team
- Yiming Yang (PI)
- Jaime Carbonell (Co-PI)
- Bryan Kisiel (Sr. programmer)
- Nianli Ma (programmer grad student)
- Ashwin Tengli (graduate student)
- Shinjae Yoo (graduate student)
- Derek Leung, Jibran Rashid (undergrad)
3Emphases in AQUAINT
- Phase-I Go beyond passage retrieval
- Early TREC-style Q/A is not sufficient
- Phase-II Go beyond single document
- Extract information from multiple documents and
generate synthetic answers
4Primary Claims in Minerva
- Mining the web for statistical information
- visiting hundreds of university web sites to get
aggregate statistics for an synthetic answer - (e.g., the increase of international students
over recent years) - Answering questions using tables, graphs and
supporting materials (keywords, URLs, etc.) - Developing meaningful similarity metrics for
comparing curves and temporal trends
5Minerva System Overview
Related Web Sites
Table Extractor
Internal DB
Web Crawler Document Filter
On-topic documents
NE Extractor
IE
Training Examples
Table Agent
Query Interpreter
Internal queries
Text Agent
Answer Composer
Graph Agent
Retrieved Data
Sub-tasks
GUI
Q/A
6Statistic-oriented Questions
- Template based
- Retrieval type
- Aggregation type
- Correlation type
- Domain restrictions may be added for each query
type - Example Which kinds of crime show similar
trends in the total number of Drug Trafficking
crimes in Ohio state from 1982 to 2002?
7Graphical Answer
8GUI with Supporting Materials(URLs, Named
Entities, keywords, tables)
URLsof extracted information
Graph Results
Graph supporting tables
9Accomplishments in Jan-Nov 2003
- Data acquisition from the Web
- Algorithm design and implementation
- Corpus preparation annotation
- Mid-project evaluation
10Statistical Data From the Web
- University Data
- From 157 (out of 445) university web sites, 170
MB compressed - Pages containing Common Data Set (CDS), standard
information provided (supposely) by all
universities in USA - Criminal Data
- From Ohio State web sites
- Information about 315,000 criminals, one page per
criminal, 5 GB compressed in total - Property Data
- From Allegheny County (PA) and Allen County (OH)
- Information about 300,000 properties in 1.4
million web pages (13 GB compressed)
11Data in Different Formats
12Data containing Named Entities(person names,
dates, places, etc.)
13Data Exhibiting Trends
14Data Exhibiting Correlations
- Data in different domains can be correlated
through shared Named Entities - e.g.
- Property taxes and criminals with same names
- Interesting trends over same period of time
15Data on a Semantic Structure
- Common Data Set (CDS)
- Standard information provided by U. S.
universities since 1999 - Each university has its own format for presenting
the tables - We collect relevant portions (shown) of the data
for our Q/A system
16Data-related Issues
- Data may have common structures semantically
- Data may differ syntactically (in table
formatting and attribute names) - Data often contain Named Entities
- Data (statistics) may exhibit trends over time
- Data from different domains may be correlated
- How do we take advantage of those those
characteristics in statistic-based QA?
17Accomplishments in Jan-Nov 2003
- Data acquisition from the Web
- Algorithm design and implementation
- Table detection
- Table extraction
- Template-based Q/A (graph/table/text agents)
- Curve-similarity based retrieval (next focus)
- Corpus preparation annotation
- Mid-project evaluation
18Table Detection
- Using lttablegt tags to identify a table
- Not always a good indicator for a true table
- This page, for example, contains 45 table tags
19Table Detection Our Approach
- Running a decision tree algorithm to learn rules
based on features like - Number of cells
- Number of text cells
- Number of rows
- Number of columns
20Table Extraction
Input Desired Output
Undergraduate, Degree Seeking Freshmen Full-Time
Men gt 784 Undergraduate, Other First-year Degree
Seeking Full-Time Men gt 14 Undergraduate, All
other degree seeking Full-Time Men gt 2365 .
21Table Extraction Our Approach
- Use domain knowledge
- predefined attribute names distinguishing labels
(of rows columns) from data - TCDS hierarchy identifying the relationship
between attributes (super-categories vs.
sub-categories) - Use structural clues
- spanning cells defining recursive structures of
tables and concatenation of attribute names - Remove unrecognized rows and columns
22Question Types
- Aggregation
- Get aggregate statistics
- Correlation
- See stat-1 vs. stat-2
- Retrieval
- Find similar trends
23Internal DB
- Crawled data is stored in flat
(non-hierarchical) SQL tables - Structural information is stored separately,
defining taxonomy, aggregations, and links among
domains.
mapping
Intersection
24Further Augmentations (Next)
- Outgrow current three fixed query types
- e.g. Restrict results of any query to top-n most
similar - e.g. Link properties to criminals by location vs.
by person - Query Relaxation
- Detects and reports missing or incomplete results
- Suggests alternate query(s) that can be answered
completely - Flexible graphical and tabular results
- Manipulate axes, graph types, and filter results
on the fly - Providing supporting information for user
interaction - URLs of source documents, related Named Entities,
keywords, detailed tables, etc.
25Accomplishments in Jan-Nov 2003
- Data acquisition from the Web
- Algorithm design and implementation
- Corpus preparation annotation
- Data set (completed) for mid-project evaluations
(on table detection and table extraction) - Data set (next step) for end-of-project
evaluations (on curve comparison and end-to-end
Q/A) - Mid-project evaluation
26Evaluation Data
- Sampling
- Randomly sampled 100 pages from the crowed
university web pages (containing CDS data) - Human Annotation
- For each page, identify whether or not a CDS
table is contained if yes, specify the CDS
category of the table - Identify the cells in the table which need to be
extracted - A user interface was provided to automatically
save the human judgments in an XML format
27Evaluation Metrics
- Contingency Table
- Performance Measures
- Precision A/(AB)
- Recall A/(AC)
- F1 2recallprecision/(recall precision)
28Evaluation Results
- Table Detection
- Sampled Pages 100
- Pages including CDS tables 76
- True CDS tables 321
- Included universities 44
- Recall 100, precision 82.5, F190.4
- Table Extraction
- Tables 47 (in CDS categories B1, B2, C and H1)
- Recall71 , Precision76.4 , F1 73.6
29Milestones for Month 12-18
- Complete and integrate component systems,
including graph/table/text agents, user
interface, table detector, table extractor and NE
extractor - Improve component systems based on mid-project
evaluation results - Annotate data for evaluating curve-similarity
based retrieval and end-to-end question answering - Conduct end-of-project evaluation, focusing
metrics for curve/trend comparison and different
types of answers