Minerva:Web Mining for Multimedia QA - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Minerva:Web Mining for Multimedia QA

Description:

(e.g., the increase of international students over recent years) ... Developing meaningful similarity metrics for comparing curves and temporal trends ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 30
Provided by: crlN
Category:

less

Transcript and Presenter's Notes

Title: Minerva:Web Mining for Multimedia QA


1
MinervaWeb Mining for Multimedia QA
  • Presented by Yiming Yang
  • Joint work by the Minerva Team
  • LTI CSD, School of Computer Science
  • Carnegie Mellon University

2
Team
  • Yiming Yang (PI)
  • Jaime Carbonell (Co-PI)
  • Bryan Kisiel (Sr. programmer)
  • Nianli Ma (programmer grad student)
  • Ashwin Tengli (graduate student)
  • Shinjae Yoo (graduate student)
  • Derek Leung, Jibran Rashid (undergrad)

3
Emphases in AQUAINT
  • Phase-I Go beyond passage retrieval
  • Early TREC-style Q/A is not sufficient
  • Phase-II Go beyond single document
  • Extract information from multiple documents and
    generate synthetic answers

4
Primary Claims in Minerva
  • Mining the web for statistical information
  • visiting hundreds of university web sites to get
    aggregate statistics for an synthetic answer
  • (e.g., the increase of international students
    over recent years)
  • Answering questions using tables, graphs and
    supporting materials (keywords, URLs, etc.)
  • Developing meaningful similarity metrics for
    comparing curves and temporal trends

5
Minerva System Overview
Related Web Sites
Table Extractor
Internal DB
Web Crawler Document Filter
On-topic documents
NE Extractor
IE
Training Examples
Table Agent
Query Interpreter
Internal queries
Text Agent
Answer Composer
Graph Agent
Retrieved Data
Sub-tasks
GUI
Q/A
6
Statistic-oriented Questions
  • Template based
  • Retrieval type
  • Aggregation type
  • Correlation type
  • Domain restrictions may be added for each query
    type
  • Example Which kinds of crime show similar
    trends in the total number of Drug Trafficking
    crimes in Ohio state from 1982 to 2002?

7
Graphical Answer
8
GUI with Supporting Materials(URLs, Named
Entities, keywords, tables)
URLsof extracted information
Graph Results
Graph supporting tables
9
Accomplishments in Jan-Nov 2003
  • Data acquisition from the Web
  • Algorithm design and implementation
  • Corpus preparation annotation
  • Mid-project evaluation

10
Statistical Data From the Web
  • University Data
  • From 157 (out of 445) university web sites, 170
    MB compressed
  • Pages containing Common Data Set (CDS), standard
    information provided (supposely) by all
    universities in USA
  • Criminal Data
  • From Ohio State web sites
  • Information about 315,000 criminals, one page per
    criminal, 5 GB compressed in total
  • Property Data
  • From Allegheny County (PA) and Allen County (OH)
  • Information about 300,000 properties in 1.4
    million web pages (13 GB compressed)

11
Data in Different Formats
  • Criminal records
  • Property tax records
  • University data

12
Data containing Named Entities(person names,
dates, places, etc.)
13
Data Exhibiting Trends
14
Data Exhibiting Correlations
  • Data in different domains can be correlated
    through shared Named Entities
  • e.g.
  • Property taxes and criminals with same names
  • Interesting trends over same period of time

15
Data on a Semantic Structure
  • Common Data Set (CDS)
  • Standard information provided by U. S.
    universities since 1999
  • Each university has its own format for presenting
    the tables
  • We collect relevant portions (shown) of the data
    for our Q/A system

16
Data-related Issues
  • Data may have common structures semantically
  • Data may differ syntactically (in table
    formatting and attribute names)
  • Data often contain Named Entities
  • Data (statistics) may exhibit trends over time
  • Data from different domains may be correlated
  • How do we take advantage of those those
    characteristics in statistic-based QA?

17
Accomplishments in Jan-Nov 2003
  • Data acquisition from the Web
  • Algorithm design and implementation
  • Table detection
  • Table extraction
  • Template-based Q/A (graph/table/text agents)
  • Curve-similarity based retrieval (next focus)
  • Corpus preparation annotation
  • Mid-project evaluation

18
Table Detection
  • Using lttablegt tags to identify a table
  • Not always a good indicator for a true table
  • This page, for example, contains 45 table tags

19
Table Detection Our Approach
  • Running a decision tree algorithm to learn rules
    based on features like
  • Number of cells
  • Number of text cells
  • Number of rows
  • Number of columns

20
Table Extraction
Input Desired Output
Undergraduate, Degree Seeking Freshmen Full-Time
Men gt 784 Undergraduate, Other First-year Degree
Seeking Full-Time Men gt 14 Undergraduate, All
other degree seeking Full-Time Men gt 2365 .
21
Table Extraction Our Approach
  • Use domain knowledge
  • predefined attribute names distinguishing labels
    (of rows columns) from data
  • TCDS hierarchy identifying the relationship
    between attributes (super-categories vs.
    sub-categories)
  • Use structural clues
  • spanning cells defining recursive structures of
    tables and concatenation of attribute names
  • Remove unrecognized rows and columns

22
Question Types
  • Aggregation
  • Get aggregate statistics
  • Correlation
  • See stat-1 vs. stat-2
  • Retrieval
  • Find similar trends

23
Internal DB
  • Crawled data is stored in flat
    (non-hierarchical) SQL tables
  • Structural information is stored separately,
    defining taxonomy, aggregations, and links among
    domains.

mapping
Intersection
24
Further Augmentations (Next)
  • Outgrow current three fixed query types
  • e.g. Restrict results of any query to top-n most
    similar
  • e.g. Link properties to criminals by location vs.
    by person
  • Query Relaxation
  • Detects and reports missing or incomplete results
  • Suggests alternate query(s) that can be answered
    completely
  • Flexible graphical and tabular results
  • Manipulate axes, graph types, and filter results
    on the fly
  • Providing supporting information for user
    interaction
  • URLs of source documents, related Named Entities,
    keywords, detailed tables, etc.

25
Accomplishments in Jan-Nov 2003
  • Data acquisition from the Web
  • Algorithm design and implementation
  • Corpus preparation annotation
  • Data set (completed) for mid-project evaluations
    (on table detection and table extraction)
  • Data set (next step) for end-of-project
    evaluations (on curve comparison and end-to-end
    Q/A)
  • Mid-project evaluation

26
Evaluation Data
  • Sampling
  • Randomly sampled 100 pages from the crowed
    university web pages (containing CDS data)
  • Human Annotation
  • For each page, identify whether or not a CDS
    table is contained if yes, specify the CDS
    category of the table
  • Identify the cells in the table which need to be
    extracted
  • A user interface was provided to automatically
    save the human judgments in an XML format

27
Evaluation Metrics
  • Contingency Table
  • Performance Measures
  • Precision A/(AB)
  • Recall A/(AC)
  • F1 2recallprecision/(recall precision)

28
Evaluation Results
  • Table Detection
  • Sampled Pages 100
  • Pages including CDS tables 76
  • True CDS tables 321
  • Included universities 44
  • Recall 100, precision 82.5, F190.4
  • Table Extraction
  • Tables 47 (in CDS categories B1, B2, C and H1)
  • Recall71 , Precision76.4 , F1 73.6

29
Milestones for Month 12-18
  • Complete and integrate component systems,
    including graph/table/text agents, user
    interface, table detector, table extractor and NE
    extractor
  • Improve component systems based on mid-project
    evaluation results
  • Annotate data for evaluating curve-similarity
    based retrieval and end-to-end question answering
  • Conduct end-of-project evaluation, focusing
    metrics for curve/trend comparison and different
    types of answers
Write a Comment
User Comments (0)
About PowerShow.com