Title: Video Indexing and Retrieval using an MPEG7 Based Inference Network
1Video Indexing and Retrieval using an MPEG7 Based
Inference Network
- Andrew Gravesandrew_at_dcs.qmul.ac.uk
2Contents
- Introduction
- Aims
- Background, Positioning
- MPEG7 Collection
- The Model
- Experiments
- Concluding Remarks
3Introduction
4Project Aims
- Metadata based retrieval using MPEG7
- Assume we have the metadata
- Build a modular retrieval systemVideo analysis
-gt MPEG7 -gt Video retrieval - Exploit MPEG7 structure, context and concepts
5Background
- Information Retrieval
- IR Models Inference Network model
- Text retrieval Indexing Retrieval Term
Statistics - Structured Information Retrieval at QM Dortmund
- Multimedia
- MPEG 1/2/4/7
- MPEG7, Multimedia Content Description Interface
- Video Indexing and Retrieval
- Annotation, Content, Metadata based approaches
- Assume we have the Metadata Shot/Scene
detection - Feature extraction / Acquisition of Semantics
6MPEG7
- Description Definition Language (DDL),
Descriptor (D) and Description Schemes (DS) - Just another XML format
7Inference Network Model
- Probabilistic Framework for IR that uses a
Bayesian Network (so based on proven statistical
theory) - Complete Network Document Network
QueryNetwork Attachment Evaluation - Complete Network used to estimate the
probability of relevance for each Document Node
8Positioning
Model Inference Network MPEG 7
- Inference Network
- Allows a combination of evidence
- Allows hierarchical document nodes (structure)
- MPEG7
- Structural, conceptual contextual info
- So, we process DSs and Ds to form IN
9In Other Words...
- Build a Document Network that represents all of
the Ds (concepts) and DSs (structure) - Attach a Query Network and evaluate
10MPEG7 Collection
11Collection
- 3 MPEG1 files generated
- 2 comedies (Fawlty Towers)
- 1 film (A Room With A View)
- MPEG7 files then generated
- Automatic shot detection
- Shots manually grouped to form scenes
12Annotations
- Abstract from box
- StructuredAnnotation for each scene to specify
exactly participants and location - FreeTextAnnotation to describe action
- FreeTextAnnotation with speech extracts
13MPEG7 Excerpt 1
- ltAudioVisual id"Communication Problems"gt
- ltMediaInformation/gt
- ltMediaProfile/gt
- ltCreationInformationgt
- ltCreationgt
- ltTitlegtCommunication Problems
- lt/Titlegt
- ltAbstractgt
- ltFreeTextAnnotationgt
- It's not a wise man who entrusts his furtive
winnings on the horses to a geriatric Major, but
Basil bas never known for that quality. Parting
with those ill gotten gains was Basil's first
mistake his second was to tangle with the
intermittently deaf Mrs Richards. - lt/FreeTextAnnotationgt
- lt/Abstractgt
- ltCreatorgtBBClt/Creatorgt
- lt/Creationgt
- ltClassificationgt
- ltGenregtComedylt/Genregt
- ltLanguagegtEnglishlt/Languagegt
- lt/Classificationgt
- lt/CreationInformationgt
Excerpt shows concepts pertaining to the whole
video
14MPEG7 Excerpt 2
- ltSegmentDecomposition decompositionType"temporal
" gap"true" id"TableOfContent" overlap"false"gt - ltSegment id"A satisfied customer"
xsitype"AudioVisualSegmentType"gt - ltTextAnnotationgt
- ltFreeTextAnnotationgtBasil
receives a tip on a horse from a customer. Sybil
warns Basil not to bet. Basil says Sybil is a
dragon to Polly.lt/FreeTextAnnotationgt - ltStructuredAnnotationgt
- ltWhogtBasil,Sybil,Major,Pol
lylt/Whogt - ltWheregtLobbylt/Wheregt
- lt/StructuredAnnotationgt
- lt/TextAnnotationgt
- ltSegmentDecomposition
decompositionType"temporal" gap"true"
overlap"false"gt - ltSegment id"Shot_1"
xsitype"AudioVisualSegmentType"gt - ltTextAnnotationgtltFreeTextA
nnotationgt - Glad you enjoyed it.
Polly will you get Mr Firkins bill please. - lt/FreeTextAnnotationgtlt/Tex
tAnnotationgt - ltMediaTimegtltMediaIncrDurat
ion timeUnit"PT1N25F"gt86lt/MediaIncrDurationgt - lt/MediaTimegt
- lt/Segmentgt
- lt/SegmentDecompositiongt
- ltMediaTimegt
Excerpt shows video decomposition distribution
of evidence (ie concepts) amongst MPEG7 file
15Model
16Model Overview
- Document Network (built during indexing)
- Static, contains information about the collection
- Query Network (built during retrieval)
- Query Language based upon INQUERY
- Statistical operators (and approximations of
Boolean) - Attachment process
- Builds the Complete Network
- Create DN-gtQN links where concepts are the same
- Evaluation process
- Calculate probability of relevance for each
element
17Document Network
Document Nodes Context Nodes Concept Nodes
- Document Node layer. Created from MPEG7
structural aspects - Context Node layer. Provides contextual
information - Concept Node layer. Contains all the contents
present in collection
18Query Network 1
- Query text is parsed to produce Query tree
- Inverted DAG with a single final node
- Terms Operators
- Boolean Operators and or not
- Statistical Operators sum wsum max
- Constraints constraint tree
and("BBC" "Basil") constraint(Creation,
"BBC") and(constraint(Creation, "BBC")
"Basil")
19Query Network 2
- No Simple Complex constraints
- constraint and tree
and(tree(CreationInformation,
constraint(Classification/Genre,"comedy",
constraint(Creation,"BBC") basil)
- Trying to exploit the contextual information
20Attachment
- Attachment creates DN-gtQN links (at concept
level) - Find candidate links then consider constraints
- Strength of link can be determined by closeness
of match - Perform Tree Matching to find Edit Distance
(ED) - Use ED by a) testing against threshold, b) reduce
weight
21Evaluation
- After attachment we have formed the Complete
Network - This is evaluated for every Document Node and
resultant probabilities are used for ranking - All nodes required are evaluated using 1) Value
of parents nodes 2) Conditional probabilities - Nodes may inherit parental contexts (Link
Inheritance) - The parents outside the constraint may be ignored
(Path Cropping)
22Extraction
- Structural Extraction. About the hierarchical
makeup. - Attribute Extraction. Data about the structural
elements. - Concept Extraction. Obtain the concepts that
appear. - Text preprocessing
- Luhns Analysis, Term Statistics
ltcontext attribute"value"gt free
text ltsubcontextgt more free text lt/subcontextgt
lt/contextgt
ltTextAnnotationgt ltFreeTextAnnotationgt
Basil attempts to fix the car
lt/FreeTextAnnotationgt ltStructuredAnnotationgt
ltWhogtBasillt/Whogt ltWhatObjectgtCarlt/WhatOb
jectgt ltWhatActiongtFixlt/WhatActiongt
ltWheregtCarparklt/Wheregt lt/StructuredAnnotationgt
lt/TextAnnotationgt
23Probability Estimation
- Probability document is relevant to the query
- Conditional probabilities between the nodes
- Context-gtContext (eg Video-gtScene)
- Context-gtConcept
- We use
- Term Statistics
- Number of Siblings
- Duration ratios
- Alternatives
- Size of context
- Number of concepts
- Number of occurrences
- Perceived goodness
24Experiments
25Experiment Overview
- Software written in C NT
- Not using INQUERY
- 1. Basic. Does the model work at all?
- 2. Real Data. Does the model work with our real
metadata collection? - 3. Metrics. What are the precision/recall metrics?
26Remember...
- Link Inheritance (LI)
- Link Degradation (LID)
- Tree Matching (TM)
- Threshold (TMT) The attachment is made only if
the constraint is met, and if the Edit Distance
is below the specified threshold. - Weighted (TMW) The attachment is made if the
constraint is met. The Edit Distance is used as
a weight upon the DN-gtQN link. - Path Cropping (PC)
27Representations
We use XML for DN and QN representation
- ltRootgt
- ltOperator Type"WSUM"gt
- ltConcept weight"0.2"gtbreakfast
lt/Conceptgt - ltConcept weight"0.8"gtview lt/Conceptgt
- lt/Operatorgt
- lt/Rootgt
-
- ltRootgt
- ltOperator Type"AND"gt
- ltConceptgt
- ltTextgtBBClt/Textgt
- ltConstraintgtCreation
lt/Constraintgt - lt/Conceptgt
- ltConceptgtBasillt/Conceptgt
- lt/Operatorgt
- lt/Rootgt
28Experiment 1
- ltRootgt
- ltVideo id'Video1' Duration'1000'
weight'1.000000'gt - ltCreationInformation weight'0.750000'gt
- ltCreation weight'1.000000'gt
- ltConcept weight'0.8' cid'1'gtbananalt/Conc
eptgt - lt/Creationgt
- lt/CreationInformationgt
- ltMediaInformation weight'0.750000'/gt
- ltScene id'Scene1' KeyFrame'none.jpg'
Duration'400' weight'0.700000'gt - ltShot id'Shot1' KeyFrame'none.jpg'
Duration'100' weight'0.625000'/gt - ltShot id'Shot2' KeyFrame'none.jpg'
Duration'300' weight'0.875000'gt - ltConcept weight'0.7' cid'1'gtbananalt/Conc
eptgt - lt/Shotgt
- lt/Scenegt
- ltScene id'Scene2' Duration'600'
weight'0.800000'/gt - lt/Videogt
- ltVideo id'Video2' weight'1.000000'/gt
- ltVideo id'Video3' weight'1.000000'/gt
- lt/Rootgt
Remember probabilities from duration ratio and
number of context siblings
Two influences1. Creation-gtbanana2.
Shot2-gtbanana
29Experiment 1
1 No Parameters 001 0.364000 Video
Video1 002 0.245000 Shot Shot2 003 0.227500
Scene Scene1 004 0.105000 Shot Shot1 004
0.105000 Scene Scene2 004 0.105000 Video
Video2 004 0.105000 Video Video3
7 LI LID TWM 001 0.288750 Shot Shot2
002 0.280313 Scene Scene1 003 0.273000
Video Video1 004 0.129375 Scene Scene2 005
0.123750 Shot Shot1 006 0.078750 Video
Video2 006 0.078750 Video Video3
- Model works
- Different levels of document granularity
(Video/Scene/Shot) retrieved in same list - Parameters work but unclear if they help
30Experiment 2
or("Mushroom" "Mushrooms")
or(chips" and("salad" "cream"))
constraint(Classification, "Comedy")
- Model worked with real collection to produce real
results - Results were as expected given knowledge of
material
31Experiment 3
- Recall/Precision metrics calculated
- Rank in results list (not result rank) used for
analysis - Ten best Video/Scene/Shot chosen by author
- Ranking seems good
- 6/10 required in top 10
- All 10 within top 93 (out of 362 in total)
- Figures suggest that the model working
effectively although this is not conclusive
32Discussion
- Size of collection too small to produce
significant results. No known MPEG7 collections. - No independent queries with relevance assessments
exist (obviously) - Software efficiency crucial - simplifying
assumptions can be made to ensure that the IN is
computationally viable. Size of computation is
not proportional to size of collection.
33Concluding Remarks
34Concluding Remarks
- MPEG7 was found to contain useful tools
- Model for VIR developed
- Based on Inference Network, Built from MPEG7
files - Indexing captures structure, context and concepts
- Retrieval done using Terms, Operators and
Constraints - Model parameters devised
- Results suggest that approach taken well founded
although lack of data is problematic
35Next...
- Build an independent MPEG7 collection with
relevance assessments etc! - Automatic methods for generating metadata
- Eliminate bias, Increase consistency, Improve
quality - Feature extraction etc. to produce Simple
Semantics - Solve the Semantic Gap issue
- Build metadata based models that exploit
contextual information - Assume contextual information can help retrieval
- Assume we have good metadata
- Efficiency of the evaluation vital
36The End