SemiAutomatic Content Extraction from Specifications - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

SemiAutomatic Content Extraction from Specifications

Description:

... formalized description of specs in Specification Definition Representation (SDR) ... An ontology is an explicit specification of a conceptualization, which is an ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 67
Provided by: TKPr6
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: SemiAutomatic Content Extraction from Specifications


1
Semi-Automatic Content Extraction from
Specifications
  • Krishnaprasad Thirunarayan
  • Department of Computer Science Engineering
  • Wright State University
  • Aaron Berkovich and Dan Sokol
  • Cohesia Corporation

2
Extraction Summarize in a prescribed vocabulary
Spec Text
Spec SDR
Domain Library
3
Participants
  • Sponsor National Science Foundation
  • SBIR Phase I and Phase II
  • Industry Cohesia Corporation
  • Developer of (B2B) content and lower-level
    infrastructure
  • University Wright State University
  • User-level tools conceptualization and design
  • Others Geometric Software Solutions,
  • Tool/Product development and integration

4
Outline
  • Background and Goal (What?)
  • Motivation (Why?)
  • Details (How?)
  • Conclusions

5
Background and Goal
6
Manual Content Extraction
  • Input
  • Paper-based specifications of a manufacturing
    task describing composition, processing, and
    testing of materials
  • Additional constraints imposed by customers and
    vendors
  • Appropriate Ontology and Domain Library defining
    standard vocabulary

7
  • Output
  • An equivalent formalized description of specs
    in Specification Definition Representation (SDR)
  • Observation
  • Specs originating from a common source (ASTM,
    SAE, GE) share vocabulary and structure.
  • Linguistic patterns found in specs are exploited
    by an experienced extractor to interpret it.

8
Assistance for Extraction
Document Paper
Document Text
Mark-Up Editor (Wizard)
Document SDR
Document Proofer
original
9
Semi-automatic Content Extraction
  • Starting from an electronic version of a spec,
    develop a strategy for semantic markup, to assist
    in creating an equivalent SDR.
  • Semantic Markup The task of overlaying an
    abstract syntax (the essence) on the
    free-form text.
  • Spec Human-sensible
  • Mark-up Computer-sensible
  • Automate routine mechanical tasks.

10

Semantic Mark-up
Spec Name
Spec Title
Revision
Procedure
Revision Date
Qualifier
Characteristic
Values
Value
11
Ontology
  • (Gruber)
  • An ontology is an explicit specification of a
    conceptualization, which is an abstract,
    simplified view of the world that we wish to
    represent for some purpose.

12
SDL Ontology
1 or many
Document
Domain Library
1 or many
Revision
Reference
Ref 0, 1 or many
0, 1 or many
Procedure
0, 1 or many
Ref 0, 1 or many
Layer
Characteristic
Ref 0, 1 or many
0, 1 or many
Value
13

Extraction Spec to SDR
Spec Text
Spec SDR
14
Fundamental Obstacles
  • The relation between the spec and its SDR
    rendition is not linear.
  • Same spec information duplicated in SDR in
    different contexts.
  • Contiguous block of information in SDR spread
    out in spec.
  • Equivalence of phrases hard to formalize.
  • Tables and footnotes abbreviate information in
    irregular and complicated ways.

15
Linearizing through Abstraction
Introducing Specification Definition Language
Manual (original)
Original Spec
SDL
SDR
Manual (Ph-I)
Compiled (Ph-I)
Literal, Integrated, Semi-automatic (Ph-II)
Original AMS-4976 spec is 8 pages. Its SDL
equivalent is 15 pages.
Original AMS-5662J spec is 11 pages. Its SDR
equivalent is 30 pages.
16
(No Transcript)
17
(No Transcript)
18
Introducing Extraction Wizard
19
Motivation (Why?)
20
Business Background (Supply Chain)
Engine
Forger
Metal
21
Diverse and Large number of specs and spec
users
22
Quality Issues
  • Transcription Errors
  • From spec to hand-written sheet to computer
  • Completeness
  • Info in spec but missing in SDR
  • Soundness
  • Info in SDR but not in spec
  • Uniformity of Form
  • Uniformity in Interpretation
  • Different understanding of the meaning while
    mapping to SDR (Ambiguity/Inconsistency)

23
Efficiency Issues
  • Minimize time/effort required.
  • Automate routine mechanizable tasks
  • Eliminate cut-paste-modify cycle
  • Minimize duplication of information.
  • Concise representation
  • Size of translation O(Size of spec).
  • Update consistency
  • Flexible rendition into various external forms.

24
Details (How?)
25
Essence of our Approach Literal
Translation
  • Conceptually, every piece of info in SDR owes its
    existence to phrases in spec.
  • Enable maintenance of correspondence between spec
    and its translation, and attempt to embed the
    translation into spec.
  • Requires compilation into SDL/SDR.
  • Cf. XML/XSL Technology

26
  • Semi-automatic approach is feasible only if the
    partially generated translations (annotations)
    are intelligible to an extractor in the context
    of the original spec, and is systematically
    extensible.
  • Note that current manual extractions into SDL are
    not literal even though SDL enables it to an
    extent.

27
SDL Studio and its Extension
  • SDL studio enables creation and editing of SDL
    documents. It has facilities to search domain
    library and compile SDL into an equivalent SDR.
  • This can be further enriched using
  • Improved Domain Library Search
  • Extraction and composition of SDL fragments
  • Providing templates for commonly occurring
    procedures
  • Table processor
  • etc

28
Domain Library Search Engine
29
Domain Library
  • Currently, it contains technical phrases
    pertinent to materials and processing
    requirements
  • Cohesia creates and maintains DLs for in-house
    use and for use by its clients such as GE, Alcoa,
    Allvac, etc.
  • Typical size 10,000 phrases

30
(No Transcript)
31
Improving Domain Library Search
  • Goal Mapping equivalent phrases to same Domain
    Library Term
  • Uses
  • Techniques for prefix removal, stemming, and
    dealing with other variations for root
    recognition
  • Stop words elimination
  • Abbreviation expander and alias normalization

32
Algorithm Sketch
  • ListPhrase dl
  • Phrase ip Int mt
  • ListWord dlwm, inwm with back references
  • ListPhrase dlts
  • begin
  • dl readAndBuildDomainLibrary()
  • dlwm buildWordMapAndBackLinks(dl)
  • delete stop words, link words to
    DLTs
  • (in,mt) readInputPhraseAndMatchThreshold()
  • inwm buildWordMap(in)
  • dlts
  • buildDLTsListContainingMatchedWords(dlwm
    ,inwm)
  • dlts evaluateAndFilterDLTs(dlts,mt)
  • end

33
Matching words
  • Int wordMatch(w1,w2)
  • begin
  • normalized vowels deleted, i.e., only
    consonants present
  • if caseUniformAndCleanedMatch(w1,w2)
  • return 100
  • if normalizedMatch(w1,w2)
  • return 90
  • if orderedNormalizedMatch(w1,w2)
  • return 70
  • analyze for differences due to prefix and
    suffix
  • if normalizedDifferenceInPrefixSuffixTables(w1,w2
    )
  • return 90
  • end

34
Design Rationale
  • Input phrase may contain multiple DLTs.
  • DLT words may not appear contiguous in input.
  • Consonants are significant, and "correct"
    spellings may differ in vowels.
  • Robustness with respect to spelling errors such
    as transposition of letters or missing vowels.
  • Stemmers do not work for words appearing in DLTs
    satisfactorily. Instead, create tables
    customized to deal with prefixes and suffixes
    that arise in practice, and normalize
    dynamically.
  • Err on the side of recall rather than precision.
  • Number of words lt Number of DLTs

35
Extraction Tool
36
Overall Approach
  • Preprocessing Obtain spec in plain text form
    (from MSWord format).
  • This is a practical alternative to scanning and
    OCR-ing a paper-based spec.
  • Saving it in HTML format has the benefit of
    isolating tables. On the con side, it retains
    formatting tags.
  • Semi-Automatic Extraction Recognize phrases in
    spec text that are associated with a requirement
    and generate SDL fragments to assist an extractor.

37
Two possible Avenues(From Document to SDL)
  • Iteratively annotate the document text with XML
    tags reflecting the SDL structure and ontology.
  • Generate various views of the document and SDL
    from this single XML Master.
  • Iteratively generate a sequence of progressively
    detailed SDL document from spec text.

38
First Avenue Via XML
  • Semi-automatic extraction is accomplished in two
    phases
  • Initial automatic markup phase Systematically
    recognize domain library terms in spec text and
    add suitable XML annotations. Then generate a
    first-cut SDL fragment.
  • Subsequent manual conversion phase Extractor
    organizes the information and completes the
    translation into an equivalent SDL.
  • Further steps As the tool matures, automation
    can be attempted to produce more detailed
    extractions.

39
(contd)
  • Advantages
  • Focus is on a single persistent XML Master that
    tries to maintain a link between the spec and the
    extractions.
  • All the processing is orchestrated on this XML
    file.
  • Implements various views of the XML source using
    XSLFO and various transformations on the XML
    source using XSLT.

40
(contd)
  • Disadvantages
  • There is a need to manage a separate SDL version
    to incorporate user inputs and corrections. This
    is because, even though it may be possible to
    represent SDL constructs using XML tags, it may
    not be possible to integrate user edits literally
    into the XML source.

41
Semantic-Markup Algorithm
Insert Structure Tags
Insert Ontology Tags
Infer Missing Char.
Group Char. Values
Group C-Vs into Procedures
42
Functional Components
Text file
Structure Tagger
XML file
DLT Tagger
Domain Library
XML file
Group Tagger
XML file
SDL Converter
SDL file
43
Tagging and Transforming
  • flex structTagger.flex
  • gcc lex.yy.c -lfl
  • a lt GE.txt gt GE.xml
  • java org.apache.xalan.xslt.Process -in GE.xml
    -xsl CSDLStylesheet.xsl -out GE.sdl
  • java org.apache.xalan.xslt.Process -in GE.xml
    -xsl CExpSDLStylesheet.xsl -out GE.exp.sdl
  • java org.apache.xalan.xslt.Process -in GE.xml
    -xsl OriginalStylesheet.xsl -out GE.org.txt

44
(No Transcript)
45
(No Transcript)
46
Second Avenue SDL all along
  • As there is no obvious way of incorporating SDL
    edits into the XML source in general, try to
    generate legal SDL at different levels of detail
    all along.
  • Advantage Yields SDL documents that can be
    immediately used in Spec Studio and extended by
    an extractor.
  • Disadvantage This form does not retain
    correspondence with the original document
    explicitly.

47
Prototype Operation
Extraction Tool Prototype Operation
48
(No Transcript)
49
  • Views In the context of Spec
  • Plain text view
  • Text view with requirement phrases color coded
    and highlighted
  • View of domain library terms found in the spec
  • Views In the context of SDL
  • Spec identity view Large Note Method D
    Extraction
  • Method C Extraction
  • Procedure view
  • Characteristic-value pair view

50
(No Transcript)
51
Additional Standalone Tools
  • Domain Library Browser
  • Given a word or a phrase, display all the domain
    library information related to it.
  • SDL Fragment Generator
  • Given a sentence, generate an SDL fragment that
    captures its essence.
  • These tools can assist an extractor in composing
    SDL document incrementally.

52
Future Work
53
Longer-term Vision
  • Marketplace continues to confirm the need for
    tools to capture the semantic interpretation of
    document content
  • Cohesia plans to productize the results of the
    research into a viable commercial product

54
Example Engineering Tasks
  • How to express and represent templates for
    well-known procedures?
  • Alternative to cut-paste-modify cycle
  • Tensile Test
  • Heat Treatment
  • Melt Method
  • Chemistry
  • Packaging

55
  • How to express and represent heterogeneous tables
    and non-trivial footnotes in a spec in a
    convenient and uniform way?
  • How to create, manipulate, and store specs in SDR
    and SDL among other forms and maintain
    interoperability?

56
Example Research Questions
  • What are the forms of extraction rules?
  • Phrase pattern matching
  • Theory of equivalence/subsumption
  • Example Aliases / Equivalent Phrases
  • Creep Plastic Strain
  • Delivery Condition Surface Finish
  • Cause for Rejection Rejection Criteria
  • Imperfections detrimental to usage of product
    Free of injurious defects

57
  • Rules for interpreting logic words
  •   Connectives and, or,
  •   Quantifiers all, every, each,
  •   Modifiers over, under, more, less,
  •   Negation not, no, unless, except, free of
    ...
  • Mismatch?
  • A, B, and C gt A,B,C
    union/OR-logic
  • Distributive Laws?
  • Lot and order number gt lot number and order
    number

58
Another Example Scenerio
Buyers Purchase Order
Sellers Inventory
Melt Atmosphere Inert Gas Sulphur lt
2.0 Niobium lt 0.5
Melt Atmosphere Argon Sulphur lt 1.7 Columbium
lt 0.2
Match?
59
  • What are the strategies for searching and
    matching?
  • Top-down Template-driven expectations
  • Bottom-up Identifying requirements present
  • Closure Manual addition / modification /
    disambiguation

60
  • Relevant Information Extraction Research and
    Technologies
  • References
  • Message Understanding Conferences.
  • Work on NLP an IE at UMass, NYU, SRI, etc.
  • Search and Filtering tools.

61
Conclusions
62
NSF SBIR Phase II
Spec Text as
Spec Text in
Electronic
HTML/XML
Image
Optical
Paper
Extraction
Character
Scanning
Wizard
Recognition
Spec
Text on Paper
SDL (XML)
SDR
Read,
SDL
SDL
Interpret,
Compiler
Editor
Type
NSF SBIR Phase I
Before
63
Appendix
64
AMS 4928N (Ti Alloy)
65
Tensile Test
66
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com