Title: SemiAutomatic Content Extraction from Specifications
1Semi-Automatic Content Extraction from
Specifications
- Krishnaprasad Thirunarayan
- Department of Computer Science Engineering
- Wright State University
- Aaron Berkovich and Dan Sokol
- Cohesia Corporation
2Extraction Summarize in a prescribed vocabulary
Spec Text
Spec SDR
Domain Library
3Participants
- Sponsor National Science Foundation
- SBIR Phase I and Phase II
- Industry Cohesia Corporation
- Developer of (B2B) content and lower-level
infrastructure - University Wright State University
- User-level tools conceptualization and design
- Others Geometric Software Solutions,
- Tool/Product development and integration
4Outline
-
- Background and Goal (What?)
- Motivation (Why?)
- Details (How?)
- Conclusions
5Background and Goal
6Manual Content Extraction
- Input
- Paper-based specifications of a manufacturing
task describing composition, processing, and
testing of materials - Additional constraints imposed by customers and
vendors - Appropriate Ontology and Domain Library defining
standard vocabulary
7- Output
- An equivalent formalized description of specs
in Specification Definition Representation (SDR) - Observation
- Specs originating from a common source (ASTM,
SAE, GE) share vocabulary and structure. - Linguistic patterns found in specs are exploited
by an experienced extractor to interpret it.
8Assistance for Extraction
Document Paper
Document Text
Mark-Up Editor (Wizard)
Document SDR
Document Proofer
original
9Semi-automatic Content Extraction
- Starting from an electronic version of a spec,
develop a strategy for semantic markup, to assist
in creating an equivalent SDR. - Semantic Markup The task of overlaying an
abstract syntax (the essence) on the
free-form text. - Spec Human-sensible
- Mark-up Computer-sensible
- Automate routine mechanical tasks.
10 Semantic Mark-up
Spec Name
Spec Title
Revision
Procedure
Revision Date
Qualifier
Characteristic
Values
Value
11Ontology
- (Gruber)
- An ontology is an explicit specification of a
conceptualization, which is an abstract,
simplified view of the world that we wish to
represent for some purpose.
12SDL Ontology
1 or many
Document
Domain Library
1 or many
Revision
Reference
Ref 0, 1 or many
0, 1 or many
Procedure
0, 1 or many
Ref 0, 1 or many
Layer
Characteristic
Ref 0, 1 or many
0, 1 or many
Value
13 Extraction Spec to SDR
Spec Text
Spec SDR
14Fundamental Obstacles
- The relation between the spec and its SDR
rendition is not linear. - Same spec information duplicated in SDR in
different contexts. - Contiguous block of information in SDR spread
out in spec. - Equivalence of phrases hard to formalize.
- Tables and footnotes abbreviate information in
irregular and complicated ways.
15Linearizing through Abstraction
Introducing Specification Definition Language
Manual (original)
Original Spec
SDL
SDR
Manual (Ph-I)
Compiled (Ph-I)
Literal, Integrated, Semi-automatic (Ph-II)
Original AMS-4976 spec is 8 pages. Its SDL
equivalent is 15 pages.
Original AMS-5662J spec is 11 pages. Its SDR
equivalent is 30 pages.
16(No Transcript)
17(No Transcript)
18Introducing Extraction Wizard
19Motivation (Why?)
20Business Background (Supply Chain)
Engine
Forger
Metal
21Diverse and Large number of specs and spec
users
22Quality Issues
- Transcription Errors
- From spec to hand-written sheet to computer
- Completeness
- Info in spec but missing in SDR
- Soundness
- Info in SDR but not in spec
- Uniformity of Form
- Uniformity in Interpretation
- Different understanding of the meaning while
mapping to SDR (Ambiguity/Inconsistency)
23Efficiency Issues
- Minimize time/effort required.
- Automate routine mechanizable tasks
- Eliminate cut-paste-modify cycle
- Minimize duplication of information.
- Concise representation
- Size of translation O(Size of spec).
- Update consistency
- Flexible rendition into various external forms.
24Details (How?)
25Essence of our Approach Literal
Translation
- Conceptually, every piece of info in SDR owes its
existence to phrases in spec. - Enable maintenance of correspondence between spec
and its translation, and attempt to embed the
translation into spec. - Requires compilation into SDL/SDR.
- Cf. XML/XSL Technology
26 - Semi-automatic approach is feasible only if the
partially generated translations (annotations)
are intelligible to an extractor in the context
of the original spec, and is systematically
extensible. - Note that current manual extractions into SDL are
not literal even though SDL enables it to an
extent.
27 SDL Studio and its Extension
- SDL studio enables creation and editing of SDL
documents. It has facilities to search domain
library and compile SDL into an equivalent SDR. - This can be further enriched using
- Improved Domain Library Search
- Extraction and composition of SDL fragments
- Providing templates for commonly occurring
procedures - Table processor
- etc
28Domain Library Search Engine
29Domain Library
- Currently, it contains technical phrases
pertinent to materials and processing
requirements - Cohesia creates and maintains DLs for in-house
use and for use by its clients such as GE, Alcoa,
Allvac, etc. - Typical size 10,000 phrases
30(No Transcript)
31Improving Domain Library Search
- Goal Mapping equivalent phrases to same Domain
Library Term - Uses
- Techniques for prefix removal, stemming, and
dealing with other variations for root
recognition - Stop words elimination
- Abbreviation expander and alias normalization
32Algorithm Sketch
- ListPhrase dl
- Phrase ip Int mt
- ListWord dlwm, inwm with back references
- ListPhrase dlts
- begin
- dl readAndBuildDomainLibrary()
- dlwm buildWordMapAndBackLinks(dl)
- delete stop words, link words to
DLTs - (in,mt) readInputPhraseAndMatchThreshold()
- inwm buildWordMap(in)
- dlts
- buildDLTsListContainingMatchedWords(dlwm
,inwm) - dlts evaluateAndFilterDLTs(dlts,mt)
- end
33Matching words
- Int wordMatch(w1,w2)
- begin
- normalized vowels deleted, i.e., only
consonants present - if caseUniformAndCleanedMatch(w1,w2)
- return 100
- if normalizedMatch(w1,w2)
- return 90
- if orderedNormalizedMatch(w1,w2)
- return 70
- analyze for differences due to prefix and
suffix - if normalizedDifferenceInPrefixSuffixTables(w1,w2
) - return 90
- end
34Design Rationale
- Input phrase may contain multiple DLTs.
- DLT words may not appear contiguous in input.
- Consonants are significant, and "correct"
spellings may differ in vowels. - Robustness with respect to spelling errors such
as transposition of letters or missing vowels. - Stemmers do not work for words appearing in DLTs
satisfactorily. Instead, create tables
customized to deal with prefixes and suffixes
that arise in practice, and normalize
dynamically. - Err on the side of recall rather than precision.
- Number of words lt Number of DLTs
35Extraction Tool
36Overall Approach
- Preprocessing Obtain spec in plain text form
(from MSWord format). - This is a practical alternative to scanning and
OCR-ing a paper-based spec. - Saving it in HTML format has the benefit of
isolating tables. On the con side, it retains
formatting tags. - Semi-Automatic Extraction Recognize phrases in
spec text that are associated with a requirement
and generate SDL fragments to assist an extractor.
37Two possible Avenues(From Document to SDL)
- Iteratively annotate the document text with XML
tags reflecting the SDL structure and ontology. - Generate various views of the document and SDL
from this single XML Master. - Iteratively generate a sequence of progressively
detailed SDL document from spec text.
38First Avenue Via XML
- Semi-automatic extraction is accomplished in two
phases - Initial automatic markup phase Systematically
recognize domain library terms in spec text and
add suitable XML annotations. Then generate a
first-cut SDL fragment. - Subsequent manual conversion phase Extractor
organizes the information and completes the
translation into an equivalent SDL. - Further steps As the tool matures, automation
can be attempted to produce more detailed
extractions.
39(contd)
- Advantages
- Focus is on a single persistent XML Master that
tries to maintain a link between the spec and the
extractions. - All the processing is orchestrated on this XML
file. - Implements various views of the XML source using
XSLFO and various transformations on the XML
source using XSLT.
40(contd)
- Disadvantages
- There is a need to manage a separate SDL version
to incorporate user inputs and corrections. This
is because, even though it may be possible to
represent SDL constructs using XML tags, it may
not be possible to integrate user edits literally
into the XML source.
41 Semantic-Markup Algorithm
Insert Structure Tags
Insert Ontology Tags
Infer Missing Char.
Group Char. Values
Group C-Vs into Procedures
42 Functional Components
Text file
Structure Tagger
XML file
DLT Tagger
Domain Library
XML file
Group Tagger
XML file
SDL Converter
SDL file
43Tagging and Transforming
- flex structTagger.flex
- gcc lex.yy.c -lfl
- a lt GE.txt gt GE.xml
- java org.apache.xalan.xslt.Process -in GE.xml
-xsl CSDLStylesheet.xsl -out GE.sdl -
- java org.apache.xalan.xslt.Process -in GE.xml
-xsl CExpSDLStylesheet.xsl -out GE.exp.sdl - java org.apache.xalan.xslt.Process -in GE.xml
-xsl OriginalStylesheet.xsl -out GE.org.txt
44(No Transcript)
45(No Transcript)
46Second Avenue SDL all along
- As there is no obvious way of incorporating SDL
edits into the XML source in general, try to
generate legal SDL at different levels of detail
all along. - Advantage Yields SDL documents that can be
immediately used in Spec Studio and extended by
an extractor. - Disadvantage This form does not retain
correspondence with the original document
explicitly.
47 Prototype Operation
Extraction Tool Prototype Operation
48(No Transcript)
49- Views In the context of Spec
- Plain text view
- Text view with requirement phrases color coded
and highlighted - View of domain library terms found in the spec
- Views In the context of SDL
- Spec identity view Large Note Method D
Extraction - Method C Extraction
- Procedure view
- Characteristic-value pair view
50(No Transcript)
51Additional Standalone Tools
- Domain Library Browser
- Given a word or a phrase, display all the domain
library information related to it. - SDL Fragment Generator
- Given a sentence, generate an SDL fragment that
captures its essence. - These tools can assist an extractor in composing
SDL document incrementally.
52Future Work
53Longer-term Vision
- Marketplace continues to confirm the need for
tools to capture the semantic interpretation of
document content - Cohesia plans to productize the results of the
research into a viable commercial product
54Example Engineering Tasks
- How to express and represent templates for
well-known procedures? - Alternative to cut-paste-modify cycle
- Tensile Test
- Heat Treatment
- Melt Method
- Chemistry
- Packaging
55- How to express and represent heterogeneous tables
and non-trivial footnotes in a spec in a
convenient and uniform way? - How to create, manipulate, and store specs in SDR
and SDL among other forms and maintain
interoperability?
56Example Research Questions
- What are the forms of extraction rules?
- Phrase pattern matching
- Theory of equivalence/subsumption
- Example Aliases / Equivalent Phrases
- Creep Plastic Strain
- Delivery Condition Surface Finish
- Cause for Rejection Rejection Criteria
- Imperfections detrimental to usage of product
Free of injurious defects
57- Rules for interpreting logic words
- Connectives and, or,
- Quantifiers all, every, each,
- Modifiers over, under, more, less,
- Negation not, no, unless, except, free of
... - Mismatch?
- A, B, and C gt A,B,C
union/OR-logic - Distributive Laws?
- Lot and order number gt lot number and order
number
58Another Example Scenerio
Buyers Purchase Order
Sellers Inventory
Melt Atmosphere Inert Gas Sulphur lt
2.0 Niobium lt 0.5
Melt Atmosphere Argon Sulphur lt 1.7 Columbium
lt 0.2
Match?
59 - What are the strategies for searching and
matching? - Top-down Template-driven expectations
- Bottom-up Identifying requirements present
- Closure Manual addition / modification /
disambiguation
60 - Relevant Information Extraction Research and
Technologies - References
- Message Understanding Conferences.
- Work on NLP an IE at UMass, NYU, SRI, etc.
- Search and Filtering tools.
61Conclusions
62NSF SBIR Phase II
Spec Text as
Spec Text in
Electronic
HTML/XML
Image
Optical
Paper
Extraction
Character
Scanning
Wizard
Recognition
Spec
Text on Paper
SDL (XML)
SDR
Read,
SDL
SDL
Interpret,
Compiler
Editor
Type
NSF SBIR Phase I
Before
63Appendix
64AMS 4928N (Ti Alloy)
65Tensile Test
66(No Transcript)