Title: XML
1XML
- Using
- Extensible Markup Language for Genealogy
April 19,2003 Tim Costin For CAGGNI
2 At present, there can be little doubt the the
whole of mankind is in mortal danger, not
because we are short of scientific and
technological know-how, but because we tend to
use it destructively, without wisdom. E.F.
Schumacher Small is Beautiful 1973
Try to realize that your data is far more
important than the applications that access
it. Eric Miller Information Week Oct 14.2002
Semantic Web activity lead for the World Wide
Web Consortium
3Too Many Forks
Henry Petroski The Evolution of Useful Things
1992
4Data Format Tower of Babel
Data Loss
Data Loss
One format
Hard to use
Save as webpage
Save as webpage
5Relative Portability
FTW
DOC
PDF
TXT
GED
HTML
XML
6Round Peg in a Square Hole
7What is the problem?
- Software vendors using proprietary data formats
- Operating system dependencies and differences
- Language and character set dependencies and
differences - In general, a lack of broad industry standards
- As a result
- Data cannot be easily reused with other
software - Data cannot be easily searched
- Data and software quickly become obsolete
- Data requires manual effort or special
conversion programs to achieve limited portability
8First Family
We will be using the Kennedy Family as our Sample
Family.
9Family Tree Format (447K)
FTM stores data in an bulky, inefficient,
unreadable, untransferrable, proprietary format
10What is a good data format?
- Is it a widely accepted standard?
- Is is reusable by other programs?
- Is it portable to other OSs, languages,
character sets? - Is it stored efficiently?
- Is is accessible and searchable?
If your genealogy data format has these features,
your genealogy data will have much less chance of
becoming obsolete and much greater chance of
being readily available for your descendants.
11(No Transcript)
12HTML Sample (6k)
HTML uses format tags, without genealogical
meaning
ltHTMLgt ltHEADgt ltTITLEgtI1 John Fitzgerald KENNEDY
(29 May 1917 - 22 Nov 1963)lt/TITLEgt lt/HEADgt ltBODYgt
ltH2gtltA NAME"I1"gtlt/AgtJohn Fitzgerald KENNEDY
lt/H2gt ltA HREF"2"gt2lt/Agt ltH3gt29 May 1917 - 22
Nov 1963lt/H3gt ltULgt ltLIgtltEMgtBIRTHlt/EMgt 29 May
1917, Brookline, MA, USA ltLIgtltEMgtDEATHlt/EMgt 22
Nov 1963, Dallas, TX, USA ltA HREF"1"gt1lt/Agt ltLI
gtltEMgtREFERENCElt/EMgt 1 lt/ULgt ltBgtFather lt/Bgt ltA
HREF"../d0000/g0000037.htmlI8"gtJoseph Patrick
KENNEDY lt/AgtltBRgt ltBgtMother lt/Bgt ltA
HREF"../d0000/g0000038.htmlI9"gtRose Elizabeth
FITZGERALD lt/AgtltBRgt ltBRgt ltBgtFamily 1lt/Bgt ltA
HREF"../d0000/g0000031.htmlI2"gtJaqueline Lee
BOUVIER lt/Agt ltULgt ltLIgtltEMgtMARRIAGElt/EMgt 12 Sep
1953, Newport, RI, USA lt/ULgt ltOLgt ltLIgt
ltTTgtnbsplt/TTgtltA HREF"../d0000/g0000034.htmlI5"
gtCaroline Bouvier KENNEDY lt/Agt ltLIgt
ltTTgtnbsplt/TTgtltA HREF"../d0000/g0000035.htmlI6"
gtJohn Fitzgerald KENNEDY lt/Agt ltLIgt
ltTTgtnbsplt/TTgtltA HREF"../d0000/g0000036.htmlI7"
gtPatrick Bouvier KENNEDY lt/Agt lt/OLgt ..
13What is GEDCOM
- GEDCOM is an acronym for "GEnealogical Data
COMmunication". - GEDCOM is a standard for transferring genealogy
data from one genealogy program to another. - Authored by the Church of Jesus Christ of Latter
Day Saints (LDS or Mormon Church). - The current version is 5.5 dated 1996Â
14GEDCOM 5.5 sample (4K)
GEDCOM uses 4 character TAGS( shown in upper
case) to label data
Family 1 consists of I1 John Kennedy, I2
Jacqueline Bouvier I5 Caroline Kennedy I6
John-John Kennedy I7 Patrick Kennedy
0 _at_F1_at_ FAM 1 HUSB _at_I1_at_ 1 WIFE _at_I2_at_ 1 CHIL _at_I5_at_ 1
CHIL _at_I6_at_ 1 CHIL _at_I7_at_ 1 MARR 2 DATE 12 SEP 1953 2
PLAC Newport, RI, USA 0 _at_F2_at_ FAM 1 HUSB _at_I8_at_ 1
WIFE _at_I9_at_ 1 CHIL _at_I10_at_ 1 CHIL _at_I1_at_ 1 CHIL _at_I11_at_ 1
CHIL _at_I12_at_ 1 CHIL _at_I13_at_ 1 MARR 2 DATE 07 OCT
1914 2 PLAC Boston, MA, USA
0 _at_I1_at_ INDI 1 REFN 1 1 NAME John
Fitzgerald/Kennedy/ 1 SEX M 1 CHAN 2
DATE 13 FEB 2000 1 BIRT
2 DATE 29 MAY 1917 2 PLAC Brookline, MA,
USA 1 DEAT 2 DATE 22 NOV 1963 2
PLAC Dallas, TX, USA 2 NOTE Assassinated by
Lee Harvey Oswald. 3 CONT 1 NOTE
Educated at Harvard University. Elected
Congressman in 1945 2 CONT aged 29 served
three terms in the House of Representatives.
2 CONT Elected Senator in 1952. Elected
President in 1960, the 2 CONT youngest ever
President of the United States. 2 CONT
2 CONT 1 FAMS _at_F1_at_ 1 FAMC _at_F2_at_ 0 _at_I2_at_
INDI 1 REFN 2 1 NAME Jaqueline
Lee/Bouvier/ ..
Family 2 consists of I8 Joe Kennedy I9 Rose
Kennedy I10 Joe Kennedy I1 John F. Kennedy I11
Bobby Kennedy
15GEDCOM Testbook Project
- This was a A project of the National Genealogy
Society - Volunteers typed the same genealogy into 8
commercial genealogy programs - The genealogies were exported and imported to
GEDCOM - Results were less than spectacular
- Your results may vary
16GEDCOM import/export errors
- Custom tags unknown to GEDCOM standard
- Tags in the wrong position
- Ignored tags
- Tags converted to/from the wrong GEDCOM tag
- Incorrect links
- Tags in wrong format
- Lost or corrupted source information
- Losses affected less commonly used fields more
often
17Master Genealogist
- TMG Offers limited access to its data from
another program - TMG Offers limited export to a spreadsheet
- TMG GenBridge feature of SuperTools feature
improves on GEDCOM 5.5 and not just for Master
Genealogist users - GenBridge not free. Not an industry standard.
Not general purpose software.
18GenBridge
Family Tree SuperTools brings advanced project
management,the industrys most flexible charting
tools, and many other exclusive features to users
of Family Tree Maker, Personal Ancestral
File,The Master Genealogist, Family Origins,
Ultimate Family Tree, Legacy,and others. By
reading data directly from these programs with
its built-in GenBridge technology, this new
companion product avoids the many problems
normally associated with GEDCOM transfers.
(GEDCOM imports are also supported, however, for
users of other programs.)
19Computer trade mags are full of XML headlines
20XML a weapon in Office Suite battle
21Theres even whole magazines on XML.
22What is XML?
- XML stands for Extensible Markup language
- XML is a new standard created by the World Wide
Web Consortium (W3C) in the late 1990s for the
exchange of annotated (tagged) text data between
programs - XML is a meta language. Meta means data about
data. XML is self-describing. - XML is a grammar for constructing custom
tag(label) languages for different applications.
23XML Usage Diversity
- XML text documents and databases
- MusicML music notation
- GEDCOM 6.0 Genealogy
- EBXML B2B E-Commerce
- MathML Mathematics Data
- VoiceXML Voice Applications
24Taxonomy Application
25Universal Language for Data
- XML is meant for storing or transporting data
between programs This is ideal for genealogy
data and very diverse types of data
26From Geography Markup to Rendering
lt?xml version"1.0" encoding"iso-8859-1"?gt ltrsgt lt
rgtltnamegtHorton Plazalt/namegtltURLgtlt/URLgtltlabelposgt41
.46,77.51lt/labelposgtltcgt5076,1540 4986,1540
4895,1539 4803,1539 4715,1539 4622,1539 4534,1538
4534,1641 4534,1745 4534,1856 4622,1856 4711,1856
4800,1856 4893,1855 4984,1855 5075,1854 5075,1749
5076,1646 lt/cgtlt/rgt ltrgtltnamegtGaslamplt/namegtltURLgtlt/U
RLgtltlabelposgt44.60,83.00lt/labelposgtltcgt5162,1013
5084,1057 5083,1116 5081,1222 5079,1326 5079,1433
5076,1540 5076,1646 5075,1749 5075,1854 5167,1854
5257,1855 5257,1750 5259,1647 5260,1541 5262,1434
5262,1328 5263,1222 5263,1013 lt/cgtlt/rgt . . .
XML encoding of geographic features (such as GML)
27Universal Computer Language
Java is a programming language that enables
portability of programs to different computers,
operating systems, languages and character
sets. Java was originally designed for small
appliances. Java is highly successful and is now
the dominant programming language. Java works
very well with XML.
28XML is a Standard file format
- All that is needed is to define a set of tags
for each application - XML is so extensible, it can replace many
proprietary file formats - Wont replace all file types
- Best suited to text formats but can LINK to
non-text data
29XML Solves Problems
- Common grammar allows transmission of data
between programs instead of reentry - Hardware and software independence instead of
locking data into proprietary formats good only
in one operating system - Reuse of data in many formats instead of reentry
- Self describing data allows targeted search
instead of searching heterogeneous data. - Unicode character set allows any Language instead
of just Western languages
30XML is a Markup language
ltNAMEgtJohn FitzgeraldltSgtKennedylt/Sgt lt/NAMEgt
ltSEXgtMlt/SEXgt ltBIRTgt ltDATEgt29 MAY
1917lt/DATEgt ltPLACgtBrookline, MA,
USAlt/PLACgt lt/BIRTgt
31XML is Extensible
You can make up your own XML tags - You cannot do
that with HTML
Tags are in red
XML FORMAT ltINDI ID"I1"gt ltNAMEgtJohn
FitzgeraldltSgtKennedylt/Sgt lt/NAMEgt
ltSEXgtMlt/SEXgt ltBIRTgt ltDATEgt29 MAY
1917lt/DATEgt ltPLACgtBrookline, MA,
USAlt/PLACgt lt/BIRTgt ltDEATgt
ltDATEgt22 NOV 1963lt/DATEgt ltPLACgtDallas,
TX, USAlt/PLACgt ltNOTEgtAssassinated by Lee
Harvey Oswald.ltBR/gt lt/NOTEgt
lt/DEATgt lt/INDIgt Tags describe the meaning
(semantics) of the data Formatting is done
separately with style sheets
HTML FORMAT ltPgtJohn Fitzgerald
Kennedy ltBRgtM ltBRgtBORN 29 MAY 1917 in Brookline,
MA, USA ltBRgtDIED 22 NOV 1963 in Dallas
Texas ltPgtltBgtNOTE lt/Bgt Assasinated By Lee Harvey
Oswald.ltBR/gt Tags describe the format of
the data
32XML sample (7K) from GEDCOM
XML for genealogy will be like much like GEDCOM
ltINDI ID"I1"gt ltREFNgt1lt/REFNgt
ltNAMEgtJohn FitzgeraldltSgtKennedylt/Sgtlt/NAMEgt
ltSEXgtMlt/SEXgt ltBIRTgt ltDATEgt29 MAY
1917lt/DATEgt ltPLACgtBrookline, MA,
USAlt/PLACgt lt/BIRTgt ltDEATgt
ltDATEgt22 NOV 1963lt/DATEgt ltPLACgtDallas,
TX, USAlt/PLACgt ltNOTEgtAssassinated by Lee
Harvey Oswald.ltBR/gtlt/NOTEgt lt/DEATgt
ltNOTEgtEducated at Harvard University. Elected
Congressman in 1945ltBR/gt aged 29 served three
terms in the House of Representatives.ltBR/gt Electe
d Senator in 1952. Elected President in 1960,
theltBR/gt youngest ever President of the United
States.ltBR/gt ltBR/gt lt/NOTEgt
ltFAMS REF"F1"/gt ltFAMC REF"F2"/gt
lt/INDIgt ltINDI ID"I2"gt ltREFNgt2lt/REFNgt
ltNAMEgtJaqueline LeeltSgtBouvierlt/Sgtlt/NAMEgt ..
ltFAM ID"F1"gt ltHUSB REF"I1"/gt ltWIFE
REF"I2"/gt ltCHIL REF"I5"/gt ltCHIL
REF"I6"/gt ltCHIL REF"I7"/gt ltMARRgt
ltDATEgt12 SEP 1953lt/DATEgt
ltPLACgtNewport, RI, USAlt/PLACgt lt/MARRgt
lt/FAMgt ltFAM ID"F2"gt ltHUSB REF"I8"/gt
ltWIFE REF"I9"/gt ltCHIL REF"I10"/gt
ltCHIL REF"I1"/gt ltCHIL REF"I11"/gt
ltCHIL REF"I12"/gt ltCHIL REF"I13"/gt
ltMARRgt ltDATEgt07 OCT 1914lt/DATEgt
ltPLACgtBoston, MA, USAlt/PLACgt lt/MARRgt
lt/FAMgt
33Key Standards for Genealogy in the future
- XML standards
- (XML,XSLT,DTD,)
- GEDCOM standard based on XML GEDCOM 6.0 will
be in XML format - Browser standards (HTML,CSS,JAVASCRIPT)
- Server standards (Java Servlets)
- Related XML languages for public records, and
geography
34Trends in the software industry
- Toward Open Source software (Linux, Apache, Java
Servlets) - Toward Freeware and Shareware (XML utilities,
GEDCOM utilities, PAF) - Toward Standards-based software (Star office,
browser user interface) - Away from proprietary data formats and toward
reusable formats (Microsoft office formats, all
genealogy program formats)
35GEDCOM 6.0
- Authored by the Family History Department of
the Church of Latter Day Saints - Beta released December 6,2002
- Uses XML format and Unicode
- Includes a DTD Document Type Definition that
defines the rules for a common vocabulary and
grammar for genealogy data in XML files.
36Presentation WEB Page
Meta-description Meta-keywords Scripts
Comments
header
Format tags lttitlegt ltpgtparagraph ltbgtbold lttablegt lt
olgtordered list lth1gtheader ltbuttongt ltcolorgt ltfontgt
ltaligngt ltsizegt
text
Formatting Stylesheet (CSS)
text
body
links
Photos graphics
forms
- A Web page contains formatted text and images
- Looks good, is accessible, but not very searchable
37Semantic WEB Page
Family_sheet.xsl Ancestor_chart.xsl Descendant_cha
rt.xsl
Kennedy_family.html Kennedy_ancestor_chart.html Ke
nnedy_descendant_chart.html
Kennedy.xml
Transformation Stylesheet (XSL) Formats xml tags
by rules ltheadergt lttitlegt ltpgtparagraph ltbgtbold ltta
blegt ltolgtordered list lth1gtheader ltbuttongt ltcolorgt
ltfontgt ltaligngt ltsizegt
Formatting Stylesheet (CSS)
XML data
Meta-description Meta-keywords Scripts Comments
text
Format tags lttitlegt ltpgtparagraph ltbgtbold lttablegt lt
olgtordered list lth1gtheader ltbuttongt ltcolorgt ltfontgt
ltaligngt ltsizegt
XML data
text
Graphics Photos Media
links
Photos graphics
forms
- A Semantic Web page separates data from the
presentation - Tagged XML data is much more searchable
38Servlet Examples
http//www.kennedy.org
39Command Line Examples
40XSL Stylesheets
XSL stylesheet to create a list of
names ltxsltransform gt ltxsltemplate
match"/"gt ltxslapply-templates/gt lt/xsltemplat
egt ltxsltemplate match"GED"gt lthtmlgt
ltheadergtlttitlegtlist of names and
birthdayslt/titlegt lt/headergt
ltbodygtltxslapply-templates select"INDI"/gtlt/bodygt
lt/htmlgt lt/xsltemplategt ltxsltemplate
match"INDI"gt ltp/gt ltbgtltxslvalue-of
select"NAME"/gtlt/bgt ltbr/gt-----BORN
ONltxslvalue-of select"BIRT"/gt
lt/xsltemplategt lt/xsltransformgt
Html file lthtmlgt ltheadergt lttitlegtlist of names
and birthdayslt/titlegt lt/headergt ltbodygt ltp/gtltbgtJohn
Fitzgerald Kennedylt/bgt ltbrgt-----BORN ON 29 May
1917 ltp/gtltbgtJoseph Patrick Kennedylt/bgt ltbrgt-----B
ORN ON 6 SEP 1888 next person lt/bodygt lt/htmlgt
XML data ltGEDgt ltINDIgt ltNAMEgt
John Fitzgerald ltSgtKennedylt/Sgt
lt/NAMEgt ltBIRTgt
ltDATEgt29 MAY 1917lt/DATEgt lt/BIRTgt
lt/INDIgt ltGEDgt
- XSL stylesheets have templates to control how to
format each xml tag - XSL stylesheets control which tags to process
and in what order - You can have as many XSL stylesheets as there
are ways to format the data
41Searching the Presentation Web
42Presentation Web Searching
- Search Engines search through unstructured data
- Search Engines only can try to match strings of
characters so some of the data on some of the
webpages - not everything - Search Engines do not know what you mean
- Search Engines do not what the data in any web
pages means - Most of the intelligence must be supplied behind
the eyes of the surfer - Much of the search results are irrelevant and
waste a lot of time - Search Engines like Google make the best of it
and have ways to score - hits and sort by relevance or they specialize
like Ancestry.com
43WGA HOMEPAGE
44KEYWORD WEB SEARCHES
- Commonly used in business databases like Oracle,
Sybase, and DB2 - Search for John Carpenter in a NAME field not
just John Carpenter - Only search NAMEs, not everything else on the
semantic web - Wont get articles on Carpentry or Carpenter ant
because of synonyms - No need for HTML META tags, all XML data is
already tagged - Add GEDCOM to the search to the search only
Genealogy pages
Hypothetical search DOCTYPEGEDCOM AND
SURNAMECARPENTER AND BIRTHDATEgt1903
45Searching the Semantic Web
46SEMANTIC WEB SEARCHES
- Currently the stuff Science Fiction, like HAL
or R2D2 - W3C and academics are working on it
- Based on more detailed definition on XML tags
and how the relate to one another. I.E. It
will know Illinois is in the USA - Encodes the meaning of words in context, in
phrases and sentences - Understands words with multiple meanings -
Polysemy - Understands words with the same meaning
synonyms - Computers search for what you mean, not for
arbitrary words - Your personal web agent will search the web for
you and will know the following is looking for
a person in a place at a certain time
Hypothetical search Find all Ian Kennedys in
Waterford from 1800-1820
47 Only a Genealogist regards a step backwards as
progress --Unknown
48Resources
GEDCOM 6.0 http//www.familysearch.org/GEDCOM/GedX
ML60.pdf
GEDCOM FAQ http//www.familysearch.org/Eng/Home/FA
Q/frameset_faq.asp?FAQfaq_gedcom.asp
GEDCOM Testbook Project http//www.gentech.org/ngs
gentech/projects/TestBook2001/index.htm
XML and genealogy http//www.oasis-open.org/cover
/genealogy.htmlgedML