Title: Markup Languages ML
1MarkupLanguagesML
SSS
A
legal-X
SG
VOX
CP
X
DHT
G
HT
math
W
DS
- Yaakov J. Stein
- Chief ScientistRAD Data Communications
C
2What do I do?
I digest, edit and produce documents
- business letters
- email
- meeting summaries
- proposals
- reports
- requirement specifications
- project plans
- web pages
- research articles
- review articles
- books
3What do others do?
- Pretty much the same
- US corporations produce gt100 billion documents
per year - 90 of a modern institutions information is in
documents - gt50 of typical corporations efforts involves
documents - Thats why word processing SW
- was expected to bring efficiency increases
- But didnt!
4Word processing?
- PROs
- makes nicer looking documents
- expedites document sharing during creation
- CONs
- typically 30 of effort on format and reformat
- doesnt increase information accessibility
- doesnt facilitate information mining
5Databases?
- The natural alternative to documents are
databases - PROs
- increase information accessibility
- facilitate information mining
- CONs
- not human readable
- format inflexible
6The solution
- What we really want is to write unconstrained
text - but to have information retrieval as well !
- Method 1 Automatic text analysis
- AI program analyzes text
- Recognizes document structure, sentence syntax
- Performs gisting, facilitates information mining
- Complete solution equivalent to solving Turing
test - Method 2 Manual markup
- Document author responsible for marking
- Clarifies document structure
- Enables automated retrieval of selected
information - Suggests presentation format
7Why is text analysis hard?
The man cried FIRE the gun !
The man cried FIRE the gun maker !
8Are MLs computer languages?
- There are many different types of computer
languages - procedural languages
- for (n0nlt10i)
- if (ngt5) printf(markup languages are fun!\n)
- graphic languages
- newpath
- 0 0 moveto 0 1 lineto 1 1 lineto 1 0 lineto
- closepath fill
- database languages
- SELECT book FROM biblio WHERE subjectDSP AND
authorSTEIN - logical languages
- useful(DSP), useful(hardware), fun(DSP), fun(web)
- interesting(X) if useful(X) and fun(X)
- ?-interesting(X)
9They are!
- Markup languages do not directly instruct
computers - like procedural languages
- rather indirectly instruct computer
- like logical languages
- They do this by using
- elements
- attributes
- entities
- text
ltBOOK SUBJECTdspgt ltTITLE
FORMATshortgtDSP-CSPlt/TITLEgt ltAUTHORgtJ.
Steinlt/AUTHORgt This is a great book!
standard-disclaimer lt/BOOKgt
(tags)
10Some markup element functions
- Structural
- Clarifies document structure
- Delineates document parts
- Descriptive (informative)
- Indicates
- Facilitates information retrieval
- Presentational (display)
- Presents information in nice format
- Helps human readability
- Referential (links, applications)
- Provide hypertext links
- Launch applications
11Structural Markup
- ltHEADINGgtSeptember 1, 2000lt/HEADINGgt
- ltGREETINGgtDear Prof. Stein, lt/GREETINGgt
- ltBODYgt
- I would like to tell you how much I enjoyed
reading your new text - Digital Signal Processing, A Computer Science
Perspective. - I hope we will be able to meet at the next
conference. - lt/BODYgt
- ltSIGNATUREgt
- Sincerely,
- Dee Espy
- lt/SIGNATUREgt
12Descriptive Markup
- ltDATEgtSeptember 1, 2000lt/DATEgt
- Dear ltPERSONgtProf. Stein,lt/PERSONgt
- I would like to tell you how much I enjoyed
reading your new text - ltBOOKgt
- Digital Signal Processing, A Computer Science
Perspective. - lt/BOOKgt
- I hope we will be able to meet at the next
ltEVENTgtconference.lt/EVENTgt - Sincerely,
- ltPERSONgtDee Espylt/PERSONgt
13Presentational Markup
- ltRIGHT-JUSTIFYgtSeptember 1, 2000lt/RIGHT-JUSTIFYgt
- ltBOLDgtDear Prof. Stein,lt/BOLDgt
- I would like to tell you how much I enjoyed
reading your new text - ltUNDERLINEgt
- Digital Signal Processing, A Computer Science
Perspective. - lt/UNDERLINEgt
- I hope we will be able to meet at the next
- ltBLINKgtconference.lt/BLINKgt
- Sincerely,
- ltIMAGE SRCdeesignature.jpg ALIGNleftgt
- ltFONT FACETimes-RomangtDee Espylt/FONTgt
14Relational Markup
- lttoday xlinkformsimple hrefdate
actuateautogt - Dear Prof. Stein,
- I would like to tell you how much I enjoyed
reading your new text - ltA HREFwww.amazon.com/exec/obidos/ASIN/04712954
gt - Digital Signal Processing, A Computer Science
Perspective. - lt/Agt
- I hope we will be able to meet at the next
- ltA HREFconferencegtconference.lt/Agt
- Sincerely,
- ltIMAGE SRCdee-signature.jpg ALIGNleftgt
- ltA HREFmailtodee_at_dee-epsy.netgtDee Espylt/Agt
15Generalized Markup Language
- William Tunnicliffe, Stanley Rice 1960s
- (independently) invent idea of structural markup
language - Problem need different ML for each type of
document - (letter, report, article, book,
etc) - Charles Goldfarb, Edward Mosher, Raymond Lorie
(IBM) 1973 - invent Generalized Markup Language (GML)
- Solution use metalanguage
- Document Type Definition (DTD)
defines tags - IBM marked up 90 of its documents with GML
16With GML structure is evident
- Library
- Novels
- Journals
- Textbooks
- Algebraic zoology
- Botanical history
- Computer poetry
- DSP
- DSP-CSP
- DSP just for fun
- Elementary QED
- Title
- Full Digital Signal Processing
- a Computer Science Perspective
- Short DSPCSP
- Author
- Name Jonathan (Y) Stein
- Association RAD Data Comm.
- Publication
- Publisher John Wiley
- Year 2000
- Location New York
- ISBN 04712954
17Standard Generalized Markup Language
- Problems with GML
- No validating parser
- Not portable (between computer systems)
- Solution
- SGML
- ANSI 1978
- ISO/IEC 8879 1986 (Intl Org for Standardization
/ Intl Electrotechnical Commission) - JTC1/SC34/WG1 (WG 1 of SubCommittee 34 of Joint
Technical Committee 1) - For presentation
- Document Style Semantics and Specification
Language
18SGML - cont.
- If SGML is so good why doesnt anyone use it ?
- Complexity
- base standard gt500 pages
- SGML is a metalanguage
- writing DTD is complex programming
- marked up text is hard to read
- DSSSL adds to complexity
- Inflexibility - requires absolute conformity
- assumes only one correct way to markup
- constrains author to dictated structure
- not good at capturing authors structure
19HyperText Markup Language
- CERN (particle physics institute in Switzerland)
was an early Internet adopter - Used extensively for collaboration (articles have
long author lists) - Major problems with format incompatibility
- only straight ASCII worked reliably
- Tim Berners-Lee (computer specialist) defined
requirements - simplicity (couldnt expect physicists to use
SGML) - freedom (didnt need validation, let browser
ignore bad markup) - needed hypertext links (including to documents
over Internet) - presentational markup (papers must look nice -
authors used to TEX) - Solution HTML - a specific application of SGML
(not metalanguage)
20HTML versions
- HTML 1.0 (1989) Berners-Lee original CERN version
- hypertext, images, headbody structure,
presentational markup - HTML 2.0 (1994) IETF standard - RFC 1866
- added lists, forms, etc.
- HTML 3.2 (1997) W3C recommendation (incorporates
Netscape extensions) - added tables, applets, super/sub-scripts
- HTML 4.0 (1997) W3C recommendation (and similar
ISO/IEC 15445) - minimizes presentational markup
- XHTML 1.0 (2000) present W3C recommendation
- reformulates HTML in XML
21 HTML document structure
- ltHTMLgt
- ltHEADgt
- global definitions such as
- ltTITLEgtWeb page titlelt/TITLEgt
- lt/HEADgt
- ltBODYgt
- marked-up text
- lt/BODYgt
- lt/HTMLgt
22Some HTML (body) elements
- ltH1gtLevel 1 Headinglt/H1gt Level 1
Heading - ltH2gtLevel 2 Headinglt/H2gt Level 2
Heading - ltH3gtLevel 3 Headinglt/H3gt Level 3
Heading - ltEMgt emphasized lt/EMgt
emphasized - ltPgt Paragraph lt/Pgt
Paragraph - ltA HREFurlgtlinklt/Agt link
- ltULgt
- ltLIgt item 1 lt/LIgt
.item 1 - ltLIgt item 2 lt/LIgt
. item 2 - lt/ULgt
- ltOLgt
- ltLIgt item 1 lt/LIgt
1 item 1 - ltLIgt item 2 lt/LIgt
2 item 2 - lt/OLgt
- ltIMG SRCurlgt
23Problems with HTML
- Presentational aspects have predominated
- ltBgt bold text lt/Bgt
- ltBLINKgt blinking text lt/BLINKgt
- ltFONT COLORredgt red text lt/FONTgt
- Practically no descriptive markup
- Search engines are reduced to flat text search
- Search by topic only through keywords or portals
- Not extensible
- Cant add new tags
- Unknown tags ignored
- Links are relatively simple
- Usually user action is required (except IMG)
- Only full document (with offset) linkable
- Link management is logistic nightmare
24Not everything is HTML
- Due to HTML limitations other tools are also
used - Multimedia extensions
- (dynamic) gif, jpg,
- streaming audio
- Common Gateway Interface
- generate HTML on-the-fly
- Perl, C,
- Server Push - Server Pull
- Javascript
- Java
25eXtensible Markup Language
- Simplified (best parts of) SGML (subset of
features) - Flexible content management tool
- W3C recommendation(s)
- Extensible - can add new elements (even without
DTD) - Easy to create special purpose languages (with
DTD/SCHEMA) - Includes HTML-like hypertext links
- and extensions (XLINK, XPOINTER)
- The future of the web !
26XML - an Example
- lt?xml version"1.0" standalone"yes"?gt
- ltbibliographygt
- ltbook isbn04712954gt
- lttitlegtDigital Signal Processing a Computer
Science Perspectivelt/titlegt - ltauthorgtJonathan (Y) Steinlt/authorgt
- ltpublishergtJohn Wiley and Sonslt/publishergt
- lt/bookgt
- ltarticlegt
- lttitlegtFalse Alarm Reduction for ASR and
OCRlt/titlegt - ltauthorgtYaakov Steinlt/authorgt
- ltproceedingsgtTenth AICVNN Symposiumlt/proceedings
gt - ltpagesgt195-200lt/pagesgt
- lt/articlegt
- ...
- lt/bibliographygt
27What can we do with an XML file?
- Check if well-formed
- Check if valid (against DTD or schema)
- Display as-is in browser
- Parse in special-purpose program (SAX, DOM)
- Process (XSL) to XML, HTML, etc.
- Display after processing
28Wireless Markup Language
- Markup language element of Wireless Application
Protocol - WAP forum (1997)
- Ericsson, Motorola, Nokia, Unwired Planet
(phone.com) - bring Internet to cellular phone users
- re-use fundamental Internet concepts (TCP/IP,
http, html, javascript) - but adapted to lower bandwidth
- smaller screen
- limited input facilities
- limited computational resources
- applications scale across transport options (GSM,
TDMA, CDMA, 3G) - and device types (mobile phones, personal
assistants)
29WML Philosophy
- Defined using XML
- Transported in compressed binary (for BW
reduction) - Applications are modeled as decks of cards
- Features
- Actions (OK, navigation, help) can be performed
- Hyperlinks (like in HTML)
- String variables
- Timers
- wbmp images (BW)
- Select boxes, forms (for input)
- wmlscript (like javascript)
30WML structure
- lt ? xml version1.0 ? gt
- lt!DOCTYPE wml gt
- ltwmlgt
- ltcardgt
- ltpgt
- text
- lt/pgt
- ltpgt
- text
- lt/pgt
- lt/cardgt
- ltcardgt
- ...
- lt/cardgt
- lt/wmlgt
31Some WML elements
- ltpgt lt/pgt text
- lta href...gt lt/agt hyperlink (anchor)
- ltdogt lt/dogt action
- ltgo href.../gt goto wml page
- lttimergt trigger event
(units tenths of a second) - ltinput/gt input user text
- ltprev/gt return to previous
page - () value of variable
- ltimg src /gt display image
- ltpostfield name value/gt set
variable - ltselect gt ltoptiongt ltoptiongt lt/selectgt select
box
32Some more markup languages
- VML Vector (graphics) Markup Language
- VoiceXML
- SSML Speech Synthesis Markup Language
- CPML Call Policy Markup Language
- DSML Directory Services Markup Language
- MathML Mathematical Markup Language
- CML Chemical Markup Language
- AML Astronomical Markup Language
- LegalXML
- BSML Bioinformatic Sequence Markup Language
- GedML Genealogical Data Markup Language
- FinXML Financial market Markup Language
- ChessML
- SDML Signed Document Markup Language
- RELML Real Estate Listing Markup Language
- etc. etc. etc. ...
33Examples
- HTML
- html examples
- XML
- xml-file xsl-file xml
- VML
- vml-file
- WML (get M3gate emulator)
- wml examples