Title: Chapter 6 Text and Multimedia Languages and Properties
1Chapter 6Text and Multimedia Languages and
Properties
- Hsin-Hsi Chen
- Department of Computer Science and Information
Engineering - National Taiwan University
Part of the materials in the following is
selected from Dr Kuang-hua Chens talk on XML and
RDF (Department of Library Information Science,
National Taiwan University)
2what is a document
- document a single unit of information
- complete logical unit
- research paper, book, manual
- part of a larger text
- paragraph, passage, an entry in a dictionary,
- a physical unit
- file, email, Web page
3characteristics of a document
How a document is displayed or printed
Document
Presentation Style
Text Structure Other Media
Syntax
Semantics
Express structure, presentation style, or even
external actions
Author
implicit, or expressed in a language
Creator
4Metadata(???,???,????,????,????,????)
- Definition
- Data about the data, e.g., schema in a DBMS
- describe other information based on some rules or
policies - Type
- Descriptive Metadata
- Metadata that is external to the meaning of the
document - Dublin Core
- Semantic Metadata
- Metadata that can be found within the documents
content - Library of Congress subject codes
5Dublin Core
- Metadata Element Set (15)
- ??????(Subject)
- ?????,?????????????????,???????????
- ??(Title)
- ???????????????
- ??(Creator)
- ???????????????
- ??(descriptions)
- ?????????,???????????????
- ???(Publisher)
- ???????,????????????????
6Dublin Core (Continued)
- ?????(Contributors)
- ?????????????????,???????????
- ????(Date)
- ???????
- ????(Type)
- ?????,??????????????????
- ????(Format)
- ???????,??text/html?ASCII???JPEG????
- ??????(Identifier)
- ???????????????,??????URL?URN,??ISBN???????
7Dublin Core (Continued)
- ??(Relation)
- ????????,????????????
- ??(Source)
- ??????????
- ??(Language)
- ??????????
- ????(Coverage)
- ??????????
- ????(Rights)
- ?????????????????
8?????
- lt?xml version"1.0"?gtltdc-recordgt
- lttypegt??lt/typegt
- ltformatgt????lt/formatgt
- ltformatgt??lt/formatgt
- lttitlegt??????????lt/titlegt
- lttitlegtcloisonnie box with lotus-spray
decorationlt/titlegt - ltdescriptiongt1400/1500lt/descriptiongt
- ltdescriptiongt??,??????????????lt/descriptiongt
- ltdescriptiongt?63.cm ??12.4cm ?634.6?lt/descriptiongt
- ltdescriptiongt???,???????????????????,?88?2??lt/des
criptiongt
9?????(?)
- ltsubjectgt??????????lt/subjectgt
- ltsubjectgt????lt/subjectgt
- ltsubjectgt??lt/subjectgt
- ltsubjectgt????lt/subjectgt
- ltsubjectgt??lt/subjectgt
- ltsubjectgt??(??????)(r) placelt/subjectgt
- ltdategt1400/1500lt/dategt
- ltcoveragegt??(??????)(r) placelt/coveragegt
- ltrightsgt??,??lt/rightsgt
- lt/dc-recordgt
10???????
- lt?xml version"1.0"?gtltdc-recordgt
- lttypegt????lt/typegt
- lttypegt??lt/typegt
- lttitlegt????lt/titlegt
- ltdescriptiongt??lt/descriptiongt
- ltdescriptiongt????lt/descriptiongt
- ltdescriptiongt3048.7lt/descriptiongt
- ltdescriptiongt?????????????????????????????????????
?????????????????????????????????????????????lt/des
criptiongt
11???????(?)
- ltdescriptiongt1127/1189lt/descriptiongt
- ltdescriptiongt????????????,????????????????????,?8
4?9??lt/descriptiongt - ltsubjectgt??lt/subjectgt
- ltcreatorgt???lt/creatorgt
- ltdategt1127/1189lt/dategt
- ltlanguagegtzhlt/languagegt
- ltrightgt??,??lt/rightgt
- lt/dc-recordgt
12MARC
- Machine-Readable Cataloging Record
- The most used format for library records
- An Example (NTU Lib)?? ?????? Public
art in Taiwan eng ??? ??????
??? ?????????? ?88-??? 1999.???
? ?? 29???? ??87???????????
csh ???? -- ?????? ? ?????
100982322.??? 100982322.????? 957-02-4
468-2 ?? NT500.????? cw 88008821.
13Web Metadata
- purposes
- cataloging (e.g., BibTex)
- content rating
- Protect children from reading some type of
documents - intellectual property rights
- digital signatures (for authentication)
- privacy levels
- applications to electronic commerce
-
- RDF (Resource Description Framework)
14RDF
- description of nodes and attached attribute/value
pairs - nodes any Web resource
- attributes properties of nodes
- values text strings or other nodes (Web
resources or metadata instances)
15RDF????
Resource
Property
Value
Subject Predicate Object
Statement
16???
??? ???
17RDF????
Resource
Resource
Property
Property
Property
value
value
18???
??? ???
Dummy
??
????
??
?????
Tenzi_at_ac.jp
19Name Space
- ???????????????
- ????????????????
- ??
- ltRDF xmlnshttp//www.w3.org/TR/WD-rdf-synta
x/ - xmlnsdchttp//purl.org/dc/elemen
ts/1.0/gt
Name Space
Dublin Core
20DC in RDF
dctype
dccoverage
Resource
dctitle
dccreator
dcsubject
dccontributor
dcdescription
dcpublisher
dcidentifier
dcdate
dcrights
dcrelation
dclanguage
dcformat
dcsource
21A DC Example in RDF
http//x.html
Kevin Chen
dccreator
ltRDF xmlns http//www.w3.org/TR/WD-rdf-syntax
xmlnsdc http//purl.org/dc/element
s/1.0/gt ltDescription about
http//x.htmlgt ltdccreatorgt Kevin
Chen lt/dccreatorgt lt/Descriptiongt lt/RDFgt
22RDF??
dctitle
http//www.lis.ntu.edu.tw/khchen/
The Magic Shelter
Kuang-hua Chen
dccreator
ltRDF xmlns http//www.w3.org/TR/WD-rdf-syntax
xmlnsdc http//purl.org/dc/element
s/1.0/gt ltDescription about
http//www.lis.ntu.edu.tw/khchen/gt
ltdcTitlegt The Magic Shelter lt/dcTitlegt
ltdcCreatorgt Kuang-hua Chen lt/dcCreatorgt
lt/Descriptiongt lt/RDFgt
23Text
- Formats
- Basic form
- ASCII,
- Document interchange
- Rich Text Format (RTF) used by word processors
- Portable Document Format (PDF) and Postcript
used for display or printing documents - MIME (Multipurpose Internet Mail Exchange)
support multiple character sets, multiple
languages, and multiple media
24Text (Continued)
- compress
- Compress (Unix)
- ARJ (PCs)
- ZIP (gzip in Unix and Winzip in Windows)
25Information Theory
- entropy
- Measure information content or information
uncertainty
where ? is the number of symbols in the alphabet
pi is a probability for symbol i
26Modeling Natural Language
- Issue 1 how a word is formulated
- symbols (separate-words and belong-to-words)
- Vowels are more frequent than most consonants
- Binomial model (0-order Markov model) each
symbol is generated with a certain probability - k-order Markov model
- Extension how a sentence is formulated
- 5-order Markov model in Bible
- finite-state model (regular languages)
- grammar model (context free and other languages)
27Modeling Natural Language(Continued)
- Issue 2 how different words are distributed
inside each document - Zipfs law
- The frequency of the i-th most frequent word is
1/i? times that of the most frequent word - In a text of n words with a vocabulary of V
words, the i-th most frequent word appears n
/(i?HV(?))
?1.52.0
28There are a few hundred words which take up 50
of the text. Words (stopwords) that are too
frequent can be disregarded.
29Modeling Natural Language(Continued)
- Issue 3 the distribution of words in the
documents of a collection - Negative binomial distribution
- The fraction of documents containing a word k
times
p9.24 and ?0.42 for word said in Brown corpus
where p and ? depend on the word and the
document collection
30Modeling Natural Language(Continued)
- Issue 4 number of distinct words in a document
(document vocabulary) - Heaps Law
- The vocabulary of a text of size n words isV
Kn?where K and ? depend on the particular
textK between 10 and 100? a positive value
less than 1 (e.g., 0.4 lt ? lt 0.6)
31Modeling Natural Language(Continued)
- Issue 5 average length of words
- Heaps law
- The length of the words in the vocabulary
increases logarithmically with the text size
32Similarity Model
- distance function
- symmetric distance(a,b)distance(b,a)
- triangle inequality distance(a,c)?distance(a,b)
distance(b,c) - measure
- Edit distance minimum number of character
insertions, deletions, and substitutionse.g.,
Edit-distance(color, colour)1,
Edit-distance(survey, surgery)2 - Longest common subsequence only deletion is
allowede.g., LCS(survey, surgery)surey
(non-common is deleted) - Longest common sequence of lines between two
files e.g., diff command in Unix
33Markup Languages
- Definition
- Textual syntax that describes formatting actions,
structure information, text semantics,
attributes, etc. - Types
- Procedural Markup
- Descriptive Markup
34????? (Procedural Markup)
35????? (Descriptive Markup)
36????????
- ??????????????
- ?????????????
37SGML(Standard Generalized Markup Language)
- 1986? ISO ??????-ISO 8879
- ????????
- ??? Meta-language
- HTML ? SGML ????
38SGML ???
- ??? (flexibility)
- ?????????????????
- ???? (non-proprietary)??????(platform-independence
) ?????? (system-independence) - ?????????????
- ?????? (re-usability)
39SGML?????
- SGML declaration
- ??????????,?????????
- DTD (Document Type Definition)
- ???????? elements?
- ?? elements ???????
- ...
- DI (Document Instance)
- ????????
40SGML Declaration
- ?? SGML ????????,?????????
- ??????? SGML declaration,????? SGML ????????????
- lt!SGML ISO 8879-1986 ...
41Example Email ?????
Email
From
Body
Date
Subject
To
42An SGML DTD for Email
starting and ending tags compulsory(-) or
optional (O)
comment
- lt!-- Elements Min Content
--gt - lt!-- ----------- -----
---------------------------------- --gt - lt!ELEMENT Email -- (From,Date,To,Subject,
-
Body?)gt - lt!ELEMENT From -O (PCDATA)gt
- lt!ELEMENT Date -O (PCDATA)gt
- lt!ELEMENT To -- (PCDATA)gt
- lt!ELEMENT Subject -O (PCDATA)gt
- lt!ELEMENT Body -- (PCDATA)gt
- lt!-- End of Email DTD --gt
, concatenation logical or ? 0 or 1
occurrence 0 or ?1 occurrences ?1
occurrences
PCDATA ASCII characters NDATA binary data EMPTY
43An SGML DI for Email DTD
- lt!DOCTYPE Email SYSTEM c\temp\email.dtdgt
- ltEmailgt
- ltFromgtJoe
- ltDategt1999-7-14 AM 0920
- ltTogtJaylt/Togt
- ltTogtJenniferlt/Togt
- ltSubjectgtLearning XML
- ltBodygtXML ?? Web ?????,????!
- lt/Bodygt
- lt/Emailgt
user defined (vs. PUBLIC)
The ending tag is optional
44(No Transcript)
45SGML, DTDs, Document Instances, and Presentation
Instances
SGML
.
DTD
DTD
.
DI
DI
DI
.
????
Hypertext??
??????
DSSSL (Document Style Semantic Specification
Language) FOSI (Formatted Output Specification
Instance)
46SGML?????
- SGML????????
- SGML?????Web???
- ???????
47HTML (Hypertext Markup Language)
- ? SGML ???
- HTML 2.0 DTD
- HTML 3.2 DTD
- HTML 4.0 DTD
- ?? Web ????????????
- ????
- ???? (portable)
- ?????? (hyperlink) ????
Most HTML instances do not explicitly make
reference to the DTD
48HTML???
- HTML DTD ???????????????
- HTML?????? (style)
- HTML??SGML???????? (markup minimization feature)
- HTML???? SGML ??????
49HTML???
- ??????
- ????????
- ???????
- ?????????
- ??????????
- ??????? HTML Extension ???
50XML (eXtensible Markup Language)
- W3C Recommendation 10-February-1998
- XML 1.0
- ????Microsoft?Netscape?Sun ?...
- XML is SGML-- rather than HTML
- ? SGML??,? HTML??
- ?????????,???? tags
- ?? Web ???
51W3C Data Format
http//www.w3c.org/
52XML??????
- ???? (Extensibility)
- XML????????,???????
- ??? (Structure)
- XML?????????????
- ???? (Validation)
- XML???? DTD ??????????
53XML??
- XML-LanguageSGML without tears
- Self-describing Documents
- Well-formed and Valid Documents
- XML-LinkPower linking
- simple and extended links
- XML-StyleSeparate style from content
- XSL (Extensible Style sheet Language)
54XML??????
- XML 1.0
- W3C Recommendation 10-Feb-1998
- XML Namespace
- W3C Recommendation 14-Jan-1999
- XLink Xpointer
- W3C Working Draft 03-March-1998
- XSL
- W3C Working Draft 16-Dec-1998
55Well-formed XML Rules
- ??????? elements
- ???? root element
- ???? start-tag ? end-tag
- ??? tags ????????? (nest) ???
- ( ? ltBgtltIgtbold and italiclt/Bgtitaliclt/Igt ?????
) - empty tags ???????XML???
- ( ? ltimg src/gt )
- ??? attribute value ???????????.
- ( ? ltfont size2gt )
- ??????????
56Writing Well-Formed XML
- Step 1 Make an XML Declaration
- Step 2 Creating a Root Element
- Step 3 Writing in XML
- Step 4 Parsing your document
57Step 1Make an XML Declaration
without DTD
- lt?xml version1.0 standaloneyes?gt
- lt?xml version1.0 encodingUTF-8
standaloneyes?gt - lt?xml version1.0 encodingbig5
standaloneyes?gt
58Step 2Creating a Root Element
- lt?xml version1.0 standaloneyes?gt
- ltEmailgt
-
- lt/Emailgt
59Step 3Writing in XML
- lt?xml version1.0 encodingbig5
standaloneyes?gt - ltEmailgt
- ltFromgtJoelt/Fromgt
- ltDategt1999-7-14 AM 0920lt/Dategt
- ltTogtJaylt/Togt
- ltTogtJenniferlt/Togt
- ltSubjectgtLearning XMLlt/Subjectgt
- ltBodygtXML ?? Web ?????,????!
- lt/Bodygt
- lt/Emailgt
End tag cannot omitted
60Step 4Parsing your document
- Checking if your well-formed XML document
conforms to well-formed XML rules. - Use a parser to check well-formedness
- for example the XML parser embedded in IE5
61Explorer 5.0 ??Well-formed XML
62Explorer 5.0 ?????XML??
63Writing Valid XML
- Step 1 Make an XML declaration.
- Step 2 Designing a DTD.
- Step 3 Writing Valid XML.
- Step 4 Parsing your Valid XML document.
64Step 1Make an XML Declaration
- lt?xml version1.0 standaloneno?gt
- lt?xml version1.0 encodingUTF-8
standaloneno?gt - lt?xml version1.0 encodingbig5
standaloneno?gt
65Step 2 Designing a DTD
- lt!-- Elements Content
--gt - lt!-- ----------- -------------------
--------------- --gt - lt!ELEMENT Email (From,Date,To,Subject,Body?)gt
- lt!ELEMENT From (PCDATA)gt
- lt!ELEMENT Date (PCDATA)gt
- lt!ELEMENT To (PCDATA)gt
- lt!ELEMENT Subject (PCDATA)gt
- lt!ELEMENT Body (PCDATA)gt
- lt!-- End of Email DTD --gt
66Step 3 Writing Valid XML
- lt?xml version1.0 encodingbig5
standaloneno?gt - lt!DOCTYPE Email SYSTEM email.dtd"gt
- ltEmailgt
- ltFromgtJoelt/Fromgt
- ltDategt1999-7-14 AM 0920lt/Dategt
- ltTogtJaylt/Togt
- ltTogtJenniferlt/Togt
- ltSubjectgtLearning XMLlt/Subjectgt
- ltBodygtXML ?? Web ?????,????!
- lt/Bodygt
- lt/Emailgt
67XML Simple Link
68XML Extended linking multiple ends
69XML Extended linkingaddressing by structure
70XML Extended linking
71XSL XML counterpart of CSS (Cascading Style
Sheet)
- Sample email.css
- Email,From,Date,To,Subject,Body,
- displayblockmargin-left5
- margin-right5border-stylegroove
72XML document with Style
- lt?xml version1.0 encodingbig5
standaloneno?gt - lt?xml-stylesheet href "email.css"
type"text/css"?gt - ltEmailgt
- ltFromgtJoelt/Fromgt
- ltDategt1999-7-14 AM 0920lt/Dategt
- ltTogtJaylt/Togt
- ltTogtJenniferlt/Togt
- ltSubjectgtLearning XMLlt/Subjectgt
- ltBodygtXML ?? Web ?????,????!
- lt/Bodygt
- lt/Emailgt
73Explorer 5.0 ????CSS?XML??
74XML???
- Database interchange
- Client-side processing
- User views of the data
- Information filtering
75Multimedia
- medias
- text, sound, images, video
- issues
- volume, format, processing requirements
76Formats
- image
- bit-mapped/pixel-based display
- The simplest format
- XBM, BMP, PCX
- disadvantages redundancy
- compression
- Compuserves Graphic Interchange Format (GIF)
- lossy compression
- Joint Photographic Experts Group (JPEG)
- exchange
- Tagged Image File Format (TIFF)
77Formats
- Audio
- AU, MIDI, WAVE
- Video
- MPEG, AVI, QuickTime
78Textual Images
- definition
- images of documents that contain mainly typed or
typeset text - obtained by OCR
- image retrieval
- Alternative 1
- At creation time, a set of keywords (called
metadata) is associated with each image - Conventional text retrieval techniques can be
applied to keywords
79Textual Images (Continued)
- Alternative 2
- Use OCR to extract the text of the image
- The resultant ASCII text can be used to extract
keywords - Alternative 3
- Use the symbols extracted from the images as
basic units to combine image retrieval techniques
with sequence retrieval techniques
80Taxonomy of Web languages
81????
- HTML-4 http//www.w3.org/TR/REC-html40
- W3C http//www.w3c.org/
- OCLC http//purl.oclc.org/
- XML http//www.xml.org/
- XML Parser http//xdev.datachannel.com/
- DDML Document Definition Markup Language.
http//www.w3.org/TR/NOTE-ddml - Xschema http//purl.oclc.org/NET/xschema
82????
- J. Kunze, Encodeing Dubin Core Metadata in
HTML, ltftp//ftp.ietf.org/internet-drafts/draft-k
unze-dchtl-00.txtgt. - E. Miller, P. Miller and d. Brickley, Guidance
on Expressing the Dublin Core within the Resource
Description Framework (RDF), lthttp//www.ukoln.ac
.uk/interop-focus/activites/dc/datamodel/WD-dc-rdf
-19990423.htmgt.