Chapter 6 Text and Multimedia Languages and Properties - PowerPoint PPT Presentation

1 / 82
About This Presentation
Title:

Chapter 6 Text and Multimedia Languages and Properties

Description:

Chapter 6 Text and Multimedia Languages and Properties Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 83
Provided by: HsinHs7
Category:

less

Transcript and Presenter's Notes

Title: Chapter 6 Text and Multimedia Languages and Properties


1
Chapter 6Text and Multimedia Languages and
Properties
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

Part of the materials in the following is
selected from Dr Kuang-hua Chens talk on XML and
RDF (Department of Library Information Science,
National Taiwan University)
2
what is a document
  • document a single unit of information
  • complete logical unit
  • research paper, book, manual
  • part of a larger text
  • paragraph, passage, an entry in a dictionary,
  • a physical unit
  • file, email, Web page

3
characteristics of a document
How a document is displayed or printed
Document
Presentation Style
Text Structure Other Media
Syntax
Semantics
Express structure, presentation style, or even
external actions
Author
implicit, or expressed in a language
Creator
4
Metadata(???,???,????,????,????,????)
  • Definition
  • Data about the data, e.g., schema in a DBMS
  • describe other information based on some rules or
    policies
  • Type
  • Descriptive Metadata
  • Metadata that is external to the meaning of the
    document
  • Dublin Core
  • Semantic Metadata
  • Metadata that can be found within the documents
    content
  • Library of Congress subject codes

5
Dublin Core
  • Metadata Element Set (15)
  • ??????(Subject)
  • ?????,?????????????????,???????????
  • ??(Title)
  • ???????????????
  • ??(Creator)
  • ???????????????
  • ??(descriptions)
  • ?????????,???????????????
  • ???(Publisher)
  • ???????,????????????????

6
Dublin Core (Continued)
  • ?????(Contributors)
  • ?????????????????,???????????
  • ????(Date)
  • ???????
  • ????(Type)
  • ?????,??????????????????
  • ????(Format)
  • ???????,??text/html?ASCII???JPEG????
  • ??????(Identifier)
  • ???????????????,??????URL?URN,??ISBN???????

7
Dublin Core (Continued)
  • ??(Relation)
  • ????????,????????????
  • ??(Source)
  • ??????????
  • ??(Language)
  • ??????????
  • ????(Coverage)
  • ??????????
  • ????(Rights)
  • ?????????????????

8
?????
  • lt?xml version"1.0"?gtltdc-recordgt
  • lttypegt??lt/typegt
  • ltformatgt????lt/formatgt
  • ltformatgt??lt/formatgt
  • lttitlegt??????????lt/titlegt
  • lttitlegtcloisonnie box with lotus-spray
    decorationlt/titlegt
  • ltdescriptiongt1400/1500lt/descriptiongt
  • ltdescriptiongt??,??????????????lt/descriptiongt
  • ltdescriptiongt?63.cm ??12.4cm ?634.6?lt/descriptiongt
  • ltdescriptiongt???,???????????????????,?88?2??lt/des
    criptiongt

9
?????(?)
  • ltsubjectgt??????????lt/subjectgt
  • ltsubjectgt????lt/subjectgt
  • ltsubjectgt??lt/subjectgt
  • ltsubjectgt????lt/subjectgt
  • ltsubjectgt??lt/subjectgt
  • ltsubjectgt??(??????)(r) placelt/subjectgt
  • ltdategt1400/1500lt/dategt
  • ltcoveragegt??(??????)(r) placelt/coveragegt
  • ltrightsgt??,??lt/rightsgt
  • lt/dc-recordgt

10
???????
  • lt?xml version"1.0"?gtltdc-recordgt
  • lttypegt????lt/typegt
  • lttypegt??lt/typegt
  • lttitlegt????lt/titlegt
  • ltdescriptiongt??lt/descriptiongt
  • ltdescriptiongt????lt/descriptiongt
  • ltdescriptiongt3048.7lt/descriptiongt
  • ltdescriptiongt?????????????????????????????????????
    ?????????????????????????????????????????????lt/des
    criptiongt

11
???????(?)
  • ltdescriptiongt1127/1189lt/descriptiongt
  • ltdescriptiongt????????????,????????????????????,?8
    4?9??lt/descriptiongt
  • ltsubjectgt??lt/subjectgt
  • ltcreatorgt???lt/creatorgt
  • ltdategt1127/1189lt/dategt
  • ltlanguagegtzhlt/languagegt
  • ltrightgt??,??lt/rightgt
  • lt/dc-recordgt

12
MARC
  • Machine-Readable Cataloging Record
  • The most used format for library records
  • An Example (NTU Lib)?? ?????? Public
    art in Taiwan eng ??? ??????
    ??? ?????????? ?88-??? 1999.???
    ? ?? 29???? ??87???????????
    csh ???? -- ?????? ? ?????
    100982322.??? 100982322.????? 957-02-4
    468-2 ?? NT500.????? cw 88008821.

13
Web Metadata
  • purposes
  • cataloging (e.g., BibTex)
  • content rating
  • Protect children from reading some type of
    documents
  • intellectual property rights
  • digital signatures (for authentication)
  • privacy levels
  • applications to electronic commerce
  • RDF (Resource Description Framework)

14
RDF
  • description of nodes and attached attribute/value
    pairs
  • nodes any Web resource
  • attributes properties of nodes
  • values text strings or other nodes (Web
    resources or metadata instances)

15
RDF????
Resource
Property
Value
Subject Predicate Object
Statement
16
???
??? ???
17
RDF????
Resource
Resource
Property
Property
Property
value
value
18
???
??? ???
Dummy
??
????
??
?????
Tenzi_at_ac.jp
19
Name Space
  • ???????????????
  • ????????????????
  • ??
  • ltRDF xmlnshttp//www.w3.org/TR/WD-rdf-synta
    x/
  • xmlnsdchttp//purl.org/dc/elemen
    ts/1.0/gt

Name Space
Dublin Core
20
DC in RDF
dctype
dccoverage
Resource
dctitle
dccreator
dcsubject
dccontributor
dcdescription
dcpublisher
dcidentifier
dcdate
dcrights
dcrelation
dclanguage
dcformat
dcsource
21
A DC Example in RDF
http//x.html
Kevin Chen
dccreator
ltRDF xmlns http//www.w3.org/TR/WD-rdf-syntax
xmlnsdc http//purl.org/dc/element
s/1.0/gt ltDescription about
http//x.htmlgt ltdccreatorgt Kevin
Chen lt/dccreatorgt lt/Descriptiongt lt/RDFgt
22
RDF??
dctitle
http//www.lis.ntu.edu.tw/khchen/
The Magic Shelter
Kuang-hua Chen
dccreator
ltRDF xmlns http//www.w3.org/TR/WD-rdf-syntax
xmlnsdc http//purl.org/dc/element
s/1.0/gt ltDescription about
http//www.lis.ntu.edu.tw/khchen/gt
ltdcTitlegt The Magic Shelter lt/dcTitlegt
ltdcCreatorgt Kuang-hua Chen lt/dcCreatorgt
lt/Descriptiongt lt/RDFgt
23
Text
  • Formats
  • Basic form
  • ASCII,
  • Document interchange
  • Rich Text Format (RTF) used by word processors
  • Portable Document Format (PDF) and Postcript
    used for display or printing documents
  • MIME (Multipurpose Internet Mail Exchange)
    support multiple character sets, multiple
    languages, and multiple media

24
Text (Continued)
  • compress
  • Compress (Unix)
  • ARJ (PCs)
  • ZIP (gzip in Unix and Winzip in Windows)

25
Information Theory
  • entropy
  • Measure information content or information
    uncertainty

where ? is the number of symbols in the alphabet
pi is a probability for symbol i
26
Modeling Natural Language
  • Issue 1 how a word is formulated
  • symbols (separate-words and belong-to-words)
  • Vowels are more frequent than most consonants
  • Binomial model (0-order Markov model) each
    symbol is generated with a certain probability
  • k-order Markov model
  • Extension how a sentence is formulated
  • 5-order Markov model in Bible
  • finite-state model (regular languages)
  • grammar model (context free and other languages)

27
Modeling Natural Language(Continued)
  • Issue 2 how different words are distributed
    inside each document
  • Zipfs law
  • The frequency of the i-th most frequent word is
    1/i? times that of the most frequent word
  • In a text of n words with a vocabulary of V
    words, the i-th most frequent word appears n
    /(i?HV(?))

?1.52.0
28
There are a few hundred words which take up 50
of the text. Words (stopwords) that are too
frequent can be disregarded.
29
Modeling Natural Language(Continued)
  • Issue 3 the distribution of words in the
    documents of a collection
  • Negative binomial distribution
  • The fraction of documents containing a word k
    times

p9.24 and ?0.42 for word said in Brown corpus
where p and ? depend on the word and the
document collection
30
Modeling Natural Language(Continued)
  • Issue 4 number of distinct words in a document
    (document vocabulary)
  • Heaps Law
  • The vocabulary of a text of size n words isV
    Kn?where K and ? depend on the particular
    textK between 10 and 100? a positive value
    less than 1 (e.g., 0.4 lt ? lt 0.6)

31
Modeling Natural Language(Continued)
  • Issue 5 average length of words
  • Heaps law
  • The length of the words in the vocabulary
    increases logarithmically with the text size

32
Similarity Model
  • distance function
  • symmetric distance(a,b)distance(b,a)
  • triangle inequality distance(a,c)?distance(a,b)
    distance(b,c)
  • measure
  • Edit distance minimum number of character
    insertions, deletions, and substitutionse.g.,
    Edit-distance(color, colour)1,
    Edit-distance(survey, surgery)2
  • Longest common subsequence only deletion is
    allowede.g., LCS(survey, surgery)surey
    (non-common is deleted)
  • Longest common sequence of lines between two
    files e.g., diff command in Unix

33
Markup Languages
  • Definition
  • Textual syntax that describes formatting actions,
    structure information, text semantics,
    attributes, etc.
  • Types
  • Procedural Markup
  • Descriptive Markup

34
????? (Procedural Markup)
35
????? (Descriptive Markup)
36
????????
  • ??????????????
  • ?????????????

37
SGML(Standard Generalized Markup Language)
  • 1986? ISO ??????-ISO 8879
  • ????????
  • ??? Meta-language
  • HTML ? SGML ????

38
SGML ???
  • ??? (flexibility)
  • ?????????????????
  • ???? (non-proprietary)??????(platform-independence
    ) ?????? (system-independence)
  • ?????????????
  • ?????? (re-usability)

39
SGML?????
  • SGML declaration
  • ??????????,?????????
  • DTD (Document Type Definition)
  • ???????? elements?
  • ?? elements ???????
  • ...
  • DI (Document Instance)
  • ????????

40
SGML Declaration
  • ?? SGML ????????,?????????
  • ??????? SGML declaration,????? SGML ????????????
  • lt!SGML ISO 8879-1986 ...

41
Example Email ?????
Email
From
Body
Date
Subject
To
42
An SGML DTD for Email
starting and ending tags compulsory(-) or
optional (O)
comment
  • lt!-- Elements Min Content
    --gt
  • lt!-- ----------- -----
    ---------------------------------- --gt
  • lt!ELEMENT Email -- (From,Date,To,Subject,

  • Body?)gt
  • lt!ELEMENT From -O (PCDATA)gt
  • lt!ELEMENT Date -O (PCDATA)gt
  • lt!ELEMENT To -- (PCDATA)gt
  • lt!ELEMENT Subject -O (PCDATA)gt
  • lt!ELEMENT Body -- (PCDATA)gt
  • lt!-- End of Email DTD --gt

, concatenation logical or ? 0 or 1
occurrence 0 or ?1 occurrences ?1
occurrences
PCDATA ASCII characters NDATA binary data EMPTY
43
An SGML DI for Email DTD
  • lt!DOCTYPE Email SYSTEM c\temp\email.dtdgt
  • ltEmailgt
  • ltFromgtJoe
  • ltDategt1999-7-14 AM 0920
  • ltTogtJaylt/Togt
  • ltTogtJenniferlt/Togt
  • ltSubjectgtLearning XML
  • ltBodygtXML ?? Web ?????,????!
  • lt/Bodygt
  • lt/Emailgt

user defined (vs. PUBLIC)
The ending tag is optional
44
(No Transcript)
45
SGML, DTDs, Document Instances, and Presentation
Instances
SGML
.
DTD
DTD
.
DI
DI
DI
.
????
Hypertext??
??????
DSSSL (Document Style Semantic Specification
Language) FOSI (Formatted Output Specification
Instance)
46
SGML?????
  • SGML????????
  • SGML?????Web???
  • ???????

47
HTML (Hypertext Markup Language)
  • ? SGML ???
  • HTML 2.0 DTD
  • HTML 3.2 DTD
  • HTML 4.0 DTD
  • ?? Web ????????????
  • ????
  • ???? (portable)
  • ?????? (hyperlink) ????

Most HTML instances do not explicitly make
reference to the DTD
48
HTML???
  • HTML DTD ???????????????
  • HTML?????? (style)
  • HTML??SGML???????? (markup minimization feature)
  • HTML???? SGML ??????

49
HTML???
  • ??????
  • ????????
  • ???????
  • ?????????
  • ??????????
  • ??????? HTML Extension ???

50
XML (eXtensible Markup Language)
  • W3C Recommendation 10-February-1998
  • XML 1.0
  • ????Microsoft?Netscape?Sun ?...
  • XML is SGML-- rather than HTML
  • ? SGML??,? HTML??
  • ?????????,???? tags
  • ?? Web ???

51
W3C Data Format
http//www.w3c.org/
52
XML??????
  • ???? (Extensibility)
  • XML????????,???????
  • ??? (Structure)
  • XML?????????????
  • ???? (Validation)
  • XML???? DTD ??????????

53
XML??
  • XML-LanguageSGML without tears
  • Self-describing Documents
  • Well-formed and Valid Documents
  • XML-LinkPower linking
  • simple and extended links
  • XML-StyleSeparate style from content
  • XSL (Extensible Style sheet Language)

54
XML??????
  • XML 1.0
  • W3C Recommendation 10-Feb-1998
  • XML Namespace
  • W3C Recommendation 14-Jan-1999
  • XLink Xpointer
  • W3C Working Draft 03-March-1998
  • XSL
  • W3C Working Draft 16-Dec-1998

55
Well-formed XML Rules
  • ??????? elements
  • ???? root element
  • ???? start-tag ? end-tag
  • ??? tags ????????? (nest) ???
  • ( ? ltBgtltIgtbold and italiclt/Bgtitaliclt/Igt ?????
    )
  • empty tags ???????XML???
  • ( ? ltimg src/gt )
  • ??? attribute value ???????????.
  • ( ? ltfont size2gt )
  • ??????????

56
Writing Well-Formed XML
  • Step 1 Make an XML Declaration
  • Step 2 Creating a Root Element
  • Step 3 Writing in XML
  • Step 4 Parsing your document

57
Step 1Make an XML Declaration
without DTD
  • lt?xml version1.0 standaloneyes?gt
  • lt?xml version1.0 encodingUTF-8
    standaloneyes?gt
  • lt?xml version1.0 encodingbig5
    standaloneyes?gt

58
Step 2Creating a Root Element
  • lt?xml version1.0 standaloneyes?gt
  • ltEmailgt
  • lt/Emailgt

59
Step 3Writing in XML
  • lt?xml version1.0 encodingbig5
    standaloneyes?gt
  • ltEmailgt
  • ltFromgtJoelt/Fromgt
  • ltDategt1999-7-14 AM 0920lt/Dategt
  • ltTogtJaylt/Togt
  • ltTogtJenniferlt/Togt
  • ltSubjectgtLearning XMLlt/Subjectgt
  • ltBodygtXML ?? Web ?????,????!
  • lt/Bodygt
  • lt/Emailgt

End tag cannot omitted
60
Step 4Parsing your document
  • Checking if your well-formed XML document
    conforms to well-formed XML rules.
  • Use a parser to check well-formedness
  • for example the XML parser embedded in IE5

61
Explorer 5.0 ??Well-formed XML
62
Explorer 5.0 ?????XML??
63
Writing Valid XML
  • Step 1 Make an XML declaration.
  • Step 2 Designing a DTD.
  • Step 3 Writing Valid XML.
  • Step 4 Parsing your Valid XML document.

64
Step 1Make an XML Declaration
  • lt?xml version1.0 standaloneno?gt
  • lt?xml version1.0 encodingUTF-8
    standaloneno?gt
  • lt?xml version1.0 encodingbig5
    standaloneno?gt

65
Step 2 Designing a DTD
  • lt!-- Elements Content
    --gt
  • lt!-- ----------- -------------------
    --------------- --gt
  • lt!ELEMENT Email (From,Date,To,Subject,Body?)gt
  • lt!ELEMENT From (PCDATA)gt
  • lt!ELEMENT Date (PCDATA)gt
  • lt!ELEMENT To (PCDATA)gt
  • lt!ELEMENT Subject (PCDATA)gt
  • lt!ELEMENT Body (PCDATA)gt
  • lt!-- End of Email DTD --gt

66
Step 3 Writing Valid XML
  • lt?xml version1.0 encodingbig5
    standaloneno?gt
  • lt!DOCTYPE Email SYSTEM email.dtd"gt
  • ltEmailgt
  • ltFromgtJoelt/Fromgt
  • ltDategt1999-7-14 AM 0920lt/Dategt
  • ltTogtJaylt/Togt
  • ltTogtJenniferlt/Togt
  • ltSubjectgtLearning XMLlt/Subjectgt
  • ltBodygtXML ?? Web ?????,????!
  • lt/Bodygt
  • lt/Emailgt

67
XML Simple Link
68
XML Extended linking multiple ends
69
XML Extended linkingaddressing by structure
70
XML Extended linking
71
XSL XML counterpart of CSS (Cascading Style
Sheet)
  • Sample email.css
  • Email,From,Date,To,Subject,Body,
  • displayblockmargin-left5
  • margin-right5border-stylegroove

72
XML document with Style
  • lt?xml version1.0 encodingbig5
    standaloneno?gt
  • lt?xml-stylesheet href "email.css"
    type"text/css"?gt
  • ltEmailgt
  • ltFromgtJoelt/Fromgt
  • ltDategt1999-7-14 AM 0920lt/Dategt
  • ltTogtJaylt/Togt
  • ltTogtJenniferlt/Togt
  • ltSubjectgtLearning XMLlt/Subjectgt
  • ltBodygtXML ?? Web ?????,????!
  • lt/Bodygt
  • lt/Emailgt

73
Explorer 5.0 ????CSS?XML??
74
XML???
  • Database interchange
  • Client-side processing
  • User views of the data
  • Information filtering

75
Multimedia
  • medias
  • text, sound, images, video
  • issues
  • volume, format, processing requirements

76
Formats
  • image
  • bit-mapped/pixel-based display
  • The simplest format
  • XBM, BMP, PCX
  • disadvantages redundancy
  • compression
  • Compuserves Graphic Interchange Format (GIF)
  • lossy compression
  • Joint Photographic Experts Group (JPEG)
  • exchange
  • Tagged Image File Format (TIFF)

77
Formats
  • Audio
  • AU, MIDI, WAVE
  • Video
  • MPEG, AVI, QuickTime

78
Textual Images
  • definition
  • images of documents that contain mainly typed or
    typeset text
  • obtained by OCR
  • image retrieval
  • Alternative 1
  • At creation time, a set of keywords (called
    metadata) is associated with each image
  • Conventional text retrieval techniques can be
    applied to keywords

79
Textual Images (Continued)
  • Alternative 2
  • Use OCR to extract the text of the image
  • The resultant ASCII text can be used to extract
    keywords
  • Alternative 3
  • Use the symbols extracted from the images as
    basic units to combine image retrieval techniques
    with sequence retrieval techniques

80
Taxonomy of Web languages
81
????
  • HTML-4 http//www.w3.org/TR/REC-html40
  • W3C http//www.w3c.org/
  • OCLC http//purl.oclc.org/
  • XML http//www.xml.org/
  • XML Parser http//xdev.datachannel.com/
  • DDML Document Definition Markup Language.
    http//www.w3.org/TR/NOTE-ddml
  • Xschema http//purl.oclc.org/NET/xschema

82
????
  • J. Kunze, Encodeing Dubin Core Metadata in
    HTML, ltftp//ftp.ietf.org/internet-drafts/draft-k
    unze-dchtl-00.txtgt.
  • E. Miller, P. Miller and d. Brickley, Guidance
    on Expressing the Dublin Core within the Resource
    Description Framework (RDF), lthttp//www.ukoln.ac
    .uk/interop-focus/activites/dc/datamodel/WD-dc-rdf
    -19990423.htmgt.
Write a Comment
User Comments (0)
About PowerShow.com