Title: Text Annotation Techniques
1Text Annotation Techniques
2Annotated Text
Ordinary Text (Eg.) This is an ordinary text
document.
Annotated text (Eg.) lthtmlgt lttitlegt Sample
Document lt/titlegt ltbodygt This is an annotated
text document. lt/bodygt lt/htmlgt
3Document Type Definition (DTD)
- It is a specification that accompanies an
annotated document. - It aids the parser in identifying what the codes
(or markup) are that separate paragraphs,
identify topic headings - It also indicates to the parser how each tag is
to be processed. - The DTD for every document is generally placed on
top of the document.
4DTD Example
- This is an XML document with a Document Type
Definition
lt?xml version"1.0"?gt lt!DOCTYPE note
lt!ELEMENT note (to,from,heading,body)gt
lt!ELEMENT to (PCDATA)gt lt!ELEMENT from
(PCDATA)gt lt!ELEMENT heading (PCDATA)gt
lt!ELEMENT body (PCDATA)gt gt
ltnotegt lttogtJoelt/togt ltfromgtJanelt/fromgt
ltheadinggtReminderlt/headinggt ltbodygtDon't forget
the homeworklt/bodygt lt/notegt
5Why use DTD?
- each file can carry a description of its own
format with it. - independent groups of people can agree to use a
common DTD for interchanging data. - Your application can use a standard DTD to verify
that the data you receive from the outside world
is valid.
6Standard Generalized Markup Language (SGML)
- SGML is a metalanguage
- a language for writing languages in.
- SGML is used to define the abstract structure of
a DTD
7SGML
- Each markup language defined in SGML is called an
SGML application. An SGML application is
generally characterized by - An SGML declaration.
- The SGML declaration specifies which characters
and delimiters may appear in the application. - A DTD
- A specification that describes the semantics to
be ascribed to the markup. This specification
also imposes syntax restrictions that cannot be
expressed within the DTD. - Document instances containing data (content) and
markup. Each instance contains a reference to the
DTD to be used to interpret it.
8(No Transcript)
9HTML HyperText Markup Language XML eXtensible
Markup Language WML Wireless Markup
Language XHTML eXtensible HyperText Markup
Language SVG Scalable Vector Graphics SMIL
Synchronized Multimedia Integration Language
10- Seen from a DTD point of view, all XML documents
(and HTML documents) are made up by the following
simple building blocks - Elements
- Tags
- Attributes
- Entities
- PCDATA
- CDATA
11Elements
- Elements are the main building blocks of both XML
and HTML documents. - Examples of HTML elements are "body" and "table".
Examples of XML elements could be "note" and
"message". Elements can contain text, other
elements, or be empty. Examples of empty HTML
elements are "hr", "br" and "img".
12Tags
- Tags are used to markup elements.
- A starting tag like ltelement_namegt marks up the
beginning of an element, and an ending tag like
lt/element_namegt marks up the end of an element. - Examples
- body element marked up with body tags
- ltbodygtbody text in betweenlt/bodygt.
- message element marked up with message tags
- ltmessagegtsome message in betweenlt/messagegt
13Attributes
- Attributes provide extra information about
elements. - Attributes are always placed inside the starting
tag of an element. Attributes always come in
name/value pairs. The following "img" element has
additional information about a source file - ltimg src"computer.gif" /gt
- The name of the element is "img". The name of the
attribute is "src". The value of the attribute is
"computer.gif". Since the element itself is empty
it is closed by a " /".
14Entities
- Entities are variables used to define common
text. - Entity references are references to entities.
15PCDATA
- PCDATA means parsed character data.
- Think of character data as the text found between
the start tag and the end tag of an XML element. - PCDATA is text that will be parsed by a parser.
Tags inside the text will be treated as markup
and entities will be expanded.
16CDATA
- CDATA also means character data.
- CDATA is text that will NOT be parsed by a
parser. Tags inside the text will NOT be treated
as markup and entities will not be expanded.
17What is HTML?
- HTML is a non-proprietary format based upon SGML,
and can be created and processed by a wide range
of tools, from simple plain text editors - you
type it in from scratch- to sophisticated WYSIWYG
authoring tools.
18(No Transcript)
19HTML uses tags such as lth1gt and lt/h1gt to
structure text into headings, paragraphs, lists,
hypertext links etc.
Heading 1 Heading 2 Heading 3 Heading 4
- ltH1gtHeading 1lt/H1gt ltH2gtHeading 2lt/H2gt
ltH3gtHeading 3lt/H3gt ltH4gtHeading 4lt/H4gt
20What is XML?
- Extensible Markup Language (XML) is a simple,
very flexible text format derived from SGML. - Originally designed to meet the challenges of
large-scale electronic publishing, XML is also
playing an increasingly important role in the
exchange of a wide variety of data on the Web and
elsewhere.
21XML Sample
- lt?xml version"1.0"?gt
- ltorder orderid"THX1138 customerNumber"3263827"gt
- ltlineitem itemid"C33"gt
- ltquantitygt36lt/quantitygt
- ltunitprice currency"dollars"gt.35lt/unitpricegt
- lt/lineitemgt
- ltlineitem itemid"M48"gt
- ltquantitygt1lt/quantitygt
- ltunitprice currency"dollars"gt2200lt/unitpricegt
- lt/lineitemgt
- lt/ordergt
22XML/HTML
- XML is not a replacement for HTML.XML and HTML
were designed with different goals - XML was designed to describe data and to focus on
what data is.HTML was designed to display data
and to focus on how data looks. - HTML is about displaying information, XML is
about describing information.
23XML/HTML
- The tags used to markup HTML documents and the
structure of HTML documents are predefined. The
author of HTML documents can only use tags that
are defined in the HTML standard. - XML allows the author to define his own tags and
his own document structure.
24What is XHTML?
- The Extensible HyperText Markup Language (XHTML)
is a family of current and future document types
and modules that reproduce, subset, and extend
HTML, reformulated in XML. - XHTML Family document types are all XML-based,
and ultimately are designed to work in
conjunction with XML-based user agents. - XHTML is the successor of HTML, and a series of
specifications has been developed for XHTML.
25SMIL
- SMIL authoring offers a new way to assemble and
deliver streaming multimedia presentations.
Rather than the traditional way of creating a
presentation by compiling a set of media into a
single distributable file, SMIL lets authors
choreograph separate media assets quickly and
easily, with tools as simple as a text editor.
Perhaps the best feature of SMIL is the ability
to generate the code on-the-fly, as many Web
pages are already created, and thereby offer
personalized streaming multimedia. - SMIL Demo
- SMIL Source
26SVG
- SVG is a language for describing two-dimensional
graphics in XML - SVG allows for three types of graphic objects
- vector graphic shapes (e.g., paths consisting of
straight lines and curves) - Images
- text
27SVG Sample
- ltsymbol id"whiteYellowBezier" overflow"visible"gt
- ltpath style"strokeblackfillnone" d"M 0,0 C
0.25,-0.1 0.75,-0.1 1,0"gt - ltanimate id"whiteYellowBezierAnim"
attributeName"d" values"M 0,0 C 0.25,- 0.1
0.75,-0.1 1,0 M 0,0 C25,-10 75,-10 100,0"
dur"5s" repeatCount"3"/gt - ltanimate attributeName"stroke-width"
values"13" dur"5s" repeatCount"3" /gt - ltanimate attributeName"stroke"
values"whiteyellow" dur"5s" repeatCount"3" /gt - lt/pathgt
- lt/symbolgt
28(No Transcript)
29WML
- an annotation technique that allows the text
portions of Web pages to be presented on cellular
telephone and personal digital assistants
(personal digital assistant) via wireless access. - Though HTML can be used WML is used as it has
lesser bandwidth resources. - Also WML uses lesser power to process compared to
HTML.
30Future Direction
- TEI (Text Encoding Initiative)
- an international project to develop guidelines
for the preparation and interchange of electronic
texts for scholarly research. - Supported and promoted the use of SGML.
31- W3C (World Wide Web Consortium)
- Vision Contributions from several hundred
dedicated researchers and engineers working for
Member organizations, from the W3C Team , and
from the entire Web community enable W3C to
identify the technical requirements that must be
satisfied if the Web is to be a truly universal
information space. - Design W3C designs Web technologies to realize
this vision, taking into account existing
technologies as well as those of the future. - Standardization W3C contributes to efforts to
standardize Web technologies by producing
specifications (called "Recommendations") that
describe the building blocks of the Web. W3C
makes these Recommendations freely available to
all.
32