XML Data Compression - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

XML Data Compression

Description:

... would like to have strategic partnership alliances with several hotels in New York ... about all the hotels in New York form NewYorkYellowsPages.com to ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 34
Provided by: MAlag
Category:
Tags: xml | compression | data

less

Transcript and Presenter's Notes

Title: XML Data Compression


1
XML Data Compression
  • By
  • Alagappan Meyyappan
  • Jaganathan Jeyapaul

2
Definition of data object used here
  • It is data wrapped in the form of object and has
    valued attributes only, for example
  • Data Class
    Data Object
  • Hotel
    NewYork_beach_Hotel

  • String Location
    Location NewYork
  • int Number_of_rooms
    Number_of_rooms 125
  • int Number_of_employee
    Number_of_employee 35


3
What is S-S data transfer ?
  • The process of exchanging huge amounts of data
    between servers is called S-S data transfer ( a
    definition used here)

Server A
Server B
Network
Server C
4
Example of S-S data transfer
  • Consider for example a California Tourist Agency
    called BayTravelAgency would like to have
    strategic partnership alliances with several
    hotels in New York
  • In this regard they would like to get information
    about all the hotels in New York form
    NewYorkYellowsPages.com to do a survey
  • BayTravelAcengy will request the
    NewYorkYellowsPages.com to send the information
    about all the hotels in New York for their
    research

5
Application
Request
CA Server
NY Server
CA db
NY db
Response
Hotel A
BayTravelAgency
NewYorkYellowPages.com
6
Why XML should be used here?
  • It is a globally accepted data format for
    information exchange
  • It is relatively simple compared to other
    protocols
  • There are many vendors in the market who provide
    all the required APIs
  • However what happens in real world if xml is
    used for S-S data transfer?

7
  • lt?xml version'1.0' encoding'ISO-8859-1'?gt
  • ltHotel valueNewYork_beach_hotel"gt
  • ltLocation value"NewYork"/gt
  • ltNumber_of_rooms value"125"/gt
  • ltNumber_of_employee value"35"/gt
  • ltzip value 56709/gt
  • lt/Hotelgt
  • ltHotel valueLong_island_hotel"gt
  • ltLocation
    value"NewYork"/gt
  • ltNumber_of_rooms
    value"150"/gt
  • ltNumber_of_employee
    value"38"/gt
  • ltzip value
    56709/gt
  • lt/Hotelgt
  • ltHotel valueManhattan_hotel"gt
  • ltLocation
    value"NewYork"/gt
  • ltNumber_of_rooms
    value345"/gt

  • ltNumber_of_employee value72"/gt
  • ltzip value
    56710/gt
  • lt/Hotelgt

8
Problems with XML?
  • The data exchanged between BayTarvelAgency and
    NewYorkYellopages.com will be huge ( for 1000
    Simple hotel objects discussed earlier it takes
    0.127MB). In reality it might take about 1MB
  • It takes a significant amount of time while
    transmitting, especially when there are many such
    request like

9
Server8
Server9
Server7
Server6
Server1
Network
Server5
Server2
Server3
Server4
10
How to overcome this, use compression but .?
  • Some compression algorithms may use a lot of cpu
    cycles hence it will reduce the server
    performance
  • If compression is used then all processors in
    the network should understand it.

11
First problem can be solved using simple delta
compression
  • Location attribute with the same value New York
    repeats in all the data objects
  • Sending it more than once is redundant .. How
    to get rid of such similar attribute values?
    Mention it once only ..if there is a difference
    then send the difference alone for example

12
  • NewYork_beach_Hotel
    Long_island_hotel

  • Location NewYork
    Location NewYork
  • Number_of_rooms 125
    Number_of_rooms 150
  • Number_of_employee 35
    Number_of_employee 38
  • Zip 56709
    Zip 56709

  • Simplified
  • NewYork_beach_Hotel
    Long_island_hotel

  • Location NewYork
    Number_of_rooms 150
  • Number_of_rooms 125
    Number_of_employee 38
  • Number_of_employee 35
  • Zip 56709


13
  • lt?xml version'1.0' encoding'ISO-8859-1'?gt
  • ltHotel valueNewYork_beach_hotel"gt
  • ltLocation value"NewYork"/gt
  • ltNumber_of_rooms value"125"/gt
  • ltNumber_of_employee value"35"/gt
  • ltzip value 56709/gt
  • lt/Hotelgt
  • ltHotel valueLong_island_hotel"gt
  • ltNumber_of_rooms
    value"150"/gt
  • ltNumber_of_employee
    value"38"/gt
  • lt/Hotelgt
  • ltHotel valueManhattan_hotel"gt
  • ltNumber_of_rooms
    value345"/gt

  • ltNumber_of_employee value72"/gt
  • ltzip value
    56710/gt
  • lt/Hotelgt
  • ..
  • .

14
When using a delta compression ...
  • Order of the objects are very important to get a
    maximum compression, for example (next 2
    slides)
  • To find out an order which might produce a high
    compression one can use the following steps
  • Create an attribute compression factor (ACF)
    table

15
  • NewYork_beach_Hotel
    Long_island_hotel
    Abc_hotel


  • Location NewYork Location
    NewYork Location NewYork
  • Number_of_rooms 125 Number_of_rooms
    150 Number_of_rooms 125
  • Number_of_employee 35
    Number_of_employee 38
    Number_of_employee 32
  • Zip 56709 Zip
    56709 Zip 56709

  • After Compression
  • NewYork_beach_Hotel
    Long_island_hotel
    Abc_hotel


  • Location NewYork
    Number_of_rooms 150 Number_of_rooms
    125
  • Number_of_rooms 125
    Number_of_employee 38
    Number_of_employee 32
  • Number_of_employee 35
  • Zip 56709

16
  • NewYork_beach_Hotel Abc_hotel
    Long_island_hotel


  • Location NewYork Location
    NewYork Location NewYork
  • Number_of_rooms 125 Number_of_rooms
    125 Number_of_rooms 150
  • Number_of_employee 35
    Number_of_employee 32
    Number_of_employee 38
  • Zip 56709 Zip
    56709 Zip 56709

  • After Compression
  • NewYork_beach_Hotel Abc_hotel
    Long_island_hotel


  • Location NewYork
    Number_of_employee 38 Number_of_rooms
    125
  • Number_of_rooms 125

    Number_of_employee 32
  • Number_of_employee 35
  • Zip 56709

17
  • Object location room
    employee zip
  • New_York New York 125 35
    56709
  • Long_Island New York 128 38
    56709
  • Abc New York 125
    32 56709
  • Attribute number of change
  • Location 0 (ignore)
  • room 1 (group)
  • employee 3
  • zip 0 (ignore)

18
  • Object room employee
  • New_York 125 35
  • Abc 125 32
  • Attribute number of change
  • room 0 (ignore)
  • employee 1 (group)
  • Object room employee
  • Long Island 128 38
  • Attribute number of change
  • room 0 (ignore)
  • employee 0 (ignore)

New York
ABC
Long Island
19
  • Object employee
  • New_York 35
  • Attribute number of change
  • employee 0 (ignore)
  • Object employee
  • Abc 32
  • Attribute number of change
  • employee 0 (ignore)

New York
ABC
Long Island
20
  • This simple method does not guarantee the best
    data object order however it provides a good
    order in a quicker time , with less CPU cycles

21
Difference between the real one and the
compressed one
  • lt?xml version'1.0' encoding'ISO-8859-1'?gt
  • ltHotel valueNewYork_beach_hotel"gt
  • ltLocation value"NewYork"/gt
  • ltNumber_of_rooms value"125"/gt
  • ltNumber_of_employee value"35"/gt
  • ltzip value 56709/gt
  • lt/Hotelgt
  • ltHotel valueLong_island_hotel"gt
  • ltNumber_of_rooms
    value"150"/gt
  • ltNumber_of_employee
    value"38"/gt
  • lt/Hotelgt
  • ltHotel valueManhattan_hotel"gt
  • ltNumber_of_rooms
    value345"/gt

  • ltNumber_of_employee value72"/gt
  • ltzip value
    56710/gt
  • lt/Hotelgt
  • ..
  • .

22
The Challenge ...
  • Other xml parsers will see it as normal XML
    document hence these may lead to
    misunderstandings therefore ..
  • we suggest a new tag to be added to the XML
    specification in order to achieve this new
    document

23

CP-XML
24
What is CP-XML?
  • It is the same XML with slight modification
  • It will have a template, new tags and comments
    in the document

25
  • lt?xml version 1.0 encoding 'ISO-8859-1?
    Compress true gt
  • ltTemplategt
  • lt 1.Hotel ,STRING , the name of
    the hotel gt
  • lt2.Location ,STRING
    /gt
  • lt3.Number_of_rooms, INT /gt
  • lt4.Number_of_employee ,INT /gt
  • lt5.zip ,INT, zip code in 5 digit
    only /gt
  • lt/Hotelgt
  • lt/ Template gt
  • lt1. NewYork_beach_hotelgt
  • lt2. NewYork/gt
  • lt3. 125/gt
  • lt4. 35/gt
  • lt5. 56709/gt
  • lt/gt
  • lt1. Abc_hotelgt
  • lt4. 38/gt
  • lt/gt
  • ..

26
XML compression Process
Request
Query handler
DB
Server
Response
Arrange Object
Form cp- XML Doc
27
Bench Marking
  • The above two documents are compressed with a
    ratio of
  • 625/869 0.71

1
ratio
3
6
12
2
Number objects
28
Sending Lists
  • How to send the different type of rooms available
    in the hotel ? ----- LIST

29
  • lt?xml version 1.0 encoding 'ISO-8859-1?
    Compress true gt
  • ltTemplategt
  • lt 1. Hotel ,STRING , the name of
    the hotel gt
  • lt2. Location
    ,STRING /gt
  • lt3. Number_of_rooms, INT /gt
  • lt4. Number_of_employee ,INT /gt
  • lt5. zip ,INT, zip code in 5 digit
    only /gt
  • lt6. type_of_room
    ,LIST(deluxe suite single double ) /gt
    lt/Hotelgt
  • lt/ Template gt
  • lt1. NewYork_beach_hotelgt
  • lt2. NewYork/gt
  • lt3. 125/gt
  • lt4. 35/gt
  • lt5. 56709/gt
  • lt6. 1,2,4 /gt
  • lt/gt

30
Sending Trees
  • How to send file directories using this method
  • /root
  • /dir1
  • file1
  • file2
  • /dir2
  • file3

31
Advanced template
  • lt?xml version 1.0 encoding 'ISO-8859-1?
    Compress true gt
  • ltTemplategt
  • lt 1. Node ,STRING , the name of
    the hotel gt
  • lt2. file_structure ,
    TREE( root dir1 dir2 file1 file2 file3
    ) /gt
  • lt/ Node gt
  • lt/ Template gt
  • lt1. Node_Agt
  • lt2. 1-2 , 1-3 , 2-4 , 2-5 , 3-6 /gt
  • lt/gt

32
Other data types like
  • Objects within objects
  • Graphs
  • Boolean
  • Tables
  • Vectors
  • Other data types

33
Thank you
Write a Comment
User Comments (0)
About PowerShow.com