Survey of Emerging IT Trends and Technologies - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

Survey of Emerging IT Trends and Technologies

Description:

Survey of Emerging IT Trends and Technologies. Chaitan Baru. Monday, 10th Aug. 1. OUTLINE ... Cyber. Infrastructure. Cyber Informatics. Core Informatics ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 67

Provided by: BenT54

Category:

more less

Transcript and Presenter's Notes

Title: Survey of Emerging IT Trends and Technologies

1
Survey of Emerging IT Trends and Technologies

Chaitan Baru
Monday, 10th Aug

2
OUTLINE

Trends in data sharing
And, Discovery/Search
Trends in service-oriented architectures
Trends in computing and data infrastructure
The road ahead

3
Geoinformatics Use Cases

a use has access from a terminal to vast stores
of data of almost any kind, with the easy ability
to visualize, analyze and model those data.
For a given region (i.e. lat/long extent, plus
depth), return a 3D structural model with
accompanying geophysical parameters and geologic
information, at a specified resolution

4
Implied IT Requirements

Search and discovery of resources
Integration of heterogeneous 3D / 4D Earth
Science data
Integration of data with tools
Analysis and Visualization
Ability to feed data to tools, and analyze
visualize model outputs
(data-centric view)

5
Search and Discovery

Searching structured data, i.e. metadata
catalogs

Search
Structured metadata catalogs
6
Search and Discovery

Searching unstructured data, i.e. the Web

Search
The Web

Structured databases are a major component of the
Deep Web

7
Combined Search and Discovery
8
Advanced Search

Proposed
Geoscience Knowledge System, GeoKnowSys
Built using Yahoo Build Your Own Search (BOSS)
service
E.g. See wolframalpha.com

9
Advanced Search PaleoLit

Research project at Dept of CS, CMU
Dr. Judith Gelernter and Prof. Jamie Carbonell
Use ontologies to match search requests to
related publications
Demo

10
Informatics Issues The Informatics Progression
Courtesy Prof. Peter Fox, RPI, CSIG08
11
The Computer Science / Domain Science continuum
Computer ? IT ? Geoinformatics ? Domain
? Domain Science Standards
Standards Standards
Science Topics
Topics e.g. Database e.g. ODBC, e.g.
Ontologies, e.g. domain e.g.
geology Systems, XML
GeoSciML vocabularies Semistructure data
definitions (Geologic Time, rock
description,)
12
The data interoperability onion
13
Software interoperability onion
14
Geologic Map Integration
15
Data Mediation

Dealing with heterogeneities in (distributed)
data sources
Data may be in different administrative domains
? Manage authentication
Data schemas may be different among sources
Terminologies may be different among sources
Terminologies may be different among sources and
user
Software infrastructure (stack) may be
different
Solve the problem with middleware
Layers of software between the original
application and the end user
Mediator
Middleware that bridges across heterogeneities
without requiring sources to change

16
A Data Integration Example Geologic Maps
17
Adopting WMS/WFS Can provide Syntactic
Integration

Integrated presentation
Uniform syntactical structure
Uniform spatial definition

18
GeoSciML Can Provide Schema Integration
MT
MT
WY
ID
NV
UT
AZ
CO
NM
19
Semantic Mediation with GeoSciML

Mappings may also be
needed between the
data and the
application ontology
E.g., say, mapping
240 mya to Mesozoic

20
Query RewritingExample A Rock Classification
Ontology
Genesis
Fabric
Composition
Texture
21
Query Concept Expansion

Concept expansion
what else to look for when
user asks for Mafic

Composition
22
Query Concept Generalization

Generalization
finding data that are like X and Y

Composition
23
Ontology-based Geologic Map Integration
Implemented in GEON
24
ODAL, SOQL, and Data Integration Carts

ODAL Ontological Database Annotation Language
Create a partial model of ontologies from
database

The values in the column ssID of the tables
Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent
instances of RockSample
25
SOQL Simple Ontology Query Language

Query single or many resources
via ontologies (i.e., high level logical views)
independent of physical representation (i.e.
schemas)

26
Issues in sharing data Primary vs secondary
(derived)
Collect Data Process and Visualize Share
Results
27
Sources of Data

Distributed data collections
By individual PIs
Informal sharing, e.g. via social network
Formal sharing, e.g. via submission to
community data archives / databases
Centralized data collections
E.g. via a large project (standardized protocols)
By agencies (internal protocols)
Metadata to the rescue
Data description standards
Process description standards (workflows)
State Surveys and USGS are major sources

28
Major Interoperability Efforts

OneGeology.org
International initiative of geological surveys to
create dynamic geological map data available via
the web.
US Geoscience Information Network (US GIN)
Led by Lee Allison, AZGS

29
Federating Metadata Catalogs

Local vs Community View
Individual data providers may choose to export
a community view
Direct access to the source may still provide
more rich access to data
Federated Catalogs
The Geosciences Information Network, GIN approach
Adopt standards for catalog content (ISO) and
implementation (CSW)

30
Interoperation between GEON and GEO GRID
GEON
GEO Grid
ADN
Geogrid Catalog
GEON Catalog
600 scenes/day
Catalog Service Web
Catalog Service Web Adapter
RESPONSE
Storage
RESPONSE
SRB
RESPONSE
WMS URL
WMS Server
WMS URL
WMS Server

Implement CSW interfaces
Collaboration with the NSF PRAGMA project
(Pacific Rim Assembly for Grid Middleware
Applications)

31
Integration Visualization of 3D/4D data
For a given region (i.e. lat/long extent, plus
depth), return a 3D structural model with
accompanying physical parameters of density,
seismic velocities, geochemistry, and geologic
ages, using a cell size of 10km
32
OpenEarth Framework Goals

Geoscience Integration
Data types - topography, imagery, bore hole
samples, velocity models from seismic tomography,
gravity measurements, simulation results
Data coordinate spaces and dimensionality - 2D
and 3D spatial representations and 4D that covers
the range of geologic processes (EQ cycle to deep
time).

33
OpenEarth Framework Goals

Structural Integration
Data formats shapefiles, NetCDF, GeoTIFF, and
other formal and defacto standards.
Data models - 2D and 3D geometry to semantically
richer models of features and relationships
between those features.
Data delivery methods Storage Schemes- local
files to database queries, web services (WMS,
WFS) and services for new data types (large
tomographic volumes, etc.).

34
OEF Philosophy

OEF focused on integrating data spanning the
geosciences.
Open software architecture and corresponding
software that can properly access, manipulate and
visualize the integrated data.
Open source to provide the necessary flexibility
for academic research and to provide a flexible
test bed for new data models and visualization
ideas.

35
OEF Architecture
36
OEF Architecture

Data Integration Services
Designed to support rapid visualization of
integrated datasets
operations to grid data, resample it at multiple
resolutions and subdivide data to better support
progressive changes to the display as the user
pans and zooms

37
OEF Architecture

Visualization Tools
Run on the user's computer, dynamically query
spatial and temporal data from the OEF services
Uses 3D graphics hardware for fast display
Open architecture supports multiple visualization
tools authored throughout the community (e.g GEON
IDV)
New viz capabilities developed as necessary

38
OEF Visualization
39
The software services stackExample GEON
Pushing down the service interface
40
Software as a ServiceAt different levels of
software

Software as a Service SaaS
E.g., Google Apps, Salesforce.com, SAP,
Infrastructure as a Service, IaaS
E.g., Amazon EC2,
Platform as a Service, PaaS

41
The evolving computational architecture

Mainframe computers (institutional computing)
Minicomputers (departmental computing)
Workstations (laboratory computing)
Laptops (personal computing)
back to the future..??

42
Cloud Computing A meeting of trends
43
Cloud Computing Origins

Cloud computing Many definitions
Heres one Use of remote data centers to manage
scalable, reliable, on-demand access to
applications
Origins
Goes back to the need by Web search engines to
inexpensively process all the pages on the Web
Done by creating a grid of datacenters and
processing data in parallel across them
Development of a parallel data programming
environment by Google MapReduce
Data cloud computing
what about remote centers for scalable, reliable,
on-demand access to data?

44
Cloud Computing

A different pricing model
No upfront cost of acquisition. Rent dont buy.
Can access 1000s of processors / disks
Scalability
Elastic computing
A different model for dealing with system
failures
Retry, loose consistency,

45
Cloud computing for data

Data as a service what is the abstraction for
storage?
Table, Blob, Queue
??
Describing characteristics of the data
Metadata about storage to specify policies to be
applied
Security, reliability, performance, etc
Scaling to meet application needs
Large configurations
Dealing with virtualization
New failure models
Retry, loose consistency

46
Storage as a Service

Amazon S3 An example
Charges for Storage, Data Transfer, and Requests
(e.g. PUT, COPY, POST, LIST, GET)
Issues
Bandwidth to storage
Quality of Service
Storage Elasticity
Privacy / security
Standardization efforts
Storage Networking Industry Assocation (SNIA)
Technical Working Group (TWG) on Cloud Storage
has just started
Important Issues
Metadata for storage
Scaling up to large dataset sizes

47
The two sides of Cloud Computing

Large distributed infrastructure
Everything is in the cloud
Interesting as a proposition for the IT
operations of an enterprise
Cloud companies would like to reach deep into
enterprise IT
Our business is not the entrenched data centers
in current large organizations, but the new
companies
Large-scale infrastructure in the Datacenter
Seeding the cloud
Shared-nothing parallelism
Data on the cheapa la Google

48
The NSF Cluster Exploratory (CluE) Program

Google-IBM-NSF Cluster
Well over a thousand processors
When fully built out, will comprise approximately
1,600 processors
Terabytes of memory
Hundreds of terabytes of storage
Open source software
Linux and Apache Hadoop
IBM Tivoli
System management, monitoring and dynamic
resource provisioning
A platform for apples-to-apples comparisons
Can reserve time on nodes for exclusive access

49
Our CluE Project

Project (PI Baru co-PI Krishnan)
Performance Evaluation of On-Demand Provisioning
Strategies for Data Intensive Applications
Investigate hybrid software model
Database system / Hadoop system
Some parts of the application require features
provided by a DBMS
Transactional capability, full SQL support
Other parts of the application can exploit Hadoop
model
Very large data sets
Data parallel processing
Loose consistency models
Price / performance is an issue
Including energy costs

50
San Andreas Fault LiDAR DatasetData Access
Patterns

B4 Dataset

51
Experiments

On-demand database vs Hadoop
SQL vs Hadoop
Energy consumption as a factor in
price/performance
Platforms to be used
Google-IBM cluster
OpenCirrus testbed
Triton resource

52
The Road Ahead

Advanced search engines
Search structured and unstructured data
Deal with display of heterogeneous results
Show provenance of data
Sophisticated tools for 3D and 4D data
integration
Combination of server-side processing and
caching and client-side interaction and
visualization
Service-oriented architecture
Applications and IT infrastructure available as
services
Perhaps some of them in the Cloud

53
(No Transcript)
54
Dealing with very large data

Either the data can be partitioned into segments
and processed in parallel
Shared-nothing parallelism
Or not
Shared memory systems

55
Parallel Processing of Large Data
P
M
D
56
Shared Nothing
57
Shared Nothing
58
Data partitioning strategies

Round-robin
Equal distribution across nodes by data volume
Hash
all data with the same key value go to same node
Range
all data within a range of values go to the same
node

59
MapReduce / Hadoop

Programming environment for very large scale data
processing
Managing task executions and data transfers in a
shared nothing environment
MapReduce Infrastructure to support data scatter
/ gather
Distributed data repository (file system)
Google File System (GFS)
Hadoop Distributed File System (HDFS)
Round-robin partitioning of data
MapReduce
Googles proprietary implementation
Hadoop
Apache, open source implementation

60
MapReduce execution

Hadoop vs database

61
MapReduce vs Database

Database
Partition base tables into N partitions
Intermediate data can be re-partitioned
Intermediate data can be combined
Well-defined algebra for data manipulation (SQL)
MapReduce / Hadoop
Partition input data file into M splits
Intermediate data are re-hashed
Intermediate data can be combined
Java programs
Cost of dynamic vs static partitioning
Run time costs
Storage costs
Optimal partitioning
Query and Workload dependent
How to measure any deviations from the optimal?
When to repartition?

62
USGS Role in Geoinformatics

Fundamental Develop, maintain, make accessible
Long-term national and regional geologic,
hydrologic, biologic, and geographic databases
Earth and planetary imagery
Open-source models of the complex natural systems
and human interaction with that system
Physical collections of earth materials, biologic
materials, reference standards, geophysical
recordings, paper records.
National geologic, biologic, hydrologic, and
geographic monitoring systems
Standards of practice for the geologic,
hydrologic, biologic, and geographic sciences

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
63
USGS Role in Geoinfomatics

All activities Data creation, modeling,
monitoring, collections, standards etc. Must be
done in cooperation and collaboration with the
public and governmental, academic, and private
sector partners and stakeholders.
A critical USGS role
facilitate bringing communities together!

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
64
Data Collections versus Communities of Practice

Geoinformatics must evolve beyond the
accumulation of data, models, and standards to
become the framework for a community of practice
in the natural sciences.
Etienne Wegner and Jean Lave coined the term and
developed the learning theory of communities of
practice that we learn not only as individuals
but as communities. By engaging in communities
of practice we increase our capacity and
innovation as well as leverage our support for
areas of interest.

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
65
Creativity, Learning, and Innovation

A community of practice is not merely a community
with a common interest. But are practitioners
who share experiences and learn from each other.
They develop a shared repertoire of resources
experiences, stories, tools, vocabularies, ways
of addressing recurring problems. This takes time
and sustained interaction. Standards of practice
and reference materials will grow out of this.
But the critical benefits include creating and
sustaining knowledge, leveraging of resources,
and rapid learning and innovation.

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
66
1000s of National and Regional Databases

The National Map topographic, elevation,
orthoimagery, transportation hydrography etc.
Geospatial One Stop-portal
MRDATA Mineral Resources and Related Data
The National Geologic Map Database stnadardized
community collection of geologic mapping
National Water Information System - NWISWeb
National Geochemical Survey Database (PLUTO,
NURE)
National Geophysical Database (aeromag, gravity,
aerorad)
Earthquake Catalogs
North American Breeding Bird Survey
National Vegetation/speciation maps
National Oil and Gas Assessment
National Coal Quality Inventory