Survey of Emerging IT Trends and Technologies - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Survey of Emerging IT Trends and Technologies

Description:

Survey of Emerging IT Trends and Technologies. Chaitan Baru. Monday, 10th Aug. 1. OUTLINE ... Cyber. Infrastructure. Cyber Informatics. Core Informatics ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 67
Provided by: BenT54
Category:

less

Transcript and Presenter's Notes

Title: Survey of Emerging IT Trends and Technologies


1
Survey of Emerging IT Trends and Technologies
  • Chaitan Baru
  • Monday, 10th Aug

2
OUTLINE
  • Trends in data sharing
  • And, Discovery/Search
  • Trends in service-oriented architectures
  • Trends in computing and data infrastructure
  • The road ahead

3
Geoinformatics Use Cases
  • a use has access from a terminal to vast stores
    of data of almost any kind, with the easy ability
    to visualize, analyze and model those data.
  • For a given region (i.e. lat/long extent, plus
    depth), return a 3D structural model with
    accompanying geophysical parameters and geologic
    information, at a specified resolution

4
Implied IT Requirements
  • Search and discovery of resources
  • Integration of heterogeneous 3D / 4D Earth
    Science data
  • Integration of data with tools
  • Analysis and Visualization
  • Ability to feed data to tools, and analyze
    visualize model outputs
  • (data-centric view)

5
Search and Discovery
  • Searching structured data, i.e. metadata
    catalogs

Search
Structured metadata catalogs
6
Search and Discovery
  • Searching unstructured data, i.e. the Web

Search
The Web
  • Structured databases are a major component of the
    Deep Web

7
Combined Search and Discovery
8
Advanced Search
  • Proposed
  • Geoscience Knowledge System, GeoKnowSys
  • Built using Yahoo Build Your Own Search (BOSS)
    service
  • E.g. See wolframalpha.com

9
Advanced Search PaleoLit
  • Research project at Dept of CS, CMU
  • Dr. Judith Gelernter and Prof. Jamie Carbonell
  • Use ontologies to match search requests to
    related publications
  • Demo

10
Informatics Issues The Informatics Progression
Courtesy Prof. Peter Fox, RPI, CSIG08
11
The Computer Science / Domain Science continuum
Computer ? IT ? Geoinformatics ? Domain
? Domain Science Standards
Standards Standards
Science Topics
Topics e.g. Database e.g. ODBC, e.g.
Ontologies, e.g. domain e.g.
geology Systems, XML
GeoSciML vocabularies Semistructure data
definitions (Geologic Time, rock
description,)
12
The data interoperability onion
13
Software interoperability onion
14
Geologic Map Integration
15
Data Mediation
  • Dealing with heterogeneities in (distributed)
    data sources
  • Data may be in different administrative domains
  • ? Manage authentication
  • Data schemas may be different among sources
  • Terminologies may be different among sources
  • Terminologies may be different among sources and
    user
  • Software infrastructure (stack) may be
    different
  • Solve the problem with middleware
  • Layers of software between the original
    application and the end user
  • Mediator
  • Middleware that bridges across heterogeneities
    without requiring sources to change

16
A Data Integration Example Geologic Maps
17
Adopting WMS/WFS Can provide Syntactic
Integration
  • Integrated presentation
  • Uniform syntactical structure
  • Uniform spatial definition

18
GeoSciML Can Provide Schema Integration
MT
MT
WY
ID
NV
UT
AZ
CO
NM
19
Semantic Mediation with GeoSciML
  • Mappings may also be
  • needed between the
  • data and the
  • application ontology
  • E.g., say, mapping
  • 240 mya to Mesozoic

20
Query RewritingExample A Rock Classification
Ontology
Genesis
Fabric
Composition
Texture
21
Query Concept Expansion
  • Concept expansion
  • what else to look for when
  • user asks for Mafic

Composition
22
Query Concept Generalization
  • Generalization
  • finding data that are like X and Y

Composition
23
Ontology-based Geologic Map Integration
Implemented in GEON
24
ODAL, SOQL, and Data Integration Carts
  • ODAL Ontological Database Annotation Language
  • Create a partial model of ontologies from
    database

The values in the column ssID of the tables
Samples, RockTexture, RockGeoChemistry,
ModalData,MineralChemistry and Images represent
instances of RockSample
25
SOQL Simple Ontology Query Language
  • Query single or many resources
  • via ontologies (i.e., high level logical views)
  • independent of physical representation (i.e.
    schemas)

26
Issues in sharing data Primary vs secondary
(derived)
Collect Data Process and Visualize Share
Results
27
Sources of Data
  • Distributed data collections
  • By individual PIs
  • Informal sharing, e.g. via social network
  • Formal sharing, e.g. via submission to
    community data archives / databases
  • Centralized data collections
  • E.g. via a large project (standardized protocols)
  • By agencies (internal protocols)
  • Metadata to the rescue
  • Data description standards
  • Process description standards (workflows)
  • State Surveys and USGS are major sources

28
Major Interoperability Efforts
  • OneGeology.org
  • International initiative of geological surveys to
    create dynamic geological map data available via
    the web.
  • US Geoscience Information Network (US GIN)
  • Led by Lee Allison, AZGS

29
Federating Metadata Catalogs
  • Local vs Community View
  • Individual data providers may choose to export
    a community view
  • Direct access to the source may still provide
    more rich access to data
  • Federated Catalogs
  • The Geosciences Information Network, GIN approach
  • Adopt standards for catalog content (ISO) and
    implementation (CSW)

30
Interoperation between GEON and GEO GRID
GEON
GEO Grid
ADN
Geogrid Catalog
GEON Catalog
600 scenes/day
Catalog Service Web
Catalog Service Web Adapter
RESPONSE
Storage
RESPONSE
SRB
RESPONSE
WMS URL
WMS Server
WMS URL
WMS Server
  • Implement CSW interfaces
  • Collaboration with the NSF PRAGMA project
    (Pacific Rim Assembly for Grid Middleware
    Applications)

31
Integration Visualization of 3D/4D data
For a given region (i.e. lat/long extent, plus
depth), return a 3D structural model with
accompanying physical parameters of density,
seismic velocities, geochemistry, and geologic
ages, using a cell size of 10km
32
OpenEarth Framework Goals
  • Geoscience Integration
  • Data types - topography, imagery, bore hole
    samples, velocity models from seismic tomography,
    gravity measurements, simulation results
  • Data coordinate spaces and dimensionality - 2D
    and 3D spatial representations and 4D that covers
    the range of geologic processes (EQ cycle to deep
    time).

33
OpenEarth Framework Goals
  • Structural Integration
  • Data formats shapefiles, NetCDF, GeoTIFF, and
    other formal and defacto standards.
  • Data models - 2D and 3D geometry to semantically
    richer models of features and relationships
    between those features.
  • Data delivery methods Storage Schemes- local
    files to database queries, web services (WMS,
    WFS) and services for new data types (large
    tomographic volumes, etc.).

34
OEF Philosophy
  • OEF focused on integrating data spanning the
    geosciences.
  • Open software architecture and corresponding
    software that can properly access, manipulate and
    visualize the integrated data.
  • Open source to provide the necessary flexibility
    for academic research and to provide a flexible
    test bed for new data models and visualization
    ideas.

35
OEF Architecture
36
OEF Architecture
  • Data Integration Services
  • Designed to support rapid visualization of
    integrated datasets
  • operations to grid data, resample it at multiple
    resolutions and subdivide data to better support
    progressive changes to the display as the user
    pans and zooms

37
OEF Architecture
  • Visualization Tools
  • Run on the user's computer, dynamically query
    spatial and temporal data from the OEF services
  • Uses 3D graphics hardware for fast display
  • Open architecture supports multiple visualization
    tools authored throughout the community (e.g GEON
    IDV)
  • New viz capabilities developed as necessary

38
OEF Visualization
39
The software services stackExample GEON
Pushing down the service interface
40
Software as a ServiceAt different levels of
software
  • Software as a Service SaaS
  • E.g., Google Apps, Salesforce.com, SAP,
  • Infrastructure as a Service, IaaS
  • E.g., Amazon EC2,
  • Platform as a Service, PaaS

41
The evolving computational architecture
  • Mainframe computers (institutional computing)
  • Minicomputers (departmental computing)
  • Workstations (laboratory computing)
  • Laptops (personal computing)
  • back to the future..??

42
Cloud Computing A meeting of trends
43
Cloud Computing Origins
  • Cloud computing Many definitions
  • Heres one Use of remote data centers to manage
    scalable, reliable, on-demand access to
    applications
  • Origins
  • Goes back to the need by Web search engines to
    inexpensively process all the pages on the Web
  • Done by creating a grid of datacenters and
    processing data in parallel across them
  • Development of a parallel data programming
    environment by Google MapReduce
  • Data cloud computing
  • what about remote centers for scalable, reliable,
    on-demand access to data?

44
Cloud Computing
  • A different pricing model
  • No upfront cost of acquisition. Rent dont buy.
  • Can access 1000s of processors / disks
  • Scalability
  • Elastic computing
  • A different model for dealing with system
    failures
  • Retry, loose consistency,

45
Cloud computing for data
  • Data as a service what is the abstraction for
    storage?
  • Table, Blob, Queue
  • ??
  • Describing characteristics of the data
  • Metadata about storage to specify policies to be
    applied
  • Security, reliability, performance, etc
  • Scaling to meet application needs
  • Large configurations
  • Dealing with virtualization
  • New failure models
  • Retry, loose consistency

46
Storage as a Service
  • Amazon S3 An example
  • Charges for Storage, Data Transfer, and Requests
    (e.g. PUT, COPY, POST, LIST, GET)
  • Issues
  • Bandwidth to storage
  • Quality of Service
  • Storage Elasticity
  • Privacy / security
  • Standardization efforts
  • Storage Networking Industry Assocation (SNIA)
    Technical Working Group (TWG) on Cloud Storage
    has just started
  • Important Issues
  • Metadata for storage
  • Scaling up to large dataset sizes

47
The two sides of Cloud Computing
  • Large distributed infrastructure
  • Everything is in the cloud
  • Interesting as a proposition for the IT
    operations of an enterprise
  • Cloud companies would like to reach deep into
    enterprise IT
  • Our business is not the entrenched data centers
    in current large organizations, but the new
    companies
  • Large-scale infrastructure in the Datacenter
  • Seeding the cloud
  • Shared-nothing parallelism
  • Data on the cheapa la Google

48
The NSF Cluster Exploratory (CluE) Program
  • Google-IBM-NSF Cluster
  • Well over a thousand processors
  • When fully built out, will comprise approximately
    1,600 processors
  • Terabytes of memory
  • Hundreds of terabytes of storage
  • Open source software
  • Linux and Apache Hadoop
  • IBM Tivoli
  • System management, monitoring and dynamic
    resource provisioning
  • A platform for apples-to-apples comparisons
  • Can reserve time on nodes for exclusive access

49
Our CluE Project
  • Project (PI Baru co-PI Krishnan)
  • Performance Evaluation of On-Demand Provisioning
    Strategies for Data Intensive Applications
  • Investigate hybrid software model
  • Database system / Hadoop system
  • Some parts of the application require features
    provided by a DBMS
  • Transactional capability, full SQL support
  • Other parts of the application can exploit Hadoop
    model
  • Very large data sets
  • Data parallel processing
  • Loose consistency models
  • Price / performance is an issue
  • Including energy costs

50
San Andreas Fault LiDAR DatasetData Access
Patterns
  • B4 Dataset

51
Experiments
  • On-demand database vs Hadoop
  • SQL vs Hadoop
  • Energy consumption as a factor in
    price/performance
  • Platforms to be used
  • Google-IBM cluster
  • OpenCirrus testbed
  • Triton resource

52
The Road Ahead
  • Advanced search engines
  • Search structured and unstructured data
  • Deal with display of heterogeneous results
  • Show provenance of data
  • Sophisticated tools for 3D and 4D data
    integration
  • Combination of server-side processing and
    caching and client-side interaction and
    visualization
  • Service-oriented architecture
  • Applications and IT infrastructure available as
    services
  • Perhaps some of them in the Cloud

53
(No Transcript)
54
Dealing with very large data
  • Either the data can be partitioned into segments
    and processed in parallel
  • Shared-nothing parallelism
  • Or not
  • Shared memory systems

55
Parallel Processing of Large Data
P
M
D
56
Shared Nothing
57
Shared Nothing
58
Data partitioning strategies
  • Round-robin
  • Equal distribution across nodes by data volume
  • Hash
  • all data with the same key value go to same node
  • Range
  • all data within a range of values go to the same
    node

59
MapReduce / Hadoop
  • Programming environment for very large scale data
    processing
  • Managing task executions and data transfers in a
    shared nothing environment
  • MapReduce Infrastructure to support data scatter
    / gather
  • Distributed data repository (file system)
  • Google File System (GFS)
  • Hadoop Distributed File System (HDFS)
  • Round-robin partitioning of data
  • MapReduce
  • Googles proprietary implementation
  • Hadoop
  • Apache, open source implementation

60
MapReduce execution
  • Hadoop vs database

61
MapReduce vs Database
  • Database
  • Partition base tables into N partitions
  • Intermediate data can be re-partitioned
  • Intermediate data can be combined
  • Well-defined algebra for data manipulation (SQL)
  • MapReduce / Hadoop
  • Partition input data file into M splits
  • Intermediate data are re-hashed
  • Intermediate data can be combined
  • Java programs
  • Cost of dynamic vs static partitioning
  • Run time costs
  • Storage costs
  • Optimal partitioning
  • Query and Workload dependent
  • How to measure any deviations from the optimal?
  • When to repartition?

62
USGS Role in Geoinformatics
  • Fundamental Develop, maintain, make accessible
  • Long-term national and regional geologic,
    hydrologic, biologic, and geographic databases
  • Earth and planetary imagery
  • Open-source models of the complex natural systems
    and human interaction with that system
  • Physical collections of earth materials, biologic
    materials, reference standards, geophysical
    recordings, paper records.
  • National geologic, biologic, hydrologic, and
    geographic monitoring systems
  • Standards of practice for the geologic,
    hydrologic, biologic, and geographic sciences

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
63
USGS Role in Geoinfomatics
  • All activities Data creation, modeling,
    monitoring, collections, standards etc. Must be
    done in cooperation and collaboration with the
    public and governmental, academic, and private
    sector partners and stakeholders.
  • A critical USGS role
  • facilitate bringing communities together!

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
64
Data Collections versus Communities of Practice
  • Geoinformatics must evolve beyond the
    accumulation of data, models, and standards to
    become the framework for a community of practice
    in the natural sciences.
  • Etienne Wegner and Jean Lave coined the term and
    developed the learning theory of communities of
    practice that we learn not only as individuals
    but as communities. By engaging in communities
    of practice we increase our capacity and
    innovation as well as leverage our support for
    areas of interest.

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
65
Creativity, Learning, and Innovation
  • A community of practice is not merely a community
    with a common interest. But are practitioners
    who share experiences and learn from each other.
    They develop a shared repertoire of resources
    experiences, stories, tools, vocabularies, ways
    of addressing recurring problems. This takes time
    and sustained interaction. Standards of practice
    and reference materials will grow out of this.
    But the critical benefits include creating and
    sustaining knowledge, leveraging of resources,
    and rapid learning and innovation.

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
66
1000s of National and Regional Databases
  • The National Map topographic, elevation,
    orthoimagery, transportation hydrography etc.
  • Geospatial One Stop-portal
  • MRDATA Mineral Resources and Related Data
  • The National Geologic Map Database stnadardized
    community collection of geologic mapping
  • National Water Information System - NWISWeb
  • National Geochemical Survey Database (PLUTO,
    NURE)
  • National Geophysical Database (aeromag, gravity,
    aerorad)
  • Earthquake Catalogs
  • North American Breeding Bird Survey
  • National Vegetation/speciation maps
  • National Oil and Gas Assessment
  • National Coal Quality Inventory

Source Presentation by Dr. Linda Gundersen,
USGS, at Geoinformatics 2007, San Diego, CA.
Write a Comment
User Comments (0)
About PowerShow.com