Title: Natasha Balac and Roman Olschanowsky
1- Natasha Balac and Roman Olschanowsky
- Data Application Group
- User Services and Development Department
- San Diego Supercomputer Center
http//datacentral.sdsc.edu
2What is Data Central?
- Data Central makes it possible to store, manage,
analyze, share and publish data collections,
thereby enabling access and collaboration in the
broader scientific community - Eligible researchers can request a data
allocation from SDSC (with or without a compute
allocation) that permits expanded access to
SDSC's Data Central facilities and services for
data collections management, data analysis and
data mining
http//datacentral.sdsc.edu
3Why SDSC Data Central?
- Todays scientists and engineers are increasingly
dependent on valued community data collections
and databases - SDSC has experienced increasing demand by the
domain communities for collaborations on data
management including - publishing of data in digital libraries
- sharing of data through the Web and data grids
- creating, optimizing, porting large scale
databases - analyzing and data mining large scale data
http//datacentral.sdsc.edu
4A Deluge of Data
- Today, data comes from everywhere
- Scientific instruments
- Experiments
- Sensors and sensor nets
- New devices
- And is used by everyone
- Scientists
- Consumers
- Educators
- General public
- IT environments must support unprecedented
diversity, globalization, integration, scale, and
use
Life Sciences
Preservationand Archiving
Astronomy
http//datacentral.sdsc.edu
5What does SDSC Data Central offer?
- SDSC has been actively working with and
collaborating with many researchers and national
scale projects in their data management efforts - We offer Expertise and Resources for
- Public Data Collections and Database Hosting
- Long-term storage (tape and disk)
- Remote data management and access (SRB)
- Data Analysis and Data Mining
- Professional, qualified 24/7 support
http//datacentral.sdsc.edu
6SDSC Data Resources
- 540 TB Storage-area Network (SAN)
- 1 PB On-line disk
- 6 PB StorageTek tape library capacity
- DB2, Oracle, MySQL
- Storage Resource Broker
- Gpfs-WAN with 226 TB
Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
http//datacentral.sdsc.edu
7Data Resources Available through DataCentral
- Disk
- 400 Terabytes SATA SAN Fibre Channel Attached
- Enables multiple high-end computers, using a
range of operating systems, to share data rapidly
and seamlessly - Growing data storage capabilities are integrated
with high-end computational resources such as
SDSCs 15.6 Teraflop DataStar IBM supercomputer
and parallel I/O - Accessible Mounted, Web, SRB, GridFTP
- Tape
- 6 Petabyte Capacity High Speed Robotic Silos
- Disk cache front end, transparently mounted via
Sun SAMQFS file system - Accessible Mounted, Web, SRB, GridFTP
http//datacentral.sdsc.edu
8Data Resources Available through DataCentral
- Databases
- DB2, Oracle, MySQL servers
- High Availability, High Performance
- Accessible Standard RDMS connectivity, client
software installed on most systems - Software
- Storage Resource Broker (SRB) State-of-the-art
data management and collaboration software for
grid file access - Powerful software applications covering a range
of disciplines including bioscience, geoscience,
astronomy, chemistry, medicine, etc. - A wide array of data analysis, mining and
visualization tools
http//datacentral.sdsc.edu
9Data Resources Available through DataCentral
- Expertise in
- High performance large data management
- data migration
- database application tuning, porting and
optimization - SQL query tuning
- portal creation and collection publication
- schema design
- database selection (Oracle, DB2, MySQL,
PostgreSQL) - data migration, upload and sharing through the
grid - data analysis and mining
http//datacentral.sdsc.edu
10Data Resources Available through DataCentral
Quality User Support
- Consulting
- Phone, Web, e-mail
- M-F, 9 a.m. - 5 p.m.
- 24x7 Help Desk/Operational Support
- Training
- Documentation
- User Portals
- Targeted Optimization and Porting (TOP)
- Strategic Applications Collaborations (SAC)
- Strategic Applications Collaborations (SAC)
- Strategic Community Collaborations (SCC)
http//datacentral.sdsc.edu
11Strategic Collaborations
- Strategic Data Applications Collaborations (SDAC)
- SDSC expert staff paired with domain scientists
for projects lasting 3-12 months - Strategic Community Collaborations (SCC)
http//datacentral.sdsc.edu
12Enabling Data Science
- Many users with large data needs
- extend above and beyond what their home
environments - increasingly dependent on valued community data
collections and databases used community-wide - Experiencing increasing demand by the domain
communities for collaborations on - publishing of data in digital libraries
- sharing of data through the Web and data grids
- creating, optimizing, porting large scale
databases - analyzing and data mining large scale data
- Comprehensive data environment that incorporates
access to the full spectrum of data enabling
resources
http//datacentral.sdsc.edu
13SDSC Data Allocations Environmentdatacentral.sdsc
.edu
Services
Parallel File-system High-speed, Temporary
Data Parking (SAN) High-speed, Short-term
Data Collections (SATA) Moderate-speed, Long-ter
m
Data Sharing (SATA) Moderate-speed, Medium-term
Disk
Local Back-up (e.g., Tape)
HPSS/SAMFS
Offsite Back-up
http//datacentral.sdsc.edu
14Data Science Support Systems
Archival Systems
Blue Gene/L (Due 12/04)
6 PB
DataStar IBM Power4
Expertise, Networking, Visualization, Storage
and Compute Resources
2.8/5.7 TF
10.4 TF
http//datacentral.sdsc.edu
15Partial list of databases and data collections
currently housed at SDSC
- Protein Data Bank (protein data)
- National Virtual Observatory (astronomical data)
- UCSD Libraries Image Collegion (ArtStore)
- National Science Digital Library (education
collection) - SCEC (earthquake data)
- BIRN (neuroscience data)
- Encyclopedia of Life (genomic data)
- TreeBase (phylogeny and ontology information)
- Transport Classification Database (protein
information) - Library of Congress data
- CKAAPS (protein evolutionary information)
- AfCS Molecule Pages (protein information)
- SLACC-JCSG (structural genomics data)
- APOPTOSIS DB (proteins related to cell death
data) - NAVDAT (geochemistry data)
- QRC (NSF data on Supercomputer Centers and PACI)
- Network Topology Data (Skitter project)
- UC Merced Library
- Biology Workbench Databases (mirrors and
originals of over 80 biology databases)
- 2 Micron All Sky Survey (astronomy data)
- Digital Palomar Observatory Sky Survey Collection
(astronomy data) - Sloan Digital Sky Survey Collection (astronomy
data) - Interpro Mirror (protein data)
- HPWREN (Wireless Network Network Analysis Data)
- HPWREN (sensor network data)
- Security logs and archives (security information)
- EarthRef Digital Archive (earth science
information) - GERM (earth reservoir information)
- Braindata (Rutgers neuroscience collection)
- HyperLTER (hyperspectral images)
- SIO-Explorer (oceanographic voyages)
- Transana (classroom video)
- WebBase (web crawls)
- Alexandria Digital Library (photographs)
- Backskatter Data (from UCSD network telescope)
- Digital Earth Data Library (earth sciences
related datasets) - GEON (PaleoGeographic Atlas project)
- IMDC (Internet measurement data catalog)
- Seamount Catalogue (bathymetric seamount maps)
- Hayden Planetarium Collection (astronomical data)
- TeraGrid Data (science and engineering
collections) - Biocyc (collection of pathway/genome DBs)
- Digital Embryo (human embryology)
- National Archives (persistent archive)
- San Diego Conservation Resources Network
(sensitive species map server) - LDAS (land data assimilation system)
- ROADNET (sensor data)
- NPACI Data Grid (scientific simulation output)
- Salk (biology data archive)
- Backbone Packet Header Traces (OC48, OC12)
- Teragrid (science and engineering collections)
- CHRONOS (analytical tools for chronostratigraphy)
- ERESE (educational Earth science portal)
- TeraBridge (Sensor stream data)
- C5 Landscape (UCSD Art dept)
http//datacentral.sdsc.edu
16Getting an Allocation Its Free!
- Who should apply?
- Open to researchers affiliated with US
educational institutions - Proposals merit-reviewed quarterly by Data
Allocations Committee - Types of Allocations
- Expedited Allocations
- 1 TB or less of disk tape 1st year
- 5 GB Database 1st year
- Yearly review
- Medium Allocations
- Under 30 TB
- Large Allocations
- Larger than 30 TB
- Data Allocations
- Getting Started http//datacentral.sdsc.edu
17Thank You
- SDSC Data Resources and Allocations
- http//datacentral.sdsc.edu/