Panel Summary - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Panel Summary

Description:

Necessity forces a Hybrid Model (RDBMS Files) Performance impact of consistency is high ... Relaxed consistency requirements. Fault tolerant software not ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 16
Provided by: AndrewHan3
Category:

less

Transcript and Presenter's Notes

Title: Panel Summary


1
Panel Summary
  • Andrew Hanushevsky
  • Stanford Linear Accelerator Center
  • Stanford University
  • XLDB
  • 23-October-07

2
State in High Energy Physics
  • A lot of data
  • 15 PB/Year for LHC
  • Typically, write once data
  • Applications are CPU bound
  • A lot of institutes must be involved
  • Increase total resources
  • Necessity forces a Hybrid Model (RDBMS Files)
  • Performance impact of consistency is high
  • Not required for LHC
  • Wide range of applications, DB expertise,
    environments

3
LHC Issues
  • Power and Cooling
  • Cheap hardware for scaling
  • Reliability problems
  • Patching issues
  • Distributed Deployment Issues
  • Needed to develop in-house tools
  • Multi-dimensional search requirements
  • Usually reason for using files for data

4
LHC Questions
  • Database as a
  • Transactional system, efficient query engine,
    highly available storage?
  • Can one product do all of this?
  • Multi-Mode Storage
  • How do you measure scaling?
  • Size? Transactions/Second? Etc.
  • Shared everything or shared nothing
    architectures?

5
State in Astronomy (LSST
  • A lot of data
  • Trillions or more of rows
  • 14PB by 2024
  • Only data about the image
  • Actual images (write once) much larger!
  • Data is distributed
  • Telescope and archive physically separate
  • Time for databases technology to catch up (12
    years)
  • Some proprietary systems handle even more data
    today
  • Reliability and Security issues loose
  • Can absorb some data may be lost, up time 98,
    public data
  • However must be able to ingest the data
  • Telescope keeps going

6
Issues in LSST
  • Easy Scaling
  • Add resources on the fly
  • Dependable software sources
  • This is a long term project
  • Data has some unique needs
  • Distributed mining capabilities
  • Varied database data types
  • Not available today except in OO databases
  • Relaxed consistency requirements
  • Fault tolerant software not hardware
  • Human scaling must be low

7
Scientific Panel I
  • 40 Pure Database
  • Otherwise 20-30 in DB rest in files
  • Majority in the peta-byte range
  • Everyone in the 10-100 TB range
  • Majority use commercial products
  • Though open source DBs rampant
  • Few (in XL scale today) use homegrown systems
  • Sometimes driven by need sometimes by legacy

8
Scientific Panel II
  • Wide range of user analytic needs
  • DBs have limited express-ability
  • Unlikely there is a common set of operators
  • Common Data Processing Model
  • Write once read many
  • But a lot of meta-data updates
  • Amenable to data parallelism
  • Approximate results are acceptable to 1st order

9
Scientific Panel III
  • Wish List
  • Approximate queries
  • Full spatial queries
  • Multiple availability levels
  • Mixture of real-time, interactive, background
    uses
  • The rest is yes
  • Scaling, performance, maintainability, etc.

10
Industry Panel I
  • Primarily traditional DB use
  • Standard scaling techniques
  • Disallow certain types of queries
  • Availability is a must
  • Money and survivability is the issue
  • 90 non-transactional query
  • Wide range of size several TB to several PB
  • 1 Billion rows/hour ingest peak
  • Trillions of rows
  • 25TB/Day is not unusual
  • Millions of queries a day

11
Industry Panel II
  • Some homegrown solutions
  • Depending on how it is used
  • Problem is I/O throughput
  • Minimize use of indexes
  • Some specialized systems used to increase
    performance
  • Dirty reads common
  • Transactional latency is a problem

12
Industry Panel III
  • Varied use patterns (business model driven)
  • Non-indexed data for mining purposes
  • Parallel Load and Query
  • Real time queries (currency is a must)
  • Designing for the unknown query
  • Customization motivation varies
  • Join inefficiency
  • Limited SQL expressiveness
  • Lack of sufficient parallelism

13
Common Industry/Science Issues
  • Performance issues
  • I/O throughput, transactional latency, etc
  • Lack of effective parallelism
  • Usability
  • SQL expressiveness
  • Licensing
  • Industry more constrained but cost is an issue
  • Human power
  • Labor is the dominant cost
  • DBA costs are high and must be reduced

14
Final Perceptions
  • Science/Industry operate roughly on same scale
  • Size and throughput
  • Science Industry business models differ
  • Drive each community into different direction
  • Science is a long-term affair
  • Industry must be reactive

15
Discussion Points
  • What drives feature sets?
  • General feeling that scaling features are missing
  • Is it the architecture (e.g., Relational vs
    other)?
  • Is it the business model?
  • Something else?
  • What feature sets do you think are important?
  • Performance, Scalability, Usability, Reliability?
  • Do you see it as a tradeoff?
  • Open Software Presence
  • A question of customization possibilities or
    simply cost?
  • Is it considered a threat to your business model?
  • Is it time to rethink the nature and placement of
    databases?
Write a Comment
User Comments (0)
About PowerShow.com