Title: Cooperative Computing for Data Intensive Science
1Cooperative Computing for Data Intensive Science
- Douglas Thain
- University of Notre Dame
- NSF Bridges to Engineering 2020 Conference
- 12 March 2008
2What is Cooperative Computing?
- By combining our computing and storage resources
together, we can attack problems larger than we
could alone. - I can use your computer when it is idle, and vice
versa. (Most computers are idle about 90 percent
of the day.) - Also known as
- Grid computing, distributed computing,
metacomputing, volunteer computing, etc
3Who Needs Coop Computing?
- Many fields of study rely on simulation and data
processing to conduct science. - Physics, chemistry, biology, engineering,
finance, sociology, computer science. - More Computing Better Results
- NOT High Performance Speed up one program.
- High Throughput Produce as many results as
possible over the next day / week / year.
4Cooperative Computing Lab
- We design and build distributed systems that
helps people to attack BIG problems. - Work directly with end users to make sure that
our solutions affect the real world. - Operate a modest computing system as both a
production service and a research testbed. - Currently about 500 cpus and 300 disks.
- CS Research challenges scalability, robustness,
usability, debugging, and performance.
http//www.nd.edu/ccl
5(No Transcript)
6(No Transcript)
7What Makes this Challenging?
- The Programming Model
- I want to process 10 TB of data on 100 machines,
then distribute it across 20 disks, then view the
best results on my workstation. - Fault Tolerance
- Something is always broken!
- Performance Robustness
- There is always one slowpoke.
- Debugging
- My job runs correctly here but not there...!?
8An Example CollaborationBiometrics
ResearchandDistributed Systems
9A Common Pattern in Biometrics
1 .8 .1 0 0 .1
1 0 .1 .1 0
1 0 .1 .3
1 0 0
1 .1
1
Sample Workload 4000 images 256KB each 1s per
F 185 CPU-days Future Workload 60000 images 1MB
each 0.1s per F 4166 CPU-days
10Non-Expert User Using 500 CPUs
11All Pairs Production System
300 active storage units 500 CPUs, 40TB disk
Web Portal
F
G
H
4 Choose optimal partitioning and submit batch
jobs.
S
T
F
F
F
1 - Upload F and S into web portal.
2 - AllPairs(F,S)
F
F
F
All-Pairs Engine
6 - Return result matrix to user.
3 - O(log n) distribution by spanning tree.
5 - Collect and assemble results.
12Some Results on Real Workload
13Collaboration is Where the Interesting Problems
Are!
(Cooperative ComputingProvides the Resources)
14What Makes a Collaboration Work?
- Like a marriage? (old joke.)
- First, a show of commitment go after some low
hanging fruit, and publish it. - A proposal for funding only succeeds if you have
already started working together. - Need very concrete goals your partner may not
share your idea of an interesting tangent. - Students sometimes need a big push to leave their
comfort zone and work together.
15For more information
- Douglas Thain
- dthain_at_nd.edu
- Cooperative Computing Lab
- http//www.nd.edu/ccl
- Apply for Summer 2008 REU
- http//www.nd.edu/ccl/reu
Supported by NSF Grants CCF-0621434 and
CNS-0643229.