Title: NSF-Relevant Challenges in Computational Intelligence
1NSF-Relevant Challenges in Computational
Intelligence
- Jaime Carbonell (jgc_at_cs.cmu.edu)
- Tom Mitchell, Guy Bleloch, Randy Bryant, et al
- School of Computer Science
- Carnegie Mellon University
- 26-April-2007
I) Major Computational Intelligence Research
Areas II) Next-Generation Infrastructure (DISC)
2Computational Intelligence
- Machine Learning
- Inductive learning algorithms, active leraning
- Data mining novel pattern detection
- Language Technologies
- Multilingual next-veneration search engines
- Machine translation (e.g. Arabic ? English)
- Perception
- Computer vision, tactile sensing (e.g., in
robotics) - Planning optimizing
- Reasoning planning under uncertainty
- Non-linear optimization (beyond O. R.)
w/uncertainty - Key scientific applications
- Proteomics, genomics, computational biology
- Modeling human brain functions
3 Machine Learning
Speech Recognition
- Reinforcement learning
- Predictive modeling
- Pattern discovery
- Hidden Markov models
- Convex optimization
- Explanation-based learning
- ....
Automated Control learning
Extracting facts from text
4Leveraging Existing Data Collecting Systems
1999 Influenza outbreak
Influenza cultures
Sentinel physicians
WebMD queries about cough etc.
School absenteeism
Sales of cough and cold meds
Sales of cough syrup
ER respiratory complaints
ER viral complaints
Influenza-related deaths
Week (1999-2000))
Moore, 2002
5Cluster Evolution and Density Change Detection
d2F(r(t))/dt2
6Classifier Rocchio, Topic Civil War (R76 in
TREC10), Threshold MLR
7 Info-Age Bill of Rights
- Get the right information
- To the right people
- At the right time
- On the right medium
- In the right language
- With the right level of detail
Search Engines
Personalization
Anticipatory Analysis
Speech Recognition
Machine Translation
Summarization
8MMR vs Current Search Engines
documents
query
MMR
IR
? controls spiral curl
9Types of Machine Translation
Semantic Analysis
Sentence Planning
Syntactic Parsing
Text Generation
Transfer Rules
Source (Arabic)
Target (English)
Direct SMT, EBMT
Requires Massive Massive Data Resources
102005 NIST Arabic-English MT
- Interlingual MT
- Grammars, semantics
- Best for focused domains
- Corpus-Based MT
- Pre-translated text (10-200M words)
- Target language text (100M 1 Trillon words)
- Best for general MT
- Context-Based MT
- Improved variant of corpus-based MT
- Perfect client for DISC
Expert Human translator
BLEU Score
0.7
Usable translation
0.6
Human Edittable translation
Google
0.5
ISI
Topic Identification
IBM CMU
UMD
0.4
JHU-CU
Edinburgh
0.3
Useless Region
0.2
Systran
0.1
Mitre
FSC
0.0
11Arabic Statistical-MT Output
???? 17 ????? / ?????? / ?? ??????? ?????? ????
???? ??????? ??????? ??? " ?????? ?????? ???????
??? ????? " ???? ?????? ??????? ?????? ????????
????? ??????????? ??????? . ??? ????? ???? ????
???????? ?????? ???? ?? ????? ????? ???? ????????
?????? ??????? ???????? ??? ????? ???? ??? ????
??????? ??????? ??? ?????? ????? ?? ??? ????
?????? ?? ???? ?????? ?? ?? ????? ?????? ?????? .
Beijing January 17 / Shinhua / the Chinese and
Russian officials urged all parties concerned to
" remain calm and exercise restraint " over the
nuclear issue of the Democratic People's Republic
of Korea. He met with vice Chinese foreign
minister Yang Chang won the deputy of the Russian
foreign minister Alexander Losyukov at a lunch
with invited interested parties to continue the
search for a peaceful solution through dialogue
under the current complicated situation.
BLEU .64
12What About Minor Languages or Dialects without
Massive Data?
13(Borrowed from Judith Klein-Seetharaman)
PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
Normal
14PROTEINS Sequence ? Structure ? Function
Primary Sequence
MNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML
AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA
VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG
GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT
WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN
ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES
ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG
SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT
LCCGKNPLGD DEASTTVSKT ETSQVAPA
Folding
3D Structure
Complex function within network of proteins
15Predicting Protein Structures
- Protein Structure is a key determinant of protein
function - Crystalography to resolve protein structures
experimentally in-vitro is very expensive, NMR
can only resolve very-small proteins - The gap between the known protein sequences and
structures - 3,023,461 sequences v.s. 36,247 resolved
structures (1.2) - Therefore we need to predict structures in-silico
16Linked Segmentation CRF
- Node secondary structure elements and/or simple
fold - Edges Local interactions and long-range
inter-chain and intra-chain interactions - L-SCRF conditional probability of y given x is
defined as
17Fold Alignment Prediction ß-Helix
- Predicted alignment for known ß -helices on
cross-family validation
18fMRI to observe human brain activity
Machine learning to discover patterns in complex
data
New discoveries about human brain function
Our algorithms have learned to distinguish
whether a human subject is reading a word e.g.
tools or buildings with 90 accuracy
19Requisite Infrastructure
- Data Intensive SuperComputing (DISC) for
tera-scale and peta-scale data repositories - Advanced algorithms research
- Massively-parallel decomposition
- Scalability in analytics learning
- Extracting compact models for run-time
- Planning, reasoning, learning w/uncertainty)
- Active Learning (maximally reducing uncertainty)
- Domain expertise (e.g. proteomics, neural
sciences, astronomy, network security, )
20System Comparison Data
DISC
Conventional Supercomputers
System
System
- System collects and maintains data
- Shared, active data set
- Computation colocated with storage
- Faster access
- Data stored in separate repository
- No support for collection or management
- Brought into system for computation
- Time consuming
- Limits interactivity
21Program Model Comparison
DISC
Conventional Supercomputers
Application Programs
Application Programs
Machine-Independent Programming Model
Software Packages
Runtime System
Machine-Dependent Programming Model
Hardware
Hardware
- Application programs written in terms of
high-level operations on data - Runtime system controls scheduling, load
balancing,
- Programs described at very low level
- Specify detailed control of processing
communications - Rely on small of software packages
- Written by specialists
- Limits classes of problems solution methods
22Final Thoughts
- Opportunities in Computational Intelligence
- Machine learning for tough problems relevant
novelty detection, structural learning, active
learning - Scientific applications Computational X
(Xbiology, linguistics, astrophysics, chemistry,
) - Next generation computational infrastructure
- DISC principle (beyond HPC, beyond grid, )
- Algorithmic fundamentals
- International programs (on common problems)