Title: Information Management and Data mining
1Information Management and Data mining
- Presented by Dr. Herna L Viktor
- Others Dr. Iluju Kiringa
- Dr. Thomas Tran
- Dr. Liam Peyton
2- Information overload The amount of knowledge in
the world has doubled in the past ten (10) years
and is doubling every 18 months American
Society of Training and Documentation (ASTD) - Massive Petabytes (250) data repositories E.g.
it is estimated that Google maintains 4 Petabytes
of RAM. - E-Commerce and the Web A digital marketplace
eHealth - Data sharing data must be available anywhere,
any time, and in almost any form - The Digital Rosetta Stone Our digital heritage
is in danger of being lost due to the silent
obsolesce of current technology - OUR RESEARCH
- How do we share/store/preserve this data?
- What information can we use to improve our
decision making? - How do we obtain/extract and explore the hidden
knowledge?
3Information Management and Data Mining Research
Five Themes
- Data/Information Management
- (T1) Dr. Iluju Kiringa Data Sharing
- (T2) Dr. Herna L Viktor Relational and
multimedia data mining - (T3) Dr. Thomas Tran Software agents for
e-Commerce - (T4) Dr. Herna L Viktor Long-term preservation
of data - (T5) Dr. Liam Peyton Accessible data warehousing
for e-health
4(T1) Data Sharing Dr. Iluju Kiringa
- Data must be available anywhere, any time, and
in almost any form thus we must cope with - very large networks of data sources
- complex heterogeneity among the sources
- Inconsistent data across the sources
- data sharing and exchange between the sources
- etc.
- Several applications illustrate this need
- Genomic data
- E-health
- Enterprise alliances
5Background and GoalsDr. Iluju Kiringa
- Background data sharing on peer-to-peer networks
- P2P networks are open-ended networks of
distributed computational nodes (peers) - Each peer can directly exchange data and/or
services with a set of other peers - Peers act autonomously, including for
joining/leaving - Peers are not subject to global control in the
form of global registries, global services,
global resource management, or global schema and
data repository - Mostly used for sharing files (plain text, songs,
movies, video, etc) some examples are - Napster, Gnitella, Kaaza file sharing
applications - Seti_at_home distributed computing application
- Research Goal
- Enhance data sharing on P2P networks to offer the
same high quality access to data that the
classical distributed relational DBMSs offer
6Data Sharing Research IssuesDr. Iluju Kiringa
- Heterogeneity management
- Interoperability of peer databases
- Syntactic and semantics heterogeneity
- Dynamics and scale management
- Protocols for peer databases to join/leave
networks - Query processing via propagation
- Query propagation through the network
- Query optimization
- Data coordination using
- update propagation
- distributed triggers
- Transaction processing
- Design non-classical transaction models and
correctness criteria - Implement the models
- Service-oriented architecture
- Design and compare several possible architecture
for a peer DBMS - Implement some of these architectures
- Deploy a real retwork
- Applications
7(T2) Data MiningDr. Herna L Viktor
- Multi-relational data mining and link mining
- Aim to directly mine a relational database,
without extensive preprocessing or flattening
8Data MiningDr. Herna L Viktor
- Multimedia (2D and 3D) data mining
- Searching for similarities in multimedia
databases - Locating clusters of images, 3D objects
- Classifying images, 3D objects within a cluster
- Application
- Anthropometry (poster)
- Health care
- Cultural Heritage
9(T3) Software Agents in E-Commerce Dr. Thomas
Tran
- The concept of an agent provides a convenient and
powerful way to describe a complex software
entity that is capable of acting with a certain
degree of autonomy in order to accomplish tasks
on behalf of its user. - An agent is defined in terms of its behavior.
10Supporting Decision Making Dr. Thomas Tran
- Designing Intelligent Business Software Agents
for E-Commerce - Modeling Trust and Reputation in E-Commerce
- Developing Agent-Based Frameworks for Mobile
Business - Designing Recommender Systems for E-Commerce
11(T4) Long-term preservation of dataDr. Herna L
Viktor
- The Digital Rosetta Stone
- The life-time of a digital file is only a few
decades - We might need the digital file in 50 years
- Our repositories may become data morgues,
containing data which are in formats that cannot
be interpreted by present and future generations.
- Towards a solution
12Long-term preservation of dataDr. Herna L Viktor
- Research issues
- scalability of information and infrastructure
- managing heterogonous data sources
- handling updating of hardware and software
- transparent storage, management and retrieval
to investigate effective ways to store, maintain
and analyze digital objects over a very long
period of time (50 years ) Approach Detachment
from original media Transparent migration to new
technologies Emulate old software on new
technologies
13Long-term preservation of dataDr. Herna L Viktor
14(T5) Evolving E-Health Business Processes Around
Accessible Data WarehousesDr. Liam Peyton
- Goals
- Process improvement to take advantage of
e-technologies and Data warehouse (DW) - Methodology to specify, automate, manage, and
analyze DW-oriented, e-health processes - Addresses privacy, confidentiality, quality, and
consent, as well as heavy legacy (and often
manual) processes and regulatory environments - Activities
- Simulation of Ottawa Hospital Data Warehouse and
environment - Business Intelligence prototype Infection
control data mart, Discharge process data mart - Quality Assurance Framework and Portal
15Assessment Framework Tied to Operational Systems,
Performance MGT Data Warehouse Strategy
Stakeholders
Use Case Maps
Goals
Reports
PIQ
Tasks
Business Systems Processes
Performance Mgt Systems Processes
PIQ measures the effectiveness of Reports to
measure effectiveness of Organization in meetings
its goals.
Data Warehouse
16In Summary Vast, evolving repositories
17- Google in 2003 had between 2 and 5 petabytes of
hard-disk storage. A more recent calculation,
dated June 27, 2006, suggests that the Google
cluster may now have 4 petabytes of RAM, on the
same order of magnitude as the quantity of hard
disk space that was estimated only three years
earlier. - As of October 15, 2005, all the files being
shared on Kazaa totaled around 54 petabytes. - 15 petabytes of data will be generated each year
in particle physics experiments using CERNs
Large Hadron Collider, due to be launched in 2007
- In 2007, NOAA maintains approximately 1 Petabyte
of climate data. NOAA expects that their
Comprehensive Large Array-data Stewardship System
(CLASS) library will hold 20 Petabyte of data by
2011, 140 Petabyte by 2020
18In Summary Vast, evolving repositories
- Our research aims to develop new, efficient ways
to manage, share and analyze such data
19Graduate studentsDr Thomas Tran
- Grad Students
- Richong Zhang (PhD)
- Zhiyong Weng (MCS)
- Vikas Kumar (MCS)
- Xiaoguang Ma (MCS)
- Tapu Kumar Ghose (MCS)
- Catherine Cormier (MSc)
- Hong Chen (MSc)
- Bo Zhan (MCS, co-supervised with Prof. Liam
Peyton) - Yao Gu (MCS, part time)
20Graduate students and their projectsDr. Herna L
Viktor
- Hongyu Guo (PhD) Multi-view learning
- Rana Awada (PhD) XML database mining (prelim)
- Nadia Azam (M.Sc.) Link-based clustering
- Bo Wang (M.Sc.) A storage resource broker agent
for long-term preservation - Divine Muhivu (M.Sc.) Data integration through
link mining - Isis Pena Sanchez (M.Sc) Interestingness
mesaurements for data mining - Minjie Shao (M.Sc.) Mining the adverse effects
of medication - Xiaomei Xia (M.Sc.) Distributed data warehouse
query processing - Joining us Julie Doyle, PhD- Long-term
preservation of data - Collaborations NRC, Faculty of Management
21Graduate studentsDr. Liam Peyton
- Masters Students
- Sepideh Ghanavati
- Pierre Seguin
- Bo Zhan
- Collaboration with
- Prof. Daniel Amyot (Ottawa)
- Prof. Greg Richards (Ottawa)
- Prof. Michael Weiss (Carleton)
- Dr. Alan Forster (Ottawa Hospital)
22Graduate students and collaborationsDr. Iluju
Kiringa
- Have implemented an experimental peer DBMS
- This is joint work with
- Renee Miller (Toronto)
- John Mylopoulos (Toronto Trento)
- Vasiliki Kantere (Athens -- NTUA)
- Anastasios Kementsietsidis (Edinburgh)
- Several students in Toronto
- Lei Jiang
- Dan Zhao
- Patricia Rodriguez
- and Ottawa
- Mehedi Masud
- Anisur Rahman
- Irfan Maki
- Several alumni
- More (strong) students are needed !!!!!
- Here is a link to visit http//www.cs.toronto.edu
/db/hyperion