Data Mining on the Web via Cloud Computing - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Data Mining on the Web via Cloud Computing

Description:

Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy Data Mining on the Web via Cloud Computing ... – PowerPoint PPT presentation

Number of Views:947
Avg rating:3.0/5.0
Slides: 16
Provided by: HEM70
Category:
Tags: cloud | computing | data | mining | via | web

less

Transcript and Presenter's Notes

Title: Data Mining on the Web via Cloud Computing


1
Data Mining on the Web via Cloud Computing
  • COMS E6125
  • Web Enhanced Information Management
  • Presented By
  • Hemanth Murthy

2
Data Mining on the Web via Cloud Computing
  • Introduction to
  • Web Mining
  • Cloud computing infrastructure
  • Apaches Hadoop
  • Web Usage Mining using Hadoop HDFS and Map/Reduce
    technologies

3
What is Web Mining
  • What is Web Mining - data mining techniques
    applied to the Web to discover user patterns like
  • what users are looking for on the internet,
  • to deduce type of information the users are
    looking for,
  • structuring data available on the web etc.
  • Why Web Mining
  • amount of information available on the Web is
    enormous.
  • difficult for users to find and utilize
    information
  • not easy for content providers to classify and
    catalog documents

4
Types of Web Mining
  • Web mining types
  • Web usage mining.
  • Web content mining.
  • Web structure mining.
  • Web usage mining - applying data mining
    techniques to discover usage patterns from Web
    data, to understand and serve the needs of
    Web-based applications better.
  • Web content mining describes the automatic search
    of information available online, and involves
    mining web data content.
  • Web structure mining is concerned with the
    description/ organization of the content.

5
More on Web Usage Mining
  • Preprocessing.
  • convert the usage, content, and structure
    information in the available data sources.
  • regarded as the most difficult task in Web Usage
    Mining.
  • Pattern Discovery.
  • uses the algorithms and techniques from data
    mining, machine learning, statistics and pattern
    recognition.
  • Pattern analysis.
  • lot of redundant rules or patterns found during
    discovery phase.
  • the main objective here is to filter out such
    data which would aid in the data analysis.
  • SQL queries, visualization techniques such as
    graphing patterns etc

6
Cloud Computing
  • Use of existing commodities.
  • reduce cost of the services.
  • helps in concentrating on deploying the services
    faster.
  • more flexibility.
  • Virtualization technique used as a standard
    deployment object.
  • provides abstraction between hardware and
    computing software.
  • enables loose coupling of the resources.
  • Services are delivered over the network.

7
HDFS - Hadoop Distributed File System
  • Data parallel but process sequential.
  • Data processing is in a batch oriented fashion.
  • Data communication is via distributed file
    system. So, latency is an issue. But HDFS is
    designed for giving higher throughputs than
    latency.
  • In Facebook, jobs that took more than a day were
    cut down to less than a day by using Hadoop.

8
Important characteristics of HDFS
  • Hardware Failure.
  • Streaming Data Access.
  • Large Data Sets.
  • Moving Computation is Cheaper than Moving Data

9
Web Mining, HDFS and Map/Reduce
  • HDFS can be the storage backbone for Web Mining
    applications.
  • HDFS replicates data at several nodes in the
    cluster to ensure robustness, data recovery in
    case of failure etc.
  • Map/Reduce A framework for realizing
    Distributed computing/Compute Cloud.

10
Web Mining HIVE
  • Developed by the Facebook Data Infrastructure
    Team in order to exploit the features of Hadoop
    HDFS and Map/Reduce.
  • The next generation infrastructure designed with
    the goals of providing data processing systems
  • enable easy data summarization
  • ad-hoc querying and analysis of large volumes of
    data
  • Allows users to embed custom map/reduce functions

11
Web Usage Mining Architecture using HDFS,
Map/Reduce and HIVE
  • How Apache Hadoop can be used in Web Usage
    Mining.
  • The system consists of HDFS as the Storage Cloud.
  • Map/Reduce framework can be used as the Compute
    Cloud.
  • Hive can be used to format the data.

12
Web Usage Mining Architecture
13
References
  • HDFS http//hadoop.apache.org/hdfs
  • Map/Reduce http//hadoop.apache.org/mapreduce
  • Web Mining Information and Pattern Discovery on
    the World Wide Web http//maya.cs.depaul.edu/mob
    asher/webminer/survey/survey.html
  • Ashish Thusoo - Hive - A Petabyte Scale Data
    Warehouse using Hadoop http//www.facebook.com/no
    te.php?note_id89508453919

14
References
  • Dhruba Borthakur Hadoop Introduction
    http//hadoop.apache.org/common/docs/r0.18.3/hdfs_
    design.htmlIntroduction
  • Jaideep Srivastava, Robert Cooleyz, Mukund
    Deshpande, Pang-Ning Tan Web Usage Mining
    Discovery and Applications of Usage Patterns from
    Web Data

15
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com