Title: Big Data Analytics
1Big Data Analytics
Ph.D. Research Scholar Vikas Kumar
(201651001) SEAS Ahmedabad University
Ahmedabad Vikas.tyagi_at_ahduni.edu.in
Supervisor Prof. Sanjay Chaudhary SEAS Ahmedabad
University Ahmedabad sanjay.chaudhary_at_ahduni.edu.
in
2OUTLINE
- INTRODUCTION
- DATA ANALYTICS
- BIG DATA ANALYTICS
- OPEN RESEARCH ISSUES
- CONCLUSIONS
- REFERENCES
3INTRODUCTION
4Big Data Definition
- (Fisher et. Al.)
- Big data means that the data is unable to be
handled and processed by most current information
system or methods - Most of the traditional data mining methods or
data analytics developed for a centralized data
analysis process may not be able to be applied
directly to big data.
5Big Data Definition (Cont)
- (Laney et. Al.)
- A well known definition of Big Data known as 3Vs
- Volume (Data is Huge)
- Velocity (Data is changing with time and coming
with a velocity) - Variety (Data is coming from multiple sources in
multiple forms)
6Big Data Definition (Cont)
- (Latest Enhanced Definition)
- The 3Vs definition was incomplete so following
dimensions to the data are added in definition - Veracity
- Validity
- Value
- Variability
- Venue
- Vocabulary and
- Vagueness
- The data satisfying set of all these properties
is known as Big Data.
7Sources of Big Data
8https//www.google.de/search?qevolutionofbusine
ssintelligencenewwindow1tbmischtbousource
univsaXeigEGoU5KXBuTb4QSGsoH4BQved0CDsQsAQb
iw1366bih64
9(No Transcript)
10Application domains of Big Data
11http//www.meltinfo.com/ppt/ibm-big-data
12Big Data in Business Intelligence
13The Evolution of Business Intelligence
scale
scale
2000s
2010s
1990s
https//www.google.de/search?qevolutionofbusine
ssintelligencenewwindow1tbmischtbousource
univsaXeigEGoU5KXBuTb4QSGsoH4BQved0CDsQsAQb
iw1366bih64
14OLTP Online Transaction Processing
(DBMSs) OLAP Online Analytical Processing
(Data Warehousing) RTAP Real-Time Analytics
Processing (Big Data Architecture technology)
15Big data in design and engineering
- Engineering department of manufacturing
companies. - Boeings new 787 aircraft is perhaps the best
example of Big Data, a plane designed and
manufactured. - Big Data needs to be transferred for conversion
into machining related information to allow the
product to be manufactured.
16Reasons for the importance of Big Data
- Increase innovation and development of next
generation product - Improve customer satisfaction
- Sharpen competitive advantages
- Create more narrow segmentation of customers
- Reduce downtime
17Cloud and big data
- In fact from a Cloud perspective I believe that
the transfer and archiving of Big Data will
become a key capability of a manufacturing
focused cloud environment. - Servers based on the Intel Xeon processor E5
and E7 families are at the heart of
infrastructure that supports both cloud and big
data environments. - Ideal for storing and processing large volumes of
data - Web based tools will allow you to upload your Big
Data to the manufacturing cloud,
18Big data in Ecommerce
- Collect, store and organize data from multiple
data sources. - Big Data track and better understand a variety of
information from many different sources(i.e.,
inventory management system, CRM, Adword/Adsence
analytics, email service provider statistics
etc.).
19Big Data and HPC Software systems
20There are a lot of Big Data and HPC Software
systems in 17 (21) layers Build on do not
compete with the 293 HPC-ABDS systems
21Functionality of 21 HPC-ABDS Layers
- Message Protocols
- Distributed Coordination
- Security Privacy
- Monitoring
- IaaS Management from HPC to hypervisors
- DevOps
- Interoperability
- File systems
- Cluster Resource Management
- Data Transport
- A) File managementB) NoSQLC) SQL
- In-memory databasescaches / Object-relational
mapping / Extraction Tools - Inter process communication Collectives,
point-to-point, publish-subscribe, MPI - A) Basic Programming model and runtime, SPMD,
MapReduceB) Streaming - A) High level Programming B) Frameworks
- Application and Analytics
- Workflow-Orchestration
Here are 21 functionalities. (including 11, 14,
15 subparts) Lets discuss how these are used in
particular applications 4 Cross cutting at
top 17 in order of layered diagram starting at
bottom
22(No Transcript)
23Software for a Big Data Initiative
- Functionality of ABDS and Performance of HPC
- Workflow Apache Crunch, Python or Kepler
- Data Analytics Mahout, R, ImageJ, Scalapack
- High level Programming Hive, Pig
- Batch Parallel Programming model Hadoop, Spark,
Giraph, Harp, MPI - Streaming Programming model Storm, Kafka or
RabbitMQ - In-memory Memcached
24Software for a Big Data Initiative (Cont)
- Data Management Hbase, MongoDB, MySQL
- Distributed Coordination Zookeeper
- Cluster Management Yarn, Slurm
- File Systems HDFS, Object store (Swift),Lustre
- DevOps Cloudmesh, Chef, Puppet, Docker, Cobbler
- IaaS Amazon, Azure, OpenStack, Docker, SR-IOV
- Monitoring Inca, Ganglia, Nagios
25SIX Forms of MapReduce
MR Basic Statistics
PP Local Analytics
Iterative
Graph
Streaming
Shared Memory
26Big Data in Agriculture Recommendation System
27Required BDA of ARS
- Data Sources
- Geo Spatial Data Analytics (Agro-Eco zones and
water resources) - Price Data from different APMCs
- Crop yield Data from government agencies
- Knowledge bases (Ontologies)
- Analytics Required
- Suitable crop pattern identification in a region
- Disease Identification in crop
- Recommendations based on observations
28- Analytics Required (cont)
- Machine Learning algorithms development for BDA
- Inferencing engine for recommendation generation
- Different Analytics service development for Data
integration and communication.
29Open Research Issues
- The service development and advance machine
learning for Big Data Analytic system will be
entirely different from development of
conventional Information System Development. - Big Data can not be handled on a centralized
system and hence parallel algorithms should be
designed to perform in BDA environment.
30Open Research Issues (Cont)
- Platform and framework perspective
- Input and output ratio of platform The
assumption of infinite computing resource is
thoroughly impractical. - Communication between systems Big Data Analytics
system should be able to integrate the data and
analytics from different subsystems and the
communication cost need to be optimized (A
typical cost optimization problem). - Bottleneck on data analytics systems The data
deluge of big data will fill up the input
system of Data analytics and it will increase the
computation load of data analysis.
31Open Research Issues (Cont)
- Platform and framework perspective
- Bottleneck on data analytics systems One of the
current solution to the avoidance of bottlenecks
in data analytics system is to add more computing
resources while the other is to split the
analysis work to different computation nodes. A
complete consideration for the whole data
analytics to avoid the bottleneck is needed for
BDA. - Security Issues
32Open Research Issues (Cont)
- Data Mining Perspective
- Data Mining Algorithms for working on Map-Reduce
solution Most of the traditional data mining
algorithms are not designed for parallel
computing therefore, they are not particularly
useful for the Big Data mining. We need to design
or modify the existing algorithms to become
compatible for map-reduce architecture. - Noise, Outlier, incomplete and inconsistent
data these problems inherited from conventional
systems will be scaled in BDA and thus their
effect need to be controlled in distributed
environment.
33Open Research Issues (Cont)
- Data Mining Perspective
- Bottlenecks on Data Mining Algorithms
Synchronization issues between the speed and
process completion time required by different
processing nodes. The bottlenecks of data mining
algorithms will become an open issue for the BDA
which explains that we need to take in to account
this issue while developing a new data mining
algorithm for BDA. - Privacy Issue
34Conclusions
- While developing a BDA system we need to take
care of input data, analytics requirement,
parallel processing and distribution of computing
task. - BDA open opportunity for developing scalable
algorithms for Machine Learning and data mining. - BDA has wide scope in Agriculture domain and we
found that only a little contribution of big data
analytics is there in literature.
35References
- Chun-Wei Tsai Chin-Feng Lai, H.-C. C. v.
Vasilakos, A. Big data analytics a survey,
Journal of Big Data Springer Open Journal, 2015 - Russom, P. others Big data analytics TDWI Best
Practices Report, Fourth Quarter, 2011, 1-35 - Assunção, M. D. Calheiros, R. N. Bianchi, S.
Netto, M. A. Buyya, R. Big Data computing and
clouds Trends and future directions Journal of
Parallel and Distributed Computing, Elsevier,
2015, 79, 3-15 - Chen, M. Mao, S. Liu, Y. Big data a survey
Mobile Networks and Applications, Springer, 2014,
19, 171-209 - I. Witten, E. F. hall. null, M. Data Mining
Practical Machine Learning Tools and Techniques
Morgan kaufmann, san Mateo, Ca, 2011,
36Thanks