Title: Pass4sure CCD-410 Cloudera Certified Developer
1Cloudera Certified Developer for Apache Hadoop
(CCDH)
2Who We Are
Mission To help organizations profit from their
data
- How We Do It
- We deliver relevant products and services.
- A distribution of Apache Hadoop that is tested,
certified and supported - Comprehensive support and professional service
offerings - A suite of management software for Hadoop
operations - Training and certification programs for
developers, administrators, managers and data
scientists
- Technical Team
- Unmatched knowledge and experience.
- Founders, committers and contributors to Hadoop
- A wealth of experience in the design and delivery
of production software
- Credentials
- The Apache Hadoop experts.
- Number 1 distribution of Apache Hadoop in the
world - Largest contributor to the open source Hadoop
ecosystem - More committers on staff than any other company
- More than 100 customers across a wide variety of
industries - Strong growth in revenue and new accounts
Leadership Strong executive team with proven
abilities.
Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Jeff Hammerbacher Chief Scientist Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions
3Users of Cloudera
Retail Consumer
Financial
Web
Media
Telecom
https//www.pass4sureexam.com/ccD-410.html
4What is Apache Hadoop?
CORE HADOOP COMPONENTS
- Hadoop is a platform for data storage and
processing that is - Scalable
- Fault tolerant
- Open source
- Scalability
- Scale-out architecture divides workloads across
multiple nodes - Flexible file system eliminates ETL bottlenecks
- Low Cost
- Can be deployed on commodity hardware
- Open source platform guards against vendor lock
- Flexibility
- A single repository for storing processing
analyzing any type of data - Not bound by a single schema
https//www.pass4sureexam.com/ccD-410.html
5What Makes Hadoop Different?
- Ability to scale out to Petabytes in size using
commodity hardware - Processing (MapReduce) jobs are sent to the data
versus shipping the data to be processed - Hadoop doesnt impose a single data format so it
can easily handle structure, semi-structure and
unstructured data - Manages fault tolerance and data replication
automatically
https//www.pass4sureexam.com/ccD-410.html
6Why the Need for Hadoop?
10,000
1.8 trillion gigabytes of data was created in
2011
- More than 90 is unstructured data
- Approx. 500 quadrillion files
- Quantity doubles every 2 years
5,000
GIGABYTES OF DATA CREATED (IN BILLIONS)
0
2005
2015
2010
Source IDC 2011
7Hadoop Use Cases
Application
Application
Industry
Use Case
Use Case
Social Network Analysis
Clickstream Sessionization
Clickstream Sessionization
Content Optimization
Network Analytics
Mediation
Loyalty Promotions Analysis
ADVANCED ANALYTICS
Data Factory
DATA PROCESSING
Fraud Analysis
Trade Reconciliation
Entity Analysis
SIGINT
Sequencing Analysis
Genome Mapping
8Hadoop in the Enterprise
ANALYSTS
BUSINESS USERS
OPERATORS
ENGINEERS
IDEs
BI / Analytics
Enterprise Reporting
Management Tools
Enterprise Data Warehouse
CUSTOMERS
Web Application
Logs
Files
Web Data
Relational Databases
https//www.pass4sureexam.com/ccD-410.html
9What is CDH?
- Clouderas Distribution Including
- Apache Hadoop (CDH) is an enterprise-ready
- distribution of Hadoop that is
- 100 Apache open source
- Contains all components needed for deployment
- Fully documented and supported
- Released on a reliable schedule
- Stable and Reliable
- Extensive Cloudera QA systems, software
processes - Tested run in production at scale
- Proven at scale in dozens of enterprise
environments
- Community Driven
- Incorporates only main-line components from the
Apache Hadoop ecosystem no forks or proprietary
underpinnings - FREE
- Fastest Path to Success
- No need to write your own scripts or do
integration testing on different components - Works with a wide range of operating systems,
hardware, databases and data warehouses
10Clouderas Commitment to the Open Source Community
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes 1
HDFS 6 Yes 2
MapReduce 5 Yes 1
HBase 2 No 2
Zookeeper 1 Yes 2
Oozie 1 Yes 1
Pig 0 No 3
Hive 1 No 2
Sqoop 2 Yes 1
Flume 3 Yes 1
Hue 3 Yes 1
Snappy 2 No 1
Bigtop 8 Yes 1
Avro 4 Yes 1
Whirr 2 Yes 1
11Components of CDH
Cloudera Enterprise
User Interface
HUE
Workflow
File System Mount
Scheduling
APACHE OOZIE
APACHE OOZIE
FUSE-DFS
Data Integration
Fast Read/Write Access
Languages / Compilers
APACHE PIG, APACHE HIVE
APACHE FLUME, APACHE SQOOP
APACHE HBASE
Coordination
APACHE ZOOKEEPER
https//www.pass4sureexam.com/ccD-410.html
12Hadoop Distributed File System
Block Size 64MB Replication Factor 3
1
2
2
4
5
5
1
3
4
2
5
1
3
3
Cost is 400-500/TB
4
5
13Components of Hadoop
- NameNode Holds all metadata for HDFS
- Needs to be a highly reliable machine
- RAID drives typically RAID 10
- Dual power supplies
- Dual network cards Bonded
- The more memory the better typical 36GB to -
64GB - Secondary NameNode Provides check pointing for
the NameNode. Same hardware as the NameNode
should be used
14Components of Hadoop
- DataNodes Hardware will depend on the specific
needs of the cluster - No RAID needed, JBOD (just a bunch of disks) is
used - Typical ratio is
- 1 hard drive
- 2 cores
- 4GB of RAM
https//www.pass4sureexam.com/ccD-410.html
15Networking
- One of the most important things to consider when
setting up a Hadoop cluster - Typically a top of rack is used with Hadoop with
a core switch - Careful on over subscribing the backplane of the
switch!
16Map
- Records from the data source (lines out of files,
rows of a database, etc) are fed into the map
function as keyvalue pairs e.g., (filename,
line). - map() produces one or more intermediate values
along with an output key from the input.
(key 1, values)
Shuffle Phase
(key 1, int. values)
Map Task
Reduce Task
(key 2, values)
(key 1, int. values)
Final (key, values)
(key 3, values)
(key 1, int. values)
17Reduce
- After the map phase is over, all the intermediate
values for a given output key are combined
together into a list - reduce() combines those intermediate values into
one or more final values for that same output key
(key 1, values)
Shuffle Phase
(key 1, int. values)
Map Task
Reduce Task
(key 2, values)
(key 1, int. values)
Final (key, values)
(key 3, values)
(key 1, int. values)
18MapReduce Execution
https//www.pass4sureexam.com/ccD-410.html
19Sqoop
- SQL to Hadoop
- Tool to import/export any JDBC-supported database
into Hadoop - Transfer data between Hadoop and external
databases or EDW - High performance connectors for some RDBMS
- Developed at Cloudera
20Flume
- Distributed, reliable, available service for
efficiently moving large amounts of data as it is
produced - Suited for gathering logs from multiple systems
- Inserting them into HDFS as they are generated
- Design goals
- Reliability, Scalability, Manageability,
Extensibility - Developed at Cloudera
21Flume high-level architecture
Master send configuration to all Agents
Configurable levels of reliability Guarantee
delivery in event of failure Deployable,
centrally administered
Agent
Agent
Agent
Agent
encrypt
MASTER
Optionally pre-process incoming data perform
transformations, suppressions, metadata enrichment
Processor
Processor
batch
compress
encrypt
Writes to multiple HDFS file formats (text,
sequence, JSON, Avro, others) Parallelized writes
across many collectors as much write throughput
as
Collector(s)
Flexibly deploy decorators at any step to improve
performance, reliability or security
22HBase
- Column-family store. Based on design of Google
BigTable - Provides interactive access to information
- Holds extremely large datasets (multi-TB)
- Constrained access model
- (key, value) lookup
- Limited transactions (only one row)
https//www.pass4sureexam.com/ccD-410.html
23HBase
23
24Hive
- SQL-based data warehousing application
- Language is SQL-like
- Supports SELECT, JOIN, GROUP BY, etc.
- Features for analyzing very large data sets
- Partition columns, Sampling, Buckets
- Example
- SELECT s.word, s.freq, k.freq FROM shakespeares
- JOIN ON (s.word k.word) WHERE s.freq gt 5
25Pig
- Data-flow oriented language Pig latin
- Datatypes include sets, associative arrays,
tuples - High-level language for routing data, allows
easy integration of Java for complex tasks - Example
- empsLOAD 'people.txt AS(id,name,salary)
- rich FILTER emps BY salary gt 100000 srtd
ORDER rich BY salary DESC STORE srtd INTO
rich_people.txt'
https//www.pass4sureexam.com/ccD-410.html
26Oozie
Oozie is a workflow/cordination service to manage
data processing jobs for Hadoop
27Zookeeper
- Zookeeper is a distributed consensus engine
- Provides well-defined concurrent access
semantics - Leader election
- Service discovery
- Distributed locking / mutual exclusion
- Message board / mailboxes
28Pipes and Streaming
- Multi-language connector libraries for MapReduce
- Write native-code MapReduce in C
- Write MapReduce passes in any scripting language,
including - Perl
- Python
https//www.pass4sureexam.com/ccD-410.html
29FUSE - DFS
- Allows mounting of HDFS volumes via Linux FUSE
file system - Does allow easy integration with other systems
for data import/export - Does not imply HDFS can be used for
general-purpose file system
30Hadoop Security
- Authentication is secured by Kerberos v5 and
integrated with LDAP - Hadoop server can ensure that users and groups
are who they say they are - Job Control includes Access Control Lists, which
means Jobs can specify who can view logs,
counters, configurations and who can modify a job - Tasks now run as the user who launched the job
https//www.pass4sureexam.com/ccD-410.html
31Cloudera Enterprise
Cloudera Enterprise makes open source Hadoop
enterprise-easy
CLOUDERA ENTERPRISE COMPONENTS
- Simplify and Accelerate Hadoop Deployment
- Reduce Adoption Costs and Risks
- Lower the Cost of Administration
- Increase the Transparency Control of Hadoop
- Leverage the Experience of Our Experts
EFFECTIVENESS Ensuring You Get Value From Your
Hadoop Deployment
EFFICIENCY Enabling You to Affordably Run Hadoop
in Production
32Cloudera Manager
The industrys first end-to-end management
application for Apache Hadoop
Proactively manages the Apache Hadoop stack
Automates the full operational lifecycle of
Apache Hadoop
33Cloudera Enterprise
https//www.pass4sureexam.com/ccD-410.html
34Cloudera Enterprise
Including Cloudera Support
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics
Notification of New Developments and Events Stay up to speed with whats going on in the Apache Hadoop community
35Cloudera University
Public and Private Training to Enable Your Success
Class Description
Developer Training Certification (4 Days) Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop
System Administrator Training Certification (3 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster
HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices
Analyzing Data with Hive and Pig (2 Days) Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data
Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as when is Hadoop appropriate?, what are people using Hadoop for? and what do I need to know about choosing Hadoop?
36Cloudera Consulting Services
Put Our Expertise To Work For You.
Clouderas team of Solutions Architects provides
guidance and hands-on expertise to address unique
enterprise challenges.
Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop for your organization
New Hadoop Deployment Set up and configure high performance, production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using Hadoop
Process and Team Development Define the requirements and processes for creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune up existing Hadoop clusters
37Journey of the Cloudera Customer
Discover the Benefits of Apache Hadoop
Clouderas Distribution
Subscribe to Cloudera Enterprise
Flexibility to store and mine all types of data
The fastest, surest path to success with Apache
Hadoop
Simplify and accelerate Apache Hadoop deployment
https//www.pass4sureexam.com/ccD-410.html
38Cloudera in Production
- Consulting Services
- Cloudera University
Cloudera Services
ANALYSTS
BUSINESS USERS
CUSTOMERS
OPERATORS
ENGINEERS
Cloudera Enterprise
- Cloudera Management Suite
- Cloudera Support
IDEs
BI / Analytics
Enterprise Reporting
Management Tools
Web Application
Enterprise Data Warehouse
Clouderas Distribution Including Apache Hadoop
(CDH) SCM Express
Operational Rules Engines
Logs
Files
Web Data
Relational Databases
39Cloudera helps you profit from all your data.
Get Hadoop
twitter.com/ cloudera
facebook.com/ cloudera
40Cloudera Manager
The first and only Hadoop management application
that
1. Manages the full Hadoop lifecycle 2.
Manages and monitors the complete Hadoop
stack 3. Incorporates comprehensive log and
event management 4. Has Technical Support
integration built-in
https//www.pass4sureexam.com/ccD-410.html
41Cloudera Manager
Key Features and Functionality
ONLY CLOUDERA
Automated Deployment Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps.
Centralized Management Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface
Service Configuration Management Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed
Audit Trails Maintains a complete record of configuration changes for SOX compliance
Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
Intelligent Log Management Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
https//www.pass4sureexam.com/ccD-410.html
42Cloudera Manager
Key Features and Functionality
ONLY CLOUDERA
Global Time Control Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
Alerting Generates email alerts when certain events occur
Operational Reports Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user
Host Level Monitoring View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
43Two Editions
FREE EDITION
ENTERPRISE EDITION
Max Number of Nodes Supported 50 Unlimited
Automated Deployment
Host-Level Monitoring
Secure Communication Between Server Agents
Configuration Management Configuration Management Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie Zookeeper
Audit Trails
Start/Stop/Restart Services
Add/Restart/Decomission Role Instances
Configuration Versioning History
Support for Kerberos
Service Monitoring Service Monitoring Service Monitoring
Proactive Health Checks
Status Health Summary
Intelligent Log Management
Events Management Alerts
Activity Monitoring
Operational Reporting
Global Time Control
Support Integration
Part of the Cloudera Enterprise subscription
44View Service Health and Performance
https//www.pass4sureexam.com/ccD-410.html
45Get Host-Level Snapshots
https//www.pass4sureexam.com/ccD-410.html
46Monitor and Diagnose Cluster Workloads
https//www.pass4sureexam.com/ccD-410.html
47Gather, View and Search Hadoop Logs
https//www.pass4sureexam.com/ccD-410.html
48Track Events From Across the Cluster
https//www.pass4sureexam.com/ccD-410.html
49Run Reports on System Performance Usage
https//www.pass4sureexam.com/ccD-410.html
50New in Cloudera Manager 3.7
ONLY CLOUDERA
1. Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
2. Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster
3. Global Time Control Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
4. Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
5. Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
6. Alerts Generates email alerts when certain events occur
7. Audit Trails Maintains a complete record of configuration changes for SOX compliance
8. Operational Reporting Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
https//www.pass4sureexam.com/ccD-410.html
51Cloudera Support
Our team of experts on call to help you meet your
SLAs
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy
Proactive Notification of New Developments and Events Stay up to speed with whats going on in the Apache Hadoop community
https//www.pass4sureexam.com/ccD-410.html
52Cloudera Enterprise
The Fastest Path to Success Running Apache Hadoop
in Production.
Only Cloudera Enterprise
Why Cloudera Enterprise?
- Apache Hadoop is a distributed system that
presents unique operational challenges - The fixed cost of managing an internal patch and
release infrastructure is prohibitive - Apache Hadoop skills and expertise are scarce
- Its challenging to track consistently to
community development efforts
Has a management application that supports the
full lifecycle of operationalizing Apache
Hadoop Has production support backed by
the Apache committers Has the depth of
experience supporting hundreds of production
Apache Hadoop clusters
53Hadoop Distributed File System
Block Size 64MB Replication Factor 3
Cost is 400-500/TB
54MapReduce Distributed Processing
https//www.pass4sureexam.com/ccD-410.html
55Thank you.
https//www.pass4sureexam.com/ccD-410.html