Title: Hadoop Jobs and Tasks
1Hadoop Jobs and Tasks
2Brief Overview
2getnew Job ID
1run Job
MapReduce Program
JobClient
5Initialize Job
Job Tracker
4 Submit Job
Client JVM
Job Tracker Node
Client Node
7Heart Beat (returns task)
6 Retrieve Input Splits
3Copy Resources
TaskTracker
8 Retrieve Job Resources
Shared File System HDFS
9 Launch
10Run
Map Task or Reduce Task
Child
Task Tracker Node
3Submit Job
- Asks the Job Tracker for a new ID
- Checks output spec of the Job. Checks o/p dir. If
exists, throws error. Job is not sumitted - Computes input split for the job. Splits cannot
be computed(inputs doest exist), error is throw.
Job is not submitted - Copies the resources needed to run the job.
Copies to Job Trackers file system, in a dir
named after job id. - Job jar file. Copied with a high replication
factor, factor of 10.Can be set by
mapred.submit.replication property - Configuration file
- Computed input splits
- Tells the JobTracker.. Job is ready
4Job Initialization
- Puts the job in internal Queue
- Job Scheduler will pickup and initialize it
- Create a Job object and job being run
- Encapsulate its tasks
- Book keeping info to track tasks status and
progress - Create list of tasks to run
- Retrieves number of input splits computed by the
JobClient from the shared filesystem - Creates one map task for split
- Scheduler creates the Reduce tasks
- No. of reduce tasks is determined by the
map.reduce.tasks. - Tasks IDs are given for each task
5Task Assignment
- Task trackers send heartbeats to JobTracker
- Task tracker indicates readines for a new task
- Job Tracker will allocate a Task
- Job Tracker communicates the task in a response
to a heartbeat return - Choosing a Task Tracker
- Job Tracker must choose a Task for a TaskTracker
- Uses scheduler to choose a task from
- Job Scheduling algorithms gtdefault one based on
assigns priority
6Task Assignment
- Task trackers has fixed slots for map tasks and
for reduce tasks - Task tracker may be able to run 2 map and 2
reduce tasks simultaneously(does not depend on no
of cores and amount of memory on the task
tracker) - Scheduler fills the Map task slots before filling
the reduce task slots - Job Tracker takes into account the task trackers
network location and picks up a tasks, whose
split is as close as possible to the task tracker - Ideal case would be to choose a task tracker
node, where the split resides on. called
data-local - Rack-local on the same rack, but not on the same
node - Some tasks are neither data-local or rack local.
Retrieves data from a different rack - Use counters to track how many data-local,
rack-local or non local - Job tracker picks the next in its of yet-to-be
run reduce tasks since there are no data locality
considerations
7Task Execution
- Task tracker has been assigned the task
- Next step is to run the task
- Localizes the Job by copying the JAR file from
the shared file system. Copies any other files
required - Creates a local working dir for the task, un-jars
the contents of the jar onto this dir - Creates an instance of TaskRunner to run the task
- Task runner launches a new JVM to run each task
- To avoid Task tracker to fail, if any bugs in
MapReduce tasks - Only the child JVM exits in case of a problem
8TaskExecution ..continued
9Progress and Status Updates
- MapReduce jobs are long running jobs
- User needs feedback from time to time on the
progress of the task - Job and tasks have the status
- Running, successfully completed, failed
- Progress of maps and reduces
- Values of Job counters
- Status messages and description
- It can decide based on what phase it is running
10Progress Reporting
- Not 100 accurate
- Nevertheless important to see Job running or not
- Following operations constitute progress
- Reading an input record
- Writing an output record
- Setting the status descriptor on a reporter
- Incrementing a counter
- Calling Reporters progress method
- Tasks can also set counters
- Framework built-in ones
- User defined ones
11Progress Reporting .. continued
- Framework support
- If progress flag is set, indicates status to be
sent to the task tracker - Flag checked in a separate thread every 3 sec.
TaskTracker is notified about the status - TaskTracker sends the same via heartbeats to the
JobTracker every 5 sec - Status of all the tasks run by TaskTracker is
sent - Status of Counters is sent less frequently to
avoid congestion - Job Tracker combines these status reports
- Gives a global view of all the Jobs and
constituent tasks and their statuses - JobClient receives the status by polling the
JobTracker every second - Client also can call getJobStatus to get the
status information
12Progress Reporting .. continued
13Job Completion
- Job Tracker receives notification that the last
task of Job is complete - Changes the status to successful
- JobClient polls for the status,
- Returns message to the user and
- Returns from the runJob method
- JobTracker can also send HTTP Job notification
- Can be configued by clients wishing to get
notified via callbacks. - Clients can set job.end.notification.url
- JobTracker cleans its working state for the Job
- Also instructs the TaskTrackers to do the same
14Tasks Failure
- Causes
- User code is buggy
- Processes crash
- Machines fail
- Hadoop handles it quite smoothly
15Tasks Failure .. continued
- Child JVM reports the error back to the
TaskTracker before exiting - Error logged into users logs
- TaskTracker marks the Task and failed
- Frees up slot for another task
- Hanging tasks
- TaskTracker senses that, it has not received any
progress update - Proceeds to mark the staus as failed
- Child JVM process is killed after the timeout
period which is normally 10 mins - Can be configured on a per job basis
- Setting a timeof zero, never frees up the
hanging slot avoid this - Atleast send the progress update by setting the
progress flag
16Task failure .. continued
- Task failed
- Notified to the JobTracker
- JobTracker will reschedule the execution of the
task - Avoids scheduling on a TaskTracker where is has
failed earlier - Will try 4 times before giving up
- mapred.map.max.attempts for map tasks
- mapred.reduce.max.attempts for reduce tasks
- If any task fails more than 4 times, the job is
set to failed, regardless of how many times it
was tried - Can be changed by setting
- mapred.max.map.failures.percent
- mapred.max.reduce.failures.percent
- Task can also be killed in case of a speculative
task - Killed tasks do not count for no. of failed tasks
17TaskTracker Failure
- Symptoms
- Fails to sent Heartbeats
- Might have crashed or
- Running slowly
- JobTracker will mark it as failed and removes it
from pool of tasktrackers to be scheduled on - Heart beat misses for 10 mins or
- mapred.task.tracker.expiry.interval
- JobTracker arranges for the tasks to run on
different TaskTracker for all the
successful/failed for incomplete Jobs - Any tasks in progress are also rescheduled
- JobTracker can also blaklist a TaskTracker
- If the no of tasks failed is significantly higher
than average rate of failure rate on the cluster - Blaklisted ones can be restarted to remove from
the jobtrackers list
18Job Scheduling
- Simple
- Ran in the order of submission using FIFO
scheduler - Fair Scheduler
- Capacity Scheduler
19Shuffle and Sort
- MapReduce framework gaurantees that the input to
every reducer is sorted by key - Process by which system performs Sort is Sort
Phase - Transfers the map outputs to the reducers as
inputs called Shuffle phase - Shuffle code base keeps changing and continuous
improvements are made - Shuffle is the heart of MapReduce
20Shuffle and Sort ..continued
21Shuffle and Sort ..continued
- Map
- Circular Memory
- Map blocks writing, if the buffer is full Dsd
- Another thread starts writing to the disk after
reaching a threshold ..80 - Map outputs will continue to write into the disk
- Before writing to disk, the thread partitions
based on Reducer is has to go to - Within each thread, in-memory sort is performed
and - A combiner function is run on the output of the
sort - Several spill files are created
- Spills are merged and partitioned and sorted to
an output file - Combiner is run before the output file is written
- Data written to the disk can be compressed
22Shuffle and Sort ..continued
- Reduce
- Needs the map output from several mappers
- Copy phase copied the mappers data to Reduce
phase - Smaller no of copier thread to copy parallely
23Shuffle and Sort ..continued
- How the reducers know where to get the map
outputs from - Tasks notify TaskTracker about map being
completed - TaskTracker sends the update to JobTracker
- JobTracker knows for a given job, the mapper
outputs and the TaskTrackers they are available - Reducers asks this information from JobTracker
periodically until is has retrieved them all - Task Trackers do not delete mapoutput from disk
till the Job is completed - Reducer Task may fail
- Wait until told to do so from JobTracker
24Task Execution
- Speculative Execution
- Task JVM Reuse
- Skipping Bad records
- Task Execution environment
- Counters
- Sorting
- Secondary Sort
- Joins
- Side data distribution
25Speculative Execution
- Tasks are run in parallel
- A slow task can make the whole job significantly
longer - Out of few thousand tasks, some jobs could be
straggling - Hadoop tries to find slow running tasks
- Hadoop creates backup tasks when a slow running
task is expected - After the task is ran successfully, any copy of
the tasks are killed - It is an optimization technique. If the task is
designed to run slow, this may not work - Can be turned on or off
26Task JVM Reuse
- Hadoop runs tasks in their own JVM
- When JVM Reuse is enabled,
- Tasks in the Child JVM are run sequentially
- Task Tracker still runs the tasks parallely
- Tasks from different Jobs are always run on
different Child JVMs - Mapred.job.reuse.jvm.run.tasks
- -1 indicates no limit.
27Skipping Bad Records
- Large data sets could have corrupt records
- They often have missing fields
- In practice, the code should ignore these records
- Bad records have to handled in Mapper or Reducer.
- Ignore the records
- TextInputFormat has a feature to set the length
of the record. - Corrupted records usually have long lengths
28Task Execution Environment
- Hadoop provides environment to tasks
- Several properties can be accessed from
JobConfiguration - Task files
- Multiple instances of the same task
- Should not write into the same file
- If task failed and is not retired, the old output
file is still present - Speculative Execution Two instances of the same
task could write into same file? - Solution
- Hadoop writes the file into a temp dir, specific
to the task attempt. - mapred.output.dir/_temporary/mapred.task.id/
- On successful completion, file is written to the
mapred.output.dir
29Counters
- Counters are used to gather statistics about the
Job - Quality controls (good vs bad records)
- Application Level statistics
- Problem diagnosis
- Counters are easier to retieve compared to log
outputs - Built in Counters
- Input records, bytes
- Output records bytes etc
30User Defined Counters
31User Defined Counters
- Counters are grouped Enum Names
- Fields are the counter names
- Dynamic counters for storing values
- Readable Counter Names
- Using Resource Bundle
- Air Temperature Records instead of
- Temperature.MISSING
32Retrieving Counters
- Counters can be retrieved as follows
33Sorting
- By default Keys are sorted before sent to the
Reduce Task - Sort order for keys is controlled by
- Property mapred.output.key.comparator.class
- Keys must be a subclass of WritableComparable
- Partitioned MapFileLookup
- If MapFileOutputFormat is used, lookup by keys
can be done
34Secondary Sort
- MapReduce sorts record by keys
- Values are not sorted
- Use following strategy to get the values sorted
- Use Composite key(have value portion)
- KeyComparator orders by Composite key
35Joins
- MapReduce can perform joins of large sets
- Use frameworks such as PIG, HIVE or Cascading to
achieve a Join - Map Joins
- Use CompositeInputFormat
- Allows Join to be performed before passing to Map
- Reduce Joins
- Key as the join mechanism
- Multiple Inputs
- Use different mappers. Map Output to be same
36Side data distribution
- Extra read only data needed by MapReduce Jobs
- The challenge is to make this data available to
Map Reduce Jobs - Cache Side data in a static field
- Use Job Configuration
- Overide the config method
- To pass Objects, use DefualtStringifier(Hadoop
serialization) - Do not use it to transfer more than 1 kb
37Side Data distribution - continued
- Distributed Cache
- Copy files and archives once per Job to the task
node - Make them available to the MapReduce functions
- --files and archives options
- Files can be local or in HDFS system
- Hadoop other args -files input/ncdc/metadata/st
ations-fixed-width.txt
38Side Data How it works
- When the Job is launched, Hadoop copies the files
specified by the files options to the
JobTrackers file system to a local disk the
cache - From the tasks point of view, the files are just
there - Reference count of no of tasks using the file is
maintained, on zero, the file is eligible for
deletion - Files are deleted if the cache size exceeds 10
GB, making way for other jobs - Files are localized under (mapred.local.dir)/task
Tracker/archi dir on task trackers - Apps can use the file as it is. Files are
symbolically linked to a working dir