Exam Logistics • One A4 size sheet in your own handwriting for memory recall • Portions – everything post T1 including videos (lecture 10 onwards) • Answer questions in the question paper itself (space provided) • 60 Marks • 2hrs paper • Reduced to 30. • Remaining 30 from your final eval • Signup sheet is shared. Pick a slot over the next two weeks. • Need to submit a report. Format is in the share. • Format of final eval presentation is also shared.
Unit 5: Resource allocation – YARN and MESOS Prof. Dinkar Sitaram Prof. K V Subramaniam Prof Sanchika Gupta Prof Prafullata K A Prof Mamatha Shetty
What have we learnt so far? • Hadoop – Map Reduce • Spark • Storm • Hbase And many more… Pig
Problem
Rapid innovation in cluster computing frameworks
Handling faults No single framework optimal for all applications
Want to run multiple frameworks in a single cluster • …to maximize utilization • …to share data between frameworks
Where We Want to Go Today: static partitioning
dynamic sharing
Hadoop
Pregel
Shared cluster MPI
What we will cover today?
YARN – Hadoop v2 Mesos YARN v/s MESOS
Hadoop 2.0 (YARN)
YARN (Mapreduce-2) • For a cluster larger than 4000 nodes, the map reduce process described above begins to suffer a sever bottleneck. • So, in 2010, Yahoo! Began to design the next generation of MapReduce. • The result was YARN, short for Yet Another Resource Negotiator (or YARN Application Resource Negotiator, when defined recursively).
How YARN fits
YARN vs Classic MapReduce • Split the responsibilities of the jobtracker into separate entities. • Job Tracker • Allocate resources • Scheduling • Task Progress Monitoring/Task Bookkeeping
• YARN separates these two roles into two independent daemons: • Resource manager to manage the use of resources across the cluster • Resource allocation
• Application master to manage the lifecycle of applications running on the cluster. • Scheduling/Task Progress
YARN Components Resource Manager
Node Manager Application Master Container
• Arbitrates resources amongst all applications of the system
• Per machine slave • Responsible for launching application containers • Monitors resource usage • Negotiate appropriate resource containers from the scheduler • Track and monitor the progress of the containers
• Unit of allocation incorporating resources such as memory, CPU, disk
YARN Architecture Overview
Application submission in YARN
Class Exercise (10 mins) • How does YARN decide which machine to run the job on? • Outline a simple strategy to pick the machine(s) to run a job on • Remember that a job consists of tasks (map/reduce)
Job Scheduling • Early versions of Hadoop used the very simplistic form of scheduling as the FIFO. • Typically, each job would use the whole cluster, so jobs had to wait their turn. • Although a shared cluster offers great potential for offering large resources to many users, the problem of sharing resources fairly between users requires a better scheduler. • Production jobs need to complete in a timely manner, while allowing users who are making smaller ad hoc queries to get results back in a reasonable time.
Job Scheduling • Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or the setJobPriority() method on JobClient. • There are five levels, VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW. • When the job scheduler is choosing the next job to run, it selects one with the highest priority. • However, with the FIFO scheduler, priorities do not support preemption, so a high-priority job can still be blocked by a longrunning low priority job that started before the high-priority job was scheduled.
Job Scheduling • MapReduce in Hadoop comes with a choice of schedulers. • The default in MapReduce 1 is the original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair Scheduler and the Capacity Scheduler. • MapReduce 2 comes with the Capacity Scheduler (the default), and the FIFO scheduler.
The Fair Scheduler • Aims to give every user a fair share of the cluster capacity over time. • To groups of jobs.
• If a single job is running, it gets all of the cluster. • As more jobs are submitted, • free task slots are given to the jobs in such a way as to give each user a fair share of the cluster.
• A short job belonging to one user will complete in a reasonable time even while another user’s long job is running, and the long job will still make progress.
The Fair Scheduler Pools • Jobs are placed in pools, and by default, each user gets their own pool. • A user who submits more jobs than a second user will not get any more cluster resources than the second, on average. • It is also possible to define custom pools with guaranteed minimum capacities defined in terms of the number of map and reduce slots, and to set weightings for each pool. • The Fair Scheduler supports preemption, so if a pool has not received its fair share for a certain period of time, then the scheduler will kill tasks in pools running over capacity in order to give the slots to the pool running under capacity.
The Capacity Scheduler • The Capacity Scheduler takes a slightly different approach to multiuser scheduling. • A cluster is made up of a number of queues (like the Fair Scheduler’s pools), which may be hierarchical (so a queue may be the child of another queue), and each queue has an allocated capacity. • This is like the Fair Scheduler, except that within each queue, jobs are scheduled using FIFO scheduling (with priorities). • In effect, the Capacity Scheduler allows users or organizations to simulate a separate MapReduce cluster with FIFO scheduling for each user or organization. • The Fair Scheduler, by contrast, enforces fair sharing within each pool, so running jobs share the pool’s resources.
Jobs
Class Exercise (5 mins) • How does YARN keep track of the progress made by
Component Job Task
Progress Indicator
Progress and Status Updates • Long running batch jobs • Status – Job and Task • Job counters • Status – running, successfully completed, failed • Progress of map and reduces • Status message
• Task • Map tasks – proportion of input processed • Reduce tasks – proportion of reduce input processed
Job Completion • Completed → last task for a job is complete • Application Master and task containers clean up working state
Handling failures • What can fail? • • • •
Task Application Manager Resource Manager Node Manager
Task failure - types Due to runtime exceptions
Hanging tasks
Killed tasks
• JVM reports error back to parent application master • Error is logged in error logs • Task is marked as failed
• Progress updates not happening for 10 mins • Timeout value can be set.
• Speculative duplicates can be killed • When progress is slow a speculative task is created • Slower task is killed
Task failure - recovery • What happens on failure? • Application Master attempts rescheduling task • On different server • Max allowed failures (default = 4)
• Job Failure • Max allowed failures for a task is exceeded • Max percent of failed tasks • Single task not allowed to fail a job
Application Master Failure When can failure occur?
• Due to hardware or network failures
How to detect for failures?
• AM sends periodic heartbeats to Resource Manager
Restart
• Max-attempts to restart application • Default = 2
Node Manager Failure When can failure occur?
• Hardware, crashing, slow network
How to detect for failures?
• Sends regular heartbeats to resource manager • When no received for 10 mins
Restart
• Completed tasks of incomplete jobs will be rerun • As intermediate results are not available • Will run on different node
Resource Manager Failure
How is failure handled?
• Active Standby configuration
Impact
• More serious as all tasks fail
Restart
• Handled by failover controller • Clients/Nodemanagers configured
Case Study: YARN usage @Yahoo • 40,000+nodes running YARN over 365 PB of data • ~400,000 jobs running per day for 10 million hours • Estimated node usage per day improvement – 60-150%
MESOS
Overview •Mesos is a common resource sharing layer over which diverse frameworks can run
Hadoop Hadoop
Pregel …
Node Node
Pregel
Node Node
…
Mesos Node Node Node Node
Other Benefits of Mesos •Run multiple instances of the same framework • Isolate production and experimental jobs • Run multiple versions of the framework concurrently
•Build specialized frameworks targeting particular problem domains • Better performance than general-purpose abstractions
Mesos Goals • High utilization of resources • Support diverse frameworks (current & future) • Scalability to 10,000’s of nodes • Reliability in face of failures
Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks
Design Elements •Fine-grained sharing: • Allocation at the level of tasks within a job • Improves utilization, latency, and data locality
•Resource offers: • Simple, scalable application-controlled scheduling mechanism
Element 1: Fine-Grained Sharing Coarse-Grained Sharing (HPC):
Fine-Grained Sharing (Mesos):
Framework 1
Fw. 3 Fw. 1
Fw. 3 2 Fw. 2
Fw. 1 Fw. 2
Framework 2
Fw. 2 Fw. 3
Fw. 1 Fw. 3
Fw. 1 3 Fw. 2
Framework 3
Fw. 2 Fw. 1
Fw. 3 Fw. 21
Fw. 2 Fw. 3
Storage System (e.g. HDFS)
Storage System (e.g. HDFS)
+ Improved utilization, responsiveness, data locality
Element 2: Resource Offers •Option: Global scheduler • Frameworks express needs in a specification language, global scheduler matches them to resources
+ Can make optimal decisions •– Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs
Element 2: Resource Offers •Mesos: Resource offers • Offer available resources to frameworks, let them pick which resources to use and which tasks to launch
+ Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal
Class Exercise (10 mins) • If you were the airline and customers are looking for seats, how would a resource offer from the airline look like for booking a seat • Hint: look at variants
Class Exercise (Solution) • If you were the airline and customers are looking for seats, how would a resource offer from the airline look like for booking a seat • Resource offer • #seatsvideo screen, • #seats near exit, • #club seats
Mesos Architecture MPI job
Hadoop job
MPI scheduler
Hadoop scheduler
Mesos master
Mesos slave
Allocation Resource module offer
Mesos slave
MPI executor
MPI executor
task
task
Pick framework to offer resources to
Mesos Architecture MPI job
Hadoop job
MPI scheduler
Hadoop scheduler
Resource offer = Pick framework to Mesos Allocation list of (node, availableResources) Resource offer resources to master module offer E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } Mesos slave
Mesos slave
MPI executor
MPI executor
task
task
Mesos Architecture MPI job
Hadoop job
MPI scheduler
Hadoop task scheduler
Mesos master
Mesos slave
Pick framework to offer resources to
Allocation Resource module offer
Mesos slave
MPI executor
MPI executor
task
task
Framework-specific scheduling
Hadoop executor
Launches and isolates executors
Mesos Scheduling •Fair share scheduler • If there are p slots • si is the resource required by framework i • Allocate resources for framework i as (p*si)/(∑sk)
•Priority scheduler • Allocate highest priority framework resources first
•Pre-emption • If framework does not use allocation • If framework over-uses allocation (greedy or buggy)
Optimization: Filters •Let frameworks short-circuit rejection by providing a predicate on resources to be offered • E.g. “nodes from list L” or “nodes with > 8 GB RAM” • Could generalize to other hints as well
•Ability to reject still ensures correctness when needs cannot be expressed using filters
Implementation Stats •20,000 lines of C++ •Master failover using ZooKeeper •Frameworks ported: Hadoop, MPI, Torque •New specialized framework: Spark, for iterative jobs (up to 20× faster than Hadoop) •Open source in Apache Incubator
Users • Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) • Berkeley machine learning researchers are running several algorithms at scale on Spark • Conviva is using Spark for data analytics • UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps
Dynamic Resource Sharing
Analysis •Resource offers work well when: • Frameworks can scale up and down elastically • Task durations are homogeneous • Frameworks have many preferred nodes
•These conditions hold in current data analytics frameworks (MapReduce, Dryad, …) • Work divided into short tasks to facilitate load balancing and fault recovery • Data replicated across multiple nodes
Fault Tolerance • Mesos master has only soft state: list of currently running frameworks and tasks • Rebuild when frameworks and slaves re-register with new master after a failure • Result: fault detection and recovery in ~10 sec
Framework Isolation • Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects • Containers currently support CPU, memory, IO and network bandwidth isolation • Not perfect, but much better than no isolation
YARN vs MESOS
YARN vs MESOS • Yarn flow • App Mgr makes request to YARN scheduler • YARN scheduler returns resources
• YARN makes decision on where to run job • Similar to traditional schedulers • May be more optimal for overall cluster utilisation
• MESOS flow • Scheduler makes resource offer to app scheduler • App scheduler accepts or rejects offer
• App scheduler makes decision on where to run job • Select based on app criteria • Particular CPU type
Acknowledgements • In no specific order • • • • • • •
Nitin Mohan Deepthi Ravi Shreyas Tanya Mehrotra Anuj Hegde Aditi Srinivas Avinash Shenoy
A Big Thanks to them.
Feedback Highlights • Thanks for the feedback. Appreciate your taking time • ~70% want to cover topics (specific/all) in greater detail • Wish we had the time. Let us see.
• 64% would like more detail on slides • We will try and incorporate or at least point to appropriate reading material. • Unfortunately no good books at present.
• 62% would like more class assignments • Ok, we can try and increase the number.
• 70% appreciated the programming assignments (some would like it tougher) • Ok. That’s easy to do ☺
• Thanks for volunteering to be a mentor for next time; we will get in touch • More tutorials • By very nature – open source software is poorly documented; goal was to give you some exposure as this will be relevant in real world • Will try and incorporate more tutorials – work that into the course.