2018-unit5-lecture22-23-resourceallocation-yarnmesos.pdf

  • Uploaded by: srini
  • 0
  • 0
  • June 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 2018-unit5-lecture22-23-resourceallocation-yarnmesos.pdf as PDF for free.

More details

  • Words: 2,447
  • Pages: 56
Exam Logistics • One A4 size sheet in your own handwriting for memory recall • Portions – everything post T1 including videos (lecture 10 onwards) • Answer questions in the question paper itself (space provided) • 60 Marks • 2hrs paper • Reduced to 30. • Remaining 30 from your final eval • Signup sheet is shared. Pick a slot over the next two weeks. • Need to submit a report. Format is in the share. • Format of final eval presentation is also shared.

Unit 5: Resource allocation – YARN and MESOS Prof. Dinkar Sitaram Prof. K V Subramaniam Prof Sanchika Gupta Prof Prafullata K A Prof Mamatha Shetty

What have we learnt so far? • Hadoop – Map Reduce • Spark • Storm • Hbase And many more… Pig

Problem

Rapid innovation in cluster computing frameworks

Handling faults No single framework optimal for all applications

Want to run multiple frameworks in a single cluster • …to maximize utilization • …to share data between frameworks

Where We Want to Go Today: static partitioning

dynamic sharing

Hadoop

Pregel

Shared cluster MPI

What we will cover today?

YARN – Hadoop v2 Mesos YARN v/s MESOS

Hadoop 2.0 (YARN)

YARN (Mapreduce-2) • For a cluster larger than 4000 nodes, the map reduce process described above begins to suffer a sever bottleneck. • So, in 2010, Yahoo! Began to design the next generation of MapReduce. • The result was YARN, short for Yet Another Resource Negotiator (or YARN Application Resource Negotiator, when defined recursively).

How YARN fits

YARN vs Classic MapReduce • Split the responsibilities of the jobtracker into separate entities. • Job Tracker • Allocate resources • Scheduling • Task Progress Monitoring/Task Bookkeeping

• YARN separates these two roles into two independent daemons: • Resource manager to manage the use of resources across the cluster • Resource allocation

• Application master to manage the lifecycle of applications running on the cluster. • Scheduling/Task Progress

YARN Components Resource Manager

Node Manager Application Master Container

• Arbitrates resources amongst all applications of the system

• Per machine slave • Responsible for launching application containers • Monitors resource usage • Negotiate appropriate resource containers from the scheduler • Track and monitor the progress of the containers

• Unit of allocation incorporating resources such as memory, CPU, disk

YARN Architecture Overview

Application submission in YARN

Class Exercise (10 mins) • How does YARN decide which machine to run the job on? • Outline a simple strategy to pick the machine(s) to run a job on • Remember that a job consists of tasks (map/reduce)

Job Scheduling • Early versions of Hadoop used the very simplistic form of scheduling as the FIFO. • Typically, each job would use the whole cluster, so jobs had to wait their turn. • Although a shared cluster offers great potential for offering large resources to many users, the problem of sharing resources fairly between users requires a better scheduler. • Production jobs need to complete in a timely manner, while allowing users who are making smaller ad hoc queries to get results back in a reasonable time.

Job Scheduling • Later on, the ability to set a job’s priority was added, via the mapred.job.priority property or the setJobPriority() method on JobClient. • There are five levels, VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW. • When the job scheduler is choosing the next job to run, it selects one with the highest priority. • However, with the FIFO scheduler, priorities do not support preemption, so a high-priority job can still be blocked by a longrunning low priority job that started before the high-priority job was scheduled.

Job Scheduling • MapReduce in Hadoop comes with a choice of schedulers. • The default in MapReduce 1 is the original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair Scheduler and the Capacity Scheduler. • MapReduce 2 comes with the Capacity Scheduler (the default), and the FIFO scheduler.

The Fair Scheduler • Aims to give every user a fair share of the cluster capacity over time. • To groups of jobs.

• If a single job is running, it gets all of the cluster. • As more jobs are submitted, • free task slots are given to the jobs in such a way as to give each user a fair share of the cluster.

• A short job belonging to one user will complete in a reasonable time even while another user’s long job is running, and the long job will still make progress.

The Fair Scheduler Pools • Jobs are placed in pools, and by default, each user gets their own pool. • A user who submits more jobs than a second user will not get any more cluster resources than the second, on average. • It is also possible to define custom pools with guaranteed minimum capacities defined in terms of the number of map and reduce slots, and to set weightings for each pool. • The Fair Scheduler supports preemption, so if a pool has not received its fair share for a certain period of time, then the scheduler will kill tasks in pools running over capacity in order to give the slots to the pool running under capacity.

The Capacity Scheduler • The Capacity Scheduler takes a slightly different approach to multiuser scheduling. • A cluster is made up of a number of queues (like the Fair Scheduler’s pools), which may be hierarchical (so a queue may be the child of another queue), and each queue has an allocated capacity. • This is like the Fair Scheduler, except that within each queue, jobs are scheduled using FIFO scheduling (with priorities). • In effect, the Capacity Scheduler allows users or organizations to simulate a separate MapReduce cluster with FIFO scheduling for each user or organization. • The Fair Scheduler, by contrast, enforces fair sharing within each pool, so running jobs share the pool’s resources.

Jobs

Class Exercise (5 mins) • How does YARN keep track of the progress made by

Component Job Task

Progress Indicator

Progress and Status Updates • Long running batch jobs • Status – Job and Task • Job counters • Status – running, successfully completed, failed • Progress of map and reduces • Status message

• Task • Map tasks – proportion of input processed • Reduce tasks – proportion of reduce input processed

Job Completion • Completed → last task for a job is complete • Application Master and task containers clean up working state

Handling failures • What can fail? • • • •

Task Application Manager Resource Manager Node Manager

Task failure - types Due to runtime exceptions

Hanging tasks

Killed tasks

• JVM reports error back to parent application master • Error is logged in error logs • Task is marked as failed

• Progress updates not happening for 10 mins • Timeout value can be set.

• Speculative duplicates can be killed • When progress is slow a speculative task is created • Slower task is killed

Task failure - recovery • What happens on failure? • Application Master attempts rescheduling task • On different server • Max allowed failures (default = 4)

• Job Failure • Max allowed failures for a task is exceeded • Max percent of failed tasks • Single task not allowed to fail a job

Application Master Failure When can failure occur?

• Due to hardware or network failures

How to detect for failures?

• AM sends periodic heartbeats to Resource Manager

Restart

• Max-attempts to restart application • Default = 2

Node Manager Failure When can failure occur?

• Hardware, crashing, slow network

How to detect for failures?

• Sends regular heartbeats to resource manager • When no received for 10 mins

Restart

• Completed tasks of incomplete jobs will be rerun • As intermediate results are not available • Will run on different node

Resource Manager Failure

How is failure handled?

• Active Standby configuration

Impact

• More serious as all tasks fail

Restart

• Handled by failover controller • Clients/Nodemanagers configured

Case Study: YARN usage @Yahoo • 40,000+nodes running YARN over 365 PB of data • ~400,000 jobs running per day for 10 million hours • Estimated node usage per day improvement – 60-150%

MESOS

Overview •Mesos is a common resource sharing layer over which diverse frameworks can run

Hadoop Hadoop

Pregel …

Node Node

Pregel

Node Node



Mesos Node Node Node Node

Other Benefits of Mesos •Run multiple instances of the same framework • Isolate production and experimental jobs • Run multiple versions of the framework concurrently

•Build specialized frameworks targeting particular problem domains • Better performance than general-purpose abstractions

Mesos Goals • High utilization of resources • Support diverse frameworks (current & future) • Scalability to 10,000’s of nodes • Reliability in face of failures

Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

Design Elements •Fine-grained sharing: • Allocation at the level of tasks within a job • Improves utilization, latency, and data locality

•Resource offers: • Simple, scalable application-controlled scheduling mechanism

Element 1: Fine-Grained Sharing Coarse-Grained Sharing (HPC):

Fine-Grained Sharing (Mesos):

Framework 1

Fw. 3 Fw. 1

Fw. 3 2 Fw. 2

Fw. 1 Fw. 2

Framework 2

Fw. 2 Fw. 3

Fw. 1 Fw. 3

Fw. 1 3 Fw. 2

Framework 3

Fw. 2 Fw. 1

Fw. 3 Fw. 21

Fw. 2 Fw. 3

Storage System (e.g. HDFS)

Storage System (e.g. HDFS)

+ Improved utilization, responsiveness, data locality

Element 2: Resource Offers •Option: Global scheduler • Frameworks express needs in a specification language, global scheduler matches them to resources

+ Can make optimal decisions •– Complex: language must support all framework needs – Difficult to scale and to make robust – Future frameworks may have unanticipated needs

Element 2: Resource Offers •Mesos: Resource offers • Offer available resources to frameworks, let them pick which resources to use and which tasks to launch

+ Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal

Class Exercise (10 mins) • If you were the airline and customers are looking for seats, how would a resource offer from the airline look like for booking a seat • Hint: look at variants

Class Exercise (Solution) • If you were the airline and customers are looking for seats, how would a resource offer from the airline look like for booking a seat • Resource offer • #seatsvideo screen, • #seats near exit, • #club seats

Mesos Architecture MPI job

Hadoop job

MPI scheduler

Hadoop scheduler

Mesos master

Mesos slave

Allocation Resource module offer

Mesos slave

MPI executor

MPI executor

task

task

Pick framework to offer resources to

Mesos Architecture MPI job

Hadoop job

MPI scheduler

Hadoop scheduler

Resource offer = Pick framework to Mesos Allocation list of (node, availableResources) Resource offer resources to master module offer E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) } Mesos slave

Mesos slave

MPI executor

MPI executor

task

task

Mesos Architecture MPI job

Hadoop job

MPI scheduler

Hadoop task scheduler

Mesos master

Mesos slave

Pick framework to offer resources to

Allocation Resource module offer

Mesos slave

MPI executor

MPI executor

task

task

Framework-specific scheduling

Hadoop executor

Launches and isolates executors

Mesos Scheduling •Fair share scheduler • If there are p slots • si is the resource required by framework i • Allocate resources for framework i as (p*si)/(∑sk)

•Priority scheduler • Allocate highest priority framework resources first

•Pre-emption • If framework does not use allocation • If framework over-uses allocation (greedy or buggy)

Optimization: Filters •Let frameworks short-circuit rejection by providing a predicate on resources to be offered • E.g. “nodes from list L” or “nodes with > 8 GB RAM” • Could generalize to other hints as well

•Ability to reject still ensures correctness when needs cannot be expressed using filters

Implementation Stats •20,000 lines of C++ •Master failover using ZooKeeper •Frameworks ported: Hadoop, MPI, Torque •New specialized framework: Spark, for iterative jobs (up to 20× faster than Hadoop) •Open source in Apache Incubator

Users • Twitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing) • Berkeley machine learning researchers are running several algorithms at scale on Spark • Conviva is using Spark for data analytics • UCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Dynamic Resource Sharing

Analysis •Resource offers work well when: • Frameworks can scale up and down elastically • Task durations are homogeneous • Frameworks have many preferred nodes

•These conditions hold in current data analytics frameworks (MapReduce, Dryad, …) • Work divided into short tasks to facilitate load balancing and fault recovery • Data replicated across multiple nodes

Fault Tolerance • Mesos master has only soft state: list of currently running frameworks and tasks • Rebuild when frameworks and slaves re-register with new master after a failure • Result: fault detection and recovery in ~10 sec

Framework Isolation • Mesos uses OS isolation mechanisms, such as Linux containers and Solaris projects • Containers currently support CPU, memory, IO and network bandwidth isolation • Not perfect, but much better than no isolation

YARN vs MESOS

YARN vs MESOS • Yarn flow • App Mgr makes request to YARN scheduler • YARN scheduler returns resources

• YARN makes decision on where to run job • Similar to traditional schedulers • May be more optimal for overall cluster utilisation

• MESOS flow • Scheduler makes resource offer to app scheduler • App scheduler accepts or rejects offer

• App scheduler makes decision on where to run job • Select based on app criteria • Particular CPU type

Acknowledgements • In no specific order • • • • • • •

Nitin Mohan Deepthi Ravi Shreyas Tanya Mehrotra Anuj Hegde Aditi Srinivas Avinash Shenoy

A Big Thanks to them.

Feedback Highlights • Thanks for the feedback. Appreciate your taking time • ~70% want to cover topics (specific/all) in greater detail • Wish we had the time. Let us see.

• 64% would like more detail on slides • We will try and incorporate or at least point to appropriate reading material. • Unfortunately no good books at present.

• 62% would like more class assignments • Ok, we can try and increase the number.

• 70% appreciated the programming assignments (some would like it tougher) • Ok. That’s easy to do ☺

• Thanks for volunteering to be a mentor for next time; we will get in touch • More tutorials • By very nature – open source software is poorly documented; goal was to give you some exposure as this will be relevant in real world • Will try and incorporate more tutorials – work that into the course.

More Documents from "srini"