15th ANNUAL WORKSHOP 2019
ACCELERATING TENSORFLOW WITH RDMA FOR HIGHPERFORMANCE DEEP LEARNING Xiaoyi Lu, Dhabaleswar K. (DK) Panda The Ohio State University [ March 19, 2019 ] E-mail: {luxi, panda}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi http://www.cse.ohio-state.edu/~panda
OVERVIEW OF HIGH-PERFORMANCE DEEP LEARNING § Deep Learning is a sub-set of Machine Learning
• But, it is perhaps the most radical and revolutionary subset
§ Deep Learning is going through a resurgence • Model: Excellent accuracy for deep/convolutional neural networks • Data: Public availability of versatile datasets like MNIST, CIFAR, and ImageNet • Capability: Unprecedented computing and communication capabilities: Multi-/Many-Core, GPGPUs, Xeon Phi, InfiniBand, RoCE, etc.
Courtesy: http://www.zdnet.com/article/caffe2-deep-learning-wide-ambitions-flexibility-scalability-and-advocacy/
§ Big Data has become one of the most important elements in business analytics
• Increasing demand for getting Big Value out of Big Data to drive the revenue continuously growing MNIST handwritten digits
2
Deep Neural Network
OpenFabrics Alliance Workshop 2019
APPLICATION EXAMPLE: FLICKR’S MAGIC VIEW PHOTO FILTERING • Image recognition to divide pictures into surprisingly accurate categories • Magic of AI/DL: Generate accurate tags for billions of pictures
Courtesy: https://thenextweb.com/opinion/2015/05/22/flickrs-new-magic-view-photo-filtering-feature-works-so-well-it-convinced-me-to-ditch-iphoto/#.tnw_RaZEaD6g
3
OpenFabrics Alliance Workshop 2019
EXAMPLES OF DEEP LEARNING STACKS § § § § § § § § § §
TensorFlow Caffe/Caffe2 Torch SparkNet TensorFrame DeepLearning4J BigDL CNTK mmlspark Many others…
4
OpenFabrics Alliance Workshop 2019
TRENDS OF DEEP LEARNING STACKS § Google TensorFlow § Microsoft CNTK § Facebook Caffe2 and PyTorch
§ Google Search Trend (March, 2019)
5
OpenFabrics Alliance Workshop 2019
INCREASING USAGE OF HPC, BIG DATA AND DEEP LEARNING
Big Data
HPC
(Hadoop, Spark, HBase, Memcached, etc.)
(MPI, RDMA, Lustre, etc.)
Deep Learning (Caffe, TensorFlow, BigDL, etc.)
Convergence of HPC, Big Data, and Deep Learning!!! 6
OpenFabrics Alliance Workshop 2019
HIGHLY-OPTIMIZED UNDERLYING LIBRARIES WITH HPC TECHNOLOGIES § BLAS Libraries – the heart of math operations
• Atlas/OpenBLAS • NVIDIA cuBlas • Intel Math Kernel Library (MKL) § DNN Libraries – the heart of Convolutions! • NVIDIA cuDNN (already reached its 7th iteration – •
cudnn-v7) Intel MKL-DNN (MKL 2017) – recent but a very promising development
§ Communication Libraries – the heart of model parameter updating
• RDMA • GPUDirect RDMA
Xiaoyi Lu, Haiyang Shi, Rajarshi Biswas, M. Haseeb Javed, and Dhabaleswar K. (DK) Panda. DLoBD: A Comprehensive Study of Deep Learning over Big Data Stacks on HPC Clusters, in IEEE Transactions on Multi-Scale Computing Systems (TMSCS), 2018
7
OpenFabrics Alliance Workshop 2019
OUTLINE § Overview of TensorFlow and gRPC § Accelerating gRPC and TensorFlow with RDMA § Benchmarking gRPC and TensorFlow § Performance Evaluation § Conclusion
8
OpenFabrics Alliance Workshop 2019
ARCHITECTURE OVERVIEW OF GOOGLE TENSORFLOW § Key Features: • Widely used for Deep Learning • Open source software library for numerical • • • • •
computation using data flow graphs Graph edges represent the multidimensional data arrays Nodes in the graph represent mathematical operations Flexible architecture allows to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API Used by Google, Airbnb, DropBox, Snapchat, Twitter Communication and Computation intensive
Architecture of TensorFlow Source: https://www.tensorflow.org/
9
OpenFabrics Alliance Workshop 2019
ARCHITECTURE OVERVIEW OF GRPC § Key Features: • Simple service definition • Works across languages and platforms • • •
•
• C++, Java, Python, Android Java etc • Linux, Mac, Windows Start quickly and scale Bi-directional streaming and integrated authentication Used by Google (several of Google’s cloud products and Google externally facing APIs, TensorFlow), NetFlix, Docker, Cisco, Juniper Networks etc. Uses sockets for communication!
Large-scale distributed systems composed of micro services Source: http://www.grpc.io/
10
OpenFabrics Alliance Workshop 2019
DISTRIBUTED DEEP LEARNING WITH TENSORFLOW AND GRPC
Worker /job:PS/task:0 GPU
CPU
gRPC server/ client
Client
Master gRPC server/ client Worker /job:Worker/task:0 CPU
GPU
Worker services communicate among each other using gRPC, or gRPC+X!
11
OpenFabrics Alliance Workshop 2019
THE HIGH-PERFORMANCE BIG DATA (HIBD) PROJECT § RDMA for Apache Spark § RDMA for Apache Hadoop 3.x (RDMA-Hadoop-3.x) § RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x) • Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions
Available for InfiniBand and RoCE Also run on Ethernet
§ RDMA for Apache Kafka § RDMA for Apache HBase § RDMA for Memcached (RDMA-Memcached) § RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
Available for x86 and OpenPOWER
§ OSU HiBD-Benchmarks (OHB) • HDFS, Memcached, HBase, and Spark Micro-benchmarks
Support for Singularity and Docker
§ http://hibd.cse.ohio-state.edu § Users Base: 300 organizations from 35 countries § More than 29,350 downloads from the project site
12
OpenFabrics Alliance Workshop 2019
MOTIVATION § Can similar designs be done for gRPC and TensorFlow to achieve significant performance benefits by taking advantage of native RDMA support?
§ How do we benchmark gRPC and TensorFlow for both deep learning and system researchers?
§ What kind of performance benefits we can get through native RDMA-based designs in gRPC and TensorFlow?
13
OpenFabrics Alliance Workshop 2019
OUTLINE § Overview of TensorFlow and gRPC § Accelerating gRPC and TensorFlow with RDMA § Benchmarking gRPC and TensorFlow § Performance Evaluation § Conclusion
14
OpenFabrics Alliance Workshop 2019
TENSOR COMMUNICATION OVER GRPC CHANNEL § Rendezvous protocol • TensorFlow worker (tensor receiving process) actively requests for tensors to the parameter server (tensor sending process)
§ Worker issues Tensor RPC request that to Parameter Server (PS) § PS finds the requested tensor, and responds to the worker § gRPC core uses recvmsg and sendmsg primitives for receiving and sending payloads § Tensor Transmission uses iovec structures R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
15
OpenFabrics Alliance Workshop 2019
HIGH PERFORMANCE TENSOR COMMUNICATION CHANNEL § gRPC + Verbs • Dedicated verbs channel for tensor communication • gRPC channel for administrative task communication § gRPC + MPI • Dedicated MPI channel for tensor communication • gRPC channel for administrative task communication § Uber Horovod • Uber’s approach of MPI based distributed TensorFlow § Baidu Tensorflow-Allreduce • Baidu’s approach of MPI based distributed TensorFlow
16
OpenFabrics Alliance Workshop 2019
TENSORFLOW WORKLOAD VIA GRPC § Small, Medium and Large indicate buffers of few Bytes, KBytes and MBytes of length
§ gRPC payload may contain a uniform distribution of such Small buffers
§ A lot of Large buffers and a few Small buffers may create a skew distribution of such buffers in one gRPC payload
iovec Buffer Distribution Observed for TensorFlow training over gRPC
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018.
17
OpenFabrics Alliance Workshop 2019
OSU AR-GRPC AND AR-GRPC ENHANCED TENSORFLOW TF Sender
gRPC Caller
TF Receiver Request Tensor
Send Tensor Protobuf Request
Put Tensor in local table
Protobuf Response
Serialization
RecvTensorAsync
T0
T1
(Register Recv Callback)
T2
Issue RPC
Deserialization
AR-gRPC Core
AR-gRPC Core
RDMA-Endpoint-Write
Find Tensor
RDMA-Endpoint-Read
RDMA-Endpoint-Read B0
B1
B2
Adapti B0
AR-gRPC Core
A ve RDM B1
B2
Copy
B0
B1
B2
RDMA-Endpoint-Write
AR-gRPC Core
RDMA Buffer
RDMA Buffer
Invoke Func
Respond to RPC
Tensor (gRPC Byte Buffer)
RDMA Comm Library
RDMA-Polling Engine
AR-gRPC Core
Serialize B2
Global Buffer pool
gRPC Server
B1
B0
Adaptive RDM A B2
B1
B0
Receive Tensor B2
B1
B0
AR-gRPC Core
RDMA-Endpoint-Write
Deserialize
RDMA Write / Read
Tensor (gRPC Byte Buffer)
RDMA / InfiniBand / RoCE
RDMA-Endpoint-Read
§ Adaptive RDMA gRPC § Features • Hybrid Communication engine
• Message pipelining and coalescing • Adaptive chunking and accumulation • Intelligent threshold detection
• Zero copy transmission
• Adaptive protocol selection between eager and rendezvous R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMAbased gRPC, In Proceedings of the 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2018.
• Zero copy send/recv 18
OpenFabrics Alliance Workshop 2019
OUTLINE § Overview of TensorFlow and gRPC § Accelerating gRPC and TensorFlow with RDMA § Benchmarking gRPC and TensorFlow § Performance Evaluation § Conclusion
19
OpenFabrics Alliance Workshop 2019
AVAILABLE BENCHMARKS, MODELS, AND DATASETS MNIST
Model LeNet SoftMax Regression
CIFAR-10
ImageNet
Category
Digit Classification
Object Classification
Object Classification
Resolution
28 × 28 B&W
32 × 32 Color
256 × 256 Color
Classes
10
10
1000
Training Images
60 K
50 K
1.2 M
Testing Images
10 K
10 K
100 K
Layers (Conv. / Full-connected)
Dataset
Framework
2/2
MNIST
TensorFlow, CaffeOnSpark, TensorFlowOnSpark
NA / NA
MNIST
TensorFlow, TensorFlowOnSpark
CIFAR-10 Quick
3/1
CIFAR-10
CaffeOnSpark, TensorFlowOnSpark, MMLSpark
VGG-16
13 / 3
CIFAR-10
TensorFlow, BigDL
AlexNet
5/3
ImageNet
TensorFlow, CaffeOnSpark
GoogLeNet
22 / 0
ImageNet
TensorFlow, CaffeOnSpark
Resnet-50
53/1
Synthetic
TensorFlow
20
OpenFabrics Alliance Workshop 2019
ARE CURRENT BENCHMARKS SUFFICIENT? • Current DL models and benchmarks are deep learning research oriented • Example: Facebook caffe2 takes 1 hour to train ImageNet data1
• However, many system researchers are focused on improving the communication engine of deep learning frameworks • A fast benchmark that models deep learning characteristics is highly desirable
1. Goyal, Priya, et al. "Accurate, large minibatch SGD: training imagenet in 1 hour." arXiv preprint arXiv:1706.02677 (2017).
21
OpenFabrics Alliance Workshop 2019
TENSORFLOW DL MICRO-BENCHMARKS FOR GRPC
R. Biswas, X. Lu, and D. K. Panda, Designing a Micro-Benchmark Suite to Evaluate gRPC for TensorFlow: Early Experiences, BPOE, 2018. 22
OpenFabrics Alliance Workshop 2019
OUTLINE § Overview of TensorFlow and gRPC § Accelerating gRPC and TensorFlow with RDMA § Benchmarking gRPC and TensorFlow § Performance Evaluation § Conclusion
23
OpenFabrics Alliance Workshop 2019
PERFORMANCE BENEFITS FOR AR-GRPC WITH MICRO-BENCHMARK
Latency (us)
60 45 30
18600
Default gRPC (IPoIB-56Gbps)
800
AR-gRPC (RDMA-56Gbps)
AR-gRPC (RDMA-56 Gbps)
600 400
14900 11200
200
3800
0
0
100
payload (Bytes)
16K
32K
64K
128K
Payload (Bytes)
256K
512K
Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps)
7500
15 2 4 8 16 32 64 128 256 512 1K 2K 4K 8K
Latency (us)
75
1000
Default gRPC (IPoIB-56Gbps)
Latency (us)
90
1M
2M
Payload (Bytes)
4M
8M
• AR-gRPC (OSU design) Latency on SDSC-Comet-FDR – Up to 2.7x performance speedup over Default gRPC (IPoIB) for Latency for small messages. – Up to 2.8x performance speedup over Default gRPC (IPoIB) for Latency for medium messages. – Up to 2.5x performance speedup over Default gRPC (IPoIB) for Latency for large messages. R. Biswas, X. Lu, and D. K. Panda, Accelerating TensorFlow with Adaptive RDMA-based gRPC, In Proceedings of the 25th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 2018. 24
OpenFabrics Alliance Workshop 2019
TF-GRPC-P2P-LATENCY
3
Default gRPC (Ethernet 40G)
Default gRPC (Ethernet 10G)
Default gRPC (IPoIB-100Gbps)
6
Default gRPC (IPoIB-56Gbps)
AR-gRPC (RDMA-100Gbps)
5
AR-gRPC (RDMA-56Gbps)
Latency (ms)
Latency (ms)
4
2 1 0
Uniform
Random
4 3 2 1 0
Skew
Uniform
Payload Generation Scheme
Random
Skew
Payload Generation Scheme
OSU-RI2-IB-EDR
SDSC-Comet-IB-FDR
• OSU-RI2-IB-EDR: AR-gRPC (RDMA) reduces latency by 59% and 56% compared to Default gRPC over 40G Ethernet and IPoIB • SDSC-Comet-IB-FDR: AR-gRPC (RDMA) reduces 78% latency compared to 10G (Default gRPC) Ethernet and 69% compared to IPoIB (Default gRPC) 25
OpenFabrics Alliance Workshop 2019
TF-GRPC-PS-THROUGHPUT Default gRPC (Ethernet 40G) Default gRPC (IPoIB-100Gbps) AR-gRPC (RDMA-100Gbps)
3500 2500
RPC/ second
RPC/ second
3000 2000 1500 1000 500 0 Uniform
Random
4000 3500 3000 2500 2000 1500 1000 500 0
Default gRPC (Ethernet 10G) Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps)
Uniform
Skew
Random
Skew
Payload Generation Scheme
Payload Generation Scheme
SDSC-Comet-IB-FDR
OSU-RI2-IB-EDR
• OSU-RI2-IB-EDR: AR-gRPC (RDMA) gRPC achieves a 3.4x speedup compared to Default gRPC over IPoIB for uniform scheme • SDSC-Comet-IB-FDR: AR-gRPC (RDMA) achieves 3.6x bandwidth compared to Default gRPC over IPoIB for uniform scheme 26
OpenFabrics Alliance Workshop 2019
PERFORMANCE BENEFITS FOR AR-GRPC WITH TENSORFLOW MIMIC TEST
24
Default gRPC (IPoIB-56Gbps)
500
AR-gRPC (RDMA-56Gbps)
400
Calls/Second
Latency (ms)
30 18 12 6 0
2M
4M
Default gRPC (IPoIB-56Gbps) AR-gRPC (RDMA-56Gbps)
300 200 100 0
8M
2M 4M payload (Bytes)
payload (Bytes)
8M
Fully-Connected Architecture (Mimic TensorFlow communication)
• AR-gRPC (OSU design) TensorFlow Mimic test on SDSC-Comet-FDR – Up to 60% reduction in average latency over Default gRPC (IPoIB) – Up to 2.68x performance speedup over Default gRPC (IPoIB) 27
OpenFabrics Alliance Workshop 2019
800
400
gRPC
300
Images / Second
Images / Second
EVALUATION OF TENSORFLOW: GOOGLENET & ALEXNET AR-gRPC
200 100 0 8
16
32
8
GoogleNet Batch Size / GPU
16
gRPC
600
AR-gRPC
400 200 0
32
8
AlexNet
16
32
8
GoogleNet
16
32
AlexNet
Batch Size / GPU
8 Nodes
12 Nodes
GoogleNet & AlexNet Evaluation on OSU-RI2-IB-EDR (Higher Better); TotalBatchSize = (BatchSize/GPU)×NUMofGPUs • • • •
GoogleNet has only 5 Million parameters, whereas AlexNet has about 60 Million parameters AR-gRPC scales better as we go from 4 nodes to 8 nodes For large batch size (32/GPU, total 224) the GoogleNet improvement is about 15% (597 vs 517) • GoogleNet results in less network intensive gradient updates However, AR-gRPC shows 89% (124 vs 65) performance improvement for Alexnet compared to default gRPC 28
OpenFabrics Alliance Workshop 2019
EVALUATION OF TENSORFLOW: INCEPTION-V4 gRPC + MPI
AR-gRPC
30 20
0
4 Nodes
32
gRPC + Verbs
gRPC + MPI
AR-gRPC
20
0 16
gRPC
30
10
Batch Size/GPU
• •
40
10
8
•
50
gRPC + Verbs
150 Images / Second
Images / Second
40
gRPC
Images / Second
50
120
gRPC
gRPC + Verbs
gRPC + MPI
AR-gRPC
90 60 30 0
8
16 Batch Size/GPU
8 Nodes
32
8
16
32
Batch Size/GPU
12 Nodes
Inception4 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU)×NUMofGPUs AR-gRPC improves TensorFlow performance by a maximum of 29%, 80%, and 144% compared to default gRPC on 4, 8, and 12 nodes, respectively • For example: Improvement of 80% (93 vs 51 images) for batch size 16/GPU (total 176) on 12 nodes AR-gRPC process a maximum of 27%, 12%, and 31% more images than Verbs channel AR-gRPC outperforms MPI channel by a maximum of 29%, 151%, and 228% for 4, 8, and 12 nodes 29
OpenFabrics Alliance Workshop 2019
EVALUATION OF TENSORFLOW: RESNET152 gRPC + Verbs
gRPC + MPI
AR-gRPC
100
30 20 10
80
gRPC
gRPC + Verbs
gRPC + MPI
AR-gRPC
60 40 20
0 8
16 Batch Size / GPU
4 Nodes
32
150 Images / Second
Images / Second
40
gRPC
Images / Second
50
120
gRPC
gRPC + Verbs
gRPC + MPI
AR-gRPC
90 60 30 0
0 8
16 Batch Size / GPU
8 Nodes
32
8
16
32
Batch Size / GPU
12 Nodes
Resnet152 Evaluation on Cluster A (Higher Better); TotalBatchSize = (BatchSize/GPU)×NUMofGPUs • • • • •
AR-gRPC accelerates TensorFlow by 62% (batch size 8/GPU) more compared to default gRPC on 4 nodes AR-gRPC improves Resnet152 performance by 32% (batch size 32/GPU) to 147% on 8 nodes AR-gRPC incurs a maximum speedup of 3x (55 vs 18 images) compared to default gRPC 12 nodes • Even for higher batch size of 32/GPU (total 352) AR-gRPC improves TensorFlow performance by 82% 12 nodes AR-gRPC processes a maximum of 40%, 35%, and 30% more images, on 4, 8, and 12 nodes, respectively, than Verbs AR-gRPC achieves a maximum speedup of 1.61x, 3.3x and 4.5x compared to MPI channel on 4, 8, and 12 nodes, respectively 30
OpenFabrics Alliance Workshop 2019
AR-GRPC SPEEDUP COMPARED TO DEFAULT GRPC
Speedup
4 3 2 1 0 AlexNet
GoogleNet
VGG16
CNNs
31
Resnet50
Resnet152
Inception4
OpenFabrics Alliance Workshop 2019
OSU RDMA-TENSORFLOW DISTRIBUTION § High-Performance Design of TensorFlow over RDMA-enabled Interconnects • • • •
High performance RDMA-enhanced design with native InfiniBand support at the verbs-level for gRPC and TensorFlow RDMA-based data communication Adaptive communication protocols Dynamic message chunking and accumulation
• Support for RDMA device selection • Easily configurable for different protocols (native InfiniBand and IPoIB)
§ Current release: 0.9.1
• Based on Google TensorFlow 1.3.0 • Tested with • Mellanox InfiniBand adapters (e.g., EDR) • NVIDIA GPGPU K80 • Tested with CUDA 8.0 and CUDNN 5.0
• http://hidl.cse.ohio-state.edu 32
OpenFabrics Alliance Workshop 2019
OUTLINE § Overview of TensorFlow and gRPC § Accelerating gRPC and TensorFlow with RDMA § Benchmarking gRPC and TensorFlow § Performance Evaluation § Conclusion
33
OpenFabrics Alliance Workshop 2019
CONCLUSION § Present architecture overview of TensorFlow and gRPC § Discuss challenges in accelerating and benchmarking TensorFlow and gRPC § RDMA can benefit DL workloads as showed by our AR-gRPC and the corresponding enhanced TensorFlow • Unified high-performance communication runtime throughout the TensorFlow stack • Up to 4.1x speedup compared to the default gRPC • Up to 3x performance improvement on TensorFlow when using AR-gRPC compared to default gRPC channel • Significant improvement over Verbs and MPI channel • Consistently good performance for different CNNs
§ Plan to explore TensorFlow runtime to find more bottlenecks § Our work is publicly available: http://hidl.cse.ohio-state.edu/
34
OpenFabrics Alliance Workshop 2019
15th ANNUAL WORKSHOP 2019
THANK YOU Xiaoyi Lu, Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: {luxi, panda}@cse.ohio-state.edu http://www.cse.ohio-state.edu/~luxi http://www.cse.ohio-state.edu/~panda