RadFS
Random Access DFS
DFS in a slide
Locates DataNode, opens socket, says hi DataNode allocates a thread to stream block contents from opened position
Client reads in as it processes
Great for streaming entire blocks
Positioned reads cut across the grain
A Different Approach
Everything is a positioned read
All interactions with DN are stateless
Wrapped ByteService interfaces −
Caching, Network, Checksum
−
Each configurable, can be optimized for task
Connections pooled and re-used often
Fewer threads on server
DFS Anatomy – seek + read()
DFS seek+read con't
Locates preferred block, caches DN locations Opens new Socket and BlockReader for each random read
Reads from the Socket
Object creation means GC debt
No optimizations for repeated reads of the same data Threads will consume resources on server after client hangup – files aren't automatically closed
RadFS Overview
RadFS seek+read, con't
Transparently caches frequently read data
Automatically pools/manages file handles
Reduces network congestion (in theory)
Lower DataNode workload −
3 threads total instead of 1 per Xceiver
Configurable on client side for the task at hand
Network latency penalty on long-running reads
Checksum implementation means 2 reads per random read if caching is disabled
Implementation Notes
Checksum is currently generated by wrapping CheckSumFileSystem around RadFileSystem −
Inefficient, reads 2 files over dfs
−
Improper – what if checksum block is corrupt?
CachingByteService implements lookahead (good) by copying bytes twice (bad) Permissions happen “by accident” at namenode −
Attackable by searching blockid space on DNs
−
Could exchange UserAccessToken on request
Benchmark Environment
EC2 “Medium” - 2x2GHz, 1.7GB, shared I/O
Operations against 20GB sequence file
All tests run singlethreaded from the lightly loaded namenode Fast internal network, adequate memory but not enough to page entire file All benchmarks in a given set were run on the same instance, middle value from 3 runs
Random Reads - 2k
10,000 random reads of 2k each over the length of a 20GB file DFS averaged 7.8ms while Rad with no cache averaged 4.4ms Caching added a full 2ms – hardcoded lookahead was no help and lots of unnecessary byte copying
Random Reads – 2kb (avg in ms) 10 8 6 Column 2
4 2 0
DFS
RadFS Rad - No Cache
SequenceFile Search
Binary search over 10gb sequence file
DFS, RadFS with various cache settings
Indicative of potential filesystem uses −
Lucene
−
Large numbers of secondary indices
−
Ease of development
−
Read-only RDBMS-like systems built from ETLs or other long-running process
Sequence File Binary Search 5000 searches, avg ms per search 120
100
80
60
40
Column 2
20
0
DFS
RAD0 RAD128MB RAD16MB
Streaming
DFS is inherently faster for streaming due to the dedicated server thread Checksumming is expensive! Early radfs builds beat dfs at 1-byte read()s because they didn't have checksumming Require a PipeliningByteService for use in streaming jobs that would make requests to Datanode, stream in and checksum in a separate client-side thread
Streaming – 1GB 1b reads, time in seconds 350
300
250
200
150
Column 2
100
50
0
DFS
RAD no checksum
Streaming 1GB 2k reads, time in seconds 160
140
120
100
80 Column 2 60
40
20
0
DFS
RAD
Going forward – modular reader
Going forward - Applications
Could improve Hbase, solves file handle problem and improves latency Could be used to create low-latency lookup formats accessible from scripting languages −
Cache is automatic, simplifying development
−
“Table” directory with main store file and several secondary index files generated by ETL
−
Lucene indices? Can be built with MapReduce
Going forward - Development
Copy existing HDFS method of interleaving checksums directly from datanode – one read −
Audit checksumming code for CPU efficiency – reading can be CPU bound
−
Implement as a ByteService instead of clumsy wrapper around FileSystem. Make configurable
Implement PipeliningByteService to improve streaming by pre-fetching pages Exchange UserAccessToken at each read, could possibly use for encryption of blockid
Contribute!
Patch is at Apache JIRA issue HDFS-516
Will be on GitHub momentarily
Goals: −
Equivalent streaming performance to DFS
−
Faster random read, caching option
−
Lower resource consumption on server
3 doable tasks above
Large configuration space to explore
Email me:
[email protected]