Cascading www.cascading.org
[email protected] Wednesday, May 14, 2008
Design Goals Make large processing jobs more transparent Reusable processing components independent of resources Incremental “data” builds Simplify testing of processes Scriptable from higher level languages (Groovy, JRuby, Jython, etc)
Wednesday, May 14, 2008
Cascading Introduction
Wednesday, May 14, 2008
Tuple Streams Value Stream
Group Stream [K1,K2,...,Kn
[V1,V2,...,Vn
Tuple A set of ordered data [“John”, “Doe”, 39] Value Stream Just tuples
[V1,V2,...,Vn
[V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn
Tuples groups by a key
Wednesday, May 14, 2008
[V1,V2,...,Vn [K1,K2,...,Kn [V1,V2,...,Vn [V1,V2,...,Vn
[V1,V2,...,Vn [V1,V2,...,Vn
Group Stream
[V1,V2,...,Vn
[V1,V2,...,Vn [V1,V2,...,Vn
Tuple Streams Scalar functions and filters Apply to value and group streams Aggregate functions Apply to group stream Functions can be chained
Source
[values]
[values]
[groups]
func
Group
aggr
func
[values] Source
Wednesday, May 14, 2008
func
Group
aggr
Sink
Sink
[values]
[values]
[groups/values]
[values]
Stream Processing Flow Pipe Assembly S
F
F
G
Pipe Assemblies A chain of scalar functions, groupings, aggregate functions Reusable, independent of data source/sink Cascade Flows S F Assemblies plus sources and sinks F S Cascades S F A collection of Flows Wednesday, May 14, 2008
A
A
S
S
F
S
S
F
S
Processing Patterns Source
Group
Sink
Source
Sink
Chain Group
Sink
Source
Source
Splits
Group
Sink
Group
Sink
Source
Joins Cross Wednesday, May 14, 2008
Source
Source
Sink
Group
Sink
Sink
MapReduce Planner Flow
Flow
Job Map
S
F
Job Map
Reduce G
A
F
F
F
Reduce
S
G
A
S
F Map
Job Map S
Flows are logical ‘units of work’
F
Reduce G
S
F Map
Flows ‘compiled’ into MR Jobs Intermediate files are created (and destroyed) to join Jobs
Wednesday, May 14, 2008
Job
F
A
Map S
F
S
Topological Scheduler Flows walk MapReduce Jobs in dependency order Cascades walk Flows in dependency order Independent Jobs and Flows are scheduled to run concurrently Listeners can react to element events (notify completion or failures) Only stale data-sets are rebuilt (configurable)
Wednesday, May 14, 2008
Scripting - Groovy Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order sink(output) } flow.complete() // execute, block till completed Wednesday, May 14, 2008
System Integration FileSystems (unique to Cascading) Raw file S3 reading/writing (MD5) Raw file HTTP reading (MD5) Zip files Can bypass native Hadoop ‘collectors’ Event notification via listeners (XMPP/SQS/Zookeeper notifications) Groovy scripting for easier local shell/file operations (wget, scp, etc)
Wednesday, May 14, 2008
Cascading API & Internals
Wednesday, May 14, 2008
Core Concepts Taps and Schemes Tuples and Fields Pipes and PipeAssemblies Each and Every Operators Groups Flows, FlowSteps, and FlowConnectors Cascades, and CascadeConnectors, optional Wednesday, May 14, 2008
Taps and Schemes Taps, abstract out where and how a data resources is accessed hdfs, http, local, S3, etc Taps, used as Tuple (data) stream sinks, sources, or both Schemes, define what a resource is made of text lines, SequenceFile, CSV, etc
Wednesday, May 14, 2008
Tuples and Fields Tuples are the ‘records’, read from Tap sources, written to Tap sinks Fields are the ‘column names’, sourced from Schemes Tuple class, an ordered collection of Comparable values (“a string”, 1.0, new SomeComparableWritable()) Fields class, a list of field names, absolute or relative positions (“total”, 3, -1) // fields ‘total’, 4th position, last position Wednesday, May 14, 2008
Pipes and PipeAssemblies Tuple streams pass through Pipes to be processed Pipes, apply functions, filters, and aggregators to the Tuple stream Pipe instances are chained together into assemblies Reusable assemblies are subclasses of class PipeAssembly E
A
A
B' P
C'
E
E
A
B E
G
A
C E
Wednesday, May 14, 2008
B'
C'
E
Group Class and Subclasses Group, subclass of Pipe, groups the Tuple stream on given fields GroupBy and CoGroup subclass Group GroupBy groups and sorts CoGroup performs joins T
Fe
E
Fa G
T
Wednesday, May 14, 2008
E
G
A
T
T
E
A
T
Each and Every Classes Each, subclass of Pipe, applies Functions and Filters to each Tuple instance (a,b,c) -> Each( func() ) -> (a,b,c,d) Every, subclass of Pipe, applies Aggregators to every Tuple group (a: b,c) -> Every( agg()) -> (a,d: b,c)
Wednesday, May 14, 2008
Fe
Fa
E
A
Flows and FlowConnectors Flows encapsulate assemblies and sink and source Taps FlowConnectors connect assemblies and Taps into Flows
Flow
Flow
FlowStep G
E
A
FlowStep
T
T T
E
E
G
FlowStep E
Wednesday, May 14, 2008
E
G
A
T
T
E
A
T
FlowSteps and FlowConnectors Internally, FlowConnectors ‘compile’ assemblies into FlowSteps FlowSteps are MapReduce jobs, which are executed in Topo order Temporary files are created to link FlowSteps Flow FlowStep
FlowStep
Map Stack T
Wednesday, May 14, 2008
E
Reduce Stack G
A
Map Stack T
Reduce Stack G
E
T
Cascades and CascadeConnectors Are optional Cascades bind Flows together via shared Taps CascadeConnectors connect Flows Flows are executed in Topo order Cascade
T
Wednesday, May 14, 2008
T
F
T
F
T
T
E
T
F
T
F
Syntax Each( previous, argSelector, function/filter, resultSelector ) Every( previous, argSelector, aggregator, resultSelector ) GroupBy( previous, groupSelector, sortSelector ) CoGroup( joinN, joiner, declaredFields ) Function( numArgs, declaredFields, .... ) Filter (numArgs, ... ) Aggregator( numArgs, declaredFields, ... ) Wednesday, May 14, 2008