Map Reduce Using Cascading

  • May 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Map Reduce Using Cascading as PDF for free.

More details

  • Words: 911
  • Pages: 22
Cascading www.cascading.org [email protected] Wednesday, May 14, 2008

Design Goals Make large processing jobs more transparent Reusable processing components independent of resources Incremental “data” builds Simplify testing of processes Scriptable from higher level languages (Groovy, JRuby, Jython, etc)

Wednesday, May 14, 2008

Cascading Introduction

Wednesday, May 14, 2008

Tuple Streams Value Stream

Group Stream [K1,K2,...,Kn

[V1,V2,...,Vn

Tuple A set of ordered data [“John”, “Doe”, 39] Value Stream Just tuples

[V1,V2,...,Vn

[V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn [V1,V2,...,Vn

Tuples groups by a key

Wednesday, May 14, 2008

[V1,V2,...,Vn [K1,K2,...,Kn [V1,V2,...,Vn [V1,V2,...,Vn

[V1,V2,...,Vn [V1,V2,...,Vn

Group Stream

[V1,V2,...,Vn

[V1,V2,...,Vn [V1,V2,...,Vn

Tuple Streams Scalar functions and filters Apply to value and group streams Aggregate functions Apply to group stream Functions can be chained

Source

[values]

[values]

[groups]

func

Group

aggr

func

[values] Source

Wednesday, May 14, 2008

func

Group

aggr

Sink

Sink

[values]

[values]

[groups/values]

[values]

Stream Processing Flow Pipe Assembly S

F

F

G

Pipe Assemblies A chain of scalar functions, groupings, aggregate functions Reusable, independent of data source/sink Cascade Flows S F Assemblies plus sources and sinks F S Cascades S F A collection of Flows Wednesday, May 14, 2008

A

A

S

S

F

S

S

F

S

Processing Patterns Source

Group

Sink

Source

Sink

Chain Group

Sink

Source

Source

Splits

Group

Sink

Group

Sink

Source

Joins Cross Wednesday, May 14, 2008

Source

Source

Sink

Group

Sink

Sink

MapReduce Planner Flow

Flow

Job Map

S

F

Job Map

Reduce G

A

F

F

F

Reduce

S

G

A

S

F Map

Job Map S

Flows are logical ‘units of work’

F

Reduce G

S

F Map

Flows ‘compiled’ into MR Jobs Intermediate files are created (and destroyed) to join Jobs

Wednesday, May 14, 2008

Job

F

A

Map S

F

S

Topological Scheduler Flows walk MapReduce Jobs in dependency order Cascades walk Flows in dependency order Independent Jobs and Flows are scheduled to run concurrently Listeners can react to element events (notify completion or failures) Only stale data-sets are rebuilt (configurable)

Wednesday, May 14, 2008

Scripting - Groovy Flow flow = builder.flow("wordcount") { source(input, scheme: text()) // input is filename of raw text document tokenize(/[.,]*\s+/) // output new tuple for each split, result replaces stream by default group() // group on stream count() // count values in group, creates 'count' field by default group(["count"], reverse: true) // group/sort on 'count', reverse the sort order sink(output) } flow.complete() // execute, block till completed Wednesday, May 14, 2008

System Integration FileSystems (unique to Cascading) Raw file S3 reading/writing (MD5) Raw file HTTP reading (MD5) Zip files Can bypass native Hadoop ‘collectors’ Event notification via listeners (XMPP/SQS/Zookeeper notifications) Groovy scripting for easier local shell/file operations (wget, scp, etc)

Wednesday, May 14, 2008

Cascading API & Internals

Wednesday, May 14, 2008

Core Concepts Taps and Schemes Tuples and Fields Pipes and PipeAssemblies Each and Every Operators Groups Flows, FlowSteps, and FlowConnectors Cascades, and CascadeConnectors, optional Wednesday, May 14, 2008

Taps and Schemes Taps, abstract out where and how a data resources is accessed hdfs, http, local, S3, etc Taps, used as Tuple (data) stream sinks, sources, or both Schemes, define what a resource is made of text lines, SequenceFile, CSV, etc

Wednesday, May 14, 2008

Tuples and Fields Tuples are the ‘records’, read from Tap sources, written to Tap sinks Fields are the ‘column names’, sourced from Schemes Tuple class, an ordered collection of Comparable values (“a string”, 1.0, new SomeComparableWritable()) Fields class, a list of field names, absolute or relative positions (“total”, 3, -1) // fields ‘total’, 4th position, last position Wednesday, May 14, 2008

Pipes and PipeAssemblies Tuple streams pass through Pipes to be processed Pipes, apply functions, filters, and aggregators to the Tuple stream Pipe instances are chained together into assemblies Reusable assemblies are subclasses of class PipeAssembly E

A

A

B' P

C'

E

E

A

B E

G

A

C E

Wednesday, May 14, 2008

B'

C'

E

Group Class and Subclasses Group, subclass of Pipe, groups the Tuple stream on given fields GroupBy and CoGroup subclass Group GroupBy groups and sorts CoGroup performs joins T

Fe

E

Fa G

T

Wednesday, May 14, 2008

E

G

A

T

T

E

A

T

Each and Every Classes Each, subclass of Pipe, applies Functions and Filters to each Tuple instance (a,b,c) -> Each( func() ) -> (a,b,c,d) Every, subclass of Pipe, applies Aggregators to every Tuple group (a: b,c) -> Every( agg()) -> (a,d: b,c)

Wednesday, May 14, 2008

Fe

Fa

E

A

Flows and FlowConnectors Flows encapsulate assemblies and sink and source Taps FlowConnectors connect assemblies and Taps into Flows

Flow

Flow

FlowStep G

E

A

FlowStep

T

T T

E

E

G

FlowStep E

Wednesday, May 14, 2008

E

G

A

T

T

E

A

T

FlowSteps and FlowConnectors Internally, FlowConnectors ‘compile’ assemblies into FlowSteps FlowSteps are MapReduce jobs, which are executed in Topo order Temporary files are created to link FlowSteps Flow FlowStep

FlowStep

Map Stack T

Wednesday, May 14, 2008

E

Reduce Stack G

A

Map Stack T

Reduce Stack G

E

T

Cascades and CascadeConnectors Are optional Cascades bind Flows together via shared Taps CascadeConnectors connect Flows Flows are executed in Topo order Cascade

T

Wednesday, May 14, 2008

T

F

T

F

T

T

E

T

F

T

F

Syntax Each( previous, argSelector, function/filter, resultSelector ) Every( previous, argSelector, aggregator, resultSelector ) GroupBy( previous, groupSelector, sortSelector ) CoGroup( joinN, joiner, declaredFields ) Function( numArgs, declaredFields, .... ) Filter (numArgs, ... ) Aggregator( numArgs, declaredFields, ... ) Wednesday, May 14, 2008

Related Documents

Map Reduce
May 2020 0
Map Reduce
August 2019 21
Map Reduce Multicore
October 2019 15
Using Concept Map Tools
November 2019 23