22.sap Hana Vora - Functions And Integration.pdf

  • Uploaded by: Anonymous aCeye79n
  • 0
  • 0
  • November 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View 22.sap Hana Vora - Functions And Integration.pdf as PDF for free.

More details

  • Words: 24,234
  • Pages: 83
Aron MacDonald

SAP HANA Vora": Functions and Integration

® Rheinwerk® Publishing Bonn • Boston

What You'll Learn Learn to use SAP HANA Vora to extend Apache Spark for SAP data. Start out by discovering basic operations and how to use them. Then explore how SAP HANA Vora enables you to enhance business functions in your applications, consume Hadoop data in SAP HANA, and consume SAP HANA data in Hadoop. 1

G ettin g Started w ith SAP HANA Vora and Apache S p a rk ............

5

2

Transform Data w ith Apache Spark .................................................

10

2.1

Using R D D s..................................................................................

12

2.2

Using D ataFram es......................................................................

24

3

4

5

Consum ing Data w ith Vora

...............................................................

35

3.1 3.2

Consuming Hadoop Data ........................................................ Consuming SAP HANA D a t a ....................................................

35 49

3.3

Cross-Consumption: Join SAP HANA and Vora D a t a .........

58

3.4

Additional Business F u n c tio n s .................................................

61

3.5

Additional Inform ation on V o r a ...............................................

63

C om bining Apache Spark, SAP HANA, and Vora ..........................

64

4.1

Download Oil Price Data Using Apache S p a rk .....................

64

4.2

Download NY Times Articles Using Apache Spark ..............

68

4.3

Store Net Income Results Using SAP H A N A ..........................

75

4.4

Analyze Oil Data Using V o r a ....................................................

77

W h a t's N e x t? .........................................................................................

83

4

1

Getting Started with SAP HANA Vora and Apache Spark

SAP HANA Vora (hereafter referred to as "Vora") is SAP's new in-memory query engine, which plugs into the Apache Spark and Hadoop frame­ works to provide interactive analysis of big data. Hadoop is an open-source software, designed to run on inexpensive com­ modity hardware, enabling businesses to store and process data at the petabyte scale. Hadoop is made up of a large collection of projects or libraries, covering areas like real-time data ingestion, document storage, and in-memory reporting engines. Hadoop is constantly evolving, and choosing the right project for your current needs requires careful consid­ eration. At Hadoop's core are several projects: »The Hadoop Distributed File System (HDFS) » YARN, a second-generation map/reduce framework and resource man­ ager » Apache Spark, an in-memory data-processing engine Over recent years, Apache Spark has evolved to become the new de facto standard data-processing engine for Hadoop; it’s fast and flexible and operates "in-memory" (when the dataset can fit). Vora expands upon Apache Spark by providing added business features as well as best-in­ class integration with SAP HANA (using an organization's live corporate data), enabling cross-consumption reporting and advanced analytics, at scale. Figure 1 illustrates how Hadoop, Vora, and SAP HANA might appear within an enterprise's big data architecture to manage transactional and behavioral datasets.

5

|

1

Third Party Analytic Tools

G e ttin g Started w ith SAP H A N A V ora and A pach e Spark

Transactional Figure i

Behavioral

SAP HANA and Vora/Hadoop Reference Architecture

This E-Bite demonstrates, step by step, basic data transformation opera­ tions using Apache Spark and Vora's data consumption operations to get you started on your big data journey. For a more detailed overview of Vora, please check out SAP HANA Vora; An Overview, http://www.slideshare.net/SAPTechnology/sap-hana-voraan-overview. All examples shown in this E-Bite are running SAP HANA Vora Developer Edition. SAP HANA Vora Developer Edition is free to try by registering at https://go.sap.com/cmp/syb/crm-xul5-int-voratrdm/index.html. Before we jump into technical and practical details, let's briefly discuss this software.

6

G e ttin g S tarted w ith SAP H A N A V ora and A pache Spark

|

1

Once you have followed the AWS installation steps for Vora, you'll be presented with the screen shown in Figure 2. This Vora launchpad pro­ vides a brief overview ofyour newly created Hadoop cluster, running the Hortonworks distribution of Hadoop (HDP) and Vora. The Vora launchpad provides a user-friendly way of adding and removing nodes on your test cluster and also has important links to other installed services, includ­ ing: » Apache Ambari A tool for provisioning, managing, and monitoring Apache Hadoop clusters » Apache Zeppelin A web-based notebook that enables interactive data analytics using SQL, Scala, Vora, and more Figure 2 shows the Vora launchpad. Click on the links in the top-right cor­ ner to access the installed services. HANA Vora

Cluster =

SAP HANA Vora M essages

R e s o u rc e s

22 hrs ago on Feb 21: Configured cluster with status success

Ambari administration

AppBOOk

Spark

Zeppelin

MASTER

11

End User License Agreement

Figure 2

Vora Launchpad

7

G e ttin g Started w ith SAP H A N A V ora and A pache Spark

|

Figure 3 shows the Ambari console, where more complex configuration and fine-tuning of your Hadoop cluster can take place. # ! j l Ambari

SA PH A N A V ora



Metrics

© MapReduce2 © YARN

__

6 alerts

Heatmaps

| Dashboard | Services

Hosts 2

Alerts

Admin

config History

Metric Actions * Q HDFS Disk Usage

u . Tez

© Hive

n Pig

0

© ZooKeeper

DataNodes Live

2/2

Memory Usage Nameficde SecooSary K s m e N o d e

lift., il,

2Data?lodes

PWrirlfj

Moce *

© Ambari Metrics © SAPHANAVora

CPU Usage

Actions

NameNode Heap

NameNode RPC

------- 1-----r/*

10*. 5*

NameNode Uptime

22.2 hr Figure 3

Cluster Load

ii%

ResourceManager Heap 4

f %

Network Usage

ResourceManager Uptime

22.1 hr

0.05

ms

NodeManagers Ltve

1/2

NameNode CPU

W IO

o.o%

YARN Memory

33^

Ambari Console

In this E-Bite, Zeppelin is the primary tool used for interacting with Apache Spark and Vora. When you first launch Zeppelin, you are pre­ sented with a list of the notebooks (documents) you are working on, as shown in Figure 4. Within each notebook, code can be created, saved, and executed. Figure 5 shows an example of a basic Apache Spark function in a Zeppelin note­ book. The top half of a notebook tile, highlighted in orange, shows your own custom Apache Spark or Vora code. The bottom half of the tile, high­ lighted in green, shows the results of executing the code.

8

1

G e ttin g Started w ith SAP H A N A V ora and A pache Spark

Zeppelin

Notebook -

|

Interpreter

Welcome to Zeppelin! Zeppelin is web-based notebook that enables interactive data analytics You can make beautiful data-driven interactive collaborative document with SOL code 3nd even morel

Notebook

Help

□ Create new note DOOGOAWSTPCH OOOOOBmg FT 0 0000 DF Basics OOOOOHDFS QOOCOJSOMBasic 0 OOCOJSON Oiklata O0000 NYTimes OOOCORDOBasics O0001 Vora 1 0 Note 26CPSVBC4 0 Create Table 0 Session 1 0 Session 2 NOPE 0 ShedockHolmes 0 Table search 0 Word Count 0 Zeppelin Tutonai Otabte2

Get staled with Zeppelin documentation

Figure 4

Community Please feel free to help us to improve Zeppelin Any contribution are welcome' V Mail ng list A Issues tracking O Grthub

Zeppelin Main Screen

E

Zeppelin

Notebook -

Interpreter

0000 DF Basics c>:: uj 0

©

© O oefav# -

finisheo

o

uj

s« 5( ( * 6s e r , ‘ Revenue*, ie e e .e e ),

('u s e r,■»ever*«', weo.eo), < * v s e iv ix p « r , i« - , - je e e .e e > ,

(**uei","Revenue", see.ee)).toCf('Co^«ny-,'Accounterouo“ CC*c»n>fl»:aD-: crg.»Mcl-<.$Nrl.sal.R*t*?r»ne - [foeperv,: string, icccuntJroup: string,

rfeublej

1Ceepany| Accountsroupl Aacunt.uSOI |

cseil useij useil 1 *uei| 1 1

Figure 5

Revenoel Revenuel Ixpensel Revenuel

leee.ei seee.ei -jeee.ei see.e|

Launch Apache Spark via a Zeppelin Tile

Note Apache Spark and Vora support multiple programming languages such as Scala, Python, and Java. As Scala is currently the most common language used with Apache Spark, the examples in this E-Bite are in Scala.

9

1

Transform D ata w ith A pache Spark

Within each Zeppelin notebook, SQL can also be executed and visualized, avoiding the need to jump to external reporting tools, especially during testing. Figure 6 shows an example of an oil price analysis performed using Apache Spark and Vora. This example will be discussed further in Section 4.

2

Transform Data with Apache Spark

Vora is an add-on to Apache Spark, providing enhanced SQL logic and business functionality. Apache Spark will continue to be used in conjunc­ tion with Vora when complex transformation logic is required to prepare data prior to query execution. To maximize the use of Vora, it is import­ ant to understand the basics of Apache Spark, especially when data prepa­ ration is required. Apache Spark is a fast and general-purpose cluster computing system. It pro­ vides high-level APIs in Java, Scala, Python, and R, as well as an optimized 10

|

2

Transform D ata w ith A pache Spark

engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data process­ ing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Apache Spark has many features, but at its core are the three main fea­ tures covered in this E-Bite: » Resilient distributed dataset (RDD) RDDs are fault-tolerant collections of elements that can be operated on in parallel across your Hadoop cluster. There are two ways to create RDDs: parallelizing an existing collection in your custom program or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering an Hadoop input for­ mat. » Spark SQL

Spark SQL is a module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. » DataFrames A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in R/Python but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources, such as structured data files, tables in Hive, external databases, or exist­ ing RDDs. Note Spark also has other advanced functionality related to Spark Streaming and Machine Learning Library (MLlib), useful for machine learning algorithms. Unfortunately, these are not covered in this brief overview. Please see http:// spark.apache.org/mllib/ and http://spark.apache.org/streaming/ for additio­ nal info.

11

|

2

Transform D ata w ith A pache Spark

2.1

|

2

U sin g R D D s

RDDs have a long list of functions for manipulating and transforming data. Table 1 gives a brief overview of some of the common functions. Essential Apache Spark Operations General

M ath/ Statistical

Set Theory/ Relational

► map ► filter

► sample ► randomSplit

► union

► flatMap ► mapPartitions

► keyBy ► intersection ► zipWithIndex ► zipWithllniquelD ► subtract ► distinct ► zipPartitions ► cartesian

► mapPartitionsWith Index

Data Structure/ I/O

► zip

► groupBy ► sortBy

► coalesce ► repartition ► repartitionAndSortWithinParti tions ► pipe

► reduce ► collect

Table

► aggregate ► fold

► ► ► ►

count takeSample max min

► f ir s t ► take ► forEach

► sum ► histogram ► mean

► top

► variance

► treeAggregate ► treeReduce ► forEachPartition

► stdev ► sampleVariance ► countApprox

► collectAsMap

► countApproxDistinct

► takeOrdered

► saveAsTextFile ► saveAsSequenceFile ► saveAsObjectFile ► saveAsHadoopDataset ► saveAsHadoopFile ► saveAsNewAPIHadoopDataset ► saveAsNewAPIHadoopFile

1 Apache Spark RDD Functions

12

Transform D ata w ith A pache Spark

|

2

For a more complete list of the functions, with examples, see http://training.databricks.com/visualapi.pdf. This E-Bite uses several examples of these functions to help get you started on your Vora journey. map Function

You can use the map function to transform a row of data from one type to another. For example, with the code in Listing 1, you can transform an RDD with a single column, containing a list of company codes, into a twocolumn structure. The first column will contain company code, and the second will contain an integer of 1, as shown in Figure 7* val companyCodeRDDx = sc.paralleiize(Array("GB01", "USOl", "USOl", "UK01")) val companyCodeRDDy = companyCodeRDDx.map(obj => ( o b j. l ) ) Listing 1 map Function

//W P

v«l ccnp»nyc:i<5c©« ■ sc. paral lei i:e ( array ("oeor, 'u s e r , "usei*, "umi’)) val ccnpanjCofleRDOy » corpan><©eeRDOx.nap(oPj ■>
FINISHED t> « III ©

printlnC"\nsesults\n...........') p rin tInC'ccnpenyCOPcRCO* : * ♦ cc*pan>'CoPeRCOx.collectO.«kStrir.g(")) prlntlnCccepanyCcPeRCOy : ’ ♦ ccnpanyCoPeROOy .collect().nkStrir.gr, ")) coepanytoCeROOx: erg.apache.spark.re<j.stio[5trlog] ■ ParallelCollectionRCO[35£] »t parallelize at «onsole>:M corpartyCoaeRCOy: erg.apache.spark.rpo.»oo[(String, :nt)] ■ rapPartltionsKOOlses] a t nap at -:8i Results ccrpanyCoeeROOx : ssei, USM, usei, u<ei CC*p»n>

Figure 7 map Results

filter Function

You can use the f i l t e r function, as shown in Listing 2, to filter rows of the RDD. As shown in Figure 8, in this example, only the rows belonging to company code "USOl” are returned. val companyCodeRDDx = s c . p a r a l l e i i z e ( A r r a y ( "GBOl”, "USOl", "USOl", "UKOl")) val companyCodeRDDy = companyCodeRDDx. f i l t e r ( obj => obj == "USOl") Listing 2 filte r Function

13

Transform D ata w ith A pache Spark

I //F ilte r val coapanycodescox . s«.peralleli:e(Array<‘6 6 0 i*, "usei’, 'user, 'utcox')) val coapanycooesooy ■ coflcenycooeRCtw.fi Ite r (op) ■> 06 J ■ ■ *usei*) 'Keep jsbi entries

|

FINISHED t>

2

m €»

println("\nResults\n...........”) prlntln("coBpan>coaei»30x : - ♦ <e:83 cereanyCoOeROOy: org.apache,spark.roo.ROO{String] ■ MepPertiticnsROO[3«7] at f i l t e r at :8l Results CCflp*nyCC*0«“OOx : 5801, USOl, uSM, IK01 cwanyCooeRCOy : usei, usei

Figure 8

filter Results

KeyBy Function

Following the example code in Listing 3, you can use the KeyBy function to create a new key column in the RDD. As shown in Figure 9, the first two digits of the company code are used to create a key. val companyCodeRDDx = sc.paralleiize(Array("GBOl", "USOl", "USOl", "UK01")) val companyCodeRDDy = companyCodeRDDx. keyBy(w => w . s u b s t r i n g s , 2)) println(companyCodeRDDy.col 1e c t ( ) .mkString( ", ")) Listing

3 KeyBy Function

I //Key By val «»p«n>co H.su6 string(e, 2 )) println( coifrenytoceROOy.colle«t() .«kstring(•. ') )

FINISHEO O ; ; ID @

println( “\nResults\n........... “) p rin tln ('coapanycoOeRDOx : ' ♦ cc*©anyCoPeW»x.collect().«kStringC, ")) p rin tln ('ccapanyCoOeRDOy : * ♦ ccapdnyCoPeacOy.collectO.akStringr")) conpanytoceROOx: org.apache,spark.roo.RDD[string] . ParallelColle:80 cc*panyCo<JeRCOy: erg.apache,spark.red.aw>[(String, string)] ■ PacPe-titionsROO[3M] at keyBy at :81 ;56,0601), (us,us«i), (us,usei), (UK.wcei)

Results CCncanyCotfeROOx : 6801, usei, US01, u*ei eenpan>«xjeROO> : «se,09«i>, (us,1* 0 1 ), (us,usei), (uk, i* 0 i >

Figure

9 KeyBy Results

countByKey Function

You can use the countByKey function, as shown in Listing 4, to create a new RDD containing the key and a column containing the occurrences of the key. The results are shown in Figure 10.

14

|

2

FINISHED 0

Efl

Transform D ata w ith A pache Spark

val companyCodeRDDx = s c . pa rail eli ze(Array( "GB01". "US01", "US01", "UK01" ) ) . keyBy(w -> w.substringC0,2)) val companyCodeRDDy = companyCodeRDDx.countByKey() Listing 4

countByKey Function

val coapanyCodeROOx ■ sc.paralleli:e(*rray(*eeei*, " u s e r, "usei*, ”UKei'».ke)«y(K ■> H.sutstringCe,2» val co<J«900x.collectO.«kStrirg( , “)) prin tln (“ee*panyCePeROOy : ’ ♦ ce*pan>CoPeROOy) conpanyCedeROOx: erg.apache.spark.r<w.9M>[(String, String)] . PapeartltlensRCO[37i] a t keyBy at :80 ccc.panyco<Se«OOy: scale.collection.KaplStringAong] . mop( uk -> l, us -> 2 , 66 -> 1 ) Results COnpanyCOXteRDOx : ;<SB,<*ei>, (US,US01), (US,US01), (l*,UK01) CCnpanyCCdeRfiOy : Kap(UK -> 1, US -> 2, 66 -> 1)

Took1 MCondi

Figure to

countByKey Results

sample Function

Following the example in Listing 5, you can use the sample function to create a randomized subset of rows from the RDD. This sample subset, shown in Figure 11, might be particularly useful if you need to create rep­ resentative subsets to test predictive analytic functions when working with very large datasets. val companyCodeRDDx = s c . p ar al le li ze (A rr ay ( ( "GB01".100). ("US01\ 100). ("US01". 300). ( "AU01", 100))) val companyCodeRDDy = companyCodeRDDx.sample(false, 0.5) Listing

5 sample Function

I //Sample val conpanjCodeRCOx ■ sc.paralleli:e(Arra>(CGcer,iee), ('u s e r, lee), ("user, ;ee),(*Auer, ieej» val eo«pan>eod«RC©y • co"pan>coeeRDOx.sarple(false, e.;)

FINISHED t> « [0 ©

println<"\nResults\n...........') printing"ccapanytodeROOx : ’ • ecapanyce4e3COx.collect().akString(", ') ) println("<e«pan,cofle«OCy : * . ec^an>Cc:se eonpanycodesooy: erg.apache.spark.rdd.ftCO[(String, xnt)] ■ Parti?lor»«iseSanpiedROO[37S] a t sanple a t :3i

Results

ceepanycoeesoo* : («eei,iee), (usei,iee>, (usei,3«e), (Auei,ie®) CWanyCoPeRDOl : (6001, 166), (US01,i#0>, (US01, 300)

Figure t i

sample Results

15

Transform D ata w ith A pache Spark

|

2

distinct Function

Following the example code in Listing 6, you can use the di sti net func­ tion to create a new RDD containing only distinct (different) values. The results are shown in Figure 12. val companyCodeRDDx = sc. pa rail el i ze (Array ( "GB01", "USOl", "USOl", ,,UK01M)) val companyCodeRDDy = companyCodeRDDx. d i s t i n c t ( ) Listing 6

distinct Function

//DISTINCT val conpanyCcOfBOOx • sc.paralleli:e(Arra><“G&«r, " u se r, 'u s e r , “uicer » val co«pan>Code»OOy • conpanyCodeROOx.distlmtO

FINISHED t> JI 81 ©

println( "\nRes<jlts\n...........') printing'co«pan><«sei*OOx : " . Cc<JeRCiOx.collect().iikStrlng(', -)) p r i n t i n g : ■ ♦ Ccde*»y.ccll«t<).BkStrlrgr, ”)) ccr-panycodeROOx: org.apacK.spark.Pdd.s®o[string] > ParallelColle:w cc*ipafty': org.apache.spark.rdd.eco[Strmg] - i!apPartitionsROO[379] at distinct at :8l

r

Results cc*pan>
Figure t2

distinct Results

reduceByKey Function

You can use the reduceByKey function, as shown in Listing 7, to create a new RDD with aggregated values based on a key. This function is similar to an SQL statement such as SELECT COMPANYCODE, SUM(VALUE) FROM GROUP BY COMPANYCODE. The results are shown in Figure 13. val companyCodeRDDx = sc .p ar a ll e ii z e (A rr a y( ( "GBOl",100). ( "USOl", 100), ("US01M, 300), ( "AUOr, 100))) val companyCodeRDDy - companyCodeRDDx.reduceByKey(_ + _ ) . collect Listing 7

reduceByKey Function

16

Transform D ata w ith A pache Spark

I //KCOUCe BY KlY val coapanyCodeRDOx • sc.parallell:e(Array(("G60i",iee), ( 'u s e r , val coapanyCcdeRDOy • conpanyCodeRK>x.reduceByKey(_ • _ ).collect

1 0 0 ),

( - u s e r,

|

2

FINISHED t> JK BP

O

3e e ) , ( ' A u e i " , i e e > ) >

prlntln( '\nftesults\n...........•) prin tln ("ccapanyCodeROOx : ' • coapanyCcdeRC©x.collect<).akString<", -)) prin tln ("ccapanyCodeRCOy : * • ccapanyCcdeROC^.akStnngC’, *» conpanyCodeRPOx: erg.apache.spark.rdd.RDO[(String, int>] ■ ParallelCollection*oo[383] at p arallelise at :se ccnpany CodeRDCy: Array ((String, Int>] > Array<(C8ei,iee), (Auei,lM), ( uS»1,aM)) Results cenpanycodeRXx : (6sei,iee>, (usei,iee>, (usei,3ee), (Auei,iee) coapanycodeROOy : «*eei,iee>, (am i , tee), (usei,«ee)

Figure 13

reduceByKey Results

union Function

Following the example code in Listing 8, you can use the union function to create a new RDD containing the union of rows based on two identical RDDs, as shown in Figure 14. This function is similar to an SQL UNION statement. val companyCodeRDDx = sc.pa rail eli ze(Array( ( "GB01",100), ("US01", 100), ("US01\ 300))) val companyCodeRDDy = s c . p ar al le ii ze (A rr ay ( ( "AU01", 100))) val companyCodeRDDz - companyCodeRDDx. union(companyCodeRDDy) Listing 8

union Function

//IMION val cwiperyCodeRDOx - sc.porallcli:e(A.rray((“<;eer,KK>), (“u s e r , 100), (’0501% :*>)» val conparyCodeROOy • sc.paralleli:e(Array(("Auei", iee)))

FINISHED > i? ffl $

val ccfiparyCodeRDO: ■ cc*panyO>deRDOx.union(co«paryCcdeaDCy) println( *\nResults\n...........*) println( "coapanyCoCeROOz : ' ♦ cc#panyCcde»DO:.collect().W(St'-lrg(', *))

conpanycotfeROOx: org.apache, spark.rco.R£>c>[(string, int>] ■ parallelcollectionRoo[3SS3 at parallelise at :se corpany^odeRPOy: erg.apache,spark.rd<j.R0O[(String, int)J - ParallelCollectienROD[3S63 at parallelise at :7$ ccrpanyCcxCeRDO:: erg.apache,spark.rdd.RDO[(String, int>] ■ unionRDO[387] at union at :8* Results COrpanyCotfeROO: : (6M1.1M), (US81,1W), (USei,3M>, (A<*1,1M)

Figure 14

union Results

join Function

You can use the join function to create a new RDD containing the join of different RDDs that share a common column, such as a transactional table

17

Transform D ata w ith A pache Spark

|

2

and a master data table containing reporting attributes. This function is similar to a SQL join statement. In Listing 9, two RDDs are joined based on a shared column "company code" to lookup the country code attribute contained in the second RDD. The results are shown in Figure 15. val companyCodeRDD = sc .p ar a ll e ii z e (A rr a y( ("GB01”,100), ("USOl", 100), ("USOl". 300), ( "AU01", 100))) val companyCodeAttrRDD = s c . p a r a l 1e l i z e (Array( ( "GB01", "GB"), ( MUS01", "US"))) val companyCodeJoinRDD companyCodeRDD.join(companyCodeAttrRDD).col 1ect val companyCodeLeftJoinRDD = companyCodeRDD.1eftOuterJoin(companyCodeAttrRDD).collect Listing 9

join Function

val coapanycajesoo . sc.paraiieii:e(Arra>«"GBer,iee), (“u s e r , lee), ( 'u s e r , 3ee),(*Auer, iee )j) val coapan><©deAttraoo • se.parelleli:e(A rra>(CG eer,“G8"), ( “u s e r , *us*))>

FINISHED t> X ffl ©

val coapanjCodeloirftOO ■ co«eer.yCo<Jeaoo.Join(ee«panyCoeeAttrROD).cclleet val coapanyCodeieftsoinRDO ■ eo«penyCoeesco.lefto<jter3oin(co«panycodeAtt'iioO).collect println( "\nResults\n...........") prin tln ( "cccpanyCoeelcinROO : * ♦ conp8nyCoee:oir«0C.HcString(", “)) printing "cc*seany:8* corpanyCo^eAttrROO: erg.apache.spark.r«.ROO[(String, string)] ■ PerellelCcllectionRDO[389] at parallelize a t :79 eo«paeyCC"de3oinROO: Array[(String, (le t. String))] ■ Array( (GMi,(iM,Ge)), (uSei,(3d®,uS)), (uS6l,» Results conpanyCodeJoinROO : «3M1,(1M,€S)), (usei,(3«e,uS)), (usei, conpenyc©<Jei.cft3cinPa> : (seei,(iw ,so»e«»> )), (<‘u e i,(ie e ,<«cnc)), (usei,:3W,sc*e(us)>), (usei,uee,soae(us)))

Figure

15 join Results

fla tM a p Function

You can use the fl atMap function to split a column into a new RDD con­ taining multiple rows. In the example code in Listing 10, an RDD contains a single row of company codes delimited by a comma. flatMap is used to return company codes as multiple rows, as shown in Figure 16. val sapDataRDD - s c . para 11 elize(Array( "GB01, USOl, USOl,AU01") ) sapDataRDD.fiatMap(x => x . s p l i t ( " , " ) ) . col 1ect Listing 10

flatAAap Fuction

18

Transform D ata w ith A pache Spark

I/ / f i a t nap - Split a coma deliatteO rot val saccatasoc • sc.paraliellie(Arra,("os«i,usoi,usei,Auei*) ) sapoataftco.flatMap<x .» x.split(V)).collect sapoataRPO: erg.apache.spark.r<Sd.Rt©[String] ■ ParellelCollectionaf)0[4e2] at parallelize at :8e rei936: ArraylStrlng] • Array(GW1, usei, usoi, «J01)

Figure 16

|

2

FINISHEO t> 5S ro <S>

flatMap Results

map and split Functions

Listing 11 uses the map and split functions to transform a single column of data that contains comma-delimited information into two separate col­ umns, as shown in Figure 17. val sapDataRDD = s c . para 11 el i z e ( A r r a y ( "GB01, 100" , "US01.100" . "US01,300" . "AU01.100") ) sapDataRDD.map(x => { var s p l i t S t r = x . spl i t ( ” , ” ) (sp litS tr(O ). s p litS tr(l)) }) . c o l l e c t Listing 11 map and split Functions

I

- tram f o n t a row val jeccataaoo « sc.jyarallelize! ray C G eei,iw , ”usM,Me”, "Lisei,3ee”, ‘Auei,ie«“) S8poata*iX>.r-Bp(x «> var SplitStr ■ x .s p lit( V ) (sp litStr(a), s p litS tr(.)) collect println( "\nnesults\n........... println(saccat8*00. collect! J.wfcStringCW)) sapCatatO©: org.apache.spark.rdS.eco[String] « A8rellelCollectlof*CO['4©4j at parallelize at :8© res939: ArrayKString, string)] ■ Array!(G60i,iee), (US«i,ie«), {USOi,3W), (Auei,iee>) //n a p

{ )).

)

finisheo o ;; ft) ©

*)

Results GW1,1M us»i,iee u s u ,)te Auoi.ieo

F ig u re 17

map and split Results

Use Patterns to M ap into M ultiple Columns

In some cases, a File may not have standardized delimiters but still may have a recognizable format. A typical example of this is an Apache log file. In the following example, a simple file might contain a company code, year, period document number, and date (in SAP format). The row could be split using a " " (space) delimiter as shown earlier, but more complex pattern-matching logic, detailed in Listing 12, can also be used to generate custom results, as shown in Figure 18. 19

Transform D ata w ith A pache Spark

val data = sc.parallelize(Array("US01 2016 01 00100000001 20160115" "US01 2016 01 00100000002 20160115" "US01 2016 01 00100000003 20160115" )) val Pattern = """(\S+) (\S+) (\S+) (\S+) (\S+)......r / / Create a case class for Financial Document Header. case class financialDocumentHeader(companyCode: String, year: String, month: String, documentNumber: String, postingDate: String

)

def parseFinancialDocumentHeaderLineCline: String): financialDocumentHe ader - { val patternMatch = Pat ter n. findFirstMatchIn(1ine) i f (patternMatch.isEmpty) { throw new RuntimeException("Cannot parse line: " + line)

}

val lineData - patternMatch.get financialDocumentHeader( 1ineData. group(1), 1ineData.group(2), lineDa ta.group(3), 1ineData.group(4) , 1ineData.group(5)) var dataHeaderDocs = data.map(parseFinancialDocumentHeaderLine) Listing 12

Pattern-Matching Logic

v a l da ta ■ S C .p ara lU li:« (A rrty < * U ttl 241* «1 MIOMOOM: M IM IU * “u se i j o t s e i e*ieee© » e2 2016011s* • w a i 29U « i 0010000000> : o H o m » v a l P a tte r n . — < \» .) < \» .) ( \ J . ) ( \ j . > ( \ J . ) * ~ . r / / c r to x t 0 case c la s s f a r r t i w c t o i o o c v e n t C K f c la s s financialOocunentHeade^conpanyCode: yt t t : ■e fltn : o c « f* M » J to tr : p o stin g S a te: >

f in is h e d

, ,

0

;; w &

S trin j, S tr in g , S trin g , s t r in g , S trln *

parseFinanclalD ocir>ent-«eaderilne(llne: c - - : - ; ) ; fin an c ta lD o a r-e rt-e a d e - • < va 1 p a tte rn r a tc n • P a tt0r n .f ln d » irs w a tc h in < lir e ) I f (p ateern P O tch .isE ap ty ) { threw new «jntlartxeeptlcr<*C ann© t p arse l i r e : * * l i n t ) v a l lm e o a ta • p a tt0 rn v * tcn .* « t flnancialO ocuaentH eadert lir« O ate.* rcw p < :), lire2 ata.* rcw p < >, lir e S a ta .f C '-C I ) , iln e O e ta .* ro jp (-) , lirc D a ta . > 1 v a r d a ta-ead ero o cs • d a ta .a a p (p a rse rin a n c la lo o a a e n t4e ad e n .in e)

))

p r in t ln< *> .n » es0 lts\r ') prtntln<dota>*ea5er©c><S.C0i:e .«»iStrin*< -\n*)) d a ta : o r * .a p a c h e .s p a r k .r « .* » { S trin * ) . P arallelC o llectlo rP S O t* * *] a t p a r a l l e l l t e a t :#0 p a tte rn : s c a la .u til.ra tc M r« .P e * e < . (\S*> <\S*) {\S*) (\S*> defin ed c la s s fin an e la lO o < u ^ n t-e ai* r perse«inenciel&oct*>entM ee<erj.ine: ( lin e : Strir*)fin»n;lalD cct«<entM eader dataneaeer»o<s: c r* .a p a c h e .sp a rk .rd a .* c o { flra r< ia » o c j» e rt» e a d e rJ • M appartitlons*t© [40?] a t rap a t :M •c u lts fin a n c ia lo o c i^ c n f< e a d e r(u se i,2 e ie ,a i>M i» e e e e e « i,M i« a iis ) fin a n c ia lD C K in e n f^ ad e r(u se i,2e is ,e i ,w i e e e « « e « :.2e i M i i s ) fm a n c ia iP o < u n < -.t-e aJ« 'a ’i 0 i,:e i« ,ti,e < J ie « > 2«<«3, » i M i i s ) Tata2 ii [ i r *

Figure 18

Pattern-Matching Logic Results

20

|

2

Transform D ata w ith A pache Spark

Execute Linux Commands

In some cases, executing Linux commands directly from Apache Spark, which would make it unnecessary to open additional sessions to search for files in HDFS, can be useful. The example code in Listing 13 shows a basic Hadoop listing command, the results of which arc shown in Figure 19.

import seal a.sys.process._ val IsResult = Seq("hadoop"," f s " 1s","hdfs://master.cluster:8020/ user/vora/").!! Listing 13

Execute Linux Command from Hadoop to Search for HDFS File

laport scala. sys.process.. val ls^rsult • scq(“hatfooo'',*^s*>" .ls " ,’hd#$://M ster.cluste'‘:8 0 2 0 /oser/vora/").!! inpert j
Figure

9 2016-02-21 12:06 ntffs://«ster. 0 2 0 1 6 -0 2 -2 1 0 8 :5 4 n o fs:/...

T9 HDFS Listing

W orking w ith HDFS Directories

Within Apache Spark it is also possible to execute common Hadoop com­ mands. In Listing 14, the temp directory is deleted and re-created in HDFS, as shown in Figure 20. import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val fs = Fi1e$ystem.get(sc.hadoopConfiguration) //Delete Temp Directory fs.delete(new Path("/user/vora/temp"), true) var status = f s .1istStatus(new P at h(" /us er /vo ra /")) p r i n t l n ( "\nDirectory Listing Before:") s t a t u s . foreach(x=> p r in t ln ( x .g e tP at h ) ) //Create new Temp directory fs.mkdi rs(new Path( "/user/vora/temp"))

21

|

2

Transform D ata w ith A pache Spark

status = f s .1istStatus(new Path("/user/vora")) p r i n t l n ( "\nDirectory Listing After:") status.foreach(x=> p r in t ln ( x .g e tP at h ) ) Listing 14

Delete and Create HDFS Directories

Directory Listing Before: Mfs: //M ttc r. cluster: 802®/user/vcr»/. Tr*sh M fs://M Ster. cluster: 802®/user/vcra/. hive Jars Mfs: //M Ster. d u s te r: 8020/user/vcc*/. sp»rkst»ging M fs://M Ster. cluster :SO20/user/vcra/. staging M fs://M Ster. duster:802©/uscr/vcr»/lir>eiten Mfs://Mster.cluster:8020/user/^d/lineite*i3SON Mfs: //master. cluster:802®/user/vcr»/lineite«0 Mf s: //M Ster. cluster: 8 0 2 8 / user /vera/1lnel tenfarquet M fs://naster. cluster : 8 ®2 0 /user/vcra/natlon M fs://M Ster. d u ster :8®2®/user/vcra/r>eti©n.g:.ore Mf s: //a a s te r. cluster:8020/user/vcra/people. parquet Mfs: //M Ster. cluster: 8020/user/vora/region. srappy. parquet M fs://M Ster. cluster :8®20/user/vcra/sap&ata M fs://M Ster. d u ster: 8020/user/vcre/sapOetePerquet Mfs: //M Ster. cluster:8020/user/vcra/test .csv M fs: / /M ste r. clu ster: 8020/user/vera/test. parquet M f s : / / m jeer. cluster : MJe/user/veea/Bera.eeuAt Mf s:/ / m ster. cluster:8020/user / vera /norfl.count. csv M fs://M $ter. cluster: 8020/user/^-cra/norOcount res25: Boolean • true status: Array[crg.acacne.haoocp.fs.Fllestatus] - [iorg.dpeene.he0oop.fs.File$tatus;$ec87787 Directory* Listing After: Mf s: //M Ster. cluster:8020/user / vera/. Trash M fs://naster.cluster:802®/user/vcra/.hiveJars Mf s ://M Ster. cluster :8O20/user/vera/. sparkstaging M fs://M Ster. duster:8O20/user/vera/. staging Mf s: //M Ster. cluster: 8 0 2 0 /user/vera/lineltea M fs://M Ster. cluster :8020/user/vcre/lineite«3SCN M fs://M Ster. cluster :s®2®/user/vcre/lineite** M fs://M Ster. cluster :S®20/user/vcra/linelteaParquet M fs://«aster.cluster:802®/user/vcra/nation Mfs: //M Ster. cluster:8020/user/vcra/ftatlon.g:.ere Mf s: //•» ster .d u s te r: 8 0 2 0 /user / .era/people. parquet M fs://M Ster. cluster :8O20/user/vcra/reglon. snappy.parquet M fs://M Ster. d u ster: 8020/user/vcra/sepOate Mf s: //M Ster. cluster:8020/user/vcra/sapDataParquet Mfs: //M Ster. cluster:8020/user/vcra/teup Mfs: //M Ster. cluster:8020/user/vera/test .csv Mfs: //M Ster. cluster:8020/user/vera/test .parquet M fs://M Ster. cluster: 802®/user/'.-cra/i»ora_count M fs://M Ster. cluster :8020/user/'.qra/worfl_count.csv Mfs: //M Ster. cluster: 8020/user/vcre/worSccunt

Figure 20

New Temp Directory Created

saveAsTextFile Function: Saving RDDs to HDFS

RDDs are not persistent indefinitely. In some cases, the final RDD may subsequent reprocessing. Apache save RDD to a local file system, to HDFS, or to another remote file system, such as AWS S3. If you are running Apache Spark on Hadoop, then the most commonly used storage location is HDFS. The example code in Lis­ ting 15 demonstrates the saveAsTextFile function to save an RDD to a text file on HDFS, as shown in Figure 21.

22

|

2

Transform D ata w ith A pache Spark

|

2

import org. apache. hadoop. f s . FileSystem import org.apache.hadoop.fs. Path val fs - FileSystem.get(sc.hadoopConfiguration) fs.delete(new Path("/user/vora/sapData"), true) val sapDataRDD = s c . para 11 elize(Array( "GB01, 100", "USOl.lOO”, "USOl,300",HAU01.100") ) sapDataRDD.reparti t i o n ( 1 ) . saveAsTextFi1e("sapData") Listing 15

I //save inport inport val fs

saveAsTextFile Function

two to HOfS erg. apache,haccop.fs.FileSyster erg.apache.racoop.fs.Path ■ F*leSysten.get(sc.hadoop
fs.aelete(new PathCVuser/vora/sapoata- ), true) val !apcata»oo » sc. p a r a lle lle d - - a ,('«sei,iee>*, ’uS»i,iM*, sapoataROO. re p a rtitio n 1 ). saveAsTe*tFile(" sapoata")

FINISHED C> JJ « @

)

inport or*.apache.hadoop.fs.FileSysten inport or*.apache.haaoop.fs.Path fs: erg.wche.hadoop.fs.rileSysten . DFSlOfSClient(cliehtNane.OFSCllent_NC>V'>vPR£OUC£_lS6>S3e?»0_17, ugi.vora (authiSIMRlE)]] ressw : Boolean ■ true sapOata'XC: org.apache.spark.rdd.RX(strin*] > ParallelCollect:cnROO[4XS] at parallelize a t :88 Took2 seconds

Figure 21

saveAsTextFile Results

textFile Function: Load CSV from HDFS

The example code in Listing 16 demonstrates the textFi 1e function, used to load a CSV file, stored on HDFS, into an RDD, as shown in Figure 22. val hdfsFileRDD = s c . t e x t F i 1e("hdfs: //master.cluster:8020/user/vora/ sapData”) 16 textFile Function

Listing

I //LOOJ CSV from tCfS val hdfsFileRCO ■ sc.textFile(*hdfs://naster.cluster:8«29/u}er/vcca/S8E©ata*)

FINISHED 0 5J tfl ©

println<“\naesults\n........... ') print In (hdfsFileWO.collectO.nfcStringCVn*)) hdfsFUenco: erg.apache,spark.rdd.ROO[String] ■ Kapi>*rtlticnsftX[4e9] *t textFile at :79 Results

use:,iee usei,3W A je i.ie e

Figure

22 Load CSV Results

23

Transform D ata w ith A pache Spark

2.2

|

2

U sin g D a ta F ra m e s

DataFrames work with structured data, similar to an SQL table but have a much shorter list of available functions when compared with RDDs. Data­ Frames share many of the same functions as standard SQL functions, such as SELECT, WHERE (filter), AGG (aggregate), JOIN, UNION, and so on. For a complete list of DataFrame functions, see http://spark.apache.org/ docs/latest/sql-programming-guide.html. Note Datasets were introduced in Apache Spark 1.6. Datasets merge core functio­ nality from RDDs and DataFrames. This is not yet supported by Vora (expec­ ted Q2, 2016).

Create a Simple DataFrame

The example code in Listing 17 demonstrates the textFi 1e function, used to load a CSV file, stored on HDFS, into an RDD, as shown in Figure 23. val companyDataDF - Seq(( "GBOl”, "Revenue”. 1000.00). CUS01”,"Revenue", 5000.00), ("US01"."Expense”, -3000.00), ( "AU01", "Revenue", 300.00)).toDF("Company", "Accoun tGroup", "Amount_USD") Listing 17

textFile Function

//Create a Staple Oatofroae val co*pan>cata©c • Seo(('6BOl*,■Revenue’, lMO.ee), ('u se r, -Revenue-, soeo.eo), ("user,'Expense*,-aeee.ee), <"au01",'Revenue', '-ao.C«,'i)).toC#('Coo(«rry*,'Acco'jnt6roop",'A»otint_USO")

FINISHED t > ; ;

ccapanyDataCF. sr»o<*(: ■:) cc«pany08teD=: org.apache.spark.sol.MtOFrtae » [Company: string, AccountGroup: string, Ancunt_USO: Pebble] | Conpany | AcceuntGroup | Aaeont_usoI 1 I 1 1

GBOl| useil -soil Aueil

Revenue| Revenue | Expense! Revenue 1

leoo.el sooo.oi -jeee.el 300.01

Took 4 MCOftdt

Figure 23

Create DataFrame Results

24

to@

Transform D ata w ith A pache Spark

2

|

Print Schema

After creating an RDDf it may be useful to see its definition using the printSchema function (Listing 18), similar to an SQL DESCRIBE table. The results are shown in Figure 24. companyDataDF.printSchema Listing 18

I

printSchema Function

//Show D e ta ils

F IN IS H E D [ > «

EP ®

co«MnyCat»3F .orlntscnena root I-- Company: string (reliable ■ true) I-- Actountsroup: string (nullable - true) j - /*ount_uSO: double (nullable ■ false)

Figure 24

printSchema Results

Select Columns

Following the example code in Listing 19, you can use the select func­ tion to reduce the number of columns. The results are shown in Figure 25. companyDataDF.select("Company", "Amount_USD") . show(10) Listing

19 select Function

//S e le c t Columns coaipariyOataOf. select ( *cc*pany*, ' aaount.uSD") . sno«(; e)

F IN IS H E D

ICom pany|A *ount_uS O | | I I I

oaei | u s e il u s e ij A u e i|

le o e .e l s e e e .e i • je e e .e l s e e .e i

Took 0 M tondi

Figure 25

select Column Results

25



ffl ©

Transform D ata w ith A pache Spark

|

2

filter Function

The f i 1ter function can be used to filter rows. Listing 20 shows different syntaxes for performing the same filter operation. The results are shown in Figure 26. companyDataDF.fi1ter("AccountGroup = ' Revenue' " ) . show(10) companyDataDF.filter($"AccountGroupM=== "Revenue”).show(10) Listing 20 filte r Function

FINISHEO I> //filte r

l! ©

Roms

coiwrc/aat#OF.filter(“Account<>roup ■ ' Revenue'").sHow(ie)

w

conpanySataOF.filterts-AccountGroup" «»« "Revenue").show

...

♦ ........... f - . . - .................. ICompany | AccountGroup IA«ount_USD| *........... 4-.................... I GB01| Revenuel leee.ei | usei| Revenue) SOM.0| | Auei| Revenuel 3ee.ei ............. * *........... ♦ .................... ...................................................

...

ICoopany | AccountGroup lA«ount_uSDI ♦ ........... 4-............................ I GBOl| Revenuel iN t.d I useil Revenue) sflw.ei I auoi| Revenuel 3ee.ei ♦ .............- 4-............................-

Figure

26 filter Results

Combining filter and select Functions

DataFrame functions can be combined. In the example code in Listing 21, rows are filtered, and the numbers of columns are reduced. The results arc shown in Figure 27. companyDataDF.filterC'Company = ' US0 1 " ' ) . sel ect< "Company ” , "Amount_ USD") . show( 1 0 ) Listing 21 Combining the filter and select Functions

26

Transform D ata w ith A pache Spark

//com bining f i n e r t end s e le c t


|

F in is h e d D>

2

HJ ©

| Coreany |Anc<jrt_USO|

I useil

seeel

I uSdll

-3©0«l

Figure

27 filter and select Results

groupBy and agg Functions

The groupBy and agg functions can be used to aggregate data. Using the example code in Listing 22, we searched for the total profit of a company code. The results are shown in Figure 28. val groupedData = companyDataDF.fi1ter("Company = ' US01' " ) . s e l e c t ( "Company", "Amount_USD") . groupBy( "Company") val groupedDataDF = groupedData. agg( sum("Amount_USD")) Listing

22 groupBy and agg Functions for Data Aggregation

//C roup ond Aggregate

val grojpeXata ■ ccapanyOataO8.f i l t e r ("Company * ' u s e r select (‘Company*, *AacuntJJSO*) . groupBy<"Company") val g'-cupedCataD' • grc<JC«dCata.agg( sun(’Ano>jnt_uso“))

FINISHED t> “

groupeooatacf .sno*()| grcupscoata: erg.apacne.spark.sqi.GrotipecData • org.apaene.spark.sql.Grojcea0ata$S2c?42ff group
u s e il

Figure

:e e e .e |

28 groupBy and agg Results

countDistinct Function

Following the example code in Listing 23, the countDistinct function can be used to count the distinct number of values of a column. The results are shown in Figure 29. companyDataDF.aggCcountDi stinet("Company")).show() Listing

23 countDistinct Function

27

ffl ©

Transform D ata w ith A pache Spark

c o a p a r j c a t a O f . « « ( < o w ito i i t i n c ? ( * c o n p an y " ) ) . show ( )

|CCUNT(OESTINCT Conpany)| Jl

countDistinct Results

Load JSON to a DataFrame

JSON structures are increasingly used for results produced by web APIs. Listing 24 shows how a very simple JSON structure can be loaded into a DataFrame. The results are shown in Figure 30. val stringJsonRDD = sc.paralleiize(Seq( it it ii

{ "CompanyCode": "GB01", "AccountGrp": "Revenue". "Amount_USD": 1000.00 { "CompanyCode": "US01", "AccountGrp": "Revenue", "Amount_USD": 5000.00 { "CompanyCode": "US01", "AccountGrp": "Expense", "AmountJJSD": -3000.00 { "CompanyCode": "AU01", "AccountGrp": "Revenue", "Amount_USD": 300.00 | it i i ii

) )

val resultsJSON = sqlContext.read.json(stringJsonRDD) Listing 24

2

FINISMEO t> 51 QJ ®

//C ount O is tin c t

Figure 29

|

Load a Simple JSON Structure

28

Transform D ata w ith A pache Spark

//Load JSON to Octa Frame

val 5tring:j3ri?c© • sc.perallell:e(scq(

FINISHED

{ "Company-Code": *G8®1", "AccountGrp": "Revenue*, "Ameunt.uso": ieee.ee r * »

{ "Company-Code": "u ser, "AccountGrp": "Revenue", ■Amount uso*: seeo.de { "Company-Code": "usei", "AccountGrp": "Expense", "Amount uso": -3eee.ee { "Company-Code": "Auei*, "AccountGrp": "Revenue",| "Amount uso*: 3ee.ee

)

> )■**

val resultslSCN • sqlContext.read.Jscn(stringJsonROD) results3SO*.tcO£. sho*<) string;sonROO: org.apache.spark.rdd.soo[string] ■ Porallelcollectlo«ROO[626] at parallelise at :l23 resultsSSCN: org.apacne.spark.sql.Oata'rame - [AccountGrp: string, Amoont_uSD: double, CompanyCode: string] 1AccountSrp | Amount_uSO| CompanyCode | | I 1 I

Revenue 1 Revenue | Expense | Revenue |

Figure 30

iee«.e| seee.el -3&M.0I *e«.e|

seen useil -sen Aoei|

Load JSON Structure Result

Explode a JSON

Unfortunately, not all JSON structures are flat. In many cases, JSON struc­ tures may have nested lists of data. Listing 25 shows how the levels of a JSON structure can be exploded (flattened) into a more useful columnar form, as shown in Figure 31 and Figure 32. val stringJsonRDD = s c . paral 1elize(Seq( •• if

11

{ "data” : [ { "CompanyCode": "GBOl". "AccountGrp": "Revenue", "Amount_USD": 1000.00 }, { "CompanyCode": "US01", "AccountGrp": "Revenue", "AmountJJSD": 5000.00 }, { "CompanyCode": "US01", "AccountGrp": "Expense", "AmountJJSD": -3000.00

29

|

2

Efl ®

Transform D ata w ith A pache Spark

|

2

{ "CompanyCode": "AU0 1 ", "AccountGrp": "Revenue” , "Amount_USD": 3 0 0. 0 0

] I «i it it

)

)

val resultsJSON = sqlContext. read. json(stringJsonRDD) resultsJSON.printSchema resultsJSON.toDF.showC) resultsJSON.toDF.pri ntSchema resultsJSON. toDF.withColumn("dataList", explode<$"data")). show() var finalResults = resultsJSON. toDF.wi thColumn("dataLi s t " , explode($"data") ) .withColumn( "CompanyCode", $"dataList.CompanyCode") .withColumn("AccountGrp", $"data Li st.AccountGrp") .wi thColumn("Amount_USD", $"data Li s t . Amount_USD") finalResults.show() f in a l R e s u lt s . s e l e c t ( "CompanyCode", "AccountGrp". "AmountJJSD") .show() Listing 25 Explode a More Complete JSON Structure

/ / t x p l o d f a 3SOU

FINISHED !> J I Q] @

val stringjsonRCO ■ sc.paralleli:e(Seo(

{ 'data- : [

< 'ConpanyCoCe*: ' gmi’, "AccountGrp": "Revenue", "Aoojnt_USD": 1000.00 < "coepanycoce*: ”usei", "AccountGrp": "Revenue", "Aoount_USO": 5000.00 < ”co*panj'Code’ : "usoi", ■AccountGrp": "Expense", "Amount USD": -3000.9©

J, n> ))

< "CoeoanyCoce*: "au01’, "AccountGrp": "Revenue", "ABOUflt USO": 300.00

val resultsJSON • s<jlContext.reac.Jscn(strin$Jsonaoo) resultsJSON.prlntSche«a resultsJSON. tcO'.sro»<)| resultsJSON.tcO* .pcintSchena resultsJSON.tcO'.uithColinn("cataList", exploce(S*data")).show() var 'inalftesults • results3SCN.toOc.witnColuBn("<Jataiist“, explode(S"<Jata")).«itKoluin(’Conpan>
FlnalResults.sno«() FXnalResults.select(*COBpan>Ccoe", "AccountGrp", *Aaoont_usO"). shom()

Figure 31 Explode JSON Results, Part

1

30

Transform D ata w ith A pache Spark

|

2

string)sonRDO: erg.apacne.spark.rdd.RCO[String] . RarallelCollectlonROO[86?3 at parallelize at :123 resultSJSCN: org.apache.spark.sql.oata'raee » [data: 0rray<struc?
data: array (nullable ■ true) |- - elenent: struct (contalnsfiull - true) I I—AccountGrp: String (nullable ■ true) | | - Anount_USO: double (nullable - true) | I-- corpanycode: string (nullable ■ true) data|

data List|

L ist:[Re:venue,l00... | [Revenue,ieoe.0,G.. . | |list{[Reverue,lW ... | [Revenue,5000.0 ,U.. . | |Llst([Re;venue,ie0... | [Expense,-3009.a , . . . | IList ([ Revenue, 1W. . . | [ Revenue, 3ee.9, al rel="nofollow">01] | finalResults: erg.apache.spark.sql.Oata=ra«e ■ [data: array<struetCode:string>>, dataList: struct, ccapanyccde: string, AccountGrp: string, Anount_USD: double] datal

dataListl Conpan>
|List{[Revenue,l0®.. . |[Revenue,idee.0 ,0 ...| |List([Revenue,100.. . I[Revenue,5 0 0 0 .e ,u ...I |llst([Revenue,ie«.. . I[Expense,-300®.0 , . . . | |list([R«-v«nue,iM.. . | [ Revenue, 3ee.0,Atiei] |

08011 US01I US01I Auei

Revenue| Revenue| Expensel Revenue|

1000.01 5000.01 •3000.01 300.0|

ICcepanyCode IAccountGrp | Anount.uSO| 1 1 1 1

Figure

O80l| US01| US01I AU011

Revenue| Revenue| Expensel Revenue|

laee.el 5000.01 •3000.01 300.0|

32 Explode JSON Results, Part 2

Save as Parquet Format

To achieve the best query performance on Hadoop using Spark SQL and Vora, it's essential to use a columnar store file structure (similar to SAP HANA's columnar store structure). The most common standards for Hadoop are ORC and Parquet. Following the example code in Listing 26, the results of a DataFrame are stored in the Parquet file format on HDFS, as shown in Figure 33. fs.delete(new Path("/user/vora/sapDataParquet"), true) f i n a l R e s u l t s . s e l e c t ( "CompanyCode", "AccountGrp", "Amount_

USD").repartition(l).save("sapDataParquet". "parquet")

31

Transform D ata w ith A pache Spark

|

2

status = f s .1istStatus(new Path("/user/vora/sapDataParquet")) p r i n t l n ( "\nDirectory Listing After:") status.foreach(x=> p r in t ln ( x .g e tP at h ) ) Listing 26 Save as Parquet File

j//save as parquet

fs.<selete(new Path("/user/vora/sapoataParquet"), true) finalP.esults.select("Coedan)*,**nount_usO’ ).repartition(i).save("sapOdtdPa'‘cuet’ , "pa'euet") status ■ fs.llstStatus(new PathC/user/vora/sapDataParquet*))

finished

i> ;; ro ©

prlntln<"\noirectory Listing After:") status. f creach(x»> println(x.getPath)) res1522: Boolean ■ true warning: there were l deprecation warning(s); re-run with -deprecation for details status: Arraytorg.apache.hadocp.fs.Pilestatus] ■ (long.apache.nadooo.fs.Filestatus;|«a«dfBo Directory Listing After: hdfs ://uaster. cluster :8020/user/ vora/ sapoataParquet /.SUCCESS hdfs://naster.cluster:8020/user/vora/sapOataParquet/_cc*i»cn_netadata hdfs://Baster.cluster:802e/user/vora/sapoataParquet/_«etaflata hdfs://M$t«r.cluster:iO20/user/«.‘cri/s*poataP*rqij«t/pirt-r.ww«.«5j7cn-9C»l-45fl-8a7d-eti7J3»J25(».g:.par'jijet

Figure 33 Save as Parquet Results

Load Parquet Format

Listing 27 shows how a Parquet file on HDFS can be loaded into a DataFrame. The results are shown in Figure 34. v a l f i n a l R e s u l t s R e l o a d e d = s q l C o n t e x t . 1o a d ( " s a p D a t a P a r q u e t " finalResultsReloaded.show()

"parquet")

Listing 27 Load a Parquet File

I // loop Parquet

val fin*l»esultsaelo«ded ■ sqlContext.load("sapoataParquet", "parquet") finalResultsReloaded. show() warnir^: there .cere l deprecation warning(s); re-run with -deprecation for details finalResultsRelcaded: org.apache.spark.sql.DataFrane ■ [ConponyCode: string, Account6rp: string, A«ount_uSO: double]

FINISHED t> « 03 <8>

1CoapwyCoc«| Accounttrpl a*ount_u$0 1 1

1 1

1

06011 VSOll useil AJJOll

Revenue| Revenue1 expense! Revenue!

1000.01 soee.ei -jooo.ei 300.01

Tex* 1 **:<

Figure 3 4 Load Parquet Results

32

Transform D ata w ith A pache Spark

|

2

Register Temp Table for Spark SQL

Following the example code in Listing 28, you can make a DataFrame available to Spark SQL as a table, using the registerTempTable function. The results are shown in Figure 35. finalResultsReloaded.registerTempTableC"sapDataParquet") Listing 28

registerTempTable Function

1//Register Table f in*lRes.ults?eloefi«s. registerTwr»t>l«("j#:©atflP«rqoe:' )|

FINISHED t > m

©

registerTempTable Results

Figure 35

Using Spark SQL

The following are a few examples of using Zeppelin to execute Spark SQL. Listing 29 uses Spark SQL to show tables, as shown in Figure 36. %sql show tables Listing 29

Execute Apache Spark SQL to Show Tables

Xtol show ta&les O

M

C

*

FINISHED t> “ C0 © 1^

l*

tableNaene

isTemporary

sapdataparquet

true

Figure 36 Show Tables Results

33

Transform D ata w ith A pache Spark

|

2

Next, execute the SQL code in Listing 30 and try Zeppelin's different chart types, such as its tabular and pie charts. Figure 37 and Figure 38 show the data displayed in a tabular format and in a pie chart, respectively. %sql s e le ct CompanyCode, sum(Amount_USD) as SUM_AMOUNT_ USD from sapDataParquet group by CompanyCode Listing 30 Execute a Simple SQL Statement to Aggregate Results

fcscl select cc*pao>co<se

C

M

C-

a*

^

f in is h e d

t > ; ; to ©

FINISHED

t> “ tB ®

\#

CompanyCode

SUM_AMOUNT_USD

GB01

1.000.0

US01

2 000 0

AU01

300 0

Figure 37 Results in Tabular Display

Xsol select ccapan>
C

*

Il£L

SETTINGS »

• G801

US01 • AUDI

AUDI

Figure 38 Results in Pie Chart Display

34

Consuming Data with Vora | 3

Consuming Data with Vora

3

T h is

s e c tio n

H a d o o p

V o r a 's

c o v e rs

d a ta

a n d

th e

S A P

o p e ra tio n s

b a s ic

H A N A

fa ll in to

d a ta

tw o

»

A n a ly z in g

d a ta

in

H a d o o p

»

A n a ly z in g

d a ta

in

S A P

3.1

fu n c tio n a lity c a n

b e

o f V o ra .

c o n s u m e d

It d e m o n s tra te s a n d

h o w

c o m b in e d .

d is tin c t c a te g o rie s :

H A N A

C o n s u m in g H a d o o p D a ta

T h is s e c tio n

c o v e rs

s o m e

o f V o r a 's

m a jo r fu n c tio n s

fo r re a d in g

d a ta

fro m

H D F S .

Create a Vora Table T h e

V o ra

lin k e d in

to

S Q L th e

F ig u re

% v o ra

c o d e

s h o w n

H D F S

C S V

TABLE

te s tC S V

in

file s

L is tin g

c re a te d

in

3 1

c a n

S e c tio n

b e

u s e d

2 .1 . T h e

to

c re a te

re s u lts

a re

a

ta b le

s h o w n

3 9 .

CREATE

COM PAN YCOD E A M O U N T JJS D

(

V A R C H A R (4 ) , DOUBLE

) U S IN G c o m .s a p .s p a r k .v o r a O P T IO N S

(

ta b le N a m e

" te s tC S V ” ,

p a th s

" /u s e r/v o ra /s a p D a ta /p a rt-0 0 0 0 0 " ,

h o s ts

" m a s te r .c l u s te r ,s e c o n d a r y .c l u s t e r " ,

z k u rls

" m a s te r .c l u s t e r : 2 1 8 1 .s e c o n d a r y .c lu s te r :2 1 8 1 " ,

n a m e N o d e U rl

" m a s t e r . c l u s t e r :8 0 2 0 "

) Listing 31

Create a Vora Table

35

Consuming Data with Vora | 3

X v o r * CREATE TABLE t e s t C S V ( COMPANYCOOE VARCHAR(-J), AMOUNTJUSD DOUBLE

FIN ISH ED [ > J I f f l € >

)

USING c o n . sa p . sp a rk . vora OPTIONS (

tablewar* 'testcsv", paths '/user/vpra/sapOata/part- 0 ©0 0 0 ", hosts "naster.cluster,secondary.cluster", :kurls "naster.cluster: 2 1 8 1 ,secondary.cluster: 2 1 8 1 ", naneriodeurl “master.cluster:8020“|

)

Command processed successfully with no results

Create Table Results

Figure 39

Note On

the Vora Developer Edition you can use master.cluster and secondary .c lus te r values, instead of their corresponding internal IP's. This can be helpful to avoid having to adjust your scripts each time your Hadoop cluster is stopped and started.

Select Results of CSV T h e

V o ra

re s u lts

S Q L

a re

c o d e

s h o w n

in

s h o w n F ig u re

in

L is tin g

3 2

is

u s e d

to

q u e ry

th e

ta b le .

T h e

4 0 .

%vora select * from testCSV Listing 32

Vora SQL to Query a Table

|tvora s e le c t • from te s tc s v 8

-

©

*

lei

f in is h e d

c > ; ; ® <s>

t£L

COMPANYCODE

A M O U N T U SD

GB01

1000

US01

1000

US01

3000

AU01

1000

Figure 4 0 Results of SQL Query on a Table

36

Consuming Data with Vora | 3

Create a Vora Table Based on Parquet Files

You can use the Vora SQL code shown in Listing 33 to create a table linked to the HDFS Parquet files created in Section 2.2. %vora CREATE TABLE testParquet ( COMPANYCODE VARCHAR(4), ACCOUNTGROUP VARCHAR(IO), AMOUNT_USD DOUBLE ) USING com.sap.spark.vora OPTIONS ( tableName "testParquet", paths "/user/vora/sapDataParquet/part-r-00000-d65a7cl1-9bal-45fl -8a7deb733b9325d0.gz. parquet", hosts "master.cl uster,secondary.cl u s t e r " , zkurls "maste r.c lus ter :2181.secondary.cl u s t e r : 2181", nameNodeUrl " master .cluster :8020", format "parquet"

)

Listing 33

Vora SQL to Create Tables Linked to Parquet Files

Select Results of Parquet File

You can use the Vora SQL code shown in Listing 34 to query the Parquet table. The selection results are shown in Figure 41. %vora select CompanyCode, sum(Amount_USD) as SUM_AMOUNT_ USD from sapDataParquet group by CompanyCode Listing 34

Vora SQL to Query Parquet Table

X v o ra s e t e e t COKPAn y c o o e ,

O

M

«J

fc

su »

£

( * moont _ u » >

SETI

f r o o t e s tP a r q u e t g ro u p b y

F IN IS H E D t > “

cotasyco o c

tfl

&

»

• OKI

US01 •AUOI

AU31

Figure 41 Selection Results on Parquet Table

37

Consuming Data with Vora | 3

Show Tables

Vora tables are different from Spark SQL tables. Use the SQL code shown in Listing 35 to see the tables available in Vora. The results are shown in Figure 42. %vora show tables Listing 35

Vora SQL to Show Tables

Xvcr« show tables

B

M

C

*

FINISHED D>

tO <*>

le£

tableName

isTemporary

tesiParquet

false

teslCSV

false

J

Figure 42

Show Tables Results

Show Data Sources

Vora currently uses Zookeeper (Hadoop's generic configuration manage­ ment tool) to store Vora configurations. Use the command shown in Listing 36 to list the data sources of each table registered in Zookeeper. The results are shown in Figure 43. %vora SHOW DATASOURCETABLES USING com. sap. spark. vora OPTI0NS( zkurls "master.cl u s t e r : 2181,secondary.cl u s t e r : 2181” , nameNodeUrl "ma ste r.c lus ter :8020"

)

Listing

36 List Data Sources Registered in Zookeeper

38

Consuming Data with Vora | 3

Xvora Show OATAScUkCETABlES USING cos.sap.spark.vora OPTIONS! :kw *ls - a a s t e r .c l u s t e r : 2 m , se c o n d a ry ,c lu s te r : 2 1 s : ', naaewodeuri ' a a s t e r .c l u s t e r : sole" .8 a

in



n

FINISHED ^

U E0 &

-

a jiii

tb l.n a m e testC SV teslParquet

Figure 43

Show Data Sources Results

Note The names of the data sources do not need to match the table names crea­ ted; however, it helps to identify their relationships.

Clearing Zookeeper

During testing, a large number of data sources may be created, and peri­ odically, it may be necessary to clear Zookeeper. Use the statement shown in Listing 37 to perform the cleanup. This statement does not delete the source data in HDFS, it simply resets all the Zookeeper references. Figure 44 and Figure 45 show the results. import com.sap.spark.vora.client._ ClusterUti1s . clearZooKeeperCatalog("master.cl uster:2181") Listing 37

Clear Zookeeper

import coa.sap .sw k .v o ra.clien t.J

FINISHED

C lustervtils. clear;oc#eepercatalog( "«aster .cluster: 2181”) inport c c « .s a p .s p a rk .v o ra .c lie n t., re s l: Boolean ■ tru e

Figure 4 4 Clear Zookeeper Results

39

t>

60

©

Consuming Data with Vora | 3

Xvora SHOW CUTAS0USCETA81ES USIMS COfl.iap.ipafS.vOra OPTIOMS( ’ n a t t e r .< l o i t e r :

2181, 8secondary 20* . d u s t e r : 2181”,

na«eModeurl " n a s t e r . d u s t e r : « IS

Command processed successfully with no results

Show Tables Results

Figure 4 5

Note After cleaning up Zookeeper, the tables and data sources still in use will need to be re-created. These are only symbolic links to the source HDFS data, so they can be re-created very easily, if you keep a list.

Create a Vora Table Programmatically L is tin g

3 8

is e x a m p l e

H D F S

C S V

file s

c a lle d

d ire c tly

c re a te d

fro m

s a p S q lC o n te x t. s q l ( CREATE

TABLE

COM PAN YCOD E A M O U N T JJS D

c o d e in

th a t c a n S e c tio n

S c a la , a n d

th e

b e

u s e d

2 .1 .

to

T h e

re s u lts

a re

c re a te

V o ra

a ta b le

S Q L

s h o w n

in

lin k e d

s ta te m e n t F ig u re

to c a n

th e b e

4 6 .

s ........

te s tC S V

(

V A R C H A R (4 ) , DOUBLE

) U S IN G c o m .s a p .s p a rk .v o ra O P T IO N S

(

ta b le N a m e

" te s tC S V " ,

p a th s

" / u s e r /v o r a /s a p D a ta /p a r t-0 0 0 0 0 ",

h o s ts

"master.cluster secondary.cluster",

z k u rls

" m a s t e r .c l u s t e r : 2 1 8 1 ,s e c o n d a r y .c l u s t e r : 2 1 8 1 " ,

n a m e N o d e U rl

" m a s te r .c l u s t e r : 8 0 2 0 ”

) .... .stripMargin ) Listing 38

Create a Vora Table

40

Consuming Data with Vora | 3

FINISHED t> V , W <§>

sepSqlContext.sql( s “*“ CREATE TABLE testCSV ( CCMPAIfYCOOE VARCHAR(4), AMOUNTJJSD DOUBLE l|

USING coa.sap . spark.vora OPTIONS ( tableNane "testCSv", paths " /u ser/v o ra /sa p o ata/p a rt-©eo*©”, hosts "n aster.clu ster,se co n d ary .clu ster", rk u rls " a a s te r.c lu s te r: 2 1 8 1 , sec o n d a ry .d u ste r: 2 1 8 1 *, naneuodeurl " e a s te r.c lu s te r : 8 0 2 0 " ) ““".stripM argln ) res3: org.apache.spark.sql.DataErafle ■ () Took 0 MCOO&

Figure 4 6

Programmatically Create a Table

Using Vora with Large TPC-H Files

Presumably, you are using Hadoop and Vora because you already have very large files that need to be processed, which may not be cost effective to load and process in traditional databases. Ifyou don'tyet have access to large files but would like to test Vora at scale, then you will need make much larger files than those used in our earlier examples. A simple way of generating large files is by using public benchmarking tools such as TPC-H and TPC-DS. For more information, see http:// www.tpc.org/. There, you will find utilities for generating test files as large as you can handle. Alternatively, you can also download similar files from the AWS S3 Redshift Demo, https://s3.amazonaws.com/redshift-demo/, if you have the bandwidth. In this section, sample files from s3://redshift-demo/tpc-h/100/lineitem/ were downloaded and stored in HDFS in the /user/vora/lineitem directory. Figure 47 shows the first few rows of the TPC-H 1i nei tern file. Notice the file has a " | " (vertical bar) delimiter.

41

Consuming Data with Vora | 3

1 11551893 5 I76895111 117 133203.7210.04 |0.02 |N|0|1996-03-1311996-02-1211996-03-22|DELIVER IN PERSON|TRUCK]eqular courts above thel 1167 30908 I7 3090912136169788.5210.0910.061N|011996-04-1211996-02-2811996-04-201TAKE BACK RETURN|KAIL|ly final dependencies: slyly bold I 1 16369978136997913 18 116381.2810.1010.021N1011996-01-2911996-03-0511996-01-311TAKE BACK RETURN REG AIRlriously. regular, express depI II213150146315114128129767.92 10.09!0.061 N O 1 1996-04-2111996-03-3011996-05-161 NONE IAIRllites. fluffily even de 1 12402664115267115124|37596.9610.10 0.04:N|o!1996-03-3011996-03-1411996-04-011n o n e IFOB I pending foxes, slyly re|

m563445J«448l£l^2248267i84K)i07M )i02Jj
Figure 47

TPC-H lineitem Data Example

Vora has a special CSV delimiter option, which enables these files to be loaded very easily, with the SQL code shown in Listing 39. The result is shown in Figure 48. %vora CREATE TABLE lineitemCSV ( BIGINT, L_ORDERKEY L_PARTKEY BIGINT, BIGINT, L_SUPPKEY INTEGER. L_LINENUMBER L_QUANTITY DEC I MAL(1B,5), L_EXT ENDEDPRIC E DECIMAL(15.5). L_DISCOUNT DECIMAL(15,5), L_TAX DECIMAK 15,5), STRING, L_RETURNFLAG L_LINESTATUS STRING, L_SHIPDATE DATE. L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHI PINSTRUCT STRING, L_SHIPMODE STRING, L_C0 MMENT STRING

)

USING

com.sap.spark.vora OPTIONS ( tableName "lineitemCSV", paths "/user/vora/1inei tem/part-00000", csvdelimiter "| ", hosts "master. cl uster.secondary. cl u s t e r " ,

zkurls "master.cluster:2181,secondary.cluster:2181", nameNodeUrl "ma ste r.c lus ter :8020"

)

Listing

39 Load TPC-H Files

42

Consuming Data with Vora | 3

Xvora CREATE table lineittaCSV ( BIGXVT, l_ORDERKEV l_PARTKEY BIGIVT, BIGXVT, l_SUPPKEY l.LXNENUKBER INTEGER, t_QUAVTirY 0ECM»l('-S#S), L_EX7ENCECP*ICE OECXVAK , ), l.DXSCCWT OECX»Wl(15,5), L_TAX o (c m (> s ,s ),| L.RETURHFLMJ STRING, L.lSNESTATUS STRING, OATE, l_sh: poate L_COMMITOATE OATE, LjtECEIPTCATE OATE, L.SHlPISSTftUCT STRING, L_SmXPHCOE STRING, STRING

finished

[>

HI ©

> l_CC*V£HT

USING com.

s a p .s p a rk .v o ra

OPTIONS ( tableMane "lineiteMCSV*, paths - /user/vora/lineitea/part-©©e©©',

csvoellalter *|",

hosts "M aster.clu ste r,sec o n d ary .d u ste r', tk u rls " ■a s te r .c lu s te r: 2l81,secc«i«ary. clu ster :2 1 8 r, naneNodeurl 'M aster.clu ster: 8920 "

>

Command processed successfully wth no results

Create lineitem Table

Figure 4 8

Next, we can count the number of record loads with the SQL code shown in Listing 40. The results are shown in Figure 49. %vora select count(*) from lineitemCSV Listing 4 0

Count Record Loads

Xvora select count<*) fro* lireite«CS\1 • •H a lt S as

m

v

*

FINISHEO [> “ 03 ©

fe*i

cO 1

.0

0 0 .0 0 0

Figure 4 9

Count of lineitem Table

43

Consuming Data with Vora | 3

Appending HDFS Files to Vora Table

The SQL code shown in Listing 41 enables additional files to be added to the Vora table. This code does not copy the data; it only adds a link to the file, as shown in Figure 50. %vora APPEND TABLE lineitemCSV OPTIONS (paths Vuser/vora/lineitem/ part-00001,/user/vora/linei tem/part00002" , csvdelimiter eagerLoad "true") Link Files to a Vora Table

Listing 41

Xvora appe*® ta8lE lineiteocsv OPTXOKS (paths *7user/vore/lir*ite«/p«rt-eeeei,/user/vor.a/lir*itea/p8rt-eeee2“ , c s v d e lia ite r

t> “ 03 ®

finished

Command processed successfully with no results Took 11 teconos

Figure 50

Append lineitem Table

Next, let's check the row count again (Figure 51). Xvora s-elect count(*) from lin e ite n c sv --U n it S E

IM

C

*

finished

^

0

t£L I

cO 3.000.000

Took$MCOo6»

Figure

51 New Row Count Results

Convert Large CSV Files to Parquet

Large CSVfiles are fine for testing performance, but to optimize Vora SQL performance, you'll need to convert the TPC-H files into the Parquet columnar format.

44

;; © ©

Consuming Data with Vora | 3

Listing 42 shows some example code, using Apache Spark, to split the file based on the " | ” (vertical bar) delimiter. The results are shown in Figure 52. val fileRdd = s c . t e x t F i 1e ( "1ineitem") val spl i tRdd = fileRdd. 1i ne => 1r Is p 1i t = line.spl ') lspl 1spl lspl lspl 1spl lspl lspl 1spl 1spl 1spl 1spl lspl 1spl lspl 1spl lspl

i t (0). t o l n t , i t ( l ) . toLong, i t ( l ) . toLong, i t (3). t o l n t , i t (4). toDouble, i t (5). toDouble. i t (6). toDouble, i t (7). toDouble, i t (8), i t (9), it(10) , it(ll) , i t (12) , i t (13) , i t ( 14) , i t (15) )

) Listing

42 Split Large CSV Files

val fileRdd ■ se.textFile{'lir.eitee*)

f in is h e d

t> ; ; 81 ®

val splltRdd ■ fileRdd.nap< line •> {var ls p lit • lin e .sp litC l" ) ls p lit(- ).to ln t, ls p l it ( ).tOLorg, lsplit<:).toLorg, lsplit< ).to ln t, ls p lit( ) .toOouble, ls p lit( c).toOouble, ls p lit( ).?c©ouble, lsp lltc ).toootf>le, lsp lit(ft), lSpllt< ), l$ p llt( ), 1s p lit ( )# ls p lit( - ) , I S p l l t (1 3 ),

1s p lit(: ), ls p litc ))

> >

fileRdd: cwg.apache,spark.red.RCO[strir«] ■ FapPartitionsRDO[Ml] at textFile at :3S splitltdd: org.apache.spark.rdd.ROO[:39

Figure 52

Use RDD for a Split Based on a Vertical Bar or Pipe Delimiter (|)

45

Consuming Data with Vora | 3

Next, using the code in Listing 43, the RDD is converted to a DataFrame, and the results are saved in the Parquet format with five partitions (result­ ing in five files, as shown in Figure 53). s p li tRdd. toDF("l_orderkey", "l_ p a rtk e y ", "l_suppkey" , "1_1inenumber” , "1_ q u a n tity ” ,"l_extendedpri ce” ,"1_d is c o u n t", " l_ ta x " , "l_ re tu rn f1ag” , "1_ 1in e s ta tu s ", ” l_shi pdate", "l_commi td a te ", "l_ rece i p td a te ", "1_ s h ip in s tru c t", "l_shi pmode"," 1_ comment"). r e p a r t it io n ^ ) . save( "1 inei temParquet", "parquet") Listing

43 Covert RDD to DataFrame

The re p a rtitio n function is a crucial part of converting large files.

Alarge number of very small files will

result in very poor performance. Asin­ gle, very large file will cause other problems. File-size tuning is something dependent on many factors, including the size of your cluster, available memory, and data retrieval patterns.

Next, let's check if the five Parquet files were created by using the code in Listing 44. The results are shown in Figure 54. val IsResult = Seq("hadoop", "d fs " , " -du", " -h” , "/user/vora/ 1i nei temParquet").!! Listing 4 4

Check Created Files

46

Consuming Data with Vora | 3

//M fs dfs -du -s

-

f

t

/

user/voro

/



F

I

N

I

S

H

E

D

val lsftesult « seq<"hadocpVdfs,V - d u '," - h"/ "/user/vora/lineiteeferquet").!! IsRcsult: String « ”® /user/vcr*/lineitMR«rqu«t/_SUCCess 1.5 K /user/vcra/lineitenParquet/_co«inon_netddata 1 2 . 1 k /user/vcra/lineiteBParquet/joetacata S*.5 M /user/vora/lineitenParqu«t/part-r-eeeee-ee2a931b-eb<15-i6fb-a9b7-7ia279S2c6ea.g:. parquet 51.5 H /user/vora/lin«itenParquet/pa!'?-r-eee0l-ee2a931b-eb{iS-iSfb-a9b7-7la279S2cSea.gz.parquet 51.5 h /user/vora/lineitenParquet/part-r-eeee2-ee2a93ib-eb
Figure 54

N o w ,

Check Files in HDFS

u s in g

th e

c o d e

P a r q u e t flie s . T h e

in

re s u lts

L is tin g a re

4 4 ,

s h o w n

w e in

c a n

c re a te

F ig u re

a

ta b le

u s in g

th e

n e w

5 5 .

%vora CREATE TABLE 1ineitemPArquet ( L_ORDERKEY BIGINT. L_PARTKEY BIGINT. L_SUPPKEY BIGINT, L_LINENUMBER INTEGER, L_QUANTITY DECIMAL(15,5), L_EXTENDEDPRICE DECIMAL(15,5), L_DISCOUNT DECIMAL(15,5), L_TAX DECIMAL(15,5), STRING, L_RETURNFLAG L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING. L_SHIPMODE STRING. L_C0MMENT STRING

)

USING com.sap.spark.vora OPTIONS ( tableName "1ineitemParquet", paths "/user/vora/1ineitemParquet/part-r-00000-ee2a931b-ebd5-46fba9b7-7Ia27952c6ea.gz.parquet, / u se r /v o r a / 1i nei temParquet/part-r-00001ee2a931b-ebd5-46fb-a9b7-7Ia27952c6ea.gz.parquet,/user/vora/ 1ineitemParquet/part-r-00002-ee2a931b-ebd5-46fb-a9b771a27952c6ea.gz. parquet,/u$er/vora/lineitemParquet/part-r-00003ee2a931b-ebd5-46fb-a9b7-7Ia27952c6ea.gz.parquet, /user/vora/ 1ineitemParquet/part-r-00004-ee2a931b-ebd5-46fb-a9b771a27952c6ea.gz.parquet", hosts "master.clust er.secondary.cluster", zkurls "master.cl u s t e r : 2181,secondary.cl u s t e r : 2181",

47

O

@

Consuming Data with Vora | 3

nameNodeUrl "master.cluster:8020", format "parquet"

)

Listing 45

Create a Table with Parquet Files

lin eiter-PArciuet ( BIGINT,

Xvora CREATE TABLE L_0«DERKEY L_PARTKEY L_SUPPKEY L_LINE*AJP.8ER ^-QUANTITY L_EXTENDEOPRICE L_DISCCUNT L_TAX 1_RETURNFIAG L.LlNESTATUS L_SHIP0ATE L_COriITDATE l_RECEIPTOATE L_ShIPINSTRUCT

>

L_ShIPMOOE LCOMHENT

FINISHED t>

Eli ©

BIGINT, BIGINT, INTEGER, OECIPAL(15,5),

DECXMAL(1S,S), DECIVAt(15,5), D£CIVAL(1S,S),

STRING, STRING, DATE, DATE, DATE, STRING, STRING, STRING

USINq

con. s a p . spark.vora OPTIONS ( tableware MllneitenParquet3", paths “/uset/part-r-eee0i -ee2a93lb-eMS-46fb-a9b7.7ia279S2c6«a.g2.parqu«t,/user/vora/lineiteoParquet/part-r-e0002-ee2a93lb-«bdS-4Gfb-a9b7.7ia279S2c6ea.g:.parquetJ/user /vora/lineitenParqu«t/part-r-00e03-ee2a93lD-eb<JS-»6fb-a9D7.7ia279S2c6ea.g:.parquet,/user/vora/lineite«Parquet/part-r-eeoG4-ee2a93lD-eMS-a9b7-7ia279S2c6ea.g:.parquet", hosts master, cluster, secorxJary.cluster", rkurls "oaster.cluster:2181 ,secondary.cluster:2181", naaewodeurl "naster.cluster:S020’, fori»at “parquet"

46fb

Figure 55

Create lineitem Parquet Table

Even with tens of millions of records and a very small Hadoop cluster, you can still achieve very high performance. As shown at the bottom of Figure 56, the Spark SQL responded in two seconds. Xsol select («««(•) fro* ( select l_orderkey fro* sapeata’ercuct where l_:Mpflite ■ 'ittg - 12 -e r group by l_orderkey ) cannot recognize input nee- '<SOf>' ' <SO'>' •* in subqoery source; line 1 pos 11«

Xscl select vjm < l_euantlty • l_e«tend«Op'KO fro* sazOataPerouet where l.sMpdate • 'l N t- lM l'

a

m

£

*

FINISHED t>

w

©

FINISHED t>

to ©



_c0 4 6092129730000004E7

-

Figure 56

Performance Test on lineitem Table

48

Consuming Data with Vora | 3

3.2

C o n s u m in g SAP H A N A D a ta

This section covers some of Vora’s major functions for reading from and writing to SAP HANA. In this example SAP HANA SP10, AWS Developer Edition, is used in conjunction with Vora. See http://cal.sap.com to launch a trial HANA system. Create a Simple Table in SAP HANA

By now, you are probably familiar with the simple test dataset used in this E-Bite. Execute the SQL code in Listing 46 in SAP HANA to create a table and populate it with some entries. The results are shown in Figure 57 and Figure 58. drop table SAPDATA; create column table companycode accountgroup aiw)unt_usd

);

i ns e r t i ns e r t i ns e r t in se rt Listing

into into into into

SAPDATA ( varchar(4), varchar(lO), double --decima 1(18.2)

"CODEJAMMER". "SAPDATA” "CODEJAMMER". "SAPDATA" "CODEJAMMER". "SAPDATA" "CODEJAMMER". "SAPDATA"

values( ' GB01' values( ■usor values( 'usor values( •Auor

, ' Revenue', , ' Revenue' , , ' Expense', . ' Revenue’ ,

1000); 5000); -3000) 300);

46 Create a Table in SAP HANA

Q ’HD8 - SQl Console 2 K 3VSQI Console 4 HDB (C O D E JA M M E R )

77! ’SQLConsole 5



S2.73J2J7300

a i © i a. o

tnsQ i 1 drop tab le SAPOATA; 2 create column table SAPOATA ( } v#rch*r(4), 4 accountgroup varchar(lO), 5 amount usd double - -dcciral<18,2) 6 ); 7 3 in s e rt into "COO€JAMMER”. ‘SAPOATA* values( G801'. Revenue-. C- in s e rt into “COO€JAMMER".‘ SAPOATA* values(U S01', •Revenue-, 10 in s e rt into "COO€JAMMER”. ’SAPOATA* values( US01' , ’Expense’, 11 in s e rt into ”COO€JAMMER".‘SAPOATA* values( 'AU01V Revenue',

» cc ▼

E 1000); S00O) :| -3006); 306);

,

* Staterent ‘ in sert into "COOEJAWER*. "SAPOATA” values( 'AU01‘ Revenue‘, 300)' successfully executed in 99 as 802 ps (server processing tia e : 0 as 384 ps) Statement 'drop tab le SAPOATA2’ successfully executed in 103 as 661 ps Stateaent 'drop table SAPOATA3' successfully executed in 98 as 841 ps Ouration of 8 stateaents: 835 as

Rows Affected: 1

*

(server processing tia e : 3 as 72S ps) - Rows Affected: 0

(server processing tia e : 3 as 135 ps)

Rows Affected: 0 *

Figure

57 Create a Table in SAP HANA

49

Consuming Data with Vora | 3

SAP HANA Modeler - 'CODEJAMMERVSAPDATA” - Eclipse £ile

Edit

n*

N avigate

Search

so ;

£ roject

gun

Jtfindow

tJelp

-o o - Q uick Access

System s S3

°



0f-l a n - a (3b % j

Raw D ata | l~0l Distinct values It

i5> HOB (CODEJAMMER) ;Prcd * a

C l ’SQL C onsole 4

C l ’HO B-SQ L C onsole 2

©

C atalog

^

:<

C olum n Views

> &

EPM M odels

> &

EPM Q uery Soun

C l ’SQL C o n so le 5

'CODEJAMMERVSAPDATA" S3

il l Analysis

°

• Show Log j Max rows: 200 O

4 ro w s re tr ie v e d - 1 4 0 m s

COMPANYCODE

■ \£> Public Synonym s a . io CODEJAMMER

Y* Java EE £ } SAP HANA D evelopm ent | ^ > SAP HANA Modeler*] BP BW M odeling

§

ACCOUNTGROUP

Execute

^

A dd filter

w

Sort «otire data set

»

A M O tN T JJS D

GB01

Revenue

US01

Revenue

1.000 5,000

US01

Expense

-3,000

AU01

Revenue

300

> IcJ- Functions > l£ $ Indexes t> 1=5- Procedures > l

Sequences

t> 9 k Synonym s a i& Tables AT SAPDATA [>

Triggers

t> Q ? Views > o g HGL 0 d § IO T 0 t>

PSA g SAP.AUTHOR1ZATIC

■ c*n occr Ai»t

0

W here-U sed List S3







Properties

(D» “ a

P roperty

=



Value

Job Lc g

Progress

*

N o o p eratio n s to display a t this tim e.

Select ed:CA_SLFIGHT_001 (CODEJ/ N u m b er o f U sages#

Figure

58 Data Preview of a New Table in SAP HANA

Query an SAP HANA Table from Vora

Now, it is now possible to access this SAP HANA table directly from Vora, using the code in Listing 47. No data is copied to Hadoop unless a query is executed. The results are shown in Figure 59. %vora CREATE TABLE HANA_SAPDATA USING com.sap.spark.hana OPTIONS ( path "SAPDATA", dbschema "CODEJAMMER", host " ", instance "00",

50

°

«

Consuming Data with Vora | 3

user "CODEJAMMER", passwd "CodeJam2015"

)

Access an SAP HANA Table from Vora

Listing 47

X v o ra CREATE TABLE HANA_SA?DATA USING c o m .s a p .s p a r k .h a n a

FINISHED C>

^

OPTIONS (

p a th "SAPOATA", c s s c i e r a " CODEDAM»iR“ ,

host ‘HHIHH', in s ta n c e "00",

u s e r "CODEDAHHER*,

passwd "CodeDa»2015"

)

Command processed successfully with no results T o o k 1 seconds

Figure 59 Create an SAP HANA Table in Vora

Next, we can execute the SQL statement shown in Listing 48 in Vora to retrieve data from SAP HANA. The results are shown in Figure 60. %vora select * from HANA_SAPDATA Listing 48

Retrieving Data from SAP HANA

Xvora s e le c t • from SE

m

©

A

* v\ a_sap©ata

FINISHEO 0 “ rn

le£

AMOUNTUSO

COMPANYCODE

ACCOUNTGROUP

GB01

Revenue

1.000.00

US01

Revenue

5.000.00

US01

Expense

-3.000.00

AU01

Revenue

300.00

Took 0 seconds

Figure

60 Search an SAP HANA Table in Vora

51

<§>

Consuming Data with Vora | 3

Create a Simple View in SAP HANA

Vora is also able to read SAP HANA views. To demonstrate this, first we need to create a simple view, which joins the following tables SFLIGHT, SCARR (on CARRID), and M_TIME_DIMENSION (on FLDATE/SAPDATE). Create a view similar to the one shown in Figure 61.

Figure

61 Basic SAP HANA View on SFLIGHT

The results of the data preview in SAP HANA are shown in Figure 62.

52

Consuming Data with Vora | 3

Figure

62 Data Preview of SFLIGHT View

Query an SAP HANA View from Vora

Your SAP HANA view can now be easily accessed directly through Vora, using the code in Listing 49. The results are shown in Figure 63. %vora CREATE TABLE HANA_SFLIGHT_VIEW USING com.sap.spark. hana OPTIONS ( path "CODEJAMMER/CA_SLFIGHT_001\ dbschema "_SYS_BIC", host " \ instance "00", user "CODEJAMMER", passwd "CodeJam2015") Listing 49

Access an SAP HANA View from Vora

53

Consuming D ata with Vora | 3

FINISHED l>

X vora CREATE TABLE HANA_SFLIGHT_VIEW USING c o n .s a p .s p a r k .h a n a OPTIONS ( p a th "COOE3AMMER/CA_SLFIGHT_eei", d b s c n w ia "_SYS_BIC", host ’ ^ ■ ■ 1 ' , i n s t a n c e "OG“ , u s e r "-, p assive )

ffl ©

Command processed successfully with no results TocA 0 seconds

Figure 63

Create an SAP HANA View Table in Vora

Next, we can query the view from Vora with the SQL code in Listing 50. The results are shown in Figure 64. %vora select DATE_SQL, sum(PRICE) VIEW group by DATE.SQL Listing

50 Query an SAP HANA View from Vora

Xvora s e l e c t oate_ sql, SlP(PRICE) AS SUT.PRICE ar

from HANA_$FLIGHT_

m

c *

fr o n HANA_SPIIGHT_VIEY\ g ro jp by DATE_SQL

DATE_SQL

SUMPRICE

2 0 0 2 -1 2 -2 0

1,222 00

2 0 0 2 -1 2 -2 3

2 .2 2 2 00

1 9 9 5 -0 2 -2 8

1 .7 4 8 00

199 5 -1 1 -1 7

1 .4 9 9 00

1 9 9 5 -0 6 -0 6

1 .0 9 0 00

1 9 9 5 -0 4 -2 8

6 .0 0 0 00

20 0 2 -1 2 -2 1

222 00

Figure 6 4

FINISHED O “

ic

Results of an SAP HANA View in Vora

We can check if the query executed in SAP HANA by following the menu path in the SAP HANA Studio Admin console • Performance • SQL Plan Cache. The results are shown in Figure 65.

54

tfl ®

Consuming Data with Vora | 3

Overview Landscape Alerts Perform ance Volumes Configuration System Information Diagnosis Files Trace Configuration Threads Sessions Blocked Transactions SQL Plan Cache Expensive Statem ents Trace Job Progress Load

o

SFUGHT HOST

PORT

vhcalhdb

30.003

vhcalhdb

30.003

vhcalhdb

30.003

vhcalhdb

30.003

vhcalhdb

30.003

vhcalhdb

30.003

Visible rows: 6/1000 Not all ro VOLUME.ID

^ C o n f ig u r e ...

STATEMENT.STRING

^ Filters...



[ j j Save as File

STATEMENT .HASH



a<&

USER.NAME

2 SELECT " _ su b q u e ry 2 '.' D...

ca677c808b9a2c2362870... SYSTEM

2 select count(’) as NUM.R...

e31852e0174e81cei8a9e2...

SYSTEM

(•

STATEMENT STRING SELECT ■_ su b q u e ry 2 ”.',DATE_SQL”, SU M ('_subquery2”. "PRICE") AS "SUM PRICE" FROM (SELECT '_tablel"."D ATE SQL", "_tablel"."PRICE" FROM " SYS BICVCODEJAMMER/CA SLFIGHT 001" AS " _ ta b le l" ) AS "_subquery2" GROUP BY "_su b q u ery 2 ’.’DATE SQL" UMIT 7 OFFSET 0

- J Copy

Figure

4

Save As...

Close

65 Subquery Called in SAP HANA

Use Apache Spark to W rite Data to SAP HANA

Within Apache Spark, you can now write code to create tables or append data to SAP HANA. The code shown in Listing 51 demonstrates how to create a new table in SAP HANA (table S A P D A T 2 ) and insert a single row of information into it. The results are shown in Figure 66. import org.apache.spark.sql._ import org.apache.spark.sql. types._ val nameNodeHostAndPort = "master. cluster:8020" / / va1 sqlcSAP = new SapSQLContext(sc) / / t h i s table name for the Vora instance val tableNameVora = "HANA_SAPDATA" / / t h i s table name holds for HANA val tableNameHana = "SAPDATA2" //Database Schema Name HANA val dbSchema = "CODEJAMMER"

55

Consuming Data with Vora | 3

/ / HAMA Host instance val host = "52.73.12.173" / / HANA instance ID val instance = "00" //User name and password val user = "CODEJAMMER" val passwd = "CodeJam2015" //New Data Row to Add from a Data Frame val companyDataRDD = Seq( ( "TH01" , " Revenue", 100.00)).toDF("COMPANYCODE", "ACCOUNTGROUP", "AMO UNT_USD").rdd //val companyDataDF = Seq( ( "SG01" , "Revenue", 100.00)).toDF("COMPANYCODE","ACCOUNTGROUP", "AMO UNT.USD") val schema = StructType(Array(StructField("COMPANY",Stri ngType, false ).Str uctFi e l d (" ACCOUNTGROUP",StringType,false),StructField! "AMOUNT. USD",DoubleType,false))) val companyDataDF = sqlc.createDataFrame(companyDataRDD, schema) val configuration - Map(( "host"-> host), ( "instance"-> instance), ("user"-> user), ("passwd"-> passwd)) val writeOptions = configuration + ("path" -> tableNameHana) + ("dbschema" -> dbSchema) //Create Table and Insert // companyDataDF.wri t e . format!"com.sap.spark. hana") . mode(SaveMode. ErrorlfE xi s t s ) .opti ons(wri teOpti ons) . save!) //Create / Append companyDataDF.write.format!"com.sap.spark.hana") .mode!SaveMode.Append). options(writeOptions).save!) //Check Results val queryResult = sqlContext. s q l ( s " ""show tables""") //val queryResult = sql Context. sql (s"""select * from StableNameVora""") queryR esu lt .col lec t! ) .foreach(println) Listing

51 Create a Table and Insert a Single Row in SAP HANA

56

Consuming Data with Vora | 3

import erg.apache.spark.sql._ import erg.apache.spark.sql.types._ val na»eV>deHCStAncPort • "master,cluster:8620"

FINISHEO D> 5Z @0 ®

//voL sqLcSAP m new sapSQicontext(sc) / / this table name for the vora instance val tableNanevora - *mana_sapoata* / / this table name holds for hana val table\anenana > "SAPDATA2"

//Database Schema Hone hana val dbSchena . “COOEJamker" / / hana Host instonce val re s t ■ "52.73.12.173“ / / hana instance ID val instance ■

//user name and password val user ■ "CCOEjammer" val passwd - "CodeJan2eiS“ //Hew Data now to Add from a Data Frame val companyOataRCO ■ Seq(("TH01“, “Revenue", iee.00)).toOF<M COMPAJfrcCOE","ACC<XOTGRajP',"AF©UNTJJSO*).rdd //val companyoatoof • seq(('SGOi", "Revenue", ioo. oo) ) .toOF("conPANYCOD£", "accouhtgrout", "amount_uso") val schema = StructType(-' sy(StructField('COM;>ANY,,stringType,false)>StructPieldrACCOU»iTGROuP",StringType,false),StructField("AMC(JNT_uS0" ,DoubleType,false))) val companyoataos ■ sqlc.createDataFrare(conpanyoataRDD, schema) val configuration ■ Map(("nost*-> host), ("lnstarce"-> instance), C"user"-> user), ("passwTo passwd)) val wrlteOptions - configuration ♦ ("path" -> table\a«e*ana) ♦ ("dbschema" -> dbSchena)

//create / Append coapanybataCF. i»rite. fornat C corn. sap. spark. hana ”). node (saveMode. Append). opt ions (xriteeptions). save ( )

//Chech Results val queryResuit • sqlContext.sql(s"""show tables""*)

\//vol queryResuit » sqicontext.sql(s"’ "select • from StobleHomevora"") queryResuit.collect() . fcreach(printIn) inport org.apache.spar*.sql._ inport org.apache.sparic.sql.types.. nane'*odenostAndPcrt: String ■ naster.cluster :se2e table'ianevora: String - hama. sapoata table'ianeHara: String « SAP0ATA2 dbSchena: String ■ CCOEJAWER host: strin g . 52.73.12.173 instance: String * ee user: string ■ CCCEJANMER passed: String ■ Ccde3am2eis ccnpanyDataROO: org.apache.spark.rdd.RDD[o.-g.apache.spark.sql.Row] • MapPartitionsROO[*] at rdd at :3S schema: org.apache.sparlc.sql.types.StructType ■ StructType;Struct*ield(COf'PAW,StrirgType,fa lse ), StructField(ACCOtWTGROUP,StringType,false), Struc tField( amount. uso, ocubleType,false)) cccipanyDateD': erg.apache.spark.sqi.Dataware ■ [cohpaw: string, accountgroup: string, ArtOUNTjUSO: double] configuration: scala.collection.im utable.M ap[string,string] - rap(host -> 52.73.12.173, instance -> ee, user -> ccoejamker, passwd -> Ccde3an2®l5) writeOptions: scala.collection.imutable.M ap[String,string] - mp(path -> sapoata2, host -> 52.73.12.173, instance -> 09, doschena -> CCOEJammer, p asswd -> coce)an20i5, user -> cooejammer) queryResuit: erg.apache.spark.sql.oatasraae « [tabletiame: string, lsTenpcrary: boolean]

Figure 6 6

Create a Table in SAP HANA Using Vora

We can check that the new table SAP DATA2 has been created in SAP HANA and check its contents, as shown in Figure 67.

57

Consuming Data with Vora | 3

Figure 67

3.3

New SAP HANA Table, Created from Vora

C ro s s -C o n s u m p tio n : Join SAP H A N A and V o ra D a ta

In this section, a simple cross-consumption query is executed to check how query logic is passed down to SAP HANA. From Section 3.1 , a small table, testCSV, was created in Vora. From Sec­ tion 3.2, a small table, HANA_SAPDATA, was created in SAP HANA and made accessible in Vora.

58

Consuming Data with Vora | 3

Now, we can execute the SQL code shown in Listing 52 to join these tables to analyze how query logic is passed down. Notice there is a where clause on H A N A _ S A P D A T A (using table alias A ) for company code ( A . C O M P A N Y C O D E = 1 U S O 1'). The results are shown in Figure 68. %vora select A.*, B.AMOUNTJJSD as OTHER_AMOUNTJJSD from HANA.SAPDATA A inner join testCSV B on A.COMPANYCODE = B. COMPANYCODE where A.COMPANYCODE = ’US01* Listing

52 Join Tables to Analyze a Query

Xvora s e le c t a . compahycoob, a . amowt. usd, B.AHOCNT_USO BS OTHER jWO«JHT_USO fro a HA’IA.SAPOATA A inner jo in testCSV B on A.CCMPAWCOOE - B. COHPANYCOOE wnere a . cow>a\ ycoo€ • 'u s e r E

*1

C

.*•

FINISHED [>

\£L

COMPANYCODE

AMOUNT_USD

O T H E R A M O U N T U SD

US01

5.000 0

1 0 0 .0

US01

5.000 0

300.0

US01

-3.000.0

1 0 0 .0

US01

-3.000.0

300.0

Took 1 MConOs

Figure 68

Results of Cross-Consumption Query

In SAP HANA, we can check the subquery executed by following the menu path Admin console • Perforaaance • SQL Plan Cache. The results are shown in Figure 69.

59

©

Consuming Data with Vora | 3

SAP HANA M odeler - System: HD8 H o s t 52.73.12.173 Instance: 00 C o nnected U s e r SYSTEM System Usage: P ro d u ctio n System - Eclipse £ile

£dit

£Javigatc

‘ f t -

...

0

Search

£ ro ject

gun

W indow

£Jelp *

8

(S '



-

Q uick Access s

S ystem s $2





IH *HDB • SQL C. ®

*7 !g> HDB (CODEJAMMER) (Production I a 1;

&

-

J ‘ CODEJAMMER:...

SAP HANA D evelopm ent |^ > SAP HANA M odeler | ©

’CODEJAMMER"...

HDB £3



BW M odeling “



S W

H D B (S Y S T E M )

[P roduction System]

52.... 00 Last U pdate: Feb 22.2016 9d2s07 AM $

| Cfr Intervafc

60

-

S econds

| [*

s

T hreads Sessions Blocked T ransactions SQL Plan C ache Expensive S tatem ents Trace Job Progress Load

p ^

Provisioning Security

SAPDATA

O]

i

Visible rows: 10/1000 N o t all r*

3jo Filters...

w

^ ] Save as File



s0000716522trial-trial (DEV_2WQBC

vhcalhdb

30,003

2 SELECT " _ tab lel"."C O M P A N Y C O D E \ " _ t a b l e l “."ACCOUNTGROUP". " _ ta b

SOD (GBAMHB)

vhcalhdb

30,003

2 s e le c t c o u n t( ') a s NU M .R OW S fro m ( SELECT "_tablel"."C O M PA N Y C O D E“.

v hcalhdb vhcalhdb vhcalhdb

STAT EMENT.STRING

' . AM OUNT.USD IROM "CC

STATEMENT STRING

-

m

-__________________£ 3

"COMPANYCODE". "ACCOUNT 1 "."ACCOUNTGROUP". " _ ta b

select c o u n tf ) as NUM.ROWS fro m ( SELECT ' tablel'.'C O M P A N Y C O D E '." tablelV A CCO U N TG RO U P’. _tablel'.'A M O U N T _U SD " FROM 'CODEJAMMER'.'SAPDATA' AS ■ _ t a b l e l ' WHERE (*_tablelV C O M PA N Y C O D E" = USOl'))

"

ta b le l "."COMPANYCODE",

LIMIT 0 >UP". "A M OUNT.USD’ FROM

vhcalhdb

)M PA N Y -. “ACCOUNTGROUP

vhcalhdb

ITGROUP". "AMOUNT USD'FI

Copy

J

Save As...

Close

j

» HDB^2.73.12173«I:S1NGLED8:SYSTEM

Figure 6 9

Subquery Called by Vora

Examples of the SQL generated in SAP HANA by Vora are shown in Lis­ ting 53. s ele ct count(*) as NUM.ROWS from ( SELECT " _ t a b l e l ". "COMPANYCODE" # tabl e l ". "ACCOUNTGROUP", " _ t a b l e l ". "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" AS " _ t a b l e l " WHERE ( " _ tablel"."COMPANYCODE" = 'USOl') ) SELECT "_tablel"."COMPANYCODE". "__tabl e l ". "ACCOUNTGROUP". tablel"."AMOUNT.USD" FROM "CODEJAMMER". "SAPDATA" AS " _ tab le l" WHERE ("__t a b l e l " . "COMPANYCODE" = 'USOl') LIMIT 2 OFFSET 0 Listing 53

<9 □

PORT

v hcalhdb

VOLUMEJ D

& Configure...

HOST

vhcalhdb

4

8

Overview L andscape Alerts P erform ance V olum es C o n figuration System Inform ation Dia< nosis Files Trace C onfiguration

Backup

t> ©

-

C atalog ic> C o ntent © Pending...

>

o

f g |£ j a v a E E

HDB (SYSTEM) [P roduction System

t> a

.

SQL Generated in SAP HANA

60

Consuming Data with Vora |

Listing 53 shows a positive result because it indicates that the where clause has been passed down from Vora to the subqueiy executed in SAP HANA. If we run a similar Vora SQL and the where clause is now on testCSV for company code (for example, Listing 54), the logic has not been passed down to the subqueries in SAP HANA. s ele ct count(*) as NUM_ ROWS from ( SELECT "COMPANYCODE". "ACCOUNTGROUP**, "AMOUNT. USD" FROM "CODEJAMMER"."SAPDATA" ) SELECT "COMPANYCODE", "ACCOUNTGROUP", "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" LIMIT 4 OFFSET 0 Listing 54

SQL Generated in SAP HANA with Missing where Clause

Another observation is that, if only a reduced number of columns is spec­ ified in the Vora SQL (for example, in Listing 55), this reduced selection successfully passes to SAP HANA. SELECT "__t a b l e t " . "COMPANYCODE", " _ t a b l e l " . "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" AS " _ t a b l e l " LIMIT 4 OFFSET 0 Listing 55

3 .4

SQL Generated in SAP HANA with Reduced Columns

A d d itio n a l Business F u n ctio n s

In the future, SAP plans to add additional business functions to Vora that are not supported in Apache Spark. The first collection of available func­ tions includes hierarchy support and UoM conversion. W orking with Hierarchies

Hierarchical data structures define a parent-child relationship between different data items, providing an abstraction that enables complex com­ putations to be performed on different levels of data.

61

3

Consuming Data with Vora |

An organization, for example, is basically a hierarchy where the connec­ tions between nodes (for example, between managers and developers) are determined by the reporting lines that are defined by that organiza­ tion. Since it is very difficult to use standard SQL to work with and perform analyses of hierarchical data, Spark SQL has been enhanced to provide missing hierarchy functionality. To utilize hierarchies, a custom hierarchy table needs to be defined for each object with a hierarchy similar to the one shown in Table 2. Key

Predecessor

Successor

Order

(I NT)

(I NT)

(INT)

US01

None

1

1

GB01

1

2

1

AU01

1

3

2

GB02

3

4

1

GB03

3

5

2

Table 2

Example Hierarchy Relationship (h_src)

The SQL for joining a hierarchy table with a transactional table is shown in Listing 56. SELECT B.name, A.address, B.level FROM (SELECT name, LEVEL(node) AS level FROM HIERARCHY ( USING h_src AS v JOIN PARENT u ON v.pred = u.succ SEARCH BY ord ASC START WHERE pred IS NULL SET node) AS H) B, addresses A WHERE B.name = A.name Listing 5 6

Join a Hierarchy Table with a Transactional Table

62

3

C on sum ing D ata w ith Vora

UoM Conversion

The SAP HANA data source allows you to use UDFs that are implemented solely in SAP HANA (that is, they do not exist in Apache Spark). You can do this by using the "$" in the s e l e c t statement to convert a measure, as shown in Listing 57. import org. apache.spark.sql val sqlc = new SapSQLContext(sc) lazy val configuration = Map(( "host"->”"), ( "instance"->"00"), ( "user"->"CODEJAMMER"), ( "passwd"->"CodeJam2015")) lazy val samplelnputConf = configuration + ("path" -> "SAPDATA") + ("dbschema" -> "CODEJAMMER") samplelnputRelation = sqlc.read.formate"com.sap.spark.hana"). options( samplelnputConf). load() samplelnputRelation.regi sterTempTable("SAMPLE_1NPUT") var query = s q l c . s q l ( "Select $CONVERT_UNIT" + "(QUANT.SOURCEJJNIT,* CODEJAMMER*,TARGETJJNIT, ' 0 0 0 ' ) as converted " + "FROM SAPDATA") queryResu11.col 1e c t ( ) . foreach(pri ntln) Listing

3.5

57 Use the

" $ "

Prefix to apply HANA UDFs to results

A d d itio n a l In fo rm a tio n on V o ra

For a more detailed look at Vora, please check out the following links: » SAP HANA Vora-YouTube https://www.youtube.com/playlist?list=PLkzo92owKnVyyNRkogvGloXYcDtLqeByX » SAP HANA Academy, Github https://github.com/saphanaacademy/Vora » SAP HANA Vora 1.1 - SAP Help Portal Page http://help.sap.com/hana_vora.html » SAP HANA Vora Installation and Developer Guide http://help.sap.com/Download/Multimedia/hana_vora/SAP_HANA_ Vora_Installation_Developer_Guide_en.pdf

63

|

3

C o m b in in g A pache Spark, SAP H A N A , and V ora

» SAP HANA Vora—Troubleshooting http://scn.sap.com/blogs/vora/2015/12/09/sap-hana-vora--troubleshooting » Tagged Questions, Stack Overflow http://stackoverflow.com/questions/tagged/vora

4

Combining Apache Spark, SAP HANA, and Vora

With some basic knowledge of Apache Spark functions and Vora, we are n o w

r e a d y

to

u s e

t h e m

t o g e t h e r

t o

t r a n s f o r m

a n d

a n a l y z e

d a ta .

In the following section, Apache Spark will be used to load and transform oil price data and New York Times news articles relating to oil over the past ten years. SAP HANA will also be populated with net income figures from several major oil companies for the same years. Finally, Vora will be used to combine and analyze these datasets, which are stored in Hadoop HDFS and SAP HANA.

4.1

D o w n lo a d O il Price D a ta U sin g A p a c h e S p ark

In this section, we will use Apache Spark to retrieve oil price data and store this data in Hadoop HDFS for subsequent processing by Vora. As shown in Figure 70, the U.S. Energy Information Administration (http://www.eia.gov/opendata/) has an API for downloading open data­ sets, such as oil price data, at http://api.eia.gov/series/?api_key=YOUR_ API_KEY_HERE&series_id=PET.RBRTE.D.

64

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

iK&pmbnl Statute! i t An*i>fU U .S . E n e r g y

In fo rm a tio n

A d m in is tra tio n

Sources & Uses »

Topics »

Geography *

OPEN DATA API BROWSER

A P I

Q

u e r y

B r o w

r»ON »

EXCEL ACO-IN ■

s e r

GRAPHS &MAPS

BULK I

S e a rc h by: • K eyw ords)

0 S e n e s ID

BA Data Sets > Petroleum > Prices > Spot Prices APICAL*.

h ltp : api.eia.oov * « rie t'? a p 1_key-Y O U R _A PI_K E Y _H E R E **erie»Jd-PE T .R B R T E .O Click C hildren C a te g o rie s to

TO USE

tx cv .se data sets Click through th e tree of SERIES NAME

E u ro p e B rent S p o t P rice FOB. Oaily

a m ia b le s e n e s until you reach a C hild S e n e s Click C hild S e n e s to d sp la y

SERIES

10:

PET.RBRTE.O » Show m e ho * to en-bed a

a ch art of the data . API can and s e n e s information is listed at

--Figure 70 U.S Energy Information Administration API

To access this data, you must get an API key. Sign up for free access at http://www.eia.gov/opendata/register.cfm. You'll receive an email with your API key, as shown in Figure 71. InM ftm im tl .ft— .On i t A m ttfm U .S .

E n e rg y In fo rm a tio n

A d m in is tra tio n

Thank you for registering The following is your unique EIAAPI key to use with the EIAAPI.

Figure 71 U.S Energy Information Administration API Key

The example code shown in Listing 58 can be used to call the API, down­ load the JSON response, explode the first level of the JSON, and finally display these results to the screen, as shown in Figure 72. //Function converts Remote JSON URL Response to a Data Frame def getJsonUrlAsDF(url: String): org. apache. spark. s q l . DataFrame = { val client

65

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

org.apache.http impl.client.HttpClientBui I d e r . c r e a t e ( ) . bui 1d() val request = new org.apache.http.client.methods. HttpGet(url) val response - cli e n t . execute( request) val handler = new org.apache.http.impl.client.BasicResponseHandler() //Get Response String var responseString = handler.handleResponse(response).trim //Convert to RDD var responseRDD

= s c . p ar a l 1elize( responsestring :: Nil)

//Return Data Frame return sqlContext.read.j son ( responseRDD).toDF() //Download Latest Oil Data var responseDF = getJsonUrlAsDF("h t t p : / / a p i . e i a . gov/series/?api_key= &series_id=PET. RBRTE. DM) //Explode JSON and Select Relevant Coluns var o i 1DataDF responseDF.withColumn("Data", explode($"series. data" ) ) .withColumn("Ele ment", explode($"Data" ) ) . s e l e c t ( "Element") oilDataDF.show(5) Listing

58 Download Oil Price Data

//F u n c tio n c o n ve rts Remote JSC* UK Response to o DOto Frame Cel get3S® rvrlAJ© * 'a ae • { val client . © rg.apache.nttp.iapl.client.Httpclientajilse'.createO.Ovliao val ■ new e r g . apache.http.client.■etheCs.wttpGettu’l) val re sp o n se ■ client.execute(refiuest) v a l h a n d le r - new o r g .a p a c h e .h ttp .in p l.c lie n t.e a s ic ? e s p o n s e H e n d le r ()

FINISHEO t>

II W ©

//Set Response String v a r re sp e n se ix rx rg . h a n d le r .n * n a le * ts e c n s e ( r « s e c rs 0 .tn ia //C o n v e r t to v a r r e i s o n i t 'r c

■ s c . p a r « l l e l i : e ( r e s s o n i e s t r lr * : : M il)

//R e tu r n o o to Frame r e tu r n s a lC o n te x t. read.;son
>

//oounlood lotest Oil ooto Y*r r
apj

r«v>»jeriej_ip^tT.tg*Te,C>*)

//tx p lo O e JSC* ono S e le c t R elev a n t Colins v a r o i l r a t s o f . r e s p o n s e s .A lth c o lia w rc a ta * , e * p :e © e ($ ';e rie s.< Ja ta -)).« ilth c o lu B n { -E :e » e n ft e*plO «e{$'O ata‘ » . s e l e c t ! 'E l e r ie n f )

oitoatecc-»•»■»< ) g e tlso o u riA so a: ( i r l : S trin g )© rg .ap e cn e .sp erl:.sq l.O d T e« rea e r e s p o n s e s : c c g .» p a :n e .sp * * < .sa :.o » t» ! '»-*« . [ re q u e s t: s tr v c t< e c * » a n d : s tr in g ,s e rie s _ ie :s tr in g > , s e r i e s : e rre > < strv e t< c o p > rig n t:s trln g ,d e te :e rre y < e rra > < strlrtg » > ,© e s< ri p tle n :s t r l n g , e n d : s t r l n < , f : s t r i n g , n i « e : s t r l n g , s e r l e s . l ( l : s t r l n g , s « i j ^ e : s t r l n g ,s t a r t : s t r l n g ,u n l t s : s t r l n * ,u n U s s f t e r t : s t r l n g ,u p J e t e d : s t r l n g » ) oilD ataO *: c rg .a p a c h e .s p a r k .s q l.O a ta ir a a e ■ (E lem ent: a rray < s trin g > ) lle n e rrtl |L is t! 2 e i6 e 2 « e , n . . . | lL is t:2 e x e e 2 e s , 3 2 .. . I |L lSt:2ei6© 2© *, 3 2 . . . |

U:st::ei6«2e3, IU S t:2 » l« 0 2 » 2 ,

Figure 72

M—| ...i 32

Oil Data Initial Results

66

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

|

4

t>

Q|J

From the above results, we aren't able to make sense of the row listing. We can check the schema again with the command shown in Listing 59. The results are shown in Figure 73. oilDataDF.printSchemaC) Listing 59

Check the Schema

I D * .printScheM()

F IN IS H E D

root I-array (nullable ■ true) I I-- elenent: strinj «ontainjHull ■ true)

Figure 73

printSchema Results

In this case, we aren't able to explode or select the columns any further. We need to perform some RDD map operations to split the row into mul­ tiple columns and then save the reorganized data to HDFS by using the code in Listing 60. The results are shown in Figure 74. case class o i 1PriceStruct( Date: java . s q l . Date, Price_BBL: Double) val o i 1DataFixedDF = o i 1DataDF.rdd.map( row => < var rowString = row(O) .toString var rowSplit = rowStri ng.substring( 5 , rowStri ng. length -1). spl i t ( ",") //Date SQL formatting var dateStr = rowSplit(O) var simpleDateFormat: java.text.SimpleDateFormat = new java.text.SimpleDateFormat( "yyyyMMdd"); j a v a . u t i 1.Date j a v a . s q l . Date

var util Date = simpleDateFormat. pa rs e ( d at e S t r) ; var sqlDate = new ja v a . s q l . Date(uti1Date.getTime());

: :

// Return in Structured Format o i 1PriceStruct(sqlDat e , rowSplit( 1 ) .toDouble) }

) . toDF()

67

@

C o m b in in g A pache Spark, SAP H A N A , and Vora

|

4

o i 1DataFi xedDF. reparti t i o n ( 1 ) . save("oi1Data Parquet", "parquet") o i 1DataFi xedDF.sort( $"Date".desc)show(50) Split a Row into Multiple Columns with RDD Operations

Listing 6o

FINISHED C> ; ; ffl ©

case class oilPriceStruct( Oate: java.sql.Date, Frice_BBl: Double) val oiLoataFtxedDF ■ eilcata©= .reo.naji row »> { var rowstring ■ po«<:>.testring var ro.*Split - rowstring.substrings,r©«iString. length -l).s p lit(“,*) //Cote fo rm a ttin g var datestr var sinpleoateFemat:java.text.slr.pieoatesorf!at var utiloate : Java.util.Oate var sqlOate :java.sol.Oate

■ ro»split( ) ■ new java.text.si^oleoateFematCyyyjmJd’ ); - sl«pleOateForftat.parse(dateStr); • new java.sql.Cate(uolOate.getTiaeO);

//Return t n structured format oilPriceStruct(sqlDate , rowSplit(i).tcOocble) ) ).tOOF() oilOataFixedOF.registerTenpTaele(“oilOata’) oilDataFi<edOF.sort($"Oate*.desc)show(W) defined class oilPriceStruct oilOatar ixedOF: org.apache.spark.sql.DataFranc ■ [oate: date, Price_eeu: double] I

Odte|Price_8Bi|

12816-82-881 I2816-82-8S| 12816-82-841 12816-82-831 |2816-82-82 j

Figure 74

31.6*| 32.3S| 32.76| 32.38| 38.9sj

Oil Data Final Results

We now have the results saved in HDFS in the Parquet format.

4.2

D o w n lo a d N Y T im es A rtic le s U s in g A p ac h e S p ark

The New York Times has an API that will return ten articles (per page), based on a specific search criteria. A maximum of 100 pages for a specific search can be returned. In the following steps, we will use Apache Spark to automate the down­ load of all New York Times articles with "oil" in the title over the past ten years and store the results in Hadoop HDFS for subsequent processing by Vora.

68

C o m b in in g A pache Spark, SAP H A N A , and Vora

|

4

Sign Up for API

First, as shown in Figure 75 and Figure 76, sign up for access to the API to search for articles at http://developer.nytimes.com/. W C*rXn»|M kCI»» v D evelop ers

Signed n as aron.mscoona’o Events

&£!s

Bog

My Account

Open Source

W elcom e

Overview Available APIs

You already know that NYTimes.com is an unparalleled source of news and information. But now it's a oremter source cf data, too — why just read the news when you can hack it?

Keys Forum

Getting Started

Gallery

"he Times Developer Network is our API clearinghouse and community. Here's how to get started

API Ccnsde

1 Reouestan API kev 2 Read the afi tittu m tm a m £aq and Terms of Use 3. Use the A&I Tod to exoenment without writing cede A Browse the appi&atigflfiallgft 5. Connect with other developers in the fcrurr

Figure

75 Request a NY Times API Key

Overview Available APIs

T im es D ev elo p er N etw ork A p p lica tio n R eg istra tio n Thank you for registering.

Keys Forum

Gallery

Your API Keys:

Article Search API

API Console K

e

y

:

A p p lic a tio n K oy: S ta tu s : R e g is te r e d

Key Rate Limits 10 io , ooo

Figure

Log Out

Careers

Calls per second Calls per day

76 New NY Times API Key

69

C o m b in in g A pache Spark, SAP H A N A , and V ora

Note The NY Times API is free to use; however, there are limits on the number of searches made per second and per day, so take care when using Spark logic.

Perform a Single Search

Next, we will use Spark to perform a single search for articles where "oil" is mentioned in the article title. We also need to perform some date type manipulation to allow the results to be compared with the oil price data loaded earlier in Section 4.1. Execute the code shown in Listing 61 to use the API to perform a single page search. The results are shown in Figure 77 and Figure 78. import org.apache.spark.sql. functions.explode import org.apache.spark.sql. types. DataTypes._

// Convert TimeStamp String to Date . . . (NOTE: New cast(DateType) work s f r o m Spark 1.5) def dtStrToDate(dateStr: String) : j a v a . s q l . Date = { i f (dateStr != nul1) { var UTCDateFormat:java.text.SimpleDateFormat new java.text.SimpleDateFormat("yyyy-MM-dd*T' HH:mm:ss"); UTCDateFormat. setTimeZone(java.util.TimeZone.getTimeZone("UTC")); var util Date : ja va . u t i 1 . Date UTCDateFormat. p ars e(d ate Str) ; var sqlDate :j a v a . s q l . Date new j a v a . s q l . Date( u t i 1Date.getTime()); return sqlDate } else return null val dtStrToDate_udf = udf(dtStrToDate _) def searchNYTimes (search: String, page: Int) : org.apache.spark.sql .Da taFrame = { / / : String = { Thread.sleep(lOO) val exampleUrl - "http://api.nytimes.com/svc/search/v2/ articlesearch.json?"+ search +"&api- key=&page=" + page

70

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

val bingUr1_p1 = "h t t p s : / / a p i .datamarket.azure.com/Bing/Search/vl/ Composi te?Sources=%27news%27&Query=%27i nti tle%3A" val bingUrl_p2 = "%27&$format=json&$skip=" / /v a r bingllrl Ful 1 = bingUrl_pl + search + bingUrl_p2 + page / /v a r bingllrl Ful 1 = "h t tp s : // a p i .datamarket.azure.com/Bing/Search/vl/ Composi te?Sources=%27news%27&Query=%27i nti tle%3Aoi1%27&$format=json& $s kip =l " var bingUrlFull = exampleUrl val cl i e nt = org.a pac he.http.impl.client.HttpClientBuiIder.cr eate( ) .bui 1d() val request = new org.apache. http.cl ient .met ho ds . HttpGet(bingUrlFul1) // request.setHeader("Authorization", "Basic UXROSUlvUXNDTjJXMnlIdzdPazZYU mk2VXNrV3dwMWt3UFpwcnVvZlpabzpRdE5JTW9RcONOMlcyeWV3N09rNlhSaTZVc2tXd3Ax a3dQWnBydW9mWlpv"); val response = client.execute(request) val handler = new o r g. apa ch e. htt p. im pl .c li en t. BasicResponseHandler() var bingResults = handler. handleResponse(response). trim / / r e t u r n bingUrlFull var bingResultsRDD = s c . p ar a ll e ii z e ( bingResults :: Nil) val resultsJSON = sqlContext.read.json(bingResultsRDD) var resultsExpDF = resultsJSON. toDF() .withColumn("Docs", explode($"response.docs")) if ( resultsExpDF. count() > 0 ) { resul tsExpDF =resultsExpDF.withColumn("web_url", $"Docs.web_url") .withColumn("source", $"Docs. source") .withColumn("pub_date",dtStrToDate_ udf( $"Docs. pub_date")) / / .withColumn("headline_ main", $"Docs. headline.main") .withColumn("news_desk", $"Docs.news desk") .wi thColumn("secti on_ name", $"Docs. section_name") .wi thColumn("subsection name", $"Docs. subsection_name") .withColumn("snippet", $"Docs.snippet") .drop("copyright") . drop("Docs") .drop("response") .d r op("sta tus") / / .wi thColumn("pri nt_ headli ne", explode($"headline.print_headline"))

71

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

|

return resultsExpDF

var x = searchNYTimes("q=oi1+price&fq=news_ desk:(%22Busine$$%22,%22Energy%22,%22Envi ronment%22,%22Financial%22)& begi n_date=20070101&end_date=20071231", 1) x . show() x.c ou nt( ) Listing 61

Single Page Search for "Oil" in Article Title

rms*€d

in p o r t e r g .a s a c * e . s p a r . . s a l . * . r e t i e n : e p l o e e in p o r t c ^ £ .* sa c » « - s p a r k . s a l . t y p e s . C ata T y p es._ //C o n v e n e TimeStamp S t r i n g t o D ate . . . (MOTT: Mew c a s t (D o te T /p e ) m arks fr o m S p o t* 1 .3 ) d e f e tJ tn T s D a c e (d a te it' - .-g ) :a a :q l.B a ti * ( i f ( d a te S tr ! ■ n u ll) { v a r iT C O a te 'e r r a t ; a a . t e - t S in p le O a te ’ e r r a t * new ; a a . t e .t .S i n p l e O a i e * e r * e t ( *) y y y - ‘I '- s s *T (♦< w U T C D ate'e rra e s e t T i « e ls * e ( ja . a . u t il. T in e J © " e . g e t T i n e I c n e ( ’ C " ) ) ; v a r w iilD a te : J a v a .u ti l.O a t* ■ U T C O a te * o n n a t.p e 's e ( d a te S tr ); var s a l? a te : J a v a .s q l .D a te - new j a v a . s q l . C a t e ( u t i l D a t e . g e t T i r e ( ) ) ; r e t u r n sq lC a te ) e le * re tu rn n u ll

>

v a l d tI tr T p D a te _ u d f * u d f (d tS tr T c C a te _ ) d « f a*ar««.V rT jae* ( s e a rc h - S t - i - r , p a |e T h r e a d .s le e p (1 9 6 ) v a l e -a rp le U rl * ’h ttp :

val b ir f .r l^ l i v a l M rg U rl_ p 2 •

h ttp :

J n t ) : e 'g ap ach e s p a r e .s q l .C a t e 'r a n e ■ (

s p i . n y t i n e s . c o n s c . s e a r c h / 2 . a r t i e l e s e a r c h . ;sc n '- ' * s e a r c h

♦ ’ lapi-'«ey»<>O JP '5 '> i s a » e = ' ♦ page

. a s i . « a t » n a r i . e t a :w '-e.c o n . Jin j/S e a rc lv 'v l/C o a p o s ite ? J o u < - c e s « % 2 7 n e v .s M 7 l^ w e ry ^ tt7 in title \J i

/ / v a r O in p O if w t t • b \ n f j r i _ p l » s e a r c h r kinp<Jrl_p2 * pope / / '• o r O i.K fJrl.fa U e 'h t t p s : / / o p i . P a to m o r k s t.o z u r p . c o m /C in g /S * o r < h /v l/C o m p O iitp fS o v rc iss\2 7 n p m t% I7 tQ u sry B % 2 7 \n titie % 3 io iL % i7 tS fo n so ta jso n lS sk ip sl“ v a r b i - f . r l ' j l l a e -a n p le .e l| v a l c l i e n t * e r g apache n t t p . i n p l . c l i e n t . r t t t p C l i e n t 8 u i l d e r .c r c a t e ( ) .b u i 2 d ( ) v a l r c s . c a t • new o r j . a p a c r e . h t e p . c l i e n t . n e tn e d s . h t t p 6 e t ( b i n j '. r l * J l l ) / / r e q u e s t . s e tM e o d e r f A /r h o r tr o r t on * , 'S a s i c UXKOSUl\<&X)TjJXHrUd:SPazn‘lM *2VXHrV3P*trc3orp*cn)K:Lpob:pXtS!}]TV9MH0nLcyeikf3M P9r*UiSaT?\/c2TXi3AxaSdQUn6yiiU9rtap-.’~), v a l r e s p p r j e * c lie n t. e r e c u te ( r e q iM S t ) v a l h a r a l t ' e new e rg . a ;a c * » . h t t p s n p l . c l i t ' t . 3 s : i : 5 e : p c ': t M a * d l e r ( ) v a r b i n g i c s - i l t s a h a n d le r . * » « 3 le 5 e : p e - n :« ( r e s ;s - s e ) . tr in

v a r b i-jS e sw lts S C O a s c . p a r a l l e l i s e ! b ir g f ie s w lts : M il) v a l r e s . l t s 3 5 C « a s q lC o rte i t . r e a d . ; : © n ! t i r » S s s . l t : S 3 0 )

v a r re s w lts !'.p C * » re s u lts 3 S O h .te O * () .w llh C o lie es( " D o c s ', e > p lo d e ( $ ~ r e s p o n s e .d o c s ') ) i f ( r e s . l t sE -sC £ c o u n t( ) V i ) { r e sw ltsS ip C * a r a s u ltt l> » 5 * .'.a th C o lu n n ( ,e b _ w r l'/ $ D ocs. weO_ur 1 ■ ) .> .ith C o lw n r(" s:> .rc e* , J 'O o c s . s o w 'te " ) . .ith C o lu n r ( ; . s _ « a t e / d tS tr T © 0 a te _ -4 ''( J C © ;s .p u b _ sat* ' ) ) // .w ith C o lu a r ( " h e o d lin e _ a o in * , S 'D o c t.h p o d lin p .m o iti') .V 'ithC oluan(~r«w s_des> S 'D ec s n e u s .d e s - ) .'» i th C o li r a n ( 's e c ti o " _ « a « e ', J 3 : c : s e e ti e n _ n a n e ') .w ith C o l'r » '( " a w b s e s tia n _ n a n e ', $*Oo c j . s u S s e e tie n _ n * n e ') .w ith C o lu B n ^ s n ip p e t" , i 'D s c s . s n ip p e t" ) .d r o p ( 'c o p y r ig h t * ) .d rp p (* 0 o c c * ) - d r o p ( 'r e s p o r s e " ) .d r o p ( 's t a t u s * ) //.w tth C o i« « a r( " p » -in f_ h e o d lin e ', e - p ic d e ( S ~ h e o d l\r e .p r i n t _ h e a d l i n e " ) )

> r e t u r n re su ltsiA p O *

v a r > * se a rc h M V T in e s('q a o il« p ric « ifq a n e w s_ d e s< : ,\2 i8 u tin e :s % 2 2 ,\2 2 tn e r g y \2 2 ,% 2 2 I n ironnent% 22,% 22cin a r< ia l% 2 2 ) 4 b « g in _ d a te e 2 M 7 tltlte n d _ d a te * 2 M 7 1 2 3 1 " ( . ) x .sh p w O « .e © u n t(J

Figure

77 Results of Single Page Search of "Oil" in Article Header, Part 1

72

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

i n p o r t e * -g .«;»£>!«. - p » c . . s q l . ■ * . “c « i e « s . « ■ p lc s « i n p o r t o - g . a p a : " « . sp a i-k . :« 1 .3 y p « s .O a ta T y p « s ._ dt5 t« -T © D at« : ( d » t « 5 t r . S t r i n g ) j * a . s q l . O at« d tS trT p D a t« _ « S '/ e r g a p a c h * . s p a r . , . s q l . U : « " C tf i'a iP u " « ie n ■ U serO e ,C a t e T / p e ■ ;« a rc h ? iv T in « o . ( s e a r c h . S t r i n g , sa g e I n t ) p r g . a p a c h e p a r k . s q l . D a t a 'r a r e e r g . a p a e h q . s p a r k . s q S . S a t a f r a r e ■ [ « « b _ u r l : s t r i n g . s c u r« e s t r i n g , pw s_ d ate P a t e , " r u s _ d t s . s t r i n g , s e e ti e n _ « a n e i.e s _ u r h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp

r e :8 9 9

F ig u re

K tla e /tire " y tin e /tire /tire /tire /tire " /tire /tire , - tir e

»«U3..d e sk s e t t l e « _ n a iT e I s « b s e c tio n _ ra n e | rh e fh e fh e rh e rh e rh e rh e rh e rh e rh e

New New Hew Hew Hew Hew Hew Hew Hew Hew

Yen. Y o r. Y or. Y o r. Y or. Y o r. Y or. Y or. Y o r. Y or.

T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e :

•11-MI

s: ■11-131 5 6 u s i^ « : : ■1 * 0 9 1 S3 ■1 1 - 1*1 ss -* 1-281 ss ss -12-181 mm s s ■19-241 ■ . „ m ■1 2 - 9 3 1 ss -19-181 S<j s i a « : 8 u si« < s s ss •94-231

Cay| D ay| O ay | Day I D ay| O ay | O ay | D ay| Day] Day |

|

s t r i n g , sw ® se«tior> _nane: s t r i n g , s n ip p e t

s n ip p e t I

n u l l | C i l an d f a : p r i c e . n o r l d B u s in e s s |T h e p r i c e o f e r u d . n u l l |C r uPe O il f u t u r e s , n u l l |A s l e a d e r : o f t h e . n u l l |S a u d i A r a b ia - a s . n u l l | ? e r y e a r s , c h ea p . n u l l |S 9 i s p r e p a n m g t . n o r l d b u s l 'e s s |A ^ t e r a y e a r o* a . n o r l d 6 u s i» e s s I T e n s i o n : i n t« e n o r l d B u s in e s s I C th e r f u r p p e a - o i .

k : - g * 19

78 Results of Single Page Search of "Oil" in Article Header, Part 2

For a single search criterion, the NY Times API only supports up to 100 pages of results. To broaden the search, a narrower selection criteria is needed (e.g., by year), and the results then need to be combined using union statements. Download a Range of New York Times Articles

Now, we will use Spark to download all news articles related to "oil” cre­ ated over the last ten years. The code shown in Listing 62 illustrates how the API can be called multiple times and the results be combined and saved into Hadoop HDFS, as shown in Figure 79. import org.apache.spark.sql var unionResultsDF: DataFrame - null; var tempDF : DataFrame = null; var totalPages: Int = 100; / /cu rren tl y max 100 supported var searchTerm: String = "q=oi1+price&fq=news_ desk:( %22Busines$%22,%22Energy%22,%22Financi al%22)&begin_date= YEARNUM0101&end_date=YEARNUM1231"; for (yearNum <- 2007 to 2017) { tempDF = nul1 for (pageNum <- 0 to (totalPages - 1 )){ / / p r intln(pageNum) i f (pageNum ==00 || tempDF.count() > 0) { var search = searchTerm.replace("YEARNUM", yearNum.toString)

73

4

s tr in g ;

C o m b in in g A pache Spark, SAP H A N A , and V ora

|

4

tempDF = $earchNYTimes(search , pageNum) i f (unionResultsDF!= nul l ) { unionResultsDF = unionResultsDF. unionAl1(tempDF) //println("Next: " + pageNum) } else { / / p r i n t l n ( "1st: " + pageNum) unionResultsDF = tempDF

}

}

uni onR esultsD F . show( 4 0 ) uni onR esultsD F . r e p a r t i t i o n ( 1 ) . sav e("n y tim es P a rq u e t" , " p a rq u e t" ) Listing 62

Download Ten Years of Articles

import org.apache.spark.sql._ FINISHED t> « HJ © var unlonResultSD': DataFrare • null; var te«©2f : oataFrare ■ noil; var totalF-ages: :nt - leo; 'currently «ax 100 lupported var searchTera: string ■ -
« . 2 0 0 7 to 2 0 1 7 ) { m null for (pageHua <• e to (tctalPages • //prMtln(page*turn)

tetcCf

1

)){

i f (cageujo . . 0 0 || tenpO*.ccunt<) > ©> < var search ■ searchTem.replace(TE«iA*r, yearfkai.tostring) terpOf • searchVYTii»es(search , page***) i f (unionftesultsOF!« noil) { uniortfesultsDF ■ uhionBesultsOF.urionAllEtenpOF) //printIn("»ext: * ♦ pogefnm) ) else < unionflesultsOF ■ tenpO:

)) }> unlonP.esultsOF. sho*{-ie )| inport org.apache.spark.sol._ unlontesultsD*: org.apache.spark.sql.Data'rane • null tenpOF: org.apache.spark.sql.oataFrane ■ noil totalPeges: Int • lee searchTem: string . q.oil*priceSfq.neu5_Cesk:(%228usiness\22,X22EnergyS22,%22Finanei8l\22:46egin_c«te.YEA.»’#>,eiei4end_date«YEA*NUHi23l warning: there were 1 deprecation waming(s); re-run with -deprecation for details *eb_url| |http://www.nytine...|The lhttp://i***.nytine...|The |http://«Nw.n)’tin e...|T h e |http://M M .n)tine...|The |http://hMw.nytlne. ..|The |http://«ww.n)tine...|The lhttp://n»«.nytine.. . |The Ihttp://nw<w.n>tine. .. |The

F ig u re 7 9

source! pub_date|news.desk| New York New vork New York New York New York Hew York New York New York

Times U©©?-i2-©6| Times U « 7 -bb- i 8| Times|2B©7-ie-17| Tlmes|20©7-es-24| TimesUB07-e9-»| Times|2©07-ii-i4| Times|2B07-e9-12| Times 12007-12-151

Business! Business! Business! Business! Business! Business! Business! Busiressl

Results of an Iterative Search of the New York Times

74

C o m b in in g A pache Spark, SAP H A N A , and V ora

We have now successfully saved a summary of ten years of articles on Hadoop HDFS.

4 .3

S to re N e t In c o m e R esu lts U s in g SAP H A N A

Now, we will create a table in SAP HANA to store the average net income (in billions) for several of the world's largest oil companies over the past ten years. Execute the SQL code in Listing 63 in SAP HANA Studio to create a table and populate it with the average, publicly listed, quarterly results. DROP TABLE CODEJAMMER.NETINCOME; CREATE COLUMN TABLE CODEJAMMER.NET INCOME( CALQUARTER VARCHARC5), NETINCOME DECIMAL(18,5)

); in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t i nsert in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t

i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto

CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME

valuesC'20154',2.36); valu es ('20071'.26.72) valu es (’20072’ ,28.03) valu es ('20073',28.19) valuesC'20074',31.2); valu es (’20081',30.81) valuesC1200821,31.92) valuesC'20083',35.17) valu es ('20084',25.25) valuesC *20091'.20.33) valu es ('20092',16.55) valu es ('20093',13.64) valuesC'20094',16.87) valuesC'20101',18.98) valuesC'20102'.19.99) valuesC'20103',20.3); valuesC'20104',25.06) valuesC'20111',27.84) valuesC'20112',29.81) valuesC'20113',31.74) valuesC'20114',30.41) valuesC'20121',30.7); valuesC'20122',31.21) valuesC'20123',27.18) valuesC'20124',27.48)

75

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t

into into into into i nto into into into i nto into into into

CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME

|

4

valu es (’20131',26.76) valu es (*20132',24.85) values(’20133',24.59) valu es (’20134’ ,22.91) values(’20141’ ,21.37) values(’20142’ ,23.32) values(’20143’ ,22.69) values(’20144’ ,19.43) valu es (’20151’ ,16.26) valu es (’20152’ .14.06) valu es ('20153',6.29); valu es (’20154’ ,2.36);

Listing 63 SQL to Generate NETINCOME Results in SAP HANA

Next, create a calculation view that combines the new NETINCOME table with the time dimension table (joined on the CALQUARTER column) to enable date-specific searches to be executed, as shown in Figure 80. ^ 3

System s SS

a



■ft v 0 S9 * C3 iS R %

*SQL C onsole 3

-T? ’SQL C onsole2

£1 *SQL C onsole 1

CODEJAMMER::CA_NETINCOME_01 HDB (CODEJAMMER)

a (§> HDB (CODEJAMMER) [Production : t> a

® CODEJAMMER 4

&

Calculation Views ■

t> a

> ^

(2)

2 CA_NET1NCOME_01 2 CA_SLFIGHT_001

public

t> #

sap

t> a

studentOO

t>

system -local

t> a

tm p

t> a

workshop

Provisioning

t> IcJ’ Security :>

©+ © ♦


C ontent -

®\£- ft | ©© q? d ^I

Scenario

C atalog

HDB (SYSTEM) [Production System

: [J ’CODEJAMMER;...

Columns(3) View Properties

& ...

(2) IE Q? S

- I

Show: [All - Q e A ggregation

Proje...

Type

Key

H









“t

Aggr... Rank

N am e

DATE.SQL NETINCOME

Jom_1 w -E P ro jectio n ^ v iE Projectbn_3

SOD (GBAMHB) “ P ro je c tio n ^ ^NETHCOUE (METWC

0

- Projectk>n_3 •-1

^ ~ M_TWE_DWENSIOH 0

100%

Lai

CALQUARTER ~CA

iQ s0000716522trial-trial (DEV_2WQBC t> &

:

local

Join

8 Union

w

Details

£3^

Figure 80 SAP HANA Net Income Calculation View

76

DA NE

C o m b in in g A pache Spark, SAP H A N A , and V ora

4 .4

A n a ly ze O il D a ta U s in g V o ra

Finally, we will use Vora to access each of the datasets created in the ear­ lier steps. We will then use Vora to combine and visualize the results. Oil Price Data

Create a Vora table using the code in Listing 64 to access the oil price Parquet file (/user/vora/oilDataParquet/part-r-00O00-4a549f81-9b0f-4f0abel3-15ba7213a52e.gz.parquet) created by Apache Spark in Section 4.1. The resulting price table is shown in Figure 81. %vora CREATE TABLE OILDATA ( PRICE.DATE DATE, PRICE BBL DOUBLE

)

USING com.sap. spark.vora OPTIONS ( tableName "OILDATA", paths " /u s er /v or a/ o i1DataParquet/part-r-00000-4a549f81-9b0f-4f0a-bel315ba7213a52e.gz. parquet", hosts "master.cl u s ter ,s econ dar y. clu ste r" , zkurls "master. c l u s t e r :2181.secondary.cluster:2181", nameNodeUrl "ma ste r.c lus ter :8020", format "parquet"

)

Listing 64 Create a Vora Table Linked to Oil Price Data

Xvora CREATE TABLE OILOATA (

MICE.OATE

wicE.sei

>

FINISHED t> ” ffl &

OATS,

ooueiE

USING

co«.sap .spark.vora CPTICNS ( tabieK ane ’OILOATA*. oat*s •/us«r/vora/oilO ataParqu«t/part-r-eoO ® 0-4a549f81-9b0*-4f0a-bel3-15ea7213a52e.g:.parqyev,

nests 'naster.cluster,seconeary-cluster’ , ik u rls ’ r a s te r , d u s t e r : 2181, secondary, c lu s te r : 2181", naneSo-deUrl 'e a s te r.c lu s te r:8 0 2 O " , fe rn a t ’ parquet’

IS

Command processed successfully with no results

Figure 81 Vora Oil Price Table

Execute a test query on the table using the code in Listing 65. You'll see oil price changes displayed as in Figure 82.

77

|

4

C o m b in in g A pache Spark, SAP H A N A , and Vora

%vora select * from OILDATA order by PRICE_DATE asc Listing 65 Check Oil Price Fluctuations

Figure 82 Chart Showing Oil Price Changes

Oil News Articles

Create a Vora table using the code in Listing 66 to access the news article Parquet file (/user/vora/nytimesParquet/part-r-00000-82a4db8e-a59d-4fec8c66-b670c151a328.gz.parquet) created by Apache Spark in Section 4.2. The results are shown in Figure 83. %vora CREATE TABLE NYTIMES ( WEBURL VARCHAR(IOO), SOURCE VARCHAR(IOO), PUBDATE VARCHAR(IO), NEWSDESK VARCHAR(100)

)

USING com.sap.spark.vora OPTIONS ( tableName "NYTIMES”, paths "/user/vora/nytimesParquet/part-r-00000-82a4db8e-a59d-4fec-8c66b670cl51a328.gz. parquet", hosts "master.cl uster,secondary.cl u s t e r " , zkurls "maste r.c lus ter :2181,secondary.cl u s t e r : 2181", nameNodeUrl "master.cl uster:8020",

78

|

4

C o m b in in g A pache Spark, SAP H A N A , and Vora

format "parquet"

)

Listing 66

Create a Vora Table Linked to O il News Articles

Xvcra CREATE TASLE OILOATA ( PRXCEJMTE DATE. paiCE.eei oousle

FINISHED D>

Efl ©

)

USING c o s .s a p .s p a r k .v o ra

OPTIONS (

tablef.-aiw "OILDATA", patns •/user/vor«/oiiOat8*>arquet/part-r-©08e0-4e549*81-9B8?-4f0a-&el3-15oa721Ja52e.g:.perqueV, nosts '« aste r. c lu ster, secondary, d u s t e r ', zkurls " ea ste r.clu ster:2181,secondary.cluster:2181“, nafteNodeUrl " ea sie r.clu ster: 8 8 2 0 “, fo rrat “parquet"

)

Command processed successfully with no results

Figure 83

Vora Table Linked to Oil News Articles

Execute a test query on the table using the code in Listing 67. The results are shown in Figure 84. %vora select PUBDATE. count(*) from NYTIMES --where PUBDATE < '2016-02-05' group by PUBDATE order by PUBDATE asc Listing 67

Articles W ritte n Per Day fo r the Past Ten Years

Xvora s e l e c t —• Pl'BOATE, c o u n t(’ ) fron

FINISHED [>

EJU ©

NY7XHES

••hbere PU60ATE < 2816 . 6V group By *J8DAT6 order By *U80ATE asc B

M

C

*

*£.

12.

SETTINGS-

#Srouped

#c‘

8 .0

7.0 60

2007-12-05

Figure 84

2008-09-19

2000-09-02

2010-12-29

2012-08-06

2018-09-17

2014-12-11

2015-10-29

Chart Showing the Num ber o f Articles W ritte n about O il over Ten Years

79

|

4

C o m b in in g A pache Spark, SAP H A N A , and V ora

|

4

Net Income Results

Using the code in Listing 68, you can create a Vora table to access the net income results stored in SAP HANA, which we created in Section 4.3. The results are shown in Figure 85. %vora CREATE TABLE HANA_NETINCOME USING com.sap.spark.hana OPTIONS ( path "CODEJAMMER/CA_NETINCOME_01", dbschema "_SYS_BIC\ host " \ instance " 00 ” , user "CODEJAMMER", passwd "CodeJam2015"

)

Listing 68 Create a Vora Table Linked to Net Income Results

Xvcra

CREATE TABLE HANAJiETXNCOME

USING

com.sap.spark.hare

FINISHED O K © ©

OPTIONS (

path "COOEWrtER/CA.METINCQKE.er, ODschera "_SYS_BIC", host "", instance user "CCOEJAMNER-, passive "Code 3air2015"

>

C o m m a n d p ro c e s s e d s u cce ssfu lly w ith no re su lts

Figure 85 Vora Table Linked to Net Income Data

Execute a test query on the table using the code in Listing 69. The results are shown in Figure 86. %vora select * from HANA_NETINCOME Listing 69 Show Contents of NETINCOAAE Table

80

C o m b in in g A pache Spark, SAP H A N A , and Vora

Combining Results with Vora

Finally, with a tiny bit of manipulation using the code in Listing 70, we can overlay the results to see if there are any noticeable patterns between the number of articles written about oil, oil price fluctuations, and net income over the past ten years. %vora select 01LDATA.PRICE_DATE, 01LDATA. PRICE_ BBL, NI.NETINCOME. NYT.artCount from OILDATA l e f t outer join ( select PUBDATE, count(*)*10 as artCount from NYTJMES group by PUBDATE ) NYT on 01LDATA.PRICE_DATE = NYT.PUBDATE l e f t outer join ( SELECT cast(DATE_ SQL as String) AS DSQL, cast(NETINCOME as double)*4 as NETINCOME FROM HANA_NETINCOME ) as NI on 01LDATA.PRICE_DATE = NI. DSQL where 01LDATA.PRICE_DATE between 12007-01 -01 * and '2017-01-011 order by 01LDATA.PRICE_DATE asc Listing 70 Generate a Graph to Compare Oil Prices, Articles, and Net Income

81

|

4

C o m b in in g A pache Spark, SAP H A N A , and Vora

Figure 87 Combined Chart of Showing Ten Years of Oil Data

Success! In Figure 87, you can clearly see an increase in the frequency and number of news articles written around the same time as major changes in oil prices. You can also see the predictable trend that the net income of oil companies aligns very closely with oil prices. You might even be able to infer that, in 2007, oil companies were better able to cope with lower oil prices than they are today. Perhaps, with more detailed sentiment analy­ sis on the articles or more detailed analysis using corporate data, we may be able to make better predictions for tomorrow.

82

|

4

What's Next? | 5

5

What's Next?

Now that you're familiar with SAP HANA Vora and its basic functions and configuration, what next? Vora helps you process big data faster and more efficiently than ever, and simplifies data management for SAP HANA and Hadoop. But there's more to SAP HANA data than Vora! Recommendation from Our Editors if you're looking to get knee-deep into data model design, SAP is the book for you. With stepL v H H H H by-step instructions and sample coding, this book will teach H E S S S H l you how to build and design predictive, simulation, and optiI mization models. From information views to AFL models, | you'll learn to scale for large datasets and performance tune your models to perfection. Visit www.sap-press.com/3863 to learn more! I

H A N A A d v a n c e d D a ta M o d e lin g

In addition to this book, our editors picked a few other SAP PRESS publi­ cations that you might also be interested in. Check out the next page to learn more!

83

More from SAP PRESS Ready to get familiar with SAP HANA? From initial setup to connecting to a business intelligence platform, follow steby- step instructions and detailed examples for a complete SAP HANA implementation. Implem enting SAP HANA:

860 pages, 2 nd edition, pub. 12/2014 E-book: S69.99 | Print: $79.95 | Bundle: $89.99 www.sap-press.com/3703

Learn your options for imple­ menting SAP BW on SAP HANA. Find out how SAP HANA changes data modeling, reporting, and administration on SAP HANA. Implem enting SAP BW on SAP HANA:

467 pages, pub. 05/2015 E-book: $69.99 | Print: $79.95 | www.sap-press.com/3609

B

Bundle:

$89.99

Looking to move past the SAP HANA basics and get knee-deep into some real data model design? Learn how to build and design predictive, simulation, and optimization models with this book. From information views to AFL models, you'll learn to scale for large datasets and performance tune your models to perfection! SAP HANA Advanced Data Modeling:

392 pages, pub. 10/2015 E-book: $69.99 | Print: $79.95 | www.sap-press.com/3863

Bundle:

$89.99

S A P P R E S S E - B it e s SAP PRESS E-Bites provide you with a high-quality response to your specific project need. If you're looking for detailed instructions on a specific task; or if you need to become familiar with a small, but crucial sub-component of an SAP product; or if you want to understand all the hype around product xyz: SAP PRESS E-Bites have you covered. Authored by the top professionals in the SAP universe, E-Bites provide the excellence you know from SAP PRESS, in a digestible electronic format, delivered (and consumed) in a fraction of the time!

Abani Pattanayak, Williams Ruter Smart Data Access: Data Virtualization in SAP HANA ISBN 978-1-4932-1372-6 | $14.99 | 95 pages Aron MacDonald Integrating SAP HANA and Hadoop ISBN 978-1-4932-1293-4 | $12.99 | 105 pages Tamas Szirtes, Aviad Rivlin Implementing SAP Fiori Launchpad ISBN 978-1-4932-1291-0 | $12.99 | 73 pages

The A u th o r o f th is E -B ite

is an independent SAP HANA consultant, currently working for Shell Oil Company. His area of focus is the integration between SAP HANA and Hadoop. He is an SAP HANA Distinguished Engineer (HDE), which is an SAP program devoted to recognizing and promoting SAP HANA technical skills. Learn more about Aron at https:// www.sap-press.com/sap-hana-vora-functions-and-integration_4137/ author/.

Aron M acDonald

Related Documents

Hana
November 2019 43
Vora Barbarano
May 2020 13
Karenai Hana
November 2019 35
Functions
November 2019 47
Functions
December 2019 49

More Documents from "Chinchay Guinto"

Temporal Join.docx
November 2019 24
Fsc020.pdf
November 2019 17
1. Attribute View.docx
November 2019 17
4.bw4hana Documentation.docx
November 2019 17
Security Questions.docx
November 2019 8