Aron MacDonald
SAP HANA Vora": Functions and Integration
® Rheinwerk® Publishing Bonn • Boston
What You'll Learn Learn to use SAP HANA Vora to extend Apache Spark for SAP data. Start out by discovering basic operations and how to use them. Then explore how SAP HANA Vora enables you to enhance business functions in your applications, consume Hadoop data in SAP HANA, and consume SAP HANA data in Hadoop. 1
G ettin g Started w ith SAP HANA Vora and Apache S p a rk ............
5
2
Transform Data w ith Apache Spark .................................................
10
2.1
Using R D D s..................................................................................
12
2.2
Using D ataFram es......................................................................
24
3
4
5
Consum ing Data w ith Vora
...............................................................
35
3.1 3.2
Consuming Hadoop Data ........................................................ Consuming SAP HANA D a t a ....................................................
35 49
3.3
Cross-Consumption: Join SAP HANA and Vora D a t a .........
58
3.4
Additional Business F u n c tio n s .................................................
61
3.5
Additional Inform ation on V o r a ...............................................
63
C om bining Apache Spark, SAP HANA, and Vora ..........................
64
4.1
Download Oil Price Data Using Apache S p a rk .....................
64
4.2
Download NY Times Articles Using Apache Spark ..............
68
4.3
Store Net Income Results Using SAP H A N A ..........................
75
4.4
Analyze Oil Data Using V o r a ....................................................
77
W h a t's N e x t? .........................................................................................
83
4
1
Getting Started with SAP HANA Vora and Apache Spark
SAP HANA Vora (hereafter referred to as "Vora") is SAP's new in-memory query engine, which plugs into the Apache Spark and Hadoop frame works to provide interactive analysis of big data. Hadoop is an open-source software, designed to run on inexpensive com modity hardware, enabling businesses to store and process data at the petabyte scale. Hadoop is made up of a large collection of projects or libraries, covering areas like real-time data ingestion, document storage, and in-memory reporting engines. Hadoop is constantly evolving, and choosing the right project for your current needs requires careful consid eration. At Hadoop's core are several projects: »The Hadoop Distributed File System (HDFS) » YARN, a second-generation map/reduce framework and resource man ager » Apache Spark, an in-memory data-processing engine Over recent years, Apache Spark has evolved to become the new de facto standard data-processing engine for Hadoop; it’s fast and flexible and operates "in-memory" (when the dataset can fit). Vora expands upon Apache Spark by providing added business features as well as best-in class integration with SAP HANA (using an organization's live corporate data), enabling cross-consumption reporting and advanced analytics, at scale. Figure 1 illustrates how Hadoop, Vora, and SAP HANA might appear within an enterprise's big data architecture to manage transactional and behavioral datasets.
5
|
1
Third Party Analytic Tools
G e ttin g Started w ith SAP H A N A V ora and A pach e Spark
Transactional Figure i
Behavioral
SAP HANA and Vora/Hadoop Reference Architecture
This E-Bite demonstrates, step by step, basic data transformation opera tions using Apache Spark and Vora's data consumption operations to get you started on your big data journey. For a more detailed overview of Vora, please check out SAP HANA Vora; An Overview, http://www.slideshare.net/SAPTechnology/sap-hana-voraan-overview. All examples shown in this E-Bite are running SAP HANA Vora Developer Edition. SAP HANA Vora Developer Edition is free to try by registering at https://go.sap.com/cmp/syb/crm-xul5-int-voratrdm/index.html. Before we jump into technical and practical details, let's briefly discuss this software.
6
G e ttin g S tarted w ith SAP H A N A V ora and A pache Spark
|
1
Once you have followed the AWS installation steps for Vora, you'll be presented with the screen shown in Figure 2. This Vora launchpad pro vides a brief overview ofyour newly created Hadoop cluster, running the Hortonworks distribution of Hadoop (HDP) and Vora. The Vora launchpad provides a user-friendly way of adding and removing nodes on your test cluster and also has important links to other installed services, includ ing: » Apache Ambari A tool for provisioning, managing, and monitoring Apache Hadoop clusters » Apache Zeppelin A web-based notebook that enables interactive data analytics using SQL, Scala, Vora, and more Figure 2 shows the Vora launchpad. Click on the links in the top-right cor ner to access the installed services. HANA Vora
Cluster =
SAP HANA Vora M essages
R e s o u rc e s
22 hrs ago on Feb 21: Configured cluster with status success
Ambari administration
AppBOOk
Spark
Zeppelin
MASTER
11
End User License Agreement
Figure 2
Vora Launchpad
7
G e ttin g Started w ith SAP H A N A V ora and A pache Spark
|
Figure 3 shows the Ambari console, where more complex configuration and fine-tuning of your Hadoop cluster can take place. # ! j l Ambari
SA PH A N A V ora
□
Metrics
© MapReduce2 © YARN
__
6 alerts
Heatmaps
| Dashboard | Services
Hosts 2
Alerts
Admin
config History
Metric Actions * Q HDFS Disk Usage
u . Tez
© Hive
n Pig
0
© ZooKeeper
DataNodes Live
2/2
Memory Usage Nameficde SecooSary K s m e N o d e
lift., il,
2Data?lodes
PWrirlfj
Moce *
© Ambari Metrics © SAPHANAVora
CPU Usage
Actions
NameNode Heap
NameNode RPC
------- 1-----r/*
10*. 5*
NameNode Uptime
22.2 hr Figure 3
Cluster Load
ii%
ResourceManager Heap 4
f %
Network Usage
ResourceManager Uptime
22.1 hr
0.05
ms
NodeManagers Ltve
1/2
NameNode CPU
W IO
o.o%
YARN Memory
33^
Ambari Console
In this E-Bite, Zeppelin is the primary tool used for interacting with Apache Spark and Vora. When you first launch Zeppelin, you are pre sented with a list of the notebooks (documents) you are working on, as shown in Figure 4. Within each notebook, code can be created, saved, and executed. Figure 5 shows an example of a basic Apache Spark function in a Zeppelin note book. The top half of a notebook tile, highlighted in orange, shows your own custom Apache Spark or Vora code. The bottom half of the tile, high lighted in green, shows the results of executing the code.
8
1
G e ttin g Started w ith SAP H A N A V ora and A pache Spark
Zeppelin
Notebook -
|
Interpreter
Welcome to Zeppelin! Zeppelin is web-based notebook that enables interactive data analytics You can make beautiful data-driven interactive collaborative document with SOL code 3nd even morel
Notebook
Help
□ Create new note DOOGOAWSTPCH OOOOOBmg FT 0 0000 DF Basics OOOOOHDFS QOOCOJSOMBasic 0 OOCOJSON Oiklata O0000 NYTimes OOOCORDOBasics O0001 Vora 1 0 Note 26CPSVBC4 0 Create Table 0 Session 1 0 Session 2 NOPE 0 ShedockHolmes 0 Table search 0 Word Count 0 Zeppelin Tutonai Otabte2
Get staled with Zeppelin documentation
Figure 4
Community Please feel free to help us to improve Zeppelin Any contribution are welcome' V Mail ng list A Issues tracking O Grthub
Zeppelin Main Screen
E
Zeppelin
Notebook -
Interpreter
0000 DF Basics c>:: uj 0
©
© O oefav# -
finisheo
o
uj
s« 5( ( * 6s e r , ‘ Revenue*, ie e e .e e ),
('u s e r,■»ever*«', weo.eo), < * v s e iv ix p « r , i« - , - je e e .e e > ,
(**uei","Revenue", see.ee)).toCf('Co^«ny-,'Accounterouo“ CC*c»n>fl»:aD-: crg.»Mcl-<.$Nrl.sal.R*t*?r»ne - [foeperv,: string, icccuntJroup: string,
rfeublej
1Ceepany| Accountsroupl Aacunt.uSOI |
cseil useij useil 1 *uei| 1 1
Figure 5
Revenoel Revenuel Ixpensel Revenuel
leee.ei seee.ei -jeee.ei see.e|
Launch Apache Spark via a Zeppelin Tile
Note Apache Spark and Vora support multiple programming languages such as Scala, Python, and Java. As Scala is currently the most common language used with Apache Spark, the examples in this E-Bite are in Scala.
9
1
Transform D ata w ith A pache Spark
Within each Zeppelin notebook, SQL can also be executed and visualized, avoiding the need to jump to external reporting tools, especially during testing. Figure 6 shows an example of an oil price analysis performed using Apache Spark and Vora. This example will be discussed further in Section 4.
2
Transform Data with Apache Spark
Vora is an add-on to Apache Spark, providing enhanced SQL logic and business functionality. Apache Spark will continue to be used in conjunc tion with Vora when complex transformation logic is required to prepare data prior to query execution. To maximize the use of Vora, it is import ant to understand the basics of Apache Spark, especially when data prepa ration is required. Apache Spark is a fast and general-purpose cluster computing system. It pro vides high-level APIs in Java, Scala, Python, and R, as well as an optimized 10
|
2
Transform D ata w ith A pache Spark
engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data process ing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming. Apache Spark has many features, but at its core are the three main fea tures covered in this E-Bite: » Resilient distributed dataset (RDD) RDDs are fault-tolerant collections of elements that can be operated on in parallel across your Hadoop cluster. There are two ways to create RDDs: parallelizing an existing collection in your custom program or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering an Hadoop input for mat. » Spark SQL
Spark SQL is a module for structured data processing. Unlike the basic Spark RDD API, the interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimizations. » DataFrames A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in R/Python but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources, such as structured data files, tables in Hive, external databases, or exist ing RDDs. Note Spark also has other advanced functionality related to Spark Streaming and Machine Learning Library (MLlib), useful for machine learning algorithms. Unfortunately, these are not covered in this brief overview. Please see http:// spark.apache.org/mllib/ and http://spark.apache.org/streaming/ for additio nal info.
11
|
2
Transform D ata w ith A pache Spark
2.1
|
2
U sin g R D D s
RDDs have a long list of functions for manipulating and transforming data. Table 1 gives a brief overview of some of the common functions. Essential Apache Spark Operations General
M ath/ Statistical
Set Theory/ Relational
► map ► filter
► sample ► randomSplit
► union
► flatMap ► mapPartitions
► keyBy ► intersection ► zipWithIndex ► zipWithllniquelD ► subtract ► distinct ► zipPartitions ► cartesian
► mapPartitionsWith Index
Data Structure/ I/O
► zip
► groupBy ► sortBy
► coalesce ► repartition ► repartitionAndSortWithinParti tions ► pipe
► reduce ► collect
Table
► aggregate ► fold
► ► ► ►
count takeSample max min
► f ir s t ► take ► forEach
► sum ► histogram ► mean
► top
► variance
► treeAggregate ► treeReduce ► forEachPartition
► stdev ► sampleVariance ► countApprox
► collectAsMap
► countApproxDistinct
► takeOrdered
► saveAsTextFile ► saveAsSequenceFile ► saveAsObjectFile ► saveAsHadoopDataset ► saveAsHadoopFile ► saveAsNewAPIHadoopDataset ► saveAsNewAPIHadoopFile
1 Apache Spark RDD Functions
12
Transform D ata w ith A pache Spark
|
2
For a more complete list of the functions, with examples, see http://training.databricks.com/visualapi.pdf. This E-Bite uses several examples of these functions to help get you started on your Vora journey. map Function
You can use the map function to transform a row of data from one type to another. For example, with the code in Listing 1, you can transform an RDD with a single column, containing a list of company codes, into a twocolumn structure. The first column will contain company code, and the second will contain an integer of 1, as shown in Figure 7* val companyCodeRDDx = sc.paralleiize(Array("GB01", "USOl", "USOl", "UK01")) val companyCodeRDDy = companyCodeRDDx.map(obj => ( o b j. l ) ) Listing 1 map Function
//W P
v«l ccnp»nyc:i<5c©« ■ sc. paral lei i:e ( array ("oeor, 'u s e r , "usei*, "umi’)) val ccnpanjCofleRDOy » corpan><©eeRDOx.nap(oPj ■>
FINISHED t> « III ©
printlnC"\nsesults\n...........') p rin tInC'ccnpenyCOPcRCO* : * ♦ cc*pan>'CoPeRCOx.collectO.«kStrir.g(")) prlntlnCccepanyCcPeRCOy : ’ ♦ ccnpanyCoPeROOy .collect().nkStrir.gr, ")) coepanytoCeROOx: erg.apache.spark.re<j.stio[5trlog] ■ ParallelCollectionRCO[35£] »t parallelize at «onsole>:M corpartyCoaeRCOy: erg.apache.spark.rpo.»oo[(String, :nt)] ■ rapPartltionsKOOlses] a t nap at
-:8i Results ccrpanyCoeeROOx : ssei, USM, usei, u<ei CC*p»n>
Figure 7 map Results
filter Function
You can use the f i l t e r function, as shown in Listing 2, to filter rows of the RDD. As shown in Figure 8, in this example, only the rows belonging to company code "USOl” are returned. val companyCodeRDDx = s c . p a r a l l e i i z e ( A r r a y ( "GBOl”, "USOl", "USOl", "UKOl")) val companyCodeRDDy = companyCodeRDDx. f i l t e r ( obj => obj == "USOl") Listing 2 filte r Function
13
Transform D ata w ith A pache Spark
I //F ilte r val coapanycodescox . s«.peralleli:e(Array<‘6 6 0 i*, "usei’, 'user, 'utcox')) val coapanycooesooy ■ coflcenycooeRCtw.fi Ite r (op) ■> 06 J ■ ■ *usei*) 'Keep jsbi entries
|
FINISHED t>
2
m €»
println("\nResults\n...........”) prlntln("coBpan>coaei»30x : - ♦ <e:83 cereanyCoOeROOy: org.apache,spark.roo.ROO{String] ■ MepPertiticnsROO[3«7] at f i l t e r at :8l Results CCflp*nyCC*0«“OOx : 5801, USOl, uSM, IK01 cwanyCooeRCOy : usei, usei
Figure 8
filter Results
KeyBy Function
Following the example code in Listing 3, you can use the KeyBy function to create a new key column in the RDD. As shown in Figure 9, the first two digits of the company code are used to create a key. val companyCodeRDDx = sc.paralleiize(Array("GBOl", "USOl", "USOl", "UK01")) val companyCodeRDDy = companyCodeRDDx. keyBy(w => w . s u b s t r i n g s , 2)) println(companyCodeRDDy.col 1e c t ( ) .mkString( ", ")) Listing
3 KeyBy Function
I //Key By val «»p«n>co H.su6 string(e, 2 )) println( coifrenytoceROOy.colle«t() .«kstring(•. ') )
FINISHEO O ; ; ID @
println( “\nResults\n........... “) p rin tln ('coapanycoOeRDOx : ' ♦ cc*©anyCoPeW»x.collect().«kStringC, ")) p rin tln ('ccapanyCoOeRDOy : * ♦ ccapdnyCoPeacOy.collectO.akStringr")) conpanytoceROOx: org.apache,spark.roo.RDD[string] . ParallelColle:80 cc*panyCo<JeRCOy: erg.apache,spark.red.aw>[(String, string)] ■ PacPe-titionsROO[3M] at keyBy at :81 ;56,0601), (us,us«i), (us,usei), (UK.wcei)
Results CCncanyCotfeROOx : 6801, usei, US01, u*ei eenpan>«xjeROO> : «se,09«i>, (us,1* 0 1 ), (us,usei), (uk, i* 0 i >
Figure
9 KeyBy Results
countByKey Function
You can use the countByKey function, as shown in Listing 4, to create a new RDD containing the key and a column containing the occurrences of the key. The results are shown in Figure 10.
14
|
2
FINISHED 0
Efl
Transform D ata w ith A pache Spark
val companyCodeRDDx = s c . pa rail eli ze(Array( "GB01". "US01", "US01", "UK01" ) ) . keyBy(w -> w.substringC0,2)) val companyCodeRDDy = companyCodeRDDx.countByKey() Listing 4
countByKey Function
val coapanyCodeROOx ■ sc.paralleli:e(*rray(*eeei*, " u s e r, "usei*, ”UKei'».ke)«y(K ■> H.sutstringCe,2» val co<J«900x.collectO.«kStrirg( , “)) prin tln (“ee*panyCePeROOy : ’ ♦ ce*pan>CoPeROOy) conpanyCedeROOx: erg.apache.spark.r<w.9M>[(String, String)] . PapeartltlensRCO[37i] a t keyBy at :80 ccc.panyco<Se«OOy: scale.collection.KaplStringAong] . mop( uk -> l, us -> 2 , 66 -> 1 ) Results COnpanyCOXteRDOx : ;<SB,<*ei>, (US,US01), (US,US01), (l*,UK01) CCnpanyCCdeRfiOy : Kap(UK -> 1, US -> 2, 66 -> 1)
Took1 MCondi
Figure to
countByKey Results
sample Function
Following the example in Listing 5, you can use the sample function to create a randomized subset of rows from the RDD. This sample subset, shown in Figure 11, might be particularly useful if you need to create rep resentative subsets to test predictive analytic functions when working with very large datasets. val companyCodeRDDx = s c . p ar al le li ze (A rr ay ( ( "GB01".100). ("US01\ 100). ("US01". 300). ( "AU01", 100))) val companyCodeRDDy = companyCodeRDDx.sample(false, 0.5) Listing
5 sample Function
I //Sample val conpanjCodeRCOx ■ sc.paralleli:e(Arra>(CGcer,iee), ('u s e r, lee), ("user, ;ee),(*Auer, ieej» val eo«pan>eod«RC©y • co"pan>coeeRDOx.sarple(false, e.;)
FINISHED t> « [0 ©
println<"\nResults\n...........') printing"ccapanytodeROOx : ’ • ecapanyce4e3COx.collect().akString(", ') ) println("<e«pan,cofle«OCy : * . ec^an>Cc:se eonpanycodesooy: erg.apache.spark.rdd.ftCO[(String, xnt)] ■ Parti?lor»«iseSanpiedROO[37S] a t sanple a t :3i
Results
ceepanycoeesoo* : («eei,iee), (usei,iee>, (usei,3«e), (Auei,ie®) CWanyCoPeRDOl : (6001, 166), (US01,i#0>, (US01, 300)
Figure t i
sample Results
15
Transform D ata w ith A pache Spark
|
2
distinct Function
Following the example code in Listing 6, you can use the di sti net func tion to create a new RDD containing only distinct (different) values. The results are shown in Figure 12. val companyCodeRDDx = sc. pa rail el i ze (Array ( "GB01", "USOl", "USOl", ,,UK01M)) val companyCodeRDDy = companyCodeRDDx. d i s t i n c t ( ) Listing 6
distinct Function
//DISTINCT val conpanyCcOfBOOx • sc.paralleli:e(Arra><“G&«r, " u se r, 'u s e r , “uicer » val co«pan>Code»OOy • conpanyCodeROOx.distlmtO
FINISHED t> JI 81 ©
println( "\nRes<jlts\n...........') printing'co«pan><«sei*OOx : " . Cc<JeRCiOx.collect().iikStrlng(', -)) p r i n t i n g : ■ ♦ Ccde*»y.ccll«t<).BkStrlrgr, ”)) ccr-panycodeROOx: org.apacK.spark.Pdd.s®o[string] > ParallelColle:w cc*ipafty': org.apache.spark.rdd.eco[Strmg] - i!apPartitionsROO[379] at distinct at :8l
r
Results cc*pan>
Figure t2
distinct Results
reduceByKey Function
You can use the reduceByKey function, as shown in Listing 7, to create a new RDD with aggregated values based on a key. This function is similar to an SQL statement such as SELECT COMPANYCODE, SUM(VALUE) FROM GROUP BY COMPANYCODE. The results are shown in Figure 13. val companyCodeRDDx = sc .p ar a ll e ii z e (A rr a y( ( "GBOl",100). ( "USOl", 100), ("US01M, 300), ( "AUOr, 100))) val companyCodeRDDy - companyCodeRDDx.reduceByKey(_ + _ ) . collect Listing 7
reduceByKey Function
16
Transform D ata w ith A pache Spark
I //KCOUCe BY KlY val coapanyCodeRDOx • sc.parallell:e(Array(("G60i",iee), ( 'u s e r , val coapanyCcdeRDOy • conpanyCodeRK>x.reduceByKey(_ • _ ).collect
1 0 0 ),
( - u s e r,
|
2
FINISHED t> JK BP
O
3e e ) , ( ' A u e i " , i e e > ) >
prlntln( '\nftesults\n...........•) prin tln ("ccapanyCodeROOx : ' • coapanyCcdeRC©x.collect<).akString<", -)) prin tln ("ccapanyCodeRCOy : * • ccapanyCcdeROC^.akStnngC’, *» conpanyCodeRPOx: erg.apache.spark.rdd.RDO[(String, int>] ■ ParallelCollection*oo[383] at p arallelise at :se ccnpany CodeRDCy: Array ((String, Int>] > Array<(C8ei,iee), (Auei,lM), ( uS»1,aM)) Results cenpanycodeRXx : (6sei,iee>, (usei,iee>, (usei,3ee), (Auei,iee) coapanycodeROOy : «*eei,iee>, (am i , tee), (usei,«ee)
Figure 13
reduceByKey Results
union Function
Following the example code in Listing 8, you can use the union function to create a new RDD containing the union of rows based on two identical RDDs, as shown in Figure 14. This function is similar to an SQL UNION statement. val companyCodeRDDx = sc.pa rail eli ze(Array( ( "GB01",100), ("US01", 100), ("US01\ 300))) val companyCodeRDDy = s c . p ar al le ii ze (A rr ay ( ( "AU01", 100))) val companyCodeRDDz - companyCodeRDDx. union(companyCodeRDDy) Listing 8
union Function
//IMION val cwiperyCodeRDOx - sc.porallcli:e(A.rray((“<;eer,KK>), (“u s e r , 100), (’0501% :*>)» val conparyCodeROOy • sc.paralleli:e(Array(("Auei", iee)))
FINISHED > i? ffl $
val ccfiparyCodeRDO: ■ cc*panyO>deRDOx.union(co«paryCcdeaDCy) println( *\nResults\n...........*) println( "coapanyCoCeROOz : ' ♦ cc#panyCcde»DO:.collect().W(St'-lrg(', *))
conpanycotfeROOx: org.apache, spark.rco.R£>c>[(string, int>] ■ parallelcollectionRoo[3SS3 at parallelise at :se corpany^odeRPOy: erg.apache,spark.rd<j.R0O[(String, int)J - ParallelCollectienROD[3S63 at parallelise at :7$ ccrpanyCcxCeRDO:: erg.apache,spark.rdd.RDO[(String, int>] ■ unionRDO[387] at union at :8* Results COrpanyCotfeROO: : (6M1.1M), (US81,1W), (USei,3M>, (A<*1,1M)
Figure 14
union Results
join Function
You can use the join function to create a new RDD containing the join of different RDDs that share a common column, such as a transactional table
17
Transform D ata w ith A pache Spark
|
2
and a master data table containing reporting attributes. This function is similar to a SQL join statement. In Listing 9, two RDDs are joined based on a shared column "company code" to lookup the country code attribute contained in the second RDD. The results are shown in Figure 15. val companyCodeRDD = sc .p ar a ll e ii z e (A rr a y( ("GB01”,100), ("USOl", 100), ("USOl". 300), ( "AU01", 100))) val companyCodeAttrRDD = s c . p a r a l 1e l i z e (Array( ( "GB01", "GB"), ( MUS01", "US"))) val companyCodeJoinRDD companyCodeRDD.join(companyCodeAttrRDD).col 1ect val companyCodeLeftJoinRDD = companyCodeRDD.1eftOuterJoin(companyCodeAttrRDD).collect Listing 9
join Function
val coapanycajesoo . sc.paraiieii:e(Arra>«"GBer,iee), (“u s e r , lee), ( 'u s e r , 3ee),(*Auer, iee )j) val coapan><©deAttraoo • se.parelleli:e(A rra>(CG eer,“G8"), ( “u s e r , *us*))>
FINISHED t> X ffl ©
val coapanjCodeloirftOO ■ co«eer.yCo<Jeaoo.Join(ee«panyCoeeAttrROD).cclleet val coapanyCodeieftsoinRDO ■ eo«penyCoeesco.lefto<jter3oin(co«panycodeAtt'iioO).collect println( "\nResults\n...........") prin tln ( "cccpanyCoeelcinROO : * ♦ conp8nyCoee:oir«0C.HcString(", “)) printing "cc*seany:8* corpanyCo^eAttrROO: erg.apache.spark.r«.ROO[(String, string)] ■ PerellelCcllectionRDO[389] at parallelize a t :79 eo«paeyCC"de3oinROO: Array[(String, (le t. String))] ■ Array( (GMi,(iM,Ge)), (uSei,(3d®,uS)), (uS6l,» Results conpanyCodeJoinROO : «3M1,(1M,€S)), (usei,(3«e,uS)), (usei, conpenyc©<Jei.cft3cinPa> : (seei,(iw ,so»e«»> )), (<‘u e i,(ie e ,<«cnc)), (usei,:3W,sc*e(us)>), (usei,uee,soae(us)))
Figure
15 join Results
fla tM a p Function
You can use the fl atMap function to split a column into a new RDD con taining multiple rows. In the example code in Listing 10, an RDD contains a single row of company codes delimited by a comma. flatMap is used to return company codes as multiple rows, as shown in Figure 16. val sapDataRDD - s c . para 11 elize(Array( "GB01, USOl, USOl,AU01") ) sapDataRDD.fiatMap(x => x . s p l i t ( " , " ) ) . col 1ect Listing 10
flatAAap Fuction
18
Transform D ata w ith A pache Spark
I/ / f i a t nap - Split a coma deliatteO rot val saccatasoc • sc.paraliellie(Arra,("os«i,usoi,usei,Auei*) ) sapoataftco.flatMap<x .» x.split(V)).collect sapoataRPO: erg.apache.spark.r<Sd.Rt©[String] ■ ParellelCollectionaf)0[4e2] at parallelize at :8e rei936: ArraylStrlng] • Array(GW1, usei, usoi, «J01)
Figure 16
|
2
FINISHEO t> 5S ro <S>
flatMap Results
map and split Functions
Listing 11 uses the map and split functions to transform a single column of data that contains comma-delimited information into two separate col umns, as shown in Figure 17. val sapDataRDD = s c . para 11 el i z e ( A r r a y ( "GB01, 100" , "US01.100" . "US01,300" . "AU01.100") ) sapDataRDD.map(x => { var s p l i t S t r = x . spl i t ( ” , ” ) (sp litS tr(O ). s p litS tr(l)) }) . c o l l e c t Listing 11 map and split Functions
I
- tram f o n t a row val jeccataaoo « sc.jyarallelize! ray C G eei,iw , ”usM,Me”, "Lisei,3ee”, ‘Auei,ie«“) S8poata*iX>.r-Bp(x «> var SplitStr ■ x .s p lit( V ) (sp litStr(a), s p litS tr(.)) collect println( "\nnesults\n........... println(saccat8*00. collect! J.wfcStringCW)) sapCatatO©: org.apache.spark.rdS.eco[String] « A8rellelCollectlof*CO['4©4j at parallelize at :8© res939: ArrayKString, string)] ■ Array!(G60i,iee), (US«i,ie«), {USOi,3W), (Auei,iee>) //n a p
{ )).
)
finisheo o ;; ft) ©
*)
Results GW1,1M us»i,iee u s u ,)te Auoi.ieo
F ig u re 17
map and split Results
Use Patterns to M ap into M ultiple Columns
In some cases, a File may not have standardized delimiters but still may have a recognizable format. A typical example of this is an Apache log file. In the following example, a simple file might contain a company code, year, period document number, and date (in SAP format). The row could be split using a " " (space) delimiter as shown earlier, but more complex pattern-matching logic, detailed in Listing 12, can also be used to generate custom results, as shown in Figure 18. 19
Transform D ata w ith A pache Spark
val data = sc.parallelize(Array("US01 2016 01 00100000001 20160115" "US01 2016 01 00100000002 20160115" "US01 2016 01 00100000003 20160115" )) val Pattern = """(\S+) (\S+) (\S+) (\S+) (\S+)......r / / Create a case class for Financial Document Header. case class financialDocumentHeader(companyCode: String, year: String, month: String, documentNumber: String, postingDate: String
)
def parseFinancialDocumentHeaderLineCline: String): financialDocumentHe ader - { val patternMatch = Pat ter n. findFirstMatchIn(1ine) i f (patternMatch.isEmpty) { throw new RuntimeException("Cannot parse line: " + line)
}
val lineData - patternMatch.get financialDocumentHeader( 1ineData. group(1), 1ineData.group(2), lineDa ta.group(3), 1ineData.group(4) , 1ineData.group(5)) var dataHeaderDocs = data.map(parseFinancialDocumentHeaderLine) Listing 12
Pattern-Matching Logic
v a l da ta ■ S C .p ara lU li:« (A rrty < * U ttl 241* «1 MIOMOOM: M IM IU * “u se i j o t s e i e*ieee© » e2 2016011s* • w a i 29U « i 0010000000> : o H o m » v a l P a tte r n . — < \» .) < \» .) ( \ J . ) ( \ j . > ( \ J . ) * ~ . r / / c r to x t 0 case c la s s f a r r t i w c t o i o o c v e n t C K f c la s s financialOocunentHeade^conpanyCode: yt t t : ■e fltn : o c « f* M » J to tr : p o stin g S a te: >
f in is h e d
, ,
0
;; w &
S trin j, S tr in g , S trin g , s t r in g , S trln *
parseFinanclalD ocir>ent-«eaderilne(llne: c - - : - ; ) ; fin an c ta lD o a r-e rt-e a d e - • < va 1 p a tte rn r a tc n • P a tt0r n .f ln d » irs w a tc h in < lir e ) I f (p ateern P O tch .isE ap ty ) { threw new «jntlartxeeptlcr<*C ann© t p arse l i r e : * * l i n t ) v a l lm e o a ta • p a tt0 rn v * tcn .* « t flnancialO ocuaentH eadert lir« O ate.* rcw p < :), lire2 ata.* rcw p < >, lir e S a ta .f C '-C I ) , iln e O e ta .* ro jp (-) , lirc D a ta . > 1 v a r d a ta-ead ero o cs • d a ta .a a p (p a rse rin a n c la lo o a a e n t4e ad e n .in e)
))
p r in t ln< *> .n » es0 lts\r ') prtntln<dota>*ea5er©c><S.C0i:e .«»iStrin*< -\n*)) d a ta : o r * .a p a c h e .s p a r k .r « .* » { S trin * ) . P arallelC o llectlo rP S O t* * *] a t p a r a l l e l l t e a t :#0 p a tte rn : s c a la .u til.ra tc M r« .P e * e < . (\S*> <\S*) {\S*) (\S*> defin ed c la s s fin an e la lO o < u ^ n t-e ai* r perse«inenciel&oct*>entM ee<erj.ine: ( lin e : Strir*)fin»n;lalD cct«<entM eader dataneaeer»o<s: c r* .a p a c h e .sp a rk .rd a .* c o { flra r< ia » o c j» e rt» e a d e rJ • M appartitlons*t© [40?] a t rap a t :M •c u lts fin a n c ia lo o c i^ c n f< e a d e r(u se i,2 e ie ,a i>M i» e e e e e « i,M i« a iis ) fin a n c ia lD C K in e n f^ ad e r(u se i,2e is ,e i ,w i e e e « « e « :.2e i M i i s ) fm a n c ia iP o < u n < -.t-e aJ« 'a ’i 0 i,:e i« ,ti,e < J ie « > 2«<«3, » i M i i s ) Tata2 ii [ i r *
Figure 18
Pattern-Matching Logic Results
20
|
2
Transform D ata w ith A pache Spark
Execute Linux Commands
In some cases, executing Linux commands directly from Apache Spark, which would make it unnecessary to open additional sessions to search for files in HDFS, can be useful. The example code in Listing 13 shows a basic Hadoop listing command, the results of which arc shown in Figure 19.
import seal a.sys.process._ val IsResult = Seq("hadoop"," f s " 1s","hdfs://master.cluster:8020/ user/vora/").!! Listing 13
Execute Linux Command from Hadoop to Search for HDFS File
laport scala. sys.process.. val ls^rsult • scq(“hatfooo'',*^s*>" .ls " ,’hd#$://M ster.cluste'‘:8 0 2 0 /oser/vora/").!! inpert j
Figure
9 2016-02-21 12:06 ntffs://«ster. 0 2 0 1 6 -0 2 -2 1 0 8 :5 4 n o fs:/...
T9 HDFS Listing
W orking w ith HDFS Directories
Within Apache Spark it is also possible to execute common Hadoop com mands. In Listing 14, the temp directory is deleted and re-created in HDFS, as shown in Figure 20. import org.apache.hadoop.fs.FileSystem import org.apache.hadoop.fs.Path val fs = Fi1e$ystem.get(sc.hadoopConfiguration) //Delete Temp Directory fs.delete(new Path("/user/vora/temp"), true) var status = f s .1istStatus(new P at h(" /us er /vo ra /")) p r i n t l n ( "\nDirectory Listing Before:") s t a t u s . foreach(x=> p r in t ln ( x .g e tP at h ) ) //Create new Temp directory fs.mkdi rs(new Path( "/user/vora/temp"))
21
|
2
Transform D ata w ith A pache Spark
status = f s .1istStatus(new Path("/user/vora")) p r i n t l n ( "\nDirectory Listing After:") status.foreach(x=> p r in t ln ( x .g e tP at h ) ) Listing 14
Delete and Create HDFS Directories
Directory Listing Before: Mfs: //M ttc r. cluster: 802®/user/vcr»/. Tr*sh M fs://M Ster. cluster: 802®/user/vcra/. hive Jars Mfs: //M Ster. d u s te r: 8020/user/vcc*/. sp»rkst»ging M fs://M Ster. cluster :SO20/user/vcra/. staging M fs://M Ster. duster:802©/uscr/vcr»/lir>eiten Mfs://Mster.cluster:8020/user/^d/lineite*i3SON Mfs: //master. cluster:802®/user/vcr»/lineite«0 Mf s: //M Ster. cluster: 8 0 2 8 / user /vera/1lnel tenfarquet M fs://naster. cluster : 8 ®2 0 /user/vcra/natlon M fs://M Ster. d u ster :8®2®/user/vcra/r>eti©n.g:.ore Mf s: //a a s te r. cluster:8020/user/vcra/people. parquet Mfs: //M Ster. cluster: 8020/user/vora/region. srappy. parquet M fs://M Ster. cluster :8®20/user/vcra/sap&ata M fs://M Ster. d u ster: 8020/user/vcre/sapOetePerquet Mfs: //M Ster. cluster:8020/user/vcra/test .csv M fs: / /M ste r. clu ster: 8020/user/vera/test. parquet M f s : / / m jeer. cluster : MJe/user/veea/Bera.eeuAt Mf s:/ / m ster. cluster:8020/user / vera /norfl.count. csv M fs://M $ter. cluster: 8020/user/^-cra/norOcount res25: Boolean • true status: Array[crg.acacne.haoocp.fs.Fllestatus] - [iorg.dpeene.he0oop.fs.File$tatus;$ec87787 Directory* Listing After: Mf s: //M Ster. cluster:8020/user / vera/. Trash M fs://naster.cluster:802®/user/vcra/.hiveJars Mf s ://M Ster. cluster :8O20/user/vera/. sparkstaging M fs://M Ster. duster:8O20/user/vera/. staging Mf s: //M Ster. cluster: 8 0 2 0 /user/vera/lineltea M fs://M Ster. cluster :8020/user/vcre/lineite«3SCN M fs://M Ster. cluster :s®2®/user/vcre/lineite** M fs://M Ster. cluster :S®20/user/vcra/linelteaParquet M fs://«aster.cluster:802®/user/vcra/nation Mfs: //M Ster. cluster:8020/user/vcra/ftatlon.g:.ere Mf s: //•» ster .d u s te r: 8 0 2 0 /user / .era/people. parquet M fs://M Ster. cluster :8O20/user/vcra/reglon. snappy.parquet M fs://M Ster. d u ster: 8020/user/vcra/sepOate Mf s: //M Ster. cluster:8020/user/vcra/sapDataParquet Mfs: //M Ster. cluster:8020/user/vcra/teup Mfs: //M Ster. cluster:8020/user/vera/test .csv Mfs: //M Ster. cluster:8020/user/vera/test .parquet M fs://M Ster. cluster: 802®/user/'.-cra/i»ora_count M fs://M Ster. cluster :8020/user/'.qra/worfl_count.csv Mfs: //M Ster. cluster: 8020/user/vcre/worSccunt
Figure 20
New Temp Directory Created
saveAsTextFile Function: Saving RDDs to HDFS
RDDs are not persistent indefinitely. In some cases, the final RDD may subsequent reprocessing. Apache save RDD to a local file system, to HDFS, or to another remote file system, such as AWS S3. If you are running Apache Spark on Hadoop, then the most commonly used storage location is HDFS. The example code in Lis ting 15 demonstrates the saveAsTextFile function to save an RDD to a text file on HDFS, as shown in Figure 21.
22
|
2
Transform D ata w ith A pache Spark
|
2
import org. apache. hadoop. f s . FileSystem import org.apache.hadoop.fs. Path val fs - FileSystem.get(sc.hadoopConfiguration) fs.delete(new Path("/user/vora/sapData"), true) val sapDataRDD = s c . para 11 elize(Array( "GB01, 100", "USOl.lOO”, "USOl,300",HAU01.100") ) sapDataRDD.reparti t i o n ( 1 ) . saveAsTextFi1e("sapData") Listing 15
I //save inport inport val fs
saveAsTextFile Function
two to HOfS erg. apache,haccop.fs.FileSyster erg.apache.racoop.fs.Path ■ F*leSysten.get(sc.hadoop
fs.aelete(new PathCVuser/vora/sapoata- ), true) val !apcata»oo » sc. p a r a lle lle d - - a ,('«sei,iee>*, ’uS»i,iM*, sapoataROO. re p a rtitio n 1 ). saveAsTe*tFile(" sapoata")
FINISHED C> JJ « @
)
inport or*.apache.hadoop.fs.FileSysten inport or*.apache.haaoop.fs.Path fs: erg.wche.hadoop.fs.rileSysten . DFSlOfSClient(cliehtNane.OFSCllent_NC>V'>vPR£OUC£_lS6>S3e?»0_17, ugi.vora (authiSIMRlE)]] ressw : Boolean ■ true sapOata'XC: org.apache.spark.rdd.RX(strin*] > ParallelCollect:cnROO[4XS] at parallelize a t :88 Took2 seconds
Figure 21
saveAsTextFile Results
textFile Function: Load CSV from HDFS
The example code in Listing 16 demonstrates the textFi 1e function, used to load a CSV file, stored on HDFS, into an RDD, as shown in Figure 22. val hdfsFileRDD = s c . t e x t F i 1e("hdfs: //master.cluster:8020/user/vora/ sapData”) 16 textFile Function
Listing
I //LOOJ CSV from tCfS val hdfsFileRCO ■ sc.textFile(*hdfs://naster.cluster:8«29/u}er/vcca/S8E©ata*)
FINISHED 0 5J tfl ©
println<“\naesults\n........... ') print In (hdfsFileWO.collectO.nfcStringCVn*)) hdfsFUenco: erg.apache,spark.rdd.ROO[String] ■ Kapi>*rtlticnsftX[4e9] *t textFile at :79 Results
use:,iee usei,3W A je i.ie e
Figure
22 Load CSV Results
23
Transform D ata w ith A pache Spark
2.2
|
2
U sin g D a ta F ra m e s
DataFrames work with structured data, similar to an SQL table but have a much shorter list of available functions when compared with RDDs. Data Frames share many of the same functions as standard SQL functions, such as SELECT, WHERE (filter), AGG (aggregate), JOIN, UNION, and so on. For a complete list of DataFrame functions, see http://spark.apache.org/ docs/latest/sql-programming-guide.html. Note Datasets were introduced in Apache Spark 1.6. Datasets merge core functio nality from RDDs and DataFrames. This is not yet supported by Vora (expec ted Q2, 2016).
Create a Simple DataFrame
The example code in Listing 17 demonstrates the textFi 1e function, used to load a CSV file, stored on HDFS, into an RDD, as shown in Figure 23. val companyDataDF - Seq(( "GBOl”, "Revenue”. 1000.00). CUS01”,"Revenue", 5000.00), ("US01"."Expense”, -3000.00), ( "AU01", "Revenue", 300.00)).toDF("Company", "Accoun tGroup", "Amount_USD") Listing 17
textFile Function
//Create a Staple Oatofroae val co*pan>cata©c • Seo(('6BOl*,■Revenue’, lMO.ee), ('u se r, -Revenue-, soeo.eo), ("user,'Expense*,-aeee.ee), <"au01",'Revenue', '-ao.C«,'i)).toC#('Coo(«rry*,'Acco'jnt6roop",'A»otint_USO")
FINISHED t > ; ;
ccapanyDataCF. sr»o<*(: ■:) cc«pany08teD=: org.apache.spark.sol.MtOFrtae » [Company: string, AccountGroup: string, Ancunt_USO: Pebble] | Conpany | AcceuntGroup | Aaeont_usoI 1 I 1 1
GBOl| useil -soil Aueil
Revenue| Revenue | Expense! Revenue 1
leoo.el sooo.oi -jeee.el 300.01
Took 4 MCOftdt
Figure 23
Create DataFrame Results
24
to@
Transform D ata w ith A pache Spark
2
|
Print Schema
After creating an RDDf it may be useful to see its definition using the printSchema function (Listing 18), similar to an SQL DESCRIBE table. The results are shown in Figure 24. companyDataDF.printSchema Listing 18
I
printSchema Function
//Show D e ta ils
F IN IS H E D [ > «
EP ®
co«MnyCat»3F .orlntscnena root I-- Company: string (reliable ■ true) I-- Actountsroup: string (nullable - true) j - /*ount_uSO: double (nullable ■ false)
Figure 24
printSchema Results
Select Columns
Following the example code in Listing 19, you can use the select func tion to reduce the number of columns. The results are shown in Figure 25. companyDataDF.select("Company", "Amount_USD") . show(10) Listing
19 select Function
//S e le c t Columns coaipariyOataOf. select ( *cc*pany*, ' aaount.uSD") . sno«(; e)
F IN IS H E D
ICom pany|A *ount_uS O | | I I I
oaei | u s e il u s e ij A u e i|
le o e .e l s e e e .e i • je e e .e l s e e .e i
Took 0 M tondi
Figure 25
select Column Results
25
“
ffl ©
Transform D ata w ith A pache Spark
|
2
filter Function
The f i 1ter function can be used to filter rows. Listing 20 shows different syntaxes for performing the same filter operation. The results are shown in Figure 26. companyDataDF.fi1ter("AccountGroup = ' Revenue' " ) . show(10) companyDataDF.filter($"AccountGroupM=== "Revenue”).show(10) Listing 20 filte r Function
FINISHEO I> //filte r
l! ©
Roms
coiwrc/aat#OF.filter(“Account<>roup ■ ' Revenue'").sHow(ie)
w
conpanySataOF.filterts-AccountGroup" «»« "Revenue").show
...
♦ ........... f - . . - .................. ICompany | AccountGroup IA«ount_USD| *........... 4-.................... I GB01| Revenuel leee.ei | usei| Revenue) SOM.0| | Auei| Revenuel 3ee.ei ............. * *........... ♦ .................... ...................................................
...
ICoopany | AccountGroup lA«ount_uSDI ♦ ........... 4-............................ I GBOl| Revenuel iN t.d I useil Revenue) sflw.ei I auoi| Revenuel 3ee.ei ♦ .............- 4-............................-
Figure
26 filter Results
Combining filter and select Functions
DataFrame functions can be combined. In the example code in Listing 21, rows are filtered, and the numbers of columns are reduced. The results arc shown in Figure 27. companyDataDF.filterC'Company = ' US0 1 " ' ) . sel ect< "Company ” , "Amount_ USD") . show( 1 0 ) Listing 21 Combining the filter and select Functions
26
Transform D ata w ith A pache Spark
//com bining f i n e r t end s e le c t
|
F in is h e d D>
2
HJ ©
| Coreany |Anc<jrt_USO|
I useil
seeel
I uSdll
-3©0«l
Figure
27 filter and select Results
groupBy and agg Functions
The groupBy and agg functions can be used to aggregate data. Using the example code in Listing 22, we searched for the total profit of a company code. The results are shown in Figure 28. val groupedData = companyDataDF.fi1ter("Company = ' US01' " ) . s e l e c t ( "Company", "Amount_USD") . groupBy( "Company") val groupedDataDF = groupedData. agg( sum("Amount_USD")) Listing
22 groupBy and agg Functions for Data Aggregation
//C roup ond Aggregate
val grojpeXata ■ ccapanyOataO8.f i l t e r ("Company * ' u s e r select (‘Company*, *AacuntJJSO*) . groupBy<"Company") val g'-cupedCataD' • grc<JC«dCata.agg( sun(’Ano>jnt_uso“))
FINISHED t> “
groupeooatacf .sno*()| grcupscoata: erg.apacne.spark.sqi.GrotipecData • org.apaene.spark.sql.Grojcea0ata$S2c?42ff group
u s e il
Figure
:e e e .e |
28 groupBy and agg Results
countDistinct Function
Following the example code in Listing 23, the countDistinct function can be used to count the distinct number of values of a column. The results are shown in Figure 29. companyDataDF.aggCcountDi stinet("Company")).show() Listing
23 countDistinct Function
27
ffl ©
Transform D ata w ith A pache Spark
c o a p a r j c a t a O f . « « ( < o w ito i i t i n c ? ( * c o n p an y " ) ) . show ( )
|CCUNT(OESTINCT Conpany)| Jl
countDistinct Results
Load JSON to a DataFrame
JSON structures are increasingly used for results produced by web APIs. Listing 24 shows how a very simple JSON structure can be loaded into a DataFrame. The results are shown in Figure 30. val stringJsonRDD = sc.paralleiize(Seq( it it ii
{ "CompanyCode": "GB01", "AccountGrp": "Revenue". "Amount_USD": 1000.00 { "CompanyCode": "US01", "AccountGrp": "Revenue", "Amount_USD": 5000.00 { "CompanyCode": "US01", "AccountGrp": "Expense", "AmountJJSD": -3000.00 { "CompanyCode": "AU01", "AccountGrp": "Revenue", "Amount_USD": 300.00 | it i i ii
) )
val resultsJSON = sqlContext.read.json(stringJsonRDD) Listing 24
2
FINISMEO t> 51 QJ ®
//C ount O is tin c t
Figure 29
|
Load a Simple JSON Structure
28
Transform D ata w ith A pache Spark
//Load JSON to Octa Frame
val 5tring:j3ri?c© • sc.perallell:e(scq(
FINISHED
{ "Company-Code": *G8®1", "AccountGrp": "Revenue*, "Ameunt.uso": ieee.ee r * »
{ "Company-Code": "u ser, "AccountGrp": "Revenue", ■Amount uso*: seeo.de { "Company-Code": "usei", "AccountGrp": "Expense", "Amount uso": -3eee.ee { "Company-Code": "Auei*, "AccountGrp": "Revenue",| "Amount uso*: 3ee.ee
)
> )■**
val resultslSCN • sqlContext.read.Jscn(stringJsonROD) results3SO*.tcO£. sho*<) string;sonROO: org.apache.spark.rdd.soo[string] ■ Porallelcollectlo«ROO[626] at parallelise at :l23 resultsSSCN: org.apacne.spark.sql.Oata'rame - [AccountGrp: string, Amoont_uSD: double, CompanyCode: string] 1AccountSrp | Amount_uSO| CompanyCode | | I 1 I
Revenue 1 Revenue | Expense | Revenue |
Figure 30
iee«.e| seee.el -3&M.0I *e«.e|
seen useil -sen Aoei|
Load JSON Structure Result
Explode a JSON
Unfortunately, not all JSON structures are flat. In many cases, JSON struc tures may have nested lists of data. Listing 25 shows how the levels of a JSON structure can be exploded (flattened) into a more useful columnar form, as shown in Figure 31 and Figure 32. val stringJsonRDD = s c . paral 1elize(Seq( •• if
11
{ "data” : [ { "CompanyCode": "GBOl". "AccountGrp": "Revenue", "Amount_USD": 1000.00 }, { "CompanyCode": "US01", "AccountGrp": "Revenue", "AmountJJSD": 5000.00 }, { "CompanyCode": "US01", "AccountGrp": "Expense", "AmountJJSD": -3000.00
29
|
2
Efl ®
Transform D ata w ith A pache Spark
|
2
{ "CompanyCode": "AU0 1 ", "AccountGrp": "Revenue” , "Amount_USD": 3 0 0. 0 0
] I «i it it
)
)
val resultsJSON = sqlContext. read. json(stringJsonRDD) resultsJSON.printSchema resultsJSON.toDF.showC) resultsJSON.toDF.pri ntSchema resultsJSON. toDF.withColumn("dataList", explode<$"data")). show() var finalResults = resultsJSON. toDF.wi thColumn("dataLi s t " , explode($"data") ) .withColumn( "CompanyCode", $"dataList.CompanyCode") .withColumn("AccountGrp", $"data Li st.AccountGrp") .wi thColumn("Amount_USD", $"data Li s t . Amount_USD") finalResults.show() f in a l R e s u lt s . s e l e c t ( "CompanyCode", "AccountGrp". "AmountJJSD") .show() Listing 25 Explode a More Complete JSON Structure
/ / t x p l o d f a 3SOU
FINISHED !> J I Q] @
val stringjsonRCO ■ sc.paralleli:e(Seo(
{ 'data- : [
< 'ConpanyCoCe*: ' gmi’, "AccountGrp": "Revenue", "Aoojnt_USD": 1000.00 < "coepanycoce*: ”usei", "AccountGrp": "Revenue", "Aoount_USO": 5000.00 < ”co*panj'Code’ : "usoi", ■AccountGrp": "Expense", "Amount USD": -3000.9©
J, n> ))
< "CoeoanyCoce*: "au01’, "AccountGrp": "Revenue", "ABOUflt USO": 300.00
val resultsJSON • s<jlContext.reac.Jscn(strin$Jsonaoo) resultsJSON.prlntSche«a resultsJSON. tcO'.sro»<)| resultsJSON.tcO* .pcintSchena resultsJSON.tcO'.uithColinn("cataList", exploce(S*data")).show() var 'inalftesults • results3SCN.toOc.witnColuBn("<Jataiist“, explode(S"<Jata")).«itKoluin(’Conpan>
FlnalResults.sno«() FXnalResults.select(*COBpan>Ccoe", "AccountGrp", *Aaoont_usO"). shom()
Figure 31 Explode JSON Results, Part
1
30
Transform D ata w ith A pache Spark
|
2
string)sonRDO: erg.apacne.spark.rdd.RCO[String] . RarallelCollectlonROO[86?3 at parallelize at :123 resultSJSCN: org.apache.spark.sql.oata'raee » [data: 0rray<struc?
data: array (nullable ■ true) |- - elenent: struct (contalnsfiull - true) I I—AccountGrp: String (nullable ■ true) | | - Anount_USO: double (nullable - true) | I-- corpanycode: string (nullable ■ true) data|
data List|
L ist:[Re:venue,l00... | [Revenue,ieoe.0,G.. . | |list{[Reverue,lW ... | [Revenue,5000.0 ,U.. . | |Llst([Re;venue,ie0... | [Expense,-3009.a , . . . | IList ([ Revenue, 1W. . . | [ Revenue, 3ee.9, al rel="nofollow">01] | finalResults: erg.apache.spark.sql.Oata=ra«e ■ [data: array<struetCode:string>>, dataList: struct, ccapanyccde: string, AccountGrp: string, Anount_USD: double] datal
dataListl Conpan>
|List{[Revenue,l0®.. . |[Revenue,idee.0 ,0 ...| |List([Revenue,100.. . I[Revenue,5 0 0 0 .e ,u ...I |llst([Revenue,ie«.. . I[Expense,-300®.0 , . . . | |list([R«-v«nue,iM.. . | [ Revenue, 3ee.0,Atiei] |
08011 US01I US01I Auei
Revenue| Revenue| Expensel Revenue|
1000.01 5000.01 •3000.01 300.0|
ICcepanyCode IAccountGrp | Anount.uSO| 1 1 1 1
Figure
O80l| US01| US01I AU011
Revenue| Revenue| Expensel Revenue|
laee.el 5000.01 •3000.01 300.0|
32 Explode JSON Results, Part 2
Save as Parquet Format
To achieve the best query performance on Hadoop using Spark SQL and Vora, it's essential to use a columnar store file structure (similar to SAP HANA's columnar store structure). The most common standards for Hadoop are ORC and Parquet. Following the example code in Listing 26, the results of a DataFrame are stored in the Parquet file format on HDFS, as shown in Figure 33. fs.delete(new Path("/user/vora/sapDataParquet"), true) f i n a l R e s u l t s . s e l e c t ( "CompanyCode", "AccountGrp", "Amount_
USD").repartition(l).save("sapDataParquet". "parquet")
31
Transform D ata w ith A pache Spark
|
2
status = f s .1istStatus(new Path("/user/vora/sapDataParquet")) p r i n t l n ( "\nDirectory Listing After:") status.foreach(x=> p r in t ln ( x .g e tP at h ) ) Listing 26 Save as Parquet File
j//save as parquet
fs.<selete(new Path("/user/vora/sapoataParquet"), true) finalP.esults.select("Coedan)*,**nount_usO’ ).repartition(i).save("sapOdtdPa'‘cuet’ , "pa'euet") status ■ fs.llstStatus(new PathC/user/vora/sapDataParquet*))
finished
i> ;; ro ©
prlntln<"\noirectory Listing After:") status. f creach(x»> println(x.getPath)) res1522: Boolean ■ true warning: there were l deprecation warning(s); re-run with -deprecation for details status: Arraytorg.apache.hadocp.fs.Pilestatus] ■ (long.apache.nadooo.fs.Filestatus;|«a«dfBo Directory Listing After: hdfs ://uaster. cluster :8020/user/ vora/ sapoataParquet /.SUCCESS hdfs://naster.cluster:8020/user/vora/sapOataParquet/_cc*i»cn_netadata hdfs://Baster.cluster:802e/user/vora/sapoataParquet/_«etaflata hdfs://M$t«r.cluster:iO20/user/«.‘cri/s*poataP*rqij«t/pirt-r.ww«.«5j7cn-9C»l-45fl-8a7d-eti7J3»J25(».g:.par'jijet
Figure 33 Save as Parquet Results
Load Parquet Format
Listing 27 shows how a Parquet file on HDFS can be loaded into a DataFrame. The results are shown in Figure 34. v a l f i n a l R e s u l t s R e l o a d e d = s q l C o n t e x t . 1o a d ( " s a p D a t a P a r q u e t " finalResultsReloaded.show()
"parquet")
Listing 27 Load a Parquet File
I // loop Parquet
val fin*l»esultsaelo«ded ■ sqlContext.load("sapoataParquet", "parquet") finalResultsReloaded. show() warnir^: there .cere l deprecation warning(s); re-run with -deprecation for details finalResultsRelcaded: org.apache.spark.sql.DataFrane ■ [ConponyCode: string, Account6rp: string, A«ount_uSO: double]
FINISHED t> « 03 <8>
1CoapwyCoc«| Accounttrpl a*ount_u$0 1 1
1 1
1
06011 VSOll useil AJJOll
Revenue| Revenue1 expense! Revenue!
1000.01 soee.ei -jooo.ei 300.01
Tex* 1 **:<
Figure 3 4 Load Parquet Results
32
Transform D ata w ith A pache Spark
|
2
Register Temp Table for Spark SQL
Following the example code in Listing 28, you can make a DataFrame available to Spark SQL as a table, using the registerTempTable function. The results are shown in Figure 35. finalResultsReloaded.registerTempTableC"sapDataParquet") Listing 28
registerTempTable Function
1//Register Table f in*lRes.ults?eloefi«s. registerTwr»t>l«("j#:©atflP«rqoe:' )|
FINISHED t > m
©
registerTempTable Results
Figure 35
Using Spark SQL
The following are a few examples of using Zeppelin to execute Spark SQL. Listing 29 uses Spark SQL to show tables, as shown in Figure 36. %sql show tables Listing 29
Execute Apache Spark SQL to Show Tables
Xtol show ta&les O
M
C
*
FINISHED t> “ C0 © 1^
l*
tableNaene
isTemporary
sapdataparquet
true
Figure 36 Show Tables Results
33
Transform D ata w ith A pache Spark
|
2
Next, execute the SQL code in Listing 30 and try Zeppelin's different chart types, such as its tabular and pie charts. Figure 37 and Figure 38 show the data displayed in a tabular format and in a pie chart, respectively. %sql s e le ct CompanyCode, sum(Amount_USD) as SUM_AMOUNT_ USD from sapDataParquet group by CompanyCode Listing 30 Execute a Simple SQL Statement to Aggregate Results
fcscl select cc*pao>co<se
C
M
C-
a*
^
f in is h e d
t > ; ; to ©
FINISHED
t> “ tB ®
\#
CompanyCode
SUM_AMOUNT_USD
GB01
1.000.0
US01
2 000 0
AU01
300 0
Figure 37 Results in Tabular Display
Xsol select ccapan>
C
*
Il£L
SETTINGS »
• G801
US01 • AUDI
AUDI
Figure 38 Results in Pie Chart Display
34
Consuming Data with Vora | 3
Consuming Data with Vora
3
T h is
s e c tio n
H a d o o p
V o r a 's
c o v e rs
d a ta
a n d
th e
S A P
o p e ra tio n s
b a s ic
H A N A
fa ll in to
d a ta
tw o
»
A n a ly z in g
d a ta
in
H a d o o p
»
A n a ly z in g
d a ta
in
S A P
3.1
fu n c tio n a lity c a n
b e
o f V o ra .
c o n s u m e d
It d e m o n s tra te s a n d
h o w
c o m b in e d .
d is tin c t c a te g o rie s :
H A N A
C o n s u m in g H a d o o p D a ta
T h is s e c tio n
c o v e rs
s o m e
o f V o r a 's
m a jo r fu n c tio n s
fo r re a d in g
d a ta
fro m
H D F S .
Create a Vora Table T h e
V o ra
lin k e d in
to
S Q L th e
F ig u re
% v o ra
c o d e
s h o w n
H D F S
C S V
TABLE
te s tC S V
in
file s
L is tin g
c re a te d
in
3 1
c a n
S e c tio n
b e
u s e d
2 .1 . T h e
to
c re a te
re s u lts
a re
a
ta b le
s h o w n
3 9 .
CREATE
COM PAN YCOD E A M O U N T JJS D
(
V A R C H A R (4 ) , DOUBLE
) U S IN G c o m .s a p .s p a r k .v o r a O P T IO N S
(
ta b le N a m e
" te s tC S V ” ,
p a th s
" /u s e r/v o ra /s a p D a ta /p a rt-0 0 0 0 0 " ,
h o s ts
" m a s te r .c l u s te r ,s e c o n d a r y .c l u s t e r " ,
z k u rls
" m a s te r .c l u s t e r : 2 1 8 1 .s e c o n d a r y .c lu s te r :2 1 8 1 " ,
n a m e N o d e U rl
" m a s t e r . c l u s t e r :8 0 2 0 "
) Listing 31
Create a Vora Table
35
Consuming Data with Vora | 3
X v o r * CREATE TABLE t e s t C S V ( COMPANYCOOE VARCHAR(-J), AMOUNTJUSD DOUBLE
FIN ISH ED [ > J I f f l € >
)
USING c o n . sa p . sp a rk . vora OPTIONS (
tablewar* 'testcsv", paths '/user/vpra/sapOata/part- 0 ©0 0 0 ", hosts "naster.cluster,secondary.cluster", :kurls "naster.cluster: 2 1 8 1 ,secondary.cluster: 2 1 8 1 ", naneriodeurl “master.cluster:8020“|
)
Command processed successfully with no results
Create Table Results
Figure 39
Note On
the Vora Developer Edition you can use master.cluster and secondary .c lus te r values, instead of their corresponding internal IP's. This can be helpful to avoid having to adjust your scripts each time your Hadoop cluster is stopped and started.
Select Results of CSV T h e
V o ra
re s u lts
S Q L
a re
c o d e
s h o w n
in
s h o w n F ig u re
in
L is tin g
3 2
is
u s e d
to
q u e ry
th e
ta b le .
T h e
4 0 .
%vora select * from testCSV Listing 32
Vora SQL to Query a Table
|tvora s e le c t • from te s tc s v 8
-
©
*
lei
f in is h e d
c > ; ; ® <s>
t£L
COMPANYCODE
A M O U N T U SD
GB01
1000
US01
1000
US01
3000
AU01
1000
Figure 4 0 Results of SQL Query on a Table
36
Consuming Data with Vora | 3
Create a Vora Table Based on Parquet Files
You can use the Vora SQL code shown in Listing 33 to create a table linked to the HDFS Parquet files created in Section 2.2. %vora CREATE TABLE testParquet ( COMPANYCODE VARCHAR(4), ACCOUNTGROUP VARCHAR(IO), AMOUNT_USD DOUBLE ) USING com.sap.spark.vora OPTIONS ( tableName "testParquet", paths "/user/vora/sapDataParquet/part-r-00000-d65a7cl1-9bal-45fl -8a7deb733b9325d0.gz. parquet", hosts "master.cl uster,secondary.cl u s t e r " , zkurls "maste r.c lus ter :2181.secondary.cl u s t e r : 2181", nameNodeUrl " master .cluster :8020", format "parquet"
)
Listing 33
Vora SQL to Create Tables Linked to Parquet Files
Select Results of Parquet File
You can use the Vora SQL code shown in Listing 34 to query the Parquet table. The selection results are shown in Figure 41. %vora select CompanyCode, sum(Amount_USD) as SUM_AMOUNT_ USD from sapDataParquet group by CompanyCode Listing 34
Vora SQL to Query Parquet Table
X v o ra s e t e e t COKPAn y c o o e ,
O
M
«J
fc
su »
£
( * moont _ u » >
SETI
f r o o t e s tP a r q u e t g ro u p b y
F IN IS H E D t > “
cotasyco o c
tfl
&
»
• OKI
US01 •AUOI
AU31
Figure 41 Selection Results on Parquet Table
37
Consuming Data with Vora | 3
Show Tables
Vora tables are different from Spark SQL tables. Use the SQL code shown in Listing 35 to see the tables available in Vora. The results are shown in Figure 42. %vora show tables Listing 35
Vora SQL to Show Tables
Xvcr« show tables
B
M
C
*
FINISHED D>
tO <*>
le£
tableName
isTemporary
tesiParquet
false
teslCSV
false
J
Figure 42
Show Tables Results
Show Data Sources
Vora currently uses Zookeeper (Hadoop's generic configuration manage ment tool) to store Vora configurations. Use the command shown in Listing 36 to list the data sources of each table registered in Zookeeper. The results are shown in Figure 43. %vora SHOW DATASOURCETABLES USING com. sap. spark. vora OPTI0NS( zkurls "master.cl u s t e r : 2181,secondary.cl u s t e r : 2181” , nameNodeUrl "ma ste r.c lus ter :8020"
)
Listing
36 List Data Sources Registered in Zookeeper
38
Consuming Data with Vora | 3
Xvora Show OATAScUkCETABlES USING cos.sap.spark.vora OPTIONS! :kw *ls - a a s t e r .c l u s t e r : 2 m , se c o n d a ry ,c lu s te r : 2 1 s : ', naaewodeuri ' a a s t e r .c l u s t e r : sole" .8 a
in
•
n
FINISHED ^
U E0 &
-
a jiii
tb l.n a m e testC SV teslParquet
Figure 43
Show Data Sources Results
Note The names of the data sources do not need to match the table names crea ted; however, it helps to identify their relationships.
Clearing Zookeeper
During testing, a large number of data sources may be created, and peri odically, it may be necessary to clear Zookeeper. Use the statement shown in Listing 37 to perform the cleanup. This statement does not delete the source data in HDFS, it simply resets all the Zookeeper references. Figure 44 and Figure 45 show the results. import com.sap.spark.vora.client._ ClusterUti1s . clearZooKeeperCatalog("master.cl uster:2181") Listing 37
Clear Zookeeper
import coa.sap .sw k .v o ra.clien t.J
FINISHED
C lustervtils. clear;oc#eepercatalog( "«aster .cluster: 2181”) inport c c « .s a p .s p a rk .v o ra .c lie n t., re s l: Boolean ■ tru e
Figure 4 4 Clear Zookeeper Results
39
t>
60
©
Consuming Data with Vora | 3
Xvora SHOW CUTAS0USCETA81ES USIMS COfl.iap.ipafS.vOra OPTIOMS( ’ n a t t e r .< l o i t e r :
2181, 8secondary 20* . d u s t e r : 2181”,
na«eModeurl " n a s t e r . d u s t e r : « IS
Command processed successfully with no results
Show Tables Results
Figure 4 5
Note After cleaning up Zookeeper, the tables and data sources still in use will need to be re-created. These are only symbolic links to the source HDFS data, so they can be re-created very easily, if you keep a list.
Create a Vora Table Programmatically L is tin g
3 8
is e x a m p l e
H D F S
C S V
file s
c a lle d
d ire c tly
c re a te d
fro m
s a p S q lC o n te x t. s q l ( CREATE
TABLE
COM PAN YCOD E A M O U N T JJS D
c o d e in
th a t c a n S e c tio n
S c a la , a n d
th e
b e
u s e d
2 .1 .
to
T h e
re s u lts
a re
c re a te
V o ra
a ta b le
S Q L
s h o w n
in
lin k e d
s ta te m e n t F ig u re
to c a n
th e b e
4 6 .
s ........
te s tC S V
(
V A R C H A R (4 ) , DOUBLE
) U S IN G c o m .s a p .s p a rk .v o ra O P T IO N S
(
ta b le N a m e
" te s tC S V " ,
p a th s
" / u s e r /v o r a /s a p D a ta /p a r t-0 0 0 0 0 ",
h o s ts
"master.cluster secondary.cluster",
z k u rls
" m a s t e r .c l u s t e r : 2 1 8 1 ,s e c o n d a r y .c l u s t e r : 2 1 8 1 " ,
n a m e N o d e U rl
" m a s te r .c l u s t e r : 8 0 2 0 ”
) .... .stripMargin ) Listing 38
Create a Vora Table
40
Consuming Data with Vora | 3
FINISHED t> V , W <§>
sepSqlContext.sql( s “*“ CREATE TABLE testCSV ( CCMPAIfYCOOE VARCHAR(4), AMOUNTJJSD DOUBLE l|
USING coa.sap . spark.vora OPTIONS ( tableNane "testCSv", paths " /u ser/v o ra /sa p o ata/p a rt-©eo*©”, hosts "n aster.clu ster,se co n d ary .clu ster", rk u rls " a a s te r.c lu s te r: 2 1 8 1 , sec o n d a ry .d u ste r: 2 1 8 1 *, naneuodeurl " e a s te r.c lu s te r : 8 0 2 0 " ) ““".stripM argln ) res3: org.apache.spark.sql.DataErafle ■ () Took 0 MCOO&
Figure 4 6
Programmatically Create a Table
Using Vora with Large TPC-H Files
Presumably, you are using Hadoop and Vora because you already have very large files that need to be processed, which may not be cost effective to load and process in traditional databases. Ifyou don'tyet have access to large files but would like to test Vora at scale, then you will need make much larger files than those used in our earlier examples. A simple way of generating large files is by using public benchmarking tools such as TPC-H and TPC-DS. For more information, see http:// www.tpc.org/. There, you will find utilities for generating test files as large as you can handle. Alternatively, you can also download similar files from the AWS S3 Redshift Demo, https://s3.amazonaws.com/redshift-demo/, if you have the bandwidth. In this section, sample files from s3://redshift-demo/tpc-h/100/lineitem/ were downloaded and stored in HDFS in the /user/vora/lineitem directory. Figure 47 shows the first few rows of the TPC-H 1i nei tern file. Notice the file has a " | " (vertical bar) delimiter.
41
Consuming Data with Vora | 3
1 11551893 5 I76895111 117 133203.7210.04 |0.02 |N|0|1996-03-1311996-02-1211996-03-22|DELIVER IN PERSON|TRUCK]eqular courts above thel 1167 30908 I7 3090912136169788.5210.0910.061N|011996-04-1211996-02-2811996-04-201TAKE BACK RETURN|KAIL|ly final dependencies: slyly bold I 1 16369978136997913 18 116381.2810.1010.021N1011996-01-2911996-03-0511996-01-311TAKE BACK RETURN REG AIRlriously. regular, express depI II213150146315114128129767.92 10.09!0.061 N O 1 1996-04-2111996-03-3011996-05-161 NONE IAIRllites. fluffily even de 1 12402664115267115124|37596.9610.10 0.04:N|o!1996-03-3011996-03-1411996-04-011n o n e IFOB I pending foxes, slyly re|
m563445J«448l£l^2248267i84K)i07M )i02Jj
Figure 47
TPC-H lineitem Data Example
Vora has a special CSV delimiter option, which enables these files to be loaded very easily, with the SQL code shown in Listing 39. The result is shown in Figure 48. %vora CREATE TABLE lineitemCSV ( BIGINT, L_ORDERKEY L_PARTKEY BIGINT, BIGINT, L_SUPPKEY INTEGER. L_LINENUMBER L_QUANTITY DEC I MAL(1B,5), L_EXT ENDEDPRIC E DECIMAL(15.5). L_DISCOUNT DECIMAL(15,5), L_TAX DECIMAK 15,5), STRING, L_RETURNFLAG L_LINESTATUS STRING, L_SHIPDATE DATE. L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHI PINSTRUCT STRING, L_SHIPMODE STRING, L_C0 MMENT STRING
)
USING
com.sap.spark.vora OPTIONS ( tableName "lineitemCSV", paths "/user/vora/1inei tem/part-00000", csvdelimiter "| ", hosts "master. cl uster.secondary. cl u s t e r " ,
zkurls "master.cluster:2181,secondary.cluster:2181", nameNodeUrl "ma ste r.c lus ter :8020"
)
Listing
39 Load TPC-H Files
42
Consuming Data with Vora | 3
Xvora CREATE table lineittaCSV ( BIGXVT, l_ORDERKEV l_PARTKEY BIGIVT, BIGXVT, l_SUPPKEY l.LXNENUKBER INTEGER, t_QUAVTirY 0ECM»l('-S#S), L_EX7ENCECP*ICE OECXVAK , ), l.DXSCCWT OECX»Wl(15,5), L_TAX o (c m (> s ,s ),| L.RETURHFLMJ STRING, L.lSNESTATUS STRING, OATE, l_sh: poate L_COMMITOATE OATE, LjtECEIPTCATE OATE, L.SHlPISSTftUCT STRING, L_SmXPHCOE STRING, STRING
finished
[>
HI ©
> l_CC*V£HT
USING com.
s a p .s p a rk .v o ra
OPTIONS ( tableMane "lineiteMCSV*, paths - /user/vora/lineitea/part-©©e©©',
csvoellalter *|",
hosts "M aster.clu ste r,sec o n d ary .d u ste r', tk u rls " ■a s te r .c lu s te r: 2l81,secc«i«ary. clu ster :2 1 8 r, naneNodeurl 'M aster.clu ster: 8920 "
>
Command processed successfully wth no results
Create lineitem Table
Figure 4 8
Next, we can count the number of record loads with the SQL code shown in Listing 40. The results are shown in Figure 49. %vora select count(*) from lineitemCSV Listing 4 0
Count Record Loads
Xvora select count<*) fro* lireite«CS\1 • •H a lt S as
m
v
*
FINISHEO [> “ 03 ©
fe*i
cO 1
.0
0 0 .0 0 0
Figure 4 9
Count of lineitem Table
43
Consuming Data with Vora | 3
Appending HDFS Files to Vora Table
The SQL code shown in Listing 41 enables additional files to be added to the Vora table. This code does not copy the data; it only adds a link to the file, as shown in Figure 50. %vora APPEND TABLE lineitemCSV OPTIONS (paths Vuser/vora/lineitem/ part-00001,/user/vora/linei tem/part00002" , csvdelimiter eagerLoad "true") Link Files to a Vora Table
Listing 41
Xvora appe*® ta8lE lineiteocsv OPTXOKS (paths *7user/vore/lir*ite«/p«rt-eeeei,/user/vor.a/lir*itea/p8rt-eeee2“ , c s v d e lia ite r
t> “ 03 ®
finished
Command processed successfully with no results Took 11 teconos
Figure 50
Append lineitem Table
Next, let's check the row count again (Figure 51). Xvora s-elect count(*) from lin e ite n c sv --U n it S E
IM
C
*
finished
^
0
t£L I
cO 3.000.000
Took$MCOo6»
Figure
51 New Row Count Results
Convert Large CSV Files to Parquet
Large CSVfiles are fine for testing performance, but to optimize Vora SQL performance, you'll need to convert the TPC-H files into the Parquet columnar format.
44
;; © ©
Consuming Data with Vora | 3
Listing 42 shows some example code, using Apache Spark, to split the file based on the " | ” (vertical bar) delimiter. The results are shown in Figure 52. val fileRdd = s c . t e x t F i 1e ( "1ineitem") val spl i tRdd = fileRdd. 1i ne => 1r Is p 1i t = line.spl ') lspl 1spl lspl lspl 1spl lspl lspl 1spl 1spl 1spl 1spl lspl 1spl lspl 1spl lspl
i t (0). t o l n t , i t ( l ) . toLong, i t ( l ) . toLong, i t (3). t o l n t , i t (4). toDouble, i t (5). toDouble. i t (6). toDouble, i t (7). toDouble, i t (8), i t (9), it(10) , it(ll) , i t (12) , i t (13) , i t ( 14) , i t (15) )
) Listing
42 Split Large CSV Files
val fileRdd ■ se.textFile{'lir.eitee*)
f in is h e d
t> ; ; 81 ®
val splltRdd ■ fileRdd.nap< line •> {var ls p lit • lin e .sp litC l" ) ls p lit(- ).to ln t, ls p l it ( ).tOLorg, lsplit<:).toLorg, lsplit< ).to ln t, ls p lit( ) .toOouble, ls p lit( c).toOouble, ls p lit( ).?c©ouble, lsp lltc ).toootf>le, lsp lit(ft), lSpllt< ), l$ p llt( ), 1s p lit ( )# ls p lit( - ) , I S p l l t (1 3 ),
1s p lit(: ), ls p litc ))
> >
fileRdd: cwg.apache,spark.red.RCO[strir«] ■ FapPartitionsRDO[Ml] at textFile at :3S splitltdd: org.apache.spark.rdd.ROO[:39
Figure 52
Use RDD for a Split Based on a Vertical Bar or Pipe Delimiter (|)
45
Consuming Data with Vora | 3
Next, using the code in Listing 43, the RDD is converted to a DataFrame, and the results are saved in the Parquet format with five partitions (result ing in five files, as shown in Figure 53). s p li tRdd. toDF("l_orderkey", "l_ p a rtk e y ", "l_suppkey" , "1_1inenumber” , "1_ q u a n tity ” ,"l_extendedpri ce” ,"1_d is c o u n t", " l_ ta x " , "l_ re tu rn f1ag” , "1_ 1in e s ta tu s ", ” l_shi pdate", "l_commi td a te ", "l_ rece i p td a te ", "1_ s h ip in s tru c t", "l_shi pmode"," 1_ comment"). r e p a r t it io n ^ ) . save( "1 inei temParquet", "parquet") Listing
43 Covert RDD to DataFrame
The re p a rtitio n function is a crucial part of converting large files.
Alarge number of very small files will
result in very poor performance. Asin gle, very large file will cause other problems. File-size tuning is something dependent on many factors, including the size of your cluster, available memory, and data retrieval patterns.
Next, let's check if the five Parquet files were created by using the code in Listing 44. The results are shown in Figure 54. val IsResult = Seq("hadoop", "d fs " , " -du", " -h” , "/user/vora/ 1i nei temParquet").!! Listing 4 4
Check Created Files
46
Consuming Data with Vora | 3
//M fs dfs -du -s
-
f
t
/
user/voro
/
•
F
I
N
I
S
H
E
D
val lsftesult « seq<"hadocpVdfs,V - d u '," - h"/ "/user/vora/lineiteeferquet").!! IsRcsult: String « ”® /user/vcr*/lineitMR«rqu«t/_SUCCess 1.5 K /user/vcra/lineitenParquet/_co«inon_netddata 1 2 . 1 k /user/vcra/lineiteBParquet/joetacata S*.5 M /user/vora/lineitenParqu«t/part-r-eeeee-ee2a931b-eb<15-i6fb-a9b7-7ia279S2c6ea.g:. parquet 51.5 H /user/vora/lin«itenParquet/pa!'?-r-eee0l-ee2a931b-eb{iS-iSfb-a9b7-7la279S2cSea.gz.parquet 51.5 h /user/vora/lineitenParquet/part-r-eeee2-ee2a93ib-eb
Figure 54
N o w ,
Check Files in HDFS
u s in g
th e
c o d e
P a r q u e t flie s . T h e
in
re s u lts
L is tin g a re
4 4 ,
s h o w n
w e in
c a n
c re a te
F ig u re
a
ta b le
u s in g
th e
n e w
5 5 .
%vora CREATE TABLE 1ineitemPArquet ( L_ORDERKEY BIGINT. L_PARTKEY BIGINT. L_SUPPKEY BIGINT, L_LINENUMBER INTEGER, L_QUANTITY DECIMAL(15,5), L_EXTENDEDPRICE DECIMAL(15,5), L_DISCOUNT DECIMAL(15,5), L_TAX DECIMAL(15,5), STRING, L_RETURNFLAG L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING. L_SHIPMODE STRING. L_C0MMENT STRING
)
USING com.sap.spark.vora OPTIONS ( tableName "1ineitemParquet", paths "/user/vora/1ineitemParquet/part-r-00000-ee2a931b-ebd5-46fba9b7-7Ia27952c6ea.gz.parquet, / u se r /v o r a / 1i nei temParquet/part-r-00001ee2a931b-ebd5-46fb-a9b7-7Ia27952c6ea.gz.parquet,/user/vora/ 1ineitemParquet/part-r-00002-ee2a931b-ebd5-46fb-a9b771a27952c6ea.gz. parquet,/u$er/vora/lineitemParquet/part-r-00003ee2a931b-ebd5-46fb-a9b7-7Ia27952c6ea.gz.parquet, /user/vora/ 1ineitemParquet/part-r-00004-ee2a931b-ebd5-46fb-a9b771a27952c6ea.gz.parquet", hosts "master.clust er.secondary.cluster", zkurls "master.cl u s t e r : 2181,secondary.cl u s t e r : 2181",
47
O
@
Consuming Data with Vora | 3
nameNodeUrl "master.cluster:8020", format "parquet"
)
Listing 45
Create a Table with Parquet Files
lin eiter-PArciuet ( BIGINT,
Xvora CREATE TABLE L_0«DERKEY L_PARTKEY L_SUPPKEY L_LINE*AJP.8ER ^-QUANTITY L_EXTENDEOPRICE L_DISCCUNT L_TAX 1_RETURNFIAG L.LlNESTATUS L_SHIP0ATE L_COriITDATE l_RECEIPTOATE L_ShIPINSTRUCT
>
L_ShIPMOOE LCOMHENT
FINISHED t>
Eli ©
BIGINT, BIGINT, INTEGER, OECIPAL(15,5),
DECXMAL(1S,S), DECIVAt(15,5), D£CIVAL(1S,S),
STRING, STRING, DATE, DATE, DATE, STRING, STRING, STRING
USINq
con. s a p . spark.vora OPTIONS ( tableware MllneitenParquet3", paths “/uset/part-r-eee0i -ee2a93lb-eMS-46fb-a9b7.7ia279S2c6«a.g2.parqu«t,/user/vora/lineiteoParquet/part-r-e0002-ee2a93lb-«bdS-4Gfb-a9b7.7ia279S2c6ea.g:.parquetJ/user /vora/lineitenParqu«t/part-r-00e03-ee2a93lD-eb<JS-»6fb-a9D7.7ia279S2c6ea.g:.parquet,/user/vora/lineite«Parquet/part-r-eeoG4-ee2a93lD-eMS-a9b7-7ia279S2c6ea.g:.parquet", hosts master, cluster, secorxJary.cluster", rkurls "oaster.cluster:2181 ,secondary.cluster:2181", naaewodeurl "naster.cluster:S020’, fori»at “parquet"
46fb
Figure 55
Create lineitem Parquet Table
Even with tens of millions of records and a very small Hadoop cluster, you can still achieve very high performance. As shown at the bottom of Figure 56, the Spark SQL responded in two seconds. Xsol select («««(•) fro* ( select l_orderkey fro* sapeata’ercuct where l_:Mpflite ■ 'ittg - 12 -e r group by l_orderkey ) cannot recognize input nee- '<SOf>' ' <SO'>' •* in subqoery source; line 1 pos 11«
Xscl select vjm < l_euantlty • l_e«tend«Op'KO fro* sazOataPerouet where l.sMpdate • 'l N t- lM l'
a
m
£
*
FINISHED t>
w
©
FINISHED t>
to ©
*£
_c0 4 6092129730000004E7
-
Figure 56
Performance Test on lineitem Table
48
Consuming Data with Vora | 3
3.2
C o n s u m in g SAP H A N A D a ta
This section covers some of Vora’s major functions for reading from and writing to SAP HANA. In this example SAP HANA SP10, AWS Developer Edition, is used in conjunction with Vora. See http://cal.sap.com to launch a trial HANA system. Create a Simple Table in SAP HANA
By now, you are probably familiar with the simple test dataset used in this E-Bite. Execute the SQL code in Listing 46 in SAP HANA to create a table and populate it with some entries. The results are shown in Figure 57 and Figure 58. drop table SAPDATA; create column table companycode accountgroup aiw)unt_usd
);
i ns e r t i ns e r t i ns e r t in se rt Listing
into into into into
SAPDATA ( varchar(4), varchar(lO), double --decima 1(18.2)
"CODEJAMMER". "SAPDATA” "CODEJAMMER". "SAPDATA" "CODEJAMMER". "SAPDATA" "CODEJAMMER". "SAPDATA"
values( ' GB01' values( ■usor values( 'usor values( •Auor
, ' Revenue', , ' Revenue' , , ' Expense', . ' Revenue’ ,
1000); 5000); -3000) 300);
46 Create a Table in SAP HANA
Q ’HD8 - SQl Console 2 K 3VSQI Console 4 HDB (C O D E JA M M E R )
77! ’SQLConsole 5
□
S2.73J2J7300
a i © i a. o
tnsQ i 1 drop tab le SAPOATA; 2 create column table SAPOATA ( } v#rch*r(4), 4 accountgroup varchar(lO), 5 amount usd double - -dcciral<18,2) 6 ); 7 3 in s e rt into "COO€JAMMER”. ‘SAPOATA* values( G801'. Revenue-. C- in s e rt into “COO€JAMMER".‘ SAPOATA* values(U S01', •Revenue-, 10 in s e rt into "COO€JAMMER”. ’SAPOATA* values( US01' , ’Expense’, 11 in s e rt into ”COO€JAMMER".‘SAPOATA* values( 'AU01V Revenue',
» cc ▼
E 1000); S00O) :| -3006); 306);
,
* Staterent ‘ in sert into "COOEJAWER*. "SAPOATA” values( 'AU01‘ Revenue‘, 300)' successfully executed in 99 as 802 ps (server processing tia e : 0 as 384 ps) Statement 'drop tab le SAPOATA2’ successfully executed in 103 as 661 ps Stateaent 'drop table SAPOATA3' successfully executed in 98 as 841 ps Ouration of 8 stateaents: 835 as
Rows Affected: 1
*
(server processing tia e : 3 as 72S ps) - Rows Affected: 0
(server processing tia e : 3 as 135 ps)
Rows Affected: 0 *
Figure
57 Create a Table in SAP HANA
49
Consuming Data with Vora | 3
SAP HANA Modeler - 'CODEJAMMERVSAPDATA” - Eclipse £ile
Edit
n*
N avigate
Search
so ;
£ roject
gun
Jtfindow
tJelp
-o o - Q uick Access
System s S3
°
□
0f-l a n - a (3b % j
Raw D ata | l~0l Distinct values It
i5> HOB (CODEJAMMER) ;Prcd * a
C l ’SQL C onsole 4
C l ’HO B-SQ L C onsole 2
©
C atalog
^
:<
C olum n Views
> &
EPM M odels
> &
EPM Q uery Soun
C l ’SQL C o n so le 5
'CODEJAMMERVSAPDATA" S3
il l Analysis
°
• Show Log j Max rows: 200 O
4 ro w s re tr ie v e d - 1 4 0 m s
COMPANYCODE
■ \£> Public Synonym s a . io CODEJAMMER
Y* Java EE £ } SAP HANA D evelopm ent | ^ > SAP HANA Modeler*] BP BW M odeling
§
ACCOUNTGROUP
Execute
^
A dd filter
w
Sort «otire data set
»
A M O tN T JJS D
GB01
Revenue
US01
Revenue
1.000 5,000
US01
Expense
-3,000
AU01
Revenue
300
> IcJ- Functions > l£ $ Indexes t> 1=5- Procedures > l
Sequences
t> 9 k Synonym s a i& Tables AT SAPDATA [>
Triggers
t> Q ? Views > o g HGL 0 d § IO T 0 t>
PSA g SAP.AUTHOR1ZATIC
■ c*n occr Ai»t
0
W here-U sed List S3
“
□
□
Properties
(D» “ a
P roperty
=
□
Value
Job Lc g
Progress
*
N o o p eratio n s to display a t this tim e.
Select ed:CA_SLFIGHT_001 (CODEJ/ N u m b er o f U sages#
Figure
58 Data Preview of a New Table in SAP HANA
Query an SAP HANA Table from Vora
Now, it is now possible to access this SAP HANA table directly from Vora, using the code in Listing 47. No data is copied to Hadoop unless a query is executed. The results are shown in Figure 59. %vora CREATE TABLE HANA_SAPDATA USING com.sap.spark.hana OPTIONS ( path "SAPDATA", dbschema "CODEJAMMER", host " ", instance "00",
50
°
«
Consuming Data with Vora | 3
user "CODEJAMMER", passwd "CodeJam2015"
)
Access an SAP HANA Table from Vora
Listing 47
X v o ra CREATE TABLE HANA_SA?DATA USING c o m .s a p .s p a r k .h a n a
FINISHED C>
^
OPTIONS (
p a th "SAPOATA", c s s c i e r a " CODEDAM»iR“ ,
host ‘HHIHH', in s ta n c e "00",
u s e r "CODEDAHHER*,
passwd "CodeDa»2015"
)
Command processed successfully with no results T o o k 1 seconds
Figure 59 Create an SAP HANA Table in Vora
Next, we can execute the SQL statement shown in Listing 48 in Vora to retrieve data from SAP HANA. The results are shown in Figure 60. %vora select * from HANA_SAPDATA Listing 48
Retrieving Data from SAP HANA
Xvora s e le c t • from SE
m
©
A
* v\ a_sap©ata
FINISHEO 0 “ rn
le£
AMOUNTUSO
COMPANYCODE
ACCOUNTGROUP
GB01
Revenue
1.000.00
US01
Revenue
5.000.00
US01
Expense
-3.000.00
AU01
Revenue
300.00
Took 0 seconds
Figure
60 Search an SAP HANA Table in Vora
51
<§>
Consuming Data with Vora | 3
Create a Simple View in SAP HANA
Vora is also able to read SAP HANA views. To demonstrate this, first we need to create a simple view, which joins the following tables SFLIGHT, SCARR (on CARRID), and M_TIME_DIMENSION (on FLDATE/SAPDATE). Create a view similar to the one shown in Figure 61.
Figure
61 Basic SAP HANA View on SFLIGHT
The results of the data preview in SAP HANA are shown in Figure 62.
52
Consuming Data with Vora | 3
Figure
62 Data Preview of SFLIGHT View
Query an SAP HANA View from Vora
Your SAP HANA view can now be easily accessed directly through Vora, using the code in Listing 49. The results are shown in Figure 63. %vora CREATE TABLE HANA_SFLIGHT_VIEW USING com.sap.spark. hana OPTIONS ( path "CODEJAMMER/CA_SLFIGHT_001\ dbschema "_SYS_BIC", host " \ instance "00", user "CODEJAMMER", passwd "CodeJam2015") Listing 49
Access an SAP HANA View from Vora
53
Consuming D ata with Vora | 3
FINISHED l>
X vora CREATE TABLE HANA_SFLIGHT_VIEW USING c o n .s a p .s p a r k .h a n a OPTIONS ( p a th "COOE3AMMER/CA_SLFIGHT_eei", d b s c n w ia "_SYS_BIC", host ’ ^ ■ ■ 1 ' , i n s t a n c e "OG“ , u s e r "-, p assive )
ffl ©
Command processed successfully with no results TocA 0 seconds
Figure 63
Create an SAP HANA View Table in Vora
Next, we can query the view from Vora with the SQL code in Listing 50. The results are shown in Figure 64. %vora select DATE_SQL, sum(PRICE) VIEW group by DATE.SQL Listing
50 Query an SAP HANA View from Vora
Xvora s e l e c t oate_ sql, SlP(PRICE) AS SUT.PRICE ar
from HANA_$FLIGHT_
m
c *
fr o n HANA_SPIIGHT_VIEY\ g ro jp by DATE_SQL
DATE_SQL
SUMPRICE
2 0 0 2 -1 2 -2 0
1,222 00
2 0 0 2 -1 2 -2 3
2 .2 2 2 00
1 9 9 5 -0 2 -2 8
1 .7 4 8 00
199 5 -1 1 -1 7
1 .4 9 9 00
1 9 9 5 -0 6 -0 6
1 .0 9 0 00
1 9 9 5 -0 4 -2 8
6 .0 0 0 00
20 0 2 -1 2 -2 1
222 00
Figure 6 4
FINISHED O “
ic
Results of an SAP HANA View in Vora
We can check if the query executed in SAP HANA by following the menu path in the SAP HANA Studio Admin console • Performance • SQL Plan Cache. The results are shown in Figure 65.
54
tfl ®
Consuming Data with Vora | 3
Overview Landscape Alerts Perform ance Volumes Configuration System Information Diagnosis Files Trace Configuration Threads Sessions Blocked Transactions SQL Plan Cache Expensive Statem ents Trace Job Progress Load
o
SFUGHT HOST
PORT
vhcalhdb
30.003
vhcalhdb
30.003
vhcalhdb
30.003
vhcalhdb
30.003
vhcalhdb
30.003
vhcalhdb
30.003
Visible rows: 6/1000 Not all ro VOLUME.ID
^ C o n f ig u r e ...
STATEMENT.STRING
^ Filters...
▼
[ j j Save as File
STATEMENT .HASH
▼
a<&
USER.NAME
2 SELECT " _ su b q u e ry 2 '.' D...
ca677c808b9a2c2362870... SYSTEM
2 select count(’) as NUM.R...
e31852e0174e81cei8a9e2...
SYSTEM
(•
STATEMENT STRING SELECT ■_ su b q u e ry 2 ”.',DATE_SQL”, SU M ('_subquery2”. "PRICE") AS "SUM PRICE" FROM (SELECT '_tablel"."D ATE SQL", "_tablel"."PRICE" FROM " SYS BICVCODEJAMMER/CA SLFIGHT 001" AS " _ ta b le l" ) AS "_subquery2" GROUP BY "_su b q u ery 2 ’.’DATE SQL" UMIT 7 OFFSET 0
- J Copy
Figure
4
Save As...
Close
65 Subquery Called in SAP HANA
Use Apache Spark to W rite Data to SAP HANA
Within Apache Spark, you can now write code to create tables or append data to SAP HANA. The code shown in Listing 51 demonstrates how to create a new table in SAP HANA (table S A P D A T 2 ) and insert a single row of information into it. The results are shown in Figure 66. import org.apache.spark.sql._ import org.apache.spark.sql. types._ val nameNodeHostAndPort = "master. cluster:8020" / / va1 sqlcSAP = new SapSQLContext(sc) / / t h i s table name for the Vora instance val tableNameVora = "HANA_SAPDATA" / / t h i s table name holds for HANA val tableNameHana = "SAPDATA2" //Database Schema Name HANA val dbSchema = "CODEJAMMER"
55
Consuming Data with Vora | 3
/ / HAMA Host instance val host = "52.73.12.173" / / HANA instance ID val instance = "00" //User name and password val user = "CODEJAMMER" val passwd = "CodeJam2015" //New Data Row to Add from a Data Frame val companyDataRDD = Seq( ( "TH01" , " Revenue", 100.00)).toDF("COMPANYCODE", "ACCOUNTGROUP", "AMO UNT_USD").rdd //val companyDataDF = Seq( ( "SG01" , "Revenue", 100.00)).toDF("COMPANYCODE","ACCOUNTGROUP", "AMO UNT.USD") val schema = StructType(Array(StructField("COMPANY",Stri ngType, false ).Str uctFi e l d (" ACCOUNTGROUP",StringType,false),StructField! "AMOUNT. USD",DoubleType,false))) val companyDataDF = sqlc.createDataFrame(companyDataRDD, schema) val configuration - Map(( "host"-> host), ( "instance"-> instance), ("user"-> user), ("passwd"-> passwd)) val writeOptions = configuration + ("path" -> tableNameHana) + ("dbschema" -> dbSchema) //Create Table and Insert // companyDataDF.wri t e . format!"com.sap.spark. hana") . mode(SaveMode. ErrorlfE xi s t s ) .opti ons(wri teOpti ons) . save!) //Create / Append companyDataDF.write.format!"com.sap.spark.hana") .mode!SaveMode.Append). options(writeOptions).save!) //Check Results val queryResult = sqlContext. s q l ( s " ""show tables""") //val queryResult = sql Context. sql (s"""select * from StableNameVora""") queryR esu lt .col lec t! ) .foreach(println) Listing
51 Create a Table and Insert a Single Row in SAP HANA
56
Consuming Data with Vora | 3
import erg.apache.spark.sql._ import erg.apache.spark.sql.types._ val na»eV>deHCStAncPort • "master,cluster:8620"
FINISHEO D> 5Z @0 ®
//voL sqLcSAP m new sapSQicontext(sc) / / this table name for the vora instance val tableNanevora - *mana_sapoata* / / this table name holds for hana val table\anenana > "SAPDATA2"
//Database Schema Hone hana val dbSchena . “COOEJamker" / / hana Host instonce val re s t ■ "52.73.12.173“ / / hana instance ID val instance ■
//user name and password val user ■ "CCOEjammer" val passwd - "CodeJan2eiS“ //Hew Data now to Add from a Data Frame val companyOataRCO ■ Seq(("TH01“, “Revenue", iee.00)).toOF<M COMPAJfrcCOE","ACC<XOTGRajP',"AF©UNTJJSO*).rdd //val companyoatoof • seq(('SGOi", "Revenue", ioo. oo) ) .toOF("conPANYCOD£", "accouhtgrout", "amount_uso") val schema = StructType(-' sy(StructField('COM;>ANY,,stringType,false)>StructPieldrACCOU»iTGROuP",StringType,false),StructField("AMC(JNT_uS0" ,DoubleType,false))) val companyoataos ■ sqlc.createDataFrare(conpanyoataRDD, schema) val configuration ■ Map(("nost*-> host), ("lnstarce"-> instance), C"user"-> user), ("passwTo passwd)) val wrlteOptions - configuration ♦ ("path" -> table\a«e*ana) ♦ ("dbschema" -> dbSchena)
//create / Append coapanybataCF. i»rite. fornat C corn. sap. spark. hana ”). node (saveMode. Append). opt ions (xriteeptions). save ( )
//Chech Results val queryResuit • sqlContext.sql(s"""show tables""*)
\//vol queryResuit » sqicontext.sql(s"’ "select • from StobleHomevora"") queryResuit.collect() . fcreach(printIn) inport org.apache.spar*.sql._ inport org.apache.sparic.sql.types.. nane'*odenostAndPcrt: String ■ naster.cluster :se2e table'ianevora: String - hama. sapoata table'ianeHara: String « SAP0ATA2 dbSchena: String ■ CCOEJAWER host: strin g . 52.73.12.173 instance: String * ee user: string ■ CCCEJANMER passed: String ■ Ccde3am2eis ccnpanyDataROO: org.apache.spark.rdd.RDD[o.-g.apache.spark.sql.Row] • MapPartitionsROO[*] at rdd at :3S schema: org.apache.sparlc.sql.types.StructType ■ StructType;Struct*ield(COf'PAW,StrirgType,fa lse ), StructField(ACCOtWTGROUP,StringType,false), Struc tField( amount. uso, ocubleType,false)) cccipanyDateD': erg.apache.spark.sqi.Dataware ■ [cohpaw: string, accountgroup: string, ArtOUNTjUSO: double] configuration: scala.collection.im utable.M ap[string,string] - rap(host -> 52.73.12.173, instance -> ee, user -> ccoejamker, passwd -> Ccde3an2®l5) writeOptions: scala.collection.imutable.M ap[String,string] - mp(path -> sapoata2, host -> 52.73.12.173, instance -> 09, doschena -> CCOEJammer, p asswd -> coce)an20i5, user -> cooejammer) queryResuit: erg.apache.spark.sql.oatasraae « [tabletiame: string, lsTenpcrary: boolean]
Figure 6 6
Create a Table in SAP HANA Using Vora
We can check that the new table SAP DATA2 has been created in SAP HANA and check its contents, as shown in Figure 67.
57
Consuming Data with Vora | 3
Figure 67
3.3
New SAP HANA Table, Created from Vora
C ro s s -C o n s u m p tio n : Join SAP H A N A and V o ra D a ta
In this section, a simple cross-consumption query is executed to check how query logic is passed down to SAP HANA. From Section 3.1 , a small table, testCSV, was created in Vora. From Sec tion 3.2, a small table, HANA_SAPDATA, was created in SAP HANA and made accessible in Vora.
58
Consuming Data with Vora | 3
Now, we can execute the SQL code shown in Listing 52 to join these tables to analyze how query logic is passed down. Notice there is a where clause on H A N A _ S A P D A T A (using table alias A ) for company code ( A . C O M P A N Y C O D E = 1 U S O 1'). The results are shown in Figure 68. %vora select A.*, B.AMOUNTJJSD as OTHER_AMOUNTJJSD from HANA.SAPDATA A inner join testCSV B on A.COMPANYCODE = B. COMPANYCODE where A.COMPANYCODE = ’US01* Listing
52 Join Tables to Analyze a Query
Xvora s e le c t a . compahycoob, a . amowt. usd, B.AHOCNT_USO BS OTHER jWO«JHT_USO fro a HA’IA.SAPOATA A inner jo in testCSV B on A.CCMPAWCOOE - B. COHPANYCOOE wnere a . cow>a\ ycoo€ • 'u s e r E
*1
C
.*•
FINISHED [>
\£L
COMPANYCODE
AMOUNT_USD
O T H E R A M O U N T U SD
US01
5.000 0
1 0 0 .0
US01
5.000 0
300.0
US01
-3.000.0
1 0 0 .0
US01
-3.000.0
300.0
Took 1 MConOs
Figure 68
Results of Cross-Consumption Query
In SAP HANA, we can check the subquery executed by following the menu path Admin console • Perforaaance • SQL Plan Cache. The results are shown in Figure 69.
59
©
Consuming Data with Vora | 3
SAP HANA M odeler - System: HD8 H o s t 52.73.12.173 Instance: 00 C o nnected U s e r SYSTEM System Usage: P ro d u ctio n System - Eclipse £ile
£dit
£Javigatc
‘ f t -
...
0
Search
£ ro ject
gun
W indow
£Jelp *
8
(S '
’
-
Q uick Access s
S ystem s $2
“
□
IH *HDB • SQL C. ®
*7 !g> HDB (CODEJAMMER) (Production I a 1;
&
-
J ‘ CODEJAMMER:...
SAP HANA D evelopm ent |^ > SAP HANA M odeler | ©
’CODEJAMMER"...
HDB £3
”
BW M odeling “
□
S W
H D B (S Y S T E M )
[P roduction System]
52.... 00 Last U pdate: Feb 22.2016 9d2s07 AM $
| Cfr Intervafc
60
-
S econds
| [*
s
T hreads Sessions Blocked T ransactions SQL Plan C ache Expensive S tatem ents Trace Job Progress Load
p ^
Provisioning Security
SAPDATA
O]
i
Visible rows: 10/1000 N o t all r*
3jo Filters...
w
^ ] Save as File
▼
s0000716522trial-trial (DEV_2WQBC
vhcalhdb
30,003
2 SELECT " _ tab lel"."C O M P A N Y C O D E \ " _ t a b l e l “."ACCOUNTGROUP". " _ ta b
SOD (GBAMHB)
vhcalhdb
30,003
2 s e le c t c o u n t( ') a s NU M .R OW S fro m ( SELECT "_tablel"."C O M PA N Y C O D E“.
v hcalhdb vhcalhdb vhcalhdb
STAT EMENT.STRING
' . AM OUNT.USD IROM "CC
STATEMENT STRING
-
m
-__________________£ 3
"COMPANYCODE". "ACCOUNT 1 "."ACCOUNTGROUP". " _ ta b
select c o u n tf ) as NUM.ROWS fro m ( SELECT ' tablel'.'C O M P A N Y C O D E '." tablelV A CCO U N TG RO U P’. _tablel'.'A M O U N T _U SD " FROM 'CODEJAMMER'.'SAPDATA' AS ■ _ t a b l e l ' WHERE (*_tablelV C O M PA N Y C O D E" = USOl'))
"
ta b le l "."COMPANYCODE",
LIMIT 0 >UP". "A M OUNT.USD’ FROM
vhcalhdb
)M PA N Y -. “ACCOUNTGROUP
vhcalhdb
ITGROUP". "AMOUNT USD'FI
Copy
J
Save As...
Close
j
» HDB^2.73.12173«I:S1NGLED8:SYSTEM
Figure 6 9
Subquery Called by Vora
Examples of the SQL generated in SAP HANA by Vora are shown in Lis ting 53. s ele ct count(*) as NUM.ROWS from ( SELECT " _ t a b l e l ". "COMPANYCODE" # tabl e l ". "ACCOUNTGROUP", " _ t a b l e l ". "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" AS " _ t a b l e l " WHERE ( " _ tablel"."COMPANYCODE" = 'USOl') ) SELECT "_tablel"."COMPANYCODE". "__tabl e l ". "ACCOUNTGROUP". tablel"."AMOUNT.USD" FROM "CODEJAMMER". "SAPDATA" AS " _ tab le l" WHERE ("__t a b l e l " . "COMPANYCODE" = 'USOl') LIMIT 2 OFFSET 0 Listing 53
<9 □
PORT
v hcalhdb
VOLUMEJ D
& Configure...
HOST
vhcalhdb
4
8
Overview L andscape Alerts P erform ance V olum es C o n figuration System Inform ation Dia< nosis Files Trace C onfiguration
Backup
t> ©
-
C atalog ic> C o ntent © Pending...
>
o
f g |£ j a v a E E
HDB (SYSTEM) [P roduction System
t> a
.
SQL Generated in SAP HANA
60
Consuming Data with Vora |
Listing 53 shows a positive result because it indicates that the where clause has been passed down from Vora to the subqueiy executed in SAP HANA. If we run a similar Vora SQL and the where clause is now on testCSV for company code (for example, Listing 54), the logic has not been passed down to the subqueries in SAP HANA. s ele ct count(*) as NUM_ ROWS from ( SELECT "COMPANYCODE". "ACCOUNTGROUP**, "AMOUNT. USD" FROM "CODEJAMMER"."SAPDATA" ) SELECT "COMPANYCODE", "ACCOUNTGROUP", "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" LIMIT 4 OFFSET 0 Listing 54
SQL Generated in SAP HANA with Missing where Clause
Another observation is that, if only a reduced number of columns is spec ified in the Vora SQL (for example, in Listing 55), this reduced selection successfully passes to SAP HANA. SELECT "__t a b l e t " . "COMPANYCODE", " _ t a b l e l " . "AMOUNT. USD" FROM "CODEJAMMER". "SAPDATA" AS " _ t a b l e l " LIMIT 4 OFFSET 0 Listing 55
3 .4
SQL Generated in SAP HANA with Reduced Columns
A d d itio n a l Business F u n ctio n s
In the future, SAP plans to add additional business functions to Vora that are not supported in Apache Spark. The first collection of available func tions includes hierarchy support and UoM conversion. W orking with Hierarchies
Hierarchical data structures define a parent-child relationship between different data items, providing an abstraction that enables complex com putations to be performed on different levels of data.
61
3
Consuming Data with Vora |
An organization, for example, is basically a hierarchy where the connec tions between nodes (for example, between managers and developers) are determined by the reporting lines that are defined by that organiza tion. Since it is very difficult to use standard SQL to work with and perform analyses of hierarchical data, Spark SQL has been enhanced to provide missing hierarchy functionality. To utilize hierarchies, a custom hierarchy table needs to be defined for each object with a hierarchy similar to the one shown in Table 2. Key
Predecessor
Successor
Order
(I NT)
(I NT)
(INT)
US01
None
1
1
GB01
1
2
1
AU01
1
3
2
GB02
3
4
1
GB03
3
5
2
Table 2
Example Hierarchy Relationship (h_src)
The SQL for joining a hierarchy table with a transactional table is shown in Listing 56. SELECT B.name, A.address, B.level FROM (SELECT name, LEVEL(node) AS level FROM HIERARCHY ( USING h_src AS v JOIN PARENT u ON v.pred = u.succ SEARCH BY ord ASC START WHERE pred IS NULL SET node) AS H) B, addresses A WHERE B.name = A.name Listing 5 6
Join a Hierarchy Table with a Transactional Table
62
3
C on sum ing D ata w ith Vora
UoM Conversion
The SAP HANA data source allows you to use UDFs that are implemented solely in SAP HANA (that is, they do not exist in Apache Spark). You can do this by using the "$" in the s e l e c t statement to convert a measure, as shown in Listing 57. import org. apache.spark.sql val sqlc = new SapSQLContext(sc) lazy val configuration = Map(( "host"->”"), ( "instance"->"00"), ( "user"->"CODEJAMMER"), ( "passwd"->"CodeJam2015")) lazy val samplelnputConf = configuration + ("path" -> "SAPDATA") + ("dbschema" -> "CODEJAMMER") samplelnputRelation = sqlc.read.formate"com.sap.spark.hana"). options( samplelnputConf). load() samplelnputRelation.regi sterTempTable("SAMPLE_1NPUT") var query = s q l c . s q l ( "Select $CONVERT_UNIT" + "(QUANT.SOURCEJJNIT,* CODEJAMMER*,TARGETJJNIT, ' 0 0 0 ' ) as converted " + "FROM SAPDATA") queryResu11.col 1e c t ( ) . foreach(pri ntln) Listing
3.5
57 Use the
" $ "
Prefix to apply HANA UDFs to results
A d d itio n a l In fo rm a tio n on V o ra
For a more detailed look at Vora, please check out the following links: » SAP HANA Vora-YouTube https://www.youtube.com/playlist?list=PLkzo92owKnVyyNRkogvGloXYcDtLqeByX » SAP HANA Academy, Github https://github.com/saphanaacademy/Vora » SAP HANA Vora 1.1 - SAP Help Portal Page http://help.sap.com/hana_vora.html » SAP HANA Vora Installation and Developer Guide http://help.sap.com/Download/Multimedia/hana_vora/SAP_HANA_ Vora_Installation_Developer_Guide_en.pdf
63
|
3
C o m b in in g A pache Spark, SAP H A N A , and V ora
» SAP HANA Vora—Troubleshooting http://scn.sap.com/blogs/vora/2015/12/09/sap-hana-vora--troubleshooting » Tagged Questions, Stack Overflow http://stackoverflow.com/questions/tagged/vora
4
Combining Apache Spark, SAP HANA, and Vora
With some basic knowledge of Apache Spark functions and Vora, we are n o w
r e a d y
to
u s e
t h e m
t o g e t h e r
t o
t r a n s f o r m
a n d
a n a l y z e
d a ta .
In the following section, Apache Spark will be used to load and transform oil price data and New York Times news articles relating to oil over the past ten years. SAP HANA will also be populated with net income figures from several major oil companies for the same years. Finally, Vora will be used to combine and analyze these datasets, which are stored in Hadoop HDFS and SAP HANA.
4.1
D o w n lo a d O il Price D a ta U sin g A p a c h e S p ark
In this section, we will use Apache Spark to retrieve oil price data and store this data in Hadoop HDFS for subsequent processing by Vora. As shown in Figure 70, the U.S. Energy Information Administration (http://www.eia.gov/opendata/) has an API for downloading open data sets, such as oil price data, at http://api.eia.gov/series/?api_key=YOUR_ API_KEY_HERE&series_id=PET.RBRTE.D.
64
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
iK&pmbnl Statute! i t An*i>fU U .S . E n e r g y
In fo rm a tio n
A d m in is tra tio n
Sources & Uses »
Topics »
Geography *
OPEN DATA API BROWSER
A P I
Q
u e r y
B r o w
r»ON »
EXCEL ACO-IN ■
s e r
GRAPHS &MAPS
BULK I
S e a rc h by: • K eyw ords)
0 S e n e s ID
BA Data Sets > Petroleum > Prices > Spot Prices APICAL*.
h ltp : api.eia.oov * « rie t'? a p 1_key-Y O U R _A PI_K E Y _H E R E **erie»Jd-PE T .R B R T E .O Click C hildren C a te g o rie s to
TO USE
tx cv .se data sets Click through th e tree of SERIES NAME
E u ro p e B rent S p o t P rice FOB. Oaily
a m ia b le s e n e s until you reach a C hild S e n e s Click C hild S e n e s to d sp la y
SERIES
10:
PET.RBRTE.O » Show m e ho * to en-bed a
a ch art of the data . API can and s e n e s information is listed at
--Figure 70 U.S Energy Information Administration API
To access this data, you must get an API key. Sign up for free access at http://www.eia.gov/opendata/register.cfm. You'll receive an email with your API key, as shown in Figure 71. InM ftm im tl .ft— .On i t A m ttfm U .S .
E n e rg y In fo rm a tio n
A d m in is tra tio n
Thank you for registering The following is your unique EIAAPI key to use with the EIAAPI.
Figure 71 U.S Energy Information Administration API Key
The example code shown in Listing 58 can be used to call the API, down load the JSON response, explode the first level of the JSON, and finally display these results to the screen, as shown in Figure 72. //Function converts Remote JSON URL Response to a Data Frame def getJsonUrlAsDF(url: String): org. apache. spark. s q l . DataFrame = { val client
65
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
org.apache.http impl.client.HttpClientBui I d e r . c r e a t e ( ) . bui 1d() val request = new org.apache.http.client.methods. HttpGet(url) val response - cli e n t . execute( request) val handler = new org.apache.http.impl.client.BasicResponseHandler() //Get Response String var responseString = handler.handleResponse(response).trim //Convert to RDD var responseRDD
= s c . p ar a l 1elize( responsestring :: Nil)
//Return Data Frame return sqlContext.read.j son ( responseRDD).toDF() //Download Latest Oil Data var responseDF = getJsonUrlAsDF("h t t p : / / a p i . e i a . gov/series/?api_key= &series_id=PET. RBRTE. DM) //Explode JSON and Select Relevant Coluns var o i 1DataDF responseDF.withColumn("Data", explode($"series. data" ) ) .withColumn("Ele ment", explode($"Data" ) ) . s e l e c t ( "Element") oilDataDF.show(5) Listing
58 Download Oil Price Data
//F u n c tio n c o n ve rts Remote JSC* UK Response to o DOto Frame Cel get3S® rvrlAJ© * 'a ae • { val client . © rg.apache.nttp.iapl.client.Httpclientajilse'.createO.Ovliao val ■ new e r g . apache.http.client.■etheCs.wttpGettu’l) val re sp o n se ■ client.execute(refiuest) v a l h a n d le r - new o r g .a p a c h e .h ttp .in p l.c lie n t.e a s ic ? e s p o n s e H e n d le r ()
FINISHEO t>
II W ©
//Set Response String v a r re sp e n se ix rx rg . h a n d le r .n * n a le * ts e c n s e ( r « s e c rs 0 .tn ia //C o n v e r t to v a r r e i s o n i t 'r c
■ s c . p a r « l l e l i : e ( r e s s o n i e s t r lr * : : M il)
//R e tu r n o o to Frame r e tu r n s a lC o n te x t. read.;son
>
//oounlood lotest Oil ooto Y*r r
apj
r«v>»jeriej_ip^tT.tg*Te,C>*)
//tx p lo O e JSC* ono S e le c t R elev a n t Colins v a r o i l r a t s o f . r e s p o n s e s .A lth c o lia w rc a ta * , e * p :e © e ($ ';e rie s.< Ja ta -)).« ilth c o lu B n { -E :e » e n ft e*plO «e{$'O ata‘ » . s e l e c t ! 'E l e r ie n f )
oitoatecc-»•»■»< ) g e tlso o u riA so a: ( i r l : S trin g )© rg .ap e cn e .sp erl:.sq l.O d T e« rea e r e s p o n s e s : c c g .» p a :n e .sp * * < .sa :.o » t» ! '»-*« . [ re q u e s t: s tr v c t< e c * » a n d : s tr in g ,s e rie s _ ie :s tr in g > , s e r i e s : e rre > < strv e t< c o p > rig n t:s trln g ,d e te :e rre y < e rra > < strlrtg » > ,© e s< ri p tle n :s t r l n g , e n d : s t r l n < , f : s t r i n g , n i « e : s t r l n g , s e r l e s . l ( l : s t r l n g , s « i j ^ e : s t r l n g ,s t a r t : s t r l n g ,u n l t s : s t r l n * ,u n U s s f t e r t : s t r l n g ,u p J e t e d : s t r l n g » ) oilD ataO *: c rg .a p a c h e .s p a r k .s q l.O a ta ir a a e ■ (E lem ent: a rray < s trin g > ) lle n e rrtl |L is t! 2 e i6 e 2 « e , n . . . | lL is t:2 e x e e 2 e s , 3 2 .. . I |L lSt:2ei6© 2© *, 3 2 . . . |
U:st::ei6«2e3, IU S t:2 » l« 0 2 » 2 ,
Figure 72
M—| ...i 32
Oil Data Initial Results
66
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
|
4
t>
Q|J
From the above results, we aren't able to make sense of the row listing. We can check the schema again with the command shown in Listing 59. The results are shown in Figure 73. oilDataDF.printSchemaC) Listing 59
Check the Schema
I D * .printScheM()
F IN IS H E D
root I-array (nullable ■ true) I I-- elenent: strinj «ontainjHull ■ true)
Figure 73
printSchema Results
In this case, we aren't able to explode or select the columns any further. We need to perform some RDD map operations to split the row into mul tiple columns and then save the reorganized data to HDFS by using the code in Listing 60. The results are shown in Figure 74. case class o i 1PriceStruct( Date: java . s q l . Date, Price_BBL: Double) val o i 1DataFixedDF = o i 1DataDF.rdd.map( row => < var rowString = row(O) .toString var rowSplit = rowStri ng.substring( 5 , rowStri ng. length -1). spl i t ( ",") //Date SQL formatting var dateStr = rowSplit(O) var simpleDateFormat: java.text.SimpleDateFormat = new java.text.SimpleDateFormat( "yyyyMMdd"); j a v a . u t i 1.Date j a v a . s q l . Date
var util Date = simpleDateFormat. pa rs e ( d at e S t r) ; var sqlDate = new ja v a . s q l . Date(uti1Date.getTime());
: :
// Return in Structured Format o i 1PriceStruct(sqlDat e , rowSplit( 1 ) .toDouble) }
) . toDF()
67
@
C o m b in in g A pache Spark, SAP H A N A , and Vora
|
4
o i 1DataFi xedDF. reparti t i o n ( 1 ) . save("oi1Data Parquet", "parquet") o i 1DataFi xedDF.sort( $"Date".desc)show(50) Split a Row into Multiple Columns with RDD Operations
Listing 6o
FINISHED C> ; ; ffl ©
case class oilPriceStruct( Oate: java.sql.Date, Frice_BBl: Double) val oiLoataFtxedDF ■ eilcata©= .reo.naji row »> { var rowstring ■ po«<:>.testring var ro.*Split - rowstring.substrings,r©«iString. length -l).s p lit(“,*) //Cote fo rm a ttin g var datestr var sinpleoateFemat:java.text.slr.pieoatesorf!at var utiloate : Java.util.Oate var sqlOate :java.sol.Oate
■ ro»split( ) ■ new java.text.si^oleoateFematCyyyjmJd’ ); - sl«pleOateForftat.parse(dateStr); • new java.sql.Cate(uolOate.getTiaeO);
//Return t n structured format oilPriceStruct(sqlDate , rowSplit(i).tcOocble) ) ).tOOF() oilOataFixedOF.registerTenpTaele(“oilOata’) oilDataFi<edOF.sort($"Oate*.desc)show(W) defined class oilPriceStruct oilOatar ixedOF: org.apache.spark.sql.DataFranc ■ [oate: date, Price_eeu: double] I
Odte|Price_8Bi|
12816-82-881 I2816-82-8S| 12816-82-841 12816-82-831 |2816-82-82 j
Figure 74
31.6*| 32.3S| 32.76| 32.38| 38.9sj
Oil Data Final Results
We now have the results saved in HDFS in the Parquet format.
4.2
D o w n lo a d N Y T im es A rtic le s U s in g A p ac h e S p ark
The New York Times has an API that will return ten articles (per page), based on a specific search criteria. A maximum of 100 pages for a specific search can be returned. In the following steps, we will use Apache Spark to automate the down load of all New York Times articles with "oil" in the title over the past ten years and store the results in Hadoop HDFS for subsequent processing by Vora.
68
C o m b in in g A pache Spark, SAP H A N A , and Vora
|
4
Sign Up for API
First, as shown in Figure 75 and Figure 76, sign up for access to the API to search for articles at http://developer.nytimes.com/. W C*rXn»|M kCI»» v D evelop ers
Signed n as aron.mscoona’o Events
&£!s
Bog
My Account
Open Source
W elcom e
Overview Available APIs
You already know that NYTimes.com is an unparalleled source of news and information. But now it's a oremter source cf data, too — why just read the news when you can hack it?
Keys Forum
Getting Started
Gallery
"he Times Developer Network is our API clearinghouse and community. Here's how to get started
API Ccnsde
1 Reouestan API kev 2 Read the afi tittu m tm a m £aq and Terms of Use 3. Use the A&I Tod to exoenment without writing cede A Browse the appi&atigflfiallgft 5. Connect with other developers in the fcrurr
Figure
75 Request a NY Times API Key
Overview Available APIs
T im es D ev elo p er N etw ork A p p lica tio n R eg istra tio n Thank you for registering.
Keys Forum
Gallery
Your API Keys:
Article Search API
API Console K
e
y
:
A p p lic a tio n K oy: S ta tu s : R e g is te r e d
Key Rate Limits 10 io , ooo
Figure
Log Out
Careers
Calls per second Calls per day
76 New NY Times API Key
69
C o m b in in g A pache Spark, SAP H A N A , and V ora
Note The NY Times API is free to use; however, there are limits on the number of searches made per second and per day, so take care when using Spark logic.
Perform a Single Search
Next, we will use Spark to perform a single search for articles where "oil" is mentioned in the article title. We also need to perform some date type manipulation to allow the results to be compared with the oil price data loaded earlier in Section 4.1. Execute the code shown in Listing 61 to use the API to perform a single page search. The results are shown in Figure 77 and Figure 78. import org.apache.spark.sql. functions.explode import org.apache.spark.sql. types. DataTypes._
// Convert TimeStamp String to Date . . . (NOTE: New cast(DateType) work s f r o m Spark 1.5) def dtStrToDate(dateStr: String) : j a v a . s q l . Date = { i f (dateStr != nul1) { var UTCDateFormat:java.text.SimpleDateFormat new java.text.SimpleDateFormat("yyyy-MM-dd*T' HH:mm:ss"); UTCDateFormat. setTimeZone(java.util.TimeZone.getTimeZone("UTC")); var util Date : ja va . u t i 1 . Date UTCDateFormat. p ars e(d ate Str) ; var sqlDate :j a v a . s q l . Date new j a v a . s q l . Date( u t i 1Date.getTime()); return sqlDate } else return null val dtStrToDate_udf = udf(dtStrToDate _) def searchNYTimes (search: String, page: Int) : org.apache.spark.sql .Da taFrame = { / / : String = { Thread.sleep(lOO) val exampleUrl - "http://api.nytimes.com/svc/search/v2/ articlesearch.json?"+ search +"&api- key=&page=" + page
70
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
val bingUr1_p1 = "h t t p s : / / a p i .datamarket.azure.com/Bing/Search/vl/ Composi te?Sources=%27news%27&Query=%27i nti tle%3A" val bingUrl_p2 = "%27&$format=json&$skip=" / /v a r bingllrl Ful 1 = bingUrl_pl + search + bingUrl_p2 + page / /v a r bingllrl Ful 1 = "h t tp s : // a p i .datamarket.azure.com/Bing/Search/vl/ Composi te?Sources=%27news%27&Query=%27i nti tle%3Aoi1%27&$format=json& $s kip =l " var bingUrlFull = exampleUrl val cl i e nt = org.a pac he.http.impl.client.HttpClientBuiIder.cr eate( ) .bui 1d() val request = new org.apache. http.cl ient .met ho ds . HttpGet(bingUrlFul1) // request.setHeader("Authorization", "Basic UXROSUlvUXNDTjJXMnlIdzdPazZYU mk2VXNrV3dwMWt3UFpwcnVvZlpabzpRdE5JTW9RcONOMlcyeWV3N09rNlhSaTZVc2tXd3Ax a3dQWnBydW9mWlpv"); val response = client.execute(request) val handler = new o r g. apa ch e. htt p. im pl .c li en t. BasicResponseHandler() var bingResults = handler. handleResponse(response). trim / / r e t u r n bingUrlFull var bingResultsRDD = s c . p ar a ll e ii z e ( bingResults :: Nil) val resultsJSON = sqlContext.read.json(bingResultsRDD) var resultsExpDF = resultsJSON. toDF() .withColumn("Docs", explode($"response.docs")) if ( resultsExpDF. count() > 0 ) { resul tsExpDF =resultsExpDF.withColumn("web_url", $"Docs.web_url") .withColumn("source", $"Docs. source") .withColumn("pub_date",dtStrToDate_ udf( $"Docs. pub_date")) / / .withColumn("headline_ main", $"Docs. headline.main") .withColumn("news_desk", $"Docs.news desk") .wi thColumn("secti on_ name", $"Docs. section_name") .wi thColumn("subsection name", $"Docs. subsection_name") .withColumn("snippet", $"Docs.snippet") .drop("copyright") . drop("Docs") .drop("response") .d r op("sta tus") / / .wi thColumn("pri nt_ headli ne", explode($"headline.print_headline"))
71
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
|
return resultsExpDF
var x = searchNYTimes("q=oi1+price&fq=news_ desk:(%22Busine$$%22,%22Energy%22,%22Envi ronment%22,%22Financial%22)& begi n_date=20070101&end_date=20071231", 1) x . show() x.c ou nt( ) Listing 61
Single Page Search for "Oil" in Article Title
rms*€d
in p o r t e r g .a s a c * e . s p a r . . s a l . * . r e t i e n : e p l o e e in p o r t c ^ £ .* sa c » « - s p a r k . s a l . t y p e s . C ata T y p es._ //C o n v e n e TimeStamp S t r i n g t o D ate . . . (MOTT: Mew c a s t (D o te T /p e ) m arks fr o m S p o t* 1 .3 ) d e f e tJ tn T s D a c e (d a te it' - .-g ) :a a :q l.B a ti * ( i f ( d a te S tr ! ■ n u ll) { v a r iT C O a te 'e r r a t ; a a . t e - t S in p le O a te ’ e r r a t * new ; a a . t e .t .S i n p l e O a i e * e r * e t ( *) y y y - ‘I '- s s *T (♦< w U T C D ate'e rra e s e t T i « e ls * e ( ja . a . u t il. T in e J © " e . g e t T i n e I c n e ( ’ C " ) ) ; v a r w iilD a te : J a v a .u ti l.O a t* ■ U T C O a te * o n n a t.p e 's e ( d a te S tr ); var s a l? a te : J a v a .s q l .D a te - new j a v a . s q l . C a t e ( u t i l D a t e . g e t T i r e ( ) ) ; r e t u r n sq lC a te ) e le * re tu rn n u ll
>
v a l d tI tr T p D a te _ u d f * u d f (d tS tr T c C a te _ ) d « f a*ar««.V rT jae* ( s e a rc h - S t - i - r , p a |e T h r e a d .s le e p (1 9 6 ) v a l e -a rp le U rl * ’h ttp :
val b ir f .r l^ l i v a l M rg U rl_ p 2 •
h ttp :
J n t ) : e 'g ap ach e s p a r e .s q l .C a t e 'r a n e ■ (
s p i . n y t i n e s . c o n s c . s e a r c h / 2 . a r t i e l e s e a r c h . ;sc n '- ' * s e a r c h
♦ ’ lapi-'«ey»<>O JP '5 '> i s a » e = ' ♦ page
. a s i . « a t » n a r i . e t a :w '-e.c o n . Jin j/S e a rc lv 'v l/C o a p o s ite ? J o u < - c e s « % 2 7 n e v .s M 7 l^ w e ry ^ tt7 in title \J i
/ / v a r O in p O if w t t • b \ n f j r i _ p l » s e a r c h r kinp<Jrl_p2 * pope / / '• o r O i.K fJrl.fa U e 'h t t p s : / / o p i . P a to m o r k s t.o z u r p . c o m /C in g /S * o r < h /v l/C o m p O iitp fS o v rc iss\2 7 n p m t% I7 tQ u sry B % 2 7 \n titie % 3 io iL % i7 tS fo n so ta jso n lS sk ip sl“ v a r b i - f . r l ' j l l a e -a n p le .e l| v a l c l i e n t * e r g apache n t t p . i n p l . c l i e n t . r t t t p C l i e n t 8 u i l d e r .c r c a t e ( ) .b u i 2 d ( ) v a l r c s . c a t • new o r j . a p a c r e . h t e p . c l i e n t . n e tn e d s . h t t p 6 e t ( b i n j '. r l * J l l ) / / r e q u e s t . s e tM e o d e r f A /r h o r tr o r t on * , 'S a s i c UXKOSUl\<&X)TjJXHrUd:SPazn‘lM *2VXHrV3P*trc3orp*cn)K:Lpob:pXtS!}]TV9MH0nLcyeikf3M P9r*UiSaT?\/c2TXi3AxaSdQUn6yiiU9rtap-.’~), v a l r e s p p r j e * c lie n t. e r e c u te ( r e q iM S t ) v a l h a r a l t ' e new e rg . a ;a c * » . h t t p s n p l . c l i t ' t . 3 s : i : 5 e : p c ': t M a * d l e r ( ) v a r b i n g i c s - i l t s a h a n d le r . * » « 3 le 5 e : p e - n :« ( r e s ;s - s e ) . tr in
v a r b i-jS e sw lts S C O a s c . p a r a l l e l i s e ! b ir g f ie s w lts : M il) v a l r e s . l t s 3 5 C « a s q lC o rte i t . r e a d . ; : © n ! t i r » S s s . l t : S 3 0 )
v a r re s w lts !'.p C * » re s u lts 3 S O h .te O * () .w llh C o lie es( " D o c s ', e > p lo d e ( $ ~ r e s p o n s e .d o c s ') ) i f ( r e s . l t sE -sC £ c o u n t( ) V i ) { r e sw ltsS ip C * a r a s u ltt l> » 5 * .'.a th C o lu n n ( ,e b _ w r l'/ $ D ocs. weO_ur 1 ■ ) .> .ith C o lw n r(" s:> .rc e* , J 'O o c s . s o w 'te " ) . .ith C o lu n r ( ; . s _ « a t e / d tS tr T © 0 a te _ -4 ''( J C © ;s .p u b _ sat* ' ) ) // .w ith C o lu a r ( " h e o d lin e _ a o in * , S 'D o c t.h p o d lin p .m o iti') .V 'ithC oluan(~r«w s_des> S 'D ec s n e u s .d e s - ) .'» i th C o li r a n ( 's e c ti o " _ « a « e ', J 3 : c : s e e ti e n _ n a n e ') .w ith C o l'r » '( " a w b s e s tia n _ n a n e ', $*Oo c j . s u S s e e tie n _ n * n e ') .w ith C o lu B n ^ s n ip p e t" , i 'D s c s . s n ip p e t" ) .d r o p ( 'c o p y r ig h t * ) .d rp p (* 0 o c c * ) - d r o p ( 'r e s p o r s e " ) .d r o p ( 's t a t u s * ) //.w tth C o i« « a r( " p » -in f_ h e o d lin e ', e - p ic d e ( S ~ h e o d l\r e .p r i n t _ h e a d l i n e " ) )
> r e t u r n re su ltsiA p O *
v a r > * se a rc h M V T in e s('q a o il« p ric « ifq a n e w s_ d e s< : ,\2 i8 u tin e :s % 2 2 ,\2 2 tn e r g y \2 2 ,% 2 2 I n ironnent% 22,% 22cin a r< ia l% 2 2 ) 4 b « g in _ d a te e 2 M 7 tltlte n d _ d a te * 2 M 7 1 2 3 1 " ( . ) x .sh p w O « .e © u n t(J
Figure
77 Results of Single Page Search of "Oil" in Article Header, Part 1
72
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
i n p o r t e * -g .«;»£>!«. - p » c . . s q l . ■ * . “c « i e « s . « ■ p lc s « i n p o r t o - g . a p a : " « . sp a i-k . :« 1 .3 y p « s .O a ta T y p « s ._ dt5 t« -T © D at« : ( d » t « 5 t r . S t r i n g ) j * a . s q l . O at« d tS trT p D a t« _ « S '/ e r g a p a c h * . s p a r . , . s q l . U : « " C tf i'a iP u " « ie n ■ U serO e ,C a t e T / p e ■ ;« a rc h ? iv T in « o . ( s e a r c h . S t r i n g , sa g e I n t ) p r g . a p a c h e p a r k . s q l . D a t a 'r a r e e r g . a p a e h q . s p a r k . s q S . S a t a f r a r e ■ [ « « b _ u r l : s t r i n g . s c u r« e s t r i n g , pw s_ d ate P a t e , " r u s _ d t s . s t r i n g , s e e ti e n _ « a n e i.e s _ u r h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp h ttp
r e :8 9 9
F ig u re
K tla e /tire " y tin e /tire /tire /tire /tire " /tire /tire , - tir e
»«U3..d e sk s e t t l e « _ n a iT e I s « b s e c tio n _ ra n e | rh e fh e fh e rh e rh e rh e rh e rh e rh e rh e
New New Hew Hew Hew Hew Hew Hew Hew Hew
Yen. Y o r. Y or. Y o r. Y or. Y o r. Y or. Y or. Y o r. Y or.
T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e : T in e :
•11-MI
s: ■11-131 5 6 u s i^ « : : ■1 * 0 9 1 S3 ■1 1 - 1*1 ss -* 1-281 ss ss -12-181 mm s s ■19-241 ■ . „ m ■1 2 - 9 3 1 ss -19-181 S<j s i a « : 8 u si« < s s ss •94-231
Cay| D ay| O ay | Day I D ay| O ay | O ay | D ay| Day] Day |
|
s t r i n g , sw ® se«tior> _nane: s t r i n g , s n ip p e t
s n ip p e t I
n u l l | C i l an d f a : p r i c e . n o r l d B u s in e s s |T h e p r i c e o f e r u d . n u l l |C r uPe O il f u t u r e s , n u l l |A s l e a d e r : o f t h e . n u l l |S a u d i A r a b ia - a s . n u l l | ? e r y e a r s , c h ea p . n u l l |S 9 i s p r e p a n m g t . n o r l d b u s l 'e s s |A ^ t e r a y e a r o* a . n o r l d 6 u s i» e s s I T e n s i o n : i n t« e n o r l d B u s in e s s I C th e r f u r p p e a - o i .
k : - g * 19
78 Results of Single Page Search of "Oil" in Article Header, Part 2
For a single search criterion, the NY Times API only supports up to 100 pages of results. To broaden the search, a narrower selection criteria is needed (e.g., by year), and the results then need to be combined using union statements. Download a Range of New York Times Articles
Now, we will use Spark to download all news articles related to "oil” cre ated over the last ten years. The code shown in Listing 62 illustrates how the API can be called multiple times and the results be combined and saved into Hadoop HDFS, as shown in Figure 79. import org.apache.spark.sql var unionResultsDF: DataFrame - null; var tempDF : DataFrame = null; var totalPages: Int = 100; / /cu rren tl y max 100 supported var searchTerm: String = "q=oi1+price&fq=news_ desk:( %22Busines$%22,%22Energy%22,%22Financi al%22)&begin_date= YEARNUM0101&end_date=YEARNUM1231"; for (yearNum <- 2007 to 2017) { tempDF = nul1 for (pageNum <- 0 to (totalPages - 1 )){ / / p r intln(pageNum) i f (pageNum ==00 || tempDF.count() > 0) { var search = searchTerm.replace("YEARNUM", yearNum.toString)
73
4
s tr in g ;
C o m b in in g A pache Spark, SAP H A N A , and V ora
|
4
tempDF = $earchNYTimes(search , pageNum) i f (unionResultsDF!= nul l ) { unionResultsDF = unionResultsDF. unionAl1(tempDF) //println("Next: " + pageNum) } else { / / p r i n t l n ( "1st: " + pageNum) unionResultsDF = tempDF
}
}
uni onR esultsD F . show( 4 0 ) uni onR esultsD F . r e p a r t i t i o n ( 1 ) . sav e("n y tim es P a rq u e t" , " p a rq u e t" ) Listing 62
Download Ten Years of Articles
import org.apache.spark.sql._ FINISHED t> « HJ © var unlonResultSD': DataFrare • null; var te«©2f : oataFrare ■ noil; var totalF-ages: :nt - leo; 'currently «ax 100 lupported var searchTera: string ■ -
« . 2 0 0 7 to 2 0 1 7 ) { m null for (pageHua <• e to (tctalPages • //prMtln(page*turn)
tetcCf
1
)){
i f (cageujo . . 0 0 || tenpO*.ccunt<) > ©> < var search ■ searchTem.replace(TE«iA*r, yearfkai.tostring) terpOf • searchVYTii»es(search , page***) i f (unionftesultsOF!« noil) { uniortfesultsDF ■ uhionBesultsOF.urionAllEtenpOF) //printIn("»ext: * ♦ pogefnm) ) else < unionflesultsOF ■ tenpO:
)) }> unlonP.esultsOF. sho*{-ie )| inport org.apache.spark.sol._ unlontesultsD*: org.apache.spark.sql.Data'rane • null tenpOF: org.apache.spark.sql.oataFrane ■ noil totalPeges: Int • lee searchTem: string . q.oil*priceSfq.neu5_Cesk:(%228usiness\22,X22EnergyS22,%22Finanei8l\22:46egin_c«te.YEA.»’#>,eiei4end_date«YEA*NUHi23l warning: there were 1 deprecation waming(s); re-run with -deprecation for details *eb_url| |http://www.nytine...|The lhttp://i***.nytine...|The |http://«Nw.n)’tin e...|T h e |http://M M .n)tine...|The |http://hMw.nytlne. ..|The |http://«ww.n)tine...|The lhttp://n»«.nytine.. . |The Ihttp://nw<w.n>tine. .. |The
F ig u re 7 9
source! pub_date|news.desk| New York New vork New York New York New York Hew York New York New York
Times U©©?-i2-©6| Times U « 7 -bb- i 8| Times|2B©7-ie-17| Tlmes|20©7-es-24| TimesUB07-e9-»| Times|2©07-ii-i4| Times|2B07-e9-12| Times 12007-12-151
Business! Business! Business! Business! Business! Business! Business! Busiressl
Results of an Iterative Search of the New York Times
74
C o m b in in g A pache Spark, SAP H A N A , and V ora
We have now successfully saved a summary of ten years of articles on Hadoop HDFS.
4 .3
S to re N e t In c o m e R esu lts U s in g SAP H A N A
Now, we will create a table in SAP HANA to store the average net income (in billions) for several of the world's largest oil companies over the past ten years. Execute the SQL code in Listing 63 in SAP HANA Studio to create a table and populate it with the average, publicly listed, quarterly results. DROP TABLE CODEJAMMER.NETINCOME; CREATE COLUMN TABLE CODEJAMMER.NET INCOME( CALQUARTER VARCHARC5), NETINCOME DECIMAL(18,5)
); in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t i nsert in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t
i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto i nto
CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME
valuesC'20154',2.36); valu es ('20071'.26.72) valu es (’20072’ ,28.03) valu es ('20073',28.19) valuesC'20074',31.2); valu es (’20081',30.81) valuesC1200821,31.92) valuesC'20083',35.17) valu es ('20084',25.25) valuesC *20091'.20.33) valu es ('20092',16.55) valu es ('20093',13.64) valuesC'20094',16.87) valuesC'20101',18.98) valuesC'20102'.19.99) valuesC'20103',20.3); valuesC'20104',25.06) valuesC'20111',27.84) valuesC'20112',29.81) valuesC'20113',31.74) valuesC'20114',30.41) valuesC'20121',30.7); valuesC'20122',31.21) valuesC'20123',27.18) valuesC'20124',27.48)
75
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t in se r t
into into into into i nto into into into i nto into into into
CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME CODEJAMMER.NETINCOME
|
4
valu es (’20131',26.76) valu es (*20132',24.85) values(’20133',24.59) valu es (’20134’ ,22.91) values(’20141’ ,21.37) values(’20142’ ,23.32) values(’20143’ ,22.69) values(’20144’ ,19.43) valu es (’20151’ ,16.26) valu es (’20152’ .14.06) valu es ('20153',6.29); valu es (’20154’ ,2.36);
Listing 63 SQL to Generate NETINCOME Results in SAP HANA
Next, create a calculation view that combines the new NETINCOME table with the time dimension table (joined on the CALQUARTER column) to enable date-specific searches to be executed, as shown in Figure 80. ^ 3
System s SS
a
□
■ft v 0 S9 * C3 iS R %
*SQL C onsole 3
-T? ’SQL C onsole2
£1 *SQL C onsole 1
CODEJAMMER::CA_NETINCOME_01 HDB (CODEJAMMER)
a (§> HDB (CODEJAMMER) [Production : t> a
® CODEJAMMER 4
&
Calculation Views ■
t> a
> ^
(2)
2 CA_NET1NCOME_01 2 CA_SLFIGHT_001
public
t> #
sap
t> a
studentOO
t>
system -local
t> a
tm p
t> a
workshop
Provisioning
t> IcJ’ Security :>
©+ © ♦
C ontent -
®\£- ft | ©© q? d ^I
Scenario
C atalog
HDB (SYSTEM) [Production System
: [J ’CODEJAMMER;...
Columns(3) View Properties
& ...
(2) IE Q? S
- I
Show: [All - Q e A ggregation
Proje...
Type
Key
H
□
—
□
□
“t
Aggr... Rank
N am e
DATE.SQL NETINCOME
Jom_1 w -E P ro jectio n ^ v iE Projectbn_3
SOD (GBAMHB) “ P ro je c tio n ^ ^NETHCOUE (METWC
0
- Projectk>n_3 •-1
^ ~ M_TWE_DWENSIOH 0
100%
Lai
CALQUARTER ~CA
iQ s0000716522trial-trial (DEV_2WQBC t> &
:
local
Join
8 Union
w
Details
£3^
Figure 80 SAP HANA Net Income Calculation View
76
DA NE
C o m b in in g A pache Spark, SAP H A N A , and V ora
4 .4
A n a ly ze O il D a ta U s in g V o ra
Finally, we will use Vora to access each of the datasets created in the ear lier steps. We will then use Vora to combine and visualize the results. Oil Price Data
Create a Vora table using the code in Listing 64 to access the oil price Parquet file (/user/vora/oilDataParquet/part-r-00O00-4a549f81-9b0f-4f0abel3-15ba7213a52e.gz.parquet) created by Apache Spark in Section 4.1. The resulting price table is shown in Figure 81. %vora CREATE TABLE OILDATA ( PRICE.DATE DATE, PRICE BBL DOUBLE
)
USING com.sap. spark.vora OPTIONS ( tableName "OILDATA", paths " /u s er /v or a/ o i1DataParquet/part-r-00000-4a549f81-9b0f-4f0a-bel315ba7213a52e.gz. parquet", hosts "master.cl u s ter ,s econ dar y. clu ste r" , zkurls "master. c l u s t e r :2181.secondary.cluster:2181", nameNodeUrl "ma ste r.c lus ter :8020", format "parquet"
)
Listing 64 Create a Vora Table Linked to Oil Price Data
Xvora CREATE TABLE OILOATA (
MICE.OATE
wicE.sei
>
FINISHED t> ” ffl &
OATS,
ooueiE
USING
co«.sap .spark.vora CPTICNS ( tabieK ane ’OILOATA*. oat*s •/us«r/vora/oilO ataParqu«t/part-r-eoO ® 0-4a549f81-9b0*-4f0a-bel3-15ea7213a52e.g:.parqyev,
nests 'naster.cluster,seconeary-cluster’ , ik u rls ’ r a s te r , d u s t e r : 2181, secondary, c lu s te r : 2181", naneSo-deUrl 'e a s te r.c lu s te r:8 0 2 O " , fe rn a t ’ parquet’
IS
Command processed successfully with no results
Figure 81 Vora Oil Price Table
Execute a test query on the table using the code in Listing 65. You'll see oil price changes displayed as in Figure 82.
77
|
4
C o m b in in g A pache Spark, SAP H A N A , and Vora
%vora select * from OILDATA order by PRICE_DATE asc Listing 65 Check Oil Price Fluctuations
Figure 82 Chart Showing Oil Price Changes
Oil News Articles
Create a Vora table using the code in Listing 66 to access the news article Parquet file (/user/vora/nytimesParquet/part-r-00000-82a4db8e-a59d-4fec8c66-b670c151a328.gz.parquet) created by Apache Spark in Section 4.2. The results are shown in Figure 83. %vora CREATE TABLE NYTIMES ( WEBURL VARCHAR(IOO), SOURCE VARCHAR(IOO), PUBDATE VARCHAR(IO), NEWSDESK VARCHAR(100)
)
USING com.sap.spark.vora OPTIONS ( tableName "NYTIMES”, paths "/user/vora/nytimesParquet/part-r-00000-82a4db8e-a59d-4fec-8c66b670cl51a328.gz. parquet", hosts "master.cl uster,secondary.cl u s t e r " , zkurls "maste r.c lus ter :2181,secondary.cl u s t e r : 2181", nameNodeUrl "master.cl uster:8020",
78
|
4
C o m b in in g A pache Spark, SAP H A N A , and Vora
format "parquet"
)
Listing 66
Create a Vora Table Linked to O il News Articles
Xvcra CREATE TASLE OILOATA ( PRXCEJMTE DATE. paiCE.eei oousle
FINISHED D>
Efl ©
)
USING c o s .s a p .s p a r k .v o ra
OPTIONS (
tablef.-aiw "OILDATA", patns •/user/vor«/oiiOat8*>arquet/part-r-©08e0-4e549*81-9B8?-4f0a-&el3-15oa721Ja52e.g:.perqueV, nosts '« aste r. c lu ster, secondary, d u s t e r ', zkurls " ea ste r.clu ster:2181,secondary.cluster:2181“, nafteNodeUrl " ea sie r.clu ster: 8 8 2 0 “, fo rrat “parquet"
)
Command processed successfully with no results
Figure 83
Vora Table Linked to Oil News Articles
Execute a test query on the table using the code in Listing 67. The results are shown in Figure 84. %vora select PUBDATE. count(*) from NYTIMES --where PUBDATE < '2016-02-05' group by PUBDATE order by PUBDATE asc Listing 67
Articles W ritte n Per Day fo r the Past Ten Years
Xvora s e l e c t —• Pl'BOATE, c o u n t(’ ) fron
FINISHED [>
EJU ©
NY7XHES
••hbere PU60ATE < 2816 . 6V group By *J8DAT6 order By *U80ATE asc B
M
C
*
*£.
12.
SETTINGS-
#Srouped
#c‘
8 .0
7.0 60
2007-12-05
Figure 84
2008-09-19
2000-09-02
2010-12-29
2012-08-06
2018-09-17
2014-12-11
2015-10-29
Chart Showing the Num ber o f Articles W ritte n about O il over Ten Years
79
|
4
C o m b in in g A pache Spark, SAP H A N A , and V ora
|
4
Net Income Results
Using the code in Listing 68, you can create a Vora table to access the net income results stored in SAP HANA, which we created in Section 4.3. The results are shown in Figure 85. %vora CREATE TABLE HANA_NETINCOME USING com.sap.spark.hana OPTIONS ( path "CODEJAMMER/CA_NETINCOME_01", dbschema "_SYS_BIC\ host " \ instance " 00 ” , user "CODEJAMMER", passwd "CodeJam2015"
)
Listing 68 Create a Vora Table Linked to Net Income Results
Xvcra
CREATE TABLE HANAJiETXNCOME
USING
com.sap.spark.hare
FINISHED O K © ©
OPTIONS (
path "COOEWrtER/CA.METINCQKE.er, ODschera "_SYS_BIC", host "", instance user "CCOEJAMNER-, passive "Code 3air2015"
>
C o m m a n d p ro c e s s e d s u cce ssfu lly w ith no re su lts
Figure 85 Vora Table Linked to Net Income Data
Execute a test query on the table using the code in Listing 69. The results are shown in Figure 86. %vora select * from HANA_NETINCOME Listing 69 Show Contents of NETINCOAAE Table
80
C o m b in in g A pache Spark, SAP H A N A , and Vora
Combining Results with Vora
Finally, with a tiny bit of manipulation using the code in Listing 70, we can overlay the results to see if there are any noticeable patterns between the number of articles written about oil, oil price fluctuations, and net income over the past ten years. %vora select 01LDATA.PRICE_DATE, 01LDATA. PRICE_ BBL, NI.NETINCOME. NYT.artCount from OILDATA l e f t outer join ( select PUBDATE, count(*)*10 as artCount from NYTJMES group by PUBDATE ) NYT on 01LDATA.PRICE_DATE = NYT.PUBDATE l e f t outer join ( SELECT cast(DATE_ SQL as String) AS DSQL, cast(NETINCOME as double)*4 as NETINCOME FROM HANA_NETINCOME ) as NI on 01LDATA.PRICE_DATE = NI. DSQL where 01LDATA.PRICE_DATE between 12007-01 -01 * and '2017-01-011 order by 01LDATA.PRICE_DATE asc Listing 70 Generate a Graph to Compare Oil Prices, Articles, and Net Income
81
|
4
C o m b in in g A pache Spark, SAP H A N A , and Vora
Figure 87 Combined Chart of Showing Ten Years of Oil Data
Success! In Figure 87, you can clearly see an increase in the frequency and number of news articles written around the same time as major changes in oil prices. You can also see the predictable trend that the net income of oil companies aligns very closely with oil prices. You might even be able to infer that, in 2007, oil companies were better able to cope with lower oil prices than they are today. Perhaps, with more detailed sentiment analy sis on the articles or more detailed analysis using corporate data, we may be able to make better predictions for tomorrow.
82
|
4
What's Next? | 5
5
What's Next?
Now that you're familiar with SAP HANA Vora and its basic functions and configuration, what next? Vora helps you process big data faster and more efficiently than ever, and simplifies data management for SAP HANA and Hadoop. But there's more to SAP HANA data than Vora! Recommendation from Our Editors if you're looking to get knee-deep into data model design, SAP is the book for you. With stepL v H H H H by-step instructions and sample coding, this book will teach H E S S S H l you how to build and design predictive, simulation, and optiI mization models. From information views to AFL models, | you'll learn to scale for large datasets and performance tune your models to perfection. Visit www.sap-press.com/3863 to learn more! I
H A N A A d v a n c e d D a ta M o d e lin g
In addition to this book, our editors picked a few other SAP PRESS publi cations that you might also be interested in. Check out the next page to learn more!
83
More from SAP PRESS Ready to get familiar with SAP HANA? From initial setup to connecting to a business intelligence platform, follow steby- step instructions and detailed examples for a complete SAP HANA implementation. Implem enting SAP HANA:
860 pages, 2 nd edition, pub. 12/2014 E-book: S69.99 | Print: $79.95 | Bundle: $89.99 www.sap-press.com/3703
Learn your options for imple menting SAP BW on SAP HANA. Find out how SAP HANA changes data modeling, reporting, and administration on SAP HANA. Implem enting SAP BW on SAP HANA:
467 pages, pub. 05/2015 E-book: $69.99 | Print: $79.95 | www.sap-press.com/3609
B
Bundle:
$89.99
Looking to move past the SAP HANA basics and get knee-deep into some real data model design? Learn how to build and design predictive, simulation, and optimization models with this book. From information views to AFL models, you'll learn to scale for large datasets and performance tune your models to perfection! SAP HANA Advanced Data Modeling:
392 pages, pub. 10/2015 E-book: $69.99 | Print: $79.95 | www.sap-press.com/3863
Bundle:
$89.99
S A P P R E S S E - B it e s SAP PRESS E-Bites provide you with a high-quality response to your specific project need. If you're looking for detailed instructions on a specific task; or if you need to become familiar with a small, but crucial sub-component of an SAP product; or if you want to understand all the hype around product xyz: SAP PRESS E-Bites have you covered. Authored by the top professionals in the SAP universe, E-Bites provide the excellence you know from SAP PRESS, in a digestible electronic format, delivered (and consumed) in a fraction of the time!
Abani Pattanayak, Williams Ruter Smart Data Access: Data Virtualization in SAP HANA ISBN 978-1-4932-1372-6 | $14.99 | 95 pages Aron MacDonald Integrating SAP HANA and Hadoop ISBN 978-1-4932-1293-4 | $12.99 | 105 pages Tamas Szirtes, Aviad Rivlin Implementing SAP Fiori Launchpad ISBN 978-1-4932-1291-0 | $12.99 | 73 pages
The A u th o r o f th is E -B ite
is an independent SAP HANA consultant, currently working for Shell Oil Company. His area of focus is the integration between SAP HANA and Hadoop. He is an SAP HANA Distinguished Engineer (HDE), which is an SAP program devoted to recognizing and promoting SAP HANA technical skills. Learn more about Aron at https:// www.sap-press.com/sap-hana-vora-functions-and-integration_4137/ author/.
Aron M acDonald
Related Documents
More Documents from "Chinchay Guinto"