What_is_bigdata[1].docx

  • Uploaded by: payal garg
  • 0
  • 0
  • April 2020
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View What_is_bigdata[1].docx as PDF for free.

More details

  • Words: 4,955
  • Pages: 36
1

A PROJECT REPORT ON

“AIRLINE ANALYSIS OF DATA USING HIVE” Submitted in partial fullfillment for the award of the degree of BACHELOR OF TECHNOLOGY (Rajasthan Technical University, Kota) IN INFORMATION TECHNOLOGY

SESSION (2018-2019)

Submitted To:

Submitted By:

Ma’am Payal Awal

Saumya Gupta (14EEMIT030)

Assistant Professor

Shalli Gadia (14EEMIT031) VII Sem IT

DEPARTMENT OF INFORMATION TECHNOLOGY GOVERNMENT WOMEN ENGINEERING COLLEGE NASIRABAD ROAD, MAKHUPURA, AJMER-305002 March, 2018

2

CANDIDATE DECLARATION

I hereby declare that the work which is being presented in the Seminar report entitled “AIRLINE ANALYSIS OF DATA USING HIVE” in partial fulfilment for the award of the degree of Bachelor of Technology Information Technolgy and submitted to the department of Computer Engineering, Government Women’s Engineering College, Ajmer , is an authentic record of my work carried out during the session 2018-19.

Saumya Gupta (14EEMIT030) Shalli Gadia (14EEMIT031) Government Women’s Engineering College, AJMER

3

CERTIFICATE

This is to certify that the Seminar Report entitled “AIRLINE ANALYSIS OF DATA USING HIVE” being submitted by Saumya Gupta & Shalli Gadia in partial fullfillment of the requirement of the degree of Bachelor of technology in Information Technology at Rajasthan Technical University Kota is work carried out by her under my guidance for the academic session 2018-2019. This work has been found satisfactory by me and is approved for submission.

Supervisor Name Ma’am Payal Awal (Assistant proffesor)

Place:- Ajmer Date : 22-10-2018

4

ACKNOWLEDGEMENT

I would like to express our appreciation and deep gratitude to My Supervisor Ma’am Payal Awal for his continuous guidance, encouragement throughout the development of this Seminar. I grateful to the faculty members of Information Technology department for giving his support. I wish to give sincere thanks to Dr. Ranjan Maheshwari, principal, Govt. Women Engineering College, Ajmer for providing us suitable environment to carry out My Seminar Report work. I thank Mukesh Khandelwal, Head of Department of Information Technology Govt. Women Engineering College, and Ajmer for providing me all the required facilities to carry out my seminar report

Yours Sincerely,

Saumya Gupta(15EEMIT030) Shalli Gadia(15EEMIT031) VII Sem IT

5

CHAPTER NO.

TITLE NAME

PAGE NO

CANDIDATE DECLERATION CERTIFICATE ACKNOWLEDGEMENT ABSTRACT 1.

1

INTRODUCTION 1.1 Bigdata

2

1.2 Bigdata History

2

1.3 Major problem in Bigdata

2.

3.

4.

3

HADOOP 2.1 What is Hadoop?

4

2.1.1 Component of Hadoop

4

2.2 Apache Hadoop ecosystem

5

2.2.1 Architecture Comparison Hadoop 1 vs Hadoop 2

6

2.3 Hadoop design principle

6

Hadoop distributed file system 3.1 HDFS architecture

7

3.2 Assumption and Goals

7

Installation process 4.1 Install virtual box 4.2 Install cloudera

9 10

6

5.

Commands 5.1 Hadoop hdfs command

6.

7.

8.

12

Sqoop 6.1 Description

15

6.2 Sqoop-Import command

15

6.3 Sqoop-Export command

16

HIVE 7.1 What is Hive?

17

7.2 Facts of Hive

17

7.3 Hive architecture

17

7.4 How to process data with hive

18

7.5 Getting started with hive

18

7.5.1 Hive table

18

7.6 Performing Queries

19

Analysis of airport data 8.1 Airport dataset

20

8.2 Airline dataset

21

8.3 Route dataset

22

8.4 Analysis few aspect

22

8.5 Methodology

23

8.5.1 Commands for analysis airport data

23

7

CONCLUSION

26

REFERENCES

27

8

ABSTRACT

In the contemporary world, Data analysis is a challenge in the era of varied inters disciplines though there is a specialization in the respective disciplines. In other words, effective data analytics helps in analyzing the data of any business system. But it is the big data which helps and axial rates the process of analysis of data paving way for a success of any business intelligence system. With the expansion of the industry, the data of the industry also expands. Then, it is increasingly difficult to handle huge amount of data that gets generated no matter what’s the business is like, range of fields from social media to finance, flight data, environment and health. Big Data can be used to assess risk in the insurance industry and to track reactions to products in real time. Big Data is also used to monitor things as diverse as wave movements, flight data, traffic data, financial transactions, health and crime. The challenge of Big Data is how to use it to create something that is value to the user. How can it be gathered, stored, processed and analyzed it to turn the raw data information to support decision making. In this paper Big Data is depicted in a form of case study for Airline data. The proposed method is made by considering following scenario under consideration An Airport has huge amount of data related to number of flights, data and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in each country. The problem they faced till now it’s, they have ability to analyze limited data from databases. The Proposed model intension is to develop a model for the airline data to provide platform for new analytics based on the following queries.

9

CHAPTER 1 INDRODUCTION BIGDATA Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves. When the amount of data is beyond the processing capability of the device that data is called big data for that particular device. Example:- 1gb data is a big data for any non-smart phone but it is not big data for pc.

Big Data History While the term “big data” is relatively new, the act of gathering and storing large amounts of information for eventual analysis is ages old. The concept gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three Vs:

Volume. Organizations collect data from a variety of sources, including business transactions, social media and information from sensor or machine-to-machine data. In the past, storing it would’ve been a problem – but new technologies (such as Hadoop) have eased the burden. Velocity. Data streams in at an unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and smart metering are driving the need to deal with torrents of data in near-real time. Variety. Data comes in all types of formats – from structured, numeric data in traditional databases to unstructured text documents, email, video, audio, stock ticker data and financial transactions.

10 At SAS, we consider two additional dimensions when it comes to big data: Variability. In addition to the increasing velocities and varieties of data, data flows can be highly inconsistent with periodic peaks. Is something trending in social media? Daily, seasonal and event-triggered peak data loads can be challenging to manage. Even more so with unstructured data. Complexity. Today's data comes from multiple sources, which makes it difficult to link, match, cleanse and transform data across systems. However, it’s necessary to connect and correlate relationships, hierarchies and multiple data linkages or your data can quickly spiral out of control.

Major problems which we face in bigdata -storage of the massive data -processing of the data

11

CHAPTER 2 WHAT IS HADOOP ? Hadoop is an open source distributed processing framework that manages data processing and storage for big data applications running in clustered systems. It is at the center of a growing ecosystem of big data technologies that are primarily used to support advanced analytics initiatives, including predictive analytics, data mining and machine learning applications. Hadoop can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing and analyzing data than relational databases and data warehouses provide.

Components Hadoop Hadoop Distributed File System (HDFS) : HDFS is the distributed storage system that is designed to provide highperformance access to data across multiple nodes in a cluster. HDFS is capable of storing huge amounts of data that is 100+ terabytes in size and streaming it at high bandwidth to big data analytics applications. MapReduce: MapReduce is a programming model that enables distributed processing of large data sets on compute clusters of commodity hardware. Hadoop MapReduce first performs mapping which involves splitting a large file into pieces to make another set of data. After mapping comes the reducing task, which takes the output from mapping and assemble the results into a consumable solution. Hadoop can run MapReduce programs written in many languages, like Java, Ruby, Python, and C++. Owing to parallel nature of MapReduce programs, Hadoop easily facilitates large-scale data analysis using multiple machines in the cluster. YARN: Yet Another Resource Negotiator or YARN is a large-scale, distributed operating system for big data applications. YARN is considered to be the next generation of Hadoop’s compute platform. It brings on the table a clustering platform that helps manage resources and schedule tasks. YARN was designed to set up both global and application-specific resource management components. YARN improves utilization over more static MapReduce rules, that were rendered in early versions of Hadoop, through dynamic allocation of cluster resources. Apache Hadoop Frameworks 1. Hive Hive is an open-source data warehousing framework that structures and queries data using a SQL-like language called HiveQL. Hadoop allows developers to write complex MapReduce applications over structured data in a distributed system. If a developer can’t express a logic using Hive, Hadoop allows to choose traditional map/reduce programmers to plug in their custom mappers and reducers. Hive is a very good relational-database framework and can accelerate queries using indexing feature. 2. HBase HBase is an open-source, distributed, versioned, non-relational database model that provides random, realtime read/write access to your big data. Hbase is a NoSQL Database for Hadoop. It’s a great framework for businesses that have to deal with multi-structured or sparse data. HBase makes it possible to push the boundaries of Hadoop that runs processes in batch and doesn’t allow for modification. With HBase, you can modify data in real-time without leaving the HDFS environment. HBase is a perfect fit for the type of data that fall into a big table. HBase first performs the task of storing and searching billions of rows and millions of columns. It then shares the table across multiple nodes, paving the way for MapReduce jobs to run locally. 3. Pig Pig is an open-source technology that enables cost-effective storage and processing of large data sets, without requiring any specific formats. Pig is a high-level platform and uses Pig Latin language for expressing data analysis programs. Pig also features a compiler that creates sequences of MapReduce programs. The framework processes very large data sets across hundreds to thousands of computing nodes, which makes it amenable to substantial parallelization. In simple words, we can consider Pig as a high-level mechanism that is

12 suitable for executing MapReduce jobs on Hadoop clusters using parallel programming.

Apache Hadoop Ecosystem

The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce, described in MapReduce below. Hadoop was created by Doug Cutting and named after his son’s toy elephant. Vendors that provide Hadoop-based platforms include Cloudera, Hortonworks, MapR, Greenplum, IBM, and Amazon. The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, faulttolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch read operations against massive datasets that can scale from tens of terabytes to petabytes in size. The popularity of Hadoop has grown in the last few years, because it meets the needs of many organizations for flexible data analysis capabilities with an unmatched price-performance curve. The flexible data analysis features apply to data in a variety of formats, from unstructured data, such as raw text, to semi-structured data, such as logs, to structured data with a fixed schema. Hadoop has been particularly useful in environments where massive server farms are used to collect data from a variety of sources. Hadoop is able to process parallel queries as big, background batch jobs on the same server farm. This saves the user from having to acquire additional hardware for a traditional database system to process the data (assume such a system can scale to the required size). Hadoop also reduces the effort and time required to load data into another system; you can process it directly within Hadoop. This overhead becomes impractical in very large data sets. Many of the ideas behind the open source Hadoop project originated from the Internet search community, most notably Google and Yahoo!. Search engines employ massive farms of inexpensive servers that crawl the Internet retrieving Web pages into local clusters where they are analyzes with massive, parallel queries to build search indices and other useful data structures. The Hadoop ecosystem includes other tools to address particular needs. Hive is a SQL dialect and Pig is a dataflow language for that hide the tedium of creating MapReduce jobs behind higher-level abstractions more appropriate for user goals. Zookeeper is used for federating services and Oozie is a scheduling system. Avro, Thrift and Protobuf are platform-portable data serialization and description formats.

13

HADOOP DESIGN PRINCIPLE 1.Facilitate the storage and processing of large and rapidly going data set. > structured and unstructured data > simple programming models 2. scale out rather than scale up 3. Bring code to data rather than data to code 4. High scalability and availability 5. use commodity hardware

14

CHAPTER 3 HADOOP DISTRIBUTED FILE SYSTEM – HDFS HDFS Architecture

Assumptions and Goals Hardware Failure – In HDFS hardware failure is very common. HDFS instance has hundred thousands of servers which contain data. So, because of this large network there is a probability that failure will occur. Thus, HDFS error and fault control with automatic recovery should be our main goal. Streaming Data Access – Streaming data means is shifting/transferring of data at constant rate (high speed) in order to carry out various functions. By data streaming, HDFS can provide High Definition TV services or constant back up to storage medium. Therefore, data is read in continuously with constant data rate rather reading in form of blocks/packets. Latency- Latency is defined as time delay caused due to various operations during the process. In Hadoop, initial time is spend in various activities for example – resource distribution, job submission and split creation. Thus, in Hadoop latency is very high. Large Data Sets – In Hadoop, applications which are running require considerable data sets.Memory requirement can vary from gigabytes to terabytes. Moving Computation Vs Moving Data – In HDFS, computation is moved to data. In Hadoop taking computation toward data is more efficient. HFDS provides interface which transfer application to data where it is located. Name Node and Data Node Hadoop Distributed File system follows Master-Slave architecture. Cluster is made in Hadoop,and cluster consists of single Name node which acts as master server which is user for managing file system namespace and it provides regulation for accessing files by client. Difference between Name Node and Data Node

15 Names node is used for executing file system namespace operations like closing, renaming files and directories whereas data node is responsible for reading and writing data. Name node is responsible for mapping of blocks to data node while data node is used for creation, replication and deletion.In HDFS file is divided into one or more blocks. Hard Link Hard link is a file that links a name with a file in distributed file system. There can be multiple hard links for a same file, we can create multiple names for same file and create aliasing effect for example if contents of file 1 are altered then these effects will be visible when the same file is opened with another name. Soft Link, Symbolic Link In HDFS, reference for another or directory is there in target file. Reference is in the form of relative path. If the link is deleted, target will not get affected. Also if target is shifted or removed, even then it will point to old target and nonexisting target will be broken. Replication Factor Replication factor is defined as number of copies should be maintained for particular file. Replication factor is stored in Name Node which maintains file system namespace. Data Replication Data replication is a main feature of HDFS. Data replication makes HDFS very reliable system that can store large files. In this, files are broken into blocks which are stored. All the blocks have same size except the last block. In order to provide reliability blocks are replicated. In HDFS block size and replication factor specified during creation, are not fixed and they can be changed. Name node receives block report and heartbeat in periodic intervals, thus ensuring data nodes are working properly. Block report contains list of all blocks in data node. Files can be written only once and name node makes decisions for replication of blocks. Replication Placement Optimization replica replacement distinguishes Hadoop distributed file system from other DFS. Secondary Name Node Secondary name node is used for connecting with name node and builds snapshot of directory of primary name nodes

16

CHAPTER 4 INSTALLATION PROCESS Install VirtualBox 5.0.4  Steps to install VirtualBox in Windows Host

17

Install Cloudera as virtual applicance

Steps to deploy cloudera CDH 5.4.x with VirtualBox

18

19

CHAPTER 5 COMMANDS Hadoop HDFS Commands Hadoop HDFS version Command Usage version Hadoop HDFS version Command Example hdfs dfs version Hadoop HDFS version Command Description This Hadoop command prints the Hadoop version mkdir Hadoop HDFS mkdir Command Usage mkdir <path> Hadoop HDFS mkdir Command Example hdfs dfs -mkdir /user/dataflair/dir1 Hadoop HDFS mkdir Command Description This HDFS command takes path URI’s as an argument and creates directories. Creates any parent directories in path that are missing (e.g., mkdir -p in Linux). ls Hadoop HDFS ls Command Usage ls <path> Hadoop HDFS ls Command Example hdfs dfs -ls /user/dataflair/dir1 Hadoop HDFS ls Commnad Description This Hadoop HDFS ls command displays a list of the contents of a directory specified by path provided by the user, showing the names, permissions, owner, size and modification date for each entry. Hadoop HDFS ls Command Example hdfs dfs -ls -R Hadoop HDFS ls Description This Hadoop fs command behaves like -ls, but recursively displays entries in all subdirectories of a path. put Hadoop HDFS put Command Usage

20 put <dest> Hadoop HDFS put Command Example hdfs dfs -put /home/dataflair/Desktop/sample /user/dataflair/dir1 Hadoop HDFS put Command Description This hadoop basic command copies the file or directory from the local file system to the destination within the DFS. Learn Internals of HDFS Data Write Pipeline and File write execution flow. copy From Local Hadoop HDFS copyFromLocal Command Usage copyFromLocal <dest> Hadoop HDFS copyFromLocal Command Example hdfs dfs -copyFromLocal /home/dataflair/Desktop/sample /user/dataflair/dir1 Hadoop HDFS copyFromLocal Command Description get Hadoop HDFS get Command Usage get [-crc] <src> Hadoop HDFS get Command Example hdfs dfs -get /user/dataflair/dir2/sample /home/dataflair/Desktop Hadoop HDFS get Command Description This HDFS fs command copies the file or directory in HDFS identified by the source to the local file system path identified by local destination. Hadoop HDFS get Command Example hdfs dfs -getmerge /user/dataflair/dir2/sample /home/dataflair/Desktop Hadoop HDFS get Command Description This HDFS basic command retrieves all files that match to the source path entered by the user in HDFS, and creates a copy of them to one single, merged file in the local file system identified by local destination. Hadoop HDFS get Command Example hadoop fs -getfacl /user/dataflair/dir1/sample hadoop fs -getfacl -R /user/dataflair/dir1 Hadoop HDFS get Command Description This Apache Hadoop command shows the Access Control Lists (ACLs) of files and directories. If a directory contains a default ACL, then getfacl also displays the default ACL. Options : -R: It displays a list of all the ACLs of all files and directories recursively. path: File or directory to list.

21 Hadoop HDFS get Command Example hadoop fs -getfattr -d /user/dataflair/dir1/sample copy To Local Hadoop HDFS copyToLocal Command Usage copyToLocal <src> Hadoop HDFS copyToLocal Command Example hdfs dfs -copyToLocal /user/dataflair/dir1/sample /home/dataflair/Desktop Hadoop HDFS copyToLocal Description Similar to get command, only the difference is that in this the destination is restricted to a local file reference. cat Hadoop HDFS cat Command Usage cat Hadoop HDFS cat Command Example hdfs dfs -cat /user/dataflair/dir1/sample Hadoop HDFS cat Command Description This Hadoop fs shell command displays the contents of the filename on console or stdout. mv Hadoop HDFS mv Command Usage mv <src> <dest> Hadoop HDFS mv Command Example hadoop fs -mv /user/dataflair/dir1/purchases.txt /user/dataflair/dir2 Hadoop HDFS mv Command Description This basic HDFS command moves the file or directory indicated by the source to destination, within HDFS. cp Hadoop HDFS cp Command Usage cp <src> <dest> Hadoop HDFS cp Command Exampl hadoop fs -cp /user/dataflair/dir2/purchases.txt /user/dataflair/dir1 Hadoop HDFS cp Command Description This Hadoop File system shell command copies the file or directory identified by the source to destination, within HDFS.

22

CHAPTER 6 SQOOP Sqoop is a command-line interface application for transferring data between relational databases and Hadoop . Description Sqoop supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Imports can also be used to populate tables in Hive or HBase.Exports can be used to put data from Hadoop into a relational database. Sqoop got the name from sql+hadoop. Sqoop became a top-level Apache project in March 2012.

Sqoop – IMPORT Command

sqoop import --connect jdbc:mysql://localhost/employees --username edureka --table employees

23

Sqoop – Export

24

CHAPTER 7 HIVE What Is Hive ? • Apache Hive is a data warehouse (initially developed by Facebook) software project built on top of Apache Hadoop for providing data summarization, query and analysis. • Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. It provides a SQL interface to query data stored in Hadoop distributed file system (HDFS) or Amazon S3 (an AWS implementation) through an HDFS-like abstraction layer called EMRFS (Elastic MapReduce File System).

Apache Hive: Fast Facts

Hive Architecture

25

How to process data with Apache Hive? • Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations. • Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.

Getting Started With Hive

Hive Table A Hive Table: -Data: file or group of files in HDFS -Schema: in the form of metadata stored in a relational database Schema and Data are separate -Schema can be defined for existing data -data can added or removed independently -Hive can be pointed at existing data You have to define a schema if you have existing data in HDFS that you want to use in Hive.

26 Defining a Table

Managing Table

Loading Data • Use LOAD DATA to import data into Hive Table

• Not modified by Hive – Loaded as is • Use the word OVERWRITE to write over a file of the same name • The schema is checked when the data is queried • If a node does not match, it will be read as a null

Performing Queries (HiveQL) • SELECT

27 • Supports the following: • WHERE clause • UNION ALL and DISTINCT • GROUP BY and HAVING • LIMIT clause • Can use REGEX column Specification

Subqueries • Hive support subqueries only in the FROM clause

• The columns in the subquery SELECT list are available in outer query

28

CHAPTER 8 ANALYSIS OF AIRPORT DATA The proposed method is made by considering following scenario under consideration An Airport has huge amount of data related to number of flights, data and time of arrival and dispatch, flight routes, No. of airports operating in each country, list of active airlines in each country. The problem they faced till now it’s, they have ability to analyze limited data from databases. The Proposed model intension is to develop a model for the airline data to provide platform for new analytics based on the following queries.

Airport Data Set Attribute Airport ID

Name

City

Country

IATA/FAA

ICAO Latitude

Longitude

Altitude Timezone

DST

Tz database time

Description Unique OpenFlights identifier for this airport Name of airport. May or may not contain the City name. Main city served by airport. May be spelled differently from Name. Country or territory where airport is located. 3-letter FAA code, for airports located in Country "United States of America" 4-letter ICAO code. Decimal degrees, usually to six significant digits. Negative is South, positive is North. Decimal degrees, usually to six significant digits. Negative is West, positive is East. In feet. Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5. Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time Timezone in "tz" (Olson) format, eg. "America/Los_Angel es". zone

29 Airline Data Set Attribute Airline

Name Alias

IATA ICAO Callsign Country

Active

Description Unique OpenFlights identifier for this airline. ID Name of the airline Alias of the airline. For example, All Nippon Airways is commonly known as "ANA". 2-letter IATA code, if available. 3-letter ICAO code, if available Airline callsign. Country or territory where airline is incorporated "Y" if the airline is or has until recently been operational, "N" if it is defunct. This field is not reliable: in particular, major airlines that stopped flying long ago, but have not had their IATA code reassigned (eg. Ansett/AN), will incorrectly show as "Y".

30

Route Data Set Attribute Airline

Airline ID Source airport

Source airport ID

Destination airport

Destination airport ID

Codeshare

Stops

Equipment

Description 2-letter (IATA) or 3letter(ICAO) code of the airline. Unique OpenFlights identifier for airline 3-letter (IATA) or 4letter (ICAO) code of the source airport Unique OpenFlights identifier for source airport 3-letter (IATA) or 4letter (ICAO) code of the destination airport. Unique OpenFlights identifier for destination airport. "Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise. Number of stops on this flight ("0" for direct) 3-letter codes for plane type(s) generally used on this flight, separated by spaces

This paper proposes a method to analyze few aspects which are related to airline data such as a) list of airports operating in the country India, b) list of airlines having zero stops c) list of airlines operating with code share d) list highest airports in each country e) list of active airlines in United State

31

METHODOLOGY Methodology used is as follows: 1. Create tables with required attributes 2. Extract semi structured data into table using the load a command 3. Analyze data for the following queries a) list of airports operating in the country India b) list of airlines having zero stops c) list of airlines operating with code share d) which country has highest airports e) list of active airlines in United State Commands for analysis airport data

Fig Create and Load data set into HDFS create table airports (id int, name string, city string, country string, faa string, icao string, latitude string, longitude string, timezone string, DST string, tz string) row format delimited fields terminated by ','; create table final_airlines (airline string, name string, alias string, IATA string, ICAO string, callsign string, country string, active string) row format delimited fields terminated by ','; create table routes (airline string, id string, source string, source_id string, destination string, destination_id string, code_share string, stops string, equipments string) row format delimited fields terminated by ','; LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/Final_airlines.enc' INTO TABLE final_airlines;

32 LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/airports_mod.enc' INTO TABLE airports; LOAD DATA LOCAL INPATH '/home/cloudera/Desktop/airlines/routes.enc' INTO TABLE routes;

Fig. List of airlines operating with code share create table active_airlines as select active,count(*) as status from final_airlines group by active; select * from active;

33

Fig : List of airlines having zero stops create table zero_stops as select * from routes where stops LIKE '%0%'; select * from zero_stops;

List of airports operating in country India

34 create table india_airports as select * from airports where country LIKE '%India%'; select * from india_airports; ACTIVE IN US create table active_airlines_US as select * from final_airports where (country LIKE '%United States%') and (active LIKE '%Y%'); select * from active_airlines_US;

35

CONCLUSION This project addresses the related work of distributed data bases that were found in literature, challenges ahead with big data, and a case study on airline data analysis using Hive. Author attempted to explore detailed analysis on airline data sets such as listing airports operating in the India, list of airlines having zero stops, list of airlines operating with code share which country has highest airports and list of active airlines in united state. Here author focused on the processing the big data sets using hive component of hadoop ecosystem in distributed environment. This work will benefit the developers and business analysts in accessing and processing their user queries.

36

REFERENCES [1] Challenges and opportunities with Big Datahttp://cra.org/ccc/wpcontent/uploads/sites/2/2015/05/bigdatawhitepaper.pdf [2] Oracle: Big Data for Enterprise, June 201http://www.oracle.com/us/products/database/big-data-forenterprise-519135.pdf [3] Data set is taken from edureka http://www.edureka.co/my-course/big-data-and-hadoop [4] HiveQL Language Manual [5] Apache Tez

More Documents from "payal garg"