Apache Hadoop

Uploaded by: kmanickavelu
0
0

June 2020
PDF

Download

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA

Overview

Download & View Apache Hadoop as PDF for free.

More details

Words: 648
Pages: 13

Preview
Full text

Apache Hadoop

- Kumaresan Manickavelu

Problems With Scale 

 



2

Failure is the defining difference between distributed and local programming  If components fail, their workload must be picked up by still-functioning units  Nodes that fail and restart must be able to rejoin the group activity without a full group restart Increased load should cause graceful decline Increasing resources should support a proportional increase in load capacity Storing and Sharing data with processing units.

Hadoop Echo System 











3

Apache Hadoop is a collection of open-source software for reliable, scalable, distributed computing. Hadoop Common: The common utilities that support the other Hadoop subprojects. HDFS: A distributed file system that provides high throughput access to application data. MapReduce: A software framework for distributed processing of large data sets on compute clusters. Pig: A high-level data-flow language and execution framework for parallel computation. HBase: A scalable, distributed database that supports structured data storage for large tables.

HDFS  

 

Based on Google’s GFS Redundant storage of massive amounts of data on cheap and unreliable computers Optimized for huge files that are mostly appended and read Architecture  







4

HDFS has a master/slave architecture An HDFS cluster consists of a single NameNode and a number of DataNodes HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software

The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients.

Map Reduce 



    5

Provides a clean abstraction for programmers to write distributed application. Factors out many reliability concerns from application logic A batch data processing system Automatic parallelization & distribution Fault-tolerance Status and monitoring tools

Programming Model 

Programmer has to implement interface of two functions:

– map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list 6

Map Reduce Flow

7

Mapper (indexing example) 

Input is the line no and the actual line.



Input 1 : (“100”,“I Love India ”) Output 1 : (“I”,“100”), (“Love”,“100”),



(“India”,“100”)  

Input 2 : (“101”,“I Love eBay”) Output 2 : (“I”,“101”), (“Love”,“101”), (“eBay”,“101”)

8

Reducer (indexing example) 

Input is word and the line nos.



Input Input Input Input

  



9

1 2 3 4

: : : :

(“I”,“100”,”101”) (“Love”,“100”,”101”) (“India”, “100”) (“eBay”, “101”)

Output, the words are stored along with the line nos.

Google Page Rank example

10



Mapper  Input is a link and the html content  Output is a list of outgoing link and pagerank of this page



Reducer  Input is a link and a list of pagranks of pages linking to this page  Output is the pagerank of this page, which is the weighted average of all input pageranks

Hadoop at Yahoo  

 

11

World's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster Biggest contributor to Hadoop. Converting All its batches to Hadoop.

Hadoop at Amazon 





12

Hadoop can be run on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3) The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 Amazon Elastic MapReduce is a new web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework.

Thanks

Questions?

kumaresan . manickavelu @ gmail.com

13

Apache Hadoop

Overview

More details

Related Documents

Apache Hadoop

Mapreduce Programming With Apache Hadoop

Apache

Apache

Apache

Apache

More Documents from ""

Apache Hadoop