A Project report on
TREND ANALYSIS BASED ON ACCESS PATTERN OVER WEB LOGS USING HADOOP In partial fulfillment of the requirements for the award of the degree of BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE & ENGINEERING Submitted By I.HARIPRIYA
(12JD1A0518)
P. NAGA SAI APARNA
(13JD5A0506)
N. PRAJNA
(12JD1A0546)
S. SOUMIKA
(12JD1A0558) Under the Guidance of V.RAJESH BABU M.TECH Assistant Professor
Department of Computer Science & Engineering ELURU COLLEGE OF ENGINEERING AND TECHNOLOGY DUGGIRALA(V), PEDAVEGI(M), ELURU-534004
(Affiliated to Jawaharlal Nehru Technological University) 2012-2016 i
ELURU COLLEGE OF ENGINEERING AND TECHNOLOGY (Affiliated to JNTUK, Approved by AICTE-NEW DELHI) Department of Computer Science and Engineering
This is to certify that the project report titled “TREND ANALYSIS BASED ON ACCESS PATTERN OVER WEB LOGS USING HADOOP”is being submitted by I.HARIPRIYA (12JD1A0518),P.NAGA SAI APARNA (13JD5A0506), N.PRAJNA (12JD1A0546) and S.SOWMIKA (12JD1A0558) in B.Tech IV II semester ComputerScience & Engineering is a record bonafide work carried out by them. The results embodied in this report have not been submitted to any other University for the award of any degree.
PROJECT GUIDE
HEAD OF THE DEPARTMENT
Mr.V.RAJESH BABU M.Tech(CSE) Assistant Professor
EXTERNAL EXAMINER
ii
ACKNOWLEDGEMENT The Present Project work is the several days study of various aspects of the project development. During this the effort is the present study, we have received a great amount of help from our Chairman Sri V.RAGHAVENDRARAO and Secretary V.RAMA KRISHNARAO of Padmavathi Group of Institutions, which we wish to acknowledge and thank from depth of our hearts. We are thankful to our Principal Dr. P. BALAKRISHNA PRASAD for permitting and encouraging us doing this project. We are deeply intended to Dr. G. GURUKESAVADAS, Professor& Head of the Department, Whose motivation and constant encouragement has led to pursue a project in field of Software development. We are very much obliged and thankful to our project guide Mr.V.RAJESH BABU, Assistant Professor, forprovidingthisopportunity and constant encouragement given by him during the course. We are grateful to his valuable guidance and suggestions during our project work. Our parents have put our self ahead of themselves. Because of their hard work and dedication, we had opportunity beyond our wildest dreams. Our heartfelt thanks to them for giving us all we ever needed to be successful student and individual. Finally we express our thanks to all our faculty members, classmates, friends and neighbors who helped us for the completion of our project and without infinite love and patience this would never have been possible. I. HARIPRIYA
(12JD1A0518)
P. NAGA SAI APARNA
(13JD5A0506)
N. PRAJNA
(12JD1A0546)
S. SOUMIKA
(12JD1A0558) iii
ABSTRACT The first part covers some fundamental theory and summarizes basic goals andtechniques of log file analysis. It reveals that log file analysis is an omitted fieldof computer science. Available papers describe moreover specific log analyzersand only few contain some general methodology.Second part contains three case studies to illustrate different application of log file analysis. The examples were selected to show quite different approach and goals of analysis and thus they set up different requirements.The analysis of requirements then follows in the next part which discussesvarious criteria put on a general analysis tool and also proposes some design suggestions. Finally, in the last part there is an outline of the design and implementation ofan universal analyzer. Some features are presented in more detail while others are just intentions or suggestions.
iv
TABLE OF CONTENTS CHAPTER NO
1
NAME
PAGE NO
ACKNOWLEDGEMENT
iii
ABSTRACT
iv
TABLE OF CONTENTS
v
LIST OF FIGURES
vii
INTRODUCTION
1
1.1 About Log Files
1
1.2 Motivation
6
1.3 Problem Definition
6
1.4 Current State of Technology
7
1.5 Current Practice
8
2
HISTORY OF HADOOP
10
3
LITERATURE SURVEY
22
4
SYSTEM ANALYSIS
27
5
6
4.1 Existing System
27
4.2 Proposed System
27
4.3 System Specification
29
4.3.1 Software Requirements
29
4.3.2 Hardware Requirements
29
SYSTEM DESIGN
30
5.1 System Architecture
30
5.2 Use Case Diagram
31
5.3 Behavioral Model
34
5.3.1 Data flow Diagram
34
5.3.2 Sequence Diagram
35
SYSTEM IMPLEMENTATION 6.1 Language of Metadata
36 36
6.1.1 Log Description
36
6.1.2 Examples
37
6.2 Language of Analysis v
38
6.2.1 Principle of operation
38
6.2.2 Data Filtering and Cleaning
39
6.2.3 Analytical Tasks
41
6.3 Analyzer Layout
7
44
6.3.1 Operation
44
6.3.2 File types
45
6.3.3 Modules
46
SOFTWARE TESTING
61
7.1 System Testing
61
7.2 Architecture Testing
64
7.3 Performance Testing
65
8
RESULTS
70
9
CONCLUSIONS AND FUTURE WORK
74
10
9.1 Conclusion
74
9.2 Future Enhancements
74
REFERENCES
75
vi
LIST OF FIGURES CHAPTER NO 1
FIGURE NAME Big data Applications
PAGE NO 3
3
System description
16
3
Pig processing steps
12
4
Hadoop frame work
18
4
The log analysis (use case diagram)
15
4
Searching the web
15
4
Flow of data
16
4
The log analysis (Sequence diagram)
17
5
Analyzer layout
26
5
Components of Hadoop Frame Work
30
5
Flow of Data in Map Reduce Frame Work 32
5
Format of Execution
33
5
Log Analysis Solution Architecture
37
vii
viii
CHAPTER 1 INTRODUCTION 1.1 About Log Files Current software application often produce (or can be configured to produce) some auxiliary text files known as log files. Such files are used during various stages of software development, mainly for debugging and profiling purposes. Use of log files helps testing by making debugging easier. It allows the following logic of the program, at high level, without having to run it in debug mode. Nowadays, log files are commonly used also at customer’s installations for the purpose of permanent software monitoring and/or fine-tuning. Log files became a standard part of large application and are essential in operating systems, computer networks and distributed systems. Log files are often the only way how to identify and locate an error in software, because log file analysis is not affected by any time-based issues known as probe effect. This is an opposite to an analysis of a running program, when the analytical process can interfere with time {critical or resource {critical conditions within the analyzed program. Log files are often very large and can have complex structure. Although the process of generating log files is quite simple and straightforward, log file analysis could be a tremendous task that requires enormous computational resources, long time and sophisticated procedures. This often leads to a common situation, when log files are continuously generated and occupy valuable space on storage devices, but nobody uses them and utilizes enclosed information. Big data means really a big data; it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, technqiues and frameworks.
What Comes Under Big Data? 1
Big data involves the data produced by different devices and applications. Given below are some of the fields that come under the umbrella of Big Data.
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight crew, recordings of microphones and earphones, and the performance information of the aircraft.
Social Media Data: Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe.
Stock Exchange Data: The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
Power Grid Data: The power grid data holds information consumed by a particular node with respect to a base station.
Transport Data: Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will be of three types. 2
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured data: Word, PDF, Text, Media Logs.
Benefits of Big Data Big data is really critical to our life and its emerging as one of the most important technologies in modern world. Follow are just few benefits which are very much known to all of us:
Using the information kept in the social network like Facebook, the marketing agencies are learning about the response for their campaigns, promotions, and other advertising mediums.
Using the information in the social media like preferences and product perception of their consumers, product companies and retail organizations are planning their production.
Using the data regarding the previous medical history of patients, hospitals are providing better and quick service.
Big Data Technologies Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. To harness the power of big data, you would require an infrastructure that can manage and process huge volumes of structured and unstructured data in realtime and can protect data privacy and security. There are various technologies in the market from different vendors including Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we examine the following two classes of technology:
3
Operational Big Data This include systems like MongoDB that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored. NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational big data workloads much easier to manage, cheaper, and faster to implement. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.
Analytical Big Data This includes systems like Massively Parallel Processing (MPP) database systems and MapReduce that provide analytical capabilities for retrospective and complex analysis that may touch most or all of the data. MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL, and a system based on MapReduce that can be scaled up from single servers to thousands of high and low end machines. These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical Systems
Latency
Operational
Analytical
1 ms - 100 ms
1 min - 100 min
Concurrency 1000 - 100,000
1 - 10
Access Pattern Writes and Reads Reads Queries
Selective
Unselective
4
Data Scope
Operational
Retrospective
End User
Customer
Data Scientist
Technology
NoSQL
MapReduce, MPP Database
Fulfilling the Challenges of Relational Databases The major challenges associated with big data are as follows:
Capturing data
Duration
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
1.2 Motivation There are various applications (known as log file analyzers or log files visualization tools) that can digest a log file of specific vendor or structure and produce easily human readable summary reports. Such tools are undoubtedly useful, but their usage is limited only to log files of certain structure. Although such products have configuration options, they can answer only built-in questions and create built-in reports. The initial motivation of this work was the lack of Cisco Net Flow analyzer that could be used to monitor and analyze large computer networks like the metropolitan area network of the University of West Bohemia (WEBNET) or the country-wide backbone of the Czech Academic Network (CESNET) using Cisco Net Flow data exports. Because the amount of log data (every packet is logged!), evolution of the Net Flow log format in time and wide spectrum of
5
monitoring goals/questions, it seems that introduction of an new, systematic, efficient and open approach to the log analysis is necessary. There is also a belief that it is useful to research in the field of log files analysis and to design an open, very flexible modular tool that would be capable to analyze almost any log file and answer any questions, including very complex ones. Such analyzer should be programmable, extendable, efficient (because of the volume of log files) and easy to use for end users. It should not be limited to analyze just log files of specific structure or type and also the type of question should not be restricted.
1.3 Problem Definition The overall goal of this research is to invent and design a good model of generic processing of log files. The problem covers areas of formal languages and grammars, definite state machines, lexical and syntax analysis, data-driven programming techniques and data mining/warehousing techniques. The following list sums up areas involved by this research.
Formal definition of a log file
Formal description of the structure and syntax of a log file (metadata)
Lexical and syntax analysis of log file and metadata information
Formal specification of a programming language for easy and efficient log analysis
Design of internal data types and structures (includes RDBMS)
Design of such programming language
Design of a basic library/API and functions or operators for easy handling of logs within the programming language
Deployment of data mining/warehousing techniques if applicable
Design of an user interface
The expected results are both theoretical both practical. The main theoretical results are (a) design of the framework and (b) the formal language for log analysis description. From practical point of view, the results of the research and design of an analyzer should be verified and evaluated by a prototype implementation of an experimental Cisco Net Flow analyzer.
6
1.4 Current State of Technology In the past decades there was surprisingly low attention paid to problem of getting useful information from log files. It seems there are two main streams of research. The first one concentrates on validating program runs by checking conformity of log files to a state machine. Records in log file are interpreted as transitions of given state machine. If some illegal transitions occur, then there is certainly a problem, either in the software under test or in the state machine specification or in the testing software itself. The second branch of research is represented by articles that just describe various ways of production statistical output. The following items summarize current possible usage of log files:
Generic program debugging and profiling
Tests whether program conforms to a given state machine
Various usage statistics, top ten, etc.
Security monitoring According to available scientific papers it seems that the most evolving and
developed area of log file analysis is the WWW industry. Log files of HTTP servers are nowadays used not only for system load statistic but they offer a very valuable and cheap source of feedback. Providers of web content were the first one who lack more detailed and sophisticated reports based on server logs. They require detecting behavioral patterns, paths, trends etc. Simple statistical methods do not satisfy these needs so an advanced approach must be used. There are over 30 commercially available applications for web log analysis and many more free available on the Internet. Regardless of their price, they are disliked by their user and considered too slow, inflexible and difficult to maintain. Some log files, especially small and simple, can be also analyzed using common spreadsheet or database programs. In such case, the logs are imported into a worksheet or database and then analyzed using available functions and tools. The remaining but probably the most promising way of log file processing represents data driven tools like AWK. In connection with regular expressions are such tools very efficient and flexible. On the other hand, they are too low-level, i.e. their usability is limited to text files, one-way, single-pass operation. For higher-level tasks they lack mainly advanced data structures. 7
1.5 Current Practice Prior to a more formal definition, let us simply describe log _les and their usage. Typically, log files are used by programs in the following way:
The log file is an auxiliary output file, distinct from other outputs of the program. Almost all log files are plain text files.
On startup of the program, the log file is either empty, or contains whatever was left from previous runs of the program.
During program operation, lines (or groups of lines) are gradually appended to the log file, never deleting or changing any previously stored information.
Each record (i.e. a line or a group of lines) in a log file is caused by a given event in the program, like user interaction, function call, input or output procedure etc.
Records in log files are often parameterized, i.e. they show current values of variables, return values of function calls or any other state information
The information reported in log files is the information that programmers consider important or useful for program monitoring and/or locating faults. Syntax and format of log files can vary a lot. There are very brief log files that
contain just sets of numbers and there are log files containing whole essays. Nevertheless, an average log file is a compromise of these two approaches; it contains minimum unnecessary information while it is still easily human-readable. Such files for example contain variable names variable values, comprehensive delimiters, smart text formatting, hint keywords, comments etc.
8
CHAPTER 2 HISTORY OF HADOOP 2.1 About Apache Hadoop Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed. This approach takes advantage of data locality nodes manipulating the data they have access to— to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. The base Apache Hadoop framework is composed of the following modules:
9
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications;[6][7] and
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing.
The term Hadoop has come to refer not just to the base modules above, but also to the ecosystem,[8] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, Apache Storm.[9] Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System.[10] The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Other projects in the Hadoop ecosystem expose richer user interfaces.
2.2 History The genesis of Hadoop came from the Google File System paper that was published in October 2003. This paper spawned another research paper from Google - MapReduce: Simplified Data Processing on Large Clusters.[13] Development started in the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. 10
The initial code that was factored out of Nutch consisted of 5k lines of code for NDFS and 6k lines of code for MapReduce. The first committer added to the Hadoop project was Owen O’Malley in March 2006. Hadoop 0.1.0 was released in April 2006 and continues to evolve by the many contributors to the Apache Hadoop project.
11
2.3 Architecture See also: Hadoop Distributed File System, Apache HBase and MapReduce Hadoop consists of the Hadoop Common package, which provides file system and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java Archive (JAR) files and scripts needed to start Hadoop. For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is. Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if one of these hardware failures occurs, the data will remain available.
A multi-node Hadoop cluster A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, Name Node, and Data Node. A slave or worker node acts as both a Data Node and Task Tracker, though it is possible to have
12
data-only worker nodes and compute-only worker nodes. These are normally used only in nonstandard applications.[62] Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (ssh) be set up between nodes in the cluster.[63] In a larger cluster, HDFS nodes are managed through a dedicated Name Node server to host the file system index, and a secondary Name Node that can generate snapshots of the name node's memory structures, thereby preventing file-system corruption and loss of data. Similarly, a standalone Job Tracker server can manage job scheduling across nodes. When Hadoop MapReduce is used with an alternate file system, the Name Node, secondary Name Node, and Data Node architecture of HDFS are replaced by the filesystem-specific equivalents.
2.4 File systems Hadoop distributed file system The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable filesystem written in Java for the Hadoop framework. A Hadoop cluster has nominally a single name node plus a cluster of data nodes, although redundancy options are available for the name node due to its criticality. Each data node serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to communicate between each other. HDFS stores large files (typically in the range of gigabytes to terabytes across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is 13
stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.[65] HDFS added the high-availability capabilities, as announced for release 2.0 in May 2012,[66] letting the main metadata server (the Name Node) fail over manually to a backup. The project has also started developing automatic fail-over. The HDFS file system includes a so-called secondary name node, a misleading name that some might incorrectly interpret as a backup name node for when the primary name node goes offline. In fact, the secondary name node regularly connects with the primary name node and builds snapshots of the primary name node's directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary name node without having to replay the entire journal of filesystem actions, then to edit the log to create an up-to-date directory structure. Because the name node is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate name nodes. An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs. 14
HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations. HDFS can be mounted directly with a File system in User space (FUSE) virtual file system on Linux and some other Unix systems. File access can be achieved through the native Java application programming interface (API), the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, browsed through the HDFS-UI Web application (web app) over HTTP, or via 3rd-party network client libraries.[68]
2.5 Other file systems Hadoop works directly with any distributed file system that can be mounted by the underlying operating system simply by using a file:// URL; however, this comes at a price: the loss of locality. To reduce network traffic, Hadoop needs to know which servers are closest to the data; this is information that Hadoop-specific file system bridges can provide. In May 2011, the list of supported file systems bundled with Apache Hadoop were:
HDFS: Hadoop's own rack-aware file system.[69] This is designed to scale to tens of peta bytes of storage and runs on top of the file systems of the underlying operating systems.
FTP File system: this stores all its data on remotely accessible FTP servers.
Amazon S3 (Simple Storage Service) file system. This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.
Windows Azure Storage Blobs (WASB) file system. WASB, an extension on top of HDFS, allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.
15
A number of third-party file system bridges have also been written, none of which are currently in Hadoop distributions. However, some commercial distributions of Hadoop ship with an alternative file system as the default—specifically IBM and MapR.
In 2009, IBM discussed running Hadoop over the IBM General Parallel File System.[70] The source code was published in October 2009.[71]
In April 2010, Parascale published the source code to run Hadoop against the Parascale file system.[72]
In April 2010, Appistry released a Hadoop file system driver for use with its own CloudIQ Storage product.[73]
In June 2010, HP discussed a location-aware IBRIX Fusion file system driver.[74]
In May 2011, Map Reduce Technologies, Inc. announced the availability of an alternative file system for Hadoop, which replaced the HDFS file system with a full random-access read/write file system.
Job Tracker and Task Tracker: the MapReduce engine Main article: MapReduce Above the file systems comes the MapReduce Engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs. The Job Tracker pushes work out to available Task Tracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the Job Tracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a Task Tracker fails or times out, that part of the job is rescheduled. The Task Tracker on each node spawns a separate Java Virtual Machine process to prevent the Task Tracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the Task Tracker to the Job Tracker every few minutes to check its status. The Job Tracker and Task Tracker status and information is exposed by Jetty and can be viewed from a web browser. 16
Known limitations of this approach are:
The allocation of work to Task Trackers is very simple. Every Task Tracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.
If one Task Tracker is very slow, it can delay the entire MapReduce job— especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.
Scheduling By default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was re factored out of the Job Tracker, while adding the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler, described next) Fair scheduler The fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and Quality of Service for production jobs. The fair scheduler has three basic concepts. 1. Jobs are grouped into pools. 2. Each pool is assigned a guaranteed minimum share. 3. Excess capacity is split between jobs. By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs. Capacity scheduler
17
The capacity scheduler was developed by Yahoo. The capacity scheduler supports several features that are similar to the fair scheduler.[79]
Queues are allocated a fraction of the total resource capacity.
Free resources are allocated to queues beyond their total capacity.
Within a queue a job with a high level of priority has access to the queue's resources.
There is no preemption once a job is running.
2.6 Other applications The HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is batchoriented rather than real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used to complement a real-time system, such as lambda architecture. As of October 2009, commercial applications of Hadoop[80] included:
Log and/or click stream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data, e.g. for compliance
Prominent users On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application. The Yahoo! Search Web map is a Hadoop application 18
that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query.[81] There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple datacenters. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine. In June 2009, Yahoo! made the source code of the Hadoop version it runs available to the public via the open-source community.[82] In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage.[83] In June 2012, they announced the data had grown to 100 PB[84] and later that year they announced that the data was growing by roughly half a PB per day.[85] As of 2013, Hadoop adoption had become widespread: more than half of the Fortune 50 used Hadoop.[86] Hadoop hosting in the Cloud Hadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The cloud allows organizations to deploy Hadoop without hardware to acquire or specific setup expertise. Vendors who currently have an offer for the cloud include Microsoft, Amazon, IBM and Google. On Microsoft Azure Azure HD Insight is a service that deploys Hadoop on Microsoft Azure. HD Insight uses Horton works HDP and was jointly developed for HDI with Horton works. HDI allows programming extensions with .NET (in addition to Java). HD Insight also supports creation of Hadoop clusters using Linux with Ubuntu. By deploying HD Insight in the cloud, organizations can spin up the number of nodes they want and only get charged for the compute and storage that is used.[90] Horton works implementations can also move data from the on-premises datacenter to the cloud for backup, development/test, and bursting scenarios.[90] It is also possible to run Cloudera or Horton works Hadoop clusters on Azure Virtual Machines. 19
On Amazon EC2/S3 services It is possible to run Hadoop on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). As an example, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth). There is support for the S3 object store in the Apache Hadoop releases, though this is below what one expects from a traditional POSIX file system. Specifically, operations such as rename() and delete() on directories are not atomic, and can take time proportional to the number of entries and the amount of data in them. Amazon Elastic MapReduce Elastic MapReduce (EMR) was introduced by Amazon.com in April 2009. Provisioning of the Hadoop cluster, running and terminating jobs, and handling data transfer between EC2 (VM) and S3(Object Storage) are automated by Elastic MapReduce. Apache Hive, which is built on top of Hadoop for providing data warehouse services, is also offered in Elastic MapReduce. Support for using Spot Instances was later added in August 2011. Elastic MapReduce is fault tolerant for slave failures, and it is recommended to only run the Task Instance Group on spot instances to take advantage of the lower cost while maintaining availability. Commercial support A number of companies offer commercial implementations or support for Hadoop. ASF's view on the use of "Hadoop" in product names The Apache Software Foundation has stated that only software officially released by the Apache Hadoop Project can be called Apache Hadoop or Distributions of Apache 20
Hadoop. The naming of products and derivative works from other vendors and the term "compatible" are somewhat controversial within the Hadoop developer community.
CHAPTER 3 LITERATURE SURVEY 3.1 BIG DATA “Every day, we create 2500 PB of data — so much that 90% of the data in the world today has been created in the last two years alone.” Modern systems have to deal with far more data than was the case in the past. Organizations are generating huge amounts of data and that data has inherent value, and cannot be deleted. We live in the data age! Web has been growing rapidly in size as well as scale during the last 10 years and shows no signs of slowing down. Statistics show that every passing year more data gets generated than all the previous years combined. Moore's law not only holds true for hardware but for data being generated too. Without wasting time for 21
coining a new phrase for such vast amounts of data, the computing industry decided to just call it, plain and simple, Big Data. More than structured information stored neatly in rows and columns, Big Data actually comes in complex, unstructured formats, everything from web sites, social media and email, to videos, presentations, etc. This is a critical distinction, because, in order to extract valuable business intelligence from Big Data, any organization will need to rely on technologies that enable a scalable, accurate, and powerful analysis of these formats. Big data dimensions: IBM’s 4 V’s Volume, Velocity, Variety and Veracity Volume Amount of data which is processed. There are many business areas where we are uploading and processing the data in very high rate in terms of terabytes and zeta bytes. If we take present statistics, every day we are uploading around 25 terabytes of data into Facebook, 12 terabytes of data in twitter and around 10 terabytes of data from various devices like RFID, telecommunications and networking. Here the storing of these huge amounts of data will require high clusters and large servers with high bandwidth. And here the problem is not only storing the information but also the processing at much higher speed. This became the major issue nowadays in most of the companies. Velocity The Velocity is defined as the speed at which the data is created, modified, retrieved and processed. This is one of the major challenges of big data because the amount of time for processing and performing different operations are considered a lot when dealing with massive amounts of data. If we use traditional databases for such types of data, then it is useless and doesn’t give full satisfaction to the customers in the organizations and industries. So processing rate of such data is taken into account considerably when talking about big data. Hence volume is one of the big challenges in dealing with big data. Variety
22
In the distributed environment there may be the chances of presenting various types of data. This is known as variety of data. These can be categorized as structured, semi structured and unstructured data. The process of analysis and performing operations are varying from one and another. Social media like Facebook posts or Tweets can give different insights, such as sentiment analysis on your brand, while sensory data will give you information about how a product is used and what the mistakes are. So this is the major issue to process information from different sets of data. Veracity Big data Veracity refers to the biases, noise and abnormally in data. It is the data that is being stored, and mined meaningful to the problem being analyzed. In other words, Veracity can be treated as the uncertainty of data due to data inconsistency and incompleteness, ambiguities, latency, model approximations in the process of analyzing data.
Yarn Architecture
3.2 Technical Challenges
Scalable storage
Massive Parallel Processing and System Integration
Data Analysis
23
3.3 Hadoop as a solution The next logical question arises – How do we efficiently process such large data sets? One of the pioneers in this field was Google, which designed scalable frameworks like MapReduce and Google File System. Inspired by these designs, an Apache open source initiative was started under the name Hadoop. Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines. Apache Hadoop, at its core, consists of 2 sub-projects – Hadoop MapReduce and Hadoop Distributed File System. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Other Hadoop-related projects at Apache include PIG, Hive, HBase, Mahout, Sqoop, Flume, Oozie and ZooKeeper. The Motivation for Hadoop - Traditional Large-Scale Computation Traditionally, computation has been processor-bound, small amounts of data and significant amount of complex processing performed on that data. For the years, the primary concern was to increase the computing power of a single machine, faster processor, more RAM. Programming for traditional distributed systems is complex.
Data exchange
requires synchronization. Temporal dependencies are complicated, difficult to deal with partial failures of the system. Data Storage, data for a distributed system is stored on a SAN. At compute time, data is copied to the compute nodes (data to code). But only good for relatively small amounts of data What is Hadoop?
Open source project 24
Written in java
Optimized to handle
Massive amounts of data through parallelism
A variety of data (structured, unstructured and semi structured)
Uses inexpensive commodity hardware and gives great performance
Reliability provided through replication
Apache Hadoop 2.7.1 is a stable release in the 2.7.x release line, building upon the previous stable release 2.7.0.
Hadoop core concepts
Computation happens where the data is stored, wherever possible.
Nodes talk to each other as little as possible via TCP Handshake.
If a node fails, the master will detect that failure and re-assign the work to a different node on the system
Restarting a task does not require communication with nodes working on other portions of the data
When data is loaded into the system, it is split into ‘blocks’ , Typically 64MB or 128MB
Map tasks (the first part of the MapReduce system) work on relatively small portions of data typically a single block.
A master program allocates work to nodes such that a Map task, will work on a block of data stored locally on that node
Many nodes work in parallel, each on their own part of the overall dataset
25
CHAPTER 4 SYSTEM ANALYSIS 4.1 EXISTING SYSTEM Our Existing System is a Traditional Database. A DBMS makes it possible for end users to create, read, update and delete data in a database. The DBMS essentially serves as an interface between the database and end users or application programs, ensuring that data is consistently organized and remains easily accessible. DISADVANTAGES
Traditionally, computation has been processor-bound -Small amounts of data -Significant amount of complex processing performed on that data
For the years, the primary concern was to increase the computing power of a single machine -Faster processor, more RAM
Programming for traditional distributed systems is complex 26
-Data exchange requires synchronization -Temporal dependencies are complicated -Difficult to deal with partial failures of the system
Data Storage -Data for a distributed system is stored on a SAN -At compute time, data is copied to the compute nodes -Good for relatively small amounts of data
4.2PROPOSED SYSTEM ‘BIG DATA’ has been getting much importance in different industries over the last year or two, on a scale that has generated lots of data every day. Big Data is a term applied to data sets of very large size such that the traditional databases are unable to process their operations in a reasonable amount of time. It has tremendous potential to transform business and power in several ways. Here the challenge is not only storing the data, but also accessing and analyzing the required data in specified amount of time. One of the popular implementation to solve the above challenges of big data is using Hadoop. Hadoop is well-known open-source implementation of the MapReduce programming model for processing big data in parallel of data-intensive jobs on clusters of commodity servers. It is highly scalable compute platform. Hadoop enables users to store and process bulk amount which is not possible while using less scalable techniques.
SYSTEM ARCHITECTURE:
27
SYSTEM PROCESSING STEPS: Step 1: Streaming data through flume from webservers and storing the data sets on Hadoop
Step 2: Processing (doing ETL jobs) Log files with PIG
Step 3: Storing the resultant data set on Hadoop or once again exporting it into Hive (Data Ware house systems)
4.3 SYSTEM SPECIFICAIONS: 4.3.1 SOFTWARE SPECIFICATIONS: Operating System
: Windows7/Windows8, centos, Ubuntu
VMware Player/Studio /Oracle Virtual Box
4.3.2 MINIMUM HARDWARE SPECIFICATIONS Hard Disk
: 40 GB
RAM
: 4GB
Processor
: i3/i5 Processor which supports virtualization
28
CHAPTER 5 SYSTEM DESIGN 5.1 SYSTEM ARCHITECTURE
29
Hadoop frame work
ARCHITECTURE OF DATA WAREHOUSE
5.2 USECASE DIAGRAM Use Case diagram identify the functionality provided by system (use cases), the users who interact with the system (actors), and the association between the users and the functionality. Use Cases are used in the analysis phase of software development to articulate the high-level requirements of the system. The primary goals of Use Case diagrams include
Providing a high-level view of what the system does
Identifying the users (“actors”) of the system
Determining areas needing human-computer interfaces Use Cases extend beyond pictorial diagrams. In fact, text-based use case descriptions are often used to supplement diagrams, and explore use case functionality in more detail.
Graphical Notation The basic components of Use Case diagrams are the actor, the Use Case, and the Association. Actor An Actor, as mentioned, is depicted using a stick figure. The role of the user is written beneath the icon. Actors are not limited to humans. If a system communicates with another 30
application, and expects input or delivers output, then that application can also be considered an actor
Use Case
A Use Case is functionality provided by the system, Use cases are depicted with an eclipse. The name of the use case is written with the ellipse.
Association
Associations are used to link Actors with Use Case in some form. Associations are depicted by a line connecting the actor and the use case
Behind each Use case is a series of actions to achieve the proper functionality, as well as alternate paths for instances where validation fails, or errors occur. These actions can be further defined in a Use Case description. Because this is not addressed in UML, there are no standards for Use case descriptions. However, there are some common templates you can follow, and whole books on the subject writing of Use Case descriptions. Common methods of writing Use Case descriptions include
Write a paragraph describing the sequence of activities in the Use Case
List two columns, with the activities of the actor and the responses by the system
Use a template (Such as those from the Rational Unified Process or Alistair Cockburn’s
book,
Writing
Effective
Use Cases) identifying actors, preconditions, post conditions, success
scenarios
main
and
extensions.
31
Remember, the goal of the process is to able to communicate the requirements of the system, so use which ever method is best for your team and your organization.
USECASE DIAGRAMS
32
The Log Analysis
Searching the web
5.3 BEHAVIOURAL DIAGRAMS 5.3.1 Data Flow Diagram A Data Flow Diagram (DFD) is a graphical representation of the “Flow” of data through a information system, modeling its process aspects. A DFD is often used as a preliminary step to create an overview of the system, which can be later be elaborated. These can also be used for the visualization of data(structured diagram).
33
The Flow of Data
5.3.2 Sequence Diagram
34
Web Server
Administrator
Data Analyst
Manager
1 : login()
2 : gives access to log files()
3 : Manages the server logs()
4 : Analyses the webserver log files()
5 : creates a report of processed log files()
6 : gives the report()
7 : Decission making()
8 : Decission Sending() 9 : Final Report()
10 : Terminate()
The Log Analysis
CHAPTER 6 SYSTEM IMPLEMENTATION 35
6.1 Language of Metadata This section describes in more detail some design issues and explains decisionsthat have been made. Although this paper is written during early stages of design and implementation but it is believed to reflect the key design ideas. The first stage of log file analysis is the process of understanding log content. This is done by lexical scanning of log entries and parsing them into internaldata structures according to a given syntax and semantics4. The semantics canbe either explicitly specified in the log file or it must be provided by a human(preferably by the designer of the involved software product) in another way, i.e.by a description in corresponding metadata file. 6.1.1 Log Description Unfortunately, the common practice is that log files do not contain any explicit semantic information. In such case the user has to supply missing information,i.e. to create the metadata in a separate file that contains log file description. The format of metadata information the language of metadata is hard-wired into the analyzer. The exact notation is not determined yet, but there is a suggestion in the next paragraph. Generally the format of a log file is given and we cannot change internal file structure directly by adding other information. Instead, we must provide some external information. For example we can simply list all possible records in the log file and explain their meaning or write a grammar that would correspond to a given log format. A complete list of all possibilities or a list of all grammar rules (transcriptions)is exhausting, therefore we should seek a more efficient notation. One possible solution to this problem is based on matching log records against a set of patterns and searching for the best fit. This idea is illustrated by an example in the next paragraph.
6.1.2 Example For illustration, let us assume a fake log file that contains a mixture of several types of events with different parameters and different delimiters like this:structured reports from various programs like12:52:11 alpha kernel: Loaded 113 symbols in 8 modules,
36
12:52:11 alpha inet: inetd startup succeeded. 12:52:11 alpha sshd: Connection from 192.168.1.10 port 64187 12:52:11 beta login: root login on tty1
Periodic reports of a temperature
16/05/01 temp=21 17/05/01 temp=23 18/05/01 temp=22
Reports about switching on and a heater heater onheater off
Then we can construct a description of the log structure somehow in the following way: There are four rules that identify parts of data separated by delimiters that are listed in parenthesis. Each description of a data element consist of a name (or an asterisk if the name is not further used) and of an optional type description. There are macros that recognize common built-in types of information like time, date, IP address etc. in several notations. time:$TIME( )host( )program(: )message time:$TIME( )*( sash: )*( )address:$IP( port )port :$DATE( temp=)temperature (heater )status The first line fonts for any record that begins by a timestamp (macro $TIME should fit for various notations of time) followed by a space, a host name, a programname followed by a colon, one more space and a message. The second line fonts forrecord of sshd program from any host; this rule distinguishes also correspondingIP address and port number. The third line fits for dated temperature recordsand finally the forth fonts for heater status. Users would later use the names defined herein (i.e. \host", \program", \address" etc.) Directly in their programs as identifiers of variables. Note This is just an example how the meta file can look like. Exact specification is subject of further research in the PhD thesis.
6.2 Language of Analysis
37
The language of analysis is a cardinal element of the framework. It determines fundamental features of any analyzer and therefore requires careful design. Let us repeat its main purpose: The language of analysis is a notation for efficient finding or description of various data relation within log _les, or more generally, finding unknown relations in heterogeneous tables. This is the language of user-written filter and analytical plug-ins. A well-designed language should be powerful enough to express easily all required relations using an abstract notation. Simultaneously, it must allow feasible implementation in available programming languages. The following paragraphs present a draft of such language. They sequentially describe the principle of operation, basic features and structures. A complete formal description will be developed during further work. 6.2.1 Principle of Operation and Basic Features The language of analysis is of a data-driven programming style that eliminates usage of frequent program loops. Basic operation is roughly similar to AWK as it is further explained. Programs are sets of conditions and actions that are applied on each record of given log file. Conditions are expressions of Boolean functions and can use abstract data types. Actions are imperative procedures without direct input/output but with ability to control flow of data or execution. Unlike AWK, there is no single-pass processing of the stream of log records. The analyzer takes conditions one by one and transforms them into SQL commands that select necessary data into a temporary table. The table is then processedrecord{by{record: each time, data from one record are used as variable values int he respective action and the action is executed. The type system is weak, variable type is determined by assigned values. The language of analysis defines four different sections
1. An optional FILTER section that contains a set of conditions that allow selective filtration of log content. FILTER section is stored separately in filter plug-in. Filter plugins use different syntax to analytical plug-ins. More detailed description is available later in this chapter. 2. An optional DEFINE section, where output variables and user-defined functions are declared or defined. The log content is already accessible in apseudo-array using identifiers declared in metafile.
38
3. An optional BEGIN section that is executed once before `main' programexecution. BEGIN section can be used for initialization of some variablesor for analysis of `command line' arguments passed to the plug-in. 4. The `MAIN' analytic program that consists of ordered pairs [condition; action].Rules are written in form of Boolean expressions, actions are written in form of imperative procedures. All log entries are step-by-step testedagainst all conditions, if the condition is satisfied then the action is executed.In contrary to AWK, the conditions and actions manipulate data at highabstraction level, i.e. they use variables and not lexical elements like AWKdoes. In addition, both conditions and actions can use a variety of useful operators and functions, including but not limited to string and booleanfunctions like in AWK.Actions would be written in pseudo C imperative language. 5. An optional END part that is executed once after the `main' program execution. Here can a plug-in produce summary information or perform final,close-up computation. DECLARE, BEGIN, `MAIN' and END sections make together an analytical plugin module. 6.2.2 Data Filtering and Cleaning Data filtering and cleaning is simple transformation of a log file to reduce itssize, the reduced information is of the same type and structure. During thisphase, an analyzer has to determine what log entries should be filtered out and what information should be passed for further processing if the examined record successfully passes filtration. The filtration is performed by testing each log record against a set of conditions.
A set of conditions must be declared to have one of the following policies (in all cases there are two types of conditions, `negative' and `positive'):
Order pass, fail that means that all pass conditions are evaluated before fail conditions; initial state is FAIL, all conditions are evaluated, there is no shortcut, the order of conditions is not significant
Order fail, pass that means that all fail conditions are evaluated before pass conditions; initial state is PASS, all conditions are evaluated, there isno shortcut, the order of conditions is not significant
Order require, satisfy that means that a log record must pass all requireconditions before but including the first satisfy condition (if there is any);the record fails 39
filtration when it by the first unsatisfied condition; the orderof conditions is significant. The conditions are written by common Boolean expressions and use identifiersfrom metadata. They can also use functions and operators from the API library. Here are some examples of fragments of filter plug-ins (they assume the samplelog file described): #Example 1: messages from alba not older 5 days order pass,fail pass host=="alba" pass (time-NOW)[days]<5 #Example 2: problems with hackers and secure shell order fail, pass fail all pass (host=="beta")&&(program=="sshd") pass (program=="sshd")&&(address.dnslookup IN "badboys.com") #Example 3: heater test order require,satisfy satisfy temperature>30 # if temperature is too high -> critical require temperature>25 # if temperature is high and ... require status=="on" # heater is on -> heater malfunction ??? There must be also a way how to specify what information should be excluded from further processing. This can be done by `removal' of selected variables from internal data structures. For example, the following command indicates thatvariables \time" and \host" are not further used:omit time,host
6.2.3 Analytical Tasks This paragraph further describes the language of analysis, especially features and expressions used in analytical task. Some features are illustrated in subsequent examples. A program consists of an unordered list of conditions and actions. Conditions are boolean expressions in C-like notation and they are evaluated 40
from left to right. Conditions can be preceded by a label is such case they act like procedures and can be referenced in actions. Conditions use the same syntax like actions, see below for more details. Actions are imperative procedures that are executed if the respective condition is true. Actions have no file or console input and output. All identifiers are casesensitive. Here is a list of basic features:
There are common control structures like sequence command, conditions if-thenelse, three program loops for, while, do-while, function invocationetc. There are also special control commands like abort or rewind.
Expression is a common composition of identifiers, constants, operators and parentheses. Priority of operators is defined but can be overridden by parenthesis.
There are unary and binary operators with prefix, infix andpostfix notation. Unary postfix operators are mainly type-converters.
Variables should be declared prior they usage. At beginning they are empty they have no initial value. Variable type is determined with first assignment but it can change with next assignments. There are the following basic types: number, string, and array. Strings are enclosed in quotes,arrays use brackets for index.
Variables are referred directly by their names. If a variable is used inside string, the name is preceded by a $ sign.
Variable scope and visibility spans over whole program.
Arrays are hybrid (also referenced as hash arrays); they are associative arrays with possible indexed access. Internally, they are represented by alist of pairs key value.
Multidimensional arrays are allowed; in fact they are array(s) of arrays. Fields are accessed via index or using internal pointer.
Index can be of type number or string. Array can be heterogeneous.
Newfield in array can be created by assignment without index or by assignmentwith not-yet-existing key.
They are available common array operators like first, last, next etc.
Structures (record type) are internally treated as arrays; therefore there is no special type. Sample `structures' are date, time, IP address and similar.
Procedures and functions are written as named pairs condition, action. 41
Conditions are optional. Arguments are referred within procedure a by their order, is the first one. Procedure invocation is done by its identifier followed by parentheses optionally with passed arguments. Recursion is not allowed.
There is also a set of internal variables that reflect internal state of the analyzer:current log file, current line number, current log record etc.
Like in the C language, there is a language design and there are useful functions separated in a library. There are `system' and `user' library. Some system functions are `primitive' that means that they cannot be re-written using this language.
Bellow there are several simple examples that illustrate basic features of the suggested language of analysis. The examples are believed to be self-explaining and they also use the same sample log file from page 33. #Example 1: counting and statistics DEFINE {average, count} BEGIN {count=0; sum=0} (temperature<>0) {count++; sum+=temperature} END {average=sum/count} #Example 2: security report - ssh sessions DEFINE {output tbl} BEGIN {user=""; msg=[]} (program=="sshd"){msg=FINDNEXT(30,5m,(program==login)); user=msg[message].first; output_tbl[address.dnslookup][user]++ } #Example 3: heater validation by a state machine DEFINE {result} BEGIN {state="off"; tempr=0; min=argv[1]; max=argv[2]} (temperature<>0) {tempr=temperature} (status=="on") {if (state=="off" &&tempr<min) state="on" else {result="Error at $time"; abort} } (status=="off") {if (state=="on" &&tempr>max) state="off" 42
else {output[0]="Error at $time"; abort} } END {result="OK"} #Example 4: finding relations DEFINE {same_addr,same_port} (program=="sshd"){foo(address); bar(port)} foo:(address==$1){same_adr[]=message} bar:(port=$1) {same_port[]=message} #Example 5: finding causes DEFINE {causes} BEGIN {msg=[]} (program=="sshd"){msg=FINDMSG(100,10m,3); sort(msg); for (i=0;i<msg.length; i++) causes[]="$i: $msg[i][time] $msg[i][program]" } Again, the examples above just illustrate how things can work. Formal specification is subject of further work. 6.2.4 Discussion The proposed system of operation is believed to take advantage of both AWK and imperative languages while avoiding the limitations of AWK. In fact, AWK enables to write incredibly short and efficient programs for text processing but it is near useless when it should work with high-level data types and terms like variable values and semantics. 6.3 Analyzer Layout The previous chapters imply that an universal log analyzer is fairly complicated piece of software that must fulfill many requirements. For the purpose of easy and open implementation, the analyzer is therefore divided into several functional modules as it is shown at the figure below:
43
The figure proposes one possible modular decomposition of an analyzer. It shows main modules, used files, the user, basic rows of information (solid arrows) and basic control rows (outlined arrows). 6.3.1 Operation Before a detailed description of analyzer's components, let us enumerate the basic stages of log processing from user point of view and thus explain how the analyzer operates. 1. Job setup. In a typical session, the user specifies via a user interface one or more log files, a metadata file and selects modules that should participate in the job, each module for one analytical task. 2. Pre-processing. The analyzer then reads metadata information (using lexical and syntax analyzers) and thus learns the structure of the given log file. 3. Filtration, cleaning, database construction. After learning log structure, the analyzer starts reading the log file and according the metadata stores pieces of information into the database. During this process some unwanted records can be filtered out or some selected parts of each record can be omitted to reduce the database size.
44
4. The analysis. When all logs are read and stored in the database, the analyzers executes one-after-one all user-written modules that perform analytical tasks. Each program is expected to produce some output in form of a set of variables or the output may be missing if the log does not contain the sought-after pattern etc. 5. Processing of results. Output of each module (if any) is passed to the visualization module that uses yet another type of files (visual plug-ins) to learn how to present obtained results. For example, it can transform the obtained variables into pie charts, bar graphs, nice looking tables, lists, sentences etc. Finally, the transformed results are either displayed to user via the user interface module or they are stored in an output file. 6.3.2 File types There are several types of files used to describe data and to control program behavior during the analysis. The following list describes each type of file: Control file. This file can be optionally used as a configuration file and thus eliminates the need of any user interface. It contains description of log source (i.e. local filename or network socket), metadata location, list of applied filtering plug-ins, analytical plug-ins and visual plug-ins. Metadata file. The file contains information about the structure and format of a given log file. First, there are parameters used for lexical andsyntax analysis of the log. Second, the file gives semantics to various piecesof information in the log, i.e. it declares names of variables used later inanalytical plug-ins, describes fixed and variable parts of each record (timestamp, hostname, IP address etc.).
Log file(s). The file or files that should be analyzed in the job. Instead of a local file also a network socket can be used.
Filtering plug-in(s). Contains a set of rules that are applied on each log record to determine whether it should be passed for further processing or thrown away.
Filter plug-in can also remove some unnecessary informationfrom each record.
Analytical plug-ins. These plug-ins do the real job. They are programs that search for patterns, count occurrences and do most of the work; they contain the `business logic' of the analysis. The result of execution of an analytical plug-in is a set of variables filled with required information. The plug-ins should not be too difficult to write because the log is already pre-processed. Some details about this type of plug-in are discussed laterin this chapter. 45
Visual plug-in(s). Such files contain instructions and parameters for the visualization module how to transform outputs of analytical modules. In other words, a visual plug-in transforms content of several variables (those are output of analytical plug-ins) into a sentence of graphic primitives (or. commands) for a visualization module.
Output file. A place where the final results of the analytical job can be stored, but under normal operation, the final results are displayed to theuser and not stored to any file.
6.3.3 Modules Finally there is a list of proposed modules together with brief description of their function and usage:
User interface. It allows the user to set up the job and also handles and displays the final results of the analysis. The user has to specify where to find the log file(s) and which plug-ins to use.User interface can be replaced by control and output files.
Main executive. This module controls the operation of the analyzer. On behalf of user interaction or a control file it executes subordinate modules and passes correct parameters to them.
Generic lexical analyzer. At the beginning a metadata file and log files are processed by this module to identify lexical segments. The module is configured and used by the metadata parser and log file parser.
Generic syntax analyzer. Alike the lexical analyzer, this module is also used by both metadata parser and log file parser. It performs syntaxanalysis of metadata information and log files.
Metadata parser. This module reads a metadata file and gathers information about the structure of correspondent log file. The information is subsequently used to configure log file parser.
Log files parser. This module uses the information gained by the metadata parser. While it reads line by line the log content, pieces of information from each record (with
correct
syntax)
are
stored
theirsemantics.
46
in
corresponding
variablesaccording
Filtering, cleaning: This module reads one or more filter plug-ins and examines the output of the log file parser. All records that pass filtration are stored into a database.
Software environment Apache Hadoop: The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures. The project includes these modules Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS™): A distributed file system that provides highthroughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Pig™: A high-level data-flow language and execution framework for parallel computation. ZooKeeper™: A high-performance coordination service for distributed applications.
47
Components of Hadoop Frame Work MapReduce MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity hardware in a reliable manner. MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job. The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the MapReduce model. The Algorithm
48
Generally MapReduce paradigm is based on sending the computer to where the data resides. MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage: The map or mapper’s job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line. The mapper processes the data and creates several small chunks of data. Reduce stage: This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in the HDFS. During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers in the cluster. The framework manages all the details of data-passing such as issuing tasks, verifying task completion, and copying data around the cluster between the nodes. Most of the computing takes place on nodes with data on local disks that reduces the network traffic. After completion of the given tasks, the cluster collects and reduces the data to form an appropriate result, and sends it back to the Hadoop server.
Flow of Data in Map Reduce Frame Work
49
Input
Output
Map
list ()
Reduce
list ()
Terminology Mapper - Mapper maps the input key/value pairs to a set of intermediate key/value pair. MasterNode - Node where JobTracker runs and which accepts job requests from clients. SlaveNode - Node where Map and Reduce program runs. NamedNode - Node that manages the Hadoop Distributed File System (HDFS). DataNode - Node where data is presented in advance before any processing takes place. JobTracker - Schedules jobs and tracks the assign jobs to Task tracker. Task Tracker - Tracks the task and reports status to JobTracker. Job - A program is an execution of a Mapper and Reducer across a dataset. Task - An execution of a Mapper or a Reducer on a slice of data. Task Attempt - A particular instance of an attempt to execute a task on a SlaveNode.
50
Format of Execution The World Wide Web has an estimated 2 billion users and contains anywhere from 15 to 45 billion Web pages, with around 10 million pages added each day. With such large numbers, almost every website owner and developer who has a decent presence on the Internet faces a complex problem: how to make sense of their web pages and all the users who visit their websites. Every Web server worth its salt logs the user activities for the websites it supports and the Web pages it serves up to the virtual world. These Web logs are mostly used for debugging issues or to get insight into the details, which are interesting from a business or performance point of view. Over time, the size of the logs keeps increasing until it becomes very difficult 51
to manually extract any important information out of them, particularly for busy websites. The Hadoop framework does a good job at tackling this challenge in a timely, reliable, and cost-efficient manner. We propose a solution based on the Pig framework that aggregates data at an hourly, daily or yearly granularity. The proposed architecture features a data-collection and a database layer as an end-to-end solution, but we focus on the analysis layer, which is implemented in the Pig Latin language. Analyzing Logs Generated by Web Servers The challenge for the proposed solution is to analyze Web logs generated by Apache Web Server. Apache Web logs follow a standard pattern that is customizable. Their description and sample Web logs can be found easily on the Web. These logs contain information in various fields, such as timestamp, IP address, page visited, referrer, and browser, among others. Each row in the Web log corresponds to a visit or event on a Web page. The size of Web logs can range anywhere from a few KB to hundreds of GB. We have to design solutions that based on different dimensions such as timestamp, browser and country can extract patterns and information out of these logs and provide us vital bits of information, such as the number of hits for a particular website or Web page, the number of unique users, and so on. Each potential problem can be divided into a particular use case and can then be solved. The technologies used are the Apache Hadoop framework, Apache Pig, the Java programming language, and regular expressions (regex).
Hadoop Solution Architecture
52
The proposed architecture is a layered architecture, and each layer has components. It scales according to the number of logs generated by your Web servers, enables you to harness the data to get key insights, and is based on an economical, scalable platform. The Log Analysis Software Stack Hadoop is an open source framework that allows users to process very large data in parallel. It's based on the framework that supports Google search engine. The Hadoop core is mainly divided into two modules: HDFS is the Hadoop Distributed File System. It allows you to store large amounts of data using multiple commodity servers connected in a cluster. Map-Reduce (MR) is a framework for parallel processing of large data sets. The default implementation is bonded with HDFS. The database can be a NoSQL database such as HBase. The advantage of a NoSQL database is that it provides scalability for the reporting module as well, as we can keep historical processed data for reporting purposes. HBase is an open source columnar DB or NoSQL DB, which uses HDFS. It can also use MR jobs to process data. It gives real-time, random read/write access to very large data sets -- HBase can save very large tables having millions of rows. It's a distributed database and can also keep multiple versions of a single row. The Pig framework is an open source platform for analyzing large data sets and is implemented as a layered language over the Hadoop Map-Reduce framework. It is built to ease the work of developers who write code in the Map-Reduce format, since code in MapReduce format needs to be written in Java. In contrast, Pig enables users to write code in a scripting language. Flume is a distributed, reliable and available service for collecting, aggregating and moving a large amount of log data (src flume-wiki). It was built to push large logs into HadoopHDFS for further processing. It's a data flow solution, where there is an originator and destination for each node and is divided into Agent and Collector tiers for collecting logs and pushing them to destination storage. 53
Data Flow and Components Content will be created by multiple Web servers and logged in local hard discs. This content will then be pushed to HDFS using FLUME framework. FLUME has agents running on Web servers; these are machines that collect data intermediately using collectors and finally push that data to HDFS. Pig Scripts are scheduled to run using a job scheduler (could be cron or any sophisticated batch job solution). These scripts actually analyze the logs on various dimensions and extract the results. Results from Pig are by default inserted into HDFS, but we can use storage implementation for other repositories also such as HBase, MongoDB, etc. We have also tried the solution with HBase (please see the implementation section). Pig Scripts can either push this data to HDFS and then MR jobs will be required to read and push this data into HBase, or Pig scripts can push this data into HBase directly. The database HBase will have the data processed by Pig scripts ready for reporting and further slicing and dicing. The data-access Web service is a REST-based service that eases the access and integrations with data clients. The client can be in any language to access REST-based API. These clients could be BI- or UI-based clients.
Log Analysis Solution Architecture Apache Tomcat 7 Server UML Composite Structure Diagram 54
Implementation steps Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. Let us consider Pig for apache log analysis. Pig has some built in 55
libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All
the
functionalities
are
available
in
the
piggybank.jar
mostly available
under pig/contrib/piggybank/java/ directory.
Log Format “%h %l %u %t \”%r\” %>s %b \”%{Referer}i\” \”%{User-agent}i\””
%h The IP address of the client (remote host) which made the request. If hostname looks up is set it will try to look the hostname instead of the IP address %l The – indicated the requested information is not available. RFC1413 Identification Protocol of the client %u This is the userid of the person requestion the document as determined by the HTTP authentication. Same way if – present then the requested information is not available %t The time that the server finished processing the request. The format is [day/month/year:hour:minute:secondzone] day=2*digit month=3*letter year=4*digit hour=2*digit minute=2*digit second=2*digit zone = (`+’ | `-‘) 4*digit %r The request line from the client in the double process. The method used by the client (GET). The client requested resource i.e /apachepb.gif %m %U%q %H 56
This will provide the log method, path, query-string and protocol %s The status code that the server sends back to the client 2xx is a successful response, 3xx is a redirection, 4xx is a client error 5xx is a server error %b Size of object returned to the client. Measured in bytes. %Referrer The referrer, the site that the client reports having been reported from. %{User-agent} The user-agent HTTP request header.Identifying information that the client browser reports about itself. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin 1.
Register Piggybank jar REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar; Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as
2.
Define a log loader DEFINEApacheCommonLogLoaderorg.apache.pig.piggybank.storage.apachelog.Comm onLogLoader();
(Piggy Bank has other log loaders as well) In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor() 3.
Define Date Extractor DEFINEDayExtractororg.apache.pig.piggybank.evaluation.util.apachelogparser.DateEx tractor('yyyy-MM-dd'); 57
Once we have the required functionalities with us we need to first load the log file into pig 4.
Load apache log file into pig --load the log files from hdfs into pig using CommonLogLoader logs = LOAD '/userdata /access.log' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);
Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here
Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below
Example 1: Find unique hits per day PIG Latin --Extracting the day alone and grouping records based on days grpd = GROUP logs BY DayExtractor(dt) as day; --looping through each group to get the unique no of userIds cntd = FOREACH grpd { tempId = logs.userId; uniqueUserId = DISTINCT tempId; GENERATE group AS day,COUNT(uniqueUserId) AS cnt; } --sorting the processed records based on no of unique user ids in descending order srtd = ORDER cntd BY cntdesc; --storing the final result into a hdfs directory STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';
Example 2: Find unique hits to websites (IPs) per day 58
PIG Latin --Extracting the day alone and grouping records based on days and ip address grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address); --looping through each group to get the unique no of userIds cntd = FOREACH grpd { tempId = logs.userId; uniqueUserId = DISTINCT tempId; GENERATE group AS day,COUNT(uniqueUserId) AS cnt; } --sorting the processed records based on no of unique user ids in descending order srtd = ORDER cntd BY cntdesc; --storing the final result into a hdfs directory STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';
BUSINESS LOGIC REGISTER '/piggybank.jar';
DEFINEApacheCommonLogLoaderorg.apache.pig.piggybank.storage.apachelog.Comm onLogLoader();
logs = LOAD '/common_access_log' USING ApacheCommonLogLoader AS (addr: chararray, logname: chararray, user: chararray, time: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int);
addrs = GROUP logs BY addr;
counts = FOREACH addrs GENERATE flatten($0), COUNT($1) as count;
DUMP counts;
59
CHAPTER 7 SOFTWARE TESTING 7.1 System testing Testing Approach 60
Big data is still emerging and a there is a lot of onus on testers to identify innovative ideas to test the implementation. Testers can create small utility tools using excel macros for data comparison which can help in deriving a dynamic structure from the various data sources during the pre Hadoop processing stage. For instance, Aspire designed a test automation framework for one of our large retail customers’ big data implementation in the BDT (Behavior Driven Testing) model using Cucumber and Ruby. The framework helped in performing count and data validation during the data processing stage by comparing the record count between the Hive and SQL tables and confirmed that the data is properly loaded without any truncation by verifying the data between Hive and SQL tables. Similarly when it comes to validation on the map-reduce process stage, it definitely helps if the tester has good experience on programming languages. The reason is because unlike SQL where queries can be constructed to work through the data MapReduce framework transforms a list of key-value pairs into a list of values. A good unit testing framework like Junit or PyUnit can help validate the individual parts of the MapReduce job but they do not test them as a whole. Building a test automation framework using a programming language like Java can help here. The automation framework can focus on the bigger picture pertaining to MapReduce jobs while encompassing the unit tests as well. Setting up the automation framework to a continuous integration server like Jenkins can be even more helpful. However, building the right framework for big data applications relies on how the test environment is setup as the processing happens in a distributed manner here. There could be a cluster of machines on the QA server where testing of MapReduce jobs should happen. Testing Big Data application is more a verification of its data processing rather than testing the individual features of the software product. When it comes to Big data testing, performance and functional testing are the key. In Big data testing QA engineers verify the successful processing of terabytes of data using commodity cluster and other supportive components. It demands a high level of testing skills as the processing is very fast. Processing may be of three types
61
Along with this, data quality is also an important factor in big data testing. Before testing the application, it is necessary to check the quality of data and should be considered as a part of database testing. It involves checking various characteristics like conformity, accuracy, duplication, consistency, validity, data completeness, etc.
The following figure gives a high level overview of phases in Testing Big Data Applications
62
Big Data Testing can be broadly divided into three steps Step 1: Data Staging Validation
The first step of big data testing, also referred as pre-Hadoop stage involves process validation.
Data from various source like RDBMS, weblogs, social media, etc. should be validated to make sure that correct data is pulled into system.
Comparing source data with the data pushed into the Hadoop system to make sure they match
Verify the right data is extracted and loaded into the correct HDFS location
Tools like Talend, Datameer, can be used for data staging validation 63
Step 2: "MapReduce" Validation
The second step is a validation of "MapReduce". In this stage, the tester verifies the business logic validation on every node and then validating them after running against multiple nodes, ensuring that the Map Reduce process works correctly
Data aggregation or segregation rules are implemented on the data
Key value pairs are generated
Validating the data after Map Reduce process
Step 3: Output Validation Phase
The final or third stage of Big Data testing is the output validation process. The output data files are generated and ready to be moved to an EDW (Enterprise Data Warehouse) or any other system based on the requirement.
Activities in third stage includes:
To check the transformation rules are correctly applied
To check the data integrity and successful data load into the target system
To check that there is no data corruption by comparing the target data with the HDFS file system data
Architecture Testing Hadoop processes very large volumes of data and is highly resource intensive. Hence, architectural testing is crucial to ensure success of your Big Data project. Poorly or improper designed system may lead to performance degradation, and the system could fail to meet the requirement. Atleast, Performance and Failover test services should be done in a Hadoop environment. Performance testing includes testing of job completion time, memory utilization, data throughput and similar system metrics. While the motive of Failover test service is to verify that data processing occurs seamlessly in case of failure of data nodes Performance Testing 64
Performance Testing for Big Data includes two main action Data ingestion and Throughout: In this stage, the tester verifies how the fast system can consume data from various data source. Testing involves identifying different message that the queue can process in a given time frame. It also includes how quickly data can be inserted into underlying data store for example insertion rate into a Mongo and Cassandra database. Data Processing: It involves verifying the speed with which the queries or map reduce jobs are executed. It also includes testing the data processing in isolation when the underlying data store is populated within the data sets. For example running Map Reduce jobs on the underlying HDFS Sub-Component Performance: These systems are made up of multiple components, and it is essential to test each of these components in isolation. For example, how quickly message is indexed and consumed, mapreduce jobs, query performance, search, etc. Performance Testing Approach Performance testing for big data application involves testing of huge volumes of structured and unstructured data, and it requires a specific testing approach to test such massive data.
Performance Testing is executed in this order
Process begins with the setting of the Big data cluster which is to be tested for performance
Identify and design corresponding workloads 65
Prepare individual clients (Custom Scripts are created)
Execute the test and analyzes the result (If objectives are not met then tune the component and re-execute)
Optimum Configuration
Parameters for Performance Testing Various parameters to be verified for performance testing are
Data Storage: How data is stored in different nodes
Commit logs: How large the commit log is allowed to grow
Concurrency: How many threads can perform write and read operation
Caching: Tune the cache setting "row cache" and "key cache."
Timeouts: Values for connection timeout, query timeout, etc.
JVM Parameters: Heap size, GC collection algorithms, etc.
Map reduce performance: Sorts, merge, etc.
Message queue: Message rate, size, etc.
Test Environment Needs
Test Environment needs depend on the type of application you are testing. For Big data testing, test environment should encompass
It should have enough space for storage and process large amount of data
It should have cluster with distributed nodes and data
It should have minimum CPU and memory utilization to keep performance high
Challenges in Big Data Testing Automation Automation testing for Big data requires someone with a technical expertise. Also, automated tools are not equipped to handle unexpected problems that arise during testing Virtualization It is one of the integral phases of testing. Virtual machine latency creates timing problems in real time big data testing. Also managing images in Big data is a hassle. 66
Large Dataset
Need to verify more data and need to do it faster
Need to automate the testing effort
Need to be able to test across different platform
Performance testing challenges
Diverse set of technologies: Each sub-component belongs to different technology and requires testing in isolation
Unavailability of specific tools: No single tool can perform the end-to-end testing. For example, NoSQL might not fit for message queues
Test Scripting: A high degree of scripting is needed to design test scenarios and test cases
Test environment: It needs special test environment due to large data size
Monitoring Solution: Limited solutions exists that can monitor the entire environment
Diagnostic Solution: Custom solution is required to develop to drill down the performance bottleneck areas
As data engineering and data analytics advances to a next level, Big data testing is inevitable. Big data processing could be Batch, Real-Time, or Interactive 3 stages of Testing Big Data applications are
Data staging validation
"MapReduce" validation
Output validation phase
Architecture Testing is the important phase of Big data testing, as poorly designed system may lead to unprecedented errors and degradation of performance Performance testing for big data includes verifying 67
Data throughput
Data processing
Sub-component performance
Big data testing is very different from Traditional data testing in terms of Data, Infrastructure & Validation Tools
Big Data Testing challenges include virtualization, test automation and dealing with large dataset. Performance testing of Big Data applications is also an issue.
In summary, test automation can be a good approach in testing big data implementations. Identifying the requirements and building a robust automation framework can help in doing comprehensive testing. However, a lot would depend on how the skills of the tester and how the big data environment is setup. In addition to functional testing of big data applications using approaches such as test automation, given the large size of data there are definitely needs for Performance and load testing in big data implementations.
Tools used in Big Data Scenarios Big Data Cluster
Big Data Tools
No SQL
CouchDB, Databases, MongoDB, Cassandra, Redis, ZooKeeper, Hbase
MapReduce
Hadoop, Hive, Pig, Cascading, Oozie, Kafka, S4, MapR, Flume
Storage
S3, HDFS (Hadoop Distributed File System)
Servers
Elastic, Heroku, Elastic, Google App Engine, EC2
Processing
R, Yahoo! Pipes, Mechanical Turk, Big Sheets, Data Meer
Big data Testing vs. Traditional database Testing
68
Properties
Traditional Data Base Big data Testing
Data
Tester work with structured data
Infrastructure Validation Tools
Testing approach is well defined and time-tested Tester has the option of "Sampling" strategy doing manually or "Exhaustive Verification" strategy by automation tool It does not require special test environment as the file size is limited Tester uses either the Excel based macros or UI based automation tools Testing Tools can be used with basic operating knowledge and less training.
CHAPTER 8 RESULTS
69
Tester works with both structured as well as unstructured data Testing approach requires focused R&D efforts "Sampling" strategy in Big data is a challenge
It requires special test environment due to large data size and files (HDFS) No defined tools, the range is vast from programming tools like MapReduce to HIVEQL It requires a specific set of skills and training to operate testing tool. Also, the tools are in their nascent stage and overtime it may come up with new features.
70
71
72
73
CHAPTER 9 CONCLUSION AND FUTURE WORK 9.1 Conclusion This project aims to show how large amounts of unstructured data can be easily analyzed and how important information can be extracted from the data using the Apache Hadoop framework and its related technologies. The approach described here can be used to effectively address data analysis challenges such as traffic log analysis, user consumption patterns, best-selling products, and so on. Information gleaned from the analysis can have important economic and societal value, and can help organizations and people to operate more efficiently. The limit to implementations is set only by human imagination.
9.2 Future enhancements For now, Hadoop does a good job of pushing the limits on the size of data that can be analyzed and of extracting valuable information from seemingly arbitrary data. However, a lot of work still needs to be done to improve the latency time in Hadoop solutions so that Hadoop will be applicable to scenarios where almost real-time responses are required.
CHAPTER 10 74
REFERENCES [1] J. H. Andrews: \Theory and practice of log file analysis." Technical Report 524, Department of Computer Science, University of Western Ontario, May 1998. [2] J. H. Andrews: \Testing using log _le analysis: tools, methods, and issues." Proc. 13 th IEEE International Conference on Automated Software Engineering, Oct. 1998, pp. 157-166. [3] J. H. Andrews: \A Framework for Log File Analysis." http://citeseer.nj.nec.com/159829.html [4] J. H. Andrews, Y. Zhang: \Broad-spectrum studies of log _le analysis." International Conference on Software Engineering, pages 105-114, 2000 [5] J. H. Andrews: \Testing using Log File Analysis: Tools, Methods, and Issues." available at http://citeseer.nj.nec.com [6] L. Lamport: \Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, Vol. 21, No. 7, July 1978 [7] K. M. Chandy, L. Lamport: \Distributed Snapshots: Determining Global States of Distributed Systems." ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985 [8] F. Cristian: \Probabilistic Clock Synchronization." 9th Int. Conference on Distributed Computing Systems, June 1989 [9] M. Guzdial, P. Santos, A. Badre, S. Hudson, M. Gray: \Analyzing and visualizing log _les: A computational science of usability." Presented at HCI Consortium Workshop, 1994. [10] M. J. Guzdial: \Deriving software usage patterns from log _les." Georgia Institute of Technology. GVU Center Technical Report.Report #93-41. 1993. [11] Tec-Ed, Inc.: \Assessing Web Site Usability from Server Log Files White Paper." http://citeseer.nj.nec.com/290488.html [12] Osmar R. Zaane, Man Xin, and Jiawei Han: \Discovering web access patterns and trends by applying OLAP and data mining technology on web logs." In Proc. Advances in Digital Libraries ADL'98, pages 19{29, Santa Barbara, CA, USA, April 1998. 75
[13] Cisco Systems: \NetFlow Services and Application." White paper. Available at http://www.cisco.com 49 [14] Cisco Systems: \NetFlowFlowCollector Installation and User Guide." Available at http://www.cisco.com [15] D. Gunter, B. Tierney, B. Crowley, M. Holding, J. Lee: \Netlogger: A toolkit for Distributed System Performance Analysis." Proceedings of the IEEE Mascots 2000 Conference, August 2000, LBNL-46269 [16] B. Tirney, W. Johnston, B. Crowley, G. Hoo, C. Brooks, D. Gunter: \The Netlogger Methodology for High Performance Distributed Systems Performance Analysis." Proceedings of IEEE High Performance Distributed Computing Conference (HPDC-7), July 1998, LBNL-42611 [17] Google Web Directory: \A list of HTTP log analysis tools" http://directory.google.com/Top/Computers/Software/Internet/ /Site Management/Log Analysis [18] J. Abela, T. Debeaupuis: \Universal Format for Logger Messages." Expired IETF draft [19] C. J. Calabrese: \Requirements for a Network Event Logging Protocol." IETF draft [20] SOFA group at The Charles University, Prague. http://nenya.ms.mff.cuni.cz/thegroup/SOFA/sofa.html
76