CHAPTER 1 INTRODUCTION 1.1 OVERVIEW OF THE PROJECT The volume, variety, and velocity of data being produced in all areas of the retail industry is growing exponentially, creating both challenges and opportunities for those diligently analyzing this data to gain a competitive advantage. Although retailers have been using data analytics to generate business intelligence for years, the extreme composition of today’s data necessitates new approaches and tools. This is because the retail industry has entered the big data era, having access to more information that can be used to create amazing shopping experiences and forge tighter connections between customers, brands, and retailers. A trail of data follows products as they are manufactured, shipped, stocked, advertised, purchased, consumed, and talked about by consumers – all of which can help forward-thinking retailers increase sales and operations performance. This requires an end-to-end retail analytics solution capable of analyzing large datasets populated by retail systems and sensors, enterprise resource planning (ERP), inventory control, social media, and other sources. In an attempt to demystify retail data analytics, this project chronicles a realworld implementation that is producing tangible benefits, such as allowing retailers to:
Increase sales per visit with a deeper understanding of customers’ purchase patterns.
Learn about new sales opportunities by identifying unexpected trends from social media.
Improve inventory management with greater visibility into the product pipeline A set of simple analytics experiments was performed to create capabilities and a
framework for conducting large-scale, distributed data analytics. These experiments 1
facilitated an understanding of the edge-to-cloud business analytics value proposition and at the same time, provided insight into the technical architecture and integration needed for implementation. 1.2 COMPETITIVE ADVANTAGE According to McKinsey Global Institute,big data has the potential to increase net retailer margins by 60 percent . Likewise,companies in the top third of their industry in the use of data-driven-decision making were on average,five percent more productive and six percent more profitable than their competitors,wrote Andrew McAfee and Erik Brynjolfsson in a Harvard Business Review article. In order to generate the insight needed to reap substantial business benefits,new innovative approaches and technologies are required.This is because big data in retail is like a mountain, and retailers must uncover those tiny,but game-changing golden nuggests of insights and knowledge that can be used to create a competitive advantage. 1.3 BIG DATA TECHNOLOGIES In order for retailers to realize the full potential of big data, they must find a new approach for handling large amounts of data. Traditional tools and infrastructure are struggling to keep up with larger and more varied data sets coming in at high velocity. New technologies are emerging to make big data analytics scalable and cost-effective, such as a distributed grid of computing resources. Processing is pushed out to the nodes where the data resides, which is in contrast to long-established approaches that retrieve data for processing from a central point. Hadoop is a popular, open-source software framework that enables distributed processing of large data sets on clusters of computers. The framework easily scales on servers based on Intel Xeon processors, as demonstrated by Cloudera provider of Apache Hadoop-based software, support services, and training. Although Hadoop has captured a lot of attention, other tools and technologies are also available for working on different types of big data analytics problems.
2
1.4 REAL-WORLD USE CASES In order to demonstrate big data use cases for 3,000 natural food stores in the United States. Living Naturally develops and markets a suite of online and mobile programs designed to enhance the productivity and marketing capabilities of retailers and suppliers. Since 1999, Living Naturally has worked with thousands of retail customers and 20,000 major industry brands. In a period of three months, a team of four analysts worked with Living Naturally on a retail analytics project spanning four key phases: problem definition, data preparation, algorithm definition, and big data solution Implementation. During the project, the following use cases were investigated in detail: Product Pipeline Tracking: When Inventory levels are out sync with demand, make recommendations to retail buyers to remedy the situation and maximize return for the store. Market Basket Analysis: When an item goes on sale, let retailers know about adjacent products that benefit from a sales increase as well. Social Media Analysis: Prior to products going viral on social media, suggest retail buyers increase order size to be more responsive to shifting consumer demand and avoid out-of-stocks
3
CHAPTER 2 LITERATUR REVIEW 2.1 Background of the study It includes the investigation and possible changes to the existing system. Analysis is used to gain an understanding of the existing system and what is required of it. At the conclusion of the system analysis, there is a system description and a set of requirements for a new system. A detailed study of the process must be made by various techniques like interviews, questionnaires etc. The data collected by these sources must be scrutinized to arrive to a conclusion. The conclusion is an understanding of how the system functions. This system is called the existing system. Now the existing system is subjected to close study and problem areas are identified. The designer now functions as a problem solver and tries to sort out the difficulties that the enterprise faces. The solutions are given as proposals. The proposal is then weighed with the existing system analytically and the best one is selected. The proposal is presented to the user for an endorsement by the user. The proposal is reviewed on user request and suitable changes are made. This is loop that ends as soon as the user is satisfied with proposal. Preliminary study is problem solving activity that requires intensive communication between the system users and system developers. It does various feasibility studies. In these studies a rough figure of the system activities can be obtained, which can be used to take decisions regarding strategies to be followed for an effective system development. The various task to be carried out in system analysis involves: examining the document and the relevant aspects of the existing system, its failures and problems; analyse the findings and record the results; define and document in outline the proposed system; test the proposed design against the known facts; produce a detailed report to support the proposals; estimate the resource required to design and implement the proposed system. 4
The objective of this system study is to determine whether there is any need for the new system. All the levels of the feasibility measures have to be performed. Thereby knowing the performance by which a new system has to be performed. 2.1.1 Problem Definition Problem Definition deals with observations, site visits and discussions to identify analyze and document project requirements and carry out feasibility studies and technical assessments to determine the best approaches for full system development. Addition of new features is very difficult and creates more overheads. The present system is not easily customizable. The existing system records customer details, employee details, new service/event details manually. The change in one module or any part of the system is widely affected in other parts or sections. And it is difficult to coordinate all the changes by the administrator. This present system then becomes more error prone when we think about updating. Lots of security problems are present in the existing system. Keeping the problem definition in mind the proposed system evolves which is easily customizable, user friendly, easy to update with new features in future and so on. 2.1.2 Requirement Analysis Requirements analysis in systems engineering and software engineering, encompasses those tasks that go into determining the needs or conditions to meet for a new or altered product or project, taking account of the possibly conflicting requirements of the various stakeholders, analyzing, documenting, validating and managing software or system requirements. Requirements analysis is critical to the success of a systems or software project. Requirement analysis for the Digital Security Surveillance system outputs the following results.
5
2.2 EXISTING SYSTEM Increasingly connected consumers and retail channels have served to make a wide variety of new data types and sources available to today's retailer. Traditional structured types of data such as the transactional and operational data found in a data warehouse are now only a small part of the overall data landscape for retailers. Savvy retailers now collect large volumes of unstructured data such as web logs and clickstream data,location data,social network interactions and data from a variety of sensors. The opportunities offered by this data are significant, and span all functional areas in retail.According to a 2011 study by Mckinsey Global Institute,retailers fully exploiting big data technology stand to dramatically improve margins and productivity. In order for retailers to take advantage of new types of data,they need new approach storage and analysis. 2.2.1 The Data Deluge, And Other Barriers Modern retailers are increasingly pursuing omni-channel retail strategies that seek to create a seamless customer experience across a variety of physical and digital interaction channels, including in-store, telephone, and web, mobile and social media. The emergence of omni-channel retailing, the need for a 360-view of the customer and advances in retail technology are leading today's retailers to collect a wide variety of new types of data. These new data sources including clickstream and web logs, social media interactions, search queries, in-store sensors and video, marketing assets and interactions, and variety of public and purchased data sets provided by 3rd parties have put tremendous pressure on traditional retail data systems.Retailers face challenges collecting big data. These are often characterized by “the three Vs”:
Volume or the sheer amount of data being amassed by today’s enterprises.
Velocity, or the speed at which this data is created, collected and processed.
Variety or the fact that the fastest growing types of data have little or no inherent structure or a structure that changes frequently.
6
These three Vs have individually and collectively outstripped the capabilities of traditional storage and analytics solutions that were created in a world of more predictable data. Compounding the challenges represented by the Vs is the fact that collected data can have little or no value as individual or small groups of records. But when explored in the aggregate, in very large quantities or over a longer historical time frame, the combined data reveals patterns that feed advanced analytical applications. In addition, retailers seeking to harness the power of big data face a distinct set of challenges. These include:
Data Integration. It can be challenging to integrate unstructured and transactional data while also accommodating privacy concerns and regulations.
Skill Building. Many of the most valuable big data insights in retail come from advanced techniques like machine learning and predictive analytics. However, to use these techniques analysts and data scientists must build new skill sets.
Application of Insight. Incorporating big data analytics into retail will require a shift in industry practices from those based on weekly or monthly historical reporting to new techniques based on real-time decision-making and predictive analytics. Fortunately, open technology platforms like Hadoop allow retailers to overcome
many of the general and industry-specific challenges just described. Retailers who do so benefit from the rapid innovation taking place within the Hadoop community and the broader big data ecosystem. 2.3 PROPOSED SYSTEM To unlock the drawbacks of Existing System retailers from across the industry have turned to Apache Hadoop to meet these needs and to collect,manage and anlyze a wide variety of data.In doing so,they are gaining new insights into the customer,offering the right producr at the right price,improving operations and supply Apache Hadoop is an open-source data processing platform create at web-scale 7
Internet companies confronted with the challenge of storing and processing massive amounts of structured and unstructured data.By combining the affordability of low-cost commodity servers and an open source approach,Hadoop provides cost-effective and efficient data storage and processing that scales to meet the needs of the very largest retail organizations.Hadoop enables a shared “data lake” allowing retail organizations to:
Collect Everything: A data lake can store any type of data,including an enterprises structured data,publicly available data,and semi-structured and unstructured social media content.Data generated by the Internet of Things, such as in-store sensors that capture location data on how products and people move about the store is also easily included.
Dive in Anywhere. A data lake enables users across multiple business units to refine,explore and enrich data on their terms, performing data exploration and discovery to support business decisions.
Flexibly Access Data. . A data lake supports multiple data access patterns across a shared infrastructure, enabling batch, interactive, real-time, in-memory and other types of processing on underlying data sets. A Hadoop-based data lake, in conjunction with existing data management
investments, can provide retail enterprises an opportunity for Big Data analytics while at the same time increasing storage and processing efficiency, which reduces costs. Big data technology and approaches have broad application within the retail domain. McKinsey Global Institute identified 16 big data use cases, or “levers,” across marketing, merchandising, operations, supply chain and new business models.
8
Function
Big data lever
Marketing
Cross-selling Location-based marketing In-store behavior analysis Customer micro-segmentation Sentiment analysis Enhancing the multichannel
consumer
experience
Merchandising
Assortment optimization Pricing optimization Placement and design optimization
Operations
Performance transparency Labor inputs optimization
Supply chain
Inventory management Distribution and logistics optimization Informing supplier negotiations
New business models
Price comparison services Web-based markets
Table: 1 Big Data Levers in Retail Analytics 9
2.5 FACT FINDING TECHNIQUES Requirements analysis encompasses all of the tasks that go into the investigation, scoping and definition of a new or altered system. The first activity in analysis phase is to do the preliminary investigation. During the preliminary investigation data collecting is a very important and for this we can use the fact finding techniques. The following fact finding techniques can be used for collecting the data:
Observation - This is a skill which the analysts have to develop. The analysts have to identify the right information and choose the right person and look at the right place to achieve his objective. He should have a clear vision of how each departments work and work flow between them and for this should be a good observer.
Interview - This method is used to collect the information from groups or individuals. In this method the analyst sits face to face with the people and records their responses. The type of questions to be asked is prepared in advance and should be ready to answer any type of question. The information collected is quite accurate and reliable as the doubts are cross checked and cleared from the customer site itself. This method also helps gap the areas of misunderstandings and help to discuss about the future problems. Requirements analysis is an important part of the system design process, whereby
requirements engineers and business analysts, along with systems engineers or software developers, identify the needs or requirements of a client. Once the client’s requirements have been identified and facts collected, the system designers are then in a design a solution
10
2.6 REVIEW OF THE LITERATURE Text summarization is an old challenge in text mining but in dire need of researcher’s attention in the areas of computational intelligence, machine learning and natural language processing. A set of features are extracted from each sentence that helps to identify its importance in the document. Every time reading full text is time consuming. Clustering approach is u useful to decide which type of data present in document. The concept of k-means clustering for natural language processing of text for word matching is used and in order to extract meaningful information from large set of offline documents, data mining document clustering algorithm are adopted. Automated text summarization focused two main ideas have emerged to deal with this task; the first was how a summarizer has to treat a huge quantity of data and the second, how it may be possible to produce a human quality summary. Depending on the nature of text representation in the summary, summary can be categorized as an abstract and an extract. An extract is a summary consisting of the overall idea of the document. An abstract is a summary, which represents the subject matter of the document. In general, the task of document summarization covers generic summarization and query-oriented summarization. The query oriented method generates summary of documents and generic method summarizes the oveall sense of the document without any additional information. Traditional document clustering algorithms use the full text in documents to generate feature vectors. Such methods often produce unsatisfactory results because there is much noisy information in documents. The varying –length problem of the documents is also a significant negative factor affecting the performance. This technique retrieves important sentences which emphasize on high information richness. The maximum sentence generated scores are clustered to generate the summary of the document. Thus k-mean clustering is used to group the maximum sentences of the document and find the relation to extract clusters with most relevant sets in the document. This helps in producing the summary of the document. The main purpose of k-mean clustering algorithm is to generate pre define
length of summary having
maximum informative sentences. The approach for automatic text summarization is presented by extraction of sentences from the Reuters-21578 corpus which include newspaper articles. 11
CHAPTER 3 RESEARCH METHODOLOGY 3.1 APACHE HADOOP Apache Hadoop is an open source software framework used for distributed storage and processing of big data sets using the MapReduce programming model. It consists of computer clusters built from commodity hardware commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking. The base Apache Hadoop framework is composed of the following modules: Hadoop Common – contains libraries and utilities needed by other Hadoop modules; Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
Hadoop YARN – a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications; and 12
Hadoop MapReduce – an implementation of the MapReduce programming model for large scale data processing. The term Hadoop has come to refer not just to the base modules and submodules above, but also to the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig,Apache Hive, Apache
Hbase,Apache
Phoenix,Apache
Spark,Apache
ZoopKeeper,Apache
ZoopKeeper,Cloudera Impala,Apache Flume,Apache Sqoop,Apache Oozie, and Apache Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System. The Hadoop framework itself is mostly written in theJava programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.Other projects in the Hadoop ecosystem expose richer user interfaces. 3.1.1 Architecture Hadoop consists of the Hadoop Common package, which provides file system and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the necessary Java ARchive(JAR) files and scripts needed to start Hadoop. For effective scheduling of work, every Hadoop-compatible file system should provide location awareness: the name of the rack (more precisely, of the network switch) where a worker node is.
Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if one of these 13
hardware failures occurs, the data will remain available. 3.1.2 File Systems Hadoop Distributed File System The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Some consider HDFS to instead be a data store due to its lack of POSIX compliance and inability to be mounted, but it does provide shell commands and Java API methods that are similar to other file systems. A Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedur call (RPC) to communicate between each other. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX compliant, because the requirements for a POSIX file-system differ from the target goals for a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.
14
Figure 3.1: Architecture of Hadoop HDFS added the high-availability capabilities, as announced for release 2.0 in May 2012, letting the main metadata server (the NameNode) fail over manually to a backup. The project has also started developing automatic fail-over. The HDFS file system includes a so-called secondary namenode, a misleading name that some might incorrectly interpret as a backup namenode for when the primary namenode goes offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-todate directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files.
15
HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in HDFS, namely, small file issue, scalability problem, Single Point of Failure (SPoF), and bottleneck in huge metadata request. An advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times, which has been demonstrated when running data-intensive jobs. HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write-operations.HDFS can be mounted directly with a Filesystem in Userspace (FUSE)virtual file system on Linux and some other Unix systems. File access can be achieved through the native Javaapplication programming interface (API), the Thrift API to generate a client in the language of the users' choosing (C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa , Smalltalk, and OCaml ), thecommand-line interface, browsed through the HDFS-UI Web application (webapp) over HTTP , or via 3rd-party network client libraries. HDFS is designed for portability across various hardware platforms and compatibility with a variety of underlying operating systems. The HDFS design introduces portability limitations that result in some performance bottlenecks, since the Java implementation can't use features that are exclusive to the platform on which HDFS is running. Due to its widespread integration into enterprise-level infrastructures, monitoring HDFS performance at scale has become an increasingly important issue.
16
Monitoring end-to-end performance requires tracking metrics from datanodes, namenodes, and the underlying operating system. There are currently several monitoring platforms to track HDFS performance, including HortonWorks, Cloudera, and Datadog. 3.1.3 Jobtracker and Tasktracker: The Mapreduce Engine Above the file systems comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work out to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a separate Java Virtual Machine process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser. Known limitations of this approach are: The allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as "4 slots"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability. If one TaskTracker is very slow, it can delay the entire MapReduce job – especially towards the end of a job, where everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.
17
3.1.4 Scheduling By default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, while adding the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler, described next). 3.1.5 Fair Scheduler The fair scheduler was developed by Facebook The goal of the fair scheduler is to provide fast response times for small jobs and Qos for production jobs. The fair scheduler has three basic concepts. Jobs are grouped into pools . Each pool is assigned a guaranteed minimum share. Excess capacity is split between jobs. By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, and a limit on the number of running jobs. 3.1.6 Capacity Scheduler The capacity scheduler was developed by Yahoo . The capacity scheduler supports several features that are similar to the fair scheduler. Queues are allocated a fraction of the total resource capacity. Free resources are allocated to queues beyond their total capacity. Within a queue a job with a high level of priority has access to the queue's resources. There is no preemption once a job is running. 3.2 APACHE HIVE Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarization, query, and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReducee Java API to execute 18
SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the lowlevel Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook , Apache Hive is used and developed by other companies such as Netflix and theFinancial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services. 3.2.1 Architecture Major components of the Hive architecture are:
Figure 3.2: Architecture of Hive
Hive Architecture Metastore: Stores metadata for each of the tables such as their schema and location. It also includes the partition metadata which helps the driver to track the progress of various data sets distributed over the cluster. The data is stored in 19
a traditional RDBMS format. The metadata helps the driver to keep a track of the data and it is highly crucial. Hence, a backup server regularly replicates the data which can be retrieved in case of data loss. Driver: Acts like a controller which receives the HiveQL statements. It starts the execution of statement by creating sessions and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of an HiveQL statement. The driver also acts as a collection point of data or query result obtained after the Reduce operation. Compiler: Performs compilation of the HiveQL query, which converts the query to an execution plan. This plan contains the tasks and steps needed to be performed by the Hadoop MapReduce to get the output as translated by the query. The compiler converts the query to an Abstract syntax tree (AST). After checking for compatibility and compile time errors, it converts the AST to a directed acyclic graph (DAG). DAG divides operators to MapReduce stages and tasks based on the input query and data. Optimizer: Performs various transformations on the execution plan to get an optimized DAG. Various transformations can be aggregated together, such as converting a pipeline of joins by a single join, for better performance. It can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance and scalability. However, the logic of transformation used for optimization used can be modified or pipelined using another optimizer. Executor: After compilation and Optimization, the Executor executes the tasks according to the DAG. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run. CLI, UI, and Thrift Server: Command Line Interface and UI (User Interface) allow an external user to interact with Hive by submitting queries, instructions 20
and monitoring the process status. Thrift server allows external clients to interact with Hive just like how JDBC/ODBC servers do. 3.2.2 APACHE SQOOP Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and external datastores such as relational databases, enterprise data warehouses.Sqoop is used to import data from external datastores into Hadoop Distributed File System or related Hadoop eco-systems like Hive and HBase. Similarly, Sqoop can also be used to extract data from Hadoop or its eco-systems and export it to external datastores such as relational databases, enterprise data warehouses. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres etc. For Hadoop developers, the interesting work starts after data is loaded into HDFS. Developers play around the data in order to find the magical insights concealed in that Big Data. For this, the data residing in the relational database management systems need to be transferred to HDFS, play around the data and might need to transfer back to relational database management systems. In reality of Big Data world, Developers feel the transferring of data between relational database systems and HDFS is not that interesting, tedious but too seldom required. Developers can always write custom scripts to transfer data in and out of Hadoop, but Apache Sqoop provides an alternative.Sqoop automates most of the process, depends on the database to describe the schema of the data to be imported. Sqoop uses MapReduce framework to import and export the data, which provides parallel mechanism as well as fault tolerance. Sqoop makes developers life easy by providing command line interface. Developers just need to provide basic information like source, destination and database authentication details in the sqoop command. Sqoop takes care of remaining part.
21
Sqoop provides many salient features like: Full Load Incremental Load Parallel import/export Import results of SQL query Compression Connectors for all major RDBMS Databases Kerberos Security Integration Load data directly into Hive/Hbase Support for Accumulo Sqoop is Robust, has great community support and contributions. Sqoop is widely used in most of the Big Data companies to transfer data between relational databases and Hadoop.Relational database systems are widely used to interact with the traditional business applications. So, relational database systems has become one of the sources that generate Big Data. As we are dealing with Big Data, Hadoop stores and processes the Big Data using different processing frameworks like MapReduce, Hive, HBase, Cassandra, Pig etc and storage frameworks like HDFS to achieve benefit of distributed computing and distributed storage. In order to store and analyze the Big Data from relational databases, Data need to be transferred between database systems and Hadoop Distributed File System (HDFS). Here, Sqoop comes into picture. Sqoop acts like a intermediate layer between Hadoop and relational database systems. You can import data and export data between relational database systems and Hadoop and its eco-systems directly using sqoop. Sqoop provides command line interface to the end users. Sqoop can also be accessed using Java APIs. Sqoop command submitted by the end user is parsed by Sqoop and launches Hadoop Map only job to import or export data because Reduce phase is required only when aggregations are needed. Sqoop just imports and exports the data; it does not do any aggregations.
22
Sqoop parses the arguments provided in the command line and prepares the Map job. Map job launch multiple mappers depends on the number defined by user in the command line. For Sqoop import, each mapper task will be assigned with part of data to be imported based on key defined in the command line. Sqoop distributes the input data among the mappers equally to get high performance. Then each mapper creates connection with the database using JDBC and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or HBase based on the option provided in the command line.
Figure 3.3: Architecture of Sqoop
Sqoop supports incremental loads of a single table or a free forms SQL query as well as saved jobs which can be run multiple times to import update made to a database since the last import.Imports can also be used to populate tables in Hive
or
Hbase.Esports can be used to put data from Hadoop into a relational database.Sqoop got the name from sql+hadoop.Sqoop became a top-level Apache project in March 2012. 23
Informatica provides a Sqoop based connector from version 10.1.Pentaho provides open source Sqoop based connector steps,Sqoop Import and Sqoop Export,in their ETL suite Pentaho Data Integration since version 4.5 of the software.Microsoft uses a Sqoop-based connector to help transfer data from Microsoft Sql Server databases to Hadoop. 3.3 POWERBI Power BI is not a new name in the BI market, components of Power BI has been in the market through different time periods. Some components such As Power BI Desktop is such new that released as general availablility at 24th of July. On the other hand Power Pivot released at 2010 for the first time. Microsoft team worked through long period of time to build a big umbrella called Power BI, this big umbrella is not just a visualization tool such as Tableau, it is not just a self-service data analysis tool such as PivotTable and PivotChart in Excel, it is not just a cloud based tool for data analysis. Power BI is combination of all of those, and it is much more. With Power BI you can connect to many data sources (wide range of data sources supported, and more data sources add to the list every month). You can mash up the data as you want with a very powerful data mash up engine. You can model the data, build your star schema, or add measures and calculated columns with an In-Memory super fast engine. You can visualize data with great range of data visualization elements and customize it to tell the story behind the data. You can publish your dashboard and visualization tool in cloud and share it to those who you want. You can work with On-premises as well as Azure/cloud based data sources. and believe me there are much more things that you can do with Power BI which you can’t do with other products easily. Power BI is a cloud based data analysis, which can be used for reporting and data analysis from wide range of data source. Power BI is simple and user friendly enough that business analysts and power users can work with it and get benefits of it. On the other hand Power BI is powerful and mature enough that can be used in enterprise systems by BI developers for complex data mash-up and modelling scenarios.
24
Power BI made of 6 main components, these components released in the market separately, and they can be used even individually. Components of Power BI are: Power Query: Data mash up and transformation tool. Power Pivot: In-memory tabular data modelling tool Power View: Data visualization tool Power Map: 3D Geo-spatial data visualization tool Power Q&A: Natural language question and answering engine. Power BI Desktop: A powerful companion development tool for Power BI There are many other parts for Power BI as well, such as; PowerBI.com Website; which Power BI data analysis can be shared through this website and hosted there as cloud service Power BI Mobile Apps; Power BI supported in Android, Apple, and Windows Phones. Some of above components are strong and has been tested for very long time. Some of them however are new and under frequent regular updates. Power BI built easy graphical user interfaces to follow, so a business user simply could user Power Query or Power BI desktop to mash up the data without writing even a single line of code. It is on the other hand so powerful with power query formula language (M) and data analysis expression (DAX) that every developer can write complex codes for data mash up and calculated measures to respond challenging requirements. So if you’ve heard somewhere that Power BI is a basic self-service data analysis tool for business analysts and cannot be used for large enterprises systems, I have to say this is totally wrong! I’ve been using Power BI technology myself in many large enterprise scale systems and applications, and I’ve seen usage of that in many case studies all around the world. 25
Power BI components can be used individually or in a combination. Power Query has an add-in for Excel 2010 and Excel 2013, and it is embedded in Excel 2016. The add-in for Power Query is available for free! for everyone to download and use it alongside with existing an Excel (as long as it is Excel 2010 or higher versions). Power Pivot has been as an add-in for Excel 2010, from Excel 2013 Power Pivot is embedded in Excel, this add-in is again free to use! Power View is an add-in for Excel 2013, and it is free for use again. Power Map is an add-in for Excel 2013, it is embedded in Excel 2016 as 3D maps. Power Q&A doesn’t require any installation or add-in, it is just an engine for question and answering that works on top of models built in Power BI with other components. Components above can be used in a combination. You can mash up the data with Power Query, and load the result set into Power Pivot model. You can use the model you’ve built in Power Pivot for data visualization in Power View or Power Map. There is fortunately a great development tool that combines three main components of Power BI. Power BI Desktop is the tool that gives you combined editor of Power Query, Power Pivot, and Power View. Power BI Desktop is available as stand-alone product that can be downloaded separately. With Power BI Desktop you will have all parts of the solution in one holistic view. 3.3.1 Power Query Power Query is data transformation and mash up engine. Power Query can be downloaded as an add-in for Excel or be used as part of Power BI Desktop. With Power Query you can extract data from many different data sources. You can read data from databases such as SQL Server, Oracle, MySQL, DB2, and many other databases. You can fetch data from files such as CSV, Text, Excel. You can even loop through a folder. You can use Microsoft Exchange, Outlook, Azure…. as source. You can connect to Facebook as source and many other applications. You can use online search or use a web address as the source to fetch the data from that web page. Power Query gives you a graphical user interface to transform data as you need, adding columns, changing types, transformations for date and time, text, and many other operations are available. Power Query can load the result set into 26
Excel or into Power Pivot model. Power Query also uses a powerful formula language as code behind called M. M is much more powerful than the GUI built for it. There are many functionality in M that cannot be accessed through graphical user interface. I would write deeply about Power Query and M in future chapters so you can confidently write any code and apply complex transformations to the data easily screenshot below is a view of Power Query editor and some of it’s transformations.
Figure 3.4: Power Query
27
3.3.2 Power Pivot Power Pivot is data modelling engine which works on Velocity In-Memory based tabular engine. The In-Memory engine gives Power Pivot super fast response time and the modelling engine would provide you a great place to build your star schema, calculated measures and columns, build relationships through entities and so on. Power Pivot uses Data Analysis expression language (DAX) for building measures and calculated columns. DAX is a powerful functional language, and there are heaps of functions for that in the library. We will go through details of Power Pivot modelling and DAX in future chapters. Screenshot below shows the relationship diagram of Power Pivot.
Figure 3.5: Power Pivot
28
3.3.3 Power View The main data visualization component of Power BI is Power View. Power View is an interactive data visualization that can connect to data sources and fetch the metadata to be used for data analysis. Power View has many charts for visualization in its list. Power View gives you ability to filter data for each data visualization element or for the entire report. You can use slicers for better slicing and dicing the data. Power View reports are interactive, user can highlight part of the data and different elements in Power View talk with each other
Figure 3.6: Power View 3.3.4 Power Map Power Map is for visualizing Geo-spatial information in 3D mode. When visualization renders in 3D mode it will gives you another dimension in the visualization. You can visualize a measure as height of a column in 3D, and another measure as heatmap view. You can highlight data based on the Geo-grahpical location such as country, city, state, and street address.
29
Power Map works with Bing maps to get best visualization based on Geographical either latitude and longitude or country, state, city, and street address information. Power Map is an add-in for Excel 2013, and embedded in Excel 2016. 3.3.4 Power BI Desktop Power BI Desktop is the newest component in Power BI suit. Power BI Desktop is a holistic development tool for Power Query, Power Pivot and Power View. With Power BI Desktop you will have everything under a same solution, and it is easier to develop BI and data analysis experience with that. Power BI Desktop updates frequently and regularly. This product has been in preview mode for a period of time with name of Power BI Designer. There are so much great things about Power BI Desktop that cannot fit in a small paragraph here, you’ll read about this tool in future chapters. because of great features of this product I’ll write the a section “Power BI Hello World” with a demo of this product. You can have a better view of newest features of Power BI Desktop. screenshot below shows a view of this tool
Figure 4.7: Power BI Desktop
30
3.3.6 Power BI Website Power BI solution can be published to PowerBI website. In Power BI website the data source can be scheduled to refresh (depends on the source and is it supporting for schedule data refresh or not). Dashboards can be created for the report, and it can be shared with others. Power BI website even gives you the ability to slice and dice the data online without requiring any other tools, just a simple web browser. You can built report and visualizations directly on Power BI site as well. screenshot below shows a view of Power BI site and dashboards built there; 3.3.7 Power Q&A Power Q&A is a natural language engine for questions and answers to your data model. Once you’ve built your data model and deployed that into Power BI website, then you or your users can ask questions and get answers easily. There are some tips and tricks about how to build your data model so it can answer questions in the best way which will be covered in future chapters. Power Q&A and works with Power View for the data visualizations. So users can simply ask questions such as: Number of Customers by Country, and Power Q&A will answer their question in a map view with numbers as bubbles, Fantastic, isn’t it? 3.3.8 Power BI Mobile Apps There are mobile apps for three main mobile OS providers: Android, Apple, and Windows Phone. These apps gives you an interactive view of dashboards and reports in the Power BI site, you can share them even from mobile app. You can highlight part of the report, write a note on it and share it to others
31
Figure 3.8: Power Q&A .
Figure 3.9: Power BI Mobile Apps
32
3.3.9 Power BI Pricing Power BI provide these premium services for free! You can create your account in PowerBI.com website just now for free. Many components of Power BI can be used individually for free as well. you can download and install Power BI Desktop, Power Query add-in, Power Pivot add-in, Power View add-in, and Power Map add-in all for free! There are some features of these products that reserved for paid version however, such as Power BI Pro which gives you some more features of the product.
33
CHAPTER 4 EXPERIMENTAL RESULTS 4.1 OVERVIEW The Retail Analytics is developed to help companies to stay ahead of shopper trends by applying customer analytics in retail to uncover ,Smart business decisionmaking .To quickly grasp how the company is performing ,a retail dashboard helps with strategic oversight and decision. Data analytics can help in optimizing the price by monitoring competitive pricing and discount strategies. To ensure you have the right product mix ,use data analytics to monitor the breadth and depth of competitors catalogues and identify
their
newly-launched
products
interpret and act
on
meaningful data insights ,including in-store and online shopper patterns. 4.1.1 Problem Definition Before starting a big data project, it is important to have clarity of purpose, which can be accomplished with the following steps: 4.1.2 Identify the Problem Retailers need to have a clear idea of the problems they want to solve and understand the value of solving them. For instance, a clothing store may find four of ten customers looking for blue jeans cannot find their size on the shelf, which results in lost sales. Using big data, the retailer's goal could be to reduce the frequency of out-ofstocks to less than one in ten customers, thus increasing the sales potential for jeans by over 50 percent. Big data can be used to solve a lot of problems, such as reduce spoilage, increase margin, increase transaction size or sales volume, improve new product launch success or increase customer dwell time in stores. For example, a popular medical doctors with a health-oriented television show recommended raspberry ketene pills as a weight reducer to his very large following on Twitter.
34
This caused a run on the product at health stores and ultimately led to empty shelves, which took a long time to restock since there was little inventory in the supply chain . 4.1.3 Identify All the Data Sources Retailers collect a variety of data from point-of-sale (POS) systems, including the typical sales by product and sales by customer via loyalty cards. However, there could be less obvious sources of useful data that retailers can find by “walking the store”, a process intended to provide a better understanding of a product's end-to-end life cycle. His exercise allows retailers to think about problems with a fresh set of eyes, avoid thinking of their data in silo'd terms, and consider new opportunities that present themselves when big data is used to find correlations between exist and and new data sources ,such as: Video: Surveillance cameras and anonymous video analytics on digital signs. Social Media: Twitter, Face book, blogs and product reviews Supply chain: Orders, Shipment, invoices, inventory, sales receipts, and payments. Advertising: Coupons in flyers and newspapers, and advertisements on TV and in-store signs. Environment: Weather, community events, seasons, and holidays Product Movement: RFID tags and GPS 4.1.4 Conduct 'Cause and Effect' Analysis After listing the entire possible data source, the next step is to explore the data through queries that help determine which combinations of data could be used to solve specific retail problems. This meant using social media data to inform health stores up o three weeks before a product is likely to go viral, providing enough time for retail buyers to increase their orders.
35
The team queries social media feeds and compared them to historic store inventory levels and found it was possible to give retail buyers an early warning of potentially rising product demand. In this case, it was important to find a reliable leading indicator that could be used to predict how much product should be ordered and when. 4.1.5 Determine Metrics That Need Improvement After clearly defining a problem, it is critical to determine metrics that can accurately measure the effectiveness of the future big data solution’s recommendation is to make sure the metrics can be translated into a monetary value (e.g., income from increased sales, savings from increased sales, saving from reduced spoilage) and incorporated into a return on investment (ROI) calculation. The team focused on metrics related to reducing shrinkage and spoilage, and increasing profitability by suggesting which products a store should pick for promotion. 4.1.6 Verify the Solution Is Workable A workable big data solution hinges on presenting the findings in a clear and acceptable way to employees. As described previously, the solution provided retail buyers with timely product ordering recommendations displayed with an intuitive UI running on devices used to enter orders. 4.2 CONSIDER BUSINESS PROCESS RE-ENGINEERING By definition, the outcomes from a big data project can change the way people do their jobs. It is important to consider the costs and time of the business process reengineering that may be needed to fully implement the big data solution. 4.2.1 Calculate An ROI Retailers should calculate an ROI to ensure their big data project makes Financial sense. At this point, the ROI is likely to have some assumptions and unknowns, but hopefully these will have only a second order effect on the financials.
36
If the return is deemed too low it may make sense to repeat the prior steps and look for other problems and opportunities to pursue. Moving forward, the following steps will add clarity to what can be accomplished and raise the confidence level of the ROI. 4.2.2 Data Preparation Retail data used for big data analysis typically comes from multiple sources and different operational systems, and may have inconsistencies. Transaction data was supplied by several systems, which truncated UPC codes or stored them differently due to special characters in the data field. More often than not, one of the first steps is to cleanse, enhance, and normalize the data. Data cleaned tasks performed are:
Populate missing data elements
Scrub data for discrepancies
Make data categories uniform
Reorganize data Fields
Normalize UPC data
4.3 DATA FLOW DIAGRAM Data flow diagram is a way of representing system requirements in a graphic form. A DFD also known as “Bubble Chart” has the purpose of clarifying system requirements and identifies major transformations that will become program in system design. So it is the starting point of design phase that functionally decomposes the requirements specifications down to the lowest level of details. A DFD consist of series of bubbles joined by lines. The bubbles represent data transformation and the lines represent data flow in the system.
37
4.3.1 DFD Symbols In a DFD there are four symbols
A square defines a source or destination of system data. An arrow identifies data flow in motion. It is a pipeline through which in format flows
A circle or a bubble represents a process that transforms incoming data flows into going data flows.
An open rectangle is a data source or data at rest or a temporary of data constructing a DFD. 4.3.2 Rules in Drawing DFD’S
Process should name and numbered for easy reference.
The direction of flow is from top to bottom and from left to right.
Data traditionally flow from source to destination, although they may flow from source.
When a process is exploded into lower levels, they are numbered.
The names of data source, sources and destination are written in capital letters. process and data flow names have the first letter of each word capital A DFD is often used as a preliminary step to create an overview of the system,
which can later be elaborated. The DFD is designed to aid communication. The rule of thumb is exploding the DFD to a functional level, so that the next sublevel does not exceeds to process. 38
Beyond that it is best to take each function separately and expand it to show the explosion of single process. Data flow diagrams are one of the three essential perspectives of the structured-systems analysis and design method SSADM. With a data flow diagram, users are able to visualize how the system will operate, what the system will accomplish, and how the system will be implemented. It is a common practice to show the interaction between the system and external agents which act as data sources and data sinks.
Figure 4.1: Dataflow Diagram for Retail Analytics 4.4 MODULES MARKETING AND DEVELOPMENT PLANS SHARP,STRATEGIC FOCUS ON RETAIL BI MERCHANDISING STORES E-COMMERCE CUSTOMER RELATIONSHIP MANAGER
39
4.4.1 Marketing and Development Plans A Marketing and Development plan is a comprehensive document or blueprint that outlines a company's advertising and marketing efforts for the coming year. It describes business activities involved in accomplishing specific marketing objectives within a set time frame. A marketing plan also includes a description of the current marketing position of a business, a discussion of the target market and a description of the marketing mix that a business will use to achieve their marketing goals. A Marketing and Development plan has a formal structure, but can be used as a formal or informal document which makes it very flexible. It contains some historical data, future predictions, and methods or strategies to achieve the marketing objectives. Marketing and Development plans start with the identification of customer needs through a market research and how the business can satisfy these needs while generating an acceptable level of return. This includes processes such as market situation analysis, action programs, budgets, sales forecasts, strategies and projected financial statements. A marketing and Development plan can also be described as a technique that helps a business to decide on the best use of its resources to achieve corporate objectives. It can also contain a full analysis of the strengths and weaknesses of a company, its organization and its products. The Marketing and Development plan shows the step or actions that will be utilized in order to achieve the plan goals. For example, a marketing and development plan may include a strategy to increase the business's market share by fifteen percent. The marketing and development plan would then outline the objectives that need to be achieved in order to reach the fifteen percent increase in the business market share. The marketing and development plan can be used to describe the methods of applying a company's marketing resources to fulfil marketing objectives. Marketing and development planning segments the markets, identifies the market position, forecast the market size, and plans a viable market share within each market segment.
40
Marketing and development planning can also be used to prepare a detailed case for introducing a new product, revamping current marketing strategies for an existing product or put together a company marketing plan to be included in the company corporate or business plan. 4.4.2 Sharp, Strategic Focus on Retail Given the enormous growth of data, banks are suffering from their inability to effectively exploit their data assets. As a result, retails must now capitalize on the capabilities of business intelligence (BI) software, particularly in areas such as customer intelligence, performance management, financial analysis, fraud detection, risk management, and compliance. Furthermore, retails need to develop their BI real-time delivery ability in order to respond faster to business issues. This brief argues the following statements: • BI must be fully aligned with business processes. • Ability to respond faster to customers, to regulators or to management generates need for real-time automation. 4.4.3 Merchandising The activity of promoting the sale of goods at retail. Merchandising activities may include display techniques, free samples, on-the-spot demonstration, pricing, shelf talkers, special offers and other point-of-sales methods. According to American Marketing Association, merchandising encompasses “planning involved in marketing the right merchandise or service at the right place, at the right time, in the right quantities and at the right price.” Retail Merchandising refers to the various activities which contribute to the sale of products to the consumers for their end use. Every retail store has its own line of merchandise to offer to the customers. The display of the merchandise plays an important role in attracting the customers into the store and prompting them to purchase as well.
41
Merchandising helps in the attractive display of the products at the store in order to increase their sale and generate revenues for the retail store.
Merchandising helps in the sensible presentation of the products available for sale to entice the customers and make them a brand loyalist.
Promotional Merchandising
The ways the products are displayed and stocked on the shelves play an important role in influencing the buying behaviour of the individuals.
4.4.4 Stores Retail in-store analytics is vital for providing near-real-time view to key in-store metrics, collecting footfall statistics using people counting solutions (video based/thermal based/Wi-Fi-based). This information is fed into in-store analytics tools for measuring sales conversion by co-relating traffic inflow with the number of transactions and value of transactions. The deep insights from such near real-time views of traffic, sales and engagement help in planning store workforce needs based on traffic patterns. Reliable in-store data and analytics solutions enable better comparison of store performances, optimizing campaign strategy, improved assortment and merchandising decisions. Happiest Minds In-Store Analytics Solution Challenges related to change management, technology implementation, work force management, backend operational discourage players from leveraging smart instore analytics solutions in physical stores. Happiest Minds addresses all these challenges in a better way. Our In-store Analytics solution, which is available on cloud andon-premise, leverages digital technologies like mobile, beacon/ location tracking devices, video based people counting, analytics to monitor and improve in-store efficiency, increase employee productivity and sales. The in-store conversion dashboard, with multi-dimensional drill down views, provides location view, time view and
42
4.4.5 E-Commerce E Commerce is growing day by day in both B-to-B and B-to-C context. Retailing industry including Fashion Retail and Grocery retailing have caught on to the bandwagon and have begun to offer E trading or Online Shopping. In the early 1990s we saw Companies setting up websites with very little understanding of E Commerce and Consumer behaviour. E commerce as a model is totally different from the traditional shopping in all respect. All Companies have fast realised the need to have E commerce strategy separately but as a part of overall Retail Strategy. Retail Strategy involves planning for the business growth keeping in view the current market trends, opportunities as well as threats and building a strategic plan that helps the Company deal with all these external factors and stay on course to reach its goals. Further the Retail business strategy is concerned with identifying the markets to be in, building the product portfolio and band width coupled with brand positioning and the various elements of brand visibility and in store promotions etc. Business operations are more or less standard and proven models that are adapted as best practices. 4.4.6 Customer Relationship Management A sound and well-rounded customer relationship management system is an important element in maintaining one’s business in the retail marketing industry. Not only is customer relationship management a business strategy but it is also a powerful tool to connect retail companies with their consumers. Developing this bond is essential in driving the business to the next levels of success. 4.4.7 Retail Marketing Landscape Today’s retail marketing landscape is changing and retail industry organizations struggle to achieve or maintain good marketing communications with existing consumers as well as prospective customers. But unlike in past years’ shotgun approach, most retail companies now are aiming at specific targets. Tracking a particular goal, as opposed to random, scattershot pursuits, allows these companies to channel their marketing efforts to obtain highest possible return on investment.
43
The need to identify consumer-related issues, better understand your customers and meet their needs with the company’s products and services. By making accurate estimates regarding product or service demands in a given consumer market, one can formulate support and development strategies accordingly. Integrating A Customer Relationship Management System In The Retail Marketing A retail business sells goods or services; strives to attract more consumers by marketing and advertising and seeks customer feedback. And these are just among the many things that a retail business has to juggle: product supply, finance, operations, membership database, etc. With a customer relationship management integration (CRM Integration) system in place, managers and supervisors of retail businesses can set goals, implement processes, and measures and achieve them in a more efficient manner. A CRM Integration system can combine several systems to allow a single view. Data can be integrated from consumer lifestyle, expenditure, brand choice.If the CRM system is implemented to track marketing strategies over products, services, then it can provide a scientific, data-based approach to marketing and advertising analysis. The CRM Integration system improves the overall efficiency of marketing campaigns since it allows retail companies to specifically target the right group of consumers. The right message, at the right time to the right people delivers a positive marketing response from consumers and translates to more sales. Overall, this system provides a clear picture of the consumer segments, allowing retail companies to develop suitable business strategies, formulate appropriate marketing plans for their products or services, and anticipate a change of business landscape
44
4.5 ALGORITHM DEFINITON PROCESS 4.5.1. Footfall The number of people visiting a shop or a chain of shops in a period of time is called its footfall. Footfall is an important indicator of how successfully a company's marketing, brand and format are bringing people into its shops. Footfall is an indicator of the reach a retailer has, but footfall needs to be converted into sales and this is not guaranteed to happen. Many retailers have struggled to turn high footfall into sales. Trends in footfall do tell investors something useful. They may be an indicator of growth and help investors to understand why a retailer's sales growth (or decline) is happening. Investors may want to know whether sales growth due to an increase in the number of people entering the shops (footfall) or more success at turning visitors into buyers (which can be seen by comparing footfall to the number of transactions). Sales growth may also come from selling more items to each buyer (compare number of transactions to sales volumes), selling more expensive items (an improvement in the sales mix), or increasing prices. Which of these numbers is disclosed varies from company to company. Investors should look at whatever is available. Calculation of FootFall: Customer conversion ratio = No of transactions / Customer traffic x 100 4.5.2. Conversion Rate A conversion is any action that you define. It could be a purchase, an old fashioned phone call, contact form submission, newsletter signup, social share, a specified length of time a visitor spends on a web page, playing a video, a download, etc. Many of the small businesses that I meet only have a gut-sense of where their new business comes from because they haven’t been tracking conversions. Knowing your conversion rate(s) is a first step in understanding how your sales funnel is performing and what marketing avenues are giving the greatest return on investment (ROI). 45
Once you have defined what conversions you want to track, you can calculate the conversion rate. For the purposes of the following example, let’s call a conversion a sale. Even if you’re still in the dark ages without a viable website, as long as you are tracking the number of leads you get and the number of resulting sales (conversions), you can calculate your conversion rate like so… Conversion Rate = Total Number of Sales / Number of Leads * 100 Example: Let’s say you made 20 sales last year and you had 100 inquiries/ leads. Your sales to lead conversion rate would be 20%.If you’re tracking conversions from website leads, your formula looks like so: Conversion Rate = Total Number of Sales / Number of Unique Visitors* 100 4.5.3. Visits Frequency The total number of visits divided by the total number of visitors during the same timeframe. Visits / Visitors = Average Visits per Visitor Sophisticated users may also want to calculate average visits per visitor for different visitor segments. This can be especially valuable when examining the activity of new and returning visitors or, for online retailers, customers and non-customers. Presentation The challenge with presenting average visits per visitor is that need to examine an appropriate timeframe for this KPI to make sense. Depending on business model it may be daily or it may be annually: Search engines like Google or Yahoo can easily justify examining this average on a daily, weekly and monthly basis. Marketing sites that support very long sales cycles waste their time with any greater granularity than monthly. Consider changing the name of the indicator when present it to reflect the timeframe under examination, e.g, “Average Daily Visits per Visitor” or “Average Monthly Visits per Visitor”. 46
Expectation Expectations for average visits per visitor vary widely by business model. Retail sites selling high-consideration items will ideally have a low average number of visits indicating low barriers to purchase; those sites selling low consideration items will ideally have a high average number of visits, ideally indicating numerous repeat purchases. Online retailers are advised to segment this KPI by customers and non-customers as well as new versus returning visitors regardless of customer status. Advertising and marketing sites will ideally have high average visits per visitor, a strong indication of loyalty and interest. Customer support sites will ideally have a low average visits per visitor, suggesting either high satisfaction with the products being supported or easy resolution of problems. Support sites having high frequency of visit per visitor should closely examine average page views per visit, average time spent on site and call centre volumes, especially if the KPI is increasing (e.g., getting worse.) 4.5.4. Repeat Customer Percent Repeat Customer Rate is calculated by dividing your Repeat Customers by your Total Paying Customers. Every store has two types of customers: New Customers and Repeat Customers. Knowing your Repeat Customer Rate will show you what percentage of customers are coming back to your store to shop again. This is an important factor in customer Lifetime Value. Customer Acquisition Costs (CAC) are one of the highest expenses in the ecommerce world. When you pay buckets of marketing money to Google Ad words, Bing Ads, Face book Ads and Interest, want to ensure effectively encouraging customers to shop more than once. Repeat customer%=Repeat customer/No of customer
47
4.5.5 Calculation of Top And Bottom 5 Revenue Generators Product/Store Top and Bottom 5 Revenue generators for Product/Store/customer gives the demand of the product in corresponding to the store and customers. With this information we can also indentify the likelihood of the customers .We can also get clear idea of stock to be purchased. We can also increase the availability of product if we know the frequently purchased product. The Profit will be high if we decrease the stock containing the least revenue generated product. The calculation for Revenue generators are as follows: =agar(if(rank(sum(Revenue))<=5 or rank(-sum(Revenue))<=5,Customer),Customer) 4.5.6 Average Basket Size On average, a customer with a basket will buy two items more on any given visit than a customer left to fumble with two or three purchases, other shopping bags and a handbag or purse. This is no small matter as store managers attempt to maximise “basket size,” or “average transaction value” as it is known in the colourful language of financial analysts. Average basket size calculation = Turnover ÷ number of people passing though cash desk. It is one of the merchandising ratios. It measures the average amount spent by a customer in a point of sale on a visit. Average basket size refers to the number of items getting sold in a single purchase. It is the equivalent of total units sold ÷ number of invoices. 4.5.7 Average Ticket Size The “Average Ticket” is the average dollar amount of a final sale, including all items purchased, of a merchant’s typical sales. Processors use the Average Ticket to better understand the merchant’s business and to determine how much risk the merchant poses to the processor. 48
Many merchants find it difficult to determine the size of their average ticket due to widely varying sales amounts, or because the business is new. Often times, processors only need estimation and not an exact figure. A ticket that records all the terms, conditions and basic information of a trade agreement. A deal ticket is created after the transaction of shares, futures contracts or other derivatives. Also referred to as a "trading ticket". Deal Ticket A deal ticket is similar to a trading receipt. It tracks the price, volume, names and dates of a transaction, along with all other important information. Companies use deal tickets as part of an internal control system, allowing them organized access to the transaction history. Deal tickets can keep be in either electronic or physical form. 4.5.8 Cost of Goods Sold (Cogs) Cost of goods sold refers to the carrying value of goods sold during a particular period. Costs are associated with particular goods using one of several formulas, including specific identification, first-in first-out (FIFO), or average cost. Costs include all costs of purchase, costs of conversion and other costs incurred in bringing the inventories to their present location and condition. Costs of goods made by the business include material, labour, and allocated overhead. The costs of those goods not yet sold are deferred as costs of inventory until the inventory is sold or written down in value. 4.5.9 Gross Margin Return on Investment GMROI demonstrates whether a retailer is able to make a profit on his inventory. As in the above example, GMROI is calculated by dividing gross margin by the inventory cost. And keep in mind what gross margin is: the net sale of goods minus the cost of goods sold. Retailers need to be well aware of the GMROI on their merchandise, because it lets them determine how much they're earning on for every dollar they invest. Divide the sales by average cost of inventory and times that by the gross margin % to get GMROI 49
4.5.10 Sell Through Percentage Sell through rate is a calculation, commonly represented as a percentage, comparing the amount of inventory a retailer receives from a manufacturer or supplier against what is actually sold to the customer. The period (usually one month) examined is useful when comparing the sale of one product or style against another. Or more importantly, when comparing the sell through of a specific product from one month to another to examine trends. So, in your store, if you bought 100 chairs and after 30 days had sold 20 chairs (meaning you had 80 chairs left in inventory) then your sell-through rate would be 20 percent. Using your beginning of month (BOM) inventory, you divide your sales by that BOM. It is calculated this way: Sell through = Sales / Stock on Hand (BOM) x 100 (to convert to a percentage) Or in our example (20 /100) x 100 = 20 percent Sell through is a healthy way to assess if your investment is returning well for you. For example, a sell-through rate of 5 percent might mean you either have too many on hand (so you are overbought) or priced too high. In comparison, a sell-through rate of 80 percent might mean you have too little inventory (under bought) or priced too low. Truly the sell through rate's analysis is based on what you want from the merchandise.
50
4.6 HIVE TABLE STRUCTURE
Field Name
Field Type
Customerid
Int
City
String
State
String
Zip
Int Table: 1 Customer address
Field Name
Field Type
Customer_accountid
String
Account_opened
Timestamp Table: 2 Customer_Details
Field Name
Field Type
Customerid
Int
FristName
String
LastName
String
Datacreated
Timestamp Table: 3 Customer 51
Filed Name
Filed Type
Itemid
String
Itemtypeid
String
Mfgname
String
Createddate
Timestamp Table: 4 Itemdetails
Field Name
Field Type
Orderdetailsid
String
Orderid
String
Itemid
String
Quantity
Int
Unitprice
Int
Shipmethodid
String Table: 5 Order details
52
Field Name
Field Type
Orderid
String
Customerid
String
Transactiondate
Timestamp
Totalorderamount
Int
Taxamoutn
Int
Frightamount
Int
Trdisamount
Int
Subtotal
Int
Paymenttypeid
String
Ordersource
String
Cancelreturn
String
Datecreated
Timestamp
Storeid
String Table: 6 Orders
53
Field Name
Field Type
Itemtypeid
String
Paymenttype
String
Createduser
String
Datecreated
Timestamp
Modifieduser
String
Datemodified
Timestamp Table: 7 Payment_type
Field Name
Filed Type
Shipmethodid
String
Shipmentmethod
String
Createduser
String
Modifieduser
String
Datemodified
TimeStamp Table 8:Shipment_type
54
Field Name
Field Type
Id
Int
Name
String
Region
String
City
String
Country
String
State
String
Zip
String Table 9: Store-Details
Field Name
Field Type
Itemtype
String
Typename
String Table 10: Items
55
4.7 INPUT DESIGN Once the output requirements have been finalized, the next step is to find out what inputs are needed to produce the desired outputs. Inaccurate input data results in errors in data processing. Errors entered by data entry operator can be controlled by input design. Input design is a process of converting user originated inputs to computerbased format. The various objectives of the input design should focus on;
Controlling amount of input
Avoiding delay
Avoiding errors in data
Avoiding extra steps
Keeping the process simple Input is considered as the process of keying in data into the system, which will
be converted to system format. People all over the world who belong to different cultures and geographies will use a web site. So the input screens given in the site should be really flexible and faster to use. With highly competitive environment existing today in web based businesses the success of the site depends on the number users logging on to the site and transacting with the company. A smooth and easy to use site interface and flexible data entry screens are a must for the success of the site. The easy to use hyperlinks of the site help in navigating between different pages of the site in a faster way. A document should be concise because longer documents contain more data and so take longer to enter and have a greater chance of data entry errors. The more quickly an error is detected, the closer the error is to the person who generated it and so the error is more easily corrected. A data input specification is a detailed description of the individual fields (data elements) on an input document together with their characteristics. Be specific and precise, not general, ambiguous, or vague in case of error messages.
56
4.8 OUTPUT DESIGN The outputs from computer systems are required mainly to communicate the results of processing to users. They are also used to provide permanent (“hard”) copy of these results for later consultation. Output is what the client is buying when he or she pays for a development project. Inputs, databases, and processes exist to provide output. Printouts should be designed around the output requirements of the user. The output devices are considered keeping in mind factors such as compatibility of the device with the system response time requirements, expected print quality and number of copies needed. Output to be produced also depends on following factors :
Type of user and purpose
Contents of output
Format of the output
Frequency and timing of output
Volume of output
Sequence
Quality The success or failure of software is decided by the integrity and correctness
output that is produced form the system. One of the main objective behind the automation of business systems itself is the fast and prompt generation of reports in a short time period. In today’s competitive world of business it is very important for companies to keep themselves up to date about the happenings in the business. Prompt and reliable reports are considered to be the lifeline of every business today. At the same time wrong reports can shatter the business itself and create huge and irreparable losses for the business. So the outputs/reports generated by the software systems are of paramount importance. 4.9 IMPLEMENTATION PLANNING This section describes about the Implementation of the application and the details of how to access this control from any application. Implementation is the 57
process of assuring that the information system is operational and then allowing users take over its operation for use and evaluation. Implementation includes the following activities.
Obtaining and installing the system hardware.
Installing the system and making it run on its intended hardware.
Providing user access to the system.
Creating and updating the database.
Documenting the system for its users and for those who will be responsible for maintaining it in the future.
Making arrangements to support the users as the system is used.
Transferring on-going responsibility for the system from its developers to the operations or maintenance part.
Evaluating the operation and use of the system.
4.10 IMPLEMENTATION PHASE IN THIS PROJECT The new system of Web Based Digital Security Surveillance has been implemented. The present system has been integrated with the already existing hardware. The database was put into the Microsoft SQL server. The database is accessible through Internet on any geographic location. Documentation is provided well in such a way that it is useful for users and maintainers. 4.11 MAINTENANCE Maintenance is any work done to change the system after it is in operational. The term maintenance is used to describe activities that occur following the delivery of the product to the customer. The maintenance phase of the software life cycle is the time period in which a software product performs useful work. Maintenance activities involve making enhancements to products, adapting products to new environments, correcting problems. An integral part of software is the maintenance one, which requires an accurate 58
maintenance plan to be prepared during the software development. It should specify how users will request modifications or report problems. The budget should include resource and cost estimates. A new decision should be addressed for the developing of every new system feature and its quality objectives. The software maintenance, which can last for 5–6 years (or even decades) after the development process, calls for an effective plan which can address the scope of software maintenance, the tailoring of the post-delivery/deployment process, the designation of who will provide maintenance, and an estimate of the life-cycle costs. The selection of proper enforcement of standards is the challenging task right from early stage of software engineering which has not got definite importance by the concerned stakeholders. In this we retrieve the data from the database design by searching the database. So, for maintaining data our project has a backup facility so that there is an additional copy of data, which needs to be maintained. More over this project would update the annual data on to a CD, which could be used for later reference
59
CHAPTER 5 CONCLUSION In this research work, the goal is to propose an automatic and intelligent system able to analyze the data on inventory levels, supply chain movement, consumer demands, sales, etc., that are crucial for making marketing and procurement decisions. Also, it will help the organization to analyze the data with various factors and slice and dice the data based
on region,
time,
(year,
quarter,
month,
week
and
day),product(product category, product type, model and product) and etc. The analytics on demand and supply data can be used for maintaining procurement level and also for taking marketing decision .Retail analytics gives us detailed customer insights along with insights into the business and processes of the organization with scope and need for improvement.
60
CHAPTER 6 APPENDIX SAMPLE SOURCE CODE UDF FOR TEXT AND SPECIAL CHARACTER TRIMING IN HIVE package org.hardik.letsdobigdata; import org.apache.commons.lang.StringUtils; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.Text; public class Strip extends UDF { private Text result = new Text(); public Text evaluate(Text str, String stripChars) { if(str == null) { return null;} result.set(StringUtils.strip(str.toString(), stripChars)); return result; } public Text evaluate(Text str) { if(str == null) { return null; } result.set(StringUtils.strip(str.toString())); return result;}}
61
UDAF FOR MEAN CALCULATION package org.hardik.letsdobigdata; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDAF; import org.apache.hadoop.hive.ql.exec.UDAFEvaluator; import org.apache.hadoop.hive.ql.metadata.HiveException; import org.hardik.letsdobigdata.MeanUDAF.MeanUDAFEvaluator.Column; @Description(name = "Mean", value = "_FUNC(double) - computes mean", extended = "select col1, MeanFunc(value) from table group by col1;") public class MeanUDAF extends UDAF { // Define Logging static final Log LOG = LogFactory.getLog(MeanUDAF.class.getName()); public static class MeanUDAFEvaluator implements UDAFEvaluator { /*** Use Column class to serialize intermediate computation. This is our groupByColumn */ public static class Column { double sum = 0; int count = 0;} private Column col = null; public MeanUDAFEvaluator() { super(); init(); } // A - Initalize evaluator - indicating that no values have been 62
// aggregated yet. public void init() { LOG.debug("Initialize evaluator"); col = new Column(); } // B- Iterate every time there is a new value to be aggregated public boolean iterate(double value) throws HiveException { LOG.debug("Iterating over each value for aggregation"); if (col == null) throw new HiveException("Item is not initialized"); col.sum = col.sum + value; col.count = col.count + 1; return true;} // C - Called when Hive wants partially aggregated results. public Column terminatePartial() { LOG.debug("Return partially aggregated results"); return col;} // D - Called when Hive decides to combine one partial aggregation with another public boolean merge(Column other) { LOG.debug("merging by combining partial aggregation"); if(other == null) { return true;} col.sum += other.sum; col.count += other.count; return true; } 63
// E - Called when the final result of the aggregation needed. public double terminate(){ LOG.debug("At the end of last record of the group - returning final result"); return col.sum/col.count; } }}
64
HIVE QUERY FOR ANALYTICS: 1.Select count(*) from visitor where month(created)='month'; Taking the number of visitors in the shop in a particular interval of time. Footfall means counting the number of persons in the Shopping Mall/Retail Shop. 2.Select customerid,sum(totalorderamount) from placedorder group by customerid order by 2 desc limit 5; This Query is used to get the least revenue producing product/customer/store Select customerid,sum(totalorderamount) from placedorder group by customerid order by 2 asc limit 5; This Query is used to get the greatest revenue producing product/customer/store 3.Select customerid,count(itemid) from placedorder join placedorder.orderid=deliveredorder.orderid group by customerid ;
deliveredorder
This query is used to calculate the Average Basket value. Count of items purchased by the customer 4.Select sum(amount)/count(orderid) from placedorder; This Query is used to get the Average ticket Size Describes the details of the customer’s order 5.Select storeid, customerid, count(1) from visitors group by storeid, customerid having count(1) > 1 Regular customer percentage for a shop 6.select customerid, count(*) from visitors where month(datecreated)='month'group by customerid; No of times the customer visits the shop in a month
65
SCREENSHOTS
Figure A.6.1: UDF for Text splitting in Retail Analytics
Figure A.6.2: UDAF for Retail Analytics(Finding the mean)
66
Figure A.6.3: Sqoop transformation data
Figure A.6.3: Finding FootFall
67
Figure A.6.5: Finding conversion rate
Figure A.6.6: Finding basket size
68
Figure A.6.7: Frequency visitors
Figure A.6.8: Finding least 5 revenue product
69
Figure A.6.9: Finding Top revenue producing product
Figure A.6.10: Repeated customer percentage
70
Figure A.6.11: Creating the Integrating Service
Figure A.6.12: Connecting Hive with PowerBI
71
CHAPTER 7 REFERENCES
1 Apache Hadoop Project. http://hadoop.apache.org 2 Apache Hive Project. http://hadoop.apache.org/hive 3 Cloudera Distribution Including Apache Hadoop (CDH). http://www.cloudera.com 4 Greenplum Database. http://www.greenplum.com 5 GridMix Benchmark. http://hadoop.apache.org/docs/mapreduce/current/gridmix.html 6 Oracle Database - Oracle. http://www.oracle.com 7 PigMix Benchmark. https://cwiki.apache.org/confluence/display/PIG/PigMix 8 Teradata Database - Teradata Inc.http://www.teradata.com 9 TwinFin - Netezza, Inc. http://www.netezza.com/ 10 TPC Benchmark DS, 2012.
72
PUBLICATIONS Mrs.A. Sangeetha MCA., (M.Phil CS)., Mrs. K. Akilandeswari MCA, M.Phil., Assistant Professor, “Survey on Data Democratization using Big Data” Paper published in International journal of science and Research,
73