Practice+use+case

  • October 2019
  • PDF

This document was uploaded by user and they confirmed that they have the permission to share it. If you are author or own the copyright of this book, please report to us by using this DMCA report form. Report DMCA


Overview

Download & View Practice+use+case as PDF for free.

More details

  • Words: 709
  • Pages: 3
Practice Use Case In this Chapter you will practice applying what you learned in the previous lessons in the context of a use case working with customer data. Before you begin working on this lesson, read the following background information about the Customers dataset and use case. Once you have completed the use case, you can find the steps we took and the results produced with in the following pages. This example is meant to test what you have learned thus far. When you feel comfortable with this use case example, we recommend you challenge your wrangling skills by taking our Certification Exam. Before getting started, download the following datasets: • Use Case Data

Example

You work for a technical services company that sells three monthly subscription products: • • •

Silver (price: $9.99/month) Gold (price: $14.99/month) Platinum (price: $29.99/month)

The company occasionally offers promotional discounts – therefore, some product prices may be slightly lower than those listed above. Your overall goal is to provide an analysis of sales activity by zip code. In order to do this, you’ll need to join your Lab_customers dataset (where zip code information resides) with sales data from your transactions datasets. Once you’ve joined the data, you’ll need to aggregate the results.

Create a new flow

Add Following Datasets: • • • • • • • •

lab_customers.csv lab_2010_transactions.csv lab_2011_transactions.csv lab_2012_transactions.csv lab_2013_transactions.csv lab_2014_transactions.csv lab_2015_transactions.csv zip_to_state_map.csv

Add recipe to Lab_Customers

Add a Recipe to the Lab_Customers dataset. Edit the Recipe to enter the Transformer Grid

Filter out faulty data

Remove the contacts who are not actual customers: We identify actual customers by the start_date column.

Provide a default end_date value for active customers

For those current customers without an end date, set the default end date to January 01, 2050.

Fix the start_date data type mismatches

Notice that the start_date column contains in two different formats. You want all of the dates in this column to conform to the format ‘yyyy/MM/dd’.

Lab_Transactions

Add a Recipe to the Lab_2010_transactions dataset. Edit that Recipe to enter the grid

Remove dummy data

Since this company sells online subscription products, the website sends a test transaction of $0.01 to validate that a customer’s credit card is valid. Test transactions are recorded in the database and appear in the transactions file. You need to remove these transactions so they don’t distort your analysis.

Clean up Discount Column

Remove the % from the Discount column, then convert the column to a decimal. Fill in the missing values with 0.

Standardize product names

After 2011, the company renamed the subscription products. To maintain consistency between the transactions files from each year, you need to replace the old product names with the new product names. • •

Change ‘basic’ to ‘silver’ Change ‘deluxe’ to ‘gold’

Determine the real price for each product

The real price for each product is a combination of the list price and the discount. Create a new column that contains the actual charge for each transaction.

Union transactions datasets

Union 2010 – 2015 datasets. We want to apply the above transforms to all data, make sure your recipe step is added to the beginning of your recipe, so that all the steps in your recipe are applied to all of the unioned datasets. • • • •

2011_Transactions 2012_Transactions 2013_Transactions 2014_Transactions



2015_Transactions

Join with the Customers dataset

Join transaction data with the customer data, keep the following columns. • • • • • • • • •

Customer_id (2010_transactions) transaction_date (2010 Transactions) adjusted_price (2010 Transactions) product (2010 Transactions) address_state (Customers) address_zip (Customers) region (Customers) start_date (Customers) end_date (Customers)

Lookup state by zip

Scroll to the address_state column and notice that there are some blank records. Perform a lookup on the address_zip column to zip_to_state_map.csv. This Lookup file contains a state mapping for each zip code. Rename column3 to State and delete Address_state

Filter transactions based on date

Only keep transactions that have occurred during the course of a customer’s active subscription (i.e. the transaction_date is before the end_date

Create a new table with aggregated results

Create a table that lists the count, sum, min and max of adjusted price, grouped by address zip.

Run the job at scale

Run job and view job results.

Solution Solution Guide