Case Study 1: The Look e-Commerce

RevoU DAMC

Introduction

Case Study Instructions:

Look at this data and start thinking. List down 3 trends/points that you think need to be shown:

bigquery-public-data.thelook_ecommerce.orders

bigquery-public-data.thelook_ecommerce.order_items

bigquery-public-data.thelook_ecommerce.users

bigquery-public-data.thelook_ecommerce.products

From here, try to explore the data and make changes, filter, and prepare the data that you need.

Create some visualizations or dashboard with the best type of chart that you have learned. You could use any tools as you wish.

Then, make 1-2 slides from the graphs with the insights you got to present your findings to the stakeholders

1. Ask

1.1 Business Objectives

How is our users' demographic?
How is our orders and sales performance?
What products are most desirable and profitable?

1.2 Business Tasks

List down 3 trends/points and create some visualizations or dashboard with the best type of chart that you have learned.

1.3 Deliverables

A clear statement of the business task
A description of all data sources used
Documentation of any cleaning or manipulation of data
A summary of your analysis
Supporting visualizations and key findings
Your top three recommendations based on your analysis

1.4 Case Study Roadmap

Study Case Roadmap - Ask
Guidance Question: What problem you're trying to solve? Create a dashboard about users' demographic and business performance covering orders, sales, and products. How's your knowledge can make a business decision? All acquired knowledge in further analysis will be shown as a finding report.
Key Tasks: Identified business task(s) Considered the Stakeholder(s)
Result A clear statement about the business task

2. Prepare

2.1 Work Steps

The data required available in BigQuery Public Data marketplace. This service is included in free Google BigQuery account, so this can be accessed directly in BigQuery without other preparing methods. The Look e-Commerce is a fictitious company developed by Google Looker team. Furthermore, the contents of this dataset are fully synthetic.

Dataset contains detail information about customers, products, orders, logistics, web events, and digital marketing campaigns. Speaking of credibility, since this is a primary source data, the data consider as credible. As mentioned before, this dataset contains detail information about customers such as name, age, gender, and address; privacy aspect must be regulated more strictly. In conclusion, this dataset is well-formatted.

Details on data type for each table shown below:

**Orders Table**
#	Field	Data Type
1.	order_id	INT
2.	user_id	INT
3.	status	STRING
4.	gender	STRING
5.	created_at	DATETIME
6.	returned_at	DATETIME
7.	shipped_at	DATETIME
8.	delivered_at	DATETIME
9.	num_of_item	INT

**Order Items Table**
#	Field	Data Type
1.	id	INT
2.	order_id	INT
3.	user_id	INT
4.	product_id	INT
5.	inventory_item_id	INT
6.	status	STRING
7.	created_at	DATETIME
8.	shipped_at	DATETIME
9.	delivered_at	DATETIME
10.	returned_at	DATETIME
11.	sale_of_price	FLOAT

**Users Table**
#	Field	Data Type
1.	id	INT
2.	first_name	STRING
3.	last_name	STRING
4.	email	STRING
5.	age	INT
6.	gender	STRING
7.	state	STRING
8.	street_address	STRING
9.	postal_code	STRING
10.	city	STRING
11.	country	STRING
12.	latitude	FLOAT
13.	longitude	FLOAT
14.	traffic_source	STRING
15.	created_at	DATE

**Products Table**
#	Field	Data Type
1.	id	INT
2.	cost	INT
3.	category	STRING
4.	name	STRING
5.	brand	STRING
6.	retail_price	FLOAT
7.	department	STRING
8.	sku	STRING
9.	distribution_center_id	INT

2.2 Case Study Roadmap

Study Case Roadmap - Prepare
Guidance Question: Where is your data located? Dataset is available in BigQuery Public Data marketplace. How is the data organized? The data is stored as dataset and hosted in BigQuery. Are there issues with bias or credibility in this data? Does your data ROCCC? There is no bias since this is a primary data, therefore it is credible, too. The data is ROCCC: Reliable: The data is acquired from the The Look e-Commerce database in BigQuery. Original: Data is gained from primary source. Comprehensive: The data is already provided to answer the business problem. Current: Data was expected to be updated daily, so it is current. Citation: The data was provided by a highly credible organization (Google Looker Team). How are you addressing licensing, privacy, security, and accessibility? As stated in Google Cloud Platform, all the public datasets are under Apache License 2.0 license. Since this is publicly available, then the data is accessible. But, privacy aspects must be considered strictly. How did you verify the data’s integrity? All data has the exact same format. How does it help you answer your question? All the information acquired. Are there any problems with the data? No, the data is formatted correctly.
Key Tasks: Identify how it’s organized Sort and filter the data Determine the credibility of the data
A description of all data sources used

3. Process

This analysis mainly use BigQuery as cloud storage database and Looker Studio as analyzing and visualizing data tools.

3.1 Cleaning

Since the data was available in BigQuery Public Data marketplace, I ran SQL scripts in BigQuery cloud console. First step I did was check for duplications, to do that I ran this scripts:

SELECT 
    COUNT([column]) - COUNT(DISTINCT [column]) as dup
FROM
    `bigquery-public-data.thelook_ecommerce.[table_name]` as [alias];

The result was zero in all four tables. It means there were not any duplications in the table. Then I wanted to check if there was null value in the table using this script:

SELECT * FROM bigquery-public-data.thelook_ecommerce.[table_name]
WHERE
    [column_1] IS NULL OR
    [column_2] IS NULL OR
    ...;

The result was also zero, meaning that there is no null value in the table. The last step is checking the data consistency. To do this, I ran the SQL script below:

/* 1. CHECKING ID LENGTH DATA. SINCE THE ID DATA TYPE WAS INT64,
I NEEDED TO CONVERT THE DATA TYPE INTO STRING FIRST. */
SELECT DISTINCT
    LENGTH(CAST([id_column] as STRING)) as len_id
FROM
    `bigquery-public-data.thelook_ecommerce.[table_name]`

/* 2. CHECK FOR CITY, STATE, AND COUNTRY CONSISTENCY*/
SELECT DISTINCT
    [strColumnName]
FROM
    `bigquery-public-data.thelook_ecommerce.[table_name]`

All data seemed to be cleaned. There were not any duplications, misspelling, or null values. But there were two different countries naming as seen in the table below:

#	Country Name	English Naming
1.	España	Spain
2.	Deutschland	Germany

Since the dataset cannot be modified (due to limitation and restriction condition by BigQuery), I made a temporary table using WITH AS() statement to modify those country naming into the proper one using CASE statement.

WITH users_cl AS (
    SELECT CASE
        WHEN country = "España" THEN country = "Spain"
        WHEN country = "Deutschland" THEN country = "Germany"
    ELSE users_cl END
)

As a part of data cleaning processes, I checked the email data in users table using SQL script below. The result is email data completely clean.

  WITH user_temp AS (
    SELECT
      id,
      first_name,
      last_name,
      email
    FROM
      `bigquery-public-data.thelook_ecommerce.users`
  )
  SELECT
    id,
    first_name,
    last_name,
    email,
    REGEXP_EXTRACT(email, r'^(.*?)@') as name,
    LEFT(REGEXP_EXTRACT(email, r'^(.*?)@'), LENGTH(first_name)) as surname,
    RIGHT(REGEXP_EXTRACT(email, r'^(.*?)@'), LENGTH(last_name)) as fam_nam,
    REGEXP_EXTRACT(email, r'[.@](.*?)\.') as comp,
    RIGHT(email, 3) as dom
  FROM
    user_temp
  LIMIT 10;

  /* 'comp' STANDS FOR "COMPANY" AND 'dom' STANDS FOR "DOMAIN" */

EXPLANATION: REGEXP_EXTRACT() is regular expression that BigQuery provided in their built-in functions. This function match the substring with regular expression pattern. Example, when I extracted 'name' in email fields the regex can be interpretated as: "Matching any character at the beginning of the string (^) before the '@' symbol."

3.2 Case Study Roadmap

Case Study Roadmap - Process
Guiding questions: What tools are you choosing and why? In this analysis I used BigQuery as database (since the data hosted in BigQuery) and Looker Studio (formerly known as Google Data Studio) as dashboard report tools and utilized to aggregate and analyze data. Have you ensured your data’s integrity? Yes, all data formatted correctly. What steps have you taken to ensure that your data is clean? I checked them using SQL script to count any duplication, ensured no blank or null records, and the consistency of strings' length. How can you verify that your data is clean and ready to analyze? I tested the data using SQL script multiple times to ensure all data was ready to be analyzed. Have you documented your cleaning process so you can review and share those results? Yes, I recorded all processes.
Key tasks: Check the data for errors. Choose your tools. Transform the data so you can work with it effectively. Document the cleaning process.
Documentation of any cleaning or manipulation of data

4. Analyze

I utilized Looker Studio as the analyze tool. But, since Looker Studio can not directly manipulate tables in the dataset (like Power Query in Power BI), I need to query all data required by connecting BigQuery to Looker Studio using custom query connection feature. Therefore, I need to write individual SQL scripts to perform all analysis I wanted to do.

4.1 Users Demographic

First dashboard I made was about users' demographic. This visualization tells about how much our users are, where are they come from, how do they access our website, and how is our users growth from time to time. To perform the users card visualization, I write a custom query like this:

SELECT DISTINCT
  users.id,
  users.gender
FROM 
  bigquery-public-data.thelook_ecommerce.users as users
ORDER BY
  users.id ASC;

I used the COUNT() function in Looker Studio to count total distinct users. Then, sorted it by gender; I copied the total users card and filtered the data for each gender. The results is that our active users were 100,000 people, with almost 50:50 ratio for gender distribution. After that, I wanted explore our users age range to see which one was dominating. Therefore, to make the age range and gender distribution chart, I write the scripts like this:

WITH age_group AS (
  SELECT users.age,
    CASE WHEN users.age>=12 AND users.age<=17 THEN '12-17'
         WHEN users.age>=18 AND users.age<=24 THEN '18-24'
         WHEN users.age>=25 AND users.age<=34 THEN '25-34'
         WHEN users.age>=35 AND users.age<=44 THEN '35-44'
         WHEN users.age>=45 AND users.age<=54 THEN '45-54'
  		 WHEN users.age>=55 AND users.age<=64 THEN '55-64'
         ELSE '>65' END as age_groups, users.gender as gender
  FROM bigquery-public-data.thelook_ecommerce.users as users
)
SELECT  age_groups, 
        COUNT(*) as total, 
        gender,
	    CASE WHEN gender = 'F' THEN 'Female'
            ELSE 'Male' END as genders
FROM age_group
GROUP BY age_groups, gender;

The age classification followed the age categories used by several govermental organizations which categorize age range from teenagers, youngs, young adults, adults, to seniors. This classification needed as a consideration of users buying power and other psychological aspects. The result shows that four groups of adults with range of age from 25 to 64 dominating 67.85 percent of our total users. Each group of age has 50:50 gender ratio (the difference was not significant). According to this, our users dominated by they who had stable income. This may also describe that products we sold are matched with adults preferences.

After knowing the users distribution, I wanted to know how our users growth was. Since the analysis was done in September 2022, the latest complete data was in September, too. To create the visualization, I write a custom query like this:

SELECT
  id as users_id,
  gender,
  created_at
FROM bigquery-public-data.thelook_ecommerce.users
ORDER BY 1;

From the users table, I extracted the created_at column which contained when was the user register an account in our website. In Looker Studio, I used COUNTDISTINCT() function to count how much total distinct users, breakdown the gender using gender field, and then extract month and year data from created_at field. This would count total users joined our website in a year. The result shows that the last we have an increasing number of users was on the 2019-2020 period, then it went plateau on 2020-2021 period, then it went down from 2021 to September 2022.

But if we breakdown for female line, the decreasing occured on 2020-2021 period, then went up again on 2021-2022 before going totally down in 2022. This finding indicates that we need to build a strategy to regain our users and traffic to maintain our business.

Now, I wanted know what traffic they used to get into our website. To do this, I ran script like this:

WITH tg AS (
  SELECT users.traffic_source as traffic,
    CASE WHEN users.traffic_source = 'search' THEN 'Search'
         WHEN users.traffic_source = 'organic' THEN 'Organic'
         WHEN users.traffic_source = 'facebook' THEN 'Facebook'
         WHEN users.traffic_source = 'display' THEN 'Display'
         WHEN users.traffic_source = 'email' THEN 'Email'
         ELSE users.traffic_source END AS traffic_groups
  FROM bigquery-public-data.thelook_ecommerce.users as users
)
SELECT  traffic_groups, 
        COUNT(*) AS Total
FROM tg
GROUP BY traffic_groups;

I plotted in a pie chart to see a clearly distinctive result. It turns out that most of users know our website by searching it in any search engine. It followed by organic users, Facebook, email, and display. This suggests us to maintain and improve our current SEO or SEM strategy. However, we can not overlook the other traffics. This can also suggests us to improve our branding to attract more organic users, improve our ads and marketing strategy in Facebook Ads, email marketing, or displays. In other word, we need to improve the accessibility and users recognition to our business.

Lastly, I wanted to know where are they come from, does our users consisted only by local or domestic users or not. To do this, I ran a SQL script like this:

WITH country_group AS (
  SELECT users.country,
    CASE WHEN users.country = 'United States' THEN 'US'
         WHEN users.country = 'South Korea' THEN 'S.Korea'
         WHEN users.country = 'United Kingdom' THEN 'UK'
         WHEN users.country = 'España' THEN 'Spain'
         WHEN users.country = 'Deutschland' THEN 'Germany'
         ELSE users.country END AS country_groups
    FROM bigquery-public-data.thelook_ecommerce.users as users
)
SELECT country_groups, COUNT(*) AS TOTAL
FROM country_group
GROUP BY country_groups

As mentioned in the cleaning process previously, there were two country name data that did not follow the official naming. I used the temp table made in cleaning process and extended it with a few country to make it shorter to increase the readability. I plotted it into a vertical bar chart. The result is surprising, top three countries our users from are China, US, and Brasil; which leads by China with 11,000 more people than US. It means that our products are fulfilled both domestic and international needing. This findings also tells us to improve international recognition, because we have a great potential to extend our market.

4.2 Orders and Sales Performance

As seen in the image above, second dashboard I made was about orders and sales performance which consisted of total complete order, gross sales, profit, average sales, orders and sales trends. To do the sales report card visualizations, I ran SQL scripts:

/* 1. COMPLETE ORDER */
SELECT
  orders.order_id,
  orders.delivered_at
FROM  bigquery-public-data.thelook_ecommerce.orders as orders
WHERE orders.status = 'Complete';

/* 2. GROSS SALES AND AVERAGE SALES */
SELECT
  items.order_id,
  items.user_id,
  items.delivered_at as delivered,
  orders.gender,
  SUM(orders.num_of_item * items.sale_price) as order_revenue
FROM bigquery-public-data.thelook_ecommerce.orders as orders
INNER JOIN bigquery-public-data.thelook_ecommerce.order_items as items
ON items.order_id = orders.order_id
WHERE items.status = 'Complete'
GROUP BY 1,2,3,4;

/* 3. PROFIT */
SELECT
  items.order_id,
  items.user_id,
  orders.gender,
  items.status,
  items.product_id,
  items.delivered_at as delivered,
  products.cost,
  products.retail_price,
  items.sale_price,
  orders.num_of_item as num_item,
  ROUND(products.retail_price, 0) - ROUND(products.cost, 0) as profit,
  ((ROUND(products.retail_price, 0) - ROUND(products.cost, 0))/products.cost) as percent_profit,
FROM bigquery-public-data.thelook_ecommerce.order_items as items
INNER JOIN bigquery-public-data.thelook_ecommerce.products as products
    ON items.product_id = products.id
INNER JOIN bigquery-public-data.thelook_ecommerce.orders as orders
    ON items.order_id = orders.order_id
WHERE items.status ='Complete'
ORDER BY 1,2 ASC;

All these cards are linked to the date range control dropdown that can be seen on the right above of the sales trend chart. The date range control only controlled the cards and not the other visualizations.

I only aggregated complete orders data and excluded cancellations and returns to gain the succesful sales data. That was why the complete orders was only 15,810 orders from January to September 2022, which was only 1,757 average orders per month or 58-59 average orders per day. Even though, the total complete orders was increasing 61.3% compared to the same period of time in year 2021. Through the third quarter of 2022, the gross sales was $2,567,319 with profit $709,349. These numbers increasing 60% compared to the same period of time last year.

To clarify, the term 'gross sales' in this analysis refers to revenue (net sales) and term 'profit' refers to gross profit which only substract the revenue with cost of goods sold (CoGS) and not the fixed cost.

\text{Profit Margin} = \frac{profit}{revenue}\times100\% \\

\text{Profit Margin} = \frac{709,349}{2,567,319}\times100\%

\text{Profit Margin} = 27.6\%

The equation above shows that our profit margin was 27.6 percent. This number must be compared to the national average number or our competitor's number for the same industry to get a comprehensive interpretation. Average sales per complete orders through the third quarter of 2022 was $162, 0.2 percent lower than last year, but it was not significant.

To make the sales trends visualization, I used the same Gross Sales SQL script like in the previous total gross sales card visualizations. This time, I plotted it into a line chart by filtering the date into four quarters in a year. As seen in the chart, our gross sales was constantly growing up. Until it reached the highest sales in Q1 to Q2 on year 2022. There was an increase in gross sales (total revenue) of 237,000 USD in the period of first and second quarter of 2022, or 37 percent increase from the first quarter. This number touched the all-time high revenue from 2019 to 2022.

The same SQL script (Gross Sales SQL script) also used to analyze the orders trend. Orders trend was useful to explain why there was an increase of gross profit beside the possibility of the increase of users buying power. I plotted the data into a line chart like before, but I use COUNT() function to the order_id field instead as the parameter. The result was clear, in the same period (Q1 to Q2 of 2022) the total order went to the all-time high with an increase of 1,421 orders. It can conclude that the more frequent our users buy products from us, the more we get total revenue. However, other factors like buying volumes, may also effect the sales we get.

Finally, still using the same SQL script, I plotted the data into a vertical stacked bar chart to see which buyers/users order our products more frequently. The result is surprisingly equalled. The ratio is constantly 50:50 between male and female users through 2019 to 2022. Total order numbers could grow until two-times each year, except for 2022 (total order numbers growed for 50 percent). This findings may indicates that our products always cover both genders segmentation.

4.3 Products

The third dashboard is about products. The visualizations will be about what product categories ordered the most, which gave the highest gross sales (revenue), and which from it was the most profitable brand. From this analysis, it was expected to know what products are the most desirable and needed by users, how can we strategize branded goods stocks in our warehous for every category to gain more sales and profit, and most importantly current users' preferences.

First chart is about top 10 product categories were ordered by users. I used this SQL script:

SELECT
  orders.order_id,
  products.category,
  products.brand,
  products.name,
  items.delivered_at,
  orders.gender,
  SUM(ROUND(items.sale_price, 0)) as total_sales,
  SUM(ROUND(products.retail_price, 0) - ROUND(products.cost, 0)) as profit
FROM bigquery-public-data.thelook_ecommerce.order_items as items
INNER JOIN bigquery-public-data.thelook_ecommerce.products as products
    ON items.product_id = products.id
INNER JOIN bigquery-public-data.thelook_ecommerce.orders as orders
    ON items.order_id = orders.order_id
WHERE items.status = 'Complete'
GROUP BY 1,2,3,4,5,6
ORDER BY 2;

I joined data from three tables to extract data I need. The result showed that intimates and jeans products dominated with more than 1,500 orders. The different was distinctive, intimates products were 100% ordered by female users, whereas respectively in the rest of the other categories, female users ordered only 35-39% of the total orders.

Second visualization I made is about top 10 product category based on gross sales. I used this SQL script for the last two visualizations:

SELECT
  products.category,
  products.brand,
  items.delivered_at,
  SUM(ROUND(items.sale_price, 0)) as total_sales,
  SUM(ROUND(products.retail_price, 0) - ROUND(products.cost, 0)) as profit
FROM bigquery-public-data.thelook_ecommerce.order_items as items
INNER JOIN bigquery-public-data.thelook_ecommerce.products as products
ON items.product_id = products.id
WHERE items.status = 'Complete'
GROUP BY 1,2,3
ORDER BY 1;

I plotted table and parameter into a bar chart where the x-axis plotted the product categories and the y-axis was the parameter of total gross sales. As seen in the image above, the top three categories such as Outwear & Coats, Jeans, and Sweaters had the distinct total gross sales. If we look details into the result, outwears, coats, jeans, and sweaters are considered as seasonal wears. Followed by suits & sport coats and swim wears which may also part of seasonal wears. This indicates that there can be a high demand of specific product category on each seasons.

By any means, we need to consider the best marketing, ads, or selling strategy for each seasons. However, we may also need to enhance our work to sell more fashions and casuals wears to avoid dependance on seasonal products.

For the last, with the same SQL script, I plotted the table and parameter into a vertical stacked bar chart where the x-axis was profit and y-axis was product category and brand as the breakdown dimension. This chart specifically categorizes by profit, so there was a different result shows in the 9^th and 10^th rank of product category. Even though the top 3 gross sales owned by outerwear & coats, jeans, and sweaters; the top 3 profitable category owned by jeans, outerwear & coats, and suits & sport coats. By visual, we can conclude that some brands give the highest profit to a product category.

For example: (1) True Religion, Columbia, and Calvin Klein gained the highest profit for outerwears & coats category; (2) Diesel, 7 For All Mankind, and True Religion gained the highest profit for jeans category; (3) Quiksilver gained the highest profit for swim and shorts category; and (4) Ray-Ban gained the highest profit for accessories category.

SUMMARIES:

Users consist of male and female people with 50:50 ratio.

Groups of adult users with range of age 25 to 64 years old dominating 67.85 percent of our total users.

Gender distribution was 50:50 for each groups of age.

There was a significant decrease in the user growth chart in the 2021-2022 timeframe.

Most of users access our website through the search engines with 70 percent coverage.

Top 3 country where the users come from were China (34.2%), United States (22.5%), and Brasil (14.5%).

There were an increase in sales and orders performance in 2022.

Sales and orders trends showed constant growth for year-to-year data.

Gross profit we acquired in Q1 to Q3 of 2022 was 709,349 US dollars. The profit margin was 27.6 percent.

One hundred percent of intimates products was ordered by female users. There were only about 35-39 percent females ordered the rest of the other categories.

Outwears & coats, jeans, and sweaters were top 3 categories which generated the highest gross sales.

Top 3 profitable categories owned by jeans, outerwear & coats, and suits & sport coats. Some brands generated the highest profit in a specific product category.

4.4 Case Study Roadmap

Case Study Roadmap - Analyze
Guiding Questions: How should you organize your data to perform analysis on it? The data was already well-organized in BigQuery by Looker Team, no need further organizing work to the data. Has your data been properly formatted? Yes, I had ensured that the data was properly formatted. What surprises did you discover in the data? Users gender distribution was equal. User growth was dropping significantly in 2022. There were more international users than national users. Sales and orders performance growing constantly. Intimates products were only ordered by female users. Some brands generated the highest profit in a category. What trends or relationships did you find in the data? Users growth in 2022 was not going well. Sales and orders trends constantly growing up. High orders volume might effect to the high sales generated. How will these insights help answer your business questions? All findings showed in the dashboard depicted The Look's recent business performance.
Key Tasks: Aggregate your data so it’s useful and accessible. Organize and format your data. Perform calculations. Identify trends and relationships.
A summary of your analysis

Case Study Roadmap - Share
Guiding Questions: Were you able to answer the business question? Yes, i was able to answer the questions. What story does your data tell? Overall, sales performance was great, but not the users growth. Company needs to build the right strategies to increase users growth, maintain sales performance, and increase the other product segments. How do your findings relate to your original question? All information showed in the dashboard answered the original questions. Who is your audience? What is the best way to communicate with them? My audience is a marketing and sales managers and stakeholders. The most effective way to communicate with them is to build a concise presentation that includes all requested findings, data visualizations, and a brief explanation of the visualizations. Can data visualization help you share your findings? Yes, data visualization helps me to share my findings. Is your presentation accessible to your audience? Yes, my presentation is accessible.
Key Tasks: Determine the best way to share your findings. Create effective data visualizations. Present your findings. Ensure your work is accessible.
Supporting visualizations and key findings

6. Act

Case Study Roadmap - Act
Guiding Questions: What is your final conclusion based on your analysis? Decreasing users growth must be company's priority. Create a suitable strategy to maintain sales performance and improve other product segments. How could your team and business apply your insights? They can construct an appropriate plan by combining these insights with those they already possess. What next steps would you or your stakeholders take based on your findings? Combining these findings with previous analysis to see if there was any similarity, we then built a suitable strategy by considering business and users needs. Is there additional data you could use to expand on your findings? No, the dataset already contained all the data required if the analysis needed to be expanded.
Key Tasks: Create your portfolio. Add your case study. Practice presenting your case study to a friend or family member.
Your top three recommendations based on your analysis