Bellabeat case study

This case study is about data analysis for Bellabeat company.

About company

Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits. Since it was founded in 2013, Bellabeat has grown rapidly and quickly positioned itself as a tech-driven wellness company for women

Questions

  • What are some trends in smart device usage?
  • How could these trends apply to Bellabeat customers?
  • How could these trends help influence Bellabeat marketing strategy?

Business task

  • To find the strategy and opportunity for Bellabeat marketing based on data trend from smart devices?

Tools

  • RStudio

Data source

  • FitBit Fitness Tracker Data (CC0: Public Domain, dataset made available through Mobius): This Kaggle data set contains personal fitness tracker from thirty fitbit users. Thirty eligible Fitbit users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring. It includes information about daily activity, steps, and heart rate that can be used to explore users’ habits.

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidyr)
library(dplyr)
library(ggplot2)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Import data

activity <- read.csv("dailyActivity_merged.csv")
calories <- read.csv("dailyCalories_merged.csv")
sleep <- read.csv("sleepDay_merged.csv")
heartRate<- read.csv("heartrate_seconds_merged.csv")

Explore and clean data

head(activity)
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    4/12/2016      13162          8.50            8.50
## 2 1503960366    4/13/2016      10735          6.97            6.97
## 3 1503960366    4/14/2016      10460          6.74            6.74
## 4 1503960366    4/15/2016       9762          6.28            6.28
## 5 1503960366    4/16/2016      12669          8.16            8.16
## 6 1503960366    4/17/2016       9705          6.48            6.48
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0               1.88                     0.55
## 2                        0               1.57                     0.69
## 3                        0               2.44                     0.40
## 4                        0               2.14                     1.26
## 5                        0               2.71                     0.41
## 6                        0               3.19                     0.78
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                6.06                       0                25
## 2                4.71                       0                21
## 3                3.91                       0                30
## 4                2.83                       0                29
## 5                5.04                       0                36
## 6                2.51                       0                38
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories
## 1                  13                  328              728     1985
## 2                  19                  217              776     1797
## 3                  11                  181             1218     1776
## 4                  34                  209              726     1745
## 5                  10                  221              773     1863
## 6                  20                  164              539     1728
head(calories)
##           Id ActivityDay Calories
## 1 1503960366   4/12/2016     1985
## 2 1503960366   4/13/2016     1797
## 3 1503960366   4/14/2016     1776
## 4 1503960366   4/15/2016     1745
## 5 1503960366   4/16/2016     1863
## 6 1503960366   4/17/2016     1728
head(sleep)
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
head(heartRate)
##           Id                 Time Value
## 1 2022484408 4/12/2016 7:21:00 AM    97
## 2 2022484408 4/12/2016 7:21:05 AM   102
## 3 2022484408 4/12/2016 7:21:10 AM   105
## 4 2022484408 4/12/2016 7:21:20 AM   103
## 5 2022484408 4/12/2016 7:21:25 AM   101
## 6 2022484408 4/12/2016 7:22:05 AM    95
n_distinct(activity$Id)
## [1] 33
n_distinct(calories$Id)
## [1] 33
n_distinct(sleep$Id)
## [1] 24
n_distinct(heartRate$Id)
## [1] 14

sleep data came from 24 users, however it is enough to analyze. However, heart rate data is not enough to analyze due to 14 users.

sum(duplicated(activity))
## [1] 0
sum(duplicated(calories))
## [1] 0
sum(duplicated(sleep))
## [1] 3
activity <- activity %>%
  distinct() %>%
  drop_na()
calories <- calories %>%
  distinct() %>%
  drop_na()
sleep <- sleep %>%
  distinct() %>%
  drop_na()

sum(duplicated(sleep))
## [1] 0

There are no duplicate and the data is clean now. * date formating

activity <- activity %>%
  rename(date = ActivityDate) %>%
  mutate(date = as_date(date, format = "%m/%d/%Y"))

sleep <- sleep %>%
  rename(date = SleepDay) %>%
  mutate(date = as_date(date,format ="%m/%d/%Y %I:%M:%S %p" , tz=Sys.timezone()))
## Warning: `tz` argument is ignored by `as_date()`

Summarise data

  • Average step per day
activity %>%
  select(TotalSteps) %>%
  summary()
##    TotalSteps   
##  Min.   :    0  
##  1st Qu.: 3790  
##  Median : 7406  
##  Mean   : 7638  
##  3rd Qu.:10727  
##  Max.   :36019
calories %>%
  select(Calories) %>%
  summary()
##     Calories   
##  Min.   :   0  
##  1st Qu.:1828  
##  Median :2134  
##  Mean   :2304  
##  3rd Qu.:2793  
##  Max.   :4900
sleep %>%
  select(TotalMinutesAsleep, TotalTimeInBed) %>%
  summary()
##  TotalMinutesAsleep TotalTimeInBed 
##  Min.   : 58.0      Min.   : 61.0  
##  1st Qu.:361.0      1st Qu.:403.8  
##  Median :432.5      Median :463.0  
##  Mean   :419.2      Mean   :458.5  
##  3rd Qu.:490.0      3rd Qu.:526.0  
##  Max.   :796.0      Max.   :961.0

From summary

According to summary above, average steps is 7638 that fall into ‘Fairy active’. from https://www.10000steps.org.au/articles/counting-steps/ * the sample group is about Lightly active - Fairly active * The average calories is 2304 that a bit higher than woman standard. The demographic of this sample can be both gender.

Analyze and Share

Using average data to analyze by creating an avg_data

step_sleep <- merge(activity,sleep, by = c("Id","date"))
glimpse(step_sleep)
## Rows: 410
## Columns: 18
## $ Id                       <dbl> 1503960366, 1503960366, 1503960366, 150396036…
## $ date                     <date> 2016-04-12, 2016-04-13, 2016-04-15, 2016-04-…
## $ TotalSteps               <int> 13162, 10735, 9762, 12669, 9705, 15506, 10544…
## $ TotalDistance            <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ TrackerDistance          <dbl> 8.50, 6.97, 6.28, 8.16, 6.48, 9.88, 6.68, 6.3…
## $ LoggedActivitiesDistance <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveDistance       <dbl> 1.88, 1.57, 2.14, 2.71, 3.19, 3.53, 1.96, 1.3…
## $ ModeratelyActiveDistance <dbl> 0.55, 0.69, 1.26, 0.41, 0.78, 1.32, 0.48, 0.3…
## $ LightActiveDistance      <dbl> 6.06, 4.71, 2.83, 5.04, 2.51, 5.03, 4.24, 4.6…
## $ SedentaryActiveDistance  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ VeryActiveMinutes        <int> 25, 21, 29, 36, 38, 50, 28, 19, 41, 39, 73, 3…
## $ FairlyActiveMinutes      <int> 13, 19, 34, 10, 20, 31, 12, 8, 21, 5, 14, 23,…
## $ LightlyActiveMinutes     <int> 328, 217, 209, 221, 164, 264, 205, 211, 262, …
## $ SedentaryMinutes         <int> 728, 776, 726, 773, 539, 775, 818, 838, 732, …
## $ Calories                 <int> 1985, 1797, 1745, 1863, 1728, 2035, 1786, 177…
## $ TotalSleepRecords        <int> 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotalMinutesAsleep       <int> 327, 384, 412, 340, 700, 304, 360, 325, 361, …
## $ TotalTimeInBed           <int> 346, 407, 442, 367, 712, 320, 377, 364, 384, …
step_sleep %>%
  group_by(Id) %>%
  summarise(avg_step = mean(TotalSteps)) %>%
  ggplot() +
  geom_col(mapping= aes(Id, avg_step, fill = avg_step))

step_sleep %>%
  group_by(Id) %>%
  summarise(avg_sleep = mean(TotalMinutesAsleep)/60) %>%
  ggplot() +
  geom_col(mapping= aes(Id, avg_sleep, fill = avg_sleep))

* Relation step VS cal

step_sleep %>%
  group_by(Id) %>%
   ggplot() +
  geom_point(mapping= aes(x = TotalSteps, y = Calories)) +
  geom_smooth(mapping= aes(x = TotalSteps, y = Calories)) +
  labs(title="Steps vS Calories")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

step_sleep %>%
  group_by(Id) %>%
   ggplot() +
  geom_point(mapping= aes(x = Calories, y = TotalMinutesAsleep)) +
  geom_smooth(mapping= aes(x = Calories, y = TotalMinutesAsleep)) +
  labs(title="Calories VS sleep")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

step_sleep %>%
  group_by(Id) %>%
   ggplot() +
  geom_point(mapping= aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  geom_smooth(mapping= aes(x = TotalSteps, y = TotalMinutesAsleep)) +
  labs(title="Steps VS sleep")
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

According to the graph above * Clearly see the positive relationship between Steps and Calories. ‘This is a must have feature of the Bellabeat app to track and report user about their activity.’ * Unclear relationship between Calories and Sleep time. * See the negative relationship between Steps and Sleep time. ‘If the bellabeat app want to help user mange the sleep time, steps is one factor to consider.’

avg_user_group <- step_sleep %>%
  group_by(Id) %>%
  summarise(avg_totalSteps = mean(TotalSteps),avg_cal = mean(Calories), avg_totalSleepTime = mean(TotalMinutesAsleep)) %>%
  mutate(user_activeType = case_when(
    avg_totalSteps < 7500 ~ "Lightly active",
    avg_totalSteps >= 7500 & avg_totalSteps < 10000 ~ "Fairly active",
    avg_totalSteps >= 10000 ~ "Very active"
  ))

avg_user_group %>%
  ggplot(aes(x="", y=user_activeType, fill=user_activeType)) +
  geom_bar(stat="identity") +
  coord_polar("y", start=0) +
  theme_void()

Summary

According to the data analysis, it shown that lightly activity is the major group of users. The company should focus on this target group first. The steps and sleep data are important to provide the information to the users. Hear rate data is not be used to consider in this study because there is only 14 samples (half of the entire samples). Hence, the Bellabeat app and product should be designed to satisfy user with step and sleep tracker feature as a minimum requirement.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.