6  Data Transformation

Data transformation is a critical step in the data analysis process, aimed at converting raw data into more useful and meaningful information. This process involves a variety of techniques used to adjust the format of the data, improve its quality, and convert it into a form that is easier to analyze and understand. Data transformation also plays a key role in enhancing the accuracy and effectiveness of analytical models, especially in the context of handling large and complex datasets.

In the following mind map, we will explore several data transformation techniques that can be used to improve data quality and generate better insights. The mind map covers major categories such as temporal transformations, distribution transformations, normalization, categorical encoding, feature engineering, and more, each with important subcomponents that aid in the data processing workflow.

Dummy Dataset

Below is the dummy dataset related to business that has been created using R (via Python for the demo). You can also generate it yourself in R using packages like dplyr, lubridate, and stringi. This dataset simulates transaction data from a retail business with key elements:

Column Description
Transaction_ID Unique ID for each transaction
Transaction_Date Transaction date
Customer_ID Unique ID for the customer
Product_Category Product category (Electronics, Clothing, etc.)
Product_ID Product ID
Quantity Number of items purchased
Unit_Price Price per unit of the product
Discount Transaction discount (0–0.3)
Region Region (North, South, East, West)
Sales_Channel Sales channel (Online / Offline)
Total_Price Total price after applying the discount

6.1 Temporal Transformations

Temporal transformations refer to techniques used to manipulate and extract meaningful patterns from time-based data. These transformations are essential when dealing with time series or datasets containing date and time information. The goal is to uncover trends, seasonal patterns, or lagged relationships that improve the predictive power of models.

Common temporal transformation techniques include:

  • Lag, Difference, and Rolling: Capture temporal dependencies and smooth fluctuations.
  • Extract Hour, Day, Weeek, Month, Year: Derive time-based features that often influence behavior or outcomes.
  • Cumulative Sum / Count / Mean: Track running totals or averages over time for trend analysis.

These methods help structure time-based data in a way that enables deeper insights and improved decision-making.

6.1.1 Lag, Diff, Rolling

# (Contoh untuk Quantity, berdasarkan tanggal transaksi)
Tempral <- data_bisnis %>%
  arrange(Customer_ID, Transaction_Date) %>%
  group_by(Customer_ID) %>%
  mutate(
    Lag_Quantity = lag(Quantity),
    Diff_Quantity = Quantity - lag(Quantity),
    RollingMean_3 = zoo::rollapply(Quantity, width = 3, FUN = mean, fill = NA, align = 'right')
  ) %>%
  ungroup()
head(Tempral)
# A tibble: 6 × 14
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 1PaWvGiAokHB   2023-02-09       02kNipUny9  Groceries        P0253     
2 IYlg0MAGwsUw   2020-11-30       04npfK5VJa  Books            P0113     
3 0kgKWZOpEp6Z   2020-03-18       0NwgTyo7P2  Clothing         P0445     
4 ddFsMFlQOa2J   2023-05-19       0PlvCaxPuS  Electronics      P0298     
5 VOCzzxDOxQps   2020-07-07       0S9qOEuCr9  Clothing         P0353     
6 FyRCpwyOqKWK   2021-12-02       0dEDsy4fWM  Groceries        P0175     
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Lag_Quantity <int>,
#   Diff_Quantity <int>, RollingMean_3 <lgl>

6.1.2 Extract Date

Extract <- data_bisnis %>%
  mutate(
    Day_of_Week = weekdays(Transaction_Date),
    Month = month(Transaction_Date, label = TRUE),
    Year = year(Transaction_Date),
    Is_Weekend = ifelse(Day_of_Week %in% c("Saturday", "Sunday"), 1, 0),
    Region = as.factor(Region),
    Product_Category = as.factor(Product_Category)
  ) %>%
  bind_cols(
    as.data.frame(model.matrix(~ Region - 1, data = .)),
    as.data.frame(model.matrix(~ Product_Category - 1, data = .))
  )

# View the first few rows
head(Extract)
# A tibble: 6 × 24
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <fct>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 19 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <fct>, Sales_Channel <chr>, Total_Price <dbl>, Day_of_Week <chr>,
#   Month <ord>, Year <dbl>, Is_Weekend <dbl>, RegionEast <dbl>,
#   RegionNorth <dbl>, RegionSouth <dbl>, RegionWest <dbl>,
#   Product_CategoryBooks <dbl>, Product_CategoryClothing <dbl>,
#   Product_CategoryElectronics <dbl>, Product_CategoryGroceries <dbl>,
#   Product_CategoryHome <dbl>

6.1.3 Cumulative Values

Tempral3 <- data_bisnis %>%
  group_by(Customer_ID) %>%
  mutate(
    Cumulative_Quantity = cumsum(Quantity),
    Cumulative_Spend = cumsum(Total_Price),
    Cumulative_AvgPrice = cummean(Unit_Price)
  ) %>%
  ungroup()
head(Tempral3)
# A tibble: 6 × 14
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
#   Cumulative_Quantity <int>, Cumulative_Spend <dbl>,
#   Cumulative_AvgPrice <dbl>

6.2 Distribution Tranformations

Distribution transformations are techniques used to modify the shape of a dataset’s distribution, making it more suitable for analysis or modeling. Many statistical models assume that data follows a normal distribution; therefore, transforming skewed or irregularly distributed data can lead to better model performance and more reliable insights.

Key distribution transformation techniques include:

  • Log Transform: Compresses large values and reduces right-skewness.
  • Box-Cox and Yeo-Johnson: Flexible transformations that stabilize variance and normalize data.
  • Variance Stabilization / Skew Reduction: Improves symmetry and homoscedasticity in datasets.

These transformations are particularly useful when dealing with highly skewed data, outliers, or heteroscedastic variance, allowing for more robust and interpretable analyses.

6.2.1 Log Transform

min_positive <- min(data_bisnis$Total_Price[data_bisnis$Total_Price > 0])

Log <- data_bisnis %>%
  mutate(
    Safe_Total_Price = ifelse(Total_Price <= 0, min_positive, Total_Price),
    Log_Total_Price = log1p(Safe_Total_Price)
  )
head(Log)
# A tibble: 6 × 13
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
#   Safe_Total_Price <dbl>, Log_Total_Price <dbl>

6.2.2 Box-Cox

library(bestNormalize)
Box_Cox <- data_bisnis %>%
  mutate(
    YeoJ_Quantity = bestNormalize(Quantity)$x.t,
    YeoJ_TotalPrice = bestNormalize(Total_Price)$x.t
  )
head(Box_Cox)
# A tibble: 6 × 13
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, YeoJ_Quantity <dbl>,
#   YeoJ_TotalPrice <dbl>

6.2.3 Stabilize Variance

6.3 Scaling & Normalization

Scaling and normalization are essential preprocessing steps in data transformation that ensure all numerical features contribute equally to model performance. These techniques adjust the range or distribution of data values, especially when features have different units or magnitudes.

Common techniques include:

  • Min-Max Scaling: Rescales values to a fixed range (typically 0 to 1).
  • Standardization (Z-score):Centers data around zero with unit variance.
  • Robust Scaling: Uses median and IQR, making it more resistant to outliers.

Applying these transformations improves the convergence and accuracy of algorithms such as gradient descent and distance-based models (e.g., k-NN, SVM), which are sensitive to feature magnitude.

# Z-Score Standardization
data_std <- data_bisnis %>%
  mutate(
    Quantity_Std = scale(Quantity),
    Unit_Price_Std = scale(Unit_Price)
  )
head(data_std)
# A tibble: 6 × 13
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
#   Quantity_Std <dbl[,1]>, Unit_Price_Std <dbl[,1]>
# Min-Max Normalization
data_norm <- data_bisnis %>%
  mutate(
    Quantity_Norm = (Quantity - min(Quantity)) / (max(Quantity) - min(Quantity)),
    Total_Price_Norm = (Total_Price - min(Total_Price)) / (max(Total_Price) - min(Total_Price))
  )
head(data_norm)
# A tibble: 6 × 13
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Quantity_Norm <dbl>,
#   Total_Price_Norm <dbl>

6.4 Categorical Encoding

Categorical encoding transforms non-numeric (categorical) variables into a numerical format that can be used by machine learning algorithms. Since most models cannot handle text or label data directly, encoding is crucial for integrating categorical features into predictive modeling.

Popular encoding techniques include:

  • One-Hot Encoding: Creates binary columns for each category, useful for nominal data.
  • Label Encoding: Assigns each category a unique integer, best for ordinal variables.
  • Frequency Encoding: Replaces categories with their frequency or count in the dataset.

Choosing the right encoding method helps preserve the meaning of categorical data while maintaining model performance and interpretability.

6.4.1 One-hot Encoding

library(fastDummies)
one_hot <- dummy_cols(data_bisnis, 
                          select_columns = c("Region", 
                                             "Sales_Channel",
                                             "Product_Category"), 
                          remove_first_dummy = TRUE, 
                         remove_selected_columns = TRUE)
head(one_hot)
# A tibble: 6 × 16
  Transaction_ID Transaction_Date Customer_ID Product_ID Quantity Unit_Price
  <chr>          <date>           <chr>       <chr>         <int>      <dbl>
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  P0370             2      15.2 
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  P0185             5      10.2 
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  P0443             3      17.7 
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  P0035             6      28.3 
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  P0375             3      11.9 
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  P0447             1       5.43
# ℹ 10 more variables: Discount <dbl>, Total_Price <dbl>, Region_North <int>,
#   Region_South <int>, Region_West <int>, Sales_Channel_Online <int>,
#   Product_Category_Clothing <int>, Product_Category_Electronics <int>,
#   Product_Category_Groceries <int>, Product_Category_Home <int>

6.4.2 Frequency Encoding

freq_enc <- function(col) {
  tab <- table(col)
  return(as.numeric(tab[col]) / length(col))
}

Frequency <- data_bisnis %>%
  mutate(
    Region_freq = freq_enc(Region),
    Product_Category_freq = freq_enc(Product_Category)
  )
head(Frequency)
# A tibble: 6 × 13
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Region_freq <dbl>,
#   Product_Category_freq <dbl>

6.5 Feature Engineering

Feature engineering is the process of creating new input features from existing data to improve model performance. It involves extracting, transforming, or combining variables in ways that make patterns more accessible to machine learning algorithms.

Key strategies include:

  • Mathematical combinations: Creating interaction terms (e.g., price × quantity) or ratios (e.g., discount / price).
  • Domain-driven insights: Building features based on domain knowledge (e.g., efficiency, customer value).
  • Statistical summaries: Aggregating features by group (e.g., average purchase per customer).
  • Parsing IDs or Text: Extracting patterns or categories from strings.

Effective feature engineering often leads to significant boosts in model accuracy and interpretability, making it one of the most impactful steps in a data pipeline.

# 1. New Features from Raw Data
# 2. Product of Features, Crossed Terms
# 3. Price per Unit, Efficiency
# 4. Ranking, Percentile
# 5. From IDs: Prefix, Length, Pattern
# 6. Avg, Sum, Count by Group

Feature_Eng <- data_bisnis %>%
  mutate(
    Price_per_Unit = Total_Price / Quantity,
    ID_Prefix = substr(Product_ID, 1, 2),
    ID_Length = nchar(Product_ID),
    Discount_Level = ntile(Discount, 4)
  ) %>%
  group_by(Region) %>%
  mutate(
    Avg_Quantity_Region = mean(Quantity),
    Sum_Sales_Region = sum(Total_Price)
  ) %>%
  ungroup()
head(Feature_Eng)
# A tibble: 6 × 17
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 12 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Price_per_Unit <dbl>,
#   ID_Prefix <chr>, ID_Length <int>, Discount_Level <int>,
#   Avg_Quantity_Region <dbl>, Sum_Sales_Region <dbl>

6.5.1 Interaction Features

# Interaction between Quantity and Unit_Price
data_interaction <- data_bisnis %>%
  mutate(
    Price_Impact = Quantity * Unit_Price
  )
head(data_interaction)
# A tibble: 6 × 12
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 7 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Price_Impact <dbl>

6.5.2 Ratio Features

# Discount to Price Ratio
data_ratio <- data_bisnis %>%
  mutate(
    Discount_Ratio = Discount / (Unit_Price + 1e-5)  # Avoid division by zero
  )
head(data_ratio)
# A tibble: 6 × 12
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 7 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Discount_Ratio <dbl>

6.5.3 Group Aggregation

# Total sales by each Customer
customer_sales <- data_bisnis %>%
  group_by(Customer_ID) %>%
  summarise(
    Avg_Spent = mean(Total_Price),
    Max_Spent = max(Total_Price),
    Transaction_Count = n()
  )

# Join with main data
data_bisnis <- left_join(data_bisnis, customer_sales, by = "Customer_ID")
head(data_bisnis)
# A tibble: 6 × 14
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>

6.5.4 Rank Transformation

data_ranked <- data_bisnis %>%
  mutate(
    Sales_Rank = rank(-Total_Price) # Higher total price gets lower rank number
  )
head(data_ranked)
# A tibble: 6 × 15
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>, Sales_Rank <dbl>

6.5.5 Text Cleaning & Feature Creation

# Create numeric suffix from Product_ID
data_text <- data_bisnis %>%
  mutate(
    Product_Num = as.numeric(gsub("P", "", Product_ID))
  )
head(data_text)
# A tibble: 6 × 15
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>, Product_Num <dbl>

6.5.6 Cumulative Features

data_cumulative <- data_bisnis %>%
  arrange(Customer_ID, Transaction_Date) %>%
  group_by(Customer_ID) %>%
  mutate(
    Cumulative_Sales = cumsum(Total_Price)
  )
head(data_cumulative)
# A tibble: 6 × 15
# Groups:   Customer_ID [6]
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 1PaWvGiAokHB   2023-02-09       02kNipUny9  Groceries        P0253     
2 IYlg0MAGwsUw   2020-11-30       04npfK5VJa  Books            P0113     
3 0kgKWZOpEp6Z   2020-03-18       0NwgTyo7P2  Clothing         P0445     
4 ddFsMFlQOa2J   2023-05-19       0PlvCaxPuS  Electronics      P0298     
5 VOCzzxDOxQps   2020-07-07       0S9qOEuCr9  Clothing         P0353     
6 FyRCpwyOqKWK   2021-12-02       0dEDsy4fWM  Groceries        P0175     
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>, Cumulative_Sales <dbl>

6.6 Outlier Handeling

Outlier handling refers to the identification and treatment of data points that significantly deviate from other observations. Outliers can distort statistical analyses and model predictions if not managed properly.

Common techniques include:

  • Z-Score: Identifies how many standard deviations a point is from the mean.
  • Interquartile Range (IQR): Flags values that fall far outside the middle 50% of data.
  • Winsorizing: Limits extreme values by capping them at a certain percentile.

Deciding whether to keep, transform, or remove outliers depends on their cause and the goals of the analysis. Proper handling ensures that models are not overly influenced by anomalous data, leading to more stable and accurate outcomes.

# Z-score method for detecting outliers
z_scores <- scale(data_bisnis$Quantity)
data_outliers <- data_bisnis %>%
  mutate(Outlier_Flag = ifelse(abs(z_scores) > 3, "Outlier", "Normal"))

# IQR method for detecting and removing outliers
Q1 <- quantile(data_bisnis$Total_Price, 0.25)
Q3 <- quantile(data_bisnis$Total_Price, 0.75)
IQR_val <- Q3 - Q1
data_outliers <- data_bisnis %>%
  filter(Total_Price > (Q1 - 1.5 * IQR_val) & Total_Price < (Q3 + 1.5 * IQR_val))

head(data_outliers)
# A tibble: 6 × 14
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
5 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
6 zu0iP1OBFuNI   2021-09-25       ifYw95qmoL  Groceries        P0345     
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>

6.7 Discretization

Discretization, also known as binning, is the process of converting continuous numerical variables into categorical intervals or “bins.” This technique simplifies data, improves model performance (especially for tree-based models), and enhances interpretability.

Common approaches include:

  • Equal-width binning: Divides the range of values into intervals of equal size.
  • Equal-frequency binning: Ensures each bin contains roughly the same number of observations.
  • Custom bins: Created based on domain knowledge or statistical thresholds (e.g., age groups, income brackets).

Binning is especially useful when the exact values of a variable are less important than the range they fall into. It can also help deal with non-linearity and reduce the impact of outliers.

binning <- data_bisnis %>%
  mutate(
    Price_Level = cut(Unit_Price,
                      breaks = quantile(Unit_Price, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                      labels = c("Low", "Medium", "High"),
                      include.lowest = TRUE)
  )
head(binning)
# A tibble: 6 × 15
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>, Price_Level <fct>

6.8 Seasonality

Seasonality and signal decomposition involve identifying recurring patterns or cyclical behaviors in time-series data. These patterns can significantly enhance predictive models, especially in fields like sales forecasting, weather prediction, or traffic analysis.

One powerful tool for detecting seasonality is the Fourier Transform, which converts time-based signals into frequency components, revealing hidden periodic trends.

Applications include:

  • Extracting weekly, monthly, or yearly patterns
  • Smoothing or denoising time series
  • Creating seasonal features for machine learning models

By isolating and leveraging these seasonal signals, analysts can improve forecast accuracy and understand underlying business cycles.

Seasonality <- data_bisnis %>%
  mutate(
    Year = year(Transaction_Date),
    DayOfYear = yday(Transaction_Date),
    DaysInYear = if_else(leap_year(Transaction_Date), 366, 365),
    sin_year = sin(2 * pi * DayOfYear / DaysInYear),
    cos_year = cos(2 * pi * DayOfYear / DaysInYear)
  )

head(Seasonality)
# A tibble: 6 × 19
  Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
  <chr>          <date>           <chr>       <chr>            <chr>     
1 7zmPHxF7XfN9   2021-07-14       BAl3Y7yxev  Clothing         P0370     
2 y4bCY9pKTBWU   2020-11-16       TYY0h5C190  Electronics      P0185     
3 8k0B7XX19Ykf   2023-03-22       nUX640AaXg  Home             P0443     
4 l8ahQz5YNOKz   2023-01-02       sBZyUSJLEP  Home             P0035     
5 kmufgw8wx5qk   2023-06-05       GMfVH2ZWNX  Groceries        P0375     
6 aI0KADT0mn7C   2023-03-15       YxqAmfTU9M  Clothing         P0447     
# ℹ 14 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
#   Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
#   Max_Spent <dbl>, Transaction_Count <int>, Year <dbl>, DayOfYear <dbl>,
#   DaysInYear <dbl>, sin_year <dbl>, cos_year <dbl>

6.9 Practicum

Your task is to demonstrate the various data transformation techniques that you have learned, including Temporal Transformations, Distribution Transformations, Scaling/Normalization, Categorical Encoding, Feature Engineering, Outlier Handling, Binning, Signal/Seasonality (Fourier), and other relevant transformations.

You will be divided into several groups, and each group will receive a dataset. Your objective is to apply the appropriate transformation techniques to the dataset, analyze the outcomes, and present your findings to the class.

6.9.1 Weather

# Ensure all required packages are installed
packages <- c("dplyr", "stringi", "lubridate", "DT")
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)

# Create a complex dummy weather dataset for data transformation
set.seed(42)
n <- 500

dates <- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
sample_dates <- sample(dates, n, replace = TRUE)

month <- month(sample_dates)
season <- case_when(
  month %in% c(11, 12, 1, 2) ~ "Rainy Season",
  month %in% c(6, 7, 8, 9) ~ "Dry Season",
  TRUE ~ "Transitional Season"
)

humidity <- round(runif(n, 60, 100), 1)
temperature <- round(rnorm(n, mean = 27, sd = 3), 1)
rainfall <- round(rgamma(n, shape = 2, rate = 0.2), 1)
wind_speed <- round(runif(n, 1, 15), 1)

weather_data <- tibble(
  Observation_ID = stri_rand_strings(n, 12),
  Date = sample_dates,
  Location = sample(c("Jakarta", "Bandung", "Surabaya", "Medan", "Makassar"), n, replace = TRUE),
  Season = season,
  Temperature = temperature,
  Humidity = humidity,
  Rainfall = rainfall,
  Wind_Speed = wind_speed
)

# Show interactive table with download buttons
datatable(
  weather_data,
  extensions = 'Buttons',
  options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    scrollY = "400px",
    scrollCollapse = TRUE,
    paging = FALSE
  ),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; 
             font-size: 18px; font-weight: bold;'
  ),
  class = 'stripe hover compact'
)

6.9.2 General Health

# Ensure all required packages are installed
packages <- c("dplyr", "stringi", "lubridate", "DT")
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)

# Create a complex dummy health dataset for data transformation
set.seed(42)
n <- 500

dates <- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
sample_dates <- sample(dates, n, replace = TRUE)

# Simulate health parameters
age <- sample(18:80, n, replace = TRUE)
bmi <- round(rnorm(n, mean = 25, sd = 4), 1)  # BMI (Body Mass Index)
blood_pressure <- round(rnorm(n, mean = 120, sd = 15), 1)  # Systolic BP
cholesterol <- round(rnorm(n, mean = 200, sd = 40), 1)  # Cholesterol (mg/dL)
glucose <- round(rnorm(n, mean = 90, sd = 20), 1)  # Glucose (mg/dL)
heart_rate <- round(rnorm(n, mean = 75, sd = 10), 1)  # Heart Rate (bpm)

# Simulate location and health condition
location <- sample(c("Jakarta", "Bandung", "Surabaya", "Medan", "Makassar"), n, replace = TRUE)
health_condition <- sample(c("Healthy", "Hypertension", "Diabetes", "Obesity", "Cardiovascular Disease"), n, replace = TRUE)

# Simulate seasonal impact on health
season <- case_when(
  month(sample_dates) %in% c(11, 12, 1, 2) ~ "Rainy Season",
  month(sample_dates) %in% c(6, 7, 8, 9) ~ "Dry Season",
  TRUE ~ "Transitional Season"
)

# Create the health dataset
health_data <- tibble(
  Patient_ID = stri_rand_strings(n, 12),
  Date = sample_dates,
  Age = age,
  BMI = bmi,
  Blood_Pressure = blood_pressure,
  Cholesterol = cholesterol,
  Glucose = glucose,
  Heart_Rate = heart_rate,
  Location = location,
  Health_Condition = health_condition,
  Season = season
)

# Show interactive table with download buttons
datatable(
  health_data,
  extensions = 'Buttons',
  options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    scrollY = "400px",
    scrollCollapse = TRUE,
    paging = FALSE
  ),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; 
             font-size: 18px; font-weight: bold;'
  ),
  class = 'stripe hover compact'
)

6.9.3 Financial Market

# Ensure all required packages are installed
packages <- c("dplyr", "stringi", "lubridate", "DT")
new_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)

# Create a complex dummy financial market dataset for data transformation
set.seed(42)
n <- 500

# Dates from 2020-01-01 to 2024-12-31
dates <- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
sample_dates <- sample(dates, n, replace = TRUE)

# Simulate financial parameters
stock_price <- round(runif(n, 100, 1500), 2)  # Stock price (in USD)
volume_traded <- sample(1000:1000000, n, replace = TRUE)  # Trading volume
market_cap <- round(stock_price * volume_traded, 2)  # Market cap (in USD)
pe_ratio <- round(rnorm(n, mean = 15, sd = 5), 2)  # Price-to-Earnings ratio
dividend_yield <- round(runif(n, 0, 10), 2)  # Dividend Yield (%)
return_on_equity <- round(rnorm(n, mean = 12, sd = 3), 2)  # Return on Equity (%)

# Simulate stock market sector and performance
sector <- sample(c("Technology", "Finance", "Healthcare", "Energy", "Consumer Goods"), n, replace = TRUE)
performance <- sample(c("Positive", "Negative", "Stable"), n, replace = TRUE)

# Create the financial market dataset
financial_data <- tibble(
  Stock_ID = stri_rand_strings(n, 12),
  Date = sample_dates,
  Stock_Price = stock_price,
  Volume_Traded = volume_traded,
  Market_Cap = market_cap,
  PE_Ratio = pe_ratio,
  Dividend_Yield = dividend_yield,
  Return_on_Equity = return_on_equity,
  Sector = sector,
  Performance = performance
)

# Show interactive table with download buttons
datatable(
  financial_data,
  extensions = 'Buttons',
  options = list(
    dom = 'Bfrtip',
    buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
    scrollY = "400px",
    scrollCollapse = TRUE,
    paging = FALSE
  ),
  caption = htmltools::tags$caption(
    style = 'caption-side: top; text-align: left; 
             font-size: 18px; font-weight: bold;'
  ),
  class = 'stripe hover compact'
)