6 Data Transformation
Data transformation is a critical step in the data analysis process, aimed at converting raw data into more useful and meaningful information. This process involves a variety of techniques used to adjust the format of the data, improve its quality, and convert it into a form that is easier to analyze and understand. Data transformation also plays a key role in enhancing the accuracy and effectiveness of analytical models, especially in the context of handling large and complex datasets.
In the following mind map, we will explore several data transformation techniques that can be used to improve data quality and generate better insights. The mind map covers major categories such as temporal transformations, distribution transformations, normalization, categorical encoding, feature engineering, and more, each with important subcomponents that aid in the data processing workflow.
Dummy Dataset
Below is the dummy dataset related to business that has been created using R (via Python for the demo). You can also generate it yourself in R using packages like dplyr
, lubridate
, and stringi
. This dataset simulates transaction data from a retail business with key elements:
Column | Description |
---|---|
Transaction_ID |
Unique ID for each transaction |
Transaction_Date |
Transaction date |
Customer_ID |
Unique ID for the customer |
Product_Category |
Product category (Electronics, Clothing, etc.) |
Product_ID |
Product ID |
Quantity |
Number of items purchased |
Unit_Price |
Price per unit of the product |
Discount |
Transaction discount (0–0.3) |
Region |
Region (North, South, East, West) |
Sales_Channel |
Sales channel (Online / Offline) |
Total_Price |
Total price after applying the discount |
6.1 Temporal Transformations
Temporal transformations refer to techniques used to manipulate and extract meaningful patterns from time-based data. These transformations are essential when dealing with time series or datasets containing date and time information. The goal is to uncover trends, seasonal patterns, or lagged relationships that improve the predictive power of models.
Common temporal transformation techniques include:
- Lag, Difference, and Rolling: Capture temporal dependencies and smooth fluctuations.
- Extract Hour, Day, Weeek, Month, Year: Derive time-based features that often influence behavior or outcomes.
- Cumulative Sum / Count / Mean: Track running totals or averages over time for trend analysis.
These methods help structure time-based data in a way that enables deeper insights and improved decision-making.
6.1.1 Lag, Diff, Rolling
# (Contoh untuk Quantity, berdasarkan tanggal transaksi)
<- data_bisnis %>%
Tempral arrange(Customer_ID, Transaction_Date) %>%
group_by(Customer_ID) %>%
mutate(
Lag_Quantity = lag(Quantity),
Diff_Quantity = Quantity - lag(Quantity),
RollingMean_3 = zoo::rollapply(Quantity, width = 3, FUN = mean, fill = NA, align = 'right')
%>%
) ungroup()
head(Tempral)
# A tibble: 6 × 14
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 1PaWvGiAokHB 2023-02-09 02kNipUny9 Groceries P0253
2 IYlg0MAGwsUw 2020-11-30 04npfK5VJa Books P0113
3 0kgKWZOpEp6Z 2020-03-18 0NwgTyo7P2 Clothing P0445
4 ddFsMFlQOa2J 2023-05-19 0PlvCaxPuS Electronics P0298
5 VOCzzxDOxQps 2020-07-07 0S9qOEuCr9 Clothing P0353
6 FyRCpwyOqKWK 2021-12-02 0dEDsy4fWM Groceries P0175
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Lag_Quantity <int>,
# Diff_Quantity <int>, RollingMean_3 <lgl>
6.1.2 Extract Date
<- data_bisnis %>%
Extract mutate(
Day_of_Week = weekdays(Transaction_Date),
Month = month(Transaction_Date, label = TRUE),
Year = year(Transaction_Date),
Is_Weekend = ifelse(Day_of_Week %in% c("Saturday", "Sunday"), 1, 0),
Region = as.factor(Region),
Product_Category = as.factor(Product_Category)
%>%
) bind_cols(
as.data.frame(model.matrix(~ Region - 1, data = .)),
as.data.frame(model.matrix(~ Product_Category - 1, data = .))
)
# View the first few rows
head(Extract)
# A tibble: 6 × 24
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <fct> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 19 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <fct>, Sales_Channel <chr>, Total_Price <dbl>, Day_of_Week <chr>,
# Month <ord>, Year <dbl>, Is_Weekend <dbl>, RegionEast <dbl>,
# RegionNorth <dbl>, RegionSouth <dbl>, RegionWest <dbl>,
# Product_CategoryBooks <dbl>, Product_CategoryClothing <dbl>,
# Product_CategoryElectronics <dbl>, Product_CategoryGroceries <dbl>,
# Product_CategoryHome <dbl>
6.1.3 Cumulative Values
<- data_bisnis %>%
Tempral3 group_by(Customer_ID) %>%
mutate(
Cumulative_Quantity = cumsum(Quantity),
Cumulative_Spend = cumsum(Total_Price),
Cumulative_AvgPrice = cummean(Unit_Price)
%>%
) ungroup()
head(Tempral3)
# A tibble: 6 × 14
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
# Cumulative_Quantity <int>, Cumulative_Spend <dbl>,
# Cumulative_AvgPrice <dbl>
6.2 Distribution Tranformations
Distribution transformations are techniques used to modify the shape of a dataset’s distribution, making it more suitable for analysis or modeling. Many statistical models assume that data follows a normal distribution; therefore, transforming skewed or irregularly distributed data can lead to better model performance and more reliable insights.
Key distribution transformation techniques include:
- Log Transform: Compresses large values and reduces right-skewness.
- Box-Cox and Yeo-Johnson: Flexible transformations that stabilize variance and normalize data.
- Variance Stabilization / Skew Reduction: Improves symmetry and homoscedasticity in datasets.
These transformations are particularly useful when dealing with highly skewed data, outliers, or heteroscedastic variance, allowing for more robust and interpretable analyses.
6.2.1 Log Transform
<- min(data_bisnis$Total_Price[data_bisnis$Total_Price > 0])
min_positive
<- data_bisnis %>%
Log mutate(
Safe_Total_Price = ifelse(Total_Price <= 0, min_positive, Total_Price),
Log_Total_Price = log1p(Safe_Total_Price)
)head(Log)
# A tibble: 6 × 13
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
# Safe_Total_Price <dbl>, Log_Total_Price <dbl>
6.2.2 Box-Cox
library(bestNormalize)
<- data_bisnis %>%
Box_Cox mutate(
YeoJ_Quantity = bestNormalize(Quantity)$x.t,
YeoJ_TotalPrice = bestNormalize(Total_Price)$x.t
)head(Box_Cox)
# A tibble: 6 × 13
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, YeoJ_Quantity <dbl>,
# YeoJ_TotalPrice <dbl>
6.2.3 Stabilize Variance
6.3 Scaling & Normalization
Scaling and normalization are essential preprocessing steps in data transformation that ensure all numerical features contribute equally to model performance. These techniques adjust the range or distribution of data values, especially when features have different units or magnitudes.
Common techniques include:
- Min-Max Scaling: Rescales values to a fixed range (typically 0 to 1).
- Standardization (Z-score):Centers data around zero with unit variance.
- Robust Scaling: Uses median and IQR, making it more resistant to outliers.
Applying these transformations improves the convergence and accuracy of algorithms such as gradient descent and distance-based models (e.g., k-NN, SVM), which are sensitive to feature magnitude.
# Z-Score Standardization
<- data_bisnis %>%
data_std mutate(
Quantity_Std = scale(Quantity),
Unit_Price_Std = scale(Unit_Price)
)head(data_std)
# A tibble: 6 × 13
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>,
# Quantity_Std <dbl[,1]>, Unit_Price_Std <dbl[,1]>
# Min-Max Normalization
<- data_bisnis %>%
data_norm mutate(
Quantity_Norm = (Quantity - min(Quantity)) / (max(Quantity) - min(Quantity)),
Total_Price_Norm = (Total_Price - min(Total_Price)) / (max(Total_Price) - min(Total_Price))
)head(data_norm)
# A tibble: 6 × 13
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Quantity_Norm <dbl>,
# Total_Price_Norm <dbl>
6.4 Categorical Encoding
Categorical encoding transforms non-numeric (categorical) variables into a numerical format that can be used by machine learning algorithms. Since most models cannot handle text or label data directly, encoding is crucial for integrating categorical features into predictive modeling.
Popular encoding techniques include:
- One-Hot Encoding: Creates binary columns for each category, useful for nominal data.
- Label Encoding: Assigns each category a unique integer, best for ordinal variables.
- Frequency Encoding: Replaces categories with their frequency or count in the dataset.
Choosing the right encoding method helps preserve the meaning of categorical data while maintaining model performance and interpretability.
6.4.1 One-hot Encoding
library(fastDummies)
<- dummy_cols(data_bisnis,
one_hot select_columns = c("Region",
"Sales_Channel",
"Product_Category"),
remove_first_dummy = TRUE,
remove_selected_columns = TRUE)
head(one_hot)
# A tibble: 6 × 16
Transaction_ID Transaction_Date Customer_ID Product_ID Quantity Unit_Price
<chr> <date> <chr> <chr> <int> <dbl>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev P0370 2 15.2
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 P0185 5 10.2
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg P0443 3 17.7
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP P0035 6 28.3
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX P0375 3 11.9
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M P0447 1 5.43
# ℹ 10 more variables: Discount <dbl>, Total_Price <dbl>, Region_North <int>,
# Region_South <int>, Region_West <int>, Sales_Channel_Online <int>,
# Product_Category_Clothing <int>, Product_Category_Electronics <int>,
# Product_Category_Groceries <int>, Product_Category_Home <int>
6.4.2 Frequency Encoding
<- function(col) {
freq_enc <- table(col)
tab return(as.numeric(tab[col]) / length(col))
}
<- data_bisnis %>%
Frequency mutate(
Region_freq = freq_enc(Region),
Product_Category_freq = freq_enc(Product_Category)
)head(Frequency)
# A tibble: 6 × 13
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 8 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Region_freq <dbl>,
# Product_Category_freq <dbl>
6.5 Feature Engineering
Feature engineering is the process of creating new input features from existing data to improve model performance. It involves extracting, transforming, or combining variables in ways that make patterns more accessible to machine learning algorithms.
Key strategies include:
- Mathematical combinations: Creating interaction terms (e.g., price × quantity) or ratios (e.g., discount / price).
- Domain-driven insights: Building features based on domain knowledge (e.g., efficiency, customer value).
- Statistical summaries: Aggregating features by group (e.g., average purchase per customer).
- Parsing IDs or Text: Extracting patterns or categories from strings.
Effective feature engineering often leads to significant boosts in model accuracy and interpretability, making it one of the most impactful steps in a data pipeline.
# 1. New Features from Raw Data
# 2. Product of Features, Crossed Terms
# 3. Price per Unit, Efficiency
# 4. Ranking, Percentile
# 5. From IDs: Prefix, Length, Pattern
# 6. Avg, Sum, Count by Group
<- data_bisnis %>%
Feature_Eng mutate(
Price_per_Unit = Total_Price / Quantity,
ID_Prefix = substr(Product_ID, 1, 2),
ID_Length = nchar(Product_ID),
Discount_Level = ntile(Discount, 4)
%>%
) group_by(Region) %>%
mutate(
Avg_Quantity_Region = mean(Quantity),
Sum_Sales_Region = sum(Total_Price)
%>%
) ungroup()
head(Feature_Eng)
# A tibble: 6 × 17
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 12 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Price_per_Unit <dbl>,
# ID_Prefix <chr>, ID_Length <int>, Discount_Level <int>,
# Avg_Quantity_Region <dbl>, Sum_Sales_Region <dbl>
6.5.1 Interaction Features
# Interaction between Quantity and Unit_Price
<- data_bisnis %>%
data_interaction mutate(
Price_Impact = Quantity * Unit_Price
)head(data_interaction)
# A tibble: 6 × 12
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 7 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Price_Impact <dbl>
6.5.2 Ratio Features
# Discount to Price Ratio
<- data_bisnis %>%
data_ratio mutate(
Discount_Ratio = Discount / (Unit_Price + 1e-5) # Avoid division by zero
)head(data_ratio)
# A tibble: 6 × 12
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 7 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Discount_Ratio <dbl>
6.5.3 Group Aggregation
# Total sales by each Customer
<- data_bisnis %>%
customer_sales group_by(Customer_ID) %>%
summarise(
Avg_Spent = mean(Total_Price),
Max_Spent = max(Total_Price),
Transaction_Count = n()
)
# Join with main data
<- left_join(data_bisnis, customer_sales, by = "Customer_ID")
data_bisnis head(data_bisnis)
# A tibble: 6 × 14
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>
6.5.4 Rank Transformation
<- data_bisnis %>%
data_ranked mutate(
Sales_Rank = rank(-Total_Price) # Higher total price gets lower rank number
)head(data_ranked)
# A tibble: 6 × 15
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>, Sales_Rank <dbl>
6.5.5 Text Cleaning & Feature Creation
# Create numeric suffix from Product_ID
<- data_bisnis %>%
data_text mutate(
Product_Num = as.numeric(gsub("P", "", Product_ID))
)head(data_text)
# A tibble: 6 × 15
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>, Product_Num <dbl>
6.5.6 Cumulative Features
<- data_bisnis %>%
data_cumulative arrange(Customer_ID, Transaction_Date) %>%
group_by(Customer_ID) %>%
mutate(
Cumulative_Sales = cumsum(Total_Price)
)head(data_cumulative)
# A tibble: 6 × 15
# Groups: Customer_ID [6]
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 1PaWvGiAokHB 2023-02-09 02kNipUny9 Groceries P0253
2 IYlg0MAGwsUw 2020-11-30 04npfK5VJa Books P0113
3 0kgKWZOpEp6Z 2020-03-18 0NwgTyo7P2 Clothing P0445
4 ddFsMFlQOa2J 2023-05-19 0PlvCaxPuS Electronics P0298
5 VOCzzxDOxQps 2020-07-07 0S9qOEuCr9 Clothing P0353
6 FyRCpwyOqKWK 2021-12-02 0dEDsy4fWM Groceries P0175
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>, Cumulative_Sales <dbl>
6.6 Outlier Handeling
Outlier handling refers to the identification and treatment of data points that significantly deviate from other observations. Outliers can distort statistical analyses and model predictions if not managed properly.
Common techniques include:
- Z-Score: Identifies how many standard deviations a point is from the mean.
- Interquartile Range (IQR): Flags values that fall far outside the middle 50% of data.
- Winsorizing: Limits extreme values by capping them at a certain percentile.
Deciding whether to keep, transform, or remove outliers depends on their cause and the goals of the analysis. Proper handling ensures that models are not overly influenced by anomalous data, leading to more stable and accurate outcomes.
# Z-score method for detecting outliers
<- scale(data_bisnis$Quantity)
z_scores <- data_bisnis %>%
data_outliers mutate(Outlier_Flag = ifelse(abs(z_scores) > 3, "Outlier", "Normal"))
# IQR method for detecting and removing outliers
<- quantile(data_bisnis$Total_Price, 0.25)
Q1 <- quantile(data_bisnis$Total_Price, 0.75)
Q3 <- Q3 - Q1
IQR_val <- data_bisnis %>%
data_outliers filter(Total_Price > (Q1 - 1.5 * IQR_val) & Total_Price < (Q3 + 1.5 * IQR_val))
head(data_outliers)
# A tibble: 6 × 14
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
5 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
6 zu0iP1OBFuNI 2021-09-25 ifYw95qmoL Groceries P0345
# ℹ 9 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>
6.7 Discretization
Discretization, also known as binning, is the process of converting continuous numerical variables into categorical intervals or “bins.” This technique simplifies data, improves model performance (especially for tree-based models), and enhances interpretability.
Common approaches include:
- Equal-width binning: Divides the range of values into intervals of equal size.
- Equal-frequency binning: Ensures each bin contains roughly the same number of observations.
- Custom bins: Created based on domain knowledge or statistical thresholds (e.g., age groups, income brackets).
Binning is especially useful when the exact values of a variable are less important than the range they fall into. It can also help deal with non-linearity and reduce the impact of outliers.
<- data_bisnis %>%
binning mutate(
Price_Level = cut(Unit_Price,
breaks = quantile(Unit_Price, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
labels = c("Low", "Medium", "High"),
include.lowest = TRUE)
)head(binning)
# A tibble: 6 × 15
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 10 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>, Price_Level <fct>
6.8 Seasonality
Seasonality and signal decomposition involve identifying recurring patterns or cyclical behaviors in time-series data. These patterns can significantly enhance predictive models, especially in fields like sales forecasting, weather prediction, or traffic analysis.
One powerful tool for detecting seasonality is the Fourier Transform, which converts time-based signals into frequency components, revealing hidden periodic trends.
Applications include:
- Extracting weekly, monthly, or yearly patterns
- Smoothing or denoising time series
- Creating seasonal features for machine learning models
By isolating and leveraging these seasonal signals, analysts can improve forecast accuracy and understand underlying business cycles.
<- data_bisnis %>%
Seasonality mutate(
Year = year(Transaction_Date),
DayOfYear = yday(Transaction_Date),
DaysInYear = if_else(leap_year(Transaction_Date), 366, 365),
sin_year = sin(2 * pi * DayOfYear / DaysInYear),
cos_year = cos(2 * pi * DayOfYear / DaysInYear)
)
head(Seasonality)
# A tibble: 6 × 19
Transaction_ID Transaction_Date Customer_ID Product_Category Product_ID
<chr> <date> <chr> <chr> <chr>
1 7zmPHxF7XfN9 2021-07-14 BAl3Y7yxev Clothing P0370
2 y4bCY9pKTBWU 2020-11-16 TYY0h5C190 Electronics P0185
3 8k0B7XX19Ykf 2023-03-22 nUX640AaXg Home P0443
4 l8ahQz5YNOKz 2023-01-02 sBZyUSJLEP Home P0035
5 kmufgw8wx5qk 2023-06-05 GMfVH2ZWNX Groceries P0375
6 aI0KADT0mn7C 2023-03-15 YxqAmfTU9M Clothing P0447
# ℹ 14 more variables: Quantity <int>, Unit_Price <dbl>, Discount <dbl>,
# Region <chr>, Sales_Channel <chr>, Total_Price <dbl>, Avg_Spent <dbl>,
# Max_Spent <dbl>, Transaction_Count <int>, Year <dbl>, DayOfYear <dbl>,
# DaysInYear <dbl>, sin_year <dbl>, cos_year <dbl>
6.9 Practicum
Your task is to demonstrate the various data transformation techniques that you have learned, including Temporal Transformations, Distribution Transformations, Scaling/Normalization, Categorical Encoding, Feature Engineering, Outlier Handling, Binning, Signal/Seasonality (Fourier), and other relevant transformations.
You will be divided into several groups, and each group will receive a dataset. Your objective is to apply the appropriate transformation techniques to the dataset, analyze the outcomes, and present your findings to the class.
6.9.1 Weather
# Ensure all required packages are installed
<- c("dplyr", "stringi", "lubridate", "DT")
packages <- packages[!(packages %in% installed.packages()[, "Package"])]
new_packages if(length(new_packages)) install.packages(new_packages)
# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)
# Create a complex dummy weather dataset for data transformation
set.seed(42)
<- 500
n
<- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
dates <- sample(dates, n, replace = TRUE)
sample_dates
<- month(sample_dates)
month <- case_when(
season %in% c(11, 12, 1, 2) ~ "Rainy Season",
month %in% c(6, 7, 8, 9) ~ "Dry Season",
month TRUE ~ "Transitional Season"
)
<- round(runif(n, 60, 100), 1)
humidity <- round(rnorm(n, mean = 27, sd = 3), 1)
temperature <- round(rgamma(n, shape = 2, rate = 0.2), 1)
rainfall <- round(runif(n, 1, 15), 1)
wind_speed
<- tibble(
weather_data Observation_ID = stri_rand_strings(n, 12),
Date = sample_dates,
Location = sample(c("Jakarta", "Bandung", "Surabaya", "Medan", "Makassar"), n, replace = TRUE),
Season = season,
Temperature = temperature,
Humidity = humidity,
Rainfall = rainfall,
Wind_Speed = wind_speed
)
# Show interactive table with download buttons
datatable(
weather_data,extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
scrollY = "400px",
scrollCollapse = TRUE,
paging = FALSE
),caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;
font-size: 18px; font-weight: bold;'
),class = 'stripe hover compact'
)
6.9.2 General Health
# Ensure all required packages are installed
<- c("dplyr", "stringi", "lubridate", "DT")
packages <- packages[!(packages %in% installed.packages()[, "Package"])]
new_packages if(length(new_packages)) install.packages(new_packages)
# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)
# Create a complex dummy health dataset for data transformation
set.seed(42)
<- 500
n
<- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
dates <- sample(dates, n, replace = TRUE)
sample_dates
# Simulate health parameters
<- sample(18:80, n, replace = TRUE)
age <- round(rnorm(n, mean = 25, sd = 4), 1) # BMI (Body Mass Index)
bmi <- round(rnorm(n, mean = 120, sd = 15), 1) # Systolic BP
blood_pressure <- round(rnorm(n, mean = 200, sd = 40), 1) # Cholesterol (mg/dL)
cholesterol <- round(rnorm(n, mean = 90, sd = 20), 1) # Glucose (mg/dL)
glucose <- round(rnorm(n, mean = 75, sd = 10), 1) # Heart Rate (bpm)
heart_rate
# Simulate location and health condition
<- sample(c("Jakarta", "Bandung", "Surabaya", "Medan", "Makassar"), n, replace = TRUE)
location <- sample(c("Healthy", "Hypertension", "Diabetes", "Obesity", "Cardiovascular Disease"), n, replace = TRUE)
health_condition
# Simulate seasonal impact on health
<- case_when(
season month(sample_dates) %in% c(11, 12, 1, 2) ~ "Rainy Season",
month(sample_dates) %in% c(6, 7, 8, 9) ~ "Dry Season",
TRUE ~ "Transitional Season"
)
# Create the health dataset
<- tibble(
health_data Patient_ID = stri_rand_strings(n, 12),
Date = sample_dates,
Age = age,
BMI = bmi,
Blood_Pressure = blood_pressure,
Cholesterol = cholesterol,
Glucose = glucose,
Heart_Rate = heart_rate,
Location = location,
Health_Condition = health_condition,
Season = season
)
# Show interactive table with download buttons
datatable(
health_data,extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
scrollY = "400px",
scrollCollapse = TRUE,
paging = FALSE
),caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;
font-size: 18px; font-weight: bold;'
),class = 'stripe hover compact'
)
6.9.3 Financial Market
# Ensure all required packages are installed
<- c("dplyr", "stringi", "lubridate", "DT")
packages <- packages[!(packages %in% installed.packages()[, "Package"])]
new_packages if(length(new_packages)) install.packages(new_packages)
# Load libraries
library(dplyr)
library(stringi)
library(lubridate)
library(DT)
# Create a complex dummy financial market dataset for data transformation
set.seed(42)
<- 500
n
# Dates from 2020-01-01 to 2024-12-31
<- seq.Date(from = as.Date("2020-01-01"), to = as.Date("2024-12-31"), by = "day")
dates <- sample(dates, n, replace = TRUE)
sample_dates
# Simulate financial parameters
<- round(runif(n, 100, 1500), 2) # Stock price (in USD)
stock_price <- sample(1000:1000000, n, replace = TRUE) # Trading volume
volume_traded <- round(stock_price * volume_traded, 2) # Market cap (in USD)
market_cap <- round(rnorm(n, mean = 15, sd = 5), 2) # Price-to-Earnings ratio
pe_ratio <- round(runif(n, 0, 10), 2) # Dividend Yield (%)
dividend_yield <- round(rnorm(n, mean = 12, sd = 3), 2) # Return on Equity (%)
return_on_equity
# Simulate stock market sector and performance
<- sample(c("Technology", "Finance", "Healthcare", "Energy", "Consumer Goods"), n, replace = TRUE)
sector <- sample(c("Positive", "Negative", "Stable"), n, replace = TRUE)
performance
# Create the financial market dataset
<- tibble(
financial_data Stock_ID = stri_rand_strings(n, 12),
Date = sample_dates,
Stock_Price = stock_price,
Volume_Traded = volume_traded,
Market_Cap = market_cap,
PE_Ratio = pe_ratio,
Dividend_Yield = dividend_yield,
Return_on_Equity = return_on_equity,
Sector = sector,
Performance = performance
)
# Show interactive table with download buttons
datatable(
financial_data,extensions = 'Buttons',
options = list(
dom = 'Bfrtip',
buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
scrollY = "400px",
scrollCollapse = TRUE,
paging = FALSE
),caption = htmltools::tags$caption(
style = 'caption-side: top; text-align: left;
font-size: 18px; font-weight: bold;'
),class = 'stripe hover compact'
)