An Analysis of Random Forest

Jessica Cramer, Hunter Hillis, Murpys Mendez

Introduction

Random Forest is a machine learning algorithm that constructs multiple decision trees
The method aggregates the predictions of these trees to produce a final classification
This approach reduces overfitting compared to a single decision tree
Random Forest is known for achieving strong predictive performance
Provides practical interpretability through tools like feature importance and OOB error

Why Use Random Forest?

Decision trees can be unstable and highly sensitive to small changes in data
Random Forest improves stability by averaging many trees
Introduces randomness through bootstrapped samples of the data and random subsets of features at each split
These mechanisms reduce variance and improve generalization

How Random Forest Works

Each tree is trained on a bootstrap sample of the data
At each node, a random subset of predictors is chosen for splitting
Final prediction is based on the majority vote of all trees
Out-of-bag samples provide a built-in estimate of error
The method balances bias and variance through ensemble learning

Method

Bootstrap sampling creates new samples by randomly drawing with replacement
Trees are grown until all terminal nodes (leaves) are pure or another stopping criterion is met
A result is reached by aggregating the predictions of the individual trees
- Classification: majority vote of the trees (mode)
- Regression: average of the tree predictions

Splitting Criteria

At every node the best split is selected based on an impurity measure

Gini index(classification): $i_G(t) = 1 - \sum_{k=1}^{J} p(c_k \mid t)^2$

Entropy(classification): $i_H(t) = - \sum_{k=1}^{J} p(c_k \mid t) \, \log_2 \big( p(c_k \mid t) \big)$

Mean Squared Error(regression): $\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$

Hyperparameters

Hyperparameter tuning balances model complexity, computation, and generalization

Commonly tuned parameters are:

ntree: number of trees
maxnodes: maximum depth of each tree
mtry: number of features considered at each split
nodesize: controls the minimum number of observations a terminal node (leaf) must have

Grid and randomized search are commonly used for hyperparameters tuning, improving model accuracy and robustness in a systematic way (Probst 2019)

Class Imbalance

Random Forests can perform poorly under severe class imbalance because trees tend to favor the majority class

Commonly used, Synthetic Minority Over-Sampling Technique (SMOTE) generates synthetic minority-class samples to balance the dataset (Zhang 2024)

Figure 1: Illustration of the Synthetic Minority Oversampling Technique.

Data Description

The dataset represents over 140k card transactions with over 300 features per transaction and are identified as either fraud or not fraud

Target Variable
Transaction Amounts
Card Issuer

The target class is largely imbalanced with 92.15% non-fraud vs 7.95% fraud

The fraud class has a slightly higher average transaction amount ($88.8 vs $83.1)

MasterCard shows the highest fraud rate at 8.9% , followed by Discover at 7.8%

Data Preparation

Empty and incomplete features were dropped entirely
To consolidate the number of features, PCA was applied
Remaining dataset composition:

Card Type	Total Transactions	Fraud Cases	Fraud Rate
Credit	75,090	6,693	8.9%
Debit	68,950	4,613	6.7%
Unknown	178	12	6.7%
Charge Card	15	0	0.0%

Data Feature Relationships

Minimal correlation impact from remaining features

Balancing the Target Class

Upsampled using SMOTE to 50% and 100%

Optimizing the Model

Hyper-parameters used for tuning the model:

Number of Trees
Number of Features in each Tree
Feature Importance

# Define hyperparameter grid
param_grid <- expand.grid(
  ntree = ntree_values,
  mtry = mtry_values,
  importance = importance_values
)
results <- data.frame()
# Grid search: train and evaluate a model for each combination
for (i in 1:nrow(param_grid)) {
  params <- param_grid[i, ]
  model <- randomForest(
    as.formula(paste(target, "~ .")),
    data = train,
    ntree = params$ntree,
    mtry = params$mtry,
    importance = params$importance
  )
  preds <- predict(model, newdata = test)
  cm <- confusionMatrix(preds, test[[target]])
}

Voting Method: Aggregation of majority of trees

Creating the Model

Data was split 80/20 for train and test
Grid search was conducted over all three datasets and the three hyper-parameters
Determining metrics of model performance were accuracy and the false positive rate

 # Train model
    model <- randomForest(
      formula = as.formula(paste(target, "~ .")),
      data = train,
      ntree = params$ntree,
      mtry = params$mtry,
      importance = params$importance
    )
    
    # Predict on test set
    preds <- predict(model, newdata = test)

Model Evaluation

The model was evaluated with the ROC curve and the confusion matrix, achieving an AUC of 0.96 and an accuracy of 0.97

ROC Curve
Confusion Matrix
Performance Metrics

The ROC curve shows that the three versions of the data performed relatively well, achieving high true positive rates.

The model correctly identified 26,393 legitimate transactions and 1,549 fraudulent ones, producing 189 false positives and 716 false negatives.

Metric	Value
Accuracy	0.9690
Precision	0.9735
Recall	0.9934
Specificity	0.6830
F1 Score	0.9834
AUC	0.9597

Feature Importance

Feature importance analysis revealed the top features ranked by their importance in the Random Forest model.

Findings From this Project

Applying Random Forest to the dataset showed how preprocessing affects results
Balancing the highly imbalanced classes improved the model’s ability to detect target cases
Hyperparameter tuning influenced predictive performance
Model evaluation metrics helped assess accuracy and overall behavior

Strengths Demonstrated by this Model

Capable of achieving strong predictive performance with relatively few tuning requirements
Robust against noise and fluctuations in the data
Produces interpretable measures such as feature importance
Offers out-of-bag error estimates for internal validation

Conclusion

Random Forest is a versatile and reliable machine learning algorithm
Random Forest is well suited for both classification and regression tasks
Handles datasets with complex relationships and many features
Maintains strong performance even in the presence of noise, missing data, and outliers
The ensemble approach helps to reduce overfitting and increase accuracy

References

Probst, Philipp. 2019. “Hyperparameters, Tuning and Meta-Learning for Random Forest and Other Machine Learning Algorithms.” LMU Munich Dissertations, 1–220. https://edoc.ub.uni-muenchen.de/24557/.

Zhang, Xijie. 2024. “Fraud Detection Using Machine Learning: An Evaluation of Logistic Regression and Random Forest,” 169–73. https://doi.org/10.1109/SCOUT64349.2024.00041.