An Analysis of Random Forest

Jessica Cramer, Hunter Hillis, Murpys Mendez

Introduction

 

  • Random Forest is a machine learning algorithm that constructs multiple decision trees

  • The method aggregates the predictions of these trees to produce a final classification

  • This approach reduces overfitting compared to a single decision tree

  • Random Forest is known for achieving strong predictive performance

  • Provides practical interpretability through tools like feature importance and OOB error

Why Use Random Forest?

 

  • Decision trees can be unstable and highly sensitive to small changes in data

  • Random Forest improves stability by averaging many trees

  • Introduces randomness through bootstrapped samples of the data and random subsets of features at each split

  • These mechanisms reduce variance and improve generalization

How Random Forest Works

 

  • Each tree is trained on a bootstrap sample of the data

  • At each node, a random subset of predictors is chosen for splitting

  • Final prediction is based on the majority vote of all trees

  • Out-of-bag samples provide a built-in estimate of error

  • The method balances bias and variance through ensemble learning

Method

 

  • Bootstrap sampling creates new samples by randomly drawing with replacement

  • Trees are grown until all terminal nodes (leaves) are pure or another stopping criterion is met

  • A result is reached by aggregating the predictions of the individual trees

    • Classification: majority vote of the trees (mode)

    • Regression: average of the tree predictions

Splitting Criteria

 

At every node the best split is selected based on an impurity measure

 

  • Gini index(classification):   \(i_G(t) = 1 - \sum_{k=1}^{J} p(c_k \mid t)^2\)

 

  • Entropy(classification):   \(i_H(t) = - \sum_{k=1}^{J} p(c_k \mid t) \, \log_2 \big( p(c_k \mid t) \big)\)

 

  • Mean Squared Error(regression):   \(\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\)

Hyperparameters

 

Hyperparameter tuning balances model complexity, computation, and generalization

Commonly tuned parameters are:

 

  • ntree: number of trees

  • maxnodes: maximum depth of each tree

  • mtry: number of features considered at each split

  • nodesize: controls the minimum number of observations a terminal node (leaf) must have

 

Grid and randomized search are commonly used for hyperparameters tuning, improving model accuracy and robustness in a systematic way (Probst 2019)

Class Imbalance

 

Random Forests can perform poorly under severe class imbalance because trees tend to favor the majority class

Commonly used, Synthetic Minority Over-Sampling Technique (SMOTE) generates synthetic minority-class samples to balance the dataset (Zhang 2024)

 

Figure 1: Illustration of the Synthetic Minority Oversampling Technique.

Data Description

  • The dataset represents over 140k card transactions with over 300 features per transaction and are identified as either fraud or not fraud
  • The target class is largely imbalanced with 92.15% non-fraud vs 7.95% fraud
  • The fraud class has a slightly higher average transaction amount ($88.8 vs $83.1)
  • MasterCard shows the highest fraud rate at 8.9% , followed by Discover at 7.8%

Data Preparation

  • Empty and incomplete features were dropped entirely

  • To consolidate the number of features, PCA was applied

  • Remaining dataset composition:


Card Type Total Transactions Fraud Cases Fraud Rate
Credit 75,090 6,693 8.9%
Debit 68,950 4,613 6.7%
Unknown 178 12 6.7%
Charge Card 15 0 0.0%

Data Feature Relationships

  • Minimal correlation impact from remaining features
Figure 2: Correlation Matrix.

Balancing the Target Class

  • Upsampled using SMOTE to 50% and 100%
Figure 3: Oversampling ratio

Optimizing the Model

Hyper-parameters used for tuning the model:

  • Number of Trees
  • Number of Features in each Tree
  • Feature Importance
# Define hyperparameter grid
param_grid <- expand.grid(
  ntree = ntree_values,
  mtry = mtry_values,
  importance = importance_values
)
results <- data.frame()
# Grid search: train and evaluate a model for each combination
for (i in 1:nrow(param_grid)) {
  params <- param_grid[i, ]
  model <- randomForest(
    as.formula(paste(target, "~ .")),
    data = train,
    ntree = params$ntree,
    mtry = params$mtry,
    importance = params$importance
  )
  preds <- predict(model, newdata = test)
  cm <- confusionMatrix(preds, test[[target]])
}

Voting Method: Aggregation of majority of trees

Creating the Model

  • Data was split 80/20 for train and test
  • Grid search was conducted over all three datasets and the three hyper-parameters
  • Determining metrics of model performance were accuracy and the false positive rate
 # Train model
    model <- randomForest(
      formula = as.formula(paste(target, "~ .")),
      data = train,
      ntree = params$ntree,
      mtry = params$mtry,
      importance = params$importance
    )
    
    # Predict on test set
    preds <- predict(model, newdata = test)

Model Evaluation

  • The model was evaluated with the ROC curve and the confusion matrix, achieving an AUC of 0.96 and an accuracy of 0.97

The ROC curve shows that the three versions of the data performed relatively well, achieving high true positive rates.

The model correctly identified 26,393 legitimate transactions and 1,549 fraudulent ones, producing 189 false positives and 716 false negatives.


 

 

Metric Value
Accuracy 0.9690
Precision 0.9735
Recall 0.9934
Specificity 0.6830
F1 Score 0.9834
AUC 0.9597

Feature Importance

Feature importance analysis revealed the top features ranked by their importance in the Random Forest model.


Findings From this Project


  • Applying Random Forest to the dataset showed how preprocessing affects results
  • Balancing the highly imbalanced classes improved the model’s ability to detect target cases
  • Hyperparameter tuning influenced predictive performance
  • Model evaluation metrics helped assess accuracy and overall behavior

Strengths Demonstrated by this Model


  • Capable of achieving strong predictive performance with relatively few tuning requirements
  • Robust against noise and fluctuations in the data
  • Produces interpretable measures such as feature importance
  • Offers out-of-bag error estimates for internal validation

Conclusion


  • Random Forest is a versatile and reliable machine learning algorithm
  • Random Forest is well suited for both classification and regression tasks
  • Handles datasets with complex relationships and many features
  • Maintains strong performance even in the presence of noise, missing data, and outliers
  • The ensemble approach helps to reduce overfitting and increase accuracy

References

Probst, Philipp. 2019. “Hyperparameters, Tuning and Meta-Learning for Random Forest and Other Machine Learning Algorithms.” LMU Munich Dissertations, 1–220. https://edoc.ub.uni-muenchen.de/24557/.
Zhang, Xijie. 2024. “Fraud Detection Using Machine Learning: An Evaluation of Logistic Regression and Random Forest,” 169–73. https://doi.org/10.1109/SCOUT64349.2024.00041.