Abstract
Using real customer data from a large community bank in the South
of the US, this paper analyzes the customer churn prediction problem by
constructing and comparing ten machine learning classification models with five
sample techniques. Our results show that Random Forest, XG Boost, AdaBoost, and
Bagging Meta classifiers dominate others in terms of overall accuracy, F-score,
and AUC curve for the test observations. For the four classifiers, the overall
accuracy ranges from 87% to 96% across five different sampling methods
explored, while the AUC values range between 0.9 to 0.93. Considering overall
accuracy and F-Score, AdaBoost with original and MTDF sampling technique
dominates others; however, considering the AUC measure, XG Boost and Random
Forest perform similarly to AdaBoost, which slightly dominate Bagging Meta
across all sampling techniques; although the performance measures for these
four classifiers are comparable across all sampling techniques. The paper
further presents important features of customer churn behavior as predicted by
the model. The diagnostic analysis also provides an insightful comparison
between churned and non-churned customers.
JEL classification numbers: C0, C5, C8, G21.
Keywords: Machine learning, Big data, Sampling techniques, Customer
churn, Customer retention, Financial services, Community bank.