Customer Churn Analysis for three Banks.
Updated: May 26, 2021
Customer churn is a common concern for businesses. The problem we have set out to solve is trying to identify customers that are considering leaving the bank. For this project, I tackled the issue from many different directions. For this project there was a bit of a Learning Curve.
The dataset that I used can be found here:
My Jupyter Notebook can be downloaded here:
My first attempt at solving this problem was to rule out some algorithms that we were recently studying.
I started with a linear regression. As a former math teacher, I have to admit I was very skeptical about using a linear regression model. I was surprised at first to see high accuracy with my model but was convinced that the model was not a good match for my data. This is where I learned first hand what overfitting looked like. I was working with an unbalanced dataset. Approximately 8000 people who stayed with the branch and 2000 people that had left the banks. I remember thinking wow 80% is not a bad start for my project. But it did not sit well with me so I dug deeper and discovered that for the most part the algorithm was for the most part just voting with the majority. So there was no predictive power.
I noticed that some of the data was quite skewed so I employed logarithms to even things out.
But eventually decided to abandon this strategy because I felt that the data was not representative after making those changes.
PCA dimension reduction at 95% only removed one feature from my dataset and obscured the results of the dataset.
I rebalanced the dataset briefly so there was a more even distribution between customers staying and leaving the bank. But this did not help solve the problem. No matter how I sliced it regression models were not the solution.
“I tried many different algorithms and solved the churn analysis problem in many different ways.”
Random Forrest
My first successful algorithm was the random Forrest. I obtained about an 85% accuracy and my model began to start choosing values from both populations of customers. Keep in mind this is an unbalanced dataset.
There was also a reasonable learning rate according to the ROC curve.
I built a pipeline and did a gridsearchCV
This was a very large search space. I learned later how to determine better value choices for my grid searches.
I was left with the best parameters for my Random Forrest.
But did not see a huge difference in the results.
Decision Tree
I built a decision tree but really only used it for the nice plot. I was running low on time for the semester and I had my heart set on seeing the results from a neural network.
“I built my first neural network right before deciding to get my masters in data science.”
Neural Network
This model was my favorite. It was a Sequential deep neural network with three layers.
I built a pipeline and used it to make predictions on whether or not a person would leave the bank.
This allowed me to make up new customers and test whether or not they would leave the bank.
I even built a report showing which customers should be contacted by the bank to try and save.
Comentarios