Churn Analysis

In this project, we are trying to find out what factors drive a customer of a telecommunications company to churn based on the dataset given. Later on, we will be constructing a logistic regression model in the hopes of predicting which customers will churn in the future.

Business questions:

1) Which customers will churn?

2) Do specific customer attributes (gender, senior citizen, partner, and etc.) affect the churn rate?

3) Do the specific types of services (phone services, multiple lines, internet services and etc.) subsribed to by the customers affect their churn rate?

4) How do customers make the decision to churn or not churn?

Data Cleaning

Before we start doing the EDA and modelling, we need to clean the data first. As below, we begin by remapping certain columns to 0 or 1, converting certain columns to a numerical data type, detecting any missing values and removing those records (since very little records actually have any missing values), and also searching for outliers in dataset with the use of box plots.

From the description above, there appears to be missing data in TotalCharges (count only returns 7032 rows instead of 7043) Fortunately, only 11 rows contain missing data, which is not a significant amount given the size of this dataset Thus, we can filter out those rows with missing values to prevent issues later on

It should be noted though, had a significant amount of rows been missing in the dataset, filtering would be unviable Instead, we would typically replace the missing values with some statistic of the values we do have (such as the mean)

Exploratory Data Analysis (EDA)

Before we begin modelling, let's take a more detailed look at the data we have and attempt to answer some of the business questions posed in the first section.

Here, we can see the relationship between gender, senior-citizenship and partner (marital status) with churn.

From this, we can tell that gender doesn't appear to have much of a relationship with churn, since both genders have similar distributions. This is not particularly surprising or interesting, so let's move on.

Senior-citizenship is a bit more interesting, it appears that senior citizens were generally less prevalent in the dataset. However, the proportion of them who churned were significantly higher than their younger counterparts. This may be due to them either not needing telecommunications services anymore or possibly them forgetting to pay their telco bills leading to churn.

On the other hand, those with a partner seemed slightly less likely to churn. This may be due to those with a family preferring stability over constant change, or them simply having less time to worry about changing their telcos.

The three groups of bar charts above all show how various services may influence a customers desire to purchase. For brevity'sake I will not go into detail for all of them. However, those services that seemed to result in more churning that would be expected should definitely be reviewed by the company to fix any possible issues causing the higher churn rates.

Lastly, the use of paperless billing seemed to increase the churn rates quite significantly, this could be due to the convenience at which those who use paperless billing can switch to another service provider online, whereas those who preferred physical bills would have to go through more procedures should they wish to change their telco.

From the boxplots above, most of the details do not appear to be particularly surprising. Those who had a shorter tenure would in turn have lower total charges, and due to their short tenure, these customers may have never been interested in a long-term contract with the telco to begin with. Additionally, those with high monthly charges also tended to churn more, again not really surprising.

What is interesting however, is that a group of customers with high tenure and total charges ended up churning (they appear as outliers in the tenure and total charges graphs). This implies that for some reason a wave of the telco's older customers suddenly decided to abandon them, this could be due to any number of factors (such as a poor business decision, or the presence of a more attractive competitor), and certainly warrants further analysis by the company.

Modelling

Now, we will try to predict whether the customer will churn or not based on the attributes selected above.

We first set a label column (in this case churn, i.e. our target variable). This is because we will be using the decision tree machine learning model to perform our predictions, and since this model is a supervised learning technique, we need a label or target variable to be defined first so our model actually has something to predict.

But before that, we have to split our dataset into a training and testing set.

We will split our dataset that consist of 7032 records into 2 groups 70% will be used as a training set for the machine to learn and study the patterns 30% will be used as a test set for the machine to apply what they have learn and do the prediction X axis = column y axis = churn if we know X value than we can predict Y value.

From the cell above, feeds the dataset into decision tree model (in this case, the split versions of the training and testing data). Generally, the training set is for learning purposes, while the testing set for application/prediction purposes, the algorithm will attempt to learn from the training data first before they can do any prediction on the test data.

Here, we will only make use of a decision tree with 3 layers.