post-image

Churn Analysis - Part 2: Indicators


Hello everyone,

In the last post we have decided to continue our study with the logistic regression. We have obtained the following ROC curve with an area under the curve (AUC) of 0.843.

Now we imagine 2 scenarios. In the first one, we suppose we have a large budget and we want to target many customers. In the second one, we have a restricted budget. Hence, we want to contact just 25% of the customers in our database.
For the first scenario, we will focus on the threshold value. We want to optimize the number of good classifications.
Then, in the second scenario, we will focus on the probability that each customer has to churn. We will sort our database by their probability to churn, we also call this probability, the score. Then we will select the top 25% customers of our database based on their probability to churn by using a lift curve.

Scenario 1: Threshold Optimization

Each customer has a score that corresponds to his probability to churn. Let’s see the score of the five first customers of our database.

head(pred_logistic,5)
##          1          2          3          4          5 
## 0.62513394 0.32512336 0.52812027 0.31273422 0.01264661

Originally, the threshold is 0.5. We predict a positive answer if the score is higher than 0.5 and a negative answer if the score is lower than 0.5.
It means for the 5 customers above that the model predicts a positive answer for customer 1 and customer 3 and a negative answer for the 3 other customers.

To optimize the threshold, we want to compute different statistical indicators (accuracy, precision, sensitivity, f1 and kappa) for different threshold values from 0.05 to 0.847 with a step value of 0.001. We don’t go above 0.847 because after this value, we just have negative answers.

if (!require("tidyverse")) install.packages("tidyverse")
library("tidyverse")
if (!require("caret")) install.packages("caret")
library("caret")
if (!require("e1071")) install.packages("e1071")
library("e1071")
comp = cbind.data.frame(answer = db_test$ChurnNum, 
                        pred=pred_logistic) %>%
        arrange(desc(pred))

indic_perf = function(x){
        compare = comp %>%
                mutate(pred = ifelse(pred>x,1,0))
         
        if(ncol(table(compare))>1){
        
        mat = confusionMatrix(table(compare), positive = "1")

        #acuracy 
        acc = mat$overall["Accuracy"]
        
        #Kappa
        kap = mat$overall["Kappa"]
        
        #sensitivity
        sen = mat$byClass["Sensitivity"]
        
        #F1
        f1_stat = mat$byClass["F1"]
        
        #Precision
        prec = mat$byClass["Precision"]
        

        }else{
                acc = NA
                prec = NA
                sen = NA 
                kap = NA
                f1_stat = NA
        }
        return(data.frame(threshold = x, accuracy = acc, 
                          precision = prec, sensitivity = sen, 
                          kappa = kap, f1= f1_stat))
}
indics = do.call(rbind, lapply(seq(0.05,0.95, by=0.001), 
                               indic_perf)) %>%
        filter(!is.na(accuracy))


if (!require("plotly")) install.packages("plotly")
library("plotly")
if (!require("IRdisplay")) install.packages("IRdisplay")
library("IRdisplay")

gather_indics = tidyr::gather(indics, variable, 
                      value, -threshold) %>%
  group_by(variable) %>%
  mutate(color =  (max(value) == value), 
         threshold = as.numeric(threshold) )
        
q=ggplot(gather_indics , aes(x= threshold, y=value)) +
        ggtitle("Indicator values by thresholds")+
        geom_point(aes(color = color), size=0.5) +
        facet_wrap(~variable, scales = 'free_x') +
        scale_color_manual(values = c(NA, "tomato")) +
        labs(x="thresholds", y=" ") +
        geom_line(color="navy") + theme_bw()+ 
        theme( legend.position="none")
#offline(ggplotly(q),  width = 'auto')
q

We draw a table where we show the maximum value for each indicator.

max_indics = indics %>%
        filter(accuracy == max(accuracy, na.rm=TRUE) | precision == max(precision, na.rm = TRUE) | sensitivity == max(sensitivity, na.rm = TRUE) | kappa == max(kappa, na.rm = TRUE) | f1 == max(f1, na.rm = TRUE) )

max_indics
##    threshold  accuracy   precision sensitivity       kappa          f1
## 1      0.050 0.5071157 0.978571429   0.3479365 0.199653582 0.513348946
## 2      0.051 0.5099620 0.978571429   0.3492670 0.202547963 0.514795679
## 3      0.052 0.5118596 0.978571429   0.3501597 0.204484542 0.515764706
## 4      0.053 0.5132827 0.978571429   0.3508323 0.205940664 0.516493874
## 5      0.054 0.5166034 0.978571429   0.3524116 0.209350638 0.518203310
## 6      0.055 0.5208729 0.978571429   0.3544631 0.213760484 0.520417854
## 7      0.056 0.5232448 0.978571429   0.3556132 0.216222936 0.521656354
## 8      0.057 0.5237192 0.978571429   0.3558442 0.216716506 0.521904762
## 9      0.339 0.7718216 0.751785714   0.5517693 0.475812739 0.636432351
## 10     0.584 0.8026565 0.471428571   0.6875000 0.437820539 0.559322034
## 11     0.585 0.8026565 0.469642857   0.6884817 0.437107483 0.558386412
## 12     0.799 0.7405123 0.023214286   1.0000000 0.033727638 0.045375218
## 13     0.800 0.7400380 0.021428571   1.0000000 0.031159002 0.041958042
## 14     0.801 0.7400380 0.021428571   1.0000000 0.031159002 0.041958042
## 15     0.802 0.7400380 0.021428571   1.0000000 0.031159002 0.041958042
## 16     0.803 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 17     0.804 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 18     0.805 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 19     0.806 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 20     0.807 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 21     0.808 0.7390892 0.017857143   1.0000000 0.026008938 0.035087719
## 22     0.809 0.7381404 0.014285714   1.0000000 0.020841748 0.028169014
## 23     0.810 0.7381404 0.014285714   1.0000000 0.020841748 0.028169014
## 24     0.811 0.7381404 0.014285714   1.0000000 0.020841748 0.028169014
## 25     0.812 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 26     0.813 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 27     0.814 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 28     0.815 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 29     0.816 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 30     0.817 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 31     0.818 0.7376660 0.012500000   1.0000000 0.018251704 0.024691358
## 32     0.819 0.7371917 0.010714286   1.0000000 0.015657346 0.021201413
## 33     0.820 0.7371917 0.010714286   1.0000000 0.015657346 0.021201413
## 34     0.821 0.7371917 0.010714286   1.0000000 0.015657346 0.021201413
## 35     0.822 0.7371917 0.010714286   1.0000000 0.015657346 0.021201413
## 36     0.823 0.7367173 0.008928571   1.0000000 0.013058663 0.017699115
## 37     0.824 0.7367173 0.008928571   1.0000000 0.013058663 0.017699115
## 38     0.825 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 39     0.826 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 40     0.827 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 41     0.828 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 42     0.829 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 43     0.830 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 44     0.831 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 45     0.832 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 46     0.833 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 47     0.834 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 48     0.835 0.7362429 0.007142857   1.0000000 0.010455645 0.014184397
## 49     0.836 0.7357685 0.005357143   1.0000000 0.007848280 0.010657194
## 50     0.837 0.7352941 0.003571429   1.0000000 0.005236558 0.007117438
## 51     0.838 0.7352941 0.003571429   1.0000000 0.005236558 0.007117438
## 52     0.839 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 53     0.840 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 54     0.841 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 55     0.842 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 56     0.843 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 57     0.844 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 58     0.845 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 59     0.846 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062
## 60     0.847 0.7348197 0.001785714   1.0000000 0.002620469 0.003565062

We still have a lot of values so we decide to remove some rows. We keep the five most relevant thresholds : 0.057, 0.339, 0.584, 0.585 and 0.799.

max_indics %>%
        filter( threshold %in% c("0.057", "0.339", "0.584", "0.585", "0.799"))
##   threshold  accuracy  precision sensitivity      kappa         f1
## 1     0.057 0.5237192 0.97857143   0.3558442 0.21671651 0.52190476
## 2     0.339 0.7718216 0.75178571   0.5517693 0.47581274 0.63643235
## 3     0.584 0.8026565 0.47142857   0.6875000 0.43782054 0.55932203
## 4     0.585 0.8026565 0.46964286   0.6884817 0.43710748 0.55838641
## 5     0.799 0.7405123 0.02321429   1.0000000 0.03372764 0.04537522

At this stage, we can choose a threshold based on what we think is the most important. I suggest to continue with 0.339. Hence we have the following confusion matrix.

compare = comp %>%
                mutate(pred = ifelse(pred>0.339,1,0))
confusionMatrix(table(compare), positive = "1")
## Confusion Matrix and Statistics
## 
##       pred
## answer    0    1
##      0 1206  342
##      1  139  421
##                                           
##                Accuracy : 0.7718          
##                  95% CI : (0.7533, 0.7896)
##     No Information Rate : 0.638           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4758          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5518          
##             Specificity : 0.8967          
##          Pos Pred Value : 0.7518          
##          Neg Pred Value : 0.7791          
##              Prevalence : 0.3620          
##          Detection Rate : 0.1997          
##    Detection Prevalence : 0.2657          
##       Balanced Accuracy : 0.7242          
##                                           
##        'Positive' Class : 1               
## 

With a threshold of 0.339, we have the best f1 value and we have an acceptable level of accuracy, presicion and sensitivity (recall). Now let’s focus on the second scenario.

Scenario 2: Lift curve

Now we want to target a percentage of customers in our database. Let’s sort customers by their score. Then, we observe the churn rate in different percentile of our database. The best way to visualize the churn rate by percentile is to make a lift curve. It consists of sorting customers by their probability to leave and then separate them in percentiles. The higher their probability to leave, the higher will be their rank. It means that the 5th percentile represents the top 5% of our customers database that have the highest probability to leave following our model.

By the way, we know there is 26.54% of churn in our database. So if we have a random model, it is expected to have 26.54% of churn in each percentile. Hence we can compare the churn rate in our model and the random one.

fivtile = nrow(comp)/20
step = floor(fivtile * 1:20)
pct = sapply(step, function(x){
        return(mean(comp[1:x,1]))})

#paste(seq(from = 5, to = 100, by=5), "%",sep=" ")
lift = data.frame(label= seq(from = 5, to = 100, by=5), score = pct*100)
q = ggplot(lift, aes(x=label, y=score))+
        
        geom_bar(stat="identity",position="stack",color="navy", fill="navy")+ 
        ggtitle("Churn rate per cumulative percentile of \n customers  with the highest probability to leave")+
        coord_cartesian(xlim = c(5,100), ylim = c(0,100))+
        scale_y_continuous(breaks = seq(from = 0, to = 100, by=25), labels = function(x) paste0(x,"%", sep = ""))+
        scale_x_continuous(breaks = c(5, 25, 50, 75, 100), labels = function(x) paste0(x,"%", sep = ""))+
        labs(x="cumulative percentile ", y="churn rate") + 
        geom_line(aes(y=score[20]),linetype=2, size=1, color="tomato")+ 
        theme_minimal()+
        theme( 
                panel.grid.major.x = element_blank(),
                panel.grid.minor.x = element_blank(),
                plot.title = element_text(size=17,face="bold", hjust=0.5),
                axis.text.x = element_text(vjust=1),
                axis.ticks.x = element_blank(),
                axis.title.x = element_text(size=13, face="bold", vjust=-0.4),
                axis.title.y = element_text(size=13,face="bold")#,
                #strip.text.x = element_text(face="italic", size=11)
                )

print(q)

Our model is really efficient compared to the random model. We detect at least 2 times more churn using our model up to the 40th percentile. If we target 25% of our database, we observe that the churn rate is 61,67%.

Conclusion

In the first part we have cleaned the database, made an exploratory data analysis and then we have tested different models to finally choose the one with the biggest area under the curve (AUC), the logistic regression.
Then in the second part, we have imagined two different scenarios. In the first one, we focused on the threshold value and in the second one, we wanted to target just 25% of the database. Hence, in the first part, we found the optimized threshold value: 0.339. Then in the second part we have seen that our model is efficient on the top 25% customers of our database. The churn rate expected without model is 26.54% while the churn rate is 61.67% for our model.

Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

Back to blog