Applying Kaplan-Meier estimation, log-rank tests, and Cox proportional hazards regression to model time-to-churn for a subscription software company — identifying the key drivers of customer retention.
Customer churn is the central challenge for subscription-based businesses. Unlike standard classification ("will they churn?"), survival analysis answers the richer question: when will they churn, and what factors accelerate or delay that event? This project models time-to-churn for 2,847 SaaS customers using survival analysis techniques, accounting for the right-censoring inherent in subscription data (many customers are still active at the time of analysis).
The analysis progresses from non-parametric Kaplan-Meier survival curves through stratified log-rank tests to a semi-parametric Cox proportional hazards model. I identify the most impactful covariates, verify the proportional hazards assumption with Schoenfeld residual tests, and produce individual risk scores that the company can use for targeted retention campaigns.
The dataset contains 2,847 customer records from a B2B SaaS analytics platform, observed over a 36-month window. Each record includes the customer's tenure (months until churn or censoring), an event indicator, and several covariates:
| Variable | Description | Summary |
|---|---|---|
| tenure | Months until churn or last observation | Median: 14 mo |
| churned | Event indicator (1 = churned, 0 = censored) | 38.2% churned |
| plan_tier | Subscription tier (Basic / Pro / Enterprise) | 41% / 35% / 24% |
| monthly_spend | Monthly subscription cost ($) | Mean: $284 |
| support_tickets | Total support tickets filed | Mean: 4.2 |
| login_frequency | Average weekly logins | Mean: 3.8 |
| onboarding_score | Onboarding completion (0–100) | Mean: 67 |
| contract_type | Monthly vs. Annual billing | 58% / 42% |
The overall Kaplan-Meier curve shows a steep drop in the first 6 months — nearly 20% of customers churn within the first half-year — followed by a more gradual decline. The median survival time is 22 months.
library(survival)
library(survminer)
# Overall survival curve
surv_obj <- Surv(churn$tenure, churn$churned)
km_fit <- survfit(surv_obj ~ 1)
ggsurvplot(km_fit,
conf.int = TRUE,
risk.table = TRUE,
xlab = "Months",
ylab = "Retention Probability",
title = "Overall Customer Survival Curve",
palette = "#f43f5e")
Dramatic separation between plan tiers. Enterprise customers show markedly higher retention (log-rank test p < 0.001).
# Stratified by plan tier
km_tier <- survfit(surv_obj ~ plan_tier, data = churn)
ggsurvplot(km_tier, pval = TRUE, risk.table = TRUE,
palette = c("#f43f5e", "#6c8cff", "#4ade80"))
# Log-rank test
survdiff(surv_obj ~ plan_tier, data = churn)
# Chi-sq = 89.4, p < 0.001
The Cox PH model estimates how each covariate affects the hazard (instantaneous churn risk) without assuming a specific baseline hazard distribution. The key output is hazard ratios — a ratio above 1 means the covariate increases churn risk, below 1 means it's protective.
# Fit Cox PH model
cox_fit <- coxph(surv_obj ~ plan_tier + monthly_spend + support_tickets +
login_frequency + onboarding_score + contract_type,
data = churn)
summary(cox_fit)
| Covariate | Hazard Ratio | 95% CI | p-value |
|---|---|---|---|
| Plan: Pro (vs. Basic) | 0.68 | [0.57, 0.81] | < 0.001 |
| Plan: Enterprise (vs. Basic) | 0.41 | [0.33, 0.52] | < 0.001 |
| Monthly Spend (per $100) | 0.85 | [0.78, 0.93] | 0.003 |
| Support Tickets (per ticket) | 1.12 | [1.07, 1.18] | < 0.001 |
| Login Frequency (per login/wk) | 0.82 | [0.76, 0.88] | < 0.001 |
| Onboarding Score (per 10 pts) | 0.91 | [0.86, 0.96] | 0.001 |
| Contract: Annual (vs. Monthly) | 0.53 | [0.44, 0.63] | < 0.001 |
I verified the PH assumption using the Schoenfeld residual test. All covariates passed the global test (p = 0.31), confirming that hazard ratios are approximately constant over time. I also visually inspected scaled Schoenfeld residual plots — no systematic time trends were observed.
# Test proportional hazards assumption
ph_test <- cox.zph(cox_fit)
print(ph_test)
# GLOBAL: chisq = 8.14, p = 0.31 → assumption satisfied
ggcoxzph(ph_test) # Visual inspection
# Concordance index
concordance(cox_fit)
# C-index = 0.742 (SE = 0.012)
Three actionable findings for the retention team:
1. Annual contracts cut churn risk by 47% (HR = 0.53). Incentivizing annual billing through discounts is the single highest-leverage retention strategy.
2. Each additional weekly login reduces churn hazard by 18% (HR = 0.82). Product engagement is directly protective — the product team should prioritize features that drive habitual usage.
3. Support tickets are a warning signal (HR = 1.12 per ticket). Customers filing multiple tickets are 12% more likely to churn per additional ticket. Proactive outreach after 3+ tickets could intercept at-risk customers.
This project demonstrates how survival analysis provides a richer framework for understanding churn than simple logistic regression. By modeling time to churn rather than just churn probability, we capture the dynamics of customer attrition — the steep early dropout, the stabilization after the first year, and the differential trajectories across customer segments. The Cox model's concordance index of 0.74 indicates strong discriminative ability, and the individual risk scores it produces can be operationalized into a real-time churn warning system.