Daniel Tsemekhman - Data & Analytics Portfolio

Overview

Customer churn is the central challenge for subscription-based businesses. Unlike standard classification ("will they churn?"), survival analysis answers the richer question: when will they churn, and what factors accelerate or delay that event? This project models time-to-churn for 2,847 SaaS customers using survival analysis techniques, accounting for the right-censoring inherent in subscription data (many customers are still active at the time of analysis).

The analysis progresses from non-parametric Kaplan-Meier survival curves through stratified log-rank tests to a semi-parametric Cox proportional hazards model. I identify the most impactful covariates, verify the proportional hazards assumption with Schoenfeld residual tests, and produce individual risk scores that the company can use for targeted retention campaigns.

Kaplan-Meier Estimator Log-Rank Test Cox PH Model Right Censoring Schoenfeld Residuals Concordance Index

Data Description

The dataset contains 2,847 customer records from a B2B SaaS analytics platform, observed over a 36-month window. Each record includes the customer's tenure (months until churn or censoring), an event indicator, and several covariates:

Variable	Description	Summary
tenure	Months until churn or last observation	Median: 14 mo
churned	Event indicator (1 = churned, 0 = censored)	38.2% churned
plan_tier	Subscription tier (Basic / Pro / Enterprise)	41% / 35% / 24%
monthly_spend	Monthly subscription cost ($)	Mean: $284
support_tickets	Total support tickets filed	Mean: 4.2
login_frequency	Average weekly logins	Mean: 3.8
onboarding_score	Onboarding completion (0–100)	Mean: 67
contract_type	Monthly vs. Annual billing	58% / 42%

Kaplan-Meier Analysis

The overall Kaplan-Meier curve shows a steep drop in the first 6 months — nearly 20% of customers churn within the first half-year — followed by a more gradual decline. The median survival time is 22 months.

library(survival)
library(survminer)

# Overall survival curve
surv_obj <- Surv(churn$tenure, churn$churned)
km_fit   <- survfit(surv_obj ~ 1)

ggsurvplot(km_fit,
           conf.int = TRUE,
           risk.table = TRUE,
           xlab = "Months",
           ylab = "Retention Probability",
           title = "Overall Customer Survival Curve",
           palette = "#f43f5e")

Dramatic separation between plan tiers. Enterprise customers show markedly higher retention (log-rank test p < 0.001).

# Stratified by plan tier
km_tier <- survfit(surv_obj ~ plan_tier, data = churn)
ggsurvplot(km_tier, pval = TRUE, risk.table = TRUE,
           palette = c("#f43f5e", "#6c8cff", "#4ade80"))

# Log-rank test
survdiff(surv_obj ~ plan_tier, data = churn)
# Chi-sq = 89.4, p < 0.001

Cox Proportional Hazards Model

The Cox PH model estimates how each covariate affects the hazard (instantaneous churn risk) without assuming a specific baseline hazard distribution. The key output is hazard ratios — a ratio above 1 means the covariate increases churn risk, below 1 means it's protective.

# Fit Cox PH model
cox_fit <- coxph(surv_obj ~ plan_tier + monthly_spend + support_tickets +
                  login_frequency + onboarding_score + contract_type,
                  data = churn)
summary(cox_fit)

Hazard Ratio Estimates

Covariate	Hazard Ratio	95% CI	p-value
Plan: Pro (vs. Basic)	0.68	[0.57, 0.81]	< 0.001
Plan: Enterprise (vs. Basic)	0.41	[0.33, 0.52]	< 0.001
Monthly Spend (per $100)	0.85	[0.78, 0.93]	0.003
Support Tickets (per ticket)	1.12	[1.07, 1.18]	< 0.001
Login Frequency (per login/wk)	0.82	[0.76, 0.88]	< 0.001
Onboarding Score (per 10 pts)	0.91	[0.86, 0.96]	0.001
Contract: Annual (vs. Monthly)	0.53	[0.44, 0.63]	< 0.001

0.74 Concordance Index

0.53 Annual Contract HR

0.82 Login Freq. HR

1.12 Support Tickets HR

Proportional Hazards Assumption

I verified the PH assumption using the Schoenfeld residual test. All covariates passed the global test (p = 0.31), confirming that hazard ratios are approximately constant over time. I also visually inspected scaled Schoenfeld residual plots — no systematic time trends were observed.

# Test proportional hazards assumption
ph_test <- cox.zph(cox_fit)
print(ph_test)
# GLOBAL: chisq = 8.14, p = 0.31 → assumption satisfied

ggcoxzph(ph_test)  # Visual inspection

# Concordance index
concordance(cox_fit)
# C-index = 0.742 (SE = 0.012)

Business Implications

Three actionable findings for the retention team:

1. Annual contracts cut churn risk by 47% (HR = 0.53). Incentivizing annual billing through discounts is the single highest-leverage retention strategy.

2. Each additional weekly login reduces churn hazard by 18% (HR = 0.82). Product engagement is directly protective — the product team should prioritize features that drive habitual usage.

3. Support tickets are a warning signal (HR = 1.12 per ticket). Customers filing multiple tickets are 12% more likely to churn per additional ticket. Proactive outreach after 3+ tickets could intercept at-risk customers.

Key Takeaways

This project demonstrates how survival analysis provides a richer framework for understanding churn than simple logistic regression. By modeling time to churn rather than just churn probability, we capture the dynamics of customer attrition — the steep early dropout, the stabilization after the first year, and the differential trajectories across customer segments. The Cox model's concordance index of 0.74 indicates strong discriminative ability, and the individual risk scores it produces can be operationalized into a real-time churn warning system.

Survival Analysis of SaaS Customer Churn