Statistics
Introduction
In contractual settings, understanding customer attrition is crucial. Arguably, it is the duty of the manager to harness the full potential of attrition rates, given the privilege of its observation. Supplementing attrition is its counterpart, retention, commonly employed in calculating customer tenure and lifetime value. These metrics, vital for businesses, must be completely owned and understood.
Typically, retention (the proportion of customers retained at each period) is assumed to be constant across a cohort. This good-enough approach, however, leads to a gross underestimation of customer tenure and value for your most valuable customers. At first glance, it may seem logical, but a more thorough examination reveals its weaknesses. A constant rate of retention requires a uniform attrition rate among a cohort's customers, irrespective of the observation period. This, however, defies intuition. It seems implausible that a customer who has resubscribed to Netflix for the 30th time would have the same likelihood of leaving in the 31st period as a new subscriber. Empirically, it is even shown retention rates increase over time.
Some would argue that these customers are getting better, but this contradicts the nature of human behavior. People don't aspire to become better consumers; they are simply guided by their own, seemingly random, desires. The good customers had always been good; identifying them, however, required observation of their actions. Customers are drawn to try new products; and, if in agreement with their preferences, they continue that relationship with the business. Such preferences remain unchanged (assuming stationarity) by the mere act of trying a product. Even exemplary customers will eventually churn, particularly if an alternative product better aligns with their preferences or current necessities.
How can we ascertain what actions our customers will take, or more aptly, how do we discern the assortment of customers we possess? Each customer, acting independently, obscures the underlying factors influencing their behavior. Yet, we endeavor to do our best. Fortunately, we have a proxy for determining customer types with similar behavior patterns—the duration of our relationship.
To harness churn data and our beliefs of customer behavior, we employ the shifted beta-geometric distribution (sBG). The full motivation for its choice is described in a paper linked later. To briefly summarize, the (shifted-) geometric distribution aligns with our understanding of customer attrition. It is parameterized by \(\theta\) (churn propensity), which encodes the various unobservable influences impacting a customer's decision to churn—product preferences, personality traits, occupational hazards, etc. Given we know an individual's true \(\theta\), we can calculate the probability of that customer surviving through period t: \(p(T>t|\theta)\).
The beta distribution accounts for our intuition regarding customer heterogeneity. It represents the spectrum of \(\theta\) values among our customers and is characterized by two hyperparameters, \(\alpha\) and \(\beta\). The versatile shapes of this distribution, and the corresponding retention curves they imply, can be examined through the interactive graph below.
Retention rates will always go up. It's an artifact of the drop-out effect of bad customers. Accordingly, it would be ill-advised to only consider retention rates when analyzing one's customer base. This is evident in the survival curve, which reveals that customer cohorts with high churn rates rapidly vanish, despite achieving significant retention in later stages. In contrast, cohorts with a lower churn propensity retain a substantial portion of their initial base while still attaining high retention. Typically, companies excessively inclined towards the higher-churn segment of the distribution either cease to exist or adapt, attracting more high-value customers.
Methodology
The motivation behind this analysis is straightforward. Having implemented a solution employing Empirical Bayes, a compelling desire to explore a more comprehensive approach emerged. My methodology consists implementing the fully Bayesian model; and, to benchmark the results, I will include results from the Empirical Bayes method (one whose simplicity allows for implementation in Excel).
Fully Bayesian Treatment
The data consist of 1,000 customers and their corresponding churn periods, which have been censored at the 7\(^{th}\) period. That is, it is unknown when these remaining customers churned. Each customer's observed churn period, denoted \(y_i\), is assumed to follow a Geometric distribution with a corresponding parameter representing churn propensity, \(\theta_i\). This structure implies a constant retention rate for each customer. Each \(\theta_i\) is assumed to be drawn from the same Beta distribution, described by parameters, \(\alpha\) and \(\beta\). Such structure can be represented by the following formulas:
$$\begin{align*} y_i &\sim \text{Geom}(\theta_i) \\ \theta_i &\sim \text{Beta}(\alpha, \beta) \end{align*}$$
A helpful analogy for internalizing this structure is to imagine each customer within a cohort reaching into a bag and pulling out a coin. The probability of this coin landing on heads is their propensity to churn (\(\theta_i\)). The mix of coins within the bag follows a Beta distribution; and, coin selection is only done once, at the beginning of the customer relationship. At each period, the customers flip their coins, and those whose coin lands on heads, churn. These constraints again underscore the two main assumptions of the model: each period, a customer decides to leave the firm with a constant probability \(\theta_i\) (stationarity) where each \(\theta_i\) is generated by a Beta distribution. The Beta distribution is a convenient choice as it exhibits conjugacy with the Geometric distribution and is bounded between 0 and 1.
Under a Bayesian interpretation, our unknown parameters are considered random variables. Additionally, we can choose to imbue these parameters with our subjective beliefs, codified by a prior distribution. The result is an updated estimation of our belief in the parameters' values, as given by a joint posterior distribution. This joint distribution for an individual can be calculated using Bayes' rule:
$$\begin{align*} p(\theta_i, \alpha, \beta | y_i) & \propto p(y_i| \theta_i, \alpha, \beta) \cdot p(\theta_i , \alpha, \beta) \\ & = p(y_i | \theta_i, \alpha, \beta) \cdot p(\theta_i | \alpha, \beta) \cdot p(\alpha, \beta) \quad \text{where } p(\alpha, \beta) \propto 1\\ & \propto \left (p(y_i | \theta_i) \cdot p(\theta_i | \alpha, \beta) \right)\cdot p(\alpha, \beta) \end{align*}$$
This formula represents the joint posterior for a specific customer up to a certain normalizing constant; however, our analysis is not done. The data for this model include 1,000 customers. Therefore, to calculate the full joint posterior, we must incorporate the entirety of the data. Normally, it would involve calculating the product of our likelihoods and prior distributions; however, recall that our data are right-censored at the 7th period. This structure requires two different likelihood formulas: one based on the geometric's pdf; the other, its survival function:
$$\begin{align*} p(\boldsymbol{\theta}, \alpha, \beta | \boldsymbol{y}) & \propto p(\boldsymbol{y} | \boldsymbol{\theta}, \alpha, \beta) \cdot p(\boldsymbol{\theta} | \alpha, \beta) \cdot p(\alpha, \beta) \quad \text{where } p(\alpha, \beta) \propto 1\\ & = \left( \prod_{i=1}^{1000} \left (p(y_i | \theta_i, \alpha, \beta) p(\theta_i | \alpha, \beta) \right) \right) \cdot p(\alpha, \beta)\\ & \propto \prod_{i=1}^{1000} \left (p(y_i | \theta_i) p(\theta_i | \alpha, \beta) \right)\\ & = \prod_{i=1}^{759} \left( \theta_i (1 - \theta_i)^{y_i-1} \frac{ \theta_i^{\alpha-1} (1 - \theta_i)^{\beta-1}}{B(\alpha, \beta)} \right) \prod_{i=760}^{1000} \left( (1 - \theta_i)^{7} \frac{ \theta_i^{\alpha - 1} (1 - \theta_i)^{\beta - 1}}{B(\alpha, \beta)} \right)\\ & = \left( \frac{1}{B(\alpha, \beta)} \right)^{1000} \left( \prod_{i=1}^{759} \left( \theta_i^{\alpha} (1 - \theta_i)^{\beta+ y_i -2} \right) \prod_{i=760}^{1000} \left( \theta_i^{\alpha-1} (1 - \theta_i)^{\beta+6} \right)\right) \end{align*}$$
Note that I have chosen a flat prior for both \(\alpha\) and \(\beta\), resulting in \(p(\alpha, \beta) \propto 1\). Additionally, it's important to not that I have hard-coded \(y_i\) for the censored cell at \(t = 7\) in the second product, as this number is a constant. We do not know the exact churn period for these customers. Instead, we know that \(p(T>t|\theta)\) can be modeled by the survival function of the geometric distribution, namely \((1-\theta_i)^{t}\): We now have the form of our joint posterior, but we need to sample from it. Unfortunately, our posterior is represented by a non-standard distribution, making sampling from it difficult. To circumvent this, we will employ the Metropolis algorithm to acquire samples. The Metropolis algorithm, a Markov chain Monte Carlo method, allows us to find samples of each parameter by conditioning on the current samples of the other parameters when some of the conditional distributions are also non-standard. The initial step involves determining the conditional posteriors of all our parameters. This is achieved by disregarding terms that include the parameters on which we are conditioning. Consequently, the resulting distribution is proportional to its true value. For example, the parameters \(\boldsymbol{\theta}\) have conditional posterior distributions proportional to the following equations:
$$p(\theta_i | \theta_{j \neq i }, \alpha, \beta, y_i) \propto \begin{cases} \displaystyle\theta_i^{\alpha} (1 - \theta_i)^{\beta+ y_i -2} & \text{for } 1 \leq i < 760, \\[5pt] \displaystyle\theta_i^{\alpha-1} (1 - \theta_i)^{\beta+6} & \text{for } i \geq 760 \end{cases}$$
We recognize these as the functional forms of the Beta distribution:
$$\theta_i | \theta_{j \neq i }, \alpha, \beta, y_i \sim \begin{cases} \displaystyle \text{Beta}(\alpha+1, \beta+ y_i -1) & \text{ for } 1 \leq i < 760, \\ \displaystyle \text{Beta}(\alpha, \beta+7) & \text{ for } i \geq 760 \end{cases}$$
Similarly, by ignoring elements of the joint distribution not reliant on \(\alpha\) and \(\beta\), a proportional joint conditional distribution can be calculated. Unfortunately, this results in a non-standard distribution, which will require the Metropolis step of our algorithm to find samples from:
$$p(\alpha, \beta | \boldsymbol{y}, \boldsymbol{\theta}) \propto \prod_{i=1}^{1000} \left( \frac{\theta_i^{\alpha} (1 - \theta_i)^{\beta}}{B(\alpha, \beta)} \right)$$
In an iteration of the Metropolis algorithm, samples are obtained from each conditional distribution using the current samples (or previous samples, depending on one's perspective) Each time a sample for a parameter is obtained, it becomes the current sample for that parameter, which is then used in subsequent samplings. The scheme below illustrates the process of sampling using this algorithm.
- Initialize the parameters \(\boldsymbol{\theta}\), \(\alpha\), and \( \beta \), the starting points in the state space, and choose a symmetric proposal distribution \( q(x^{*}|x^{t-1}) \) for your parameters generated by non-standard conditional distributions.
- For each iteration \( t = 1, 2, \ldots, T \):
- Sample the parameters \(\boldsymbol{\theta}\) using \(\theta_i | \theta_{j \neq i }, \alpha, \beta, y_i \sim \begin{cases} \displaystyle \text{Beta}(\alpha+1, \beta+ y_i -1) & \text{ for } 1 \leq i < 760, \\ \displaystyle \text{Beta}(\alpha, \beta+7) & \text{ for } i \geq 760 \end{cases}\)
- Generate a proposal \( x^{*} \) from the proposal distribution \( q(x^{*}|x^{t-1}) \) for
both \(\alpha\) and \(\beta\). Where \(\sigma_{\alpha}\) and \(\sigma_{\beta}\) are tuning
parameters for the algorithm.
- \(\alpha^{*} \sim \text{Normal}(\alpha^{t-1}, \sigma_{\alpha})\)
- \(\beta^{*} \sim \text{Normal}(\beta^{t-1}, \sigma_{\beta})\)
- Input the proposed values into their non-standard joint distribution and compare to the
current values using a ratio r.
- \(r = \frac{p(\alpha^{*}, \beta^{*} | \boldsymbol{y}, \boldsymbol{\theta})}{p(\alpha^{t-1}, \beta^{t-1} | \boldsymbol{y}, \boldsymbol{\theta})}\)
- We accept the proposed values with probability r. (i.e. when r is greater than u where u \(\sim \text{Uniform}(0,1)\))
- If accepted, our current sample values are updated to the proposed values. Otherwise, the old samples remain.
Sometimes it is computationally useful to work with the logarithms of our distribution. Utilizing the properties of logarithms, r is now represented as a difference of logarithms and is compared against \(\log(u)\).
$$\begin{align*} p(\alpha, \beta | \boldsymbol{y}, \boldsymbol{\theta}) & \propto \prod_{i=1}^{1000} \left( \frac{\theta_i^{\alpha} (1 - \theta_i)^{\beta}}{B(\alpha, \beta)} \right)\\ & = B(\alpha, \beta)^{-1000} \cdot \prod_{i=1}^{1000} \left( \theta_i^{\alpha} (1 - \theta_i)^{\beta} \right)\\ \log(p(\alpha, \beta | \boldsymbol{y}, \boldsymbol{\theta})) & \propto \log(B(\alpha,\beta)^{-1000}) + \log\left(\prod_{i=1}^{1000} \theta_i^{\alpha}(1-\theta_i)^{\beta} \right)\\ & = -1000\log(B(\alpha, \beta)) + \log(\theta_1^{\alpha} \cdot \ldots \cdot \theta_{1000}^{\alpha} \cdot (1-\theta_1)^{\beta} \cdot \ldots \cdot (1-\theta_{1000})^{\beta})\\[3mm] & = -1000\log(B(\alpha, \beta)) + \log(\theta_1^{\alpha}) + \ldots + \log(\theta_{1000}^{\alpha}) \\ & \quad \text{ } + \log((1-\theta_1)^{\beta}) + \ldots + \log((1 - \theta_{1000})^{\beta})\\[3mm] & = -1000\log(B(\alpha, \beta)) + \alpha\log(\theta_1) + \ldots + \alpha\log(\theta_{1000})\\ & \quad \text{ } + \beta\log(1-\theta_1) + \ldots + \beta\log(1 - \theta_{1000})\\ & = -1000\log(B(\alpha, \beta)) + \alpha\sum_{i=1}^{1000}(\log(\theta_i)) + \beta\sum_{i=1}^{1000}(\log(1-\theta_i))\\ & = \alpha\sum_{i=1}^{1000}(\log(\theta_i)) + \beta\sum_{i=1}^{1000}(\log(1-\theta_i)) - 1000\log(B(\alpha, \beta)) \end{align*}$$
$$\log(r) = \log(p(\alpha^{*}, \beta^{*} | \boldsymbol{y}, \boldsymbol{\theta})) -\log(p(\alpha^{t-1}, \beta^{t-1} | \boldsymbol{y}, \boldsymbol{\theta}))$$
Code
def log_gibbs(log_alpha, log_beta, num_samples, sd_a_prop, sd_b_prop):
random.seed(a=11101)
# arr = np.array([369, 163, 86, 56, 37, 27, 21, 241])
arr = np.array([131, 126, 90, 60, 42, 34, 26, 491])
y = np.repeat(np.arange(1, len(arr) + 1), arr)
alpha_samples = np.zeros((1, num_samples))
beta_samples = np.zeros((1, num_samples))
theta_samples = np.zeros((1000, num_samples))
num_accepted = 0
for i in range(num_samples):
alpha = np.exp(log_alpha)
b = np.exp(log_beta)
if i % 100 == 0:
print('sample iteration number ' + str(i))
print(num_accepted / (i + 1))
print("alpha: " + str(alpha) + " beta: " + str(b))
for j in range(0, 509):
theta_samples[j, i] = conditional_posterior_theta_uncensored(alpha, b, y[j])[0]
for j in range(509, 1000):
theta_samples[j, i] = conditional_posterior_theta_censored(alpha, b)[0]
log_alpha_prop = norm.rvs(loc=log_alpha, scale=sd_a_prop)
log_beta_prop = norm.rvs(loc=log_beta, scale=sd_b_prop)
alpha_prop = np.exp(log_alpha_prop)
beta_prop = np.exp(log_beta_prop)
target_prop = log_conditional_posterior_log_alpha_beta(log_alpha_prop, log_beta_prop, theta_samples[:, i])
current_prop = log_conditional_posterior_log_alpha_beta(log_alpha, log_beta, theta_samples[:, i])
log_given_current = lognorm.logpdf(alpha_prop, s=sd_a_prop, scale=alpha) + lognorm.logpdf(beta_prop,
s=sd_b_prop, scale=b)
log_given_proposed = lognorm.logpdf(alpha, s=sd_a_prop, scale=alpha_prop) + lognorm.logpdf(b, s=sd_b_prop,
scale=beta_prop)
r = target_prop - current_prop + log_given_proposed - log_given_current
u = np.log(uniform.rvs(loc=0, scale=1))
if r >= u:
log_alpha = log_alpha_prop
log_beta = log_beta_prop
num_accepted += 1
# Back-transform log_alpha and log_beta to store them
alpha_samples[0, i] = alpha_prop
beta_samples[0, i] = beta_prop
result = {
"theta.samp": theta_samples,
"alpha.samp": alpha_samples,
"beta.samp": beta_samples,
"acceptance_rate": num_accepted / num_samples
}
return result