Today, we shall look at more examples of Bayesian data analyses using PyMC.
#Import the necessary libraries:
import arviz as az
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm #PyMC3 is now replaced by PyMC
A simple experiment to measure the length of an object led to the following 15 measurements: $17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61.$ What is the best estimate for the length of this object?
The standard method for solving this problem is the following. Let $\theta$ denote the unknown length of the object that we are trying to obtain. The standard estimate of $\theta$ is the mean of the observations: \begin{align*} \hat{\theta} := \bar{y} = \frac{y_1 + \dots + y_n}{n}. \end{align*} The uncertainty in this estimate is quantified via \begin{align*} \frac{\hat{\sigma}}{\sqrt{n}} ~~~~ \text{ where } \hat{\sigma} := \sqrt{\frac{1}{n-1} \sum_{i=1}^n \left(y_i - \bar{y} \right)^2} \end{align*}
#Data
n = 15
y_obs = np.array([17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61])
#Usual Solution:
thetaest = np.mean(y_obs)
thetaerror = np.std(y_obs, ddof=1)/np.sqrt(n)
print(thetaest, thetaerror)
17.615333333333336 0.00150132216861293
We now present a Bayesian solution to this problem using PyMC. As in every Bayesian Analysis, we shall assume a prior model for the unknown parameters, and then use a likelihood model to connect the unknown parameters to data. The posterior will then be used for answering the scientific questions. Here is one way of specializing this standard procedure to this problem.
Let $\theta$ denote the unknown true length that we are trying to obtain. We shall assume the prior: $$\theta \in \text{Uniform}[-80, 80]$$ (it is obvious from the measurements that $\theta$ should lie between 0 and 30 so the $\text{uniform}[-80, 80]$ prior is really an "uninformative" prior; uninformative basically means that the prior reveals very little nontrivial information about the parameter).
For the likelihood, we shall assume the Gaussian model: \begin{align*} y_1, \dots, y_n \mid \theta \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2) \end{align*} Here $\sigma$ is a parameter which characterizes the uncertainty in each individual measurement. We can try to heuristically figure out the value of $\sigma$ from the observed data (say using the standard estimate $\hat{\sigma}$ defined above). A more principled procedure is to also treat it as an unknown parameter and specify a prior for $\sigma$. Here is then a full specification of the prior and likelihood for solving this problem: \begin{align*} \text{Prior}: ~~ \theta \sim \text{Uniform}[-80, 80] ~~ \text{ and } ~~ \log \sigma \sim \text{Uniform}[-10, 10], \end{align*} and likelihood: \begin{align*} y_1, \dots, y_n \mid \theta, \sigma \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2) \end{align*} Here are a few additional remarks about the prior. As $\sigma$ is a positive-valued parameter, the common practice is to specify the uniform prior on $\log \sigma$. In this problem, it is clear from the measurements that each individual measurement is quite precise which means that $\sigma$ should be small. We still chose the range $[-10, 10]$ for the uniform prior on $\log \sigma$ to be uninformative.
We now specify this model in PyMC.
#Bayes Solution
measurement_model = pm.Model()
with measurement_model:
    theta = pm.Uniform("theta", lower = -80, upper = 80)
    log_sigma = pm.Uniform("log_sigma", lower = -10, upper = 10)
    sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
    Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
    #Sample from posterior:
    idata = pm.sample(2000, chains = 2, return_inferencedata = True) 
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 4 jobs) NUTS: [theta, log_sigma]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
The posterior samples for $\theta$ can be accessed from the PyMC output. We will also get posterior samples for $\log(\sigma)$ which can be converted to posterior samples for $\sigma$ by exponentiating.
theta_samples = idata.posterior['theta'].values.flatten()
log_sigma_samples = idata.posterior['log_sigma'].values.flatten()
#Histogram of posterior theta samples for theta
plt.hist(theta_samples, bins = 500)
plt.xlabel(r'Values of $\theta$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\theta$ samples')
plt.show();
#Histogram of posterior log_sigma samples
plt.hist(log_sigma_samples, bins = 500)
plt.xlabel(r'values of $\log(\sigma)$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\log(\sigma)$ samples')
plt.show();
#Posterior samples for sigma are obtained by exponentiating the log_sigma samples
sigma_samples = np.exp(log_sigma_samples)
plt.hist(sigma_samples, bins = 500)
plt.xlabel(r'values of $\sigma$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\sigma$ samples')
plt.show();
#Our best estimate of the unknown length can be taken to be the mean of the posterior samples for theta:
thetamean = np.mean(theta_samples)
display(thetamean)
#Our uncertainty in theta can be captured by the standard deviation of the posterior samples
thetastd = np.std(theta_samples)
display(thetastd)
#Our best estimate for sigma (which can be interpreted as the uncertainty in each individual measurement) can be obtained 
#by taking the mean of the posterior samples for sigma:
sigmamean = np.mean(sigma_samples)
display(sigmamean)
17.615357797323185
0.0015931447991204468
0.006142381305189109
Note that the estimates of $\theta$ and $\sigma$ from the Bayesian analysis match quite closely with those obtained from the standard procedure:
display([thetamean, thetaest])
display([thetastd, thetaerror])
[17.615357797323185, 17.615333333333336]
[0.0015931447991204468, 0.00150132216861293]
$95\%$ intervals for $\theta$ and $\sigma$ (these are intervals in which the true parameters will lie with probability 0.95; the formal term for these intervals is "Credible Intervals") can be calculated as follows.
#A 95% interval for theta based on the posterior samples is computed as follows:
lower_limit_theta = np.percentile(theta_samples, 2.5)
upper_limit_theta = np.percentile(theta_samples, 97.5)
display([lower_limit_theta, upper_limit_theta])
#Similarly, A 95% interval for sigma based on the posterior samples is computed as follows:
lower_limit_sigma = np.percentile(sigma_samples, 2.5)
upper_limit_sigma = np.percentile(sigma_samples, 97.5)
display([lower_limit_sigma, upper_limit_sigma])
[17.612289074482923, 17.61846644701896]
[0.0043205952568492155, 0.008952565464264876]
A slight arbitrariness in the above analysis comes from the specific bounds $[-80, 80]$ and $[-10, 10]$ used in the prior distributions: $\theta \in \text{Uniform}[-80, 80]$ and $\log \sigma \sim \text{Uniform}[-10, 10]$. The point of these priors is not be too informative so the posterior is largely determined by the data and the likelihood. One can repeat the analysis with other choices of the bounds and see how the results change. Another option is to just set the bounds extremely large. In other words, we take \begin{align*} \theta \sim \text{Uniform}[-C, C] ~~ \text{ and } ~~ \log \sigma \sim \text{Uniform}[-C, C] \end{align*} were $C$ is very large. If we directly plug this model into PyMC with a large value of $C$, the posterior sample generation method might run into numerical issues leading to errors (for example, try the code below). In general, one sets the bounds to be uninformative while not being too large leading to numerical issues.
You can uncomment the following code and see that PyMC throws an error.
#In the following, we take C = 10000
#Uncomment the code to run
# measurement_model = pm.Model()
# with measurement_model:
#     theta = pm.Uniform("theta", lower = -10000, upper = 10000)
#     log_sigma = pm.Uniform("log_sigma", lower = -10000, upper = 10000)
#     sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
#     Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
#     #Sample from posterior:
#     idata = pm.sample(2000, chains = 2, return_inferencedata = True) 
    
#This model with C extremely large is throwing an error. 
Consider the same problem as before but now we have two additional observations. The data now is $17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61, 25, 31$. Again the goal is to obtain the best estimate for the length of this object.
Looking at the observations, one can clearly see that the new observations might have resulted from mistakes (either in recording or in the instrument). If we repeat the same process as before with this new set of data, we will get answers that are quite different from the previous analysis.
y_obs = np.array([17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61, 25, 31])
n = len(y_obs)
print(n)
17
#Usual Solution:
thetaest = np.mean(y_obs)
thetaerror = np.std(y_obs, ddof=1)/np.sqrt(n)
print(thetaest, thetaerror)
18.837058823529414 0.8751232117897164
So the estimate of $\theta$ now changes to 18.837 from 17.615. The uncertainty estimate also changes nontrivially to 0.849 from 0.0015. The Bayesian analysis with the same model as in the previous section also leads to similar conclusions.
measurement_model = pm.Model()
with measurement_model:
    theta = pm.Uniform("theta", -80, 80)
    log_sigma = pm.Uniform("log_sigma", -10, 10)
    sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
    Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
    #Sample from posterior:
    idata = pm.sample(2000, chains = 2, return_inferencedata = True) 
theta_samples = idata.posterior['theta'].values.flatten()
log_sigma_samples = idata.posterior['log_sigma'].values.flatten()
sigma_samples = np.exp(log_sigma_samples)
#Our best estimate of the unknown length can be taken to be the mean of the posterior samples for theta:
thetamean = np.mean(theta_samples)
display([thetamean, np.mean(y_obs)])
#Our uncertainty in theta can be captured by the standard deviation of the posterior samples
thetastd = np.std(theta_samples)
display(thetastd)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 4 jobs) NUTS: [theta, log_sigma]
Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 2 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
[18.817034942559157, 18.837058823529414]
0.9416332348601021
In this problem, we clearly need a more sophisticated analysis which involves, in addition to estimation of $\theta$, inspection of each observation to determine whether it is legitimate or fake. This can be achieved by Bayesian analysis with the following refined model. In this model, we shall introduce binary random variables $z_1, \dots, z_n$ where the purpose of $z_i$ is to flag whether $y_i$ is a legitimate observation or is a fake observation due to a recording mistake. When $z_i = 0$, the observation $y_i$ is legitimate and we use the $N(\theta, \sigma^2)$ distribution as before. On the other hand, when $z_i = 1$, the observation $y_i$ is a fake one and we use an uninformative distribution such as $N(0, 100^2)$ to indicate the fact that $y_i$ can pretty much be anything when it is fake. We also model $z_1, \dots, z_n$ as i.i.d Bernoulli($w$) with a noninformative uniform prior on $w$. Overall the model can be written as follows: \begin{align*} & w \sim \text{Uniform}[0, 1] ~~ \text{ and } ~~ \theta \sim \text{Flat} ~~ \text{ and } ~~ \log \sigma \sim \text{Flat} \\ & z_1, \dots, z_n \mid w \sim \text{Bernoulli}(w) \\ & y_i \mid z_i = 0, \theta, \sigma \sim N(\theta, \sigma^2) ~~ \text{ and } ~~ y_i \mid z_i = 1 \sim N(0, 100^2). \end{align*} This model is often referred to as a Gaussian (or Normal) Mixture Model because of the assumption that the data points $y_i$ are a mixture of two normals: $N(\theta, \sigma^2)$ for the legitimate observations, and $N(0, 100^2)$ for the fake observations. This model is implemented in PyMC as follows. The posterior samples for $z_i$ can be used to detect whether the $i^{th}$ observation is legitimate or fake (indeed, if the majority of the samples for $z_i$ equal 0, then the observation can be termed legitimate, and if the majority of $z_i$ samples equal 1, then the observation can be termed fake). The estimate and uncertainty in $\theta$ can be obtained in the usual way based on the samples for $\theta$.
measurement_somefake_model = pm.Model()
with measurement_somefake_model:
    w = pm.Beta("w", alpha = 1, beta = 1)
    theta = pm.Uniform("theta", lower = -80, upper = 80)
    log_sigma = pm.Uniform("log_sigma", lower = -10, upper = 10)
    sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
    
    theta2 = 0
    sigma2 = 100
    
    thetas = pm.math.stack([theta, theta2])
    sigmas = pm.math.stack([sigma, sigma2])
    category = pm.Bernoulli("category", p = w, shape = n)
    mean_idx = thetas[category]
    sd_idx = sigmas[category]
    obs = pm.Normal("obs", mu = mean_idx, sigma = sd_idx, observed = y_obs)
    idata = pm.sample(1000, chains = 2, return_inferencedata = True)
Multiprocess sampling (2 chains in 4 jobs) CompoundStep >NUTS: [w, theta, log_sigma] >BinaryGibbsMetropolis: [category]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
The above procedure will give us posterior samples for $w, \theta, \sigma$, and $z_1,...,z_n$. Let us look at each set of samples individually.
The samples for $w$ are given below:
w_samples = idata.posterior['w'].values.flatten()
print(w_samples)
wmean = np.mean(w_samples)
wstd = np.std(w_samples)
print(f"mean of w: {wmean}")
print(f"SD of w: {wstd}")
[0.08974985 0.10023658 0.16318447 ... 0.08025677 0.21912921 0.12668582] mean of w: 0.1563788527985102 SD of w: 0.07999833790480428
The samples for $\theta$:
theta_samples = idata.posterior['theta'].values.flatten()
theta_est = np.mean(theta_samples) #this will be our estimate of theta
theta_std = np.std(theta_samples) #this will be the uncertainty in the estimate of theta
print(f"Estimate for theta: {theta_est}") 
print(f"Uncertainty in theta (SD): {theta_std}") 
sigma_samples = idata.posterior['sigma'].values.flatten()
sigma_est = np.mean(sigma_samples)
print(f"Estimate for sigma: {sigma_est}") #this will be the estimate of sigma
Estimate for theta: 17.61535873280061 Uncertainty in theta (SD): 0.001599080046578374 Estimate for sigma: 0.006145534327400754
One can check that the estimate of $\theta$ is basically the same as in the previous section when we did not have erroneous observations. Same is true with the uncertainty measure in the estimate of $\theta$. Same is true with the estimate of $\sigma$.
Now let us look at the posterior samples for $z_1,...,z_n$.
print(idata.posterior['category'].shape)
(2, 1000, 17)
For the last observation, we would expect $z_i$ to be mostly equal to one.
category_samples_onepoint = idata.posterior['category'][:, :, 16].values.flatten()
print(category_samples_onepoint)
[1 1 1 ... 1 1 1]
This is indeed the case. This means that the model is able to figure out that this observation is fake.
We can count the number of ones in the posterior samples for $z_i$ for each $i$. This will give us an estimate of the probability that the $i$th observation is fake.
category_samples = idata.posterior['category'].values
combined_samples = category_samples.reshape(-1, n)
category_means = np.mean(combined_samples, axis = 0)
category_sums = np.sum(combined_samples, axis = 0)
print(np.round(category_means))
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
Clearly, the model is able to flag the last two observations as fake. The rest of the observations are labelled legitimate.
You should repeat this analyis by making the erroneous observations to be somewhat close to the actual observations and examine the performance of the model in those challenging situations.
The Gaussian Mixture Model from the previous example is often used to separate datasets into two groups. Here we shall see this in the context of a real dataset on Exoplanets. This dataset is downloaded from the NASA website, and contains data on radius, mass, density, orbital period and star temperature of 517 Exoplanets (exoplanets are planets outside our solar system; planets orbit around a star; the star temperature variable represents the temperature of that star). Of all the variables, we shall only work with the variable radius. The dataset gives radius of each exoplanet as a multiple of the radius of Earth. For example, $\text{radius} = 8$ means that the planet's radius is 8 times the radius of Earth.
dplan = pd.read_csv('exoplanets.csv')
print(dplan.shape)
dplan.head(10)
(517, 6)
| name | orbital_period | mass | radius | star_temperature | density | |
|---|---|---|---|---|---|---|
| 0 | 2MASS J21402931+1625183 A b | 7336.500000 | 6657.910000 | 10.312188 | 2300.0 | NaN | 
| 1 | 55 Cnc e | 0.736539 | 8.078476 | 1.905513 | 5196.0 | 6.400 | 
| 2 | BD+20 594 b | 41.685500 | 16.299962 | 2.230571 | 5766.0 | 7.890 | 
| 3 | CoRoT-1 b | 1.508956 | 327.334000 | 16.701261 | 5950.0 | 0.380 | 
| 4 | CoRoT-10 b | 13.240600 | 873.950000 | 10.872633 | 5075.0 | 3.700 | 
| 5 | CoRoT-11 b | 2.994330 | 740.474000 | 16.028727 | 6440.0 | 0.990 | 
| 6 | CoRoT-12 b | 2.828042 | 291.422600 | 16.140816 | 5675.0 | 0.411 | 
| 7 | CoRoT-13 b | 4.035190 | 415.682400 | 9.919877 | 5945.0 | 2.340 | 
| 8 | CoRoT-14 b | 1.512140 | 2415.280000 | 12.217701 | 6035.0 | 7.300 | 
| 9 | CoRoT-16 b | 5.352270 | 170.023000 | 13.114413 | 5650.0 | 0.440 | 
The interesting aspect of the radius variable is that its histogram seems to show two distinct modes. This is illustrated below:
#Let us plot the histogram of the radii:
plt.figure(figsize=(10, 8))
plt.hist(dplan['radius'], bins = 100)
plt.xlabel('Radius')
plt.ylabel('Frequency')
plt.title('Histogram of Planet Radii')
plt.show()
#Clearly there are two groups of planet radii: Big planets and Small planets. 
On the basis of this histogram, there seem to be two distinct types of planets: "small" (those with small radii) and "large" (those with large radii). This begs the question: can we separate the planets in the dataset into these two groups? Note that it is not easy to do this without any formal analysis because we do quite know where to "cut" the radii. For example, would we say that planets with radii larger than 8 should be considered "large" or should we say that planets with radii larger than 9 should be considered "large"? The Gaussian mixture model can be used to answer this question.
Here is the formal description of the Gaussian Mixture Model in this setting. The observed data points are $y_1, \dots, y_N$. We associate binary random variables $z_1, \dots, z_n$ which serve as group (small or large) indicators. When $z_i = 0$, the observation $y_i$ will be assumed to have the $N(\theta_0, \sigma_0^2)$ distribution for some $\theta_0$ and $\sigma_0$. When $z_i = 1$, the observation $y_i$ will be assumed to have the $N(\theta_1, \sigma_1^2)$ distribution for some other $\theta_1$ and $\sigma_1$. We also model $z_1, \dots, z_n$ as i.i.d $\text{Bernoulli}(w)$ random variables. This model also has the additional parameters $w, \theta_0, \theta_1, \sigma_0, \sigma_1$ (these are referred to as hyperparameters). For these, we assume that \begin{align*} w \sim \text{Uniform}[0, 1] ~~ \text{ and } ~~ \theta_0 \sim N(2, 10^2) ~~ \text{ and } ~~ \theta_1 \sim N(14, 10^2) ~~ \text{ and } ~~ \sigma_0 \sim \text{HalfNormal}(0, 10^2) ~~ \text{ and } ~~ \sigma_1 \sim \text{HalfNormal}(0, 10^2). \end{align*} From the histogram above, it is clear that the "small" planets have mean radius around 2 and the "large" planets have mean radius around 14. This is why we set 2 and 14 as the centers of the normal distributions for $\theta_0$ and $\theta_1$. The large variance (of 100) means that these are not too strongly informative priors. $\text{HalfNormal}(0, \sigma^2)$ distribution is defined as the distribution of $|X|$ where $X \sim N(0, \sigma^2)$. This is a common choice for the distribution of positive random variables such as $\sigma_0$ and $\sigma_1$. Again the variance here is taken to be 100 in order to make these priors uninformative.
This mixture model can be implemented in PyMC as follows.
#Fitting a Gaussian mixture model using pymc
#N = dplan_no_missing.shape[0]
#observed = dplan_no_missing['pl_radj']
N = dplan.shape[0]
exoplanet_model = pm.Model()
with exoplanet_model:
    w = pm.Beta("w", alpha = 1, beta = 1)
    thetas = pm.Normal("thetas", mu = np.array([2, 14]), sigma = 10, shape = 2)
    sigmas = pm.HalfNormal("sigmas", sigma = 10, shape = 2) #The HalfNormal distribution is the absolute value of the Normal
    category = pm.Bernoulli("category", p = w, shape = N)
    mean_individual = thetas[category]
    sd_individual = sigmas[category]
    obs = pm.Normal("obs", mu = mean_individual, sigma = sd_individual, observed = dplan['radius'])
    idata = pm.sample(1000, chains = 2, return_inferencedata = True)
Multiprocess sampling (2 chains in 4 jobs) CompoundStep >NUTS: [w, thetas, sigmas] >BinaryGibbsMetropolis: [category]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 71 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
PyMC will provide us with posterior samples for $w$, $\theta_0, \theta_1$, $\sigma_0, \sigma_1$ as well as the binary variables $z_1, \dots, z_n$. We can explore each of these posterior samples as follows.
#Calculate posterior means of w, thetas and sigmas:
w_samples = idata.posterior['w'].values.flatten()
wmean = np.mean(w_samples)
thetas_samples = idata.posterior['thetas'].values
combined_thetas_samples = thetas_samples.reshape(-1, 2)
thetas_means = np.mean(combined_thetas_samples, axis = 0)
sigmas_samples = idata.posterior['sigmas'].values
combined_sigmas_samples = sigmas_samples.reshape(-1, 2)
sigmas_means = np.mean(combined_sigmas_samples, axis = 0)
print(f"mean of w: {wmean}")
print(f"mean of theta_0: {thetas_means[0]}")
print(f"mean of theta_1: {thetas_means[1]}")
print(f"mean of sigma_0: {sigmas_means[0]}")
print(f"mean of theta_0: {sigmas_means[1]}")
mean of w: 0.6644072056827766 mean of theta_0: 2.6638927220217274 mean of theta_1: 13.466795467201019 mean of sigma_0: 1.2249132379040955 mean of theta_0: 3.2315425843829293
The posterior mean estimates of the two group means ($\theta_0$ and $\theta_1$) are 2.66 and 13.46. Estimates of $\sigma_0$ and $\sigma_1$ are $1.22$ and $3.23$. Note that your estimates might be slightly different because of randomness in the posterior samples given by PyMC. The estimate of $w$ is 0.66. As the estimate of $w$ is more than 1, it means that we have more large planets compare to small.
Given the hyperparameters $w, \theta_0, \theta_1, \sigma_0, \sigma_1$, the distribution of $y_i$ is given by: \begin{align*} y_i \mid z_i = 0 \sim N(\theta_0, \sigma_0^2) ~~ \text{ and } ~~ y_i \mid z_i = 1 \sim N(\theta_1, \sigma_1^2). \end{align*} We can marginalize these two to obtain the marginal distribution of $y_i$ (without any conditioning of $z_i$) as \begin{align*} y_i \sim (1-w) N(\theta_0, \sigma_0^2) + w N(\theta_1, \sigma_1^2) \end{align*} This density $(1-w) N(\theta_0, \sigma_0^2) + w N(\theta_1, \sigma_1^2)$ is called a normal mixture density, and it can be plotted on the histogram of the data. This plot can be examined visually to check the fit of the model.
#Let us plot the fitted two component normal mixture density on the histogram (to visualize the fit):
from scipy.stats import norm
plt.hist(dplan['radius'], bins = 100, density = True, alpha = 0.6, label = 'Planet Radius Data')
x = np.linspace(min(dplan['radius']), max(dplan['radius']), 1000)
density = wmean * norm.pdf(x, thetas_means[0], sigmas_means[0]) + (1-wmean) * norm.pdf(x, thetas_means[1], sigmas_means[1])
plt.plot(x, density, 'r', label = 'Gaussian Mixture Density')
plt.title('Histogram of Planet Radii with the Fitted Gaussian Mixture Density')
plt.xlabel('Radius')
plt.ylabel('Density')
plt.legend()
plt.show()
The above plot gives a decent fit. It appears though that the Gaussian distribution might not provide a good model for each of the groups. The radius variable is clearly positive-valued. It is common practice to work with logarithms of positive-valued variables while fitting Gaussian distributions. Let us now repeat the analysis (with the same mixture model) on the logarithmic scale.
#We now repeat the analysis on the logarithmic scale
#First let us plot the histogram of log(radius)
data_log = np.log(dplan['radius'])
#Let us plot the histogram of the radii:
plt.figure(figsize=(10, 8))
plt.hist(data_log, bins = 100)
plt.xlabel('log(Radius)')
plt.ylabel('Frequency')
plt.title('Histogram of Logarithm of Planet Radii')
plt.show()
#Clearly there are two groups of planet radii: Big planets and Small planets. 
#Fitting a Gaussian mixture model using pymc
N = len(data_log)
exoplanet_model = pm.Model()
with exoplanet_model:
    w = pm.Beta("w", alpha = 1, beta = 1)
    thetas = pm.Normal("thetas", mu = np.array([1, 2.5]), sigma = 3, shape = 2)
    sigmas = pm.HalfNormal("sigmas", sigma = 3, shape = 2)
    category = pm.Bernoulli("category", p = w, shape = N)
    mean_individual = thetas[category]
    sd_individual = sigmas[category]
    obs = pm.Normal("obs", mu = mean_individual, sigma = sd_individual, observed = data_log)
    idata = pm.sample(1000, chains = 2, return_inferencedata = True)
Multiprocess sampling (2 chains in 4 jobs) CompoundStep >NUTS: [w, thetas, sigmas] >BinaryGibbsMetropolis: [category]
Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 72 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics
In the above, we are using the prior: \begin{align*} w \sim \text{Uniform}[0, 1] ~~ \text{ and } ~~ \theta_0 \sim N(1, 3^2) ~~ \text{ and } ~~ \theta_1 \sim N(2.5, 10^2) ~~ \text{ and } ~~ \sigma_0 \sim \text{HalfNormal}(0, 3^2) ~~ \text{ and } ~~ \sigma_1 \sim \text{HalfNormal}(0, 3^2), \end{align*} because, from the histogram above, it is clear that the "small" planets have mean log(radius) around 1 and the "large" planets have mean log(radius) around 2.5. Also on the logarithmic scale, the values are much smaller so we are using $9 = 3^2$ as an uninformatively large variance for all the normals.
#Calculate posterior means of w, thetas and sigmas:
w_samples = idata.posterior['w'].values.flatten()
wmean = np.mean(w_samples)
thetas_samples = idata.posterior['thetas'].values
combined_thetas_samples = thetas_samples.reshape(-1, 2)
thetas_means = np.mean(combined_thetas_samples, axis = 0)
sigmas_samples = idata.posterior['sigmas'].values
combined_sigmas_samples = sigmas_samples.reshape(-1, 2)
sigmas_means = np.mean(combined_sigmas_samples, axis = 0)
print("Results on the log scale:")
print(f"mean of w: {wmean}")
print(f"mean of theta_0: {thetas_means[0]}")
print(f"mean of theta_1: {thetas_means[1]}")
print(f"mean of sigma_0: {sigmas_means[0]}")
print(f"mean of theta_0: {sigmas_means[1]}")
Results on the log scale: mean of w: 0.6134098346896935 mean of theta_0: 1.0123333806004144 mean of theta_1: 2.6148677968707754 mean of sigma_0: 0.6262416891159611 mean of theta_0: 0.19873751616543645
The estimate of $w$ is now slightly lower at 0.61. We can check the goodness of fit by superimposing the normal mixture density on top of the histogram as follows.
#Let us plot the fitted two component normal mixture density on the histogram (to visualize the fit):
from scipy.stats import norm
plt.hist(data_log, bins = 100, density = True, alpha = 0.6, label = 'Logarithm of Planet Radius Data')
x = np.linspace(min(data_log), max(data_log), 1000)
density = wmean * norm.pdf(x, thetas_means[0], sigmas_means[0]) + (1-wmean) * norm.pdf(x, thetas_means[1], sigmas_means[1])
plt.plot(x, density, 'r', label = 'Gaussian Mixture Density')
plt.title('Histogram of Planet Radii with the Fitted Gaussian Mixture Density')
plt.xlabel('log(Radius)')
plt.ylabel('Density')
plt.legend()
plt.show()
The quality of fit is much better than before. We can next explore the posterior samples for the binary variables $z_1, \dots, z_n$. The mean of these posterior samples provide us with probabilities for each of the planets to be large.
We can plot a histogram of these probabilities.
category_samples = idata.posterior['category'].values
combined_samples = category_samples.reshape(-1, N)
print(combined_samples.shape)
category_means = np.mean(combined_samples, axis = 0)
plt.hist(category_means, density = True)
plt.title('Histogram of Probabilities of each planet being large')
plt.xlabel('Probability')
plt.ylabel('Density')
plt.show()
#For large planets, the corresponding entry in category_means will be large. For small planets, the entry in category_means will be small. 
#If the model is unsure between large and small, it will give a value for category_means which is close to 0.5. 
(2000, 517)
We can now create a classification of planets as small and large but looking at whether category_means is less than 0.5 or more than 0.5.
#Creating a Classification of Data Points based on the posterior mean of the category variable:
binary_class = (category_means > 0.5).astype(int)
dplan['classi'] = binary_class
print(np.mean(dplan['classi']))
0.6286266924564797
In the previous histogram, we can now color the small planets as red and the large planets as blue. This leads to the following histogram.
#Plotting two separate histograms in one plot
radius_zero = dplan[dplan['classi'] == 0]['radius']
radius_one = dplan[dplan['classi'] == 1]['radius']
plt.hist(radius_zero, bins = 100, color = 'red', alpha = 0.5, label = 'Small Radii')
plt.hist(radius_one, bins = 100, color = 'blue', alpha = 0.5, label = 'Large Radii')
plt.title('Histograms')
plt.xlabel('Radius')
plt.ylabel('Frequency')
plt.legend()
plt.show()                   
Where did the model place the cutoff for small and large radii?
print(np.max(radius_zero), np.min(radius_one))
8.1600792 8.3730483
Thus planets with radii above the range 8.16 - 8.37 have been classified as large (and the rest as small) by this Bayesian Model. For context, Jupiter's radius is about 11.2 times that of Earth so it will be classified as a large planet.
What is the posterior probability of the second category (large) when radius equals $8.1600792$ and when radius equals $8.3730483$? We can check the locations of these planets and check their probabilities.
dplan[(dplan['radius'] == 8.1600792) | (dplan['radius'] == 8.3730483)]
| name | orbital_period | mass | radius | star_temperature | density | classi | |
|---|---|---|---|---|---|---|---|
| 262 | Kepler-35 b | 131.45800 | 40.3606 | 8.160079 | 5606.0 | 0.41 | 0 | 
| 317 | Kepler-539 b | 125.63243 | 308.2660 | 8.373048 | 5820.0 | 2.90 | 1 | 
print(category_means[262])
print(category_means[317])
0.426 0.55
These values are quite close to $0.5$ which means that the model is unclear as to whether these should be classified as small or large. For the radius $8.1600792$, it went with "small" with probability $55\%$. For the radius $8.3730483$, it went with "large" with probability $54\%$.
You can do the above analysis with a more up-to-date dataset that contains not 500, but 5000 exoplanets. PyMC will run much slower on such a large dataset, so we did not include it in this analysis. If you are interested, we've provided the dataset exoplanet_data_5000.csv. Feel free to test it our on your own!
We have so far seen several examples of Bayesian Modeling using PyMC. The output here is always in the form of posterior Monte Carlo samples. These samples can be used to approximate the underlying actual posterior distributions. We shall next spend some time exploring some sampling techniques that PyMC uses. We will not go into too much depth here and only study the most basic sampling algorithms.
Rejection sampling is a very old method for obtaining Monte Carlo Samples. Suppose the goal is to obtain samples from a density $f_{\text{Target}}(u)$. Rejection sampling provides an algorithm for generating samples from $f_{\text{Target}}(u)$ using another density $f_{\text{Proposal}}(u)$. This proposal density should satisfy the following two properties:
Rejection Sampling is based on the following very simple (Bayesian) fact. Consider the following Bayesian model: $$ \begin{align*} \text{Prior}: \Theta \sim f_{\text{Proposal}} ~~~ \text{ and } ~~~ \text{Likelihood}: Y \mid \Theta = u \sim \text{Bernoulli} \left(\frac{f_{\text{Target}}(u)}{M f_{\text{Proposal}}(u)} \right) \end{align*} $$ So in this model, the unknown parameter $\Theta$ has the prior density $f_{\text{Proposal}}$. The data is binary and the likelihood is given by the Bernoulli distribution with parameter $\frac{f_{\text{Target}}(u)}{M f_{\text{Proposal}}(u)}.$ Note that the Bernoulli parameter should be always between 0 and 1 which is why we need the condition $f_{\text{Target}}(u) \leq M f_{\text{Proposal}}(u)$ for all values of $u$. The key fact underlying Rejection Sampling is that the posterior density of $\Theta$ in this model given $Y = 1$ is exactly $f_{\text{Target}}$: $$ \begin{align*} \Theta \mid Y = 1 ~~ \sim f_{\text{Target}} \end{align*} $$
This is proved by a simple application of Bayes rule as follows: $$ \begin{align*} f_{\Theta \mid Y = 1}(u) &= \frac{f_{\Theta}(u) \mathbb{P}\{Y = 1 \mid \Theta = u\}}{\int f_{\Theta}(v) \mathbb{P}\{Y = 1 \mid \Theta = v\} dv} \\ &= \frac{f_{\text{Proposal}}(u) \frac{f_{\text{Target}}(u)}{M f_{\text{Proposal}}(u)}}{\int f_{\text{Proposal}}(v) \frac{f_{\text{Target}}(v)}{M f_{\text{Proposal}}(v)} dv} \\ &= \frac{\frac{f_{\text{Target}}(u)}{M}}{\int \frac{f_{\text{Target}}(v)}{M } dv} = f_{\text{Target}}(u). \end{align*} $$
Based on this idea, the algorithm for Rejection Sampling goes as follows.
Here is how this algorithm works for sampling from Beta distributions. Here $f_{\text{Target}}$ is a Beta density and $f_{\text{Proposal}}$ is the uniform density.
#Rejection sampling for generating samples from Beta(4, 1):
#Our proposal distribution will be Uniform[0, 1]
#The value of M can be taken to be the largest value of the density. 
M = 4
N = 20000 #this is the number of proposal samples that we will generate
prior_samples = np.random.rand(N)
p_prior_samples = prior_samples ** 3
Y_samples = np.random.binomial(n = 1, p = p_prior_samples)
posterior_samples = prior_samples[Y_samples == 1]
print(len(posterior_samples))
plt.hist(posterior_samples, bins = 500, density = True, alpha = 0.6, label = 'Rejection Sampling Samples from Beta(4, 1)') 
x = np.linspace(0, 1, 1000)
from scipy.stats import beta
pdf_values = beta.pdf(x, 4, 1)
plt.plot(x, pdf_values, 'r-', label = 'Beta(4, 1) Density')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.title('Histogram of Samples from Rejection Sampling and Beta(4, 1) density')
plt.show()
#The match between the histogram and the true density is not bad but we are only 
#getting about 1/4 of the total samples (others are rejected because of Y = 0).
4895
#Rejection sampling for generating samples from Beta(20, 2):
#Our proposal distribution will be Uniform[0, 1]
#The value of M can be taken to be anything larger than the largest value of the density. 
M = 8
N = 20000 #this is the number of proposal samples that we will generate
prior_samples = np.random.rand(N)
p_prior_samples = 420 * (prior_samples ** 19) * (1 - prior_samples) * (1/8)
Y_samples = np.random.binomial(n = 1, p = p_prior_samples)
posterior_samples = prior_samples[Y_samples == 1]
print(len(posterior_samples))
plt.hist(posterior_samples, bins = 500, density = True, alpha = 0.6, label = 'Rejection Sampling Samples from Beta(20, 2)') 
x = np.linspace(0, 1, 1000)
from scipy.stats import beta
pdf_values = beta.pdf(x, 20, 2)
plt.plot(x, pdf_values, 'r-', label = 'Beta(20, 2) Density')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.title('Histogram of Samples from Rejection Sampling and Beta(20, 2) density')
plt.show()
#The match between the histogram and the true density is not bad but we are only 
#getting about 1/8 of the total samples (others are rejected because of Y = 0).
2506
In general, the marginal probability of $Y = 1$ in the rejection sampling Bayesian model equals $1/M$: \begin{align*} \mathbb{P} \{Y = 1\} &= \int f_{\Theta}(v) \mathbb{P}\{Y = 1 \mid \Theta = v\} dv \\ &= \int f_{\text{Proposal}}(v) \frac{f_{\text{Target}}(v)}{M f_{\text{Proposal}}(v)} dv = \frac{1}{M} \int f_{\text{Target}}(v) dv = \frac{1}{M} \end{align*} So if $N$ samples $(\Theta^{(i)}, Y^{(i)})$ are originally generated from the prior (proposal density), then only about $N/M$ will have $Y^{(i)} = 1$. Thus we can expect to have about $N/M$ samples from the posterior (target density). Thus if $M$ is large, we will have very few samples from the target density. The main trick in Rejection Sampling therefore is to choose the proposal density so that $M$ is not too large (this, of course, may not be always possible because the proposal density also needs to be such that samples can be generated from it easily). This is the reason why Rejection Sampling is inefficient in most practical instances.