Today, we shall look at more examples of Bayesian data analyses using PyMC.

In [1]:

```
#Import the necessary libraries:
import arviz as az
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm #PyMC3 is now replaced by PyMC
```

A simple experiment to measure the length of an object led to the following 15 measurements: $17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61.$ What is the best estimate for the length of this object?

In [2]:

```
#Data
n = 15
y_obs = np.array([17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61])
```

In [3]:

```
#Usual Solution:
thetaest = np.mean(y_obs)
thetaerror = np.std(y_obs, ddof=1)/np.sqrt(n)
print(thetaest, thetaerror)
```

17.615333333333336 0.00150132216861293

We now present a Bayesian solution to this problem using PyMC. As in every Bayesian Analysis, we shall assume a prior model for the unknown parameters, and then use a likelihood model to connect the unknown parameters to data. The posterior will then be used for answering the scientific questions. Here is one way of specializing this standard procedure to this problem.

Let $\theta$ denote the unknown true length that we are trying to obtain. We shall assume the prior: $$\theta \in \text{Uniform}[-80, 80]$$ (it is obvious from the measurements that $\theta$ should lie between 0 and 30 so the $\text{uniform}[-80, 80]$ prior is really an "uninformative" prior; uninformative basically means that the prior reveals very little nontrivial information about the parameter).

For the likelihood, we shall assume the Gaussian model: \begin{align*} y_1, \dots, y_n \mid \theta \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2) \end{align*} Here $\sigma$ is a parameter which characterizes the uncertainty in each individual measurement. We can try to heuristically figure out the value of $\sigma$ from the observed data (say using the standard estimate $\hat{\sigma}$ defined above). A more principled procedure is to also treat it as an unknown parameter and specify a prior for $\sigma$. Here is then a full specification of the prior and likelihood for solving this problem: \begin{align*} \text{Prior}: ~~ \theta \sim \text{Uniform}[-80, 80] ~~ \text{ and } ~~ \log \sigma \sim \text{Uniform}[-10, 10], \end{align*} and likelihood: \begin{align*} y_1, \dots, y_n \mid \theta, \sigma \overset{\text{i.i.d}}{\sim} N(\theta, \sigma^2) \end{align*} Here are a few additional remarks about the prior. As $\sigma$ is a positive-valued parameter, the common practice is to specify the uniform prior on $\log \sigma$. In this problem, it is clear from the measurements that each individual measurement is quite precise which means that $\sigma$ should be small. We still chose the range $[-10, 10]$ for the uniform prior on $\log \sigma$ to be uninformative.

We now specify this model in PyMC.

In [4]:

```
#Bayes Solution
measurement_model = pm.Model()
with measurement_model:
theta = pm.Uniform("theta", lower = -80, upper = 80)
log_sigma = pm.Uniform("log_sigma", lower = -10, upper = 10)
sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
#Sample from posterior:
idata = pm.sample(2000, chains = 2, return_inferencedata = True)
```

Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 4 jobs) NUTS: [theta, log_sigma]

100.00% [6000/6000 00:02<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 3 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics

In [5]:

```
theta_samples = idata.posterior['theta'].values.flatten()
log_sigma_samples = idata.posterior['log_sigma'].values.flatten()
#Histogram of posterior theta samples for theta
plt.hist(theta_samples, bins = 500)
plt.xlabel(r'Values of $\theta$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\theta$ samples')
plt.show();
#Histogram of posterior log_sigma samples
plt.hist(log_sigma_samples, bins = 500)
plt.xlabel(r'values of $\log(\sigma)$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\log(\sigma)$ samples')
plt.show();
#Posterior samples for sigma are obtained by exponentiating the log_sigma samples
sigma_samples = np.exp(log_sigma_samples)
plt.hist(sigma_samples, bins = 500)
plt.xlabel(r'values of $\sigma$')
plt.ylabel('Frequency')
plt.title(r'Histogram of the Posterior $\sigma$ samples')
plt.show();
```

In [6]:

```
#Our best estimate of the unknown length can be taken to be the mean of the posterior samples for theta:
thetamean = np.mean(theta_samples)
display(thetamean)
#Our uncertainty in theta can be captured by the standard deviation of the posterior samples
thetastd = np.std(theta_samples)
display(thetastd)
#Our best estimate for sigma (which can be interpreted as the uncertainty in each individual measurement) can be obtained
#by taking the mean of the posterior samples for sigma:
sigmamean = np.mean(sigma_samples)
display(sigmamean)
```

17.615357797323185

0.0015931447991204468

0.006142381305189109

In [7]:

```
display([thetamean, thetaest])
display([thetastd, thetaerror])
```

[17.615357797323185, 17.615333333333336]

[0.0015931447991204468, 0.00150132216861293]

In [8]:

```
#A 95% interval for theta based on the posterior samples is computed as follows:
lower_limit_theta = np.percentile(theta_samples, 2.5)
upper_limit_theta = np.percentile(theta_samples, 97.5)
display([lower_limit_theta, upper_limit_theta])
#Similarly, A 95% interval for sigma based on the posterior samples is computed as follows:
lower_limit_sigma = np.percentile(sigma_samples, 2.5)
upper_limit_sigma = np.percentile(sigma_samples, 97.5)
display([lower_limit_sigma, upper_limit_sigma])
```

[17.612289074482923, 17.61846644701896]

[0.0043205952568492155, 0.008952565464264876]

A slight arbitrariness in the above analysis comes from the specific bounds $[-80, 80]$ and $[-10, 10]$ used in the prior distributions: $\theta \in \text{Uniform}[-80, 80]$ and $\log \sigma \sim \text{Uniform}[-10, 10]$. The point of these priors is not be too informative so the posterior is largely determined by the data and the likelihood. One can repeat the analysis with other choices of the bounds and see how the results change. Another option is to just set the bounds extremely large. In other words, we take \begin{align*} \theta \sim \text{Uniform}[-C, C] ~~ \text{ and } ~~ \log \sigma \sim \text{Uniform}[-C, C] \end{align*} were $C$ is very large. If we directly plug this model into PyMC with a large value of $C$, the posterior sample generation method might run into numerical issues leading to errors (for example, try the code below). In general, one sets the bounds to be uninformative while not being too large leading to numerical issues.

You can uncomment the following code and see that `PyMC`

throws an error.

In [9]:

```
#In the following, we take C = 10000
#Uncomment the code to run
# measurement_model = pm.Model()
# with measurement_model:
# theta = pm.Uniform("theta", lower = -10000, upper = 10000)
# log_sigma = pm.Uniform("log_sigma", lower = -10000, upper = 10000)
# sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
# Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
# #Sample from posterior:
# idata = pm.sample(2000, chains = 2, return_inferencedata = True)
#This model with C extremely large is throwing an error.
```

Consider the same problem as before but now we have two additional observations. The data now is $17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61, 25, 31$. Again the goal is to obtain the best estimate for the length of this object.

Looking at the observations, one can clearly see that the new observations might have resulted from mistakes (either in recording or in the instrument). If we repeat the same process as before with this new set of data, we will get answers that are quite different from the previous analysis.

In [10]:

```
y_obs = np.array([17.62, 17.61, 17.61, 17.62, 17.62, 17.615, 17.615, 17.625, 17.61, 17.62, 17.62, 17.605, 17.61, 17.62, 17.61, 25, 31])
n = len(y_obs)
print(n)
```

17

In [11]:

```
#Usual Solution:
thetaest = np.mean(y_obs)
thetaerror = np.std(y_obs, ddof=1)/np.sqrt(n)
print(thetaest, thetaerror)
```

18.837058823529414 0.8751232117897164

In [12]:

```
measurement_model = pm.Model()
with measurement_model:
theta = pm.Uniform("theta", -80, 80)
log_sigma = pm.Uniform("log_sigma", -10, 10)
sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
Y = pm.Normal("Y", mu = theta, sigma = sigma, observed=y_obs)
#Sample from posterior:
idata = pm.sample(2000, chains = 2, return_inferencedata = True)
theta_samples = idata.posterior['theta'].values.flatten()
log_sigma_samples = idata.posterior['log_sigma'].values.flatten()
sigma_samples = np.exp(log_sigma_samples)
#Our best estimate of the unknown length can be taken to be the mean of the posterior samples for theta:
thetamean = np.mean(theta_samples)
display([thetamean, np.mean(y_obs)])
#Our uncertainty in theta can be captured by the standard deviation of the posterior samples
thetastd = np.std(theta_samples)
display(thetastd)
```

100.00% [6000/6000 00:02<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 2_000 draw iterations (2_000 + 4_000 draws total) took 2 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics

[18.817034942559157, 18.837058823529414]

0.9416332348601021

**Gaussian (or Normal) Mixture Model** because of the assumption that the data points $y_i$ are a mixture of two normals: $N(\theta, \sigma^2)$ for the legitimate observations, and $N(0, 100^2)$ for the fake observations. This model is implemented in PyMC as follows. The posterior samples for $z_i$ can be used to detect whether the $i^{th}$ observation is legitimate or fake (indeed, if the majority of the samples for $z_i$ equal 0, then the observation can be termed legitimate, and if the majority of $z_i$ samples equal 1, then the observation can be termed fake). The estimate and uncertainty in $\theta$ can be obtained in the usual way based on the samples for $\theta$.

In [13]:

```
measurement_somefake_model = pm.Model()
with measurement_somefake_model:
w = pm.Beta("w", alpha = 1, beta = 1)
theta = pm.Uniform("theta", lower = -80, upper = 80)
log_sigma = pm.Uniform("log_sigma", lower = -10, upper = 10)
sigma = pm.Deterministic("sigma", pm.math.exp(log_sigma))
theta2 = 0
sigma2 = 100
thetas = pm.math.stack([theta, theta2])
sigmas = pm.math.stack([sigma, sigma2])
category = pm.Bernoulli("category", p = w, shape = n)
mean_idx = thetas[category]
sd_idx = sigmas[category]
obs = pm.Normal("obs", mu = mean_idx, sigma = sd_idx, observed = y_obs)
idata = pm.sample(1000, chains = 2, return_inferencedata = True)
```

Multiprocess sampling (2 chains in 4 jobs) CompoundStep >NUTS: [w, theta, log_sigma] >BinaryGibbsMetropolis: [category]

100.00% [4000/4000 00:07<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 7 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics

The above procedure will give us posterior samples for $w, \theta, \sigma$, and $z_1,...,z_n$. Let us look at each set of samples individually.

The samples for $w$ are given below:

In [19]:

```
w_samples = idata.posterior['w'].values.flatten()
print(w_samples)
wmean = np.mean(w_samples)
wstd = np.std(w_samples)
print(f"mean of w: {wmean}")
print(f"SD of w: {wstd}")
```

The samples for $\theta$:

In [16]:

```
theta_samples = idata.posterior['theta'].values.flatten()
theta_est = np.mean(theta_samples) #this will be our estimate of theta
theta_std = np.std(theta_samples) #this will be the uncertainty in the estimate of theta
print(f"Estimate for theta: {theta_est}")
print(f"Uncertainty in theta (SD): {theta_std}")
sigma_samples = idata.posterior['sigma'].values.flatten()
sigma_est = np.mean(sigma_samples)
print(f"Estimate for sigma: {sigma_est}") #this will be the estimate of sigma
```

One can check that the estimate of $\theta$ is basically the same as in the previous section when we did not have erroneous observations. Same is true with the uncertainty measure in the estimate of $\theta$. Same is true with the estimate of $\sigma$.

Now let us look at the posterior samples for $z_1,...,z_n$.

In [17]:

```
print(idata.posterior['category'].shape)
```

(2, 1000, 17)

For the last observation, we would expect $z_i$ to be mostly equal to one.

In [18]:

```
category_samples_onepoint = idata.posterior['category'][:, :, 16].values.flatten()
print(category_samples_onepoint)
```

[1 1 1 ... 1 1 1]

This is indeed the case. This means that the model is able to figure out that this observation is fake.

We can count the number of ones in the posterior samples for $z_i$ for each $i$. This will give us an estimate of the probability that the $i$th observation is fake.

In [21]:

```
category_samples = idata.posterior['category'].values
combined_samples = category_samples.reshape(-1, n)
category_means = np.mean(combined_samples, axis = 0)
category_sums = np.sum(combined_samples, axis = 0)
print(np.round(category_means))
```

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]

Clearly, the model is able to flag the last two observations as fake. The rest of the observations are labelled legitimate.

You should repeat this analyis by making the erroneous observations to be somewhat close to the actual observations and examine the performance of the model in those challenging situations.

The Gaussian Mixture Model from the previous example is often used to separate datasets into two groups. Here we shall see this in the context of a real dataset on Exoplanets. This dataset is downloaded from the NASA website, and contains data on radius, mass, density, orbital period and star temperature of 517 Exoplanets (exoplanets are planets outside our solar system; planets orbit around a star; the star temperature variable represents the temperature of that star). Of all the variables, we shall only work with the variable radius. The dataset gives radius of each exoplanet as a multiple of the radius of Earth. For example, $\text{radius} = 8$ means that the planet's radius is 8 times the radius of Earth.

In [24]:

```
dplan = pd.read_csv('exoplanets.csv')
print(dplan.shape)
dplan.head(10)
```

(517, 6)

Out[24]:

name | orbital_period | mass | radius | star_temperature | density | |
---|---|---|---|---|---|---|

0 | 2MASS J21402931+1625183 A b | 7336.500000 | 6657.910000 | 10.312188 | 2300.0 | NaN |

1 | 55 Cnc e | 0.736539 | 8.078476 | 1.905513 | 5196.0 | 6.400 |

2 | BD+20 594 b | 41.685500 | 16.299962 | 2.230571 | 5766.0 | 7.890 |

3 | CoRoT-1 b | 1.508956 | 327.334000 | 16.701261 | 5950.0 | 0.380 |

4 | CoRoT-10 b | 13.240600 | 873.950000 | 10.872633 | 5075.0 | 3.700 |

5 | CoRoT-11 b | 2.994330 | 740.474000 | 16.028727 | 6440.0 | 0.990 |

6 | CoRoT-12 b | 2.828042 | 291.422600 | 16.140816 | 5675.0 | 0.411 |

7 | CoRoT-13 b | 4.035190 | 415.682400 | 9.919877 | 5945.0 | 2.340 |

8 | CoRoT-14 b | 1.512140 | 2415.280000 | 12.217701 | 6035.0 | 7.300 |

9 | CoRoT-16 b | 5.352270 | 170.023000 | 13.114413 | 5650.0 | 0.440 |

In [25]:

```
#Let us plot the histogram of the radii:
plt.figure(figsize=(10, 8))
plt.hist(dplan['radius'], bins = 100)
plt.xlabel('Radius')
plt.ylabel('Frequency')
plt.title('Histogram of Planet Radii')
plt.show()
#Clearly there are two groups of planet radii: Big planets and Small planets.
```

Here is the formal description of the Gaussian Mixture Model in this setting. The observed data points are $y_1, \dots, y_N$. We associate binary random variables $z_1, \dots, z_n$ which serve as group (small or large) indicators. When $z_i = 0$, the observation $y_i$ will be assumed to have the $N(\theta_0, \sigma_0^2)$ distribution for some $\theta_0$ and $\sigma_0$. When $z_i = 1$, the observation $y_i$ will be assumed to have the $N(\theta_1, \sigma_1^2)$ distribution for some other $\theta_1$ and $\sigma_1$. We also model $z_1, \dots, z_n$ as i.i.d $\text{Bernoulli}(w)$ random variables. This model also has the additional parameters $w, \theta_0, \theta_1, \sigma_0, \sigma_1$ (these are referred to as hyperparameters). For these, we assume that \begin{align*} w \sim \text{Uniform}[0, 1] ~~ \text{ and } ~~ \theta_0 \sim N(2, 10^2) ~~ \text{ and } ~~ \theta_1 \sim N(14, 10^2) ~~ \text{ and } ~~ \sigma_0 \sim \text{HalfNormal}(0, 10^2) ~~ \text{ and } ~~ \sigma_1 \sim \text{HalfNormal}(0, 10^2). \end{align*} From the histogram above, it is clear that the "small" planets have mean radius around 2 and the "large" planets have mean radius around 14. This is why we set 2 and 14 as the centers of the normal distributions for $\theta_0$ and $\theta_1$. The large variance (of 100) means that these are not too strongly informative priors. $\text{HalfNormal}(0, \sigma^2)$ distribution is defined as the distribution of $|X|$ where $X \sim N(0, \sigma^2)$. This is a common choice for the distribution of positive random variables such as $\sigma_0$ and $\sigma_1$. Again the variance here is taken to be 100 in order to make these priors uninformative.

This mixture model can be implemented in PyMC as follows.

In [26]:

```
#Fitting a Gaussian mixture model using pymc
#N = dplan_no_missing.shape[0]
#observed = dplan_no_missing['pl_radj']
N = dplan.shape[0]
exoplanet_model = pm.Model()
with exoplanet_model:
w = pm.Beta("w", alpha = 1, beta = 1)
thetas = pm.Normal("thetas", mu = np.array([2, 14]), sigma = 10, shape = 2)
sigmas = pm.HalfNormal("sigmas", sigma = 10, shape = 2) #The HalfNormal distribution is the absolute value of the Normal
category = pm.Bernoulli("category", p = w, shape = N)
mean_individual = thetas[category]
sd_individual = sigmas[category]
obs = pm.Normal("obs", mu = mean_individual, sigma = sd_individual, observed = dplan['radius'])
idata = pm.sample(1000, chains = 2, return_inferencedata = True)
```

Multiprocess sampling (2 chains in 4 jobs) CompoundStep >NUTS: [w, thetas, sigmas] >BinaryGibbsMetropolis: [category]

100.00% [4000/4000 01:10<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 71 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics

In [30]:

```
#Calculate posterior means of w, thetas and sigmas:
w_samples = idata.posterior['w'].values.flatten()
wmean = np.mean(w_samples)
thetas_samples = idata.posterior['thetas'].values
combined_thetas_samples = thetas_samples.reshape(-1, 2)
thetas_means = np.mean(combined_thetas_samples, axis = 0)
sigmas_samples = idata.posterior['sigmas'].values
combined_sigmas_samples = sigmas_samples.reshape(-1, 2)
sigmas_means = np.mean(combined_sigmas_samples, axis = 0)
print(f"mean of w: {wmean}")
print(f"mean of theta_0: {thetas_means[0]}")
print(f"mean of theta_1: {thetas_means[1]}")
print(f"mean of sigma_0: {sigmas_means[0]}")
print(f"mean of theta_0: {sigmas_means[1]}")
```

The posterior mean estimates of the two group means ($\theta_0$ and $\theta_1$) are 2.66 and 13.46. Estimates of $\sigma_0$ and $\sigma_1$ are $1.22$ and $3.23$. Note that your estimates might be slightly different because of randomness in the posterior samples given by PyMC. The estimate of $w$ is 0.66. As the estimate of $w$ is more than 1, it means that we have more large planets compare to small.

Given the hyperparameters $w, \theta_0, \theta_1, \sigma_0, \sigma_1$, the distribution of $y_i$ is given by: \begin{align*} y_i \mid z_i = 0 \sim N(\theta_0, \sigma_0^2) ~~ \text{ and } ~~ y_i \mid z_i = 1 \sim N(\theta_1, \sigma_1^2). \end{align*} We can marginalize these two to obtain the marginal distribution of $y_i$ (without any conditioning of $z_i$) as \begin{align*} y_i \sim (1-w) N(\theta_0, \sigma_0^2) + w N(\theta_1, \sigma_1^2) \end{align*} This density $(1-w) N(\theta_0, \sigma_0^2) + w N(\theta_1, \sigma_1^2)$ is called a normal mixture density, and it can be plotted on the histogram of the data. This plot can be examined visually to check the fit of the model.

In [31]:

```
#Let us plot the fitted two component normal mixture density on the histogram (to visualize the fit):
from scipy.stats import norm
plt.hist(dplan['radius'], bins = 100, density = True, alpha = 0.6, label = 'Planet Radius Data')
x = np.linspace(min(dplan['radius']), max(dplan['radius']), 1000)
density = wmean * norm.pdf(x, thetas_means[0], sigmas_means[0]) + (1-wmean) * norm.pdf(x, thetas_means[1], sigmas_means[1])
plt.plot(x, density, 'r', label = 'Gaussian Mixture Density')
plt.title('Histogram of Planet Radii with the Fitted Gaussian Mixture Density')
plt.xlabel('Radius')
plt.ylabel('Density')
plt.legend()
plt.show()
```

In [32]:

```
#We now repeat the analysis on the logarithmic scale
#First let us plot the histogram of log(radius)
data_log = np.log(dplan['radius'])
#Let us plot the histogram of the radii:
plt.figure(figsize=(10, 8))
plt.hist(data_log, bins = 100)
plt.xlabel('log(Radius)')
plt.ylabel('Frequency')
plt.title('Histogram of Logarithm of Planet Radii')
plt.show()
#Clearly there are two groups of planet radii: Big planets and Small planets.
```

In [33]:

```
#Fitting a Gaussian mixture model using pymc
N = len(data_log)
exoplanet_model = pm.Model()
with exoplanet_model:
w = pm.Beta("w", alpha = 1, beta = 1)
thetas = pm.Normal("thetas", mu = np.array([1, 2.5]), sigma = 3, shape = 2)
sigmas = pm.HalfNormal("sigmas", sigma = 3, shape = 2)
category = pm.Bernoulli("category", p = w, shape = N)
mean_individual = thetas[category]
sd_individual = sigmas[category]
obs = pm.Normal("obs", mu = mean_individual, sigma = sd_individual, observed = data_log)
idata = pm.sample(1000, chains = 2, return_inferencedata = True)
```

100.00% [4000/4000 01:11<00:00 Sampling 2 chains, 0 divergences]

Sampling 2 chains for 1_000 tune and 1_000 draw iterations (2_000 + 2_000 draws total) took 72 seconds. We recommend running at least 4 chains for robust computation of convergence diagnostics

In [34]:

```
#Calculate posterior means of w, thetas and sigmas:
w_samples = idata.posterior['w'].values.flatten()
wmean = np.mean(w_samples)
thetas_samples = idata.posterior['thetas'].values
combined_thetas_samples = thetas_samples.reshape(-1, 2)
thetas_means = np.mean(combined_thetas_samples, axis = 0)
sigmas_samples = idata.posterior['sigmas'].values
combined_sigmas_samples = sigmas_samples.reshape(-1, 2)
sigmas_means = np.mean(combined_sigmas_samples, axis = 0)
print("Results on the log scale:")
print(f"mean of w: {wmean}")
print(f"mean of theta_0: {thetas_means[0]}")
print(f"mean of theta_1: {thetas_means[1]}")
print(f"mean of sigma_0: {sigmas_means[0]}")
print(f"mean of theta_0: {sigmas_means[1]}")
```

In [35]:

```
#Let us plot the fitted two component normal mixture density on the histogram (to visualize the fit):
from scipy.stats import norm
plt.hist(data_log, bins = 100, density = True, alpha = 0.6, label = 'Logarithm of Planet Radius Data')
x = np.linspace(min(data_log), max(data_log), 1000)
density = wmean * norm.pdf(x, thetas_means[0], sigmas_means[0]) + (1-wmean) * norm.pdf(x, thetas_means[1], sigmas_means[1])
plt.plot(x, density, 'r', label = 'Gaussian Mixture Density')
plt.title('Histogram of Planet Radii with the Fitted Gaussian Mixture Density')
plt.xlabel('log(Radius)')
plt.ylabel('Density')
plt.legend()
plt.show()
```

The quality of fit is much better than before. We can next explore the posterior samples for the binary variables $z_1, \dots, z_n$. The mean of these posterior samples provide us with probabilities for each of the planets to be large.

We can plot a histogram of these probabilities.

In [37]:

```
category_samples = idata.posterior['category'].values
combined_samples = category_samples.reshape(-1, N)
print(combined_samples.shape)
category_means = np.mean(combined_samples, axis = 0)
plt.hist(category_means, density = True)
plt.title('Histogram of Probabilities of each planet being large')
plt.xlabel('Probability')
plt.ylabel('Density')
plt.show()
#For large planets, the corresponding entry in category_means will be large. For small planets, the entry in category_means will be small.
#If the model is unsure between large and small, it will give a value for category_means which is close to 0.5.
```

(2000, 517)

In [38]:

```
#Creating a Classification of Data Points based on the posterior mean of the category variable:
binary_class = (category_means > 0.5).astype(int)
dplan['classi'] = binary_class
print(np.mean(dplan['classi']))
```

0.6286266924564797

In [39]:

```
#Plotting two separate histograms in one plot
radius_zero = dplan[dplan['classi'] == 0]['radius']
radius_one = dplan[dplan['classi'] == 1]['radius']
plt.hist(radius_zero, bins = 100, color = 'red', alpha = 0.5, label = 'Small Radii')
plt.hist(radius_one, bins = 100, color = 'blue', alpha = 0.5, label = 'Large Radii')
plt.title('Histograms')
plt.xlabel('Radius')
plt.ylabel('Frequency')
plt.legend()
plt.show()
```

Where did the model place the cutoff for small and large radii?

In [40]:

```
print(np.max(radius_zero), np.min(radius_one))
```

8.1600792 8.3730483

Thus planets with radii above the range 8.16 - 8.37 have been classified as large (and the rest as small) by this Bayesian Model. For context, Jupiter's radius is about 11.2 times that of Earth so it will be classified as a large planet.

What is the posterior probability of the second category (large) when radius equals $8.1600792$ and when radius equals $8.3730483$? We can check the locations of these planets and check their probabilities.

In [43]:

```
dplan[(dplan['radius'] == 8.1600792) | (dplan['radius'] == 8.3730483)]
```

Out[43]:

name | orbital_period | mass | radius | star_temperature | density | classi | |
---|---|---|---|---|---|---|---|

262 | Kepler-35 b | 131.45800 | 40.3606 | 8.160079 | 5606.0 | 0.41 | 0 |

317 | Kepler-539 b | 125.63243 | 308.2660 | 8.373048 | 5820.0 | 2.90 | 1 |

In [44]:

```
print(category_means[262])
print(category_means[317])
```

0.426 0.55

These values are quite close to $0.5$ which means that the model is unclear as to whether these should be classified as small or large. For the radius $8.1600792$, it went with "small" with probability $55\%$. For the radius $8.3730483$, it went with "large" with probability $54\%$.

You can do the above analysis with a more up-to-date dataset that contains not 500, but 5000 exoplanets. `PyMC`

will run much slower on such a large dataset, so we did not include it in this analysis. If you are interested, we've provided the dataset `exoplanet_data_5000.csv`

. Feel free to test it our on your own!

Rejection sampling is a very old method for obtaining Monte Carlo Samples. Suppose the goal is to obtain samples from a density $f_{\text{Target}}(u)$. Rejection sampling provides an algorithm for generating samples from $f_{\text{Target}}(u)$ using another density $f_{\text{Proposal}}(u)$. This proposal density should satisfy the following two properties:

- Obtaining samples from $f_{\text{Proposal}}(u)$ should be easy.
- There must exist a positive real number $M$ such that $$f_{\text{Target}}(u) \leq M f_{\text{Proposal}}(u)$$ for all values of $u$. Although the method will work, in principle, for any positive $M$ satisfying the above condition, it becomes extremely inefficient if $M$ gets large as will be clear below. So we want $M$ to satisfy the above condition and also to be not too large.

Based on this idea, the algorithm for Rejection Sampling goes as follows.

- Generate lots of samples $(\Theta^{(i)}, Y^{(i)}$ from the probability model described above. In other words, take $\Theta^{(i)} \sim f_{\text{Proposal}}$ and $Y^{(i)} \mid \Theta = \Theta^{(i)} \sim \text{Bernoulli} \left(\frac{f_{\text{Target}}(\Theta^{(i)})}{M f_{\text{Proposal}}(\Theta^{(i)})} \right)$
- Based on the samples, approximate the distribution of $\Theta \mid Y = 1$. This is done by discarding (or
**rejecting**) the samples for which $Y^{(i)} = 0$ and keeping only the ones with $Y^{(i)} = 1$. The $\Theta^{(i)}$'s in the remaining samples approximate $f_{\text{Target}}$.

Here is how this algorithm works for sampling from Beta distributions. Here $f_{\text{Target}}$ is a Beta density and $f_{\text{Proposal}}$ is the uniform density.

In [45]:

```
#Rejection sampling for generating samples from Beta(4, 1):
#Our proposal distribution will be Uniform[0, 1]
#The value of M can be taken to be the largest value of the density.
M = 4
N = 20000 #this is the number of proposal samples that we will generate
prior_samples = np.random.rand(N)
p_prior_samples = prior_samples ** 3
Y_samples = np.random.binomial(n = 1, p = p_prior_samples)
posterior_samples = prior_samples[Y_samples == 1]
print(len(posterior_samples))
plt.hist(posterior_samples, bins = 500, density = True, alpha = 0.6, label = 'Rejection Sampling Samples from Beta(4, 1)')
x = np.linspace(0, 1, 1000)
from scipy.stats import beta
pdf_values = beta.pdf(x, 4, 1)
plt.plot(x, pdf_values, 'r-', label = 'Beta(4, 1) Density')
plt.xlabel('Value')
plt.ylabel('Density')
plt.legend()
plt.title('Histogram of Samples from Rejection Sampling and Beta(4, 1) density')
plt.show()
#The match between the histogram and the true density is not bad but we are only
#getting about 1/4 of the total samples (others are rejected because of Y = 0).
```

4895

In [46]:

```
#Rejection sampling for generating samples from Beta(20, 2):
#Our proposal distribution will be Uniform[0, 1]
#The value of M can be taken to be anything larger than the largest value of the density.
M = 8
N = 20000 #this is the number of proposal samples that we will generate
prior_samples = np.random.rand(N)
p_prior_samples = 420 * (prior_samples ** 19) * (1 - prior_samples) * (1/8)
Y_samples = np.random.binomial(n = 1, p = p_prior_samples)
posterior_samples = prior_samples[Y_samples == 1]
print(len(posterior_samples))
plt.hist
```