{ "cells": [ { "cell_type": "markdown", "id": "da61d6ad", "metadata": {}, "source": [ "# Bayesian Hierarchical Models" ] }, { "cell_type": "code", "execution_count": 67, "id": "6be719f2", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "from IPython.display import YouTubeVideo\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "sns.set()" ] }, { "cell_type": "markdown", "id": "1628e767", "metadata": {}, "source": [ "We've seen so far that a Bayesian approach can be useful in cases where we have prior domain knowledge that we want to incorporate into our model. We've also seen that the effect of choosing a prior depends heavily on how much data we have: the less data we have, the more our conclusions tilt toward the prior.\n", "\n", "In many cases, we may not have as much external prior knowledge, and we want to rely on parts of the dataset that are larger to help offset parts of the dataset that are smaller. We'll make this (very) abstract idea concrete with an example looking at kidney cancer deaths in the US between 1980 and 1989. The data used in this section, as well as inspiration for the modeling and analysis, comes from [Bayesian Data Analysis](http://www.stat.columbia.edu/~gelman/book/) pp 47-51. The cleaned version of the data came from [Robin Ryder](https://github.com/robinryder/BDA-kidney). Note that the dataset suffers from a severe bias: it only contains information on white men. We'll discuss issues with this later in this section.\n", "\n", "We'll walk through the process of setting up a model for this more complex dataset, and in the process see several advantages and perspectives on Bayesian models." ] }, { "cell_type": "markdown", "id": "ee8ef374", "metadata": {}, "source": [ "## Example: Understanding Kidney Cancer Death Risk\n", "\n", "Before we can start modeling, we must first understand the data. We'll focus on the following columns:\n", "* `state`: the US state\n", "* `Location`: the county and state as a string\n", "* `fips`, which provides the [FIPS code](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standard_state_code) for each county: this is a standard identifier that can often be used to join datasets with county-level information.\n", "* `dc` and `dc.2`: the number of kidney cancer deaths between 1980-1984 and 1985-1989, respectively\n", "* `pop` and `pop.2`: the population between 1980-1984 and 1985-1989, respectively" ] }, { "cell_type": "code", "execution_count": 6, "id": "5ad47a90", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | state | \n", "Location | \n", "dc | \n", "dc.2 | \n", "pop | \n", "pop.2 | \n", "
---|---|---|---|---|---|---|
0 | \n", "ALABAMA | \n", "Autauga County, Alabama | \n", "2 | \n", "1 | \n", "61921 | \n", "64915 | \n", "
1 | \n", "ALABAMA | \n", "Baldwin County, Alabama | \n", "7 | \n", "15 | \n", "170945 | \n", "195253 | \n", "
2 | \n", "ALABAMA | \n", "Barbour County, Alabama | \n", "0 | \n", "1 | \n", "33316 | \n", "33987 | \n", "
3 | \n", "ALABAMA | \n", "Bibb County, Alabama | \n", "0 | \n", "1 | \n", "30152 | \n", "31175 | \n", "
4 | \n", "ALABAMA | \n", "Blount County, Alabama | \n", "3 | \n", "5 | \n", "88342 | \n", "91547 | \n", "