Prediction#

You may find it helpful to review Chapter 15 of the Data 8 textbook, which looks at prediction using linear regression and \(k\)-nearest neighbors.

In many cases, our goal is prediction: to predict one variable from several others. We’ll usually call the prediction variable \(y\), and the other variables \(x\). Here are a few examples of prediction in real-world settings:

  • A subscription-based company might be interested in predicting whether a certain customer will renew their subscription next month (\(y\)), using demographic information, historical usage data, and general market trends (\(x\)). This information will help them make important business decisions on whether to send promotions to that customer, or by aggregating many predictions, to estimate how much revenue the company will bring in over the next month.

  • A YouTube content producer might be interested in predicting the number of views on their next video (\(y\)), using information about their YouTube channel (follower count, etc.) and information about the video itself (length, number of guest stars, production cost, etc.) (\(x\)). This information could be useful when looking for advertisers who want to understand how many people will see the ad if they watch the video.

In each of these examples, we can see that we have multiple predictors \(x\) (usually we write these as \(x_1, x_2, \ldots, x_d\), or just as a vector \(x\)), and we’re predicting a single target variable \(y\) (usually scalar-valued). The predictions can be binary, otherwise discrete, or continuous.

Here’s the general framework that we’ll use: we’ll start by assuming that we have some pairs of known examples, \((x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\), where each pair contains a vector \(x_i\) of predictors for data point \(i\), and a known value of the prediction target for data point \(i\). In the examples above, this might be historical information on customers who did and didn’t renew in the past, or data on previously released videos with known view counts and channel information at the time of release. We’ll use these points to learn a relationship between \(x\) and \(y\), and then apply what we learned to new points \(x_{n+1}, x_{n+2}, \ldots\). Any points we use to learn that relationship are referred to as the training set.

You’ve already seen this pattern used several times before:

  • …in linear regression, where we fit or train a linear model using \((x_1, y_1), \ldots, (x_n, y_n)\) and then use the learned coefficients to make predictions on new data points.

  • …in \(k\)-nearest neighbors classification, where we store the entire training set, and then when classifying (i.e., predicting classification labels for) new points, we find the \(k\) closest points in the training set, and then use a majority vote of their labels as our prediction.

Note that prediction is not necessarily the same as causality! A YouTuber might find that the view count on their most recent video is a strong predictor of the view count for their next video, but one doesn’t necesarily have a causal relationship to the other. This doesn’t mean it’s any less useful for prediction, though. So, we’ll limit ourselves to the world of making and understanding predictions, and avoid making any conclusions about causality or lack thereof for now. In the next chapter, we’ll build a better understanding of the distinction between the two, and discover when we can in fact reason about causality.