Resources
We will be posting all lecture materials on the course syllabus. In addition, they will also be listed in the following publicly visible Github Repo.
Here is a collection of resources that will help you learn more about various concepts and skills covered in the class. Learning by reading is a key part of being a well rounded data scientist. We will not assign mandatory reading but instead encourage you to look at these and other materials. If you find something helpful, post it on Piazza, and consider contributing it to the course website.
You can send us changes to the course website by forking and sending a pull request to the course website github repository. You will then become part of the history of the DS102 class at Berkeley.
Web References
As a data scientist you will often need to search for information on various libraries and tools. In this class we will be using several key python libraries. Here are their documentation pages:
- The Bash Command Line:
- Linux and Bash: Intro to Linux, Cloud Computing (which you can skip for the purposes of this class), and the Bash command line. You can skip all portions that don’t pertain to using the command line.
- Bash Part 2: Part 2 of the intro to command line.
- Python:
- Python Tutorial: Teach yourself python. This is a pretty comprehensive tutorial.
- Python + Numpy Tutorial this tutorial provides a great overview of a lot of the functionality we will be using in DS102.
- Python 101: A notebook demonstrating a lot of python functionality with some (minimal explanation).
- Plotting:
- matplotlib.pyplot tutorial: This short tutorial provides an overview of the basic plotting utilities we will be using.
- seaborn: The Seaborn library has some nice additional visualization functions that we may use occasionally.
- Pandas:
- The Pandas Cookbook: This provides a nice overview of some of the basic Pandas functions. However, it is slightly out of date.
- Learn Pandas A set of lessons providing an overview of the Pandas library.
- Python for Data Science Another set of notebook demonstrating Pandas functionality.
- Git:
- Getting Started with Git: A tutorial on version control and Git.
- Git Reference: A condense version of git instructions.
- Understanding the Git Flow: This will give you a better idea of how Git projects work.
- Learning about Branches: This is a perhaps overly interactive tutorial that some people might find helpful.
- Explaining Git with D3
Books
Because data science is a relatively new and rapidly evolving discipline there is no single ideal textbook for this subject. Instead we plan to use reading from a collection of books all of which are free. However, we have listed a few optional books that will provide additional context for those who are interested.
-
Principles and Techniques of Data Science This is the accompanying textbook written for the DS102 course.
-
Introduction to Statistical Learning (Free online PDF) This book is a great reference for the machine learning and some of the statistics material in the class
-
Data Science from Scratch (Available as eBook for Berkeley students) This more applied book covers many of the topics in this class using Python but doesn’t go into sufficient depth for some of the more mathematical material.
-
Doing Data Science (Available as eBook for Berkeley students) This books provides a unique case-study view of data science but uses R and not Python.
-
Python for Data Analysis (Available as eBook for Berkeley students). This book provides a good reference for the Pandas library.
Reading Resources
- Matrix Cookbook This “cookbook” is a handy collection of facts about linear algebra and matrices.
- All of statistics book This book is a great, broad introduction to mathematical statistics. It begins with probability concepts (e.g. Bayes’ theorem), works through many statistical inference topics (e.g. hypothesis testing, decision theory, and bootstrap, and also includes statistical modeling (e.g. regression and causal inference). The textbook as a whole covers many more ideas from statistics than will be used in or needed for this course, but students may still find it useful to reference specific topics within it to supplement ideas covered in lecture or review ideas from previous courses. For example:
- Chapters 1-3 review some background ideas about probability and random variables
- Chapter 12 discusses the statistical decision theory framework
- Section 9.3 reviews maximum likelihood estimation, while the first few sections of chapter 11 review the core idea behind Bayesian inference
- Sections 10.2, 10.6, and 10.7 cover p-values, the likelihood ratio test, and multiple testing ideas
- Chapter 13 covers linear and logistic regression
- Chapters 7-8 review empirical distributions and bootstrap
- Chapter 16 covers causal inference
- Computer age statistical inference book This book takes a fairly modern view of statistics, often examining the influence of computation on the field. It is useful to keep in mind that the book was written with masters’ students in mind. As such, this textbook covers many topics beyond the scope of this course, but nevertheless provides useful, high-level discussion of some course topics for those students looking for additional information. For example:
- Chapters 2 and 3 do an excellent job of comparing and contrasting frequentist and Bayesian inference, with illustrative examples
- Chapter 4 discusses maximum likelihood estimation
- Chapter 15 provides additional details about multiple hypothesis testing and false discovery rate control
- There is also one section each on logistic regression, the EM algorithm, the bootstrap, conjugate priors, and Gibbs sampling