Think Stats – Probability and Statistics for Programmers

Categories:

Recommended

Chapter 1 – Statistical thinking for programmers

This book is about turning data into knowledge. Data is cheap (at least relatively); knowledge is harder to come by. I will present three related pieces:

  • Probability is the study of random events. Most people have an intuitive understanding of degrees of probability, which is why you can use words like “probably” and “unlikely” without special training, but we will talk about how to make quantitative claims about those degrees.
  • Statistics is the discipline of using data samples to support claims about populations. Most statistical analysis is based on probability, which is why these pieces are usually presented together.
  • Computation is a tool that is well-suited to quantitative analysis, and computers are commonly used to process statistics. Also, computational experiments are useful for exploring concepts in probability and statistics.

The thesis of this book is that if you know how to program, you can use that skill to help you understand probability and statistics. These topics are often presented from a mathematical perspective, and that approach works well for some people. But some important ideas in this area are hard to work with mathematically and relatively easy to approach computationally.

The rest of this chapter presents a case study motivated by a question I heard when my wife and I were expecting our first child: do first babies tend to arrive late?

1.1 Do first babies arrive late? 

If you Google this question, you will find plenty of discussion. Some people claim it’s true, others say it’s a myth, and some people say it’s the other way around: first babies come early.

In many of these discussions, people provide data to support their claims. I found many examples like these:

“My two friends that have given birth recently to their first babies, BOTH went almost 2 weeks overdue before going into labour or being induced.”
“My first one came 2 weeks late and now I think the second one is going to come out two weeks early!!”
“I don’t think that can be true because my sister was my mother’s first and she was early, as with many of my cousins.”

Reports like these are called anecdotal evidence because they are based on data that is unpublished and usually personal. In casual conversation, there is nothing wrong with anecdotes, so I don’t mean to pick on the people I quoted.

But we might want evidence that is more persuasive and an answer that is more reliable. By those standards, anecdotal evidence usually fails, because:

  • Small number of observations: If the gestation period is longer for first babies, the difference is probably small compared to the natural variation. In that case, we might have to compare a large number of pregnancies to be sure that a difference exists.
  • Selection biasPeople who join a discussion of this question might be interested because their first babies were late. In that case the process of selecting data would bias the results.
  • Confirmation biasPeople who believe the claim might be more likely to contribute examples that confirm it. People who doubt the claim are more likely to cite counterexamples.
  • Inaccuracy: Anecdotes are often personal stories, and often misremembered, misrepresented, repeated inaccurately, etc.

So how can we do better?

1.2 A statistical approach 

To address the limitations of anecdotes, we will use the tools of statistics, which include:

  • Data collectionWe will use data from a large national survey that was de- signed explicitly with the goal of generating statistically valid inferences about the U.S. population.
  • Descriptive statisticsWe will generate statistics that summarize the data concisely, and evaluate different ways to visualize data.
  • Exploratory data analysis: We will look for patterns, differences, and other features that address the questions we are interested in. At the same time we will check for inconsistencies and identify limitations.
  • Hypothesis testing: Where we see apparent effects, like a difference be- tween two groups, we will evaluate whether the effect is real, or whether it might have happened by chance.
  • Estimation: We will use data from a sample to estimate characteristics of the general population.

By performing these steps with care to avoid pitfalls, we can reach conclusions that are more justifiable and more likely to be correct.

Category:

Attribution

Allen B. Downey (2011), Think Stats: Probability and Statistics for Programmers, URL: https://greenteapress.com/thinkstats/

This work is licensed under Attribution-NonCommercial 3.0 Unported License:  (https://creativecommons.org/licenses/by-nc/3.0/).

VP Flipbook Maker

Have no idea how to share your work in an interesting way? Let’s convert it to a digital flipbook! Visual Paradigm Online Flipbook Maker is a professional tool supports flipbook conversion and creation. Try it now!