Chapter 4 Probability

In this chapter we will discuss the probability as a framework for working with uncertainty. We’ll start by defining probability then move to the axioms of probability. We’ll talk about different types of probability that cover scenarios for complex events when some characteristics of the outcome are known. We’ll then move on to rules of probability that permit us to manipulate probabilities.

Probability is used to talk about the plausibility of events. Sometimes the co-occurrence of events gives us additional information (if someone one is wealthy, they may also wear designer labels). However, sometimes the co-occurrence gives us no additional information. We will talk about independence and conditional independence.

Probability is tricky and associated with a number of fallacies, which we will discuss. This forms a subset of fallacies associated with data analysis in general which we will cover in a later chapter. Finally, we’ll go over some applications of probability and the role of simulation in solving problems involving probability.

4.1 From Systems Thinking to Probability

It may not be intuitive but the use of probability as a tool to model plausibility arises naturally out of systems thinking. To explain how, we consider the canonical example of probability: flipping a coin.

If we think about a Coin Flipping System, what are the elements of this “flipping a coin” system? Obviously, the coin has something to do with it, maybe its shape, size, weight, balance, etc. Coins will behave differently on the Earth than on the Moon or elsewhere in the universe so the gravity, atmosphere, etc. all play a role. There’s the size and shape of the hand, location of the coin, and the force applied. Finally, there’s this universe’s general laws of physics. So “flipping a coin” is a system and, more specifically, it is a deterministic system.

But it turns out that we cannot actually measure most of the variables that influence the outcome (whether the coin is heads or tails) and because of this, even though there is no actual variable called “chance”, it’s easier to model the variable “coin flip” as being random. The key point, though, is that it is not “arbitrary”. We’re using probability to model the system of flipping a coin because it’s a great way of working with uncertainty (the opposite of plausibility, we’ll use both terms).

It is perhaps unfortunate that the theory of probability was developed around gambling because the whole idea of “games of chance” biases our views about what probability is and what it can be used for. The concept of “probability” thus becomes associated with some fuzzy ideas about “chance”, “randomness” and “arbitrariness”. Throughout this text, we just take probability to mean a measure of plausibility or uncertainty…and usually that uncertainty comes from ignorance. Not from some inherent, mysterious “randomness” in the process.

On the other hand, we do have everyday notions and ideas about probability. We often say that things are “probable” or “likely” and we’ll talk about the “probability” of something happening or the “likelihood” of something happening. These actually fit very well with our idea of probability as measuring uncertainty. However, we need to bear in mind that although we will often use the colloquial “likelihood” now and again, “likelihood”, as we will see, has a technical meaning as well. In general, this should not be a problem as the context will tell us which meaning of likelihood is being used. It’s just something we need to be aware of.

Long story short, when we flip a coin, and we don’t know about or we simply ignore all of those factors we cannot measure, we need a way of working with the resulting uncertainty over the outcomes. Probability allows us to do that.

As a side note, if you don’t believe there isn’t at least one deterministic coin flipping system, researchers have built a coin flipping machine that can reliably make a coin always come up heads.

4.2 Definition of Probability

Data is ultimately evidence for a claim. In Data Science, we want to use data to establish claims that answer a question or solve a problem. The degree to which data is supports a claim could be called the claim’s plausibilty. As we gather data, some claims will be more plausible than others.

Our tool for dealing with plausibility is probability. Following Edwin Jaynes, we see probability as an extension of logic from claims that are true or false to claims that have different degrees of support.

Consider the following syllogism in logic:

  1. If something is a person, it is mortal.
  2. Socrates is a person.
  3. Socrates is mortal.

This is an example of modus ponens. Now consider the following syllogism:

  1. If something is a swan, it is white.
  2. This animal is a swan.
  3. This animal is white.

Of course, when this syllogism was framed, Australia had not yet been discovered by Europeans. However, when the Europeans arrived, they found black swans. The syllogism is now false because the first statement is false. In logic, veracity is an all or nothing quality. Statements are true or false.

While there are various methods of handling degrees of truth, we will focus in this text on probability. Probability permits us to work with statements like the following:

  1. If something is a swan, it is likely to be white.
  2. This animal is a swan.
  3. It is likely to be white.

Probability enables us to work in a world where the evidence is not clear cut and only lends partial support–plausibility–to a claim. The evidence might also support multiple claims to varying degrees preventing us from being absolutely certain in our decisions.

In this chapter we will discuss the basics of probability. In the next chapter, we will combine probability and systems thinking to talk about stochastic processes. In a later chapter, we will return to probability and talk about the continuous version of modus ponens, Bayes Rule and statistical inference.

4.3 Axioms of Probability

There are different versions of the “axioms of probability” but we’ll use the Kolgomorov Axioms of Probability because they’re straightforward. They define the basic properties of a probability metric.

Probability is defined over outcomes (we’ll talk about events later). Outcomes are, more or less, mutually exclusive observations about a fixed domain, which can generally include the properties of objects. We’ve already discussed the coin flip, W = {"heads", "tails"}. If we were talking about customers, we might have an outcome space of W = {"purchase", "not purchase"}. If a customer purchases something, we can talk about an outcome space of where they live, S = {"AL", "AK", "AR", ..., "WY"} and their gender, W = {"male", "female"} (very few businesses account for non-binary identification). The customer’s state of residence and gender are properties of the person. Finally, we can talk about joint outcome spaces as the Cartesian product of simple outcome spaces, S x G = {("AL", "male"), ("AL", "female"), ("AK", "male"), ("AK", "female"),...}.

We will define different types of probability later but for now we take P(W) to mean “the probability of”. The argument to P(W), W, is an outcome space, such as state or gender. P(W) can be thought of as a assignment of probability to each element of W (as a table of probability values or perhaps a function). P(W="male") returns the probability of the single outcome in W, W="male". Following the conventional notation, when confusion is unlikely, we can also write P(male) to mean the same as P(W="male").

If we let \(W\) be the set of all possible outcomes and \(w\) be some particular outcome (which are considered to be atomic or mutually exclusive) then the axioms of probability are:

  1. \(P(w) \geq 0\)
  2. \(P(W) = \sum_i P(w_i) = 1\)
  3. \(P(w_i \cup w_j) = P(w_i) + P(w_j)\)

The first axiom states that a probability, a degree of certainty, must be non-negative. While one can certainly think of an interpretation of a negative probability (perhaps our certainty against an event happening), it is cleaner to think of all probabilities as being non-negative.

The second axiom says that we must be 100% certain in all the outcomes in \(W\) taken together. At minimum, one of them must happen. It also constrains probabilities to be on the range \((0, 1)\). We can always convert an “improper” probability distribution into a proper probability distribution by normalizing: dividing all “improper” probabilities through by the sum total.

In fact, we may often do this on purpose when eliciting probabilities from people. If we assign a certainty value of 1 to an outcome A, we can ask someone if they feel that outcome B is twice as likely to happen (in which case it would get a value of 2), or half as likely to happen (in which case it would get a 1/2). We can then go through and normalize.

The third axiom says that the probability of \(w_i\) or \(w_j\) is equal to the sums of the their individual probabilities, \(P(w_i)\) or \(P(w_j)\). If plausibility in rain tomorrow is 0.23 and our degree of belief in snow tomorrow is 0.10 then our degree of belief–probability–in either rain or snow tomorrow must be 0.33. This axiom is known as the Additive Law of Probability. Note that this is only true because we’re talking about atomic outcomes.

It is sometimes the case that the axioms are defined in terms of events. An event is a generalization of an outcome and may contain several outcomes. Rolling a 1 on a six-sided die is an outcome. Rolling an odd number on a six sided die is an event. Note that rolling a 1 on a six-sided die is also an event which can lead to some confusion.

If we consider events, then the 3rd Axiom becomes:

3a. \(P(E_i \cup E_j) = P(E_i) + P(E_j) - P(E_i \cap E_j)\)

Why the difference? The first version of the axiom is defined for outcomes which must be mutually exclusive. The second version is defined for events which may not be mutually exclusive. Events can be collections of outcomes.

For example, \(W\) might be the sides of a six-sided die \(W = {1, 2, 3, 4, 5, 6}\). These are the possible outcomes. In contrast, \(E_1\) might be “all even valued sides of the die” and \(E_2\) might be “all sides whose value is < 4” which are events. While the outcomes in \(W\) are mutually exclusive, the events in \(E_1\) and \(E_2\) are not (they have the value 2 in common). If follows that all outcomes are events but not vice versa.

The power set of a set \(X\) is the set of sets generated by combining all possible elements of X into sets. For example, {1, 2, 3} has the power set [{1, 2, 3}, {1, 2}, {1, 3}, {2, 3}, {1}, {2}, {3}, {}]. You can think of events as coming from the power set over the set of all individual outcomes with only some of them being of any interest.

If \(W\) defines the set of outcomes, then the power set defines all possible events. Note that all outcomes are events but not all events are outcomes. So the power set of {All US States} will include {AK, HI}, {AR}, {ME, NM}, {CA, AZ, OR, WA, NV}. Some of these may be of interest to us and some may not and some of these sets have names such as “Continental US” or “Western States”. The upshot is that all outcomes are by definition mutually exclusive but events are not. When working with events, we must use the modified versions of Axioms #2 and #3.

We will usually refer to events and need only worry about whether or not they’re atomic or mutually exclusive. This meshes nicely with the language of software engineering in general and logging specifically.

4.3.1 Notation

Probability notation can get a bit crazy. We’ve already mentioned that \(P()\) can mean a lot of different things:

  1. \(P(A)\) - is the probability distribution over the outcomes/events of A. It is most likely a table of events and probability values.
  2. \(P(A=a_1)\) - is the probability that A takes on the value a1. It’s a single probability value.
  3. \(P(a_1)\) - when the context is not ambiguous, this is a shorthand for the above. It is a single probability value.

What if there’s more than one variable?

  1. \(P(A, B)\) - is the probability distribution over the Cartesian product of outcomes/events in A and B. If A = {a1, a2, a3} and B = {b1, b2} then \(P(A, B)\) returns a single probability value for each of (a1, b1), (a1, b2), (a2, b1), (a2, b2), (a3, b1), (a3, b2).
  2. \(P(A|B)\) - is actually multiple probability distributions. The \(|B\) part is read as “given B”. This is known as a conditional probability which we’ll talk about later. There is one probability distribution for each possible value of B. So if B = {b1, b2, b3, b4}, \(P(A|B)\) is actually four probability distributions.

Operations on probability distributions are kind of like joins in database queries:

  1. \(P(A)P(A)\) - This is an outer join of A (cartesian product) with itself. \(P(a1)\) * \(P(a1)\), \(P(a1)\) * \(P(a2)\), \(P(a2)\) * \(P(a1)\), \(P(a2)\) * \(P(a2)\).
  2. \(P(A)P(B)\) - This is an outer join of A with B.
  3. \(P(A|B)P(B)\) - This an inner join. Remember that \(P(A|B)\) is actually multiple probability distributions. This means that if B = {b1, b2} then we have \(P(A|B=b1)P(B=b1)\), \(P(A|B=b2)P(B=b2)\). In this case, B acts as the “foreign key” between the two sets.

As you work through some examples later, you’ll get a better feel for how it all hangs together.

4.4 Types of Probability

When we have a complex event space, we may have different types of probability. These probability types cover joint probability, conditional probability, and marginal or prior probability.

4.4.1 Joint Probability

Consider a social and economic process whereby certain areas are more or less densely populated, have more jobs or better jobs, or have different kinds of industries. We can think of these areas as having two properties, Community and Income.

We may not have sufficient information to characterize this process (a system that changes over time) for a particular area and so we want to use probability to describe how plausible certain occurrences of these joint events are. The possible events for Community (C) are {urban, suburban, rural} and the possible events for Income are {low, high}. Taken together these define an event space that is the cross product of the two sets {(urban, low), (urban, high),…}.

If this seems weird to you, it is! Normally we work the other way around. We work from data (events) about communities and infer these proportions. Inference is covered in a later chapter. I want you to get used to use probability for things other than card games.

The following joint probability distribution over these two sets measures our belief that the next observed area will some combination of properties for Community and Income:

area low high
rural 0.04 0.02
suburban 0.19 0.22
urban 0.29 0.24

Each entry in the table is a particular probability estimate such as:

\(P(C=urban, I=high) = P(urban, high) = 0.24\)

\(P(C=rural, I=low) = P(rural, low) = 0.04\)

If we want to know the probability of the event (urban, low) or (urban, high) then by axiom #3, we have:

\(P(urban, low \cup urban, high) = P(urban, low) + P(urban, high) = 0.29 + 0.24 = 0.53\)

and similarly, if we want to know the probability of (urban, high) or (suburban, high) we can calculate it as:

\(P(urban, high \cup suburban, high) = P(urban, high) + P(suburban, high) = 0.24 + 0.22 = 0.46\)

This seems a bit unnatural at first because it doesn’t, on the surface, appear to be the same as flipping a coin. But the fact that there are 6 numbers (5 of which are independent, the 6th is derivable from the axiom #2) instead of 1 doesn’t change anything.

The main differences are:

  1. The process involves the development of geographic areas (instead of the physical process of flipping a coin).
  2. There are two (joint) events being measured (instead of just the one, heads or tails).

As it turns out, calculating these probabilities from data is one of the simplest forms of modeling. We will discuss this later.

On a final note, there is no limit (except perhaps processing power, space, or credulity) to the number of variables you can have in a joint probability distribution. \(P(A, B)\) is one. \(P(A, B, C)\) is one. And \(P(A, B, C, D, E, F, G)\) is one. Of course, as you add more variables, the size of the corresponding probability table increases exponentially.

4.4.2 Conditional Probability

\(P(C, I)\) is a joint probability distribution over all the possible combinations of Community and Income. But what if we already know the Income? This is denoted by \(P(C | I)\) and is called a conditional probability distribution because we “condition” our uncertainty on the information we have.

When we look at a specific event, a joint probability might be something like \(P(urban, low)\). This is the probability of the next geographic area being both an urban community and low income if we don’t know either value. However, the conditional probability \(P(urban | low)\) is the probability of the next geographic areas being both an urban community and low income if we already know that it’s low income. These two probabilities will not necessarily be the same (a point to which we will return later).

Conditional probability effectively focuses our attention on a subset of the joint probability space where some of our uncertainty is resolved by knowing some of what happened. Remember that joint probability is the probability of any combination of events, assuming we know nothing. But if we know that \(income = low\) then we do not need to look at \(income = high\) anymore because it’s not possible. We know that that event is not going to happen and thus we can ignore it. This is true whether there are 2 possible events or 100 and whether we know the values for 1 or 99. Whenever we find out the value of a variable–that some event has happened–we can ignore all other values for that variable as no longer being possible.

When we do this, however, we’re left with improper probability distributions. By the 2nd Axiom of Probability, all probability distributions must sum to 1 therefore we need to normalize the remaining probabilities (of event combinations that can happen). We do this by dividing through by the sum of the remaining possible events.

More formally, the definition of conditional probability is:

\(P(A|B) = \frac{P(A, B)}{P(B)}\)

and if you play with the math a bit, you will see that this is the same thing. Don’t be confused by the seeming division of a probability distribution by another probability distribution. This is just a shorthand for element-wise division. Therefore, using the joint probability table from above, we can calculate:

\(P(urban|low) = P(urban, low)/P(low) = 0.29/0.52 = 0.56\)

Our conditional probability table, \(P(Community|Income)\), describes the uncertainty under any outcome for Income:

Conditional Probability, \(P(Community|Income)\)

area low high
rural 0.08 0.05
suburban 0.36 0.46
urban 0.56 0.50

In this case, \(P(Community|Income)\) has 2 probability distributions (one for each possible value of Income). In this case, each column sums to 1.

We can do this for Community as well:

Conditional Probability, \(P(Income|Community)\)

area low high
rural 0.65 0.35
suburban 0.46 0.54
urban 0.54 0.46

But this means that the table above actually has three separate probability distributions. One for each possible outcome of Community and each distribution itself (row in this case) adheres to Axiom #2 (they sum to 1).

In general, with conditional probabilities, knowing that some event occurs often changes our information about the certainty or uncertainty of another event. However, in general, we cannot say anything about whether or not:

\[P(A) <=> P(A | B)\]

Conditional probabilities are perfectly general.

Just as we can have a joint probability distribution over many different event spaces such as P(A, B, C, D, E, F) we can have a conditional probability distribution where the result or value for some (at least one) of the sets is known. For example, we might know the values of \(C\) and \(D\):

\(P(A, B | C, D) = \frac{P(A, B, C, D)}{P(C, D)}\)

4.4.3 Marginal or Prior Probability

When dealing with a joint probability distribution over many sets, there may sometimes be sets that we don’t care about, that is, we’re only interested in the probabilities of some subset of joint event space. The simplest example is having a joint probability distribution \(P(A, B)\) and only being interested in the probability distribution \(P(A)\) or \(P(B)\). In this case, we can marginalize out the events we are not interested in. “Marginalization” comes from the actual practice of calculating marginal values/probabilities in the actual margins of books and ledgers.

For example, if we have our joint probability distribution \(P(C, I)\) and we only care about community, \(C\), we can marginalize out income:

\(P(urban, low \cup urban, high) = P(urban, low) + P(urban, high) = 0.29 + 0.24 = 0.53\)

\(P(suburban, low \cup suburban, high) = P(suburban, low) + P(suburban, high) = 0.19 + 0.22 = 0.41\)

\(P(rural, low \cup rural, high) = P(rural, low) + P(rural, high) = 0.04 + 0.02 = 0.06\)

which leads to the following:

Marginal Probability, \(P(Community)\)

area any
rural 0.06
suburban 0.41
urban 0.53

This follows from the axioms of probability. We can also do this for income:

Marginal Probability, \(P(Income)\)

area low high
any 0.52 0.48

which strangely leads us back to coin flipping.

According to Jaynes, you can think of P(heads) as a marginalization over everything we don’t know or don’t care about in a larger joint probability distribution such as P(heads, gravity, Earth, Jim, left handed, …).

4.5 Rules of Probability

There are some handy rules that follow from the axioms of probability. In a probability course, you might be required to prove them. We will present them without proof.

4.5.1 Monotonicity

\(A \subseteq B \Rightarrow P(A) \leq P(B)\)

This says that if A is a subset of B then the probablity of A must not exceed the probability of B. This is really an abuse of notation. What we really mean is:

\(A \subseteq B \Rightarrow \sum_i P(a_i) \leq \sum_j P(b_j)\)

4.5.2 Negation

\(P(\neg{a}) = 1 - P(a)\)

This follows from axiom #2. If we write out the summation and isolate the single event \(a\) and then re-arrange terms, we get the above rule.

4.5.3 Total Probability

This is also called the Law of Total Probability. It has the form:

\(P(A) = \sum_i P(A|B=b_i) P(B=b_i)\)

Remember that P(A) is a set of probabilities (one for each element in A). This rules makes a bit more sense if you break it out:

\(P(A=a_1) = P(A=a_1|B=b_1) P(B=b_1) + P(A=a_1|B=b_2) P(B=b_2) + \ldots + P(A=a_1|B=b_n) P(B=b_n)\)

We can derive the Law by looking at the definition of conditional probability for a single event:

\(P(A=a_1|B=b_1) = \frac{P(A=a_1, B=b_1)}{P(B=b_1)}\)

and re-arranging terms:

\(P(A=a_1, B=b_1) = P(A=a_1|B=b_1)P(B=b_1)\)

From our discussion about marginalization and Axiom #2, we know that:

\(P(A=a_1) = P(A=a_1, B=b_1) + P(A=a_1, B=b_2) + \ldots + P(A=a_1, B=b_n)\)

By substitution, we have:

\(P(A=a_1) = P(A=a_1|B=b_1) P(B=b_1) + P(A=a_1|B=b_2) P(B=b_2) + \ldots + P(A=a_1|B=b_n) P(B=b_n)\)

and for the entire set A, we have:

\(P(A) = SUM_i P(A|B=b_i) P(B=b_i)\)

Total Probability is very useful (believe it or not) because there are many situations where you don’t know P(A) but you know P(A|B) and P(B). We will see this later.

4.5.4 Chain Rule

Again, starting with the definition of conditional probability we have:

\(P(A|B) = \frac{P(A, B)}{P(B)}\)

and by re-arranging we have:

\(P(A, B) = P(A|B)P(B)\)

We can expand this to more sets:

\(P(A,B,C) = P(A|B,C) P(B|C) P(C)\)

\(P(A,B,C,D) = P(A|B,C,D) P(B|C,D) P(C|D) P(D)\)

We can expand in any order:

\(P(A,B,C,D) = P(D|A,B,C) P(B|A,C) P(A|C) P(C)\)

But this is really just a clever manipulation of the definition of conditional probability:

\(P(A,B,C,D) = P(D|A,B,C) P(B|A,C) P(A|C) P(C)\) \(P(A,B,C,D) = \frac{P(D,A,B,C)}{P(B,A,C)} \frac{P(B,A,C)}{P(A, C)}\frac{P(A, C)}{P(C)} P(C)\)

Still, it can be a handy thing to know and it presents the foundation for Bayesian Networks.

4.5.5 Bayes Rule

Bayes Rule is a similar manipulation of conditional probability. We start with the definition of conditional probability:

\(P(A|B) = \frac{P(A,B)}{P(B)}\)

and re-arrange:

\(P(A,B) = P(A|B)P(B)\)

We can start with the “opposite” conditional probability:

\(P(B|A) = \frac{P(A,B)}{P(A)}\)

and re-arrange:

\(P(A,B) = P(B|A)P(A)\)

which means I can set these two equal to each other:

\(P(B|A)P(A) = P(A|B)P(B)\)

and re-arrange:

\(P(B|A) = \frac{P(A|B)P(B)}{P(A)}\)

which does not look particularly interesting until I start giving my sets interesting names: B=Hypothesis and A=Data:

\(P(H|D) = \frac{P(D|H)P(H)}{P(D)}\)

which says…the probability of the hypothesis (or model or parameter) given the data is equal to the probability of the data given the hypothesis times the probability of the hypothesis. This is the normalized by the probability of the data.

These probabilities all have names:

  • P(H|D) = posterior probability
  • P(D|H) = likelihood
  • P(H) = prior probability
  • P(D) = normalizer

Note that we almost never know P(D) and we will often use total probability to calculate it:

\(P(D) = \sum_h P(D|H=h)P(H=h)\)

Remember what our notation means. By solving for \(P(H|D)\) we are solving for all the hypotheses and all the data outcomes at once. This means the posterior distribution described by \(P(H|D)\) is not a single probability distribution but many. For a single hypothesis, we would have something like:

\(P(h1|D) = \frac{P(D|h1)P(h1)}{P(D)}\)

Note that the denominator does entail all hypotheses. We often end up having to use the rule of Total Probability here:

\(P(h1|D) = \frac{P(D|h1)P(h1)}{P(D|h1)P(h1) + P(D|h2)P(h2) + \dots + P(D|h_n)P(h_n)}\)

Bayes Rule is particularly important to data science because it says exactly how we should change our degree of certainty given new evidence. It also plays a foundational role in several machine learning techniques (Naive Bayes Classifier, Bayesian Belief Networks). It is also the main formula for Bayesian inference.

We will spend an entire later chapter on Bayes Rule because of its central role in inference.

4.6 Independence and Conditional Independence

From before, we manipulated the definition of conditional probability as follows:

\(P(A|B) = \frac{P(A,B)}{P(B)}\)

and re-arrange:

\(P(A,B) = P(A|B)P(B)\)

Remember our interpretation of conditional probability: knowing what event in B happened gives us additional information that influences our expectations about which event in A will happen. If it doesn’t, then:

\(P(A, B) = P(A)P(B)\)

in which case A and B are said to be independent. This is known as the Multiplication Rule of Probability.

In the previous section on Probability Rules, we introduced the Chain Rule. The Chain Rule basically gives us permission to factor a joint probability distribution however we please. For example, we might have:

\(P(A, B, C) = P(A|B, C)P(B|C)P(C)\)

we can also do the same with a conditional joint probability:

\(P(A,B| C) = P(A|B, C)P(B|C)\)

This says we can factor our uncertainty in the joint events of A and B given some known event in C into two parts: the uncertainty over an event in A given events in B and C are known times the uncertainty in an event in B given the event in C is known.

For a concrete example, consider a joint probability of Religion, Party and Wealth. That’s \(P(A, B, C)\) or \(P(Party, Wealth, Religion)\). Now suppose we condition on \(Religion\) so we have \(P(Party, Wealth | Religion)\). We’re now describing our uncertainty over someone’s Party and Wealth given we know their Religion. What the above says is that we can factor that into two parts. The first part is the probability of Party given we know their Wealth and Religion and the second part is the probability of Wealth given we know their Religion.

Returning to the example, however, if the following is true:

\(P(A,B|C) = P(A|C)P(B|C)\)

then we can say that A and B are conditionally independent given C. Note that the above can be true and that the following is also true:

\(P(A, B) \neq P(A)P(B)\)

The one does not imply the other.

There are many applications in statistics and machine learning that either require or assume independence or conditional independence.

4.7 Probabilistic Fallacies

As it turns out, people in general are not particularly good with probabilities and make quite a few mistakes. These mistakes are fallacies in probabilistic reasoning just as there are reasoning and argument errors in general (for example, ad hominem arguments).

4.7.1 Conjunction Fallacy

Monotonicity plays a very important role in judging the probabilities of events. Perhaps the most famous example arises in the form of the Conjunction Fallacy illustrated as follows:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice and participated in anti-nuclear demonstrations.

Which is more probable?

  1. Linda is a bank teller.
  2. Linda is a bank teller and active in the feminist movement.

The majority of people, when asked, picked #2 but this cannot be. If B is the set of all tellers then the set of feminist bank tellers is likely smaller, a subset of A of B. It cannot be larger. Being an element of A cannot be more probable than being an element of A and B.

This basically says that \(P(A=a) <= P(A=a, B=b)\). Note that this is not the same thing as what we said before with conditional probability because we are talking about joint probability here.

The main problem here actually seems to be confusing joint and conditional probabilities. Still, generally, the more specific you are, the less probable it is. We will see later that this is important for designing experiments and analyzing data.

4.7.2 Gambler’s Fallacy and Hot-Streak Fallacy

A common fallacy is that there is some sort of overarching or underlying power or force that keeps probabilities in balance. The most common way that this fallacy shows itself is as the Gambler’s Fallacy. In the Gambler’s Fallacy, the person incorrectly believes that because a rare event has not happened, it’s time must come or it must happen in order to balance the probabilities in some way. For example, if “red 13” has not come up in Roulette for a while, then this “must” happen any time now.

The Hot-Streak Fallacy is similar in that it assumes that some extraordinary event must continue to happen. However, some research has shown that in certain cases, this is not necessarily true “Hot Hand” is not an illusion.

This is a very easy mistake to make so Data Scientists and consumers of Data Science must be especially wary of this particular fallacy.

4.7.3 Inverse Probability Fallacy

The Inverse Probability Fallacy relates to conditional probabilities, specifically by believing that the following are the same:

\(P(A|B) = P(B|A)\)

A concrete example of this fallacy would involve making the following claim:

\(P(rain|prediction) = P(prediction|rain)\)

Here we are saying that the probability of rain given that rain was predicted is equal to the probability of a prediction for rain given that it is raining. These are two entirely different things. Remember that probability is just counting and normalization.

In the first case, we count all the times where it rained given a prediction of rain. In the second case, we count all the times rain was predicted given that it rained. You can think of a table like so:

actual prediction
yes no
yes yes
no yes
yes no

In the first case, we add up all the times where actual is yes and prediction is yes but divide by the times that prediction is yes (regardless of actual). In the second case, we add up all the times where actual is yes and prediction is yes but divide by the times that actual is yes (regardless of prediction).

These will not necessarily lead to the same number.

4.7.4 Base Rate Fallacy

Suppose we’re asked what religion we think Garth is and, knowing that he’s from middle America, we can guess that he’s notionally a Christian. Suppose we further learn that Garth is a goth and wears dark clothing with various mystical symbols on it, our estimation of Garth’s religion would probably swing in the direction of being a Satanist.

However, the base rate (prior) of being a Satanist is really quite low. There are 2,000,000,000 Christians in the world and only 100,000s of Satanists. While the probability of Garth being a Christian might go down, knowing that Garth is a goth shouldn’t really flip our sense of the probability from most likely a Christian to most likely a Satanist. Doing this is called the Base Rate Fallacy.

This is also related to Bayes Rule which tells us exactly how much we should change our prior when reviewing evidence. The difference between the posterior and prior is the incremental evidence whereas the posterior alone is the total evidence. The Base Rate fallacy can also be attributed to confusing incremental evidence and total evidence when the base rate is low. Upon learning that Garth dresses in black, wears eyeliner and black fingernail polish, we adjust our beliefs about his religion. But to conclude that he’s a Satanist, ignoring the base rate, is to improperly change our beliefs based on the evidence alone.

4.7.5 Prosecutor’s Fallacy

This fallacy refers to a Prosecutor (and sometimes Defense Attorneys) arguing the wrong thing. This can result in a mistrial. As it turns out this rarely happens among the attorneys but can happen to expert testimony. This is really a family of fallacies the first of which relates to Bayes Rule and is related to the Inverse Probability Fallacy:

\(P(innocence|evidence) = \frac{P(evidence|innocence)P(innocence)}{P(evidence)}\)

The Fallacy is committed when the prosecutor assumes that just because the damning evidence is small \(P(evidence|innocence)\) (“if he were innocent, the evidence would be really unlikely”) that \(P(innocence|evidence)\) must be equally as small. This happens a lot with forensic evidence, especially DNA evidence. But it simply isn’t true that if \(P(evidence|innocence) = 1:1,000,000\) that \(P(innocence|evidence) = 1:1,000,000\).

Another version of the Fallacy confuses the prior and conditional probabilities, that is, it assumes that \(P(A)=P(A|B)\) and is thus related to the Base Rate Fallacy.

4.8 Applications

We can solidify our understanding of probability, conditional probability and Bayes Rule by going over some problems, some of which are quite famous. We’ll start first with some general probability problems and then move in the second part to problems using Bayes Rule.

Most of these problems come from Allen Downey’s excellent book Think Bayes.

Some of these problems are just calculations, either by hand or by computer. Others involve answering questions with simulations in order to calculate probabilities. You may be called upon to do something similar as a data scientist. For example, you have just fielded an advertising campaign in 20 major cities. Testing showed that advertising campaign was 10 percent better than the previous one. However, the last three weeks of returns in New York have been below average. What is the probability of this happening given that the campaign really is 10 percent better?

4.8.1 Applications in General Probability

These problems are general probability problems (although the Monty Hall problem can be solved using Bayes Rule).

4.8.1.1 A Girl Named Florida

Consider the following problems: we have a family with two children.

  • What is the probability that they are both girls?

To answer this question, we have have to make some assumptions about the probability of a child being either a boy or girl (which we will take to mean either XX or XY chromosomes). The generally accepted probabilities are P(boy) = 0.5 and P(girl) = 0.5 (ignoring other chromosomal possibilities).

Recall our definition of independence. Two sets of events, \(A\) and \(B\), are independent if the following holds true:

\[P(A, B) = P(A)P(B)\]

but that’s how we determine that A and B are independent. If we assume that the events are independent, then we can turn it around to calculate the probability of the joint event:

\[P(A)P(B) = P(A, B)\]

Here \(A\) is the “sex of the first child” and \(B\) is the “sex of the second child”. This means we can take P(A=girl) as 0.5 and P(B=girl) as 0.5–shortened to P(girl) x P(girl)– which equals 0.5 x 0.5 = 0.25.

Note that if we wanted to calculate the probability of them being different sexes, then we’d have to calculate the probability of having a boy then a girl (0.5 x 0.5 = 0.25) and the probability of having a girl then a boy (0.5 x 0.5 = 0.25) and combine them based on the Additive Law of Probability: 0.25 + 0.25 = 0.50.

But remember how I said probability is just counting?

There are 4 possibilities:

  1. Boy, Girl.
  2. Girl, Boy.
  3. Girl, Girl.
  4. Boy, Boy.

There is only one way in which both children are girls so the probability of two girls is 1/4 = 0.25. There are two ways in which the children are mixed sexes so the probability of that joint event is 2/4 = 1/2 = 0.5.

  • What is the probability that they are both girls given that at least one is a girl?

Now there are only 3 possibilities–the ones that include a Girl as either the first or the second birth:

  1. Boy, Girl.
  2. Girl, Boy.
  3. Girl, Girl.

Using the counting method, we can see that there is only one way to get the result we’re interested in and three possible outcomes so the probability is 1/3.

Using the mathy way, we know from above that probability of the individual outcomes are each 1/4 or 0.25:

  1. Boy, Girl = 1/4
  2. Girl, Boy. = 1/4
  3. Girl, Girl. = 1/4

However, since we have ruled out the {Boy, Boy} possibility by assumption, we have to renormalize the probabilities. Normalization just means “make all the probabilities add up to 1 again” and you do this by adding the probabilities together (which is 3/4) and dividing each original probability by this normalizer:

  1. Boy, Girl = 1/4 // 3/4 = 1/3
  2. Girl, Boy. = 1/4 // 3/4 = 1/3
  3. Girl, Girl. = 1/4 // 3/4 = 1/3

And we get 1/3 as before. The reason we show both ways to get the answer is because, as you might expect, there are cases–most cases–where the counting approach isn’t tractable.

  • What is the probability that they are both girls given that the oldest (first) is a girl?

We do the same thing again except that any outcome that has a Boy as the oldest is removed:

  1. Girl, Boy. = 1/4
  2. Girl, Girl. = 1/4

And again, we need to have our probabilities add up to 1 so we normalize them:

  1. Girl, Boy. = 1/4 // 2/4 = 1/2
  2. Girl, Girl. = 1/4 // 2/4 = 1/2

If you think about it a second, this makes perfect sense…the events are independent so knowing that the first is a girl doesn’t give us any information about the second child’s sex.

That was a fairly typical probability problem. There is a crazy variant that asks:

  • What is the probability that they are both girls given that one of them is a girl named Florida?

Think about it. Does the name change anything?

We’re now going to switch to problems where simulation is often a useful tool. If you ever have a probability problem that you can’t quite formulate right or if someone doesn’t believe your answer, think: can I simulate this?

4.8.1.2 Birthday Problem

The Birthday Problem is as follows: what is the probability that two people in a given group of size \(N\), have the same birthday (month and day)?

  1. Guess. What do you think the probability is? 10%, 20%, 30%…100%?
  2. Think about how you might answer this mathematically.
  3. Think about how you might solve this easily as a simulation. What assumptions do you need to make?

We’re going to simulate the problem by writing a few functions. The first function takes \(k\) persons as an argument and assigns them randomly to one of the 365 days of the year (we ignore leap years). As we do so, we count how many people have that birthday.

First some imports…

from random import randint, uniform
from collections import defaultdict

We used defaultdict because missing keys are automatically assigned a value of 0 instead of it causing a KeyError.

def tally_k_birthdays( k):
    counts = defaultdict( int)
    for i in range( 0, k):
        birthday = randint( 1, 365)
        counts[ birthday] += 1
    return counts

Let’s see what we get for 10 people:

tally_k_birthdays( 10)

Now all we need to do is take this dictionary of values and see if any of the days (we only need one) has a count greater than one which would mean that two (or more) people have the same birthday:

def identify_same_birthdays( counts_of_birthdays):
    for day in counts_of_birthdays.keys():
        if counts_of_birthdays[ day] > 1:
            return True
    return False

In general, in order to get a good result from a simulation, it must be run multiple times and the results averaged. We write a function to do just that. The arguments are \(N\) people and \(times\) simulations.

def sample_group( N, times):
    match = 0.0
    for i in range( times):
        birthday_count = tally_k_birthdays( N)
        if identify_same_birthdays( birthday_count):
            match += 1.0
    return match / times

We can now run the function and see approximately what the probability is for two people to have the same birthday in a class with \(N=26\) students:

sample_group( 26, 10000)

It’s much more probable than people usually think.

This is a good example of a simple simulation for a system process. Again, in theory, everything is fairly deterministic. Parents decided to have children, the children were born on certain days, the children grew up and where in a particular class (one such situation) or they got older and went to university (another situation) or took up an interest in art and when to an art gallery (another such situation) and in all cases the simulation works.

It doesn’t work if an assumption if violated. If the situation is a Meetup for People born in March, we would need an entirely different situation.

  1. Can you reprogram the simulation to see how many people it takes to have a 50% probability of someone with the same birthday, if everyone is born in the same month?

4.8.1.3 Monty Hall Problem

Monty Hall was the host for Let’s Make a Deal before Wayne Brady. One of the “bits” on the show involved picking a curtain in hopes of winning a great prize like a car and this probability problem is based on it. It’s actually a very famous problem.

There are three curtains: 1, 2, and 3. Behind one of those curtains is a car. On the show, the other curtains often had gag gifts behind them like a goat but we assume they’re empty. The contestant picks the curtain they believe hides the car. After picking, Monty reveals what is behind one of the other curtains. One important assumption is that if the contestant has picked the car, Monty reveals one of the other two curtains at random.

The contestant is then given the option to either stick with the curtain they picked or switch to the remaining curtain. The question is this: should the contestant switch? What do you think?

There are a number of ways to answer this question but we’re going to use simulation because that’s often the most definitive. In fact, Paul Erdos, the famous mathematician, would not believe the correct answer until it was simulated.

First, we have a function that simulates one Monty Hall “Problem”. It basically says:

  1. set up the problem
  2. place the car at random.
  3. generate a random contestant pick.
  4. figure out which curtain to reveal.
  5. figure out which curtain is closed.
  6. if do_switch is True, make the pick equal to the closed curtain. Otherwise, keep it the same.
  7. return if the picked curtain equals the car’s curtain (True or False).
def evaluate_a_monty_hall_scenario( do_switch=False):
    options = {1, 2, 3}
    car = randint( 1, 3)
    pick = randint( 1, 3)
    opened = list( options.difference( {car}).difference( {pick}))[0]
    closed = list( options.difference( {pick}).difference( {opened}))[0]
    if do_switch:
        pick = closed
    return car == pick

Let’s run it 10 times:

for i in range( 0, 10):
    print( evaluate_a_monty_hall_scenario(True))

We’re now going to run the Monty Hall problem function 10,000 times and evaluate what happens first, if you don’t switch and second, if you switch:

def evaluate_monty_hall_problem( switch=False):
    trials = 10000
    count = 0
    for i in range( 0, trials):
        result = evaluate_a_monty_hall_scenario( switch)
        if result:
            count += 1
    return float( count) / trials
evaluate_monty_hall_problem()
evaluate_monty_hall_problem(True)

And there you have it, if you switch, you win the car 66% of the time.

4.8.2 Applications of Bayes Rule

Speaking of switching, one of the main types of problems we’ll be solving are problems involving Bayes Rule. In fact, Bayesian Inference depends entirely on understanding Bayes Rule and evaluating it for a large number of possibilities. We’ll start out with smaller problems.

For whatever reason, Bayes Rule examples are either weather or medical tests. We’ll start with the weather:

4.8.2.1 Rain or Shine

Sam is getting married tomorrow in an outdoor ceremony in the desert. In recent years, it has only rained 5 days per year. Unfortunately, the meteorologist has predicted rain for tomorrow. Should Sam rent a tent for the ceremony?

We can solve this problem using Bayes Rule which remember is:

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

But instead what we want is:

\[P(W|F) = \frac{P(F|W)P(W)}{P(F)}\]

where \(W\) is weather (rain or shine) and \(F\) is forecast (rain or shine). Remember that \(P(W)\) in the numerator is our prior probability. What is our prior probability? Well, it only rains 5 days a year on average:

rain shine
5/365 = 0.0137 360/365 = 0.9863

I think this is what Sam had in mind when he planned his wedding.

But now he needs to take new evidence into account: a forecast of rain. The likelihood \(P(F|W)\) is essentially the probability of the meteorologist being correct: given that it rained, what is the probability that it was forecast? Sam looks this up on the Internet.

F rain shine
rain 0.8 0.2
shine 0.2 0.8

What does this mean? Given that it rained, there is an 80% chance there was a forecast of rain:

\(F(F=rain|W=rain) = 0.8\)

Because it will be confusing, we do not take shortcuts here. We will use the longhand notation, F=rain and W=rain, to distinguish the two events. Up above, we had Bayes Rule defined over entire random variables.

What does it look like for the specific outcome we’re interested in?

\[P(W=rain|F=rain) = \frac{P(F=rain|W=rain)P(W=rain)}{P(F=rain)}\]

We have everything we need except the denominator. We can use total probability for it, though:

\(P(F=rain) = P(F=rain|W=rain)P(W=rain) + P(F=rain|W=shine)P(W=shine)\)

\(0.8 \times 0.0137 + 0.2 \times 0.9863 = 0.208\)

and now we have:

\(P(W=rain|F=rain) = \frac{0.8 \times 0.0137}{0.208} = 0.053\)

So really, Sam should just go ahead with the wedding (at least from a weather perspective).

4.8.2.2 Breast Cancer

The logic underlying this problem is why certain routine screenings for breast cancer were discontinued. The numbers, however, are made up.

1% of women at age 40 who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in the age group had a positive mammography. What is the probability of her having breast cancer?

We have two variables, each with two outcomes: \(M\) is {pos, neg}, and \(C\) is {yes, no}. As before, we need to set up Bayes Rule and determine either what information we have and what information we need to calculate.

\[P(yes|pos) = \frac{P(pos|yes)P(yes)}{P(pos)}\]

We have the prior, \(P(yes)\) which is simply 0.01. We have the likelihood we need which is established in the second sentence: \(P(pos|yes)\) = 0.8 (which means that \(P(neg|yes)\) = 0.2. We don’t have \(P(pos)\). We will need to use total probability again.

\(P(pos) = P(pos|yes)P(yes) + P(pos|no)P(no)\)

We have \(P(pos|no)\) from the 3rd sentence: 0.096. Note that this clearly shows where total probability comes from. If we want to calculate the probability of a positive test result, we need to take into account all the possible sources of positive test results. These come from those with cancer who get a positive test result (the first term) and those without cancer who get a positive test result (the second term). The probability of not having cancer is just 1 - P(yes).

\(P(pos) = 0.8 \times 0.01 + 0.096 \times 0.99 = 0.103\)

and now we can just plug in the numbers.

\(P(yes|pos) = \frac{0.8 * 0.01}{0.103} = 0.078\)

This result makes an important assumption, though, the only information about this woman’s status is that this was a routine screening. Why might this not be the case?

OK, we’re computer scientists…enough math. We can let computers do the math.

4.8.2.3 Elvis

Apparently Elvis was one of a set of twins. He had a twin brother who died at birth. We want to know the probability that Elvis had an identical twin. This isn’t really enough information to answer anything so…

Wikipedia to the rescue…“Twins are estimated to be approximately 1.9% of the world population, with monozygotic twins making up 0.2% of the total, 8% of all twins”.

You should solve this by hand right now, writing out the problem. It might surprise you how difficult it is to get started. Consider the following…what is the event we want to know about and what is the evidence?

So the evidence is that the child was male and the event we’re trying to determine the probability of is that Elvis and the child were identical twins:

\[P(I|M) = \frac{P(M|I)P(I)}{P(M)}\]

I’m going to start out with a helper function that normalizes a probability distribution the way I have decided to represent it (as a map):

def normalize( dist):
    normalizer = sum( dist.values())
    for k in dist.keys():
        dist[ k] = dist[ k] / normalizer
    return dist ## don't need to do this.

I’m describing the events as Identical twin or Fraternal twin. The probabilities come from the Wikipedia article. In Python, it is very convenient to represent a discrete Probability distribution with a Dict where the keys are outcomes {“I”, “F”} and the values are the probabilities of those outcomes.

elvis_prior = {"I": 0.08, "F": 0.92}

Here we use a Dict to express a likelihood which ends up as a nested Dict. Remember that \(P(A|B)\) is a Probability distribution for each value of “B”. In this case, the outer key is the “given” so that we can say “given I” and look up the appropriate probability distribution. The inner Dict represents the probability distribution over the events of “A”, in this case the sex of the baby, Male or Female.

elvis_likelihoods = {
  "I": { "M": 1.00, "F": 0.00},
  "F": { "M": 0.50, "F": 0.50}
}

Below is a function that will calculate the posterior probability for the entire probability distribution (over all events). As we’ve mentioned before, in Bayes Rule:

\(P(A|B) = \frac{P(B|A)P(A)}{P(B)}\)

we are calculating an entire posterior probability distribution…a probability for each value of A given each value of B. Additionally, it is unlikely that we know the value of the normalizer \(P(B)\) directly. However, we can calculate \(P(B)\) using the Rule of Total Probability:

$P(B) = P(B|A=a_1)P(A=a_1) + P(B|A=a_2)P(A=a_2) + … + \(P(B|A=a_n)P(a_n)\)

but it turns out that if we are interested in the probability of every hypothesis in A, we are going to calculate all of these values anyway. We don’t need to go through any extra effort. First we note that if we are only concerned about order we do not need to normalize so we have:

\(P(A=a_1|B) \propto P(B|A=a_1)P(A=a_1)\)

\(P(A=a_2|B) \propto P(B|A=a_2)P(A=a_2)\)

\(P(A=a_n|B) \propto P(B|A=a_n)P(A=a_n)\)

where \(\propto\) means “proportional to”. We can calculate all of these without calculating the normalizer, \(P(B)\). But having calculated all those terms, we have calculated the terms we need to compute the normalizer and calculate the actual probabilities:

\(P(A=a_1|B) = \frac{P(B|A=a_1)P(A=a_1)}{P(B)}\)

\(P(A=a_2|B) = \frac{P(B|A=a_2)P(A=a_2)}{P(B)}\)

\(P(A=a_n|B) = \frac{P(B|A=a_n)P(A=a_n)}{P(B)}\)

This is what the following function does, although for all values of A and B.

def query( prior, likelihoods, evidence):
    posterior = {}
    for k in prior.keys():
        posterior[ k] = likelihoods[ k][ evidence] * prior[ k]
    normalize( posterior)
    return posterior

Now we can print out the prior probability and the posterior probability:

print( "prior=", elvis_prior)
print( "posterior=", query( elvis_prior, elvis_likelihoods, "M"))

The evidence (that the other child was a boy), increases the probability that they were identical twins (if the other child had been female, it would have been impossible).

What other piece of evidence is implicit in this calculation?

4.8.2.4 M & M’s

Here is a bit more challenging problem.

A friend shows me two bags of M&M’s and tells me that one is from 1994 and the other is from 1996. He won’t tell me which is which but gives me an M&M from each bag. Which bag is which?

So the first step is map out the events we’re trying to predict and the evidence. I’ll use the same basic approach as before, representing probability distributions as Dicts.

The key information, however, is that the blue M&M was introduced in 1995. Before that the color mixes in the bags where:

color 1994 1996
brown 30% 13%
yellow 20% 14%
red 20% 13%
green 10% 20%
orange 10% 16%
tan 10% 0%
blue 0% 24%

(I’m not sure where this data came from!)

You should try to solve this for yourself before looking at my solution.

Here is the prior distribution for the 1994 bag:

mix94 = dict(brown=0.3, yellow=0.2, red=0.2, green=0.1, orange=0.1, tan=0.1)
mix94

and the prior distribution for the 1996 bag:

mix96 = dict(blue=0.24, green=0.2, orange=0.16, yellow=0.14, red=0.13, brown=0.13)
mix96

Now, my two possible events are: either the first bag is the 1994 bag (A) or the first bag is the 1996 bag (B):

A = dict(bag1=mix94, bag2=mix96)
B = dict(bag1=mix96, bag2=mix94)

which gives me my likelihoods:

m_m_likelihoods = {"A": A, "B": B}
m_m_likelihoods

This is a more complex likelihood than we’re used to seeing.

Given that event A happened (1994 bag), then the probability of picking a yellow M&M from that bag is 20%. Given that event B happened (1996 bag), then the probability of picking a yellow M&M out of that bag is 14%.

Our prior is 50/50 for each of the events A and B because there are two bags.

m_m_priors = {"A": 0.5, "B": 0.5}

Our evidence is that I took a yellow M&M out of Bag 1 and a green M&M out of Bag 2:

m_m_evidences = [('bag1', 'yellow'), ('bag2', 'green')]

And now some code to massage it all together:

from copy import deepcopy

def calculate_m_m_posteriors( priors, likelihoods, evidences):
    posteriors = {}
    current_priors = deepcopy( priors)
    for evidence in evidences:
        bag, mnm = evidence
        for hypothesis in priors.keys():
            posteriors[ hypothesis] = likelihoods[ hypothesis][ bag][ mnm] \
              * current_priors[ hypothesis]
        normalize( posteriors)
        current_priors = posteriors
        print( "evidence=", evidence, "posterior=", posteriors)
    return posteriors
print( "prior", m_m_priors)
calculate_m_m_posteriors( m_m_priors, m_m_likelihoods, m_m_evidences)

Based on the evidence, the more likely event is “A”…that the first bag is the 1994 bag of M&Ms and the second bag is the 1996 bag of M&Ms.

One special thing to note about Bayes Rule is that it doesn’t matter if you take the evidence altogether or piece by piece (no pun intended). You will always get the same result. It’s slightly easier in this case to cycle through the evidence and use the posterior distribution that results as the prior distribution for the next calculation.

This is the beauty of Bayes Rule (and makes it slightly easier to program).

It is always worth noting that in all cases we used probability to deal with systems–processes–exhibiting uncertainty whether it was a breast cancer testing process, the weather, Elvis’s deceased twin or M&M’s.

  1. Can you solve the Monty Hall problem using Bayes Rule?
  2. Can you identify–even in the most general terms–the processes underlying each of these problems?

4.9 Probability and Simulation

Now it may turn out that you cannot easily describe your problem analytically or enumerate all the possibilities in which case you can turn to simulation to do you event generation and counting, to get the probabilities you’re interested in. Here’s an example that actually happened to me.

4.9.1 Cities

Some data scientists are involved in personalization at online retailers. They apply personalization models to different views of the products whether they are sales emails or on the website; anywhere there are default lists of products. Of course, not being content to rest on their laurels, these data scientists are constantly developing new personalization models. In order to test whether a new model is better than an old model, they engage in A/B testing. We’ll talk about A/B testing later in the book but for now, A/B testing just means dividing your, ahem, test subjects into two groups completely at random. Group A will get the control or existing personalization model. Group B will get the treatment or new personalization model. That sets up the context for the following problem in computational probability.

Suppose you are one of those data scientists and have a personalization model that is being tested in 20 cities nationwide. Each city has 100,000 customers on the mailing list and you send an email every day.

The test has been running a few days and it looks like the new model (called the “treatment”) is about 10% better than the current practice (“control”). Let us assume this is true (we’ll see later how to test this). After another week of testing, the director of marketing comes to you and is concerned that the new model has done very poorly in one of the cities 3 days in a row and opines that it must not really be 10% better.

Is the director of marketing right? Does this mean the new model is worse than the current one? Maybe, maybe not. Deciding which model is better is a problem in inference. This isn’t really what the director of marketing is asserting, though. They’re asserting that the new model can’t be better if there is a losing streak of 3 days in one city. Now this is a question we can answer now, using computational probability.

What we want to know–what the stakeholder needs to know–is, if the new model is definitely 10% better, what is the probability of seeing a string of 3 worse outcomes in a particular city? If the probability is high, then we needn’t worry. If the probability is low, we might be concerned about test.

4.9.2 What do we know?

  • 20 cities
  • Each city has 100,000 subscribers with the lists split in half for control and treatment.
  • the purchase rate for control is 0.0001 (0.01%)
  • the purchase rate for treatment is 0.00011 (0.011%)
  • the lift is 0.011/0.010 = 1.1 - 1.0 = 10%

These probably seem small but daily purchase rates are often small.

Let’s start out by simulating a single day’s worth of purchases in a single city. We know what the ideal purchase rate is, but it’s not going to pan out to be the same exact thing everyday. We need to simulate those purchases and calculate the actual purchase rate:

from random import random, seed
seed(128934662)
def actual_purchase_rate( population, purchase_rate):
    purchases = [1.0 if random() < purchase_rate else 0.0 
                 for i in range( population)]
    return sum( purchases)/population

Let’s see how it does:

for _ in range( 5):
    print( actual_purchase_rate( 50000, 0.0001))

These look reasonable.

Now let’s simulate a comparison in outcomes for control and treatment in a city. If the control is better, we’ll say that’s “1.0” and if the treatment is better, we’ll say that’s “0.0”. Additionally, half of each mailing list gets control and half of each mailing list gets the treatment:

def difference_in_purchase_rates(population, control_rate, treatment_rate):
    control_actual = actual_purchase_rate( population//2, control_rate)
    treatment_actual = actual_purchase_rate( population//2, treatment_rate)
    difference = 1.0 if control_actual > treatment_actual else 0.0
    return difference

Now let’s see how that looks:

for _ in range( 5):
    print( difference_in_purchase_rates( 100000, 0.0001, 0.00011))

Now we want to see what happens over N days:

def simulate_difference_for_n_days(population, control_rate, treatment_rate, days):
    return [difference_in_purchase_rates( population, control_rate, treatment_rate)
            for i in range( days)]

Let’s see what it looks like over 30 days:

print( simulate_difference_for_n_days( 100000, 0.0001, 0.00011, 30))

There are a number of ways we might interpret the idea of a “streak”.

  • A streak is N 1’s followed by a 0. For example, if N is 3, then we’re looking for something like 0, 1, 1, 1, 0.
  • A streak is N or more 1’s followed by a 0. For example, if N is 3, then we’re looking for 0, 1, 1, 1, 0 but also 0, 1, 1, 1, 1, 0.
  • A streak is any number of N 1’s for example if N is 3 and there are 5 1’s, then that’s 3 streaks (1, 2, 3), (2, 3, 4), (3, 4, 5).

what kind of streak we’re looking for matters because it affects both identification of the event of interest and the number of possible events and thus affects our conclusions. Since probability is just counting, we need to make sure of what we’re counting.

For this problem, let’s say that we’re interested in the 3rd one…we want to know whenever 3 days in a row have 1’s regardless of what happens before or after. This makes the simulation a bit easier because for a simulation of length M and a sequence of length N, there are M-N+1 such possible sequences.

Let’s write a function that identifies these sequences:

def count_sequences( n, data):
    all_sequences = [data[i:i+n] for i in range(len( data)-n+1)]
    streaks = sum([1.0 if sum(xs) == float(n) else 0.0 for xs in all_sequences])
    return streaks, len( all_sequences)
data = simulate_difference_for_n_days( 100000, 0.0001, 0.00011, 30)
print( data)
streak, sequences = count_sequences( 3, data)
print( streak, sequences, streak/sequences)

So that’s about a 7.1% chance of seeing a streak of 3 days at least once in a 30 day period in a single city. If we want to measure the average, we’d need to re-run the experiment a lot of times:

streaks = []
sequences = []
for i in range( 100):
    data = simulate_difference_for_n_days( 100000, 0.0001, 0.00011, 30)
    streak, sequence = count_sequences( 3, data)
    streaks.append( streak)
    sequences.append( sequence)
print( sum(streaks)/sum(sequences))

So, roughly, there’s a 5.0% chance of seeing at least one streak of 3 days in a single city over a 30 day period even if the treatment is better.

But is this what we really want to know? There are 20 cities…we want to know the probability of observing such a streak in at least one of the 20 cities…so the event space is “streak-in-a-city”.

def streak_in_a_city( cities, population, control_rate, treatment_rate, streak_length):
    results = []
    for i in range( cities):
        data = simulate_difference_for_n_days(population, control_rate,
                                              treatment_rate, 30)
        streak, sequence = count_sequences( streak_length, data)
        result = 1.0 if streak > 0 else 0.0
        results.append( result)
    return sum(results), cities
streaks, cities = streak_in_a_city( 20, 100000, 0.0001, 0.00011, 3)
print( streaks, cities, streaks/cities)
streaks = []
experiments = []
for i in range( 100):
    streak, cities = streak_in_a_city( 20, 100000, 0.0001, 0.00011, 3)
    streaks.append( streak)
    experiments.append( cities)
print( sum( streaks)/sum(experiments))

As we can see, we expect to see a losing streak of 3 in some city even if the treatment is definitely better than control about 67.7% of the time.

Here’s an interesting observation. Once we had the probability of 5.0% for a streak of losses in a city, did we need to do a simulation for 20 cities? The answer is, no.

If we let a losing streak be “tails” then the question we’re asking about our new model and cities is, if the probability of tails is 5%, and we flip 20 coins simultaneously, what is the probability that we’ll see at least one tail? The “at least one” part is what makes it harder. However, if we reframe the question as, what is the probability of seeing 20 heads when tossing 20 coins if the probability of heads is 95%, then we have a draw from a Binomial distribution:

from scipy.stats import binom
binom.pmf(20, 20, 0.95)

If this is the probability of all heads, then we can use the Axioms of Probability to find out the probability of 1 tail or 2 tails or 3 tails, etc, which is “at least one tail”. Since the probability of all outcomes is one, and we have the probability of the single outcome we don’t want, we can simply subtract it from 1 to get the probability of the outcomes we do want. And if we take 1 - 0.36, we get 0.64, the probability of “at least one tail”. The result is close to what we simulated. You should always be on the lookout for such shortcuts.

Now try solving the problem yourself with different assumptions about what constitutes a streak or even what the true difference between the new model (treatment) and control is (the lift…which can be negative and still be called “lift”).

4.10 Conclusion

We covered a lot of ground in this chapter from the basic ideas of probability to fallacies in probabilistic reasoning. Still, probability is a fundamental tool in data science. Returning to our definition of data science, it says:

Data science is the application of math and computers to solve problems that stem from a lack of knowledge, constrained by the small number of people with any interest in the answers.

The fundamental problem of Science is extrapolating general conclusions from specific data sets. That is, inference. Probability is our fundamental tool for dealing with the problem of inference because it allows us to model our uncertainty in a very rigorous way. Still…it’s not a panacea. It does not magically make inference deductive instead of inductive.

Although the types, rules and laws of probability are important, the most important part of this chapter is Bayes Rule (or Theorem). Bayes Rule allows us tells us how to update our beliefs based on evidence and forms the basis for Statistical Inference discussed in this book.

#S## Review

When asked to give an example, do not use dice, coins or cards (or any gambling device).

  1. Why is a coin flip deterministic but we still need probability to model it?
  2. What is our working definition of probability?
  3. If \(A = {a_1, a_2, a_3}\) and \(B = {b_1, b_2}\), then answer the following questions:
    1. What does \(P(A, B)\) denote?
    2. What does \(P(A)\) denote? How did we arrive at it from \(P(A, B)\)?
    3. What does \(P(A=a_1\)) denote?
    4. What does \(P(a_1)\) denote? Why should we be careful when using a “shorthand”?
    5. What does \(P(A|B)\) denote? How many probability distributions does it represent?
    6. What does \(P(A|B)P(B)\) denote? Write it out.
    7. Express \(P(A)\) using the Total Probability.
  4. What is the difference between an outcome and an event? Give an example not shown elsewhere.
  5. Give an example of the independence of two outcomes.
  6. Give an example of conditional independence of three outcomes.
  7. What is the Gambler’s Fallacy?
  8. What is the Inverse Probability Fallacy?
  9. What is the Prosecutor’s Fallacy?
  10. Joe has been randomly selected for drug testing from a population that has about 3% heroin use. Joe tests positive for heroin use (\(u\)). The test used correctly identifies users 95% of the time \(P(+|u) = 0.95\) and correctly identifies non-users 90% of the time \(P(-|c) = 0.90\) (\(c\) for “clean”). What is the probability that Joe is using heroin \(P(u|+)\)? What are the increment and total evidences in this problem?