# Probability Spaces

## Contents

# 7.1. Probability Spaces#

Probability measures the amount of uncertainty in an event. Typically, the probability of an event is expressed as a nonnegative real number ranging between \(0\) and \(1\).

When an event has a probability \(1\), we are certain that the event will occur.

When an event has a probability \(0\), we are certain that the event will not occur.

When we toss a coin, we are not certain whether it will turn heads (H) or tails (T). We normally assign a probability of \(0.5\) to both H and T outcomes as we believe that both outcomes are equally likely.

When we throw a die, then the possible outcomes are \(\{ 1, 2, 3, 4, 5, 6 \}\). If we believe that each outcome is equally likely, then we can assign a probability of \(\frac{1}{6}\) to each outcome.

If we have a fairly reliable digital communication channel, then the probability of error may be as low as \(1e-6\). In other words, there is a one in a million chance of a transmitted bit flipping during the transmission.

Given a sample space of outcomes, there are two main activities involved:

Assigning a probability to different outcomes or events in a manner that the probabilities are sensible.

Use the laws of probability theory to infer the probabilities of other outcomes or events.

Our notes will be based on the axiomatic treatment of probability. We describe the rules for assigning sensible probabilities to different events. We then develop the theory for computing with the probabilities. We leave out the task of estimating the probabilities of individual events which is covered extensively in statistics. The foundations of modern probability theory are rooted in the measure theory which is a study of measures. A measure is a generalization of geometric notions like length, area and volume. The probability of a set is also a measure. Although we don’t provide extensive coverage of measure theory in these notes, we cover some fundamental concepts like \(\sigma\)-algebra in our introduction to probability theory.

For an introduction to probability theory, see [8, 64, 68, 69]. For engineering applications, see [72]. For a comprehensive measure theoretic treatment, see [10].

(Random experiment)

A *random experiment* is an experiment in which the
outcomes are nondeterministic.
In other words, different outcomes can occur each time
the experiment is run.

## 7.1.1. Sample Space#

(Sample space)

The *sample space* associated with a random experiment
is the set of all possible outcomes of the experiment.

We shall often denote the sample space by the symbol \(\Omega\). Individual outcomes shall often be denoted by \(\zeta\).

## 7.1.2. Sigma Algebras and Fields#

(Field, Algebra)

Consider a sample space \(\Omega\)
and a certain collection of subsets of \(\Omega\) denoted by \(\FFF\).
We say that \(\FFF\) forms an *algebra* (in the set theoretic sense)
over \(\Omega\) if meets the following rules:

\(\EmptySet \in \FFF\).

If \(A, B \in \FFF\) then \(A \cup B \in \FFF\) and \(A \cap B \in \FFF\).

If \(E \in \FFF\) then \(\Omega \setminus E = E^c \in \FFF\).

In other words,
\(\FFF\) contains the empty set,
\(\FFF\) is closed under union, intersection, complement operations.
The pair \((\Omega, \FFF)\) is known as a *field*.

Note

The term *algebra* is used here in the sense of Boolean
algebra from set theory whose operations include
union, intersection and complement.
The notion of *field* here is different from the notion
of fields (e.g. \(\RR\) and \(\CC\))
in ring theory.
Similarly the term algebra over \(\Omega\) should not
be confused with algebras over fields or rings in ring theory.

(trivial algebra)

Let \(\Omega\) be an arbitrary sample space. Then, \(\{ \EmptySet, \Omega \}\) is a trivial algebra over \(\Omega\).

(algebras over a set of 4 elements)

Let \(\Omega = \{ 1, 2, 3, 4 \}\).

\(\{\EmptySet, \{ 1, 2, 3, 4 \} \}\) is an algebra.

\(\{\EmptySet, \{ 1, 2\}, \{3, 4 \}, \{ 1, 2, 3, 4 \} \}\) is an algebra.

\(\{\EmptySet, \{ 1\}, \{2, 3, 4 \}, \{ 1, 2, 3, 4 \} \}\) is an algebra.

The power set consisting of all subsets of \(\Omega\) is an algebra.

(Properties of an algebra)

Let \(\FFF\) be an algebra over \(\Omega\). Then

\(\Omega \in \FFF\).

If \(A_1, \dots, A_n \in \FFF\), then \(A_1 \cup \dots \cup A_n \in \FFF\).

If \(A_1, \dots, A_n \in \FFF\), then \(A_1 \cap \dots \cap A_n \in \FFF\).

If \(A, B \in \FFF\) then \(A \setminus B \in \FFF\).

\(\FFF\) includes the sample space. \(\FFF\) is closed under finite unions and finite intersections. \(\FFF\) is closed under set difference.

Proof. (1) Sample space

\(\EmptySet \in \FFF\).

\(\Omega = \EmptySet^c\).

Hence \(\Omega \in \FFF\).

(2) Closure under finite union by mathematical induction

Base case: for \(n=2\) is trivial.

Assume that the statement is true for some \(n \geq 2\).

Let \(A_1, \dots, A_n, A_{n+1} \in \FFF\).

By inductive hypothesis \(A = A_1 \cup \dots \cup A_n \in \FFF\).

Then

\[ A_1 \cup \dots \cup A_n \cup A_{n + 1} = (A_1 \cup \dots \cup A_n) \cup A_{n + 1} = A \cup A_{n + 1}. \]Since both \(A, A_{n + 1} \in \FFF\), hence there union is also in \(\FFF\).

Hence by mathematical induction, \(\FFF\) is closed under all finite unions.

(3) Closure under finite intersection

We note that

\[ A_1 \cap \dots \cap A_n = (A_1^c \cup \dots \cup A_n^c)^c. \]Since \(A_i \in \FFF\), hence \(A_i^c \in \FFF\) for \(i=1,\dots,n\).

Then \(A = A_1^c \cup \dots \cup A_n^c \in \FFF\) due to (2).

Then \(A_1 \cap \dots \cap A_n = A^c \in \FFF\).

(4) Set difference

We note that \(A \setminus B = A \cap B^c\).

Since \(B \in \FFF\), hence \(B^c \in \FFF\).

Since \(A, B^c \in \FFF\), hence \(A \cap B^c \in \FFF\).

(Algebra from a partition)

Let \(\Omega\) be a sample space. Let \(A = \{ A_1, \dots, A_n \}\) be a (finite) partition of \(\Omega\). In other words, \(A_i\) are pairwise disjoint and their union is \(\Omega\). Then the collection \(\FFF\) consisting of all unions of the sets \(A_i\) (including the empty set which is the union of zero sets) forms an algebra.

Since there are \(n\) sets in the partition \(A\), hence number of elements of \(\FFF\) is \(2^n\).

By definition \(\EmptySet\) and \(\Omega\) are in \(\FFF\).

Let \(X, Y \in \FFF\). Then both \(X\) and \(Y\) are unions of some members of \(A\).

Hence \(X \cup Y\) is also a union of some members of \(A\).

Hence \(\FFF\) is closed under union.

If \(X\) and \(Y\) are disjoint then \(X \cap Y \in \FFF\).

Otherwise, \(X \cap Y\) is the union of some members of \(A\) which are common in both \(X\) and \(Y\). Hence \(X \cap Y \in \FFF\).

Hence \(\FFF\) is closed under intersection.

Let \(X \in \FFF\). Then \(X\) is union of some members of \(A\).

Then \(\Omega \setminus X\) is the union of remaining members of \(A\).

Hence \(\Omega \setminus X \in \FFF\).

Hence \(\FFF\) is closed under complement.

Hence \(\FFF\) is an algebra.

Often, the sample space \(\Omega\) is an infinite set (e.g., \(\RR\)) and the field of subsets of \(\Omega\) is also infinite. For example, the Borel field defined over \(\RR\) contains all the open and closed intervals in \(\RR\). We have to deal with situations which involve countable unions and intersections of sets in the field. Mathematical induction shows that a field is closed under finite unions and intersections but it is not enough to prove that it is also closed under countable unions. To handle such cases, we need to extend the definition of a field.

\(\sigma\)-field, \(\sigma\)-algebra)

(Consider an infinite sample space \(\Omega\)
and a certain collection of subsets of \(\Omega\) denoted by \(\FFF\).
We say that \(\FFF\) is a \(\sigma\)-*algebra* over \(\Omega\) if
it is an algebra over \(\Omega\) and it is closed
under countable unions.
In other words, if \(A_1, A_2, \dots\) is a countable
collection of sets in \(\FFF\), then, their union

The pair \((\Omega, \FFF)\) is known as a \(\sigma\)-*field*.

When the sample space is obvious from the context, we will often call \(\FFF\) also as a field or a \(\sigma\)-field as appropriate.

(Power set)

Let \(\Omega\) be an arbitrary sample space. Then its power set \(\PPP(\Omega)\) is a \(\sigma\)-algebra.

(Countable intersection)

Let \(\FFF\) be a \(\sigma\)-algebra over \(\Omega\). Then \(\FFF\) is closed under countable intersection. In other words, if \(A_1, A_2, \dots\) is a countable collection of sets in \(\FFF\), then, their intersection

Proof. We use the complement property.

Let \(A_1, A_2, \dots\) be subsets in \(\FFF\).

Then \(A_1^c, A_2^c, \dots\) are also in \(\FFF\).

Then their countable union:

\[ \bigcup_{i=1}^{\infty} A_i^c \in \FFF. \]Taking complement, we get

\[ \bigcap_{i=1}^{\infty} A_i \in \FFF \]as desired.

Any \(\sigma\)-algebra \(\FFF\) of subsets of a sample space \(\Omega\) likes between the two extremes:

### 7.1.2.1. Generated \(\sigma\) Algebra#

(atom)

An *atom* of \(\FFF\) is a set \(A \in \FFF\) such that
the only subsets of \(A\) which are also in \(\FFF\)
are the empty set \(\EmptySet\) and \(A\) itself.

\(\sigma\)-algebras)

(Intersection ofLet \(\GGG = \{ \GGG_i \}_{i \in I}\) be a nonempty collection of \(\sigma\)-algebras over \(\Omega\) where \(I\) is some index set. Then their intersection \(\bigcap_{i \in I} \GGG_i\) is also a \(\sigma\)-algebra.

Proof. Denote \(\FFF = \bigcap_{i \in I} \GGG_i\). We shall verify all the properties of a \(\sigma\)-algebra.

Empty Set

Since \(\EmptySet \in \GGG_i\) for every \(i\), hence \(\EmptySet \in \FFF\).

Union

Let \(A, B \in \FFF\).

Then \(A, B \in \GGG_i\) for every \(i\).

Hence \(A \cup B \in \GGG_i\) for every \(i\).

Hence \(A \cup B \in \FFF\).

Intersection

Let \(A, B \in \FFF\).

Then \(A, B \in \GGG_i\) for every \(i\).

Hence \(A \cap B \in \GGG_i\) for every \(i\).

Hence \(A \cap B \in \FFF\).

Countable union

Let \(A_1, A_2, \dots\) be a countable collection of subsets in \(\FFF\).

Then \(A_1, A_2, \dots \in \GGG_i\) for every \(i\).

Since each \(\GGG_i\) is a \(\sigma\)-algebra, hence \(\bigcup_{j=1}^{\infty} A_j \in \GGG_i\) for every \(i\).

Hence \(\bigcup_{j=1}^{\infty} A_j \in \FFF\).

\(\sigma\)-algebra generated by a collection of sets)

(Let \(\AAA = \{ A_i \}_{i \in I}\) be a collection of subsets of \(\Omega\)
where \(I\) is an index set.
Let \(\FFF\) be the smallest \(\sigma\)-algebra such that
\(A_i \in \FFF\) for every \(i \in I\).
Then \(\FFF\) is called the \(\sigma\)-*algebra generated by* \(\AAA\)
and is denoted by \(\sigma(\AAA)\).

Here by smallest, we mean that if there is any other \(\sigma\)-algebra \(\GGG\) such that \(A_i \in \GGG\) for every \(i \in I\), then \(\FFF \subseteq \GGG\).

Since the power set \(\PPP(\Omega)\) is a \(\sigma\) algebra and it contains all subsets of \(\Omega\) (including \(\AAA\)) hence there is always a \(\sigma\)-algebra containing a given collection of sets.

It may not be possible to visualize every member of \(\FFF\) easily from the descriptions of \(A_i\).

We next show that a smallest \(\sigma\)-algebra exists for every collection \(\AAA\).

\(\sigma\)-algebra)

(Existence of the generatedLet \(\AAA\) be a collection of subsets of \(\FFF\). Then there exists a smallest \(\sigma\)-algebra containing \(\AAA\). In other words, there is a \(\sigma\)-algebra generated by \(\AAA\).

Proof. The main issue is to verify that if there is any other \(\sigma\)-algebra, it contains the smallest one. We provide a constructive proof.

Let \(\GGG = \{ \GGG_j \}_{j \in J}\) be the collection of all \(\sigma\)-algebras containing all sets of \(\AAA\).

We can see that \(\GGG\) is nonempty since \(\PPP(\Omega) \in \GGG\).

By Theorem 7.3, \(\FFF = \bigcap_{j \in J} \GGG_j\) is a \(\sigma\)-algebra.

We claim that \(\sigma (\AAA) = \FFF\).

Since \(\AAA \subseteq \GGG_j\) for every \(j \in J\), hence \(\AAA \subseteq \FFF\).

By construction \(\FFF \subseteq \GGG_j\) for every \(j \in J\). In other words, \(\FFF\) is contained in every \(\sigma\)-algebra containing \(\AAA\).

Hence \(\FFF\) is indeed the \(\sigma\)-algebra generated by \(\AAA\).

We note that if \(\AAA\) itself is a \(\sigma\)-algebra, then of course \(\sigma(\AAA) = \AAA\).

(Algebra generated by 2 sets)

Let \(\Omega\) be an arbitrary sample space. Let \(A\) and \(B\) be two subsets of \(\Omega\) which are not necessarily disjoint. We shall construct the algebra \(\FFF\) generated by \(A\) and \(B\).

Since \(A \in \FFF\), hence \(A^c \in \FFF\).

Similarly, \(B^c \in \FFF\).

Since \(A\) and \(B\) and their complements are in \(\FFF\), hence \(AB, AB^c, A^c B, A^c B^c \in \FFF\) as \(\FFF\) is closed under intersection.

Let us name them \(E = AB\), \(F = AB^c\), \(G = A^c B\) and \(H = A^c B^c\).

We can see that these four sets \(E, F, G, H\) are disjoint and

\[ \Omega = E \cup F \cup G \cup H. \]We now have a partition of \(\Omega\) into \(4\) disjoint sets. We can follow the lead from Example 7.3 to construct an algebra by constructing all the unions of \(0\) or more sets from the collection \(\{ E, F, G, H \}\).

The empty set \(\EmptySet\) doesn’t contain any of these sets \({4 \choose 0}\).

There are \({4 \choose 1} = 4\) of these disjoint subsets.

There are \({4 \choose 2} = 6\) pair-wise unions of these \(4\) sets.

There are \({4 \choose 3} = 4\) unions of \(3\) of these \(4\) subsets.

There is \({4 \choose 4} = 1\) union of all the \(4\) subsets which is \(\Omega\).

A total of \(1 + 4 + 6 + 4 + 1 = 16 = 2^4\) possible subsets are formed.

In particular, note that \(A = E \cup F\), \(B = E \cup G\), \(A^c = G \cup H\) and \(B^c = F \cup H\).

We can see that this collection of \(16\) subsets of \(\Omega\) is an algebra following Example 7.3.

This is indeed the smallest algebra containing \(A\) and \(B\) as any other algebra must contain \(\FFF\).

Hence it is the algebra generated by \(A\) and \(B\).

We can see that \(E, F, G, H\) are the atoms of the algebra \(\FFF\).

### 7.1.2.2. Dynkin \(\pi-\lambda\) theorem#

When constructing probability measures (see Probability Measure and Space), it is generally impossible to assign a probability to each subset in a \(\sigma\)-algebra. The Carathéodory extension theorem allows us to define a measure explicitly for only a small collection of simple sets, which may or may not form a \(\sigma\)-algebra (e.g. the atoms inside a \(\sigma\)-algebra) and automatically extend the measure to all other sets in the algebra. The uniqueness claim in the extension theorem makes use of Dynkin \(\pi-\lambda\) theorem (below). Readers may skip this subsection in the first reading.

\(\pi\) system)

(Let \(\Omega\) be a sample space. A collection \(P\)
of subsets of \(\Omega\) is a \(\pi\)-*system* if
\(P\) is closed under finite intersections.
In other words, if \(A, B \in P\), then \(A \cap B \in P\).

\(\lambda\) system)

(Let \(\Omega\) be a sample space. A collection \(L\)
of subsets of \(\Omega\) is a \(\lambda\)-*system* if

\(L\) contains the empty set \(\EmptySet\)

\(L\) is closed under complements: if \(A \in L\) then \(A^c \in L\)

\(L\) is closed under countable

*disjoint*union: if \(A_1, A_2, \dots \in L\) and \(A_i \cap A_j = \EmptySet\) for every \(i \neq j\), then \(\bigcup_{i=1}^{\infty} A_i \in L\).

\(\pi-\lambda\) theorem)

(DynkinIf \(P\) is a \(\pi\) system and \(L\) a \(\lambda\)-system of subsets of \(\Omega\) with \(P \subseteq L\) then

In other words, the \(\sigma\)-algebra generated by \(P\) is contained in \(L\).

The proof of this result is involved. The overall strategy works as follows:

We shall construct a \(\sigma\)-algebra that lies between \(P\) and \(L\).

In particular, we shall construct the set \(l(P)\) which is the intersection of all \(\lambda\)-systems containing \(P\).

We then show that \(l(P)\) is a \(\lambda\)-system.

We then show that \(l(P)\) is also a \(\pi\)-system.

We show that a collection which is both a \(\lambda\)-system and a \(\pi\)-system is also a \(\sigma\)-algebra.

Thus, we claim that \(l(P)\) is a \(\sigma\)-algebra.

We finally show that \(\sigma(P) \subseteq l(P) \subseteq L\).

In order to prove this result, we first prove some of the intermediate steps individually.

(Closedness under proper differences)

A \(\lambda\)-system is closed under proper differences. In other words, if \(A, B \in L\) where \(L\) is a \(\lambda\)-system and \(A \subseteq B\) then the difference \(B \setminus A\) is also in \(L\).

Proof. We note that \(B \setminus A = B \cap A^c\). We shall show this as a complement of the disjoint union of sets.

Since \(B \in L\), hence \(B^c \in L\).

Since \(A \subseteq B\), hence \(A\) and \(B^c\) are disjoint.

Hence \(D = A \cup B^c \in L\) since it is a disjoint union.

Hence \(D^c = A^c \cap B\) in \(L\).

But \(B \setminus A = B \cap A^c = D^c\).

Hence \(B \setminus A \in L\).

\(\pi + \lambda \implies \sigma\))

(A family of subsets of \(\Omega\) which is both a \(\pi\)-system and a \(\lambda\)-system is a \(\sigma\)-algebra.

Proof. Let \(S\) be a family of subsets of \(\Omega\) which is both a \(\pi\)-system and a \(\lambda\)-system.

\(\EmptySet \in S\) since it is a \(\lambda\)-system.

\(S\) is closed under finite intersections since it is a \(\pi\)-system.

We just need to show that it is closed under countable unions of (not necessarily disjoint) sets.

Let \(A_1, A_2, \dots \in S\).

We shall write \(\bigcup_{i=1}^{\infty} A_i\) as a countable union of disjoint sets.

Let \(B_1 = A_1\).

For \(n \geq 2\), let

\[ B_n = A_n \setminus (A_1 \cup A_2 \cup \dots \cup A_{n-1}) = A_n \cap A_1^c \cap A_2^c \cap \dots \cap A_{n-1}^c. \]In other words, \(B_n\) consists of all elements of \(A_n\) which don’t appear in any of the sets \(A_1, \dots, A_{n-1}\).

Then \(B_1, B_2, \dots\) is a sequence of disjoint sets.

We can also see that \(A_1 \cup \dots \cup A_n = B_1 \cup \dots \cup B_n\) for every \(n\).

Hence \(\bigcup_{i=1}^{\infty} A_i = \bigcup_{i=1}^{\infty} B_i\).

Since \(S\) is a \(\lambda\)-system, hence \(A_i^c \in S\) for every \(i\).

Since \(S\) is a \(\pi\)-system, hence \(B_i \in S\) for every \(i\).

Hence \(\bigcup_{i=1}^{\infty} B_i \in S\) as it is a countable union of disjoint sets.

\(\lambda\)-co-systems)

(Suppose \(L'\) is a \(\lambda\)-system of subsets of \(\Omega\). For any set \(A \in L'\), let \(S_A\) be the set of all \(B \in \Omega\) for which \(A \cap B \in L'\). Then \(S_A\) is a \(\lambda\)-system.

Proof. Empty set

Since \(\EmptySet = A \cap \EmptySet \in L'\), hence \(\EmptySet \in S_A\).

Countable union of disjoint sets

Let \(B_1, B_2, \dots \in S_A\) be a sequence of disjoint subsets in \(S_A\).

Then \(A \cap B_i \in L'\) for every \(i\).

Then \(\bigcup (A \cap B_i) \in L'\) since \(A \cap B_i\) are also disjoint.

Also \(\bigcup (A \cap B_i) = A \cap (\bigcup B_i)\).

Since \(A \cap (\bigcup B_i) \in L'\), hence \(\bigcup B_i \in S_A\).

Hence \(S_A\) is closed under countable union of disjoint sets.

Complements

Let \(B \in S_A\). Then \(A \cap B \in L'\).

Now \(A \cap B^c = A \setminus B = A \setminus (A \cap B)\).

Since \(A \in L'\) and \(A \cap B \in L'\) and \(A \cap B \subseteq A\), hence due to Lemma 7.1, \(A \setminus (A \cap B) \in L'\).

Since \(A \cap B^c \in L'\) hence \(B^c \in S_A\).

Hence \(S_A\) is closed under complements.

Thus, \(S_A\) is indeed a \(\lambda\)-system.

Let \(l(P)\) be the intersection of all \(\lambda\)-systems containing \(P\). Then \(l(P)\) is a \(\lambda\)-system.

Proof. We can easily verify that \(l(P)\) satisfies all the properties of a \(\lambda\)-system.

Let \(\LLL = \{L_i \}\) be the collection of all \(\lambda\)-systems containing \(P\).

Then \(l(P) = \bigcap L_i\).

Since \(\EmptySet \in L_i\) for every \(i\), hence \(\EmptySet \in l(P)\).

Let \(A \in l(P)\).

Then \(A \in L_i\) for every \(i\).

Hence \(A^c \in L_i\) for every \(i\).

Hence \(A^c \in l(P)\).

Let \(A_1,A_2, \dots \in l(P)\) be a collection of pairwise disjoint sets.

Then \(A_1,A_2,\dots \in L_i\) for every \(i\).

Hence \(\bigcup A_i \in L_i\) for every \(i\).

Hence \(\bigcup A_i \in l(P)\).

Let \(l(P)\) be the intersection of all \(\lambda\)-systems containing \(P\). Then \(l(P)\) is a \(\pi\)-system.

Proof. The proof uses a *bootstrap* argument often used in
measure theory.

We first show that for any \(A \in P\) and \(B \in l(P)\), \(A \cap B \in l(P)\).

Let \(A \in P\).

Then \(A \in l(P)\).

Let \(S_A\) be the set of all sets \(B \subseteq \Omega\) such that \(A \cap B \in l(P)\).

By Lemma 7.3, \(S_A\) is a \(\lambda\)-system.

Let \(B \in P\). Then \(A \cap B \in P\) since \(P\) is a \(\pi\)-system.

But \(P \subseteq l(P)\).

Hence \(A \cap B \in l(P)\).

Hence \(B \in S_A\).

Hence \(P \subseteq S_A\).

Thus, \(S_A\) is a \(\lambda\)-system containing \(P\).

Hence \(l(P) \subseteq S_A\) as \(l(P)\) is the intersection of all \(\lambda\)-systems containing \(P\).

Thus, for any \(A \in P\) and for any \(B \in l(P)\), the intersection \(A \cap B \in l(P)\).

\(B \in l(P) \implies B \in S_A\).

\(B \in S_A \implies A \cap B \in l(P)\).

We now show that for any \(A, B \in l(P)\), \(A \cap B \in l(P)\).

Consider any \(B \in l(P)\).

Let \(S_B\) be the set of all sets \(C \subseteq \Omega\) such that \(B \cap C \in l(P)\).

By preceding argument \(P \subseteq S_B\).

Let \(A \in P\).

Since \(A \in P\) and \(B \in l(P)\) hence \(A \cap B \in l(P)\).

Hence \(A \in S_B\).

Hence \(P \subseteq S_B\).

By Lemma 7.3, \(S_B\) is a \(\lambda\)-system.

Therefore \(l(P) \subseteq S_B\).

This means that for any \(A \in l(P)\), the intersection \(A \cap B \in l(P)\).

\(A \in l(P) \implies A \in S_B\).

\(A \in S_B \implies B \cap A \in l(P)\).

Thus, \(l(P)\) is closed under intersections and is indeed a \(\pi\)-system.

We are now ready to prove the Dynkin \(\pi-\lambda\) theorem (Theorem 7.5).

Proof. Let \(l(P)\) be the intersection of all \(\lambda\)-systems containing \(P\).

By hypothesis \(L\) is a \(\lambda\)-system containing \(P\).

By definition \(l(P) \subseteq L\).

By Lemma 7.4, \(l(P)\) is a \(\lambda\)-system.

By Lemma 7.5, \(l(P)\) is a \(\pi\)-system.

By Lemma 7.2, \(l(P)\) is a \(\sigma\)-algebra.

By definition \(\sigma(P)\) is the smallest \(\sigma\)-algebra containing \(P\).

Hence \(P \subseteq \sigma(P) \subseteq l(P) \subseteq L\).

### 7.1.2.3. Borel \(\sigma\)-Algebra#

Recall the notions of topology (open sets, closed sets, etc.) on the real line from Topology of Real Line. If we consider the collection of open sets of \(\RR\), denoted by \(\OOO\), then it is clear that it is not a \(\sigma\)-algebra as it is not closed under complements. We are interested in the \(\sigma\)-algebra generated by \(\OOO\). By Theorem 7.4 we know that such a \(\sigma\)-algebra exists.

\(\sigma\)-algebra)

(BorelThe \(\sigma\)-algebra generated by the open sets of the
real line \(\RR\) is known as the
*Borel* \(\sigma\)-*algebra* of \(\RR\)
and is denoted by \(\BBB\).
In other words, if \(\OOO\) is the collection
of all open subsets of \(\RR\), then

The pair \((\RR, \BBB)\) is known as the
*Borel field*.
The members of the Borel \(\sigma\)-algebra
are known as *Borel sets*.

Since \(\BBB\) is a \(\sigma\)-algebra, it contains all open sets, all closed sets, all countable unions of closed sets, all countable intersections of open sets. There are other subsets of real line which are not included in the Borel \(\sigma\)-algebra. However, they don’t happen to be of much engineering and scientific interest.

(Examples of Borel sets)

Following are some examples of Borel sets:

\(\{ x \}\) for any \(x \in \RR\).

The set of rational numbers.

Any countable subset of \(\RR\).

The intervals \((0,1)\), \([0,1]\), \((0, 1]\), \([1,0)\).

\([1,2] \cup [4,5]\).

\(G_{\delta}\)-set)

(A countable intersection of open sets is known as a \(G_{\delta}\)-set.

\(F_{\sigma}\)-set)

(A countable union of closed sets is known as a \(F_{\sigma}\)-set.

We recall that a countable intersection of open sets need not be open and a countable union of closed sets need not be closed. However \(\BBB\) contains all the \(G_{\delta}\) and \(F_{\sigma}\) sets since it is a \(\sigma\)-algebra.

We haven’t yet shown that \(\BBB \neq 2^{\RR}\). In other words, there exist non-Borel subsets of \(\RR\). There are other characterizations of the Borel \(\sigma\)-algebra which provide us with a better understanding of its structure.

(Generation from one-sided closed intervals)

The Borel \(\sigma\)-algebra \(\BBB\) is generated by intervals of the form \((-\infty, a]\), where \(a \in \QQ\) is a rational number.

Proof. .

Let \(\OOO_0\) denote the collection of all open intervals.

By Theorem 2.1, every open set in \(\RR\) is an at most countable union of open intervals.

Hence \(\sigma(\OOO_0) = \BBB\).

Let \(\DDD\) denote the collection of all intervals of the form \((-\infty, a], a \in \QQ\).

Let \((a,b) \in \OOO_0\) for some \(a < b\).

Let \(a_n\) be a rational number in \((a, a + \frac{1}{n})\). We can see that \(a_n \downarrow a\) as \(n \to \infty\).

Let \(b_n\) be a rational number in \((b - \frac{1}{n}, b)\). We can see that \(b_n \uparrow b\) as \(n \to \infty\).

Thus,

\[ (a, b) = \bigcup_{n=1}^{\infty}(a_n, b_n] = \bigcup_{n=1}^{\infty} \{ (-\infty, b_n] \cap (-\infty, a_n]^c \}. \]Hence \((a, b) \in \sigma(\DDD)\).

Hence \(\OOO_0 \subseteq \sigma(\DDD)\).

Hence \(\sigma(\OOO_0) \subseteq \sigma(\DDD)\).

However, every element of \(\DDD\) is a closed set.

Hence \(\sigma(\DDD) \subseteq \BBB\).

We have

\[ \BBB \subseteq \sigma(\OOO_0) \subseteq \sigma(\DDD) \subseteq \BBB. \]Hence \(\BBB = \sigma(\DDD)\).

## 7.1.3. Events#

(Event)

An *event* is a subset of the sample space of
a random experiment that belongs to some
\(\sigma\)-algebra defined on it.
Let \(\Omega\) be the sample space of a random experiment.
Let \(\FFF\) be a \(\sigma\)-algebra of subsets of \(\Omega\).
Then every member of \(\FFF\) is an event.

Note

Events are the subsets of the sample space to which probabilities can be assigned. For finite sample spaces, any subset can be an event. For infinite sample spaces, it may not be possible to meaningfully assign probabilities to every possible subset. We need the notion of a \(\sigma\)-algebra which is a collection of subsets of the sample space satisfying closure under countable unions, intersections and complements. The subsets belonging to a \(\sigma\)-algebra can be assigned probabilities and are events.

We can translate the set-theoretic language to the language of events as follows. Let \(A, B\) be two different events.

\(A\) doesn’t occur is denoted by \(A^c\).

Either \(A\) or \(B\) occur is denoted by \(A \cup B\).

Both \(A\) and \(B\) occur is denoted by \(A B\).

\(A\) occurs and \(B\) doesn’t occur is denoted by \(A \setminus B\). This can also be denoted as \(A B^c\).

The events \(A\) and \(B\) are

*exhaustive*if \(\Omega = A \cup B\). In particular \(A \cup A^c = \Omega\).\(A\) and \(B\) events are exclusive if \(A B = \EmptySet\).

(Elementary event)

An event consisting of only one outcome
is called a *singleton event* or an *elementary event*.

(Certain event)

The sample space \(\Omega\) is known as the *certain event*.

Since \(\Omega\) contains all possible outcomes of an experiment, hence this event always occurs whenever the experiment is run.

(Null event)

The empty set \(\EmptySet\) is known as the *null event*.

The null event never occurs.

(Mutually exclusive events)

Let \(E\) and \(F\) be two events. If \(E\) and \(F\) are disjoint sets then we say that the two events are mutually exclusive.

## 7.1.4. Probability Measure and Space#

We next provide an axiomatic definition of a probability measure. Note that we will often write a joint event (intersection of two events) as \(A B\) rather than \(A \cap B\).

(Probability measure)

Let \(\Omega\) be a sample space and
let \(\FFF\) be a \(\sigma\) algebra of subsets of \(\Omega\).
A *probability measure* is a set function \(\PP : \FFF \to \RR\)
that assigns to every event \(E \in \FFF\) a real number
\(\PP(E)\) called the probability of the event \(E\) satisfying
the following rules:

Nonnegativity: \(\PP(E) \geq 0\).

Unit measure or normalization: \(\PP(\Omega) = 1\).

Additivity: \(\PP(E \cup F) = \PP(E) + \PP(F)\) if \(E F = \EmptySet\).

We can write it in the form of axioms (first introduced by Andrey Kolmogorov).

(First axiom: nonnegativity)

The probability of an event is a nonnegative real number.

This axiom implies that the probability is always finite.

(Second axiom: unit measure)

(Third axiom: additivity)

Let \(E, F \in \FFF\) be two disjoint sets. Then

### 7.1.4.1. Probability Space#

(Probability space)

A sample space endowed with a \(\sigma\)-algebra and
a probability measure is known as a *probability space*.
In other words,
let \(\Omega\) be the sample space of a random experiment,
let \(\FFF\) be a \(\sigma\) algebra of subsets of \(\Omega\)
and let \(\PP\) be a probability measure defined on \(\FFF\).
Then the triplet \((\Omega, \FFF, \PP)\) is known as
a probability space.

We next establish some basic facts about a probability measure.

### 7.1.4.2. Basic Properties#

(Properties of a probability measure)

Let \((\Omega, \FFF, \PP)\) be a probability space. Then for the events contained in \(\FFF\) satisfy the following properties.

Probability of null event: \(\PP(\EmptySet) = 0\).

\(\PP(E F^c) = \PP(E) - \PP(EF)\).

Complement rule: \(\PP(E) = 1 - \PP(E^c)\).

Sum rule: \(\PP(E \cup F) = \PP(E) + \PP(F) - \PP(EF)\).

Monotonicity: If \(E \subseteq F\), then

\[ \PP (E)\leq \PP(F). \]Numeric bound:

\[ 0 \leq \PP(E) \leq 1. \]Finite additivity: For any positive integer \(n\), we have

\[ \PP\left ( \bigcup_{i=1}^n E_i \right) = \sum_{i=1}^n \PP(E_i) \]if \(E_1, E_2, \dots, E_n\) are pairwise disjoint events.

Proof. (1)

\(\EmptySet\) and \(\Omega\) are disjoint.

Hence

\[ \PP(\EmptySet \cup \Omega) = \PP(\EmptySet) + \PP(\Omega). \]This simplifies to

\[ 1 = \PP(\EmptySet) + 1. \]Hence \(\PP(\EmptySet) = 0\).

(2)

Recall that \(E F^c\) and \(EF\) are disjoint sets with \(E F^c \cup EF = E\).

Hence

\[ \PP(E) = \PP(E F^c) + \PP(E F). \]Hence

\[ \PP(E F^c) = \PP(E) - \PP(EF). \]

(3)

\(E\) and \(E^c\) are disjoint with \(E \cup E^c = \Omega\).

Hence

\[ 1 = \PP(\Omega) = \PP(E \cup E^c) = \PP(E) + \PP(E^c). \]Hence \(\PP(E) = 1 - \PP(E^c)\).

(4)

Recall that we can split \(E \cup F\) into disjoint sets

\[ E \cup F = EF^c \cup EF \cup E^c F. \]By additivity, we have

\[\begin{split} \PP(E \cup F) &= \PP(EF^c \cup (EF \cup E^c F))\\ &= \PP(EF^c) + \PP(EF \cup E^c F)\\ &= \PP(EF^c) + \PP(EF) + \PP(E^c F)\\ &= \PP(E) - \PP(EF) + \PP(EF) + \PP(F) - \PP(EF)\\ &= \PP(E) + \PP(F) - \PP(EF). \end{split}\]

(5)

We have \(F = F E \cup F E^c = E \cup F E^c\).

\(E\) and \(FE^c\) are disjoint.

Then

\[ \PP (F) = \PP(E \cup F E^c) = \PP(E) + \PP(F E^c). \]By nonnegativity, \(\PP(F E^c) \geq 0\).

Hence \(\PP(E) \leq \PP(F)\).

(6)

We have \(\PP(E) = 1 - \PP(E^c)\).

But \(\PP(E^c) \geq 0\).

Hence \(\PP(E) \leq 1\).

(7)

The statement is trivially true for \(n=1\).

The statement is true for \(n=2\) by the additivity rule.

Assume that the statement is true for some \(k \geq 2\).

In other words, for every collection of events \(E_1, \dots, E_k\), such that the events are pairwise disjoint, we have

\[ \PP\left ( \bigcup_{i=1}^k E_i \right) = \sum_{i=1}^k \PP(E_i). \]Let \(E_1, E_2, \dots, E_k, E_{k+1}\) be a collection of \(k+1\) pairwise disjoint events. Define \(E= \bigcup_{i=1}^k E_i\).

We have \(E \cap E_{k+1} = \EmptySet\).

Then

\[\begin{split} \PP\left ( \bigcup_{i=1}^{k+1} E_i \right) &= \PP ( E \cup E_{k + 1}) \\ &= \PP (E) + \PP(E_{k + 1})\\ &= \PP\left ( \bigcup_{i=1}^k E_i \right) + \PP(E_{k + 1})\\ &= \sum_{i=1}^k \PP(E_i) + \PP(E_{k + 1})\\ &= \sum_{i=1}^{k+1} \PP(E_i). \end{split}\]By principle of mathematical induction, the statement is true for every \(n\).

Note

We assign probabilities to events and not to individual outcomes of a random experiment. If the sample space is finite or countable, often it is convenient to assign probabilities to individual outcomes. One should treat this as assignment of probability to the event consisting of a single outcome; a singleton event.

In the following, we shall assume that a probability space \((\Omega, \FFF, \PP)\) has been given and all events are contained in \(\FFF\).

(Union of three events)

Let \(A, B, C\) be three events. Then

Proof. Define \(D = B \cup C\).

Then

\[ \PP(A \cup B \cup C) = \PP(A \cup D) = \PP(A) + \PP(D) - \PP(AD). \]Further

\[ \PP(D) = \PP(B \cup C) = \PP(B) + \PP(C) - \PP(BC). \]Note that \(AD = AB \cup AC\).

Also \(AB \cap AC = ABC\).

Hence

\[ \PP(AD) = \PP(AB \cup AC) = \PP(AB) + \PP(AC) - \PP(ABC). \]Putting these back, we get

\[\begin{split} \PP(A \cup B \cup C) &= \PP(A) + (\PP(B) + \PP(C) - \PP(BC)) - (\PP(AB) + \PP(AC) - \PP(ABC)) \\ &= \PP(A) + \PP(B) + \PP(C) - \PP(AB) - \PP(BC) - \PP(AC) + \PP(ABC). \end{split}\]

### 7.1.4.3. Inclusion-Exclusion Principle#

Theorem 7.8 can be extended to the union of \(n\) events. This is known as the inclusion-exclusion principle.

(Inclusion-exclusion principle)

Let \(A_1, A_2, \dots, A_n\) be \(n\) events in a probability space \((\Omega, \FFF, \PP)\). Then

where \(S_k\) is the sum of the probability of all \(k\)-cardinality intersections among the sets \(A_1, \dots, A_n\). In particular,

In general for every \(k \in 1,\dots,n\), we can write:

It is known as inclusion-exclusion principle since \(S_1\) is included then \(S_2\) is excluded, then \(S_3\) is included and so on.

Proof. The proof is based on mathematical induction.

### 7.1.4.4. Boole’s Inequality#

(Boole’s inequality)

Let \(A_1, A_2, \dots, A_n\) be a finite collection of events. Then we have

Proof. We prove it using induction.

For \(n=1\), obviously

\[ \PP (A_1) \leq \PP (A_1). \]Assume the inequality is true for the set of \(n\) events for some \(n \geq 1\). In other words,

\[ \PP \left ( \bigcup_{i=1}^n A_i \right) \leq \sum_{i=1}^n \PP \left ( A_i \right). \]Since

\[ \PP (A \cup B ) = \PP (A) + \PP(B) - \PP (A \cap B), \]hence

\[ \PP \left ( \bigcup_{i=1}^{n + 1} A_i \right) = \PP \left ( \bigcup_{i=1}^n A_i \right) + \PP (A_{n + 1}) - \PP \left ( \bigcup_{i=1}^n A_i \bigcap A_{n +1} \right ). \]Since

\[ \PP \left ( \bigcup_{i=1}^n A_i \bigcap A_{n +1} \right ) \geq 0, \]hence

\[ \PP \left ( \bigcup_{i=1}^{n + 1} A_i \right) \leq \PP \left ( \bigcup_{i=1}^n A_i \right) + \PP (A_{n + 1}) \leq \sum_{i=1}^{n + 1} \PP \left ( A_i \right). \]

### 7.1.4.5. Countable Additivity#

Often, we need to work with problems where we need to estimate the probability of a countable union of events. The basic axioms of a probability measure are unable to handle this. We need one more axiom that a probability measure must satisfy.

(Fourth axiom: countable additivity)

Let \(E_1, E_2, \dots\) be a (countable) sequence of mutually exclusive events (disjoint sets). Then

## 7.1.5. Joint Probability#

(Joint probability)

Let \(A\) and \(B\) be two different events. Then
the *joint probability* of the events \(A\)
and \(B\) is the probability that the two events
occur together and is given by \(\PP(A B)\).

Similarly, let \(\{A_i \}_{i \in I}\) be a collection
of events indexed by the set \(I\). Then their *joint probability*
is given by

In other words, it is the probability of every event happening together.

## 7.1.6. Conditional Probability#

Conditional probability provides us the means to reason about experiments with partial information. If an event is known to have happened, then the probabilities of other events associated with an experiment change. Here are some examples:

If a cancer test is 90% reliable and it turns positive for a person then the probability of the person having cancer increases dramatically.

One can analyze a corpus of English literature to estimate the probabilities of a letter coming after another. Given that a letter \(t\) has appeared as the first letter of the word, the probability that the next letter will be \(h\) is higher than the general probability of \(h\) being the second letter of a word.

If a die is rolled twice successively and we are told that the sum of two rolls is \(7\), then the probability that the first roll is \(1\) is \(0.5\).

One point to note that conditional probability doesn’t establish any chronological order between events. It merely describes how the probabilities change based on partial information about an experiment.

(Conditional probability)

Let \(A\) and \(B\) be two events.
Assume that \(\PP(A) > 0\).
The *conditional probability* of the event \(B\)
given that the event \(A\) has happened is denoted by
\(\PP(B | A)\). It is defined as

Note that the conditional probability is not defined if \(\PP(A) = 0\).

By definition, we can see that

Consider an experiment of tossing a coin 3 times.

The sample space is

\[ \Omega = \{HHH, HHT, HTH, HTT, THH, THT, TTH, TTT\}. \]Assume that all the outcomes are equally likely. Each each outcome has the probability \(\frac{1}{8}\).

Let \(A\) denote the event that the first toss is a head. We have

\[ A = \{ HHH, HHT, HTH, HTT \}. \]We can see that \(\PP(A) = \frac{1}{2}\).

Let \(B\) be the event that more heads than tails come up in the three tosses. We have

\[ B = \{HHH, HHT, HTH, THH \}. \]We can see that \(\PP(B) = \frac{1}{2}\).

If the first outcome is a head, then the probability that more heads than tails come will increase.

Let us first check the event \(A \cap B\). We have

\[ A \cap B = \{ HHH, HHT, HTH \}. \]Hence \(\PP(A B) = \frac{3}{8}\).

Then the probability that more heads than tails come up given that first toss is a head is given by

\[ \PP(B | A) = \frac{3 / 8}{ 1/2} = \frac{3}{4}. \]We can also compute the probability that the first toss is a head given that more heads than tails come up as

\[ \PP(A | B) = \frac{3 / 8}{ 1/2} = \frac{3}{4}. \]

We should verify that the conditional probability as defined above satisfies the axioms of probability.

The conditional probability is a probability measure.

Proof. (Nonnegativity) By definition, it is a ratio of nonnegative quantities. Hence, it is nonnegative.

(Normalization) We can see that

(Additivity)

Let \(B_1\) and \(B_2\) be disjoint events.

Then \(A B_1\) and \(A B_2\) are also disjoint events.

Hence

\[\begin{split} \PP(B_1 \cup B_2 | A) &= \frac{\PP(A (B_1 \cup B_2 ))}{\PP(A)}\\ &= \frac{\PP(A B_1 \cup A B_2 ))}{\PP(A)}\\ &= \frac{\PP(A B_1) + \PP(A B_2 ))}{\PP(A)}\\ &= \frac{\PP(A B_1))}{\PP(A)} + \frac{\PP(A B_2 ))}{\PP(A)}\\ &= \PP(B_1 | A) + \PP(B_2 | A). \end{split}\]

The argument for countable additivity is similar.

We note that

### 7.1.6.1. Properties#

Since \(\PP(B | A)\) is a valid probability measure, all the properties of a probability measure are applicable for the conditional probability also.

(Properties of a conditional probability measure)

Let \((\Omega, \FFF, \PP)\) be a probability space. Let all probabilities be conditioned on an event \(A\). Then the following properties hold:

\(\PP(\EmptySet | A) = 0\).

\(\PP(A | A) = 1\).

\(\PP(E F^c | A) = \PP(E | A) - \PP(EF | A)\).

\(\PP(E | A) = 1 - \PP(E^c | A)\).

\(\PP(E \cup F | A) = \PP(E | A) + \PP(F | A) - \PP(EF | A)\).

If \(E \subseteq F\), then

\[ \PP (E | A)\leq \PP(F | A). \]For any positive integer \(n\), we have

\[ \PP\left ( \bigcup_{i=1}^n E_i | A \right) = \sum_{i=1}^n \PP(E_i | A) \]if \(E_1, E_2, \dots, E_n\) are pairwise disjoint events.

Union of three events

\[\begin{split} \PP(B \cup C \cup D | A ) &= \PP(B | A) + \PP(C | A) + \PP(D | A) \\ &- \PP(BC | A) - \PP(CD | A) - \PP(BD | A) + \PP(BCD | A). \end{split}\]Let \(B_1, B_2, \dots, B_n\) be a finite collection of events. Then we have

\[ \PP \left ( \bigcup_{i=1}^n B_i | A \right) \leq \sum_{i=1}^n \PP \left ( B_i | A \right). \]

The proofs are similar to Theorem 7.7 and other results in the following.

Note

Since \(\PP(A | A) = 1\), one can see that all of the conditional probability is concentrated on the outcomes in \(A\). Thus, we might as well discard the outcomes in \(A^c\) and treat the conditional probabilities as a probability measure on the new sample space \(A\).

### 7.1.6.2. Multiplication Rule#

(Multiplication rule)

Let \(A_1, A_2, \dots, A_n\) be a finite collection of events and let \(A\) be an event which occurs if and only if each of these events occur. In other words,

Then

Proof. We can see that

Draw 4 cards from a deck of 52 cards. What is the probability that none of them is a spade?

Define \(A_i\) as the event that the \(i\)-th card is not a spade.

Our event of interest is \(A = A_1 A_2 A_3 A_4\).

There are 13 cards of the suit spade. There are 39 other cards.

\(\PP(A_1) = \frac{39}{52}\).

Given that \(A_1\) has happened (i.e. the first card is not a spade), we are left with 51 cards out of which 38 are not spade.

Hence \(\PP(A_2 | A_1) = \frac{38}{51}\).

The probability that the third card is not a spade is \(\PP(A_3 | A_1 A_2) = \frac{37}{50}\).

The probability that the fourth card is not a spade is \(\PP(A_4 | A_1 A_2 A_3) = \frac{36}{49}\).

Applying the multiplication rule

\[\begin{split} \PP(A) = \PP(A_1 A_2 A_3 A_4) &= \PP(A_1) \PP(A_2 | A_1) \PP(A_3 | A_1 A_2) \PP(A_4 | A_1 A_2 A_3) \\ &= \frac{39}{52}\frac{38}{51}\frac{37}{50}\frac{36}{49}. \end{split}\]

Another way to calculate this is through counting the number of ways four cards can be chosen.

Total number of ways four cards can be chosen is \(52 \choose 4\).

Total number of ways four cards can be chosen which are not spades are \(39 \choose 4\).

Hence the probability that none of the four cards are spades is

\[ \frac{39 \choose 4}{52 \choose 4} = \frac{39!}{4! 35!}\frac{4! 48!}{52!} = \frac{39 \cdot 38 \cdot 37 \cdot 36}{52 \cdot 51 \cdot 50 \cdot 49}. \]

### 7.1.6.3. Marginal Probability#

(Marginal probability)

Let \(A\) and \(B\) be two events.
The *marginal probability* of the event \(A\)
is the probability \(\PP(A)\) which is not
conditioned on the event \(B\).

### 7.1.6.4. Total Probability Theorem#

(Total probability theorem)

Let \(A_1, \dots, A_n\) be disjoint events that form a partition of the sample space \(\Omega\). Assume that \(\PP(A_i) > 0\) for every \(i\). Then for any event \(B\) we have

Proof. .

Since \(A_1, \dots, A_n\) are disjoint, hence so are \(A_1 B, \dots, A_n B\).

We have

\[ B = A_1 B \cup \dots \cup A_n B. \]Applying additivity axiom,

\[ \PP(B) = \PP(A_1 B) + \dots + \PP(A_n B). \]Applying conditional probability definition, we have

\[ \PP(B) = \PP(A_1)\PP(B | A_1) + \dots + \PP(A_n)\PP(B | A_n). \]

## 7.1.7. Bayes Rule#

The most famous application of total probability theorem is the Bayes rule. It relates the conditional probabilities of the form \(\PP(A | B)\) with conditional probabilities of the form \(\PP(B | A)\) in which the order of conditioning is reversed.

(Bayes rule)

Let \(A_1, \dots, A_n\) be disjoint events that form a partition of the sample space \(\Omega\). Assume that \(\PP(A_i) > 0\) for every \(i\). Let \(B\) be another event such that \(\PP(B) > 0\). Then we have

Proof. .

By definition of conditional probability, we have

\[ \PP (A_i B) = \PP(A_i | B) \PP(B) \]and

\[ \PP (A_i B) = \PP(B | A_i) \PP(A_i). \]Hence

\[ \PP(A_i | B) \PP(B) = \PP(B | A_i) \PP(A_i). \]Dividing both sides with \(\PP(B)\), we get

\[ \PP(A_i | B) = \frac{\PP(A_i) \PP(B | A_i)}{\PP(B)}. \]Expanding \(\PP(B)\) via total probability theorem, we get the desired result.

(Statistical inference)

Bayes’ rule is a key tool in the field of
*statistical inference*.

We conduct an experiment where can observe an

*effect*which may be due to a number of*causes*.The events \(A_1, \dots, A_n\) denote the causes which may not be observable directly.

The event \(B\) is the effect which can be observed.

The probability \(\PP(B | A_i)\) models the relationship between the cause \(A_i\) and the effect \(B\). It represents the likelihood of \(B\) happening given that \(A_i\) has happened.

We are often interested in knowing the probability of \(A_i\) given that \(B\) has been observed.

This is an

*inference*since the events \(A_i\) cannot be observed directly.The probability \(\PP(A_i | B)\) is known as the

*posterior probability*of the event \(A_i\) given that \(B\) has happened.This is distinguished from the unconditional/marginal probability \(\PP(A_i)\) which is the probability of the event \(A_i\) without any information about the event \(B\).

\(\PP(A_i)\) is known as the

*prior*probability of \(A_i\).

## 7.1.8. Independence#

The conditional probability \(\PP(A | B)\) captures the information that occurrence of the event \(B\) provides about the event \(A\). In many cases, it may happen that observing \(B\) provides us no information about the event \(A\). Consequently, the probability \(\PP(A)\) is not altered when conditioning on the event \(B\). Then, we say that the events \(A\) and \(B\) are independent.

(Independence of two events)

Let \(A\) and \(B\) be two events.
We say that \(A\) and \(B\) are *independent* if

Independence is a symmetric property. We can see that if \(A\) is independent of \(B\) then \(B\) is independent of \(A\).

It follows that for independent events

If \(\PP(A) > 0\) then

\[ \PP(B | A) = \PP(B). \]And if \(\PP(B) > 0\) then

\[ \PP(A | B) = \PP(A). \]

Consider the problem of rolling a \(4\)-sided dice twice. Assume that all outcomes are equally likely.

Let \(A\) be the event that the first roll is \(1\).

Let \(B\) be the event that the sum of two rolls is \(5\).

We have \(A = \{(1,1), (1,2), (1,3), (1,4) \}\).

We have \(B = \{(1, 4), (2, 3), (3, 2), (4, 1) \}\).

Number of outcomes is \(16\).

\(\PP(A) = \frac{1}{4}\).

\(\PP(B) = \frac{1}{4}\).

We can see that \(A B = \{(1, 4)\}\).

Then \(\PP(A B) = \frac{1}{16}\).

We can see that \(\PP(A B) = \PP(A) \PP(B)\).

The two events \(A\) and \(B\) are independent.

On the surface, it may seem that there should be dependence between the events \(A\) and \(B\) but it is not so.

Since the outcomes in the event \(B\) are equally likely, we can carefully examine the list of outcomes in the event \(B\) to see that observing the event \(B\) gives us no additional information about whether the first roll will give us \(1\).

Now consider the problem of rolling a \(6\)-sided dice twice with the same events as before.

We have \(A = \{(1,1), (1,2), (1,3), (1,4), (1,5), (1,6) \}\).

We have \(B = \{(1, 4), (2, 3), (3, 2), (4, 1) \}\).

Number of outcomes is \(36\).

\(\PP(A) = \frac{1}{6}\).

\(\PP(B) = \frac{1}{9}\).

We can see that \(A B = \{(1, 4)\}\).

Then \(\PP(A B) = \frac{1}{36}\).

Clearly, \(\PP(A B) \neq \PP(A) \PP(B)\).

The two events are dependent.

In fact \(\PP(A | B) = \frac{1}{4}\) In other words, the probability of \(A\) increases given that \(B\) has been observed.

Looking at the outcomes in \(B\) we can see that observance of \(B\) eliminates the possibility of \(5\) and \(6\) from the first roll.

This leads to the increase in the probability of \(1\) in the first roll.

We can also see that \(\PP(B | A) = \frac{1}{6}\).

The probability of \(B\) given \(A\) has also increased.

### 7.1.8.1. Conditional Independence#

Recall that the conditional probability is a probability measure in itself. This enables us to introduce the notion of conditional independence.

(Conditional independence)

Given an event \(C\), two events \(A\) and \(B\) are
called *conditionally independent* if

Let \(A, B, C\) be three different events. Assume that \(\PP(B C) > 0\). Given that \(C\) has occurred, \(A\) and \(B\) are conditionally independent if and only if

Proof. Note that \(\PP(B C) > 0\) implies that \(\PP(B) > 0\), \(\PP(C) > 0\), \(\PP(B | C) > 0\) and \(\PP(C | B) > 0\).

By conditional probability definition and multiplication rule, we have

\[\begin{split} \PP (A B | C) &= \frac{\PP (A B C)}{\PP(C)}\\ &= \frac{\PP(C) \PP(B | C) \PP (A | B C )}{\PP(C)}\\ &= \PP(B | C) \PP (A | B C ). \end{split}\]Comparing this with the expression of conditional independence, we have

\[ \PP(B | C) \PP (A | B C ) = \PP (A | C) \PP (B | C). \]Eliminating the common term, we have

\[ \PP (A | B C ) = \PP (A | C). \]

It is important to note that

Unconditional independence of \(A\) and \(B\) doesn’t imply conditional independence given \(C\).

Conditional independence of \(A\) and \(B\) given \(C\) doesn’t imply unconditional independence.

Consider tossing a fair coin twice. Assume that all outcomes are equally likely.

\(\Omega = \{ HH, HT, TH, TT \}\).

Let \(A\) be the event that the first toss is a head. \(A = \{ HH, HT \}\).

Let \(B\) be the event that the second toss is a head. \(B = \{TH, HH \}\).

It is easy to see that \(A\) and \(B\) are independent.

Let \(C\) be the event that the two tosses have different results. \(C = \{TH, HT \}\).

\(\PP (A | C) = \frac{1}{2}\).

\(\PP (B | C) = \frac{1}{2}\).

\(\PP (A B | C) = 0\).

Clearly, the two events are not conditionally independent.

Let there be two coins. The first coin is fair while the second coin is unfair with the probability of head being \(0.9\). Assume that a coin can be chosen at random with equal probability and then it is tossed twice (the two tosses being independent).

Let \(C\) be the event that the second (unfair) coin was chosen.

Let \(H_i\) be the event that the \(i\)-th toss resulted in heads.

Given the choice of the coin, the two events \(H_1\) and \(H_2\) are independent.

\(\PP(H_1 | C) = 0.9\).

\(\PP(H_2 | C) = 0.9\).

\(\PP(H_1 H_2 | C) = 0.81\).

However, the two events \(H_1\) and \(H_2\) are not independent unconditionally.

If the first toss is a head, then the probability that the coin is unfair increases. Consequently, the probability that the second toss will also be a head increases.

From total probability theorem, we have

\[ \PP(H_1) = \PP(C)\PP(H_1 | C) + \PP(C^c)\PP(H_1 | C^c) = 0.5 \cdot 0.9 + 0.5 \cdot 0.5 = 0.7. \]Similarly \(\PP(H_2) = 0.7\).

However,

\[ \PP(H_1 H_2) = \PP(C)\PP(H_1 H_2 | C) + \PP(C^c)\PP(H_1 H_2 | C^c) = 0.5 \cdot 0.81 + 0.5 \cdot 0.25 = 0.53. \]\(\PP(H_1) \PP(H_2) = 0.7 \cdot 0.7 = 0.49 \neq \PP(H_1 H_2)\).

### 7.1.8.2. Three Events#

Naturally, we are interested in extending the notion of independence for multiple events. However this is a bit tricky. We start with the simpler case of three events which can explain the issues involved. Independence means that occurrence or non-occurrence of any event or any pair of events should not provide any information about the occurrence of any other event or pair of events.

(Independence of three events)

Let \(A, B\) and \(C\) be three events.
We say that \(A, B\) and \(C\) are *jointly independent* if

The first three conditions establish the
*pairwise independence* of the three sets. However, the
fourth condition is necessary and it doesn’t follow
from the first three conditions.
Conversely, the fourth condition doesn’t imply the first
three conditions.

Let us look at some counter examples.

Consider the experiment of tossing a fair coin twice with equally likely outcomes.

\(\Omega = \{ HH, HT, TH, TT \}\).

Let \(H_i\) be the event that \(i\)-th toss is a head.

Let \(A\) be the event that the two tosses have different results.

\(H_1 = \{ HT, HH \}\).

\(H_2 = \{TH, HH \}\).

\(A = \{HT, TH \}\).

\(A H_1 = \{HT \}\).

\(A H_2 = \{TH \}\).

\(H_1 H_2 = \{ HH \}\).

\(\PP(H_1) = \PP(H_2) = \PP(A) = \frac{1}{2}\).

\(\PP(A H_1) = \PP (A H_2) = \PP (H_1 H_2) = \frac{1}{4}\).

We can see that \(A, H_1, H_2\) are pairwise independent.

However \(A H_1 H_2 = \EmptySet\) since occurrence of \(A\) makes \(H_1\) and \(H_2\) mutually exclusive.

Hence \(\PP (A H_1 H_2) = 0 \neq \PP(A) \PP(H_1) \PP(H_2)\).

Hence \(A, H_1,H_2\) are not independent.

Consider the experiment of tossing a fair dice twice with equally likely outcomes.

The sample space has \(36\) possible outcomes.

Let \(A\) be the event that the first roll has the outcomes \(1\), \(2\), or \(3\).

Let \(B\) be the event that the first roll has the outcomes \(3\), \(4\), or \(5\).

Let \(C\) be the event that the sum of two rolls is \(9\).

We can see that

\[ C = \{ (3,6), (4,5), (5,4), (6,3) \}. \]We have

\[ \PP(A) = \frac{1}{2}, \PP(B) = \frac{1}{2}, \PP(C) = \frac{1}{9}. \]We can see that

\[ A B C = \{ (3,6) \}. \]Hence \(\PP(ABC) = \frac{1}{36}\).

Clearly,

\[ \PP(A B C) = \PP(A) \PP(B) \PP(C). \]Now \(A B\) is the event that the first roll is \(3\).

\(B C = \{ (3,6), (4,5), (5,4) \}\).

\(A C = \{ (3,6)\}\).

Hence, we have

\[ \PP(A B) = \frac{1}{6}, \PP(B C) = \frac{1}{12}, \PP(A C) = \frac{1}{36}. \]None of \(A, B, C\) are pairwise independent.

Hence \(A,B,C\) are not independent.

### 7.1.8.3. \(n\) Events#

We are now ready to generalize the notion of independence to any finite number of events. The key idea is that information about occurrence or non-occurrence of any subset of events should not provide any information about the occurrence or non-occurrence of the remaining events.

\(n\) events)

(Independence ofLet \(A_1, A_2, \dots, A_n\) be \(n\) events contained in \(\FFF\).
We say that \(A_1, A_2, \dots, A_n\) are *jointly independent*
if

for all combinations of indices such that \(1 \leq i_1 < i_2 < \dots < i_k \leq n\).

In other words, \(A_1, A_2, \dots, A_n\) are *independent* if

for every (nonempty) subset \(S\) of \(\{1,2,\dots, n\}\).

### 7.1.8.4. Independent Trials#

(Independent trials)

If an experiment involves a sequence of
independent but identical stages, we say
that we have a sequence of *independent trials*.

(Bernoulli trials)

If every stage in a sequence of independent trials
has only two possible outcomes, we say that
we have a sequence of *Bernoulli trials*.

It is customary to denote the two outcomes of a Bernoulli trial by \(H\) and \(T\) with the probabilities \(p\) and \(q=1-p\) respectively.

(Repeated coin tosses)

Consider tossing a coin \(n\) times as a sequence of Bernoulli trials.

The outcomes of each experiment are \(H\) (head) and \(T\) (tail).

Let \(\PP(H) = p\) and \(\PP(T) = q = 1 - p\).

The outcome of \(n\) tosses can be described as a string of \(n\) letters each of which is \(H\) or \(T\).

There are \(2^n\) possible strings. They form the sample space of the compound experiment.

Let a particular string have \(k\) heads and \(n-k\) tails.

Then the probability of this string is given by

\[ \PP(\zeta_1, \dots, \zeta_n) = \prod_{i=1}^n \PP(\{ \zeta_i \}) = p^k q^{n - k}. \]There are \(n \choose k\) strings which consist of \(k\) heads and \(n-k\) tails.

Each of these strings (singleton events) are mutually exclusive of each other.

Hence, by the additivity axiom, the probability of having \(k\) heads and \(n-k\) tails in \(n\) trials is given by

\[ {n \choose k} p^k q^{n - k}. \]

(Binomial probability)

The probability of \(k\) heads in a sequence of \(n\)
Bernoulli trials, denoted by \(p(k)\),
is known as *binomial probability* and
is given by

where \({n \choose k}\) is a binomial coefficients and \(p\) is the probability of head.

See Enumerative Combinatorics for a quick review of binomial coefficients.

(Sum of binomial probabilities)

Proof. Recall from Theorem 1.43 that

Putting \(a = p\) and \(b=1-p\), we get

## 7.1.9. Compound Experiments#

Often we need to examine the outcomes of different experiments together. Here are some examples:

Tossing a coin and throwing a dice

Tossing a coin twice in succession (first and second tosses are separate experiments)

Two or more experiments together form a compound experiment. Repeated trials are an example of a compound experiment.

(Compound experiment)

Let \(A\) and \(B\) be two different experiments.
Let \(\Omega_1\) be the sample space of \(A\)
and \(\Omega_2\) be the sample space of \(B\).
Then the sample space of the *compound experiment*
is given by the Cartesian product \(\Omega = \Omega_1 \times \Omega_2\).

### 7.1.9.1. Product \(\sigma\)-Algebra#

If \(\FFF_1\) and \(\FFF_2\) are \(\sigma\)-algebras over \(\Omega_1\) and \(\Omega_2\) respectively, it is natural to ask how can we construct a \(\sigma\)-algebra for the compound sample space.

Let \(E_1\) be an event associated with experiment \(A\)
and \(E_2\) be an event associated with experiment \(B\).
Then \(E = E_1 \times E_2\) is a *product event* associated with
the compound experiment. We have

However, we can see that the set

is not a \(\sigma\)-algebra. Naturally, the closest \(\sigma\)-algebra is the \(\sigma\)-algebra generated by this set.

\(\sigma\) algebra)

(ProductLet \((\Omega_1, \FFF_1)\) and \((\Omega_2, \FFF_2)\) be
two different \(\sigma\)-fields. Then the
*product* \(\sigma\)-*algebra* \(\FFF_1 \otimes \FFF_2\)
is the \(\sigma\)-algebra on \(\Omega_1 \times \Omega_2\)
generated by the collection of all product events:

The members of a product \(\sigma\)-algebra are known
as *compound events*.

### 7.1.9.2. Compound Events#

A *compound event* can be written
as a finite (or countable) disjoint union of product events
from the two experiments. A finite union looks like

where \(E_{1,i}\) and \(E_{2, i}\) are events in \(\FFF_1\) and \(\FFF_2\) respectively.

(Compound events as union of product events)

Consider two experiments each of which consists of throwing a die.

The sample space for both experiments is

\[ \Omega_1 = \Omega_2 = \{ 1,2,3,4,5,6 \}. \]There are \(36\) possible outcomes in the compound experiment.

The compound sample space is given by

\[ \Omega = \Omega_1 \times \Omega_2 = \{ (1,1), (1,2), \dots, (1,6), (2,1), \dots, (2,6), \dots, (6,1), \dots, (6,6) \}. \]

### 7.1.9.3. Independent Experiments#

(Independent experiments)

Two experiments are called *independent*
if the outcome of one experiment doesn’t
depend on the (past, present or future) outcomes
of the other experiment.
In that case, for every product event \(E = E_1 \times E_2\),
we can write

Let \(E\) be a compound event of two independent experiments given as a disjoint union of product events. Then the probability measure for the compound event is given by