We provide an exercise-oriented introduction to bayes decision theory and maximum likelihood estimation and parameter estimation.

Bayes Decision Theory

Bayesian decision theory revolves around making classification decisions based on probabilities and the cost of making such decisions. This makes the assumption that we know (or can derive) all the relevant probabilities.

We denote the state of nature as w, which can take on different classes such as w1 for sea bass or w2 for salmon. We also may have prior probabilities for each class $P(w_i)$. If we don’t know any prior information about the classes, we can treat each prior the same: $P(w_i) = 1/c$, where c is the number of classes.

Class conditional densities, or, likelihood, $p(x \vert wi)$ represent the probability density of a feature x, given some class wi. For example, one such class conditional density could be the probability density of height of a salmon (e.g 25% are 5ft, 75% are 6 ft).

Bayes rule tells us that the posterior probability, $P(wi \vert x) = p(x \vert wi) * P(wi) / P(x)$. This give us a way to derive necessary probabilities to make our decisions. It also tells us that the probability of a certain class given some observed feature is proportional to the likelihood multiplied by the prior. We’ll later see that maximizing the probability of a certain class is what gives us a good decision, so finding the class with maximum likelihood * prior is how we make a good decision.

Note that $P(x)$, or the evidence, is equivalent to the sum of likelihood and prior products. This way, summing all of the class probabilities given some evidence equates to 1- this expensive normalization can be ignored since all class probabilities given a feature x will have the same $P(x)$.

We also know that the probability of two events happening, $P(a,b)$, is equivalent to the probability of one event happening multiplied by the conditional probability of the second event happening if the first event happens $P(a) P(b \vert a)$.

Theorem: Maximization of Posterior Decision Rule Minimizes Error Probability

Suppose we have the following decision rule for binary classification: given some observed feature, if the posterior probability of class 1 is greater than the posterior probability of class 2, then decide on class 1, else decide on class 2. The probability of making an error given some evidence in this case would be 1 - posterior of the class we decided on = posterior of the class we didn’t decide on.

In order to find the total probability of making an error using this decision rule, we can take the integral of the probability of making an error and observing feature x. This is equivalent to the integral of the probability of making an error given the observation of a feature x, multiplied by the total probability density of the feature. So now we have the probability of error of our decision rule given that we know or can derive the relevant probabilities (posterior and evidence).

To make a good decision rule, we can try to decrease our probability of error by minimizing the probability of making an error given the observation of any feature x. Note that we can’t modify the evidence since that’s an aspect of the dataset, not our decision rule. Our decision rule does in fact minimize $P(error \vert x)$, so our decision rule is optimal for binary classification in terms of making minimal error. Thus we have justified that for binary classification, we should decide on the class with the highest $P(Wi \vert x)$, as this will minimize our error.

Note that we can also rewrite our decision rule using bayes rule: if $P(x \vert wi) * P(wi) > P(x \vert w2) * P(w2)$, then choose w1 else choose w2.

Generalization of Bayes Decision Theory with Loss Functions

We can generalize x from a scalar feature to feature vectors, without changing any proofs. Also, by looking at non-classification problems, instead of looking at error probabilities, we can choose something more general to minimize- a loss function (where a loss is defined per action, assuming some class).

If we take some action, such as classification, then the expected loss of taking that action, or conditional risk, is equivalent to the summation of all the losses of taking that action if the class is some class, multiplied by the probability that that class is the actual class, given the observed feature. We can then calculate the total risk, or expected loss, over all actions, of our decision rule by taking the integral over the probability density of the feature multiplied by the risk of taking an action, given that we have observed the feature.

Likelihood Ratio: By writing out the conditional risk in terms of the loss and the posteriors, we can obtain an alternative representation of the minimum-risk decision rule algebrarically: decide w1 if likelihood ratio (likelihood of $w1$ over likelihood of $w2$) > misclassification ratio $(L12 - L22 / L21 - L11)$ * prior ratio. This allows for an interpretation of the Bayes decision rule as deciding w1 if the likelihood ratio meets some threshold decided by the loss and the priors.

Minimum Error Classification

To get a concrete loss function for generalization, we can define a loss function, called the zero-one loss function, as 0 if the action classifies to the correct state of nature, and 1 if not. This loss function assigns 0 loss to the correct decision, and 1 for any error. The risk would be defined as the summation over the loss multiplied by the posterior, which would then just be the summation over the posterior for classes that do not correspond to the chosen classification, which is then just 1 - the posterior of the classified class.

Likelihood Normalization and Likelihood Ratio Practice Problems

Suppose two equally probable one-dimensional densities are of the form $p(x \vert ωi) ∝ e − \vert x−ai \vert /bi (1)$ for i = 1, 2 and $0 < bi$.

(Problem) Write an analytic expression for each density- normalize each function for arbitrary ai and positive bi.

To normalize the likelihood density, we must multiply the integral from negative infinity to positive infinity of the likelihood by some constant such that the result is equivalent to 1. After writing this equation out, we can remove the absolute value by splitting the equation into the sum of two integrals: from a to inf, and -inf to a, for when the absolute value actually applies. Then to integrate $ce^{-x+a}/b dx$, we use u-substitution to turn the integral into $bce^{-x+a} du$, which is just $-bce^{-x+a}$. Then we have that $c = 1/2bi$.

categories: likelihood, absolute value splitting, u-substition

(Problem) Calculate the likelihood ratio as a function of your four variables.

We simply multiply our likelihoods by our constants obtained above and substitute in i=1, i=2. Then we do simple algebraic manipulation of exponentials and fractions and obtain $b2/b1 e^(- \vert x-a1 \vert b2 + \vert x-a2 \vert b1)/b1b2$.

categories: likelihood ratio, algebra

(Problem) Sketch a graph of the likelihood ratio p(x \vert ω1)/p(x \vert ω2) for the case $a1 = 0, b1 = 1, a2 = 1$ and $b2 = 2$.

First we can substitute the given values into our equation above. To sketch the graph, we can plug in every point manually or split the function into a piecewise function (because of the absolute values). The splitting points are at x=0, x=1, so we have 3 partitions: -inf to 0, 0 to 1, 1 to inf. Then we evaluate the points at 0 and 1 to get $2e^½$ and $2/e$. Then we can vaguely sketch the plots which are all variants of $e^x$ with some transformations (the main thing to get right is the slope, which depends on the sign of x).

categories: piecewise function, absolute value splitting

Position of Minimum Error Decision Boundary and Posterior Behavior

Let the conditional densities for a two-category one-dimensional problem be given by the Cauchy distribution described as: $p(x \vert ωi) = 1/πb · 1/(1 + (x−ai/b)2)$ where $i = 1, 2$.

(Problem) Check that the likelihoods are indeed normalized

We must check that the conditional is equal to one. By performing u substitution on the denominator term, we get $du = b dx$, cancelling out the $1/b$ factor. Then we see that since the function is symmetric, the integral can be changed to 2 * integral from 0 to infinity. Then we perform u substitution on u, setting it equal to tan(z), so $du = sec^2(z)dz$. The integral limits then change from 0 to inf to 0 to pi/2 $(arctan(inf) = pi/2)$. Then we have $1/1+tan^2(z) * sec^2(z)$, which we can trigonometrically manipulate into 1. Then taking the integral of 0 to $\pi$ of 1 we get $\pi$. $1/\pi * \pi = 1$, so we have confirmed the likelihoods are indeed normalized.

categories: u-substitution

(Problem) Assuming $P(ω1) = P(ω2)$, show that $P(ω1 \vert x) = P(ω2 \vert x)$ if $x = (a1 + a2)/2$, that is, the minimum error decision boundary is a point midway between the peaks of the two distributions, regardless of b.

From assuming the priors are equal, we show that when the posteriors are equal, the likelihoods are equal by bayes rule. By equating the posteriors given by the cauchy distribution above, the posteriors are either the same or $x = (a1 + a2)/2$.

categories: bayes rule, likelihood, decision boundary

(Problem) How do $P(ω1 \vert x)$ and $P(ω2 \vert x)$ behave as $x → −∞$ and $x → +∞$?

As x goes to infinity or negative infinity, the likelihood goes to $1/\pi * b$. When comparing binary posteriors, we can ignore the evidence since they’re equal by definition. The posteriors also have equal likelihoods in this case. Therefore the ratio of posteriors is equivalent to their priors. If they have equal priors, like in this problem, the posteriors are equivalent and thus must equal ½. We can also solve this problem by explicitly writing out the posterior formulation.

categories: likelihood, posterior formulation

Bayes Decision Boundary Problem for Gaussian Discriminant

Consider a two-category classification problem in two dimensions with $p(x \vert ω1) ∼ N (0, I), p(x \vert ω2) ∼ N((1c1), I)$ and $P(ω1) = P(ω2) = 1/2$.

(Problem) Calculate the Bayes decision boundary.

Gaussian Discriminant Function: The decision boundary is the set of points where the discriminant functions are equal ($g1(x) = g2(x)$). To classify variables with gaussian likelihoods, we use the discriminant function for the gaussian case.

categories: gaussian discriminant

Decision Rule for Classes with Opposite Likelihoods and Independent Binary Features

Let the components of the vector $x = (x1, …, xd) t$ be independent, binary-valued (0 or 1), with d odd, and $pi1 = p(xi = 1 \vert ω1) = p > 1/2 i = 1, …, d pi2 = p(xi = 1 \vert ω2) = 1 − p i = 1, …, d (4)$ and $P(ω1) = P(ω2) = 1/2$.

(Problem) Show that the minimum-error-rate decision rule becomes Decide ω1 if X d i=1 xi > d/2 and ω2 otherwise.

Simply plug in discriminant function for classes with independent binary features and known likelihoods.

categories: gaussian discriminant

Explicit Posterior Calculation for Classification on Gaussian Classes

Suppose we have three categories in two dimensions with the following underlying distributions:

.

(Problem) By explicit calculation of posteriors, classify the point x = 0.3 0.3 for minimum probability of error.

Gaussian Likelihood/Posterior: Simply plug in the distribution values into the gaussian likelihood and multiply by the prior to get the posterior. Pick the highest posterior to minimize error, w1.

(Problem) Suppose that for a particular test point the first feature is missing. That is, classify $x = ∗ 0.3$.

Recalculate the posterior except integrate over the missing value x1 from -inf to inf.

Maximum Likelihood Estimation Practice Problems

Let x have a uniform density $p(x \vert \theta) ∼ U(0, \theta) = 1/\theta, 0 ≤ x ≤ \theta, 0$ otherwise.

(Problem) Suppose that n samples D = {x1, …, xn} are drawn independently according to p(x \vert \theta). Show that the maximum-likelihood estimate for \theta is max[D] - that is, the value of the maximum element in D.

Maximum-Likelihood: $P(D \vert \theta) = \prod_{k=1}^n p(xk \vert \theta)$

MLE with Poor Models

(Problem) Show that if our model is poor, the maximum likelihood classifier we derive is not the best — even among our (poor) model set — by exploring the following example. Suppose we have two equally probable categories (i.e., $P(ω1) = P(ω2) = 0.5$). Furthermore, we know that $p(x \vert ω1) ∼ N (0, 1)$ but assume that $p(x \vert ω2) ∼ N (µ, 1)$. (That is, the parameter $\theta$ we seek by maximum-likelihood is the mean of the second distribution.) Imagine, however, that the true underlying distribution is $p(x \vert ω2) ∼ N (1, 106 )$.

(Problem) What is the value of our maximum-likelihood estimate µb in our poor model, given a large amount of data?

Gaussian distribution with variance = 1. Recall that $\mu = 1/n * \sum\limits_{k=1}^n x^k$, aka just the average. So $\mu = 1$.

(Problem) What is the decision boundary arising from this maximum likelihood estimate in the poor model?

Gaussian Probability Density: Equate gaussians

(Problem) Ignore for the moment the maximum likelihood approach, and use the method from Chapter 2 to derive the Bayes optimal decision boundary given the true underlying distributions: $p(x \vert ω1) ∼ N (0, 1)$ and $p(x \vert ω2) ∼ N (1, 106)$. Be careful to include all portions of the decision boundary

Simply do gaussians with true variance

(Problem) Now consider again classifiers based on the (poor) model assumption $p(x \vert ω2) ∼ N (µ, 1)$. Using your result immediately above, find a new value of µ that will give lower error than the maximum-likelihood classifier.

We know our model is bad (variance is wrong and MLE thus gave a number that maximizes likelihood but doesn’t mininize the real-world error), but now we know the decision boundary we want, so we will give this bad model some mean to reach that decision boundary. We found that the decision boundary is the average of the means, so we can set the mean to $2 * \text{ideal decision boundary}$ found above.

Resources and Acknowledgments

Several exercises provided or inspired by Pattern Classification, Duda & Hart.

Further Readings

MIT MLE Estimates

This short reading is a well-written tutorial from an undergrad MIT statistics class on using MLE for a variety of simple statistical problems. Useful to practicing recognizing where MLE can be/is applied.