Python:Advanced Guide to Artificial Intelligence
上QQ阅读APP看书,第一时间看更新

MLE and MAP learning

Let's suppose we have a data generating process pdataused to draw a dataset X:

In many statistical learning tasks, our goal is to find the optimal parameter set θ according to a maximization criterion. The most common approach is based on the likelihood and is called MLE. In this case, the optimal set θ is found as follows:

This approach has the advantage of being unbiased by wrong preconditions, but, at the same time, it excludes any possibility of incorporating prior knowledge into the model. It simply looks for the best θ in a wider subspace, so that p(X|θ) is maximized. Even if this approach is almost unbiased, there's a higher probability of finding a sub-optimal solution that can also be quite different from a reasonable (even if not sure) prior. After all, several models are too complex to allow us to define a suitable prior probability (think, for example, of reinforcement learning strategies where there's a huge number of complex states). Therefore, MLE offers the most reliable solution. Moreover, it's possible to prove that the MLE of a parameter θ converges in probability to the real value:

On the other hand, if we consider Bayes' theorem, we can derive the following relation:

The posterior probability, p(θ|X), is obtained using both the likelihood and a prior probability, p(θ), and hence takes into account existing knowledge encoded in p(θ). The choice to maximize p(θ|X) is called the MAP approach and it's often a good alternative to MLE when it's possible to formulate trustworthy priors or, as in the case of Latent Dirichlet Allocation (LDA), where the model is on purpose based on some specific prior assumptions.

Unfortunately, a wrong or incomplete prior distribution can bias the model leading to unacceptable results. For this reason, MLE is often the default choice even when it's possible to formulate reasonable assumptions on the structure of p(θ). To understand the impact of a prior on an estimation, let's consider to have observed n=1000 binomial distributed (θ corresponds to the parameter p) experiments and k=800 had a successful outcome. The likelihood is as follows:

For simplicity, let's compute the log-likelihood:

If we compute the derivative with respect to θ and set it equal to zero, we get the following:

So the MLE for θ is 0.8, which is coherent with the observations (we can say that after observing 1000 experiments with 800 successful outcomes, p(X|Success)=0.8). If we have only the data X, we could say that a success is more likely than a failure because 800 out of 1000 experiments are positive.

However, after this simple exercise, an expert can tell us that, considering the largest possible population, the marginal probability p(Success)=0.001 (Bernoulli distributed with p(Failure) = 1 - P(success)) and our sample is not representative. If we trust the expert, we need to compute the posterior probability using Bayes' theorem:

Surprisingly, the posterior probability is very close to zero and we should reject our initial hypothesis! At this point, there are two options: if we want to build a model based only on our data, the MLE is the only reasonable choice, because, considering the posterior, we need to accept we have a very poor dataset (this is probably a bias when drawing the samples from the data generating process pdata).

On the other hand, if we really trust the expert, we have a few options for managing the problem:

  • Checking the sampling process in order to assess its quality (we can discover that a better sampling leads to a very lower k value)
  • Increasing the number of samples
  • Computing the MAP estimation of θ

I suggest that the reader tries both approaches with simple models, to be able to compare the relative accuracies. In this book, we're always going to adopt the MLE when it's necessary to estimate the parameters of a model with a statistical approach. This choice is based on the assumption that our datasets are correctly sampled from pdata. If this is not possible (think about an image classifier that must distinguish between horses, dogs, and cats, built with a dataset where there are pictures of 500 horses, 500 dogs, and 5 cats), we should expand our dataset or use data augmentation techniques to create artificial samples.