Skip to content

PDF Estimation

Main Idea

Fig I: Input Distribution.

P(x \in [a,b]) = \int_a^b p(x)dx

Likelihood

Given a dataset \mathcal{D} = \{x^{1}, x^{2}, \ldots, x^{n}\}, we can find the some parameters \theta by solving this optimization function: the likelihood

\underset{\theta}{\text{max}} \sum_i \log p_\theta(x^{(i)})

or equivalently:

\underset{\theta}{\text{min }} \mathbb{E}_x \left[ - \log p_\theta(x) \right]

This is equivalent to minimizing the KL-Divergence between the empirical data distribution \tilde{p}_\text{data}(x) and the model p_\theta.

D_\text{KL}(\hat{p}(\text{data}) || p_\theta) = \mathbb{E}_{x \sim \hat{p}_\text{data}} \left[ - \log p_\theta(x) \right] - H(\hat{p}_\text{data})

where \hat{p}_\text{data}(x) = \frac{1}{n} \sum_{i=1}^N \mathbf{1}[x = x^{(i)}]

Stochastic Gradient Descent

Maximum likelihood is an optimization problem so we can use stochastic gradient descent (SGD) to solve it. This algorithm minimizes the expectation for f assuming it is a differentiable function of \theta.

\argmin_\theta \mathbb{E} \left[ f(\theta) \right]

With maximum likelihood, the optimization problem becomes:

\argmin_\theta \mathbb{E}_{x \sim \hat{p}_\text{data}} \left[ - \log p_\theta(x) \right]

We typically use SGD because it works with large datasets and it allows us to use deep learning architectures and convenient packages.


Example

Mixture of Gaussians

p_\theta(x) = \sum_i^k \pi_i \mathcal{N}(x ; \mu_i, \sigma_i^2)

where we have parameters as k means, variances and mixture weights,

\theta = (\pi_1, \cdots, \pi_k, \mu_1, \cdots, \mu_k, \sigma_1, \cdots, \sigma_k)

However, this doesn't really work for high-dimensional datasets. To sample, we pick a cluster center and then add some Gaussian noise.

Histogram Method

Gotchas

Search Sorted

Numpy


PyTorch

def searchsorted(bin_locations, inputs, eps=1e-6):
    bin_locations[..., -1] += eps
    h_sorted = torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
    return h_sorted

This is an unofficial implementation. There is still some talks in the PyTorch community to implement this. See github issue here. For now, we just use the implementation found in various implementations.