PDF Estimation¶
Main Idea¶
Fig I: Input Distribution.
Likelihood¶
Given a dataset \mathcal{D} = \{x^{1}, x^{2}, \ldots, x^{n}\}, we can find the some parameters \theta by solving this optimization function: the likelihood
or equivalently:
This is equivalent to minimizing the KL-Divergence between the empirical data distribution \tilde{p}_\text{data}(x) and the model p_\theta.
where \hat{p}_\text{data}(x) = \frac{1}{n} \sum_{i=1}^N \mathbf{1}[x = x^{(i)}]
Stochastic Gradient Descent¶
Maximum likelihood is an optimization problem so we can use stochastic gradient descent (SGD) to solve it. This algorithm minimizes the expectation for f assuming it is a differentiable function of \theta.
With maximum likelihood, the optimization problem becomes:
We typically use SGD because it works with large datasets and it allows us to use deep learning architectures and convenient packages.
Example¶
Mixture of Gaussians¶
where we have parameters as k means, variances and mixture weights,
However, this doesn't really work for high-dimensional datasets. To sample, we pick a cluster center and then add some Gaussian noise.
Histogram Method¶
Gotchas¶
Search Sorted¶
Numpy
PyTorch
def searchsorted(bin_locations, inputs, eps=1e-6):
bin_locations[..., -1] += eps
h_sorted = torch.sum(inputs[..., None] >= bin_locations, dim=-1) - 1
return h_sorted
This is an unofficial implementation. There is still some talks in the PyTorch community to implement this. See github issue here. For now, we just use the implementation found in various implementations.