Skip to content

PDF Estimation

Main Idea

Fig 1: Input Distribution.
\[P(x \in [a,b]) = \int_a^b p(x)dx\]

Likelihood

Given a dataset \(\mathcal{D} = \{x^{1}, x^{2}, \ldots, x^{n}\}\), we can find some parameters \(\theta\) by solving this optimization function: the likelihood

\[\underset{\theta}{\text{max}} \sum_i \log p_\theta(x^{(i)})\]

or equivalently:

\[\underset{\theta}{\text{min }} \mathbb{E}_x \left[ - \log p_\theta(x) \right]\]

This is equivalent to minimizing the KL-Divergence between the empirical data distribution \(\tilde{p}_\text{data}(x)\) and the model \(p_\theta\).

\[ D_\text{KL}(\hat{p}(\text{data}) || p_\theta) = \mathbb{E}_{x \sim \hat{p}_\text{data}} \left[ - \log p_\theta(x) \right] - H(\hat{p}_\text{data}) \]

where \(\hat{p}_\text{data}(x) = \frac{1}{n} \sum_{i=1}^N \mathbf{1}[x = x^{(i)}]\)

Stochastic Gradient Descent

Maximum likelihood is an optimization problem so we can use stochastic gradient descent (SGD) to solve it. This algorithm minimizes the expectation for \(f\) assuming it is a differentiable function of \(\theta\).

\[\argmin_\theta \mathbb{E} \left[ f(\theta) \right]\]

With maximum likelihood, the optimization problem becomes:

\[\argmin_\theta \mathbb{E}_{x \sim \hat{p}_\text{data}} \left[ - \log p_\theta(x) \right]\]

We typically use SGD because it works with large datasets and it allows us to use deep learning architectures and convenient packages.


Example

Mixture of Gaussians

\[p_\theta(x) = \sum_i^k \pi_i \mathcal{N}(x ; \mu_i, \sigma_i^2)\]

where we have parameters as \(k\) means, variances and mixture weights,

\[\theta = (\pi_1, \cdots, \pi_k, \mu_1, \cdots, \mu_k, \sigma_1, \cdots, \sigma_k)\]

However, this doesn't really work for high-dimensional datasets. To sample, we pick a cluster center and then add some Gaussian noise.

Histogram Method

The simplest non-parametric density estimator. We partition the domain into bins and estimate the density by counting samples in each bin:

\[\hat{p}(x) = \frac{\text{count in bin containing } x}{N \cdot \Delta}\]

where \(N\) is the total number of samples and \(\Delta\) is the bin width. The main trade-off is bin width: too wide and we lose detail, too narrow and the estimate becomes noisy. See Kernel Density Estimation for a smoother alternative.

Implementation Notes

Search Sorted

NumPy

indices = np.searchsorted(bin_locations, inputs)

PyTorch

indices = torch.searchsorted(bin_locations, inputs)

Both NumPy and PyTorch provide built-in searchsorted functions that efficiently find insertion points in sorted arrays.