PDF Estimation
Main Idea
Likelihood
Given a dataset \(\mathcal{D} = \{x^{1}, x^{2}, \ldots, x^{n}\}\), we can find some parameters \(\theta\) by solving this optimization function: the likelihood
or equivalently:
This is equivalent to minimizing the KL-Divergence between the empirical data distribution \(\tilde{p}_\text{data}(x)\) and the model \(p_\theta\).
where \(\hat{p}_\text{data}(x) = \frac{1}{n} \sum_{i=1}^N \mathbf{1}[x = x^{(i)}]\)
Stochastic Gradient Descent
Maximum likelihood is an optimization problem so we can use stochastic gradient descent (SGD) to solve it. This algorithm minimizes the expectation for \(f\) assuming it is a differentiable function of \(\theta\).
With maximum likelihood, the optimization problem becomes:
We typically use SGD because it works with large datasets and it allows us to use deep learning architectures and convenient packages.
Example
Mixture of Gaussians
where we have parameters as \(k\) means, variances and mixture weights,
However, this doesn't really work for high-dimensional datasets. To sample, we pick a cluster center and then add some Gaussian noise.
Histogram Method
The simplest non-parametric density estimator. We partition the domain into bins and estimate the density by counting samples in each bin:
where \(N\) is the total number of samples and \(\Delta\) is the bin width. The main trade-off is bin width: too wide and we lose detail, too narrow and the estimate becomes noisy. See Kernel Density Estimation for a smoother alternative.
Implementation Notes
Search Sorted
NumPy
PyTorch
Both NumPy and PyTorch provide built-in searchsorted functions that efficiently find insertion points in sorted arrays.