Exploiting Structure in Kernel Matrices
2021-07-28, 13:40–13:50 (UTC), Green

Kernel methods are widely used in statistics, machine learning, and physical simulations. These methods give rise to dense matrices that are naïvely expensive to multiply or invert. Herein, we present CovarianceFunctions.jl, a package that automatically detects and exploits low rankness, hierarchical structure, approximate sparsity. We highlight applications of this technology in Bayesian optimization and physical simulations.

CovarianceFunctions.jl implements many commonly used kernel functions including stationary ones like the exponentiated quadratic, rational quadratic, and Matérn kernel, but also non-stationary ones like the polynomial and neural network kernel. A crucial component of the package is the "Gramian" matrix type which lazily represents kernel matrices with virtually no memory footprint. The package's most significant functionality derives from algorithms designed for particular combinations of kernel and data types, since they are able to drastically reduce the computational complexity of multiplication and inversion. However, even in the general case the lazy implementation eliminates the typical O(n^2) memory allocation for simply storing a kernel matrix of n data points

For stationary kernels in low dimensions, the package implements a new hierarchical factorization based on multipole expansions of the kernels via automatic differentiation. The user need only input the kernel function and data points, and the package automatically computes the relevant analytic expansions which are then leveraged within a treecode analagous to the Barnes-Hut algorithm. Fast multiplies are then performed in O(nlog(n)) time, and solves are performed using an iterative method whose preconditioner is also based on the aforementioned treecode.

For exponentially decaying stationary kernels (i.e. exponential, Matérn, RBF) in high dimensions, it is highly likely that the associated kernel matrix can be approximated well by a sparse matrix. However, naïvely detecting this approximate sparsity pattern would require evaluating the entire matrix in O(n^2) time. Instead, the package takes advantage of vantage trees to quickly find the most prominent neighbors of each data point in O(nk log(n)), where k is the maximum number of relevant neighbors of a data point.

In the context of Bayesian optimization with gradient information, the associated gradient kernel matrices are of size (nd x nd), naïvely requiring O(n^2d^2) operations for multiplication, which becomes prohibitive quickly as the number of parameters d increases. Based on recent work that uncovered a particular structure in a large class of these gradient kernel matrices, the package contains an exact multiplication algorithm that requires O(n^2d) operations. As a result, we are able to demonstrate first-order Bayesian optimization on problems of higher dimensionality than were previously possible.

In the absence of any of the above particular structure, the package attempts to construct a low-rank approximation of the matrix via a generic pivoted Cholesky algorithm that lazily computes the kernel matrix's entries, allowing the algorithm to terminate in O(nr^2) steps, where r is the numerical rank, before even fully forming the entire matrix. In the worst case however, this falls back to O(n^3) complexity in the absence of any structure.

In addition to implementing the above algorithms, a main feature of the package is the automatic detection of the most scalable algorithm depending on the kernel and data type. We believe that this type of automation is particularly useful for practitioners that rely on kernel methods and need to scale them to large datasets. We further invite specialists to contribute their methods for efficient computations with kernel matrices.

PhD Candidate at Cornell University

PhD Candidate in Computer Science at Cornell University