Time |
Title |
Speaker |
8.30 |
Overture |
Organizers |
8.30-9.30 |
A Tutorial on PAC-Bayesian Theory
|
François Laviolette |
9.30-10.15 |
A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity
Over the last 15 years, machine learning theorists have bounded the performance of empirical risk minimization by (localized) Rademacher complexity; Bayesians with frequentist sympathies have studied Bayesian consistency and rate of convergence theorems, sometimes under misspecification, and PAC-Bayesians have studied convergence properties of generalized Bayesian and Gibbs posteriors. We show that, amazingly, most such bounds readily follow from essentially a single result that bounds excess risk in terms of a novel complexity COMP\( (\eta,w) \). which depends on a learning rate \( \eta \) and a luckiness function \( w \), the latter generalizing the concept of a 'prior'. Depending on the choice of \( w \), COMP\( (\eta,w) \) specializes to PAC-Bayesian (KL(posterior||prior) complexity, MDL (normalized maximum likelihood) complexity and Rademacher complexity, and the bounds obtained are optimized for generalized Bayes, ERM, penalized ERM (such as Lasso) or other methods. Tuning \( \eta \) leads to optimal excess risk convergence rates, even for very large (polynomial entropy) classes which have always been problematic for the PAC-Bayesian approach; the optimal \( \eta \) depends on 'fast rate' properties of the domain, such as central, Bernstein and Tsybakov conditions.
Joint work with Nishant Mehta, University of Victoria. See the paper.
|
Peter Grünwald |
10.15-11.00 |
Coffee break and posters session |
|
11.00-11.45 |
Some recent advances on Approximate Bayesian Computation techniques
In an increasing number of application domains, the statistical model is so complex that the point-wise computation of the likelihood is intractable. That is typically the case when the underlying probability distribution involves numerous latent variables. Approximate Bayesian Computation (ABC) is a widely used technique to bypass that difficulty. I will review some recent developments on ABC techniques, emphazing the fact that modern machine learning approaches are useful in this field. Although intrinsically very different of PAC-Bayesian strategies - the choice of a generative model is essential within the ABC paradigm - I will highlight some links between these two methodologies.
|
Jean-Michel Marin |
11.45-12.05 |
Contributed talk: A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.
|
Behnam Neyshabur |
12.05-14.00 |
Lunch break |
|
14.00-14.40 |
Dimension-free PAC-Bayesian Bounds
PAC-Bayesian inequalities have already proved to be a great tool
to obtain dimension free generalization bounds, such as margin
bounds for Support Vector Machines. In this talk,
we will play with PAC-Bayesian inequalities and influence
functions to present new robust estimators for the mean
of random vectors and random matrices, as well as for
linear least squares regression. A common theme of the
presentation will be to establish dimension free bounds
and to work under mild polynomial moment assumptions
regarding the tail of the sample distribution.
|
Olivier Catoni |
14.40-15.00 |
Contributed talk: Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector
We present a new estimator of the mean of a random vector, computed by applying some threshold function to the norm. Non asymptotic dimension-free almost sub-Gaussian bounds are proved under weak moment assumptions, using PAC-Bayesian inequalities.
|
Olivier Catoni |
15.00-15.30 |
Coffee break and posters session |
|
15.30-16.15 |
A Strongly Quasiconvex PAC-Bayesian Bound
We propose a new PAC-Bayesian bound and a way of constructing a hypothesis space, so that the bound is convex in the posterior distribution and also convex in a trade-off parameter between empirical performance of the posterior distribution and its complexity. The complexity is measured by the Kullback-Leibler divergence to a prior. We derive an alternating procedure for minimizing the bound. We show that the bound can be rewritten as a one-dimensional function of the trade-off parameter and provide sufficient conditions under which the function has a single global minimum. When the conditions are satisfied the alternating minimization is guaranteed to converge to the global minimum of the bound. We provide experimental results demonstrating that rigorous minimization of the bound is competitive with cross-validation in tuning the trade-off between complexity and empirical performance. In all our experiments the trade-off turned to be quasiconvex even when the sufficient conditions were violated.
Joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger.
|
Yevgeny Seldin |
16.15-17.00 |
Distribution Dependent Priors for Stable Learning
One feature of PAC-Bayes approach is that though the prior must be fixed, we do not need to have an explicit expression for it, only to be able to bound the distance between prior and posterior. Furthermore, the choice of prior only impacts the quality of the bound and not the validity of the results. We will discuss the implications of these observations describing ways in which the prior may be chosen to improve the quality of the bounds obtained. The application of these ideas to the stability analysis for SVMs delivers a tightening of the well-known stability bounds.
|
John Shawe-Taylor |
17.00-17.30 |
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization, we need nonvacuous bounds.
I will discuss recent work using PAC-Bayesian bounds and optimization to arrive at nonvacuous generalization bounds for neural networks with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization, as well as to variational inference for deep learning. Time permitting, I'll discuss new work interpreting Entropy-SGD as a PAC-Bayesian method.
Joint work with Gintare Karolina Dziugaite, based on this paper.
|
Dan Roy |
17.30-18.25 |
Discussion (moderated by Francis Bach) |
Panelists: Olivier Catoni, Peter Grünwald, François Laviolette, Jean-Michel Marin, Yevgeny Seldin, John Shawe-Taylor, Yee-Whye Teh |
18.25 |
Concluding remarks |
Organizers |