# NIPS 2017 Workshop

## (Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Long Beach Convention Center, California
Room 101-A
December 9, 2017
Schedule

## Scope

Industry-wide successes of machine learning at the dawn of the (so-called) big data era has led to an increasing gap between practitioners and theoreticians. The former are using off-the-shelf statistical and machine learning methods, while the latter are designing and studying the mathematical properties of such algorithms. The tradeoff between those two movements is somewhat addressed by Bayesian researchers, where sound mathematical guarantees often meet efficient implementation and provide model selection criteria. In the late 90s, a new paradigm has emerged in the statistical learning community, used to derive probably approximately correct (PAC) bounds on Bayesian-flavored estimators. This PAC-Bayesian theory has been pioneered by Shawe-Taylor and Willamson (1997), and McAllester (1998, 1999). It has been extensively formalized by Catoni (2004, 2007) and has triggered, slowly but surely, increasing research efforts during last decades.

We believe it is time to pinpoint the current PAC-Bayesian trends relatively to other modern approaches in the (statistical) machine learning community. Indeed, we observe that, while the field grows by its own, it took some undesirable distance from some related areas. Firstly, it seems to us that the relation to Bayesian methods has been forsaken in numerous works, despite the potential of PAC-Bayesian theory to bring new insights to the Bayesian community and to go beyond the classical Bayesian/frequentist divide. Secondly, the PAC-Bayesian methods share similarities with other quasi-Bayesian (or pseudo-Bayesian) methods studying Bayesian practices from a frequentist standpoint, such as the Minimum Description Length (MDL) principle (Grünwald, 2007). Last but not least, even if some practical and theory grounded learning algorithm has emerged from PAC-Bayesian works, these are almost unused for real-world problems.

In short, this workshop aims at gathering statisticians and machine learning researchers to discuss current trends and the future of {PAC,quasi}-Bayesian learning. From a broader perspective, we aim to bridge the gap between several communities that can all benefit from sharper statistical guarantees and sound theory-driven learning algorithms.

### References

1. J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of COLT, 1997.
2. D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of COLT, 1998.
3. D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of COLT, 1999.
4. O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
5. O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Institute of Mathematical Statistics, 2007.
6. P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.

## Invited Speakers

The workshop welcomes world-class experts on {quasi,PAC,∅}-Bayesian learning.

### Olivier Catoni

Senior researcher, CNRS, France.

### Peter Grünwald

Professor, CWI, The Netherlands.

### Neil Lawrence

Professor, University of Sheffield, and
Amazon Research Cambridge, UK.

### Jean-Michel Marin

Professor, Université de Montpellier, France.

### Dan Roy

Assistant Professor, University of Toronto, Canada.

### Yevgeny Seldin

Associate Professor, University of Copenhagen, Denmark.

### John Shawe-Taylor

Professor, University College London, UK.

### Yee-Whye Teh

Professor, University of Oxford, UK.

## Schedule

The workshop will run in room 101-A.

Check NIPS website.

Time Title Speaker
8.30 Overture Organizers
8.30-9.30 A Tutorial on PAC-Bayesian Theory François Laviolette
9.30-10.15
A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

Over the last 15 years, machine learning theorists have bounded the performance of empirical risk minimization by (localized) Rademacher complexity; Bayesians with frequentist sympathies have studied Bayesian consistency and rate of convergence theorems, sometimes under misspecification, and PAC-Bayesians have studied convergence properties of generalized Bayesian and Gibbs posteriors. We show that, amazingly, most such bounds readily follow from essentially a single result that bounds excess risk in terms of a novel complexity COMP$$(\eta,w)$$. which depends on a learning rate $$\eta$$ and a luckiness function $$w$$, the latter generalizing the concept of a 'prior'. Depending on the choice of $$w$$, COMP$$(\eta,w)$$ specializes to PAC-Bayesian (KL(posterior||prior) complexity, MDL (normalized maximum likelihood) complexity and Rademacher complexity, and the bounds obtained are optimized for generalized Bayes, ERM, penalized ERM (such as Lasso) or other methods. Tuning $$\eta$$ leads to optimal excess risk convergence rates, even for very large (polynomial entropy) classes which have always been problematic for the PAC-Bayesian approach; the optimal $$\eta$$ depends on 'fast rate' properties of the domain, such as central, Bernstein and Tsybakov conditions.

Joint work with Nishant Mehta, University of Victoria. See the paper.

Peter Grünwald
10.15-11.00 Coffee break and posters session
11.00-11.45
Some recent advances on Approximate Bayesian Computation techniques

In an increasing number of application domains, the statistical model is so complex that the point-wise computation of the likelihood is intractable. That is typically the case when the underlying probability distribution involves numerous latent variables. Approximate Bayesian Computation (ABC) is a widely used technique to bypass that difficulty. I will review some recent developments on ABC techniques, emphazing the fact that modern machine learning approaches are useful in this field. Although intrinsically very different of PAC-Bayesian strategies - the choice of a generative model is essential within the ABC paradigm - I will highlight some links between these two methodologies.

Jean-Michel Marin
11.45-12.05
Contributed talk: A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.

Behnam Neyshabur
12.05-14.00 Lunch break
14.00-14.40
Dimension-free PAC-Bayesian Bounds

PAC-Bayesian inequalities have already proved to be a great tool to obtain dimension free generalization bounds, such as margin bounds for Support Vector Machines. In this talk, we will play with PAC-Bayesian inequalities and influence functions to present new robust estimators for the mean of random vectors and random matrices, as well as for linear least squares regression. A common theme of the presentation will be to establish dimension free bounds and to work under mild polynomial moment assumptions regarding the tail of the sample distribution.

Olivier Catoni
14.40-15.00
Contributed talk: Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector

We present a new estimator of the mean of a random vector, computed by applying some threshold function to the norm. Non asymptotic dimension-free almost sub-Gaussian bounds are proved under weak moment assumptions, using PAC-Bayesian inequalities.

Olivier Catoni
15.00-15.30 Coffee break and posters session
15.30-16.15
A Strongly Quasiconvex PAC-Bayesian Bound

We propose a new PAC-Bayesian bound and a way of constructing a hypothesis space, so that the bound is convex in the posterior distribution and also convex in a trade-off parameter between empirical performance of the posterior distribution and its complexity. The complexity is measured by the Kullback-Leibler divergence to a prior. We derive an alternating procedure for minimizing the bound. We show that the bound can be rewritten as a one-dimensional function of the trade-off parameter and provide sufficient conditions under which the function has a single global minimum. When the conditions are satisfied the alternating minimization is guaranteed to converge to the global minimum of the bound. We provide experimental results demonstrating that rigorous minimization of the bound is competitive with cross-validation in tuning the trade-off between complexity and empirical performance. In all our experiments the trade-off turned to be quasiconvex even when the sufficient conditions were violated.

Joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger.

Yevgeny Seldin
16.15-17.00
Distribution Dependent Priors for Stable Learning

One feature of PAC-Bayes approach is that though the prior must be fixed, we do not need to have an explicit expression for it, only to be able to bound the distance between prior and posterior. Furthermore, the choice of prior only impacts the quality of the bound and not the validity of the results. We will discuss the implications of these observations describing ways in which the prior may be chosen to improve the quality of the bounds obtained. The application of these ideas to the stability analysis for SVMs delivers a tightening of the well-known stability bounds.

John Shawe-Taylor
17.00-17.30
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes

One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization, we need nonvacuous bounds.

I will discuss recent work using PAC-Bayesian bounds and optimization to arrive at nonvacuous generalization bounds for neural networks with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization, as well as to variational inference for deep learning. Time permitting, I'll discuss new work interpreting Entropy-SGD as a PAC-Bayesian method.

Joint work with Gintare Karolina Dziugaite, based on this paper.

Dan Roy
17.30-18.25 Discussion (moderated by Francis Bach) Panelists: Olivier Catoni, Peter Grünwald, François Laviolette, Jean-Michel Marin, Yevgeny Seldin, John Shawe-Taylor, Yee-Whye Teh
18.25 Concluding remarks Organizers

## Call for Papers

All accepted papers will have a poster presentation, and we will select two papers for oral presentations. We welcome submissions related to the following list of topics:

• {PAC,quasi,∅}-Bayesian generalization guarantees,
• Novel theoretical perspectives on Bayesian methods,
• Application of the PAC-Bayesian theory to different learning frameworks,
• Learning algorithms inspired by a {PAC,quasi}-Bayesian analysis.

### Submission Instructions

Submission is no longer possible.

• Page limit: 4 pages (excluding references).
• Please use the NIPS 2017 submission format.
• We are committed to a double-blind reviewing process. Submissions must be made anonymous: please hide authors' names and affiliations.
• Results which were previously published may be submitted, although we encourage original contributions. Please clearly indicate if the submitted work has been presented or published elsewhere.
• Please note that at least one author of each accepted paper must be available to present the paper at the workshop.

## Accepted papers

### Contributed talks

1. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester and Nathan Srebro
2. Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector
Olivier Catoni and Ilaria Giulini

### Posters

1. Multi-way Interacting Regression via Factorization Machines
Mikhail Yurochkin, XuanLong Nguyen and Nikolaos Vasiloglou
2. Lookahead Bayesian Optimization with Inequality Constraints
Remi Lam and Karen Willcox
3. Convergence rates of a partition based Bayesian multivariate density estimation method
Linxi Liu, Dangna Li and Wing Hung Wong
Jian Wu, Matthias Poloczek, Andrew Gordon Wilson and Peter I. Frazier
5. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
Balaji Lakshminarayanan, Alexander Pritzel and Charles Blundell
6. Adaptive Bayesian Sampling with Monte Carlo EM
Anirban Roychowdhury and Srinivasan Parthasarathy
7. Excess Risk Bounds for the Bayes Risk using Variational Inference in Latent Gaussian Models
Rishit Sheth and Roni Khardon
8. Exploring Generalization in Deep Learning
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester and Nathan Srebro
9. A PAC-Bayesian Analysis of Randomized Learning with Application to Stochastic Gradient Descent
Ben London
10. Structured Bayesian Pruning via Log-Normal Multiplicative Noise
Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha and Dmitry Vetrov
11. PAC-Bayesian Generalization Bound for Multi-class Learning
Loubna Benabbou and Pascal Lang

## Material

Slides are available below. A YouTube channel hosts most of the videos of the workshop: go to YouTube.

• Olivier Catoni - Dimension-free PAC-Bayesian Bounds
[ Slides 1/2 ] • [ Slides 2/2 ] • [ Video ] • [ Picture ]
• Peter Grünwald - A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity
[ Slides ] • [ Video ] • [ Picture ]
• François Laviolette - A Tutorial on PAC-Bayesian Theory
[ Slides ] • [ Video ] • [ Picture ]
• Jean-Michel Marin - Some recent advances on Approximate Bayesian Computation techniques
[ Slides ] • [ Video ] • [ Picture ]
• Behnam Neyshabur - A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
[ Slides ] • [ Video ] • [ Picture ]
• Dan Roy - Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
[ Slides ] • [ Video ] • [ Picture ]
• Yevgeny Seldin - A Strongly Quasiconvex PAC-Bayesian Bound
[ Slides ] • [ Video ] • [ Picture ]
• John Shawe-Taylor - Distribution Dependent Priors for Stable Learning
[ Slides ] • [ Video ] • [ Picture ]

## Organizers

### Pascal Germain

Researcher, Inria, France. Senior Researcher, Inria, France. Researcher, Inria, France.