(Almost) 50 shades of Bayesian Learning: PAC-Bayesian trends and insights

Scope

Industry-wide successes of machine learning at the dawn of the (so-called) big data era has led to an increasing gap between practitioners and theoreticians. The former are using off-the-shelf statistical and machine learning methods, while the latter are designing and studying the mathematical properties of such algorithms. The tradeoff between those two movements is somewhat addressed by Bayesian researchers, where sound mathematical guarantees often meet efficient implementation and provide model selection criteria. In the late 90s, a new paradigm has emerged in the statistical learning community, used to derive probably approximately correct (PAC) bounds on Bayesian-flavored estimators. This PAC-Bayesian theory has been pioneered by Shawe-Taylor and Willamson (1997), and McAllester (1998, 1999). It has been extensively formalized by Catoni (2004, 2007) and has triggered, slowly but surely, increasing research efforts during last decades.

We believe it is time to pinpoint the current PAC-Bayesian trends relatively to other modern approaches in the (statistical) machine learning community. Indeed, we observe that, while the field grows by its own, it took some undesirable distance from some related areas. Firstly, it seems to us that the relation to Bayesian methods has been forsaken in numerous works, despite the potential of PAC-Bayesian theory to bring new insights to the Bayesian community and to go beyond the classical Bayesian/frequentist divide. Secondly, the PAC-Bayesian methods share similarities with other quasi-Bayesian (or pseudo-Bayesian) methods studying Bayesian practices from a frequentist standpoint, such as the Minimum Description Length (MDL) principle (Grünwald, 2007). Last but not least, even if some practical and theory grounded learning algorithm has emerged from PAC-Bayesian works, these are almost unused for real-world problems.

In short, this workshop aims at gathering statisticians and machine learning researchers to discuss current trends and the future of {PAC,quasi}-Bayesian learning. From a broader perspective, we aim to bridge the gap between several communities that can all benefit from sharper statistical guarantees and sound theory-driven learning algorithms.

References

J. Shawe-Taylor and R. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of COLT, 1997.
D. A. McAllester. Some PAC-Bayesian theorems. In Proceedings of COLT, 1998.
D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of COLT, 1999.
O. Catoni. Statistical Learning Theory and Stochastic Optimization. Saint-Flour Summer School on Probability Theory 2001 (Jean Picard ed.), Lecture Notes in Mathematics. Springer, 2004.
O. Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Institute of Mathematical Statistics Lecture Notes—Monograph Series, 56. Institute of Mathematical Statistics, 2007.
P. D. Grünwald. The Minimum Description Length Principle. The MIT Press, 2007.

Time	Title	Speaker
8.30	Overture	Organizers
8.30-9.30	A Tutorial on PAC-Bayesian Theory	François Laviolette
9.30-10.15	A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity Over the last 15 years, machine learning theorists have bounded the performance of empirical risk minimization by (localized) Rademacher complexity; Bayesians with frequentist sympathies have studied Bayesian consistency and rate of convergence theorems, sometimes under misspecification, and PAC-Bayesians have studied convergence properties of generalized Bayesian and Gibbs posteriors. We show that, amazingly, most such bounds readily follow from essentially a single result that bounds excess risk in terms of a novel complexity COMP\( (\eta,w) \). which depends on a learning rate \( \eta \) and a luckiness function \( w \), the latter generalizing the concept of a 'prior'. Depending on the choice of \( w \), COMP\( (\eta,w) \) specializes to PAC-Bayesian (KL(posterior\|\|prior) complexity, MDL (normalized maximum likelihood) complexity and Rademacher complexity, and the bounds obtained are optimized for generalized Bayes, ERM, penalized ERM (such as Lasso) or other methods. Tuning \( \eta \) leads to optimal excess risk convergence rates, even for very large (polynomial entropy) classes which have always been problematic for the PAC-Bayesian approach; the optimal \( \eta \) depends on 'fast rate' properties of the domain, such as central, Bernstein and Tsybakov conditions. Joint work with Nishant Mehta, University of Victoria. See the paper.	Peter Grünwald
10.15-11.00	Coffee break and posters session
11.00-11.45	Some recent advances on Approximate Bayesian Computation techniques In an increasing number of application domains, the statistical model is so complex that the point-wise computation of the likelihood is intractable. That is typically the case when the underlying probability distribution involves numerous latent variables. Approximate Bayesian Computation (ABC) is a widely used technique to bypass that difficulty. I will review some recent developments on ABC techniques, emphazing the fact that modern machine learning approaches are useful in this field. Although intrinsically very different of PAC-Bayesian strategies - the choice of a generative model is essential within the ABC paradigm - I will highlight some links between these two methodologies.	Jean-Michel Marin
11.45-12.05	Contributed talk: A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks We present a generalization bound for feedforward neural networks in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. The generalization bound is derived using a PAC-Bayes analysis.	Behnam Neyshabur
12.05-14.00	Lunch break
14.00-14.40	Dimension-free PAC-Bayesian Bounds PAC-Bayesian inequalities have already proved to be a great tool to obtain dimension free generalization bounds, such as margin bounds for Support Vector Machines. In this talk, we will play with PAC-Bayesian inequalities and influence functions to present new robust estimators for the mean of random vectors and random matrices, as well as for linear least squares regression. A common theme of the presentation will be to establish dimension free bounds and to work under mild polynomial moment assumptions regarding the tail of the sample distribution.	Olivier Catoni
14.40-15.00	Contributed talk: Dimension free PAC-Bayesian bounds for the estimation of the mean of a random vector We present a new estimator of the mean of a random vector, computed by applying some threshold function to the norm. Non asymptotic dimension-free almost sub-Gaussian bounds are proved under weak moment assumptions, using PAC-Bayesian inequalities.	Olivier Catoni
15.00-15.30	Coffee break and posters session
15.30-16.15	A Strongly Quasiconvex PAC-Bayesian Bound We propose a new PAC-Bayesian bound and a way of constructing a hypothesis space, so that the bound is convex in the posterior distribution and also convex in a trade-off parameter between empirical performance of the posterior distribution and its complexity. The complexity is measured by the Kullback-Leibler divergence to a prior. We derive an alternating procedure for minimizing the bound. We show that the bound can be rewritten as a one-dimensional function of the trade-off parameter and provide sufficient conditions under which the function has a single global minimum. When the conditions are satisfied the alternating minimization is guaranteed to converge to the global minimum of the bound. We provide experimental results demonstrating that rigorous minimization of the bound is competitive with cross-validation in tuning the trade-off between complexity and empirical performance. In all our experiments the trade-off turned to be quasiconvex even when the sufficient conditions were violated. Joint work with Niklas Thiemann, Christian Igel, and Olivier Wintenberger.	Yevgeny Seldin
16.15-17.00	Distribution Dependent Priors for Stable Learning One feature of PAC-Bayes approach is that though the prior must be fixed, we do not need to have an explicit expression for it, only to be able to bound the distance between prior and posterior. Furthermore, the choice of prior only impacts the quality of the bound and not the validity of the results. We will discuss the implications of these observations describing ways in which the prior may be chosen to improve the quality of the bounds obtained. The application of these ideas to the stability analysis for SVMs delivers a tightening of the well-known stability bounds.	John Shawe-Taylor
17.00-17.30	Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization, we need nonvacuous bounds. I will discuss recent work using PAC-Bayesian bounds and optimization to arrive at nonvacuous generalization bounds for neural networks with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization, as well as to variational inference for deep learning. Time permitting, I'll discuss new work interpreting Entropy-SGD as a PAC-Bayesian method. Joint work with Gintare Karolina Dziugaite, based on this paper.	Dan Roy
17.30-18.25	Discussion (moderated by Francis Bach)	Panelists: Olivier Catoni, Peter Grünwald, François Laviolette, Jean-Michel Marin, Yevgeny Seldin, John Shawe-Taylor, Yee-Whye Teh
18.25	Concluding remarks	Organizers


Benjamin Guedj	Francis Bach	Pascal Germain
Researcher, Inria, France.	Senior Researcher, Inria, France.	Researcher, Inria, France.

NIPS 2017 Workshop

(Almost) 50 Shades of Bayesian Learning: PAC-Bayesian trends and insights

Scope

References

Invited Speakers

Olivier Catoni

Peter Grünwald

François Laviolette

Neil Lawrence

Jean-Michel Marin

Dan Roy

Yevgeny Seldin

John Shawe-Taylor

Yee-Whye Teh

Schedule

Call for Papers

Submission Instructions

Scientific Committee

Accepted papers

Contributed talks

Posters

Material

Organizers

Benjamin Guedj

Francis Bach

Pascal Germain

Sponsors

Endorsed by