Lecture Notes in Computer Science 7700

Commenced Publication in 1973

Founding and Former Series Editors:

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David Hutchison

Lancaster University, UK

Takeo Kanade

Carnegie Mellon University, Pittsburgh, PA, USA

Josef Kittler

University of Surrey, Guildford, UK

Jon M. Kleinberg

Cornell University, Ithaca, NY, USA

Alfred Kobsa

University of California, Irvine, CA, USA

Friedemann Mattern

ETH Zurich, Switzerland

John C. Mitchell

Stanford University, CA, USA

Moni Naor

Weizmann Institute of Science, Rehovot, Israel

Oscar Nierstrasz

University of Bern, Switzerland

C. Pandu Rangan

Indian Institute of Technology, Madras, India

Bernhard Steffen

TU Dortmund University, Germany

Madhu Sudan

Microsoft Research, Cambridge, MA, USA

Demetri Terzopoulos

University of California, Los Angeles, CA, USA

Doug Tygar

University of California, Berkeley, CA, USA

Gerhard Weikum

Max Planck Institute for Informatics, Saarbruecken, Germany

Grégoire Montavon

Geneviève B. Orr

Klaus-Robert Müller (Eds.)

Neural Networks:

Tricks of the Trade

Second Edition

Volume Editors

Grégoire Montavon

Technische Universität Berlin

Department of Computer Science

Franklinstr. 28/29, 10587 Berlin, Germany

E-mail: gregoire.montavon@tu-berlin.de

Geneviève B. Orr

Willamette University

Department of Computer Science

900 State Street, Salem, OR 97301, USA

E-mail: gorr@willamette.edu

Klaus-Robert Müller

Technische Universität Berlin

Department of Computer Science

Franklinstr. 28/29, 10587 Berlin, Germany

and

Korea University

Department of Brain and Cognitive Engineering

Anam-dong, Seongbuk-gu, Seoul 136-713, Korea

E-mail: klaus-robert.mueller@tu-berlin.de

ISSN 0302-9743 e-ISSN 1611-3349

ISBN 978-3-642-35288-1 e-ISBN 978-3-642-35289-8

DOI 10.1007/978-3-642-35289-8

Springer Heidelberg Dordrecht London New York

Library of Congress Control Number: 2012952591

CR Subject Classiﬁcation (1998): F.1, I.2.6, I.5.1, C.1.3, F.2, J.3

LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is

concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,

reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication

or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,

in its current version, and permission for use must always be obtained from Springer. Violations are liable

to prosecution under the German Copyright Law.

The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,

even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws

and regulations and therefore free for general use.

Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

Preface to the Second Edition

There have been substantial changes in the ﬁeld of neural networks since the ﬁrst

edition of this book in 1998. Some of them have been driven by external factors

such as the increase of available data and computing power. The Internet made

public massive amounts of labeled and unlabeled data. The ever-increasing raw

mass of user-generated and sensed data is made easily accessible by databases

and Web crawlers. Nowadays, anyone having an Internet connection can parse

the 4,000,000+ articles available on Wikipedia and construct a dataset out of

them. Anyone can capture a Web TV stream and obtain days of video content

to test their learning algorithm.

Another development is the amount of available computing power that has

continued to rise at steady rate owing to progress in hardware design and en-

gineering. While the number of cycles per second of processors has thresholded

due to physics limitations, the slow-down has been oﬀset by the emergence of

processing parallelism, best exempliﬁed by the massively parallel graphics pro-

cessing units (GPU). Nowadays, everybody can buy a GPU board (usually al-

ready available in consumer-grade laptops), install free GPU software, and run

computation-intensive simulations at low cost.

These developments have raised the following question: Can we make use of

this large computing power to make sense of these increasingly complex datasets?

Neural networks are a promising approach, as they have the intrinsic modeling

capacity and ﬂexibility to represent the solution. Their intrinsically distributed

nature allows one to leverage the massively parallel computing resources.

During the last two decades, the focus of neural network research and the

practice of training neural networks underwent important changes. Learning in

deep (or “deep learning”) has to a certain degree displacedtheoncemorepreva-

lent regularization issues, or more precisely, changed the practice of regularizing

neural networks. Use of unlabeled data via unsupervised layer-wise pretrain-

ing or deep unsupervised embeddings is now often preferred over traditional

regularization schemes such as weight decay or restricted connectivity. This new

paradigm has started to spread over a large number of applications such as image

recognition, speech recognition, natural language processing, complex systems,

neuroscience, and computational physics.

The second edition of the book reloads the ﬁrst edition with more tricks.

These tricks arose from 14 years of theory and experimentation (from 1998

to 2012) by some of the world’s most prominent neural networks researchers.

These tricks can make a substantial diﬀerence (in terms of speed, ease of im-

plementation, and accuracy) when it comes to putting algorithms to work on

real problems. Tricks may not necessarily have solid theoretical foundations or

formal validation. As Yoshua Bengio states in Chap. 19, “the wisdom distilled

here should be taken as a guideline, to be tried and challenged, not as a practice

set in stone” [1].

VI G. Montavon and K.-R. Müller

The second part of the new edition starts with tricks to faster optimize neu-

ral networks and make more eﬃcient use of the potentially inﬁnite stream of

data presented to them. Chapter 18 [2] shows that a simple stochastic gradi-

ent descent (learning one example at a time) is suited for training most neural

networks. Chapter 19 [1] introduces a large number of tricks and recommenda-

tions for training feed-forward neural networks and choosing the multiple hyper-

parameters.

When the representation built by the neural network is highly sensitive to

small parameter changes, for example, in recurrent neural networks, second-order

methods based on mini-batches such as those presented in Chap. 20 [9] can be a

better choice. The seemingly simple optimization procedures presented in these

chapters require their fair share of tricks in order to work optimally. The software

Torch7 presented in Chap. 21 [5] provides a fast and modular implementation

of these neural networks.

The novel second part of this volume continues with tricks to incorporate

invariance into the model. In the context of image recognition, Chap. 22 [4] shows

that translation invariance can be achieved by learning a k-means representation

of image patches and spatially pooling the k-means activations. Chapter 23 [3]

shows that invariance can be injected directly in the input space in the form

of elastic distortions. Unlabeled data are ubiquitous and using them to capture

regularities in data is an important component of many learning algorithms.

For example, we can learn an unsupervised model of data as a ﬁrst step, as

discussed in Chaps. 24 [7] and 25 [10], and feed the unsupervised representation

to a supervised classiﬁer. Chapter 26 [12] shows that similar improvements can

be obtained by learning an unsupervised embedding in the deep layers of a neural

network, with added ﬂexibility.

The book concludes with the application of neural networks to modeling time

series and optimal control systems. Modeling time series can be done using a very

simple technique discussed in Chap. 27 [8] that consists of ﬁtting a linear model on

top of a “reservoir” that implements a rich set of time series primitives. Chapter 28

[13] oﬀers an alternative to the previous method by directly identifying the underly-

ing dynamical system that generates the time series data. Chapter 29 [6] presents

how these system identiﬁcation techniques can be used to identify a Markov de-

cision process from the observation of a control system (a sequence of states and

actions in the reinforcement learning terminology). Chapter 30 [11] concludes by

showing how the control system can be dynamically improved by ﬁtting a neural

network as the control system explores the space of states and actions.

The book intends to provide a timely snapshot of tricks, theory, and algo-

rithms that are of use. Our hope is that some of the chapters of the new second

edition will become our companions when doing experimental work—eventually

becoming classics, as some of the papers of the ﬁrst edition have become. Even-

tually in some years, there may be an urge to reload again...

September 2012 Grégoire

Klaus

Preface to the Second Edition VII

Acknowledgments. This work was supported by the World Class University Pro-

gram through the National Research Foundation of Korea funded by the Ministry

of Education, Science, and Technology, under Grant R31-10008. The editors also

acknowledge partial support by DFG (MU 987/17-1).

References

[1] Bengio, Y.: Practical Recommendations for Gradient-based Training of Deep Ar-

chitectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the

Trade, 2nd edn. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012)

[2] Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B.,

Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700,

pp. 421–436. Springer, Heidelberg (2012)

[3] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep Big Mul-

tilayer Perceptrons for Digit Recognition. In: Montavon, G., Orr, G.B., Müller,

K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 581–598.

Springer, Heidelberg (2012)

[4] Coates, A., Ng, A.Y.: Learning Feature Representations with k-means. In: Mon-

tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS,

vol. 7700, pp. 561–580. Springer, Heidelberg (2012)

[5] Collobert, R., Kavukcuoglu, K., Farabet, C.: Implementing Neural Networks Eﬃ-

ciently. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 537–557. Springer, Heidelberg (2012)

[6] Duell, S., Udluft, S., Sterzing, V.: Solving Partially Observable Reinforcement

Learning Problems with Recurrent Neural Networks. In: Montavon, G., Orr,

G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700,

pp. 687–707. Springer, Heidelberg (2012)

[7] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In:

Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn.

LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012)

[8] Lukoševičius, M.: A Practical Guide to Applying Echo State Networks. In: Mon-

tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS,

vol. 7700, pp. 659–686. Springer, Heidelberg (2012)

[9] Martens, J., Sutskever, I.: Training Deep and Recurrent Networks with Hessian-

free Optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012)

[10] Montavon, G., Müller, K.-R.: Deep Boltzmann Machines and the Centering Trick.

In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd

edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012)

[11] Riedmiller, M.: 10 Steps and Some Tricks to Set Up Neural Reinforcement Con-

trollers. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 735–757. Springer, Heidelberg (2012)

[12] Weston, J., Ratle, F., Collobert, R.: Deep Learning Via Semi-supervised Embed-

ding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012)

[13] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural

Networks: 12 Tricks. In: NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700,

pp. 687–707. Springer, Heidelberg (2012)

Table of Contents

Introduction ...................................................... 1

Speeding Learning

Preface .......................................................... 7

1. Eﬃcient BackProp .............................................. 9

Yann LeCun, Leon Bottou, Genevieve B. Orr, and

Klaus-Robert Müller

Regularization Techniques to Improve Generalization

Preface .......................................................... 49

2. Early Stopping — But When? .................................... 53

Lutz Prechelt

3. A Simple Trick for Estimating the Weight Decay Parameter .......... 69

Thorsteinn S. Rögnvaldsson

4. Controlling the Hyperparameter Search in MacKay’s Bayesian Neural

Network Framework............................................. 91

Tony Plate

5. Adaptive Regularization in Neural Network Modeling ............... 111

Jan Larsen, Claus Svarer, Lars Nonboe Andersen, and

Lars Kai Hansen

6. Large Ensemble Averaging ....................................... 131

David Horn, Ury Naftaly, and Nathan Intrator

Improving Network Models and Algorithmic Tricks

Preface .......................................................... 139

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons .. 143

Gary William Flake

8. A Dozen Tricks with Multitask Learning ........................... 163

Rich Caruana

9. Solving the Ill-Conditioning in Neural Network Learning ............. 191

Patrick van der Smagt and Gerd Hirzinger

10. Centering Neural Network Gradient Factors ....................... 205

Nicol N. Schraudolph

11. Avoiding Roundoﬀ Error in Backpropagating Derivatives............ 225

Tony Plate

X Table of Contents

Representing and Incorporating Prior Knowledge in

Neural Network Training

Preface .......................................................... 231

12. Transformation Invariance in Pattern Recognition – Tangent

Distance and Tangent Propagation ............................... 235

Patrice Y. Simard, Yann A. LeCun, John S. Denker, and

Bernard Victorri

13. Combining Neural Networks and Context-Driven Search for

On-line, Printed Handwriting Recognition in the Newton ........... 271

Larry S. Yaeger, Brandyn J. Webb, and Richard F. Lyon

14. Neural Network Classiﬁcation and Prior Class Probabilities ......... 295

Steve Lawrence, Ian Burns, Andrew Back, Ah Chung Tsoi, and

C. Lee Giles

15. Applying Divide and Conquer to Large Scale Pattern Recognition

Tasks ........................................................ 311

Jürgen Fritsch and Michael Finke

Tricks for Time Series

Preface .......................................................... 339

16. Forecasting the Economy with Neural Nets: A Survey of

Challenges and Solutions ....................................... 343

John Moody

17. How to Train Neural Networks .................................. 369

Ralph Neuneier and Hans Georg Zimmermann

Big Learning in Deep Neural Networks

Preface .......................................................... 419

18. Stochastic Gradient Descent Tricks ............................... 421

Léon Bottou

19. Practical Recommendations for Gradient-Based Training

of Deep Architectures .......................................... 437

Yoshua Bengio

20. Training Deep and Recurrent Networks with Hessian-Free

Optimization .................................................. 479

James Martens and Ilya Sutskever

21. Implementing Neural Networks Eﬃciently......................... 537

Ronan Collobert, Koray Kavukcuoglu, and Clément Farabet

Table of Contents XI

Better Representations: Invariant, Disentangled and

Reusable

Preface .......................................................... 559

22. Learning Feature Representations with K-Means ................... 561

Adam Coates and Andrew Y. Ng

23. Deep Big Multilayer Perceptrons for Digit Recognition ............. 581

Dan Claudiu Cireşan, Ueli Meier, Luca Maria Gambardella, and

Jürgen Schmidhuber

24. A Practical Guide to Training Restricted Boltzmann Machines ...... 599

Geoﬀrey E. Hinton

25. Deep Boltzmann Machines and the Centering Trick ................ 621

Grégoire Montavon and Klaus-Robert Müller

26. Deep Learning via Semi-supervised Embedding .................... 639

Jason Weston, Frédéric Ratle, and Ronan Collobert

Identifying Dynamical Systems for Forecasting and

Control

Preface .......................................................... 657

27. A Practical Guide to Applying Echo State Networks ............... 659

Mantas Lukoševičius

28. Forecasting with Recurrent Neural Networks: 12 Tricks ............. 687

Hans-Georg Zimmermann, Christoph Tietz, and Ralph Grothmann

29. Solving Partially Observable Reinforcement Learning Problems

with Recurrent Neural Networks ................................. 709

Siegmund Duell, Steﬀen Udluft, and Volkmar Sterzing

30. 10 Steps and Some Tricks to Set up Neural Reinforcement

Controllers .................................................... 735

Martin Riedmiller

Author Index ................................................ 759

Subject Index ................................................ 761

Introduction



It is our belief that researchers and practitioners acquire, through experience

and word-of-mouth, techniques and heuristics that help them successfully apply

neural networks to diﬃcult real world problems. Often these “tricks” are theoret-

ically well motivated. Sometimes they are the result of trial and error. However,

their most common link is that they are usually hidden in people’s heads or in

the back pages of space-constrained conference papers. As a result newcomers

to the ﬁeld waste much time wondering why their networks train so slowly and

perform so poorly.

This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade

whose goal was to begin the process of gathering and documenting these tricks.

The interest that the workshop generated, motivated us to expand our collection

and compile it into this book. Although we have no doubt that there are many

tricks we have missed, we hope that what we have included will prove to be

useful, particularly to those who are relatively new to the ﬁeld. Each chapter

contains one or more tricks presented by a given author (or authors). We have

attempted to group related chapters into sections, though we recognize that the

diﬀerent sections are far from disjoint. Some of the chapters (e.g. 1,13,17) contain

entire systems of tricks that are far more general than the category they have

been placed in.

Before each section we provide the reader with a summary of the tricks con-

tained within, to serve as a quick overview and reference. However, we do not

recommend applying tricks before having read the accompanying chapter. Each

trick may only work in a particular context that is not fully explained in the

summary. This is particularly true for the chapters that present systems where

combinations of tricks must be applied together for them to be eﬀective.

Below we give a coarse roadmap of the contents of the individual chapters.

Speeding Learning

The book opens with a chapter based on Leon Bottou and Yann LeCun’s popular

workshop on eﬃcient backpropagation where they present a system of tricks for

speeding the minimization process. Included are tricks that are very simple to

implement as well as more complex ones, e.g. based on second-order methods.

Though many of the readers may recognize some of these tricks, we believe that

this chapter provides both: a thorough explanation of their theoretical basis as

well as an understanding of the subtle interactions among them.

This chapter provides an ideal introduction for the reader. It starts with dis-

cussing fundamental tricks addressing input representation, initialization, target



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 1–5, 2012.

 Springer-Verlag Berlin Heidelberg 2012

2 G.B. Orr and K.-R. Müller

values, choice of learning rates, choice of the nonlinearity, and so on. Subse-

quently, the authors introduce in great detail tricks for estimation and approxi-

mation of the Hessian in neural networks. This provides the basis for a discussion

of second-order algorithms, fast training methods like the stochastic Levenberg-

Marquardt algorithm, and tricks for learning rate adaptation.

Regularization Techniques to Improve Generalization

Fast minimization is important but only if we can also insure good general-

ization. We therefore next include a collection of chapters containing a range

of approaches for improving generalization. As one might expect, there are no

tricks that work well in all situations. However, many examples and discussions

are included to help the reader to decide which will work best for their own

problem.

Chapter 2 addresses what is one of the most commonly used techniques:

early stopping. Here Lutz Prechelt discusses the pitfalls of this seemingly simple

technique. He quantiﬁes the tradeoﬀ between generalization and training time

for various stopping criteria, which leads to a trick for picking an appropriate

criterion.

Using a weight decay penalty term in the cost function is another common

method for improving generalization. The diﬃculty, however, is in ﬁnding a good

estimate of the weight decay parameter. In chapter 3, Thorsteinn Rögnvaldsson

presents a fast technique for ﬁnding a good estimate, surprisingly, by using in-

formation measured at the early stopping point. Experimental evidence for its

usefulness is given in several applications.

Tony Plate in chapter 4 treats the penalty terms along the lines of MacKay,

i.e. as hyperparameters to be found through iterative search. He presents and

compares tricks for making the hyperparameter search in classiﬁcation networks

work in practice by speeding it up and simplifying it. Key to his success is a

control of the frequency of the hyperparameter updates and a better strategy in

cases where the Hessian becomes out-of-bounds.

In chapter 5, Jan Larsen et al. present a trick for adapting regularization

parameters by using simple gradient descent (with respect to the regularization

parameters) on the validation error. The trick is tested on both classiﬁcation

and regression problems.

Averaging over multiple predictors is a well known method for improving

generalization. Two questions that arise are how many predictors are “enough”

and how does the number of predictors aﬀect the stopping criteria for early

stopping. In the ﬁnal chapter of this section, David Horn et al. present solutions

to these questions by providing a method for estimating the error of an inﬁnite

number of predictors. They then demonstrate this trick for a prediction task.

Improving Network Models and Algorithmic Tricks

In this section we examine tricks that help improve the network model. Even

though standard multilayer perceptrons (MLPs) are, in theory, universal ap-

Introduction 3

proximators, other architectures may provide a more natural ﬁt to a problem. A

better ﬁt means that training is faster and that there is a greater likelihood of

ﬁnding a good and stable solution. For example, radial basis functions (RBFs)

are preferred for problems that exhibit local features in a ﬁnite region. Of course,

which architecture to choose is not always obvious.

In chapter 7, Gary Flake presents a trick that gives MLPs the power of both

an MLP and an RBF so that one does not need to choose between them . This

trick is simply to add extra inputs whose values are the square of the regular

inputs. Both a theoretical and intuitive explanation are presented along with a

number of simulation examples.

Rich Caruana in chapter 8 shows that performance can be improved on a

main task by adding extra outputs to a network that predict related tasks. This

technique, known as multi-task learning (MTL), trains these extra outputs in

parallel with the main task. This chapter presents multiple examples of what

one might use as these extra outputs as well as techniques for implementing

MTL eﬀectively. Empirical examples include mortality rankings for pneumonia

and road-following in a network learning to steer a vehicle.

Patrick van der Smagt and Gerd Hirzinger consider in chapter 9 the ill-

conditioning of the Hessian in neural network training and propose using what

they call a linearly augmented feed-forward network, employing input/output

short-cut connections that share the input/hidden weights. This gives rise to

better conditioning of the learning problem and, thus, to faster learning, as

shown in a simulation example with data from a robot arm.

In chapter 10, Nicol Schraudolph takes the idea of scaling and centering the

inputs even further than chapter 1 by proposing to center all factors in the

neural network gradient: inputs, activities, error signals and hidden unit slopes.

He gives experimental evidence for the usefulness of this trick.

In chapter 11, Tony Plate’s short note reports a numerical trick for computing

derivatives more accurately with only a small memory overhead.

Representation and Incorporating Prior Knowledge in

Neural Network Training

Previous chapters (e.g. Chapter 1) present very general tricks for transforming

inputs to improve learning: prior knowledge of the problem is not taken into

account explicitly (of course regularization, as discussed in Chapters 2-5, im-

plicitly assumes a prior but on the weight distribution). For complex, diﬃcult

problems, however, it is not enough to take a black box approach, no matter

how good that black box might be. This section examines how prior knowledge

about a problem can be used to greatly improve learning. The questions asked

include how to best represent the data, how to make use of this representation

for training, and how to take advantage of the invariances that are present. Such

issues are key for proper neural network training. They are also at the heart

of the tricks pointed out by Patrice Simard, et al. in the ﬁrst chapter of this

section. Here, the authors present a particularly interesting perspective on how

4 G.B. Orr and K.-R. Müller

to incorporate prior knowledge into data. They also give the ﬁrst review of the

tangent distance classiﬁcation method and related techniques evolving from it

such as tangent prop. These methods are applied to the diﬃcult task of optical

character recognition (OCR).

In chapter 13, Larry Yaeger, et al. give an overview of the tricks and tech-

niques for on-line handwritten character recognition that were eventually used

in the Apple Computer’s Newton MessagePad

R

and eMate

R

. Anyone who has

used these systems knows that their handwriting recognition capability works

exceedingly well. Although many of the issues that are discussed in this chapter

overlap with those in OCR, including representation and prior knowledge, the

solutions are complementary. This chapter also gives a very nice overview of

what design choices proved to be eﬃcientaswellashowdiﬀerenttrickssuchas

choice of learning rate, over-representation of more diﬃcult patterns, negative

training, error emphasis and so on work together.

Whether it be handwritten character recognition, speech recognition or med-

ical applications, a particularly diﬃcult problem encountered is the unbalanced

class prior probabilities that occur, for example, when certain writing styles

and subphoneme classes are uncommon or certain illnesses occur less frequently.

Chapter 13 brieﬂy discusses this problem in the context of handwriting recogni-

tion and presents a heuristic which controls the frequency with which samples

are picked for training.

In chapter 14, Steve Lawrence, et al. discuss the issue of unbalanced class prior

probabilities in greater depth. They present and compare several diﬀerent heuris-

tics (prior scaling, probabilistic sampling, post scaling and class membership

equalization) one of which is similar to the one in chapter 13. They demonstrate

their tricks solving an ECG classiﬁcation problem and provide some theoretical

explanations.

Many training techniques work well for small to moderate size nets. However

when problems consist of thousands of classes and millions of examples, not

uncommon in applications such as speech recognition, many of these techniques

break down. This chapter by Jürgen Fritsch and Michael Finke is devoted to the

issue of large scale classiﬁcation problems and representation design in general.

Here the problem of unbalanced class prior probabilities is also tackled.

Although Fritsch and Finke speciﬁcally exemplify their design approach for

the problem of building a large vocabulary speech recognizer, it becomes clear

that these techniques are also applicable to the general construction of an appro-

priate hierarchical decision tree. A particularly interesting result in this paper

is that the structural design to incorporate prior knowledge about speech done

by a human speech expert was outperformed by their machine learning tech-

nique using an agglomerative clustering algorithm to choose the structure of the

decision tree.

Tricks for Time Series

We close the book with two papers on the subject of time series and economic

forecasting. In the ﬁrst of these chapters, John Moody presents an excellent

Introduction 5

survey of both the challenges of macroeconomic forecasting as well a number of

neural network solutions. The survey is followed by a more detailed description of

smoothing regularizers, model selection methods (e.g. AIC, eﬀective number of

parameters, nonlinear cross-validation), and input selection via sensitivity-based

input pruning. Model interpretation and visualization are also discussed.

In the ﬁnal chapter, Ralph Neuneier and Hans Georg Zimmermann present

an impressive integrated system for neural network training of time series and

economic forecasting. Every aspect of the system is discussed including input

preprocessing, cost functions, handling of outliers, architecture, regularization

techniques, as well as solutions for dealing with the problem of bottom-heavy

networks, i.e. the input dimension is large while the output dimension is very

small. There is also a thought-provoking discussion of the Observer-Observer

dilemma: we want both to create a model based on observed data while, at the

same time, use this model to judge the correctness of new incoming data. Even

those people not interested speciﬁcally in economic forecasting are encouraged

to read this very useful example of how to incorporate prior (system) knowledge

into training.

Final Remark

As a ﬁnal remark, we note that some of the views taken in the chapters are

contradictory, e.g. some authors favor one regularization method over another,

while other authors make exactly the opposite statement. On the one hand, one

can explain these discrepancies by stating that the ﬁeld is still very active and

therefore opposing viewpoints will inevitably exist until more is understood. On

the other hand, it may be that both (contradicting) views are correct but on

diﬀerent data sets and in diﬀerent applications, e.g. an approach that considers

noisy time-series needs algorithms with a completely diﬀerent robustness than

in, say, an OCR setting. In this sense, the present book mirrors an active ﬁeld

and a variety of applications with its diversity of views.

August 1998 Jenny & Klaus

Acknowledgements. We would like to thank all authors for their collaboration.

Special thanks to Steven Lemm for considerable help with the typesetting. K.-

R.M. acknowledges partial ﬁnancial support from DFG (grant JA 379/51 and

JA 379/7) and EU ESPRIT (grant 25387-STORM).

Speeding Learning



Preface

There are those who argue that developing fast algorithms is no longer necessary

because computers have become so fast. However, we believe that the complexity

of our algorithms and the size of our problems will always expand to consume all cy-

cles available, regardless of the speed of our machines. Thus, there will never come a

time when computational eﬃciency can or should be ignored. Besides, in the quest

to ﬁnd solutions faster, we also often ﬁnd better and more stable solutions as well.

This section is devoted to techniques for making the learning process in backprop-

agation (BP) faster and more eﬃcient. It contains a single chapter based on a work-

shop by Leon Bottou and Yann LeCun. While many alternative learning systems

have emerged since the time BP was ﬁrst introduced, BP is still the most widely

used learning algorithm. The reason for this is its simplicity, eﬃciency, and its gen-

eral eﬀectiveness on a wide range of problems. Even so, there are many pitfalls in

applying it, which is where all these tricks enter.

Chapter 1 begins gently by introducing us to a few practical tricks that are

very simple to implement. Included are easy to understand qualitative explana-

tions of each. There is a discussion of stochastic (on-line) vs batch mode

learning where the advantages and disadvantages of both are presented while

making it clear that stochastic learning is most often preferred (p. 13). There

is a trick that aims at maximizing the per iteration information presented to

the network simply by knowing how best to shuﬄe the examples (p. 15).

This is followed by an entire set of tricks that must be coordinated together for

maximum eﬀectiveness. These include:

– how to normalize, decorrelate, and scale the inputs (p. 16)

– how to choose the sigmoid (p. 17)

– how to settargetvalues(classiﬁcation) (p. 19)

– how to initialize the weights (p. 20)

– how to pick the learning rates (p. 20).

Additional issues discussed include the eﬀectiveness of momentum and the choice

between radial basis units and sigmoid units (p. 21).

Chapter 1 then introduces us to a little of the theory, providing deeper under-

standing of some of the preceding tricks. Included are discussions of the eﬀect

of learning rates on the speed of learning and of the relationship between the

Hessian matrix, the error surface, and the learning rates. Simple examples of

linear and multilayer nets are provided to illustrate the theoretical results.

The chapter next enters more diﬃcult territory by giving an overview of sec-

ond order methods (p. 31). Quickly summarized here, they are



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 7–8, 2012.

 Springer-Verlag Berlin Heidelberg 2012

8 G.B. Orr and K-R. Müller

Newton method: generally impractical to use since it requires inverting the

full Hessian and works only in batch mode.

conjugate gradient: an O(N ) algorithm that doesn’t use the Hessian, but

requires a line search and so works only in batch mode.

Quasi-Newton, Broyden-Fletcher-Goldfarb-Shanno (BFGS) method:

an O(N

) algorithm that computes an estimate of the inverse Hessian. It re-

quires line search and also only works in batch mode.

Gauss-Newton method: an O(N

) algorithm that uses the square Jacobi ap-

proximation of the Hessian. Mainly used for batch and works only for mean

squared error loss functions.

Levenberg Marquardt method: extends the Gauss-Newton method to in-

clude a regularization parameter for stability.

Second order methods can greatly speed learning at each iteration but often at

an excessive computational cost. However, by replacing the exact Hessian with

an approximation of either the full or partial Hessian, the beneﬁts of second order

information can still be reaped without incurring as great a computational cost.

The ﬁrst and most direct method for estimating the full Hessian is ﬁ-

nite diﬀerences which simply requires little more than two backpropagations to

compute each row of the Hessian (p. 35). Another is to use the square Jacobian

approximation which guarantees a positive semi-deﬁnite matrix which may be

beneﬁcial for improving stability. If even more simpliﬁcation is desired, one can

just compute the diagonal elements of the Hessian. All of the methods mentioned

here are easily implemented using BP.

Unfortunately, for very large networks, many of the classical second order

methods do not work well because storing the Hessian is far too expensive and

because batch mode, required by most of the methods, is too slow. On-line

second order methods are needed instead. One such technique presented here is

a stochastic diagonal Levenberg Marquardt method (p. 40).

If all that is needed is the product of the Hessian with an arbitrary vec-

tor rather than the Hessian itself, then much time can be saved using a method

that computes this entire product directly using only a single backpropagation

step (p. 37). Such a technique can be used to compute the largest eigenvalue and

associated eigenvector of the Hessian. The inverse of the largest eigenvalue can

then be used to obtain a good estimate of the learning rate.

Finally, three useful tricks are presented for computing the principal eigen-

value and vector without having to compute the Hessian: the power

method, Taylor expansion, and an on-line method (p. 42).

Jenny & Klaus

Eﬃcient BackProp



Yann A. LeCun

,LéonBottou

, Genevieve B. Orr

, and Klaus-Robert Müller

Image Processing Research Department AT& T Labs - Research, 100 Schulz Drive,

Red Bank, NJ 07701-7033, USA

Willamette University, 900 State Street, Salem, OR 97301, USA

GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany

{yann,leonb}@research.att.com, gorr@willamette.edu, klaus@first.gmd.de

Abstract. The convergence of back-propagation learning is analyzed

so as to explain common phenomenon observed by practitioners. Many

undesirable behaviors of backprop can be avoided with tricks that are

rarely exposed in serious technical publications. This paper gives some

of those tricks, and oﬀers explanations of why they work.

Many authors have suggested that second-order optimization methods

are advantageous for neural net training. It is shown that most “classical”

second-order methods are impractical for large neural networks. A few

methods are proposed that do not have these limitations.

1.1 Introduction

Backpropagation is a very popular neural network learning algorithm because

it is conceptually simple, computationally eﬃcient, and because it often works.

However, getting it to work well, and sometimes to work at all, can seem more of

an art than a science. Designing and training a network using backprop requires

making many seemingly arbitrary choices such as the number and types of nodes,

layers, learning rates, training and test sets, and so forth. These choices can be

critical, yet there is no foolproof recipe for deciding them because they are largely

problem and data dependent. However, there are heuristics and some underlying

theory that can help guide a practitioner to make better choices.

In the ﬁrst section below we introduce standard backpropagation and discuss

a number of simple heuristics or tricks for improving its performance. We next

discuss issues of convergence. We then describe a few “classical” second-order

non-linear optimization techniques and show that their application to neural

network training is very limited, despite many claims to the contrary in the

literature. Finally, we present a few second-order methods that do accelerate

learning in certain cases.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 9–48, 2012.

 Springer-Verlag Berlin Heidelberg 2012

10 Y.A. LeCun et al.

1.2 Learning and Generalization

There are several approaches to automatic machine learning, but much of the

successful approaches can be categorized as gradient-based learning methods.The

learning machine, as represented in Figure 1.1, computes a function M (Z

,W)

where Z

is the p-th input pattern, and W represents the collection of adjustable

parameters in the system. A cost function E

= C(D

,M(Z

,W)),measures

the discrepancy between D

, the “correct” or desired output for pattern Z

,and

the output produced by the system. The average cost function E

train

(W ) is the

average of the errors E

over a set of input/output pairs called the training set

{(Z

), ....(Z

)}. In the simplest setting, the learning problem consists in

ﬁnding the value of W that minimizes E

train

(W ). In practice, the performance

of the system on a training set is of little interest. The more relevant measure

is the error rate of the system in the ﬁeld, where it would be used in practice.

This performance is estimated by measuring the accuracy on a set of samples

disjoint from the training set, called the test set. The most commonly used cost

function is the Mean Squared Error:

− M(Z

,W))

train



p=1

COST FUNCTION

LEARNING

MACHINE

Parameters

Output

E0, E1,....Ep

Error

Desired

Output

D0, D1,...Dp

Input

M(Z,W)

Z0, Z1,... Zp

Fig. 1.1. Gradient-based learning machine

This chapter is focused on strategies for improving the process of minimizing

the cost function. However, these strategies must be used in conjunction with

methods for maximizing the network’s ability to generalize, that is, to predict

the correct targets for patterns the learning system has not previously seen (e.g.

seechapters2,3,4,5formoredetail).

To understand generalization, let us consider how backpropagation works. We

start with a set of samples each of which is an input/output pair of the function

to be learned. Since the measurement process is often noisy, there may be errors

in the samples. We can imagine that if we collected multiple sets of samples

then each set would look a little diﬀerent because of the noise and because

of the diﬀerent points sampled. Each of these data sets would also result in

networks with minima that are slightly diﬀerent from each other and from the

1. Eﬃcient BackProp 11

true function. In this chapter, we concentrate on improving the process of ﬁnding

the minimum for the particular set of examples that we are given. Generalization

techniques try to correct for the errors introduced into the network as a result

of our choice of dataset. Both are important.

Several theoretical eﬀorts have analyzed the process of learning by minimizing

the error on a training set (a process sometimes called Empirical Risk Minimiza-

tion) [40, 41].

Some of those theoretical analyses are based on decomposing the generaliza-

tion error into two terms: bias and variance (see e.g. [12]). The bias is a measure

of how much the network output, averaged over all possible data sets diﬀers from

the desired function. The variance is a measure of how much the network output

varies between datasets. Early in training, the bias is large because the net-

work output is far from the desired function. The variance is very small because

the data has had little inﬂuence yet. Late in training, the bias is small because

the network has learned the underlying function. However, if trained too long,

the network will also have learned the noise speciﬁc to that dataset. This is re-

ferred to as overtraining. In such a case, the variance will be large because the

noise varies between datasets. It can be shown that the minimum total error will

occur when the sum of bias and variance are minimal.

There are a number of techniques (e.g. early stopping, regularization) for

maximizing the generalization ability of a network when using backprop. Many

of these techniques are described in later chapters 2, 3, 5, 4.

The idea of this chapter, therefore, is to present minimization strategies (given

a cost function) and the tricks associated with increasing the speed and quality

of the minimization. It is however clear that the choice of the model (model

selection), the architecture and the cost function is crucial for obtaining a net-

work that generalizes well. So keep in mind that if the wrong model class is used

and no proper model selection is done, then even a superb minimization will

clearly not help very much. In fact, the existence of overtraining has led several

authors to suggest that inaccurate minimization algorithms can be better than

good ones.

1.3 Standard Backpropagation

Although the tricks and analyses in this paper are primarily presented in the

context of “classical” multi-layer feed-forward neural networks, many of them

also apply to most other gradient-based learning methods.

The simplest form of multilayer learning machine trained with gradient-based

learning is simply a stack of modules, each of which implements a function X

n−1

),whereX

is a vector representing the output of the module, W

is the vector of tunable parameters in the module (a subset of W), and X

n−1

the module’s input vector (as well as the previous module’s output vector). The

input X

to the ﬁrst module is the input pattern Z

. If the partial derivative of

12 Y.A. LeCun et al.

with respect to X

is known, then the partial derivatives of E

with respect

to W

and X

n−1

can be computed using the backward recurrence

∂E

∂W

∂F

∂W

n−1

)

∂E

∂X

∂E

∂X

n−1

∂F

∂X

n−1

)

∂E

∂X

(1.1)

where

∂F

∂W

n−1

) is the Jacobian of F with respect to W evaluated at the

point (W

n−1

),and

∂F

∂X

n−1

) is the Jacobian of F with respect to X.

The Jacobian of a vector function is a matrix containing the partial derivatives

of all the outputs with respect to all the inputs. When the above equations

are applied to the modules in reverse order, from layer N to layer 1, all the

partial derivatives of the cost function with respect to all the parameters can be

computed. The way of computing gradients is known as back-propagation.

Traditional multi-layer neural networks are a special case of the above system

where the modules are alternated layers of matrix multiplications (the weights)

and component-wise sigmoid functions (the units):

= W

n−1

(1.2)

= F (Y

) (1.3)

where W

is a matrix whose number of columns is the dimension of X

n−1

,and

number of rows is the dimension of X

. F is a vector function that applies a

sigmoid function to each component of its input. Y

is the vector of weighted

sums, or total inputs,tolayern.

Applying the chain rule to the equation above, the classical backpropagation

equations are obtained:

∂E

∂y

= f



)

∂E

∂x

(1.4)

∂E

∂w

= x

n−1

∂E

∂y

(1.5)

∂E

∂x

n−1



∂E

∂y

. (1.6)

The above equations can also be written in matrix form:

∂E

∂Y

= F



)

∂E

∂X

(1.7)

∂E

∂W

= X

n−1

∂E

∂Y

(1.8)

∂E

∂X

n−1

= W

∂E

∂Y

. (1.9)

The simplest learning (minimization) procedure in such a setting is the gradient

descent algorithm where W is iteratively adjusted as follows:

W (t)=W (t − 1) −η

∂E

∂W

. (1.10)

1. Eﬃcient BackProp 13

In the simplest case, η is a scalar constant. More sophisticated procedures use

variable η. In other methods η takes the form of a diagonal matrix, or is an

estimate of the inverse Hessian matrix of the cost function (second derivative

matrix) such as in the Newton and Quasi-Newton methods described later in

the chapter. A proper choice of η is important and will be discussed at length

later.

1.4 A Few Practical Tricks

Backpropagation can be very slow particularly for multilayered networks where

the cost surface is typically non-quadratic, non-convex, and high dimensional

with many local minima and/or ﬂat regions. There is no formula to guarantee

that (1) the network will converge to a good solution, (2) convergence is swift, or

(3) convergence even occurs at all. However, in this section we discuss a number

of tricks that can greatly improve the chances of ﬁnding a good solution while

also decreasing the convergence time often by orders of magnitude. More detailed

theoretical justiﬁcations will be given in later sections.

1.4.1 Stochastic versus Batch Learning

At each iteration, equation (1.10) requires a complete pass through the entire

dataset in order to compute the average or true gradient. This is referred to as

batch learning since an entire “batch” of data must be considered before weights

are updated. Alternatively, one can use stochastic (online) learning where a

single example {Z

} is chosen (e.g. randomly) from the training set at each

iteration t.Anestimate of the true gradient is then computed based on the error

of that example, and then the weights are updated:

W (t +1)=W (t) − η

∂E

∂W

. (1.11)

Because this estimate of the gradient is noisy, the weights may not move precisely

down the gradient at each iteration. As we shall see, this “noise” at each iteration

can be advantageous. Stochastic learning is generally the preferred method for

basic backpropagation for the following three reasons:

Advantages of Stochastic Learning

1. Stochastic learning is usually much faster than batch learning.

2. Stochastic learning also often results in better solutions.

3. Stochastic learning can be used for tracking changes.

Stochastic learning is most often much faster than batch learning particularly

on large redundant datasets. The reason for this is simple to show. Consider the

simple case where a training set of size 1000 is inadvertently composed of 10

identical copies of a set with 100 samples. Averaging the gradient over all 1000

patterns gives the exact same result as computing the gradient based on just

the ﬁrst 100. Thus, batch gradient descent is wasteful because it recomputes

14 Y.A. LeCun et al.

the same quantity 10 times before one parameter update. On the other hand,

stochastic gradient will see a full epoch as 10 iterations through a 100-long

training set. In practice, examples rarely appear more than once in a dataset,

but there are usually clusters of patterns that are very similar. For example in

phoneme classiﬁcation, all of the patterns for the phoneme /æ/ will (hopefully)

contain much of the same information. It is this redundancy that can make batch

learning much slower than on-line.

Stochastic learning also often results in better solutions because of the noise in

the updates. Nonlinear networks usually have multiple local minima of diﬀering

depths. The goal of training is to locate one of these minima. Batch learning

will discover the minimum of whatever basin the weights are initially placed. In

stochastic learning, the noise present in the updates can result in the weights

jumping into the basin of another, possibly deeper, local minimum. This has

been demonstrated in certain simpliﬁed cases [15, 30].

Stochastic learning is also useful when the function being modeled is chang-

ing over time, a quite common scenario in industrial applications where the

data distribution changes gradually over time (e.g. due to wear and tear of the

machines). If the learning machine does not detect and follow the change it is

impossible to learn the data properly and large generalization errors will result.

With batch learning, changes go undetected and we obtain rather bad results

since we are likely to average over several rules, whereas on-line learning – if

operated properly (see below in section 1.4.7) – will track the changes and yield

good approximation results.

Despite the advantages of stochastic learning, there are still reasons why one

might consider using batch learning:

Advantages of Batch Learning

1. Conditions of convergence are well understood.

2. Many acceleration techniques (e.g. conjugate gradient) only op-

erate in batch learning.

3. Theoretical analysis of the weight dynamics and convergence

rates are simpler.

These advantages stem from the same noise that make stochastic learning

advantageous. This noise, which is so critical for ﬁnding better local minima

also prevents full convergence to the minimum. Instead of converging to the

exact minimum, the convergence stalls out due to the weight ﬂuctuations. The

size of the ﬂuctuations depend on the degree of noise of the stochastic updates.

The variance of the ﬂuctuations around the local minimum is proportional to the

learning rate η [28, 27, 6]. So in order to reduce the ﬂuctuations we can either

decrease (anneal) the learning rate or have an adaptive batch size. In theory

[13, 30, 36, 35] it is shown that the optimal annealing schedule of the learning

rate is of the form

η ∼

, (1.12)

where t is the number of patterns presented and c is a constant. In practice, this

may be too fast (see chapter 13).

1. Eﬃcient BackProp 15

Another method to remove noise is to use “mini-batches”, that is, start with

a small batch size and increase the size as training proceeds. Møller discusses

one method for doing this [25] and Orr [31] discusses this for linear problems.

However, deciding the rate at which to increase the batch size and which inputs

to include in the small batches is as diﬃcult as determining the proper learning

rate. Eﬀectively the size of the learning rate in stochastic learning corresponds

to the respective size of the mini batch.

Note also that the problem of removing the noise in the data may be less

critical than one thinks because of generalization. Overtraining may occur long

before the noise regime is even reached.

Another advantage of batch training is that one is able to use second order

methods to speed the learning process. Second order methods speed learning

by estimating not just the gradient but also the curvature of the cost surface.

Given the curvature, one can estimate the approximate location of the actual

minimum.

Despite the advantages of batch updates, stochastic learning is still often the

preferred method particularly when dealing with very large data sets because it

is simply much faster.

1.4.2 Shuﬄing the Examples

Networks learn the fastest from the most unexpected sample. Therefore, it is

advisable to choose a sample at each iteration that is the most unfamiliar to

the system. Note, this applies only to stochastic learning since the order of

input presentation is irrelevant for batch

. Of course, there is no simple way

to know which inputs are information rich, however, a very simple trick that

crudely implements this idea is to simply choose successive examples that are

from diﬀerent classes since training examples belonging to the same class will

most likely contain similar information.

Another heuristic for judging how much new information a training example

contains is to examine the error between the network output and the target value

when this input is presented. A large error indicates that this input has not been

learned by the network and so contains a lot of new information. Therefore, it

makes sense to present this input more frequently. Of course, by “large” we mean

relative to all of the other training examples. As the network trains, these relative

errors will change and so should the frequency of presentation for a particular

input pattern. A method that modiﬁes the probability of appearance of each

pattern is called an emphasizing scheme.

Choose Examples with Maximum Information Content

1. Shuﬄe the training set so that successive training examples

never (rarely) belong to the same class.

2. Present input examples that produce a large error more fre-

quently than examples that produce a small error.

The order in which gradients are summed in batch may be aﬀected by roundoﬀ error

if there is a signiﬁcant range of gradient values.

16 Y.A. LeCun et al.

However, one must be careful when perturbing the normal frequencies of input

examples because this changes the relative importance that the network places

on diﬀerent examples. This may or may not be desirable. For example, this

technique applied to data containing outliers can be disastrous because outliers

can produce large errors yet should not be presented frequently. On the other

hand, this technique can be particularly beneﬁcial for boosting the performance

for infrequently occurring inputs, e.g. /z/ in phoneme recognition (see chapter

13, 14).

1.4.3 Normalizing the Inputs

Convergence is usually faster if the average of each input variable over the train-

ing set is close to zero. To see this, consider the extreme case where all the inputs

are positive. Weights to a particular node in the ﬁrst weight layer are updated

by an amount proportional to δx where δ is the (scalar) error at that node and x

is the input vector (see equations (1.5) and (1.10)). When all of the components

of an input vector are positive, all of the updates of weights that feed into a

node will be the same sign (i.e. sign(δ)). As a result, these weights can only

all decrease or all increase together for a given input pattern. Thus, if a weight

vector must change direction it can only do so by zigzagging which is ineﬃcient

and thus very slow.

In the above example, the inputs were all positive. However, in general, any

shift of the average input away from zero will bias the updates in a particular

direction and thus slow down learning. Therefore, it is good to shift the inputs

so that the average over the training set is close to zero. This heuristic should

be applied at all layers which means that we want the average of the outputs

of a node to be close to zero because these outputs are the inputs to the next

layer [19], chapter 10. This problem can be addressed by coordinating how the

inputs are transformed with the choice of sigmoidal activation function. Here we

discuss the input transformation. The discussion of the sigmoid follows.

Convergence is faster not only if the inputs are shifted as described above but

also if they are scaled so that all have about the same covariance, C

,where



p=1

)

. (1.13)

Here, P is the number of training examples, C

is the covariance of the i

input

variable and z

is the i

component of the p

training example. Scaling speeds

learning because it helps to balance out the rate at which the weights connected

to the input nodes learn. The value of the covariance should be matched with

that of the sigmoid used. For the sigmoid given below, a covariance of 1 is a

good choice.

The exception to scaling all covariances to the same value occurs when it is

known that some inputs are of less signiﬁcance than others. In such a case, it

can be beneﬁcial to scale the less signiﬁcant inputs down so that they are “less

visible” to the learning process.

1. Eﬃcient BackProp 17

Transforming the Inputs

1. The average of each input variable over the training set should be close

to zero.

2. Scale input variables so that their covariances are about the same.

3. Input variables should be uncorrelated if possible.

The above two tricks of shifting and scaling the inputs are quite simple to

implement. Another trick that is quite eﬀective but more diﬃcult to implement

is to decorrelate the inputs. Consider the simple network in Figure 1.2. If inputs

are uncorrelated then it is possible to solve for the value of w

that minimizes

the error without any concern for w

, and vice versa. In other words, the two

variables are independent (the system of equations is diagonal). With correlated

inputs, one must solve for both simultaneously which is a much harder problem.

Principal component analysis (also known as the Karhunen-Loeve expansion)

can be used to remove linear correlations in inputs [10].

Inputs that are linearly dependent (the extreme case of correlation) may also

produce degeneracies which may slow learning. Consider the case where one

input is always twice the other input (z

=2z

). The network output is constant

along lines W

= v − (1/2)W

,wherev is a constant. Thus, the gradient is zero

along these directions (see Figure 1.2). Moving along these lines has absolutely

no eﬀect on learning. We are trying to solve in 2-D what is eﬀectively only a 1-D

problem. Ideally we want to remove one of the inputs which will decrease the

size of the network.

Figure 1.3 shows the entire process of transforming inputs. The steps are

(1) shift inputs so the mean is zero, (2) decorrelate inputs, and (3) equalize

covariances.

Lines of

constant E

Fig. 1.2. Linearly dependent inputs

1.4.4 The Sigmoid

Nonlinear activation functions are what give neural networks their nonlinear

capabilities. One of the most common forms of activation function is the sigmoid

which is a monotonically increasing function that asymptotes at some ﬁnite value

as ±∞ is approached. The most common examples are the standard logistic

function f(x)=1/(1 + e

−x

) and hyperbolic tangent f (x) = tanh(x) shown in

Figure 1.4. Sigmoids that are symmetric about the origin (e.g. see Figure 1.4b)

are preferred for the same reason that inputs should be normalized, namely,

18 Y.A. LeCun et al.

Mean

Cancellation

KL-

Expansion

Covariance

Equalization

Fig. 1.3. Transformation of inputs

because they are more likely to produce outputs (which are inputs to the next

layer) that are on average close to zero. This is in contrast, say, to the logistic

function whose outputs are always positive and so must have a mean that is

positive.

-4 -2 2 4

0.2

0.4

0.6

0.8

-3

-2 -1 1 2

-1.5

-1

-0.5

0.5

1.5

(a)

(b)

Fig. 1.4. (a) Not recommended: the standard logistic function, f(x)=1/(1 + e

−x

(b) Hyperbolic tangent, f(x)=1.7159 tanh





Sigmoids

1. Symmetric sigmoids such as hyperbolic tangent often converge faster

than the standard logistic function.

2. A recommended sigmoid [19] is: f(x)=1.7159 tanh





.Sincethe

tanh function is sometimes computationally expensive, an approxima-

tion of it by a ratio of polynomials can be used instead.

3. Sometimes it is helpful to add a small linear term, e.g. f (x) = tanh(x)+

ax so as to avoid ﬂat spots.

The constants in the recommended sigmoid given above have been chosen so

that, when used with transformed inputs (see previous discussion), the variance

of the outputs will also be close to 1 because the eﬀective gain of the sigmoid

is roughly 1 over its useful range. In particular, this sigmoid has the properties

1. Eﬃcient BackProp 19

(a) f (±1) = ±1, (b) the second derivative is a maximum at x =1, and (c) the

eﬀective gain is close to 1.

One of the potential problems with using symmetric sigmoids is that the

error surface can be very ﬂat near the origin. For this reason it is good to avoid

initializing with very small weights. Because of the saturation of the sigmoids,

the error surface is also ﬂat far from the origin. Adding a small linear term to

the sigmoid can sometimes help avoid the ﬂat regions (see chapter 9).

1.4.5 Choosing Target Values

In classiﬁcation problems, target values are typically binary (e.g. {-1,+1}). Com-

mon wisdom might seem to suggest that the target values be set at the value of

the sigmoid’s asymptotes. However, this has several drawbacks.

First, instabilities can result. The training process will try to drive the output

as close as possible to the target values, which can only be achieved asymptoti-

cally. As a result, the weights (output and even hidden) are driven to larger and

larger values where the sigmoid derivative is close to zero. The very large weights

increase the gradients, however, these gradients are then multiplied by an expo-

nentially small sigmoid derivative (except when a twisting term

is added to the

sigmoid) producing a weight update close to zero. As a result, the weights may

become stuck.

Second, when the outputs saturate, the network gives no indication of conﬁ-

dence level. When an input pattern falls near a decision boundary the output

class is uncertain. Ideally this should be reﬂected in the network by an out-

put value that is in between the two possible target values, i.e. not near either

asymptote. However, large weights tend to force all outputs to the tails of the sig-

moid regardless of the uncertainty. Thus, the network may predict a wrong class

without giving any indication of its low conﬁdence in the result. Large weights

that saturate the nodes make it impossible to diﬀerentiate between typical and

nontypical examples.

A solution to these problems is to set the target values to be within the range of

the sigmoid, rather than at the asymptotic values. Care must be taken, however,

to insure that the node is not restricted to only the linear part of the sigmoid.

Setting the target values to the point of the maximum second derivative on the

sigmoid is the best way to take advantage of the nonlinearity without saturating

the sigmoid. This is another reason the sigmoid in Figure 1.4b is a good choice.

It has maximum second derivative at ±1 which correspond to the binary target

values typical in classiﬁcation problems.

Targets

Choose target values at the point of the maximum second derivative on the

sigmoid so as to avoid saturating the output units.

A twisting term is a small linear term added to the node output, e.g.

f(x) = tanh(x)+ax.

20 Y.A. LeCun et al.

1.4.6 Initializing the Weights

The starting values of the weights can have a signiﬁcant eﬀect on the training

process. Weights should be chosen randomly but in such a way that the sig-

moid is primarily activated in its linear region. If weights are all very large then

the sigmoid will saturate resulting in small gradients that make learning slow.

If weights are very small then gradients will also be very small. Intermediate

weights that range over the sigmoid’s linear region have the advantage that (1)

the gradients are large enough that learning can proceed and (2) the network

will learn the linear part of the mapping before the more diﬃcult nonlinear part.

Achieving this requires coordination between the training set normalization,

the choice of sigmoid, and the choice of weight initialization. We start by requir-

ing that the distribution of the outputs of each node have a standard deviation

(σ) of approximately 1. This is achieved at the input layer by normalizing the

training set as described earlier. To obtain a standard deviation close to 1 at

the output of the ﬁrst hidden layer we just need to use the above recommended

sigmoid together with the requirement that the input to the sigmoid also have a

standard deviation σ

=1. Assuming the inputs, y

, to a unit are uncorrelated

with variance 1, the standard deviation of the units weighted sum will be

⎛

⎝



⎞

⎠

1/2

. (1.14)

Thus, to insure that the σ

are approximately 1 the weights should be randomly

drawn from a distribution with mean zero and a standard deviation given by

= m

−1/2

(1.15)

where m is the number of inputs to the unit.

Initializing Weights

Assuming that:

1. the training set has been normalized, and

2. the sigmoid from Figure 1.4b has been used

then weights should be randomly drawn from a distribution (e.g. uniform)

with mean zero and standard deviation

= m

−1/2

(1.16)

where m is the fan-in (the number of connections feeding into the node).

1.4.7 Choosing Learning Rates

There is at least one well-principled method (described in section 1.9.2) for esti-

mating the ideal learning rate η. Many other schemes (most of them rather em-

pirical) have been proposed in the literature to automatically adjust the learning

1. Eﬃcient BackProp 21

rate. Most of those schemes decrease the learning rate when the weight vector

“oscillates”, and increase it when the weight vector follows a relatively steady di-

rection. The main problem with these methods is that they are not appropriate

for stochastic gradient or on-line learning because the weight vector ﬂuctuates

all the time.

Beyond choosing a single global learning rate, it is clear that picking a diﬀerent

learning rate η

for each weight can improve the convergence. A well-principled

way of doing this, based on computing second derivatives, is described in sec-

tion 1.9.1. The main philosophy is to make sure that all the weights in the

network converge roughly at the same speed.

Depending upon the curvature of the error surface, some weights may require a

small learning rate in order to avoid divergence, while others may require a large

learning rate in order to converge at a reasonable speed. Because of this, learning

rates in the lower layers should generally be larger than in the higher layers (see

Figure 1.21). This corrects for the fact that in most neural net architectures, the

second derivative of the cost function with respect to weights in the lower layers

is generally smaller than that of the higher layers. The rationale for the above

heuristics will be discussed in more detail in later sections along with suggestions

for how to choose the actual value of the learning rate for the diﬀerent weights

(see section 1.9.1).

If shared weights are used such as in time-delay neural networks (TDNN) [42]

or convolutional networks [20], the learning rate should be proportional to the

square root of the number of connections sharing that weight, because we know

that the gradients are a sum of more-or-less independent terms.

Equalize the Learning Speeds

– give each weight its own learning rate

– learning rates should be proportional to the square root of the

number of inputs to the unit

– weights in lower layers should typically be larger than in the

higher layers

Other tricks for improving the convergence include:

Momentum. Momentum

Δw(t +1)=η

∂E

t+1

∂w

+ μΔw(t),

can increase speed when the cost surface is highly nonspherical because it damps

the size of the steps along directions of high curvature thus yielding a larger

eﬀective learning rate along the directions of low curvature [43] (μ denotes the

strength of the momentum term). It has been claimed that momentum generally

helps more in batch mode than in stochastic mode, but no systematic study of

this are known to the authors.

22 Y.A. LeCun et al.

Adaptive Learning Rates. Many authors, including Sompolinsky et al. [37],

Darken & Moody [9], Sutton [38], Murata et al. [28] have proposed rules for

automatically adapting the learning rates (see also [16]). These rules control the

speed of convergence by increasing or decreasing the learning rate based on the

error.

We assume the following facts for a learning rate adaptation algorithm: (1) the

smallest eigenvalue of the Hessian (see Eq.(1.27)) is suﬃciently smaller than the

second smallest eigenvalue and (2) therefore after a large number of iterations,

the parameter vector w(t) will approach the minimum from the direction of the

minimum eigenvector of the Hessian (see Eq.(1.27), Figure 1.5). Under these

conditions the evolution of the estimated parameter can be thought of as a one-

dimensional process and the minimum eigenvector v can be approximated (for

a large number of iterations: see Figure 1.5) by

v = 

∂E

∂w

/

∂E

∂w

,

where denotes the L

norm. Hence we can adopt a projection

ξ = v

∂E

∂w

 = 

∂E

∂w



to the approximated minimum Eigenvector v as a one dimensional measure of

the distance to the minimum. This distance can be used to control the learning

rate (for details see [28])

w(t +1)=w(t +1)− η

∂E

∂w

, (1.17)

r(t +1)=(1− δ)r(t)+δ

∂E

∂w

, (0 <δ<1) (1.18)

η(t +1)=η(t)+αη(t)(βr(t +1)−η(t)) , (1.19)

where δ controls the leak size of the average, α, β are constants and r is used as

auxiliary variable to calculate the leaky average of the gradient

∂E

∂w

Note that this set of rules is easy to compute and straightforward to imple-

ment. We simply have to keep track of an additional vector in Eq.(1.18): the

averaged gradient r. The norm of this vector then controls the size of the learn-

ing rate (see Eq.(1.19)). The algorithm follows the simple intuition: far away

from the minimum (large distance ξ) it proceeds in big steps and close to the

minimum it anneals the learning rate (for theoretical details see [28]).

1.4.8 Radial Basis Functions vs Sigmoid Units

Although most systems use nodes based on dot products and sigmoids, many

other types of units (or layers) can be used. A common alternative is the radial

basis function (RBF) network (see [7, 26, 5, 32]) In RBF networks, the dot

product of the weight and input vector is replaced with a Euclidean distance

1. Eﬃcient BackProp 23

Fig. 1.5. Convergence of the ﬂow. During the ﬁnal stage of learning the average ﬂow

is approximately one dimensional towards the minimum w

∗

and it is a good approxi-

mation of the minimum eigenvalue direction of the Hessian.

between the input and weight and the sigmoid is replaced by an exponential.

The output activity is computed, e.g. for one output, as

g(x)=



i=1

exp



−

2σ

x − ν





where ν

(σ

) is the mean (standard deviation) of the i-th Gaussian. These units

can replace or coexist with the standard units and they are usually trained by

combination of gradient descent (for output units) and unsupervised clustering

for determining the means and widths of the RBF units.

Unlike sigmoidal units which can cover the entire space, a single RBF unit

covers only a small local region of the input space. This can be an advantage

because learning can be faster. RBF units may also form a better set of basis

functions to model the input space than sigmoid units, although this is highly

problem dependent (see chapter 7). On the negative side, the locality property

of RBFs may be a disadvantage particularly in high dimensional spaces because

may units are needed to cover the spaces. RBFs are more appropriate in (low

dimensional) upper layers and sigmoids in (high dimensional) lower layers.

1.5 Convergence of Gradient Descent

1.5.1 A Little Theory

In this section we examine some of the theory behind the tricks presented earlier.

We begin in one dimension where the update equation for gradient descent can

be written as

W (t +1)=W (t) − η

dE(W )

. (1.20)

We would like to know how the value of η aﬀects convergence and the learning

speed. Figure 1.6 illustrates the learning behavior for several diﬀerent sizes of η

24 Y.A. LeCun et al.

when the weight W starts out in the vicinity of a local minimum. In one dimen-

sion, it is easy to deﬁne the optimal learning rate, η

opt

, as being the learning

rate that will move the weight to the minimum, W

min

,inpreciselyonestep(see

Figure 1.6(i)b). If η is smaller than η

opt

then the stepsize will be smaller and

convergence will take multiple timesteps. If η is between η

opt

and 2η

opt

then the

weight will oscillate around W

min

but will eventually converge (Figure 1.6(i)c).

If η is more than twice the size of η

opt

(Figure 1.6(i)d) then the stepsize is so

large that the weight ends up farther from W

min

than before. Divergence results.

E(ω)

η > η

opt

min

E(ω)

η < η

opt

min

E(ω)

η = η

opt

min

E(ω)

η > 2 η

opt

min

E(ω)

η = η

opt

Δω

dE/dω

min

dE(ω

)

dω

(i) (ii)

Fig. 1.6. Gradient descent for diﬀerent learning rates

What is the optimal value of the learning rate η

opt

? Let us ﬁrst consider

the case in 1-dimension. Assuming that E can be approximated by a quadratic

function, η

opt

can be derived by ﬁrst expanding E in a Taylor series about the

current weight, W

E(W )=E(W

)+(W − W

)

dE(W

)

(W − W

)

E(W

)

+ ..., (1.21)

where we use the shorthand

dE(W

)

≡



W =W

.IfE is quadratic the second

order derivative is constant and the higher order terms vanish. Diﬀerentiating

both sides with respect to W then gives

dE(W )

dE(W

)

+(W −W

)

E(W

)

. (1.22)

Setting W = W

min

and noting that dE(W

min

)/dW =0, we are left after rear-

ranging with

min

= W

−



E(W

)



−1

dE(W

)

. (1.23)

1. Eﬃcient BackProp 25

Comparing this with the update equation (1.20), we ﬁnd that we can reach a

minimum in one step if

opt



E(W

)



−1

. (1.24)

Perhaps an easier way to obtain this same result is illustrated in Figure 1.6(ii).

The bottom graph plots the gradient of E as a function of W .SinceE is

quadratic, the gradient is simply a straight line with value zero at the mini-

mum and

∂E(W

)

∂W

at the current weight W

. ∂

E/∂

W is simply the slope of this

line and is computed using the standard slope formula

∂

E/∂

W =

∂E(W

)/∂W − 0

− W

min

. (1.25)

Solving for W

min

then gives equation (1.23).

While the learning rate that gives fastest convergence is η

opt

, the largest learn-

ing rate that can be used without causing divergence is (also see Figure 1.6(i)d)

max

=2η

opt

. (1.26)

If E is not exactly quadratic then the higher order terms in equation (1.21) are

not precisely zero and (1.23) is only an approximation. In such a case, it may

take multiple iterations to locate the minimum even when using η

opt

, however,

convergence can still be quite fast.

In multiple dimensions, determining η

opt

is a bit more diﬃcult because the

right side of (1.24) is a matrix H

−1

where H is called the Hessian whose com-

ponents are given by

≡

∂

∂W

(1.27)

with 1 ≤ i, j ≤ N,andN equal to the total number of weights.

H is a measure of the curvature of E. In two dimensions, the lines of constant

E for a quadratic cost are oval in shape as shown in Figure 1.7. The eigenvec-

tors of H point in the directions of the major and minor axes. The eigenvalues

measure the steepness of E along the corresponding eigendirection.

Example. In the least mean square (LMS) algorithm, we have a single layer

linear network with error function

E(W )=



p=1

−



(1.28)

where P is the number of training vectors. The Hessian in this case turns out

the be the same as the covariance matrix of the inputs,

H =



. (1.29)

26 Y.A. LeCun et al.

Eigenvectors

of H

min,1

min,2

(a) (b)

Fig. 1.7. Lines of constant E

Fig. 1.8. For the LMS algorithm, the eigenvectors and eigenvalues of H measure the

spread of the inputs in input space

Thus, each eigenvalue of H is also a measure of the covariance or spread of the

inputs along the corresponding eigendirection as shown in Figure 1.8.

Using a scalar learning rate is problematic in multiple dimensions. We want

η to be large so that convergence is fast along the shallow directions of E (small

eigenvalues of H), however, if η is too large the weights will diverge along the

steep directions (large eigenvalues of H). To see this more speciﬁcally, let us

again expand E, but this time about a minimum

E(W ) ≈ E(W

min

(W − W

min

)

min

)

(W − W

min

). (1.30)

Diﬀerentiating (1.30) and using the result in the update equation (1.20) gives

W (t +1)=W (t) − η

∂E(t)

∂W

(1.31)

= W (t) −ηH

min

)

(W (t) − W

min

). (1.32)

Subtracting W

min

from both sides gives

(W (t +1)−W

min

)=(I − ηH(W

min

))(W (t) − W

min

). (1.33)

1. Eﬃcient BackProp 27

If the prefactor (I −ηH(W

min

)) is a matrix transformation that always shrinks

a vector (i.e. its eigenvalues all have magnitude less than 1) then the update

equation will converge.

How does this help with choosing the learning rates? Ideally we want diﬀerent

learning rates along the diﬀerent eigendirections. This is simple if the eigendi-

rections are lined up with the coordinate axes of the weights. In such a case,

the weights are uncoupled and we can assign each weight its own learning rate

based on the corresponding eigenvalue. However, if the weights are coupled then

we must ﬁrst rotate H such that H is diagonal, i.e. the coordinate axes line up

with the eigendirections (see Figure 1.7b). This is the purpose of diagonalizing

the Hessian discussed earlier.

Let Θ be the rotation matrix such that

Λ = ΘHΘ

(1.34)

where Λ is diagonal and Θ

Θ = I. The cost function then can be written as

E(W ) ≈ E(W

min



(W − W

min

)



ΘH

min

)



[Θ(W − W

min

)] .

(1.35)

Making a change of coordinates to ν = Θ(W − W

min

) simpliﬁes the above

equation to

E(ν) ≈ E(0) +

Λν (1.36)

and the transformed update equation becomes

ν(t +1)=(I − ηΛ)ν(t). (1.37)

Note that I − ηΛ is diagonal with diagonal components 1 − ηλ

.Thisequation

will converge if |1 −ηλ

| < 1, i.e. η<

for all i.Ifconstrainedtohaveasingle

scalar learning rate for all weights then we must require

η<

max

(1.38)

in order to avoid divergence, where λ

max

is the largest eigenvalue of H.For

fastest convergence we have

opt

max

. (1.39)

If λ

min

is a lot smaller than λ

max

then convergence will be very slow along the

min

direction. In fact, convergence time is proportional to the condition number

κ ≡ λ

max

/λ

min

so that it is desirable to have as small an eigenvalue spread as

possible.

However,sincewehaverotatedH to be aligned with the coordinate axes,

(1.37) consists actually of N independent 1-dimensional equations. Therefore,

we can choose a learning rate for each weight independent of the others. We see

that the optimal rate for the i

weight ν

is η

opt,i

28 Y.A. LeCun et al.

1.5.2 Examples

Linear Network. Figure 1.10 displays a set of 100 examples drawn from two

Gaussian distributed classes centered at (-0.4,-0.8) and (0.4,0.8). The eigenvalues

of the covariance matrix are 0.84 and 0.036. We train a single layer linear network

with 2 inputs, 1 output, 2 weights, and 1 bias (see Figure (1.9)) using the LMS

algorithm in batch mode. Figure 1.11 displays the weight trajectory and error

duringlearningwhenusingalearningratesofη =1.5 and 2.5. Note that the

learning rate (see Eq. 1.38) η

max

=2/λ

max

=2/.84 = 2.38 will cause divergences

as is evident for η =2.5.

Fig. 1.9. Simple linear network

−1.4−1.2 −1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

1.2

1.4

Fig. 1.10. Two classes drawn from gaussian dis-

tributions centered at (-0.4,-0.8) and (0.4,0.8)

Figure 1.12 shows the same example using stochastic instead of batch mode

learning. Here, a learning rate of η =0.2 is used. One can see that the trajectory

is much noisier than in batch mode since only an estimate of the gradient is

used at each iteration. The cost is plotted as a function of epoch. An epoch

here is simply deﬁned as 100 input presentations which, for stochastic learning,

corresponds to 100 weight updates. In batch, an epoch corresponds to one weight

update.

Multilayer Network. Figure 1.14 shows the architecture for a very simple

multilayer network. It has 1 input, 1 hidden, and 1 output node. There are

2 weights and 2 biases. The activation function is f (x)=1.71 tanh((2/3)x).

The training set contains 10 examples from each of 2 classes. Both classes are

Gaussian distributed with standard deviation 0.4. Class 1 has a mean of -1 and

class 2 has a mean of +1. Target values are -1 for class 1 and +1 for class 2.

Figure 1.13 shows the stochastic trajectory for the example.

1. Eﬃcient BackProp 29

Weight space

−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

epochs

Weight space

−1 −0.8−0.6−0.4−0.2 0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

epochs

(a) (b)

Fig. 1.11. Weight trajectory and error curve during learning for (a) η =1.5 and (b)

η =2.5

Weight space

−1 −0.8−0.6 −0.4−0.2 0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

epochs

batch

Fig. 1.12. Weight trajectory and error

curve during stochastic learning for η =

0.2

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

Weight space

−2 −1.6 −1.2 −0.8 −0.4 0 0.4 0.8 1.2 1.6 2 2.4

−1.4

−1.2

−1

−0.8

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

epochs

Fig. 1.13. Weight trajectories and errors

for 1-1-1 network trained using stochastic

learning

30 Y.A. LeCun et al.

Fig. 1.14. The minimal multilayer network

1.5.3 Input Transformations and Error Surface Transformations

Revisited

We can use the results of the previous section to justify several of the tricks

discussed earlier.

Subtract the means from the input variables

The reason for the above trick is that a nonzero mean in the input variables

creates a very large eigenvalue. This means the condition number will be large,

i.e. the cost surface will be steep in some directions and shallow in others so that

convergence will be very slow. The solution is to simply preprocess the inputs

by subtracting their means.

For a single linear neuron, the eigenvectors of the Hessian (with means sub-

tracted) point along the principal axes of the cloud of training vectors (recall

Figure 1.8). Inputs that have a large variation in spread along diﬀerent directions

of the input space will have a large condition number and slow learning. And so

we recommend:

Normalize the variances of the input variables.

If the input variables are correlated, this will not make the error surface

spherical, but it will possibly reduce its eccentricity.

Correlated input variables usually cause the eigenvectors of H to be rotated

away from the coordinate axes (Figure 1.7a versus 1.7b) thus weight updates

are not decoupled. Decoupled weights make the “one learning rate per weight”

method optimal, thus, we have the following trick:

Decorrelate the input variables.

Now suppose that the input variables of a neuron have been decorrelated,

the Hessian for this neuron is then diagonal and its eigenvalues point along the

coordinate axes. In such a case the gradient is not the best descent direction

as can be seen in Fig 1.7b. At the point P, an arrow shows that gradient does

not point towards the minimum. However, if we instead assign each weight its

own learning rate (equal the inverse of the corresponding eigenvalue) then the

1. Eﬃcient BackProp 31

descent direction will be in the direction of the other arrow that points directly

towards the minimum:

Use a separate learning rate for each weight.

1.6 Classical Second Order Optimization Methods

In the following we will brieﬂy introduce the Newton, conjugate gradient, Gauss-

Newton, Levenberg Marquardt and the Quasi-Newton (BFGS) method (see also

[11, 34, 3, 5]).

-½

Λ Θ′

-½

ΘΛ

Network

input

output

Network

input

output

-½

Newton Algorithm here ......

....is like Gradient Descent

there

Fig. 1.15. Sketch of the whitening properties of the Newton algorithm

1.6.1 Newton Algorithm

To get an understanding of the Newton method let us recapitulate the results

from section 1.5.1. Assuming a quadratic loss function E (see Eq.(1.21)) as

depicted in Figure 1.6(ii), we can compute the weight update along the lines

of Eq.(1.21)-(1.23)

Δw = η



∂

∂w



−1

∂E

∂w

= ηH(w)

−1

∂E

∂w

, (1.40)

where η must to be chosen in the range 0 <η<1 since E is in practice not

perfectly quadratic. In this equation information about the Hessian H is taken

into account. If the error function was quadratic one step would be suﬃcient to

converge.

Usually the energy surface around the minimum is rather ellipsoid, or in the

extreme like a taco shell, depending on the conditioning of the Hessian. A whiten-

ing transform, well known from signal processing literature [29] can change this

ellipsoid shape to a spherical shape through u = ΘΛ

1/2

w (see Figure 1.15 and

32 Y.A. LeCun et al.

first descent

direction

gradients

conjugate

direction

Fig. 1.16. Sketch of conjugate gradient directions in a 2D error surface

Eq.(1.34)). So the inverse Hessian in Eq.(1.40) basically spheres out the error

surface locally. The following two approaches can be shown to be equivalent: (a)

use the Newton algorithm in an untransformed weight space and (b) do usual

gradient descent in a whitened coordinate system (see Figure 1.15) [19].

Summarizing, the Newton algorithm converges in one step if the error function

is quadratic and (unlike gradient descent) it is invariant with respect to linear

transformations of the input vectors. This means that the convergence time is

not aﬀected by shifts, scaling and rotation of input vectors. However one of the

main drawbacks is that an N × N Hessian matrix must be stored and inverted,

which takes O(N

) per iterations and is therefore impractical for more than a

few variables. Since the error function is in general non-quadratic, there is no

guarantee of convergence. If the Hessian is not positive deﬁnite (if it has some

zero or even negative Eigenvalues where the error surface is ﬂat or some directions

are curved downward), then the Newton algorithm will diverge, so the Hessian

must be positive deﬁnite. Of course the Hessian matrix of multilayer networks

is in general not positive deﬁnite everywhere. For these reasons the Newton

algorithm in its original form is not usable for general neural network learning.

However it gives good insights for developing more sophisticated algorithms, as

discussed in the following.

1.6.2 Conjugate Gradient

There are several important properties in conjugate gradient optimization: (1)

it is a O(N) method, (2) it doesn’t use the Hessian explicitly, (3) it attempts

to ﬁnd descent directions that try to minimally spoil the result achieved in the

previous iterations, (4) it uses a line search, and most importantly, (5) it works

only for batch learning.

1. Eﬃcient BackProp 33

The third property is shown in Figure 1.16. Assume we pick a descent di-

rection, e.g. the gradient, then we minimize along a line in this direction (line

search). Subsequently we should try to ﬁnd a direction along which the gradi-

ent does not change its direction, but merely its length (conjugate direction),

because moving along this direction will not spoil the result of the previous it-

eration. The evolution of the descent directions ρ

at iteration k is given as

= −∇E(w

)+β

k−1

, (1.41)

where the choice of β

can be done either according to Fletcher and Reeves [34]

∇E(w

)

∇E(w

)

∇E(w

k−1

)

∇E(w

k−1

)

(1.42)

or Polak and Ribiere

(∇E(w

) −∇E(w

k−1

))

∇E(w

)

∇E(w

k−1

)

∇E(w

k−1

)

. (1.43)

Two directions ρ

and ρ

k−1

are deﬁned as conjugate if

Hρ

k−1

=0,

i.e. conjugate directions are orthogonal directions in the space of an identity Hes-

sian matrix (see Figure 1.17). Very important for convergence in both choices is

a good line search procedure. For a perfectly quadratic function with N variables

a convergence within N steps can be proved. For non-quadratic functions Po-

lak and Ribiere’s choice seems more robust. Conjugate gradient (1.41) can also

be viewed as a smart choice for choosing the momentum term known in neural

network training. It has been applied with large success in multi-layer network

training on problems that are moderate sized with rather low redundancy in the

data. Typical applications range from function approximation, robotic control

[39], time-series prediction and other real valued problems where high accuracy

is wanted. Clearly on large and redundant (classiﬁcation) problems stochastic

backpropagation is faster. Although attempts have been made to deﬁne mini-

batches [25], the main disadvantage of conjugate gradient methods remains that

it is a batch method (partly due to the precision requirements in line search

procedure).

κ−1

Fig. 1.17. Sketch of conjugate gradient directions in a 2D error surface

34 Y.A. LeCun et al.

1.6.3 Quasi-Newton (BFGS)

The Quasi-Newton (BFGS) method (1) iteratively computes an estimate of the

inverse Hessian, (2) is an O(N

) algorithm, (3) requires line search and (4) it

works only for batch learning.

The positive deﬁnite estimate of the inverse Hessian is done directly without

requiring matrix inversion and by only using gradient information. Algorithmi-

cally this can be described as follows: (1) ﬁrst a positive deﬁnite matrix M is

chosen, e.g. M = I, (2) then the search direction is set to

ρ(t)=M(t)∇E(w(t)),

(3) a line search is performed along ρ, which gives the update for the parameters

at time t

w(t)=w(t − 1) −η(t)ρ(t).

Finally (4) the estimate of the inverse Hessian is updated. Compared to the

Newton algorithm the Quasi-Newton approach only needs gradient information.

The most successful Quasi-Newton algorithm is the Broyden-Fletcher-Goldfarb-

Shanno (BFGS) method. The update rule for the estimate of the inverse Hessian

M(t)=M (t −1)



Mφ



δδ

−



δφ

M + Mφδ



, (1.44)

where some abbreviations have been used for the following N × 1 vectors

φ = ∇E(w(t)) −∇E(w(t − 1))

δ = w(t) −w(t − 1). (1.45)

Although, as mentioned above, the complexity is only O(N

), we are still re-

quired to store a N × N matrix, so the algorithm is only practical for small

networks with non-redundant training sets. Recently some variants exist that

aim to reduce storage requirements (see e.g. [3]).

1.6.4 Gauss-Newton and Levenberg Marquardt

Gauss-Newton and Levenberg Marquardt algorithm (1) use the square Jacobi

approximation, (2) are mainly designed for batch learning, (3) have a complexity

of O(N

) and (4) most important, they work only for mean squared error loss

functions. The Gauss-Newton algorithm is like the Newton algorithm, however

the Hessian is approximated by the square of the Jacobian (see also section 1.7.2

for a further discussion)

Δw =





∂f(w, x

)

∂w

∂f(w, x

)

∂w



−1

∇E(w). (1.46)

1. Eﬃcient BackProp 35

The Levenberg Marquardt method is like the Gauss-Newton above, but it has a

regularization parameter μ that prevents it from blowing up, if some eigenvalues

are small

Δw =





∂f(w, x

)

∂w

∂f(w, x

)

∂w

+ μI



−1

∇E(w), (1.47)

where I denotes the unity matrix. The Gauss Newton method is valid for

quadratic cost functions however a similar procedure also works with Kullback-

Leibler cost and is called Natural Gradient (see e.g. [1, 44, 2]).

1.7 Tricks to Compute the Hessian Information in

Multilayer Networks

We will now discuss several techniques aimed at computing full or partial Hessian

information by (a) ﬁnite diﬀerence method, (b) square Jacobian approximation

(for Gauss-Newton and Levenberg-Marquardt algorithm), (c) computation of

the diagonal of the Hessian and (d) by obtaining a product of the Hessian and

a vector without computing the Hessian. Other semi-analytical techniques that

allow the computation of the full Hessian are omitted because they are rather

complicated and also require many forward/backward propagation steps [5, 8].

1.7.1 Finite Diﬀerence

We can write the k-th line of the Hessian

(k)

∂(∇E(w))

∂w

∼

∇E(w + δφ

) −∇E(w)

where φ

=(0, 0, 0,...,1,...,0) is a vector of zeros and only one 1 at the k-th

position. This can be implemented with a simple recipe: (1) compute the total

gradient by multiple forward and backward propagation steps. (2) Add δ to the

k-th parameter and compute again the gradient, and ﬁnally (3) subtract both

results and divide by δ. Due to numerical errors in this computation scheme the

resulting Hessian might not be perfectly symmetric. In this case it should be

symmetrized as described below.

1.7.2 Square Jacobian Approximation for the Gauss-Newton and

Levenberg-Marquardt Algorithms

Assuming a mean squared cost function

E(w)=



− f(w, x

))

− f(w, x

)) (1.48)

then the gradient is

∂E(w)

∂w

= −



− f(w, x

))

∂f(w, x

)

∂w

(1.49)

36 Y.A. LeCun et al.

and the Hessian follows as

H(w)=



∂f(w, x

)

∂w

∂f(w, x

)

∂w



− f(w, x

))

∂

f(w, x

)

∂w∂w

. (1.50)

A simplifying approximation of the Hessian is the square of the Jacobian which

is a positive semi-deﬁnite matrix of dimension: N × O

H(w) ∼



∂f(w, x

)

∂w

∂f(w, x

)

∂w

, (1.51)

where the second term from Eq.(1.50) was dropped. This is equivalent to as-

suming that the network is a linear function of the parameters w. Again this is

readily implemented for the k-th column of the Jacobian: for all training pat-

terns, (1) we forward propagate, then (2) set the activity of the output units to

0andonlythek-th output to 1, (3) a backpropagation step is taken and the

gradient is accumulated.

1.7.3 Backpropagating Second Derivatives

Let us consider a multi-layer system with some functional blocks with N

inputs,

outputs and N parameters of the form O = F (W, X). Now assume we knew

∂

E/∂O

,whichisaN

× N

matrix. Then it is straight forward to compute

this matrix

∂

∂W

∂O

∂W

∂

∂O

∂W

∂E

∂O

∂

∂W

. (1.52)

We can drop the second term in Eq.(1.52) and the resulting estimate of the

Hessian is positive semi-deﬁnite. A further reduction is achieved, if we ignore all

but the diagonal terms of

∂

∂O

∂

∂w



∂

∂o



∂o

∂w



. (1.53)

A similar derivation can be done to obtain the N

times N

matrix ∂

E/∂x

1.7.4 Backpropagating the Diagonal Hessian in Neural Nets

Backpropagation procedures for computing the diagonal Hessian are well known

[18, 4, 19]. It is assumed that each layer in the network has the functional form

= f (y

)=f(



) (see Figure 1.18 for the sigmoidal network). Using

the Gauss-Newton approximation (dropping the term that contain f



(y))we

obtain:

∂

∂y

∂

∂o



))

, (1.54)

∂

∂w

∂

∂y

(1.55)

1. Eﬃcient BackProp 37

and

∂

∂x



∂

∂y

. (1.56)

With f being a Gaussian nonlinearity as shown in Figure 1.18 for the RBF

networks we obtain

∂

∂w

∂

∂y

− w

)

(1.57)

and

∂

∂x



∂

∂y

− w

)

. (1.58)

The cost of computing the diagonal second derivatives by running these equa-

tions from the last layer to the ﬁrst one is essentially the same as the regular

backpropation pass used for the gradient, except that the square of the weights

are used in the weighted sums. This technique is applied in the “optimal brain

damage” pruning procedure (see [21]).

||ω−x ||

f( )

ωx

1/2

Fig. 1.18. Backpropagating the diagonal Hessian: sigmoids (left) and RBFs (right)

1.7.5 Computing the Product of the Hessian and a Vector

In many methods that make use of the Hessian, the Hessian is used exclusively in

products with a vector. Interestingly, there is a way of computing such products

without going through the trouble of computing the Hessian itself. The ﬁnite

diﬀerence method can fulﬁll this task for an arbitrary vector Ψ

HΨ ∼



∂E

∂w

(w + αΨ ) −

∂E

∂w

(w)



, (1.59)

using only two gradient computations (at point w and w + αΨ respectively),

which can be readily computed with backprop (α is a small constant).

This method can be applied to compute the principal eigenvector and eigen-

value of H by the power method. By iterating and setting

Ψ(t +1)=

HΨ(t)

Ψ(t)

, (1.60)

38 Y.A. LeCun et al.

the vector Ψ(t) will converge to the largest eigenvector of H and Ψ (t) to the

corresponding eigenvalue [23, 14, 10]. See also [33] for an even more accurate

method that (1) does not use ﬁnite diﬀerences and (2) has similar complexity.

1.8 Analysis of the Hessian in Multi-layer Networks

It is interesting to understand how some of the tricks shown previously inﬂuence

on the Hessian, i.e. how does the Hessian change with architecture and details of

the implementation. Typically, the eigenvalue distribution of the Hessian looks

like the one sketched in Figure 1.20: a few small eigenvalues, many medium ones

and few very large ones. We will now argue that the large eigenvalues will cause

the trouble in the training process because [23, 22]

– non-zero mean inputs or neuron states [22] (see also chapter 10)

– wide variations of the second derivatives from layer to layer

– correlation between state variables.

To exemplify this, we show the eigenvalue distribution of a network trained on

OCR data in Figure 1.20. Clearly, there is a wide spread of eigenvalues (see Fig-

ure 1.19) and we observe that the ratio between e.g. the ﬁrst and the eleventh

eigenvalue is about 8. The long tail of the eigenvalue distribution (see Fig-

ure 1.20) is rather painful because the ratio between the largest and smallest

eigenvalue gives the conditioning of the learning problem. A large ratio corre-

sponds to a big diﬀerence in the axis of the ellipsoidal shaped error function:

the larger the ratio, the more we ﬁnd a taco-shell shaped minima, which are

extremely steep towards the small axis and very ﬂat along the long axis.

0 100 200 300 400 500 600 700 800

−6

−5.5

−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

−1

−0.5

Eigenvalue order

Log Eigenvalue

the ratio between the 1st and

the 11th eigenvalues is 8

Fig. 1.19. Eigenvalue spectrum in a 4 layer shared weights network (256×128×64×10)

trained on 320 handwritten digits

1. Eﬃcient BackProp 39

0 2 4 6 8 10 12 14 16

Eigenvalue magnitude

Number of Eigenvalues

Big killers

Fig. 1.20. Eigenvalue spectrum in a 4 layer shared weights network (256×128×64×10)

trained on 320 handwritten digits

Fig. 1.21. Multilayered architecture: the second derivative is often smaller in lower

layers

40 Y.A. LeCun et al.

Another general characteristic of the Hessian in multi-layer networks is the

spread between layers. In Figure 1.21 we roughly sketch how the shape of the

Hessian varies from being rather ﬂat in the ﬁrst layer to being quite steep in

the last layer. This aﬀects the learning speed and can provide an ingredient

to explain the slow learning in lower layers and the fast (sometime oscillating)

learning in the last layer. A trick to compensate this diﬀerent scale of learning is

to use the inverse diagonal Hessian to control the learning rate (see also section

1.6, chapter 17).

1.9 Applying Second Order Methods to Multilayer

Networks

Before we concentrate in this section on how to tailor second order techniques

for training large networks, let us ﬁrst repeat some rather pessimistic facts about

applying classical second order methods. Techniques using full Hessian informa-

tion (Gauss -Newton, Levenberg-Marquardt and BFGS) can only apply to very

small networks trained in batch mode, however those small networks are not

the ones that need speeding up the most. Most second order methods (conju-

gate gradient, BFGS, ...) require a line-search and can therefore not be used

in the stochastic mode. Many of the tricks discussed previously apply only to

batch learning. From our experience we know that a carefully tuned stochastic

gradient descent is hard to beat on large classiﬁcation problems. For smaller

problems that require accurate real-valued outputs like in function approxima-

tion or control problems, we see that conjugate gradient (with Polak-Ribiere

Eq.(1.43)) oﬀers the best combination of speed, reliability and simplicity. Sev-

eral attempts using “mini batches” in applying conjugate gradient to large and

redundant problems have been made recently [17, 25, 31]. A variant of conjugate

gradient optimization (called scaled CG) seems interesting: here the line search

procedure is replaced by a 1D Levenberg Marquardt type algorithm [24].

1.9.1 A Stochastic Diagonal Levenberg Marquardt Method

To obtain a stochastic version of the Levenberg Marquardt algorithm the idea

is to compute the diagonal Hessian through a running estimate of the second

derivative with respect to each parameter. The instantaneous second derivative

can be obtained via backpropagation as shown in the formulas of section 1.7. As

soon as we have those running estimates we can use them to compute individual

learning rates for each parameter





∂

∂w

 + μ

, (1.61)

where  denotes the global learning rate, and 

∂

∂w

 is a running estimate of the

diagonal second derivative with respect to w

. μ is a parameter to prevent η

1. Eﬃcient BackProp 41

from blowing up in case the second derivative is small, i.e. when the optimization

moves in ﬂat parts of the error function. The running estimate is computed as



∂

∂w



new

=(1− γ)

∂

∂w



old

+ γ

∂

∂w

, (1.62)

where γ is a small constant that determines the amount of memory that is

being used. The second derivatives can be computed prior to training over e.g.

a subset of the training set. Since they change only very slowly they only need

to be reestimated every few epochs. Note that the additional cost over regular

backpropagation is negligible and convergence is – as a rule of thumb – about

three times faster than a carefully tuned stochastic gradient algorithm.

In Figure 1.22 and 1.23 we see the convergence of the stochastic diagonal

Levenberg Marquardt method (1.61) for a toy example with two diﬀerent sets

of learning rates. Obviously the experiment shown Figure 1.22 contains fewer

ﬂuctuations than in Figure 1.23 due to smaller learning rates.

Weight space

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

Hessian

largest

eigenvalue:

λ = 0.84

max

η = 2.38

max

epochs

Learning

rates:

η0 = 0.12

η1 = 0.03

η2 = 0.02

Maximum

admissible

Learning

rate (batch):

Fig. 1.22. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaus-

sians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e.

three parameters (2 weights, 1 bias).

42 Y.A. LeCun et al.

Weight space

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Log MSE (dB)

0 1 2 3 4 5 6 7 8 9 10

−20

−15

−10

−5

Hessian

largest

eigenvalue:

λ = 0.84

max

η = 2.38

max

epochs

Learning

rates:

Maximum

admissible

Learning

rate (batch):

η0 = 0.76

η1 = 0.18

η2 = 0.12

Fig. 1.23. Stochastic diagonal Levenberg-Marquardt algorithm. Data set from 2 Gaus-

sians with 100 examples. The network has one linear unit, 2 inputs and 1 output, i.e.

three parameters (2 weights, 1 bias).

1.9.2 Computing the Principal Eigenvalue/Vector of the Hessian

In the following we give three tricks for computing the principal eigenvalue/Vec-

tor of the Hessian without having to compute the Hessian itself. Remember

that in section 1.4.7 we also introduced a method to approximate the smallest

eigenvector of the Hessian (without having to compute the Hessian) through

averaging (see also [28]).

Power Method. We repeat the result of our discussion in section 1.7.5: starting

from a random initial vector Ψ , the iteration

new

= H

old

Ψ

old



will eventually converge to the principal eigenvector (or a vector in the principal

eigenspace) and Ψ

old

 will converge to the corresponding eigenvalue [14, 10].

1. Eﬃcient BackProp 43

0 50 100 150 200 250 300 350 400

Number of pattern presentations

eigenvalue

γ=0.1

γ=0.03

γ=0.01

γ=0.003

Fig. 1.24. Evolution of the eigenvalue as a function of the number of pattern presen-

tations for a shared weight network with 5 layers, 64638 connections and 1278 free

parameters. The training set consists of 1000 handwritten digits.

Taylor Expansion. Another method makes use of the fact that small pertur-

bations of the gradient also lead to the principal eigenvector of H

new



∂E

∂w

(w + α

old

Ψ

old



) −

∂E

∂w

(w)



, (1.63)

where α is a small constant. One iteration of this procedure requires two forward

and two backward propagation steps for each pattern in the training set.

Online Computation of Ψ . The following rule makes use of the running

average to obtain the largest eigenvalue of the average Hessian very fast

new

=(1− γ)Ψ +



∂E

∂w

(w + α

old

Ψ

old



) −

∂E

∂w

(w)



. (1.64)

44 Y.A. LeCun et al.

0 0.250.50.75 1 1.251.51.75 2 2.252.52.75 3 3.253.53.75 4

0.5

1.5

2.5

LEARNING RATE

PREDICTED OPTIMAL LEARNING RATE

MEAN SQUARED ERROR

1 epoch

2 epochs

3 epochs

4 epochs

5 epochs

Fig. 1.25. Mean squared error as a function of the ratio between learning rate and

predicted optimal learning rate for a fully connected network (784 × 30 × 10). The

training set consists of 300 handwritten digits.

To summarize, the eigenvalue/vector computations:

1. a random vector is chosen for initialization of Ψ ,

2. an input pattern is presented with desired output, a forward and backward

propagation, step is performed and the gradients G(w) are stored,

3. α

old

Ψ

old



is added to the current weight vector w,

4. a forward and backward propagation step is performed with the perturbed

weight vector and the gradients G(w



) are stored,

5. the diﬀerence 1/α(G(w



) − G(w)) is computed and the running average of

the eigenvector is updated,

6. we loop from (2)-(6) until a reasonably stable result is obtained for Ψ,

7. the optimal learning rate is then given as

opt

Ψ

1. Eﬃcient BackProp 45

0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3

0.5

1.5

2.5

LEARNING RATE

PREDICTED OPTIMAL LEARNING RATE

MEAN SQUARED ERROR

1 epoch

2 epochs

3 epochs

4 epochs

5 epochs

Fig. 1.26. Mean squared error as a function of the ratio between learning rate and

predicted optimal learning rate for a shared weight network with 5 layers (1024 ×

1568 × 392 × 400 × 100 × 10), 64638 (local) connections and 1278 free parameters

(shared weights). The training set consists of 1000 handwritten digits.

In Figure 1.24 we see the evolution of the eigenvalue as a function of the number

of pattern presentations for a neural network in a handwritten character recog-

nition task. In practice we adapt the leak size of the running average in order

to get fewer ﬂuctuations (as also indicated on the ﬁgure). In the ﬁgure we see

that after fewer than 100 pattern presentations the correct order of magnitude

for the eigenvalue, i.e the learning rate is reached. From the experiments we also

observe that the ﬂuctuations of the average Hessian over training are small.

In Figure 1.25 and 1.26 we start with the same initial conditions, and perform

a ﬁxed number of epochs with learning rates computed by multiplying the pre-

dicted learning rate by a predeﬁned constant. Choosing constant 1 (i.e. using the

predicted optimal rate) always gives residual errors which are very close to the

error achieved by the best choice of the constant. In other words, the “predicted

optimal rate” is optimal enough.

46 Y.A. LeCun et al.

1.10 Discussion and Conclusion

According to the recommendations mentioned above, a practitioner facing a

multi-layer neural net training problem would go through the following steps:

– shuﬄe the examples

– center the input variables by subtracting the mean

– normalize the input variable to a standard deviation of 1

– if possible, decorrelate the input variables.

– pick a network with the sigmoid function shown in ﬁgure 1.4

– set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values as prescribed by 1.16.

The preferred method for training the network should be picked as follows:

– if the training set is large (more than a few hundred samples) and redundant,

and if the task is classiﬁcation, use stochastic gradient with careful tuning,

or use the stochastic diagonal Levenberg Marquardt method.

– if the training set is not too large, or if the task is regression, use conjugate

gradient.

Classical second-order methods are impractical in almost all useful cases.

The non-linear dynamics of stochastic gradient descent in multi-layer neural

networks, particularly as it pertains to generalization, is still far from being well

understood. More theoretical work and systematic experimental work is needed.

Acknowledgement. Y.L. & L.B. & K.-R. M. gratefully acknowledge mutual

exchange grants from DAAD and NSF.

References

[1] Amari, S.: Neural learning in structured parameter spaces — natural riemannian

gradient. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural

Information Processing Systems, vol. 9, p. 127. MIT Press (1997)

[2] Amari, S.: Natural gradient works eﬃciently in learning. Neural Computa-

tion 10(2), 251–276 (1998)

[3] Battiti, R.: First- and second-order methods for learning: Between steepest descent

and newton’s method. Neural Computation 4, 141–166 (1992)

[4] Becker, S., LeCun, Y.: Improving the convergence of backbropagation learning

with second oder metho ds. In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Pro-

ceedings of the 1988 Connectionist Models Summer School, pp. 29–37. Lawrence

Erlbaum Associates (1989)

[5] Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford

(1995)

[6] Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.)

Online Learning in Neural Networks (1997 Workshop at the Newton Institute).

The Newton Institute Series. Cambridge University Press, Cambridge (1998)

[7] Broomhead, D.S., Lowe, D.: Multivariable function interpolation and adaptive

networks. Complex Systems 2, 321–355 (1988)

1. Eﬃcient BackProp 47

[8] Buntine, W.L., Weigend, A.S.: Computing second order derivatives in Feed-

Forward networks: A review. IEEE Transactions on Neural Networks (1993) (to

appear)

[9] Darken, C., Moody, J.E.: Note on learning rate schedules for stochastic optimiza-

tion. In: Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.) Advances in Neural

Information Processing Systems, vol. 3, pp. 832–838. Morgan Kaufmann, San Ma-

teo (1991)

[10] Diamantaras, K.I., Kung, S.Y.: Principal Component Neural Networks. Wiley,

New York (1996)

[11] Fletcher, R.: Practical Methods of Optimization, ch. 8.7: Polynomial time algo-

rithms, 2nd edn., pp. 183–188. John Wiley & Sons, New York (1987)

[12] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4(1), 1–58 (1992)

[13] Goldstein, L.: Mean square optimality in the continuous time Robbins Monro pro-

cedure. Technical Report DRB-306, Dept. of Mathematics, University of Southern

California, LA (1987)

[14] Golub, G.H., Van Loan, C.F.: Matrix Computations, 2nd edn. Johns Hopkins

University Press, Baltimore (1989)

[15] Heskes, T.M., Kappen, B.: On-line learning processes in artiﬁcial neural networks.

In: Tayler, J.G. (ed.) Mathematical Approaches to Neural Networks, vol. 51, pp.

199–233. Elsevier, Amsterdam (1993)

[16] Jacobs, R.A.: Increased rates of convergence through learning rate adaptation.

Neural Networks 1, 295–307 (1988)

[17] Kramer, A.H., Sangiovanni-Vincentelli, A.: Eﬃcient parallel learning algorithms

for neural networks. In: Touretzky, D.S. (ed.) Proceedings of the 1988 Confer-

ence on Advances in Neural Information Processing Systems, pp. 40–48. Morgan

Kaufmann, San Mateo (1989)

[18] LeCun, Y.: Modeles connexionnistes de l’apprentissage (connectionist learning

models). PhD thesis, Universitè P. et M. Curie, Paris VI (1987)

[19] LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter,

Z., Fogelman, F., Steels, L. (eds.) Proceedings of the International Conference

Connectionism in Perspective, University of Zürich, October 10-13. Elsevier, Am-

sterdam (1988)

[20] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,

Jackel, L.D.: Handwritten digit recognition with a backpropagation network. In:

Touretsky, D.S. (ed.) Advances in Neural Information Processing Systems, vol. 2.

Morgan Kaufmann, San Mateo (1990)

[21] LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretsky, D.S.

(ed.) Advances in Neural Information Processing Systems, vol. 2, pp. 598–605

(1990)

[22] LeCun, Y., Kanter, I., Solla, S.A.: Second order properties of error surfaces. In:

Advances in Neural Information Processing Systems, vol. 3. Morgan Kaufmann,

San Mateo (1991)

[23] LeCun, Y., Simard, P.Y., Pearlmutter, B.: Automatic learning rate maximization

by on-line estimation of the hessian’s eigenvectors. In: Giles, Hanson, Cowan (eds.)

Advances in Neural Information Processing Systems, vol. 5. Morgan Kaufmann,

San Mateo (1993)

[24] Møller, M.: A scaled conjugate gradient algorithm for fast supervised learning.

Neural Networks 6, 525–533 (1993)

[25] Møller, M.: Supervised learning on large redundant training sets. International

Journal of Neural Systems 4(1), 15–25 (1993)

48 Y.A. LeCun et al.

[26] Moody, J.E., Darken, C.J.: Fast learning in networks of locally-tuned processing

units. Neural Computation 1, 281–294 (1989)

[27] Murata, N.: PhD thesis, University of Tokyo (1992) (in Japanese)

[28] Murata, N., Müller, K.-R., Ziehe, A., Amari, S.: Adaptive on-line learning in

changing environments. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances

in Neural Information Processing Systems, vol. 9, p. 599. MIT Press (1997)

[29] Oppenheim, A.V., Schafer, R.W.: Digital Signal Processing. Prentice-Hall, Engle-

wood Cliﬀs (1975)

[30] Orr, G.B.: Dynamics and Algorithms for Stochastic learning. PhD thesis, Oregon

Graduate Institute (1995)

[31] Orr, G.B.: Removing noise in on-line search using adaptive batch sizes. In: Mozer,

M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing

Systems, vol. 9, p. 232. MIT Press (1997)

[32] Orr, M.J.L.: Regularization in the selection of radial basis function centers. Neural

Computation 7(3), 606–623 (1995)

[33] Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computa-

tion 6, 147–160 (1994)

[34] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipies

in C: The art of Scientiﬁc Programming. Cambridge University Press, Cambridge

(1988)

[35] Saad, D. (ed.): Online Learning in Neural Networks (1997 Workshop at the Newton

Institute). The Newton Institute Series. Cambridge University Press, Cambridge

(1998)

[36] Saad, D., Solla, S.A.: Exact solution for on-line learning in multilayer neural net-

works. Physical Review Letters 74, 4337–4340 (1995)

[37] Sompolinsky, H., Barkai, N., Seung, H.S.: On-line learning of dichotomies: al-

gorithms and learning curves. In: Oh, J.-H., Kwon, C., Cho, S. (eds.) Neural

Networks: The Statistical Mechanics Perspective, pp. 105–130. World Scientiﬁc,

Singapore (1995)

[38] Sutton, R.S.: Adapting bias by gradient descent: An incremental version of delta-

bar-delta. In: Swartout, W. (ed.) Proceedings of the 10th National Conference on

Artiﬁcial Intelligence, pp. 171–176. MIT Press, San Jose (July 1992)

[39] van der Smagt, P.: Minimisation methods for training feed-forward networks. Neu-

ral Networks 7(1), 1–11 (1994)

[40] Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)

[41] Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

[42] Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recogni-

tion using time-delay neural networks. IEEE Transactions on Acoustics, Speech,

and Signal Processing ASSP-37, 328–339 (1989)

[43] Wiegerinck, W., Komoda, A., Heskes, T.: Stochastic dynamics of learning with

momentum in neural networks. Journal of Physics A 27, 4425–4437 (1994)

[44] Yang, H.H., Amari, S.: The eﬃciency and the robustness of natural gradient de-

scent learning rule. In: Jordan, M.I., Kearns, M.J., Solla, S.A. (eds.) Advances in

Neural Information Processing Systems, vol. 10. MIT Press (1998)

Regularization Techniques to Improve

Generalization



Preface

Good tricks for regularization are extremely important for improving the gener-

alization ability of neural networks. The ﬁrst and most commonly used trick is

early stopping, which was originally described in [11]. In its simplest version,

the trick is as follows:

Take an independent validation set, e.g. take out a part of the training

set, and monitor the error on this set during training. The error on the

training set will decrease, whereas the error on the validation set will

ﬁrst decrease and then increase. The early stopping point occurs where

the error on the validation set is the lowest. It is here that the network

weights provide the best generalization.

As Lutz Prechelt points out in chapter 2, the above picture is highly idealized.

In practice, the shape of the error curve on the validation set is more likely very

ragged with multiple minima. Choosing the “best” early stopping point then

involves a trade-oﬀ between (1) improvement of generalization and (2) speed of

learning. If speed is not an issue then, clearly, the safest strategy is to train all

the way until the minimum error on the training set is found, while monitoring

the location of the lowest error rate on the validation set. Of course, this can

take a prohibitive amount of computing time. This chapter presents less costly

strategies employing a number of diﬀerent stopping criteria, e.g. when the ratio

between the generalization loss and the progress exceeds a given threshold (see

p. 57). A large simulation study using various benchmark problems is used in

the discussion and analysis of the diﬀerences (with respect to e.g. robustness,

eﬀectiveness,trainingtime,...)betweentheseproposedstopping criteria (see

p. 60ﬀ.). So far theoretical studies [12, 1, 6] have not studied this trade-oﬀ.

Weight decay is also a commonly used technique for controlling capacity in

neural networks. Early stopping is considered to be fast, but it is not well deﬁned

(keep in mind the pitfalls mentioned in chapter 2). On the other hand, weight

decay regularizers [5, 2] are well understood, but ﬁnding a suitable parameter λ

to control the strength of the weight decay term can be tediously time consuming.

Thorsteinn Rögnvaldsson proposes a simple trick for estimating λ by making

use of the best of both worlds (see p. 75): simply compute the gradient at the

early stopping solution W

and divide it by the norm of W

λ = ∇E(W

)/2W

.

Other penalties are also possible. The trick is speedy, since we neither have to do

acompletetrainingnorascanofthewholeλ parameter space, and the accuracy

of the determined

λ is good, as seen from some interesting simulations.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 49–51, 2012.

 Springer-Verlag Berlin Heidelberg 2012

50 G.B. Orr and K.-R. Müller

Tony Plate in chapter 4 treats the penalty factors for the weights (hyperpa-

rameters) along the Bayesian framework of MacKay [8] and Neal [9]. There

are two levels in searching for the best network. The inner loop is a minimization

of the training error keeping the hyperparameters ﬁxed, whereas the outer loop

searches the hyperparameter space with the goal of maximizing the evidence

of having generated the data. This whole procedure is rather slow and compu-

tationally expensive, since, in theory, the inner search needs to converge (to a

local minimum) at each outer loop search step. When applied to classiﬁcation

networks using the cross-entropy error function the outer-loop search can be

unstable with the hyperparameter values oscillating wildly or going to inappro-

priate extremes. To make this Bayesian framework work better in practice, Tony

Plate proposes a number of tricks that speed and simplify the hyperparameter

search strategies (see p. 96). In particular, his search strategies center around

the questions: (1) how often (when) should the hyperparameters be updated

(see p. 96) and (2) what should be done if the Hessian is out-of-bounds (see p.

97ﬀ.). To discuss the eﬀects of the choices made in (1) and (2), Tony Plate uses

simulations based on artiﬁcial examples and concludes with a concise set of rules

for making the hyperparameter framework work better.

In chapter 5, Jan Larsen et al. formulate an iterative gradient descent scheme

for adapting their regularization parameters (note, diﬀerent regularizers

can be used for input/hidden and hidden/output weights). The trick is simple:

perform gradient descent on the validation set errors with respect to the regu-

larization parameters, and iteratively use the results for updating the estimate

of the regularization parameters (see p. 116). This method holds for a variety

of penalty terms (e.g. weight decay). The computational overhead is negligible

for computing the gradients, however, an inverse Hessian has to be estimated. If

second order methods are used for training, then the inverse Hessian may already

be available, so there is little additional eﬀort. Otherwise obtaining full Hessian

information is rather tedious and limits the approach to smaller applications

(see discussion in chapter 1). Nevertheless approximations of the Hessian (e.g.

diagonal) could also be used to limit the computation time. Jan Larsen, et al.,

demonstrate the applicability of their trick on classiﬁcation (vowel data) and

regression (time-series prediction) problems.

Averaging over multiple predictors is a well known method for improving

generalization (see e.g. [10, 3, 7, 13]). David Horn et al. raises two questions in

ensemble training: (1) how many predictors are “enough” and (2) how does the

number of predictors aﬀect the stopping criteria for early stopping (see p. 134).

They present solutions for answering these questions by providing a method for

estimating the error of an inﬁnite number of predictors and they demonstrate the

usefulness of their trick for the sunspot prediction task. Additional theoretical

reasoning is given to explain their success in terms of variance minimization

within the ensemble.

Jenny & Klaus

Regularization Techniques to Improve Generalization 51

References

[1] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.H.: Asymptotic sta-

tistical theory of overtraining and cross-validation. IEEE Transactions on Neural

Networks 8(5), 985–996 (1997)

[2] Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford

(1995)

[3] Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996)

[4] Cowan, J.D., Tesauro, G., Alspector, J. (eds.): Advances in Neural Information

Processing Systems 6, San Mateo, CA. Morgan Kaufman Publishers Inc. (1994)

[5] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks ar-

chitectures. Neural Computation 7(2), 219–269 (1995)

[6] Kearns, M.: A bound on the error of cross validation using the approximation and

estimation rates, with consequences for the training-test split. Neural Computa-

tion 9(5), 1143–1161 (1997)

[7] Lincoln, W.P., Skrzypek, J.: Synergy of clustering multiple back propagation net-

works. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Sys-

tems 2, San Mateo, CA, pp. 650–657. Morgan Kaufmann (1990)

[8] McKay, D.J.C.: A practical Bayesian framework for backpropagation networks.

Neural Computation 4, 448–472 (1992)

[9] Neal, R.M.: Bayesian Learning for Neural Networks. Lecture Notes in Statistics,

vol. 118. Springer, New York (1996)

[10] Perrone, M.P.: Improving Regression Estimation: Averaging Methods for Variance

Reduction with Extensions to General Convex Measure Optimization. PhD thesis,

Brown University (May 1993)

[11] Plaut, D.C., Nowlan, S.J., Hinton, G.E.: Experiments on learning by back-

propagation. Technical Report Computer Science Dept. Tech. Report, Pittsburgh,

PA (1986)

[12] Wang, C., Venkatesh, S.S., Judd, J.S.: Optimal stopping and eﬀective machine

complexity in learning. In: [4] (1994)

[13] Wolpert, D.H.: Stacked generalization. Neural Networks 5(2), 241–259 (1992)

Early Stopping — But When?

∗

Lutz Prechelt

Fakultät für Informatik, Universität Karlsruhe

D-76128 Karlsruhe, Germany

prechelt@ira.uka.de

http://www.ipd.ira.uka.de/˜prechelt/

Abstract. Validation can be used to detect when overﬁtting starts dur-

ing supervised training of a neural network; training is then stopped

before convergence to avoid the overﬁtting (“early stopping”). The exact

criterion used for validation-based early stopping, however, is usually

chosen in an ad-hoc fashion or training is stopped interactively. This

trick describes how to select a stopping criterion in a systematic fashion;

it is a trick for either speeding learning procedures or improving gen-

eralization, whichever is more important in the particular situation. An

empirical investigation on multi-layer perceptrons shows that there ex-

ists a tradeoﬀ between training time and generalization: From the given

mix of 1296 training runs using diﬀerent 12 problems and 24 diﬀerent

network architectures I conclude slower stopping criteria allow for small

improvements in generalization (here: about 4% on average), but cost

much more training time (here: about factor 4 longer on average).

2.1 Early Stopping Is Not Quite as Simple

2.1.1 Why Early Stopping?

When training a neural network, one is usually interested in obtaining a network

with optimal generalization performance. However, all standard neural network

architectures such as the fully connected multi-layer perceptron are prone to

overﬁtting [10]: While the network seems to get better and better, i.e., the error

on the training set decreases, at some point during training it actually begins

to get worse again, i.e., the error on unseen examples increases. The idealized

expectation is that during training the generalization error of the network evolves

as shown in Figure 2.1. Typically the generalization error is estimated by a

validation error, i.e., the average error on a validation set,aﬁxedsetofexamples

not from the training set.

There are basically two ways to ﬁght overﬁtting: reducing the number of di-

mensions of the parameter space or reducing the eﬀective size of each dimension.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 53–67, 2012.

 Springer-Verlag Berlin Heidelberg 2012

54 L. Prechelt

Techniques for reducing the number of parameters are greedy constructive learn-

ing [7], pruning [5, 12, 14], or weight sharing [18]. Techniques for reducing the

size of each parameter dimension are regularization, such as weight decay [13]

and others [25], or early stopping [17]. See also [8, 20] for an overview and [9]

for an experimental comparison.

Early stopping is widely used because it is simple to understand and imple-

ment and has been reported to be superior to regularization methods in many

cases, e.g. in [9].

2.1.2 The Basic Early Stopping Technique

In most introductory papers on supervised neural network training one can ﬁnd a

diagram similar to the one shown in Figure 2.1. It is claimed to show the evolution

over time of the per-example error on the training set and on a validation set

not used for training (the training error curve and the validation error curve).

Given this behavior, it is clear how to do early stopping using validation:

Training error

Validation error

Fig. 2.1. Idealized training and validation error curves. Vertical: errors; horizontal:

time.

1. Split the training data into a training set and a validation set, e.g. in a 2-to-1

proportion.

2. Train only on the training set and evaluate the per-example error on the

validation set once in a while, e.g. after every ﬁfth epoch.

3. Stop training as soon as the error on the validation set is higher than it was

thelasttimeitwaschecked.

4. Use the weights the network had in that previous step as the result of the

training run.

This approach uses the validation set to anticipate the behavior in real use (or

on a test set), assuming that the error on both will be similar: The validation

error is used as an estimate of the generalization error.

2. Early Stopping — But When? 55

2.1.3 The Uglyness of Reality

However, for real neural network training the validation set error does not evolve

as smoothly as shown in Figure 2.1, but looks more like in Figure 2.2. See Sec-

tion 2.4 for a rough explanation of this behavior. As we see, the validation error

can still go further down after it has begun to increase — plus in a realistic

setting we do never know the exact generalization error but estimate it by the

validation set error instead. There is no obvious rule for deciding when the min-

imum of the generalization error is obtained. Real validation error curves almost

always have more than one local minimum. The above curve exhibits as many

as 16 local minima before severe overﬁtting begins at about epoch 400. Of these

local minima, 4 are the global minimum up to where they occur. The optimal

stopping point in this example would be epoch 205. Note that stopping in epoch

400 compared to stopping shortly after the ﬁrst “deep” local minimum at epoch

45 trades an about sevenfold increase of learning time for an improvement of

validation set error by 1.1% (by ﬁnding the minimum at epoch 205). If repre-

sentative data is used, the validation error is an unbiased estimate of the actual

network performance; so we expect a 1.1% decrease of the generalization error in

this case. Nevertheless, overﬁtting might sometimes go undetected because the

validation set is ﬁnite and thus not perfectly representative of the problem.

Unfortunately, the above or any other validation error curve is not typical

in the sense that all curves share the same qualitative behavior. Other curves

might never reach a better minimum than the ﬁrst, or than, say, the third; the

mountains and valleys in the curve can be of very diﬀerent width, height, and

shape. The only thing all curves seem to have in common is that the diﬀerences

between the ﬁrst and the following local minima are not huge.

As we see, choosing a stopping criterion predominantly involves a tradeoﬀ

between training time and generalization error. However, some stopping criteria

may typically ﬁnd better tradeoﬀs that others. This leads to the question of

9.3

9.4

9.5

9.6

9.7

9.8

9.9

0 50 100 150 200 250 300 350 400 450

Validation error (Glass dataset, 4+4 hidden)

Fig. 2.2. A real validation error curve. Vertical: validation set error; horizontal: time

(in training epochs).

56 L. Prechelt

which criterion to use with cross validation to decide when to stop training.

This is why we need the present trick: To tell us how to really do early stopping.

2.2 How to Do Early Stopping Best

What we need is a predicate that tells us when to stop training. We call such

apredicateastopping criterion. Among all possible stopping criteria we are

searching for those which yield the lowest generalization error and also for those

with the best “price-performance ratio”, i.e., that require the least training for a

given generalization error or that (on average) result in the lowest generalization

error for a certain training time.

2.2.1 Some Classes of Stopping Criteria

There are a number of plausible stopping criteria and this work considers three

classes of them. To formally describe the criteria, we need some deﬁnitions ﬁrst.

Let E be the objective function (error function) of the training algorithm, for

example the squared error. Then E

(t), the training set error (for short: train-

ing error), is the average error per example over the training set, measured after

epoch t. E

(t), the validation error, is the corresponding error on the validation

set and is used by the stopping criterion. E

(t), the test error, is the correspond-

ing error on the test set; it is not known to the training algorithm but estimates

the generalization error and thus benchmarks the quality of the network result-

ing from training. In real life, the generalization error is usually unknown and

only the validation error can be used to estimate it.

The value E

opt

(t) is deﬁned to be the lowest validation set error obtained in

epochs up to t:

opt

(t):=min



≤t



)

Now we deﬁne the generalization at epoch t to be the relative increase of the

validation error over the minimum-so-far (in percent):

GL(t) = 100 ·



(t)

opt

(t)

− 1



High generalization loss is one obvious candidate reason to stop training, be-

cause it directly indicates overﬁtting. This leads us to the ﬁrst class of stop-

ping criteria: stop as soon as the generalization loss exceeds a certain

threshold. We deﬁne the class GL

: stop after ﬁrst epoch t with GL(t) >α

However, we might want to suppress stopping if the training is still progressing

very rapidly. The reasoning behind this approach is that when the training error

still decreases quickly, generalization losses have higher chance to be “repaired”;

we assume that often overﬁtting does not begin until the error decreases only

2. Early Stopping — But When? 57

slowly. To formalize this notion we deﬁne a training strip of length k to be a

sequence of k epochs numbered n +1...n+ k where n is divisible by k.The

training progress (in per thousand) measured after such a training strip is then

(t) := 1000 ·







=t−k+1



)

k · min



=t−k+1



)

− 1



that is, “how much was the average training error during the strip larger than

the minimum training error during the strip?” Note that this progress measure is

high for unstable phases of training, where the training set error goes up instead

of down. This is intended, because many training algorithms sometimes produce

such “jitter” by taking inappropriately large steps in weight space. The progress

measure is, however, guaranteed to approach zero in the long run unless the

training is globally unstable (e.g. oscillating).

Now we can deﬁne the second class of stopping criteria: use the quo-

tient of generalization loss and progress.

: stop after ﬁrst end-of-strip epoch t with

GL(t)

(t)

>α

In the following we will always assume strips of length 5 and measure the vali-

dation error only at the end of each strip.

A completely diﬀerent kind of stopping criterion relies only on the sign of

the changes in the generalization error. We deﬁne the third class of stopping

criteria: stop when the generalization error increased in s successive

strips.

: stop after epoch t iﬀ UP

s−1

stops after epoch t − k and

(t) >E

(t − k)

: stop after ﬁrst end-of-strip epoch t with E

(t) >E

(t − k)

The idea behind this deﬁnition is that when the validation error has increased not

only once but during s consecutive strips, we assume that such increases indicate

the beginning of ﬁnal overﬁtting, independent of how large the increases actually

are. The UP criteria have the advantage of measuring change locally so that they

can be used in the context of pruning algorithms, where errors must be allowed

to remain much higher than previous minima over long training periods.

None of these criteria alone can guarantee termination. We thus complement

them by the rule that training is stopped when the progress drops below 0.1 or

after at most 3000 epochs.

All stopping criteria are used in the same way: They decide to stop at some

time t during training and the result of the training is then the set of weights that

exhibited the lowest validation error E

opt

(t). Note that in order to implement

this scheme, only one duplicate weight set is needed.

2.2.2 The Trick: Criterion Selection Rules

These three classes of stopping criteria GL, UP,andPQ were evaluated on

a variety of learning problems as described in Section 2.3 below. The results

58 L. Prechelt

indicate that “slower” criteria, which stop later than others, on the average lead

to improved generalization compared to “faster” ones. However, the training time

that has to be expended for such improvements is rather large on average and

also varies dramatically when slow criteria are used. The systematic diﬀerences

between the criteria classes are only small.

For training setups similar to the one used in this work, the following rules

can be used for selecting a stopping criterion:

1. Use fast stopping criteria unless small improvements of network performance

(e.g. 4%) are worth large increases of training time (e.g. factor 4).

2. To maximize the probability of ﬁnding a “good” solution (as opposed to

maximizing the average quality of solutions), use a GL criterion.

3. To maximize the average quality of solutions, use a PQ criterion if the net-

work overﬁts only very little or an UP criterion otherwise.

2.3 Where and How Well Does This Trick Work?

As no mathematical analysis of the properties of stopping criteria is possible

today (see Section 2.4 for the state of the art), we resort to an experimental

evaluation.

We want to ﬁnd out which criteria will achieve how much generalization using

how much training time on which kinds of problems. To achieve broad cover-

age, we use 12 diﬀerent network topologies, 12 diﬀerent learning tasks, and 14

diﬀerent stopping criteria. To keep the experiment feasible, only one training

algorithm is used.

2.3.1 Concrete Questions

To derive and evaluate the stopping criteria selection rules presented above we

need to answer the following questions:

1. Training time: How long will training take with each criterion, i.e., how fast

or slow are they?

2. Eﬃciency: How much of this training time will be redundant, i.e., will occur

after the to-be-chosen validation error minimum has been seen?

3. Eﬀectiveness: How good will the resulting network performance be?

4. Robustness: How sensitive are the above qualities of a criterion to changes

of the learning problem, network topology, or initial conditions?

5. Tradeoﬀs: Which criteria provide the best time-performance tradeoﬀ?

6. Quantiﬁcation: How can the tradeoﬀ be quantiﬁed?

The answers will directly lead to the rules already presented above in Sec-

tion 2.2.2. To ﬁnd the answers to the questions we record for a large number of

runs when each criterion would stop and what the associated network perfor-

mance would be.

2. Early Stopping — But When? 59

2.3.2 Experimental Setup

Approach. To measure network performance, we partition each dataset into

two disjoint parts: Training data and test data. The training data is further

subdivided into a training set of examples used to adjust the network weights

and a validation set of examples used to estimate network performance during

training as required by the stopping criteria. The validation set is never used

for weight adjustment. This decision was made in order to obtain pure stopping

criteria results. In contrast, in a real application after a reasonable stopping

time has been computed, one would include the validation set examples in the

training set and retrain from scratch.

Stopping Criteria. The stopping criteria examined were GL

, GL

0.5

, PQ

0.75

, PQ

, UP

,andUP

. All criteria

where evaluated simultaneously, i.e., each single training run returned one result

for each of the criteria. This approach reduces the variance of the estimation.

Learning Tasks. Twelve diﬀerent problems were used, all from the Proben1

NN benchmark set [19]. All problems are real datasets from realistic application

domains; they form a sample of a broad class of domains, but none of them

exhibits extreme nonlinearity. The problems have between 8 and 120 inputs,

between 1 and 19 outputs, and between 214 and 7200 examples. All inputs and

outputs are normalized to range 0...1. Nine of the problems are classiﬁcation

tasks using 1-of-n output encoding (cancer, card, diabetes, gene, glass, heart,

horse, soybean,andthyroid), three are approximation tasks (building, ﬂare,and

hearta).

Datasets and Network Architectures. The examples of each problem were

partitioned into training (50%), validation (25%), and test set (25% of examples)

in three diﬀerent random ways, resulting in 36 datasets. Each of these datasets

was trained with 12 diﬀerent feedforward network topologies: one hidden layer

networks with 2, 4, 8, 16, 24, or 32 hidden nodes and two hidden layer networks

with 2+2, 4+2, 4+4, 8+4, 8+8, or 16+8 hidden nodes in the ﬁrst+second hidden

layer, respectively; all these networks were fully connected including all possible

shortcut connections. For each of the network topologies and each dataset, two

runs were made with linear output units and one with sigmoidal output units

using the activation function f (x)=x/(1 + |x|).

Training Algorithm. All runs were done using the RPROP training algorithm

[21] using the squared error function and the parameters η

=1.1, η

−

=0.5,

∈ 0.05 ...0.2 randomly per weight, Δ

max

=50, Δ

min

=0,initialweights

−0.5 ...0.5 randomly. RPROP is a fast backpropagation variant that is about

as fast as quickprop [6] but more stable without adjustment of the parame-

ters. RPROP requires epoch learning, i.e., the weights are updated only once

per epoch. Therefore, the algorithm is fast without parameter tuning for small

training sets but not recommendable for large training sets. Lack of parameter

tuning helps to avoid the common methodological error of tuning parameters

using the test error.

60 L. Prechelt

2.3.3 Experiment Results

Altogether, 1296 training runs were made for the comparison, giving 18144 stop-

ping criteria performance records for the 14 criteria. 270 of these records (or

1.5%) from 125 diﬀerent runs reached the 3000 epoch limit instead of using the

stopping criterion itself.

The results for each stopping criterion averaged over all 1296 runs are shown

in Table 2.1. Figure 2.3 describes the variance embedded in the means given in

the table. I will now explain and then interpret the entries in both, table and

ﬁgure. Note that the discussion is biased by the particular collection of criteria

chosen for the study.

Deﬁnitions. For each run, we deﬁne E

found until criterion C indicates to stop; it is the error after epoch number

characterizes network performance. Stopping occurs after epoch t

“time of stop”). A best criterion

C of a particular run is one with minimum t

all those (among the examined) with minimum E

, i.e., a criterion that found

the best validation error fastest. There may be several best, because multiple

criteria may stop at the same epoch. Note that there is no single criterion

because

C changes from run to run. C is called good in a particular run if

(C)=E

(

C), i.e., if it is among those that found the lowest validation set

error, no matter how fast or slow.

2.3.4 Discussion: Answers to the Questions

We now discuss the questions raised in Section 2.3.1.

Table 2.1. Behavior of stopping criteria. S

is normalized training time, B

normalized test error (both relative to GL

). r is the training time redundancy, P

is the

probability of ﬁnding a good solution. For further description please refer to the text.

training time eﬃciency and eﬀectiveness

ˆc

(C)

0.792 0.766 0.277 1.055 1.024 0.587

0.956 0.823 0.308 1.044 1.010



0.680

1.010 1.264 0.419



1.026 1.003 0.631

1.237 1.000 0.514 1.034 1.000



0.723

1.243 1.566 0.599



1.020 0.997 0.666

0.5

1.253 1.334 0.663 1.027 1.002 0.658

0.75

1.466 1.614 0.863 1.021 0.998 0.682

1.550 1.450



0.712 1.025 0.994



0.748

1.635 1.796 1.038 1.018 0.994 0.704

1.786 2.381 1.125



1.012 0.990 0.737

2.014 2.013 1.162 1.021 0.991



0.772

2.184 2.510 1.636 1.012 0.990 0.768

2.485 3.259 1.823



1.010 0.988 0.759

2.614 3.095 2.140 1.009 0.988 0.800

2. Early Stopping — But When? 61

02468

slowness

GL1

GL2

GL3

GL5

PQ0.5

PQ0.75

PQ1

PQ2

PQ3

UP2

UP3

UP4

UP6

UP8

0510

redundancy

GL1

GL2

GL3

GL5

PQ0.5

PQ0.75

PQ1

PQ2

PQ3

UP2

UP3

UP4

UP6

UP8

0.8 1.2 1.6

badness

GL1

GL2

GL3

GL5

PQ0.5

PQ0.75

PQ1

PQ2

PQ3

UP2

UP3

UP4

UP6

UP8

Fig. 2.3. Variance of slowness S

168 columns, the dot represents the mean computed from 108 runs: learning problem

and stopping criterion are ﬁxed, while three other parameters are varied (12 topologies

× 3 runs × 3 dataset variants). The length of the line is twice the standard deviation

within these 108 values. Within each block of dot-line plots, the plots represent (in

order) the problems building, cancer, card, diabetes, ﬂare, gene, glass, heart, hearta,

horse, soybean, thyroid. The horizontal line marks the median of the means. Note:

When comparing the criteria groups, remember that overall the PQ criteria chosen are

slower than the others. It is unfair to compare, for example, PQ

0.5

to GL

and UP

62 L. Prechelt

1. Training time: The slowness of a criterion C in a run, relative to another

criterion x is S

(C):=t

(C)/t

(x), i.e., the relative total training time. As we

see, the times relative to a ﬁxed criterion as shown in column S

more than factor 4. Therefore, the decision for a particular stopping criterion

inﬂuences training times dramatically, even if one considers only the range of cri-

teria used here. In contrast, even the slowest criteria train only about 2.5 times

as long as the fastest criterion of each run that ﬁnds the same result, as indi-

cated in column S

(C). This shows that the training times are not completely

unreasonable even for the slower criteria, but do indeed pay oﬀ to some degree.

2. Eﬃciency: The redundancy of a criterion can be deﬁned as r(C):=

(C)/t

(C)) − 1. It characterizes how long the training continues after the

ﬁnal solution has been seen. r(C)=0would be perfect, r(C)=1means that

the criterion trains twice as long as necessary. Low values indicate eﬃcient crite-

ria. As we see, the slower a criterion is, the less eﬃcient it tends to get. Even the

fastest criteria “waste” about one ﬁfth of their overall training time. The slower

criteria train twice as long as necessary to ﬁnd the same solution.

3. Eﬀectiveness: We deﬁne the badness of a criterion C in a run relative to

another criterion x as B

(C):=E

(C)/E

(x), i.e., its relative error on the

test set. P

This is an estimate of the probability that C is good in a run. As we see from

the P

column, even the fastest criteria are fairly eﬀective. They reach a result

as good as the best (of the same run) in about 60% of the cases. On the other

hand, even the slowest criteria are not at all infallible; they achieve about 80%.

However, P

says nothing about how far from the optimum the non-good runs

are. Columns B

rather small: column B

achieve only about 1% lower error on the average than the relatively fast criterion

.IncolumnB

about 2% higher error on the average than the best criteria of the same run. For

obtaining the lowest possible generalization error, independent of training time,

it appears that one has to use an extreme criterion such as GL

or even use a

conjunction of all three criteria classes with high parameter values.

4. Robustness: We call a criterion robust to the degree that its performance

is independent of the learning problem and the learning environment (network

topology, initial conditions etc.). Optimal robustness would mean that in Fig-

ure 2.3 all dots within a block are at the same height (problem independence)

and all lines have length zero (environment independence). Note that slowness

and badness are measured relative to the best criterion of the same program run.

We observe the following:

– With respect to slowness and redundancy, slower criteria are much less robust

than faster ones. In particular the PQ criteria are quite sensitive to the

learning problem, with the card and horse problems being worst in this

experimental setting.

– With respect to badness, the picture is completely diﬀerent: slower criteria

tend to be slightly more robust than slower ones. PQ criteria are a little

2. Early Stopping — But When? 63

more robust than the others while GL criteria are signiﬁcantly less robust.

All criteria are more or less instable for the building, cancer, and thyroid

problems. In particular, all GL criteria have huge problems with the building

problem, whose dataset 1 is the only one that is partitioned non-randomly;

it uses chronological order of examples, see [19]. The slower variants of the

other criteria types are nicely robust in this case.

– Similar statements apply when one analyzes the inﬂuence of only large or

only small network topologies separately (not shown in any ﬁgure or table).

One notable exception was the fact that for networks with very few hidden

nodes the PQ criteria are more cost-eﬀective than both the GL and the

UP criteria for minimizing B

(C). The explanation may be that such small

networks do not overﬁt severely; in this case it is advantageous to take train-

ing progress into account as an additional factor to determine when to stop

training.

Overall, fast criteria improve the predictability of the training time, while slow

ones improve the predictability of the solution quality.

5. Best tradeoﬀs: Despite the common overall trend, some criteria may be

more cost-eﬀective than others, i.e., provide better tradeoﬀs between training

time and resulting network performance. Column B

ˆc

of the table suggests that

the best tradeoﬀs between test error and training time are (in order of increasing

willingness to spend lots of training time) UP

, UP

,andUP

,ifonewantsto

minimize the expected network performance from a single run. These criteria are

also robust. If on the other hand one wants to make several runs and pick the

network that seems to be best (based on its validation error), P

is the relevant

metric and the GL criteria are preferable. The best tradeoﬀs are marked with a

star in the table. Figure 2.4 illustrates these results. The upper curve corresponds

to column B

of the table (plotted against column S

); local minima indicate

500

600

700

800

900

1000

1100

1200

1000 1500 2000 2500

Slowness

Badness

Fig. 2.4. Badness B

against slowness S

64 L. Prechelt

criteria with the best tradeoﬀs. The lower curve corresponds to column P

;local

maxima indicate the criteria with the best tradeoﬀs. All measurements are scaled

by 1000.

6. Quantiﬁcation: From columns S

tradeoﬀ involved in the selection of a stopping criterion as follows: In the range

of criteria examined we can roughly trade a 4% decrease in test error (from 1.024

to 0.988) for an about fourfold increase in training time (from 0.766 to 3.095).

Within this range, some criteria are somewhat better than others, but there is

no panacea.

2.3.5 Generalization of These Results

It is diﬃcult to say whether or how these results apply to diﬀerent contexts

than those of the above evaluation. Speculating though, I would expect that the

behavior of the stopping criteria

– is similar for other learning rules, unless they frequently make rather extreme

steps in parameter space,

– is similar for other error functions, unless they are discontinuous,

– is similar for other learning tasks, as long as they are in the same ballpark

with respect to their nonlinearity, number of inputs and outputs, and amount

of available training data.

Note however, that at least with respect to the learning task deviations do occur

(see Figure 2.3). More research is needed in order to describe which properties

of the learning tasks lead to which diﬀerences in stopping criteria behavior —

or more generally: in order to understand how which features of tasks inﬂuence

learning methods.

2.4 Why This Works

Detailed theoretical analyses of the error curves cannot yet be done for the

most interesting cases such as sigmoidal multi-layer perceptrons trained on a

modest number of examples; today they are possible for restricted scenarios

only [1, 2, 3, 24] and do usually not aim at ﬁnding the optimal stopping crite-

rion in a way comparable to the present work. However, a simpliﬁcation of the

analysis performed by Wang et al. [24] or the alternative view induced by the

bias/variance decomposition of the error as described by Geman et al. [10] can

give some insights why early stopping behaves as it does.

At the beginning of training (phase I), the error is dominated by what Wang

et al. call the approximation error — the network has hardly learned anything

and is still very biased. During training this part of the error is further and

further reduced. At the same time, however, another component of the error

increases: the complexity error that is induced by the increasing variance of the

network model as the possible magnitude and diversity of the weights grows.

If we train long enough, the error will be dominated by the complexity error

2. Early Stopping — But When? 65

(phase III). Therefore, there is a phase during training, when the approximation

and complexity (or: bias and variance) components of the error compete but

none of them dominates (phase II). See Amari et al. [1, 2] for yet another view

of the training process, using a geometrical interpretation. The task of early

stopping as described in the present work is to detect when phase II ends and

the dominance of the variance part begins.

Published theoretical results on early stopping appear to provide some nice

techniques for practical application: Wang et al. [24] oﬀer a method for comput-

ing the stopping point based on complexity considerations — without using a

separate validation set at all. This could save precious training examples. Amari

et al. [1, 2] compute the optimal split proportion of training data into training

and validation set.

On the other hand, unfortunately, the practical applicability of these theo-

retical analyses is severely restricted. Wang et al.’s analysis applies to networks

where only output weights are being trained; no hidden layer training is cap-

tured. It is unclear to what degree the results apply to the multi-layer networks

considered here. Amari et al.’s analysis applies to the asymptotic case of very

many training examples. The analysis does not give advice on stopping criteria;

it shows that early stopping is not useful when very many examples are avail-

able but does not cover the much more frequent case when training examples

are scarce.

There are several other theoretical works on early stopping, but none of them

answers our practical questions. Thus, given these theoretic results, one is still

left with making a good stopping decision for practical cases of multilayer net-

works with only few training examples and faced with a complicated evolution of

the validation set error as shown in Figure 2.2. This is why the present empirical

investigation was necessary.

The jagged form of the validation error curve during phase II arises because

neither bias nor variance change monotonically, let alone smoothly. The bias

error component may change abruptly because training algorithms never per-

form gradient descent, but take ﬁnite steps in parameter space that sometimes

have severe results. The observed variance error component may change abruptly

because, ﬁrst, the validation set error is only an estimate of the actual general-

ization error and, second, the eﬀect of a parameter change may be very diﬀerent

in diﬀerent parts of parameter space.

Quantitatively, the diﬀerent error minima that occur during phase II are quite

close together in terms of size, but may be rather far apart in terms of training

epoch. The exact validation error behavior seems rather unpredictable when only

a short left section of the error curve is given. The behavior is also very diﬀerent

for diﬀerent training situations.

For these reasons no class of stopping criteria has any big advantage over

another (on average, for the mix of situations considered here), but scaling the

same criterion to be slower always tends to gain a little generalization.

66 L. Prechelt

References

[1] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.: Statistical theory of

overtraining - is cross-validation eﬀective. In: [23], pp. 176–182 (1996)

[2] Amari, S., Murata, N., Müller, K.-R., Finke, M., Yang, H.: Aymptotic statistical

theory of overtraining and cross-validation. IEEE Trans. on Neural Networks 8(5),

985–996 (1997)

[3] Baldi, P., Chauvin, Y.: Temporal evolution of generalization during learning in

linear networks. Neural Computation 3, 589–603 (1991)

[4] Cowan, J.D., Tesauro, G., Alspector, J. (eds.): Advances in Neural Information

Processing Systems 6. Morgan Kaufman Publishers Inc., San Mateo (1994)

[5] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: [22], pp. 598–605

(1990)

[6] Fahlman, S.E.: An empirical study of learning speed in back-propagation networks.

Technical Report CMU-CS-88-162, School of Computer Science, Carnegie Mellon

University, Pittsburgh, PA (September 1988)

[7] Fahlman, S.E., Lebiere, C.: The Cascade-Correlation learning architecture. In:

[22], pp. 524–532 (1990)

[8] Fiesler, E.: Comparative bibliography of ontogenic neural networks (1994) (sub-

mitted for publication)

[9] Finnoﬀ, W., Hergert, F., Zimmermann, H.G.: Improving model selection by non-

convergent methods. Neural Networks 6, 771–783 (1993)

[10] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4, 1–58 (1992)

[11] Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.): Advances in Neural Information

Processing Systems 5. Morgan Kaufman Publishers Inc., San Mateo (1993)

[12] Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal

brain surgeon. In: [11], pp. 164–171 (1993)

[13] Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: [16],

pp. 950–957 (1992)

[14] Levin, A.U., Leen, T.K., Moody, J.E.: Fast pruning using principal components.

In: [4] (1994)

[15] Lippmann, R.P., Moody, J.E., Touretzky, D.S. (eds.): Advances in Neural Infor-

mation Processing Systems 3. Morgan Kaufman Publishers Inc., San Mateo (1991)

[16] Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.): Advances in Neural Information

Processing Systems 4. Morgan Kaufman Publishers Inc., San Mateo (1992)

[17] Morgan, N., Bourlard, H.: Generalization and parameter estimation in feedforward

nets: Some experiments. In: [22], pp. 630–637 (1990)

[18] Nowlan, S.J., Hinton, G.E.: Simplifying neural networks by soft weight-sharing.

Neural Computation 4(4), 473–493 (1992)

[19] Prechelt, L.: PROBEN1 — A set of benchmarks and benchmarking rules for neural

network training algorithms. Technical Report 21/94, Fakultät für Informatik,

Universität Karlsruhe, Germany, Anonymous,

ftp://pub/papers/techreports/1994/1994-21.ps.gz on,

ftp.ira.uka.de (September 1994)

[20] Reed, R.: Pruning algorithms — a survey. IEEE Transactions on Neural Net-

works 4(5), 740–746 (1993)

[21] Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation

learning: The RPROP algorithm. In: Proc. of the IEEE Intl. Conf. on Neural

Networks, San Francisco, CA, pp. 586–591 (April 1993)

2. Early Stopping — But When? 67

[22] Touretzky, D.S. (ed.): Advances in Neural Information Processing Systems 2. Mor-

gan Kaufman Publishers Inc., San Mateo (1990)

[23] Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.): Advances in Neural Infor-

mation Processing Systems 8. MIT Press, Cambridge (1996)

[24] Wang, C., Venkatesh, S.S., Judd, J.S.: Optimal stopping and eﬀective machine

complexity in learning. In: [4] (1994)

[25] Weigend, A.S., Rumelhart, D.E., Huberman, B.A.: Generalization by weight-

elimination with application to forecasting. In: [15], pp. 875–882 (1991)

A Simple Trick for Estimating the Weight Decay

Parameter



Thorsteinn S. Rögnvaldsson

Centre for Computer Architecture (CCA), Halmstad University, P.O. Box 823,

S-301 18 Halmstad, Sweden

denni@cca.hh.se

http://www.hh.se/staff/denni/

Abstract. We present a simple trick to get an approximate estimate of

the weight decay parameter λ. The method combines early stopping and

weight decay, into the estimate

λ = ∇E(W

)/2W

,

where W

is the set of weights at the early stopping point, and E(W )

is the training data ﬁt error.

The estimate is demonstrated and compared to the standard cross-

validation procedure for λ selection on one synthetic and four real life

data sets. The result is that

λ is as good an estimator for the optimal

weight decay parameter value as the standard search estimate, but orders

of magnitude quicker to compute.

The results also show that weight decay can produce solutions that

are signiﬁcantly superior to committees of networks trained with early

stopping.

3.1 Introduction

A regression problem which does not put constraints on the model used is ill-

posed [21], because there are inﬁnitely many functions that can ﬁt a ﬁnite set

of training data perfectly. Furthermore, real life data sets tend to have noisy

inputs and/or outputs, which is why models that ﬁt the data perfectly tend

to be poor in terms of out-of-sample performance. Since the modeler’s task is

to ﬁnd a model for the underlying function while not overﬁtting to the noise,

models have to be based on criteria which include other qualities besides their

ﬁt to the training data.

In the neural network community the two most common methods to avoid

overﬁtting are early stopping and weight decay [17]. Early stopping has the ad-

vantage of being quick, since it shortens the training time, but the disadvantage



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 69–89, 2012.

 Springer-Verlag Berlin Heidelberg 2012

70 T.S. Rögnvaldsson

of being poorly deﬁned and not making full use of the available data. Weight

decay, on the other hand, has the advantage of being well deﬁned, but the dis-

advantage of being quite time consuming. This is because much time is spent

with selecting a suitable value for the weight decay parameter (λ), by searching

over several values of λ and estimating the out-of-sample performance using e.g.

cross validation [25].

In this paper, we present a very simple method for estimating the weight decay

parameter, for the standard weight decay case. This method combines early

stopping with weight decay, thus merging the quickness of early stopping with

the more well deﬁned weight decay method, providing a weight decay parameter

which is essentially as good as the standard search method estimate when tested

empirically.

We also demonstrate in this paper that the arduous process of selecting λ can

be rewarding compared to simpler methods, like e.g. combining networks into

committees [16].

The paper is organized as follows: In section 2 we present the background of

how and why weight decay or early stopping should be used. In section 3 we

review the standard method for selecting λ and also introduce our new estimate.

In section 4 we give empirical evidence on how well the method works, and in

section 5 we summarize our conclusions.

3.2 Ill-Posed Problems, Regularization, and Such

Things...

3.2.1 Ill-Posed Problems

In what follows, we denote the input data by x(n), the target data by y(n),

and the model (neural network) output by f (W , x(n)),whereW denotes the

parameters (weights) for the model. We assume a target data generating process

of the form

y(n)=φ[x(n)] + ε(n) (3.1)

where φ is the underlying function and ε(n) are sampled from a stationary un-

correlated (IID) zero mean noise process with variance σ

. We select models f

from a model family F , e.g. multilayer perceptrons, to learn an approximation to

the underlying function φ, based on the training data. That is, we are searching

for

∗

≡ f(W

∗

) ∈ F such that E(f

∗

,φ) ≤ E(f,φ) ∀ f ∈ F, (3.2)

where E(f, φ) is a measure of the “distance” between the model f and the true

model φ. Since we only have access to the target values y, and not the underlying

function φ, E(f,φ) is often taken to be the mean square error

E(f, φ) → E(f,y)=E(W )=



n=1

[y(n) − f(W , x(n))]

. (3.3)

Unfortunately, minimizing (3.3) is, more often than not, an ill-posed problem.

That is, it does not meet the following three requirements [21]:

3. A Simple Trick for Estimating the Weight Decay Parameter 71

– The model (e.g. neural network) can learn the function training data, i.e.

there exists asolutionf

∗

∈ F .

– The solution is unique.

– The solution is stable under small variations in the training data set. For

instance, training with two slightly diﬀerent training data sets sampled from

the same process must result in similar solutions (similar when evaluated on

e.g. test data).

The ﬁrst and second of these requirements are often not considered serious prob-

lems. It is always possible to ﬁnd a multilayer perceptron that learns the training

data perfectly by using many internal units, since any continuous function can be

constructed with a single hidden layer network with sigmoid units (see e.g. [6]),

and we may be happy with any solution and ignore questions on uniqueness.

However, a network that has learned the training data perfectly will be very

sensitive to changes in the training data. Fulﬁlling the ﬁrst requirement is thus

usually in conﬂict with fulﬁlling the third requirement, which is a really impor-

tant requirement. A solution which changes signiﬁcantly with slightly diﬀerent

training sets will have very poor generalization properties.

3.2.2 Regularization

It is common to introduce so-called regularizers

in order to make the learning

task well posed (or at least less ill-posed). That is, instead of only minimizing an

error of ﬁt measure like (3.3) we augment it with a regularization term λR(W )

which expresses e.g. our prior beliefs about the solution.

The error functional then takes the form

E(W )=



n=1

[y(n) − f(W , x(n))]

+ λR(W )=E

(W )+λR(W ), (3.4)

where λ is the regularization parameter which weighs the importance of R(W )

relative to the error of ﬁt E

(W ).

The eﬀect of the regularization term is to shrink the model family F ,ormake

some models more likely than others. As a consequence, solutions become more

stable to small perturbations in the training data.

The term “regularization” encompasses all techniques which make use of

penalty terms added to the error measure to avoid overﬁtting. This includes

e.g. weight decay [17], weight elimination [26], soft weight sharing [15], Lapla-

cian weight decay [12] [27], and smoothness regularization [2] [9] [14] . Certain

forms of “hints” [1] can also be called regularization.

3.2.3 Bias and Variance

The beneﬁt of regularization is often described in the context of model bias and

model variance. This originates from the separation of the expected generaliza-

tion error E

gen

 into three terms [8]

Called “stabilizers” by Tikhonov [21].

72 T.S. Rögnvaldsson

E

gen

 = 



[y(x) − f(x)]

p(x)dx



[φ(x) −f(x)]

p(x)dx + 



[f(x) −f(x)]

p(x)dx+





[y(x) − φ(x)]

p(x)dx

= Bias

+ Variance + σ

, (3.5)

where  denotes taking the expectation over an ensemble of training sets. Here

p(x) denotes the input data probability density.

A high sensitivity to training data noise corresponds to a large model vari-

ance. A large bias term means either that φ ∈/F,orthatφ is downweighted in

favour of other models in F . We thus have a trade-oﬀ between model bias and

model variance, which corresponds to the trade-oﬀ between the ﬁrst and third

requirements on well-posed problems.

Model bias is weighed versus model variance by selecting both a parametric

form for R(W ) and an optimal

value for the regularization parameter λ.

Many neural network practitioners ignore the ﬁrst part and choose weight

decay by default, which corresponds to a Gaussian parametric form for the prior

on W . Weight decay is, however, not always the best choice (in fact, it is most

certainly not the best choice for all problems). Weight decay does not for instance

consider the function the network is producing, it only puts a constraint on

the parameters. Another, perhaps more correct, choice would be to constrain

the higher order derivatives of the network function (which is commonplace in

statistics) like in e.g. [14].

3.2.4 Bayesian Framework

From a Bayesian and maximum likelihood perspective, prior information about

the model (f) is weighed against the likelihood of the training data (D) through

Bayes theorem (see [4] for a discussion on this). Denote the probability for ob-

serving data set D by p(D), the prior distribution of models f by p(f ),andthe

likelihood for observing the data D,iff is the correct model, by p(D|f).We

then have for the posterior probability p(f|D) for the model f given the observed

data D

p(f|D)=

p(D|f)p(f)

p(D)

⇒

−ln p(f |D)=−log p(D|f) − ln p(f)+lnp(D) ⇒

−ln p(f |D)=



n=1

[y(n) − f(W , x(n))]

− ln p(f)+constant, (3.6)

where Gaussian noise ε is assumed in the last step. If we identify 2NλR(W ) with

the negative logarithm of the model prior, −ln p(f), then maximizing p(f|D) is

equivalent to minimizing expression (3.4).

Optimality is usually measured via cross-validation or some similar method.

3. A Simple Trick for Estimating the Weight Decay Parameter 73

From this perspective, choosing R(W ) is equivalent to choosing a parameter-

ized form for the model prior p(f), and selecting a value for λ corresponds to

estimating the parameters for the prior.

3.2.5 Weight Decay

Weight decay [17] is the neural network equivalent to the Ridge Regression [11]

method. In this case R(W )=W 



and the error functional is

E(W )=E

(W )+λR(W )=



n=1

[y(n) − f(W , x(n))]

+ λW 

, (3.7)

and λ is usually referred to as the weight decay parameter.IntheBayesian

framework, weight decay means implicitly imposing the model prior

p[f(W )] =



2πσ

exp



−λW 

2σ



(3.8)

where σ

is the variance of the noise in the data.

Weight decay often improves the generalization properties of neural network

models, for reasons outlined above.

3.2.6 Early Stopping

Undoubtedly, the simplest and most widely used method to avoid overﬁtting is

to stop training before the training set has been learned perfectly. This is done

by setting aside a fraction of the training data for estimating the out-of-sample

performance. This data set is called the validation data set. Training is then

stopped when the error on the validation set starts to increase. Early stopping

often shortens the training time signiﬁcantly, but suﬀers from being ill-deﬁned

since there really is no well deﬁned stopping point, and wasteful with data, since

a part of the data is set aside.

There is a connection between early stopping and weight decay, if learning

starts from small weights, since weight decay applies a potential which forces

all weights towards zero. For instance, Sjöberg and Ljung [20] show that, if a

constant learning rate η is used, the number of iterations n at which training is

stopped is related to the weight decay parameter λ roughly as

λ ∼

2ηn

. (3.9)

This does not, however, mean that using early stopping is equivalent to using

weight decay in practice. Expression (3.9) is based on a constant learning rate,

a local expansion around the optimal stopping point, ignoring local minima,

and assumes small input noise levels, which may not reﬂect the situation when

overﬁtting is a serious problem. The choice of learning algorithm can also aﬀect

74 T.S. Rögnvaldsson

the early stopping point, and one cannot expect (3.9) to hold exactly in the

practical case.

Inspired by this connection between early stopping and weight decay, we use

early stopping in the following section to estimate the weight decay parameter

λ.

3.3 Estimating λ

From a pure Bayesian point of view, the prior is something we know/assume

in advance and do not use the training data to select (see e.g. [5]). There

is consequently no such thing as “λ selection” in the pure Bayesian model selec-

tion scheme. This is of course perfectly ﬁne if the prior is correct. However, if

we suspect that our choice of prior is less than perfect, then we are better oﬀ

if we take an “empirical Bayes” approach and use the data to tune the prior,

through λ.

Several options for selecting λ have been proposed. Weigend et al. [26] present,

for a slightly diﬀerent weight cost term, a set of heuristic rules for changing λ during

the training. Although Weigend et al. demonstrate the use of these heuristics on a

couple of time series problems, we cannot get these rules to work consistently to our

satisfaction. A more principled approach is to try several values of λ and estimate

the out-of-sample error, either by correcting the training error, with some factor or

term, or by using cross-validation. The former is done in e.g. [10], [23], and [24] (see

also references therein). The latter is done by e.g. [25].

The method of using validation data for estimating the out-of-sample error is

robust but slow since it requires training several models. We use cross-validation

here because of its reliability.

3.3.1 Search Estimates

Finding the optimal λ requires the use of a search algorithm, which must be

robust because the validation error can be very noisy. A simple and straight-

forward way is to start at some large λ where the validation error is large, due

to the large model bias, and step towards lower values until the out-of-sample

error becomes large again, due to the large model variance. In our experience, it

often makes sense to do the search in log λ (i.e. with equally spaced increments

in log λ).

The result of such a search is a set of K values {λ

} with corresponding average

n-fold cross validation errors {log E

nCV,k

} and standard deviations {σ

nCV,k

} for

the validation errors. These are deﬁned as

log E

nCV,k



j=1

log E

j,k

(3.10)

nCV,k

n − 1



j=1

(log E

j,k

− log E

nCV,k

)

(3.11)

3. A Simple Trick for Estimating the Weight Decay Parameter 75

when λ = λ

. The number of validation data sets is n and E

j,k

denotes the

validation error when λ = λ

and we use validation set j. Taking logarithms

is motivated by our observation that the validation error distribution looks ap-

proximately log-normal and we use this in our selection of the optimal λ value

below.

Once the search is ﬁnished, the optimal λ is selected. This is not necessarily

trivial since a large range of values may look equally good, or one value may

have a small average cross-validation error with a large variation in this error,

and another value may have a slightly higher average cross-validation error with

a small variation in this error. The simplest approach is to look at a plot of

the validation errors versus λ and make a judgement on where the optimal λ

is, but this adds an undesired subjectiveness to the choice. Another is to take

a weighted average over the diﬀerent λ values, which is what we use here (see

Ripley [19] for a discussion on variants of λ selection methods).

Our estimate for the optimal λ is the value

opt



k=1



k=1

(3.12)

where n

is the number of times λ

corresponds to the minimum validation error

when we sample validation errors from K log-normal distributions with means

log E

nCV,k

and standard deviations σ

nCV,k

, assuming that the validation errors

are independent. This is illustrated on a hypothetical example in Figure 3.1.

The choice (3.12) was done after conﬁrming that it often agrees well with our

subjective choice for λ. We refer to this below as a “Monte Carlo estimate” of λ.

3.3.2 Two Early Stopping Estimates

If W

∗

is the set of weights when E(W ) in eq. (3.4) is minimized, then

∇E(W

∗

)=∇E

∗

)+λ∇R(W

∗

)=0, (3.13)

which implies

λ =

∇E

∗

)

∇R(W

∗

)

(3.14)

for the regularization parameter λ. Thus, if we have a reasonable estimate of

∗

,orof∇E

∗

) and ∇R(W

∗

), then we can use this to estimate λ.An

appealingly simple way of estimating W

∗

is to use early stopping, because of its

connection with weight decay.

Denoting the set of weights at the early stopping point by W

,wehave

∇E

)

∇R(W

)

, (3.15)

as a simple estimate for λ. A second possibility is to consider the whole set of

linear equations deﬁned by (3.13) and minimize the squared error

∇E

)+λ∇R(W

)

∇E

)

+2λ∇E

) ·∇R(W

)+λ

∇R(W

)

(3.16)

76 T.S. Rögnvaldsson

−7 −6 −5 −4 −3 −2 −1 0 1 2

0.5

1.0

1.5

log10(lambda)

log10(Ecv)

−7 −6 −5 −4 −3 −2 −1 0 1 2

log10(lambda)

Counts (n)

mean = −2.48

std = 1.1545

Fig. 3.1. Illustration of the procedure for estimating

opt

on a hypothetical example.

FromthesearchwehaveasetofK lognormal distributions with means log E

nCV,k

and variances σ

nCV,k

, which is illustrated in the top plate. From these K distributions,

we sample K error values and select the λ corresponding to the minimum error value

as “winner”. This is repeated several times (100 in the ﬁgure but 10,000 times in the

experiments in the text) collecting statistics on how often diﬀerent λ values are winners,

and the mean log λ is computed. This is illustrated in the bottom plate, which shows the

histogram resulting from sampling 100 times. From this we get log

opt

= −2.48±1.15,

which gives us λ =10

−2.48

=0.003 for training the “best” network.

with respect to λ. That is, solving the equation

∂

∂λ



∇E

)+λ∇R(W

)



=0 (3.17)

which gives

=max



−∇E

) ·∇R(W

)

∇R(W

)



. (3.18)

The estimate is bound from below since λ must be positive.

The second estimate,

, corresponds to a linear regression without intercept

term on the set of points {∂

),∂

R(W

)}, whereas the ﬁrst estimate,

,isclosertotheratiomax[|∂

)|]/ max[|∂

R(W

)|]. It follows from

the Cauchy-Schwartz inequality that

≥

. (3.19)

For the speciﬁc case of weight decay, where R(W )=W 

, expressions (3.15)

and (3.18) become

∇E

)

2W



, (3.20)

3. A Simple Trick for Estimating the Weight Decay Parameter 77

=max



−∇E

) · W

2W





. (3.21)

These estimates are sensitive to the particularities of the training and validation

data sets used, and possibly also to the training algorithm. One must therefore

average them over diﬀerent validation and training sets. It is, however, still

quicker to do this than to do a search since early stopping training often is

several orders of magnitude faster to do than a full minimization of (3.7).

One way to view the estimates (3.15) and (3.18) is as the weight decay pa-

rameters that correspond to the early stopping point. However, our aim here is

not to imitate early stopping with weight decay, but to use early stopping to

estimate the weight decay parameter λ. We hope that using weight decay with

this λ value will actually result in better out-of-sample performance than what

we get from doing early stopping (the whole exercise becomes rather meaningless

if this is not the case).

As a sidenote, we imagine that (3.15) and (3.18) could be used also to esti-

mate weight decay parameters in cases when diﬀerent weight decays are used

for weights in diﬀerent layers. This would then be done by considering these

estimates for diﬀerent groups of weights.

3.4 Experiments

3.4.1 Data Sets

We here demonstrate the performance of our algorithm on a set of ﬁve regression

problems. For each problem, we vary either the number of inputs, the number of

hidden units, or the amount of training data to study the eﬀects of the numbers

of parameters relative to the number of training data points. The ﬁve problems

are:

Synthetic Bilinear Problem. The task is to model a bilinear function of the

form

φ(x

)=x

. (3.22)

We use three diﬀerent sizes of training data sets, M ∈{20, 40, 100}, but a con-

stant validation set size of 10 patterns. The validation patterns are in addition

to the M training patterns. The test error, or generalization error, is computed

by numerical integration over 201 × 201 data points on a two-dimensional lat-

tice (x

) ∈ [−1, 1]

. The target values (but not the inputs) are contam-

inated with three diﬀerent levels of Gaussian noise with standard deviation

σ ∈{0.1, 0.2, 0.5}. This gives a total of 3 × 3=9diﬀerent experiments on

this particular problem, which we refer to as setup A1, A2, ..., and A9 below.

This allows controlled studies w.r.t. noise levels and training set sizes, while

keeping the network architecture constant (2 inputs, 8 tanh hidden, and one

linear output).

78 T.S. Rögnvaldsson

Predicting Puget Sound Power and Light Co. Power Load between

7 and 8 a.m. the Following Day. This data set is taken from the Puget

Sound Power and Light Co’s power prediction competition [3]. The winner of

this competition used a set of linear models, one for each hour of the day. We

have selected the subproblem of predicting the load between 7 and 8 a.m. 24 hrs.

in advance. This hour shows the largest variation in power load. The training set

consists of 844 weekdays between January 1985 and September 1990. Of these,

150 days are randomly selected and used for validation. We use 115 winter week-

days, from between November 1990 and March 1992, for out-of-sample testing.

The inputs are things like current load, average load during the last 24 hours,

average load during the last week, time of the year, etc., giving a total of 15

inputs. Three diﬀerent numbers of internal units are tried on this task: 15, 10,

and 5, and we refer to these experiments as B1, B2, and B3 below.

Predicting Daily Riverﬂow in Two Icelandic Rivers. This problem is

tabulated in [22], and the task is to model tomorrow’s average ﬂow of water

in one of two Icelandic rivers, knowing today’s and previous days’ waterﬂow,

temperature, and precipitation. The training set consists of 731 data points,

corresponding to the years 1972 and 1973, out of which we randomly sample

150 datapoints for validation. The test set has 365 data points (the year 1974).

We use two diﬀerent lengths of lags, 8 or 4 days back, which correspond to 24

or 12 inputs, while the number of internal units is kept constant at 12. These

experiments are referred to as C1, C2, C3, and C4 below.

Predicting the Wolf Sunspots Time Series. This time series has been used

several times in the context of demonstrating new regularization techniques, for

instance by [15] and [26]. We try three diﬀerent network architectures on this prob-

lem, always keeping 12 input units but using 4, 8, or 12 internal units in the network.

These experiments are referred to as setup D1, D2, and D3 below. The training set

size is kept constant at M = 221 (years 1700-1920), out of which we randomly

pick 22 patterns for validation. We test our models under four diﬀerent conditions:

Single step prediction on “test set 1” with 35 data points (years 1921-1955), 4-step

iterated prediction on “test set 1”, 8-step iterated prediction on all 74 available test

years (1921-1994), and 11-step iterated prediction on all available test years. These

test conditions are coded as s1, m4, m8, and m11.

Estimating the Peak Pressure Position in a Combustion Engine. This

is a data set with 4 input variables (ignition time, engine load, engine speed, and

air/fuel ratio) and only 49 training data points, out of which we randomly pick 9

patterns for validation. The test set consists of 35 data points, which have been

measured under slightly diﬀerent conditions than the training data. We try four

diﬀerent numbers of internal units on thistask:2,4,8,or12,andrefertothese

experiments as E1, E2, E3, and E4.

3. A Simple Trick for Estimating the Weight Decay Parameter 79

3.4.2 Experimental Procedure

The experimental procedure is the same for all problems: We begin by estimating

λ in the “traditional” way by searching over the region log λ ∈ [−6.5, 1.0] in steps

of Δ log λ =0.5.Foreachλ value, we train 10 networks using the Rprop training

algorithm

[18]. Each network is trained until the total error (3.7) is minimized,

measured by

log



100



i=1

|ΔE

ΔW





< −5, (3.23)

where the sum runs over the most recent 100 epochs, or until 10

epochs havepassed,

whichever occurs ﬁrst. The convergence criterion (3.23) is usually fulﬁlled within

epochs. New validation and training sets are sampled for each of the 10 net-

works, but the diﬀerent validation sets are allowed to overlap. Means and standard

deviations, log E

nCV,k

and σ

nCV,k

, for the errors are estimated from these 10 net-

work runs, assuming a lognormal distribution for the validation errors. Figure 3.2

shows an example of such a search for the Wolf sunspot problem, using a neural

network with 12 inputs, 8 internal units, and 1 linear output.

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

-7 -6 -5 -4 -3 -2 -1 0 1

<log10(MSE/1535)>

log10(lambda)

Weight decay, Wolf sunspots (12-8-1)

Training (199)

Validation (22)

-1.2

-0.8

-0.4

0.0

0.4

-7 -6 -5 -4 -3 -2 -1 0 1

<log10(MSE/1535)>

log10(lambda)

Weight decay, Wolf sunspots (12-8-1)

Test (35)

Fig. 3.2. Left panel: Training and validation errors on the Wolf sunspot time series,

setup D2, plotted versus the weight decay parameter λ. Each point corresponds to

an average over 10 runs with diﬀerent validation and training sets. The error bars

mark 95% conﬁdence limits for the average validation and training errors, under the

assumption that the errors are lognormally distributed. The objective Monte Carlo

method gives log

opt

= −2.00 ±0.31. Right panel: The corresponding plot for the test

error on the sunpots “test set 1”. The network architecture is 12 inputs, 8 tanh internal

units, and 1 linear output.

Using the objective Monte Carlo method described above, we estimate an

optimal

opt

value from this search. This value is then used to train 10 new

Initial tests showed that the Rprop algorithm was considerably more eﬃcient and

robust than e.g. backprop or conjugate gradients in minimizing the error. We did

not, however, try true second order algorithms like Levenberg-Marquardt or Quasi-

Newton.

80 T.S. Rögnvaldsson

-6 -5 -4 -3 -2 -1 0

Weight decay, Wolf sunspots (12-8-1)

Estimated log10(lambda)

Entries / 0.1 bin

1:st estimator

<log10(lambda)> = -2.4 +/- 0.4

-6 -5 -4 -3 -2 -1 0

Weight decay, Wolf sunpots (12-8-1)

Estimated log10(lambda)

Entries / 0.1 bin

2:nd estimator

<log10(lambda)> = -3.7 +/- 0.6

Fig. 3.3. Left panel: Histogram showing the estimated values

for 100 diﬀerent train-

ing runs, using diﬀerent training and validation sets each time. Right panel: Similar

histogram for

. The problem (D2) is the same as that depicted in Figure 3.2.

networks with all the training data (no validation set). The test errors for these

networks are then computed using the held out test set.

A total of 16 × 10 = 160 network runs are thus done to select the

opt

for

each experiment. This corresponds to a few days’ or a week’s work, depending

on available hardware and the size of the problem. Although this is in excess of

what is really needed in practice (one could get away with about half as many

runs in a real application) the time spent doing this is aggravating. The times

needed for doing the searches described in this paper ranged from 10 up to 400

cpu-hours, depending on the problem and the computer

. For comparison, the

early stopping experiments described below took between 10 cpu-minutes and

14 cpu-hours. There was typically a ratio of 40 between the time needed for a

search and the time needed for an early stopping estimate.

We then estimate

and

, by training 100 networks with early stopping. One

problem here is that the stopping point is ill-deﬁned, i.e. the ﬁrst observed mini-

mum in the validation error is not necessarily the minimum where one should stop.

The validation error quite often decreases again beyond this point. To avoid such

problems, we keep a record of the weights corresponding to the latest minimum

validation error and continue training beyond that point. The training is stopped

when as many epochs have passed as it took to ﬁnd the validation error minimum

without encountering a new minimum. The weights corresponding to the last vali-

dation error minimum are then used as the early stopping weights. For example, if

the validation error has a minimum at say 250 epochs, we then wait until a total of

500 epochs have passed before deciding on that particular stopping point. From the

100 networks, we get 100 estimates for

and

. We take the logarithm of these

and compute means log

 and log

, and corresponding standard deviations.

A variety of computers were used for the simulations, including NeXT, Sun Sparc,

DEC Alpha, and Pentium computers running Solaris.

3. A Simple Trick for Estimating the Weight Decay Parameter 81

The resulting arithmetic mean values are taken as the estimates for λ and the stan-

dard deviations are used as measures of the estimation error. The arithmetic means

are then used to train 10 networks which use all the training data. Figure 3.3 shows

the histograms corresponding to the problem presented in Figure 3.2.

When comparing test errors achieved with diﬀerent methods, we use the

Wilcoxon rank test [13], also called the Mann-Whitney test, and report dif-

ferences at 95% conﬁdence level.

3.4.3 Quality of the λ Estimates

Asaﬁrsttestofthequalityoftheestimates

and

, we check how well they

agree with the

opt

estimate, which can be considered a “truth”. The estimates

for all the problem setups are tabulated in table 3.1 and plotted in Figure 3.4.

Table 3.1. Estimates of λ for the 23 diﬀerent problem setups. Code A corresponds

to the synthetic problem, code B to the Power prediction, code C to the riverﬂow

prediction, code D to the Sunspots series, and code E to the maximum pressure position

problem. For the log

opt

column, errors are the standard deviations of the Monte Carlo

estimate. For the early stopping estimates, errors are the standard deviations of the

estimates.

Problem log

opt

log

A1 (M =20,σ =0.1) -2.82 ± 0.04 -2.71 ± 0.66 -3.44 ± 1.14

A2 (M =20,σ =0.2) -2.67 ± 0.42 -2.32 ± 0.58 -3.20 ± 0.96

A3 (M =20,σ =0.5) -0.49 ± 1.01 -1.93 ± 0.78 -3.14 ± 1.15

A4 (M =40,σ =0.1) -2.93 ± 0.49 -2.85 ± 0.73 -3.56 ± 0.87

A5 (M =40,σ =0.2) -2.53 ± 0.34 -2.41 ± 0.64 -2.91 ± 0.68

A6 (M =40,σ =0.5) -2.43 ± 0.44 -2.13 ± 0.74 -2.85 ± 0.77

A7 (M = 100,σ =0.1) -3.45 ± 0.78 -3.01 ± 0.86 -3.74 ± 0.93

A8 (M = 100,σ =0.2) -3.34 ± 0.71 -2.70 ± 0.73 -3.33 ± 0.92

A9 (M = 100,σ =0.5) -3.31 ± 0.82 -2.34 ± 0.63 -3.13 ± 1.06

B1 (Power, 15 hidden) -3.05 ± 0.21 -3.82 ± 0.42 -5.20 ± 0.70

B2 (Power, 10 hidden) -3.57 ± 0.35 -3.75 ± 0.45 -4.93 ± 0.50

B3 (Power, 5 hidden) -4.35 ± 0.66 -3.78 ± 0.52 -5.03 ± 0.74

C1 (Jökulsá Eystra, 8 lags) -2.50 ± 0.10 -3.10 ± 0.33 -4.57 ± 0.59

C2 (Jökulsá Eystra, 4 lags) -2.53 ± 0.12 -3.15 ± 0.40 -4.20 ± 0.59

C3 (Vatnsdalsá, 8 lags) -2.48 ± 0.11 -2.65 ± 0.40 -3.92 ± 0.56

C4 (Vatnsdalsá, 4 lags) -2.39 ± 0.55 -2.67 ± 0.45 -3.70 ± 0.62

D1 (Sunspots, 12 hidden) -2.48 ± 0.12 -2.48 ± 0.50 -3.70 ± 0.42

D2 (Sunspots, 8 hidden) -2.00 ± 0.31 -2.43 ± 0.45 -3.66 ± 0.60

D3 (Sunspots, 4 hidden) -2.51 ± 0.44 -2.39 ± 0.48 -3.54 ± 0.65

E1 (Pressure, 12 hidden) -3.13 ± 0.43 -3.03 ± 0.70 -4.69 ± 0.91

E2 (Pressure, 8 hidden) -3.01 ± 0.52 -3.02 ± 0.64 -4.72 ± 0.82

E3 (Pressure, 4 hidden) -3.83 ± 0.80 -3.07 ± 0.71 -4.50 ± 1.24

E4 (Pressure, 2 hidden) -4.65 ± 0.78 -3.46 ± 1.34 -4.21 ± 1.40

82 T.S. Rögnvaldsson

The linear correlation between log

and lo g

opt

is 0.71, which is more than

three standard deviations larger than the expected correlation between 23 ran-

dom points. Furthermore, a linear regression with intercept gives the result

opt

=0.30 + 1.13

. (3.24)

Thus,

is a fairly good estimator of

opt

The linear correlation between

and

opt

is 0.48, more than two standard

deviations from the random correlation. A linear regression gives

opt

= −0.66 + 0.57

, (3.25)

and the second estimator

is clearly a less good estimator of

opt

-6

-5

-4

-3

-2

-1

-6 -5 -4 -3 -2 -1 0 1

log(hat lambda 1)

log(Search lambda)

-6

-5

-4

-3

-2

-1

-6 -5 -4 -3 -2 -1 0 1

log(hat lambda 2)

log(Search lambda)

Fig. 3.4. Plot of the results in Table 3.1. Left plate: The

estimate plotted versus

opt

. The linear correlation between log

and log

opt

is 0.71. Right plate:

plotted

versus

opt

. The linear correlation between log

and log

opt

is 0.48. The sizes of the

crosses correspond to the error bars in Table 3.1.

We next compare the out-of-sample performances of these diﬀerent λ esti-

mates, which is what really matters to the practitioner. Table 3.2 lists the dif-

ferences in out-of-sample performance when using the early stopping estimates

or the search estimate. A “+” means that using the early stop estimate results

in signiﬁcantly (95% signiﬁcance level) lower test error than if

opt

is used. Sim-

ilarly, a “–” means that the search estimate gives signiﬁcantly lower test error

than the early stopping estimates. A “0” means there is no signiﬁcant diﬀerence.

The conclusion from Table 3.2 is that

is signiﬁcantly worse than

opt

, but

that there is no consistent diﬀerence between

and

opt

. The two estimates are

essentially equal, in terms of test error. In some cases, like the power prediction

problem, it would have been beneﬁcial to do a small search around the early

stop estimate to check for a possibly better value.

3. A Simple Trick for Estimating the Weight Decay Parameter 83

The test errors for the combustion engine (setups E) are not included in Tables

3.2 (and 3.3) because the test set is too diﬀerent from the training set to provide

relevant results. In fact, no regularized network is signiﬁcantly better than an

unregularized network on this problem.

Table 3.2. Relative performance of single networks trained using the estimates

and

, for the weight decay parameter, and the performance of single networks trained

using the search estimate

opt

. The relative performances are reported as: “+” means

that using

results in a test error which is signiﬁcantly lower than what the search

estimate

opt

gives, “0” means that the performances are equivalent, and “–” means

that using

opt

results in a lower test error than when using

. All results are reported

for a 95% conﬁdence level when using the Wilcoxon test. See the text on why the E

results are left out.

Problem Setup

vs.

opt

vs.

opt

A1 (M =20,σ =0.1) 0 0

A2 (M =20,σ =0.2) 0 –

A3 (M =20,σ =0.5) 0 0

A4 (M =40,σ =0.1) 0 0

A5 (M =40,σ =0.2) 0 –

A6 (M =40,σ =0.5) 0 0

A7 (M = 100,σ =0.1) – 0

A8 (M = 100,σ =0.2) 0 0

A9 (M = 100,σ =0.5) + 0

B1 (Power, 15 hidden) – –

B2 (Power, 10 hidden) – –

B3 (Power, 5 hidden) + –

C1 (Jökulsá Eystra, 8 lags) – –

C2 (Jökulsá Eystra, 4 lags) 0 –

C3 (Vatnsdalsá, 8 lags) – –

C4 (Vatnsdalsá, 4 lags) 0 –

D1.s1 (Sunspots, 12 hidden) 0 –

D2.s1 (Sunspots, 8 hidden) + –

D3.s1 (Sunspots, 4 hidden) 0 –

D1.m4 (Sunspots, 12 hidden) 0 –

D2.m4 (Sunspots, 8 hidden) + –

D3.m4 (Sunspots, 4 hidden) + –

D1.m8 (Sunspots, 12 hidden) 0 –

D2.m8 (Sunspots, 8 hidden) – –

D3.m8 (Sunspots, 4 hidden) + –

D1.m11 (Sunspots, 12 hidden) 0 0

D2.m11 (Sunspots, 8 hidden) + –

D3.m11 (Sunspots, 4 hidden) 0 –

84 T.S. Rögnvaldsson

3.4.4 Weight Decay versus Early Stopping Committees

Having trained all these early stopping networks, it is reasonable to ask if using

them to estimate λ for a weight decay network is the optimal use of these net-

works? Another possible use is, for instance, to construct a committee [16] from

them.

To test this, we compare the test errors for our regularized networks with

those when using a committee of 10 networks trained with early stopping. The

results are listed in Table 3.3.

Some observations from Table 3.3, bearing in mind that the set of problems

is small, are: Early stopping committees seem like the better option when the

problem is very noisy (setups A3, A6, and A9), and when the network does

not have very many degrees of freedom (setups B3, C4, and D3). Weight decay

networks, on the other hand, seem to work better than committees on problems

with many degrees of freedom (setups B1 and C3), problems with low noise

levels and much data (setup A7), and problems where the prediction is iterated

through the network (m4, m8, and m11 setups). We emphasize, however, that

these conclusions are drawn from a limited set of problems and that all problems

tend to have their own set of weird characteristics.

We also check which model works best on each problem. On the power pre-

diction, the best overall model is a large network (B1) which is trained with

weight decay. On the river prediction problems, the best models are small (C2

and C4) and trained with either weight decay (Jökulsá Eystra) or early stopping

and then combined into committees (Vatnsdalsá). On the sunspot problem, the

best overall model is a large network (D1) trained with weight decay.

These networks are competitive with previous results on the same data sets.

The performance of the power load B1 weight decay networks, using

opt

,are

signiﬁcantly better than what a human expert produces, and also signiﬁcantly

better than the results by the winner of the Puget Sound Power and Light Co.

Power Load Competition [7], although the diﬀerence is small. The test results

are summarized in Figure 3.5. The performance of the sunspot D1 weight decay

network is comparable with the network by Weigend et al., listed in [26]. Figure

3.6 shows the performance of the D1 network trained with weight decay, λ =

, and compares it to the results by Weigend et al. [26]. The weight decay

network produces these results using a considerably simpler λ selection method

and regularization cost than the one presented in [26].

From these anecdotal results, one could be bold and say that weight decay

shows a slight edge over early stopping committees. However, it is fair to say that

it is a good idea to try both committees and weight decay whenconstructing

predictor models.

It is emphasized that these results are from a small set of problems, but that

these problems (except perhaps for the synthetic data) are all realistic in the

sense that the datasets are small and noisy.

3. A Simple Trick for Estimating the Weight Decay Parameter 85

Table 3.3. Relative performance of single networks trained using weight decay and

early stopping committees with 10 members. The relative performance of weight decay

(WD) and 10 member early stopping committees are reported as: “+” means that

weight decay is signiﬁcantly better than committees, “0” means that weight decay and

committees are equivalent, and “–” means that committees are better than weight

decay. All results are reported for a 95% conﬁdence level when using the Wilcoxon

test. See the text on why the E results are left out.

Problem Setup WD(

opt

) vs. Comm. WD(

) vs. Comm.

A1 (M =20,σ =0.1) 0 +

A2 (M =20,σ =0.2) 0 0

A3 (M =20,σ =0.5) – –

A4 (M =40,σ =0.1) 0 0

A5 (M =40,σ =0.2) + +

A6 (M =40,σ =0.5) – –

A7 (M = 100,σ =0.1) + +

A8 (M = 100,σ =0.2) 0 0

A9 (M = 100,σ =0.5) – 0

B1 (Power, 15 hidden) + –

B2 (Power, 10 hidden) 0 –

B3 (Power, 5 hidden) – –

C1 (Jökulsá Eystra, 8 lags) + 0

C2 (Jökulsá Eystra, 4 lags) + 0

C3 (Vatnsdalsá, 8 lags) + +

C4 (Vatnsdalsá, 4 lags) – –

D1.s1 (Sunspots, 12 hidden) 0 0

D2.s1 (Sunspots, 8 hidden) – 0

D3.s1 (Sunspots, 4 hidden) – –

D1.m4 (Sunspots, 12 hidden) + +

D2.m4 (Sunspots, 8 hidden) + +

D3.m4 (Sunspots, 4 hidden) 0 +

D1.m8 (Sunspots, 12 hidden) + +

D2.m8 (Sunspots, 8 hidden) + 0

D3.m8 (Sunspots, 4 hidden) 0 0

D1.m11 (Sunspots, 12 hidden) + +

D2.m11 (Sunspots, 8 hidden) 0 +

D3.m11 (Sunspots, 4 hidden) + +

86 T.S. Rögnvaldsson

−1.25

−1.20

−1.15

−1.10

−1.05

−1.00

−0.95

log10(Test NMSE)

ower

esu

Human expert

Competition winner

Weight decay nets

Fig. 3.5. The performance of the 10 neural networks with 15 inputs, 15 hidden units,

and one output unit, trained with weight decay using λ =

opt

, on the power prediction

problem. “Human expert” denotes the prediction result by the human expert at Puget

Sound Power and Light Co., and “Competition winner” denotes the result by the model

that won the Puget Sound Power and Light Co.’s Power Prediction Competition.

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0 2 4 6 8 10 12

MSE/1535

Years forward (test set 1)

Iterated sunspot predictions

Fig. 3.6. The performance of a neural network with 12 inputs, 12 hidden units, and

one output unit, trained with weight decay using λ =

, on iterated predictions for

the sunspot problem. The error bars denote one standard deviation for the 10 trained

networks. The dashed line shows the results when using the network listed in [26].

Note that these results are achieved with a simple weight decay cost and a very simple

method for selecting λ, whereas [26] use weight elimination and a complicated heuristic

scheme for setting λ.

3. A Simple Trick for Estimating the Weight Decay Parameter 87

3.5 Conclusions

The established connection between early stopping and weight decay regulariza-

tion naturally leads to the idea of using early stopping to estimate the weight

decay parameter. In this paper we have shown how this can be done and that

the resulting λ results in as low test errors as achieved with the standard cross-

validation method, although this varies between problems. In practical applica-

tions, this means replacing a search which may take days or weeks, with a com-

putation that usually does not require more than a few minutes or hours. This

value can also be used as a starting point for a more extensive cross-validation

search.

We have also shown that using several early stopping networks to estimate λ

can be smarter than combining the networks into committees. The conclusion

from this is that although there is a correspondence between early stopping and

weight decay under asymptotic conditions this does not mean that early stopping

and weight decay give equivalent results in real life situations.

The method unfortunately only works for regularization terms that have a

connection with early stopping, like quadratic weight decay or “weight decay

like” regularizers where the weights are constrained towards the origin in weight

space (but using e.g. a Laplacian prior instead of the usual Gaussian prior). The

method does not carry over to regularizers which do not have any connection to

early stopping (like e.g. Tikhonov smoothing regularizers).

Acknowledgements. David B. Rosen is thanked for a very inspiring dinner

conversation during the 1996 “Machines that Learn” Workshop in Snowbird,

Utah. Milan Casey Brace of Puget Sound Power and Light Co. is thanked for

supplying the power load data. Financial support is gratefully acknowledged from

NSF (grant CDA-9503968), Olle and Edla Ericsson’s Foundation, the Swedish

Institute, and the Swedish Research Council for Engineering Sciences (grant

TFR-282-95-847).

References

[1] Abu-Mustafa, Y.S.: Hints. Neural Computation 7, 639–671 (1995)

[2] Bishop, C.M.: Curvature-driven smoothing: A learning algorithm for feedforward

networks. IEEE Transactions on Neural Networks 4(5), 882–884 (1993)

[3] Brace, M.C., Schmidt, J., Hadlin, M.: Comparison of the forecast accuracy of

neural networks with other established techniques. In: Proceedings of the First

International Form on Application of Neural Networks to Power System, Seattle

WA, pp. 31–35 (1991)

[4] Buntine, W.L., Weigend, A.S.: Bayesian back-propagation. Complex Systems 5,

603–643 (1991)

[5] Cheeseman, P.: On Bayesian model selection. In: The Mathematics of General-

ization - The Proceedings of the SFI/CNLS Workshop on Formal Approaches to

Supervised Learning, pp. 315–330. Addison-Wesley, Reading (1995)

88 T.S. Rögnvaldsson

[6] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals and Systems 2, 304–314 (1989)

[7] Engle, R., Clive, F., Granger, W.J., Ramanathan, R., Vahid, F., Werner, M.:

Construction of the puget sound forecasting model. EPRI Project # RP2919,

Quantitative Economics Research Institute, San Diego, CA (1991)

[8] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4(1), 1–58 (1992)

[9] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural networks ar-

chitectures. Neural Computation 7, 219–269 (1995)

[10] Hansen, L.K., Rasmussen, C.E., Svarer, C., Larsen, J.: Adaptive regularization. In:

Vlontzos, J., Hwang, J.-N., Wilson, E. (eds.) Proceedings of the IEEE Workshop

on Neural Networks for Signal Processing IV, pp. 78–87. IEEE Press, Piscataway

(1994)

[11] Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation of nonorthogonal

problems. Technometrics 12, 55–67 (1970)

[12] Ishikawa, M.: A structural learning algorithm with forgetting of link weights. Tech-

nical Report TR-90-7, Electrotechnical Laboratory, Information Science Division,

1-1-4 Umezono, Tsukuba, Ibaraki 305, Japan (1990)

[13] Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics, 3rd edn. Hafner

Publishing Co., New York (1972)

[14] Moody, J.E., Rögnvaldsson, T.S.: Smoothing regularizers for projective basis func-

tion networks. In: Advances in Neural Information Processing Systems 9. MIT

Press, Cambridge (1997)

[15] Nowlan, S., Hinton, G.: Simplifying neural networks by soft weight-sharing. Neural

Computation 4, 473–493 (1992)

[16] Perrone, M.P., Cooper, L.C.: When networks disagree: Ensemble methods for hy-

brid neural networks. In: Artiﬁcial Neural Networks for Speech and Vision, pp.

126–142. Chapman and Hall, London (1993)

[17] Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by backpropaga-

tion. Technical Report CMU-CS-86-126, Carnegie Mellon University, Pittsburg,

PA (1986)

[18] Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation

learning: The RPROP algorithm. In: Ruspini, H. (ed.) Proc. of the IEEE Intl.

Conference on Neural Networks, San Fransisco, California, pp. 586–591 (1993)

[19] Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University

Press, Cambridge (1996)

[20] Sjöberg, J., Ljung, L.: Overtraining, regularization, and searching for minimum

with application to neural nets. Int. J. Control 62(6), 1391–1407 (1995)

[21] Tikhonov, A.N., Arsenin, V.Y.: Solutions of Ill-Posed problems. V. H. Winston &

Sons, Washington D.C. (1977)

[22] Tong, H.: Non-linear Time Series: A Dynamical System Approach. Clarendon

Press, Oxford (1990)

[23] Utans, J., Moody, J.E.: Selecting neural network architectures via the prediction

risk: Application to corporate bond rating prediction. In: Proceedings of the First

International Conference on Artiﬁcial Intelligence Applications on Wall Street.

IEEE Computer Society Press, Los Alamitos (1991)

[24] Wahba, G., Gu, C., Wang, Y., Chappell, R.: Soft classiﬁcation, a.k.a. risk esti-

mation, via penalized log likelihood and smoothing spline analysis of variance. In:

The Mathematics of Generalization - The Proceedings of the SFI/CNLS Work-

shop on Formal Approaches to Supervised Learning, pp. 331–359. Addison-Wesley,

Reading (1995)

3. A Simple Trick for Estimating the Weight Decay Parameter 89

[25] Wahba, G., Wold, S.: A completely automatic french curve. Communications in

Statistical Theory & Methods 4, 1–17 (1975)

[26] Weigend, A., Rumelhart, D., Hubermann, B.: Back-propagation, weight-

elimination and time series prediction. In: Sejnowski, T., Hinton, G., Touretzky,

D. (eds.) Proc. of the Connectionist Models Summer School. Morgan Kaufmann

Publishers, San Mateo (1990)

[27] Williams, P.M.: Bayesian regularization and pruning using a Laplace prior. Neural

Computation 7, 117–143 (1995)

Controlling the Hyperparameter Search in

MacKay’s Bayesian Neural Network Framework



Tony Plate

School of Mathematical and Computing Sciences,

Victoria University, Wellington, New Zealand

tap@mcs.vuw.ac.nz

http://www.mcs.vuw.ac.nz/˜tap/

Abstract. In order to achieve good generalization with neural networks

overﬁtting must be controlled. Weight penalty factors are one common

method of providing this control. However, using weight penalties creates

the additional search problem of ﬁnding the optimal penalty factors.

MacKay[5]proposedanapproximateBayesian framework for training

neural networks, in which penalty factors are treated as hyperparameters

and found in an iterative search. However, for classiﬁcation networks

trained with cross-entropy error, this search is slow and unstable, and

it is not obvious how to improve it. This paper describes and compares

several strategies for controlling this search. Some of these strategies

greatly improve the speed and stability of the search. Test runs on a

range of tasks are described.

4.1 Introduction

Neural networks can provide useful ﬂexible statistical models for non-linear re-

gression and classiﬁcation. However, as with all such models, the ﬂexibility must

be controlled to avoid overﬁtting. One way of doing this in neural networks is to

use weight penalty factors ( regularization parameters). This creates the problem

of ﬁnding the values of the penalty factors which will maximize performance on

new data. As various researchers have pointed out, including MacKay [5], Neal

[10] and Bishop [1], it is generally advantageous to use more than one penalty

factor, in order to diﬀerentially penalize weights between diﬀerent layers of the

network. However, doing this makes it computationally infeasible to choose op-

timal penalty factors by k-fold cross validation.

MacKay [5] describes a Bayesian framework for training neural networks and

choosing optimal penalty factors (which are hyperparameters in his framework).

In this framework, we choose point estimates of hyperparameters to maximize the

“evidence” of the network. Parameters (i.e., weights) can be assigned into diﬀerent



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 91–110, 2012.

 Springer-Verlag Berlin Heidelberg 2012

92 T. Plate

groups, and each controlled by a separate hyperparameter. This allows weights be-

tween diﬀerent layers to be penalized diﬀerently. MacKay [6, 8] and Neal [10] have

shown that it also provides a way of implementing “Automatic Relevance Detec-

tion” (ARD), in which connections emerging from diﬀerent units in the input layer

are assigned to diﬀerent regularization groups. The idea is that hyperparameters

controlling weights for irrelevant inputs should become large, driving those weights

to zero, while hyperparameters for relevant inputs stabilize at small to moderate

values. This can help generalization by causing the network to ignore irrelevant

inputs and also makes it possible to see at a glance which inputs are important.

In this framework the search for an optimal network has two levels. The inner

level is a standard search for weights which minimize error on the training data,

with ﬁxed hyperparameters. The outer level is a search for hyperparameters

which maximize the evidence. For the Bayesian theory to apply, the inner level

search should be allowed to converge to a local minima at each step of the outer

level search. However, this can be expensive and slow. Problems with speed and

stability of the search seem especially severe with classiﬁcation networks trained

with cross-entropy error.

This paper describes experiments with diﬀerent control strategies for updating

the hyperparameters in the outer level search. These experiments show that the

simple “let it run to convergence and then update” strategy often does not work

well, and that other strategies can generally work better. In previous work, the

current author successfully employed one of these strategies in an application of

neural networks to epidemiological data analysis [11]. The experiments reported

here conﬁrm the necessity for update strategies and also demonstrate that although

the strategy used in this previous work is reasonably eﬀective in some situations,

there are simpler and better strategies which work in a wider range of situations.

These experiments also furnish data on the relationship between the evidence and

the generalization error. This data conﬁrms theoretical expectations about when

the evidence should and should not be a good indication of generalization error.

In the second section of this chapter, the update formulas for hyperparameters

are given. Network propagation and weight update formulas are not given, as

they are well known and available elsewhere, e.g., in Bishop [1]. Diﬀerent control

strategies for the outer level hyperparameter search are described in the third

section. In the fourth section, the simulation experiments are described, and the

results are reported in the ﬁfth section. The experimental relationships between

evidence and generalization error are reported in the sixth section.

4.2 Hyperparameter Updates

The update formulas for the hyperparameters (weight penalty factors) in the

outer level search are quite simple. Before describing them we need some termi-

nology. For derivations and background theory see Bishop [1], MacKay [5, 7], or

Thodberg [13].

• n is the total number of weight in the network.

• w

the value of the ith weight.

4. Hyperparameter Search in MacKay’s Bayesian Framework 93

• K is the number of hyperparameters.

•I

is the set of indices of the weights in the cth hyperparameter group.

• α

is the value of the hyperparameter controlling the cth hyperparameter

group; it speciﬁes the prior distribution on the weights in that group. α

[i]

de-

notes the value of the hyperparameter controlling the group to which weight

i belongs.

• n

is the number of weights in the cth hyperparameter group.

• C is the weight cost (penalty term) for the network: C =



i=1

[i]

• m is the total number of training examples.

• y

and t

are the network outputs and target values, respectively, for the jth

training example.

• E is the error term of the network. For the classiﬁcation networks described

here, the modiﬁed cross-entropy (Bishop [1], p.232) is used:

E = −



j=1

log

+(1−t

)log

1 − y

1 − t

Note that all graphs and tables of test set performance use the “deviance”,

which is twice the error.

• H is the Hessian of the network (the second partial derivatives of the sum of

the error and weight cost). h

denotes the ijth element of this matrix, and

−1

denotes the ijth element of H

−1

∂

(E + C)

∂w

is the matrix of second partial derivatives of just the error, and H

the matrix of second partial derivatives of just the weight cost.

• Tr(H

−1

) is the trace of the inverse of H:Tr(H

−1



i=1

−1

• Tr

−1

) is the trace of the inverse Hessian for just those elements of the

cth regularization group: Tr

−1



i∈I

−1

• γ

is a derived parameter which can be seen as an estimate of the number of

well-determined parameters in the cth regularization group, i.e., the number

of parameters determined by the data rather than by the prior.

The overall training procedure is shown in Figure 4.1.

The updates for the hyperparameters α

depend on the estimate γ

(the

number of well-determined parameters in group c) which is calculated as follows

(Eqn 27 in [7]; derivable from Eqn 10.140 in [1]):

= n

− α

−1

). (4.1)

If a Gaussian distribution is a reasonable approximation to the posterior weight

distribution, γ

should be between 0 and n

. Furthermore, we expect each

parameter in group c to contribute between 0 and 1 to γ

. Hence, we expect h

−1

to always be in the range [0, 1/α

[i]

94 T. Plate

set the α

to initial values

set w

to initial random values

repeat

make an optimization step for weights to minimize E + C

until ﬁnished weight optimization

re-estimate the α

until ﬁnished max number of passes through training data

Fig. 4.1. The training procedure

The updates for the α

is as follows (Eqn 22 in [7]; Eqn 10.74 in [1]):





i∈I

(4.2)

MacKay [7] remarks that this formula can be seen as matching the prior to the

data: 1/α

is an estimate of the variance for the weights in group c,takinginto

account the eﬀective number of well determined parameters (eﬀective degrees of

freedom) in that group.

4.2.1 Diﬃculties with Using the Update Formulas

The diﬃculties with using these update formulas arise when the assumption that

the error plus cost surface is a quadratic bowl is false. This assumption can fail

in two ways: the error plus cost surface may not be quadratic, or it may not be

a bowl (i.e., the Hessian is not positive deﬁnite). In either of these situations, it

is possible for γ

to be out of the range [0,n

]. To illustrate, consider a single

diagonal element of the Hessian in the situation where oﬀ-diagonal elements are

zero:

H =

⎡

⎢

⎣

⎤

⎥

⎦

⎡

⎢

⎣

+ α

[i]

⎤

⎥

⎦

Since the oﬀ-diagonal elements are zero, the inverse Hessian is simple to write

down:

−1

⎡

⎢

⎣

+α

[i]

⎤

⎥

⎦

Suppose the i parameter is in the cth regularization group, by itself. Then the

number of well-determined parameters in this group is given by:

[i]

=1− α

[i]

−1

=1−

[i]

+ α

[i]

+ α

[i]

(4.3)

4. Hyperparameter Search in MacKay’s Bayesian Framework 95

If h

is positive, γ

[i]

will be between 0 and 1. γ

[i]

will be large if h

is large

relative to α

[i]

, which means that w

is well determined by the data, i.e., small

moves of w

will make a large diﬀerence to E. γ

[i]

will be small if h

is small

relative to α

[i]

, which means that w

is poorly determined by the data.

The expectation that h

−1

is in the range [0, 1/α

] (and hence contributes

between 0 and 1 well determined parameter to γ

[i]

) can fail even if the model

is at a local minima of E + C. Being at a local minimum of E + C does not

guarantee that h

will be positive: it is possible for the hyperparameter to “pin”

the weight value to a convex portion of a non-quadratic E surface. Consider the

case where the Hessian is diagonal and positive deﬁnite, but h

is negative. From

Eqn4.3,wecanseethath

−1

can make a negative contribution

to γ

,which

makes little sense in terms of “numbers of well-determined parameters”. This

situation

is illustrated in Figure 4.2: at the minimum of the E + C function the

E function is convex (

is negative). Here, negative degrees of freedom would

be calculated under the (incorrect) assumption that error plus cost is quadratic.

This is important for neural networks, because even if sum-squared error is used,

non-linearities in the sigmoids can cause the the error plus cost function to be

not a quadratic function of weights.

E + C

min of E + C

-2.0 -1.5 -1.0

-0.5

0.0 0.5 1.0

1.5

min of

E + C

∂

(E+C)

∂w

∂

∂w

∂

∂w

0.0

0.1

0.2 0.3 0.4

0.5

Fig. 4.2. In minimizing E + C, a weight cost function C can pin a weight value to a

convex portion of the error surface E. The plot on the left shows the surfaces, the plot

on the right shows the derivatives in the region of the minimum.

If the model is not at a local minimum of E + C all bets are oﬀ. H may not

even be positive deﬁnite (i.e., the Hessian of a quadratic bowl), and if this is the

case it is almost certain that some h

−1

will be out of the range [0, 1/α

[i]

].Even

With general matrices it is possible that h

−1

< −α

[i]

, in which case the contribution

will be an unbounded positive number.

In Figure 4.2, E = w( w − 1)(w +1)

+1, C =4w

d(E+C)



0.152645

≈ 0,and



0.152645

≈−0.8036.

96 T. Plate

if H is positive deﬁnite, and H

is also positive deﬁnite, it can still be the case

that some h

−1

are out-of-bounds, and thus make contributions of less than zero

or more than one “well-determined parameter” each.

These diﬃculties leave us with two problems:

1. When should the hyperparameters be updated?

2. What should be done when h

−1

is not in [0, 1/α

[i]

] (i.e., is “out-of-bounds”).

Bishop [1] suggests updating hyperparameters after every few weight updates.

Thodberg [13] suggests updating the α

after every weight update, but only re-

calculating the γ

occasionally (at ﬁve evenly-spaced intervals throughout the

whole process). While updating hyperparameters after every weight update is not

feasible when using conjugate gradient or other second-order training methods,

common practice seems to be more or less in line with Thodberg’s recommenda-

tions: train the network more or less to convergence before each hyperparameter

update. However, this strategy can result in an extremely slow overall search.

Furthermore, training to convergence does not eliminate the problem of out-

of-bounds h

−1

values. In the remainder of this chapter, various strategies for

choosing when to update hyperparameters, and for dealing with out-of-bounds

−1

values are described and compared in experiments.

4.3 Control Strategies

The strategies tested here fall into three groups: strategies for when to update

hyperparameters, strategies for dealing with out-of-bounds h

−1

values, and spe-

cial strategies for dealing with exceptional cases. For each task, searches were

run a ﬁxed number of minimization steps, with diﬀerent combinations of control

strategies. Each step of the inner minimization loop was a step of a conjugate

gradient method, and could involve a number of passes through the training

data, though the average was just over two.

4.3.1 Choosing When to Update Hyperparameters

Four diﬀerent strategies for choosing when to break out of the inner loop and

update hyperparameters were employed:

rare: Update at 10 evenly spaced intervals.

medium: Update at 30 evenly spaced intervals.

often: Update at 100 evenly spaced intervals.

patience: Update when the improvement in the last n steps was less than 1% of

the improvement since the start of the inner loop, or when the improvement

in the last n steps was less than 0.01% of the null error (the minimum error

that can be achieved with a constant output value). At least n steps of the

inner loop are taken.

4. Hyperparameter Search in MacKay’s Bayesian Framework 97

Convergence was diﬃcult to test when using a number of diﬀerent datasets and

networks. One standard way of detecting convergence is to test whether the ratio

|¯g|/|¯w| is less than some threshold (|¯g| is the Euclidean length of the vector of

weight derivatives, and | ¯w| is the Euclidean length of the vector of weights.)

However, the appropriate threshold varied greatly among diﬀerent tasks. In any

case, with all strategies, if a convergence test was met (|¯g|/|¯w| < 10

−6

), the

inner loop terminated, and the hyperparameters were updated. This did not

occur often in the experiments described here.

The “patience” strategy is intended to be a surrogate for convergence: the

inner loop runs out of patience when the improvement achieved in the last n

steps is minuscule. With “patience”, the inner loop is guaranteed to terminate

if the error is bounded below. In practice, the inner loop runs out of patience

reasonable quickly: the update rate is somewhere between “rare” and “medium”

depending on the diﬃculty of the optimization problem.

4.3.2 Dealing with Out-of-Bounds Estimates of Numbers of

Well-Determined Parameters

In each of the experiments, one of the following strategies was used to deal with

out-of-bounds h

−1

values. In describing these strategies, h

−1



or γ



are used to

denote values which are used instead of the originally calculated ones.

none: This strategy allows γ

to take on unreasonably large values, but not

negative ones (h

−1

values are not checked):





0 if γ

< 0

otherwise

group: This strategy forces the total number of well-determined parameters in

a regularization group to be reasonable:



⎧

⎨

⎩

0 if γ

< 0

if γ

otherwise

trim: This strategy forces the contribution of each h

−1

to be reasonable. If h

−1

is out-of-bounds, it is assumed to represent zero well-determined parameters:

−1





1/α

[i]

if h

−1

is not in [0, 1/α

[i]

]

−1

otherwise

snip: This strategy forces the contribution of each h

−1

to be reasonable. If h

−1

is out-of-bounds, it is assumed to represent one well-determined parameter:

−1





0 if h

−1

is not in [0, 1/α

[i]

]

−1

otherwise

98 T. Plate

useold: This strategy forces the contribution of each h

−1

to be reasonable. If

−1

is out-of-bounds the last good estimate of the well-determinedness of

parameter i is used:





i=1



1 − α

∗

−1

∗

if h

−1

is not in [0, 1/α

[i]

]

1 − α

−1

otherwise

where α

∗

and h

−1

∗

are most recent values such that h

−1

is in [0, 1/α

[i]

],or

if there are no such values, h

−1

∗

=0.

cond: This strategy only updates α

,usingγ

or a snipped version of γ

, under

the following conditions:

(a) all eigenvalues of H are positive, and

(b) for all i ∈I

, h

−1

is in [0, 1/α

[i]

], or a snipped version of γ

will result

in a change in α

in the same direction as the last change when all the

eigenvalues of H were positive and all the h

−1

in group c were in bounds,

and there have not been more than ﬁve such changes since all of the h

−1

for group c have been in range.

cheap: This is Mackay’s [7] “cheap and cheerful” method, in which all pa-

rameters are assumed to be well determined, i.e.,



= n

The advantage of this method is that Hessian need not be calculated.

Mackay remarks that this method can be expected to perform poorly

when there are a large number of poorly determined parameters.

4.3.3 Further Generally Applicable Strategies

Several further strategies which could be combined with any of the ones already

mentioned were also employed:

nza: (no zero alphas) Do not accept an updated alpha value of zero (retain the

old value).

limit: Limit the the change in an alpha value to have a magnitude of no more

than 10, i.e., round α



to be in the interval [0.1α

, 10α

omit: If there are any h

−1

not in [0, 1/α

[i]

], omit the corresponding rows and

columns from H to give the smaller matrix H



, and use the diagonal elements

of H

−1

for h

−1

. The idea is to omit troublesome components of the model

so that they do not interfere with estimates of well-determinedness for well-

behaved parameters. Strategies from the previous section are used to assign

well-determinedness values for parameters with out-of-bound h

−1

values in

−1

or H

−1

4.4 Experimental Setup

The experiments reported here used seven diﬀerent test functions, based on the

ﬁve 2-dimensional functions used by Hwang et al. [4] and Roosen and Hastie [12]:

4. Hyperparameter Search in MacKay’s Bayesian Framework 99

linear function:

)=(2x

+ x

)/0.6585

simple interaction:

)=10.391((x

− 0.4)(x

− 0.6) + 0.36)

radial function:

)=24.234((x

− 0.5)

+(x

− 0.5)

)(0.75 − ((x

− 0.5)

+(x

− 0.5)

))

harmonic function:

)=42.659(0.1+(x

− 0.5)(0.05 − 10(x

− 0.5)

+(x

− 0.5)

+5(x

− 0.5)

))

additive function:

)=1.3356(1.5(1 − x

)+e

(2x

−1)

sin(3π(x

− 0.6)

)

3(x

−0.5)

sin(4π(x

− 0.9)

))

complicated interaction:

)=1.9(1.35 + e

sin(13(x

− 0.6)

−x

sin(7x

))

interaction plus linear:

)=0.83045(f

)+f

))

Training data for the binary classiﬁcation tasks was generated by choosing 250

and x

points from a uniform distribution over [0, 1]. Another 250 s

points

were chosen from the uniform [0, 1] distribution to determine whether the target

should be 0 or 1. The {0, 1} target t

for case j for function i depended on the

probability p

of a 1 calculated as the sigmoid of the function value (with the

mean of the function subtracted):

1+e

−(f

)−μ

)



0 if p

1 otherwise

where μ

is the mean of each function over the unit square (μ

=2.28,μ

3.6,μ

=2.3,μ

=4.4,μ

=2.15,μ

=2.7,μ

=4.92).

A further 5 random distractor points chosen from a uniform [0, 1] distribution

were concatenated with each (x

) pair, to give a 7 dimensional function-

approximation task. These extra points were added so that in order for the neural

network to generalize well it would be necessary that the automatic relevance

determination set the weights from irrelevant inputs to zero (by driving the α’s

for those weights to high values).

100 T. Plate

For testing, the probabilities were used as targets, rather than stochastic bi-

nary values. This was done to reduce the noise in measuring the test error. The

test set inputs consisted of 400 (x

) points from the uniform grid over [0, 1],

and 400 vectors of distractors chosen from a uniform distribution over [0, 1].In

order that test and training errors be comparable the test error was calculated

as the expected error over the test points, assuming actual targets had been

chosen randomly with the given target probabilities. The expected error for one

test case was calculated as follows:

= −

log y

+(1−p

) log(1 − y

)

where y

was the prediction of the network for test case j.

The seven learning tasks are called I0 through I6 (the “I” is for “impure” in

Roosen and Hastie’s terminology [12].) It should be noted that these learning

tasks are much harder than the tasks used by Hwang et al. [4], as the binomial

outputs make the training data much noisier, and irrelevant inputs are present.

4.4.1 Targets for “Good” Performance

Various simple modeling strategies were applied to the data to give an indica-

tion of what level of test error performance was achievable, and to demonstrate

how important it was to ignore the distractor inputs. Three diﬀerent models

were tried: linear and quadratic logistic models, and generalized additive models

(GAMs) [3], each with and without the distractor inputs. No attempt was made

to prevent overﬁtting, as the intention of these models was to show how much

overﬁtting can occur if distractor are not ignored.

null: No inputs are used. The predicted output is the average of the targets (1

parameter).

true: The actual function value. This is included to indicate the level of noise

in the data.

lin: A logistic (linear) model ﬁt using only x

and x

as inputs (3 parameters).

lin.D: A logistic (linear) model ﬁt using all inputs, including the distractors (8

parameters).

quad: A logistic model (with quadratic terms) ﬁt using only x

and x

as inputs

(6 parameters).

quad.D: A logistic model (with quadratic terms) ﬁt using all inputs, including

the distractors (36 parameters).

gam: A generalized additive model ﬁt using only x

and x

as inputs, with three

degrees of freedom for each dimension (approx 7 parameters).

gam.D: A generalized additive model ﬁt using all inputs, with three degrees of

freedom for each dimension (approx 22 parameters).

T: The target for “good” network performance.

The test set deviances for the various modelsandtasksareshowninTable4.1,

ordered by error for each task. Some tasks are easy, while others are very diﬃcult

4. Hyperparameter Search in MacKay’s Bayesian Framework 101

(ﬁnding good solutions for task I3 appears to extremely diﬃcult, as Roosen and

Hastie [12] also discovered.) Targets for “good” neural network performance were

derived from the errors achieved by any model. The targets were chosen to be

achievable by neural network models and yet lower than the test set deviance

achieved by any of the above simple models using all the inputs (except for I3,

which neural networks had great diﬃculty with).

Table 4.1. Test set deviances (twice the error) for various models. Names ending in “.D”

are those of models which used both the distractor and relevant inputs. The number

in parentheses beside the target for good network performance (T) is the number of

networks which achieved this target at the end of training (out of 540).

I0 I1 I2 I3 I4 I5 I6

true 478

lin 487

T (165) 493

lin.D 498

quad 501

gam 502

gam.D 519

null 555

quad.D 556

true 488

quad 495

T (258) 520

lin 545

gam 548

null 556

lin.D 558

quad.D 595

gam.D 608

true 475

quad 494

gam 507

T (44) 530

gam.D 551

null 555

lin 561

quad.D 575

lin.D 577

true 495

gam 546

T (7) 555

lin 556

lin.D 557

null 558

quad 566

gam.D 574

quad.D 669

true 480

gam 498

T (114) 530

quad 533

gam.D 537

lin 554

null 556

lin.D 558

quad.D 597

true 491

gam 539

T (58) 545

quad 547

lin 547

lin.D 555

null 555

gam.D 572

quad.D 620

true 478

quad 485

T (233) 500

lin 517

lin.D 523

gam 529

quad.D 540

gam.D 550

null 557

4.4.2 Network Architecture and Training

Standard feed-forward networks were used, with details as follows. All networks

had 3 to 15 hidden units, which computed the tanh function (a symmetric sig-

moid). Inputs were in the range 0 to 1. The output unit computed the logistic

function of its total input. Weights were initialized to random values drawn from

a Gaussian distribution with variance 0.5.

Networks had nine hyperparameters (weight penalties): one for the weights

for each input, one for hidden unit biases, and one for hidden to output weights.

There was no penalty on the output bias. All hyperparameters were initialized

to 0.5. Hyperparameters were allowed to increase to maximum value of 10,000.

Networks were trained using a conjugate algorithm for the number of steps

speciﬁed in Table 4.2. These numbers of steps were chosen give ample time for

reasonably good methods to converge on some solution (the harder problems

required more steps). Each step of the conjugate gradient algorithm involved

one or more passes through the training set (the average was just over two).

Training was terminated if the total number of passes through the training set

exceeded 2.3 times the maximum allowed number of conjugate gradient steps.

The Hessian was calculated using the exact analytical method described in

Buntine and Weigend [2]. This requires h +1 passes through the training data,

where h is the number of hidden units. This is usually far faster than a ﬁnite-

diﬀerences method, which requires n +1 for the forward diﬀerences method and

2n+1 passes for the more accurate central diﬀerences method. The total number

of ﬂoating point operations involved in the exact calculation of the Hessian is

dominated by the update of the Hessian matrix (Eqn 15c in [2]). For a network

with one output unit it is approximately 3.5(h +1)Nn

(there are 7 operations

102 T. Plate

in each Hessian element update, but only half the elements need be computed as

the matrix is symmetric). Eigendecomposition and inversion of the Hessian takes

approximately 4/3n

operations, but this is usually small compared to the cal-

culation of the Hessian matrix. As long as hyperparameters are not updated too

frequently, the time taken by Hessian evaluation and inversion is generally not

an excessive amount on top of the time taken by the standard weight-training

part of the procedure. For example, in the easier tasks (I0, I1, and I6) with the

most frequent Hessian calculations (the “often” updates: 100 during), approxi-

mately one-third of the computation time was spent in Hessian calculations. In

the more diﬃcult tasks, relatively less time was spent in Hessian calculations

because the training times were longer.

Thirty six diﬀerent combinations of the hyperparameter update strategies

discussed in Section 4.3 were tested. Training on each problem was repeated ﬁve

times with diﬀerent initial weights. The same ﬁve sets of random initial weights

were used for each strategy. This means that there were a total of 160 attempts

to train each sized network for each problem.

Table 4.2. Number of conjugate gradient steps allowed for diﬀerent sized networks on

the diﬀerent tasks

Number of hidden units

Task

3510 15

I0, I1, I6 1800 3000 6000 –

I2, I3, I4

3000 6000 12000 –

– 6000 12000 30000

4.5 Eﬀectiveness of Control Strategies

Eﬀective of various combinations of control strategies was judged by whether

or not the test error at the end of training was acceptable (using the deviance

targets in Table 4.1). Good performance on any of the tasks was not possi-

ble without setting the hyperparameters in the appropriate ranges: low for the

weights coming from the x

and x

inputs, and high for the weights coming

from the distractor inputs. Figure 4.3 shows example plots of hyperparameter

values versus test deviance for networks trained on task I6 (with jitter added to

make dense clouds of points visible). Task I6 was a reasonably easy task: nearly

all networks ended with appropriate low values for the relevant-input hyperpa-

rameters. Finding appropriate high values for distractor hyperparameters was

more diﬃcult, and those networks which did not did not perform well. All such

plots had the same tendencies as those in Figure 4.3: low test set deviance was

achieved only be networks with low values for relevant-input hyperparameters

and high values for distractor hyperparameters. Some of the other tasks were

more diﬃcult, e.g., I5, and poor search strategies would set relevant-input hy-

perparameters to high values, resulting in poor performance from ignoring the

relevant inputs.

4. Hyperparameter Search in MacKay’s Bayesian Framework 103

0.01

0.1

100

1000

10000

490 495 500 505 510 515 520 525 530

Test d ev i a n ce

Hyperparameter value

Hyperparameters for

, x

weights

0.01

0.1

100

1000

10000

490 495 500 505 510 515 520 525 530

Test deviance

Hyperparameter value

Hyperparameters for weights

from distract or inputs

Fig. 4.3. Final hyperparameter values versus test set deviances for networks with 5

hidden units trained on task I6

The number of successes for each of the 36 (combinations of) strategies on

each task is shown in Table 4.3. Each asterisk represents one success, i.e., a

network which ended up with test set performance lower than the target for

good performance (Table 4.1.) The maximum number of successes for any cell is

15, as 5 random starts were used for each of three diﬀerent sized networks. The

row totals are out of 105, and the column totals are out of 540. The grand total

(987 successes) is out of 3780.

The clear overall best strategy is “snip+often”. The special strategies “nza”

and “limit” seem to help a little. These conclusions are strengthened by exam-

ining plots of the evolution of test deviance during training. Figures 4.4 and 4.5

show plots of test deviance during training for networks with 5 hidden units

on task I6, and for networks with 10 hidden units on task I4. There are ﬁve

lines in each plot because ﬁve random starts were used for each network. The

ideal network has a test deviance which descends rapidly and then stays below

the target performance (the dotted line) – this shows that the strategy quickly

found good values for hyperparameters (the tasks were set up so that it was not

possible to achieve low test deviance without having appropriate values for the

hyperparameters). Lines which ﬂatten out at around 557 are for networks whose

hyperparameters have all been set to high values, so that the network ignores all

inputs and thus has the same performance as the null model (see Table 4.1). The

search was terminated early for many of these networks, because no hyperpa-

rameters or weights were changing. Note that few of the networks with 10 hidden

units reached the target for good performance on task I4 though networks with

5 hidden units did much better. This indicates that the redundancy in these

networks could not be eﬀectively controlled.

104 T. Plate

Table 4.3 . Number of successes for each strategy on each task

total I0 I1 I2 I3 I4 I5 I6

cheap+often 4 ****

group+medium 10 *** *** * ***

group+patience

10 * *** * *****

none+patience

10 **** * *****

none+rare

11 *** ** ******

group+rare

12 *** *** ******

snip+rare

14 ** ** **** ******

trim+often

16 ***** **** **** ***

trim+patience

17 *** ****** * * * *****

none+medium

18 **** ****** * * ** ****

omit+snip+patience

18 ******* ****** *****

omit+snip+patience+limit

18 ******* ****** *****

trim+medium

20 ***** ****** *** *** ***

trim+rare

20 ** ****** *** **** *****

useold+patience

20 **** ****** * * ** * *****

omit+useold+patience

21 **** ******* *** ** *****

omit+useold+patience+limit

22 **** ****** **** *** *****

cheap+medium

22 ***** ***** *** * ********

cheap+patience

23 ***** ***** * ** **********

none+often

26 ******** ********* * *** ** ***

snip+medium

26 ****** ********* ***** ** ****

snip+patience

26 ****** ***** * ***** **** *****

omit+trim+patience

27 ********** ****** ** ** ** *****

snip+patience+nza+limit

27 ***** ******* * ***** **** *****

omit+useold+often

28 ********* ******** *** ** ******

omit+useold+often+limit

29 ********* ******** ** *** *******

snip+patience+nza

29 ******* ******* * ***** **** *****

group+often

31 ******** *********** * * ** ********

cond+often+limit+nza

32 ************* ********** *********

omit+snip+often+limit

32 ************ ********* ***********

useold+often

32 *********** ******** **** ** *******

cheap+rare

34 ********* ********* ***** ** *********

omit+snip+often

36 ** ************* ******** *************

snip+often

50 ******** *************** ** ** **** ***** **************

snip+often+nza

54 ********** *************** ** ** ***** ****** **************

snip+often+nza+limit

54 ********** ************** * ******* ******** **************

987 165 258 44 7 114 58 233

Update strategies other than “snip” tend to be very unstable. Frequent updates

seem essential for achieving good ﬁnal performance within a reasonable amount

of time. The more complex strategies for getting or retaining good estimates of

γ’s, i.e., omit and useold, seemed to be of little beneﬁt.

Finding good hyperparameter values is diﬃcult. If updates are too frequent,

α’s can rise uncontrollably or become unstable. If updates are too infrequent,

the search is too slow. Uncontrollable rises in α’s can occur when the network

has not started using the weights it needs, and those weights are small. In these

circumstances α is overestimated, which forces weights to become smaller in a

runaway-feedback process. Instability results because of the feedback between

γ and α: a change in γ causes a same direction change in α, and a change in

alpha causes an opposite direction change in γ.Thus,ifγ is overestimated, this

leads to an overestimation in α, which lead to a lower reestimate of γ,whichin

turn can lead to a lower reestimate of α. While this process sometimes results in

stable self-correcting behavior, other times it results in uncontrolled oscillations

(which is what is happening with the “none+often” strategy: second row, third

column in Figures 4.4 and 4.5). In general, while it is slow to raise an α from a

value that is too low, it is more diﬃcult to lower an α from a value which is too

high (because the weights are forced to zero immediately). The reasons for the

diﬀering behavior of the various strategies appear to be as follows:

none, group: Frequently give out-of-bounds or very poor estimates for γ,and

cause instability in the search.

4. Hyperparameter Search in MacKay’s Bayesian Framework 105

cond: Results in fairly stable behavior, but often does not ﬁnd the optimal

values for all α’s because it stops updating them.

cheap: Overestimates γ and kills weights too quickly.

trim: Sometimes underestimates γ because it assumes zero degrees of freedom

when the γ estimates from the Hessian are out-of-bounds, and thus reduces

α values which should be large, resulting in instability.

snip: Sometimes overestimates γ because it assumes one degree of freedom for

a parameter whose γ estimate from the Hessian is out-of-bounds. However,

this keeps α values high without raising them too much: overestimation of

γ appears to be much less harmful than underestimation.

Update frequency was also important. With good α calculation strategies, up-

dating frequently (i.e., “often”) gave fastest convergence to good α values and

best overall performance. Waiting for some degree of convergence in the conju-

gate gradient search before updating α values (i.e., “patience”) was of no beneﬁt

at all.

4.6 Relationship of Test Set Error to Evidence

If we have trained a number of networks, we often want to know which will

have the lowest generalization error. The “evidence” value calculated for each

network can be used choose networks which will perform well. The evidence is

the log likelihood of the data given the values of the hyperparameters, integrated

over weight values based on the assumption of a Gaussian distribution for the

posterior of the weights. The evidence for a network evaluated with cross-entropy

error is as follows (Eqn 10.67 in [1]; Eqn 30 in [7]):

ln p(D|α)=−



i=1

[i]

− E −

|H| +



ln α

−

ln(2Π) (4.4)

Whether or not the evidence is a good guide to which networks will perform

well is questionable, as various assumptions on which evidence calculations are

based are often violated in particular networks. The simulations performed oﬀer

a good opportunity to examine how accurately high evidence indicates good test

set performance.

The evidence is particularly sensitive to low eigenvalues in the Hessian of the

network. The validity of the evidence value is doubtful in cases where the Hessian

has low or negative eigenvalues. Bishop [1] recommends omitting eigenvalues

which are lower than some threshold from the calculation of the evidence.

Figure 4.6 shows plots of test deviances that would be achieved by selecting

twenty networks based on their evidence values. Two diﬀerent methods of calcu-

lating the evidence were used: (a) ignore negative eigenvalues of the Hessian, and

(b) if an eigenvalue is lower than the smallest non-zero α

, replace it with that α

(“clipped evidence”). Because the evidence is sensitive to low eigenvalues, four dif-

ferent ﬁlters were applied to networks: (a) use all networks; (b) throw out networks

106 T. Plate

480

500

520

540

560

580

I6, 5 HU, min=499.2

cheap+rare

I6, 5 HU, min=496.9

cheap+medium

I6, 5 HU, min=502.1

cheap+often

I6, 5 HU, min=496.7

cheap+patience

480

500

520

540

560

580

I6, 5 HU, min=495.3

none+rare

I6, 5 HU, min=498.0

none+medium

I6, 5 HU, min=491.2

none+often

I6, 5 HU, min=506.6

none+patience

480

500

520

540

560

580

I6, 5 HU, min=495.5

group+rare

I6, 5 HU, min=494.2

group+medium

I6, 5 HU, min=489.2

group+often

I6, 5 HU, min=515.5

group+patience

480

500

520

540

560

580

I6, 5 HU, min=510.2

trim+rare

I6, 5 HU, min=492.6

trim+medium

I6, 5 HU, min=495.4

trim+often

I6, 5 HU, min=503.6

trim+patience

480

500

520

540

560

580

I6, 5 HU, min=496.1

snip+rare

I6, 5 HU, min=497.3

snip+medium

I6, 5 HU, min=490.8

snip+often

I6, 5 HU, min=521.8

snip+patience

480

500

520

540

560

580

I6, 5 HU, min=490.8

snip+often+nza

I6, 5 HU, min=489.3

snip+often+nza+limit

I6, 5 HU, min=521.8

snip+patience+nza

I6, 5 HU, min=521.8

snip+patience+nza+limit

480

500

520

540

560

580

I6, 5 HU, min=508.3

useold+patience

I6, 5 HU, min=491.1

useold+often

I6, 5 HU, min=508.9

omit+trim+patience

I6, 5 HU, min=495.8

conditional+limit+nza+often

480

500

520

540

560

580

I6, 5 HU, min=491.1

omit+snip+often

I6, 5 HU, min=492.6

omit+snip+often+limit

I6, 5 HU, min=521.8

omit+snip+patience

I6, 5 HU, min=521.8

omit+snip+patience+limit

480

500

520

540

560

580

0 2000 4000 6000

I6, 5 HU, min=498.5

omit+useold+often

0 2000 4000 6000

I6, 5 HU, min=493.0

omit+useold+often+limit

0 2000 4000 6000

I6, 5 HU, min=516.3

omit+useold+patience

0 2000 4000 6000

I6, 5 HU, min=518.7

omit+useold+patience+limit

Fig. 4.4. Test set deviance during training for networks with 5 hidden units on task

I6. The horizontal axis is the number of passes through the training set. The dotted

line is the target for good performance. The minimum value quoted is the minimum

deviance at any point in training.

4. Hyperparameter Search in MacKay’s Bayesian Framework 107

520

540

560

580

600

I4, 10 HU, min=543.3

cheap+rare

I4, 10 HU, min=542.5

cheap+medium

I4, 10 HU, min=545.4

cheap+often

I4, 10 HU, min=543.2

cheap+patience

520

540

560

580

600

I4, 10 HU, min=544.4

none+rare

I4, 10 HU, min=526.6

none+medium

I4, 10 HU, min=522.9

none+often

I4, 10 HU, min=547.1

none+patience

520

540

560

580

600

I4, 10 HU, min=548.8

group+rare

I4, 10 HU, min=535.2

group+medium

I4, 10 HU, min=521.3

group+often

I4, 10 HU, min=535.0

group+patience

520

540

560

580

600

I4, 10 HU, min=539.9

trim+rare

I4, 10 HU, min=540.5

trim+medium

I4, 10 HU, min=540.4

trim+often

I4, 10 HU, min=538.1

trim+patience

520

540

560

580

600

I4, 10 HU, min=529.8

snip+rare

I4, 10 HU, min=527.3

snip+medium

I4, 10 HU, min=525.1

snip+often

I4, 10 HU, min=529.7

snip+patience

520

540

560

580

600

I4, 10 HU, min=525.4

snip+often+nza

I4, 10 HU, min=526.2

snip+often+nza+limit

I4, 10 HU, min=529.7

snip+patience+nza

I4, 10 HU, min=527.5

snip+patience+nza+limit

520

540

560

580

600

I4, 10 HU, min=538.1

useold+patience

I4, 10 HU, min=540.5

useold+often

I4, 10 HU, min=539.3

omit+trim+patience

I4, 10 HU, min=523.8

conditional+limit+nza+often

520

540

560

580

600

I4, 10 HU, min=526.1

omit+snip+often

I4, 10 HU, min=528.2

omit+snip+often+limit

I4, 10 HU, min=524.6

omit+snip+patience

I4, 10 HU, min=525.4

omit+snip+patience+limit

520

540

560

580

600

0 10000 20000

I4, 10 HU, min=528.8

omit+useold+often

0 10000 20000

I4, 10 HU, min=529.7

omit+useold+often+limit

0 10000 20000

I4, 10 HU, min=539.1

omit+useold+patience

0 10000 20000

I4, 10 HU, min=539.4

omit+useold+patience+limit

Fig. 4.5. Test set deviance during training for networks with 10 hidden units on task

I4. The horizontal axis is the number of passes through the training set. The dotted

line is the target for good performance. The minimum value quoted is the minimum

deviance at any point in training.

108 T. Plate

Test D e v i ance

520 540 560 580 600

Task I4

Actual best

test deviance

Evidence

all nets

Clipped ev

all nets

Evidence

no -ve eigvals

Clipped evid.

no -ve eigvals

Evidence

no low eigvals

Clipped evid.

no low eigvals

Evidence

no OOBDs

Clipped evid.

no OOBDs

Test Deviance

540 560 580 600

Task I5

Actual best

test deviance

Evidence

all nets

Clipped ev

all nets

Evidence

no -ve eigvals

Clipped evid.

no -ve eigvals

Evidence

no low eigvals

Clipped evid.

no low eigvals

Evidence

no OOBDs

Clipped evid.

no OOBDs

Test Deviance

500520540560580600

Task I6

Actual best

test deviance

Evidence

all nets

Clipped ev

all nets

Evidence

no -ve eigvals

Clipped evid.

no -ve eigvals

Evidence

no low eigvals

Clipped evid.

no low eigvals

Evidence

no OOBDs

Clipped evid.

no OOBDs

Fig. 4.6. Distribution of test errors for the 20 “best” nets selected according to evidence

and network diagnostics

with negative eigenvalues; (c) throw out networks with eigenvalues lower than the

the smallest α

; and (d) throw out networks with out-of-bounds h

−1

values.

The ﬁrst row in each plot of Figure 4.6 shows the distribution of the actual best

20 test deviances (this would normally be impossible to calculate, but we can cal-

culate it here because we have artiﬁcial tasks.) If evidence were a perfect predictor

of test set performance, the 20 networks with highest evidence would be the same

20 networks, but it is not, as the rest of the ﬁgure shows. The remaining eight rows

show the distribution of test errors for nets with the highest evidence, with evidence

calculated in both of the ways described above. Lower pairs of rows rules more nets

out of consideration, based on diagnostics of the Hessian and inverse Hessian.

The two diﬀerent ways of calculating the evidence did not make any discernible

diﬀerence. The box-and-whisker plots show the median as a solid circle. The

box shows the upper and lower quartiles (i.e., 50% of points are in the box).

The whiskers show the furthest points within 1.5 times the interquartile range

(the length of the box) of the quartiles. Points outside the whiskers are plotted

individually with hollow circles.

Plots of the evidence versus test deviance show a strong correlation, which

seems strongest and most reliable for networks which have no low eigenvalues.

However, the plots in Figure 4.6 show that using the evidence to select networks

is not always reliable. Networks with high evidence but with poorly conditioned

Hessians have a wide range of test error, but some networks with the lowest test

error have poorly conditioned Hessians. This means that allowing networks with

poorly conditioned Hessians resulted in choosing too many networks with high

test error, whereas rejecting networks with poorly conditioned Hessians rejected

too many of the networks with low test error. This seemed to be more of a

problem for the more diﬃcult tasks. On the easier tasks (e.g., I6), networks with

high evidence and no low eigenvalues had very low test deviance.

4. Hyperparameter Search in MacKay’s Bayesian Framework 109

4.7 Conclusion

For classiﬁcation networks, the search for optimal hyperparameters can be slow

and unstable. However, the search can be improved by using the strategies sum-

marized below. Similar experiments with regression networks (with linear out-

put units and sum-square errors, on tasks with continuous targets and Gaussian

noise) revealed that the hyperparameter search in such tasks is generally faster

and more stable than in classiﬁcation tasks. Only when the tasks had very high

noise in the training cases (with variance of 4 or greater) did the hyperparameter

search become at all diﬃcult. Under these conditions, none of the strategies tried

was clearly superior.

One of the main causes of instability in the hyperparameter search is low

eigenvalues in the Hessian. In turn, one of the main causes of this is redun-

dancies in the network. It is easily veriﬁed that the Hessian for a network with

redundancies can have zero or close to zero eigenvalues – the only thing that

prevents the eigenvalues being zero is non-zero hyperparameters. In neural net-

works both additive (parallel) and multiplicative (serial) redundancies can occur.

Larger networks are more likely to have redundancies. Hence we could expect

the search to be more stable for smaller networks, and this is what was observed

(though of course some small networks did not perform well as they did not have

suﬃcient capacity to model the task).

The initial hyperparameter and weight values are important. If the initial

hyperparameters are too high, they can force all the weights to zero before the

network has a chance to learn anything. If the initial hyperparameters are too

low the network can get trapped in an overﬁtting mode. Thodberg [13] makes

the reasonable suggestion that the initial hyperparameters should be set so that

the weight cost is 10% of the error at the beginning of training.

The results described in this chapter lead to the following recommendations

for hyperparameter updates:

– update hyperparameters frequently

– if any h

−1

values are out-of-bounds, replace them with zero in the calcula-

tion of γ

[i]

(so that each out-of-bounds h

−1

contributes one well-determined

parameter)

– ignore updates for α

which suggest a zero value

– limit changes in α

to have a maximum magnitude of 10.

– ignore negative eigenvalues in calculations of the evidence

– regard evidence values as unreliable for networks with eigenvalues lower than

the lowest α

Those interested in Bayesian approaches to neural network modeling should

also consider Neal’s [9] Markov-chain Monte Carlo methods. Although these

Monte Carlo methods sometimes require longer computation times than methods

based on Mackay’s approximate Bayesian framework, they have some theoretical

advantages and also require less tweaking.

110 T. Plate

References

[1] Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press

(1995)

[2] Buntine, W.L., Weigend, A.S.: Computing second derivatives in feed-forward net-

works: A review. IEEE Transactions on Neural Networks 5(3), 480–488 (1994)

[3] Hastie, T.J., Tibshirani, R.J.: Generalized additive models. Chapman and Hall,

London (1990)

[4] Hwang, J.N., Lay, S.-R., Maechler, M., Martin, R.D., Schimert, J.: Regression

modeling in back-propagation and projection pursuit learning. IEEE Transactions

on Neural Networks 5(3), 342–353 (1994)

[5] MacKay, D.J.C.: A practical Bayesian framework for backpropagation networks.

Neural Computation 4(3), 448–472 (1992)

[6] MacKay, D.J.C.: Bayesian methods for backpropagation networks. In: Domany,

E., van Hemmen, J.L., Schulten, K. (eds.) Models of Neural Networks III, ch. 6,

Springer, New York (1994)

[7] MacKay, D.J.C.: Probable networks and plausible predictions - a review of prac-

tical Bayesian methods for supervised neural networks. Network: Computation in

Neural Systems 6, 469–505 (1995)

[8] MacKay, D.J.C.: Bayesian non-linear modelling for the 1993 energy prediction

competition. In: Heidbreder, G. (ed.) Maximum Entropy and Bayesian Methods,

Santa Barbara 1993, pp. 221–234. Kluwer, Dordrecht (1996)

[9] Neal, R.M.: Monte carlo implementation of gaussian process models for bayesian

regression and classiﬁcation. Technical Report TR9702, Dept. of Statistics, Uni-

versity of Toronto (1997), Software,

http://www.cs.utoronto.ca/~radford/

[10] Neal, R.M.: Bayesian Learning for Neural Networks. Springer, New York (1996)

[11] Plate, T., Bert, J., Grace, J., Band, P.: A comparison between neural networks

and other statistical techniques for modeling the relationship between tobacco and

alcohol and cancer. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in

Neural Information Processing 9 (NIPS 1996). MIT Press (1997)

[12] Roosen, C., Hastie, T.: Logistic response projection pursuit. Technical report,

AT&T Bell Laboratories (1993)

[13] Thodberg, H.H.: A review of bayesian neural networks with an application to near

infrared spectroscopy. IEEE Transactions on Neural Networks 7(1), 56–72 (1996)

Adaptive Regularization in Neural Network

Modeling



Jan Larsen

, Claus Svarer

, Lars Nonboe Andersen

,andLarsKaiHansen

Department of Mathematical Modeling, Building 321

Technical University of Denmark

DK-2800 Lyngby, Denmark

Neurobiology Research Unit

Department of Neurology, Building 9201

Copenhagen University Hospital

Blegdamsvej 9

DK-2100 Copenhagen, Denmark

{jl,lna,lkhansen}@imm.dtu.dk, csvarer@pet.rh.dk

http://eivind.imm.dtu.dk

, http://neuro.pet.rh.dk

Abstract. In this paper we address the important problem of optimiz-

ing regularization parameters in neural network modeling. The suggested

optimization scheme is an extended version of the recently presented al-

gorithm [25]. The idea is to minimize an empirical estimate – like the

cross-validation estimate – of the generalization error with respect to

regularization parameters. This is done by employing a simple itera-

tive gradient descent scheme using virtually no additional programming

overhead compared to standard training. Experiments with feed-forward

neural network models for time series prediction and classiﬁcation tasks

showed the viability and robustness of the algorithm. Moreover, we pro-

vided some simple theoretical examples in order to illustrate the potential

and limitations of the proposed regularization framework.

5.1 Introduction

Neural networks are ﬂexible tools for time series processing and pattern recogni-

tion. By increasing the number of hidden neurons in a 2-layer architecture any

relevant target function can be approximated arbitrarily close [19]. The asso-

ciated risk of overﬁtting on noisy data is of major concern in neural network

design, which ﬁnd expression in the ubiquitous bias-variance dilemma, see e.g.,

[9].

The need for regularization is two-fold: First, it remedies numerical prob-

lems in the training process by smoothing the cost function and by introducing

curvature in low (possibly zero) curvature regions of cost function. Secondly, reg-

ularization is a tool for reducing variance by introducing extra bias. The overall



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 111–130, 2012.

 Springer-Verlag Berlin Heidelberg 2012

112 J. Larsen et al.

objective of architecture optimization is to minimize the generalization error.

The architecture can be optimized directly by stepwise selection procedures (in-

cluding pruning techniques) or indirectly using regularization. In general, one

would prefer a hybrid scheme; however, a very ﬂexible regularization may sub-

stitute the need for selection procedures. The numerical experiments we consider

mainly hybrid pruning/adaptive regularization schemes.

The trick presented in this communication addresses the problem of adapting

regularization parameters.

The trick consists in formulating a simple iterative gradient descent

scheme for adapting the regularization parameters aiming at minimiz-

ing the generalization error.

We suggest to use an empirical estimate

of the generalization error, viz. the

K-fold cross-validation [8], [39]. In [25] and [3] the proposed scheme was studied

using the hold-out validation estimator.

In addition to empirical estimators for the generalization error a number of

algebraic estimators like FPE [1], FPER [23], GEN [21], GPE [31] and NIC [33]

have been developed in recent years. These estimates, however, depend on a

number of statistical assumptions which can be quite hard to justify. In partic-

ular, they are o(1/N

) estimators where N

is the number of training examples.

However, for many practical modeling set-ups it is hard to meet the large training

set assumption.

In [14] properties of adaptive regularization is studied in the simple case of

estimating the mean of a random variable using an algebraic estimate of the av-

erage

generalization error and [15] proposed an adaptive regularization scheme

for neural networks based on an algebraic estimate. However, experiments in-

dicate that this scheme has a drawback regarding robustness. In addition, the

requirement of a large training set may not be met.

The Bayesian approach to adapt regularization parameters is to minimize the

so-called evidence [5, Ch. 10], [30]. The evidence, however, does not in a simple

way relate to the generalization error which is our primary object of interest.

Furthermore [2] and [38] consider the use of a validation set to tune the amount

of regularization, in particular when using the early-stop technique.

Section 5.2 considers training and empirical generalization assessment. In Sec-

tion 5.3 the framework for optimization of regularization parameters is presented.

The experimental section 5.4 deals with examples of feed-forward neural net-

works models for classiﬁcation and time series prediction. Further, in order to

study the theoretical potential/limitations of the proposed framework, we in-

clude some simulations on a simple toy problem.

5.2 Training and Generalization

Suppose that the neural network is described by the vector function f (x; w)

where x is the input vector and w is the vector of network weights and thresholds

For further discussion on empirical generalization assessment, see e.g., [24].

Average w.r.t. to diﬀerent training sets.

5. Adaptive Regularization in Neural Network Modeling 113

with dimensionality m. The objective is to use the network model for approxi-

mating the true conditional input-output distribution p(y|x), or some moments

hereof. For regression and signal processing problems we normally model the

conditional expectation E{y|x}.

Assume that we have available a data set D = {x(k); y(k)}

k=1

of N input-

output examples. In order to both train and empirically estimate the general-

ization performance we follow the idea of K-fold cross-validation [8], [39] and

split the data set into K randomly chosen disjoint sets of approximately equal

size, i.e., D = ∪

j=1

and ∀i = j : V

∩V

= ∅. Training and validation is

replicated K times, and in the j’th run training is done on the set T

= D\V

and validation is performed on V

The cost function, C

, for network training on T

, is supposed to be the sum

of a loss function (or training error), S

(w), and a regularization term R(w, κ)

parameterized by a set of regularization parameters κ, i.e.,

(w)=S

(w)+R(w, κ)=



k=1

 (y(k),

y(k); w)+R(w, κ) (5.1)

where (·) measures the distance between the output y(k) and the network

prediction

y(k)=f (x(k); w). In section 5.4 we will consider log-likelihood and

the square error loss function  = |y −

. N

≡|T

| deﬁnes the number of

training examples in T

and k is used to index the k’th example [x(k), y(k)].

Training provides the estimated weight vector

=argmin

(w).

The j’th validation set V

consist of N

= N −N

examples and the valida-

tion error

of the trained network reads

(



k=1

 (y(k),

y(k);

) (5.2)

where the sum runs over the N

validation examples. S

(

) is thus an esti-

mate of the generalization error, deﬁned as the expected loss,

)=E

x,y

{(y,

)} =



(y,

) · p(x, y) dxdy (5.3)

where p(x, y) is the unknown joint input-output probability density. Generally,

(

)=G(

)+O(1/

) where O(·) is the Landau order function

.Thus

we need large N

to achieve an accurate estimate of the generalization error. On

the other hand, this leaves only few data for training thus the true generalization

) increases. Consequently there exist a trade-oﬀ among the two conﬂicting

aims which calls for ﬁnding an optimal split ratio. The optimal split ratio

is an

interesting open and diﬃcult problem since it depends on the total algorithm in

which the validation error enters. Moreover, it depends on the learning curve

[17].

That is, the loss function on the validation set.

If h(x)=O(g(x)) then |h(x)|/|g(x)| < ∞ for x → 0.

For more elaborations on the split of data, see e.g., [2], [20], [24] and [26].

Deﬁned as the average generalization error as a function of the number of training

examples.

114 J. Larsen et al.

The ﬁnal K-fold cross-validation estimate is given by the average validation

error estimates,

Γ =



j=1

(

). (5.4)

Γ is an estimate of the average generalization error over all possible training sets

of size N

Γ = E

{G(

)}. (5.5)

Γ is an unbiased estimate of Γ if the data of D are independently distributed

see e.g., [16].

The idea is now to optimize the amount of regularization by minimizing

w.r.t. the regularization parameters κ. An algorithm for this purpose is described

in Section 5.3. Furthermore, we might consider optimizing regularization using

the hold-out validation estimate corresponding to K =1. In this case one has to

choose a split ratio. Without further ado, we will recommend a 50/50 splitting.

Suppose that we found the optimal κ using the cross-validation estimate.

Replications of training result in K diﬀerent weight estimates

which might

be viewed as an ensemble of networks. In [16] we showed under certain mild

conditions that when considering a o(1/N ) approximation, the average general-

ization error of the ensemble network f

ens

(x)=



j=1

· f (x,

) equals that

of the network trained on all examples in D where β

weights the contribution

from the j’th network and



=1.IfK is a divisor in N then ∀j, β

=1/K,

otherwise β

=(N −N

)/N (K −1). Consequently, one might use the ensemble

network to compensate for the increase in generalization error due to only train-

ing on N

= N −N

data. Alternatively, one might retrain on the full data set

D using the optimal κ. We use the latter approach in the experimental section.

A minimal necessary requirement for a procedure which estimates the network

parameters on the training set and optimizes the amount of regularization from a

cross-validation set is: the generalization error of the regularized network should

be smaller than that of the unregularized network trained on the full data set

D. However, this is not always the case, and is the quintessence of various “no

free lunch” theorems [12], [44], [46]:

– If the regularizer is parameterized using many parameters, κ,thereisa

potential risk of over-ﬁtting on the cross-validation data. A natural way

to avoid this situation is to limit the number of regularization parameters.

Another recipe is to impose constraints on κ (hyper regularization).

– The speciﬁc choice of the regularizers functional form impose prior con-

straints on the functions to be implemented by the network

. If the prior

information is mismatched to the actual problem it might be better not to

use regularization.

– The de-biasing procedure described above which compensate for training

only on N

<N examples might fail to yield better performance since the

That is, [x(k

), y(k

)] is independent of [x(k

), y(k

)] for all k

= k

The functional constraints are through the penalty imposed on the weights.

5. Adaptive Regularization in Neural Network Modeling 115

weights now are optimized using all data, including those which where left

out exclusively for optimizing regularization parameters.

– The split among training/validation data, and consequently the number of

folds, K, may not be chosen appropriately.

These problems are further addressed in Section 5.4.1.

5.3 Adapting Regularization Parameters

The choice of regularizer may be motivated by

– the fact that the minimization of the cost function is normally an ill- posed

task. Regularization smoothens the cost function and thereby facilitates the

training. The weight decay regularizer

, originally suggested by Hinton in

the neural networks literature, is a simple way to accomplish this task, see

e.g., [35].

– a priori knowledge of the weights, e.g., in terms of a prior distribution (when

using a Bayesian approach). In this case the regularization term normally

plays the role of a log-prior distribution. Weight decay regularization may

be viewed as a Gaussian prior, see e.g., [2]. Other types of priors, e.g., the

Laplacian [13], [43] and soft weight sharing [34] has been considered. More-

over, priors have been developed for the purpose of restricting the number

of weights (pruning), e.g., the so-called weight elimination [42].

– a desired characteristics of the functional mapping performed by the network.

Typically, a smooth mapping is preferred. Regularizers which penalizes cur-

vature of the mapping has been suggested in [4], [7], [32], [45], [10].

In the experimental section we consider weight decay regularization and some

generalizations hereof. Without further ado, weight decay regularization has

proven to be useful in many neural network applications.

The standard approach for estimation of regularization parameters is more

and less systematic search and evaluation of the cross-validation error. However,

this is not viable for multiple regularization parameters. On the other hand, as

will be demonstrated, it is possible to derive an optimization algorithm based

on gradient descent.

Consider a regularization term R(w, κ) which depends on q regularization pa-

rameters contained in the vector κ. Since the estimated weights

=argmin

(w) are controlled by the regularization term, we may in

fact consider the cross-validation error (5.4) as an implicit function of the regu-

larization parameters, i.e.,

Γ (κ)=



j=1

(

(κ)) (5.6)

Also known as ridge regression.

116 J. Larsen et al.

where

(κ) is the κ-dependent vector of weights estimated from training set

. The optimal regularization can be found by using gradient descent

(n+1)

= κ

(n)

− η

∂

∂κ

(

w(κ

(n)

)) (5.7)

where η>0 is a step-size (learning rate) and κ

(n)

is the estimate of the regu-

larization parameters in iteration n.

Suppose the regularization term is linear in the regularization parameters,

R(w, κ)=κ



r(w)=



i=1

(w) (5.8)

where κ

are the regularization parameters and r

(w) the associated regular-

ization functions. Many suggested regularizers are linear in the regularization

parameters, this includes the popular weight decay regularization as well as reg-

ularizers imposing smooth functions such as the Tikhonov regularizer [4], [2] and

the smoothing regularizer for neural networks [32], [45]. However, there exist ex-

ceptions such as weight-elimination [42] and soft weight sharing [34]. In this case

the presented method needs few modiﬁcations.

Using the results of the Appendix, the gradient of the cross-validation error

equals

∂

∂κ

(κ)=



j=1

∂S

∂κ

(

), (5.9)

∂S

∂κ

(

)=−

∂r

∂w



(

) · J

−1

(

) ·

∂S

∂w

(

). (5.10)

where J

= ∂

/∂w∂w



is the Hessian of the cost function. As an example,

consider the case of weight decay regularization with separate weight decays for

two group of weights, e.g., the input-to-hidden and hidden-to output weights,

i.e.,

R(w, κ)=κ

·|w

+ κ

·|w

(5.11)

where κ =[κ

,κ

], w =[w

, w

] with w

, w

denoting the input-to-hidden

and hidden-to output weights, respectively. The gradient of the validation error

then yields,

∂S

∂κ

(

)=−2(

)



·g

∂S

∂κ

(

)=−2(

)



· g

(5.12)

where g

is the vector

=[g

, g

]=J

−1

(

) ·

∂S

∂w

(

). (5.13)

In summary, the algorithm for adapting regularization parameters consists of

the following 8 steps:

We have recently extended this algorithm incorporating second order information

via the Conjugate Gradient technique [11].

5. Adaptive Regularization in Neural Network Modeling 117

1. Choose the split ratio; hence, the number of folds, K.

2. Initialize κ and the weights of the network

3. Train the K networks with ﬁxed κ on T

to achieve

(κ), j =1, 2, ···,K.

Calculate the validation errors S

and the cross-validation estimate

Γ .

4. Calculate the gradients ∂S

/∂κ and ∂

Γ/∂κ cf. (5.9) and (5.10). Initialize

the step-size η.

5. Update κ using (5.7).

6. Retrain the K networks from the previous weight estimates and recalculate

the cross-validation error

Γ .

7. If no decrease in cross-validation error then perform a bisection of η and go

to step 5; otherwise, continue.

8. Repeat steps 4–7 until the relative change in cross-validation error is below

a small percentage or, e.g., the 2-norm of the gradient ∂

Γ/∂κ is below a

small number.

Compared to standard neural network training the above algorithm does gen-

erally not lead to severe computational overhead. First of all, the standard ap-

proach of tuning regularization parameters by, more or less systematic search,

requires a lot of training sessions. The additional terms to be computed in the

adaptive algorithm are: 1) the derivative of the regularization functions w.r.t. the

weights, ∂r/∂w, 2) the gradient of the validation errors, ∂S

/∂w,and3)the

inverse Hessians, J

−1

. The ﬁrst term is often a simple function of the weights

and computationally inexpensive. In the case of feed-forward neural networks,

the second term is computed by one pass of the validation examples through a

standard back-propagation algorithm. The third term is computationally more

expensive. However, if the network is trained using a second order scheme, which

requires computation of the inverse Hessian

, there is no computational over-

head.

The adaptive algorithm requires of the order of K·itr

·itr

weight retrainings.

Here itr

is the number of iterations in the gradient descent scheme for κ and

itr

is the average number of bisections of η in step 7 of the algorithm. In the

experiments carried out the number of retrainings is approx. 100–300 times K.

Recall, since we keep on retraining from the current weight estimate, the number

of training epochs is generally small.

The number of weight retrainings is somewhat higher than that involved when

optimizing the network by using a pruning technique like validation set based Op-

timal Brain Damage (vOBD) [25], [27]. vOBD based on K-fold cross-validation

requires of the order of K ·m retrainings, where m = dim(w). The adaptive reg-

ularization algorithm is easily integrated with the pruning algorithm as demon-

strated in the experimental section.

In Sec. 5.4.1 a practical initialization procedure for κ is described.

For weight decay, it is 2w.

Often the computations are reduced by using a Hessians approximation, e.g.,

the Gauss-Newton approximation. Many studies have reported signiﬁcant training

speed-up by using second order methods, see e.g., [22], [35].

118 J. Larsen et al.

5.4 Numerical Experiments

5.4.1 Potentials and Limitations in the Approach

The purpose of the section is to demonstrate the potential and limitations of

the suggested adaptive regularization framework. We consider the simple linear

data generating system, viz. estimating the mean of a Gaussian variable,

y(k)=w

◦

+ ε(k) (5.14)

where w

◦

is the true mean and the noise ε(k) ∼N(0,σ

We employ 2-fold cross-validation, i.e., D = T

∪T

,whereT

, j =1, 2 denote

the two training sets in the validation procedure containing approximately half

the examples

. The linear model y(k)=w + e(k) is trained using the mean

square cost function augmented by simple weight decay, as shown by

(w)=



k=1

(y(k) − w)

+ κ · w

(5.15)

where k runs over examples of the data set in question. The estimated weights

are %w

=¯y

/(1 + κ) where ¯y

= N

−1



k=1

y(k) are the estimated mean. For

this simple case, the minimization of the cross-validation error given by,

Γ (κ)=



j=1

(

(κ)),S

(

(κ)) =



k=1

(y(k) − %w

)

, (5.16)

can be done exactly. The optimal κ is given by

opt

¯y

+¯y

2¯y

¯y

− 1. (5.17)

Assuming N to be even, the ensemble average of the estimated weights

(κ

opt

), leads to the ﬁnal estimate

reg

( %w

(κ

opt

)+ %w

(κ

opt

)) =

¯y

(¯y

+¯y

)

¯y

+¯y

. (5.18)

Notice two properties: First, the estimate is self-consistent as lim

N→∞

reg

lim

N→∞

= w

◦

where %w

= N

−1



k=1

y(k)=(¯y

+¯y

)/2 is the

unregularized estimate trained on all data. Secondly, it is easy to verify that

¯y

∼N(w

◦

, 2σ

/N ).Thatis,ifthenormalized true weight θ ≡ w

◦

/A where

A =

2/N · σ

is large then ¯y

≈ w

◦

which means, %w

reg

≈ %w

That is, N

= N/2 and N

= N −N

. Note that these training sets are also the

two validation sets, V

= T

,andviceversa.

The ensemble average corresponds to retraining on all data using κ

opt

. The weighting

of the two estimates is only valid for N even (see Sec. 5.2 for the general case).

5. Adaptive Regularization in Neural Network Modeling 119

The objective is now to test whether using %w

reg

results in lower generalization

error than employing the unregularized estimate %w

. The generalization error

associated with using the weight w is given by

G(w)=σ

+(w − w

◦

)

. (5.19)

Further deﬁne the generalization error improvement,

Z = G( %w

) −G( %w

reg

)=(%w

− w

◦

)

− ( %w

reg

− w

◦

)

. (5.20)

Note that Z merely is a function of the random variables ¯y

, ¯y

and the true

weight w

◦

, i.e., it suﬃces to get samples of ¯y

, ¯y

when evaluating properties of

Z. Deﬁne the normalized variables

¯y

∼N



◦



, 1



= N(θ, 1). (5.21)

It is easily shown that the normalized generalization error improvement Z/A

is a function of 'y

, 'y

and θ; hence, the distribution of Z/A

is parameterized

solely by θ.

As a quality measure we consider the probability of improvement in generaliza-

tion error given by Prob{Z>0}.NotethatProb{Z>0} =1/2 corresponds to

equal preference of the two estimates. The probability of improvement depends

only on the normalized weight θ since Prob{Z>0} =Prob{Z/A

> 0}.

Moreover, we consider the relative generalization error improvement, deﬁned

RGI = 100% ·

G( %w

)

. (5.22)

In particular, we focus on the probability that the relative improvement in gen-

eralization is bigger than

x, i.e., Prob(RGI >x). Optimally Prob(RGI >x)

should be close to 1 for x ≤ 0% and slowly decaying towards zero for 0% <x≤

100%. Using the notation 'w

reg

= %w

reg

/A, 'w

= %w

/A, RGI can be written as

RGI = 100% ·

( 'w

− θ)

− ( 'w

reg

− θ)

N/2+('w

− θ)

. (5.23)

Thus, the distribution of RGI is parameterized by θ and N.

The quality measures are computed by generating Q independent realizations

of 'y

, 'y

, i.e., {'y

(i)

, 'y

(i)

}

i=1

. E.g., the probability of improvement is estimated

by P

imp

= Q

−1



i=1

μ(Z

(i)

) where μ(x)=1for x>0, and zero otherwise.

The numerical results of comparing %w

reg

to the unregularized estimate %w

summarized in Fig. 5.1.

Note that, Prob(RGI > 0) = Prob(Z>0).

120 J. Larsen et al.

0 1 2 3 4 5 6 7 8 9 10

0.4

0.5

0.6

0.7

0.8

0.9

imp

(a)

N=2

N=8

N=20

−100 −80 −60 −40 −20 0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

θ=0

x (%)

Prob(RGI>x)

(b)

N=2

N=8

N=20

−100 −80 −60 −40 −20 0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

θ=2

x (%)

Prob(RGI>x)

(c)

N=2

N=8

N=20

−100 −80 −60 −40 −20 0 20 40 60 80 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

θ=10

x (%)

Prob(RGI>x)

(d)

Fig. 5.1. Result of comparing the optimally regularized estimate w

reg

of the mean of a

Gaussian variable to the unregularized estimate w

. The results are based on Q =10

independent realizations. The probability of improvement P

imp

, shown in panel (a), is

one for when the normalized true weight θ =



N/2 · w

◦

/σ

=0,andabove0.5 for

∼

0.8. That is, when the prior information of the weight decay regularizer is correct

(true weight close to zero), when N is small or when σ

is large. As θ becomes large

imp

tends to 0.5 due to the fact that w ≈ w

. Panel (b)–(d) display Prob(RGI >x)

for θ ∈{0, 2, 10}. The ideal probability curve is 1 for x<0 and a slow decay towards

zero for x>0. The largest improvement is attained for small θ and small N .Panel

also the smallest probability for negative x. That is, a higher chance of getting a good

improvement also increases the change of deterioration. Notice, even though P

imp

< 0.5

for θ =2, 10 there is still a reasonable probability of getting a signiﬁcant improvement.

5.4.2 Classiﬁcation

We test the performance of the adaptive regularization algorithm on a vowel

classiﬁcation problem. The data are based on the Peterson and Barney database

[36]. The classes are vowel sounds characterized by the ﬁrst four formant frequen-

cies. 76 persons (33 male, 28 female and 15 children) have pronounced c =10

diﬀerentvowels(IYIHEHAEAHAAAOUHUWER)twotimes.Thisresults

5. Adaptive Regularization in Neural Network Modeling 121

in a data base of totally 1520 examples. The database is the veriﬁed database

described in [41] where all data

are used, including examples where utterance

failed of unanimous identiﬁcation in the listening test (26 listeners). All examples

were included to make the task more diﬃcult.

The regularization was adapted using a hold-out validation error estimator,

thus the examples were split into a data set, D, consisting of N = 760 examples

(16 male, 14 female and 8 children) and an independent test set of the remaining

760 examples. The regularization was adapted by splitting the data set D equally

into a validation set of N

= 380 examples and a training set of N

= 380

examples (8 male, 7 female and 4 children in each set).

We used a feed-forward 2-layer neural network with hyperbolic tangent neu-

rons in the hidden layer and modiﬁed SoftMax normalized outputs, %y

, see e.g.,

[2], [18], [3]. Thus, the outputs estimates the posterior class probabilities p(C

|x),

where C

denotes the i’th class, i =1, 2, ···,c. Bayes rule (see e.g., [2]) is used to

assign C

to input x if i =argmax

p(C

|x). Suppose that the network weights

are given by w =[w

, w

bias

, w

bias

] where w

, w

are input-to-hidden and

hidden-to-output weights, respectively, and the bias weights are assembled in

bias

and w

bias

. Suppose that the targets y

(k)=1if x(k) ∈C

, and zero oth-

erwise. The network is optimized using a log-likelihood loss function augmented

by a weight decay regularizer using 4 regularization parameters,

C(w)=



k=1



i=1

(k)log(%y

(k, w))

+κ

·|w

+ κ

bias

·|w

bias

+ κ

·|w

+ κ

bias

·|w

bias

. (5.24)

We further deﬁne unnormalized weight decays as α ≡ κ · N

. This regularizer

is motivated by the fact that the bias, input and hidden layer weights play a

diﬀerent role, e.g., the input, hidden and bias signals normally have diﬀerent

scale (see also [2, Ch. 9.2]).

The simulation set-up is:

– Network: 4 inputs, 5 hidden neurons, 9 outputs

– Weights are initialized uniformly over [−0.5, 0 .5], regularization parameters

are initialized at zero. One step in a gradient descent training algorithm (see

e.g., [29]) is performed and the weight decays are re-initialized at λ

max

/10

where λ

max

is the max. eigenvalue of the Hessian matrix of the cost function.

This initialization scheme is motivated by the following observations:

• Weight decays should be so small that they do not reduce the approxi-

mation capabilities of the network signiﬁcantly.

• They should be so large that the algorithm is prevented from being

trapped in a local optimum and numerical instabilities are eliminated.

The database can be retrieved from ftp://eivind.imm.dtu.dk/dist/data/vowel/

PetersonBarney.tar.Z

We only need 9 outputs since the posterior class probability of the 10th class is given

by 1 −



j=1

p(C

|x).

122 J. Larsen et al.

– Training is now done using a Gauss-Newton algorithm (see e.g., [29]). The

Hessian is inverted using the Moore-Penrose pseudo inverse ensuring that

the eigenvalue spread

is less than 10

– The regularization step-size η is initialized at 1.

– When the adaptive regularization scheme has terminated 3% of the weights

are pruned using a validation set based version of the Optimal Brain Damage

(vOBD) recipe [25], [27].

– Alternation between pruning and adaptive regularization continues until the

validation error has reached a minimum.

– Finally, remaining weights are retrained on all data using the optimized

weight decay parameters.

Table 5. 1. Probability of misclassiﬁcation (pmc) and log-likelihood cost function

(without reg. term, see (5.24)) for the classiﬁcation example. The neural network av-

erages and standard deviations are computed from 10 runs. In the case of small ﬁxed

regularization, weight decays were set at initial values equal to λ

max

/10

where λ

max

is the largest eigenvalue of the Hessian matrix of the cost function. Optimal regular-

ization refers to the case of optimizing 4 weight decay parameters. Pruning refers to

validation set based OBD. KNN refers to k-nearest-neighbor classiﬁcation.

Probability of Misclassiﬁcation (pmc)

NN NN KNN

small ﬁxed reg. opt. reg.+prun. (k =9)

Training Set 0.075 ±0.026 0.107 ±0.008 0.150

Validation Set 0.143 ± 0.014 0.115 ±0.004 0.158

Test Set 0.146 ± 0.010 0.124 ± 0.006 0.199

Test Set (train. on all data) 0.126 ±0.010 0.119 ± 0.004 0.153

Log-likelihood Cost Function

NN NN

small ﬁxed reg. opt. reg.+prun.

Training Set 0.2002 ± 0.0600 0.2881 ± 0.0134

Validation Set 0.7016 ± 0.2330 0.3810 ± 0.0131

Test Set 0.6687 ± 0.2030 0.3773 ± 0.0143

Test Set (train. on all data) 0.4426 ±0.0328 0.3518 ± 0.0096

Table 5.1 reports the average and standard deviations of the probability of

misclassiﬁcation (pmc) and log-likelihood cost function over 10 runs for pruned

networks using the optimal regularization parameters. Note that retraining on

Eigenvalue spread should not be larger than the square root of the machine precision

[6].

5. Adaptive Regularization in Neural Network Modeling 123

the full data set decreases the test pmc slightly on the average. In fact, improve-

ment was noticed in 9 out of 10 runs. The table further shows the gain of the

combined adaptive regularization/pruning algorithm relative to using a small

ﬁxed weight decay. However, recall, cf. Sec. 5.4.1, that the actual gain is very

dependent on the noise level, data set size, etc. The objective is not to demon-

strate high gain for a speciﬁc problem, rather to demonstrate that algorithm

runs fairly robust in a classiﬁcation set-up. For comparison we used a k-nearest-

neighbor (KNN) classiﬁcation (see e.g., [2]) and found that k =9neighbors was

optimal by minimizing pmc on the validation set. The neural network performed

signiﬁcantly better. Contrasting the obtained results to other work is diﬃcult. In

[37] results on the Peterson-Barney vowel problem are reported, but their data

are not exactly the same; only the ﬁrst 2 formant frequencies were used. Fur-

thermore, diﬀerent test sets have been used for the diﬀerent methods presented.

The best result reported [28] is obtained by using KNN and reach pmc =0.186

which is signiﬁcantly higher than our results.

Fig. 5.2 shows the evolution of the adaptive regularization as well as the

pruning algorithm.

5.4.3 Time Series Prediction

We tested the performance of the adaptive regularization schemes on the Mackey-

Glass chaotic time series prediction problem, see e.g., [22], [40]. The goal is to pre-

dict the series 100 steps ahead based on previous observations. The feed-forward

net conﬁguration is an input lag-space x(k)=[x(k),x(k−6),x(k−12),x(k−18)]

of 4 inputs, 25 hidden hyperbolic tangent neurons, and a single linear output

unit %y(k) which predicts y(k)=x(k + 100). The cost function is the squared er-

ror, N

−1



k=1

(y(k) − %y(k, w))

, augmented by a weight decay regularizer using

4 diﬀerent weight decays as described in Section 5.4.2.

The simulation set-up is:

– The data set, D,hasN = 500 examples and an independent test has 8500

examples.

– The regularization parameters are optimized using a hold-out validation er-

rorwithanevensplit

of the data set into training and validation sets each

having 250 examples.

– Weight decays are initialized at zero and one Gauss-Newton iteration is

performed, then weight decays were re-initialized at λ

max

/10

,whereλ

max

is the max. eigenvalue of the Hessian matrix of the cost function.

– The network is trained using a Gauss-Newton training scheme. The Hes-

sian is inverted using the Moore-Penrose pseudo inverse ensuring that the

eigenvalue spread is less than 10

– The regularization step-size η is initialized at 10

−2

The sensitivity to diﬀerent splits are considered in [25].

124 J. Larsen et al.

Train

Validation

Test

2 4 6 8 10 12 14 16 18 20

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ITERATION

ERROR

(a)

Train

Validation

Test

10 20 30 40 50 60 70 80

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

NO. OF WEIGHTS

ERROR

(d)

Train

Validation

Test

2 4 6 8 10 12 14 16 18 20

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

ITERATION

CLASSIFICATION

ERROR

(b)

Train

Validation

Test

10 20 30 40 50 60 70 80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

NO. OF WEIGHTS

CLASSIFICATION ERROR

(e)

Input − Hidden

Input bias − Hidden

Output

Hidden bias − Output

2 4 6 8 10 12 14 16 18 20

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

ITERATION

WEIGHT DECAY VALUE

(c)

Input − Hidden

Input bias − Hidden

Hidden − Output

Hidden bias − Output

1 2 3 4 5 6 7 8 9 10

0.1

0.2

0.3

0.4

0.5

0.6

RUN NO.

WEIGHT DECAY VALUE

(f)

Fig. 5.2. Classiﬁcation example. Panels (a), (b) and (c) show the evolution of the adap-

tive regularization algorithm in a typical run (fully connected network). The weight

decays are optimized aiming at minimizing the validation error in panel (a). Note that

also the test error decreases. This tendency is also evident in panel (b) displaying pmc

even though a small increase noticed. In panel (c) the convergence unnormalized weight

decays, α = κ · N

, are depicted. Panels (d) and (e) show the evolution of errors and

pmc during the pruning session. The optimal network is chosen as the one with mini-

mal validation error, as indicated by the vertical line. There is only a marginal eﬀect of

pruning in this run. Finally, in panel (f), the variation of the optimal (end of pruning)

α’s in diﬀerent runs is demonstrated. A clear similarity over runs is noticed.

5. Adaptive Regularization in Neural Network Modeling 125

Table 5.2. Normalized squared error performance for the time series prediction ex-

amples. All ﬁgures are in units of 10

−3

σ

and averages and standard deviations are

computed from 10 runs. In the case of small ﬁxed regularization, weight decays were set

at initial values equal to λ

max

/10

where λ

max

is the largest eigenvalue of the Hessian

matrix of the cost function. Optimal regularization refers to the case of optimizing 4

weight decay parameters. Pruning refers to validation set based OBD.

NN NN NN

small ﬁxed reg. small ﬁxed reg.+prun. opt. reg.+prun.

Training Set 0.17 ± 0.07 0.12 ± 0.04 0.10 ± 0.07

Validation Set 0.53 ± 0.26 0.36 ±0.07 0.28 ± 0.14

Tes t Set 1.91 ± 0.68 1.58 ± 0.21 1.29 ± 0.46

Tes t Se t (train. on all data) 1.33 ± 0.43 1.34 ± 0.26 1.17 ± 0.48

– Alternation between adapting the 4 weight decays and validation set based

pruning [25].

– The pruned network is retrained on all data using the optimized weight decay

parameters.

Table 5.2 reports the average and standard deviations of the normalized squared

error (i.e., the squared error normalized with the estimated variance of x(k),de-

noted %σ

) over 10 runs for optimal regularization parameters. Retraining on the

full data set decreases the test error somewhat on the average. Improvement was

noticed in 10 out of 10 runs. We tested 3 diﬀerent cases: small ﬁxed regulariza-

tion, small ﬁxed regularization assisted by pruning and combined adaptive reg-

ularization/pruning. It turns that pruning alone does not improve performance;

however, supplementing by adaptive regularization gives a test error reduction.

We furthermore tried a ﬂexible regularization scheme, viz. individual weight

decay where R(w, κ)=



i=1

and κ

≥ 0 are imposed. In the present case

it turned out that the ﬂexible regularizer was not able to outperform the joint

adaptive regularization/pruning scheme; possibly due to training and validation

set sizes.

Fig. 5.3 demonstrates adaptive regularization and pruning in a typical case

using 4 weight decays.

5.5 Conclusions

In this paper it was suggested to adapt regularization parameters by minimizing

the cross-validation error or a simple hold-out validation error. We derived a sim-

ple gradient descent scheme for optimizing regularization parameters which has

a small programming overhead and an acceptable computational overhead com-

pared to standard training. Numerical examples with a toy linear model showed

limitations and advantages of the adaptive regularization approach. Moreover,

numerical experiments on classiﬁcation and time series prediction problems suc-

cessfully demonstrated the functionality of the algorithm. Adaptation of regu-

larization parameters resulted in lower generalization error; however, it should

126 J. Larsen et al.

Train

Validation

Test

0 5 10 15 20 25 30

−4

−3

−2

ITERATIONS

NORMALIZED SQUARED ERROR

(a)

Train

Validation

Test

20 40 60 80 100 120 140

−4

−3

−2

−1

NO. OF WEIGHTS

NORMALIZED SQUARED ERROR

(c)

Input − Hidden

Input bias − Hidden

Hidden − Output

Hidden bias − Output

0 5 10 15 20 25 30

0.5

1.5

2.5

3.5

x 10

−3

ITERATIONS

WEIGHT DECAY VALUE

(b)

Input − Hidden

Input bias − Hidden

Hidden − Output

Hidden bias − Output

20 40 60 80 100 120 140

0.5

1.5

2.5

3.5

4.5

x 10

−3

NO. OF WEIGHTS

WEIGHT DECAY VALUE

(d)

Fig. 5.3. Time series prediction example. Panels (a) and (b) show a typical evolution

of errors and unnormalized weight decay values α when running the adaptive regular-

ization algorithm using 4 weight decays. The normalized validation error drops approx.

a factor of 2 when adapting weight decays. It turns out that some regularization of the

input-to-hidden and output bias weights are needed whereas the other weights essen-

tially requires no regularization

. In panel (c) and (d) it is demonstrated that pruning

reduces the test error slightly. The optimal network is chosen as the one with minimal

validation error, as indicated by the vertical line.

be emphasized that the actual yield is very dependent on the problem and the

choice of the regularizers functional form.

Acknowledgments. This research was supported by the Danish Natural

Science and Technical Research Councils through the Computational Neural

Network Center. JL furthermore acknowledge the Radio Parts Foundation for

ﬁnancial support. Mads Hintz-Madsen and Morten With Pedersen are acknowl-

edged for stimulating discussions.

Recall that if a weight decay κ is below λ

max

/10

it does not inﬂuence the Moore-

Penrose pseudo inversion of the Hessian.

5. Adaptive Regularization in Neural Network Modeling 127

Appendix

Assume that the regularization term is linear in the regularization parameters,

i.e.,

R(w, κ)=κ



r(w)=



i=1

(w) (5.25)

The gradient of the cross-validation error (5.4) is

∂

∂κ

(κ)=



j=1

∂S

∂κ

(

(κ)) (5.26)

Using the chain rule the gradient vector of the validation error, S

,canbe

written as

∂S

∂κ

(

(κ)) =

∂w



∂κ

(

(κ)) ·

∂S

∂w

(

(κ)) (5.27)

where ∂w



/∂κ is the q × m derivative matrix of the estimated weights w.r.t.

the regularization parameters and m = dim(w). In order to ﬁnd this derivative

matrix, consider the gradient of the cost function w.r.t. to the weights as a

function of κ and use the following expansion around the current estimate κ

(n)

∂C

∂w

(κ)=

∂C

∂w

(κ

(n)

∂

∂w∂κ



(κ

(n)

) · (κ − κ

(n)

)+o(|κ −κ

(n)

|). (5.28)

Requiring

w(κ

(n+1)

) in the next iteration to be an optimal weight vector, i.e.,

∂C

/∂w(κ

(n+1)

)=0 implies

∂

∂w∂κ



(

w(κ

(n)

)) = 0. (5.29)

Recall that ∂C

/∂w(κ

(n)

)=0 by assumption. (5.29) can be used for deter-

mining ∂w



/∂κ. Recognizing that the cost function C

(

w(κ)) = S

(

w(κ)) +

w(κ), κ) depends implicitly (thorough

w(κ))andexplicitly on κ it is possible,

by using (5.25), to derive the following relation

∂w



∂κ

(

)=−

∂r

∂w



(

) · J

−1

(

) (5.30)

where J

= ∂

/∂w∂w



is the Hessian of the cost function which e.g., might

be evaluated using the Gauss-Newton approximation [29]. Finally, substituting

(5.30) into (5.27) gives

∂S

∂κ

(

)=−

∂r

∂w



(

) · J

−1

(

) ·

∂S

∂w

(

) (5.31)

∂S

/∂w is found by ordinary back- propagation on the validation set while

∂r/∂w



is calculated from the speciﬁc assumptions on the regularizer.

For convenience, here w’s explicit κ-dependence is omitted.

128 J. Larsen et al.

References

[1] Akaike, H.: Fitting Autoregressive Models for Prediction. Annals of the Institute

of Statistical Mathematics 21, 243–247 (1969)

[2] Amari, S., Murata, N., Müller, K.R., Finke, M., Yang, H.: Asymptotic Statistical

Theory of Overtraining and Cross-Validation. Technical report METR 95-06 and

IEEE Transactions on Neural Networks 8(5), 985–996 (1995)

[3] Nonboe Andersen, L., Larsen, J., Hansen, L.K., Hintz-madsen, M.: Adaptive Reg-

ularization of Neural Classiﬁers. In: Principe, J., et al. (eds.) Proceedings of the

IEEE Workshop on Neural Networks for Signal Processing VII, pp. 24–33. IEEE,

Piscataway (1997)

[4] Bishop, C.M.: Curvature-Driven Smoothing: A Learning Algorithm for Feedfor-

ward Neural Networks. IEEE Transactions on Neural Networks 4(4), 882–884

(1993)

[5] Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press,

Oxford (1995)

[6] Dennis, J.E., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization

and Non-linear Equations. Prentice- Hall, Englewood Cliﬀs (1983)

[7] Drucker, H., Le Cun, Y.: Improving Generalization Performance in Character

Recognition. In: Juang, B.H., et al. (eds.) Neural Networks for Signal Process-

ing: Proceedings of the 1991 IEEE-SP Workshop, pp. 198–207. IEEE, Piscataway

(1991)

[8] Geisser, S.: The Predictive Sample Reuse Method with Applications. Journal of

the American Statistical Association 50, 320–328 (1975)

[9] Geman, S., Bienenstock, E., Doursat, R.: Neural Networks and the Bias/Variance

Dilemma. Neural Computation 4, 1–58 (1992)

[10] Girosi, F., Jones, M., Poggio, T.: Regularization Theory and Neural Networks

Architectures. Neural Computation 7(2), 219–269 (1995)

[11] Goutte, C., Larsen, J.: Adaptive Regularization of Neural Networks using Conju-

gate Gradient. In: Proceedings of ICASSP 1998, Seattle, USA, vol. 2, pp. 1201–

1204 (1998)

[12] Goutte, C.: Note on Free Lunches and Cross-Validation. Neural Computation 9(6),

1211–1215 (1997)

[13] Goutte, C.: Regularization with a Pruning Prior. Neural Networks (1997) (to

appear)

[14] Hansen, L.K., Rasmussen, C.E.: Pruning from Adaptive Regularization. Neural

Computation 6, 1223–1232 (1994)

[15] Hansen, L.K., Rasmussen, C.E., Svarer, C., Larsen, J.: Adaptive Regularization.

In: Vlontzos, J., Hwang, J.-N., Wilson, E. (eds.) Proceedings of the IEEE Work-

shop on Neural Networks for Signal Processing IV, pp. 78–87. IEEE, Piscataway

(1994)

[16] Hansen, L.K., Larsen, J.: Linear Unlearning for Cross-Validation. Advances in

Computational Mathematics 5, 269–280 (1996)

[17] Hertz, J., Krogh, A., Palmer, R.G.: Introduction to the Theory of Neural Com-

putation. Addison-Wesley Publishing Company, Redwood City (1991)

[18] Hintz-Madsen, M., With Pedersen, M., Hansen, L.K., Larsen, J.: Design and Eval-

uation of Neural Classiﬁers. In: Usui, S., Tohkura, Y., Katagiri, S., Wilson, E.

(eds.) Proceedings of the IEEE Workshop on Neural Networks for Signal Process-

ing VI, pp. 223–232. IEEE, Piscataway (1996)

5. Adaptive Regularization in Neural Network Modeling 129

[19] Hornik, K.: Approximation Capabilities of Multilayer Feedforward Networks. Neu-

ral Networks 4, 251–257 (1991)

[20] Kearns, M.: A Bound on the Error of Cross Validation Using the Approxima-

tion and Estimation Rates, with Consequences for the Training-Test Split. Neural

Computation 9(5), 1143–1161 (1997)

[21] Larsen, J.: A Generalization Error Estimate for Nonlinear Systems. In: Kung, S.Y.,

et al. (eds.) Proceedings of the 1992 IEEE-SP Workshop on Neural Networks for

Signal Processing, vol. 2, pp. 29–38. IEEE, Piscataway (1992)

[22] Larsen, J.: Design of Neural Network Filters, Ph.D. Thesis, Electronics Institute,

Technical University of Denmark (1993),

ftp://eivind.imm.dtu.dk/dist/PhD

thesis/jlarsen.thesis.ps.Z

[23] Larsen, J., Hansen, L.K.: Generalization Performance of Regularized Neural Net-

work Models. In: Vlontzos, J., et al. (eds.) Proceedings of the IEEE Workshop on

Neural Networks for Signal Processing IV, pp. 42–51. IEEE, Piscataway (1994)

[24] Larsen, J., Hansen, L.K.: Empirical Generalization Assessment of Neural Network

Models. In: Girosi, F., et al. (eds.) Proceedings of the IEEE Workshop on Neural

Networks for Signal Processing V, pp. 30–39. IEEE, Piscataway (1995)

[25] Larsen, J., Hansen, L.K., Svarer, C., Ohlsson, M.: Design and Regularization of

Neural Networks: The Optimal Use of a Validation Set. In: Usui, S., Tohkura,

Y., Katagiri, S., Wilson, E. (eds.) Proceedings of the IEEE Workshop on Neural

Networks for Signal Processing VI, pp. 62–71. IEEE, Piscataway (1996)

[26] Larsen, J., et al.: Optimal Data Set Split Ratio for Empirical Generalization Error

Estimates (in preparation)

[27] Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal Brain Damage. In: Touretzky,

D.S. (ed.) Proceedings of the 1989 Conference on Advances in Neural Information

Processing Systemsshers, vol. 2, pp. 598–605. Morgan Kaufmann Publishers, San

Mateo (1990)

[28] Lowe, D.: Adaptive Radial Basis Function Nonlinearities and the Problem of Gen-

eralisation. In: Proc. IEE Conf. on Artiﬁcial Neural Networks, pp. 171–175 (1989)

[29] Ljung, L.: System Identiﬁcation: Theory for the User. Prentice-Hall, Englewood

Cliﬀs (1987)

[30] MacKay, D.J.C.: A Practical Bayesian Framework for Backprop Networks. Neural

Computation 4(3), 448–472 (1992)

[31] Moody, J.: Prediction Risk and Architecture Selection for Neural Networks. In:

Cherkassky, V., et al. (eds.) From Statistics to Neural Networks: Theory and

Pattern Recognition Applications, vol. 136. Springer-Verlag Series F, Berlin (1994)

[32] Moody, J., Rögnvaldsson, T.: Smoothing Regularizers for Projective Basis Func-

tion Networks. In: Proceedings of the 1996 Conference on Advances in Neural

Information Processing Systems, vol. 9. MIT Press, Cambridge (1997)

[33] Murata, N., Yoshizawa, S., Amari, S.: Network Information Criterion — Deter-

mining the Number of Hidden Units for an Artiﬁcial Neural Network Model. IEEE

Transactions on Neural Networks 5(6), 865–872 (1994)

[34] Nowlan, S., Hinton, G.: Simplifying Neural Networks by Soft Weight Sharing.

Neural Computation 4(4), 473–493 (1992)

[35] With Pedersen, M.: Training Recurrent Networks. In: Proceedings of the IEEE

Workshop on Neural Networks for Signal Processing VII. IEEE, Piscataway (1997)

[36] Peterson, G.E., Barney, H.L.: Control Methods Used in a Study of the Vowels.

JASA 24, 175–184 (1952)

[37] Shadafan, R.S., Niranjan, M.: A Dynamic Neural Network Architecture by Se-

quential Partitioning of the Input Space. Neural Computation 6(6), 1202–1222

(1994)

130 J. Larsen et al.

[38] Sjöberg, J.: Non-Linear System Identiﬁcation with Neural Networks, Ph.D. The-

sis no. 381, Department of Electrical Engineering, Linköping University, Sweden

(1995)

[39] Stone, M.: Cross-validatory Choice and Assessment of Statistical Predictors. Jour-

nal of the Royal Statistical Society B 36(2), 111–147 (1974)

[40] Svarer, C., Hansen, L.K., Larsen, J., Rasmussen, C.E.: Designer Networks for

Time Series Processing. In: Kamm, C.A., et al. (eds.) Proceedings of the IEEE

Workshop on Neural Networks for Signal Processing, vol. 3, pp. 78–87. IEEE,

Piscataway (1993)

[41] Watrous, R.L.: Current Status of PetersonBarney Vowel Formant Data. JASA 89,

2459–2460 (1991)

[42] Weigend, A.S., Huberman, B.A., Rumelhart, D.E.: Predicting the Future: A Con-

nectionist Approach. International Journal of Neural Systems 1(3), 193–209 (1990)

[43] Williams, P.M.: Bayesian Regularization and Pruning using a Laplace Prior. Neu-

ral Computation 7(1), 117–143 (1995)

[44] Wolpert, D.H., Macready, W.G.: The Mathematics of Search. Technical Report

SFI-TR-95-02-010, Santa Fe Instute (1995)

[45] Wu, L., Moody, J.: A Smoothing Regularizer for Feedforward and Recurrent Neu-

ral Networks. Neural Computation 8(3) (1996)

[46] Zhu, H., Rohwer, R.: No Free Lunch for Cross Validation. Neural Computa-

tion 8(7), 1421–1426 (1996)

Large Ensemble Averaging



David Horn

,UryNaftaly

, and Nathan Intrator

School of Physics and Astronomy

School of Mathematical Sciences

Raymond and Beverly Sackler Faculty of Exact Sciences

Tel Aviv University, Tel Aviv 69978, Israel

horn@neuron.tau.ac.il

http://neuron.tau.ac.il/~horn/

Abstract. Averaging over many predictors leads to a reduction of the

variance portion of the error. We present a method for evaluating the

mean squared error of an inﬁnite ensemble of predictors from ﬁnite (small

size) ensemble information. We demonstrate it on ensembles of networks

with diﬀerent initial choices of synaptic weights. We ﬁnd that the optimal

stopping criterion for large ensembles occurs later in training time than

for single networks. We test our method on the suspots data set and

obtain excellent results.

6.1 Introduction

Ensemble averaging has been proposed in the literature as a means to improve

the generalization properties of a neural network predictor[3, 11, 7]. We follow

this line of thought and consider averaging over a set of networks that diﬀer from

one another just by the initial values of their synaptic weights.

We introduce a method to extract the performance of large ensembles from

that of ﬁnite size ones. This is explained in the next section, and is demonstrated

on the sunspots data set. Ensemble averaging over the initial conditions of the

neural networks leads to a lower prediction error, which is obtained for a later

training time than that expected from single networks. Our method outperforms

the best published results for the sunspots problem [6].

The theoretical setting of the method is provided by the bias/variance decom-

position. Within this framework, we deﬁne a particular bias/variance decompo-

sition for networks diﬀering by their initial conditions only. While the bias of

the ensemble of networks with diﬀerent initial conditions remains unchanged,

the variance error decreases considerably.

6.2 Extrapolation to Large-Ensemble Averages

The training procedure of neural networks starts out with some choice of initial

values of the connection weights. We consider ensembles of networks that diﬀer



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 131–137, 2012.

 Springer-Verlag Berlin Heidelberg 2012

132 D. Horn, U. Naftaly, and N. Intrator

from one another just by their initial values and average over them. Since the

space of initial conditions is very large we develop a technique which allows us

to approximate averaging over the whole space.

Our technique consists of constructing groups of a ﬁxed number of networks,

Q. All networks diﬀer from one another by the random choice of their initial

weights. For each group we deﬁne our predictor to be the average of the output

of all Q networks. Choosing several diﬀerent groups of the same size Q,and

averaging over their predictions for the test set, deﬁnes the ﬁnite size average

that is displayed in Fig. 1. Then we perform a parametric estimate of the limit

Q →∞. A simple regression in 1/Q suﬃces to obtain this limit in the suspots

problem, as shown in Fig. 2. In general one may encounter a more complicated

inverse power behavior, indicating correlations between networks with diﬀerent

initial weights.

100

150

0.08

0.09

Fig. 6.1. ARV of test set

Prediction error (ARV) is plotted vs. training time in kilo epochs (KE). The

curves correspond to diﬀerent choices of group sizes: Q =1, 2, 4, 5, 10, 20 from

top to bottom. The lowest curve is the extrapolation to Q →∞.

6.2.1 Application to the Sunspots Problem

Yearly sunspot statistics have been gathered since 1700. These data have been

extensively studied and have served as a benchmark in the statistical literature

[9, 10, 4]. Following previous publications [10, 6, 8] we choose the training set to

6. Large Ensemble Averaging 133

0 0.2 0.4

0.6

0.8

1/Q

0.074

0.076

0.078

0.080

0.082

0.084

ARV

Fig. 6.2. Extrapolation method used for extracting the Q →∞prediction

The results for diﬀerent ensemble size Q at two diﬀerent training periods, t =

70KE (dots) and 140KE (circles) lie on straight lines as a function of 1/Q.

For each curve, the ﬁrst three points from the right represent ensemble sizes

of 1, 2, and 4 respectively. While the three points of 140KE all lie above the

corresponding ones of 70KE, an extrapolation to larger ensemble sizes suggests

that the overall performance will be better for 140KE as is observed from the ﬁt

to the line.

contain the period between 1701 and 1920, and the test-set to contain the years

1921 to 1955. Following [10], we calculate the prediction error according to the

average relative variance (ARV) of the data set S:

ARV =



k∈S

− f(x

))



k∈S

− E [y

])

(6.1)

) are the data values and f (x

) is the predictor. In our time series problem,

for any given time point t = k, the input vector x

has component values taken

from the series at times t − 1,t− 2, ···,t− 12 (as in [10]). The denominator in

Eq. 1 is σ

= 1535 for the training set. The same value is used for the test set.

We use neural networks with 12 inputs, one sigmoidal hidden layer consisting of

4 units and a linear output. They are then enlarged to form recurrent networks

(SRN) [1] in which the input layer is increased by adding to it the hidden layer

of the previous point in the time series. The learning algorithm consists of back

propagation applied to an error function which is the MSE of the training set. A

134 D. Horn, U. Naftaly, and N. Intrator

validation set containing 35 randomly chosen points was left out during training

to serve for performance validation.

Fig. 6.1 displays our results on the test set as a function of the number of

training epochs. We observe a descending order of Q =1, 2, 4, 5, 10, 20 followed

by the extrapolation to Q →∞. All of these curves correspond to averages

over groups of size Q, calculated by running 60 networks. To demonstrate how

the extrapolation is carried out we display in Fig. 6.2 the points obtained for

t =70and t = 140 KE as a function of

. It is quite clear that a linear

extrapolation is very satisfactory. Moreover, the results for Q =20are not far

from the extrapolated Q →∞results. Note that the minimum of the Q →∞

curve in Fig.1 occurs at a much higher training time than that of the Q =1single

network curve. This is also evident from the crossing of the t =70and t = 140

KE lines on Fig. 2. An important conclusion is that the stopping criterion for

ensemble training (to be applied, of course, to every network in the group) is

very diﬀerent from that of single network training.

6.2.2 Best Result

The curves shown in Fig. 1 were obtained with a learning rate of 0.003. Lower

learning rates lead to lower errors. In that case the eﬀect of ensemble averaging

is not as dramatic. We obtained our best result by changing our input vector into

the six dimensional choice of Pi & Peterson [8] that consists of x

t−1

, x

t−2

, x

t−3

t−4

, x

t−9

and x

t−10

. Using a learning rate of 0.0005 on the otherwise unchanged

SRN described above, we obtain the minimum of the prediction error at 0.0674,

which is better than any previously reported result.

6.3 Theoretical Analysis

The theoretical setting of the method is provided by the bias/variance decompo-

sition. Within this framework, we deﬁne a particular bias/variance decomposi-

tion for networks diﬀering by their initial conditions only. This is a particularly

useful subset of the general set of all sources of variance.

The performance of an estimator is commonly evaluated by the Mean Square

Error (MSE) deﬁned as

MSE(f) ≡ E

(

(y − f(x))

)

(6.2)

where the average is over test sets for the predictor f,andy are the target values

of the data in x. Assuming the expectation E is taken with respect to the true

probability of x and y, the MSE can be decomposed into

(

(y − f (x))

)

= E



(y − E[y|x])



+ E

(

(f(x) − E[y|x])

)

. (6.3)

The ﬁrst RHS term represents the variability or the noise in the data and is

independent of the estimator f. It suﬃces therefore to concentrate on the second

6. Large Ensemble Averaging 135

0 20 40

0.065

0.070

0.075

0.080

0.085

0.090

0.095

0.100

ARV

Fig. 6.3. Our best results for the test set of the sunspots problem. Plotted here are

Q =1results for various choices of initial conditions, represented by their averages

with error-bars extending over a standard deviation, and Q =20results (the thinner

points), as a function of training time in K-epochs. The network is based on the Pi &

Peterson variables, and the learning rate is 0.0005.

term. Any given predictor f (x) is naturally limited by the set of data on which

it is trained. Considering a typical error one may average over all data space [2]

and decompose this error into Bias and Variance components:

(

(f(x) − E[y|x])

)

= B

+ V

(6.4)

where

(f(x)) = (E

[f(x)] − E[y|x])

(6.5)

(f(x)) = E

(

(f(x) − E

[f(x)])

)

. (6.6)

In our application we use as our predictor E

[f(x)] where the subscript I de-

notes the space of initial weights of the neural network that serves as f(x).To

understand the eﬀect of averaging over initial weights let us construct a space

R by the direct product of D and I.Itcanthenbeshownthat

(f(x)) = B

[f(x)]), (6.7)

and

(f(x)) ≥ V

[f(x)]). (6.8)

136 D. Horn, U. Naftaly, and N. Intrator

This means that using E

[f(x)] as the predictor, the characteristic error has

reduced variance but unchanged bias.

The bias term may also be represented as B

= E

] where

(f(x)) = (E

[f(x)] − E[y|x])

. (6.9)

is unaﬀected when f (x) is replaced by its average. The analogously deﬁned

variance term, V

, gets eliminated by such averaging. In other words, by aver-

aging over all I we eliminated all variance due to the choice of initial weights.

The diﬀerence between the Q =1and Q →∞curves in Fig. 1 represents this

reduction of variance.

To understand the Q dependence of Fig. 2 consider an average deﬁned by

f(x) over functions that represent independent identically distributed random

variables. It can then be shown that

f)=

B(f) V (

f)=V (f )/Q. (6.10)

Hencewecaninterpretthe1/Q behavior displayed in Fig. 2 as a demonstration

that the choice of initial conditions in that analysis acts as eﬀective random noise.

In general this is not necessarily the case, since networks with diﬀerent initial

conditions may have non-trivial correlations. For a more thorough discussion of

this and other points see [5].

In conclusion, we see that averaging over networks with diﬀerent initial weights

is helpful in reducing the prediction error by eliminating the variance induced

by initial conditions. Performing this average over groups of ﬁnite size Q one

ﬁnds out from the Q dependence if the errors induced by initial conditions are

correlated or not. Moreover, one may estimate the Q needed to eliminate this

source of variance.

References

[1] Elman, J.L., Zipser, D.: Learning the Hidden Structure of Speech. J. Acoust. Soc.

Amer. 83, 1615–1626 (1988)

[2] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Comp. 4(1), 1–58 (1992)

[3] Lincoln, W.P., Skrzypek, J.: Synergy of clustering multiple back propagation net-

works. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Sys-

tems, vol. 2, pp. 650–657. Morgan Kaufmann, SanMateo (1990)

[4] Morris, J.: Forecasting the sunspot cycle. J. Roy. Stat. Soc. Ser. A 140, 437–447

(1977)

[5] Naftaly, U., Intrator, N., Horn, D.: Optimal Ensemble Averaging of Neural Net-

works. Network, Comp. Neural Sys. 8, 283–296 (1997)

[6] Nowlan, S.J., Hinton, G.E.: Simplifying neural networks by soft weight-sharing.

Neural Computation 4, 473–493 (1992)

[7] Perrone, P.M.: Improving regression estimation: averaging methods for variance

reduction with extensions to general convex measure optimization. PhD thesis,

Brown University, Institute for Brain and Neural Systems (1993)

6. Large Ensemble Averaging 137

[8] Pi, H., Peterson, C.: Finding the Embedding Dimension and Variable Dependen-

cies in Time Series. Neural Comp. 6, 509–520 (1994)

[9] Priestley, M.B.: Spectral Analysis and Time Series. Academic Press (1981)

[10] Weigend, A.S., Huberman, B.A., Rumelhart, D.: Predicting the future: A connec-

tionist approach. Int. J. Neural Syst. 1, 193–209 (1990)

[11] Wolpert, D.H.: Stacked generalization. Neural Networks 5, 241–259 (1992)

Improving Network Models and Algorithmic

Tricks



Preface

This section contains 5 chapters presenting easy to implement tricks which mod-

ify either the architecture and/or the learning algorithm so as to enhance the

network’s modeling ability. Better modeling means better solutions in less time.

In chapter 7, Gary Flake presents a trick that gives an MLP the additional

power of an RBF. Trivial to implement, one simply adds extra inputs whose

values are the square of the original inputs (p. 144). While adding higher

order terms as inputs is not a new idea, this chapter contributes new insight

by providing (1) a good summary of previous work, (2) simple clear examples

illustrating this trick, (3) a theoretical justiﬁcation showing that one need only

add the higher order squared terms, and (4) a thorough comparison with nu-

merous other network models. The need for only the squared terms is signiﬁcant

because it means that we gain this extra power without having the number of

inputs grow excessively large. We remark that this idea can be extended further

by including relevant features other than squared inputs e.g. by using kernel

PCA [2] to obtain the non-linear features.

Rich Caruana in chapter 8 presents multi-task learning (MTL) (p. 163)

where extra outputs are added to a network to predict tasks separate but related

to the primary task. To introduce the trick, the chapter begins with an example

and detailed discussion of a simple boolean function of binary inputs. The author

then presents several of what one might use as extra outputs in practice. These

include, among others: (1) features that are available only after predictions

must be made, but which are available oﬄine at training time (p. 170), (2) the

same task but with a diﬀerent metric (p. 175), and (3) the same task but with

diﬀerent output representations (p. 176). Empirical results are presented for

(1) mortality rankings for pneumonia where the extra outputs are test results

not available when the patient ﬁrst enters the hospital but which are available a

posteriori to complement the training data, and (2) a vehicle steering task where

other outputs include location of centerline and road edges, etc. The last part of

the chapter is devoted to topics for implementing MTL eﬀectively, such as size

of hidden layers (p. 181), early stopping (p. 181), and learning rates (p. 185).

The next chapter by Patrick van der Smagt and Gerd Hirzinger presents a

trick to reduce the problem of ill-conditioning in the Hessian. If a unit has a

very small outgoing weight then the inﬂuence of the incoming weights to that

unit will be severely diminished (see chapter 1 for other sources of Hessian ill-

conditioning). This results in ﬂat spots in the error surface, which translates into



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 139–141, 2012.

 Springer-Verlag Berlin Heidelberg 2012

140 G.B. Orr and K.-R. Müller

slow training (see also [1]). The trick is to add linear shortcut connections (p.

196) from the input to the output nodes to create what the authors refer to as

a linearly augmented feed-forward network. These connections share the

weights with the input to hidden connections so that no new weight parameters

are added. This trick enhances the sensitivity of the network to those incom-

ing weights thus removing or reducing the ﬂat spots in the error surface. The

improvement in the quality of the error surface is illustrated in a toy example.

Simulations with data from a robot arm are also shown.

The trick discussed by Nicol Schraudolph in chapter 10 is to center the various

factors comprising the neural network’s gradient (p. 208): input and hidden

unit activities (see chapter 1), error signals, and the slope of the hidden units’

nonlinear activation functions (p. 208). To give an example: activity centering

(p. 207) is done by simply transforming the values of the components x

into

ˇx

= x

−x

,

where · denotes averaging over training samples. All diﬀerent centering strate-

gies can be implemented eﬃciently for a stochastic, batch or mini batch learning

scenario (p. 209). He also uses shortcut connections (p. 208) but quite diﬀer-

ently from the previous chapter: the shortcut connections contain new weights

(not shared) which complement slope centering by carrying the linear compo-

nent of the signal, making it possible for the rest of the network to concentrate

on the nonlinear component of the problem. So, with respect to shortcut con-

nections, the approaches in chapters 9 and 10 appear complementary. Centering

gives a nice speed-up without much harm to the generalization error, as seen in

the simulations on toy and vowel data (p. 211).

In chapter 11, Tony Plate presents a trick requiring only minimal mem-

ory overhead that reduces numerical round-oﬀ in backpropagation networks.

Round-oﬀ error can occur in the standard method for computing the derivative

of the logistic function since it requires calculating the product

y(1 − y)

where y is the output of either a hidden or output unit. When the value of y is

close to 1 then the limited precision of single or even double precision ﬂoating

point numbers can result in the product being zero. This may not be a serious

problem for on-line learning but can cause signiﬁcant problems for networks using

batch mode, particularly when second order methods are used. Such round-oﬀ

can occur in other types of units as well. This chapter provides formulas for

reducing such round-oﬀ errors in the computation of

– derivatives of the error for logistic units or tanh units (p. 226 and 229)

– derivatives in a one-of-k classiﬁcation problem with cross-entropy error and

softmax (p. 228)

– derivatives and errors in a two-class classiﬁcation problem using a single

logistic output unit with cross entropy error and 0/1 targets (p. 228).

Jenny & Klaus

Improving Network Models and Algorithmic Tricks 141

References

[1] Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Computation 9(1), 1–42

(1997)

[2] Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel

eigenvalue problem. Neural Computation 10, 1299–1319 (1998)

Square Unit Augmented, Radially Extended,

Multilayer Perceptrons



Gary William Flake

Siemens Corporate Research, Inc.

755 College Road East

Princeton, NJ 08540

flake@scr.siemens.com

http://mitpress.mit.edu/books/FLAOH/cbnhtml/author.html

Abstract. Consider a multilayer perceptron (MLP) with d inputs, a sin-

gle hidden sigmoidal layer and a linear output. By adding an additional

d inputs to the network with values set to the square of the ﬁrst d inputs,

properties reminiscent of higher-order neural networks and radial basis

function networks (RBFN) are added to the architecture with little added

expense in terms of weight requirements. Of particular interest, this ar-

chitecture has the ability to form localized features in a d-dimensional

space with a single hidden node but can also span large volumes of the in-

put space; thus, the architecture has the localized properties of an RBFN

but does not suﬀer as badly from the curse of dimensionality. I refer to a

network of this type as a SQuare Unit Augmented, Radially Extended,

MultiLayer Perceptron (SQUARE-MLP or SMLP).

7.1 Introduction and Motivation

When faced with a new and challenging problem, the most crucial decision that

a neural network researcher must make is in the choice of the model class to pur-

sue. Several diﬀerent types of neural architectures are commonly found in most

model building tool-boxes, with two of the more familiar, radial basis function

networks (RBFNs) [15, 16, 13, 1, 17] and multilayer perceptrons (MLPs) [25, 20],

exemplifying the diﬀerences found between global and local model types. Specif-

ically, an MLP is an example of a global model that builds approximations with

features that alter the entire input-output response, while an RBFN is a local

model that uses features conﬁned to a ﬁnite region in the input-space. This single

diﬀerence between the two model types has major implications for many archi-

tectural and algorithmic issues. While MLPs are slow learners, have low memory

retention, typically use homogeneous learning rules and are relatively less trou-

bled by the curse of dimensionality, RBFNs are nearly opposite in every way:



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998)

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 143–161, 2012.

 Springer-Verlag Berlin Heidelberg 2012

144 G.W. Flake

they are fast learners, have high memory retention, typically use heterogeneous

learning rules and are greatly troubled by the curse of dimensionality.

Because of these diﬀerences, it is often tempting to use one or more heuristics

to make the choice, e.g., RBFNs (MLPs) for low (high) dimensional problems,

or RBFNs (MLPs) for continuous function approximation (pattern classiﬁca-

tion) problems. While these rules-of-thumb are often suﬃcient for simple prob-

lems there are many exceptions that defy the rules (e.g., see [12]). Moreover,

more challenging problems from industrial settings often have high-dimensional

input-spaces that are locally well-behaved and may not be clearly deﬁned as

either function approximation or pattern classiﬁcation problems. This means

that choosing the best architectures for a particular problem can be a nontrivial

problem in itself.

Ironically, a good compromise to this dilemma has been known for quite some

time but only recently has the elegance of the trick been appreciated. For years,

researchers have commonly used augmented MLPs by adding the squares of

the inputs as auxiliary inputs. The justiﬁcation for this has always been fairly

casual and has usually boiled down to the argument that using this trick couldn’t

possibly hurt. But as it turns out, an MLP augmented in this way with n hidden

nodes can almost perfectly approximate an RBFN with n basis functions. The

“almost” comes from the fact that the radial basis function in the augmented

MLP is not Gaussian but quasi-Gaussian (which is an admittedly undeﬁned term

that I simply use to mean “so close to Gaussian that it really doesn’t matter.”).

This means that an MLP augmented with the squares of it’s inputs can easily

form local features with a single hidden node but can also span vast regions of

the input-space, thereby eﬀectively ignoring inputs when needed. Thus, the best

of both architectural approaches is retained by using this amazingly simple trick.

The remainder ofthis chapter is divided into 5 more sections.Section7.2contains

a description of the trick and brieﬂy gives a comparison of the proposed architec-

ture to other classes of well-known models. In Section 7.3, a function approximation

problem and a pattern classiﬁcation problem are used as examples to demonstrate

the eﬀectiveness of the trick. Afterwards, a well-known and challenging vowel clas-

siﬁcation problem is studied in greater detail. Section 7.4 theoretically justiﬁes the

trick by showing the equivalence of the resulting architecture and RBFNs, while

Section 7.5 gives a more intuitive justiﬁcation for the proposed trick by illustrat-

ing the types of surfaces and boundaries that can be formed by a single node with

auxiliary square inputs. Finally, in Section 7.6, I give my conclusions.

7.2 The Trick: A SQUARE-MLP

The proposed trick involves only a simple modiﬁcation to the standard MLP

architecture: the input layer of an MLP is augmented with an extra set of inputs

that are coupled to the squares of the original inputs. This trick can be imple-

mented in at least two diﬀerent ways. The ﬁrst technique is to simply augment

a data set with the extra inputs. Thus, if one had a set with each input pattern

having d components, then a new data set can be made from the original that

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 145

has 2d inputs with the extra d inputs set equal to the squares of the original

inputs. Implementing the trick in this way is expensive from a memory point

of view but has the advantage that not a single line of new source code need

be written to try it out. Moreover, this allows the trick to be tried even on a

commercial simulator where one may not have access to source code.

The second way to implement the trick is to explicitly code the actual changes

into the architecture:

y =



⎛

⎝



+ a

⎞

⎠

+ b, (7.1)

with g(x) = tanh(x) or 1/(1+exp(−x)). I call such a network a SQuare Unit Aug-

mented, Radially Extended, MultiLayer Perceptron (SQUARE-MLP or SMLP).

The “square unit augmented” portion of the name comes from the newly added

terms. The reason behind the “radially extended” portion of the name will

become clear in Sections 7.4 and 7.5. All experiments in Section 7.3 use the ar-

chitecture described by Equation 7.1. The history of this trick is rather diﬃcult

to trace primarily because it is such a trivial trick; however, a brief list of some

related ideas is presented below.

Engineering and Statistics. Very early related ideas have been pursued in the

statistics community in the form of polynomial regression and Volterra ﬁlters in

the engineering community [24, 22]. However, in both of these related approaches

the model output is always linear in the polynomial terms, which is not the case

with the SMLP architecture or in the other neural architectures discussed below.

Sigma-Pi Networks. Some neural network architectures which are much more com-

plicated than Equation 7.1 have the SMLP as a special case. Perhaps the earliest

reference to a similar idea in the neural network literature can be traced back to

Sigma-Pi networks [20], which extends an MLP’s linear net input function with a

summation of products,



. One could imagine a multilayer Sigma-Pi

network that manages to compute the squares of the x

terms prior to them being

passed through to a sigmoidal activation function. This would be a rather clumsy

way of calculating the squares of the inputs, but it is possible to do it, nonetheless.

Higher-Order Networks. Perhaps the closest example is the higher-order net-

work, proposed by Lee et al. [14], which is similar to Equation 7.1 but uses the

full quadratic net input function:

y =



⎛

⎝



ikl

+ a

⎞

⎠

+ b. (7.2)

With v

ikl

set to zero when k = l a SMLP is recovered. Thus, a SMLP is actually

a higher-order network with a diagonal quadratic term. Higher-order networks

have been shown to be very powerful extensions of MLPs. They can form both

146 G.W. Flake

local and global features but only at the cost of squaring the number of weights

for each hidden node. The memory requirements become an even greater issue

when more sophisticated optimization routines are applied to an architecture

such as Newton’s or quasi-Newton methods which require memory proportional

to the square of the number of weights in the network.

Functional Link Networks. Another related architecture is the functional-link

network [18], which is similar to a standard MLP but explicitly augments the

network with the results of scalar functions applied to the inputs. For example,

in some applications it may be known in advance that the desired output of

the network is a function of the sine and cosine of one or more inputs (e.g.,

the inputs may correspond to angles of a robot arm). In this case, one would

do well to include these values explicitly as inputs into the network instead of

forcing the network to learn a potentially diﬃcult-to-model concept. Functional-

link networks may use any scalar function that, in the end, essentially performs a

type of preprocessing on the data. Usually, expert knowledge is used to determine

which extra scalar functions are to be incorporated into the network; that is,

there is no general technique for choosing the best preprocessor functions a

priori. However, given a set of nonlinear transformations on the input data one

can perform principal component analysis (PCA) on the nonlinear feature space

to select a subset that carries the most variance. A computationally feasible

version of this technique has been proposed in [23], which they refer to as kernel

PCA. In any event, by using the square function to augment a functional-link

network, the SMLP is once again recovered.

While I have tried to assign proper credit, it is generally accepted that the basic

idea of adding the squares of the inputs to a model is at least as old as the sage

advice “preprocessing is everything.”

7.3 Example Applications

The following three examples demonstrate problem domains in which an SMLP

can conceivably outperform an MLP or an RBFN. All of the examples are well-

known benchmarks. In each case, the output response of the models must form

local features while simultaneously spanning a large region of the input-space.

In general, the MLPs will have diﬃculty forming the local features, while the

RBFNs will have trouble spanning the ﬂat regions of the input space.

7.3.1 Hill-Plateau Function Approximation

The ﬁrst problem is an admittedly contrived example that was chosen precisely

because it is diﬃcult for both MLPs and RBFNs. The “Hill-Plateau” surface [21],

displayed in Figure 7.1, has a single local bump on a sigmoidal ridge. Training

data for this problem consists of a two-dimensional uniform sampling on a 21×21

grid of the region shown in the ﬁgure while the testing data comes from a ﬁner

41 ×41 grid.

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 147

-5

-1

Fig. 7.1. A hill-plateau exempliﬁes the diﬀerences between local and global architec-

tures. MLPs can easily form the plateau but have a hard time on the hill, while RBFNs

trivially form the hill but are troubled by the plateau.

Table 7.1 . Best of twenty runs for the Hill-Plateau surface

Model #ofNodes#ofWeightsTe st RMSE

RBFN 2 9 0.333406

3 13 0.071413

4 17 0.042067

5 21 0.002409

MLP 2 9 0.304800

3 13 0.015820

4 17 0.001201

SMLP 2 13 0.000025

Besides a standard MLP, a normalized RBFN (NRBFN) was used for this

problem, which is described by the two equations:

y =



(x)



(x)

+ a and (7.3)

(x)=exp(−||x −c

/σ

), (7.4)

with c

and σ

being the ith basis center and width, respectively. To train the

NRBFNs the centers were ﬁrst clustered in the input-space of the training pat-

terns with the k-means clustering algorithm. The width of each basis function

was then set proportional to the distance to the nearest neighbor. Afterwards,

the least-mean-square solution of the linear terms, w

and a, were solved for

148 G.W. Flake

-6

-4

-2

-8 -6 -4 -2 0 2 4 6 8

Fig. 7.2. Data for the Two-Spiral problem

exactly using a singular value decomposition to compute the pseudo-inverse. All

of this formed the initial set of weights for the quasi-Newton’s method (BFGS)

optimization routine which was used on all weights simultaneously for up to 200

epochs.

For the MLPs, a single hidden layer with the tanh(x) activation function and

a linear output was used. All weights were initially set to uniform random values

in a -0.1 to 0.1 range. All weights were then trained with quasi-Newton method

(BFGS) for up to 200 epochs. The SMLPs were setup and trained exactly as the

MLPs.

Table 7.1 shows the results for all three architectures. As can be seen, both

the RBFNs and the MLPs have a fair amount of diﬃculty with this task, even

though the training data is noise free and the training procedures are fairly

sophisticated. Contrary to this, the SMLP manages to nail the surface with only

two hidden nodes. Moreover, the testing error is orders of magnitude better than

the best results from the other two architectures.

7.3.2 Two-Spirals Classiﬁcation

The two-spirals classiﬁcation problem is a well-known benchmark that is extremely

challenging for all neural network architectures; additionally, virtually no results

have been reported for RBFNs as the problem is such that local models would

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 149

have to memorize the training data with many basis functions (> 100) in order to

come even close to solving it. Thus, this is an example of a problem that RBFNs

are not even viable candidates, which is why they are not considered further. Fig-

ure 7.2 shows the data for the two-spirals problem, which consists of 194 points

on the x-y-plane that belong to one of two spirals, each of which rotates around

the origin three times.

Previous results by other researchers for this problem have mostly focused

on traditional MLPs and MLPs with shortcut connections. The best reported

results for 100% classiﬁcation accuracy are summarized below:

– (Lang & Witbrock [9]): 2-5-5-5-1 MLP with shortcut connections and 138

total weights. Average convergence time of 20,000 batched backpropagation

epochs.

– (Lang & Witbrock [9]): 2-5-5-5-1 MLP with shortcut connections, 138 total

weights, and cross-entropy error function. Average convergence time of 12,000

batched backpropagation epochs.

– (Lang & Witbrock [9]): 2-5-5-5-1 MLP with shortcut connections and 138

total weights. Average of 7,900 quickprop [3] epochs.

– (Frostrom [unpublished]): 2-20-10-1 MLP with no shortcut connections and

281 weights. Required 13,900 batched backpropagation with momentum

epochs.

– (Fahlman and Lebiere [4]): Cascade-Correlation MLP using 12 to 19 hidden

units (15 average) and an average of 1700 quickprop epochs. Because of

the cascade correlation topology, these networks used between 117 and 250

weights.

As these results show, the two-spiral problem is exceptionally diﬃcult, requiring

both complicated network topologies and long training times.

Compared to the results above, the SMLP architecture seems to be very well-

suited to this problem. An SMLP with 15 hidden hyperbolic tangent units (for

a total of only 91 weights) was trained with a conjugate gradient optimization

routine. In ten out of ten trials, the SMLP solved the two-spirals problem with an

average of 2500 training epochs (but as few as 800). Notice that the architecture

for the SMLP, both topologically and in the number of weights, is much simpler

than those used in the studies with the MLPs. As a result, the optimization

algorithms can be much more eﬃcient. This is a case of the representation power

of an architecture simplifying the learning, thereby making weight optimization

a faster process.

Although it was not always possible to consistently train a simpler SMLP

to solve this problem, an SMLP with 10 hidden nodes (and only 61 weights)

succeeded on three separate trials, taking an average of 1500 epochs. The output

response surface of this SMLP is shown in Figure 7.3. In Section 7.5 we will

examine the diﬀerent types of surfaces that can be formed by a single SMLP

hidden node. We shall see that the the SMLP’s ability to easily form local and

global features is crucial to its ability to rapidly solve the two-spiral problem.

150 G.W. Flake

-5

-1

Fig. 7.3. SMLP reconstructed Two-Spiral surface from only ten hidden nodes

7.3.3 Vowel Classiﬁcation

The Deterding vowel recognition data set [2] is another widely studied bench-

mark that is much more diﬃcult than the two earlier problems and is also more

indicative of the type of problem that a neural network practitioner could be

faced with. The data consists of auditory features of steady state vowels spo-

ken by British English speakers. There are 528 training patterns and 462 test

patterns with each pattern consisting of 10 features and belonging to exactly

one of 11 classes that correspond to the spoken vowel. The speakers are of both

genders, making this a very interesting problem.

All results for this section use an architecture with 10 inputs, a varying number

of hidden units, and 11 outputs. Some results from previous studies are summa-

rized in Table 7.2. Some of the earlier studies are somewhat anecdotal in that

they used either a single experiment or only a few experiments but they are infor-

mative as they demonstrate what the sophisticated neural network practitioner

could expect to achieve on this problem with a wide number of architectures.

Interestingly, Robinson’s results show that a nearest neighbor classiﬁer is very

diﬃcult to beat for this problem. With a 56% correct classiﬁcation rate, a nearest

neighbor classiﬁer outperforms all of Robinson’s neural solutions. However, near-

est neighbor approaches require vast amounts of data to be stored as a look-up

table, so this is not a particularly encouraging result.

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 151

The best score, reported by Hastie and Tibshirani [7], was achieved with a

Discriminant Adaptive Nearest Neighbor (DANN) classiﬁer. The score of 61.7%

was from the best classiﬁer found in a number of simulation studies; hence, this

score represents the best known prior result as found by an expert attempting

multiple solutions.

Table 7.3 lists the results of the experiments done speciﬁcally for this work.

Five diﬀerent model/optimization combinations are shown in the table, and each

row in the table corresponds to ﬁfty separate trials started with diﬀerent random

initial conditions. For the major entries labeled as “trained” the weights of the

model were determined by conjugate gradient for models with more than 1,000

weights and quasi-Newton’s method for models with fewer than 1,000 weights.

The optimization routines were set to minimize an error function of the form

E = e

+λ||w||

where e is the diﬀerence between the actual and desired outputs

and λ is a weight decay term that penalizes large weights. λ was set to 10

−4

for all experiments. Values of λ equal to 10

−3

and 10

−5

consistently gave worse

results for all architectures, so 10

−4

was a fair compromise.

All MLP and SMLP

architectures used a tanh(x) activation function, while the RBFN is the same

as described in Section 7.3.1 but is unnormalized (The normalized RBFN gave

consistently inferior results). The optimization routines were always allowed to

run until convergence (change in error measure is less 10

−20

) unless otherwise

noted.

The weights in the RBFN and NRBFN models were “solved” with a three

step process: 1) set the centers to the cluster centers generated by the k-means

clustering algorithm applied to the input vectors of the training data; 2) set the

widths proportional to the distance to the nearest neighbor of each center; and

3) solve the remaining linear weights as a least mean square problem with a

matrix pseudo-inverse.

The SMLP architecture can be solved for in a manner similar to how the

RBFN and NRBFN networks are solved. The details of this procedure are cov-

ered in Section 7.4, but it suﬃces to say at this point that the procedure is nearly

identical with the exception that the weights corresponding to the centers and

the widths must be slightly transformed.

Interestingly, the SMLP can use this “solve” procedure to compute an initial

set of weights for an SMLP that is then trained with a gradient-based method.

This has the eﬀect of predisposing the SMLP to a very good initial solution

that can be reﬁned by the gradient-based optimization routines. The row in Ta-

ble 7.3 with the parenthetical label “hybrid” corresponds to SMLPs trained in this

way.

Three of the columns in Table 7.3 show three diﬀerent ways of measuring

success for the various models, of which the only statistically signiﬁcant measure

is the column labeled “% Correct (Average),” which is the average test set score

achieved after the optimization procedure halted on the training data. The scores

reported under the column heading “% Correct (Best)” correspond to the best

Note that since the SMLP has an extra set of weights, care must be taken to control

the capacity and avoid over-ﬁtting the data.

152 G.W. Flake

Table 7.2. Previous result on the vowel data as summarized in [19], [12], [5] [8], [6],

and [7]. All entries are either deterministic techniques or are the best scores reported,

unless the score appears with a “,” in which case the score represents an average over

multiple runs.

Model Number of Number of Percent

Hidden We ights Correct

Single-Layer Perceptron — 11 33

Multilayer Perceptron [19] 11 253 44

22 495 45

88 1,947 51

Multilayer Perceptron [12] 5 121 50.1



(with renormalization) 10 231 57.5



20 451 50.6



Stochastic Network [5] 8 297 54



(FF-R classiﬁer) 16 473 56



32 825 57.9



Radial Basis Function 88 1,936 48

528 11,616 53

Gaussian Node Network 11 253 47

22 495 54

88 1,947 53

528 11,627 55

Square Node Network 11 253 50

(not an SMLP) 22 495 51

88 1,947 55

Modiﬁed Kanerva Model 88 968 43

528 5808 50

Local Approximation 2 5808 50.0

3 5808 52.8

5 5808 53.0

10 5808 48.3

20 5808 45.0

Nearest Neighbor — (5,808) 56

Linear Discriminant Analysis — 715 44

Softmax — -?- 33

Quadratic Discriminant Analysis — -?- 47

CART — -?- 44

CART (linear comb. splits) — -?- 46

FDA / BRUTO — -?- 56

Softmax / BRUTO — -?- 50

FDA / MARS (degree 1) — -?- 55

FDA / MARS (degree 2) — -?- 58

Softmax / MARS (degree 1) — -?- 52

Softmax / MARS (degree 2) — -?- 50

LOCOCODE / Backprop 11 473 58



(30 inputs)

DANN — -?- 61.7

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 153

(a)

200 400 600 800 1000 1200 1400 1600 1800 2000

Percent Correct

Number of Weights

Average Results for All Models

SMLP (trained)

SMLP (solved)

SMLP (hybrid)

RBFN

MLP

(b)

200 400 600 800 1000 1200 1400 1600 1800 2000

Percent Correct

Number of Weights

MLP versus Trained and Hybrid SMLP

SMLP (trained)

SMLP (hybrid)

MLP

(c)

200 400 600 800 1000 1200 1400 1600 1800 2000

Percent Correct

Number of Weights

RBFN versus Solved and Hybrid SMLP

SMLP (solved)

SMLP (hybrid)

RBFN

Fig. 7.4. The results from Table 7.3 shown in a graphical format: (a) the average results

shown for all models types; (b) the average of the SMLP and MLP models with error

bars shown; (c) the average of the SMLP and RBFN models with error bars shown

154 G.W. Flake

Table 7. 3. Results from this study: All averages are computed from 50 trials with any

result less than 33% (the score of a perceptron) being discarded as non-convergent. See

the text for an explanation of the terms “Cheating,” “Best,” “and “Average”.

Model #of #of % Correct % Correct % Correct) Standard

Hidden Weights (Cheating) (Best) (Average) Deviation

MLP 11 253 53.24 51.08 44.53 3.34

(trained) 44 979 58.00 54.76 49.54 2.55

88 1947 57.57 57.14 50.67 3.69

SMLP 11 363 63.63 60.82 53.25 5.02

(trained) 22 715 64.93 63.63 55.11 3.31

33 1067 65.15 65.15 56.54 3.92

44 1419 66.66 65.15 58.50 2.51

RBFN 11 253 —— 56.92 52.11 2.64

(solved) 22 495 —— 63.20 57.36 2.89

33 737 —— 66.88 59.03 3.45

44 979 —— 67.53 61.38 2.66

66 1463 —— 65.80 61.79 2.23

88 1947 —— 67.09 61.58 2.38

SMLP 11 363 —— 58.87 54.14 2.79

(solved) 22 715 —— 63.63 56.68 3.23

33 1067 —— 63.85 57.17 3.09

44 1419 —— 67.31 59.41 2.89

66 2123 —— 68.39 60.36 3.04

88 2827 —— 67.09 60.30 2.57

SMLP 11 363 66.66 63.85 57.19 2.98

(hybrid) 22 715 66.88 63.41 59.22 2.50

33 1067 66.45 64.71 59.64 2.81

44 1419 68.18 66.88 60.59 2.74

ﬁnal test score achieved from the 50 runs, while the “% Correct (Cheating)” is the

best test score achieved at any time during the training by any of the models in

the ﬁfty runs. Since the “solved” models have their weights computed in a single

step, the “cheating” score only has meaning for the iterative techniques. One

way of interpreting the “cheating” score is that it is the best score that could be

achieved if one had a perfect cross validation set to use for the purpose of early

stopping.

Some of the information in Table 7.3 is graphically summarized in Figure 7.3.3

and can, therefore, be better appreciated. In every case the SMLPs and the

RBFNs outperform the MLPs by a statistically signiﬁcant margin. However,

the diﬀerence between the SMLPs and the RBFNs is much narrower, with the

RBFNs and the hybrid SMLPs being nearly identical performance-wise. Also

note that the hybrid training scheme appears to oﬀer some improvement over

both the trained and the solved SMLPs.

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 155

0.25

0.5

0.75

-4 -3 -2 -1 0 1 2 3 4

Gaussian

Quasi-Gaussian

Fig. 7.5. A “quasi-Gaussian” activation function as a aﬃne transformed sigmoidal

7.4 Theoretical Justiﬁcation

Given the context of the numerical experiments from the previous section, we

are now ready to see how an SMLP can be thought of as a “radially extended”

version of an MLP. In this section, I will rewrite Equation 7.1 into a form that

is equivalent to an RBFN; thus, we will see how it is possible for an SMLP to

almost perfectly approximate an RBFN.

The ﬁrst step is to more closely examine the sigmoidal activation function.

Let sigmoid(x)=1/(1 + exp(−x)) and gauss(x)=exp(−x

). We can deﬁne a

quasi-Gaussian function as:

(x)=2− 2/(gauss(x)+1)=2− 2 sigmoid(x

). (7.5)

This means that a local kernel function can be formed from an aﬃne transfor-

mation of a sigmoid whose input has been squared.

Figure 7.5 shows how the quasi-Gaussian function relates to the true Gaussian

function. Both functions are unimodal and exponentially decay in both direc-

tions. Moreover, a similar transformation can be applied to a hyperbolic tangent

function; hence, it really doesn’t matter which of the two common sigmoid func-

tions are used as either can be transformed into a basis function.

Since a basis function has a center and a width, we want to be able to form

local features of arbitrary size at an arbitrary location. Typically, this means

156 G.W. Flake

that a basis function incorporates a distance measure such as Euclidean distance.

With a center denoted by c

andawidthproportionaltoσ,wecanrewritea

normalized Euclidean distance function as follows:

||x − c

(x · x − 2c

·x + c

· c

)



−



·x +





x · x +



· c







−











·c



(7.6)

Thus, the equation

2 − 2 sigmoid

⎛

⎝



+ a

⎞

⎠

(7.7)

looks a lot like a radial basis function. By comparing Equation 7.6 to Equation 7.7

it is trivial to set the u

, v

,anda

terms in such a way that a local “bump” is

placed at a speciﬁc location with a speciﬁc width. This means that a single hidden

node in an SMLP network can form a local feature in an input of any dimension.

By way of comparison, Lapedes and Farber [10, 11] similarly constructed local

features with standard MLPs. However, in a d-dimensional input space, one

would need an MLP with two hidden layers, 2d hidden nodes in the ﬁrst hidden

layers, and another hidden node in the second hidden layer, just to form a single

local “bump”.

This simple analysis shows that local features are exceptionally easy to form in

an SMLP but are potentially very diﬃcult to form in an MLP. As mentioned in

Section 7.3.3, it is possible to exploit the similarity between SMLPs and RBFNs

and “solve” the weights in an SMLP with a non-iterative procedure. The ﬁrst

step is to choose a set of basis centers that can be determined by sub-sampling

or clustering the input-space of the training data. After the centers are chosen,

the nearest neighbor of each center with respect to the other centers can be

calculated. These distances can be used as the widths of the basis centers. Next,

the centers c

and widths σ

can be plugged into Equation 7.6 to determine the

values of the u

, v

and a

weights. Finally, the linear weights in the SMLP, w

and b from Equation 7.1, can be solved for exactly by using a matrix pseudo-

inverse procedure.

Thus, one can train an SMLP as one would an MLP or one could solve an

SMLP as one would an RBFN. It is also possible to combine the approaches and

let the solved weights be the initial weights for a training procedure. Using the

procedure to solve the weights can sometimes cut the computational overhead

for computing the weight by orders of magnitude compared to typical training

methods. Moreover, as was found in the numerical experiments in Section 7.3.3,

solutions found with this hybrid scheme may easily exceed the quality of solutions

found with more traditional approaches.

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 157

Radial Boundary

-4

Ridge Boundary

-4

Hourglass Boundary

-4

Radial Bump

-4

Smooth Ridge

-4

Smooth Saddle

-4

Step Function

-4

Wedge Boundary

-4

Elipsoidal Boundary

-4

Smooth Sigmoid

-4

Smooth Wedge

-4

Elipsoidal Bump

-4

Fig. 7.6. Several diﬀerence types of surfaces and decision boundaries that can be

formed by a single SMLP node

158 G.W. Flake

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

-5

-1

Fig. 7.7. Output response surfaces of the 10 hidden nodes in the SMLP network that

solved the two spiral problem

7.5 Intuitive and Topological Justiﬁcation

While the previous analysis shows that an SMLP can eﬃciently approximate

an RBFN, the transformation from an SMLP into an RBFN examined only a

single special case of the type of features that can be formed by an SMLP node.

In fact, a single SMLP node can form many other types of features besides

hyper-sigmoids and local bumps. To show that this is indeed the case, one only

needs to compose a two-dimensional diagonal quadratic equation into a sigmoidal

function to see what type of surfaces are possible.

Figure 7.6 shows some familiar surfaces and decision boundaries that can be

formed with a single SMLP node. As expected, hyper-sigmoids and bumps can

be formed. What is interesting, however, is that there are many types of SMLP

features that are neither local nor global. For example, a ridge, wedge or saddle

may look like a local or global feature if it is projected onto one lower dimension;

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 159

however, whether this projection is local or global depends on the subspace that

the projection is formed.

How important is it for an architecture to be able to form these features? The

very fact that these types of boundaries and surfaces have names means that

they are important enough that one may need a type of model that can eﬃciently

form them. However, if we reexamine the two-spirals problem from Section 7.3.2

it is possible to dissect the decision boundary formed by the SMLP (shown in

Figure 7.3) to see how the surface was formed. What it truly interesting is that

radial boundaries, wedges, and a sigmoid were all crucial to forming the entire

decision surface. If the SMLP lacked the ability to form any of these features,

then it is easily possible that the hidden node requirements for this problem

would explode.

Thereis,however,onecaveatwiththevariety of features that can be formed

with a single SMLP node. The wedge, ridge, ellipse and saddle structures shown

in Figure 7.6 must always be aligned with one of the input axes. In other words,

it is impossible to make a ridge that would run parallel to the line deﬁned by

x = y. We can see that this is true by noting that (x−y)

has the term −2xy in

its expansion, which means that the quadratic form is non-diagonal. In general,

in order to have the ability to rotate all of the features in Figure 7.6, one would

need the full quadratic form, thus requiring a higher-order network instead of

the SMLP. Hence, while eliminating the oﬀ-diagonal terms in a higher-order

network saves a considerable number of weights, there is a cost in the types of

features that can be formed.

7.6 Conclusions

We have seen that there are problems whose solutions require features of both

local and global scope. MLPs excel at forming global features but have a diﬃcult

time forming local features. RBFNs are exactly the opposite. The SMLP archi-

tecture can eﬃciently form both types of features with only a small penalty in

the number of weights. However, the increase in the number of weights is more

than compensated for by improvements in the network’s ability to form features.

This often results in simpler networks with fewer weights that can learn much

faster and approximate more accurately.

For the two main numerical studies in this work, it was found that the SMLP

architecture performed as well or better than the best known techniques for the

two-spirals problem and the Deterding vowel recognition data. Moreover, these

results are strengthened by the fact that the average performance of the SMLP

was often superior to the best known results for the other techniques.

It was also found that the dual nature of the SMLP can be exploited in the

form of hybrid algorithms. SMLPs can be “solved” like an RBFN, trained like an

MLP, or both. It is also noteworthy that the nonlinear weights in an SMLP are

only “moderately” nonlinear. For example, gradient-based learning algorithms

can be used on RBFNs but it is known that the high degree of nonlinearity found

in the weights which correspond to centers and widths can often make gradient

160 G.W. Flake

training very slow for RBFNs when the nonlinear weight are included. Similarly,

since MLPs require two hidden layers of nodes to form local features eﬃciently,

the ﬁrst layer of weights in an MLP are exceedingly nonlinear because they are

eventually passed through two layers of nonlinear nodes. Counter to this, the

nonlinear nodes in an SMLP are only passed through a single layer of nonlinear

nodes. Although it is unproven at this time, it seems like a reasonable conjecture

that SMLP networks may be intrinsically better conditioned for gradient-based

learning of local “bumps” than MLPs with two hidden layers.

Acknowledgements. I thank Frans Coetzee, Chris Darken, Lee Giles, Jenny

Orr, Ray Watrous, and the anonymous reviewers for many helpful comments

and discussions.

References

[1] Casdagli, M.: Nonlinear prediction of chaotic time series. Physica D 35, 335–356

(1989)

[2] Deterding, D.H.: Speaker Normalisation for Automatic Speech Recognition. PhD

thesis, University of Cambridge (1989)

[3] Fahlman, S.E.: Faster-learning variations on back-propagation: An empirical

study. In: Proceedings of the 1988 Connectionist Models Summer School. Mor-

gan Kaufmann (1988)

[4] Fahlman, S.E., Lebiere, C.: The cascade-correlation learning architecture. In:

Touretzky, S. (ed.) Advances in Neural Information Processing Systems, vol. 2.

Morgan Kaufmann (1990)

[5] Finke, M., Müller, K.-R.: Estimating a-posteriori probabilities using stochastic

network models. In: Mozer, M., Smolensky, P., Touretzky, D.S., Elman, J.L.,

Weigend, A.S. (eds.) Proceedings of the 1993 Connectionist Models Summer

School, pp. 324–331. Erlenbaum Associates, Hillsdale (1994)

[6] Hastie, T., Tibshirani, R.: Flexible discriminant analysis by optimal scoring. Tech-

nical report, AT&T Bell Labs, Murray Hill, New Jersey (1993)

[7] Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classiﬁcation.

IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6), 607–616

(1996)

[8] Hochreiter, S., Schmidhuber, J.: Lococode. Technical Report FKI-222-97, Fakultät

für Informatik, Technische Universität München (1997)

[9] Lang, K.J., Witbrock, M.J.: Learning to tell two spirals apart. In: Proceedings of

the 1988 Connectionist Models Summer School. Morgan Kaufmann, San Francisco

(1988)

[10] Lapedes, A., Farber, R.: Nonlinear signal processing using neural networks: Pre-

diction and system modelling. Technical Report LA-UR-87-2662, Los Alamos Na-

tional Laboratory, Los Alamos, NM (1987)

[11] Lapedes, A., Farber, R.: How neural nets work. In: Anderson, D.Z. (ed.) Neural

Information Processing Sysytems, pp. 442–456. American Institute of Physics,

New York (1988)

[12] Lawrence, S., Tsoi, A.C., Back, A.D.: Function approximation with neural net-

works and local methods: Bias, variance and smoothness. In: Bartlett, P., Burkitt,

A., Williamson, R. (eds.) Australian Conference on Neural Networks, pp. 16–21.

Australian National University (1996)

7. Square Unit Augmented, Radially Extended, Multilayer Perceptrons 161

[13] Lee, S., Kil, R.M.: Multilayer feedforward potential function networks. In: IEEE

international Conference on Neural Networks, pp. 1:161–1:171. SOS Printing, San

Diego (1988)

[14] Lee, Y.C., Doolen, G., Chen, H.H., Sun, G.Z., Maxwell, T., Lee, H.Y., Giles,

C.L.: Machine learning using higher order correlation networks. Physica D 22-D,

276–306 (1986)

[15] Moody, J., Darken, C.: Learning with localized receptive ﬁelds. In: Touretsky, D.,

Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connectionist Models

Summer School, Morgan Kaufmann (1988)

[16] Moody, J., Darken, C.: Fast learning in networks of locally-tuned processing units.

Neural Computation 1, 281–294 (1989)

[17] Niranjan, M., Fallside, F.: Neural networks and radial basis functions in classifying

static speech patterns. Computer Speech and Language 4, 275–289 (1990)

[18] Pao, Y.H.: Adaptive Pattern Recognition and Neural Networks. Addison-Wesley

Publishing Company, Inc., Reading (1989)

[19] Robinson, A.J.: Dynamic Error Propagation Networks. PhD thesis, Cambridge

University (1989)

[20] Rumelhart, D.E., McClelland, J.L.: the PDP Research Group. In: Parallel Dis-

tributed Processing: Explorations in the Microstructure of Cognition, vol. 2. MIT

Press (1986)

[21] Sarle, W.:

The comp.ai.neural-nets Frequently Asked Questions List (1997)

[22] Schetzen, M.: The Volterra and Wiener Theories of Nonlinear Systems. John Wiley

and Sons, New York (1980)

[23] Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel

eigenvalue problem. Technical report, Max-Planck-Institut für biologische Kyber-

netik, 1996. Neural Computation 10(5), 1299–1319 (1998)

[24] Volterra, V.: Theory of Functionals and of Integro-diﬀerential Equations. Dover

(1959)

[25] Werbos, P.: Beyond Regression: New Tools for Prediction and Analysis in the

Behavioral Sciences. PhD thesis, Harvard University (1974)

A Dozen Tricks with Multitask Learning



Rich Caruana

Just Research and Carnegie Mellon University,

4616 Henry Street, Pittsburgh, PA 15213

caruana@cs.cmu.edu

http://www.cs.cmu.edu/~caruana/

Abstract. Multitask Learning is an inductive transfer method that im-

proves generalization accuracy on a main task by using the information

contained in the training signals of other related tasks. It does this by

learning the extra tasks in parallel with the main task while using a

shared representation; what is learned for each task can help other tasks

be learned better. This chapter describes a dozen opportunities for ap-

plying multitask learning in real problems. At the end of the chapter we

also make several suggestions for how to get the most our of multitask

learning on real-world problems.

When tackling real problems, one often encounters valuable information that

is not easily incorporated in the learning process. This chapter shows a dozen

ways to beneﬁt from the information that often gets ignored. The basic trick is

to create extra tasks that get trained on the same net with the main task. This

Multitask Learning is a form of inductive transfer

that improves performance

on the main task by using the information contained in the training signals of

other related tasks. It does this by learning the main task in parallel with the

extra tasks while using a shared representation; what is learned for each task

can help other tasks be learned better.

We use the term “task” to refer to a function that will be learned from a

training set. We call the important task that we wish to learn better the main

task. Other tasks whose training signals will be used by multitask learning to

learn the main task better are the extra tasks. Often, we do not care how well

extra tasks are learned. Their sole purpose is to help the main task be learned

better. We call the union of the main task and the extra tasks a domain.Here

we restrict ourselves to domains where the tasks are deﬁned on a common set of

input features, though some of the extra tasks may be functions of only a subset

of these input features.

This chapter shows that most real-world domains present a number of op-

portunities for multitask learning (MTL). Because these opportunities are not



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

Inductive transfer is the process of transfering anything learned for one problem to

help learning of other related problems.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 163–189, 2012.

 Springer-Verlag Berlin Heidelberg 2012

164 R. Caruana

always obvious, most of the chapter is dedicated to showing diﬀerent ways useful

extra tasks arise in real problems. We demonstrate several of these opportunities

using real data. The chapter ends with a few suggestions that help you get the

most out of multitask learning. Some of these suggestions are so important that

if you don’t follow them, MTL can easily hurt performance on the main task

instead of helping it.

8.1 Introduction to Multitask Learning in Backprop Nets

Consider the following boolean functions deﬁned on eight bits, B

···B

Task1= B

∨ Parity(B

···B

)

Task2=¬B

∨ Parity(B

···B

)

Task3= B

∧ Parity(B

···B

)

Task4=¬B

∧ Parity(B

···B

)

where “B

” represents the ith bit, “¬” is logical negation, “∨” is disjunction, “∧"

is conjunction, and “Parity(B

···B

)” is the parity of bits 2–6. Bits B

and B

are not used by the functions. These four tasks are related in several ways:

– they are all deﬁned on the same inputs, bits B

···B

;

– they all ignore the same inputs, bits B

and B

;

– each uses a common computed subfeature, Parity(B

···B

);

– when B

=0, Task 1 needs Parity(B

···B

), but Task 2 does not, and vice

versa;

– as with Tasks 1 and 2, when Task 3 needs Parity(B

···B

),Task4does

not need it, and vice versa.

We can train artiﬁcial neural nets on these tasks with backprop. Bits B

···B

are the inputs to the net. The task values computed by the four functions are the

target outputs. We create a data set by enumerating all 256 combinations of the

eight input bits, and computing for each setting of the bits the task signals for

Tasks 1, 2, 3, and 4 using the deﬁnitions above. This yields 256 diﬀerent cases,

with four diﬀerent training signals for each case.

8.1.1 Single and Multitask Learning of Task 1

Consider Task 1 the main task. Tasks 2, 3, and 4 are the extra tasks. That is,

we are interested only in improving the accuracy of models trained for Task 1.

We’ve done an experiment where we train Task 1 on the three nets shown in

Figure 8.1. All the nets are fully connected feed-forward nets with 8 inputs, 100

hidden units, and 1–4 outputs. Where there are multiple outputs, each output

is fully connected to the hidden units. Nets were trained in batch mode using

backprop with MITRE’s Aspirin/MIGRAINES 6.0 with learning rate = 0.1 and

momentum = 0.9.

8. A Dozen Tricks with Multitask Learning 165

Task 1 is trained alone on the net on the left of Figure 8.1. This is a backprop

net trained on a single task. We refer to this as single task learning (STL) or

single task backprop (STL-backprop). The net in the center of Figure 8.1 trains

Task 1 on a net that is also trained on Task 2. The hidden layer of this net is

shared by Tasks 1 and 2. This is multitask backprop (MTL-backprop) with two

tasks. The net on the right side of Figure 8.1 trains Task 1 with Tasks 2, 3, and

4. The hidden layer of this net is shared by all four tasks. This is MTL-backprop

with four tasks. How well will Task 1 be learned by the diﬀerent nets?

INPUTS

. . .

Task 1

B1 B2 B3 B4 B5 B6 B7 B8

INPUTS

. . .

Task 1 Task 2

B1 B2 B3 B4 B5 B6 B7 B8

INPUTS

. . .

Task 1 Task 2 Task 3 Task 4

B1 B2 B3 B4 B5 B6 B7 B8

Fig. 8.1. Three Neural Net Architectures for Learning Task 1

We performed 25 independent trials by resampling training and test sets from

the 256 cases. From the 256 cases, we randomly sample 128 cases for the training

set, and use the remaining 128 cases as a test set. (For now we ignore the com-

plexity of early stopping, which can be tricky with MTL nets. See section 8.3.2

for a thorough discussion of early stopping in MTL nets.)

For each trial, we trained three nets: an STL net for Task 1, an MTL net for

Tasks 1 and 2, and an MTL net for Tasks 1–4. We measure performance only

on the output for Task 1. When there are extra outputs for Task 2 or Tasks 2–4,

these are trained with backprop, but ignored when the net is evaluated. The sole

purpose of the extra outputs is to aﬀect what is learned in the hidden layer these

outputs share with Task 1.

8.1.2 Results

Every 5000 epochs we evaluated the performance of the nets on the test set. We

measured the RMS error of the output with respect to the target values, the

criterion being optimized by backprop. We also measured the accuracy of the

output in predicting the boolean function values. If the net output is less than

0.5, it was treated as a prediction of 0, otherwise it was treated as a prediction

of 1.

166 R. Caruana

Figure 8.2 shows the RMSE for Task 1 on the test set during training. The

three curves in the graph are each the average of 25 trials.

RMSE on the main

task, Task 1, is reduced when Task 1 is trained on a net simultaneously trained

on other related tasks. RMSE is reduced when Task 1 is trained with extra Task

2, and is further reduced when extra Tasks 3 and 4 are added. Training multiple

tasks on one net does not increase the number of training patterns seen by the

net. Each net sees exactly the same training cases. The MTL nets do not see

more training cases; they receive more training signals with each case.

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 50000 100000 150000 200000

Test-Set RMSE (Average of 25 trials)

Training Epochs

STL: Task 1

MTL: Tasks 1+2

MTL: Tasks 1+2+3+4

Fig. 8.2. RMSE Test-set Performance of Three Diﬀerent Nets on Task 1

Figure 8.3 shows the average test-set accuracy on Task 1 for 25 trials with

the three diﬀerent nets. Task 1 has boolean value 1 about 75% of the time. A

simple learner that learned to predict 1 all the time should achieve about 75%

accuracy on Task 1. When trained alone (STL), performance on Task 1 is about

80%. When Task 1 is trained with Task 2, performance increases to about 88%.

When Task1 is trained with Tasks 2, 3, and 4, performance increases further to

about 90%. Table 8.1 summarizes the results of examining the training curve

from each trial.

Average training curves can be misleading, particularly if training curves are not

monotonic. For example, it is possible for method A to always achieve better error

than method B, but for the average of method A to be everywhere worse than the

average of method B because the regions where performance on method A is best

do not align, but do align for method B. Before presenting average training curves,

we always examine the individual curves to make sure the average curve is not

misleading.

8. A Dozen Tricks with Multitask Learning 167

0.65

0.7

0.75

0.8

0.85

0.9

0.95

0 50000 100000 150000 200000

Test-Set % Correct (Average of 25 trials)

Training Epochs

STL: Task 1

MTL: Tasks 1+2

MTL: Tasks 1+2+3+4

Fig. 8.3. Test-set Percent Correct of Three Diﬀerent Nets on Task 1

Table 8.1 . Test-set performance on Task 1 of STL of Task 1, MTL of Tasks 1 and 2,

and MTL of Tasks 1, 2, 3, and 4. *** indicates performance is statistically better than

STL at .001 or better, respectively.

NET STL: 1 MTL: 1+2 MTL: 1+2+3+4

Root-Mean-Squared-Error 0.211 0.134 *** 0.122 ***

Percent Correct 79.7% 87.5% *** 88.9% ***

8.1.3 Discussion

Why is the main task learned better if it is trained on a net learning other

related tasks at the same time? We ran a number of experiments to verify that

the performance increase with MTL is due to the fact that the tasks are related,

and not just a side eﬀect of training multiple outputs on one net.

Adding noise to neural nets sometimes improves their generalization perfor-

mance [22]. To the extent that MTL tasks are uncorrelated, their contribution

to the aggregate gradient may appear as noise to other tasks and this might

improve generalization. To see if this explains the beneﬁts we see from MTL, in

one experiment we train Task 1 on a net with three random tasks.

A second eﬀect to be concerned about is that adding tasks tends to increase the

eﬀective learning rate on the input-to-hidden layer weights because the gradients

from the multiple outputs add at the hidden layer, and this might favor nets with

multiple outputs. To test this, we train an MTL net with four copies of Task

1. Each of the four outputs receives exactly the same training signal. This is a

168 R. Caruana

degenerate form of MTL where no extra information is given to the net by the

extra tasks.

A third eﬀect that needs to be ruled out is net capacity. 100 hidden units is a

lot for these tasks. Does the MTL net, which has to share the 100 hidden units

among four tasks, generalize better because each task has fewer hidden units?

To test for this, we train Task 1 on STL nets with 200 hidden units and with

25 hidden units. This will tell us if generalization would be better with more or

less capacity.

Finally, we ran a fourth experiment based on the heuristic used in [37]. We

shuﬄe the training signals (the target output values) for Tasks 2, 3, and 4 before

training an MTL net on the four tasks. Shuﬄing reassigns the target values to

the input vectors in the training set for Tasks 2, 3, and 4. The main task, Task

1, is not aﬀected. The distributions of the training signals for outputs 2–4 have

not changed, but the training signals are no longer related to Task 1. This is a

powerful test that has the potential to rule-out many mechanisms that do not

depend on relationships between the tasks.

We ran each experiment 25 times usingexactlythesamedatasetsusedin

the previous section. Figure 8.4 shows the generalization performance on Task

1 in the four experiments. For comparison, the performance of of STL, MTL

with Tasks 1 and 2, and MTL with Tasks 1–4 from the previous section are also

shown.

When Task 1 is trained with random extra tasks, performance on Task 1 drops

below the performance on Task 1 when it is trained alone on an STL net. We

conclude MTL of Tasks 1–4 probably does not learn Task 1 better by adding

noise to the learning process through the extra outputs.

When Task 1 is trained with three additional copies of Task 1, the performance

is comparable to that when Task 1 is trained alone with STL.

We conclude

that MTL does not learn Task 1 better just because backprop works better with

multiple outputs.

When Task 1 is trained on an STL net with 25 hidden units, performance

is comparable to the performance with 100 hidden units. Moreover, when Task

1 is trained on an STL net with 200 hidden units, it is slightly better. (The

diﬀerences between STL with 25, 100, and 200 hidden units are not statistically

signiﬁcant.) We conclude that performance on Task 1 is relatively insensitive

to net size for nets between 25 and 200 hidden units, and, if anything, Task 1

would beneﬁt from a net with more capacity, not one with less capacity. Thus

it is unlikely that MTL on Tasks 1–4 performs better on Task 1 because Tasks

2–4 are using up extra capacity that is hurting Task 1.

We sometimes observe that training multiple copies of a task on one net does improve

performance. When we have observed this, the beneﬁt is never large enough to

explain away the beneﬁts observed with MTL. But it is interesting and surprising,

as the improvement is gained without any additional information being given to

the net. The most likely explanation is that the multiple connections to the hidden

layer allow diﬀerent hidden layer predictions to be averaged and thus act as a weak

boosting mechanism.

8. A Dozen Tricks with Multitask Learning 169

0.65

0.7

0.75

0.8

0.85

0.9

0 50000 100000 150000 200000

Test-Set % Correct (Average of 25 trials)

Training Epochs

STL: Task 1

MTL: Tasks 1+2

MTL: Tasks 1+2+3+4

STL: Task 1 (25 hidden units)

STL: Task 1 (200 hidden units)

4 Copies of Task 1

Task1 + Shuffled Tasks 2,3,4

Task 1 + 3 Random Tasks

Fig. 8.4. RMSE test-set performance of Task 1 when trained with: MTL with three

random tasks; MTL with three more copies of Task 1; MTL with shuﬄed training

signals for Tasks 2–4; STL on nets with 25 or 200 hidden units.

When Task 1 is trained with training signals for Tasks 2–4 that have been

shuﬄed, the performance of MTL drops below the performance of Task 1 trained

alone on an STL net. Clearly the beneﬁt we see with MTL on these problems is

not due to some accident caused by the distribution of the extra outputs. The

extra outputs must be related to the main task to help it.

These experiments rule out most explanations for why MTL outperforms STL

on Task 1 that do not require Tasks 2–4 be related to Task 1. So why is the main

task learned better when trained in parallel with Tasks 2–4?

One reason is that Task 1 needs to learn the subfeature Parity(B

···B

)

that it shares with Tasks 2–4. Tasks 2–4 give the net information about this

subfeature that it would not get from Task 1 alone. For example, when B

=1,

the training signal for Task 1 contains no information about Parity(B

···B

We say B

masks Parity(B

···B

) when B

=1. But the training signals for

Task 2 provide information about the Parity subfeature in exactly those cases

where Task 1 is masked. Thus the hidden layer in a net trained on both Tasks

1 and 2 gets twice as much information about the Parity subfeature as a net

170 R. Caruana

trained on one of these tasks, despite the fact that they see exactly the same

training cases. The MTL net is getting more information with each training case.

Another reason why MTL helps Task 1 is that all the tasks are functions of

the same inputs, bits B

···B

, and ignore the same inputs, B

and B

. Because

the tasks overlap on the features they use and don’t use, the MTL net is better

able select which input features to use.

A third reason why MTL helps Task 1 is that there are relationships be-

tween the way the diﬀerent tasks use the inputs that promote learning good

internal representations. For example, all the tasks logically combine input B

with a function of inputs B

···B

. This similarity tends to prevent the net from

learning internal representations that, for example, directly combine bits B

and

. A net trained on all the tasks together is biased to learn more modular, in

this case more correct, internal representations that support the multiple tasks.

This bias towards modular internal representations reduces the net’s tendency

to learn spurious correlations that occur in any ﬁnite training sample: there may

be a random correlation between bit B

and the output for Task 1 that looks

fairly strong in this one training set, but if that spurious correlation does not

also help the other tasks, it is less likely to be learned.

8.2 Tricks for Using Multitask Learning in the Real

World

The previous section introduced multitask learning (MTL) in backprop nets us-

ing four tasks carefully designed to have relationships that make learning them

in parallel work better than learning them in isolation. How often will real prob-

lems present extra tasks that allow multitask learning to improve performance

on the main task?

This section shows that many real world problems yield opportunities for mul-

titask learning. We present a dozen prototypical real-world applications where

the training signals for related extra tasks are available and can be leveraged.

We believe most real-world problems fall into one or more of these prototypical

classes. This claim might sound surprising given that few of the test problems

traditionally used in machine learning are multitask problems. We believe most

of the problems used in machine learning so far have been heavily preprocessed

to ﬁt the single task learning mold. Most of the opportunities for MTL in these

problems were eliminated as the problems were deﬁned.

8.2.1 Using the Future to Predict the Present

Often valuable features become available after predictions must be made. If

learning is done oﬄine, these features can be collected for the training set and

used for learning. These features can’t be used as inputs, because they will not

be available when making predictions for future test cases. They can, however,

be used as extra outputs for multitask learning. The predictions the learner

makes for these extra tasks will be ignored when the system is used to make

8. A Dozen Tricks with Multitask Learning 171

predictions for the main task. Their sole function is to provide extra information

to the learner during training so that it can learn the main task better.

One source of applications of learning from the future is sequential decision

making in medicine. Given the initial symptoms, decisions are made about what

tests to make and what treatment to begin. New information becomes available

when the tests are completed and as the patient responds (or fails to respond) to

the treatment. From this new information, new decisions are made. Should more

tests be made? Should the treatment be changed? Has the patient’s condition

changed? Is this patient now high risk, or low risk? Does the patient need to be

hospitalized? Etc.

When machine learning is applied to early stages in the decision making pro-

cess, only those input features that typically would be available for patients at

this stage of the process are usually used. This is unfortunate. In an historical

database, all of the patients may have run the full course of medical testing and

treatment and their ﬁnal outcome may be known. Must we ignore the results of

lab tests and other valuable features in the database just because these will not

be available for patients at the stage of medical decision making for which we

wish to learn a model?

The Pneumonia Risk Prediction Problem. Consider pneumonia. There

are 3,000,000 cases of pneumonia each year in the U.S., 900,000 of which get

hospitalized. Most pneumonia patients recover given appropriate treatment, and

many can be treated eﬀectively without hospitalization. Nonetheless, pneumonia

is serious: 100,000 of those hospitalized for pneumonia die from it, and many

more are at elevated risk if not hospitalized.

Consider the problem of predicting a patient’s risk from pneumonia before

they are hospitalized. (The problem is not to diagnose if the patient has pneu-

monia, but to determine how much risk the pneumonia poses to the patient.) A

primary goal in medical decision making is to accurately, swiftly, and econom-

ically identify patients at high risk from diseases like pneumonia so that they

may be hospitalized to receive aggressive testing and treatment; patients at low

risk may be more comfortably, safely, and economically treated at home.

Some of the most useful features for assessing risk are the lab tests that

become available only after a patient is hospitalized. It is the extra lab tests

made after patients are admitted to the hospital that we use as extra tasks for

MTL; they cannot be used as inputs because they will not be available for most

future patients when making the decision to hospitalize.

The most useful decision aid for this problem would be to predict which pa-

tients will live or die. This is too diﬃcult. In practice, the best that can be

achieved is to estimate a probability of death (POD) from the observed symp-

toms. In fact, it is suﬃcient to learn to order patients by POD so lower-risk

patients can be discriminated from higher risk patients; patients at least risk

may then be considered for outpatient care.

Other researchers who tackled this problem ignored the the lab tests because they

knew they would not be available at run time and did not see ways to use them

other than as inputs.

172 R. Caruana

The performance criteria used by others working with this database [15] is the

accuracy with which one can select prespeciﬁed fractions of a patient population

who will live. For example, given a population of 10,000 patients, ﬁnd the 20%

of this population at least risk. To do this we learn a risk model, and a threshold

for this risk model, that allows 20% of the population (2000 patients) to fall

below it. If 30 of the 2000 patients below this threshold die, the error rate is

30/2000 = 0.015. We say that the error rate for FOP 0.20 is 0.015 (FOP stands

for “fraction of population”). Here we consider FOPs 0.1, 0.2, 0.3, 0.4, and 0.5.

Our goal is to learn models and thresholds such that the error rate at each FOP

is minimized.

Multitask Learning and Pneumonia Risk Prediction. The straightfor-

ward approach to this problem is to use backprop to train an STL net to learn

to predict which patients live or die, and then use the real-valued predictions

of this net to sort patients by risk. This STL net has 30 inputs for the 30 ba-

sic pre-hospitalization measurements, a single hidden layer, and a single output

trained with targets 0=lived, 1=died.

Given a large training set, a net trained

this way should learn to predict the probability of death for each patient, not

which patients live or die. If the training sample is small, the net will overﬁt

and learn a very nonlinear function that outputs values near 0/1 for cases in

the training set, but which does not generalize well. It is critical to use early

stopping to halt training before this happens.

We developed a method called Rankprop speciﬁcally for this domain. Rankprop

learns to rank patients without learning to predict mortality (0=lived,1=died).

Figure 8.5 compares the performance of squared error on 0/1 targets with

rankprop on this problem. Rankprop outperforms traditional backprop using

squared error on targets 0=lived,1=died by 10%-40% on this domain, depend-

ing on which FOP is used for comparison. See [9] for details about rankprop.

There are 35 future lab values that we use as extra backprop outputs, as shown

in Figure 8.6. The expectation is that these extra outputs will bias the shared

hidden layer toward representations that better capture important features of

each patient’s condition, and that this will lead to more accurate predictions of

patient risk at the main task output.

The STL net has 30 inputs, 8 hidden units, and one output trained to predict

risk with rankprop. The MTL net has the same 30 inputs, 64 hidden units, one

output for rankprop, and 35 extra outputs trained with squared error. (Prelim-

inary experiments suggested 8–32 hidden units was optimal for STL, and that

MTL performs best with nets as large as 512 hidden units. We used 8 and 64

hidden units so that we could run many experiments.) The 35 extra outputs on

We tried both squared error and cross entropy. The diﬀerence between the two was

small. Squared error performed slightly better.

We use rankprop for the rest of our experiments on this domain because it is the

best performer we know of on this problem. We want to see if MTL can make the

best method better.

8. A Dozen Tricks with Multitask Learning 173

0.01

0.02

0.03

0.04

0.1 0.2 0.3 0.4 0.5

Test-Set Error Rate

Fraction of Population (FOP)

STL with SSE on 0/1 Targets

STL with Rankprop

Fig. 8.5. The performance of SSE (0/1 targets) and rankprop on the 5 FOPs in the

pneumonia domain. Lower error indicates better performance.

the MTL net (see Figure 8.6) are trained at the same time the net is trained to

predict risk.

We train the net using training and validation sets containing 1000 patients

randomly drawn from the database. Training is halted on both the STL and

MTL nets when overﬁtting is observed on the main rankprop risk task. On the

MTL net, the performance of the extra tasks is not taken into account for early

stopping. Only the performance of the output for the main task is considered

when deciding when to stop training. (See section 8.3.2 for more discussion of

early stopping with MTL nets.) Once training is halted, the net is tested on the

remaining unused patients in the database.

Results. Table 8.2 shows the mean performance of ten runs of rankprop using

STL and MTL. The bottom row shows the percent improvement in performance

obtained on this problem by using the future lab measurements as extra MTL

outputs. Negative percentages indicate MTL reduces error. Although MTL low-

ers the error at each FOP compared with STL, only the diﬀerences at FOP 0.3,

0.4, and 0.5 are statistically signiﬁcant with ten trials using a standard t-test.

The improvement from MTL is 5–10%. This improvement can be of consid-

erable consequence in medical domains. To verify that the beneﬁt from MTL is

due to relationships between what is learned for the future labs and the main

task, we ran the shuﬄe test (see section 8.1.3). We shuﬄed the training signals

for the extra tasks in the training sets before training the nets with MTL.

174 R. Caruana

Age

Sex

Chest Pain

Asthmatic

Diabetic

Heart Mumur

Wheezing

Stridor

Mortality

Rank

Hematocrit

White Blood

Cell Count

Potassium

. . .

INPUTS

INPUT LAYER

OUTPUT LAYER

SHARED HIDDEN LAYER

RANKPROP

OUTPUT

FUTURE LABS

. . .

Fig. 8.6. Using future lab results as extra outputs to bias learning for the main risk

prediction task. The lab tests would help most if they could be used as inputs, but

will not yet have been measured when risk must be predicted, so we use them as extra

outputs for MTL instead.

Table 8. 2. Error Rates (fraction deaths) for STL with Rankprop and MTL with

Rankprop on Fractions of the Population predicted to be at low risk (FOP) between

0.0 and 0.5. MTL makes 5–10% fewer errors than STL.

FOP 0.1 0.2 0.3 0.4 0.5

STL Rankprop .0083 .0144 .0210 .0289 .0386

MTL Rankprop .0074 .0127 .0197 .0269 .0364

% Change -10.8% -11.8% -6.2% * -6.9% * -5.7% *

Figure 8.7 shows the results of MTL with shuﬄed training signals for the

extra tasks. The results of STL and MTL with unshuﬄed extra tasks are also

shown. Shuﬄing the training signals for the extra tasks reduces the performance

of MTL below that of STL. We conclude that it is the relationship between the

main task and the extra tasks that lets MTL perform better on the main task;

the beneﬁt disappears when these relationships are broken by shuﬄing the extra

task signals.

We have also run experiments where we use the future lab tests as inputs to

a net trained to predict risk, and impute the values for the lab tests when they

are missing on future test cases. Imputing missing values for the lab tests did

not yield performance comparable to MTL on this problem. Similar experiments

with feature nets [17] also failed to yield improvements comparable to MTL.

8. A Dozen Tricks with Multitask Learning 175

0.01

0.02

0.03

0.04

0.1 0.2 0.3 0.4 0.5

Test-Set Error Rate

Fraction of Population (FOP)

STL

MTL

MTL: shuffled extra task signals

Fig. 8.7. Performance of STL, MTL, and MTL with shuﬄed extra task signals on

pneumonia risk prediction at the ﬁve FOPs

Future measurements are available in many oﬄine learning problems. As just

one very diﬀerent example, a robot or autonomous vehicle can more accurately

measure the size, location, and identity of objects if it passes nearer them in

the future. For example, road stripes and the edge of the road can be detected

reliably as a vehicle passes alongside them, but detecting them far ahead of the

vehicle is hard. Since driving brings future road closer to the car, stripes and

road borders can be measured accurately as the car passes them. Dead reckoning

allows these future measurements to be added to the training set. They can’t be

used as inputs; They won’t be available in time while driving. But they can be

used to augment a training set. We suspect that using future measurements as

extra outputs will be a frequent source of extra tasks in real problems.

8.2.2 Multiple Metrics

Sometimes it is hard to capture everything that is important in one error metric.

When alternate metrics capture diﬀerent, but useful, aspects of a problem, MTL

can be used to beneﬁt from the multiple metrics. One example of this is the

pneumonia problem in the previous section. Rankprop outperforms backprop

using traditional squared error on this problem, but has trouble learning to

rank cases at such low risk that virtually all patients survive because these

cases provide little ordering information. Interestingly, squared error performs

best when cases have high purity, such as in regions of feature space where

most cases have low risk. Squared error is at its best where rankprop is weakest.

Adding an extra output trained with squared error to a net learning to predict

176 R. Caruana

pneumonia risk with rankprop improves the accuracy of the rankprop output an

additional 5-10% for the least-risk cases. The earliest example of using multiple

output representations we know of is [38] which uses both SSE and cross-entropy

outputs for the same task.

8.2.3 Multiple Output Representations

Sometimes it is not apparent what output encoding is best for a problem. Dis-

tributed output representations often help parts of a problem be learned well

because the parts have separate error gradients. But non-distributed output

representations are sometimes more accurate. Consider the problem of learn-

ing to classify a face as one of twenty faces. One output representation is to

have one output code for each face. Another representation is to have outputs

code for features such as beard/no_beard, glasses/no_glasses, long_hair/short,

eye_color(blue, brown), male/female, that are suﬃcient to distinguish the faces.

Correct classiﬁcation, however, may require that each feature be correctly pre-

dicted. The non-distributed output coding that uses one output for each individ-

ual may be more reliable. But training the net to recognize speciﬁc traits should

help, too. MTL is one way to merge these conﬂicting requirements in one net by

using both output representations, even if only one representation will be used

for prediction.

A related approach to multiple output encodings is error correcting codes

[18]. Here, multiple encodings for the outputs are designed so that the combined

prediction is less sensitive to occasional errors in some of the outputs. It is not

clear how much ECOC beneﬁts from MTL-like mechanisms. In fact, ECOC may

beneﬁt from being trained on STL nets (instead of MTL nets) so that diﬀerent

outputs do not share the same hidden layer and thus are less correlated. But

see [27] for ways of using MTL to decorrelate errors in multiple outputs to boost

committee machine performance.

8.2.4 Time Series Prediction

The simplest way to use MTL for time series prediction is to use a single net

with multiple outputs, each output corresponding to the same task at a diﬀerent

time. This net makes predictions for the same task at diﬀerent times. We tested

this on a robot domain where the goal is to predict what the robot will sense 1,

2, 4, and 8 meters in the future as it moves forward. Training all four of these

distances on one MTL net improved the accuracy of the long range predictions

about 10% (see chapter 17 where MTL is used in a time series application).

8.2.5 Using Non-operational Features

Some features are impractical to use at run time, either because they are too

expensive to compute, or because they need human expertise that won’t be

available. We usually have more time, however, to prepare our training sets.

8. A Dozen Tricks with Multitask Learning 177

When it is impractical to compute some features on the ﬂy at run time, but

practical to compute them for the training set, these features can be used as

extra outputs to help learning. Pattern recognition provides a good example of

this. We tested MTL on a door recognition problem where the goal is to recognize

doorways and doorknobs. The extra tasks were features such as the location of

door edges and doorway centers that required laborious hand labelling that

would not be applied to the test set. The MTL nets that were trained to predict

these additional hand-labelled features were 25% more accurate at locating doors

and doorknobs. Other domains where hand-labelling can be used to augment

training sets this way include text domains, medical domains, acoustic domains,

and speech domains.

8.2.6 Using Extra Tasks to Focus Attention

Learning often uses large, ubiquitous patterns in the inputs, while ignoring small

or less common inputs that might also be useful. MTL can be used to coerce

the learner to attend to patterns in the input it would otherwise ignore. This is

done by forcing it to learn internal representations to support related tasks that

depend on these patterns.

A good example is road following. Here, STL nets often ignore lane markings

when learning to steer because lane markings are usually a small part of the

image, are constantly changing, and are often diﬃcult to see (even for humans).

If a net learning to steer is also required to learn to recognize road stripes, the net

will learn to attend to those parts of the image where stripes occur. To the extent

that the stripe tasks are learnable, the net will develop internal representations

to support them. Since the net is also learning to steer using the same hidden

layer, the steering task can use the parts of the stripe hidden representation that

are useful for steering.

We tested this idea using a road image simulator developed by Pomerleau

to permit rapid testing of learning methods for road-following domains [28].

Figure 8.8 shows several 2-D road images.

The principal task is to predict steering direction. For the MTL experiments,

we used eight additional tasks:

• whether the road is one or two lanes • location of centerline (if any)

• location of left edge of road • location of right edge of road

• location of road center • intensity of road surface

• intensity of region bordering road • intensity of centerline (if any)

These additional tasks are all computable from the internal variables in the

simulator. Table 8.3 shows the average performance of ten runs of single and

multitask learning on each of these tasks. The MTL net has 32 inputs, 16 hidden

units, and 9 outputs. The 36 STL nets have 32 inputs, 2, 4, 8 or 16 hidden units,

and 1 output each.

The last two columns compare STL and MTL. The ﬁrst is the percent reduc-

tion in error of MTL over the best STL run. Negative percentages indicate MTL

178 R. Caruana

Fig. 8.8. Sample single and two lane roads generated with Pomerleau’s road simulator

performs better. The last column is the percent improvement of MTL over the

average STL performance. On the important steering task, MTL outperforms

STL 15–30%.

We ran a follow-up experiment to test how important centerstripes are to the

STL and MTL nets. We eliminated the stripes from the images in a test set. If

MTL learned more about centerstripes than STL, and uses what it learned about

centerstripes for the main steering task, we expect to see steering performance

degrade more for MTL than for STL when we remove the centerstripes from the

images. Error increased more for the MTL nets than for the STL nets, suggesting

the MTL nets are making more use of the stripes in the images.

Table 8 .3. Performance of STL and MTL on the road following domain. The under-

lined entries in the STL columns are the STL runs that performed best. Diﬀerences

statistically signiﬁcant at .05 or better are marked with an *.

ROOT-MEAN SQUARED ERROR ON TEST SET

TASK

Single Task Backprop (STL)

MTL Change MTL Change MTL

2HU 4HU 8HU 16HU 16HU to Best STL to Mean STL

1 or 2 Lanes .201 .209 .207 .178 .156 -12.4% * -21.5% *

Left Edge .069

.071 .073 .073 .062 -10.1% * -13.3% *

Right Edge .076 .062 .058 .056

.051 -8.9% * -19.0% *

Line Center .153 .152

.152 .152 .151 -0.7% -0.8%

Road Center .038 .037

.039 .042 .034 -8.1% * -12.8% *

Road Greylevel .054

.055 .055 .054 .038 -29.6% * -30.3% *

Edge Greylevel .037

.038 .039 .038 .038 2.7% 0.0%

Line Greylevel .054 .054 .054

.054 .054 0.0% 0.0%

Steering .093 .069

.087 .072 .058 -15.9% * -27.7% *

8. A Dozen Tricks with Multitask Learning 179

8.2.7 Hints: Tasks Hand-Crafted by a Domain Expert

Extra outputs can be used to inject rule hints into nets about what they should

learn [32, 33]. This is MTL where the extra tasks are carefully engineered to

coerce the net to learn speciﬁc internal representations. Hints can also be pro-

vided to backprop nets via extra terms in the error signal backpropagated for

the main task output [1, 2]. The extra error terms constrain what is learned to

satisfy desired properties of main task such as monotonicity [31], symmetry, or

transitivity with respect to certain sets of inputs. MTL, which does not use extra

error terms on task outputs, could be used in concert with these techniques.

8.2.8 Handling other Categories in Classiﬁcation

In real-world applications of digit recognition, some of the images given to the

classiﬁer may be alphabetic characters or punctuation marks instead of digits.

One way to prevent accidentally classifying a “t” as a one or seven is to create

an “other” category that is the correct classiﬁcation for non-digit images. The

large variety of characters mapped to this “other” class makes learning this class

potentially very diﬃcult. MTL suggests an alternate way to do this. Split the

“other” class into separate classes for the individual characters that are trained

in parallel with the main digit tasks. A single output coding for the “other” class

can be used, as well. Breaking the “other” category into multiple tasks gives the

net more learnable error signal for these cases [26].

8.2.9 Sequential Transfer

MTL is parallel transfer. Often tasks arise serially and we can’t wait for all of

them to begin learning. In these cases we can use parallel transfer to perform

sequential transfer. If the training data can be stored, do MTL using whatever

tasks are available when it is time to start learning, and re-train as new tasks or

new data arise. If training data cannot be stored, or if we already have models for

which data is not available, we can still use MTL. Use the models to generate

synthetic data that is then used as extra training signals. This approach to

sequential transfer avoids catastrophic interference (forgetting old tasks while

learning new ones). Moreover, it is applicable where the analytical methods of

evaluating domain theories required by some serial transfer methods [29, 34] are

not available. For example, the domain theory need not be diﬀerentiable, it only

needs to make predictions. One issue that arises when synthesizing data from

prior models is what distribution to sample from. See [16] for a discussion of

synthetic sampling.

8.2.10 Similar Tasks With Diﬀerent Data Distributions

Sometimes there are multiple instances of the same problem, but the distribution

of samples diﬀers for each instantiation. For example, most hospitals diagnose

180 R. Caruana

and treat the same diseases, but the demographics of the patients each hospital

serves is diﬀerent. Hospitals in Florida see older patients, urban hospitals see

poorer patients, etc. Models trained separately for each hospital would perform

best, but often there is insuﬃcient data to train a separate model for each hos-

pital. Pooling the data, however, may not lead to models that are accurate for

each hospital. MTL provides one solution to this problem. Use one net to make

predictions for each hospital, using a diﬀerent output on the net for each hos-

pital. Because each patient is a training case for only one hospital, error can be

backpropagated only through the one output that has a target value for each

input vector.

8.2.11 Learning with Hierarchical Data

In many domains, the data falls in a hierarchy of classes. Most applications of

machine learning to hierarchical data make little use of the hierarchy. MTL pro-

vides one way of exploiting hierarchical information. When training a model to

classify data at one level in the hierarchy, include as extra tasks the classiﬁcation

tasks that arise for ancestors, descendants, and siblings of the current classiﬁ-

cation task. The easiest way to to accomplish this is to train one MTL net to

predict all class distinctions in the hierarchy at the same time.

8.2.12 Some Inputs Work Better as Outputs

The common practice in backprop nets is to use all features that will be available

for test cases as inputs, and have outputs only for tasks that need to be pre-

dicted. On real problems, however, learning often works better given a carefully

selected subset of the features to use inputs[7, 23, 24]. One way to beneﬁt from

features not used as inputs is to use them as extra outputs for MTL. We’ve done

experiments with both synthetic and real problems where moving some features

from the input side of the net to the output side of the net improves performance

on the main task. We use feature selection to select those features that should

be used as inputs, and then treat some of the remaining features as extra tasks.

Figure 8.9 shows the ROC Area on a pneumonia problem as the number of

input features on the backprop net varies.

ROC Areas closer to 1 indicate better

performance. There are 192 features available for most patients. Using all 192

features as inputs (Net1) is suboptimal. Better performance is obtained by using

the ﬁrst 50 features selected with feature selection (Net2). The horizontal line at

the top of the graph (Net3) shows the ROC Area obtained by using the ﬁrst 50

features as inputs, and the next 100 features as extra outputs.Usingthesesame

150 features all as inputs (Net4) yields worse performance.

This is not the same pneumonia problem used in section 8.2.1.

Although the 95% conﬁdence intervals for Net2 and Net3 overlap with ten trials, a

paired t-test shows the results are signiﬁcant at .01.

8. A Dozen Tricks with Multitask Learning 181

0.82

0.83

0.84

0.85

0.86

0 25 50 75 100 125 150 175 200

Dire Outcome ROC Area

Number of Input Features

Net1

Net2

Net3

Net4

Fig. 8.9. ROC Area on the Pneumonia Risk Prediction Task vs. the number of input

features used by the backprop net

8.3 Getting the Most Out of MTL

The basic machinery for doing multitask learning in neural nets is present in

backprop. Backprop, however, was not designed to do MTL well. This chapter

presents suggestions for how to make MTL in backprop nets work better. Some

of the suggestions are counterintuitive, but if not used, can cause MTL to hurt

generalization on the main task instead of helping it.

8.3.1 Use Large Hidden Layers

The basic idea behind MTL in backprop nets is that what is learned in the

hidden layer for one task can be useful to other tasks. MTL works when tasks

share hidden units. One might think that small hidden layers would help MTL

by promoting sharing between tasks.

For the kinds of problems we’ve examined here, this usually does not work.

Usually, tasks are diﬀerent enough that much of what each task needs to learn

does not transfer to many (or any) other tasks. Using a large hidden layer insures

that there are enough hidden units for tasks to learn independent hidden layer

representations when they need to. Sharing can still occur, but only when the

overlap between the hidden layer representations for diﬀerent tasks is strong. In

many real-world problems, the loss in accuracy that results from forcing tasks

to share by keeping the hidden layer small is larger than the beneﬁt that arises

from the sharing. Usually it is important to use large hidden layers with MTL.

8.3.2 Do Early Stopping for Each Task Separately

The classic NETtalk application [30] used one trained both phonemes and stresses

on one backprop net. NETtalk is an early example of MTL. But the builders

182 R. Caruana

0.7

0.75

0.8

0.85

0.9

0.95

0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06 1.6e+06

Test Set Accuracy

Number of Pattern Presentations

STRESS

PHONEMES

Fig. 8.10. On NETtalk, the Stress task trains very quickly and overﬁts long before the

Phoneme task reaches peak performance

of NETtalk viewed the multiple outputs as codings for a single problem, not as

independent tasks that beneﬁted each other by being trained together.

Figure 8.10 shows the learning curves for the phoneme and stress subtasks

separately. It is clear that the stress tasks begin to overﬁt before the phoneme

tasks reach peak performance. Better performance could be obtained on NETtalk

by doing early stopping on the stress and phoneme tasks individually, or by

balancing their learning rates so they reach peak performance at the same time.

Early stopping prevents overﬁtting by halting the training of error-driven

procedures like backprop before they achieve minimum error on the training set

(see chapter 2). Recall the steering prediction problem from section 8.2.6. We

applied MTL to this problem by training a net on eight extra tasks in addition

to the main steering task. Figure 8.11 shows nine learning curves, one for each of

the tasks on this MTL net. Each graph is the validation set error during training.

Table 8.4 shows the best place to halt each task. There is no one epoch where

training can be stopped so as to achieve maximum performance on all tasks.

If all tasks are important, and one net is used to predict all the tasks, halting

training where the error summed across all outputs is minimized is the best you

can do. Figure 8.12 shows the combined RMS error of the nine tasks. The best

average RMSE occurs at 75,000 backprop passes.

8. A Dozen Tricks with Multitask Learning 183

0.4

0.45

0.5

0.55

0.6

0.65

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

number_of_lanes

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

road_left_edge

0.3

0.35

0.4

0.45

0.5

0.55

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

road_right_edge

0.45

0.5

0.55

0.6

0.65

0.7

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

line_center

0.15

0.2

0.25

0.3

0.35

0.4

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

road_center

0.55

0.6

0.65

0.7

0.75

0.8

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

road_greylevel

0.55

0.6

0.65

0.7

0.75

0.8

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

nonroad_greylevel

1.2

1.4

1.6

1.8

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

line_greylevel

0.25

0.3

0.35

0.4

0.45

0.5

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

steering_direction

Fig. 8.11. Test-Set Performance of MTL Net Trained on Nine Tasks

But using one net to make predictions for all the tasks is suboptimal. Better

performance is achieved by using the validation set to do early stopping on each

output individually. The trick is to make a copy of the net at the epoch where

performance on each task is best, and use this copy to make predictions for that

task. After making each copy, continue training the net until the other tasks

reach peak performance. Sometimes, it is best to continue training all outputs

on the net, including those that have begun to overﬁt. Sometimes, it is better

to stop training (or use a lower learning rate) for outputs that have begun to

overﬁt. Keep in mind that once an output has begun to overﬁt, we no longer

care how well the net performs on that task because we have a copy of the net

from an earlier epoch when performance on that task was best. The only reason

to continue training the task is because it may beneﬁt other tasks that have not

reached peak performance yet.

184 R. Caruana

0.55

0.6

0.65

0.7

0.75

0.8

0 200000 400000 600000 800000 1e+06

root-mean-squared-error

backprop passes (epochs x 100)

all_9_tasks_combined

Fig. 8.12. Combined Test-Set Performance on all Nine Tasks

Table 8. 4. Performance of MTL on the nine tasks in the steering domain when training

is halted on each task individually compared with halting using the combined error

across all tasks

TASK Halted Individually Halted Com bined Difference

BP Pass Performance BP Pass Performance

1: 1 or 2 Lanes 100000 0.444 75000 0.456 2.7%

2: Left Edge 100000 0.309 75000 0.321 3.9%

3: Right Edge 100000 0.376 75000 0.381 1.3%

4: Line Center 75000 0.486 75000 0.486 0.0%

5: Road Center 200000 0.208 75000 0.239 14.9%

6: Road Greylevel 750000 0.552 75000 0.680 23.2%

7: Edge Greylevel 375000 0.518 75000 0.597 15.3%

8: Line Greylevel 1 1.010 75000 1.158 14.7%

9: Steering 125000 0.276 75000 0.292 5.8%

Table 8.4 compares the performance of early stopping done per task with

the performance one obtains by halting training for the entire MTL net at one

place using the combined error. On average, early stopping for tasks individually

reduces error 9.0%. This is a large diﬀerence. For some tasks, the performance

of the MTL net is worse than the performance of STL on this task if the MTL

net is not halted on that task individually.

Before leaving this topic, it should be mentioned that the training curves for

the individual outputs on an MTL net are not necessarily monotonic. While it is

not unheard of for the test-set error of an STL net to be multimodal, the training-

set error for an STL net should descend monotonically or become ﬂat. This is

not true for the errors of individual outputs on an MTL net. The training-set

error summed across all outputs should never increase, but any one output may

8. A Dozen Tricks with Multitask Learning 185

exhibit more complex behavior. The graph for road_greylevel (graph number 6)

in Figure 8.11 shows a multimodal test-set curve. The training set curve for this

output is similar. This makes judging when to halt training more diﬃcult with

MTL nets. Because of this, we always do early stopping on MTL nets by training

past the epoch where performance on each task appears to be best, and either

retrain the net a second time (with the same random seed) to get the copies,

or are careful to keep enough copies during the ﬁrst run that we have whatever

copies we will need.

8.3.3 Use Diﬀerent Learning Rates for Diﬀerent Tasks

Is it possible to control the rates at which diﬀerent tasks train so they each reach

their best performance at the same time? Would best performance on each task

be achieved if each task reached peak performance at the same time? If not, is

it better for extra tasks to learn slower or faster than the main task?

The rate at which diﬀerent tasks learn using vanilla backprop is rarely optimal

for MTL. Task that train slower than the main task will not have learned enough

to help the main task when training on the main task is stopped. Tasks that

train faster than the main task may overﬁt so much before the main task is

learned well that either what they have learned is no longer useful to the main

task, or they may drive the main task into premature overﬁtting.

The most direct method of controlling the rate at which diﬀerent tasks learn

is to use a diﬀerent learning rate for each task, i.e., for each output. We have

experimented with using gradient descent to ﬁnd learning rates for each extra

output to maximize the generalization performance on the main task. Table 8.5

shows the performance on the main steering task before and after optimizing the

learning rates of the other eight extra tasks. Optimizing the learning rates for the

extra MTL tasks improved performance on the main task an additional 11.5%.

This improvement is over and above the original improvement of 15%–25% for

MTL over STL.

Examining the training curves for all the tasks as the learning rates are op-

timized shows that the changes in the learning rates of the extra tasks has a

signiﬁcant eﬀect on the rate at which the extra tasks are learned. Interestingly,

it also has a signiﬁcant eﬀect on the rate at which the main task is learned.

Table 8.5. Performance of MTL on the main Steering Direction task before and after

optimizing the learning rates of the other eight extra tasks

TRIAL Before Optimization After Optimization Difference

Trial 1 0.227 0.213 -6.2%

Trial 2 0.276 0.241 -12.7%

Trial 3 0.249 0.236 -5.2%

Trial 4 0.276 0.231 -16.3%

Trial 5 0.276 0.234 -15.2%

Average 0.261 0.231 -11.5% *

186 R. Caruana

This is surprising because we keep the learning rate for the main task ﬁxed

during optimization. Perhaps even more interesting is the fact that optimizing

the learning rate to maximize the generalization accuracy of the main task also

improved generalization on the extra tasks nearly as much as it helped the main

task. What is good for the goose appears to be good for the gander.

8.3.4 Use a Private Hidden Layer for the Main Task

Sometimes the optimal number of hidden units is 100 hidden units or more per

output. If there are hundreds of extra tasks this translates to thousands of hidden

units. This not only creates computational diﬃculties, but degrades performance

on the main task because most of the hidden layer repersentation is constructed

for other tasks. The main task output unit has a massive hidden unit selection

problem as it tries to use only those few hidden units that are useful to it.

I n p u t s

Main Output(s)Extra Outputs

MTL Hidden

Layer

Private Hidden

Layer for Main

fully connected

to input features

fully connected

to input features

...

Fig. 8.13. MTL Architecture With a Private Hidden Layer for the Main Task(s), and

a Shared Hidden Layer Used by the Main Task(s) and the Extra Tasks

Figure 8.13 shows a net architecture that solves this problem. Instead of one

hidden layer shared equally by all tasks, there are two disjoint hidden layers.

Hidden layer 1 is a private hidden layer used only by the main task(s). Hidden

layer 2 is shared by the main task(s) and the extra tasks. This is the hidden layer

that supports MTL transfer. Because the main task sees and aﬀects the shared

hidden layer, but the extra tasks do not aﬀect the hidden layer reserved for the

main tasks(s), hidden layer 2 can be kept small without hurting the main task.

8.4 Chapter Summary

We usually think of the inputs to a backprop net as the place where information is

given to the net, and the outputs as the place where the net outputs predictions.

Backprop, however, pushes information into the net through the outputs during

training. The information fed into a net through its outputs is as important as

the information fed into it through its inputs.

8. A Dozen Tricks with Multitask Learning 187

Multitask Learning is a way of using the outputs of a backprop net to push

additional information into the net during training. If the net architecture allows

sharing of what is learned for diﬀerent outputs, this additional information can

help the main task be learned better. (See [5, 6, 4, 8, 3, 9, 11, 10, 20, 35, 36, 12,

21, 13, 14] for additional discussion about multitask learning.)

MTL trains multiple tasks in parallel not because this is a more eﬃcient way

to learn multiple tasks, but because the information in the training signals for

the extra tasks can help the main task be learned better. Sometimes what is

optimal for the main task is not optimal for the extra tasks. It is important to

optimize the technique so that performance on the main task is best, even if this

hurts performance on the extra tasks. If the extra tasks are important too, it

may be best to rerun learning for each important extra task, with the technique

optimized for each task one at a time.

This chapter presented a number of opportunities for using extra outputs to

leverage information that is available in real domains. The trick in most of these

applications is to view the outputs of the net as inputs that are used only during

learning. Any information that is available when the net is trained, but which

would not be available later when the net is used for prediction, can potentially

be used as extra outputs. There are many domains where useful extra tasks will

be available. The list of prototypical domains provided in this chapter is not

complete. More kinds of extra tasks will be identiﬁed in the future.

Acknowledgements. R. Caruana was supported by ARPA grant F33615-93-

1-1330, NSF grant BES-9315428, and Agency for Health Care Policy grant

HS06468. The work to ﬁnd input features that worked better when used as extra

outputs is joint work with Virginia de Sa, who was supported by postdoctoral

fellowship from the Sloan Foundation. We thank the University of Toronto for

the Xerion Simulator, and D. Koller and M. Sahami for the use of their feature

selector.

References

[1] Abu-Mostafa, Y.S.: Learning from Hints in Neural Networks. Journal of Complex-

ity 6(2), 192–198 (1990)

[2] Abu-Mostafa, Y.S.: Hints. Neural Computation 7, 639–671 (1995)

[3] Baxter, J.: Learning Internal Representations. In: COLT 1995, Santa Cruz, CA

(1995)

[4] Baxter, J.: Learning Internal Representations. Ph.D. Thesis, The Flinders Uni-

veristy of South Australia (December 1994)

[5] Caruana, R.: Multitask Learning: A Knowledge-Based Source of Inductive Bias.

In: Proceedings of the 10th International Conference on Machine Learning, ML

1993, University of Massachusetts, Amherst, pp. 41–48 (1993)

[6] Caruana, R.: Multitask Connectionist Learning. In: Proceedings of the 1993 Con-

nectionist Models Summer School, pp. 372–379 (1994)

[7] Caruana, R., Freitag, D.: Greedy Attribute Selection. In: ICML 1994, Rutgers,

NJ, pp. 28–36 (1994)

188 R. Caruana

[8] Caruana, R.: Learning Many Related Tasks at the Same Time with Backpropa-

gation. In: NIPS 1994, pp. 656–664 (1995)

[9] Caruana, R., Baluja, S., Mitchell, T.: Using the Future to “Sort Out” the Present:

Rankprop and Multitask Learning for Medical Risk Prediction. In: Proceedings

of Advances in Neural Information Processing Systems, NIPS 1995, pp. 959–965

(1996)

[10] Caruana, R., de Sa, V.R.: Promoting Poor Features to Supervisors: Some Inputs

Work Better As Outputs. In: NIPS 1996 (1997)

[11] Caruana, R.: Multitask Learning. Machine Learning 28, 41–75 (1997)

[12] Caruana, R.: Multitask Learning. Ph.D. thesis, Carnegie Mellon University, CMU-

CS-97-203 (1997)

[13] Caruana, R., O’Sullivan, J.: Multitask Pattern Recognition for Autonomous

Robots. In: The Proceedings of the IEEE Intelligent Robots and Systems Confer-

ence (IROS 1998), Victoria (1998) (to appear)

[14] Caruana, R., de Sa, V.R.: Using Feature Selection to Find Inputs that Work Better

as Outputs. In: The Proceedings of the International Conference on Neural Nets

(ICANN 1998), Sweden (1998) (to appear)

[15] Cooper, G.F., Aliferis, C.F., Ambrosino, R., Aronis, J., Buchanan, B.G., Caruana,

R., Fine, M.J., Glymour, C., Gordon, G., Hanusa, B.H., Janosky, J.E., Meek,

C., Mitchell, T., Richardson, T., Spirtes, P.: An Evaluation of Machine Learning

Methods for Predicting Pneumonia Mortality. Artiﬁcial Intelligence in Medicine 9,

107–138 (1997)

[16] Craven, M., Shavlik, J.: Using Sampling and Queries to Extract Rules from

Trained Neural Networks. In: Proceedings of the 11th International Conference

on Machine Learning, ML 1994, Rutgers University, New Jersey, pp. 37–45 (1994)

[17] Davis, I., Stentz, A.: Sensor Fusion for Autonomous Outdoor Navigation Using

Neural Networks. In: Proceedings of IEEE’s Intelligent Robots and Systems Con-

ference (1995)

[18] Dietterich, T.G., Bakiri, G.: Solving Multiclass Learning Problems via Error-

Correcting Output Codes. Journal of Artiﬁcial Intelligence Research 2, 263–286

(1995)

[19] Fine, M.J., Singer, D., Hanusa, B.H., Lave, J., Kapoor, W.: Validation of a Pneu-

monia Prognostic Index Using the MedisGroups Comparative Hospital Database.

American Journal of Medicine (1993)

[20] Ghosn, J., Bengio, Y.: Multi-Task Learning for Stock Selection. In: NIPS 1996

(1997)

[21] Heskes, T.: Solving a Huge Number of Similar Tasks: A Combination of Multi-

task Learning and a Hierarchical Bayesian Approach. In: Proceedings of the 15th

International Conference on Machine Learning, Madison, Wisconsin, pp. 233–241

(1998)

[22] Holmstrom, L., Koistinen, P.: Using Additive Noise in Back-propagation Training.

IEEE Transactions on Neural Networks 3(1), 24–38 (1992)

[23] John, G., Kohavi, R., Pﬂeger, K.: Irrelevant Features and the Subset Selection

Problem. In: ICML 1994, Rutgers, NJ, pp. 121–129 (1994)

[24] Koller, D., Sahami, M.: Towards Optimal Feature Selection. In: ICML 1996, Bari,

Italy, pp. 284–292 (1996)

[25] Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard,

W., Jackal, L.D.: Backpropagation Applied to Handwritten Zip-Code Recognition.

Neural Computation 1, 541–551 (1989)

[26] Le Cun, Y.: Private communication (1997)

8. A Dozen Tricks with Multitask Learning 189

[27] Munro, P.W., Parmanto, B.: Competition Among Networks Improves Commit-

tee Performance. In: Proceedings of Advances in Neural Information Processing

Systems, NIPS 1996, vol. 9 (1997) (to appear)

[28] Pomerleau, D.A.: Neural Network Perception for Mobile Robot Guidance. Doc-

toral Thesis, Carnegie Mellon University: CMU-CS-92-115 (1992)

[29] Pratt, L.Y., Mostow, J., Kamm, C.A.: Direct Transfer of Learned Information

Among Neural Networks. In: Proceedings of AAAI 1991 (1991)

[30] Sejnowski, T.J., Rosenberg, C.R.: NETtalk: A Parallel Network that Learns to

Read Aloud. John Hopkins: JHU/EECS-86/01 (1986)

[31] Sill, J., Abu-Mostafa, Y.: Monotonicity Hints. In: Proceedings of Neural Informa-

tion Processing Systems, NIPS 1996, vol. 9 (1997) (to appear)

[32] Suddarth, S.C., Holden, A.D.C.: Symbolic-neural Systems and the Use of Hints

for Developing Complex Systems. International Journal of Man-Machine Stud-

ies 35(3), 291–311 (1991)

[33] Suddarth, S.C., Kergosien, Y.L.: Rule-injection Hints as a Means of Improving

Network Performance and Learning Time. In: Proceedings of EURASIP Workshop

on Neural Nets, pp. 120–129 (1990)

[34] Thrun, S.: Explanation-Based Neural Network Learning: A Lifelong Learning Ap-

proach. Kluwer Academic Publisher (1996)

[35] Thrun, S., Pratt, L. (eds.): Machine Learning. Second Special Issue on Inductive

Transfer (1997)

[36] Thrun, S., Pratt, L. (eds.): Learning to Learn. Kluwer (1997)

[37] Valdes-Perez, R., Simon, H.A.: A Powerful Heuristic for the Discovery of Com-

plex Patterned Behavior. In: Proceedings of the 11th International Conference on

Machine Learning, ML 1994, Rutgers University, New Jersey, pp. 326–334 (1994)

[38] Weigend, A., Rumelhart, D., Huberman, B.: Generalization by Weight-

Elimination with Application to Forecasting. In: Proceedings of Advances in Neu-

ral Information Processing Systems, NIPS 1990, vol. 3, pp. 875–882 (1991)

Solving the Ill-Conditioning in Neural Network

Learning



Patrick van der Smagt and Gerd Hirzinger

German Aerospace Research Establishment,

Institute of Robotics and System Dynamics,

P.O. Box 1116, D–82230 Wessling, Germany

smagt@dlr.de

http://www.robotic.dlr.de/Smagt/

Abstract. In this paper we investigate the feed-forward learning prob-

lem. The well-known ill-conditioning which is present in most feed-for-

ward learning problems is shown to be the result of the structure of the

network. Also, the well-known problem that weights between ‘higher’

layers in the network have to settle before ‘lower’ weights can converge

is addressed. We present a solution to these problems by modifying the

structure of the network through the addition of linear connections which

carry shared weights. We call the new network structure the linearly

augmented feed-forward network, and it is shown that the universal ap-

proximation theorems are still valid. Simulation experiments show the

validity of the new method, and demonstrate that the new network is

less sensitive to local minima and learns faster than the original network.

9.1 Introduction

One of the major problems with feed-forward network learning remains the ac-

curacy and speed of the learning algorithms. Since the learning problem is a

complex and highly nonlinear one [12, 4], iterative learning procedures must be

used to solve the optimisation problem [2, 14]. A continuing desire to improve

the behavior of the learning algorithm has lead to many excellent optimisation

algorithms which are especially tailored for feed-forward network learning.

However, an important problem is the particular form of the error function

that represents the learning problem. It has long been noted [10, 16] that the

derivatives of the error function are usually ill-conditioned. This ill-conditioning

is reﬂected in error landscapes which contain many saddle points and ﬂat areas.

Although this problem can be solved by using stochastic learning methods

(e.g., [9, 1, 13]), these methods require many learning iterations in order to ﬁnd

an optimum, and are therefore not suited for problems where fast learning is a

requirement. We therefore remain focused on gradient-based learning methods.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 191–203, 2012.

 Springer-Verlag Berlin Heidelberg 2012

192 P. van der Smagt and G. Hirzinger

Algorithms exist which attempt to ﬁnd well-behaving minima [7], yet an impor-

tant factor of the learning problem remains the structure of the feed-forward

network.

In this chapter an explanation of the ill-conditioned learning problem is pro-

vided as well as a solution to alleviate this problem. Section 9.2 formally in-

troduces the learning problem, and describes the problem of singularities in the

learn matrices. In section 9.3 the cause of the singularities are analyzed and an

adapted learning rule is introduced which alleviates this problem. Section 9.4

discusses a few applications.

9.2 The Learning Process

With a neural network

: 

×

→

we create an approximation to

asetofp learning samples {(x

, y

), (x

, y

), ..., (x

, y

)},withx

∈

and y

∈

, for which holds that ∀1 ≤ i ≤ p :

)=y

. The function

: 

→

is called the model function.

Let n be the number of free parameters W of the network. In this particular

case we are interested in approximating the learning samples rather than the

underlying function

, or assume that the p learning samples are representative

for

The oth output of the function that is represented by the neural network can

be written as

(x,W)







+ θ



+ θ

(9.1)

where

(x) the transfer function, o indicates an output unit, h a hidden unit, i

an input unit. The symbol w

indicates an element of W corresponding with

the connection from hidden unit h to output unit o; w

is similarly used for a

connection from input unit i to hidden unit h. Finally, θ is a bias weight and

therefore an element of W . An exemplar feed-forward network with one input

and output unit and two hidden units is depicted in ﬁgure 9.1.

The learning task consists of minimizing an approximation error,whichis

usually deﬁned as

(W )=



i=1



,W) −y

 (9.2)

where for ·we prefer to use the L

norm. We will leave the subscript

out

when no confusion arises. E(W ) is (highly) nonlinear in W , such that iterative

search techniques are required to ﬁnd the W for which E(W ) is suﬃciently small.

Finally we deﬁne the residual pattern error

M(i−1)+j

= 

,W)

− y

, (9.3)

i.e., the error in the j’th output value for the i’th learning sample.

9. Solving the Ill-Conditioning in Neural Network Learning 193

Fig. 9.1. An exemplar feed-forward neural network. The circles represent neurons; the

black ﬁlled circle is a bias unit, and always carries an activation value of ‘1’.

9.2.1 Learning Methodology

Gradient based learning methods are characterized by considering low-order

terms from the Taylor expansion to the approximation error,

E(W + W

)=E(W

)+∇E|

W + W

∇



W + .... (9.4)

In most cases more than the second order term is neglected. We deﬁne

(W + W

)=E(W

)+∇E|

W (9.5)

and

(W + W

(W )+W

∇



W (9.6)

being the ﬁrst-order and second-order approximation to E, respectively.

By locally considering the approximation error as a ﬁrst- or second-order func-

tion of W , we can use several existing approximation methodologies to minimize

E. When, according to the local second-order approximation information, E is

minimized, the local information is updated and a second minimization step is

carried out. This process is repeated until a minimum is found.

Well-known minimization methods are steepest descent (a variant of which is

known as error backpropagation), conjugate gradient optimisation, Levenberg-

Marquardt optimisation, variable metric methods, and (quasi-) Newton meth-

ods. Each of these methods has its advantages and disadvantages which are

discussed elsewhere [14].

The optimisation methods work in principle as follows. By considering the

second-order approximation to E, an optimum can be analytically found if ∇E

194 P. van der Smagt and G. Hirzinger

and ∇

E are known. After the optimum has been located, ∇E and ∇

E are

recomputed using the local information, and the minimum is relocated. This

process is repeated until a minimum is found.

Naturally, the success of this approach stands or falls with the form of the

error function. If the error function is not too complex but smooth, and can

be reasonably approximated by a quadratic function, the discussed optimisation

methods are a reliable and fast way of ﬁnding minima. In feed-forward network

training, however, the error functions appear to have a large number of ﬂat areas

where minimization is a diﬃcult task due to limited ﬂoating point accuracy.

9.2.2 Condition of the Learning Problem

The ﬂat areas of the error function can be formalized as follows. We deﬁne the

(Mp) × n Jacobian matrix as

J ≡

⎛

⎜

⎝

∇e

⎞

⎟

⎠

where ∇e

≡

⎛

⎜

⎝

∂e

∂w

∂e

∂w

∂e

∂w

⎞

⎟

⎠

(9.7)

such that we can write J as

J ≡

⎛

⎜

⎝

∂e

∂w

∂e

∂w

···

∂e

∂w

∂e

∂w

∂e

∂w

···

∂e

∂w

∂e

∂w

∂e

∂w

···

∂e

∂w

⎞

⎟

⎠

. (9.8)

In the learning process we thus encounter that ∇E =2J

We furthermore deﬁne the Hessian H = ∇

E, which is the matrix of second

order derivatives of E. We are interested in the eigenvalues and eigenvectors of

H.SinceH is a positive semideﬁnite symmetric matrix close to a local mini-

mum, its eigenvalues are all positive real numbers. When an eigenvalue is very

small, the eﬀect of moving in the direction of the corresponding eigenvector on

the approximation error is very small. This means that, in that direction, the

error function is (nearly) singular. The singularity of the error function can be

expressed in the condition of H, which is deﬁned as the quotient of its largest

and its smallest eigenvalues.

As mentioned above, a bad condition of H may occur at minima or ﬂat points

in the error function E(W ). It appears that feed-forward learning tasks are

generally characterized by having a singular or near-singular Hessian matrix.

Although the said learning methods are mathematically not inﬂuenced by a

badly conditioned Hessian, it does lead to inaccuracies due to the limited ﬂoating

point accuracy of the digital computer.

9. Solving the Ill-Conditioning in Neural Network Learning 195

9.2.3 What Causes the Singularities

Reference [10] lists a few cases in which the Hessian may become singular or near-

singular. The listed reasons are associated with the ill character of the sigmoid

transfer function. Typical for this function is the fact that lim

x→∞

(x)=c

and lim

x→−∞

(x)=c

−

. Also, bad conditioning can be the result of uncentered

data, and can also be alleviated [11].

Assuming that some neuron is in this ‘saturated’ state for all learning patterns,

its input weights will have a delta equal to 0 (according to the backpropagation

rule) such that these weights will never change. For each of the incoming weights

of this neuron, this leads to a 0-row in H, and therefore singularity.

However, there is another important reason for singularity, which may espe-

cially occur after network initialization. When a multi-layer feed-forward network

has a small (e.g., less than 0.1) weight leaving from a hidden unit, the inﬂuence

of the weights that feed into this hidden unit is signiﬁcantly reduced. Therefore

the ∂e/∂w

will be close to 0, leading to near-null rows in J and a near-singular

We observe that this kind of singularity is very common and touches a char-

acteristic problem in feed-forward network learning: The gradients in the lower-

layer weights are inﬂuenced by the higher-layer weights. A related problem is

the inﬂuence that the change of the weights between the hidden and output

units have on the change of the input weights; when they rapidly and frequently

change, as will be the case during the initial stages of learning, the lower weights

will have nonsensical repeated perturbations.

9.2.4 Deﬁnition of Minimum

A minimum is said to be reached when the derivative of the error function is

zero, i.e., ∂E/∂w

1≤k≤n

=0. Since the gradient of the error function equals the

column-sum of the Jacobian, a minimum has been reached when

∀1 ≤ k ≤ n :



i=1

∂e

∂w

=0, (9.9)

i.e., when each of the columns adds up to 0. Eq. (9.9) deﬁnes the minimum for

a batch-learning system: the gradient, when summed over all learning samples,

should be 0. This also means, however, that it may occur that elements of a

column-sum cancel each other out, even when not all elements of the Jacobian

are 0. Diﬀerently put, it may occur that the gradient for some patterns are

non-zero, whereas the gradients sum up to zero.

The optimal case is reached when all elements of the Jacobian are 0. This

means, of course, that the residual error of each pattern is 0.

9.3 Local Minima are Caused by BackPropagation

In this section we propose a new neural network structure which alleviates

the above problems. In the standard backpropagation learning procedure, the

196 P. van der Smagt and G. Hirzinger

gradient of the error function with respect to the weights is determined by com-

puting the following steps for each learning pattern:

1. For each output unit o, compute the delta δ

= y

− a

where a

is the

activation value for that unit, when a learning sample is presented.

2. Compute the weight derivative for the weights w

from the hidden to output

units:

Δw

= δ

where a

is the activation value for the hidden unit.

3. Compute the delta for the hidden unit:





4. Compute the weight derivative for the weights w

from the input to hidden

units:

Δw

= δ





. (9.10)

The gradient is then computed as the summation of the Δw’s.

From (9.10) we can see that there are four cases when the gradient for a weight

from an input to a hidden unit is negligible, such that the corresponding row

and column in the Hessian are near-zero:

– When δ

is small. This is correct, since that case means that the output of

the network is close to its desired output.

– When w

is small. This is an undesired situation: A small weight from

hidden to output unit paralyzes weights from input to hidden units. This is

especially important since the weight might have to change its value later.

– When s



) is small; this occurs when the weight w

is large. Again, this

saturation type of paralysis is undesired.

– Finally, when a

is small. This is desired: When the input value is insigniﬁ-

cant, it should have no inﬂuence on the output.

9.3.1 A New Neural Network Structure

In order to alleviate these problems, we propose a change to the learning system

of (9.10) as follows:

Δw





]+c)a

= δ

+ c



By adding a constant c to the middle term, we can solve both paralysis problems.

In eﬀect, an extra connection from each input unit to each output unit is created,

with a weight value coupled to the weight from the input to hidden unit.

9. Solving the Ill-Conditioning in Neural Network Learning 197

Fig. 9.2. An exemplar adapted feed-forward neural network

Although c can be made a learnable parameter, in the sequel we will assume

c =1.Theo’th output of the neural network is now computed as

(x,W)









+ θ







+ θ

. (9.11)

We call the new network the linearly augmented feed-forward network.

The structure of this network is depicted in ﬁgure 9.2. Note the equivalence of

and

, viz.,

(x,W)

≡

(x,W)



. (9.12)

9.3.2 Inﬂuence on the Approximation Error E

Although the optimal W will be diﬀerent for the networks

and

,wecan

still compare the forms of the error functions E

vs. E

. Using (9.2) and (9.11)

we can compute the error in the approximation of a single learning sample (x,y)

with N inputs, κ hidden units, and a single output:

(W )

[x,W] − y)



(x,W) − y +





= E

(W )



(

[x,W] − y)+







. (9.13)

198 P. van der Smagt and G. Hirzinger

The error for the network

diﬀers from the error for

in two terms. When

we consider E

at those values of W where E

is minimal, we can see that the

diﬀerence between E

and E

consists of a normalization term 



x; w

is the vector of weights connecting the inputs to hidden unit h. This non-negative

term will only be zero when the vectors w

x, w

x, ..., w

x cancel each other

out for each input vector x which is in the training set. In words: E

(W )

introduces a penalty for hidden units doing the same thing, thus making the

network use its resources more eﬃciently.

9.3.3

and the Universal Approximation Theorems

It has been shown in various publications [3, 6, 8] that the ordinary feed-forward

neural network

can represent any Borel-measurable function with a single

layer of hidden units which have sigmoidal or Gaussian activation functions in

the hidden layer.

Theorem 1. The network

can represent any Borel-measurable function with

a single layer of hidden units which have sigmoidal or Gaussian activation func-

tions in the hidden layer.

Proof. We show that any network

can be written as a network

; therefore,

the class of networks

are universal approximators.

By using (9.1) and (9.11),

(x,W)



h=1





+ θ



+ θ



h=1







+ θ







+ θ

−



h=1



(x,W)

2κ



l=κ+1







−w

i,l−κ





−w

i,l−κ



(x,W)

(x,V)

where V is a weight matrix such that the elements of V corresponding to the

weights from hidden to output units are 0, and the other weights equal the nega-

tion of its W counterparts. Furthermore, bias weights are set to 0.

The sum

(x,W)

(x,V)

represents two

-networks, which can also

be written as a single

(x,W



)-network with the double amount of hidden units,

where W



=[WV].

Using the theorems from [3, 6, 8] the proof is complete.

Note that it is also possible to write each

-network as an

-network, by

doubling the number of hidden units and using inﬁnitesimal weights from the

input units to these hidden units, and their multiplicative inverse for the weights

from these hidden to the output units.

Figure 9.3 shows the equivalence of an

and

network for the two-hidden

unit case.

9. Solving the Ill-Conditioning in Neural Network Learning 199

input neuron

output neuron

input neuron

-w

Fig. 9.3. Equivalence of an

-network (left) and an

-network (right). Note that bias

units are left out for clarity.

9.3.4 Example

As an example, we train a network with a single hidden unit, and no bias

connections, to represent the learning samples (1, 1) and (2, 2). The hidden

unit has a sigmoid activation function. The function that is computed by the

network thus is

(x, W )=w

x) for the original neural network, and

(x, W )=w

x)+w

x for the adapted neural network. We use the sigmoid

function

(x)=1/(1 + e

−x

) as activation function, which has the following well-

known properties: lim

x→∞

(x)=1, lim

x→−∞

(x)=0,andlim

x→±∞



(x)=0.

Figure 9.4 shows the error function and its derivatives for this neural network.

In the top row of this ﬁgure we see the original neural network. Notice in the top

middle ﬁgure, depicting ∂E/∂w

,that∂E/∂w

≈ 0 for small values of w

.In

other words, when w

is small, w

will hardly change its value. Similarly, when

is large, then ∂E/∂w

will be small due to the fact that the derivative of the

transfer function is nearly 0. In the bottom row the modiﬁed neural network is

depicted. Left, again, the error function. The middle ﬁgure clearly shows that

the derivative has no areas anymore which are zero or very small. The right

ﬁgure still shows that, if w

has a large negative value, ∂E/∂w

is negligible:

after all, the activation value of the hidden unit is near-zero.

9.4 Applications

We have applied the new learning scheme to several approximation problems. In

all problems, each network has been run 3,000 times with diﬀerent initial random

values for the weights. In order to train the network, we used a Polak-Ribière

conjugate gradient optimization technique with Powell restarts [14].

The applications with real data (problems 3 and 4) use two independent sets

of data: a learn set and a cross-validation set. In all cases, the network was

200 P. van der Smagt and G. Hirzinger

0.2

0.5

1.0

0.5

5.0

0.4

-0.2

-0.4

0.2

-0.2

-1.0

-0.8

1.0

-0.2

-0.4

1.0

0.5

1.0

0.5

dE/dw

Fig. 9.4. Error function and derivatives using the original and adapted learning rule.

The top row shows the error function for the original learning rule (left), as well as

its derivative ∂E/∂w

(centre) and ∂E/∂w

(right). The bottom row shows the same

graphs for the adapted learning rule. The contour lines have a distance of 0.5 (left

graphs) and 0.2 (middle and right graphs).

trained until the error over the cross-validation set was minimal (i.e., up to but

not beyond the point where the network started to over-ﬁt). The approximation

errors that are reported are computed using the samples in the cross-validation

set.

Problem 1: (synthetic data) the XOR problem. The well-known XOR problem

consists of representing four learning samples [(0, 0), 0], [(0, 1), 1], [(1, 0), 1],and

[(1, 1), 0]. The network has two inputs, two hidden units, and one output.

It has been noted [5] that the XOR problem is very atypical in neural network

learning, because it is penalized for generalization. Nevertheless, the XOR prob-

lem is generally considered as a small standard learning benchmark problem.

Whereas the network

reaches local minima in 22.4% of the runs, the lin-

early augmented network

always reached a global minimum. For

,the

average number of steps to obtain an approximation error of 0.0 (within the

32-bit ﬂoating point accuracy of the computer) equals 189.1; for

, 98.3 steps

were required.

When using the ordinary error backpropagation learning rule (i.e., no conju-

gate gradient learning), the XOR learning problem has been reported to lead to

8.7% local minima [14].

9. Solving the Ill-Conditioning in Neural Network Learning 201

Problem 2: (synthetic data) approximating tan(x). As a second test, the net-

works have been used for the approximation of the function

(x)=tan(x).

The function has been uniformly sampled in the domain [0,π] using 20 learning

samples in total. For the approximation we used a network with one input, ﬁve

hidden units in a single hidden layer, and one output.

With network

, 14.6% of the runs lead to a local minimum. The linearly

augmented neural network is not perfect; it is stuck in a local minimum in 6.2%

of the runs. In all other cases, a global minimum of very close to 0.0 was found.

Problem 3: (real data) approximating robot arm inverse kinematics. Thirdly we

approximated data describing a hand-eye coordination learning problem with a

Manutec R2 robot arm.

The data is organized as follows. There are ﬁve input variables, describing

the current position of the robot arm in joint angles θ

, θ

,aswellasthevisual

position of an object with respect to the hand-held camera in pixel coordinates

x, y, z. The output part of the data consists of the required joint rotations Δθ

Δθ

,andΔθ

necessary to reach the object. See [15] for further description of

this hand-eye coordination problem. In this particular problem, 1103 learning

samples are used; 1103 samples are used for cross-validation. The optimal of six

hidden units in a single layer [15] is used. Only the cross-validating data is used

for evaluating the network.

With the normal network

, in 2.3% of the cases the network got stuck in

a minimum with an unacceptably high error for both the learn and test sets.

This never occurred in the 3,000 trials in which the linearly augmented network

was used. The cross-validated approximation error after 10,000 learning steps

equals 4.20 · 10

−4

for network

,and4.04 · 10

−4

for network

(both values

are average per learning sample). The new method shows a slight improvement

here.

Problem 4: (real data) gear box deformation data. The ﬁnal test consists of the

following problem, which is encountered in the control of a newly developed

lightweight robot arm. A gear box connects a DC motor with a robot arm seg-

ment. In order to position the robot arm segment at a desired joint angle θ

the DC motor has to exert a force τ. However, in the normal setup a joint angle

decoder is only available at the DC motor side, which measures the angle θ

By mounting an extra decoder at the arm segment in a laboratory setup we can

measure θ

. The actual joint angle θ

diﬀers slightly from θ

because of the

gear-box elasticity. We attempt to learn θ

from (θ

,τ).

In order to learn these data, we use a network with 3 inputs, 6 hidden units

in a single layer, and one output. The data consists of 4,000 learning samples

as well as 4,000 samples used for the cross-validation. In each run, up to 10,000

learning iterations are performed but not beyond the point of overﬁtting.

Both network

do not get stuck in a minimum with an unacceptably

high error for both the learn and test sets. A surprise, however, is encountered in

the cross-validated approximation error that is computed after 10,000 learning

steps. For network

, this error equals 2.85 · 10

−4

per learning sample; for

202 P. van der Smagt and G. Hirzinger

however, this error is as low as 1.87 ·10

−6

per sample. Further data analysis has

shown that the data has strong linear components, which explains the fact that

the approximation error is two orders of magnitude lower.

Results are summarized in table 9.1.

Table 9.1. Results of the ordinary feed-forward network

and the linearly augmented

feed-forward network

on the four problems

XOR % local minima 22.4 0.0

avg. steps

189.1 98.3

tan % local minima 14.6 6.2

robot % local minima 2.3 0.0

3D data avg. error

4.20 · 10

−4

4.04 · 10

−4

gear box % local minima 0.0 0.0

data avg. error

2.85 · 10

−4

1.87 · 10

−6

9.5 Conclusion

It has been shown that the ordinary backpropagation learning rule leads to

bad conditioning in the matrix of second-order derivatives of the error function

which is encountered in feed-forward neural network learning. This again leads to

local minima and saddle points in the error landscape. In order to alleviate this

problem, we have introduced an adaptation to the backpropagation rule, which

can be implemented as an adapted feed-forward neural network structure. We

call this network the the linearly augmented feed-forward neural network.The

adaptation leads to a learning rule which obtains stable values for the weights

which connect the input units with the hidden units, even while the weights from

hidden to output units change.

A mathematical analysis shows the validity of the method; in particular, the

universal approximation theorems are shown to remain valid with the new neural

network structure. The application of the new method to two sets of synthetic

data and two sets of real data shows that the new method is much less sensitive

to local minima, and reaches an optimum in fewer iterations.

Acknowledgments. The authors acknowledge the help of Alin Albu-Schäﬀer

in supplying the gear box data.

References

[1] Aarts, E., Korst, J.: Simulated Annealing and Boltzmann Machines. John Wiley

& Sons (1989)

[2] Battiti, R.: First- and second-order methods for learning: Between steepest descent

and Newton’s method. Neural Computation 4, 141–166 (1992)

9. Solving the Ill-Conditioning in Neural Network Learning 203

[3] Cybenko, G.: Approximation by superpositions of a sigmoidal function. Mathe-

matics of Control, Signals, and Systems 2(4), 303–314 (1989)

[4] DasGupta, B., Siegelmann, H.T., Sontag, E.D.: On the complexity of training neu-

ral networks with continuous activation functions. IEEE Transactions on Neural

Networks 6(6), 1490–1504 (1995)

[5] Fahlman, S.E.: An empirical study of learning speed in back-propagation net-

works. Technical Report CMU–CS–88-0-162, Carnegie Mellon University (Septem-

ber 1988)

[6] Funahashi, K.-I.: On the approximate realization of continuous mappings by neural

networks. Neural Networks 2(3), 183–192 (1989)

[7] Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Computation 9(1), 1–42

(1997)

[8] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are

universal approximators. Neural Networks 2(5), 359–366 (1989)

[9] Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathe-

matical Statistics 22(1), 400–407 (1951)

[10] Saarinen, S., Bramley, R., Cybenko, G.: Ill-conditioning in neural network training

problems. Siam Journal of Scientiﬁc Computing 14(3), 693–714 (1993)

[11] Schraudolph, N.N.: On centering neural network weight updates. Technical Report

IDSIA-19-97, IDSIA (April 1997)

[12] Sontag, E.D., Sussmann, H.J.: Backpropagation can give rise to spurious local

minima even for networks without hidden layers. Complex Systems 3(1), 91–106

(1989)

[13] Unnikrishnan, K.P., Venugopal, K.P.: Alopex: A correlation based learning al-

gorithm for feedforward and recurrent neural networks. Neural Computation 6,

469–490 (1994)

[14] van der Smagt, P.: Minimisation methods for training feed-forward networks. Neu-

ral Networks 7(1), 1–11 (1994)

[15] van der Smagt, P.: Visual Robot Arm Guidance using Neural Networks. PhD

thesis, Dept of Computer Systems, University of Amsterdam (March 1995)

[16] Zhang, Q.J., Zhang, Y.J., Ye, W.: Local-sparse connection multilayer networks.

In: Proc. IEEE Conf. Neural Networks, pp. 1254–1257. IEEE (1995)

Centering Neural Network Gradient Factors



NicolN.Schraudolph

IDSIA, Corso Elvezia 36

6900 Lugano, Switzerland

nic@idsia.ch

http://www.idsia.ch/

Abstract. It has long been known that neural networks can learn faster

when their input and hidden unit activities are centered about zero; re-

cently we have extended this approach to also encompass the centering

of error signals [15]. Here we generalize this notion to all factors involved

in the network’s gradient, leading us to propose centering the slope of

hidden unit activation functions as well. Slope centering removes the

linear component of backpropagated error; this improves credit assign-

ment in networks with shortcut connections. Benchmark results show

that this can speed up learning signiﬁcantly without adversely aﬀecting

the trained network’s generalization ability.

10.1 Introduction

Centering is a general methodology for accelerating learning in adaptive systems

of the type exempliﬁed by neural networks — that is, systems that are typically

nonlinear, continuous, and redundant; that learn incrementally from examples,

generally by some form of gradient descent. Its basic tenet is:

All pattern-dependent factors entering the update equation for a neural

network weight should be centered, i.e., have their average over patterns

subtracted out.

Prior Work. It is well-known that the inputs to an LMS adaptive ﬁlter should

be centered to permit rapid yet stable adaptation [22], and it has been argued

[12] that the same applies to input and hidden unit activity in a multi-layer

network. Although Sejnowski [16] proposed a variant of Hebbian learning in

which both the pre- and postsynaptic factors of the weight update are centered,

the idea was not taken up when backpropagation became popular. The beneﬁts of

centering error signals in multi-layer networks were thus reported only recently

[15]; here we ﬁnally suggest centering as a general methodology, and present

backpropagation equations in which all factors are centered.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 205–223, 2012.

 Springer-Verlag Berlin Heidelberg 2012

206 N.N. Schraudolph

Independence of Architecture. Although centering is introduced here in the

context of feedforward networks with sigmoid activation functions, the approach

itself has a far wider reach. The implementation details may vary, but in essence

centering is not tied to any particular architecture: its principles are equally

applicable to feedforward and recurrent networks, with sigmoid, radial, or other

basis functions, with or without topological structure, time delays, multiplicative

gates, etc. — in short, the host of architectural elements used in neural network

design.

Independence of Learning Algorithm. Similarly, centering is not wedded

to any particular learning algorithm either. It may be applied to determinis-

tic (batch) or stochastic (online) gradient descent; more importantly, it may be

freely combined with more sophisticated optimization techniques such as ex-

pectation maximization, conjugate gradient and quasi-Newton methods. It also

leaves available the many useful tricks often employed with stochastic gradient

descent, such as momentum, learning rate adaptation, gradient normalization,

and so forth. Due to this ﬂexibility, centering has the potential to further accel-

erate even the fastest of these methods.

Overview. Section 10.2 introduces the centering approach in terms of the mod-

iﬁcations it mandates for ordinary backpropagation learning. We then discuss

implementation details in Section 10.3 before presenting benchmark results in

Section 10.4. Section 10.5 concludes our paper with a brief analysis of how cen-

tering facilitates faster learning.

10.2 Centered Backpropagation

The backpropagation learning algorithm is characterized by three equations,

describing the forward propagation of activity, the backpropagation of error, and

the modiﬁcation of weights, respectively. Here are the implications of centering

for each of these three equations:

10.2.1 Activity Propagation

Conventional. Consider a neural network with activation of node j given by

= f

) ,y



i∈A

ˇx

, (10.1)

where f

is a nonlinear (typically sigmoid) activation function, w

are the synap-

tic weights, and A

denotes the set of anterior nodes feeding their activity ˇx

into

node j. Conventionally, the anterior nodes’ output is fed forward directly, i.e.,

(∀i)ˇx

≡x

. We imply that nodes are activated via Equation 10.1 in appropri-

ately ordered (feedforward) sequence, and that some have their values clamped

so as to represent external inputs to the network. In particular, we posit a bias

input x

≡1 and require that all nodes are connected to it: (∀j>0) 0 ∈A

10. Centering Neural Network Gradient Factors 207

Centered. As suggested by LeCun et al. [12], the activity of the network’s input

and hidden units should be centered to permit faster learning (see Chapter 1).

We do so by setting

(∀i>0) ˇx

= x

−x

, (10.2)

where · denotes averaging over training samples (see Section 10.3 for ways to

implement this operator). Note that the bias input must not be centered — since

= x

 =1, it would otherwise become inoperative.

10.2.2 Weight Modiﬁcation

Conventional. The weights w

of the network given in Equation 10.1 are

typically optimized by gradient descent in some objective function E.Witha

local step size η

for each weight, this results in the weight update equation

Δw

= η

ˇx

, where δ

= −∂E/∂y

. (10.3)

Centered. We have recently proposed [15] that the error signals δ

should be

centered as well to achieve even faster convergence. This is done by updating all

non-bias weights via

(∀i>0) Δw

= η

ˇx

, where

= δ

−δ

. (10.4)

As before, this centered update must not be used for bias weights, for otherwise

they would remain forever stuck (except for stochastic ﬂuctuations) at their

present values:

* Δw

∝

= δ

−δ

 =0. (10.5)

Instead, bias weights are updated conventionally (Equation 10.3). Since this

means that the average error δ

 is given exclusively to the bias weight w

,we

have previously called this technique d.c. error shunting [15].

10.2.3 Error Backpropagation

Conventional. For output units, the error δ

can be computed directly from

the objective function; for hidden units, it must be derived through error back-

propagation:

= f



) γ

,γ



k∈P

, (10.6)

where P

denotes the set of posterior nodes fed from node j,andf



) is the

node’s current slope — the derivative of its nonlinearity f

at the present level

of activation.

208 N.N. Schraudolph

Centered. By inserting Equation 10.6 into Equation 10.4, we can express the

weight update for hidden units as a triple product of their activity, backpropa-

gated error, and slope:

Δw

∝ f



)ˇγ

ˇx

, (10.7)

where ˇγ

denotes backpropagated centered errors. It is not necessary to center

the ˇγ

explicitly since

ˇγ

 =



k∈P



k∈P

=0. (10.8)

By centering activity and error signals we have so far addressed two of the

three factors in Equation 10.7, leaving the remaining third factor — the node’s

slope — to be dealt with. This is done by modifying the nonlinear part of error

backpropagation (Equation 10.6) to



)ˇγ

, where



)=f



) −



)

. (10.9)

Decimation of Linear Errors. Note that for a linear node n we would have



) ≡1, and Equation 10.9 would always yield δ

≡0. In other words, slope

centering (for any node) blocks backpropagation of the linear component of error

signals — that component which a linear node in the same position would receive.

Viewed in terms of error decimation, we have thus taken the logical next step past

error centering, which removed the d.c. (constant) component of error signals.

Shortcuts. It was important then to have a parameter — the bias weight — to

receive and correct the d.c. error component about to be eliminated. Likewise, we

now require additional weights to implement the linear mapping from anterior to

posterior nodes that the unit in question is itself no longer capable of. Formally,

we demand that for each node j for which Equation 10.9 is used, we have

(∀i ∈ A

) P

⊆ P

. (10.10)

We refer to connections that bypass a node (or layer) in this fashion as shortcuts.

It has been noted before that neural network learning sometimes improves with

the addition of shortcut weights. In our own experiments (see Section 10.4),

however, we ﬁnd that it is slope centering that makes shortcut weights genuinely

useful.

A Complementary Approach? In Chapter 9, van der Smagt and Hirzinger

also advocate shortcuts as a means for accelerating neural network learning.

Note, however, that their use of shortcuts is quite diﬀerent from ours: in order

to improve the conditioning of a neural network, they add shortcut connections

whose weights are coupled to (shared with) existing weights. They thus suit-

ably modify the network’s topology without adding new weight parameters, or

10. Centering Neural Network Gradient Factors 209

deviating from a strict gradient-based optimization framework. By contrast, we

deliberately decimate the linear component of the gradient for hidden units in

order to focus them on their nonlinear task. We then use shortcuts with ad-

ditional weight parameters to take care of the linear mapping that the hidden

units now ignore.

While both these approaches use shortcuts to achieve their ends, from another

perspective they appear almost complementary: whereas we eliminate the linear

component from our gradient, van der Smagt and Hirzinger in fact add just

such a component to theirs. It may even be possible to proﬁtably combine the

two approaches in a single — admittedly rather complicated — neural network

architecture.

10.3 Implementation Techniques

We can distinguish a variety of approaches to centering a variable in a neural

network in terms of how the averaging operator · is implemented. Speciﬁcally,

averaging may be performed either exactly or approximately, and applied either

a priori, or adaptively during learning in either batch (deterministic) or online

(stochastic) settings:

centering method:

approximate exact

a priori by design extrinsic

online

running average —

adaptive



batch

previous batch two-pass, single-pass

10.3.1 A Priori Methods

By Design. Some of the beneﬁts of centering may be reaped without any modi-

ﬁcation of the learning algorithm, simply by setting up the system appropriately.

For instance, the hyperbolic tangent (tanh) function with its symmetric range

from -1 to 1 will typically produce better-centered output than the commonly

used logistic sigmoid f (y)=1/(1 + e

−y

) ranging from 0 to 1, and is therefore

the preferred activation function for hidden units [12]. Similarly, the input repre-

sentation can (and should) be chosen such that inputs will be roughly centered.

When using shortcuts, one may even choose a priori to subtract a constant (say,

half their maximum) from hidden unit slopes to improve their centering.

We refer to these approximate methods as centering by design. Though inex-

act, they provide convenient and easily implemented tricks to speed up neural

network learning. Regardless of whether further acceleration techniques will be

required or not, it is generally a good idea to keep centering in mind as a design

principle when setting up learning tasks for neural networks.

Extrinsic. Quantities that are extrinsic to the network — i.e., not aﬀected

by its weight changes — may often be centered exactly prior to learning. In

210 N.N. Schraudolph

particular, for any given training set the network’s inputs can be centered in

this fashion. Even in online settings where the training set is not known in

advance, it is sometimes possible to perform such extrinsic centering based upon

prior knowledge of the training data: instead of a time series x(t) one might for

instance present the temporal diﬀerence signal x(t) − x(t−1) as input to the

network, which will be centered if x(t) is stationary.

10.3.2 Adaptive Methods

Online. When learning online, the immediate environment of a single weight

within a multi-layer network is highly non-stationary, due to the simultaneous

adaptation of other weights, if not due to the learning task itself. A uniquely

deﬁned average of some signal x(t) to be centered is therefore not available

online, and we must make do with running averages — smoothed versions of the

signal itself. A popular smoother is the exponential trace

x(t+1) = α

x(t)+(1−α) x(t) , (10.11)

which has the advantage of being history-free and causal, i.e., requiring neither

past nor future values of x for the present update. The free parameter α (with

0 ≤α ≤1) determines the time scale of averaging. Its choice is not trivial: if it

is too small,

x will be too noisy; if it is too large, the average will lag too far

behind the (drifting) signal.

Note that the computational cost of centering by this method is proportional

to the number of nodes in the network. In densely connected networks, this is

dwarfed by the number of weights, so that the propagation of activities and error

signals through these weights dominates the computation. The cost of online

centering will therefore make itself felt in small or sparsely connected networks

only.

Two-Pass Batch. A simple way to implement exact centering in a batch learn-

ing context is to perform two passes through the training set for each weight

update: the ﬁrst to calculate the required averages, the second to compute the

resulting weight changes. This obviously may increase the computational cost

of network training by almost a factor of two. For relatively small networks and

training sets, the activity and error for each node and pattern can be stored

during the ﬁrst pass, so that the second pass only consists of the weight update

(Equation 10.4). Where this is not possible, a feedforward-only ﬁrst pass (Equa-

tion 10.1) is suﬃcient to compute average activities and slopes; error centering

may then be implemented via one of the other methods described here.

Previous Batch. To avoid the computational overhead of a two-pass method,

one can use the averages collected over the previous batch in the computation

of weight changes for the current batch. This approximation assumes that the

averages involved do not change too much from batch to batch; this may result in

10. Centering Neural Network Gradient Factors 211

stability problems in conjunction with very high learning rates. Computationally

this method is quite attractive in that it is cheaper still than the online technique

described above. When mini-batches are used for training, both approaches can

be combined proﬁtably by centering with an exponential trace over mini-batch

averages.

Single-Pass Batch. It is possible to perform exact centering in just a single

pass through a batch of training patterns. This is done by expanding the triple

product of the fully centered batch weight update (cf. Equation 10.7). Using f



as a shorthand for f



),wehave

Δw

∝



−x

)(γ

−γ

)(f



−







)









−



x

γ





−



γ

f





−









x

γ

f







x

γ









γ









−



x

γ















−x









−γ









−x









+2x

γ









(10.12)

In addition to the ordinary (uncentered) batch weight update term



and the individual averages x

, γ

,and



, the single-pass centered up-

date (10.12) thus also requires collection of the sub-products x

,



,and



. Due to the extra computation involved, the single-pass batch update is

not necessarily more eﬃcient than a two-pass method. It is simpliﬁed consider-

ably, however, when not all factors are involved — for instance, when activities

have already been centered a priori so that x

≈0.

Note that the expansion technique shown here may be used to derive an exact

single-pass batch method for any weight update that involves the addition (or

subtraction) of some quantity that must be computed from the entire batch of

training patterns. This includes algorithms such as BCM learning [4, 10] and

binary information gain optimization [14].

10.4 Empirical Results

While activity centering has long been part of backpropagation lore, and empiri-

cal results for error centering have been reported previously [15], slope centering

is being proposed for the ﬁrst time here. It is thus too early to assess its general

applicability or utility; here we present a number of experiments designed to

show the typical eﬀect that centering has on speed and reliability of convergence

as well as generalization performance in feedforward neural networks trained by

accelerated backpropagation methods.

The next section describes the general setup and acceleration techniques used

in all our experiments. Subsequent sections then present our respective results for

two well-known benchmarks: the toy problem of symmetry detection in binary

patterns, and a diﬃcult vowel recognition task.

212 N.N. Schraudolph

10.4.1 Setup of Experiments

Benchmark Design. For each benchmark task we performed a number of

experiments to compare performance with vs. without various forms of centering.

Each experiment consisted of 100 runs starting from diﬀerent initial weights

but identical in all other respects. For each run, networks were initialized with

random weights from a zero-mean Gaussian distribution with standard deviation

0.3. All experiments were given the same sequence of random numbers for their

100 weight initializations; the seed for this sequence was picked only after the

design of the benchmark had been ﬁnalized.

Training Modality. In order to make the results as direct an assessment of cen-

tering as possible, training was done in batch mode so as to avoid the additional

free parameters (e.g., smoothing time constants) required by online methods.

Where not done a priori, centering was then implemented with the exact two-

pass batch method. In addition, we always updated the hidden-to-output weights

of the network before backpropagating error through them. This is known to

sometimes improve convergence behavior [17], and we have found it to increase

stability at the large step sizes we desire.

Competitive Controls. The ordinary backpropagation (plain gradient de-

scent) algorithm has many known defects, and a large number of acceleration

techniques has been proposed for it. We informally tested a number of such

techniques, then picked the combination that achieved the fastest reliable con-

vergence. This combination — vario-η and bold driver — was then used for

all experiments reported here. Thus any performance advantage for centering

reported thereafter has been realized on top of a state-of-the-art accelerated

gradient method as control.

Vario-η [23, page 48]. This interesting technique — also described in Chap-

ter 17 — sets the local learning rate for each weight inversely proportional to

the standard deviation of its stochastic gradient. The weight change thus be-

comes

Δw

−ηg

 + σ(g

)

,whereg

≡

∂E

∂w

and σ(u) ≡

u

−u

(10.13)

with the small positive constant  preventing division by near-zero values. Vario-

η canbeusedinbothbatchandonlinemodes,andisquiteeﬀectiveinthatit

not only performs gradient normalization, but also adapts step sizes to the level

of noise in the local gradient signal.

We used vario-η for all experiments reported here, with  =0.1.Inabatch

implementation this leaves only one free parameter to be determined: the global

learning rate η.

10. Centering Neural Network Gradient Factors 213

Bold Driver [11, 21, 2, 3]. This algorithm for adapting the global learning rate

η is simple and eﬀective, but only works for batch learning. Starting from some

initial value, η is increased by a certain factor after each batch in which the error

did not increase by more than a very small constant ε (required for numerical

stability). Whenever the error rises by more than ε, however, the last weight

change is undone, and η decreased sharply.

All experiments reported here were performed using bold driver with a learn-

ing rate increment of 2%, a decrement of 50%, and ε=10

−10

.Thesevalueswere

found to provide fast, reliable convergence across all experiments. Due to the

amount of recomputation they require, we do count the “failed” epochs (whose

weight changes are subsequently undone) in our performance ﬁgures.

10.4.2 Symmetry Detection Problem

In our ﬁrst benchmark, a fully connected feedforward network with 8 inputs, 8

hidden units and a single output is to learn the symmetry detection task: given

an 8-bit binary pattern at the input, it is to signal at the output whether the

pattern is symmetric about its middle axis (target = 1) or not (target = 0). This

is admittedly a toy problem, although not a trivial one.

Since the target is binary, we used a logistic output unit and cross-entropy

loss function. For each run the network was trained on all 256 possible patterns

until the root-mean-square error of its output over the batch fell below 0.01. We

recorded the number of epochs required to reach this criterion, but did not test

for generalization ability on this task.

Error and Activity Centering. In our ﬁrst set of experiments we examined

the separate and combined eﬀect of centering the network’s activity and/or error

signals. For convenience, activity centering was performed a priori by using -1

and 1 as input levels, and the hyperbolic tangent (tanh) as activation function

for hidden units. The oﬀ-center control experiments were done with 0 and 1 as

input levels and the logistic activation function f (y)=1/(1 + e

−y

).Notethat

all diﬀerences between the tanh and logistic nonlinearities are eliminated by the

vario-η algorithm, except for the eccentricity of their respective outputs.

Results. Table 10.1 shows that centering either activity or error signals pro-

duced an approximate 7-fold increase in convergence speed. In no instance was

a run that used one (or both) of these centering methods slower than the corre-

sponding control without centering. The similar magnitude of the speed-up sug-

gests that it may be due to the improved conditioning of the Hessian achieved

by centering either errors or activities (see Section 10.5). Note, however, that

activity centering beat error centering almost 2/3 of the time in the direct com-

parison.

On the other hand, error centering appeared to improve the reliability of con-

vergence: it cut the convergence time’s coeﬃcient of variation (the ratio between

its standard deviation and mean, henceforth: c.v.) in half while activity centering

214 N.N. Schraudolph

Table 10.1. The eﬀect of centering activities and/or error signals on the symmetry

detection task without shortcuts. Reported are the empirical mean, standard deviation,

and 25

/50

/75

percentile (rounded to three signiﬁcant digits) of the number of

epochs required to converge to criterion. Also shown is the result of directly comparing

runs with identical random seeds. The number of runs in each comparison may sum to

less than 100 due to ties.

error signals: conventional centered

mean ± st.d. direct comparison: mean ± st.d.

activities:

quartiles # of faster runs quartiles

669 ± 308 0 – 100 97.5 ± 21.8

oﬀ-center (0/1)

453/580/852

0 035 7

82/95.5/109

93.1 ± 46.7

100

100 93

65.4 ± 15.9

centered (-1/1)

67.5/79.5/94 14 – 84 57/62/70

left it unchanged. We speculate that this may be the beneﬁcial eﬀect of centering

on the backpropagated error, which does not occur for activity centering.

Finally, a further speedup of 50% (while maintaining the lower c.v.) occurred

when both activity and error signals were centered. This may be attributed to

the fact that our centering of hidden unit activity by design (cf. Section 10.3)

was only approximate. To assess the signiﬁcance of these eﬀects, note that since

the data was collected over 100 runs, the standard error of the reported mean

time to convergence is 1/

√

100=1/10 its reported standard deviation.

Shortcuts and Slope Centering. In the second set of experiments we left both

activity and error signals centered, and examined the separate and combined

eﬀect of adding shortcuts and/or slope centering. Note that since the complement

of a symmetric bit pattern is also symmetric, the symmetry detection task has no

linear component at all — we would therefore expect shortcuts to be of minimal

beneﬁt here.

Results. Table 10.2 shows that indeed adding shortcuts alone was not beneﬁcial

— in fact it slowed down convergence in over 80% of the cases, and signiﬁcantly

increased the c.v. Subsequent addition of slope centering, however, brought

about an almost 3-fold increase in learning speed, and restored the original c.v.

of about 1/4. When used together, slope centering and shortcuts never increased

convergence time, and on average cut it in half. By contrast, slope centering

without shortcuts failed to converge at all about 1/3 of the time. This may come

as a surprise, considering that the given task had no linear component. However,

consider the following:

10. Centering Neural Network Gradient Factors 215

Table 10.2. The eﬀect of centering slopes and/or adding shortcuts on the symmetry

detection task with centered activity and error signals. Results are shown in the same

manner as in Table 10.1.

slopes: conventional centered

mean ± st.d. direct comparison: mean ± st.d.

topology:

quartiles # of faster runs quartiles

65.4 ± 15.9 52 – 48 * 51.6 ± 16.2

short-

57/62/70

81 061 4

43/64.5/∞

cuts?

90.4 ± 31.1

99 95

33.1 ± 8.6

yes

69.5/80/102 0 – 100 28/31/35

* Mean and standard deviation exclude 34 runs which did not converge.

Need for Shortcuts. Due to the monotonicity of their nonlinear transfer func-

tion, hidden units always carry some linear moment, in the sense of a positive

correlation between their net input and output. In the absence of shortcuts,

the hidden units must arrange themselves so that their linear moments together

match the overall linear component of the task (here: zero). This adaptation

process is normally driven by the linear component of the error — which slope

centering removes.

The remaining nonlinear error signals can still jostle the hidden units into

an overall solution, but such an indirect process is bound to be unreliable: as it

literally removes slope from the error surface, slope centering creates numerous

local minima. Shortcut weights turn these local minima into global ones by

modeling the missing (linear) component of the gradient, thereby freeing the

hidden units from any responsibility to do so.

In summary, while a network without shortcuts trained with slope centering

may converge to a solution, the addition of shortcut weights is necessary to ensure

that slope centering will not be detrimental to the learning process. Conversely,

slope centering can prevent shortcuts from acting as redundant “detractors” that

impede learning instead of assisting it. These two techniques should therefore

always be used in conjunction.

10.4.3 Vowel Recognition Problem

Our positive experiences with centering on the symmetry detection task imme-

diately raise two further questions: 1) will these results transfer to more chal-

lenging, realistic problems, and 2) is the gain in learning speed — as often

happens — bought at the expense of generalization ability? In order to address

these questions, we conducted further experiments with the speaker-independent

vowel recognition data due to Deterding [5], a popular benchmark for which good

generalization performance is rather diﬃcult to achieve.

216 N.N. Schraudolph

The Task. The network’s task is to recognize the eleven steady-state vowels

of British English in a speaker-independent fashion, given 10 spectral features

(speciﬁcally: LPC-derived log area ratios) of the speech signal. The data consists

of 990 patterns to be classiﬁed: 6 instances for each of the 11 vowels spoken by

each of 15 speakers. We follow the convention of splitting it into a training set

containing the data from the ﬁrst 8 (4 male, 4 female) speakers, and a test set

containing those of the remaining 7 (4 male, 3 female). Note that there is no

separate validation set available.

Prior Work. Robinson [13] pioneered the use of Deterding’s data as a bench-

mark by comparing the performance of a number of neural network architectures

on it. Interestingly, none of his methods could outperform the primitive single

nearest neighbor approach (which misclassiﬁes 44% of test patterns), thus posing

a challenge to the pattern recognition community. Trained on the task as formu-

lated above, conventional backpropagation networks in fact appear to reach their

limits at error rates of around 42% [6, 9], while an adaptive nearest neighbor

technique can achieve 38% [7]. In Chapter 7, Flake reports comparably favorable

resultsforRBFnetworksaswellashisown hybrid architectures. Even better

performance can be obtained by using speaker sex/identity information [19, 20],

or by training a separate model for each vowel [8]. By combining these two ap-

proaches, a test set error of 23% has been reached [18], the lowest we are aware

of to date.

Training and Testing. We trained fully connected feedforward networks with

10 inputs, 22 hidden units, and 11 logistic output units by minimization of cross-

entropy loss. The target was 1 for the output corresponding to the correct vowel,

0 for all others. Activity centering was done a priori by explicitly centering the

inputs (separately for training and test set), and by using the tanh nonlinearity

for hidden units. The uncentered control experiments used the original input

data, and logistic activation functions.

The relatively small size of our networks enabled us to run all experiments

out to 2 000 epochs of training. After each epoch, the network’s generalization

ability was measured in terms of its misclassiﬁcation rate on the test set. For the

purpose of testing, a maximum likelihood approach was adopted: the network’s

highest output for a given test pattern was taken to indicate its classiﬁcation of

that pattern.

First Results. Figure 10.1 shows how the average test set error (over 100

runs) evolved during training in each of the 8 experiments we performed for

this benchmark. For all curves, error bars were at most the size of the marks

shown along the curve, and have therefore been omitted for clarity. Following our

experience on the symmetry detection task, shortcuts and slope centering were

always used in conjunction whereas activity and error centering were examined

independently. The following eﬀects can be discerned:

10. Centering Neural Network Gradient Factors 217

1 10 100 1000

epochs of training

0.4

0.6

0.8

1.0

average test set error

Fig. 10.1. Evolution of the average test set error while learning the vowel recognition

task with activity centering (triangular marks), error centering (ﬁlled marks), and/or

slope centering with shortcut weights (solid lines), vs. their uncentered controls. Ex-

periments are denoted a)–h) as in Table 10.3.

1. All experiments with activity centering (triangular marks) clearly outper-

formed all experiments without it (circular marks) in both average conver-

gence speed and minimum average test set error.

2. All experiments with shortcuts and slope centering (solid lines) outperformed

the corresponding experiment without them (dashed lines).

3. With one notable exception (experiment d), error centering (ﬁlled marks)

sped up convergence signiﬁcantly. Its eﬀect was greatest in the experiments

without activity centering.

4. The best experiment in terms of both convergence speed and minimum av-

erage test set error was e), the fully centered one; the worst was a), the fully

conventional one.

The qualitative picture that emerges is that centering appears to signiﬁcantly

speed up convergence without adversely aﬀecting the trained network’s general-

ization ability. We will now attempt to quantify this ﬁnding.

Quantifying the Eﬀect. Since the curves in Figure 10.1 are in fact superpo-

sitions of 100 nonlinear curves each, they are ill-suited to quantitative analysis:

value and location of the minimum average test set error do not tell us anything

about the distribution of such minima across individual runs — not even their

average value or location. In order to obtain such quantitative results, we need to

identify an appropriate minimum in test set error for each run. This will allow

us to directly compare runs with identical initial weights across experiments,

218 N.N. Schraudolph

as well as to characterize the distribution of minima within each experiment

by aggregate statistics (e.g., mean, standard deviation, quartiles) for both the

minimum test set error, and the time taken to reach it.

A fair and consistent strategy to identify minima suitable for the quantitative

comparisons we have in mind is not trivial to design. Individual runs may have

multiple minima in test set error, or none at all. If we were to just use the global

minimum over the duration of the run (2 000 epochs), we would not be able

to distinguish a fast method which makes some insigniﬁcant improvement to a

long-found minimum late in the run from a slow method which takes that long

to reach its ﬁrst minimum. Given that we are concerned with both the quality

of generalization performance and the speed with which it is achieved, a greedy

strategy for picking appropriate minima is indicated.

Identiﬁcation of Minima. We follow the evolution of test set error over the

course of each run, noting new minima as we encounter them. If the best value

found so far is not improved upon within a certain period of time, we pick it as the

minimum of that run for the purpose of quantitative analysis. The appropriate

length of waiting period before giving up on further improvement is a diﬃcult

issue — see Chapter 2 for a discussion. For a fair comparison between faster and

slower optimization methods, it should be proportional to the time it took to

reach the minimum in question: a slow run then has correspondingly more time

to improve its solution than a fast one.

Unfortunately this approach fails if a minimum of test set error occurs during

the initial transient, within the ﬁrst few epochs of training: the waiting period

would then be too short, causing us to give up prematurely. On the other hand,

we cannot wait longer than the overall duration of the run. We therefore stop

looking for further improvement in a run after min(2 000, 2+100) epochs, where

 records when the network ﬁrst achieved the lowest test set error seen so far in

that run. Only 9 out of the 800 runs reported here expired at the upper limit of

2 000 epochs, so we are conﬁdent that its imposition did not signiﬁcantly skew

our results.

Test Set Used as Validation Set. Note that since we use the test set to deter-

mine at which point to compare performance, we have eﬀectively appropriated

it as a validation set. The minimum test set errors reported below are therefore

not unbiased estimators for the network’s ability to generalize to novel speakers,

and should not be compared to proper measurements of this ability (for which

the test set must not aﬀect the training procedure in any way). Nonetheless, let

us not forget that the lowest test set error does measure the network’s general-

ization ability in a consistent fashion after all: even though these scores are all

biased to favor a particular set of novel speakers (the test set), by no means does

this render their comparison against each other insigniﬁcant.

Overview of Results. Table 10.3 summarizes quantitative results obtained in

this fashion for the vowel recognition problem. To assess their signiﬁcance, recall

10. Centering Neural Network Gradient Factors 219

Table 10.3. Minimum test set error (misclassiﬁcation rate in %), and the number of

epochs required to reach it, for the vowel recognition task. Except for the diﬀerent

layout, results are reported in the same manner as in Tables 10.1 and 10.2. Due to

space limitations, only selected pairs of experiments are compared directly.

features: performance measure:

centering minimum test set error epochs required

mean ± st.d. dir. comparison: mean ± st.d. dir. comparison:

experiment

activ.

error

slope

shortcuts

quartiles # of better runs quartiles # of faster runs

48.0 ± 3.6 554 ± 321

45.7/47.3/50.0 67 17 13 19 37 365/486/691 3 10 4 1 14

49.1 ± 2.9 31

125 ± 82 97

√

47.0/49.6/50.9 10 67.5/104/163 51

43.9 ± 2.5 82 156 ± 110 90

√

42.3/43.9/46.0 51 75/137/215 47

44.3 ± 2.3 89 46

84 158 ± 141 48 52

√ √

42.9/44.2/45.9 49 65 72/124/186 21 85

44.2 ± 2.5 49

80 72.4 ± 55.5 78

√ √ √ √

42.3/44.4/46.3 70 47 68 37.5/51.5/81 76 75 92

44.2 ± 2.8 51

113 ± 64 24

√ √ √

42.3/44.4/46.1 68 69.5/97.5/148 88

46.8 ± 3.7 27 126 ± 139 22

√ √ √

44.0/47.0/48.9 43 64/94/138 84

46.5 ± 3.1 56

31 29 30 61 270 ± 164 15

12 15 8 86

√ √

44.5/46.8/48.5 162/235/316

that the standard error in the mean of a performance measure reported here

is 1/

√

100 = 1/10 of its reported standard deviation. Figure 10.2 depicts the

same data (except for the direct comparisons) graphically in form of cumulative

histograms for the minimum test set error and the number of epochs required

to reach it.

The results generally conﬁrm the trends observed in Figure 10.1. Runs in the

fully centered experiment e) clearly converged most rapidly — and to test set

errors that were among the best. Compared to the conventional setup a), full

centering converged almost 8 times faster on average while generalizing better

80% of the time.

Generalization Performance. Average misclassiﬁcation rates on the test set

ranged from 44% to 49%, which we consider a fair result given our comparatively

small networks. They cluster into three groups: networks with activity centering

220 N.N. Schraudolph

0.4 0.5 0.6

minimum test set error

100

number of runs

10 100 1000

epochs required

100

number of runs

Fig. 10.2. Cumulative histograms for the minimum test set error (left), and the number

of epochs required to reach it (right), for the vowel recognition task. Curves are labelled

as in Figure 10.1, and marked every 10

percentile.

achieved around 44%, the two others with shortcuts and slope centering came

in under 47%, while the remaining two only reached 48–49%. The cumulative

histogram (Figure 10.2, left) shows that all activity-centered networks had an

almost identical distribution of minimum test set errors.

Note that centering the inputs changes the task, and that the addition of

shortcuts changes the network topology. It is possible that this — rather than

centering per se — accounts for their beneﬁcial eﬀect on generalization. Error

centering was the one feature in our experiments that changed the dynamics

of learning exclusively. Its addition appeared to slightly worsen generalization,

particularly in the absence of other forms of centering. This could be caused by

a reduction (due to centering) of the eﬀective number of parameters in what is

already a rather small model. Such an eﬀect should not overly concern us: one

could easily recoup the lost degrees of freedom by slightly increasing the number

of hidden units for centered networks.

Convergence Speed. All three forms of centering examined here clearly sped

up convergence, both individually and in combination. A slight anomaly ap-

peared in that the addition of error centering in going from experiment c) to d)

had no signiﬁcant eﬀect on the average number of epochs required. A look at

the cumulative histogram (Figure 10.2, right) reveals that while experiment d)

is ahead between the 20th and 80th percentile, c) had fewer unusually slow runs

than d), and a few exceptionally fast ones.

With the other forms of centering in place, the addition of error centering was

unequivocally beneﬁcial: average convergence time decreased from 113 epochs

in f) to 72.4 epochs in e). The histogram shows that the fully centered e) is far

ahead of the competition through almost the entire percentile range.

10. Centering Neural Network Gradient Factors 221

Finally, it is interesting to note that the addition of shortcuts and slope cen-

tering, both on their own and to a network with activity and error centering,

roughly doubled the convergence speed — the same magnitude of eﬀect as ob-

served on the symmetry detection task.

10.5 Discussion

The preceding section has shown that centering can indeed have beneﬁcial eﬀects

on the learning speed and generalization ability of a neural network. Why is this

so? In what follows, we oﬀer an explanation from three (partially overlapping)

perspectives, considering in turn the eﬀect of centering on the condition number

of the Hessian, the level of noise in the local gradient, and the credit assignment

between diﬀerent parts of the network.

Conditioning the Hessian. It is well known that the minimal convergence

time for ﬁrst-order gradient descent on quadratic error surfaces is inversely re-

lated to the condition number of the Hessian matrix, i.e., the ratio between its

largest and its smallest eigenvalue. A common strategy for accelerating gradient

descent is therefore to seek to improve the condition number of the Hessian.

For a single linear node y = w

x with squared loss function, the Hessian is

simply the covariance matrix of the inputs: H =

. Its largest eigenvalue is

typically caused by the d.c. component of x [12]. Centering the inputs removes

that eigenvalue, thus conditioning the Hessian and permitting larger step sizes

(cf. Chapter 1). For batch learning, error centering has exactly the same eﬀect

on the local weight update:

Δw ∝(δ−δ) x = δ x−δx = δ (x−x) (10.14)

Error centering does go further than activity centering, however, in that it also

aﬀects the error backpropagated to anterior nodes. Moreover, Equation 10.14

does not hold for online learning, where the gradient is noisy.

Noise Reduction. It can be shown that centering improves the signal-to-noise

ratio of the local gradient. Omitting the slope factor for the sake of simplicity,

consider the noisy weight update

Δw

∝ (δ

+ φ)(x

+ ξ)=δ

+ ξδ

+ φx

+ φξ (10.15)

where φ and ξ are the noise terms, presumed to be zero-mean, and independent

of activity, error, and each other. In the expansion on the right-hand side, the

ﬁrst term is the desired (noise-free) weight update while the others represent

noise that contaminates it. While the last (pure noise) term cannot be helped,

we can reduce the variance of the two mixed terms by centering δ

and x

so as

to minimize

and

, respectively.

One might of course contend that in doing so, we are also shrinking the signal

, so that in terms of the signal-to-noise ratio we are no better — in fact,

222 N.N. Schraudolph

worse — oﬀ than before. This cuts right to the heart of the matter, for centering

rests upon the notion that the error signal relevant to a non-bias, non-shortcut

weight is the fully centered weight update, and that any d.c. components in

should therefore also be regarded as a form of noise. This presumption can

of course be maintained only because we do have bias and shortcut weights to

address the error components that centering removes.

Improved Credit Assignment. From the perspective of a network that has

these additional parameters, then, centering is a way to improve the assignment

of responsibility for the network’s errors: constant errors are shunted to the bias

weights, linear errors to the shortcut weights, and the remainder of the network

is bothered only with those parts of the error signal that actually require a

nonlinearity. Centering thus views hidden units as a scarce resource that should

only be called upon where necessary. Given the computational complications that

arise in the training of nonlinear nodes, we submit that this is an appropriate

and productive viewpoint.

Future Work. While the results reported here are quite promising, more ex-

periments are required to assess the general applicability and eﬀectiveness of

centering. For feedforward networks, we would like to explore the use of cen-

tering with multiple hidden layers, stochastic (online) gradient descent, and for

function approximation (rather than classiﬁcation) problems. The centering ap-

proach per se, however, is rather more general than that, and so further ahead we

anticipate its application to a range of more sophisticated network architectures,

learning algorithms, and problem domains.

Acknowledgments. I would like to thank the editors of this book as well as

my colleagues Jürgen Schmidhuber, Marco Wiering, and Rafał Sałustowicz for

their helpful comments. This work was supported by the Swiss National Science

Foundation under grant numbers 2100–045700.95/1 and 2000–052678.97/1.

References

[1] Anderson, J., Rosenfeld, E. (eds.): Neurocomputing: Foundations of Research.

MIT Press, Cambridge (1988)

[2] Battiti, R.: Accelerated back-propagation learning: Two optimization methods.

Complex Systems 3, 331–342 (1989)

[3] Battiti, R.: First- and second-order methods for learning: Between steepest descent

and Newton’s method. Neural Computation 4(2), 141–166 (1992)

[4] Bienenstock, E., Cooper, L., Munro, P.: Theory for the development of neuron se-

lectivity: Orientation speciﬁcity and binocular interaction in visual cortex. Journal

of Neuroscience 2 (1982); Reprinted in [1]

[5] Deterding, D.H.: Speaker Normalisation for Automatic Speech Recognition. PhD

thesis, University of Cambridge (1989)

10. Centering Neural Network Gradient Factors 223

[6] Finke, M., Müller, K.-R.: Estimating a-posteriori probabilities using stochastic

network models. In: Mozer, M.C., Smolensky, P., Touretzky, D.S., Elman, J.L.,

Weigend, A.S. (eds.) Proceedings of the 1993 Connectionist Models Summer

School, Boulder, CO. Lawrence Erlbaum Associates, Hillsdale (1994)

[7] Hastie, T.J., Tibshirani, R.J.: Discriminant adaptive nearest neighbor classiﬁ-

cation. IEEE Transactions on Pattern Analysis and Machine Intelligence 18(6),

607–616 (1996)

[8] Herrmann, M.: On the merits of topography in neural maps. In: Kohonen, T.

(ed.) Proceedings of the Workshop on Self-Organizing Maps, pp. 112–117. Helsinki

University of Technology (1997)

[9] Hochreiter, S., Schmidhuber, J.: Feature extraction through lococode. Neural

Computation (1998) (to appear)

[10] Intrator, N.: Feature extraction using an unsupervised neural network. Neural

Computation 4(1), 98–107 (1992)

[11] Lapedes, A., Farber, R.: A self-optimizing, nonsymmetrical neural net for content

addressable memory and pattern recognition. Physica D 22, 247–259 (1986)

[12] LeCun, Y., Kanter, I., Solla, S.A.: Eigenvalues of covariance matrices: Application

to neural-network learning. Physical Review Letters 66(18), 2396–2399 (1991)

[13] Robinson, A.J.: Dynamic Error Propagation Networks. PhD thesis, University of

Cambridge (1989)

[14] Schraudolph, N.N., Sejnowski, T.J.: Unsupervised discrimination of clustered data

via optimization of binary information gain. In: Hanson, S.J., Cowan, J.D., Giles,

C.L. (eds.) Advances in Neural Information Processing Systems, vol. 5, pp. 499–

506. Morgan Kaufmann, San Mateo (1993)

[15] Schraudolph, N.N., Sejnowski, T.J.: Tempering backpropagation networks: Not

all weights are created equal. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E.

(eds.) Advances in Neural Information Processing Systems, vol. 8, pp. 563–569.

MIT Press, Cambridge (1996)

[16] Sejnowski, T.J.: Storing covariance with nonlinearly interacting neurons. Journal

of Mathematical Biology 4, 303–321 (1977)

[17] Shah, S., Palmieri, F., Datum, M.: Optimal ﬁltering algorithms for fast learning

in feedforward neural networks. Neural Networks 5, 779–787 (1992)

[18] Tenenbaum, J.B., Freeman, W.T.: Separating style and content. In: Mozer, M.C.,

Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Sys-

tems, vol. 9, pp. 662–668. The MIT Press, Cambridge (1997)

[19] Turney, P.D.: Exploiting Context When Learning to Classify. In: Brazdil, P.B.

(ed.) ECML 1993. LNCS, vol. 667, pp. 402–407. Springer, Heidelberg (1993)

[20] Turney, P.D.: Robust classiﬁcation with context-sensitive features. In: Proceedings

of the Sixth International Conference on Industrial and Engineering Applications

of Artiﬁcial Intelligence and Expert Systems, pp. 268–276 (1993)

[21] Vogl, T.P., Mangis, J.K., Rigler, A.K., Zink, W.T., Alkon, D.L.: Accelerating the

convergence of the back-propagation method. Biological Cybernetics 59, 257–263

(1988)

[22] Widrow, B., McCool, J.M., Larimore, M.G., Johnson Jr., C.R.: Stationary and

nonstationary learning characteristics of the LMS adaptive ﬁlter. Proceedings of

the IEEE 64(8), 1151–1162 (1976)

[23] Zimmermann, H.G.: Neuronale Netze als Entscheidungskalkül. In: Rehkugler, H.,

Zimmermann, H.G. (eds.) Neuronale Netze in der Ökonomie: Grundlagen und

ﬁnanzwirtschaftliche Anwendungen, pp. 1–87. Vahlen Verlag, Munich (1994)

Avoiding Roundoﬀ Error in Backpropagating

Derivatives



Tony Plate

School of Mathematical and Computing Sciences,

Victoria University, Wellington, New Zealand

tap@mcs.vuw.ac.nz

http://www.mcs.vuw.ac.nz/~tap/

Abstract. One signiﬁcant source of roundoﬀ error in backpropagation

networks is the calculation of derivatives of unit outputs with respect

to their total inputs. The roundoﬀ error can lead result in high relative

error in derivatives, and in particular, derivatives being calculated to

be zero when in fact they are small but non-zero. This roundoﬀ error

is easily avoided with a simple programming trick which has a small

memory overhead (one or two extra ﬂoating point numbers per unit)

and an insigniﬁcant computational overhead.

11.1 Introduction

Backpropagating derivatives is an essential part of training multilayer networks.

Accuracy of these derivatives is important to many training methods, especially

ones which use second-order information, such as conjugate gradients. The stan-

dard formula for backpropagating error derivatives (eg., as given in Ripley [3],

Bishop [1], and Rumelhart, Hinton, and Williams [4]) use ﬂoating point arith-

metic in such a way that can result in signiﬁcant roundoﬀ error. In particular,

small derivatives are rounded oﬀ to zero. These errors can cause second-order

methods to become confused (because of inaccurate gradients) and can also cause

the weight search to stop or be very slow because some derivatives are calculated

to be zero when in fact they are small but non-zero. This chapter explains how

this particular source of roundoﬀ error can be avoided simply and cheaply. The

method applies to both logistic and tanh units, and to the sum-squared and

cross-entropy error functions.

In this chapter, the symbol “=” is used to denote mathematical equality, and

the symbol “←” is used to denote an assignment to some ﬂoating-point variable.

Floating point values are denoted by an asterisk, eg., x

∗

is the ﬂoating point

version of x



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 225–230, 2012.

 Springer-Verlag Berlin Heidelberg 2012

226 T. Plate

11.2 Roundoﬀ Error in Sigmoid Units

Consider a non-input unit whose output y

is computed as the logistic function

of its total input x

1+exp(−x

)

and y

∗

←

1+exp(−x

∗

)

In the backpropagation phase of training, we need to calculate the partial derivative

of the error with respect to the total input. This is done using the chain rule:

∂E

∂x

∂E

∂y

∂x

The standard way of calculating

∂y

∂x

is to use the formula relating it to y

,which

for the logistic function is



∂y

∂x



∗

← y

∗

(1 − y

∗

This computation is a potential source of roundoﬀ error: if the actual value of

is so close to one that y

∗

is exactly one, then (

∂y

∂x

)

∗

will equal zero.

Figures 11.1 show the values of this expression calculated in single and double

precision ﬂoating point arithmetic. In single precision, when x

is greater than

about 17.33, y

∗

(1 −y

∗

) evaluates to zero. For x

values slightly lower than 17.33

there is signiﬁcant quantization. In double precision y

∗

(1 −y

∗

) evaluates to zero

when x

is greater than about 36.74.

1e-07

2e-07

3e-07

4e-07

5e-07

6e-07

7e-07

8e-07

9e-07

-14-15-16-17-18 14 15 16 17 18

single

5e-16

1e-15

1.5e-15

2e-15

2.5e-15

3e-15

3.5e-15

4e-15

4.5e-15

5e-15

-33-34-35-36-37 33 34 35 36 37

double

Fig. 11.1. Quantization in calculated values of unit derivatives (

∂y

∂x

) for logistic units,

computed using the formula y

∗

(1 − y

∗

) where y

∗

← 1/(1 + exp(−x

∗

)). Roundoﬀ error

causes quantization on the right hand side of each plot.

Note that in Figure 11.1 the roundoﬀ error due to the calculation of 1 − y

∗

only occurs for positive x. For negative x, y

∗

approaches zero, and is accurately

represented, and the relative roundoﬀ error in 1−y

∗

is insigniﬁcant. This provides

a clue as to how to avoid the roundoﬀ error for positive x: don’t compute 1 −y

∗

11. Avoiding Roundoﬀ Error in Backpropagating Derivatives 227

when y

∗

is close to one. Indeed, these roundoﬀ errors can be avoided entirely by

storing an extra value with each unit, which is z

=1−y

, calculated accurately

in ﬂoating point arithmetic as follows:

∗

←

1+exp(−x

∗

)

and z

∗

←

exp(−x

∗

)

1+exp(−x

∗

)

Together, these two ﬂoating points numbers accurately represent the unit output

at its extremes: y

∗

is an accurate representation when the output is close to zero,

and z

∗

is an accurate representation of one minus the output when the output

is close to one. With them,

∂y

∂x

is simply and accurately calculated as follows:



∂y

∂x



∗

← y

∗

Implementing these calculations requires the storage of one extra ﬂoating point

number per unit (ie., z

∗

), and the computation of one extra division per unit. This

extra resource usage is insigniﬁcant because the total resource requirements for a

forward- and back-propagation pass are proportional to the number of weights in

the network, which is usually of the order of the square of the number of units.

The value y

is also used in calculating partial derivatives for weights on con-

nections emerging from unit i. However, the errors in the representations of

values of y

close to one do not cause high relative errors in these weight deriva-

tives (except in special circumstances where derivatives from diﬀerent examples

cancel each other, but there is no simple remedy in these rare situations). Hence,

it is generally safe to use just y

∗

for these calculations.

11.2.1 Sum-Squared Error Computations

Roundoﬀ errors can also occur in the calculation of errors and their derivatives.

This is unlikely to be important if targets are rounded-oﬀ versions of true targets,

eg., targets like 0.3129871 or 0.9817523, because such targets have as much

roundoﬀ error as the values of unit outputs.

However, if sum-squared error is used, and targets are all 0 or 1 and are

accurate (i.e., 1 is not a rounded-oﬀ version of 0.99999999 or 1.00000001), and

unit i computes a logistic function of its total input, then the following formulas

canbeused(wheret

is the target for unit i, E =

− y

)

,andz

=1− y

calculated as before):

∗

←

∗

− y

∗

)

if y

∗

≤ 0.5

((t

∗

− 1) + z

∗

)

if y

∗

> 0.5



∂E

∂y



∗

←

∗

− t

∗

if y

∗

≤ 0.5

(1 − t

∗

) − z

∗

if y

∗

> 0.5

These formulas could be split into a greater number of simpler cases if t

were

always 0 or 1, but as they are they are correct for general values of t

and accurate

when t

is 0 or 1.

228 T. Plate

11.2.2 Single Logistic-Output Cross-Entropy Computations

If the network has a single output unit (unit i) which represents probability in

a two-class classiﬁcation problem, then the cross-entropy error function [1] for

one example is as follows:

E = t

logy

+(1−t

)log(1 − y

This error function is most appropriately used with a logistic function on the

output unit. In this case, errors and partial derivatives can be calculated from

the following formulas, and the calculations will be accurate for t

=0or 1

(where z

=1− y

, calculated as before). To get accurate results, it is necessary

to use the function

log1p(x), which computes log(1 + x) accurately for tiny x.

∗

←

∗

logy

∗

+(1−t

∗

)log1p(−y

∗

) if y

∗

≤ 0.5

∗

log1p(−z

∗

)+(1− t

∗

)logz

∗

if y

∗

> 0.5



∂E

∂x



∗

←

∗

− t

∗

if y

∗

≤ 0.5

1 − t

∗

− z

∗

if y

∗

> 0.5

11.2.3 Other Approaches to Avoiding Zero-Derivatives with the

Logistic Function

Fahlman [2] suggested adding 0.1 to

∂y

∂x

in order to decrease learning time by

eliminating ﬂat spots in the training surface (ie., spots with small derivatives),

and also to avoid zero-derivatives due to roundoﬀ. In Fahlman’s experiments this

technique improved the learning time. However, it also makes the derivatives

incorrect with respect to the error function. This may not matter in networks

where the actual output numbers do not mean very much other than “high” or

“low”. However, in networks where achieving accurate estimates of the targets

is important, eg., where the targets are probabilities, or targets are continuous

values in a modeling task, this technique is undesirable as it causes the gradient

search to not minimize the error, but to minimize some other quantity instead.

Another technique sometimes used with neural networks is to transform tar-

gets from the range [0, 1] to [0.1, 0.9]. Again, a secondary motivation is to avoid

zero-derivatives, and again, this technique should not be used where actual out-

put values have any more signiﬁcance than “high” or “low”.

11.3 Softmax and Cross-Entropy Computations

In networks which must classify instances into one of k mutually exclusive classes,

cross-entropy is often used as the error measure, together with the softmax

output function. The softmax output function allows the outputs of the k output

units to be interpreted as probabilities: it forces each output to be between zero

log1p(x) is available in Unix math libraries.

11. Avoiding Roundoﬀ Error in Backpropagating Derivatives 229

and one, and forces their total to be one. Assuming that outputs units are

numbered 1 to k, the equations for softmax and cross-entropy (for one example)

are as follows:

exp(x

)



j=1

exp(x

)

and E =



j=1

logy

The derivative of the error with respect to x has a very simple form:

∂E

∂x

= y

− t

This equation is subject to high relative error due to roundoﬀ when t

is exactly

one and y

is close to one. This can happen often in training neural nets, and

can lead to derivatives being calculatedaszerowheninfacttheyarejustsmall.

This roundoﬀ error, and also possible overﬂow in the computation of y

can be avoided by using the following computations, which use extra ﬂoat-

ing point variables to store the values of z

=1− y

and x



= x

− max

∗

← max

∗



∗

← x

∗

− m

∗

←



exp(x



∗

)

∗

←

∗

exp(x



∗

)

∗

←

1 − y

∗

if y

∗

≤ 0.5

∗



j=i

exp(x



∗

) if y

∗

> 0.5



∂E

∂x



∗

←

∗

− t

∗

if y

∗

≤ 0.5

(1 − t

∗

) − z

∗

if y

∗

> 0.5

∗

←



∗

log(y

∗

) if y

∗

≤ 0.5

∗

log1p(−z

∗

) if y

∗

> 0.5

The space overhead is low: at most two extra ﬂoating point variables per

output unit, depending on the implementation. The time overhead is also low.

The only lengthy additional computation is the second alternative for z

∗

,which

can be performed for at most one i because it is impossible for more than one

to be greater than 0.5.

11.4 Roundoﬀ Error in Tanh Units

The same ideas apply to the calculation of derivatives in tanh units, which are

described by the following equations:

1 − exp(−2x

)

1+exp(−2x

)

and

∂y

∂x

=(1− y

)(1 + y

For tanh units, the output y

can take on values between -1 and 1. Hence, to repre-

sent outputs accurately at the extremes we need two extra ﬂoating point numbers

230 T. Plate

to store the values z

=1− y

and u

=1+y

. Then the following expressions

evaluated in ﬂoating point arithmetic will avoid unnecessary roundoﬀ error:

∗

← exp(−2x

∗

)

∗

←

1+v

∗

← v

∗

← u

∗

− 1



∂y

∂x



∗

← z

∗

The tanh function is not often used on output units, but if desired, formulas for

accurately calculating errors and derivatives when targets are always 1 or -1 are

easily derived.

11.5 Why Bother?

Since the derivatives which are aﬀected by roundoﬀ error are very small, the pro-

posal to calculate them accurately might provoke the response “why bother?” For

on-line learning methods (stochastic gradient), there is probably no point, as com-

puting small gradients to high accuracy is unlikely to make any diﬀerence. How-

ever, for nets trained in batch mode with second-order methods such as conjugate

gradients, small derivatives can be quite important, for the following reasons:

1. Quantization in error or derivatives can confuse line search routines, so avoid-

ing quantization is a good thing.

2. Many small derivatives can add up.

3. Computing small but non-zero derivatives allows training methods to con-

tinue as opposed to stopping because of zero derivatives. Some training meth-

ods can make signiﬁcant progress with small derivatives, so it is possible that

the weights will move out of the ﬂat area of the error function.

Indeed, there is little reason not to compute these values accurately, as the extra

storage and computations are insigniﬁcant.

References

[1] Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press

(1995)

[2] Fahlman, S.E.: Fast-learning variations on back-propagation: An empirical study.

In: Touretzky, D., Hinton, G., Sejnowski, T. (eds.) Proceedings of the 1988 Connec-

tionist Models Summer School, pp. 38–51. Morgan Kaufmann, San Mateo (1989)

[3] Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University

Press (1995)

[4] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations

by error propagation. In: Parallel Distributed Processing: Explorations in the Mi-

crostructure of Cognition, vol. 1, pp. 318–362. MIT Press, Cambridge (1986)

Representing and Incorporating Prior Knowledge

in Neural Network Training



Preface

The present section focuses on tricks for four important aspects in learning: (1)

incorporation of prior knowledge, (2) choice of representation for the learning

task, (3) unequal class prior distributions, and ﬁnally (4) large network training.

Patrice Simard, et al. review the famous tangent distance and tangent

propagation algorithms. Included are tricks for speeding and increasing the

stability that were not published in previous work. One important trick for

obtaining good performance in their methods is smoothing (see p. 249) of the

input data, which is illustrated for 2D handwritten character recognition images.

To obtain the tangent we must compute the derivative, however, this obviously

cannot be calculated for discrete images. To compute tangent vectors, it must

be possible to interpolate between the pixel values of the images (or the features

of the images). Since many interpolations are possible, a “smoothing” regular-

izer is used because it has the extra beneﬁt of imposing some control on the

locality of the transformation invariance. However, care must be taken not to

over smooth (or useful features are washed away). Another trick is to use the

so-called elastic tangent distance (see p. 249) which eliminates the problem

of singular systems which arises when there is zero distance between two pat-

terns, or when their tangent vectors are parallel. Finally, through a very reﬁned

hierarchy of resolution and accuracy (see p. 251) the tangent distance algo-

rithm can be sped up by two or tree orders of magnitude over a straight forward

implementation.

Tangentprop takes the invariance idea even further by making it possible to

incorporate local or global invariances and prior knowledge directly into the loss

function of backpropagation and to backpropagate tangents eﬃciently (see p.

259). The chapter concludes with a digression on the mathematical background

(Lie groups) and the simpliﬁcations that follow from the theory of this very

successful (record breaking OCR) algorithm.

In the next chapter Larry Yaeger, et al. present an entire collection of tricks

for an application that also is of strong commercial interest, on-line handwritten

character recognition. These tricks were used by the authors to develop the rec-

ognizer in the Apple Computer’s Newton MessagePad

R

and eMate

R

products.

Their recognizer consists of three main components: a segmenter, a classiﬁer,

and a context driven search component. Of particular interest is the intricate

representation of stroke information (see p. 274) that serves as input to

the classiﬁer. Prior knowledge has led the authors to include speciﬁc stroke



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 231–233, 2012.

 Springer-Verlag Berlin Heidelberg 2012

232 G.B. Orr and K.-R. Müller

features, grayscale character images, and coarse character attributes (stroke

count and character aspect ratio), all combined into an ensemble of networks.

The ensemble decisions are then combined in algorithmic search through a hy-

pothesis space of dictionaries and combinations of dictionaries comprising a

broad coverage, weakly applied language model.

An extensive set of tricks for training the network are also discussed, including:

Normalizing output error: Assist output activities of secondary choices to

have non-zero values (NormOutErr). This enhances the robustness for the

subsequent integrated search procedure since the search can now also take

the n best choices into account (p. 276).

Negative training: Reduce the eﬀect of invalid character segmentation (p.

278).

Stroke warping: Generate randomly warped versions of patterns to increase

the generalization ability (p. 279).

Frequency balancing: Reduce the problem of unbalanced class priors using

the simple trick of repeating low frequency classes more often in order to

force the network to allocate more resources to these cases (p. 280).

Error emphasis: Account for diﬀerent and uncommon writing styles by pre-

senting diﬃcult or unusual patterns more often (p. 281).

Quantized weights: Enable the neural network classiﬁer to run with only one-

byte weights and train with a temporary additional two bytes (p. 282).

In chapter 14, Steve Lawrence, et al. discuss several diﬀerent tricks for alleviating

the problem of unbalanced class prior probabilities. Some theoretical explana-

tions are also provided. In the ﬁrst trick, prior scaling (p. 296), weight updates

are scaled so that the total expected update for each class is equal. The second

trick, probabilistic sampling (p. 298), slightly modiﬁes the frequency balanc-

ing in chapter 13. Here, a class is ﬁrst chosen and then from within this class

a sample is drawn. The next trick is referred to as post scaling (p. 298). The

network is trained as usual but the network outputs are rescaled after training.

This method can also be used to optimize other criteria that are diﬀerent from

the loss function the network has been trained with. Finally the authors propose

to equalize class memberships (p. 299) by either subsampling the class with

higher frequency or by duplicating patterns from the class with lower frequency

(however they report that this trick works least eﬃciently). The eﬀectiveness of

each of these tricks is examined and compared for an ECG classiﬁcation problem.

Training problems with thousands of classes and millions of examples, as

are common for speech and handwritten character recognition problems, pose a

major challenge. While many of the training techniques discussed so far work well

for moderate size nets, they can fail miserably for these extremely large problems.

In chapter 15, Jürgen Fritsch and Michael Finke design a representation and

architecture for such large scale learning problems and, like the previous two

chapters, they also tackle the problem of unbalanced class priors (since not all

of the 24k subphonemes are equally probable). They exemplify their approach

by building a large vocabulary speech recognizer. In the ﬁrst step they break

down the task into a hierarchy of smaller decision problems of controllable size

Representing and Incorporating Prior Knowledge in Neural Network Training 233

(divide and conquer, p. 313) and estimate the conditional probabilities for

each node of the decision tree with a neural network. The network training uses

mini batches and individual adaptive learning rates that are increased

if progress is made in weight space and decreased if the ﬂuctuations in weight

space are too high (p. 332). These estimated probabilities – modeled by every

single neural network node – are combined to give an overall estimate of the

class decision probabilities.

The authors either determine the decision tree structure manually or esti-

mate it by their ACID clustering algorithm (p. 324). Interestingly, the man-

ual structure design was outperformed by the proposed agglomerative clustering

scheme. No doubt that prior knowledge helps to achieve better classiﬁcation re-

sults. However, this astonishing result indicates that human prior knowledge,

although helpful in general, is suboptimal for structuring such a large task,

particularly since automatic clustering allows for ﬁne-grain subdivision of the

classiﬁcation task and aims for uniformity of priors. This desirable goal is hardly

achievable by manual construction of the classiﬁcation hierarchy. Furthermore,

the human prior knowledge also does not provide the best basis from which a

machine learning algorithm can learn optimally, a fact that is important to keep

in mind for other applications as well.

Jenny & Klaus

Transformation Invariance in Pattern Recognition

– Tangent Distance and Tangent Propagation



Patrice Y. Simard

,YannA.LeCun

,JohnS.Denker

, and Bernard Victorri

Image Processing Services Research Lab, AT& T Labs - Research,

100 Schulz Drive, Red Bank, NJ 07701-7033, USA

patrice@research.att.com

http://www.research.att.com/info/patrice

CNRS, ELSAP, ENS, 1 rue Maurice Arnoux, F-92120 MONTROUGE, France

Abstract. In pattern recognition, statistical modeling, or regression,

the amount of data is a critical factor aﬀecting the performance. If the

amount of data and computational resources are unlimited, even trivial

algorithms will converge to the optimal solution. However, in the practi-

cal case, given limited data and other resources, satisfactory performance

requires sophisticated methods to regularize the problem by introduc-

ing aprioriknowledge. Invariance of the output with respect to certain

transformations of the input is a typical example of such aprioriknowl-

edge. In this chapter, we introduce the concept of tangent vectors, which

compactly represent the essence of these transformation invariances, and

two classes of algorithms, “tangent distance” and “tangent propagation”,

which make use of these invariances to improve performance.

12.1 Introduction

Pattern Recognition is one of the main tasks of biological information process-

ing systems, and a major challenge of computer science. The problem of pattern

recognition is to classify objects into categories, given that objects in a particular

category may have widely-varying features, while objects in diﬀerent categories

may have quite similar features. A typical example is handwritten digit recogni-

tion. Characters, typically represented as ﬁxed-size images (say 16 by 16 pixels),

must be classiﬁed into one of 10 categories using a classiﬁcation function. Build-

ing such a classiﬁcation function is a major technological challenge, as irrelevant

variabilities among objects of the same class must be eliminated, while mean-

ingful diﬀerences between objects of diﬀerent classes must be identiﬁed. These

classiﬁcation functions for most real pattern recognition tasks are too compli-

cated to be synthesized “by hand” using only what humans know about the task.

Instead, we use sophisticated techniques that combine humans’ a priori knowl-

edge with information automatically extracted from a set of labeled examples



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 235–269, 2012.

 Springer-Verlag Berlin Heidelberg 2012

236 P.Y. Simard et al.

(the training set). These techniques can be divided into two camps, according to

the number of parameters they require: the “memory based” algorithms, which in

eﬀect store a sizeable subset of the entire training set, and the “learned-function”

techniques, which learn by adjusting a comparatively small number of parame-

ters. This distinction is arbitrary because the patterns stored by a memory-based

algorithm can be considered the parameters of a very complex learned function.

The distinction is however useful in this work, because memory based algorithms

often rely on a metric which can be modiﬁed to incorporate transformation in-

variances, while learned-function algorithms consist of selecting a classiﬁcation

function, the derivatives of which can be constrained to reﬂect the same transfor-

mation invariances. The two methods for incorporating invariances are diﬀerent

enough to justify two independent sections.

12.1.1 Memory Based Algorithms

To compute the classiﬁcation function, many practical pattern recognition sys-

tems, and several biological models, simply store all the examples, together with

their labels, in a memory. Each incoming pattern can then be compared to all

the stored prototypes, and the labels associated with the prototypes that best

match the input determine the output. The above method is the simplest exam-

ple of the memory-based models. Memory-based models require three things: a

distance measure to compare inputs to prototypes, an output function to produce

an output by combining the labels of the prototypes, and a storage scheme to

build the set of prototypes.

All three aspects have been abundantly treated in the literature. Output func-

tions range from simply voting the labels associated with the k closest proto-

types (K-Nearest Neighbors), to computing a score for each class as a linear

combination of the distances to all the prototypes, using ﬁxed [21] or learned [5]

coeﬃcients. Storage schemes vary from storing the entire training set, to picking

appropriate subsets of it (see [8], chapter 6, for a survey) to learned-functions

such as learning vector quantization (LVQ) [17] and gradient descent. Distance

measures can be as simple as the Euclidean distance, assuming the patterns and

prototypes are represented as vectors, or more complex as in the generalized

quadratic metric [10] or in elastic matching methods [15].

Pattern to

be classified

Prototype A Prototype B

Fig. 12.1. According to the Euclidean distance the pattern to be classiﬁed is more

similar to prototype B. A better distance measure would ﬁnd that prototype A is

closer because it diﬀers mainly by a rotation and a thickness transformation, two

transformations which should leave the classiﬁcation invariant.

12. Tangent Distance and Tangent Propagation 237

A simple but ineﬃcient pattern recognition method is to use a simple distance

measure, such as Euclidean distance between vectors representing the raw input,

combined with a very large set of prototypes. This method is ineﬃcient because

almost all possible instances of a category must be present in the prototype

set. In the case of handwritten digit recognition, this means that digits of each

class in all possible positions, sizes, angles, writing styles, line thicknesses, skews,

etc... must be stored. In real situations, this approach leads to impractically large

prototype sets or to mediocre recognition accuracy as illustrated in Figure 12.1.

An unlabeled image of a thick, slanted “9” must be classiﬁed by ﬁnding the closest

prototype image out of two images representing respectively a thin, upright “9”

and a thick, slanted“4”. According to the Euclidean distance (sum of the squares

of the pixel to pixel diﬀerences), the “4” is closer. The result is an incorrect

classiﬁcation. The classical way of dealing with this problem is to use a so-called

feature extractor whose purpose is to compute a representation of the patterns

that is minimally aﬀected by transformations of the patterns that do not modify

their category. For character recognition, the representation should be invariant

with respect to position, size changes, slight rotations, distortions, or changes in

line thickness. The design and implementation of feature extractors is the major

bottleneck of building a pattern recognition system. For example, the problem

illustrated in Figure 12.1 can be solved by deslanting and thinning the images.

An alternative to this is to use an invariant distance measure constructed

in such a way that the distance between aprototypeandapatternwillnotbe

aﬀected by irrelevant transformations of the pattern or of the prototype. With an

invariant distance measure, each prototype can match many possible instances

of pattern, thereby greatly reducing the number of prototypes required.

The natural way of doing this is to use “deformable” prototypes. During the

matching process, each prototype is deformed so as to best ﬁt the incoming pat-

tern. The quality of the ﬁt, possibly combined with a measure of the amount of

deformation, is then used as the distance measure [15]. With the example of Fig-

ure 12.1, the “9” prototype would be rotated and thickened so as to best match

the incoming “9”. This approach has two shortcomings. First, a set of allowed

deformations must be designed based on a priori knowledge. Fortunately, this is

feasible for many tasks, including character recognition. Second, the search for

the best-matching deformation is often enormously expensive, and/or unreliable.

Consider the case of patterns that can be represented by vectors. For example,

the pixel values of a 16 by 16 pixel character image can be viewed as the compo-

nents of a 256-dimensional vector. One pattern, or one prototype, is a point in

this 256-dimensional space. Assuming that the set of allowable transformations is

continuous, the set of all the patterns that can be obtained by transforming one

prototype using one or a combination of allowable transformations is a surface

in the 256-D pixel space. More precisely, when a pattern P is transformed (e.g.

rotated) according to a transformation s(P, α) which depends on one parameter

α (e.g. the angle of the rotation), the set of all the transformed patterns

= {x |∃α for which x = s(P, α)} (12.1)

238 P.Y. Simard et al.

is a one-dimensional curve in the vector space of the inputs. In the remainder of

this chapter, we will always assume that we have chosen s be diﬀerentiable with

respect to both P and α andsuchthats(P, 0) = P .

When the set of transformations is parameterized by n parameters α

(ro-

tation, translation, scaling, etc.), the intrinsic dimension of the manifold S

at most n. For example, if the allowable transformations of character images

are horizontal and vertical shifts, rotations, and scaling, the surface will be a

4-dimensional manifold.

In general, the manifold will not be linear. Even a simple image translation

corresponds to a highly non-linear transformation in the high-dimensional pixel

space. For example, if the image of an “8” is translated upward, some pixels

oscillate from white to black and back several times. Matching a deformable

prototype to an incoming pattern now amounts to ﬁnding the point on the sur-

face that is at a minimum distance from the point representing the incoming

pattern. This non-linearity makes the matching much more expensive and un-

reliable. Simple minimization methods such as gradient descent (or conjugate

gradient) can be used to ﬁnd the minimum-distance point, however, these meth-

ods only converge to a local minimum. In addition, running such an iterative

procedure for each prototype is usually prohibitively expensive.

If the set of transformations happens to be linear in pixel space, then the

manifold is a linear subspace (a hyperplane). The matching procedure is then

reduced to ﬁnding the shortest distance between a point (vector) and a hyper-

plane, which is an easy-to-solve quadratic minimization problem. This special

case has been studied in the statistical literature and is sometimes referred to

as Procrustes analysis [24]. It has been applied to signature veriﬁcation [12] and

on-line character recognition [26].

This chapter considers the more general case of non-linear transformations

such as geometric transformations of gray-level images. Remember that even a

simple image translation corresponds to a highly non-linear transformation in

the high-dimensional pixel space. The main idea of the chapter is to approx-

imate the surface of possible transforms of a pattern by its tangent plane at

the pattern, thereby reducing the matching to ﬁnding the shortest distance be-

tween two planes. This distance is called the tangent distance.Theresultofthe

approximation is shown in Figure 12.2, in the case of rotation for handwritten

digits. At the top of the ﬁgure, is the theoretical curve in pixel space which

represents equation (12.1), together with its linear approximation. Points of the

transformation curve are depicted below for various amounts of rotation (each

angle corresponds to a value of α). The bottom of Figure 12.2 depicts the linear

approximation of the curve s(P, α) given by the Taylor expansion of s around

α =0:

s(P, α)=s(P, 0) + α

∂s(P, α)

∂α

+ O(α

) ≈ P + αT. (12.2)

This linear approximation is completely characterized by the point P and the

tangent vector T =

∂s(P,α)

∂α

. Tangent vectors, also called the Lie derivatives of

12. Tangent Distance and Tangent Propagation 239

α=1α=−2 α=−1 α=0 α=2

=−2

=−1

P+ .T

Pixel space

∂

s(P, )

Fig. 12.2. Top: Representation of the eﬀect of the rotation in pixel space. Middle:

Small rotations of an original digitized image of the digit “2”, for diﬀerent angle values

of α. Bottom: Images obtained by moving along the tangent to the transformation

curve for the same original digitized image P by adding various amounts (α)ofthe

tangent vector T .

the transformation s, will be the subject of section 12.4. As can be seen from

Figure 12.2, for reasonably small angles (α < 1), the approximation is very

good.

Figure 12.3 illustrates the diﬀerence between the Euclidean distance, the full

invariant distance (minimum distance between manifolds) and the tangent dis-

tance. In the ﬁgure, both the prototype and the pattern are deformable (two-

sided distance), but for simplicity or eﬃciency reasons, it is also possible to

deform only the prototype or only the unknown pattern (one-sided distance).

Although in the following we will concentrate on using tangent distance to

recognize images, the method can be applied to many diﬀerent types of signals:

temporal signals, speech, sensor data...

240 P.Y. Simard et al.

Euclidean distance

between P and E

Tangent Distance

Distance between

S and S

Fig. 12.3. Illustration of the Euclidean distance and the tangent distance between P

and E.ThecurvesS

and S

represent the sets of points obtained by applying the

chosen transformations (for example translations and rotations) to P and E. The lines

going through P and E represent the tangent to these curves. Assuming that working

space has more dimensions than the number of chosen transformations (on the diagram,

assume one transformation in a 3-D space) the tangent spaces do not intersect and the

tangent distance is uniquely deﬁned.

12.1.2 Learned-Function Algorithms

Rather than trying to keep a representation of the training set, it is also possible

to choose a classiﬁcation function by learning a set of parameters. This is the

approach taken in neural networks, curve ﬁtting, regression, et cetera.

We assume all data is drawn independently from a given statistical distri-

bution P, and our learning machine is characterized by the set of functions it

can implement, G

(x), indexed by the vector of parameters w.WewriteF (x)

to represent the “correct” or “desired” labeling of the point x.Thetaskisto

ﬁnd a value for w such that G

best approximates F . We can use a ﬁnite set

of training data to help ﬁnd this vector. We assume the correct labeling F (x)

is known for all points in the training set. For example, G

may be the func-

tion computed by a neural net having weights w,orG

may be a polynomial

having coeﬃcients w. Without additional information, ﬁnding a value for w is

an ill-posed problem unless the number of parameters is small and/or the size

of the training set is large. This is because the training set does not provide

enough information to distinguish the best solution among all the candidate ws.

This problem is illustrated in Figure 12.4 (left). The desired function F (solid

line) is to be approximated by a functions G

(dotted line) from four exam-

ples {(x

,F(x

))}

i=1,2,3,4

. As exempliﬁed in the picture, the ﬁtted function G

12. Tangent Distance and Tangent Propagation 241

largely disagrees with the desired function F between the examples, but it is not

possible to infer this from the training set alone. Many values of w can generate

many diﬀerent functions G

, some of which may be terrible approximations of

F , even though they are in complete agreement with the training set. Because

of this, it is customary to add “regularizers”, or additional constraints, to restrict

the search of an acceptable w. For example, we may require the function G

be “smooth”, by adding the constraint that w

should be minimized. It is im-

portant that the regularizer reﬂects a property of F , hence regularizers depend

on a priori knowledge about the function to be modeled.

F(x)

x2x1

F(x)

G (x)

Fig. 12.4. Learning a given function (solid line) from a limited set of examples (x

). The ﬁtted curves are shown by dotted line. Left: The only constraint is that the ﬁt-

ted curve goes through the examples. Right: The ﬁtted curves not only go through each

example but also its derivatives evaluated at the examples agree with the derivatives

of the given function.

Selecting a good family G = {G

,w ∈

} of functions is a diﬃcult task,

sometimes known as “model selection” [16, 14]. If G contains a large family of

functions, it is more likely that it will contain a good approximation of F (the

function we are trying to approximate), but it is also more likely that the selected

candidate (using the training set) will generalize poorly because many functions

in G will agree with the training data and take outrageous values between the

training samples. If, on the other hand, G contains a small family of functions,

it is more likely that a function G

which ﬁts the data will be a good approxi-

mation of F . The capacity of the family of functions G is often referred to as the

VC dimension [28, 27]. If a large amount of data is available, G should contain

a large family of functions (high VC dimension), so that more functions can be

approximated, and in particular, F . If, on the other hand, the data is scarce, G

should be restricted to a small family of functions (low VC dimension), to control

the values between the (more distant) samples

. The VC dimension can also be

Note that this point of view also applies to memory based systems. In the case where

all the training data can be kept in memory, however, the VC dimension is inﬁnite,

and the formalism is meaningless. The VC dimension is a learning paradigm and is

not useful unless learning is involved.

242 P.Y. Simard et al.

controlled by putting a knob on how much eﬀect is given to some regularizers.

For instance it is possible to control the capacity of a neural network by adding

“weight decay” as a regularizer. Weight decay is a heuristic that favors smooth

classiﬁcation functions, by making a tradeoﬀ by decreasing w

at the cost,

usually, of slightly increased error on the training set. Since the optimal classi-

ﬁcation function is not necessarily smooth, for instance at a decision boundary,

the weight decay regularizer can have adverse eﬀects.

As mentioned earlier, the regularizer should reﬂect interesting properties (a

priori knowledge) of the function to be learned. If the functions F and G

are

assumed to be diﬀerentiable, which is generally the case, the search for G

can be

greatly improved by requiring that G

’s derivatives evaluated at the points {x

}

are more or less equal (this is the regularizer knob) to the derivatives of F at the

same points (Figure 12.4 right). This result can be extended to multidimensional

inputs. In this case, we can impose the equality of the derivatives of F and G

in certain directions, not necessarily in all directions of the input space. Such

constraints ﬁnd immediate use in traditional pattern recognition problems. It is

often the case that a priori knowledge is available on how the desired function

varies with respect to some transformations of the input. It is straightforward

to derive the corresponding constraint on the directional derivatives of the ﬁtted

function G

in the directions of the transformations (previously named tangent

vectors). Typical examples can be found in pattern recognition where the desired

classiﬁcation function is known to be invariant with respect to some transfor-

mation of the input such as translation, rotation, scaling, etc., in other words,

the directional derivatives of the classiﬁcation function in the directions of these

transformations is zero.

This is illustrated in Figure 12.4. The right part of the ﬁgure shows how

the additional constraints on G

help generalization by constraining the values

of G

outside the training set. For every transformation which has a known

eﬀect on the classiﬁcation function, a regularizer can be added in the form of

a constraint on the directional derivative of G

in the direction of the tangent

vector (such as the one depicted in Figure 12.2), computed from the curve of

transformation.

The next section will analyze in detail how to use a distance based on tangent

vector in memory based algorithms. The subsequent section will discuss the use

of tangent vectors in neural network, with the tangent propagation algorithm.

The last section will compare diﬀerent algorithms to compute tangent vectors.

12.2 Tangent Distance

The Euclidean distance between two patterns P and E is in general not ap-

propriate because it is sensitive to irrelevant transformations of P and of E.In

contrast, the transformed distance D(E,P) is deﬁned to be the minimal distance

between the two manifolds S

and S

, and is therefore invariant with respect to

the transformation used to generate S

and S

(see Figure 12.3). Unfortunately,

these manifolds have no analytic expression in general, and ﬁnding the distance

12. Tangent Distance and Tangent Propagation 243

between them is a diﬃcult optimization problem with multiple local minima.

Besides, true invariance is not necessarily desirable since a rotation of a “6” into

a “9” does not preserve the correct classiﬁcation.

Our approach consists of computing the minimum distance between the linear

surfaces that best approximate the non-linear manifolds S

and S

.Thissolves

three problems at once: 1) linear manifolds have simple analytical expressions

which can be easily computed and stored, 2) ﬁnding the minimum distance

between linear manifolds is a simple least squares problem which can be solved

eﬃciently and, 3) this distance is locally invariant but not globally invariant.

Thus the distance between a “6” and a slightly rotated “6” is small but the

distance between a “6” and a “9” is large. The diﬀerent distances between P and

E are represented schematically in Figure 12.3.

The ﬁgure represents two patterns P and E in 3-dimensional space. The man-

ifolds generated by s are represented by one-dimensional curves going through E

and P respectively. The linear approximations to the manifolds are represented

by lines tangent to the curves at E and P . These lines do not intersect in 3 di-

mensions and the shortest distance between them (uniquely deﬁned) is D(E,P).

The distance between the two non-linear transformation curves D(E,P) is also

shown on the ﬁgure.

An eﬃcient implementation of the tangent distance D(E,P) will be given in

the next section, using image recognition as an illustration. We then compare

our methods with the best known competing methods. Finally we will discuss

possible variations on the tangent distance and how it can be generalized to

problems other than pattern recognition.

12.2.1 Implementation

In this section we describe formally the computation of the tangent distance. Let

the function s transform an image P to s(P, α) according to the parameter α.We

require s to be diﬀerentiable with respect to α and P ,andrequires(P, 0) = P .

If P is a 2 dimensional image for instance, s(P, α) could be a rotation of P by

the angle α. If we are interested in all transformations of images which conserve

distances (isometry), s(P, α) would be a rotation by α

followed by a translation

by α

,α

of the image P .Inthiscaseα =(α

,α

) is a vector of parameters

of dimension 3. In general, α =(α

,...,α

) is of dimension m.

Since s is diﬀerentiable, the set S

= {x |∃α for which x = s(P, α)} is

a diﬀerentiable manifold which can be approximated to the ﬁrst order by a

hyperplane T

. This hyperplane is tangent to S

at P and is generated by the

columns of matrix

∂s(P, α)

∂α



α=0



∂s(P, α)

∂α

,...,

∂s(P, α)

∂α



α=0

(12.3)

which are vectors tangent to the manifold. If E and P are two patterns to be

compared, the respective tangent planes T

and T

can be used to deﬁne a new

distance D between these two patterns. The tangent distance D(E,P) between

244 P.Y. Simard et al.

E and P is deﬁned by

D(E,P)= min

x∈T

,y∈T

x − y

. (12.4)

The equation of the tangent planes T

and T

is given by:



(α

)=E + L

(12.5)



(α

)=P + L

(12.6)

where L

and L

are the matrices containing the tangent vectors (see equa-

tion (12.3)) and the vectors α

and α

are the coordinates of E



and P



(using

bases L

and L

) in the corresponding tangent planes. Note that E



, E, L

and α

denote vectors and matrices in linear equations (12.5). For example, if

the pixel space was of dimension 5, and there were two tangent vectors, we could

rewrite equation (12.5) as

⎡

⎢

⎣



⎤

⎥

⎦

⎡

⎢

⎣

⎤

⎥

⎦

⎡

⎢

⎣

⎤

⎥

⎦





. (12.7)

The quantities L

and L

are attributes of the patterns so in many cases they

can be precomputed and stored.

Computing the tangent distance

D(E,P)= min

,α

E



(α

) − P



(α

)

(12.8)

amounts to solving a linear least squares problem. The optimality condition is

that the partial derivatives of D(E,P) with respect to α

and α

should be

zero:

∂D(E,P)

∂α

=2(E



(α

) − P



(α

))



=0 (12.9)

∂D(E,P)

∂α

=2(P



(α

) − E



(α

))



=0. (12.10)

Substituting E



and P



by their expressions yields to the following linear system

of equations, which we must solve for α

and α



(E − P −L

+ L

)=0 (12.11)



(E − P −L

+ L

)=0. (12.12)

The solution of this system is

−1



− L



)(E − P )=(L

−1

− L

)α

(12.13)

−1



− L



)(E − P )=(L

− L

−1

)α

(12.14)

where L

= L



, L

= L



, L

= L



and L

= L



.LU

decompositions of L

and L

can be precomputed. The most expensive part

12. Tangent Distance and Tangent Propagation 245

in solving this system is evaluating L

can be obtained by transposing

). It requires m

× m

dot products, where m

is the number of tangent

vectors for E and m

is the number of tangent vectors for P .OnceL

has been

computed, α

and α

can be computed by solving two (small) linear systems of

respectively m

and m

equations. The tangent distance is obtained by com-

puting E



(α

) −P



(α

) using the value of α

and α

in equations (12.5) and

(12.6). If n is the dimension of the input space (i.e. the length of vector E and P ),

the algorithm described above requires roughly n(m

+1)(m

+1)+3(m

)

multiply-adds. Approximations to the tangent distance can however be computed

more eﬃciently.

12.2.2 Some Illustrative Results

Local Invariance. The “local

invariance” of tangent distance can be illustrated

by transforming a reference image by various amounts and measuring its distance

to a set of prototypes.

The bottom of Figure 12.5 shows 10 typical handwritten digit images. One of

them – the digit “3” – is chosen to be the reference. The reference is translated

horizontally by the amount indicated in the abscissa. There are ten curves for

Euclidean distance and ten more curves for tangent distance, measuring the

distance between the translated reference and one of the 10 digits.

Since the reference was chosen from the 10 digits, it is not surprising that

the curve corresponding to the digit “3” goes to 0 when the reference is not

translated (0 pixel translation). It is clear from the ﬁgure that if the reference

(the image “3”) is translated by more than 2 pixels, the Euclidean distance will

confuse it with other digits, namely “8” or “5”. In contrast, there is no possible

confusion when tangent distance is used. As a matter of fact, in this example,

the tangent distance correctly identiﬁes the reference up to a translation of 5

pixels! Similar curves were obtained with all the other transformations (rotation,

scaling, etc...).

The “local” invariance of tangent distance with respect to small transforma-

tions generally implies more accurate classiﬁcation for much larger transforma-

tions. This is the single most important feature of tangent distance.

The locality of the invariance has another important beneﬁt: Local invari-

ance can be enforced with very few tangent vectors. The reason is that for in-

ﬁnitesimal (local) transformations, there is a direct correspondence

between the

tangent vectors of the tangent plane and the various compositions of transforma-

tions. For example, the three tangent vectors for X-translation, Y-translation and

Local invariance refers to invariance with respect to small transformations (i.e. a

rotation of a very small angle). In contrast, global invariance refers to invariance

with respect to arbitrarily large transformations (i.e. a rotation of 180 degrees).

Global invariance is not desirable in digit recognition, since we need to distinguish

“6” from a “9”.

An isomorphism actually, see “Lie algebra” in [6].

246 P.Y. Simard et al.

−6 −5 −4 −3 −2 −1 0 1 2 3 456−6 −5 −4 −3 −2 −1 0 1 2 3 456

Euclidean Distance Tangent Distance

Fig. 12.5. Euclidean and tangent distances between 10 typical images of handwritten

digits and a translated image of the digit “3”. The abscissa represents the amount of

horizontal translation (measured in pixels).

rotations around the origin, generate a tangent plane corresponding to all the

possible compositions of horizontal translations, vertical translations and rota-

tions. The resulting tangent distance is then locally invariant to all the trans-

lations and all the rotations (around any center). Figure 12.6 further illustrates

this phenomenon by displaying points in the tangent plane generated from only

5 tangent vectors. Each of these images looks like it has been obtained by ap-

plying various combinations of scaling, rotation, horizontal and vertical skewing,

and thickening. Yet, the tangent distance between any of these points and the

original image is 0.

Handwritten Digit Recognition. Experiments were conducted to evaluate

the performance of tangent distance for handwritten digit recognition. An inter-

esting characteristic of digit images is that we can readily identify a set of local

transformations which do not aﬀect the identity of the character, while covering

a large portion of the set of possible instances of the character. Seven such image

transformations were identiﬁed: X and Y translations, rotation, scaling, two hy-

perbolic transformations (which can generate shearing and squeezing), and line

thickening or thinning. The ﬁrst six transformations were chosen to span the

set of all possible linear coordinate transforms in the image plane. (Neverthe-

less, they correspond to highly non-linear transforms in pixel space.) Additional

12. Tangent Distance and Tangent Propagation 247

Original

Tangent vectors Points in the tangent plane

Fig. 12.6. Left: Original image. Middle: 5 tangent vectors corresponding respectively

to the 5 transformations: scaling, rotation, expansion of the X axis while compressing

the Y axis, expansion of the ﬁrst diagonal while compressing the second diagonal and

thickening. Right: 32 points in the tangent space generated by adding or subtracting

each of the 5 tangent vectors.

transformations have been tried with less success. Three databases were used to

test our algorithm:

US Postal Service Database: The database consisted of 16 × 16 pixel size-

normalized images of handwritten digits, coming from US mail envelopes. The

training and testing set had respectively 9709 and 2007 examples.

NIST1 Database: The second experiment was a competition organized by the

National Institute of Standards and Technology (NIST) in Spring 1992. The

object of the competition was to classify a test set of 59,000 handwritten digits,

given a training set of 223,000 patterns.

NIST2 Database: The third experiment was performed on a database made

out of the training and testing database provided by NIST (see above). NIST

had divided the data into two sets which unfortunately had diﬀerent distribu-

tions. The training set (223,000 patterns) was easier than the testing set (59,000

patterns). In our experiments we combined these two sets 50/50 to make a train-

ing set of 60,000 patterns and testing/validation sets of 10,000 patterns each, all

having the same characteristics.

For each of these three databases we tried to evaluate human performance to

benchmark the diﬃculty of the database. For USPS, two members of our group

went through the test set and both obtained a 2.5% raw error performance.

The human performance on NIST1 was provided by the National Institute of

Standard and Technology. The human performance on NIST2 was measured on

248 P.Y. Simard et al.

a small subsample of the database and must therefore be taken with caution.

Several of the leading algorithms where tested on each of these databases.

The ﬁrst experiment used the K-Nearest Neighbor algorithm, using the or-

dinary Euclidean distance. The prototype set consisted of all available training

examples. A 1-Nearest Neighbor rule gave optimal performance in USPS while

a3-NearestNeighborsruleperformed better in NIST2.

The second experiment was similar to the ﬁrst, but the distance function was

changed to tangent distance with 7 transformations. For the USPS and NIST2

databases, the prototype set was constructed as before, but for NIST1 it was

constructed by cycling through the training set. Any patterns which were mis-

classiﬁed were added to the prototype set. After a few cycles, no more prototypes

are added (the training error was 0). This resulted in 10,000 prototypes. A 3-

Nearest Neighbors rule gave optimal performance on this set.

Other algorithms such as neural nets [18, 20], optimal margin classiﬁer [7],

local learning [3] and boosting [9] were also used on these databases. A case

study can be found in [20].

Table 12.1. Results: Performances in % of errors for (in order) human, K-nearest

neighbor, tangent distance, Lenet1 (simple neural network), Lenet4 (large neural net-

work), optimal margin classiﬁer (OMC), local learning (LL) and boosting (Boost)

Human K-NN T.D. Lenet1 Lenet4 OMC LL Boost

USPS 2.5 5.7 2.5 4.2 4.3 3.3 2.6

NIST1 1.6 3.2 3.7 4.1

NIST2 0.2 2.4 1.1 1.7 1.1 1.1 1.1 0.7

The results are summarized in Table 12.1. As illustrated in the table, the

tangent distance algorithm equals or outperforms all other algorithms we tested,

in all cases except one: Boosted LeNet 4 was the winner on the NIST2 database.

This is not surprising. The K-nearest neighbor algorithm (with no preprocessing)

is very unsophisticated in comparison to local learning, optimal margin classiﬁer,

and boosting. The advantange of tangent distance is the a priori knowledge of

transformation invariance embedded into the distance. When the training data

is suﬃciently large, as is the case in NIST2, some of this knowledge can be picked

up from the data by the more sophisticated algorithms. In other words, the value

of a priori knowledge decreases as the size of the training set increases.

12.2.3 How to Make Tangent Distance Work

This section is dedicated to the technological “know how” which is necessary

to make tangent distance work with various applications. “Tricks”of this sort

are usually not published for various reasons (they are not always theoretically

sound, page area is too valuable, the tricks are speciﬁc to one particular appli-

cation, commercial competitive considerations discourage telling everyone how

12. Tangent Distance and Tangent Propagation 249

to reproduce the result, etc.), but they are often a determining factor in making

the technology a success. Several of these techniques will be discussed here.

Smoothing the Input Space: This is the single most important factor in

obtaining good performance with tangent distance. By deﬁnition, the tangent

vectors are the Lie derivatives of the transformation function s(P, α) with respect

to α. They can be written as:

∂s(P, α)

∂α



= lim

→0

s(P, ) −s(P, 0)



. (12.15)

It is therefore very important that s be diﬀerentiable (and well behaved) with

respect to α. In particular, it is clear from equation (12.15) that s(P, ) must

be computed for  arbitrarily small. Fortunately, even when P can only take

discrete values, it is easy to make s diﬀerentiable. The trick is to use a smoothing

interpolating function C

as a preprocessing for P , such that s(C

(P ),α) is

diﬀerentiable (with respect to C

(P ) and α, not with respect to P ). For instance,

if the input space for P is binary images, C

(P ) can be a convolution of P with

a Gaussian function of standard deviation σ.Ifs(C

(P ),α) is a translation of

α pixels, the derivative of s(C

(P ),α) can easily be computed since s(C

(P ),)

can be obtained by translating Gaussian functions. This preprocessing will be

discussed in more details in section 12.4.

The smoothing factor σ controls the locality of the invariance. The smoother

the transformation curve deﬁned by s is, the longer the linear approximation

will be valid. In general the best smoothing is the maximum smoothing which

does not blur the features. For example, in handwritten character recognition

with 16x16 pixel images, a Gaussian function with a standard deviation of 1

pixel yielded the best results. Increased smoothing led to confusion (such as a

“5” mistaken for “6” because the lower loop had been closed by the smoothing)

and decreased smoothing didn’t make full use of the invariance properties.

If the available computation time allows it, the best strategy is to extract

features ﬁrst, smooth shamelessly, and then compute the tangent distance on

the smoothed features.

Controlled Deformation: The linear system given in equation (12.8) is singu-

lar if some of the tangent vectors for E or P are parallel. Although the probability

of this happening is zero when the data is taken from a real-valued continuous

distribution (as is the case in handwritten character recognition), it is possible

that a pattern may be duplicated in both the training and the test set, resulting

in a division by zero error. The ﬁx is quite simple and elegant. Equation (12.8)

can be replaced by equation:

D(E,P)= min

,α

E + L

−P −L



+ kL



+ kL



. (12.16)

The physical interpretation of this equation, depicted in Figure 12.7, is that

the point E



(α

) on the tangent plane T

is attached to E with a spring with

spring constant k and to P



(α

) (on the tangent plane T

) with spring constant

1,andP



(α

) is also attached to P with spring constant k.(Allthreesprings

250 P.Y. Simard et al.

E’

P’

Fig. 12.7. The tangent distance between E and P is the elastic energy stored in each

of the three springs connecting P, P



, E



and E. P



and E



can move without friction

along the tangent planes. The spring constants are indicated on the ﬁgure.

have zero natural length.) The new tangent distance is the total potential elastic

energy stored of all three springs at equilibrium. As for the standard tangent

distance, the solution can easily be obtained by diﬀerentiating equation (12.16)

with respect to α

and α

. The diﬀerentiation yields:



(E − P − L

(1 + k)α

+ L

)=0 (12.17)



(E − P − L

+ L

(1 + k)α

)=0. (12.18)

The solution of this system is

−1



− (1 + k)L



)(E − P )=(L

−1

− (1 + k)

)α

(12.19)

−1



− (1 + k)L



)(E − P ) = ((1 + k)

− L

−1

)α

(12.20)

where L

= L



, L

= L



, L

= L



and L

= L



.The

system has the same complexity as the vanilla tangent distance except that,

it always has a solution for k ≥ 0, and is more numerically stable. Note that

for k =0, it is equivalent to the standard tangent distance, while for k = ∞,

we have the Euclidean distance. This approach is also very useful when the

number of tangent vectors is greater or equal than the number of dimensions

of the space. The standard tangent distance would most likely be zero (when

the tangent spaces intersect), but the “spring” tangent distance still expresses

valuable information about the invariances.

If the number of dimension of the input space is large compared to the num-

ber of tangent vectors, keeping k as small as possible is better because it doesn’t

interfere with the “sliding” along the tangent plane (E



and P



are less con-

strained).

12. Tangent Distance and Tangent Propagation 251

Contrary to intuition, there is no danger of sliding too far in high dimensional

space because tangent vectors are always roughly orthogonal and they could only

slide far if they were parallel.

Hierarchy of Distances: If several invariances are used, classiﬁcation using

tangent distance alone would be quite expensive. Fortunately, if a typical memory

based algorithm is used, for example, K-nearest neighbors, it is quite unnecessary

to compute the full tangent distance between the unclassiﬁed pattern and all

the labeled samples. In particular, if a crude estimate of the tangent distance

indicates with a suﬃcient conﬁdence that a sample is very far from the pattern to

be classiﬁed, no more computation is needed to know that this sample is not one

of the K-nearest neighbors. Based on this observation one can build a hierarchy of

distances which can greatly reduce the computation of each classiﬁcation. Let’s

assume, for instance, that we have m approximations D

of the tangent distance,

ordered such that D

is the crudest approximation of the tangent distance and

is exactly tangent distance (for instance D

to D

could be the Euclidean

distance with increasing resolution, and D

to D

each add a tangent vector at

full resolution).

The basic idea is to keep a pool of all the prototypes which could potentially be

the K-nearest neighbors of the unclassiﬁed pattern. Initially the pool contains all

the samples. Each of the distances D

corresponds to a stage of the classiﬁcation

process. The classiﬁcation algorithm has 3 steps at each stage, and proceeds from

stage 1 to stage m or until the classiﬁcation is complete: Step 1: the distance D

between all the samples in the pool and the unclassiﬁed pattern is computed.

Step 2: A classiﬁcation and a conﬁdence score is computed with these distances.

If the conﬁdence is good enough, let’s say better than C

(for instance, if all

the samples left in the pool are in the same class) the classiﬁcation is complete,

otherwise proceed to step 3. Step 3: The K

closest samples, according to distance

are kept in the pool, while the remaining samples are discarded.

Finding the K

closest samples can be done in O(p) (where p is the number

of samples in the pool) since these elements need not to be sorted [22, 2]. The

reduced pool is then passed to stage i +1.

The two constants C

and K

must be determined in advance using a validation

set. This can easily be done graphically by plotting the error as a function of

and C

at each stage (starting with all K

equal to the number of labeled

samples and C

=1for all stages). At each stage there is a minimum K

and

minimum C

which give optimal performance on the validation set. By taking

larger values, we can decrease the probability of making errors on the test sets.

The slightly worse performance of using a hierarchy of distances is often well

worth the speed-up. The computational cost of a pattern classiﬁcation is then

equal to:

computational cost ≈



number of

prototypes

at stage i

distance

complexity

at stage i

probability

to reach

stage i

(12.21)

All this is better illustrated with an example as in Figure 12.8. This system was

used for the USPS experiment described in a previous section. In classiﬁcation

252 P.Y. Simard et al.

Category

ConfidenceConfidence

...

Euc. Dist

2x2

Euc. Dist

Cost: 16

4x4

Tang.Dist

Unknown Pattern

Cost: 4

16x16

...

10,000

3,500 500 5

Prototypes

Cost:20000

Confidence

14 vectors

Fig. 12.8. Pattern recognition using a hierarchy of distances. The ﬁlter proceeds from

left (starting with the whole database) to right (where only a few prototypes remain).

At each stage distances between prototypes and the unknown pattern are computed

and sorted; then the best candidate prototypes are selected for the next stage. As

the complexity of the distance increases, the number of prototypes decreases, making

computation feasible. At each stage a classiﬁcation is attempted and a conﬁdence score

is computed. If the conﬁdence score is high enough, the remaining stages are skipped.

of handwritten digits (16x16 pixel images), D

, D

,andD

, were the Euclidean

distances at resolution 2 ×2, 4 ×4 and 8 ×8 respectively. D

was the one sided

tangent distance with X-translation, on the sample side only, at resolution 8 ×8.

was the double sided tangent distance with X-translation at resolution 16×16.

Each of the subsequent distances added one tangent vector on each side (Y-

translation, scaling, rotation, hyperbolic deformation1, hyperbolic deformation2

and thickness) until the full tangent distance was computed (D

Table 12.2 shows the expected number of multiply-adds at each of the stages.

It should be noted that the full tangent distance need only be computed for

1 in 20 unknown patterns (probability 0.05), and only with 5 samples out of

the original 10, 000. The net speed up was in the order of 500,comparedwith

computing the full tangent distance between every unknown pattern and every

sample (this is 6 times faster than computing the Euclidean distance at full

resolution).

Multiple Iterations: Tangent distance can be viewed as one iteration of a

Newton-type algorithm which ﬁnds the points of minimum distance on the true

transformation manifolds. The vectors α

and α

are the coordinates of the two

closest points in the respective tangent spaces, but they can also be interpreted

as the value for the real (non-linear) transformations. In other words, we can

use α

and α

to compute the points s(E,α

) and s(P, α

), the real non-

linear transformation of E and P . From these new points, we can recompute

the tangent vectors, and the tangent distance and reiterate the process. If the

appropriate conditions are met, this process can converge to a local minimum in

the distance between the two transformation manifolds of P and E.

This process did not improve handwritten character recognition, but it yielded

impressive results in face recognition [29]. In that case, each successive iteration

was done at increasing resolution (hence combining hierarchical distances and

multiple iterations), making the whole process computationally eﬃcient.

12. Tangent Distance and Tangent Propagation 253

Table 12.2. Summary computation for the classiﬁcation of 1 pattern: The ﬁrst column

is the distance index, the second column indicates the number of tangent vectors (0

for the Euclidean distance), and the third column indicates the resolution in pixels,

the fourth is K

or the number of prototypes on which the distance D

must be com-

puted, the ﬁfth column indicates the number of additional dot products which must

be computed to evaluate distance D

, the sixth column indicates the probability of

not skipping that stage after the conﬁdence score has been used, and the last column

indicates the total average number of multiply-adds which must be performed (product

of column 3 to 6) at each stage.

i #ofT.V. Reso #ofproto(K

) #ofprod Probab # of mul/add

1 0 4 9709 1 1.00 40,000

2 0 16 3500 1 1.00 56,000

3 0 64 500 1 1.00 32,000

4 1 64 125 2 0.90 14,000

4 2 256 50 5 0.60 40,000

6 4 256 45 7 0.40 32,000

7 6 256 25 9 0.20 11,000

8 8 256 15 11 0.10 4,000

9 10 256 10 13 0.10 3,000

10 12 256 5 15 0.05 1,000

11 14 256 5 17 0.05 1,000

12.3 Tangent Propagation

The previous section dealt with memory-based techniques. We now apply

tangent-distance principles to learned-function techniques.

The key idea is to incorporate the invariance directly into the learned classi-

ﬁcation function. In this section, we present an algorithm, called “tangent prop-

agation”, in which gradient descent is used to propagate information about the

invariances of the training data. The process is a generalization of the widely-

used “back propagation” method, which propagates information about the train-

ingdataitself.

We again assume all data is drawn independently from a given statistical

distribution P, and our learning machine is characterized by the set of functions

in can implement, G

(x), indexed by the vector of parameters w. Ideally, we

would like to ﬁnd w which minimizes the energy function

E =



G

(x) − F (x)

dP(x) (12.22)

254 P.Y. Simard et al.

where F (x) represents the “correct” or “desired” labeling of the point x.Inthe

real world we must estimate this integral using only a ﬁnite set of training points

B drawn the distribution P. That is, we try to minimize



i=1

G

) − F (x

) (12.23)

where the sum runs over the training set B.Anestimateofw can be computed

by following a gradient descent using the weight-update rule:

Δw = −η

∂E

∂w

. (12.24)

Let’s consider an input transformation s(x, α) controlled by a parameter α.As

always, we require that s is diﬀerentiable and that s(x, 0) = x. Now, in addition

to the known labels of the training data, we assume that

∂F(s(x

,α))

∂α

is known at

α =0for each point x in the training set. To incorporate the invariance property

into G

(x), we add that the following constraint on the derivative:



i=1



∂G

(s(x

,α))

∂α

−

∂F(s(x

,α))

∂α



α=0

(12.25)

should be small at α =0. In many pattern classiﬁcation problems, we are inter-

ested in the local classiﬁcation invariance property for F (x) with respect to the

transformation s (the classiﬁcation does not change when the input is slightly

transformed), so we can simplify equation (12.25) to:



i=1



∂G

(s(x

,α))

∂α



α=0

(12.26)

since

∂F(s(x

,α))

∂α

=0. To minimize this term we can modify the gradient descent

rule to use the energy function

E = ηE

+ μE

(12.27)

with the weight update rule:

Δw = −

∂E

∂w

. (12.28)

The learning rates (or regularization parameters) η and μ are tremendously

important, because they determine the tradeoﬀ between learning the invariances

(based on the chosen directional derivatives) versus learning the label itself (i.e.

the zeroth derivative) at each point in the training set.

The local variation of the classiﬁcation function, which appears in equa-

tion (12.26) can be written as:

∂G

(s(x, α))

∂α



α=0

∂G

(s(x, α))

∂s(x, α)

∂α



α=0

= ∇

(x).

∂s(x, α)

∂α



α=0

(12.29)

12. Tangent Distance and Tangent Propagation 255

since s(x, α)=x if α =0and where ∇

(x) is the Jacobian of G

(x) for

pattern x,and∂s(α, x)/∂α is the tangent vector associated with transformation

s as described in the previous section. Multiplying the tangent vector by the

Jacobian involves one forward propagation through a “linearized” version of the

network. If α is multi-dimensional, the forward propagation must be repeated

for each tangent vector.

The theory of Lie algebras [11] ensures that compositions of local (small)

transformations correspond to linear combinations of the corresponding tangent

vectors (this result will be discussed further in section 12.4). Consequently, if

(x)=0is veriﬁed, the network derivative in the direction of a linear combina-

tion of the tangent vectors is equal to the same linear combination of the desired

derivatives. In other words, if the network is successfully trained to be locally

invariant with respect to, say, horizontal translations and vertical translations,

it will be invariant with respect to compositions thereof.

It is possible to devise an eﬃcient algorithm, “tangent prop”, for performing

the weight update (equation (12.28)). It is analogous to ordinary backpropaga-

tion, but in addition to propagating neuron activations, it also propagates the

tangent vectors. The equations can be easily derived from Figure 12.9.

Network Jacobian network

l+1

l−1



l+1

Fig. 12.9. Forward propagated variables (a, x, γ, ξ), and backward propagated vari-

ables (b, y, β, ψ) in the regular network (roman symbols) and the Jacobian (linearized)

network (greek symbols). Converging forks (in the direction in which the signal is

traveling) are sums, diverging forks just duplicate the values.

256 P.Y. Simard et al.

12.3.1 Local Rule

The forward propagation equation is:



l−1

= σ(a

) (12.30)

where σ is a non linear diﬀerentiable function (typically a sigmoid). The forward

propagation starts at the ﬁrst layer (l =1), with x

being the input layer, and

ends at the output layer (l = L). Similarly, The tangent forward propagation

(tangent prop) is deﬁned by:



l−1

= σ



)γ

. (12.31)

The tangent forward propagation starts at the ﬁrst layer (l =1), with ξ

being

the tangent vector

∂s(x,α)

∂α

, and ends at the output layer (l = L). The tangent

gradient backpropagation can be computed using the chain rule:

∂E

∂ξ



∂E

∂γ

l+1

∂γ

l+1

∂ξ

∂E

∂γ

∂E

∂ξ

∂γ

(12.32)



l+1

= β



). (12.33)

The tangent backward propagation starts at the output layer (l = L), with ξ

being the network variation

∂G

(s(x,α))

∂α

, and ends at the input layer. Similarly,

the gradient backpropagation equation is:

∂E

∂x



∂E

∂a

l+1

∂a

l+1

∂x

∂E

∂a

∂E

∂x

∂a

∂E

∂ξ

∂a

(12.34)



l+1

= b



)+β



)γ

. (12.35)

The standard backward propagation starts at the output layer (l = L), with

= G

) being the network output, and ends at the input layer. Finally,

the weight update is:

Δw

= −

∂E

∂a

∂w

−

∂E

∂γ

∂w

(12.36)

Δw

= −y

l−1

− ψ

l−1

. (12.37)

The computation requires one forward propagation and one backward propaga-

tion per pattern and per tangent vector during training. After the network is

trained, it is approximately locally invariant with respect to the chosen trans-

formation. After training, the evaluation of the learned function is in all ways

identical to a network which is not trained for invariance (except that the weights

have diﬀerent values).

12. Tangent Distance and Tangent Propagation 257

Backprop

Tangent Prop

20 40

160 320

%Error on

the test set

Training set size

Fig. 12.10. Generalization performance curve as a function of the training set size for

the tangent prop and the backprop algorithms

12.3.2 Results

Two experiments illustrate the advantages of tangent prop. The ﬁrst experiment

is a classiﬁcation task, using a small (linearly separable) set of 480 binary im-

ages of handwritten digits. The training sets consist of 10, 20, 40, 80, 160 or

320 patterns, and the test set contains 160 patterns. The patterns are smoothed

using a Gaussian kernel with standard deviation of one half pixel. For each of

the training set patterns, the tangent vectors for horizontal and vertical trans-

lation are computed. The network has two hidden layers with locally connected

shared weights, and one output layer with 10 units (5194 connections, 1060 free

parameters) [19]. The generalization performance as a function of the training

set size for traditional backprop and tangent prop are compared in Figure 12.10.

We have conducted additional experiments in which we implemented not only

translations but also rotations, expansions and hyperbolic deformations. This

set of 6 generators is a basis for all linear transformations of coordinates for

two dimensional images. It is straightforward to implement other generators in-

cluding gray-level-shifting, “smooth” segmentation, local continuous coordinate

transformations and independent image segment transformations.

The next experiment is designed to show that in applications where data is

highly correlated, tangent prop yields a large speed advantage. Since the dis-

tortion model implies adding lots of highly correlated data, the advantage of

tangent prop over the distortion model becomes clear.

The task is to approximate a function that has plateaus at three locations. We

want to enforce local invariance near each of the training points (Figure 12.11,

bottom). The network has one input unit, 20 hidden units and one output unit.

258 P.Y. Simard et al.

output output

Average NMSE vs age

−1.5 −1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0.5

1.5

−1.5 −1 −.5 0 .5 1 1.5

−1.5

−1

−.5

1.5

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

.15

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0.1

0.15

Tangent prop

Distortion model

Fig. 12.11. Comparison of the distortion model (left column) and tangent prop (right

column). The top row gives the learning curves (error versus number of sweeps through

the training set). The bottom row gives the ﬁnal input-output function of the network;

the dashed line is the result for unadorned back prop.

Two strategies are possible: either generate a small set of training points cov-

ering each of the plateaus (open squares on Figure 12.11 bottom), or generate

one training point for each plateau (closed squares), and enforce local invariance

around them (by setting the desired derivative to 0). The training set of the

former method is used as a measure of performance for both methods. All pa-

rameters were adjusted for approximately optimal performance in all cases. The

learning curves for both models are shown in Figure 12.11 (top). Each sweep

through the training set for tangent prop is a little faster since it requires only

6 forward propagations, while it requires 9 in the distortion model. As can be

seen, stable performance is achieved after 1300 sweeps for the tangent prop,

versus 8000 for the distortion model. The overall speedup is therefore about 10.

Tangent prop in this example can take advantage of a very large regularization

term. The distortion model is at a disadvantage because the only parameter

that eﬀectively controls the amount of regularization is the magnitude of the

distortions, and this cannot be increased to large values because the right answer

is only invariant under small distortions.

12. Tangent Distance and Tangent Propagation 259

12.3.3 How to Make Tangent Prop Work

Large Network Capacity: Relatively few experiments have been done with

tangent propagation. It is clear, however, that the invariance constraint can be

extremely beneﬁcial. If the network does not have enough capacity, it will not

beneﬁt from the extra knowledge introduced by the invariance.

Interleaving of the Tangent Vectors: Since the tangent vectors introduce

even more correlation inside the training set, a substantial speed up can be

obtained by alternating a regular forward and backward propagation with a

tangent forward and backward propagation (even if there are several tangent

vectors, only one is used at each pattern). For instance, if there were 3 tangent

vectors, the training sequence could be:

),x

),... (12.38)

where x

means a forward and backward propagation for pattern i and t

)

means a tangent forward and backward propagation of tangent vector j of pat-

tern i. With such interleaving, the learning converges faster than grouping all the

tangent vectors together. Of course, this only makes sense with on-line updates

as opposed to batch updates.

12.4 Tangent Vectors

In this section, we consider the general paradigm for transformation invariance

and for the tangent vectors which have been used in the two previous sections.

Before we introduce each transformation and their corresponding tangent vec-

tors, a brief explanation is given of the theory behind the practice. There are

two aspects to the problem. First it is possible to establish a formal connection

between groups of transformations of the input space (such as translation, rota-

tion,etc.of

) and their eﬀect on a functional of that space (such as a mapping

of 

to , which may represent an image, in continuous form). The theory of

Lie groups and Lie algebra [6] allows us to do this. The second problem has to do

with coding. Computer images are ﬁnite vectors of discrete variables. How can a

theory which was developed for diﬀerentiable functionals of 

to  be applied

to these vectors? We ﬁrst give a brief explanation of the theorems of Lie groups

and Lie algebras which are applicable to pattern recognition. Next, we explore

solutions to the coding problem. Finally some examples of transformation and

coding are given for particular applications.

12.4.1 Lie Groups and Lie Algebras

Consider an input space I (the plane 

for example) and a diﬀerentiable func-

tion f which maps points of I to .

f : X ∈ I −→ f (X) ∈. (12.39)

260 P.Y. Simard et al.

The function f (X)=f(x, y) can be interpreted as the continuous (deﬁned for

all points of 

) equivalent of the discrete computer image P [i, j].

Next, consider a family of transformations t

, parameterized by α,which

maps bijectively a point of I to a point of I

: X ∈I−→ t

(X) ∈I. (12.40)

We assume that t

is diﬀerentiable with respect to α and X,andthatt

is the

identity. For example t

could be the group of aﬃne transformations of 





−→



x + α

y + α

x + y + α

y + α



with



1+α



=0. (12.41)

This is a Lie group

with 6 parameters. Another example is the group of direct

isometry:





−→



x cos θ − y sin θ + a

x sin θ + y cos θ + b



(12.42)

which is a Lie group with 3 parameters.

We now consider the functional s(f, α), deﬁned by

s(f,α)=f ◦ t

−1

. (12.43)

This functional s, which takes another functional f as an argument, should

remind the reader of Figure 12.2 where P , the discrete equivalent of f,isthe

argument of s.

The Lie algebra associated with the action of t

on f is the space generated

by the m local transformations L

of f deﬁned by:

(f)=

∂s(f,α)

∂α



α=0

. (12.44)

We can now write the local approximation of s as:

s(f,α)=f + α

(f)+α

(f)+···+ α

(f)+o(α

)(f). (12.45)

This equation is the continuous equivalent of equation (12.2) used in the intro-

duction.

The following example illustrates how L

can be computed from t

.Let’s

consider the group of direct isometry deﬁned in equation (12.42) (with parameter

α =(θ, a, b) as before, and X =(x, y))

s(f,α)(X)=f((x −a)cosθ +(y −b)sinθ, −(x−a)sinθ +(y −b)cosθ). (12.46)

If we diﬀerentiate around α =(0, 0, 0) with respect to θ,weobtain

∂s(f,α)

∂θ

(X)=y

∂f

∂x

(x, y)+(−x)

∂f

∂y

(x, y) (12.47)

A Lie group is a group that is also a diﬀerentiable manifold such that the diﬀeren-

tiable structure is compatible with the group structure.

12. Tangent Distance and Tangent Propagation 261

that is

= y

∂

∂x

+(−x)

∂

∂y

. (12.48)

The transformation L

= −

∂

∂x

and L

= −

∂

∂y

can be obtained in a similar

fashion. All local transformations of the group can be written as

s(f,α)=f + θ(y

∂f

∂x

+(−x)

∂f

∂y

) − a

∂f

∂x

− b

∂f

∂y

+ o(α

)(f) (12.49)

which corresponds to a linear combination of the 3 basic operators L

, L

and

. The property which is most important to us is that the 3 operators generate

the whole space of local transformations. The result of applying the operators to

a function f, such as a 2D image for example, is the set of vectors which we have

been calling “tangent vector” in the previous sections. Each point in the tangent

space correspond to a unique transformation and conversely any transformation

of the Lie group (in the example all rotations of any angle and center together

with all translations) corresponds to a point in the tangent plane.

12.4.2 Tangent Vectors

The last problem which remains to be solved is the problem of coding. Computer

images, for instance, are coded as a ﬁnite set of discrete (even binary) values.

These are hardly the diﬀerentiable mappings of I to  which we have been

assuming in the previous subsection.

To solve this problem we introduce a smooth interpolating function C which

maps the discrete vectors to a continuous mapping of I to . For example, if P

is a image of n pixels, it can be mapped to a continuously valued function f over



by convolving it with a two dimensional Gaussian function g

of standard

deviation σ. This is because g

is a diﬀerentiable mapping of 

to ,andP

can be interpreted as a sum of impulse functions. In the two dimensional case

we can write the new interpretation of P as:



(x, y)=



i,j

P [i][j]δ(x − i)δ(y − j) (12.50)

where P [i][j] denotes the ﬁnite vector of discrete values, as stored in a computer.

The result of the convolution is of course diﬀerentiable because it is a sum of

Gaussian functions. The Gaussian mapping is given by:

: P −→ f = P



∗ g

. (12.51)

In the two dimensional case, the function f can be written as:

f(x, y)=



i,j

P [i][j]g

(x − i, y − j). (12.52)

These operators are said to generate a Lie algebra, because on top of the addition

and multiplication by a scalar, there is a special multiplication called “Lie bracket”

deﬁned by [L

]=L

◦L

−L

◦L

. In the above example we have [L

]=L

]=0,and[L

]=L

262 P.Y. Simard et al.

Fig. 12.12. Graphic illustration of the computation of f and two tangent vectors corre-

sponding to L

= ∂/∂x (X-translation) and L

= ∂/∂y (Y-translation), from a binary

image I. The Gaussian function g(x, y)=exp(−

2σ

) has a standard deviation of

σ =0.9 in this example although its graphic representation (small images on the right)

have been rescaled for clarity.

Other coding functions C can be used, such as cubic spline or even bilinear

interpolation. Bilinear interpolation between the pixels yields a function f which

is diﬀerentiable almost everywhere. The fact that the derivatives have two values

at the integer locations (because the bilinear interpolation is diﬀerent on both

side of each pixels) is not a problem in practice – just choose one of the two

values.

The Gaussian mapping is preferred for two reasons: First, the smoothing pa-

rameter σ can be used to control the locality of the invariance. This is because

when f is smoother, the local approximation of equation (12.45) is valid for larger

transformations. And second, when combined with the transformation operator

L, the derivative can be applied on the closed form of the Gaussian function.

For instance, if the X-translation operator L =

∂

∂x

is applied to f = P



∗g

,the

actual computation becomes:

(f)=

∂

∂x



∗ g

)=P



∗

∂g

∂x

. (12.53)

because of the diﬀerentiation properties of convolution when the support is com-

pact. This is easily done by convolving the original image with the X-derivative of

the Gaussian function g

. This operation is illustrated in Figure 12.12. Similarly,

the tangent vector for scaling can be computed with

(f)=



∂

∂x

+ y

∂

∂y



(I ∗ g

)=x(I ∗

∂g

∂x

)+y(I ∗

∂g

∂y

). (12.54)

12. Tangent Distance and Tangent Propagation 263

This operation is illustrated in Figure 12.13.

Fig. 12.13. Graphic illustration of the computation of the tangent vector T

= D

(bottom image). In this example the displacement for each pixel is proportional

to the distance of the pixel to the center of the image (D

(x, y)=x−x

and D

(x, y)=

y −y

). The two multiplications (horizontal lines) as well as the addition (vertical right

column) are done pixel by pixel.

12.4.3 Important Transformations in Image Processing

This section summarizes how to compute the tangent vectors for image process-

ing (in 2D). Each discrete image I

is convolved with a Gaussian of standard

deviation g

to obtain a representation of the continuous image f

, according to

equation:

= I

∗ g

. (12.55)

The resulting image f

will be used in all the computations requiring I

(except

for computing the tangent vector). For each image I

, the tangent vectors are

computed by applying the operators corresponding to the transformations of

interest to the expression I

∗ g

. The result, which can be precomputed, is an

image which is the tangent vector. The following list contains some of the most

useful tangent vectors:

X-translation: This transformation is useful when the classiﬁcation function

is known to be invariant with respect to the input transformation:





−→



x + α



. (12.56)

264 P.Y. Simard et al.

The Lie operator is deﬁned by:

∂

∂x

. (12.57)

Y-translation: This transformation is useful when the classiﬁcation function

is known to be invariant with respect to the input transformation:





−→



y + α



. (12.58)

The Lie operator is deﬁned by:

∂

∂y

. (12.59)

Rotation: This transformation is useful when the classiﬁcation function is

known to be invariant with respect to the input transformation:





−→



x cos α − y sin α

x sin α + y cos α



. (12.60)

The Lie operator is deﬁned by:

= y

∂

∂x

+(−x)

∂

∂y

. (12.61)

Scaling: This transformation is useful when the classiﬁcation function is known

to be invariant with respect to the input transformation:





−→



x + αx

y + αy



. (12.62)

The Lie operator is deﬁned by:

= x

∂

∂x

+ y

∂

∂y

. (12.63)

Parallel hyperbolic transformation: This transformation is useful when the

classiﬁcation function is known to be invariant with respect to the input

transformation:





−→



x + αx

y − αy



. (12.64)

The Lie operator is deﬁned by:

= x

∂

∂x

− y

∂

∂y

. (12.65)

12. Tangent Distance and Tangent Propagation 265

Diagonal hyperbolic transformation: This transformation is useful when

the classiﬁcation function is known to be invariant with respect to the input

transformation:





−→



x + αy

y + αx



. (12.66)

The Lie operator is deﬁned by:

= y

∂

∂x

+ x

∂

∂y

. (12.67)

The resulting tangent vector is is the norm of the gradient of the image,

which is very easy to compute.

Thickening: This transformation is useful when the classiﬁcation function is

known to be invariant with respect to variation of thickness. This is known

in morphology as dilation and its inverse, erosion. It is very useful in certain

domains (such as handwritten character recognition because) thickening and

thinning are natural variations which correspond to the pressure applied on a

pen, or to diﬀerent absorbtion properties of the ink on the paper. A dilation

(resp. erosion) can be deﬁned as the operation of replacing each value f(x,y)

by the largest (resp. smallest) value of f (x



) found within a neighborhood

of a certain shape, centered at (x, y). The region is called the structural

element. We will assume that the structural element is a sphere of radius

α. We deﬁne the thickening transformation as the function which takes the

function f and generates the function f



deﬁned by:



(X)= max

r≤α

f(X + r) for α ≥ 0 (12.68)



(X)= min

r≤−α

f(X + r) for α ≤ 0. (12.69)

The derivative of the thickening for α ≥ 0 can be written as:

lim

α−→ 0



(X) − f(X)

= lim

α−→ 0

max

r≤α

f(X + r) −f(X)

. (12.70)

f(X) can be put within the max expression because it does not depend on

r.Sinceα tends toward 0, we can write:

f(X + r) −f(X)=r ·∇f (X)+O(r

) ≈ r ·∇f(X). (12.71)

The maximum of

max

r≤α

f(X + r) −f(X)= max

r≤α

r ·∇f (X) (12.72)

is attained when r and ∇f(X) are co-linear, that is when

r = α

∇f(X)

∇f(X)

(12.73)

266 P.Y. Simard et al.

assuming α ≥ 0. It can easily be shown that this equation holds when α is

negative, because we then try to minimize equation (12.69). We therefore

have:

lim

α−→ 0



(X) − f (X)

= ∇f(X) (12.74)

which is the tangent vector of interest. Note that this is true for α positive or

negative. The same tangent vector describes both thickening and thinning.

Alternatively, we can use our computation of the displacement r and deﬁne

the following transformation of the input:

(f):





−→



x + αr

y + αr



(12.75)

where

)=r = α

∇f(X)

∇f(X)

. (12.76)

This transformation of the input space is diﬀerent for each pattern f (we do

not have a Lie group of transformations, but the ﬁeld structure generated

by the (pseudo Lie) operator is still useful. The operator used to ﬁnd the

tangent vector is deﬁned by:

= ∇ (12.77)

Scaling Rotation Axis deformation

Diagonal

deformation

Thickness

deformation

Fig. 12.14. Illustration of 5 tangent vectors (top), with corresponding displacements

(middle) and transformation eﬀects (bottom). The displacement D

and D

are rep-

resented in the form of vector ﬁeld. It can be noted that the tangent vector for the

thickness deformation (right column) correspond to the norm of the gradient of the

gray level image.

12. Tangent Distance and Tangent Propagation 267

which means that the tangent vector image is obtained by computing the

normalized gray level gradient of the image at each point (the gradient at

each point is normalized).

The last 5 transformations are depicted in Figure 12.14 with the tangent vector.

The last operator corresponds to a thickening or thinning of the image. This un-

usual transformation is extremely useful for handwritten character recognition.

12.5 Conclusion

The basic tangent distance algorithm is quite easy to understand and implement.

Even though hardly any preprocessing or learning is required, the performance

is surprisingly good and compares well to the best competing algorithms. We

believe that the main reason for this success is its ability to incorporate a priori

knowledge into the distance measure. The only algorithm which performed better

than tangent distance on one of the three databases was boosting, which has

similar a priori knowledge about transformations built into it.

Many improvements are of course possible. For instance, smart preprocessing

can allow us to measure the tangent distance in a more appropriate “feature”

space, instead of the original pixel space. In image classiﬁcation, for example, the

features could be horizontal and vertical edges. This would most likely further

improve the performance

The only requirement is that the preprocessing must

be diﬀerentiable, so that the tangent vectors can be computed (propagated) into

the feature space.

It is also straightforward to modify more complex algorithms such as LVQ

(learning vector quantization) to use a tangent distance. In this case even the

tangent vectors can be trained. The derivation has been done for batch training

[13] and for on-line training [23] of the tangent vectors. When such training

is performed, the a priori knowledge comes from other constraints imposed on

the tangent vectors (for instance how many tangent vectors are allowed, which

classes of transformation do they represent, etc).

Finally, many optimizations which are commonly used in distance based algo-

rithms can be used as successfully with tangent distance to speed up computa-

tion. The multi-resolution approach havealreadybeen tried successfully [25]. Other

methods like “multi-edit-condensing” [1, 30] and K-d tree [4] are also possible.

The main advantage of tangent distance is that it is a modiﬁcation of a stan-

dard distance measure to allow it to incorporate a priori knowledge that is

speciﬁc to the problem at hand. Any algorithms based on a common distance

measure (as it is often the case in classiﬁcation, vector quantization, predictions,

etc...) can potentially beneﬁt from a more problem-speciﬁc distance. Many of

these “distance based” algorithms do not require any learning, which means that

they can be adapted instantly by just adding new patterns in the database. These

additions are leveraged by the a priori knowledge put in the tangent distance.

There may be an additional cost for computing the tangent vectors in the feature

space if the feature space is very complex.

268 P.Y. Simard et al.

The two drawbacks of tangent distance are its memory and computational

requirements. The most computationally and memory eﬃcient algorithms gen-

erally involve learning [20]. Fortunately, the concept of tangent vectors can also

be used in learning. This is the basis for the tangent propagation algorithm.

The concept is quite simple: instead of learning a classiﬁcation function from

examples of its values, one can also use information about its derivatives. This

information is provided by the tangent vectors. Unfortunately, not many exper-

iments have been done in this direction. The two main problems with tangent

propagation are that the capacity of the learning machine has to be adjusted

to incorporate the additional information pertinent to the tangent vectors, and

that training time must be increased. After training, the classiﬁcation time and

complexity are unchanged, but the classiﬁer’s performance is improved.

To a ﬁrst approximation, using tangent distance or tangent propagation is like

having a much larger database. If the database was plenty large to begin with,

tangent distance or tangent propagation would not improve the performance.

To a better approximation, tangent vectors are like using a distortion model to

magnify the size of the training set. In many cases, using tangent vectors will be

preferable to collecting (and labeling!) vastly more training data, and preferable

(especially for memory-based classiﬁers) to dealing with all the data generated

by the distortion model. Tangent vectors provide a compact and powerful rep-

resentation of a priori knowledge which can easily be integrated in the most

popular algorithms.

Acknowledgement. P.S. and Y.L. gratefully acknowledge NSF grant No. INT-

9726745.

References

[1] Devijver, P.A., Kittler, J.: Pattern Recognition, A Statistical Approache. Prentice-

Hall, Englewood Cliﬀs (1982)

[2] Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structure and Algorithms. Addison-

Wesley (1983)

[3] Bottou, L., Vapnik, V.N.: Local learning algorithms. Neural Computation 4(6),

888–900 (1992)

[4] Broder, A.J.: Strategies for eﬃcient incremental nearest neighbor search. Pattern

Recognition 23, 171–178 (1990)

[5] Broomhead, D.S., Lowe, D.: Multivariable functional interpolation and adaptive

networks. Complex Systems 2, 321–355 (1988)

[6] Choquet-Bruhat, Y., DeWitt-Morette, C., Dillard-Bleick, M.: Analysis, Manifolds

and Physics. North-Holland, Amsterdam (1982)

[7] Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297

(1995)

[8] Dasarathy, B.V.: Nearest Neighbor (NN) Norms: NN Pattern classiﬁcation Tech-

niques. IEEE Computer Society Press, Los Alamitos (1991)

[9] Drucker, H., Schapire, R., Simard, P.Y.: Boosting performance in neural networks.

International Journal of Pattern Recognition and Artiﬁcial Intelligence 7(4), 705–

719 (1993)

12. Tangent Distance and Tangent Propagation 269

[10] Fukunaga, K., Flick, T.E.: An optimal global nearest neighbor metric. IEEE trans-

actions on Pattern analysis and Machine Intelligence 6(3), 314–318 (1984)

[11] Gilmore, R.: Lie Groups, Lie Algebras and some of their Applications. Wiley, New

York (1974)

[12] Hastie, T., Kishon, E., Clark, M., Fan, J.: A model for signature veriﬁcation.

Technical Report 11214-910715-07TM, AT&T Bell Laboratories (July 1991)

[13] Hastie, T., Simard, P.Y.: Metrics and models for handwritten character recogni-

tion. Statistical Science 13 (1998)

[14] Hastie, T.J., Tibshirani, R.J.: Generalized Linear Models. Chapman and Hall,

London (1990)

[15] Hinton, G.E., Williams, C.K.I., Revow, M.D.: Adaptive elastic models for hand-

printed character recognition. In: Advances in Neural Information Processing Sys-

tems, pp. 512–519. Morgan Kaufmann Publishers (1992)

[16] Hoerl, A.E., Kennard, R.W.: Ridge regression: Biased estimation for non-

orthogonal problems. Technometrics 12, 55–67 (1970)

[17] Kohonen, T.: Self-organization and associative memory. Springer Series in Infor-

mation Sciences, vol. 8. Springer (1984)

[18] Le Cun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,

Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In:

Touretzky, D. (ed.) Advances in Neural Information Processing Systems, vol. 2,

Morgan Kaufmann, Denver (1989)

[19] LeCun, Y.: Generalization and network design strategies. In: Pfeifer, R., Schreter,

Z., Fogelman, F., Steels, L. (eds.) Connectionism in Perspective, Zurich, Switzer-

land (1989); Elsevier, An extended version was published as a technical report of

the University of Toronto

[20] LeCun, Y., Jackel, L.D., Bottou, L., Cortes, C., Denker, J.S., Drucker, H., Guyon,

I., Muller, U.A., Sackinger, E., Simard, P., Vapnik, V.: Learning algorithms for

classiﬁcation: A comparison on handwritten digit recognition. In: Oh, J.H., Kwon,

C., Cho, S. (eds.) Neural Networks: The Statistical Mechanics Perspective, pp.

261–276. World Scientiﬁc (1995)

[21] Parzen, E.: On estimation of a probability density function and mode. Ann. Math.

Stat. 33, 1065–1076 (1962)

[22] Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T.: Numerical Recipes.

Cambridge University Press, Cambridge (1988)

[23] Schwenk, H.: The diabolo classiﬁer. Neural Computation (1998) (in press)

[24] Sibson, R.: Studies in the robustness of multidimensional scaling: Procrustes statis-

tices. J. R. Statist. Soc. 40, 234–238 (1978)

[25] Simard, P.Y.: Eﬃcient computation of complex distance metrics using hierarchical

ﬁltering. In: Advances in Neural Information Processing Systems. Morgan Kauf-

mann Publishers (1994)

[26] Sinden, F., Wilfong, G.: On-line recognition of handwritten symbols. Technical

Report 11228-910930-02IM, AT&T Bell Laboratories (June 1992)

[27] Vapnik, V.N.: Estimation of dependences based on empirical data. Springer (1982)

[28] Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative fre-

quencies of events to their probabilities. Th. Prob. and its Applications 17(2),

264–280 (1971)

[29] Vasconcelos, N., Lippman, A.: Multiresolution tangent distance for aﬃne-invariant

classiﬁcation. In: Advances in Neural Information Processing Systems, vol. 10, pp.

843–849. Morgan Kaufmann Publishers (1998)

[30] Voisin, J., Devijver, P.: An application of the multiedit-condensing technique to

the reference selection problem in a print recognition system. Pattern Recogn-

tion 20(5), 465–474 (1987)

Combining Neural Networks and Context-Driven

Search for On-line, Printed Handwriting

Recognition in the Newton



Larry S. Yaeger

, Brandyn J. Webb

,andRichardF.Lyon

Apple Computer 5540 Bittersweet Rd. Beanblossom, IN 46160

The Future 4578 Fieldgate Rd. Oceanside, CA 92056

Foveonics, Inc. 10131-B Bubb Rd. Cupertino, CA 95014

larryy@pobox.com

http://www.beanblossom.in.us/larryy/

Abstract. While on-line handwriting recognition is an area of

long-standing and ongoing research, the recent emergence of portable,

pen-based computers has focused urgent attention on usable, practi-

cal solutions. We discuss a combination and improvement of classical

methods to produce robust recognition of hand-printed English text,

for a recognizer shipping in new models of Apple Computer’s New-

ton MessagePad

R

and eMate

R

. Combining an artiﬁcial neural network

(ANN), as a character classiﬁer, with a context-driven search over seg-

mentation and word recognition hypotheses provides an eﬀective recog-

nition system. Long-standing issues relative to training, generalization,

segmentation, models of context, probabilistic formalisms, etc., need to

be resolved, however, to get excellent performance. We present a num-

ber of recent innovations in the application of ANNs as character classi-

ﬁers for word recognition, including integrated multiple representations,

normalized output error, negative training, stroke warping, frequency

balancing, error emphasis, and quantized weights. User-adaptation and

extension to cursive recognition pose continuing challenges.

13.1 Introduction

Pen-based hand-held computers are heavily dependent upon fast and accurate

handwriting recognition, since the pen serves as the primary means for inputting

data to such devices. Some earlier attempts at handwriting recognition have

utilized strong, limited language models to maximize accuracy. However, this

approach has proven to be unacceptable in real-world applications, generating

disturbing and seemingly random word substitutions – known colloquially within

Apple and Newton as “The Doonesbury Eﬀect”, due to Gary Trudeau’s satirical

look at ﬁrst-generation Newton recognition performance. But the original hand-

writing recognition technology in the Newton, and the current, much-improved



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 271–293, 2012.

 Springer-Verlag Berlin Heidelberg 2012

272 L.S. Yaeger et al.

“Cursive Recognizer” technology, both of which were licensed from ParaGraph

International, Inc., are not the subject of this article.

In Apple’s Advanced Technology Group (aka Apple Research Labs), we pur-

sued a diﬀerent approach, using bottom-up classiﬁcation techniques based on

trainable artiﬁcial neural networks (ANNs), in combination with comprehensive

but weakly-applied language models. To focus our work on a subproblem that

was tractable enough to lead to usable products in a reasonable time, we initially

restricted the domain to hand-printing, so that strokes are clearly delineated by

pen lifts. By simultaneously providing accurate character-level recognition, dic-

tionaries exhibiting very wide coverage of the language, and the ability to write

entirely outside those dictionaries, we have produced a hand-print recognizer that

some have called the “ﬁrst usable” handwriting recognition system. The ANN

character classiﬁer required some innovative training techniques to perform its

task well. The dictionaries required large word lists, a regular expression gram-

mar (to describe special constructs such as date, time, phone numbers, etc.), and

a means of combining all these dictionaries into a comprehensive language model.

And well balanced prior probabilities had to be determined for in-dictionary and

out-of-dictionary writing. Together with a maximum-likelihood search engine,

these elements form the basis of the so-called “Print Recognizer”, that was ﬁrst

shipped in Newton OS 2.0 based MessagePad 120 units in December, 1995, and

has shipped in all subsequent Newton devices. In the MessagePad 2000 and 2100,

despite retaining its label as a “Print Recognizer”, it has been extended to handle

connected characters (as well as a full Western European character set).

There is ample prior work in combining low-level classiﬁers with dynamic time

warping, hidden Markov models, Viterbi algorithms, and other search strategies

to provide integrated segmentation and recognition for writing [15] and speech

[11]. And there is a rich background in the use of ANNs as classiﬁers, includ-

ing their use as low-level character classiﬁers in a higher-level word recognition

system [2]. But these approaches leave a large number of open-ended questions

about how to achieve acceptable (to a real user) levels of performance. In this

paper, we survey some of our experiences in exploring reﬁnements and improve-

ments to these techniques.

13.2 System Overview

Apple’s print recognizer (APR) consists of three conceptual stages – Tentative

Segmentation, Classiﬁcation, and Context-Driven Search – as indicated in Fig-

ure 13.1. The primary data upon which we operate are simple sequences of (x,y)

coordinate pairs, plus pen-up/down information, thus deﬁning stroke primitives.

The Segmentation stage decides which strokes will be combined to produce seg-

ments - the tentative groupings of strokes that will be treated as possible charac-

ters - and produces a sequence of these segments together with legal transitions

between them. This process builds an implicit graph which is then labeled in

the Classiﬁcation stage and examined for a maximum likelihood interpretation

in the Search stage. The Classiﬁcation stage evaluates each segment using the

13. On-line, Printed Handwriting Recognition in the Newton 273

ANN classiﬁer, and produces a vector of output activations that are used as

letter-class probabilities. The Search stage then uses these class probabilities to-

gether with models of lexical and geometric context to ﬁnd the N most likely

word or sentence hypotheses.

(x,y) Points & Pen-Lifts

Tentative

Segmentation

Character

Segmentation

Hypotheses

Neural Network

Classifier

Character

Class

Hypotheses

with Context

Words

Fig. 13.1. A simpliﬁed block diagram of our hand-print recognizer

13.3 Tentative Segmentation

Character segmentation – the process of deciding which strokes comprise which

characters – is inherently ambiguous. Ultimately this decision must be made,

but, short of writing in boxes, it is impossible to do so (with any accuracy)

in advance, external to the recognition process. Hence the initial segmentation

stage in APR produces multiple, tentative groupings of strokes, and defers the

ﬁnal segmentation decisions until the search stage, thus integrating those seg-

mentation decisions with the overall recognition process.

APR uses a potentially exhaustive, sequential enumeration of stroke combina-

tions to generate a sequence of viable character-segmentation hypotheses. These

segments are subjected to some obvious constraints (such as “all strokes must be

used” and “no strokes may be used twice”), and some less obvious ﬁlters (to cull

“impossible” segmentations for the sake of eﬃciency). The resulting algorithm

produces the actual segments that will be processed as possible characters, along

with the legal transitions between these segments.

The legal transitions are deﬁned by forward and reverse delays.Theforward

delay indicates the next possible segment in the sequence (though later segments

may also be legal), pointing just past the last segment that shares the trailing

stroke of the current segment. The reverse delay indicates the start of the current

batch of segments, all of which share the same leading stroke. Due to the enu-

meration scheme, a segment’s reverse delay is the same as its stroke count minus

one, unless preceeding segments (sharing the same leading stroke) were elimi-

nated by the ﬁlters mentioned previously. These two simple delay parameters

(per segment) suﬃce to deﬁne an implicit graph of all legal segment transitions.

For a transition from segment number i to segment number j to be legal, the

sum of segment i’s forward delay plus segment j’s reverse delay must be equal to

j - i. Figure 13.2 provides an example of some ambiguous ink and the segments

that might be generated from its strokes, supporting interpretations of “dog”,

“clog”, “cbg”, or even “%g”.

274 L.S. Yaeger et al.

Segment Stroke Forward Reverse

Ink Number Segment Count Delay Delay

1 130

2 241

3 342

4 120

5 221

6 110

7 100

Fig. 13.2. Segmentation of strokes into tentative characters or segments

13.4 Character Classiﬁcation

The output of the segmentation stage is a stream of segments that are then

passed to an ANN for classiﬁcation as characters. Except for the architecture

and training speciﬁcs detailed below, a fairly standard multi-layer perceptron

trained with error back-propagation (BP) provides the ANN character classiﬁer

at the heart of APR. A large body of prior work exists to indicate the general

applicability of ANN technology as a classiﬁer providing good estimates of a

posteriori probabilities of each class given the input ([5, 12, 11], and others cited

herein). Compelling arguments have been made for why ANNs providing poste-

rior probabilities in a probabilistic recognition formulation should be expected

to outperform other recognition approaches [8], and ANNs have performed well

as the core of speech recognition systems [10].

13.4.1 Representation

A recurring theme in ANN research is the extreme importance of the represen-

tation of the data that is given as input to the network. We experimented with a

variety of input representations, including stroke features both anti-aliased (gray-

scale) and not (binary), and images both anti-aliased and not, and with various

schemes for positioning and scaling the ink within the image input window.

In every case, anti-aliasing was a signiﬁcant win. This is consistent with oth-

ers’ ﬁndings, that ANNs perform better when presented with smoothly varying,

distributed inputs than they do when presented with binary, localized inputs. Al-

most the simplest image representation possible, a non-aspect-ratio-preserving,

expand-to-ﬁll-the-window image (limited only by a maximum scale factor to keep

from blowing dots up to the full window size), together with either a single unit

or a thermometer code (some number of units turned on in sequence to represent

larger values) for the aspect ratio, proved to be the most eﬀective single-classiﬁer

solution. However, the best overall classiﬁer accuracy was ultimately obtained

13. On-line, Printed Handwriting Recognition in the Newton 275

by combining multiple distinct representations into nearly independent, parallel

classiﬁers, joined at a ﬁnal output layer. Hence representation proved not only

to be as important as architecture, but, ultimately, to help deﬁne the architec-

ture of our nets. For our ﬁnal, hand-optimized system, we utilize four distinct

inputs, as indicated in Figure 13.3. The stroke count representation was dithered

(changed randomly at a small probability), to expand the eﬀective training set,

prevent the network from ﬁxating on this simple input, and thereby improve the

network’s ability to generalize. A schematic of the various input representations

can be seen as part of the architecture drawing in Figure 13.4 in the next section.

Input Feature Resolution Description

Image 14x14 anti-aliased, scale-to-window, scale-limited

Stroke 20x9 anti-aliased, limited resolution tangent slope,

resampled to fixed number of points

Aspect Ratio 1x1 normalized and capped to [0,1]

Stroke Count 5x1 dithered thermometer code

Fig. 13.3. Input representations used in APR

13.4.2 Architecture

As with representations, we experimented with a variety of architectures, in-

cluding simple fully-connected layers, receptive ﬁelds, shared weights, multiple

hidden layers, and, ultimately, multiple nearly independent classiﬁers tied to a

common output layer. The ﬁnal choice of architecture includes multiple input

representations, a ﬁrst hidden layer (separate for each input representation) us-

ing receptive ﬁelds, fully connected second hidden layers (again distinct for each

representation), and a ﬁnal, shared, fully-connected output layer. Simple scalar

features – aspect ratio and stroke count – connect to both second hidden lay-

ers. The ﬁnal network architecture, for our original English-language system, is

showninFigure13.4.

Layers are fully connected, except for the inputs to the ﬁrst hidden layer on the

image side. This ﬁrst hidden layer on the image side consists of 8 separate grids,

each of which accepts inputs from the image input grid with its own receptive

ﬁeld sizes and strides, shown parenthetically in Figure 13.4 as (x-size x y-size; x-

stride, y-stride). A stride is the number of units (pixels) in the input image space

between sequential positionings of the receptive ﬁelds, in a given direction. The

7x2 and 2x7 side panels (surrounding the central 7x7 grid) pay special attention

to the edges of the image. The 9x1 and 1x9 side panels speciﬁcally examine full-

size vertical and horizontal features, respectively. The 5x5 grid observes features

at a diﬀerent spatial scale than the 7x7 grid.

Combining the two classiﬁers at the output layer, rather than, say, averaging

the outputs of completely independent classiﬁers, allows generic BP to learn

the best way to combine them, which is both convenient and powerful. But

our integrated multiple-representations architecture is conceptually related to

and motivated by prior experiments at combining nets such as Steve Nowlan’s

“mixture of experts” [7].

276 L.S. Yaeger et al.

Stroke

Count

Aspect

Ratio

Image

Stroke Feature

14 x 14

1 x 15 x 1

20 x 9

72 x 1

104 x 1

95 x 1

112 x 1

2 x 7

7 x 2

7 x 7

(8x8)

(8x7;1,7) (7x8;7,1)

(8x6;1,8) (6x8;8,1)

1 x 9

9 x 1

(10x10)

(6x14)

(14x6)

5 x 5

…

~£

Fig. 13.4. Final English-language net architecture. (See the text for an explanation

of the notation.)

13.4.3 Normalizing Output Error

Analyzing a class of errors involving words that were misrecognized due to per-

haps a single misclassiﬁed character, we realized that the net was doing a poor

job of representing second and third choice probabilities. Essentially, the net

was being forced to attempt unambiguous classiﬁcation of intrinsically ambigu-

ous patterns due to the nature of the mean squared error minimization in BP,

coupled with the typical training vector which consists of all 0’s except for the

single 1 of the target. Lacking any viable means of encoding legitimate proba-

bilistic ambiguity into the training vectors, we decided to try “normalizing” the

“pressure towards 0” vs. the “pressure towards 1” introduced by the output error

during training. We refer to this technique as NormOutErr, due to its normaliz-

ing eﬀect on target versus non-target output error.

We reduce the BP error for non-target classes relative to the target class by

a factor that normalizes the total non-target error seen at a given output unit

relative to the total target error seen at that unit. Assuming a training set with

equal representation of classes, this normalization should then be based on the

number of non-target versus target classes in a typical training vector, or, simply,

the number of output units (minus one). Hence for non-target output units, we

scale the error at each unit by a constant:



= Ae

13. On-line, Printed Handwriting Recognition in the Newton 277

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

p=P(correct)

net output y

Fig. 13.5. Empirical p vs. y histogram for a net trained with A =0.11 (d =0.1), with

the corresponding theoretical curve

where e is the error at an output unit, and A is deﬁned to be:

A =

d(N

outputs

− 1)

where N

outputs

is the number of output units, and d is our tuning parameter,

typically ranging from 0.1 to 0.2. Error at the target output unit is unchanged.

Overall, this raises the activation values at the output units, due to the re-

duced pressure towards zero, particularly for low-probability samples. Thus the

learning algorithm no longer converges to a least mean-squared error (LMSE)

estimate of p(class|input), but to an LMSE estimate of a nonlinear function

f(p(class|input),A) depending on the factor A by which we reduced the error

pressure toward zero.

Using a simple version of the technique of [3], we worked out what that re-

sulting nonlinear function is. The net will attempt to converge to minimize the

modiﬁed quadratic error function



 = p(1 − y)

+ A(1 − p)y

by setting its output y for a particular class to

y =

A − Ap + p

where p = p(class|input),andA is as deﬁned above. For small values of p,the

activation y is increased by a factor of nearly 1/A relative to the conventional

case of y = p, and for high values of p the activation is closer to 1 by nearly a

factor of A. The inverse function, useful for converting back to a probability, is

p =

yA +1− y

278 L.S. Yaeger et al.

Character Error Word Error

0.0 0.8NormOutErr =

12.4

9.5

31.6

22.7

Fig. 13.6. Character and word error rates for two diﬀerent values of NormOutErr(d).

A value of 0.0 disables NormOutErr, yielding normal BP. The unusually high value of

0.8 (A =0.013) produces nearly equal pressures towards 0 and 1.

We veriﬁed the ﬁt of this function by looking at histograms of character-level

empirical percentage-correct versus y, as in Figure 13.5.

Even for this moderate amount of output error normalization, it is clear that

the lower-probability samples have their output activations raised signiﬁcantly,

relative to the 45

◦

line that A =1yields.

The primary beneﬁt derived from this technique is that the net does a much

better job of representing second and third choice probabilities, and low proba-

bilities in general. Despite a small drop in top choice character accuracy when

using NormOutErr, we obtain a very signiﬁcant increase in word accuracy by this

technique. Figure 13.6 shows an exaggerated example of this eﬀect, for an atyp-

ically large value of d (0.8), which overly penalizes character accuracy; however,

the 30% decrease in word error rate is normal for this technique. (Note: These

data are from a multi-year-old experiment, and are not necessarily representative

of current levels of performance on any absolute scale.)

13.4.4 Negative Training

The previously discussed inherent ambiguities in character segmentation neces-

sarily result in the generation and testing of a large number of invalid segments.

During recognition, the network must classify these invalid segments just as it

would any valid segment, with no knowledge of which are valid or invalid. A sig-

niﬁcant increase in word-level recognition accuracy was obtained by performing

negative training with these invalid segments. This consists of presenting invalid

segments to the net during training, with all-zero target vectors. We retain con-

trol over the degree of negative training in two ways. First is a negative-training

factor (ranging from 0.2 to 0.5) that modulates the learning rate (equivalently

by modulating the error at the output layer) for these negative patterns. This

reduces the impact of negative training on positive training, thus modulating the

impact on characters that speciﬁcally look like elements of multi-stroke charac-

ters (e.g., I, 1, l, o, O, 0). Secondly, we control a negative-training probability

(ranging between 0.05 and 0.3), which determines the probability that a partic-

ular negative sample will actually be trained on (for a given presentation). This

13. On-line, Printed Handwriting Recognition in the Newton 279

both reduces the overall impact of negative training, and signiﬁcantly reduces

training time, since invalid segments are more numerous than valid segments.

As with NormOutErr,thismodiﬁcationhurtscharacter-level accuracy a little

bit, but helps word-level accuracy a lot.

13.4.5 Stroke Warping

During training (but not during recognition), we produce random variations

in stroke data, consisting of small changes in skew, rotation, and x and y lin-

ear and quadratic scalings. This produces alternate character forms that are

consistent with stylistic variations within and between writers, and induces an

explicit aspect ratio and rotation invariance within the framework of standard

back-propagation. The amounts of each distortion to apply were chosen through

cross-validation experiments, as just the amount needed to yield optimum gen-

eralization. (Cross-validation is a standard technique for early stopping of ANN

training, to prevent over-learning of the training set, and thus reduced accu-

racy on new data outside that training set. The technique consists of keeping

aside some subset of the available data, the cross-validation set, and testing on

it at some interval, but never training on it, and then stopping the training

when accuracy ceases to improve on this cross-validation set, despite the fact

that accuracy might continue to improve on the training set.) We chose relative

amounts of the various transformations by testing for optimal ﬁnal, converged

accuracy on a cross-validation set. We then increased the amount of all stroke

warping being applied to the training set, just to the point at which accuracy

on the training set ceased to diverge from accuracy on the cross-validation set.

We also examined a number of such samples by eye to verify that they rep-

resent a natural range of variation. A small set of such variations is shown in

Figure 13.7.

Fig. 13.7. A few random stroke warpings of the same original “m” data

Our stroke warping scheme is somewhat related to the ideas of Tangent Dist

and Tangent Prop [14, 13] (see chapter 12), in terms of the use of predetermined

families of transformations, but we believe it is much easier to implement. It is

also somewhat distinct in applying transformations on the original coordinate

data, as opposed to using distortions of images. The voice transformation scheme

of [4] is also related, but they use a static replication of the training set through a

small number of transformations, rather than dynamic random transformations

of an essentially inﬁnite variety.

280 L.S. Yaeger et al.

13.4.6 Frequency Balancing

Training data from natural English words and phrases exhibit very non-uniform

priors for the various character classes, and ANNs readily model these priors.

However, as with NormOutErr, we ﬁnd that reducing the eﬀect of these priors

on the net, in a controlled way, and thus forcing the net to allocate more of its

resources to low-frequency, low-probability classes is of signiﬁcant beneﬁt to the

overall word recognition process. To this end, we explicitly (partially) balance

the frequencies of the classes during training. We do this by probabilistically

skipping and repeating patterns, based on a precomputed repetition factor. Each

presentation of a repeated pattern is “warped” uniquely, as discussed previously.

To compute the repetition factor for a class i, we ﬁrst compute a normalized

frequency of that class:

where S

is the number of samples in class i,and

S is the average number of

samples over all classes, computed in the obvious way:

S =



i=1

with C being the number of classes. Our repetition factor is then deﬁned to be:





with a and b being adjustable controls over the amount of skipping vs. repeating

and the degree of prior normalization, respectively. Typical values of a range

from 0.2 to 0.8, while b ranges from 0.5 to 0.9. The factor a<1 lets us do more

skipping than repeating; e.g. for a =0.5, classes with relative frequency equal to

half the average will neither skip nor repeat; more frequent classes will skip, and

less frequent classes will repeat. A value of 0.0 for b would do nothing, giving

=1.0 for all classes, while a value of 1.0 would provide “full” normalization.

A value of b somewhat less than one seems to be the best choice, letting the net

keep some bias in favor of classes with higher prior probabilities.

This explicit prior-bias reduction is conceptually related to Lippmann’s [8]

and Morgan and Bourlard’s [10] recommended method for converting from the

net’s estimate of posterior probability, p(class|input), to the value needed in an

HMM or Viterbi search, p(input|class), which is to divide by p(class) priors. Us-

ing that technique, however, should produce noisier estimates for low frequency

classes, due to the divisions by low frequencies, resulting in a set of estimates

that are not really optimized in a LMSE sense (as the net outputs are). In ad-

dition, output activations that are naturally bounded between 0 and 1, due to

the sigmoid, convert to potentially very large probability estimates, requiring a

re-normalization step. Our method of frequency balancing during training elim-

inates both of these concerns. Perhaps more signiﬁcantly, frequency balancing

13. On-line, Printed Handwriting Recognition in the Newton 281

also allows the standard BP training process to dedicate more network resources

to the classiﬁcation of the lower-frequency classes, though we have no current

method for characterizing or quantifying this beneﬁt.

13.4.7 Error Emphasis

While frequency balancing corrects for under-represented classes, it cannot ac-

count for under-represented writing styles. We utilize a conceptually related

probabilistic skipping of patterns, but this time for just those patterns that the

net correctly classiﬁes in its forward/recognition pass, as a form of “error empha-

sis”, to address this problem. We deﬁne a correct-train probability (ranging from

0.1 to 1.0) that is used as a biased coin to determine whether a particular pat-

tern, having been correctly classiﬁed, will also be used for the backward/training

pass or not. This only applies to correctly segmented, or “positive” patterns, and

misclassiﬁed patterns are never skipped.

Especially during early stages of training, we set this parameter fairly low

(around 0.1), thus concentrating most of the training time and the net’s learning

capability on patterns that are more diﬃcult to correctly classify. This is the only

waywewereabletogetthenettolearntocorrectly classify unusual character

variants, such as a 3-stroke “5” as written by only one training writer.

Variants of this scheme are possible in which misclassiﬁed patterns would be

repeated, or diﬀerent learning rates would apply to correctly and incorrectly

classiﬁed patterns. It is also related to techniques that use a training subset,

from which easily-classiﬁed patterns are replaced by randomly selected patterns

from the full training set [6].

13.4.8 Annealing

Though some discussions of back-propagation espouse explicit formulae for mod-

ulating the learning rate over time, many seem to assume the use of a single,

ﬁxed learning rate. We view the stochastic back-propagation process as a kind of

simulated annealing, with a learning rate starting very high and decreasing only

slowly to a very low value. But rather than using any prespeciﬁed formula to

decelerate learning, the rate at which the learning rate decreases is determined

by the dynamics of the learning process itself. We typically start with a rate near

1.0 and reduce the rate by a multiplicative decay factor of 0.9 until it gets down

to about 0.001. The rate decay factor is applied following any epoch in which the

total squared error increased on the training set, relative to the previous epoch.

This “total squared error” is summed over all output units and over all patterns

in one full epoch, and normalized by those counts. So even though we are using

“online” or stochastic gradient descent, we have a measure of performance over

whole epochs that can be used to guide the “annealing” of the learning rate. Re-

peated tests indicate that this approach yields better results than low (or even

moderate) initial learning rates, which we speculate to be related to a better

ability to escape local minima.

282 L.S. Yaeger et al.

In addition, we ﬁnd that we obtain best overall results when we also allow

some of our many training parameters to change over the course of a training

run. In particular, the correct train probability needs to start out very low to

give the net a chance to learn unusual character styles, but it should ﬁnish up

near 1.0 in order to not introduce a general posterior probability bias in favor of

classes with lots of ambiguous examples. We typically train a net in four “phases”

according to parameters such as in Figure 13.8.

Phase

Learning

Rate

Correct

Train

Prob

Negative

Train

Prob

Epochs

1.0 - 0.5

0.5 - 0.1

0.1 - 0.01

0.01 - 0.001

0.1

0.25

0.5

1.0

0.05

0.1

0.18

0.3

Fig. 13.8. Typical multi-phase schedule of learning rates and other parameters for

training a character-classiﬁer net

13.4.9 Quantized Weights

The work of Asanovic and Morgan [1] shows that two-byte (16-bit) weights

are about the smallest that can be tolerated in training large ANNs via back-

propagation. But memory is expensive in small devices, and RISC processors,

such as the ARM-610 in the ﬁrst devices in which this technology was deployed,

are much more eﬃcient doing one-byte loads and multiplies than two-byte loads

and multiplies, so we were motivated to make one-byte weights work.

Running the net for recognition demands signiﬁcantly less precision than does

training the net. It turns out that one-byte weights provide adequate precision

for recognition, if the weights are trained appropriately. In particular, a dynamic

range should be ﬁxed, and weights limited to that legal range during training,

and then rounded to the requisite precision after training. For example, we ﬁnd

that a range of weight values from (almost) -8 to +8 in steps of 1/16 does

a good job. Figure 13.9 shows a typical resulting distribution of weight val-

ues. If the weight limit is enforced during high-precision training, the resources

of the net will be adapted to make up for the limit. Since bias weights are

few in number, however, and very important, we allow them to use two bytes

with essentially unlimited range. Performing our forward/recognition pass with

low-precision, one-byte weights (a ±3.4 ﬁxed-point representation), we ﬁnd no

noticeable degradation relative to ﬂoating-point, four-byte, or two-byte weights

using this scheme.

13. On-line, Printed Handwriting Recognition in the Newton 283

–8 –6 –4 –2 0 2 4 6 8

100

1000

weight value w

count per bin

of width 1/16

Fig. 13.9. Distribution of weight values in a net with one-byte weights, on a log count

scale. Weights with magnitudes greater than 4 are sparse, but important.

We have also developed a scheme for training with augmented one-byte

weights. It uses a temporary augmentation of the weight values with two addi-

tional low-order bytes to achieve precision in training, but runs the forward pass

of the net using only the one-byte high-order part. Thus any cumulative eﬀect of

the one-byte rounded weights in the forward pass can be compensated through

further training. Small weight changes accumulate in the low-order bytes, and

only occasionally carry into a change in the one-byte weights used by the net. In

a personal product, this scheme could be used for adaptation to the user, after

which the low-order residuals could be discarded and the temporary memory

reclaimed.

13.5 Context-Driven Search

The output of the ANN classiﬁer is a stream of probability vectors, one vector

for each segmentation hypothesis, with as many potentially nonzero probability

elements in each vector as there are characters (that the system is capable of

recognizing). In practice, we typically only pass the top ten (or fewer) scored

character-class hypotheses, per segment, to the search engine, for the sake of

eﬃciency. The search engine then looks for a minimum-cost path through this

vector stream, abiding by the legal transitions between segments, as deﬁned in

the tentative-segmentation step discussed previously. This minimum-cost path

is the APR system’s best interpretation of the ink input by the user, and is

returned to the system in which APR is embedded as the recognition result for

whole words or sentences of the user’s input.

The search is driven by a somewhat ad hoc, generative language model, which

consists of a set of graphs that are searched in parallel. We use a simple beam

search in a negative-log-probability (or penalty) space for the best N hypothe-

ses. The beam is based on a ﬁxed maximum number of hypotheses, rather than a

284 L.S. Yaeger et al.

particular value. Each possible transition token (character) emitted by one of

the graphs is scored not only by the ANN, but by the language model itself, by

a simple letter-case model, and by geometric-context models discussed below.

The fully integrated search process takes place over a space of character- and

word-segmentation hypotheses, as well as character-class hypotheses.

13.5.1 Lexical Context

Context is essential to accurate recognition, even if that context takes the form

of a very broad language model. Humans achieve just 90% accuracy on isolated

characters from our database. Lacking any context this would translate to a

word accuracy of not much more than 60% (0.9

), assuming an average word

length of 5 characters. We obviously need to do much better, with even lower

isolated-character accuracy, and we accomplish this by the application of our

context models.

A simple model of letter case and adjacency – penalizing case transitions

except between the ﬁrst and second characters, penalizing alphabetic-to-numeric

transitions, and so on – together with the geometric-context models discussed

later, is suﬃcient to raise word-level accuracy up to around 77%.

The next large gain in accuracy requires a genuine language model. We pro-

vide this model by means of dictionary graphs, and assemblages of those graphs

combined into what we refer to as BiGrammars. BiGrammars are essentially

scored lists of dictionaries, together with speciﬁed legal (scored) transitions be-

tween those dictionaries. This scheme allows us to use word lists, preﬁx and suf-

ﬁx lists, and punctuation models, and to enable appropriate transitions between

them. Some dictionary graphs are derived from a regular-expression grammar

that permits us to easily model phone numbers, dates, times, etc., as shown in

Figure 13.10.

dig = [0123456789]

digm01 = [23456789]

acodenums = (digm01 [01] dig)

acode = { (“1-”? acodenums “-”):40 ,

(“1”? “(“ acodenums “)”):60 }

phone = (acode? digm01 dig dig “-” dig dig dig dig)

Fig. 13.10. Sample of the regular-expression language used to deﬁne a simple

telephone-number grammar. Symbols are deﬁned by the equal operator; square brack-

ets enclose multiple, alternative characters; parentheses enclose sequences of symbols;

curly braces enclose multiple, alternative symbols; an appended colon followed by num-

bers designates a prior probability of that alternative; an appended question mark

means “zero or one occurrence”; and the ﬁnal symbol deﬁnition represents the graph

or grammar expressed by this dictionary.

13. On-line, Printed Handwriting Recognition in the Newton 285

All of these dictionaries can be searched in parallel by combining them into

a general-purpose BiGrammar that is suitable for most applications. It is also

possible to combine subsets of these dictionaries, or special-purpose dictionar-

ies, into special BiGrammars targeted at more limited contexts. A very simple

BiGrammar Phone

[Phone.lang 1. 1. 1.]

Fig. 13.11. Sample of a simple BiGrammar describing a telephone-only context.

The BiGrammar is ﬁrst named (Phone), and then speciﬁed as a list of dictionaries

(Phone.lang), together with the probability of starting with this dictionary, ending

with this dictionary, and cycling within this dictionary (the three numerical values).

BiGrammar FairlyGeneral

(.8

(.6

[WordList.dict .5 .8 1. EndPunct.lang .2]

[User.dict .5 .8 1. EndPunct.lang .2]

)

(.4

[Phone.lang .5 .8 1. EndPunct.lang .2]

[Date.lang .5 .8 1. EndPunct.lang .2]

)

(.2

[OpenPunct.lang 1. 0. .5

(.6

WordList.dict .5

User.dict .5

)

(.4

Phone.lang .5

Date.lang .5

)

]

)

[EndPunct.lang 0. .9 .5 EndPunct.lang .1]

Fig. 13.12. Sample of a slightly more complex BiGrammar describing a fairly general

context. The BiGrammar is ﬁrst named (FairlyGeneral), and then speciﬁed as a list of

dictionaries (the *.dict and *.lang entries), together with the probability of starting with

this dictionary, ending with this dictionary, and cycling within this dictionary (the ﬁrst

three numerical values following each dictionary name), plus any dictionaries to which

this dictionary may legally transition, along with the probability of taking that transi-

tion. The parentheses permit easy speciﬁcation of multiplicative prior probabilities for

all dictionaries contained within them. Note that in this simple example, it is not pos-

sible (starting probability = 0) to start a string with the EndPunct (end punctuation)

dictionary, just as it is not possible to end a string with the OpenPunct dictionary.

286 L.S. Yaeger et al.

BiGrammar, which might be useful to specify context for a ﬁeld that only ac-

cepts telephone numbers, is shown in Figure 13.11. A more complex BiGrammar

(though still far short of the complexity of our ﬁnal general-input context) is

shown in Figure 13.12.

We refer to our language model as being “weakly applied” because in parallel

with all of the wordlist-based dictionaries and regular-expression grammars, we

simultaneously search both an alphabetic-characters grammar (“wordlike”) and

a completely general, any-character-anywhere grammar (“symbols”). These more

ﬂexible models, though given fairly low a priori probabilities, permit users to

write any unusual character string they might desire. When the prior probabili-

ties for the various dictionaries are properly balanced, the recognizer is able to

beneﬁt from the language model, and deliver the desired level of accuracy for

common in-dictionary words (and special constructs like phone numbers, etc.),

yet can also recognize arbitrary, non-dictionary character strings, especially if

they are written neatly enough that the character classiﬁer can be conﬁdent of

its classiﬁcations.

We have also experimented with bi-grams, tri-grams, N-grams, and we are

continuing experiments with other, more data-driven language models; so far,

however, our generative approach has yielded the best results.

13.5.2 Geometric Context

We have never found a way to reliably estimate a baseline or topline for char-

acters, independent of classifying those characters in a word. Non-recognition-

integrated estimates of these line positions, based on strictly geometric features,

have too many pathological failure modes, which produce erratic recognition

failures. Yet the geometric positioning of characters most certainly bears infor-

mation important to the recognition process. Our system factors the problem

by letting the ANN classify representations that are independent of baseline

and size, and then using separate modules to score both the absolute size of

individual characters, and the relative size and position of adjacent characters.

The scoring based on absolute size is derived from a set of simple Gaussian

models of individual character heights, relative to some running scale param-

eters computed both during learning and during recognition. This CharHeight

score directly multiplies the scores emitted by the ANN classiﬁer, and helps

signiﬁcantly in case disambiguation.

We also employ a GeoContext module that scores adjacent characters, based

on the classiﬁcation hypotheses for those characters and on their relative size

and placement. GeoContext scores each tentative character based on its class

and the class of the immediately preceding letter (for the current search hy-

pothesis). The character classes are used to look up expected character sizes

and positions in a standardized space (baseline=0.0, topline=1.0). The ink be-

ing evaluated provides actual sizes and positions that can be compared directly

to the expected values, subject only to a scale factor and oﬀset, which are cho-

sen so as to minimize the estimated error of ﬁt between data and model. This

same quadratic error term, computed from the inverse covariance matrix of a

13. On-line, Printed Handwriting Recognition in the Newton 287

full multivariate Gaussian model of these sizes and positions, is used directly as

GeoContext’s score (or penalty, since it is applied in the -log probability space

of the search engine). Figure 13.13 illustrates the bounding boxes derived from

the user’s ink vs. the table-driven model, with the associated error measures for

our GeoContext module.

"if" from User vs. Table:

Fig. 13.13. The eight measurements that contribute to the GeoContext error vector

and corresponding score for each letter pair

GeoContext’s multivariate Gaussian model is learned directly from data. The

problem in doing so was to ﬁnd a good way to train per-character parameters of

top, bottom, width, space, etc., in our standardized space, from data that had

no labeled baselines, or other absolute referent points. Since we had a technique

for generating an error vector from the table of parameters, we decided to use

a back-propagation variant to train the table of parameters to minimize the

squared error terms in the error vectors, given all the pairs of adjacent characters

and correct class labels from the training set.

GeoContext plays a major role in properly recognizing punctuation, in dis-

ambiguating case, and in recognition in general. A more extended discussion of

GeoContext has been provided by Lyon and Yaeger [9].

13.5.3 Integration with Word Segmentation

Just as it is necessary to integrate character segmentation with recognition via

the search process, so is it essential to integrate word segmentation with recog-

nition and search, in order to obtain accurate estimates of word boundaries, and

to reduce the large class of errors associated with missegmented words. To per-

form this integration, we ﬁrst need a means of estimating the probability of a

word break between each pair of tentative characters. We use a simple statisti-

cal model of gap sizes and stroke-centroid spacing to compute this probability

(spaceProb). Gaussian density distributions, based on means and standard devi-

ations computed from a large training corpus, together with a prior probability

scale factor, provide the basis for the word-gap and stroke-gap (non-word-gap)

288 L.S. Yaeger et al.

models, as illustrated in Figure 13.14. Since any given gap is, by deﬁnition, ei-

ther a word gap or a non-word gap, the simple ratio deﬁned in Figure 13.14

provides a convenient, self-normalizing estimate of the word-gap probability. In

practice, that equation further reduces to a simple sigmoid form, thus allowing

us to take advantage of a lookup-table-based sigmoid derived for use in the ANN.

In a thresholding, non-integrated word-segmentation model, word breaks would

be introduced when spaceProb exceeds 0.5; i.e., when a particular gap is more

likely to be a word-gap than a non-word-gap. For our integrated system, both

word-break and non-word-break hypotheses are generated at each segment tran-

sition, and weighted by spaceProb and (1-spaceProb), respectively. The search

process then proceeds over this larger hypothesis space to produce best esti-

mates of whole phrases or sentences, thus integrating word segmentation as well

as character segmentation.

/ (

)

Samples

Gap Size

Word BreakStroke (Non-Word) Break

StrokeGap

WordGap

WordGap WordGapStrokeGapWordBreak

Fig. 13.14. Gaussian density distributions yield a simple statistical model of word-

break probability, which is applied in the region between the peaks of the StrokeGap

and WordGap distributions. Hashed areas indicate regions of clear cut decisions, where

Wo rd B reak

is set to either 0.0 or 1.0, to avoid problems dealing with tails of these simple

distributions.

13.6 Discussion

The combination of elements described in the preceeding sections produces a

powerful, integrated approach to character segmentation, word segmentation,

and recognition. Users’ experiences with APR are almost uniformly positive, un-

like experiences with previous handwriting recognition systems. Writing within

the dictionary is remarkably accurate, yet the ease with which people can write

outside the dictionary has fooled many people into thinking that the Newton’s

“Print Recognizer” does not use dictionaries. As discussed previously, our recog-

nizer certainly does use dictionaries. Indeed, the broad-coverage language model,

though weakly applied, is essential for high accuracy recognition. Curiously, there

seems to be little problem with dictionary perplexity – little diﬃculty as a result

of using very large, very complex language models. We attribute this fortunate

behavior to the excellent performance of the neural network character classiﬁer

13. On-line, Printed Handwriting Recognition in the Newton 289

at the heart of the system. One of the side beneﬁts of the weak application of

the language model is that even when recognition fails and produces the wrong

result, the answer that is returned to the user is typically understandable by the

user – perhaps involving substitution of a single character. Two useful phenomena

ensue as a result. First, the user learns what works and what doesn’t, especially

when she refers back to the ink that produced the misrecognition, so the system

trains the user gracefully over time. Second, the meaning is not lost the way it

can be, all too easily, with whole word substitutions – with that “Doonesbury

Eﬀect” found in ﬁrst-generation, strong-language-model recognizers.

Though we have provided legitimate accuracy statistics for certain compar-

ative tests of some of our algorithms, we have deliberately shied away from

claiming speciﬁc levels of accuracy in general. Neat printers, who are familiar

with the system, can achieve 100% accuracy if they are careful. Testing on data

from complete novices, writing for the ﬁrst time using a metal pen on a glass

surface, without any feedback from the recognition system, and with ambigu-

ous instructions about writing with “disconnected characters” (intended to mean

printing, but often interpreted to mean writing with otherwise cursive characters

but separated by large spaces in a wholely unnatural style), can yield word-level

accuracies as low as 80%. Of course, the entire interesting range of recognition

accuracies lies between these two extremes. Perhaps a slightly more meaningful

statistic comes from common reports on usenet newsgroups, and some personal

testing, that suggest accuracies of 97% to 98% in regular use. But for scien-

tiﬁc purposes, none of these numbers have any real meaning, since our testing

datasets are proprietary, and the only valid tests between diﬀerent recognizers

would have to be based on results obtained by processing the exact same bits,

or by analyzing large numbers of experienced users of the systems in the ﬁeld –

a diﬃcult project which has not been undertaken.

One of the key reasons for the success of APR is the suite of innovative

neural network training techniques that help the network encode better class

probabilities, especially for under-represented classes and writing styles. Many

of these techniques – stroke count dithering, normalization of output error, fre-

quency balancing, error emphasis – share a unifying theme: Reducing the eﬀect

of a priori biases in the training data on network learning signiﬁcantly improves

the network’s performance in an integrated recognition system, despite a mod-

est reduction in the network’s accuracy for individual characters. Normaliza-

tion of output error prevents over-represented non-target classes from biasing

the net against under-represented target classes. Frequency balancing prevents

over-represented classes from biasing the net against under-represented classes.

And stroke-count dithering and error emphasis prevent over-represented writing

styles from biasing the net against under-represented writing styles. One could

even argue that negative training eliminates an absolute bias towards properly

segmented characters, and that stroke warping reduces the bias towards those

writing styles found in the training data, although these techniques also provide

wholly new information to the system.

290 L.S. Yaeger et al.

Though we’ve oﬀered arguments for why each of these techniques, individually,

helps the overall recognition process, it is unclear why prior-bias reduction, in

general, should be so consistently valuable. The general eﬀect may be related

to the technique of dividing out priors, as is sometimes done to convert from

p(class|input) to p(input|class). But we also believe that forcing the net, during

learning, to allocate resources to represent less frequent sample types may be

directly beneﬁcial. In any event, it is clear that paying attention to such biases

and taking steps to modulate them is a vital component of eﬀective training

of a neural network serving as a classiﬁer in a maximum-likelihood recognition

system.

The majority of this paper describes a sort of snapshot of the system and

its architecture as it was deployed in its ﬁrst commercial release, when it was,

indeed, purely a “Print Recognizer”. Letters had to be fully “disconnected”; i.e.,

the pen had to be lifted between each pair of characters. The characters could

overlap to some extent, but the ink could not be continuous. Connected char-

acters proved to be the largest remaining class of errors for most of our users,

since even a person who normally prints (as opposed to writing in cursive script)

may occasionally connect a pair of characters – the cross-bar of a “t” with the

“h” in “the”, the “o” and “n” in any word ending in “ion”, and so on. To address

this issue, we experimented with some fairly straightforward modiﬁcations to

our recognizer, involving the fragmenting of user-strokes into multiple system-

strokes, or fragments. Once the ink representing the connected characters is

broken up into fragments, we then allow our standard integrated segmentation

and recognition process to stitch them back together into the most likely char-

acter and word hypotheses, as always. This technique has proven itself to work

quite well, and the version of the “Print Recognizer” in the MessagePad 2000 and

2100 supports recognition of printing with connected characters. This capability

was added without signiﬁcant modiﬁcation of the main recognition algorithms

as presented in this paper. Due to certain assumptions and constraints in the

current release of the software, APR is not yet a full cursive recognizer, though

that is an obvious next direction to explore.

The net architecture discussed in section 13.4.2 and shown in Figure 13.4 also

corresponds to the true printing-only recognizer. The ﬁnal output layer has 95

elements corresponding to the full printable ASCII character set plus the British

Pound sign. Initially for the German market, and now even in English units, we

have extended APR to handle diacritical marks and the special symbols needed

for most European languages (although there is only very limited coverage of

foreign languages in the dictionaries of English units). The main innovation that

permitted this extended character set was an explicit handling of any compound

character as a base plus an accent. This way only a few nodes needed to be added

to the neural network output layer, representing just the bases and accents,

rather than all combinations and permutations of same. And training data for all

compound characters sharing a common base or a common accent contributed to

the network’s ability to learn that base or accent, as opposed to contributing only

to the explicit base+accent combination. Here again, however, the fundamental

13. On-line, Printed Handwriting Recognition in the Newton 291

recognizer technology has not changed signiﬁcantly from that presented in this

paper.

13.7 Future Extensions

We are optimistic that our algorithms, having proven themselves to work essen-

tially as well for connected characters as for disconnected characters, may extend

gracefully to full cursive.

On a more speculative note, we believe that the technique may extend well

to ideographic languages, substituting radicals for characters, and ideographic

characters for words.

Finally, a note about learning and user adaptation: For a learning technology

such as ANNs, user adaptation is an obvious and natural ﬁt, and was planned

as part of the system from its inception. However, due to RAM constraints

in the initial shipping product, and the subsequent prioritization of European

character sets and connected characters, we have not yet deployed a learning

system. We have, however, done some testing of user adaptation, and believe

it to be of considerable value. Figure 13.15 shows a comparison of the average

performance on an old user-independent net trained on data from 45 writers,

and the performance for three individuals using (A) the user-independent net,

(B) a net trained on data exclusively from that individual, and (C) a copy of the

user-independent net adapted to the speciﬁc user by some incremental training.

(Note: These data are from a multi-year-old experiment, and are not necessarily

representative of current levels of performance on any absolute scale.)

Character Error Rate (%)

all 45

Writer

1 2 3

13.9

24.8

5.1 4.7

17.8

6.1 6.3

21.3

10.7

User-Independent Net

User-Specific Net

User-Adapted Net

11.0

Fig. 13.15. User-adaptation test results for three individual writers with three diﬀer-

ent nets each, plus the overall results for 45 writers tested on a user-independent net

trained on all 45 writers

An important distinction is being made here between “user-adapted” and

“user-speciﬁc” nets. “User-speciﬁc” nets have been trained with a relatively large

corpus of data exclusively from that speciﬁc user. “User-adapted” nets were based

on the user-independent net, with some additional training using limited data

from the user in question. All testing was performed with data held out from all

training sets.

292 L.S. Yaeger et al.

One obvious thing to note is the reduction in error rate ranging from a fac-

tor of 2 to a factor of 5 that both user-speciﬁc and user-adapted nets provide.

An equally important thing to note is that the user-adapted net performs es-

sentially as well as a user-speciﬁc net – in fact, slightly better for two of the

three writers. Given ANNs’ penchant for local minima, we were concerned that

this might not be the case. But it appears that the features learned during the

user-independent net training served the user-adapted net well. We believe that

a very small amount of training data from an individual will allow us to adapt

the user-independent net to that user, and improve the overall accuracy for that

user signiﬁcantly, especially for individuals with more stylized writing, or whose

writing style is underrepresented in our user-independent training corpus. And

even for writers with common and/or neat writing styles, there is inherently less

ambiguity in a single writer’s style than in a corpus of data necessarily doing its

best to represent essentially all possible writing styles.

These results may be exaggerated somewhat by the limited data in the user-

independent training corpus at the time these tests were performed (just 45 writ-

ers), and at least two of the three writers in question had particularly problematic

writing styles. We have also made signiﬁcant advances in our user-independent

recognition accuracies since these tests were performed. Nonetheless, we believe

these results are suggestive of the signiﬁcant value of user adaptation, even in

preference to a user-speciﬁc solution.

Acknowledgements. This work was done in collaboration with Bill Staﬀord,

Apple Computer, and Les Vogel, Angel Island Technologies. We would also like

to acknowledge the contributions of Michael Kaplan, Rus Maxham, Kara Hayes,

Gene Ciccarelli, and Stuart Crawford. We also received support and assistance

from the Newton Recognition, Software Testing, and Hardware Support groups.

We are also indebted to our many colleagues in the connectionist community

for their advice, help, and encouragement over the years, as well as our many

colleagues at Apple who pitched in to help throughout the life of this project.

Some of the techniques described in this paper are the subject of pending U.S.

and foreign patent applications.

References

[1] Asanovic, K., Morgan, N.: Experimental determination of precision requirements

for back-propagation training of artiﬁcial neural networks. Technical Report TR-

91-036, International Computer Science Institute, Berkeley, CA (June 1991)

[2] Bengio, Y., LeCun, Y., Nohl, C., Burges, C.: LeRec: A NN/HMM hybrid for on-

line handwriting recognition. Neural Computation 7(6), 1289–1303 (1995)

[3] Bourlard, H., Wellekens, C.J.: Links between markov models and multilayer per-

ceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-

12(12), 1167–1178 (1990)

[4] Chang, E.I., Lippmann, R.P.: Using voice transformations to create additional

training talkers for word spotting. In: Tesauro, G., Touretzky, D., Leen, T. (eds.)

Advances in Neural Information Processing Systems, vol. 7, pp. 875–882. The MIT

Press (1995)

13. On-line, Printed Handwriting Recognition in the Newton 293

[5] Gish, H.: A probabilistic approach to the understanding and training of neural

network classiﬁers. In: Proceedings of the IEEE Conference on Acoustics, Speech

and Signal Processing, pp. 1361–1364. IEEE Press (1990)

[6] Guyon, I., Henderson, D., Albrecht, P., LeCun, Y., Denker, P.: Writer independent

and writer adaptive neural network for on-line character recognition. In: Impedovo,

S. (ed.) From Pixels to Features III, pp. 493–506. Elsevier, Amsterdam (1992)

[7] Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local

experts. Neural Computation 3(1), 79–87 (1991)

[8] Lippmann, R.P.: Neural networks, bayesian a posteriori probabilities, and pat-

tern classiﬁcation. In: Cherkassky, V., Friedman, J.H., Wechsler, H. (eds.) From

Statistics to Neural Networks – Theory and Pattern Recognition Applications, pp.

83–104. Springer, Berlin (1994)

[9] Lyon, R., Yaeger, L.: On-line hand-printing recognition with neural networks. In:

Fifth International Conference on Microelectronics for Neural Networks and Fuzzy

Systems, Lausanne, Switzerland. IEEE Computer Society Press (1996)

[10] Morgan, N., Bourlard, H.: Continuous speech recognition – an introduction to the

hybrid HMM/connectionist approach. IEEE Signal Processing Mag 13(3), 24–42

(1995)

[11] Renals, S., Morgan, N., Cohen, M., Franco, H.: Connectionist probability esti-

mation in the Decipher speech recognition system. In: Proceedings of the IEEE

International Conference on Acoustics, Speech, and Signal Processing, vol. I,

pp. 601–604 (1992)

[12] Richard, M.D., Lippmann, R.P.: Neural Network Classiﬁers Estimate Bayesian a

posteriori probabilities. Neural Computation 3(4), 461–483 (1991)

[13] Simard, P., LeCun, Y., Denker, J.: Eﬃcient pattern recognition using a new trans-

formation distance. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Proceedings

of the 1992 Conference on Advances in Neural Information Processing Systems,

vol. 5, pp. 50–58. Morgan Kaufmann, San Mateo (1992)

[14] Simard, P., Victorri, B., LeCun, Y., Denker, J.: Tangent prop—A formalism for

specifying selected invariances in an adaptive network. In: Moody, J.E., Hanson,

S.J., Lippmann, R.P. (eds.) Advances in Neural Information Processing Systems,

vol. 4, pp. 895–903. Morgan Kaufmann Publishers, Inc. (1992)

[15] Tappert, C.C., Suen, C.Y., Wakahara, T.: The state of the art in on-line hand-

writing recognition. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence PAMI-12(8), 787–808 (1990)

Neural Network Classiﬁcation and Prior Class

Probabilities



Steve Lawrence

,IanBurns

, Andrew Back

, Ah Chung Tsoi

and C. Lee Giles

1,

NEC Research Institute,

4 Independence Way, Princeton, NJ 08540

http://www.neci.nj.nec.com

Open Access Pty Ltd, Level 2, 7–9 Albany St, St. Leonards, NSW 2065, Australia

RIKEN Brain Science Institute, 2-1 Hirosawa, Wako-shi, Saitama, 351-0198, Japan

Faculty of Informatics, University of Wollongong, Northﬁelds Ave, Wollongong,

NSW 2522, Australia

{lawrence,giles}@research.nj.nec.com, ian.burns@oa.com.au,

back@brain.riken.go.jp, Ah

Chung

Tsoi@uow.edu.au

http://www.neci.nj.nec.com/homepages/lawrence/

Abstract. A commonly encountered problem in MLP (multi-layer per-

ceptron) classiﬁcation problems is related to the prior probabilities of

the individual classes – if the number of training examples that corre-

spond to each class varies signiﬁcantly between the classes, then it may

be harder for the network to learn the rarer classes in some cases. Such

practical experience does not match theoretical results which show that

MLPs approximate Bayesian a posteriori probabilities (independent of

the prior class probabilities). Our investigation of the problem shows

that the diﬀerence between the theoretical and practical results lies with

the assumptions made in the theory (accurate estimation of Bayesian a

posteriori probabilities requires the network to be large enough, training

to converge to a global minimum, inﬁnite training data, and the apriori

class probabilities of the test set to be correctly represented in the train-

ing set). Speciﬁcally, the problem can often be traced to the fact that

eﬃcient MLP training mechanisms lead to sub-optimal solutions for most

practical problems. In this chapter, we demonstrate the problem, discuss

possible methods for alleviating it, and introduce new heuristics which

are shown to perform well on a sample ECG classiﬁcation problem. The

heuristics may also be used as a simple means of adjusting for unequal

misclassiﬁcation costs.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).



Lee Giles is also with the Institute for Advanced Computer Studies, University of

Maryland, College Park, MD 20742.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 295–309, 2012.

 Springer-Verlag Berlin Heidelberg 2012

296 S. Lawrence et al.

14.1 Introduction

It has been shown theoretically that MLPs approximate Bayesian a posteriori

probabilities when the desired network outputs are 1ofMand squared-error or

cross-entropy cost functions are used [6, 11, 12, 15, 23, 25, 26, 28, 29, 32]. This

result relies on a number of assumptions for accurate estimation: the network

must be large enough and training must ﬁnd a global minimum, inﬁnite training

data is required, and the a priori class probabilities of the test set must be

correctly represented in the training set.

In practice, MLPs have also been shown to accurately estimate Bayesian a

posteriori probabilities for certain experiments [10]. However, a commonly en-

countered problem in MLP classiﬁcation is related to the case when the frequency

of the classes in the training set varies signiﬁcantly

. If the number of training

examples for each class varies signiﬁcantly between classes then there may be a

bias towards predicting the more common classes [3, 4], leading to worse clas-

siﬁcation performance for the rarer classes. In [5] it was observed that classes

with low a priori probability in a speech application were “ignored” (no samples

were classiﬁed as these classes after training). Such problems indicate that either

the estimation of Bayesian a posteriori probabilities is inaccurate, or that such

estimation may not be desired (e.g. due to varying misclassiﬁcation costs (this is

explained further in section 14.4)). Bourlard and Morgan [7] have demonstrated

inaccurate estimation of Bayesian a posteriori probabilities in speech recogni-

tion. This chapter discusses how the problem may occur along with methods of

dealing with the problem.

14.2 The Trick

This section describes the tricks for alleviating the aforementioned problem.

Motivation for their use and experimental results are provided in the following

sections. The methods all consider some kind of scaling which is performed on

a class by class basis

14.2.1 Prior Scaling

A method of scaling weight updates on a class by class basis according to the prior

class probabilities is proposed in this section. Consider gradient descent weight

updates for each pattern: w

(new)=w

(old)+Δw

(p) where Δw

(p)=

−η

∂E(p)

∂w

, p is the pattern index, and w

is the weight between neuron k in layer

l and neuron i in layer l −1. Scaling the weight updates on a pattern by pattern

For the data in general. Others have considered the case of diﬀerent class probabilities

between the training and test sets, e.g. [23].

Anand et al. [2] have also presented an algorithm related to unequal prior class

probabilities. However, their algorithm aims only to improve convergence speed.

Additionally, their algorithm is only for two class problems and batch update.

14. Neural Network Classiﬁcation and Prior Class Probabilities 297

basis is considered such that the total expected update for patterns belonging

to each class is equal (i.e. independent of the number of patterns in the class):



p=1

Δw

(p)|

= c

, ∀x ∈ X (14.1)

where p

is the target classiﬁcation of pattern p, c

is a constant, s

is a scaling

factor, x ranges over all classes X, <> denotes expectation, and the p

= x

subscript indicates that the sum is only over the patterns in a particular class

x. This eﬀectively scales the updates for lower frequency classes so that they are

higher – the aim is to account for the fact that lower frequency classes tend to

be “ignored” in certain situations. We assume that the expected weight update

for individual patterns in each class is equal:

|Δw

(p)|

= c

, ∀x ∈ X (14.2)

where c

is a constant not related to c

. The scaling factor required is therefore:

(14.3)

where s

is the scaling factor for all weight updates associated with a pattern

belonging to class x, N

is the number of classes, and p

is the prior probability

of class x.

Scaling as deﬁned above invalidates the Bayesian a posteriori probability proofs

(for example, scaling a class by two can be compared with duplicating every pattern

in the data for that class – causing changes in probability distributions), i.e. there is

no reason to expect that the scaling strategy will be optimal. This, and the empirical

result that the scaling may improve performance, leads to the hypothesis that there

may be a point between no prior scaling and prior scaling as deﬁned above which

produces performance better than either of the two extremes. The following scaling

rule can be used to select a degree of scaling between the two extremes:



=1− c

(14.4)

where 0 ≤ c

≤ 1 is a constant specifying the amount of prior scaling to use.

=0corresponds to no scaling according to prior probabilities, and c

corresponds to scaling as above. Prior scaling in this form can be expressed as

training with the following alternative cost function

A cost function with similar motivation, the “classiﬁcation ﬁgure-of-merit” (CFM) pro-

posed by Hampshire and Waibel [13], has been suggested as a possible improvement

when prior class probabilities vary [3]. In [13], the CFM cost function leads to networks

whichmakediﬀerenterrorstothosetrained withtheMSEcriterion,and canthereforebe

usefulforimprovingperformancebycombiningclassiﬁerstrainedwiththeCFM andthe

MSE. However, networks trained with the CFM criterion do not result in higher classi-

ﬁcation performance than networks trained with the MSE criterion for the experiments

reported in [13].

298 S. Lawrence et al.

Deﬁnition 1

E =



k=1



j=1



− y

)

(14.5)

where the network has one output for each of the N

classes, N

is the number

of patterns, d is the desired or target output, y is the predicted output, and x is

the class of pattern k.

When using prior scaling as deﬁned in this section, the individual s



values can

be large for classes with low prior probability. This may lead to the requirement

of decreasing the learning rate in order to prevent the relatively large weight

updates interfering with the gradient descent process. Comparing the use of

prior scaling and not using prior scaling then becomes problematic because the

optimal learning rate is diﬀerent for each case. An alternative is to normalize the



values so that the maximum is 1. Another possibility is to present patterns

repeatedly to the network instead of scaling weight updates, i.e. for a class with

a scaling factor of 2 each pattern would be presented twice. This would have

the advantage of reducing the range of weight updates in terms of magnitude,

e.g. an update of magnitude x might be repeated twice rather than using a

single update of magnitude 2x. This may allow the use of a higher learning rate,

and therefore reduce the number of epochs required. However, a disadvantage of

repeating patterns is that the eﬀective training set would be larger, resulting in

longer training times for the same number of epochs. Such a technique could be

done probabilistically, and this is the subject of the next technique.

14.2.2 Probabilistic Sampling

Yaeger et al. [33] (chapter 13) have proposed a method called frequency balanc-

ing which is similar to the prior scaling method above. In frequency balancing,

Yaeger et al. use all training samples in random order for each training epoch

and allow each sample to be presented to the network a random number of times,

which may be zero or more and is computed probabilistically. A balancing factor

is included, which is analogous to the scaling factor above (c

We introduce a very similar method here called probabilistic sampling whereby

training patterns are chosen randomly in the following manner: the class is chosen

randomly with the probability of choosing each class x,being(1 − c

A training sample is then chosen randomly from among all training samples for

the chosen class.

14.2.3 Post Scaling

Instead of scaling weight updates or altering the eﬀective class frequencies, it is

possible to train the network as usual and then scale the outputs of the network

after training. For example, the network could be trained as usual and then

the outputs scaled according to the prior probabilities in a similar fashion to

the prior scaling method (using equation 14.3 or 14.4). Experiments with this

14. Neural Network Classiﬁcation and Prior Class Probabilities 299

technique alone show that it is not always as successful as prior scaling of the

weight updates. This may be because the estimation of the lower frequency

classes can be less accurate than that of the higher frequency classes [24] (the

deviations of the network outputs from the true values in regions with a higher

number of data points inﬂuence the squared error cost function more than the

deviations in regions with a lower number of points [23]).

The post scaling technique introduced here can also be used to optimize a

given criterion, e.g. the outputs may be scaled so that the probability of pre-

dicting each class matches the prior probabilities in the training set as closely

as possible. Post scaling to minimize a diﬀerent criterion is demonstrated in the

results section. For the results in this chapter, the minimization is performed

using a simple hill-climbing algorithm which adjusts a scaling factor associated

with each of the outputs of the network.

14.2.4 Equalizing Class Membership

A simple method for alleviating diﬃculty with unequal prior class probabilities is

to adjust (e.g. equalize) the number of patterns in each class, either by subsam-

pling [24] (removing patterns from higher frequency classes), or by duplication

(of patterns in lower frequency classes)

. For subsampling, patterns can be re-

moved randomly, or heuristics may be used to remove patterns in regions of low

ambiguity. Subsampling involves a loss of information which can be detrimen-

tal. Duplication involves a larger dataset and longer training times for the same

number of training epochs (the convergence time may be longer or shorter).

14.3 Experimental Results

Results on an ECG classiﬁcation problem are reported in this section after dis-

cussing the use of alternative performance measures. Results on a simple artiﬁcial

problem are also included in the explanation section.

14.3.1 Performance Measures

When the interclass prior probabilities of the classes vary signiﬁcantly, then the

overall classiﬁcation error may not be the most appropriate performance crite-

rion. For example, a model may always predict the most common class and still

provide relatively high performance. Statistics such as the Sensitivity, Positive

Predictivity, and False Positive Rate can provide more meaningful results [1].

These are deﬁned on a class by class basis as follows:

The Sensitivity of a class is the proportion of events labeled as that class

which are correctly detected. For the two class confusion matrix shown in table

14.1 the sensitivity of class 1 is

The heuristic of adding noise during training [22] could be useful here as with the

other techniques in this chapter.

300 S. Lawrence et al.

The Positive Predictivity of a class is the proportion of events which were

predicted to be the class and were labeled as that class. For the two class con-

fusion matrix shown in table 14.1 the positive predictivity of class 1 is

The False Positive Rate of a class is the proportion of all patterns for other

classes which were incorrectly classiﬁed as that class. For the two class confusion

matrix shown in table 14.1 the false positive rate of class 1 is

Table 14.1. A sample confusion matrix which is used to illustrate sensitivity, positive

predictivity, and false positive rate. Rows correspond to the desired classes and columns

correspond to the predicted classes.

Class 1 2

1 c

2 c

No single performance criterion can be labeled as the best for comparing

algorithms or models because the best criterion to use is problem dependent.

Here, we take the sensitivity as deﬁned above, and create a single performance

measure, the mean squared sensitivity error (MSSE). We deﬁne the MSSE as

follows:

Deﬁnition 2

MSSE =



i=1

(1 − S

)

(14.6)

where N

= the number of classes and S

= sensitivity of class i as deﬁned

earlier.

Sensitivities range from 0 (no examples of the class correctly classiﬁed) to 1 (all

examples correctly classiﬁed). Thus, a lower MSSE corresponds to better per-

formance. We choose this criterion because each class is given equal importance

and the square causes lower individual sensitivities to be penalized more (e.g. for

a two class problem, class sensitivities of 100% and 0% produce a higher MSSE

than sensitivities of 50% and 50%). Note that this is only one possible criterion,

and other criterion could be used in order to reﬂect diﬀerent requirements, e.g.

speciﬁc misclassiﬁcation costs for each class. The post scaling heuristic can be

used with any criterion (and doing so may be simpler than reformulating the

neural network training algorithm for the new criterion).

14.3.2 ECG Classiﬁcation Problem

This section presents results using the beforementioned techniques on an ECG

classiﬁcation problem. The database used is the MIT-BIH Arrhythmia database

[21] – a common publicly available ECG database which contains a large number

of ECG records that have been carefully annotated by experts. Detection of

14. Neural Network Classiﬁcation and Prior Class Probabilities 301

the following four beat types is considered: Normal (N), Premature Ventricular

Contraction (PVC), Supraventricular Contraction (S), and Fusion (F) [21], i.e.

there are four output classes. The four classes are denoted 1 (N), 2 (PVC), 3 (S),

and 4 (F). An autoregressive model is calculated for a window of 200 samples

centered over the peak of the R-wave of each beat. The inputs are the polar

coordinates of each pole in the z-plane, i.e. frequency changes are reﬂected in

the angular variation of the poles and damping is reﬂected in the magnitude

variations. The model order was four corresponding to eight input variables.

The prior probability of the classes (according to the training data) is (0.737,

0.191, 0.0529, 0.0196) corresponding to beat types (N, PVC, S, F).

MLPs with 20 hidden units were trained with stochastic backpropagation

(update after each pattern) using an initial learning rate of 0.02 which was

linearly reduced to zero over the training period of 500,000 updates. We used

5,000 points in each of the training, validation and test sets. The validation

set was used for early stopping. The following algorithms were used – a) prior

scaling with the degree of scaling, c

, varied from 0 to 1, b) probabilistic sampling

with the degree of scaling, c

, varied from 0 to 1, c) as per a) and b) with the

addition of post scaling, and d) equalizing the number of cases in each class by

removing cases in more common classes. The post scaling attempted to minimize

the MSSE on the training set

. 10 trials were performed for each case.

The median test set MSSE for d) was 0.195. The results for probabilistic sam-

pling and probabilistic sampling plus post scaling are shown with box-whiskers

plots

in ﬁgure 14.1. For probabilistic sampling, the best scaling results corre-

spond to a degree of scaling in between no scaling and scaling according to the

prior probabilities (c

≈ 0.8). When c

is larger, the sensitivity of class 1 drops

signiﬁcantly and results in higher false positive rates for the other classes. When

is lower, the sensitivity of classes 3 and 4 drops signiﬁcantly. It can be seen

that the addition of post scaling appears to almost always improve performance

for this problem. The optimal degree of scaling, c

≈ 0.8, is diﬃcult to deter-

mine a priori. However, it can be seen that the addition of post scaling makes

the selection of c

far less critical (c

=0.3 to c

=1.0 result in similar per-

formance). Figure 14.2 shows confusion matrices (in graphical form). Without

400 steps were used for the hill climbing algorithm where each step corresponded to

either multiplying or dividing an individual output scale factor by a constant which

was reduced linearly over time from 1.5 to 1. The time taken was short compared to

the overall training time.

The distribution of results is often not Gaussian and alternative means of presenting

results other than the mean and standard deviation can be more informative. Box-

whiskers plots show the interquartile range (IQR) with a box and the median as a

bar across the box. The whiskers extend from the ends of the box to the minimum

and maximum values. The median and the IQR are simple statistics which are not

as sensitive to outliers as the mean and the standard deviation [31]. The median is

the value in the middle when arranging the distribution in order from the smallest to

the largest value. If the data is divided into two equal groups about the median, then

the IQR is the diﬀerence between the medians of these groups. The IQR contains

50% of the points.

302 S. Lawrence et al.

scaling (c

=0), it can be seen that classes 3 & 4 have low sensitivity. With

scaling using c

=1all classes are now recognized, however the sensitivity of

class 1 is worse and the false positive rate of classes 3 & 4 is signiﬁcantly worse.

The results for prior scaling and prior scaling combined with post scaling

were very similar but slightly worse than the results with probabilistic sampling.

The prior scaling results are not plotted in order to make the graph easier to

follow, however the qualitative results are as follows: for low c

, prior scaling and

probabilistic sampling perform very similarly. However, for high c

, probabilistic

sampling has a clear advantage for this problem. This is perhaps just as expected

– the relatively high variation in prior class probabilities leads to a high variation

in weight update magnitudes across the classes when using high c

.Resultsfor

all methods can be seen in table 14.2.

Table 14.2. Results for the various methods. We show the average results for the best

selection of c

and also an average across all selections of c

. Note that selection of

the optimal value of c

is less critical when using post scaling in addition to either the

prior scaling or probabilistic sampling methods.

Method Prior Prior Scaling + Probabilistic Probabilistic Equalizing

Scaling Post Scaling Sampling Sampling + Membership

Post Scaling

Average MSSE 0.10 0.096 0.099 0.089 0.195

(for best c

) (c

=0.8) (c

=0.6) (c

=0.8) (c

=0.3)

Average MSSE 0.19 0.10 0.18 0.099 0.195

(over all c

)

14.4 Explanation

This section discusses why the techniques presented can be useful, limitations of

the techniques, and how they relate to the theoretical result that MLPs approx-

imate Bayesian a posteriori probabilities under certain conditions.

14.4.1 Convergence and Representation Issues

We ﬁrst list four possible situations:

1. The proofs regarding estimation of Bayesian a posteriori probabilities assume

networks with an inﬁnite number of hidden nodes in order to obtain accurate

approximation. For a given problem, it can be seen that a network which

is too small will be unable to estimate the probabilities accurately due to

limited resources.

14. Neural Network Classiﬁcation and Prior Class Probabilities 303

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Test Mean Square Sensitivity Error

Degree of Scaling

Probabilistic Sampling

Probabilistic Sampling + Post Scaling

Fig. 14.1. Box-whiskers plots (on the left in each case) along with the usual mean plus

and minus one standard deviation plots (on the right in each case) showing the test

set MSSE for probabilistic sampling and for probabilistic sampling plus post scaling.

Each result is derived from 10 trials with diﬀerent starting conditions. The probabilistic

sampling plus post scaling case is oﬀset by 0.03 to aid viewing. It can be seen that the

selection of the scaling degree for the best performance is not as critical when using

the combination of probabilistic sampling and post scaling.

2. Training an MLP is NP-complete in general and it is well known that prac-

tical training algorithms used for MLPs often result in sub-optimal solutions

(e.g. due to local minima). Often, a result of attaining a sub-optimal solution

is that not all of the network resources are eﬃciently used. Experiments with

a controlled task have indicated that the sub-optimal solutions often have

smaller weights on average [17].

3. Weight decay [16] or weight elimination [30] are often used in MLP training

and aim to minimize a cost function which penalizes large weights. These

techniques tend to result in networks with smaller weights.

4. A commonly recommended technique with MLP classiﬁcation is to set the

training targets away from the bounds of the activation function (e.g. (-0.8,

0.8) instead of (-1, 1) for the tanh activation function) [14].

00.10.20.30.40.50.60.70.80.91.0

Fig. 14.2. Confusion matrices for the test set as the degree of prior scaling, c

,isvaried

from 0 (left) to 1 (right). The columns correspond to the predicted classes and the rows

correspond to the desired classes. The classes are (left to right and top to bottom) N,

PVC, S, F. For each desired class, the predicted classes are shaded in proportion to

the number of examples which are labeled as the desired class. White indicates no

predictions. A general trend can be observed where classes S & F are recognized as

normal when c

=0, and progressively more of the normal class examples are recognized

as classes PVC, S, & F as c

approaches 1.

304 S. Lawrence et al.

These four situations can all lead to a bias towards smaller weights, or

“smoother” models

. The possibility of such a bias is not taken into account

by the proofs regarding posterior probabilities, i.e. the diﬀerence between the-

ory and practice may, in part, be explained by violation of the assumption that

suﬃcient convergence is obtained.

When a network is biased towards a “smoother” solution, and accurate ﬁtting

of the optimal function is not possible, the result may be a tendency to “ignore”

lower frequency classes

, e.g. if a network has the choice of ﬁtting either a high

frequency class or a low frequency class then it can provide a lower MSE by

ﬁtting the high frequency class

. We demonstrate by example.

We generated artiﬁcial training data using the following distributions: class 1:

N(−5, 1, 2)+N(0, 1, 2)+N (5, 1, 2),class2:N(−2.5, 0.25, 0.5)+N(2.5 , 0.25, 0.5),

where N (μ, σ, x) is a normal distribution with mean μ, standard deviation σ,

and is truncated to lie within (μ −x, μ + x). We generated 500 training and test

examples from these distributions with the probability of selection for classes

(1,2) being (0.9,0.1), i.e. the training and test sets have nine times as many

samples of class 1 as they do of class 2. Note that there is no overlap between

the classes. Figure 14.3 shows typical output probability plots for training an

MLPwith10hiddennodes

with and without probabilistic sampling. 10 trials

were performed in each case with very similar results (see table 14.3). It can

be seen that the network “ignores” class two without the use of probabilistic

sampling.

It should be noted that using conjugate gradient training for this simple

problem results in relatively accurate estimation of both classes with standard

training (alternate parameters with backpropagation may also be successful).

Rather than arguing for either backpropagation or conjugate gradient here (nei-

ther training algorithm is expected to always ﬁnd a global minimum in general),

we simply note that our experience and the experience of others [7, 18, 19, 27]

suggests that conjugate gradient is not superior for many problems – i.e. back-

propagation works better on one class of problems and conjugate gradient works

better on another class. Conjugate gradient resulted in signiﬁcantly worse per-

formance when tested on the ECG problem. It should be noted that there are

In general, smaller weights correspond to smoother functions, however this is not

always true. For example, this is not the case when ﬁtting the function sech(x)

using two tanh sigmoids [8] (because sech(x) = lim

d→0

(tanh(x + d) − tanh(x))/d,

i.e. the weights become indeﬁnitely large as the approximation improves).

In relation to the representational capacity (size of the network), Barnard and Botha

[3] have observed that MLP networks have a tendency to guess higher probability

classes when a network is too small to approximate the decision boundaries reason-

ably well.

Lyon and Yaeger [20] ﬁnd that their frequency balancing technique reduces the eﬀect

of the prior class probabilities on the network and eﬀectively forces the network to

allocate more resources to the lower frequency classes.

500,000 stochastic training updates with backpropagation, initial learning rate 0.02

reduced linearly to zero.

14. Neural Network Classiﬁcation and Prior Class Probabilities 305

many options when implementing a conjugate gradient training algorithm and

that poor performance may be attributed to the implementation used. We have

used a modiﬁed implementation of the algorithm from Fletcher [9].

-1

-0.5

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Output

Input

Network Outputs without Probabilistic Sampling

Output 1

Output 2

-1

-0.5

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Output

Input

Network Outputs using Probabilistic Sampling

Output 1

Output 2

Fig. 14.3. Network outputs for the artiﬁcial problem with (below) and without (above)

probabilistic sampling. It can be seen that the network “ignores” the lower fre-

quency class without the use of probabilistic sampling. Note that the input has been

normalized.

Table 14.3. Mean and standard deviation of the classiﬁcation error for the artiﬁcial

problem both with and without the use of probabilistic sampling

Classiﬁcation Error Mean Standard Deviation

Standard Training 11.4 0.02

With Probabilistic Sampling 0.8 0.004

14.4.2 Overlapping Distributions

Consider ﬁgure 14.4. If classes 1 and 2 have distributions diﬀering only by trans-

lation (c

and c



) then the decision threshold between these classes should be

chosen at x

. Equal percentages of each of these classes will be classiﬁed as the

other class. Now, if the distribution of class 2 is as shown (c

) then the decision

threshold between the classes should be chosen at x

. In this case, a higher per-

centage of class 2 will be classiﬁed as class 1 than the reverse. If it is desirable to

maximize the class by class sensitivity then scaling such that the eﬀective distri-

bution of c

is c



might be appropriate. Similarly, class 3 (c

) will be “ignored”

without any scaling.

306 S. Lawrence et al.

Scaling on a class by class basis may be desired when i) the distribution of

samples in the training set does not match the true distribution (e.g. it may be

more expensive to collect samples of a particular class)

, or ii) the distribution

of the classes does not represent their relative importance, e.g. in a medical

classiﬁcation problem the cost of misclassifying a diseased case as normal may

be much higher than the cost of classifying a normal case as a (possibly) diseased

case [24]. The importance of each class may be independent of the class prior

probabilities. Note that scaling such that lower frequency classes are made to

be artiﬁcially more important can be useful when considering a higher level

problem. For example, the training data from natural English words and phrases

exhibit very non-uniform priors for diﬀerent characters. Yaeger et al. [33] ﬁnd

that reducing the eﬀect of these priors on the network using frequency balancing

improves the performance of the higher level word recognition training.

Observations. a) There is no intrinsic problem if the distributions do not

overlap. b) When distributions overlap, it is desirable to preprocess the data in

a manner that results in reduced overlap. However, it is often not possible to

obtain zero overlap (due to noise, for example).

Fig. 14.4. Overlapping distributions

14.4.3 Limitations

We note a couple of limitations with the heuristics considered herein:

1. Local issues. The heuristics presented counteract biases in the network, train-

ing algorithm and/or training data. There is no reason for these biases to

be constant throughout the input space, e.g. scaling may be helpful in one

region but detrimental in another.

It may be possible to obtain more accurate estimates of class probabilities using data

that has class labels without input information. For example, word frequency infor-

mation can be obtained from text databases and the frequency of various diseases

canbeobtainedfromhealthstatistics[23].

14. Neural Network Classiﬁcation and Prior Class Probabilities 307

2. Nonlinear calibration. There is no reason for the linear scaling heuristics used

here to be optimal (in the sense that they best counteract the biases).

14.4.4 A Posteriori Proofs

Theoretically it is possible to show that the scaling techniques invalidate the a

posteriori proofs – when performing scaling on a class by class basis the deci-

sion thresholds which are used to determine the winning class should be altered

accordingly. This indicates another possible use of the prior scaling and proba-

bilistic sampling techniques when the conditions given above do not exist. This

use is related to the problem whereby lower frequency classes may be estimated

less accurately than higher frequency classes (see section 14.2.3) – training may

be performed with the heuristically altered problem (e.g. so that the class fre-

quencies are eﬀectively equal) and the outputs or decision thresholds can be

altered accordingly.

14.5 Conclusions

In practice, training issues or characteristics of a given classiﬁcation problem can

mean that scaling the predicted class probabilities may improve performance in

terms of overall classiﬁcation error and/or in terms of an alternative criterion.

We introduced algorithms which a) scale weight updates on a class by class basis

according to the prior class probabilities, b) alter class frequencies probabilisti-

cally (very similar to the frequency balancing technique of Yaeger et al. [33]),

and c) scale outputs after training in order to maximize a given performance

criterion. For an electrocardiogram (ECG) classiﬁcation problem, we found that

the prior scaling, probabilistic sampling, and post scaling techniques provided

better performance in comparison to a) no heuristics, and b) subsampling in

order to equalize the number of cases in each class. The best performance for

prior scaling and probabilistic sampling was obtained with a degree of scaling

in between no scaling and scaling according to the prior probabilities. The opti-

mal degree was diﬃcult to determine a priori. However, it was found that using

prior scaling or probabilistic sampling in combination with post scaling made

the selection of the optimal degree far less critical.

References

[1] AAMI. Testing and reporting performance results of ventricular Arrhythmia detec-

tion algorithms. In: Association for the Advancement of Medical Instrumentation,

ECAR 1987, Arlington, VA (1987)

[2] Anand, R., Mehrotra, K.G., Mohan, C.K., Ranka, S.: An improved algorithm for

neural network classiﬁcation of imbalanced training sets. IEEE Transactions on

Neural Networks 4(6), 962–969 (1993)

[3] Barnard, E., Botha, E.C.: Back-propagation uses prior information eﬃciently.

IEEE Transactions on Neural Networks 4(5), 794–802 (1993)

308 S. Lawrence et al.

[4] Barnard, E., Casasent, D.: A comparison between criterion functions for linear

classiﬁers, with an application to neural nets. IEEE Transactions on Systems,

Man, and Cybernetics 19(5), 1030–1041 (1989)

[5] Barnard, E., Cole, R.A., Hou, L.: Location and classiﬁcation of plosive constants

using expert knowledge and neural-net classiﬁers. Journal of the Acoustical Society

of America 84(suppl. 1), S60 (1988)

[6] Bourlard, H.A., Morgan, N.: Links between Markov models and multilayer per-

ceptrons. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing

Systems, vol. 1, pp. 502–510. Morgan Kaufmann, San Mateo (1989)

[7] Bourlard, H.A., Morgan, N.: Connnectionist Speech Recognition: A Hybrid Ap-

proach. Kluwer Academic Publishers, Boston (1994)

[8] Scott Cardell, N., Joerding, W., Li, Y.: Why some feedforward networks cannot

learn some polynomials. Neural Computation 6(4), 761–766 (1994)

[9] Fletcher, R.: Practical Methods of Optimization, Second Edition, 2nd edn. John

Wiley & Sons (1987)

[10] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4(1), 1–58 (1992)

[11] Gish, H.: A probabilistic approach to the understanding and training of neural

network classiﬁers. In: Proceedings of the IEEE Conference on Acoustics, Speech

and Signal Processing, pp. 1361–1364. IEEE Press (1990)

[12] Hampshire, J.B., Pearlmutter, B.: Equivalence proofs for multilayer perceptron

classiﬁers and the Bayesian discriminant function. In: Touretzky, D.S., Elman,

J.L., Sejnowski, T.J., Hinton, G.E. (eds.) Proceedings of the 1990 Connectionist

Models Summer School, Morgan Kaufmann, San Mateo (1990)

[13] Hampshire, J.B., Waibel, A.H.: A novel objective function for improved phoneme

recognition using time delay neural networks. In: International Joint Conference

on Neural Networks, Washington, DC, pp. 235–241 (June 1989)

[14] Haykin, S.: Neural Networks, A Comprehensive Foundation. Macmillan, New York

(1994)

[15] Kanaya, F., Miyake, S.: Bayes statistical behavior and valid generalization of pat-

tern classifying neural networks. IEEE Transactions on Neural Networks 2(1), 471

(1991)

[16] Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In:

Moody, J.E., Hanson, S.J., Lippmann, R.P. (eds.) Advances in Neural Information

Processing Systems, vol. 4, pp. 950–957. Morgan Kaufmann, San Mateo (1992)

[17] Lawrence, S., Lee Giles, C., Tsoi, A.C.: Lessons in neural network training: Over-

ﬁtting be harder than expected. In: Proceedings of the Fourteenth National Con-

ference on Artiﬁcial Intelligence, AAAI 1997, pp. 540–545. AAAI Press, Menlo

Park (1997)

[18] LeCun, Y.: Eﬃcient learning and second order methods. In: Tutorial Presented at

Neural Information Processing Systems, vol. 5 (1993)

[19] LeCun, Y., Bengio, Y.: Pattern recognition. In: Arbib, M.A. (ed.) The Handbook

of Brain Theory and Neural Networks, pp. 711–715. MIT Press (1995)

[20] Lyon, R., Yaeger, L.: On-line hand-printing recognition with neural networks. In:

Fifth International Conference on Microelectronics for Neural Networks and Fuzzy

Systems, Lausanne, Switzerland. IEEE Computer Society Press (1996)

[21] MIT-BIH. MIT-BIH Arrhythmia database directory. Technical Report BMEC

TR010 (Revised), Massachusetts Institute of Technology and Beth Israel Hospital

(1988)

[22] Murray, A.F., Edwards, P.J.: Enhanced MLP performance and fault tolerance

resulting from synaptic weight noise during training. IEEE Transactions on Neural

Networks 5(5), 792–802 (1994)

14. Neural Network Classiﬁcation and Prior Class Probabilities 309

[23] Richard, M.D., Lippmann, R.P.: Neural network classiﬁers estimate Bayesian a

posteriori probabilities. Neural Computation 3(4), 461–483 (1991)

[24] Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University

Press, Cambridge (1996)

[25] Rojas, R.: A short proof of the posterior probability property of classiﬁer neural

networks. Neural Computation 8, 41–43 (1996)

[26] Ruck, D.W., Rogers, S.K., Kabrisky, K., Oxley, M.E., Suter, B.W.: The multilayer

perceptron as an approximation to an optimal Bayes estimator. IEEE Transactions

on Neural Networks 1(4), 296–298 (1990)

[27] Schiﬀman, W., Joost, M., Werner, R.: Optimization of the backpropagation algo-

rithm for training multilayer perceptrons. Technical report, University of Koblenz

(1994)

[28] Shoemaker, P.A.: A note on least-squares learning procedures and classiﬁcation

by neural network models. IEEE Transactions on Neural Networks 2(1), 158–160

(1991)

[29] Wan, E.: Neural network classiﬁcation: A Bayesian interpretation. IEEE Transac-

tions on Neural Networks 1(4), 303–305 (1990)

[30] Weigend, A.S., Rumelhart, D.E., Huberman, B.A.: Generalization by weight-

elimination with application to forecasting. In: Lippmann, R.P., Moody, J.E.,

Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol.

3, pp. 875–882. Morgan Kaufmann, San Mateo (1991)

[31] Weiss, N.A., Hassett, M.J.: Introductory Statistics, 2nd edn. Addison-Wesley,

Reading (1987)

[32] White, H.: Learning in artiﬁcial neural networks: A statistical perspective. Neural

Computation 1(4), 425–464 (1989)

[33] Yaeger, L., Lyon, R., Webb, B.: Eﬀective training of a neural network character

classiﬁer for word recognition. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.)

Advances in Neural Information Processing Systems, vol. 9. MIT Press, Cambridge

(1997)

Applying Divide and Conquer to Large Scale

Pattern Recognition Tasks



Jürgen Fritsch

and Michael Finke

Interactive Systems Laboratories

University of Karlsruhe

Am Fasanengarten 5

76128 Karlsruhe, Germany

fritsch@ira.uka.de

Interactive Systems Laboratories

Carnegie Mellon University

5000 Forbes Avenue

Pittsburgh, PA 15213, USA

finkem@cs.cmu.edu

http://www.cs.cmu.edu/˜finkem/

Abstract. Rather than presenting a speciﬁc trick, this paper aims at

providing a methodology for large scale, real-world classiﬁcation tasks in-

volving thousands of classes and millions of training patterns. Such prob-

lems arise in speech recognition, handwriting recognition and speaker or

writer identiﬁcation, just to name a few. Given the typically very large

number of classes to be distinguished, many approaches focus on para-

metric methods to independently estimate class conditional likelihoods.

In contrast, we demonstrate how the principles of modularity and hi-

erarchy can be applied to directly estimate posterior class probabilities

in a connectionist framework. Apart from oﬀering better discrimination

capability, we argue that a hierarchical classiﬁcation scheme is crucial in

tackling the above mentioned problems. Furthermore, we discuss train-

ing issues that have to be addressed when an almost inﬁnite amount of

training data is available.

15.1 Introduction

The majority of contributions in the ﬁeld of neural computation deal with rela-

tively small datasets and, in case of classiﬁcation tasks, with a relatively small

number of classes to be distinguished. Representatives of such problems include

the UCI machine learning database [16] and the Proben [20] benchmark set

for learning algorithms. Research concentrates on aspectssuchasmissingdata,

model selection, regularization, overﬁtting vs. generalization and the bias/vari-

ance trade-oﬀ. Over the years, many methods and ’tricks’ have been developed to

optimally learn and generalize when only a limited amount of data is available.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 311–338, 2012.

 Springer-Verlag Berlin Heidelberg 2012

312 J. Fritsch and M. Finke

On the other hand, many problems in human computer interaction (HCI) such

as speech and handwriting recognition, lipreading and speaker and writer iden-

tiﬁcation require comparably large training databases and also often exhibit a

large number of classes to be discriminated, such as (context-dependent) phones,

letters and individual speakers or writers. For example, in state-of-the-art large

vocabulary continuous speech recognition, we are typically faced with an inven-

tory of several thousand basic acoustic units and training databases consisting of

several millions of preprocessed speech patterns. There is only a limited amount

of publications available on the sometimes very diﬀerent problems concerning the

choice of learning machines and training algorithms for such tasks and datasets.

This article addresses exactly the latter kind of learning tasks and provides

a principled approach to large scale classiﬁcation problems, exemplifying it on

the problem of connectionist speech recognition. Our approach is grounded on

the powerful divide and conquer paradigm that traditionally has always been

applied to problems of rather large size. We argue that a hierarchical approach

that modularizes classiﬁcation tasks is crucial in applying statistical estimators

such as artiﬁcial neural networks. In that respect, this paper presents not just

a single ’trick of the trade’, it oﬀers a methodology for large scale classiﬁcation

tasks. Such tasks have traditionally been addressed by building generative mod-

els rather than focusing on the prediction of posteriors without making strong

assumptions on the distribution of the input.

The remainder of the paper is organized as follows. Section 2 presents the gen-

eral approach to soft hierarchical classiﬁcation. Section 3 then discusses methods

to design the topology of hierarchical classiﬁers - a task that is of increasing im-

portance when dealing with large numbers of classes. Finally, section 4 demon-

strates in detail the application of hierarchical classiﬁcation to connectionist

statistical speech recognition. Section 5 concludes this paper with a summary.

15.2 Hierarchical Classiﬁcation

Consider the task of classifying patterns x as belonging to one of N classes ω

Given that we have access to the class conditional probability densities p(x|ω

Bayes theory states that the optimal decision should be based on the a-posteriori

probabilities

p(ω

|x)=

p(x|ω

)p(ω

)



p(x|ω

)p(ω

)

Given that equal risks are associated with all possible misclassiﬁcations, the opti-

mal decision is to choose the class with maximum a-posteriori probability given a

speciﬁc pattern x. Two distinct approaches have to be considered when applying

Bayes theory to a learning from examples task with generally unknown distri-

butions. In the ﬁrst approach, one tries to estimate class-conditional likelihoods

p(x|ω

) and prior probabilities p(ω

) from a labeled dataset which are then

used to calculate posterior probabilities according to Bayes rule. In principle,

this approach can be applied to tasks with an arbitrary large number of classes

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 313

since the class-conditional likelihoods can be estimated independently. However,

such an approach focuses on the modeling of the class-conditional densities. For

classiﬁcation accuracy however, it is more important to model class boundaries.

The second approach accommodates this perspective by directly estimating

posterior class probabilities from datasets. It was shown (e. g. [6]) that a large

class of artiﬁcial neural networks such as multi-layer perceptrons and recur-

rent neural networks can be trained to approximate posterior class probabilities.

The degree of accuracy of the approximation however depends on many factors,

among them the plasticity of the network. Comparing the two approaches, the

discriminative power of methods that estimate posterior probabilities directly is

generally higher, resulting in better classiﬁcation accuracy especially when the

class-conditional distributions are very complex. This fact (among others) ex-

plains the success and popularity of neural network classiﬁers on many learning

from examples tasks.

However, when the number of classes to be distinguished increases to say

several thousand, neural network estimators of posterior probabilities fail to

provide good approximations mainly because of two reasons: First, real-world

problems involving such a large number of classes often exhibit an extremely

non-uniform distribution of priors, see chapter 14. Many learning algorithms for

neural networks (especially stochastic on-line gradient descent) have diﬃculties

with non-uniformly distributed classes. Particularly the distribution of poste-

riors of infrequent classes tend to be approximated poorly. Second, and more

important, one of the prerequisites for training neural networks to estimate pos-

teriors, the 1-out-of-N coding of training targets, implies that the number of

output neurons matches the number of classes. It is unfeasible to train a neu-

ral network with thousands of output neurons. Also, with increasing number of

classes, the complexity of the optimum discriminant functions also increases and

the potential for conﬂicts between classes grows. Thus, from our point of view,

typical monolithic neural network classiﬁers are not applicable because of their

limitation to tasks with relatively few classes.

15.2.1 Decomposition of Posterior Probabilities

Applying the principle of divide and conquer, we can break down the task of dis-

criminating between thousands of classes into a hierarchical structure of many

smaller classiﬁcation tasks of controlled size. This idea underlies the approaches

to decision tree architectures [5, 21, 23]. Decision trees classify input patterns

by asking categorical questions at each internal node. Depending on the answer

to these questions a single path is followed to one of the child nodes and the

process repeats until a leaf node is reached and a (winner) class label is emit-

ted. Therefore, decision tree classiﬁers can only supply us with hard decisions.

No information about the confusability of a speciﬁc input pattern is given to

us. Rather, we are often interested in the posterior class probabilities because

we wish to have a measure of the ambiguity of a decision. Furthermore, we

are sometimes required to feed a measure of the degree of membership for all

314 J. Fritsch and M. Finke

p(S | S ,x) p(S | S ,x)

p(S | x)

1j 2j 21

Fig. 15.1. Hierarchical decomposition of posteriors

potential classes into a superior decision making process. As we will see in section

4, statistical speech recognition is a typical example for the latter scenario.

Adhering to the divide and conquer approach but generalizing the decision

tree framework, the statistical method of factoring posteriors can be applied

to design soft classiﬁcation trees [24, 25]. For now, we assume, that optimal

posterior probabilities are available. Let S be a (possibly large) set of classes ω

to be discriminated. Consider we have a method at our disposition which gives us

a partitioning of S into M disjoint and non-empty subsets S

such that members

of S

are almost never confused with members of S

(∀j = i). A particular class

will now be a member of S and exactly one of the subsets S

. Therefore, we

can rewrite the posterior probability of class ω

as a joint probability of the class

and the corresponding subset S

and factor it according to

p(ω

|x)=p(ω

|x) with ω

∈ S

= p(S

|x) p(ω

, x).

Thus, the global task of discriminating between all the classes in S has been

converted into (1) discriminating between subsets S

and (2) independently dis-

criminating between the classes ω

remaining within each of the subsets S

Recursively repeating this process yields a hierarchical tree-organized structure

(Fig. 15.1).

Note, that the number of subclasses S

of each node does not need to be con-

stant throughout the classiﬁer tree and might be subject to optimization during

the tree design phase. In order to compute the posterior probability for a speciﬁc

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 315

class, we have to follow the path from root node to the leaf corresponding to the

class in question, multiplying all the conditional posteriors along the way. Both

the design of the tree structure (divide) and the estimation and multiplication

(conquer) of conditional posteriors at each node are important aspects in this

architecture, that have to be considered thoroughly because in practice, only

approximations to the conditional posteriors are available.

15.2.2 Hierarchical Interpretation

The presented architecture can be interpreted as a probability mass distribution

device. At the root node, an initial probability mass of 1 is fed into the architec-

ture. At each node, the incoming probability mass is multiplied by the respective

conditional posterior probabilities and fed into the child nodes. Eventually, the

probability mass is distributed among all the leaves (classes) rendering their pos-

terior probabilities. In contrast, classiﬁer trees are mostly used as hard-switching

devices, where only a single path from root node to one of the leaves is taken.

A hierarchical decomposition of posterior probabilities through a soft classiﬁ-

cation tree oﬀers several advantages. If one of the nodes in the tree, for example

the root node fails to provide good estimates of conditional posteriors, a hard

decision tree will produce many classiﬁcation errors. In a soft classiﬁcation tree,

such shortcomings will inﬂuence the decision process less dramatically. Also,

recovery from errors is often possible through a superior decision process.

Another aspect of soft classiﬁcation trees that can be exploited for various

purposes is the sum-to-unity property observable in any horizontal cross-section

at any level of the tree. The tree can be cut oﬀ at a certain level and still be

used as a soft classiﬁcation tree that computes posterior class probabilities. This

is equivalent to creating a new (smaller) set of classes by clustering and merg-

ing the original classes according to the tree topology. In general, the resulting

classiﬁcation task will be easier to solve than the original one.

Related to the sum-to-unity property of cross-sections is that the partial pos-

teriors computed on a path from the root node to a leaf are decreasing mono-

tonically. This in turn allows to close paths whenever a suitable threshold is

reached, pruning whole subtrees with classes that would otherwise receive poste-

riors smaller than the threshold. This property yields the possibility to smoothly

trade oﬀ classiﬁcation accuracy against computational complexity. In the limit,

when only a single path with highest conditional posterior is followed, the soft

classiﬁcation tree transmutes into a hard decision tree.

15.2.3 Estimation of Conditional Node Posteriors

Given a hierarchical decomposition of posterior class probabilities, it remains to

instantiate the tree nodes with estimators for the required conditional posteriors.

Conditioning a posterior on a subset of classes S

can be accomplished by restrict-

ing the training set of the corresponding learning device to the patterns with a

class label from S

. According to this setting, the available training data in each

node is distributed among all its child nodes according to the class partitioning.

316 J. Fritsch and M. Finke

While the root node receives all available training data, nodes further down the

tree receive less data than their predecessors. On the other hand, specialization

increases from root node to leaves. This fact has important consequences on

learning speed and model selection when training whole hierarchies.

One of the important issues in hierarchical decompositions of posterior prob-

abilities are the unavoidable inaccuracies of practical estimators for the condi-

tional posteriors that have to be provided in each tree node. Neural networks

can only be trained to approximate the true distribution of posterior class proba-

bilities and the degree of accuracy depends on both the inherent diﬃculty of the

task as given by the training set and the network structure and training schedule

being used. Inaccurate approximations to the true distribution of posteriors hurt

most in the upper layers of a classiﬁcation tree - a fact that has to be taken into

account by tree design procedures, which we will discuss next.

15.3 Classiﬁer Tree Design

When it comes to the design of soft classiﬁer trees, or equivalently to the de-

sign of hierarchical decompositions of class posteriors, the choice of algorithm

depends mostly on the number of initial classes. We will ﬁrst discuss optimal

tree structures before we will turn to heuristic design algorithms necessary when

dealing with the large number of classes that we have to deal with.

15.3.1 Optimality

The optimal soft classiﬁcation tree for a given task and given type and struc-

ture of estimators for the conditional node posteriors is the one which results

in minimum classiﬁcation error in the Bayes setting. If all the node classiﬁers

would compute the true conditional posteriors, the tree structure would have no

inﬂuence on the classiﬁer performance because any kind of factoring (through

any kind of tree structure) yields an exact decomposition of the class posteriors.

However, in practice, approximation errors of node classiﬁers render the choice

of tree structure an important issue. For small numbers of classes, the optimal

tree can in principle be found by exhaustively training and testing all possible

partitionings for a particular node (starting with the root node) and chosing the

one that gives the highest recognition accuracy. However, even if restricting the

tree structure to binary branching nodes and balanced partitionings, the number

K of partitionings that have to be examined at the root node

K =

(

quickly brings this algorithm to its limits, even for a moderate number of classes

N. Therefore, we have to consider heuristics to derive near optimal tree struc-

tures. For example, one valid possibility is to assume that the accuracy of achiev-

able approximations to the true posteriors is related to the separability of the

corresponding sets of classes.

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 317

15.3.2 Prior Knowledge

Following the above mentioned guideline, prior knowledge about the task in

question can often be applied to hierarchically partition the global set of classes

into reasonable subsets. The goal is to partition the remaining set of classes in

a way that intuitively maximizes the separability of the subsets. For example,

given a set of phones in a speech recognizer, a reasonable ﬁrst partitioning would

be to build subsets consisting of voiced and unvoiced phones. In larger speech

recognition systems where we have to distinguish among multi-state context-

dependent phones, prior knowledge such as state and context identity can be used

as splitting criterion. In tasks such as speaker or writer identiﬁcation, features

such as gender or age are potential candidates for splitting criteria.

The advantage of such knowledge driven decompositions is a fast tree design

phase which is a clear superiority of this approach when dealing with large

numbers of classes. However, this method for the design of hierarchical classiﬁers

is subjective and error prone. Two experts in a speciﬁc ﬁeld might disagree

strongly on what constitutes a reasonable hierarchy. Furthermore, it is not always

the case that reasonable partitionings yield good separability of subsets. Expert

knowledge can be misleading.

15.3.3 Confusion Matrices

In case the number of classes is small enough to allow the training of a single

classiﬁer, the design of a soft classiﬁer tree can be based on the confusion matrix

of the trained monolithic classiﬁer. Indicating the confusability of each pair of

classes, the confusion matrix yields relatively good measures of the separabil-

ity of pairs of classes. This information can be exploited for designing a tree

structure using a clustering algorithm. For instance, we can deﬁne the following

(symmetric) distance measure between two disjunct sets of classes S

and S

d(S

)=−



∈S



∈S

C(ω

,ω

|T )+C(ω

,ω

|T )

where C(ω

,ω

|T ) denotes the number of times class ω

is confused with class

as measured on a set of labeled patterns T . The distance d(S

) can now

be used as a replacement for the usual Euclidean distance measure in a stan-

dard bottom-up clustering algorithm. Unfortunately, once the number of classes

increases to several thousand, training of a monolithic classiﬁer becomes increas-

ingly diﬃcult.

15.3.4 Agglomerative Clustering

Assuming that separability of classes correlates with approximation accuracy of

estimators for the posterior class probabilities, we can go further and assume that

separability of classes can be measured by a suitable distance between the class

conditional distributions in feature space. We already introduced such a distance

318 J. Fritsch and M. Finke

measure in form of the elements of a class confusion matrix. Other, computation-

ally less expensive distance measures would be the Euclidean distance between

class means or the Mahalanobis distance between the classes second order statis-

tics. Irrespective of the chosen distance measure, the goal always is to group the

set of classes in a way that results in maximum inter- and minimum intra-group

distances. Solutions to this problem are known as agglomerative clustering al-

gorithms and a large pool of variations of the basic algorithm is available in the

literature [7].

15.4 Application to Speech Recognition

In this section, we will demonstrate the main ideas and beneﬁts of the hier-

archical classiﬁer approach on the task of large vocabulary continuous speech

recognition (LVCSR). More speciﬁcally, we will focus on acoustic modeling for

statistical speech recognition using hidden Markov models (HMM) [27]. To give

an impression of the complexity of such a task: training databases typically con-

sist of tens of millions of speech patterns, the number of acoustic classes being

distinguished ranges from ca. 50 (monophones) to over 20000 (context-dependent

polyphones).

15.4.1 Statistical Speech Recognition

The basic statistical entity in HMM based speech recognition is the posterior

probability of word sequences W

,...,W

given a sequence of acoustic observa-

tions X

,...,X

and a set of model parameters Θ

P (W

,...,W

,...,X

, Θ)

During training, we are seeking parameters Θ that maximize this probability on

the training data

Θ =argmax

t=1

P (W

,...,W

N(t)

,...,X

M(t)

, Θ)

and during recognition, we want to ﬁnd the sequence of words that maximizes

this probability for a given acoustic observation and ﬁxed model parameters Θ

,...,

=argmax

,...,W

P (W

,...,W

,...,X

, Θ)

In order to simplify the process of maximizing the posterior probability of word

sequences, Bayes rule is usually applied

P (W

,...,W

,...,X

P (X

,...,X

,...,W

) P (W

,...,W

)

P (X

,...,X

)

This rule separates the estimation process into the so called acoustic model (AM)

consisting of terms that depend on the acoustic observations X

,...,X

and

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 319

the language model (LM) consisting of terms that depend only on the sequence

of words W

,...,W

. In this paper we will focus on acoustic modeling using

connectionist estimators as a typical example of a task involving the discrimina-

tion of thousands of classes. For a review on other important aspects of LVCSR

such as pronunciation modeling, language modeling and decoding algorithms we

refer the reader to [27].

The task of acoustic modeling (ignoring the denominator) is to estimate pa-

rameters Θ

which maximize

P (X

,...,X

,...,W

, Θ

Words W

are modeled as sequences (or graphs) of phone models. The mapping

from words to phone models is usually accomplished by means of a pronunciation

dictionary. Phone models in turn are usually modeled as m-state left-to-right

hidden Markov models (HMM) to capture the temporal and acoustic variability

of speech signals. The following ﬁgure shows the process of converting a sequence

of words into (1) a pronunciation graph (possibly with pronunciation variants)

and (2) an HMM state graph.

I WILL GO

/W/

/IH/

/L/

/AY/

/W/ /IH/ /L/

/G/ /OW/

/L/

/AY/

/L/

/G/ /OW/

Fig. 15.2. Typical hidden Markov model in speech recognition

320 J. Fritsch and M. Finke

In this framework, where word sequences are represented as directed acyclic

graphs of HMM states, the likelihood of an acoustic observation can be rewritten

P (X

,...,X

,...,W



,...,s

P (X

,...,X

,...,s

) p(s

,...,s

)

where the summation extends over all possible state sequences s

,...,s

in the

HMM model for the word sequence W

,...,W

. In the Viterbi approximation,

the above likelihood is approximated by the probability of the most likely state

sequence

P (X

,...,X

,...,W

)≈ max

,...,s

P (X

,...,X

,...,s

) p(s

,...,s

Given a speciﬁc state sequence, the likelihood of the acoustic observations given

that sequence can be factored as follows

P (X

,...,X

,...,s

) ≈

i=1

p(X

,...,X

i−1

,...,s

) p(s

,...,s

In the application of ﬁrst-order hidden Markov models to the estimation of such

likelihoods one usually makes two simplifying assumptions:

– Independence of Observations:

P (X

,...,X

,...,s

) ≈

i=1

p(X

,...,s

) p(s

,...,s

)

– First-order Assumption:

P (X

,...,X

,...,s

) ≈

i=1

p(X

) p(s

i−1

)

15.4.2 Emission and Transition Modeling

Mainstream LVCSR systems follow the above approach by modeling emission

probability distributions p(X

) and transition probabilities p(s

i−1

) sepa-

rately and independently. Emission probability distributions are usually mod-

eled using mixture densities from the exponential family, such as the mixture of

Gaussians

p(X



k=1

)

where the γ

denote mixture coeﬃcients and the N

mixture component den-

sities (here: normal distributions). Transition probabilities on the other hand

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 321

are modeled by simple multinomial probabilities since they are conditioned on

a discrete variable only (not on the input vector).

The advantage of this approach is a decoupled estimation process that sep-

arates temporal and acoustic modeling. As such, it allows to easily vary HMM

state topologies after training in order to modify temporal behavior. For instance,

state duplication is a popular technique to increase the minimum duration con-

straint in phone models. Having separated emission and transition probability

estimation, state duplication consists of simply sharing acoustic models among

multiple states and adapting the transition probabilities.

However, the disadvantage of the above approach is a mismatch in the dy-

namic range of emission and transition probabilities. The reason is that transition

probabilities are modeled separately as multinomial probabilities, constrained

by the requirement to sum to one. This leads to a dominant role of emission

probabilities with transition probabilities hardly inﬂuencing overall system per-

formance.

15.4.3 Phonetic Context Modeling

So far we have assumed that only one HMM is required per modeled monophone

(see Fig. 15.2). Since the English language can be modeled by approximately 45

monophones, one might get the impression that only that number of HMM mod-

els need to be trained. In practice however, one observes an eﬀect called coar-

ticulation that causes large variations in the way speciﬁc monophones actually

sound, depending on their phonetic context.

Usually, explicit modeling of phones in phonetic context yields great gains in

recognition accuracy. However, it is not immediately clear how to achieve robust

context-dependent modeling. Consider, for example, so called triphone models. A

triphone essentially represents the realization of a speciﬁc monophone in a spe-

ciﬁc context spanning one phone to the left and right. Assuming an inventory of

45 monophones, the number of (theoretically) possible triphones is 45

= 91125.

Many of these triphones will occur rarely or never in actual speech due to the lin-

guistic constraints in the language. Using triphones therefore results in a system

which has too many parameters to train. To avoid this problem, one has to intro-

duce a mechanism for sharing parameters across diﬀerent triphone models.

Typically, a CART like decision tree is adopted to cluster triphones into gen-

eralized triphones based on both their a-priori probability and their acoustic

similarity. Such a top-down clustering requires the speciﬁcation of viable at-

tributes to be used as questions on phonetic context in order to split tree nodes.

Mostly, linguistic classes such as vowel, consonant, fricative, plosive, etc. are

being employed. Furthermore, one can generalize triphones to polyphones by al-

lowing dependence on a wider context (and not just the immediate left and right

neighboring phones). Fig. 15.3 shows a typical decision tree for the clustering of

polyphonic variations of a particular monophone state.

The collection of all leaf nodes of decision trees for each monophone state in

a given system represents a robust and general set of context-dependent sub-

phonetic units. Since each of these units corresponds to several triphone HMM

322 J. Fritsch and M. Finke

+1=SONORANT?

+2=VOICED?

+1=SIBILANT? -1=VOICED?+1=BILABIAL?

+1=LW?

+3=BACK-VOW? -1=LABIAL?AX-m(0) AX-m(1) AX-m(5)

AX-m(7)AX-m(6)AX-m(3)AX-m(2)

AX-m(4)

NNN

AX-m(8)

Fig. 15.3. Phonetic Context Modeling using Decision Trees. Shown is a decision tree

modeling phonetic contexts of middle state (3-state HMM) of monophone /AX/.

states, they are often called tied states. Typically, a large vocabulary continuous

speech recognizer models between 3000 and 24000 such tied states. Mainstream

LVCSR systems scale to any number of context-dependent modeling units since

emission and transition models are independently estimated for each tied state.

15.4.4 Connectionist Acoustic Modeling

Locally discriminant connectionist acoustic modeling is the most popular ap-

proach to integrate neural networks into an HMM framework [3, 4, 18]. It is

based on converting estimates of local posterior class probabilities to scaled like-

lihoods using Bayes rule. These scaled likelihoods can then be used as observation

probability estimates in standard HMMs. For a moderately small number N of

HMM states, a neural network can be trained to jointly estimate posterior prob-

abilities p(s

) for each state s

given an input vector X

. Bayes rule yields

the corresponding scaled

class conditional likelihoods

ˆp(X

p(s

)

p(s

)

The missing additional term consisting of the probability of the input vector p(X

)

is usually omitted because it is independent of the class/state identity and therefore

does not inﬂuence a Viterbi style search for the most likely state sequence.

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 323

STATE-Nets

CONTEXT-Nets

MONO-Net

SPEECH-Net

SIL-Net

NOISE-Net

Fig. 15.4. Topology of a Hierarchy of Neural Networks (HNN) to estimate context-

dependent posteriors, factored based on a-priori phonetic knowledge

While p(s

) is estimated using a neural network, prior probabilities p(s

) can

be estimated by relative frequencies as observed in the training data. Several

researchers (e. g. [3, 14]) have reported improvements with connectionist acoustic

modeling when the technique for the estimation of emission probabilities was the

only diﬀerence in comparison. Since mainstream HMMs for speech recognizers

are mostly trained in a maximum likelihood framework using the Expectation-

Maximization (EM) algorithm, incorporation of discriminatively trained neural

networks that focus on modeling of class boundaries instead of class distributions

is often observed to be beneﬁcial. Also, compared to mixtures of Gaussians based

acoustic models, connectionist acoustic models are often reported to achieve the

same accuracy with far less parameters.

However, when the number of HMM states is increased to model context-

dependent polyphones (triphones,quintphones), a single neural network can no

longer be applied to estimate posteriors. It becomes necessary to factor the pos-

terior state probabilities [17] and modularize the process of estimating those

posteriors. In most approaches, the posteriors are factored on phonetic context

or monophone identity (e.g. [4, 9, 15]). Viewing factoring as a hierarchical de-

composition of posteriors, we generalized the approaches to context-dependent

connectionist acoustic modeling by introducing a tree structured hierarchy of

neural networks (HNN) [12, 13] corresponding to a multi-level factoring of pos-

teriors based on a-priori knowledge such as broad sound classes (silence, noises,

324 J. Fritsch and M. Finke

phones), phonetic context and HMM state identity. Fig. 15.4 shows the topology

of such a structure.

At the top of this hierarchy, we discriminate silence, noise and speech sounds

by means of two networks (SIL-Net, SPEECH-Net). The motivation for this spe-

ciﬁc partitioning comes from the observation that these three classes are easy to

distinguish acoustically. The remainder of the tree structure decomposes the pos-

terior of speech, conditioning on monophone, context and state identity as these

are convenient sound classes modeled by any phone based HMM speech recog-

nizer. The hierarchy of Fig. 15.4 can be decomposed even further, for instance

by factoring conditional monophone posteriors (estimated by the MONO-Net)

based on linguistic features (e.g. voiced/unvoiced, vowel/consonantal, fricative

etc.). The motivation behind such a decomposition is twofold. First, it reduces

the number of local classes in each node, improving approximation accuracy and

second, it yields a decoupled and specialized set of expert networks having to

handle a smaller amount of phonetic variation.

However, as mentioned in section 3, the use of prior knowledge for the design

of a hierarchy of neural networks does not take into account dissimilarity of the

observed classes in feature space. We therefore developed an agglomerative clus-

tering algorithm to automatically design such hierarchies for the estimation of

posteriors for a large number of classes. We termed this framework ACID/HNN

[11].

15.4.5 ACID Clustering

ACID (Agglomerative Clustering based on Information Divergence) is a bottom-

up clustering algorithm for the design of tree-structured soft classiﬁers such as

a hierarchy of neural networks (HNN) [10, 11]. Although developed for connec-

tionist acoustic modeling, the algorithm can in principle be used for any kind of

classiﬁcation task. Starting from a typically very large set of initial classes, for

example the set of decision tree clustered HMM states in a speech recognizer

the ACID algorithm constructs a binary hierarchy. The nodes in the resulting

tree are then instantiated with estimators for the respective conditional poste-

rior probabilities, for instance in form of an HNN. The clustering metric in the

ACID algorithm is the symmetric information divergence [26]

d(s



(p (x|s

) −p (x|s

)) log

p(x|s

)

p (x|s

)

between class conditional densities of clusters. In contrast to standard agglom-

erative clustering algorithms which mostly represent clusters by their means

and employ the Euclidean distance metric, we chose to represent clusters by

parametric mixture densities (mixtures of Gaussians) in the ACID algorithm.

In our case, we experimented with up to 24000 initial classes.

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 325

Modeling clusters with mixture densities is much more adequate than just using

the mean and it still allows to cluster large amounts of classes in a reasonable

time. The symmetric information divergence (also called Kullback-Leibler dis-

tance) measures the dissimilarity of two distributions and was therefore chosen

as the clustering metric. Typically, each initial class (state) is modeled using a

single full covariance multivariate Gaussian density

p(x|s

(2π)

|Σ

exp{−

(x − μ

)

−1

(x − μ

)}

with mean vector μ

and covariance matrix Σ

. Clustering then continuously

merges initial classes which corresponds to building mixture densities based on

the Gaussians. The symmetric information divergence between two states s

and

with Gaussian distributions amounts to

d(s

tr{(Σ

− Σ

)(Σ

−1

− Σ

−1

)}

tr{(Σ

−1

+ Σ

−1

)(μ

− μ

)(μ

− μ

)

}

The computation of this distance measure requires O(n

) multiplications and

additions (assuming pre-computed inverse covariance matrices), where n is the

dimensionality of the input feature space. To reduce the computational load of

the ACID clustering algorithm, one can model the class conditional likelihoods

with diagonal covariance matrices only. Feature space transformations such as

principal component analysis and linear discriminant analysis can be used to

approximate such distributions. When using diagonal covariance Gaussians, the

symmetric information divergence simpliﬁes to

d(s



k=1

(σ

− σ

)

+(σ

+ σ

)(μ

− μ

)

where σ

and μ

denote the k-th coeﬃcient of the variance and mean vectors

of state s

, respectively. The evaluation of the latter distance measure requires

only O(n) multiplications and additions.

Making the simplifying assumption of linearity of information divergence, we

can deﬁne the following distance measure between clusters of Gaussians S

and

D(S



∈S

p(s

)



∈S

p(s

)d(s

)

This distance measure is used in the ACID clustering algorithm:

326 J. Fritsch and M. Finke

1. Initialize algorithm with N clusters S

, each containing

(1) a parametric model of the class-conditional likelihood and

(2) a count C

, indicating the frequency of class s

in the train-

ing set.

2. Compute within cluster priors p(s

) for each cluster S

,using

the counts C

3. Compute the symmetric divergence measure D(S

) between

all pairs of clusters S

and S

4. Find the pair of clusters with minimum divergence, S

∗

and S

∗

5. Create a new cluster S = S

∗

containing all Gaussians of S

∗

and S

∗

plus their respective class counts. The resulting para-

metric model is a mixture of Gaussians where the mixture

coeﬃcients are the class priors

6. Delete clusters S

∗

and S

∗

7. While there are at least 2 clusters remaining, continue with 2.

ACID Initialization. Initialization requires to estimate class conditional like-

lihoods for all (tied) states modeled by the recognizer. The number N of initial

classes therefore is determined by other parts of the speech recognizer, namely

by the phonetic decision tree that is typically applied to cluster phonetic con-

texts, or equivalently to tie HMM states [27]. Initial class conditional densities

for these classes can be computed from state alignments using either the Viterbi

or the Forward-Backward algorithm on training data and corresponding HMM

state graphs generated from training transcriptions. Estimation of initial para-

metric models for the ACID algorithm therefore requires a single pass through

the training data. After initial models have been estimated, the actual ACID

clustering does not require any further passes through the training data. Fur-

thermore, note that this algorithm clusters HMM states without knowledge of

their phonetic identity solemnly based on acoustic dissimilarity.

ACID Dendrograms. For illustration purposes, Fig. 15.5 shows a dendrogram

of a typical ACID clustering run on a relatively small set of only 56 initial

classes corresponding to the set of single-state monophone HMMs in a context-

independent speech recognizer. The set of classes consists of 44 standard English

phones along with 7 noise sounds (marked with a plus), 4 phones modeling

interjections (marked with an ampersand) and silence (SIL).

Already the top level split separates silence, breathing and noise sounds (lower

subtree) almost perfectly from phonetic sounds (upper subtree). Furthermore,

clusters of acoustically similar phones can be observed in the ACID tree, for

instance

– IX,IH,IY,Y

– JH,CH,SH,ZH

– Z,S,F

– ER,AXR,R

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 327

+MB

+TH

+LA

AXR

&HH

+NH

+HU

&OW

&AH

SIL

+SM

+BR

Fig. 15.5. Typical dendrogram of ACID clustering

ACID clustering was found to be quite eﬀective in generating a hierarchical

decomposition of a classiﬁcation task into subtasks of increasing diﬃculty (when

traversingthetreefromrootnodetoleaves). In the case of connectionist acoustic

modeling for speech recognition, we observed that nodes in the upper layers

of an ACID clustered HNN tree distinguish between broad phonetic classes,

whereas nodes further down the tree begin to distinguish the particular phones

within a broad phonetic class. Thus, ACID clustering constitutes an eﬀective

algorithm for discovering inherent hierarchical structure and exploiting it for

modular classiﬁcation.

Model Selection. The choice of model size and topology becomes very impor-

tant in the application of hierarchical soft classiﬁers to tasks such as connectionist

328 J. Fritsch and M. Finke

speech recognition. While the global tree topology is determined by the outcome

of the ACID clustering (or any other tree design procedure), it remains to de-

cide on local (node-internal) classiﬁer topology. The task of a local classiﬁer is

to estimate conditional posterior probabilities based on the available training

data. Since a particular local estimator is conditioned on all predecessor nodes

in the tree, it only receives training data from all the classes (leaves) that can

be reached from the respective node. This amounts to a gradually diminishing

training set when going from root node to nodes further down the tree. Fig. 15.6

shows this property of HNNs with a plot of the amount of available training pat-

terns vs. node depth for a binary hierarchy with 6000 leaves. Note the logscale

on the ordinate.

1000

10000

100000

1e+06

1e+07

1e+08

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

(min,average,max) available training data

node depth

Fig. 15.6. Available Training Data in Diﬀerent Depths of HNN Tree

When deciding on the local model complexity, we consider tree nodes as lying

in a continuum between the following two extrema:

Top of the Hierarchy

– large amounts of training data available

– allows for large node classiﬁers

– relatively easy, general classiﬁcation tasks

Bottom of the Hierarchy

– only small amounts of training data available

– requires relatively small node classiﬁers

– comparably hard classiﬁcation tasks

– high degree of specialization

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 329

Ideally, the complexity of local node classiﬁers should be selected so as to max-

imize generalization ability of the complete hierarchy. Generalization, on the

other hand, is inﬂuenced by three factors: (1) size and distribution of the train-

ing set, (2) model complexity and (3) classiﬁcation complexity of the speciﬁc task

at hand. Obviously, we can not alter the latter of these factors. Furthermore, in

the context of our architecture, we assume that the size of the training set for

each node is ﬁxed by the tree topology, once the hierarchy has been designed.

Therefore, we have to choose model complexity based on available training data

and diﬃculty of classiﬁcation task.

In our experiments in connectionist acoustic modeling, we typically use multi

layer perceptron (MLP) nodes with a single hidden layer and control model

complexity by varying the number of hidden units. We use standard projective

kernels with tanh activations for the hidden units and a task dependent non-

linearity for the output units (sigmoid for binary and softmax for multiway

classiﬁcation). The overall number of weights in such a network depends linearly

on the number of hidden units. According to [1] and with some approximations,

a rule of thumb is to choose the number of hidden units M to satisfy



where N is the size of the training set and  is the expected error rate on the test

set. In our case, the variation in the number of training patterns in the diﬀerent

nodes dominates the above formula. Therefore, we set the number of hidden

units proportional to b

−n

,whereb is the branching factor of the classiﬁcation

tree and n is the node depth. As long as the tree is approximately balanced

in terms of the prior distribution of child nodes, this strategy leads to hidden

layers with size proportional to the number of available training patterns. A

more fundamental treatment of model complexity using multiple training runs

and cross validation is desirable. However, in case of large-scale applications

such as speech recognition, such a strategy is not realizable because of the high

computational cost resulting from very large training databases. Less heuristic

approaches to select model complexity still have to be explored.

15.4.6 Training Hierarchies of Neural Networks on Large Datasets

For the demonstration of various aspects of training large and complex structures

such as hierarchies of neural networks on typical datasets, we report on exper-

iments on the Switchboard [19] speech recognition database. Switchboard is a

large corpus of conversational American English dialogs, recorded in telephone

quality all over the US. It consists of about 170 hours of speech which typically

corresponds to about 60 million training samples. The corpus currently serves

as a benchmark for the oﬃcial evaluation of state-of-the-art speech recognition

systems. Switchboard is a comparably hard task, current best systems achieve

word error rates in the vicinity of 30-40%. Fig. 15.7 shows the structure of an

HNN based connectionist acoustic model for an HMM based recognizer, in our

case the Janus recognition toolkit (JanusRTk) [8].

330 J. Fritsch and M. Finke

Posteriors -> Scaled Likelihoods

Hierarchy of

Neural Networks

phonetic

decision

tree

Fig. 15.7. Hierarchy of Neural Networks for Connectionist Acoustic Modeling: The

upper part shows an ACID clustered HNN after node merging. This architecture com-

putes posterior probabilities for a set of generalized polyphones. To allow for integra-

tion into the HMM framework, these posteriors are converted to scaled likelihoods. The

correspondence to actual HMM states is accomplished by means of phonetic decision

trees.

Due to the inherent variability and complexity of the task and the large

amount of training data, typical speech recognition systems model several thou-

sand distinct subphonetic units (HMM states) as base classes. This requires to

train an estimator for posterior probabilities of thousands of distinct acoustic

classes based on millions of training samples, in order to take advantage of the

full modeling granularity of the speech recognizer.

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 331

In the following, we will discuss several aspects of training a hierarchical soft

classiﬁer on large datasets such as Switchboard. Due to the modular structure

of the classiﬁer, the size of the model inventory and the training database, the

following discussion leads to rather unique problems and solutions. However, it

is important to emphasize that these properties stem from the structure of the

classiﬁer and the size of the task - not from the speciﬁc task of acoustic modeling

for speech recognition. Thus, they are transferable to comparably large tasks,

e. g. handwriting, speaker or face recognition.

Classiﬁer Tree Topology. Depending on the number of classes to be modeled,

tree design algorithm, branching factor and size and structure of local node

classiﬁers have to be chosen. For Switchboard, we were experimenting with three

systems consisting of 6000, 10000 and 24000 distinct classes, respectively. We

used the ACID clustering algorithm to design an initial tree structure from

the set of base classes for the 6k and 24k systems. As a second step of the

tree design phase, we applied a greedy node merging algorithm on the ACID

clustered hierarchy. Node merging decreases the number of internal nodes while

increasing the average branching factor (arity) of the tree. Training of such

hierarchies is less problematic than training of the original binary tree structure

since the diﬀerence among nodes (in terms of the number of available training

patterns) is somewhat extenuated and the overall number of networks is reduced.

However, local classiﬁcation tasks change from 2-way (binomial) to more complex

multi-way (multinomial) problems which might have an impact on the accuracy

of estimating conditional posteriors. Therefore, we constrain the node merging

algorithm to produce nodes with a maximum branching factor of 8-12. This

value was found to improve training speed while not aﬀecting overall classiﬁer

accuracy. Considerably larger branching factors are not reasonable in our case as

we would gradually loose the advantage of the hierarchical structure by ﬂattening

the tree.

For the 10k system, we were using the architecture of Fig. 15.4 that was

ilarity. This structure exhibits a larger average branching factor and less depth

than the ACID clustered trees. Although we could decrease the branching factor

at the MONO node by introducing linguistic classes as mentioned earlier, we

still have large branching factors at the context nodes which are much harder to

resolve with prior knowledge only.

The resulting tree nodes were instantiated with MLPs of varying size of

the (single) hidden layer. The local MLPs output layer were parameterized with

the softmax non-linearity for two reasons. First, it complies to the property of

the modeled probability distribution to sum up to one, and second, the softmax

function implements the expected value of the multinomial probability density.

Fig. 15.8 gives an overview of the structure of the ACID/HNN systems. Tree com-

pactiﬁcation reduced the number of internal nodes of the 24k system from 24k

to about 4k by increasing the average number of local classes (average branching

332 J. Fritsch and M. Finke

level #nodes= # hidden

#networksunits/network

1 1 256

2 1 256

3 1 256

4 3 192

5 19 128

6 121 64

7 816 32

total 962

level #nodes= # hidden

#networksunits/network

1 1 128

2 10 128

3 77 64

4 524 32

5 3434 16

total 4046

Fig. 15.8. Overview of ACID clustered HNNs for 6k (left) and 24k (right) classes

factor) from 2 to about 8. Especially when dealing with large numbers of classes,

we found that moderate tree compactiﬁcation improved classiﬁer performance.

The overall numbers of parameters of the tree classiﬁers were 2M for the 6k

system, 2.4M for the 10k system and 3.1M for the 24k system.

Training Algorithm and Parameters. Training a distributed, hierarchically

organized collection of neural networks on diﬀerent amounts of training data is

a challenging task. Our training criterion is maximum likelihood, assuming a

multinomial probability model (1-out-of-N) over all base classes. A target class

label is associated with each training pattern, indicating the correct base class.

All networks in nodes along the path from root node to the current target class’

leaf receive the current pattern for training. Because of the large amount of

training data, we use on-line (stochastic) gradient ascent in log-likelihood with

small batches (10-100 patterns) to train the individual networks. More elaborate

training algorithms which apply second order methods in optimizing the objec-

tive function are too costly in our scenario - a single epoch of training, processing

all 60 million patterns in the training database takes 3-5 days on a Sparc Ultra

workstation. A practical training algorithm therefore must not take longer than

1-4 epochs to converge. Furthermore, because of the large number of networks

that have to be trained, a potential training algorithm can not be allowed to use

large amounts of memory - which could be the case with second order methods.

Of course, training of individual node classiﬁers is independent and can therefore

easily be parallelized for shared memory multi-processors which alleviates the

latter requirement.

Since we are relying on stochastic gradient ascent in our training method, we

additionally use a simple momentum term to smooth gradients. Also, we use

local learning rates for each network that are initialized with a global learning

rate and adapted individually to the speciﬁc learning task. The global learning

rate is annealed in an exponentially decaying scheme:

n+1

= η

∗ γ

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 333

Typically, we use an initial global learning rate η

between 0.001 and 0.01,a

momentum constant of 0.5 and a global annealing factor γ

of 0.999 ...0.9999

applied after each batch update.

In order to accommodate the diﬀerent learning speeds of the node classiﬁers

due to the diﬀerent amount of available training data, we control individual

learning rates using the following measure of correlation between successive gra-

dient vectors g

n−1

and g

= arccos



n−1

||g

||||g

n−1



measures the angle between the gradients g

n−1

and g

. Small angles indi-

cate high correlation and therefore steady movement in weight space. Therefore,

we increase the learning rate linearly up to the current maximum (as deter-

mined by initial learning rate, annealing factor and number of updates per-

formed) whenever α

< 90

◦

for several batch updates M. Large angles, on the

other hand, indicate random jumps in weight space. We therefore decrease the

learning rate exponentially whenever α

> 90

◦

for several batch updates M.

In summary, we obtain the following update rule for local learning rate η

network i:

n+1

=min

⎧

⎪

⎨

⎪

⎩

n+1

⎧

⎪

⎨

⎪

⎩

+ δ

∗ γ

⎫

⎪

⎬

⎪

⎭

⎫

⎪

⎬

⎪

⎭

⎧

⎪

⎨

⎪

⎩



k=0

n−k

< 90

◦

− 



k=0

n−k

> 90

◦

+ 

else

⎫

⎪

⎬

⎪

⎭

with linear increase δ =0.001 ...0.01 and exponential annealing factor γ =

0.5 ...0.9. The number of batch updates M controls smoothing of α whereas 

controls the inﬂuence of the global learning rate. For  → 90

◦

, local learning

rates are forced to follow the global learning rate, whereas low values of  allow

local learning rates to develop individually. Typical values that have been used

in our experiments are M =10and  =20

◦

Adapting individual learning rates to the training speed is a critical issue in

hierarchical classiﬁers. Networks at the top of the tree have to be trained on

very large amounts of training data. Therefore, learning rates must be allowed

to become relatively small in order to beneﬁt from all the data and not reach

the point of saturation too early. On the other hand, networks at the bottom of

the tree have to be trained with comparably small amounts of data. In order to

train these networks within the same small number of passes through the overall

data, we have to apply comparably large learning rates to reach a maximum in

local likelihood as fast as possible. However, unconstrained adaptation of learn-

ing rates with aggressive optimization of learning speed may result in failure to

converge. In our experiments with global initialization of all networks using the

334 J. Fritsch and M. Finke

same maximum learning rate, global annealing of the maximum learning rate

and local adaptation of individual learning rates that are constrained to never

become larger than the global learning rate gives best results.

Generalization/Overﬁtting. Simply speaking, we did not observe any over-

ﬁtting in our experiments. Taking a look at the training of a large hierarchy in

terms of performance on an independent cross-validation set (Fig. 15.9), we can

see that the likelihood on this data levels oﬀ, but never starts to decrease again,

as is often observed on smaller tasks. In the plots of Fig. 15.9, the vertical lines

indicate multiple epochs (passes) through the training data (consisting of 87000

utterances). Obviously, the large amount of available training data allows for

excellent generalization, early stopping was not necessary. This behavior is sur-

prising at ﬁrst sight, because we did not use any kind of explicit regularization

of the local MLPs. At second sight, however, we can identify several reasons for

the good generalization of HNNs on this task:

– Training data can be considered very noisy in our case, since samples come

from a large variety of diﬀerent speakers and recording conditions. Training

with noisy data is similar to regularization and therefore improves general-

ization [2].

– Consider the hierarchy for the 6k classes system (Fig. 15.8). Some of the 816

networks at the bottom of the tree probably have not seen enough training

patterns to generalize well to new data. Although all of these networks to-

gether constitute 85% of the total number of networks, they contribute just

as one of 7 networks to any particular posterior probability. The networks in

the upper part of the hierarchy have the largest inﬂuence on the evaluation of

posterior probabilities. For those networks, the amount of available training

data can be considered so abundant that test set error approaches training

set error rate. In other words, optimal generalization can be achieved.

Results. We evaluate the proposed hierarchical classiﬁers as connectionist acous-

tic models in a speech recognition system. Performance of speech recognizers is

usually measured in terms of the word error rate on a reasonably large set of

test utterances. In our case, we test the diﬀerent acoustic classiﬁers with the

Janus [8] recognizer on the ﬁrst 30 seconds of each speaker in the oﬃcial 1996

Switchboard evaluation test set, consisting of 366 utterances not present in the

training set.

acoustic classiﬁer # classes # parameters word error rate

HNN 10000 2.0 M 37.3 %

ACID/HNN 6000 2.4 M 36.7 %

ACID/HNN 24000 3.1 M 33.3 %

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 335

-5.6

-5.4

-5.2

-5

-4.8

-4.6

-4.4

-4.2

87000 174000 261000

log-likelihood

# training utterances

Normalized Log-Likelihood of X-Validation Set

-1.8

-1.7

-1.6

-1.5

-1.4

-1.3

-1.2

87000 174000 261000

log-likelihood

# training utterances

Average Normalized Log-Likelihood of Node Classifiers

0.44

0.48

0.52

0.56

0.6

0.64

0.68

87000 174000 261000

Classification Error

# training utterances

Average Normalized Classification Error of Node Classifiers

Fig. 15.9. Cross-Validation during training of 24k ACID/HNN architecture

336 J. Fritsch and M. Finke

The above results are competitive with those of state-of-the-art systems and

indicate a clear advantage of the ACID clustered over the pre-determined hi-

erarchical classiﬁers. We suspect, that the reason for the better performance of

automatically clustered hierarchies of neural networks is the diﬀerence in tree

topology. Automatically clustered HNNs such as the presented ACID/HNN trees

exhibit small and comparably uniform average branching factors that allow to

robustly train estimators of conditional posterior probabilities. In contrast, hand-

crafted hierarchies such as the 10k HNN tree contain nodes with rather large

branching factors. Fig. 15.10 shows the branching factors for all the networks in

the 10k tree structure.

100

150

200

250

300

0 50 100 150 200

Branching Factor / No of Classes

Networks in HNN

Size of Individual Classification Problems in 10k System

Fig. 15.10. Branching Factors of Individual Nodes in 10k HNN

The largest observed branching factor in this tree was 276. This requires

the joint estimation of conditional posterior probabilities for as many as 276

classes which may result in rather poor approximations to the true posterior

probabilities for some of the networks in the tree.

Furthermore, the superior performance of both ACID/HNN classiﬁers over

the hand-crafted 10k tree, demonstrates the full scalability of the hierarchical

approach and justiﬁes the increase in the number of parameters. Earlier attempts

to train hand-crafted hierarchies for 24k classes failed to provide classiﬁers that

could be used as acoustic models in a speech recognizer. Poor approximations to

the real posterior probabilities led to instabilities in decoding when dividing by

priors in this case. Apart from that, we do not know of any other non-parametric

approach capable of directly and discriminatively estimating posterior probabil-

ities for such a large amount of classes.

15.5 Conclusions

We have presented and discussed a methodology for the estimation of poste-

rior probabilities for large numbers of classes using a hierarchical connectionist

15. Applying Divide and Conquer to Large Scale Pattern Recognition Tasks 337

framework. The aim of the paper is to demonstrate the necessity of hierarchical

approaches to modularize classiﬁcation tasks in large-scale application domains

such as speech recognition, where thousands of classes have to be considered

and millions of training samples are available. The divide and conquer approach

proves to be a versatile tool in breaking down the complexity of the original

problem into many smaller tasks. Furthermore, agglomerative clustering tech-

niques can be applied to automatically impose a suitable hierarchical structure

on a given set of classes, even in the case this set contains tens of thousands

of classes. In contrast to the relatively small standard benchmarks for learning

machines, aspects such as choice of training method, model selection and gen-

eralization ability appear in diﬀerent light when tackling large-scale probability

estimation problems.

References

[1] Baum, E.B., Haussler, D.: What Size Net Gives Valid Generalization? Neural

Computation 1, 151–160 (1989)

[2] Bishop Training, C.M.: with Noise is Equivalent to Tikhonov Regularization. Neu-

ral Computation 7(1), 108–116 (1995)

[3] Bourlard, H., Morgan, N.: Connectionist Speech Recognition – A Hybrid Ap-

proach. Kluwer Academic Press (1994)

[4] Bourlard, H., Morgan, N.: A Context Dependent Neural Network for Continuous

Speech Recognition. In: IEEE Proc. Intl. Conf. on Acoustics, Speech and Signal

Processing, San Francisco, CA, vol. 2, pp. 349–352 (1992)

[5] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classiﬁcation and Regres-

sion Trees. Wadsworth International Group, Belmont (1984)

[6] Bridle, J.: Probabilistic Interpretation of Feed Forward Classiﬁcation Network

Outputs, with Relationships to Statistical Pattern Recognition. In: Fogelman-

Soulie, F., Hérault, J. (eds.) Neurocomputing: Algorithms, Architectures, and

Applications. Springer, New York (1990)

[7] Duda, R., Hart, P.: Pattern Classiﬁcation and Scene Analysis. John Wiley & Sons,

Inc. (1973)

[8] Finke, M., Fritsch, J., Geutner, P., Ries, K., Zeppenfeld, T.: The JanusRTk

Switchboard/Callhome 1997 Evaluation System. In: Proceedings of LVCSR Hub5-

e Workshop, Baltimore, Maryland, May 13-15 (1997)

[9] Franco, H., Cohen, M., Morgan, N., Rumelhart, D., Abrash, V.: Context-

Dependent Connectionist Probability Estimation in a Hybrid Hidden Markov

Model – Neural Net Speech Recognition System. Computer Speech and Lan-

guage 8(3), 211–222 (1994)

[10] Fritsch, J., Finke, M.: ACID/HNN: Clustering Hierarchies of Neural Networks for

Context-Dependent Connectionist Acoustic Modeling. In: Proceedings of Interna-

tional Conference on Acoustics, Speech and Signal Processing, Seattle, Wa (May

1998)

[11] Fritsch, J.: ACID/HNN: A Framework for Hierarchical Connectionist Acoustic

Modeling. In: Proceedings of IEEE Workshop on Automatic Speech Recognition

and Understanding, Santa Barbara, Ca (December 1997)

[12] Fritsch, J., Finke, M., Waibel, A.: Context-Dependent Hybrid HME/HMM Speech

Recognition using Polyphone Clustering Decision Trees. In: Intl. Conf. on Acous-

tics, Speech and Signal Processing, Munich, Germany, vol. 3, p. 1759 (1997)

338 J. Fritsch and M. Finke

[13] Fritsch, J.: Modular Neural Networks for Speech Recognition, Tech. Report CMU-

CS-96-203, Carnegie Mellon University, Pittsburgh, PA (1996)

[14] Hochberg, M.M., Cook, G.D., Renals, S.J., Robinson, A.J., Schechtman, R.S.: The

1994 ABBOT Hybrid Connectionist-HMM Large-Vocabulary Recognition System.

In: Spoken Language Systems Technology Workshop, pp. 170–176. ARPA (Jan-

uary 1995)

[15] Kershaw, D.J., Hochberg, M.M., Robinson, A.J.: Context-Dependent Classes

in a Hybrid Recurrent Network-HMM Speech Recognition System, Tech.

Rep. CUED/F-INFENG/TR217, Cambridge University Engineering Department,

Cambridge, England (1995)

[16] Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases, Uni-

versity of California, Department of Information and Computer Science (1996),

http://www.ics.uci.edu/~mlearn/MLRepository.html

[17] Morgan, N., Bourlard, H.: Factoring Networks by a Statistical Method. Neural

Computation 4(6), 835–838 (1992)

[18] Morgan, N., Bourlard, H.: An Introduction to Hybrid HMM/Connectionist Con-

tinuous Speech Recognition Signal Processing Magazine, 25–42 (May 1995)

[19] NIST, Conversational Speech Recognition Workshop, DARPA Hub-5E Evalua-

tion, May 13-15, Baltimore, Maryland (1997)

[20] Prechelt, L.: Proben1 – A Set of Neural Network Benchmark Problems and Bench-

marking Rules. Technical Report 21/94, University of Karlsruhe, Germany (1994)

[21] Quinlan, J.R.: Induction of Decision Trees. Machine Learn. 1, 81–106 (1986)

[22] Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition. Proceedings of the IEEE 77, 257–285 (1989)

[23] Safavian, S.R., Landgrebe, D.: A Survey of Decision Tree Classiﬁer Methodology.

IEEE Transactions on Systems, Man and Cybernetics 21(3), 660–674 (1991)

[24] Schürmann, J., Doster, W.: A Decision Theoretic Approach to Hierarchical Clas-

siﬁer Design. Pattern Recognition 17(3), 359–369 (1984)

[25] Schürmann, J.: Pattern Classiﬁcation: A Uniﬁed View of Statistical and Neural

Approaches. John Wiley & Sons, Inc., New York (1996)

[26] Tou, J.T., Ganzales, R.C.: Pattern Recognition Principles. Addison Wesley, Read-

ing (1974)

[27] Young, S.: Large Vocabulary Continuous Speech Recognition: a Review. CUED

Technical Report, Cambridge University (1996)

Tricks for Time Series



Preface

In the last section we focus on tricks related to time series analysis and economic

forecasting. In chapter 16, John Moody opens with a survey of the challenges of

macroeconomic forecasting including problems such as noise, nonstationarities,

nonlinearities, and the lack of good a priori models. Lest one be discouraged,

descriptions of many possible neural network solutions are next presented includ-

ing hyperparameter selection (e.g. for regularization, training window length),

input variable selection, model selection (size and topology of network), better

regularizers, committee forecasts, and model visualization.

The survey is followed by a more detailed description of smoothing reg-

ularizers, model selection methods (e.g. prediction risk, nonlinear cross-

validation (NCV) (p. 357)) and sensitivity-based pruning (SBP) (p. 359)

for input selection. The goal of using regularizers is to introduce bias into the

model. But what is the “right bias”? Weight decay may be too ad hoc in that it

does not consider the nature of the function being learned. As an alternative,

the author presents several new smoothing regularizers for both feedforward and

recurrent networks that empirically are found to work better.

In model selection, prediction risk is used as the criterion for determin-

ing “best ﬁts”. Several methods for estimating prediction risk are discussed and

compared: generalized cross-validation (GCV), Akaike’s ﬁnal prediction error

(FPE), predicted squared error (PSE), and generalized prediction error (GPE)

which can be expressed in terms of the eﬀective number of parameters.

In cross validation, separate networks are trained on diﬀerent subsets of the

data to obtain estimates of the prediction risk. With nonlinear loss functions,

however, each network may converge to distinct local minima making compar-

ison diﬃcult. NCV alleviates this problem by initializing each network to be

trained on a CV subset in the same minimum w

(obtained on the full training

set). This way the CV errors computed on the diﬀerent subsets estimate the

prediction risk locally around this minimum w

and not by using some remote

local minima. SBP is used to select the “best subset” of input variables to use.

Here a sensitivity measure (e.g. delta error, average gradient, average absolute

gradient, RMS gradient) (p. 360) is used to measure the change in the training

error that would result if a given input is removed. Input variables are ranked

based on importance and, beginning with the least important, are pruned one

at a time, retraining in between. Finally, John Moody shows how sensitivity

measures can be examined visually over time to better understand the role of

each input (p. 361). Throughout, empirical results are presented for forecasting

the U.S. Index of Industrial Production.



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 339–341, 2012.

 Springer-Verlag Berlin Heidelberg 2012

340 G.B. Orr and K.-R. Müller

In chapter 17, Ralph Neuneier and Hans Georg Zimmermann describe in de-

tail their impressive integrated system – with a lot of diﬀerent tricks – for neural

network training in the context of economic forecasting. The authors discuss all

design steps of their system, e.g. input preprocessing, cost functions, handling

of outliers, architecture, regularization techniques, learning techniques, robust

estimation, estimation of error bars, pruning techniques. As in chapter 1, the

tricks here are also highly interleaved, i.e. many tricks will not retain their full

eﬃciency if they are used individually and not en bloc. Let us start with the pre-

processing for which the authors use: squared inputs (see chapter 7), scaled

relative diﬀerences, scaled forces in form of scaled curvatures or mean re-

verting that can characterize turning points in time series (p. 370). To limit the

inﬂuence of outliers the previously mentioned inputs are transformed by



= tanh(wx),

where the parameters w are learned as one layer in the training process (p.

371). Subsequently, a bottleneck network reduces the number of inputs to the

relevant ones (p. 372). Since networks for time series prediction are in general

bottom heavy, i.e. the input dimension is large while the output dimension is

very small, it is important to increase the number of output targets, so that

more useful error signals can be backpropagated (see also [3, 2]). For this the

authors introduce two output layers: (1) a point prediction layer, where not

only the value to be predicted, i.e. y

t+n

, but also a number of neighboring values

in time are used and (2) an interaction layer, where diﬀerences between these

values, i.e. y

t+n+1

− y

t+n

, corresponding to curvature are employed (p. 375).

Following this interesting trick, several point predictions are averaged to reduce

the variance of the prediction, similar to bagging [1] (p. 377). This overall design

gives rise to an 11-layer neural network architecture, where the available prior

knowledge from ﬁnancial forecasting is coded.

For training this large – but rather constrained – architecture, cost functions

are deﬁned (p. 383) and a method based on the CDEN approach (p. 384) is

proposed for estimating the error bars of a forecast.

To train the network the so-called vario-η learning rule is introduced, which

is essentially a stochastic approximation of a Quasi-Newton algorithm with in-

dividual learning rates for each weight (p. 392). The authors discuss how the

simple pattern-by-pattern rule has structural consequences that improve gener-

alization behavior; or to put it diﬀerently: the stochasticity of learning implicitly

includes a curvature penalty on unreliable network parts.

Then the authors raise the provoking question of the Observer-Observer di-

lemma: to create a model based on observed data while, at the same time, using

this model to judge the correctness of new incoming data (p. 391). This leads

to (1) clearning and (2) training with noise on input data. The rationale behind

clearning is that a very noisy environment, such as ﬁnancial data, will spoil a

good prediction if the data is taken too seriously. That is, we are allowed to move

the data point a little bit in input space in order to get smoother predictions

(p. 395). Similarly, diﬀerent additive noise levels for each input are chosen by

Tricks for Time Series 341

an algorithm in order to adapt the noise level to the (estimated) importance

of the input: a small noise level is used for perfectly described or unimportant

inputs whereas a large noise level should be chosen for a poorly described (likely

noise corrupted) input (p. 397).

In the next step, pruning methods for optimizing the architecture are de-

scribed. They are: (1) node-pruning (p. 401) and (2) several types of weight-

pruning:(a)stochastic pruning (p. 401), (b) early-brain-damage pruning

(p. 402), (c) inverse-kurtosis pruning (p. 403), and (d) instability pruning

(p. 405). The authors use a combination of instability pruning and early-brain-

damage in their application; the ﬁrst gives stable models and the latter generates

very sparse networks.

Finally, the whole set of tricks is combined into an integrated training process

and monitored on a validation set: early/late stopping, pruning and weight decay

regularization are alternated (p. 414) to obtain an excellent estimate of the

German bond rate from June 1994 to May 1996 (p. 415).

Jenny & Klaus

References

[1] Breiman, L.: Bagging predictors. Machine Learning 26(2), 123–140 (1996)

[2] Caruana, R., de Sa, V.R.: Promoting poor features to supervisors: Some inputs

work better as outputs. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances

in Neural Information Processing Systems, vol. 9, p. 389. The MIT Press (1997)

[3] Caruana, R., Pratt, L., Thrun, S.: Multitask learning. Machine Learning 28, 41

(1997)

Forecasting the Economy with Neural Nets:

A Survey of Challenges and Solutions



John Moody

Department of Computer Science & Engineering

Oregon Graduate Institute of Science & Technology

P.O. Box 91000, Portland, OR 97291, USA

moody@cse.ogi.edu

http://www.cse.ogi.edu/~moody/

Abstract. Macroeconomic forecasting is a very diﬃcult task due to the

lack of an accurate, convincing model of the economy. The most accu-

rate models for economic forecasting, “black box” time series models,

assume little about the structure of the economy. Constructing reliable

time series models is challenging due to short data series, high noise

levels, nonstationarities, and nonlinear eﬀects. This chapter describes

these challenges and presents some neural network solutions to them.

Important issues include balancing the bias/variance tradeoﬀ and the

noise/nonstationarity tradeoﬀ. A brief survey of methods includes hy-

perparameter selection (regularization parameter and training window

length), input variable selection and pruning, network architecture selec-

tion and pruning, new smoothing regularizers, committee forecasts and

model visualization. Separate sections present more in-depth descrip-

tions of smoothing regularizers, architecture selection via the generalized

prediction error (GPE) and nonlinear cross-validation (NCV), input se-

lection via sensitivity based pruning (SBP), and model interpretation and

visualization. Throughout, empirical results are presented for forecast-

ing the U.S. Index of Industrial Production. These demonstrate that,

relative to conventional linear time series and regression methods, supe-

rior performance can be obtained using state-of-the-art neural network

models.

16.1 Challenges of Macroeconomic Forecasting

Of great interest to forecasters of the economy is predicting the “business cycle”,

or the overall level of economic activity. The business cycle aﬀects society as

a whole by its ﬂuctuations in economic quantities such as the unemployment

rate (the misery index), corporate proﬁts (which aﬀect stock market prices), the

demand for manufactured goods and new housing units, bankruptcy rates, in-

vestment in research and development, investment in capital equipment, savings



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 343–367, 2012.

 Springer-Verlag Berlin Heidelberg 2012

344 J. Moody

rates, and so on. The business cycle also aﬀects important socio-political factors

such as the general mood of the people and the outcomes of elections.

The standard measures of economic activity used by economists to track the

business cycle include the Gross Domestic Product (GDP) and the Index of In-

dustrial Production (IP). GDP is a broader measure of economic activity than is

IP. However, GDP is computed by the U.S. Department of Commerce on only a

quarterly basis, while Industrial Production is more timely, as it is computed and

published monthly. IP exhibits stronger cycles than GDP, and is therefore more in-

teresting and challenging to forecast. (See ﬁgure 16.1.) In this chapter, all empirical

results presented are for forecasting the U.S. Index of Industrial Production.

Index of Industrial Production: 1967 - 1993

Year

Index Value

1970 1980 1990

60 70 80 90 100 110

Fig. 16.1. U.S. Index of Industrial Production (IP) for the period 1967 to 1993. Shaded

regions denote oﬃcial recessions, while unshaded regions denote oﬃcial expansions.

The boundaries for recessions and expansions are determined by the National Bureau

of Economic Research based on several macroeconomic series. As is evident for IP,

business cycles are irregular in magnitude, duration, and structure.

Macroeconomic modeling and forecasting is challenging for several reasons:

Non-experimental Science: Like evolutionary biology and cosmology, macroe-

conomics is largely a non-experimental science. There is only one instance of the

world economy, and the economy of each country is not a closed system. Ob-

serving the state of an economy in aggregate is diﬃcult, and it is generally not

possible to do controlled experiments.

No aprioriModels: A convincing and accurate scientiﬁc model of business

cycle dynamics is not yet available due to the complexities of the economic

system, the impossibility of doing controlled experiments on the economy, and

non-quantiﬁable factors such as mass psychology and sociology that inﬂuence

16. Forecasting the Economy with Neural Nets 345

economic activity. There are two main approaches that economists have used to

model the macroeconomy, econometric models and linear time series models:

Econometric Models: These models attempt to model the macroeconomy at

a relatively ﬁne scale and typically contain hundreds or thousands of equa-

tions and variables. The model structures are chosen by hand, but model

parameters are estimated from the data. While econometric models are of

some use in understanding the workings of the economy qualitatively, they

are notoriously bad at making quantitative predictions.

Linear Time Series Models: Given the poor forecasting performance of econo-

metric models, many economists have resorted to analyzing and forecasting

economic activity by using the empirical “black box” techniques of standard

linear time series analysis. Such time series models typically have perhaps half

a dozen to a dozen input series. The most reliable and popular of these models

during the past decade or so have been bayesian vector autoregressive (BVAR)

models [22]. As we have found in our own work, however, neural networks can

often outperform standard linear time series models. The lack of an a priori

model of the economy makes input variable selection, the selection of lag struc-

tures, and network model selection critical issues.

Noise: Macroeconomic time series are intrinsically very noisy and generally have

poor signal to noise ratios. (See ﬁgures 16.2 and 16.3.) The noise is due both to

the many unobserved variables in the economy and to the survey techniques used

to collect data for those variables that are measured. The noise distributions are

typically heavy tailed and include outliers. The combination of short data series

and signiﬁcant noise levels makes controlling model variance, model complexity,

and the bias / variance tradeoﬀ important issues [9]. One measure of complexity

fornonlinearmodelsisP

eff

,theeﬀective number of parameters [24, 25]. P

eff

can

be controlled to balance bias and variance by using regularization and model

selection techniques.

Nonstationarity: Due to the evolution of the world’s economies over time,

macroeconomic series are intrinsically nonstationary. To confound matters, the

deﬁnitions of many macroeconomic series are changed periodically as are the

techniques employed in measuring them. Moreover, estimates of key series are

periodically revised retroactively as better data are collected or deﬁnitions are

changed. Not only do the underlying dynamics of the economy change with time,

but the noise distributions for the measured series vary with time also. In many

cases, such nonstationarity shortens the usable length of the data series, since

training on older data will induce biases in predictions. The combination of noise

and nonstationarity gives rise to a noise / nonstationarity tradeoﬀ [23], where

using a short training window results in too much model variance or estimation

error due to noise in limited training data, while using a long training window

results in too much model bias or approximation error due to nonstationarity.

Nonlinearity: Traditional macroeconomic time series models are linear [12, 14].

However, recent work by several investigators have suggested that nonlinearities

can improve macroeconomic forecasting models in some cases [13, 27, 39, 35, 40].

346 J. Moody

1950 1960 1970 1980 1990

4.5 5.0 5.5 6.0 6.5

Log_2( IP )

1950 1960 1970 1980 1990

-0.010-0.005 0.0 0.005 0.010 0.0150.020

Log_2( IP ) : 1 Month Future Returns

1950 1960 1970 1980 1990

-0.02 -0.01 0.0 0.01 0.02 0.03

Log_2( IP ) : 3 Month Future Returns

1950 1960 1970 1980 1990

-0.02 0.0 0.02 0.04 0.06

Log_2( IP ) : 6 Month Future Returns

1950 1960 1970 1980 1990

-0.04 -0.02 0.0 0.02 0.04 0.06

Log_2( IP ) : 9 Month Future Returns

1950 1960 1970 1980 1990

-0.04 -0.02 0.0 0.02 0.04 0.06

Log_2( IP ) : 12 Month Future Returns

Fig. 16.2. The U.S. Index of Industrial Production and ﬁve return series (rates of

change measured as log diﬀerences) for time scales of 1, 3, 6, 9, and 12 months. These

return series served as the prediction targets for the standard Jan 1950 - Dec 1979 / Jan

1980 - Jan 1990 benchmark results reported in [27]. The diﬃculty of the prediction task

is evidenced by the poor signal to noise ratios and erratic behavior of the target series.

For the one month returns, the performance of our neural network predictor in table 1

suggests that the SNR is around 0.2. For all returns series, signiﬁcant nonstationarities

and deviations from normality of the noise distributions are present.

(See table 16.1 and ﬁgures 16.2 and 16.3.) Based upon our own experience, the

degree of nonlinearity captured by neural network models of macroeconomic

series tends to be mild [27, 20, 38, 42, 28, 45]. Due to the high noise levels

and limited data, simpler models are favored. This makes reliable estimation of

nonlinearities more diﬃcult.

16.2 A Survey of Neural Network Solutions

We have been investigating a variety of algorithms for neural network model

selection that go beyond the vanilla neural network approach.

The goal of this

We deﬁne a vanilla neural network to be a fully connected, two-layer sigmoidal

network with a full set of input variables and a ﬁxed number of hidden units that is

trained on a data window of ﬁxed length with backprop and early stopping using a

validation set. No variable selection, pruning, regularization, or committee techniques

are used.

16. Forecasting the Economy with Neural Nets 347

DLEAD

1950 1960 1970 1980 1990

40 60 80 100 120140

HSBP

1950 1960 1970 1980 1990

100 150 200

FM2D82

1950 1960 1970 1980 1990

1000 1500 2000 2500

FSPCOM

1950 1960 1970 1980 1990

0 100 200 300 400

LPHRM

1950 1960 1970 1980 1990

38 39 40 41

LUINC

1950 1960 1970 1980 1990

200 300 400 500 600

MOCM82

1950 1960 1970 1980 1990

20 40 60 80 100

MPCON8

1950 1960 1970 1980 1990

10 20 30 40 50

IVPAC

1950 1960 1970 1980 1990

20 40 60 80 100

HHSNTN

1950 1960 1970 1980 1990

50 60 70 80 90 100

IVCUD8

1950 1960 1970 1980 1990

-5 0 5 10

PSM99S

1950 1960 1970 1980 1990

-2024

Fig. 16.3. The U.S. Index of Leading Indicators (DLEAD) and its 11 component series

as currently deﬁned. The Leading Index is a key tool for forecasting business cycles.

The input variables for the IP forecasting models included transformed versions of

DLEAD and several of its components [27]. The diﬃculty of macroeconomic forecasting

is again evident, due to the high noise levels and erratic behaviors of DLEAD and its

components. (Note that the component series included in DLEAD have been changed

several times during the past 47 years. The labels for the various series are those deﬁned

in Citibase: HSBP denotes housing starts, FM2D82 is M2 money supply, FSPCOM is

the Standard & Poors 500 stock index, and so on.)

work is to construct models with minimal prediction risk (expected test set er-

ror). The techniques that we are developing and testing are described below.

Given the brief nature of this survey, I have not attempted to provide an ex-

haustive list of the many relevant references in the literature.

Hyperparameter Selection: Hyperparameters are parameters that appear in

the training objective function, but not in the network itself. Examples include

the regularization parameter, the training window length, and robust scale pa-

rameters. Examples of varying the regularization parameter and the training

window length for a 12 month IP forecasting model are shown in ﬁgures 16.4

and 16.5. Varying the regularization parameter trades oﬀ bias and variance,

while varying the training window length trades oﬀ noise and nonstationarity.

348 J. Moody

Table 16.1. Comparative summary of normalized prediction errors for rates of return

on Industrial Production for the period January 1980 to January 1990 as presented

in [27]. The four model types were trained on data from January 1950 to December

1979. The neural network models signiﬁcantly outperform the trivial predictors and

linear models. For each forecast horizon, the normalization factor is the variance of the

target variable for the training period. Nonstationarity in the IP series makes the test

errors for the trivial predictors larger than 1.0. In subsequent work, we have obtained

substantially better results for the IP problem [20, 38, 42, 28, 45].

Prediction Trivial Univariate Multivariate Sigmoidal Nets

Horizon (Average of AR(14) Model Linear Reg. w/ PC Pruning

(Months) Training Set) Iterated Pred. Direct Pred. Direct Pred.

1 1.04 0.90 0.87 0.81

2 1.07 0.97 0.85 0.77

3 1.09 1.07 0.96 0.75

6 1.10 1.07 1.38 0.73

9 1.10 0.96 1.38 0.67

12 1.12 1.23 1.20 0.64

Input Variable Selection and Pruning: Selecting an informative set of input

variables and an appropriate representation for them (“features”) is critical to

the solution of any forecasting problem. The variable selection and representa-

tion problem is part of the overall model selection problem. Variable selection

procedures can be either model-independent or model-dependent. The Delta

Test, a model independent procedure, is a nonparametric statistical algorithm

that selects meaningful predictor variables by direct examination of the data set

[36]. Other model-independent techniques make use of the mutual information

[4, 5, 46] or joint mutual information [46]. Sensitivity-based pruning (SBP) tech-

niques are model-dependent algorithms that prune unnecessary or harmful input

variables from a trained network [33, 30, 25, 42, 21]. Sensitivity based pruning

methods are described in greater detail in section 16.4.6.

Model Selection and Pruning: A key technique for controlling the bias / vari-

ance tradeoﬀ for noisy problems is to select the size and architecture of the net-

work. For two-layer networks, this includes selecting the number of internal units,

choosing a connectivity structure, and pruning unneeded nodes, weights, or weight

matrix eigennodes. A constructive algorithm for selecting the number of internal

units is sequential network construction (SNC) [2, 30, 25]. Techniques for pruning

weights and internal nodes include sensitivity-based pruning methods like optimal

brain damage (OBD) [19] and optimal brain surgeon (OBS) [15]. Our recently-

proposed supervised principal components pruning (PCP) method [20] prunes

weight matrix eigennodes, rather than weights. Since PCP does not require train-

ing to a local minimum, it can be used with early stopping. It has computational

advantages over OBS, and can outperform OBD when input variables or hidden

node activities are noisy and correlated. Figure 16.6 shows reductions in prediction

16. Forecasting the Economy with Neural Nets 349

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

5 7 10 15 20

Test Error over 80’s

Training Window

Training Window vs Test Error

12 Month Forecasts; Delta Inputs

"lambda_0.15"

Fig. 16.4. Example of the Noise / Nonstationary Tradeoﬀ and selection of the best

training window, in this case 10 years [38, 28]. The longer training windows of 15 and

20 years yield higher test set error due to the model bias induced by nonstationarity.

The shorter training windows of 5 and 7 years have signiﬁcantly higher errors due

to model variance resulting from noise in the data series and smaller data sets. The

test errors correspond to models trained with the best regularization parameter 0.15

indicated in ﬁgure 16.5.

errors obtained by using PCP on a set of IP forecasting models. Section 16.4 de-

scribes the model selection problem and the use of estimates of prediction risk such

as nonlinear cross-validation (NCV) to guide the selection process in greater detail.

Better Regularizers: Introducing biases in a model via regularization or prun-

ing reduces model variance and can thus reduce prediction risk. Prediction risk

can be best minimized by choosing appropriate biases. One such set of biases

are smoothing constraints. We have proposed new classes of smoothing regular-

izers for both feedforward and recurrent networks [29, 45] that often yield better

performance than the standard weight decay approach. These are described in

greater detail in section 16.3.

Committee Forecasts: Due to the extremely noisy nature of economic time

series, the control of forecast variance is a critical issue. One approach for re-

ducing forecast variance is to average the forecasts of a committee of models.

Researchers in economics have studied and used combined estimators for a long

time, and generally ﬁnd that they outperform their component estimators and

that unweighted averages tend to outperform weighted averages, for a variety of

weighting methods [12, 44, 6]. Reductions of prediction error variances obtained

by unweighted committee averaging for a selection of diﬀerent IP forecasting

models are shown in ﬁgure 16.7.

Model Interpretation and Visualization: It is important not only to be able to

make accurate forecasts, but to also understand what factors inﬂuence the forecasts

that are made. This can be accomplished via the sensitivity analyses described in

sections 16.4.6 and 16.4.8 and the visualization tool presented in section 16.4.8.

350 J. Moody

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.01 0.1 1

error over 80’s

weight decay coefficient

Regularization vs Error

delta inputs

"fdl12.tr5"

"fdl12.tr7"

"fdl12.tr10"

"fdl12.tr15"

"fdl12.tr20"

Fig. 16.5. Example of the eﬀect of regularization (weight decay) parameter on test

error [38, 28]. The ﬁve curves are for training windows of length 5, 7, 10, 15, and 20

years. The Bias / Variance Tradeoﬀ is clearly evident in all the curves; the minimum

test set errors occur for weight decay parameters of order 0.1. Larger errors due to bias

occur for larger weight decay coeﬃcients, while larger errors due to model variance

occur for smaller values of the coeﬃcient.

16.3 Smoothing Regularizers for Better Generalization

Introducing biases in a model via regularization or pruning reduces model variance

and can thus reduce prediction risk (see also chapters 2-6). Prediction risk can be

best minimized by choosing appropriate biases. Quadratic weight decay [37, 18, 17],

the standard approach to regularization used in the neural nets community, is an

ad hoc function of the network weights. Weight decay is ad hoc in the sense that it

imposes direct constraints on the weights independent of the nature of the function

being learned or the parametrization of the network model. A more principled ap-

proach is to require that the function f(W, x) learned by the network be smooth.

This can be accomplished by penalizing the m

order curvature of f (W, x).The

regularization or penalty functional is then the smoothing integral

S(W, m)=



xΩ(x)

f(W, x)

, (16.1)

where Ω(x) is a weighting function and  denotes the Euclidean tensor

norm.

Since numerical computation of (16.1) generally requires expensive Monte

Carlo integrations and is therefore impractical during training, we have derived

algebraically simple approximations and bounds to S(W, m) for feedforward

The relation of this type of smoothing functional to radial basis functions has been

studied by [10]. However, the approach developed in that work does not extend to

standard feedforward sigmoidal networks, which are a special case of projective basis

function networks (PBF’s).

16. Forecasting the Economy with Neural Nets 351

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 2 4 6 8 10 12

Normalized Error

Prediction Horizon (month)

Fig. 16.6. Prediction errors for two sets of neural network models for 12 month returns

for IP, with (dotted line) and without (solid line) Supervised Principal Components

Pruning (PCP) [20]. Each data point is the mean error for 11 nets, while the error bars

represent one standard deviation. Statistically signiﬁcant improvements in prediction

performance are obtained for the 6, 9, and 12 month prediction horizons by using

the PCP algorithm to reduce the network complexities. While techniques like optimal

brain damage and optimal brain surgeon prune weights from the network, PCP reduces

network complexity and hence model variance by pruning eigennodes of the weight

matrices. Unlike the unsupervised use of principal components, PCP removes those

eigennodes that yield the greatest reduction in estimated prediction error.

networks that can be easily evaluated at each training step [29]. For these new

classes of algebraically simple m

-order smoothing regularizers for networks

of projective basis functions (PBF’s) f(W, x)=



j=1



+ v



+ u

W =(u, v) with general transfer functions g[·], the regularizers are:

(W, m)=



j=1

v



2m−1

Global Form

(W, m)=



j=1

v



Local Form.

Our empirical experience shows that these new smoothing regularizers typically

yield better prediction accuracies than standard weight decay.

In related work, we have derived an algebraically-simple regularizer for re-

current nets [45]. This regularizer can be viewed as a generalization of the ﬁrst

order Tikhonov stabilizer (the m =1local form above) to dynamic models. For

two layer networks with recurrent connections described by

Y (t)=g (AY (t − τ)+VX(t)) ,

Z(t)=UY(t) ,

352 J. Moody

0.2

0.4

0.6

0.8

1.2

1.4

m1.85 m1.89 m9.80 m9.84 m9.88 m12.80 m12.83 m2.83

Error

Model

Reduction in Max/Min Variation via Committees

avg of 100 committees / 1000 nets

"no_committee"

"committee_of_10"

Fig. 16.7. Reduction in error variance for prediction of the U.S.Index of Industrial

Production by use of combining forecasts (or committees) [38, 28]. Abscissa points

are various combinations of prediction horizon and test period. For example, “m12.80”

denotes networks trained to make 12 month forecasts on the ten years prior to 1979

and tested by making true ex ante forecasts on the year 1980. Performance metric is

normalized mean square error (NMSE) computed over the particular year. All training

sets have length 10 years. For each point, bars show range of values for either 1000

individual models, or 100 committees of 10. The individual networks each have three

sigmoidal internal units, one linear output, and typically a dozen or so input variables

selected by the δ-test from an initial set of 48 candidate variables.

the training criterion with the regularizer is

E =



t=1

||Z(t) −

Z(W, I (t))||

+ λρ

(W ) ,

where W = {U, V, A} is the network parameter set, Z(t) are the targets, I(t)=

{X(s),s=1, 2, ···,t} represents the current and all historical input information,

N is the size of the training data set, ρ

(W ) is the regularizer, and λ is a

regularization parameter. The closed-form expression for the regularizer for time-

lagged recurrent networks is:

(W )=

γ||U ||||V ||

1 − γ||A||

(

1 − e

γ||A||−1

)

where || || is the Euclidean matrix norm and γ is a factor which depends upon

the maximal value of the ﬁrst derivatives of the internal unit activations g().

Simpliﬁcations of the regularizer can be obtained for simultaneous recurrent

nets (τ → 0 ), two-layer feedforward nets, and one layer linear nets. We have

16. Forecasting the Economy with Neural Nets 353

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

0.65

0.7

0.75

0.8

0.85

0.9

Regularization Parameter

Prediction Error

Quadratic Weight Decay

Training

Validation

Test

0 0.2 0.4 0.6 0.8 1 1.2

0.64

0.66

0.68

0.7

0.72

0.74

0.76

0.78

Regularization Parameter

Prediction Error

Smoothing Regularizer

Training

Validation

Test

Fig. 16.8. Regularization parameter vs. normalized prediction errors for the task of

predicting the one month rates of change of the U.S. Index of Industrial Production

[45]. The example given is for a recurrent network trained with standard weight decay

(left) or with the new recurrent smoothing regularizer (right). For standard weight

decay, the optimal regularization parameter is 0.03 corresponding to a test error of

0.734. For the new smoothing regularizer, the optimal regularization parameter which

leads to the least validation error is 0.8 corresponding to a test error of 0.646. The new

recurrent regularizer thus yields a 12% reduction in test error relative to that obtained

using quadratic weight decay.

successfully tested this regularizer in a number of case studies and found that

it performs better than standard quadratic weight decay. A comparison of this

recurrent regularizer to quadratic weight decay for 1 month forecasts of IP is

shown in ﬁgure 16.8.

16.4 Model Selection and Interpretation

In this section, we provide a more in-depth discussion of several issues and tech-

niques for neural network model selection, including the problem of selecting

inputs. We describe techniques for selecting architectures via estimates of the

prediction risk, especially the generalized prediction error (GPE) and nonlinear

cross-validation (NCV).Wepresentsensitivity-based pruning (SBP) methods for

selecting input variables, and demonstrate the use of these methods for predict-

ing the U.S. Index of Industrial Production. Finally, we discuss some approaches

to model interpretation and visualization that enable an understanding of eco-

nomic relationships.

354 J. Moody

16.4.1 Improving Forecasts via Architecture and Input Selection

For the discussion of architecture selection in this paper, we focus on the most

widely used neural network architecture, the two-layer perceptron (or backprop-

agation) network. The response function for such a network with architecture λ

having I

input variables, H

internal (hidden) neurons, and a single output is:

ˆμ

(x)=h( u



j=1

g(v



i=1

)) . (16.2)

Here, h and g are typically sigmoidal nonlinearities, the v

and v

are input

weights and thresholds, the u

and u

are the output weights and threshold, and

the index λ is an abstract label for the speciﬁc two layer perceptron network

architecture. While we consider for simplicity this restricted class of perceptron

networks in this section, our approach can be easily generalized to networks with

multiple outputs and multiple layers.

For two layer perceptrons, the architecture selection problem is to ﬁnd a good,

near-optimal architecture λ for modeling a given data set. The architecture λ is

characterized by the number of hidden units H

, the subset of input variables I

and the subset of weights u

and v

that are non-zero. If all of the u

and v

are

non-zero, the network is referred to as fully connected. Since an exhaustive search

over the space of possible architectures is impossible, the procedure for selecting

this architecture requires a heuristic search. See Figure 16.9 for examples of

heuristic search strategies and [25] and [30] for additional discussion.

Input

Removal

Weight

Elimination

Number of

hidden units

Final Network

IIa

IIb

III

IIb

III

Input

Removal

Final Network

Number of

hidden units

Weight

Elimination

IIa

Fig. 16.9. Heuristic Search Strategies: After selecting the number of hidden units

, the input removal and weight elimination can be carried out in parallel (A) or

sequentially (B). In (B), the selection of the number of hidden units and removal of

inputs may be iterated (dashed line).

In this section, we focus on selecting the “best subset” of input variables for pre-

dicting the U.S. Index of Industrial Production. In order to avoid an exhaustive

search over the exponentially-large space of architectures obtained by considering

16. Forecasting the Economy with Neural Nets 355

all possible combinations of inputs, we employ a directed search strategy using the

sensitivity-based input pruning (SBP) algorithm (see section 16.4.6).

16.4.2 Architecture Selection via the Prediction Risk

The notion of “best ﬁts” can be captured via an objective criterion; such as max-

imum a posteriori probability (MAP), minimum Bayesian information criterion

(BIC), minimum description length (MDL),orgeneralization ability. The gener-

alization ability can be deﬁned precisely as the prediction risk P

, the expected

performance of an estimator in predicting new observations. In this section,

we use the prediction risk as our selection criterion for two reasons. First, it

is straightforward to compute, and second, it provides more information than

MAP, BIC, or MDL, since it tells us how much conﬁdence to put in predictions

produced by our best model.

Consider a set of observations D = {(x

); j =1...N} that are assumed to

be generated as t

= μ(x

)+

where μ(x) is an unknown function, the inputs

are drawn independently with an unknown stationary probability density

function p(x),the

are independent random variables with zero mean (¯ =0)

and variance σ



,andthet

are the observed target values. The learning or

regression problem is to ﬁnd an estimate ˆμ

(x; D) of μ(x) given the data set D

from a class of predictors or models μ

(x) indexed by λ. In general, λ ∈ Λ =

(S, A, W ),whereS ⊂ X denotes a chosen subset of the set of available input

variables X, A is a selected architecture within a class of model architectures A,

and W are the adjustable parameters (weights) of architecture A.

The prediction risk P

(deﬁned above) can be approximated by the expected

performance on a ﬁnite test set. P

can be deﬁned for a variety of loss functions.

For the special case of squared error, it is:



dx p(x)[μ(x) − ˆμ(x)]

+ σ



(16.3)

≈ E{



j=1

∗

− ˆμ

∗

))

} (16.4)

where (x

∗

) are new observations that were not used in constructing ˆμ

(x).

In what follows, we shall use P

as a measure of the generalization ability of a

model. Our strategy is to choose an architecture λ in the model space Λ which

minimizes an estimate of the prediction risk P

16.4.3 Estimation of Prediction Risk

Since it is not possible to exactly calculate the prediction risk P

given only

a ﬁnite sample of data, we have to estimate it. The restriction of limited data

makes the model selection and prediction risk estimation problems more diﬃcult.

This is the typical situation in economic forecasting, where the time series are

short.

356 J. Moody

A limited training set results in a more severe bias/variance (or underﬁtting

vs overﬁtting) tradeoﬀ (see e.g. [9]), so the model selection problem is both more

challenging and more crucial. In particular, it is easier to overﬁt a small training

set, so care must be taken to select a model that is not too large. Also, limited

data sets make prediction risk estimation more diﬃcult if there is not enough

data available to hold out a suﬃciently large independent test sample. In such

situations, one must use alternative approaches which enable the estimation of

prediction risk from the training data, such as data resampling and algebraic esti-

mation techniques. Data resampling methods include nonlinear reﬁnements of ν–

fold cross–validation (NCV) and bootstrap estimation, while algebraic estimates

(in the regression context) include Akaike’s ﬁnal prediction error (FPE) [1], for

linear models, and the recently proposed generalized prediction error (GPE) for

nonlinear models [31, 24, 25], which is identical to the independently-derived

network information criterion [34]. For comprehensive discussions of prediction

risk estimation, see [8, 16, 43, 25].

16.4.4 Algebraic Estimates of Prediction Risk

Predicted Squared Error for Linear Models. For linear regression models

with the squared error loss function, a number of useful algebraic estimates for

the prediction risk have been derived. These include the well known generalized

cross–validation (GCV) [7, 11] and Akaike’s ﬁnal prediction error (FPE) [1]

formulas:

GCV

= ASE

1 −

FPE

= ASE



1 −



. (16.5)

denotes the number of weights of model λ (ASE

denotes the average squared

error). Note that although GCV and FPE are slightly diﬀerent for small sample

sizes, they are asymptotically equivalent for large N:

GCV

≈ FPE

≈ ASE



1+2



(16.6)

A more general expression of predicted squared error (PSE) is:

PSE

= ASE

+2%σ

, (16.7)

where %σ

is an estimate of the noise variance in the data. Estimation strategies for

(16.7) and its statistical properties have been analyzed by [3]. FPE is obtained

as special case of PSE by setting %σ

≡ ASE

/(N − Q

).See[8,16,43]for

tutorial treatments.

It should be noted that PSE, FPE and GCV are asymptotically unbiased

estimates of the prediction risk for the neural network models considered here

under certain conditions. These are: (1) the noise 

in the observed targets t

16. Forecasting the Economy with Neural Nets 357

independent and identically distributed, (2) the resulting model is unbiased, (3)

weight decay is not used, and (4) the nonlinearity in the model can be neglected.

For PSE, we further require that an asymptotically unbiased estimate of %σ

used. In practice, however, essentially all neural network ﬁts to data will be

biased and/or have signiﬁcant nonlinearity.

Although PSE, FPE and GCV are asymptotically unbiased only under the

above assumptions, they are much cheaper to compute than NCV since no re-

training is required.

Generalized Prediction Error (GPE) for Nonlinear Models. The pre-

dicted squared error PSE, and therefore the ﬁnal prediction error FPE, are

special cases of the generalized prediction error GPE [31, 24, 25]. We present an

abbreviated description here.

GP E estimates the prediction risk for biased nonlinear models which may

use general loss functions and include regularizers such as weight decay. The

algebraic form is

GP E

≡E

λtrain

, (16.8)

where E

λtrain

is the training set error (average value of loss function on training

set),

V is a nonlinear generalization of the estimated noise covariance matrix

of the observed targets, and

is the estimated generalized inﬂuence matrix,a

nonlinear analog of the standard inﬂuence or hat matrix.

GPE can be expressed in an equivalent form as:

GP E

= E

λtrain

+2%σ

eff

λeff

, (16.9)

where

eff

≡ tr

G is the estimated eﬀective number of model parameters, and

%σ

eff

≡ (tr

G)/(tr

G) is the estimated eﬀective noise variance in the data. For

nonlinear and/or regularized models,

λeff

is generally not equal to the number

of weights Q

When the noise in the target variables is assumed to be independent with

uniform variance and the squared error loss function is used, (16.9) simpliﬁes to:

GP E

= ASE

+2%σ

λeff

. (16.10)

Note that replacing

λeff

with Q

gives the expression for PSE. Various other

special cases of (16.8) and (16.10) have been derived by other authors and can

be found in [8, 16, 43]. GP E was independently derived by [34], who called it

the Network Information Criterion (NIC).

16.4.5 NCV: Cross-Validation for Nonlinear Models

Cross-validation (CV) is a sample re–use method for estimating prediction risk;

it makes maximally eﬃcient use of the available data. A perturbative reﬁnement

of CV for nonlinear models is called nonlinear cross-validation (NCV) [25, 30].

358 J. Moody

ASE

Fig. 16.10. A nonlinear model can have many local minima in the error function. Each

local minimum w

and w

(solid curve) corresponds to a diﬀerent set of parameters

and thus to a diﬀerent model. Training on a diﬀerent ﬁnite sample of data or retraining

on a subsample, as in nonlinear cross-validation, gives rise to a slightly diﬀerent error

curve (dashed) and perturbed minima w



and w



. Variations due to data sampling

in error curves and their minima are termed model variance.

Let the data D be divided into ν randomly selected disjoint subsets D

roughly equal size: ∪

j=1

= D and ∀i = j, D

∩ D

= ∅.LetN

denote the

number of observations in subset D

.Letˆμ

λ(D

)

(x) be an estimator trained on

all data except for (x,t) ∈ D

. Then, the cross-validation average squared error

for subset j is deﬁned as

(λ)=



)∈D



− ˆμ

λ(D

)



. (16.11)

These are averaged over j to obtain the ν-fold cross-validation estimate of pre-

diction risk:

CV (λ)=



(λ) . (16.12)

Typical choices for ν are 5 and 10. Leave–one–out CV is obtained in the limit

ν = N . CV is a nonparametric estimate of the prediction risk that relies only

on the available data.

The frequent occurrence of multiple minima in nonlinear models (see Fig-

ure 16.10), each of which represents a diﬀerent predictor, requires a reﬁnement

of the cross-validation procedure. This reﬁnement, nonlinear cross-validation

(NCV), is illustrated in Figure 16.11 for ν =5.

A network is trained on the entire data set D to obtain a model ˆμ

(x) with

weights W

. These weights are used as the starting point for the ν-fold cross–

validation procedure. Each subset D

is removed from the training data in turn.

The network is re-trained using the remaining data starting at W

(rather than

using random initial weights). Under the assumption that deleting a subset from

the training data does not lead to a large diﬀerence in the locally-optimal weights,

the retraining from W

“perturbs” the weights to obtain W

,i =1...ν.The

Cross-Validation error computed for the “perturbed models” ˆμ

λ(D

)

(x) thus es-

timates the prediction risk for the model with locally-optimal weights W

desired, and not the performance of other predictors at other local minima.

16. Forecasting the Economy with Neural Nets 359

weight space

Fig. 16.11. Illustration of the computation of 5–fold nonlinear cross-validation

(NCV). First, the network is trained on all data to obtain weights w

which are used

as starting point for the cross-validation. Each data subset D

,i =1...5 is removed

from the training data D in turn. The network is trained, starting at W

,usingthe

remaining data. This “perturbs” the weights to obtain w

. The test error of the “per-

turbed model” w

is computed on the hold-out sample D

. The average of these errors

is the 5-fold CV estimate of the prediction risk for the model with weights w

If the network would be trained from random initial weights for each subset,

it could converge to a diﬀerent minimum corresponding to W

diﬀerent from the

one corresponding to W

. This would correspond to a diﬀerent model. Thus,

starting from W

assures us that the cross-validation estimates the prediction

risk for a particular model in question corresponding to W ≈ W

16.4.6 Pruning Inputs via Directed Search and Sensitivity Analysis

Selecting a “best subset” of input variables is a critical part of model selection

for forecasting. This is especially true when the number of available input series

is large, and exhaustive search through all combinations of variables is com-

putationally infeasible. Inclusion of irrelevant variables not only does not help

prediction, but can reduce forecast accuracy through added noise or systematic

bias. A model-dependent method for input variable selection is sensitivity-based

pruning (SBP) [41, 32, 30, 25]. Extensions to this approach are presented in [21].

With this algorithm, candidate architectures are constructed by evaluating the

eﬀect of removing an input variable from the fully connected network. These are

ranked in order of increasing training error. Inputs are then removed following a

“Best First” strategy, i.e. selecting the input that, when removed, increases the

training error least.

The SBP algorithm computes a sensitivity measure S

to evaluate the change

in training error that would result if input x

were removed from the network.

360 J. Moody

One such sensitivity measure is the delta error, deﬁned as:

Delta Error (DE) S



j=1

(16.13)

where S

is the sensitivity computed for exemplar x

. Since there are usually

many fewer inputs than weights, a direct evaluation of S

is feasible:

= SE (x

) −SE (x

) (16.14)



j=1

measures the eﬀect on the training squared error (SE ) of replacing the i

input x

by its average x

for all exemplars (replacement of a variable by its

average value removes its inﬂuence on the network output).

Note that in computing S

, no retraining is done in evaluating SE (x

Also note that it is not suﬃcient to just set x

=0∀ j, because the value of the

bias of each hidden unit was determined during training and would not be oﬀset

properly by setting the input arbitrarily to zero. Of course, if the inputs are

normalized to have zero mean prior to training, then setting an input variable

to zero is equivalent to replacing it by its mean.

If S

is large, the network error will be signiﬁcantly increased by replacing

the i

input variable with its mean. If S

is small, we need the help of other

measures to decide whether the i

input variable is useful. Three additional

sensitivity measures [21] can be computed based on perturbating an input or a

hidden variable and monitoring network output variations:

Average Gradient (AG) S



j=1

∂f

(j)

∂x

Average Absolute Gradient (AAG) S



j=1

∂f

(j)

∂x

| ,

RMS Gradient (RMSG) S

;



j=1



∂f

(j)

∂x



where for ease of notation we deﬁne f(x) ≡ ˆμ

(x) and f

(j)

≡ f(x

, ... , x

)

is the network output given the j

input data pattern.

These three sensitivity measures together oﬀer useful information. If S

positive and large, then on average the change of the direction of the network

output f is the same as that of the i

input variable. If S

is negative and

has large magnitude, the change of the direction of f on average is opposite to

that of the i

input variable. When S

is close to zero, we can not get much

16. Forecasting the Economy with Neural Nets 361

information from this measure. If S

AAG

is large, the output f is sensitive to the

input variable; if S

AAG

is small, f is not sensitive to the i

input variable.

If S

RMSG

is very diﬀerent from S

AAG

,thei

input series could be very noisy

and have a lot of outliers.

16.4.7 Empirical Example

As described in [42], we construct neural network models for predicting the rate

of change of the U.S. Index of Industrial Production (IP). The prediction horizon

for the IP results presented here is 12 months. Following previous work [27, 20],

the results reported here use networks with three sigmoidal units and a single

linear output unit.

The data set consists of monthly observations of IP and other macroeconomic

and ﬁnancial series for the period from January 1950 to December 1989. The

data set thus has a total of 480 exemplars. Input series are derived from around

ten raw time series, including IP, the Index of Leading Indicators, the Standard

& Poors 500 Index, and so on. Both the “unﬁltered” series and various “ﬁltered”

versions are considered for inclusion in the model, for a total of 48 possible input

variables. The target series and all 48 candidate input series are normalized to

zero mean and unit standard deviation. Figures 16.12 and 16.13 show the results

of the sensitivity analysis for the case where the training-set consists of 360

exemplars randomly chosen from the 40 year period; the remaining 120 monthly

observations constitute the test-set.

Local optima for the number of inputs are found at 15 on the FPE curve and

13 on the NCV curve. Due to the variability in the FPE and NCV estimates

(readily apparent in ﬁgure 16.13 for NCV), we favor choosing the ﬁrst good

local minimum for these curves rather than a slightly better global minimum.

This local minimum for NCV corresponds to a global minimum for the test

error. Choosing it leads to a reduction of 35 (from 48 to 13) in the number of

input series and a reduction in the number of network weights from 151 to 46.

Inclusion of additional input variables, while decreasing the training error, does

not improve the test-set performance.

This empirical example demonstrates the eﬀectiveness of the sensitivity-based

pruning (SBP) algorithm guided by an estimate of prediction risk, such as the

nonlinear cross-validation (NCV) algorithm, for selecting a small subset of input

variables from a large number of available inputs. The resulting network mod-

els exhibit better prediction performances, as measured by either estimates of

prediction risk or errors on actual test sets, than models that make use of all 48

input series.

16.4.8 Gaining Economic Understanding through Model

Visualization

Although this chapter has focussed on data-driven time series models for eco-

nomic forecasting, it is possible to extract information from these models about

362 J. Moody

MSE_train

MSE_test

FPE

NCV

5 10 15 20 25 30 35 40 45

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Number of Inputs Included

NCV, MSE, FPE

Sensitivity Input Pruning: IP (12 month)

Fig. 16.12. Sensitivity-Based Pruning (SBP) method for selecting a subset of input

variables for a neural net forecasting model [42]. The original network was trained on all

48 input variables to predict the 12 month percentage changes in Industrial Production

(IP). The variables have been ranked in order of decreasing importance according to

a sensitivity measure. The input variables are pruned one-by-one from the network;

at each stage, the network is retrained. The ﬁgure shows four curves: the Training

Error, Akaike Final Prediction Error (FPE), Nonlinear Cross-Validation Error (NCV)

[30, 25], and the actual Test Error. NCV is used as a selection criterion and suggests

that only 13 of the variables should be included. NCV predicts the actual test error

quite well relative to FPE.

the structure of the economy. The sensitivity analyses presented above in sec-

tion 16.4.6 provide a global understanding about which inputs are important for

predicting quantities of interest, such as the business cycle. Further information

can be gained, however, by examining the evolution of sensitivities over time

[21].

Sensitivity analysis performed for an individual exemplar provides information

about which input features play an important role in producing the current

prediction. Two sensitivity measures for individual exemplars can be deﬁned as:

Delta Output (DO) S

= Δf

(j)

≡ f(x

, ... , x

)

−f(x

, ... , x

) ,

Output Gradient (OG) S

∂f

(j)

∂x

where as above we deﬁne f

(j)

≡ f (x

, ... , x

) . If S

or S

is large,

then the i

variable plays an important role in making the current prediction,

16. Forecasting the Economy with Neural Nets 363

NCV

5 10 15 20 25 30 35 40 45

0.2

0.4

0.6

0.8

Number of Inputs Included

NCV, MSE, FPE

Sensitivity Input Pruning: IP (12 month)

Fig. 16.13. Sensitivity Input Pruning for IP (12 month prediction horizon). The ﬁgure

illustrates the spread in test-set error for each of the 10 subsets used to calculate NCV

(denoted by circles). The NCV error is the average of these test-set errors.

and slightly changing the value of the variable may cause a large change in the

network output. Figure 16.14 gives an example of a graphical display of the

individual exemplar sensitivities.

Using this graphical display, we can observe which input variables play im-

portant roles in producing the current forecast, or which input variables, when

we change them, can signiﬁcantly increase or decrease the forecast error. We

can also observe how the roles of diﬀerent input variables change through time.

For example, in Figure 16.14, for exemplars with indices from 56 to 60, the third

input variable has large negative sensitivity measures. Starting with the 65th ex-

emplar, the sixth input variable starts to play an important role, and this lasts

until the 71th exemplar. This kind of display provides insight into the dynamics

of the economy, as learned from the data by the trained neural network model.

16.5 Discussion

In concluding this brief survey of the algorithms for improving forecast accu-

racy with neural networks, it is important to note that many other potentially

useful techniques have been proposed (see also chapter 17). Also, the empirical

results presented herein are intended to be illustrative, rather than deﬁnitive.

Further work on both the algorithms and forecasting models may yield additional

improvements. As a ﬁnal comment, I would like to emphasize that given the dif-

ﬁculty of macroeconomic forecasting, no single technique for reducing prediction

risk is suﬃcient for obtaining optimal performance. Rather, a combination of

techniques is required.

364 J. Moody

Fig. 16.14. Sensitivity analysis results for individual exemplars for a 10 input model

for predicting the U.S. Index of Industrial Production [21]. Black and gray represent

negative and positive respectively. The size of a rectangle represents the magnitude of

a value. The monthly time index changes along the horizontal direction. The indices of

input variables are plotted vertically. This type of graph shows which input variables are

important for making forecasts at various points in time. This enables an understanding

of economic relationships.

Acknowledgements. The author wishes to thank Todd Leen, Asriel Levin,

Yuansong Liao, Hong Pi, Steve Rehfuss, Thorsteinn Rögnvaldsson, Matthew

Saﬀell, Joachim Utans, Lizhong Wu and Howard Yang for their many contribu-

tions to this research. This chapter is an expanded version of [26]. This work was

supported at OGI by ONR/ARPA grants N00014-92-J-4062 and N00014-94-1-

0071, NSF grants CDA-9309728 and CDA-9503968, and at Nonlinear Prediction

Systems by DARPA contracts DAAH01-92-CR361 and DAAH01-96-CR026.

References

[1] Akaike, H.: Statistical predictor identiﬁcation. Ann. Inst. Statist. Math. 22, 203–

217 (1970)

[2] Ash, T.: Dynamic node creation in backpropagation neural networks. Connection

Science 1(4), 365–375 (1989)

[3] Barron, A.: Predicted squared error: a criterion for automatic model selection. In:

Farlow, S. (ed.) Self–Organizing Methods in Modeling. Marcel Dekker, New York

(1984)

[4] Battiti, R.: Using mutual information for selecting features in supervised neural

net learning. IEEE Trans. on Neural Networks 5(4), 537–550 (1994)

16. Forecasting the Economy with Neural Nets 365

[5] Bonnlander, B.: Nonparametric selection of input variables for connectionist learn-

ing. Technical report, PhD Thesis. Department of Computer Science, University

of Colorado (1996)

[6] Clemen, R.T.: Combining forecasts: A review and annotated bibliography. Inter-

national Journal of Forecasting (5), 559–583 (1989)

[7] Craven, P., Wahba, G.: Smoothing noisy data with spline functions: Estimating

the correct degree of smoothing by the method of generalized cross-validation.

Numer. Math. 31, 377–403 (1979)

[8] Eubank, R.L.: Spline Smoothing and Nonparametric Regression. Marcel Dekker,

Inc. (1988)

[9] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4(1), 1–58 (1992)

[10] Girosi, F., Jones, M., Poggio, T.: Regularization theory and neural network archi-

tectures. Neural Computation 7, 219–269 (1995)

[11] Golub, G., Heath, H., Wahba, G.: Generalized cross validation as a method for

choosing a good ridge parameter. Technometrics 21, 215–224 (1979)

[12] Granger, C.W.J., Newbold, P.: Forecasting Economic Time Series, 2nd edn. Aca-

demic Press, San Diego (1986)

[13] Granger, C.W.J., Terasvirta, T.: Modelling Nonlinear Economic Relationships.

Oxford University Press (1993)

[14] Hamilton, J.D.: Time Series Analysis. Princeton University Press (1994)

[15] Hassibi, B., Stork, D.G.: Second order derivatives for network pruning: Optimal

brain surgeon. In: Hanson, S.J., Cowan, J.D., Giles, C.L. (eds.) Advances in Neural

Information Processing Systems, vol. 5, pp. 164–171. Morgan Kaufmann Publish-

ers, San Mateo (1993)

[16] Hastie, T.J., Tibshirani, R.J.: Generalized Additive Models. Monographs on Statis-

tics and Applied Probability, vol. 43. Chapman and Hall (1990)

[17] Hoerl, A.E., Kennard, R.W.: Ridge regression: applications to nonorthogonal prob-

lems. Technometrics 12, 69–82 (1970)

[18] Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogo-

nal problems. Technometrics 12, 55–67 (1970)

[19] LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D.S.

(ed.) Advances in Neural Information Processing Systems, vol. 2. Morgan Kauf-

mann Publishers (1990)

[20] Levin, A.U., Leen, T.K., Moody, J.E.: Fast pruning using principal components.

In: Cowan, J., Tesauro, G., Alspector, J. (eds.) Advances in Neural Information

Processing Systems, vol. 6. Morgan Kaufmann Publishers, San Francisco (1994)

[21] Liao, Y., Moody, J.E.: A neural network visualization and sensitivity analysis

toolkit. In: Amari, S.I., Xu, L., Chan, L.-W., King, I., Leung, K.-S. (eds.) Pro-

ceedings of the International Conference on Neural Information Processing, pp.

1069–1074. Springer-Verlag Singapore Pte. Ltd. (1996)

[22] Litterman, R.B.: Forecasting with Bayesian vector autoregressions – ﬁve years of

experience. Journal of Business and Economic Statistics 4(1), 25–38 (1986)

[23] Moody, J.: Challenges of Economic Forecasting: Noise, Nonstationarity, and Non-

linearity. Invited talk presented at Machines that Learn., Snowbird Utah (April

1994)

[24] Moody, J.: The eﬀective number of parameters: an analysis of generalization and

regularization in nonlinear learning systems. In: Moody, J.E., Hanson, S.J., Lipp-

mann, R.P. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp.

847–854. Morgan Kaufmann Publishers, San Mateo (1992)

366 J. Moody

[25] Moody, J.: Prediction risk and neural network architecture selection. In:

Cherkassky, V., Friedman, J.H., Wechsler, H. (eds.) From Statistics to Neural

Networks: Theory and Pattern Recognition Applications. Springer (1994)

[26] Moody, J.: Macroeconomic Forecasting: Challenges and Neural Network Solutions.

In: Proceedings of the International Symposium on Artiﬁcial Neural Networks,

Hsinchu, Taiwan (1995) (Invited keynote address)

[27] Moody, J., Levin, A., Rehfuss, S.: Predicting the U.S. index of industrial produc-

tion. Neural Network World 3(6), 791–794 (1993); Special Issue: Proceedings of

Parallel Applications in Statistics and Economics 1993

[28] Moody, J., Rehfuss, S., Saﬀell, M.: Macroeconomic forecasting with neural net-

works (1999) (manuscript in preparation)

[29] Moody, J., Rögnvaldsson, T.: Smoothing regularizers for projective basis function

networks. In: Proceedings of Advances in Neural Information Processing Systems,

NIPS 1996, vol. 9. MIT Press, Cambridge (1997)

[30] Moody, J., Utans, J.: Architecture selection strategies for neural networks: Appli-

cation to corporate bond rating prediction. In: Refenes, A.N. (ed.) Neural Net-

works in the Captial Markets. John Wiley & Sons (1994)

[31] Moody, J.E.: Note on generalization, regularization and architecture selection in

nonlinear learning systems. In: Juang, B.H., Kung, S.Y., Kamm, C.A. (eds.) Neu-

ral Networks for Signal Processing, pp. 1–10. IEEE Signal Processing Society

(1991)

[32] Moody, J.E., Utans, J.: Principled architecture selection for neural networks: Ap-

plication to corporate bond rating prediction. In: Moody, J.E., Hanson, S.J., Lipp-

mann, R.P. (eds.) Advances in Neural Information Processing Systems, vol. 4, pp.

683–690. Morgan Kaufmann Publishers, San Mateo (1992)

[33] Mozer, M.C., Smolensky, P.: Skeletonization: A technique for trimming the fat

from a network via relevance assessment. In: Touretzky, D.S. (ed.) Advances in

Neural Information Processing Systems, vol. 1. Morgan Kaufmann Publishers, San

Mateo (1990)

[34] Murata, N., Yoshizawa, S., Amari, S.: Network information criterion – determining

the number of hidden units for an artiﬁcial neural network model. IEEE Transac-

tions on Neural Networks 5(6), 865–872 (1994)

[35] Natter, M., Haefke, C., Soni, T., Otruba, H.: Macroeconomic forecasting using

neural networks. In: Neural Networks in the Capital Markets 1994 (1994)

[36] Pi, H., Peterson, C.: Finding the embedding dimension and variable dependencies

in time series. Neural Computation, 509–520 (1994)

[37] Plaut, D., Nowlan, S., Hinton, G.: Experiments on learning by back propagation.

Technical Report CMU-CS-86-126, Dept. of Computer Science, Carnegie-Mellon

University, Pittsburgh, Pennsylvania (1986)

[38] Rehfuss, S.: Macroeconomic forecasting with neural networks (1994) (unpublished

simulations)

[39] Rehkugler, H., Zimmermann, H.G. (eds.): Neuronale Netze in der Ökonomie. Ver-

lag Vahlen (1994)

[40] Swanson, N.R., White, H.: A model selection approach to real-time macroeconomic

forecasting using linear models and artiﬁcial neural networks. Discussion paper,

Department of Economics, Pennsylvania State University (1995)

[41] Utans, J., Moody, J.: Selecting neural network architectures via the prediction

risk: Application to corporate bond rating prediction. In: Proceedings of the First

International Conference on Artiﬁcial Intelligence Applications on Wall Street.

IEEE Computer Society Press, Los Alamitos (1991)

16. Forecasting the Economy with Neural Nets 367

[42] Utans, J., Moody, J., Rehfuss, S.: Selecting input variables via sensitivity analysis:

Application to predicting the U.S. business cycle. In: Proceedings of Computa-

tional Intelligence in Financial Engineering. IEEE Press (1995)

[43] Wahba, G.: Spline models for observational data. CBMS-NSF Regional Conference

Series in Applied Mathematics (1990)

[44] Winkler, R.L., Makridakis, S.: The combination of forecasts. Journal of Royal

Statistical Society (146) (1983)

[45] Wu, L., Moody, J.: A smoothing regularizer for feedforward and recurrent net-

works. Neural Computation 8(2) (1996)

[46] Yang, H., Moody, J.: Input variable selection based on joing mutual information.

Technical report, Department of Computer Science, Oregon Graduate Institute

(1998)

How to Train Neural Networks



Ralph Neuneier and Hans Georg Zimmermann

Siemens AG, Corporate Technology, D-81730 München, Germany

{Ralph.Neuneier,Georg.Zimmermann}@mchp.siemens.de

http://w2.siemens.de/zfe

nn/homepage.html

Abstract. The purpose of this paper is to give a guidance in neural

network modeling. Starting with the preprocessing of the data, we dis-

cuss diﬀerent types of network architecture and show how these can be

combined eﬀectively. We analyze several cost functions to avoid unstable

learning due to outliers and heteroscedasticity. The Observer - Obser-

vation Dilemma is solved by forcing the network to construct smooth

approximation functions. Furthermore, we propose some pruning algo-

rithms to optimize the network architecture. All these features and tech-

niques are linked up to a complete and consistent training procedure (see

ﬁgure 17.25 for an overview), such that the synergy of the methods is

maximized.

17.1 Introduction

The use of neural networks in system identiﬁcation or regression tasks is often

motivated by the theoretical result that in principle a three layer network can

approximate any structure contained in a data set [14]. Consequently, the char-

acteristics of the available data determine the quality of the resulting model. The

authors believe that this is a misleading point of view, especially, if the amount

of useful information that can be extracted from the data is small. This situation

arises typically for problems with a low signal to noise ratio and a relative small

training data set at hand. Neural networks are such a rich class of functions, that

the control of the optimization process, i. e. the learning algorithm, pruning, ar-

chitecture, cost functions and so forth, is a central part of the modeling process.

The statement that “The neural network solution is not better than [any classi-

cal] method” has been used too often in order to describe the results of neural

network modeling. At any rate, the assessment of this evaluation presupposes a

precise knowledge of the procedure involved in the neural network solution that

has been achieved. This is the case because a great variety of additional features

and techniques can be applied at the diﬀerent stages of the process to prevent all

the known problems like overﬁtting and sensitivity to outliers concerning neural

networks. Due to the lack of a general recipe, one can often ﬁnd a statement



Previously published in: Orr, G.B. and Müller, K.-R. (Eds.): LNCS 1524, ISBN

978-3-540-65311-0 (1998).

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 369–418, 2012.

 Springer-Verlag Berlin Heidelberg 2012

370 R. Neuneier and H.G. Zimmermann

declaring that the quality of the neural network model depends strongly on the

person who generated the model, which is usually perceived as negative. In con-

trast, we consider the additional features an outstanding advantage of neural

networks compared to classical methods, which typically do not allow such a

sophisticated control of their optimization. The aim of our article is to provide a

set of techniques to eﬃciently exploit the capabilities of neural networks. More

important, these features will be combined in such a way that we will achieve a

maximal eﬀect of synergy by their application.

First, we begin with the preprocessing of the data and deﬁne a network archi-

tecture. Then, we analyze the interaction between the data and the architecture

(Observer - Observation Dilemma) and discuss several pruning techniques to

optimize the network topology. Finally, we conclude by combining the proposed

features into a uniﬁed training procedure (see ﬁg. 17.25 in section 17.8).

Most of the paper is relevant to nonlinear regression in general. Some consid-

erations are focussed on time series modeling and forecasting. All the proposed

approaches have been tested on diverse tasks we have to solve for our clients,

e. g. forecasting ﬁnancial markets. The typical problem can be characterized

by a relative small set of very noisy data and a high dimensional input vector

to cover the complexity of the underlying dynamical system. The paper gives

an overview of the uniﬁed training procedure we have developed to solve such

problems.

17.2 Preprocessing

Besides the obvious scaling of the data (in the following abbreviated by scale(·)),

which transforms the diﬀerent time series such that each series has a mean value

of zero and a statistical variance of one, some authors have proposed complicated

preprocessing functions. In the ﬁeld of ﬁnancial forecasting, these functions are

often derived from technical analysis in order to capture some of the underlying

dynamics of the ﬁnancial markets (see [37] and [25] for some examples).

After many experiments with real data, we have settled with the following

simple transformations. If the original time series which has been selected as an

input, is changing very slowly with respect to the prediction horizon, i. e. there is

no clearly identiﬁable mean reverting equilibrium, then an indicator for the iner-

tia and an information of the driving force has been proven to be very informa-

tive. The inertia can be described by a momentum (relative change, eq. 17.1) and

the force by the acceleration of the time series (normalized curvature, eq. 17.2).

If we have a prediction horizon of n steps into the future the original time series

is transformed in the following way:

momentum: ˜x

= scale



− x

t−n



, (17.1)

force: ˆx

= scale



− 2x

t−n

+ x

t−2n

t−n



. (17.2)

17. How to Train Neural Networks 371

In eq. 17.1, the relative diﬀerence is computed to eliminate exponential trends

which, for example, may be caused by inﬂationary inﬂuences. Using only the pre-

processing functions of eq. 17.1 typically leads to poor models which only follow

obvious trends. The forces, i. e. the transformations by eq. 17.2, are important

to characterize the turning points of the time series.

A time series may be fast in returning to it’s equilibrium state after new infor-

mation has entered the market, as is the case for most prices of goods and stock

rates. In this case, we substitute eq. 17.2 by a description of the forces which drive

the price back to the estimated equilibrium. A simple way to estimate the underly-

ing price equilibrium is to take the average over some past values of the time series.

Instead of using the relative diﬀerence betweentheestimateandthecurrentvalue,

we look at the diﬀerence between the equilibrium and the past value, which lies

in the middle of the averaging window. In our example, this is formulated as

ˆx

= scale



t−n

−

2n+1



τ =0

t−τ

t−n



. (17.3)

Note that in eq. 17.3 we use x

t−n

instead of the present value x

leading to

an estimation of the equilibrium centered around x

t−n

. Since, in fact, we are

interested in measuring the tension between a price level and the underlying

market equilibrium, this estimation oﬀers a more appropriate description at the

cost of ignoring the newest possible point information. This concept, known as

mean reverting dynamics in economics, is analog to the behavior of a pendulum

in physics.

17.3 Architectures

We will present several separate architectural building blocks (ﬁgures 17.1 to

17.7), which will ﬁnally be combined into a uniﬁed neural network architecture

for time series analysis (ﬁgure 17.8). Most of the structural elements can be used

in general regression.

17.3.1 Net Internal Preprocessing by a Diagonal Connector

In order to have potentially very diﬀerent inputs on a comparable scale, we

standardize the input data to zero mean and unit variance. A problem with

this common approach is that outliers in the input can have a large impact.

This is particularly serious for data that contain large shocks, as in ﬁnance and

economics. To avoid this problem, we propose an additional, net internal (short

for network internal) preprocessing of the inputs by scaling them according to



= tanh(wx) . (17.4)

The implementation of this preprocessing layer is shown in ﬁg. 17.1. This layer

has the same number of hidden units as the input layer and uses standard tanh

372 R. Neuneier and H.G. Zimmermann

squashing functions. The particular weight matrix between the input layer and

the preprocessing layer is only a square diagonal matrix.

For short term forecasts (e. g. modeling daily returns of stock markets), we

typically initialize the weights with the value of 0.1 to ensure that the tanh is

in its linear range, which in turn ensures that the external inputs pass through

essentially unchanged. If interested in long term models (e. g. six month forecast

horizon), we prefer to start with an initial value of 1. The reason is that monthly

data are typically more contaminated by “non-economical” eﬀects like political

shocks. The large initial weight values eliminate such outliers from the beginning.

The weights in the diagonal connector are restricted to be positive to avoid

ﬂuctuations of the sign of x



during training. This constraint keeps eq. 17.4 as

a monotonic transformation with the ability to limit outliers. No bias should be

used for the preprocessing layer to prevent numerical ambiguities. We found that

these additional constraints improve the training stability of the preprocessing

layer.

These weights will be adapted during training in the same way as all the other

weights within the network. In practice, we observe both growing and shrinking

values for the weights. Growing values cause a larger proportion of the input

range to be compressed by the squashing function of the tanh. Very small values

of diagonal elements indicate the option to prune the corresponding input.

n input nodes

n hidden nodes

. . .

Fig. 17.1. Net internal preprocessing to limit the inﬂuence of outliers and to eliminate

unimportant inputs

17.3.2 Net Internal Preprocessing by a Bottleneck Network

It is an old idea to use a bottleneck network to shrink the dimensionality of the

input vector. This technique was not only used to build encoders [1], but also to

perform a principle component analysis [26]. Using a bottleneck as an internal

preprocessing building block within a larger network oﬀers the opportunity for

us to compress the input information. In addition, the tanh squashing function

of the bottleneck layer acts as a limiter of the outliers in the data.

17. How to Train Neural Networks 373

In a ﬁrst attempt, one may implement a bottleneck architecture as an input-

hidden-output-layer sub-network using the inputs also as targets on the output

level. An additional connector from the hidden layer is connected to the remain-

ing parts of the network. This design allows the compression of input signals,

but there are two major disadvantages implied. If we apply input pruning to

the original inputs the decompression becomes disordered. Furthermore, adding

noise to the inputs becomes more complex because the disturbed inputs have

also to be used as targets at the output layer.

A distinct approach of the bottleneck sub-networks which avoids these diﬃ-

culties is indicated in ﬁg. 17.2. On the left side of the diagram 17.2 we have the

typical compressor-decompressor network. The additional connection is a frozen

identity matrix, which duplicates the input layer to the output layer. Since the

target for this layer is set to zero, the output of the compressor-decompressor

network will adopt the negative values of the inputs in order to compensate

for the identical copy of the inputs. Even though the sign of the input signals

has been reversed, this circumstance does not imply any consequence for the

information ﬂow through the bottleneck layer. This architecture allows an ap-

propriate elimination of inputs as well as the disturbance of the input signals by

artiﬁcial noise.

n input nodes

n output nodes

bottleneck

...

Fig. 17.2. Net internal preprocessing cluster using a bottleneck

In our experiments, we observed that it is possible to train the bottleneck in

parallel with the rest of the network. As a consequence, the bottleneck can be

shrunk as long as the error on the training set remains at the same level as it

would without a compression. As will be described in section 17.6 it is important

to use a dynamically controlled input noise as a regularization method during

learning. The architecture of ﬁg. 17.2 allows the combination of noise and net

internal preprocessing by data compression.

374 R. Neuneier and H.G. Zimmermann

17.3.3 Squared Inputs

Following the suggestion of G. Flake in [10], we also supply the network with

the squared values of the inputs which leads to an integration of global and local

decision making. The processing of the original inputs and the squared inputs

within a tanh activation function enables the network to act as a combination of

widely-known neural networks using sigmoidal activation function (MLP) and

radial basis function networks (RBF). The output of a typical three-layer MLP

y =



tanh





− θ



, (17.5)

whereas the output of a RBF computes to

y =



−





−μ



. (17.6)

Some research papers typically deal with the comparison of these two types of

basis functions whereas Flake proposes the following combining approach

y =



tanh





+ u

− θ



, (17.7)

which covers the MLP of eq. 17.5 and simultaneously approximates the RBF

output eq. 17.6 to a suﬃcient level.

Nevertheless, we propose some minor changes to Flake’s approach by taking

the square of the internally preprocessed inputs instead of supplying the squared

original inputs separately. That way, we take advantage of the preprocessing

cluster because it limits the outliers. Furthermore, pruning inputs leads to a

simultaneous elimination of its linear and squared transformation. A possible

implementation is shown in ﬁg. 17.3. The connector id is a ﬁxed identity matrix.

The cluster above this connector uses square as the nonlinear activation function.

The next hidden cluster is able to create MLP and RBF structures with its tanh

squashing function.

One might think that the addition of a new connector between the square and

the hidden layer with all its additional weights would boost the overﬁtting. Our

experience indicates just the opposite. Furthermore, when optimizing a usual MLP,

one typically stops training before the error on the training set gets too small (early

stopping) in order to increase the probability of good results on the test set. We have

observed, that the error on the training set can be very small while at the same time

achievinggood test results,even if one does not useweight pruning. Our experiences

can be understood on the basis of the local modeling features of the architecture.

That is, local artifacts in the input space can be encapsulated, so that they have no

more global inﬂuence which may lead to bad generalization performance. This is

especially true for high dimensional input spaces.

17. How to Train Neural Networks 375

n input nodes

n hidden nodes

. . .

hidden

n square nodes

Fig. 17.3. Net internal preprocessing cluster and the square cluster, which produces

the input signals for following clusters. Note, that the connector from bias to “hidden”

is required but suppressed for visual clarity.

17.3.4 Interaction Layer

This section summarizes the articles of Weigend and Zimmermann in [36, 32].

In most applications of neural networks in ﬁnancial engineering the number of

inputs is huge (of the order of a hundred), but only a single output is used. This

situation can be viewed as a large “inverted” funnel; the hard problem is that the

only information available for learning is that of a single output. This is one of the

reasons for the “data-hungriness” of single-output neural networks, i. e. a large

set of training data is often required in order to distinguish nonlinearities from

noise. The ﬂip-side of the coin is that small data sets need to be regularized, thus

only allowing an identiﬁcation of a simple model, which implies a bias towards

linear models. This circumstance is not to be confused with the bias towards

linear models which is a consequence of some techniques for avoiding overﬁtting

such as early stopping or weight decay.

One approach for increasing the information ﬂow from the output side to the

input side is to increase the number of output units. In the simplest case, two

output units can be used, one to predict the return, the other one to predict the

sign of the return [34].

Since we have to model dynamical systems, we would like to provide enough

information to characterize the state of the autonomous part of the dynamics

on the output side, similar to Takens’ theorem for the notion of state on the input

376 R. Neuneier and H.G. Zimmermann

side [11]. The idea of multiple output units has recently been popularized in the

connectionist community in a non-time series context by Caruana [6].

The present paper is concerned with time series, so the embedding of the out-

put can be done analogously to the input side of a tapped “delay” line, indicated

in ﬁg. 17.4 as the point prediction layer.

t+n

y -y

t+n-1

t+n+1

y -2y +y

t+n-1t+n

t+n+1

y -y

t+n

t+n-1

t+n

t+n+1

n input nodes

hidden nodes

point prediction layer

interaction layer

11 1-1

-1 1-2

Fig. 17.4. Point predictions followed by the interaction layer

For the forecast we are interested in y

t+n

,then-step ahead forecast of variable

y. Additionally, we also want to predict y

t+n−1

and y

t+n+1

. However, after ex-

perimenting with this architecture, we do not consider it to be much of a beneﬁt

to our aim, because there seems only very little interaction between the behavior

of the outputs as reﬂected through the implicit transfer by sharing hidden units.

This prompted us to introduce an explicit second output layer, the interac-

tion layer. The additional layer computes the next-neighbor derivatives (y

t+n

−

t+n−1

) and (y

t+n+1

−y

t+n

),aswellasthecurvature(y

t+n+1

−2y

t+n

+ y

t+n−1

Diﬀerences between neighbors are encoded as ﬁxed weights between point

prediction and interaction layer, so they do not enlarge the number of parameters

to be estimated. The overall cost function is the sum of all six contributions

where all individual contributions are weighted evenly. If the target values are

not properly scaled, it may be useful to give equal contribution to the error and

to scale each error output by the average error of that output unit.

If the point forecasts were perfect, the interaction layer would have no eﬀect

at all. To explain the eﬀect of non-zero errors, consider ﬁg. 17.5. Both predictions

of the three points have the same pointwise errors at each of the three neighboring

17. How to Train Neural Networks 377

points. However, both the slopes and the curvature are correct in Model 1 (they

don’t give additional errors), but do add to the errors for Model 2.

Model 1

Model 2

Observations

t+n

t+n-1

t+n+1

Fig. 17.5. Geometric interpretation of the eﬀect of the interaction layer on the cost

function. Given are three curves connecting points at three adjacent steps in time.

Whereas Model 1 (connected by dashed line) and Model 2 (connected by the dotted

line) have identical pointwise errors to the observations (connected by the solid line),

taking derivatives and curvatures into account favors Model 1.

The principle of the interaction layer can also be used to model further re-

lationships on the output side. Let us take as an example a forecasting model

of several exchange rates. Between all these forecasts we should guarantee that

there is no arbitrage between the assets at the forecast horizon. In other words,

at y

t+n

there is no way to change money in a closed loop and have a positive

return. These intermarket relationships can be realized in form of an interaction

layer.

In general, the control of intermarket relationships becomes more important

if we proceed from forecasting models to portfolio analysis models. In the latter

the correct forecast of the interrelationships of the assets is more important than

a perfect result on one of the titles.

17.3.5 Averaging

Let us assume, that we have m sub-networks for the same learning task. Diﬀerent

solutions of the sub-networks may be caused by instabilities of the learning or by

a diﬀerent design of the sub-networks. It is a well known principle that averaging

the output of several networks may give us a better and more stable result [24, 4].

These advantages can be clearly seen if we deﬁne as average error function

average



t=1





i=1

i,t+n



− y

t+n



(17.8)

Note that in the case of a quadratic error function, the interaction layer can be

substituted by a single output layer of point predictions, combined with a positive

deﬁnite non-diagonal quadratic form as target function.

378 R. Neuneier and H.G. Zimmermann

t+n

average output

... ... ...

1/m 1/m 1/m

... ...

Fig. 17.6. Averaging of several point predictions for the same forecast horizon

with y

t+n

as the target pattern, y

i,t+n

as the output of sub-network i,masthe

number of subnetworks, and T as the number of training patterns. Assuming

that the errors of the sub-networks are uncorrelated,



t=1

i,t+n

− y

t+n

)(y

j,t+n

− y

t+n

)=0, ∀ i = j (17.9)

leads to

average





i=1





t=1

i,t+n

− y

t+n

)



(17.10)

average



sub-networks



. (17.11)

According to this argument, the error due to the uncertainty of the training

can be reduced. Finally, it is to be noted that averaging adds no additional

information about the speciﬁc application of our network.

17.3.6 Regularization by Random Targets

It is known in the neural network community that the addition of random targets

can improve the learning behavior

. An architectural realization of this idea is

shown ﬁg. 17.7.

In economical applications we typically use large input vectors in order to cap-

ture all probably relevant indicators. The resulting large number of parameters

if connecting the input layer to a hidden layer of an appropriate size is the source

The authors would appreciate any useful citation.

17. How to Train Neural Networks 379

hidden nodes

input nodes

random targetscorrect targets

Fig. 17.7. A simple architecture using the original targets (upper left branch) and the

random targets for regularization (upper right branch)

of overﬁtting. To partly remedy this eﬀect, one can extend the neural network

with additional but random targets. As a decreasing error of this outputs can

only be achieved by memorizing these random events, this technique absorbs a

part of the overparametrization of the network. It should not be confused with

artiﬁcial noise on the output side because the additional patterns are randomly

selected according a probability distribution before the training and are held

ﬁxed during the learning.

The question if enough or too many additional random targets are supplied

can be answered by observing the learning behavior. During training the error

with respect to these random targets is steadily decreasing until convergence. If

the number of additional targets is too small one may observe overﬁtting eﬀects

(e. g. using a validation set) on the real targets after this convergence. On the

other hand, if the the number of additional targets is too large the learning slows

down.

One may suspect the parameters focusing on the random targets may have a

unpredictable eﬀect on the generalization set but we could not observe such a

behavior. If using squared inputs (see sec. 17.3.3), their local learning possibilities

supports this by encapsulating local artifacts.

17.3.7 An Integrated Network Architecture for Forecasting

Problems

As a next step in our investigation, we suppose our task to be an economic fore-

casting model. In the following section we integrate the architectural building

blocks discussed above into an eleven layer network specialized to fulﬁll our pur-

pose (see ﬁg. 17.8). As indicated in section 17.2, the net external preprocessing

contains at least two inputs per original time series: the momentum in the form

380 R. Neuneier and H.G. Zimmermann

of the relative diﬀerence (eq. 17.1) and a force indicator in the form of a curvature

or mean reverting description (eq. 17.2).

The lowest part of the network shows the net internal preprocessing by a diag-

onal connector. Alternatively, we could also use the bottleneck network. By the

square layer, we allow our network to cover the diﬀerence and similarity anal-

ysis of MLP- and RBF- networks. The signals from the internally preprocessed

inputs and their squared values are used as inputs, weighted by the associated

parameters, to the hidden embedding and hidden force layers (see ﬁg. 17.8).

In contrast to typical neural networks, the upper part of the net is organized

in a new way. The underlying dynamical system is supposed to be characterized

by an estimation of diﬀerent features around the forecast horizon. We distin-

guish these indicators by two particular characteristics, embeddings and forces.

The forecast of these diﬀerent features has been separated in two branches of

the network in order to avoid interferences during training. Instead of directly

forecasting our ﬁnal target which has the form:

t+n

− y

, (17.12)

we use the following indicators as targets in these two output layers:

embeddings:

t+n+i

+ y

t+n

+ y

t+n−i

− 1 i =1,...m

2i+1



j=−i

t+n+j

− y

i =1,...m

forces:

−y

t+n+i

+2y

t+n

− y

t+n−i

i =1,...m

t+n

−

2i+1



j=−i

t+n+j

i =1,...m

(17.13)

Target u

describes a normalized 3-point embedding with respect to our forecast

horizon t + n while v

represents a complete average around t + n versus present

time. The forces r

and s

are formulated as curvatures or mean reverting (see also

section 17.2). Similar to the embeddings they describe features of a dynamical

system with an increasing width. The diﬀerent widths in turn allow a complete

characterization of the time series, analogous to the characterization by points

in Takens’ Theorem [29].

17. How to Train Neural Networks 381

The motivation of this design is contained in the step from the embedding

and force output layers to the multiforecast output layer. We have

t+n

− y

= u

+ r

i =1,...m

t+n

− y

= v

+ s

i =1,...m.

(17.14)

That means, we get 2m estimations of our ﬁnal target by simply adding up the

embeddings and their associated forces in pairs. In the network this is easy to

realize by two identity connectors. After the multiforecast layer we can add an

averaging connector to get our ﬁnal output. This averaging can be done by ﬁxed

weights 1/2m or by learning after ﬁnishing the training of the lower network.

The control-embedding and control-force clusters are motivated by the fol-

lowing observation. On the embedding / force output level the network has

to estimate only slightly diﬀerent features depending on the width parameter.

Neural networks have a tendency to estimate these features too similarly. To

counteract this behavior we have to add two additional clusters which control

the diﬀerence between the individual outputs inside the embedding and the force

cluster. Thus, the network has not only to learn u

and u

i+1

on the embedding

output level but also the diﬀerence u

− u

i+1

on the control embedding level.

Thesameisvalidforv

From a formal viewpoint the multiforecast cluster, the control-embedding clus-

ter and control-force cluster are interaction layers supporting the identiﬁcation

of the underlying dynamical system in form of embeddings and forces. Keep in

mind, that although the full network seems to be relatively complex, most of

the connectors are ﬁxed during training. Those are only used to produce the

appropriate information ﬂows whose design is the real focus of this section.

Our proposed network design allows for an intuitive evaluation of the dimen-

sion of the target indicators embeddings and forces:howtochoosem in eq. 17.13?

Start with a relative large m and train the network to the minimal error on the

training set as will be described in section 17.8. Then train only the weights of

the up to now ﬁxed connector between cluster multi-forecast and average fore-

cast (ﬁg. 17.8). If the dimension m has been chosen too large training may lead

to such weights which suppress the long range embeddings and forces. Thus, it

is possible to achieve an optimal dimension m of embedding and forces. For the

case of a six month forecast horizon we were successful with a value of m =6.

The eleven layer network automatically integrates aspects of section 17.3.6

concerning random targets. If the dimension m has been chosen too large, then

the extreme target indicators act as random targets. On the other hand, if the

forecast problem is characterized by high noise in the short term, the indicators

for smaller m values generate random targets. Thus, choosing m too large does

not harm our network design as discussed in section 17.3.6, but can improve

generalization by partly adsorbing the overparametrization of the network.

382 R. Neuneier and H.G. Zimmermann

n input nodes

n hidden nodes

n square nodes

hidden force

hidden embedding

control embedding

average forcast

0 ... 0 1 -1 ... 0

0 ... 0 1 -1

0 ... 1 -1 0 ... 0

1 -1 0 ... 0

0 ... 0 1 -1 ... 0

0 ... 0 1 -1

0 ... 1 -1 0 ... 0

1 -1 0 ... 0

id id

2m embeddings 2m forces

2m multi forecasts

control force

Fig. 17.8. The integrating eleven layer architecture

The diﬀerent forecasts o

of the multiforecast cluster in ﬁg. 17.8 can be used

to estimate structural instability s of the network model by

s =



i=1

− o| with o =



i=1

. (17.15)

A subsequent decision support system using this network can interpret the mea-

surements s as indicators how much one can trust the model. Note that these

values of uncertainty must not be identiﬁed with error bars as described in sec-

tion 17.5 because the s merely quantify the instability of the learning.

17. How to Train Neural Networks 383

17.4 Cost Functions

Typical error functions can be written as a sum of individual terms over all T

training patterns,

E =



t=1

, (17.16)

with the individual error E

depending on the network output y(x

,w) and the

given target data y

. The often used square error,



y(x

,w) −y



, (17.17)

can be derived from the maximum-likelihood principle and a Gaussian noise

model. Eq. 17.17 yields relatively simple error derivatives and results in asymp-

totically best estimators under certain distribution assumptions, i. e. homoscedas-

ticity. In practical applications, however, several of these assumptions are

commonly violated, which may dramatically reduce the prediction reliability

of the neural network. A problem arising from this violation is the large im-

pact outliers in the target data can have on the learning, which is also a result

of scaling the original time series to zero mean and unit variance. This eﬀect is

particularly serious for data in ﬁnance and economics which contain large shocks.

Another cause of diﬃculties in ﬁnancial time series analysis is heteroscedasticity,

i.e. situations where the variance of target variable changes over time. We will

especially consider cases where the variance is input-dependent: σ

= σ

We propose two approaches to reduce the problems resulting from outliers

and heteroscedasticity.

17.4.1 Robust Estimation with LnCosh

Outliers are common in ﬁnancial time series and are usually caused by “infor-

mation shocks” such as announcements of government data or dividends paid by

companies that are out of line with market expectations. These shocks appear as

discontinuities in the trajectory of an aﬀected asset. To be robust against such

shocks, typical cost functions like



y(x

,w) − y



(17.18)

are used which do not overweight large errors. A smoother version is given by

log cosh



y(x

,w) − y



, (17.19)

with parameter a>1. We typically use a ∈ [3, 4]. This function approximates

the parabola of the squared errors for small diﬀerences (Gaussian noise model),

and is proportional to the absolute value of the diﬀerence for larger values of the

diﬀerence (Laplacian noise model). The assumed noise model, the cost function

17.19 and its derivative for the case a =5is shown in ﬁg. 17.9. The function

384 R. Neuneier and H.G. Zimmermann

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

noise model

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.2

0.4

0.6

0.8

1.2

1.4

1.6

1.8

error function

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1

−0.8

−0.6

−0.4

−0.2

0.2

0.4

0.6

0.8

derivative

Fig. 17.9. The Laplacian-like noise model, left, the log cosh error function 17.19,

middle, and its corresponding derivative, right

log cosh is motivated by the observation that the derivative of |x| is sign(x)

and tanh(ax) is a smooth approximation of this step function with the integral

tanh(az)dz =

log cosh (az).

17.4.2 Robust Estimation with CDEN

This section describes a more general framework for robust estimation which is

based on the theory of density estimation. The advantages are twofold. First, any

parameters of the cost function (e. g. the a for eq. 17.19) can be determined by

learning which avoids an artiﬁcial bias by setting those parameters to predeﬁned

values. Second, the proposed methods allow the modeling of heteroscedastic time

series whose variance changes over time.

Probability density estimating neural networks have recently gained major

attention in the neural network community as a more adequate tool to describe

probabilistic relations than common feed-forward networks (see [35], [22] and

[20]). Interest has hereby focussed on exploiting the additional information which

is inherent to the conditional density, for example the conditional variance as

a measure of prediction reliability [28], the representation of multi-valued map-

pings in the form of multi-modal densities to approach inverse problems [3], or

the use of the conditional densities for optimal portfolio construction [18]. In

this paper, we will use the Conditional Density Estimation Network (CDEN),

which is, among various other density estimation methods, extensively discussed

in [22] and [20].

A possible architecture of the CDEN is shown in ﬁg. 17.10. It is assumed that

p(y|x), the conditional density to be identiﬁed, may be formulated as the para-

metric density p(y|φ(x)). The condition on x is realized by the parameter vector

φ which determines the form of the probability distribution p(y|φ(x)).Bothφ(x)

and p(y|·) may be implemented as neural networks with the output of the ﬁrst

determining the weights of the latter. Denoting the weights contained in the pa-

rameter prediction network φ(x) as w,wemaythuswritep(y|x, w)=p(y|φ

(x)).

Assuming independence such that p(y

, ···,y

|·)=p(y

|·) ···p(y

|·) we mini-

mize the negative Log-Likelihood error function.

17. How to Train Neural Networks 385

E = −logp(y

, ···,y

|·)

= −log

t=1

p(y

,w)

= −



t=1

logp(y

,w) (17.20)

by performing gradient descent using a variant of the common backpropagation

algorithm, we give way to a maximum likelihood estimate of the weights in the

parameter prediction network [22, 20].

Parameter Prediction

Network

Parametric Density

Family Φ

p(y| (.))

p(y|x,w)

Fig. 17.10. Conditional Density Estimation Network, CDEN

Speciﬁc problems with the discussed model are ﬁrst approached by determin-

ing an appropriate density family p(y|φ(x)). A powerful choice is a Gaussian

mixture

p(y|φ(x)) =



i=1

(x) p(y|μ

(x),σ

(x)),P

(x) ≥ 0,



(x)=1, (17.21)

because they cover a wide range of probability models. For the univariate case

(one output), the p(y|μ

(x),σ

(x)) are normal density functions:

p(y|μ

(x),σ

(x)) =

√

2πσ

(x)

−



y−μ

(x)



. (17.22)

There are several ways to determine the individual density parameters contained

in (P

,μ

,σ

)

i=1

. Either they are set as the output of the parameter prediction

network, or they are trained as x-independent, adaptable weights of the density

network, or some of them may be given by prior knowledge (e. g. clustering,

neuro-fuzzy).

386 R. Neuneier and H.G. Zimmermann

A probability model p(y|·) based on the CDEN architecture in ﬁg. 17.11 which

perceives the presence of outliers and thus leads to robust estimators is illustrated

in the left part of ﬁgure 17.12. The CDEN consists of two Gaussians with iden-

tical estimation for their mean μ(x). While its narrow Gaussian represents the

distribution of the non-outliers part of the data, the wider one expresses the

assumption that some data are located at larger distances from the prediction.

The fact that outliers are basically exceptions may be reﬂected by an appropriate

choice of the mixture weightings P

, which may be regarded as prior probabilities

for each Gaussian distribution. Using this probability model yields the following

norm for the maximum-likelihood minimization:

= −log



−



y(x

,w)−y



−



y(x

,w)−y





. (17.23)

σ (t+n)

... ... ...

CDEN

t+n

Fig. 17.11. A CDEN for limiting the inﬂuence of outliers

The qualitative behavior of E

in the one-dimensional case is illustrated in the

middle and left part of ﬁgure 17.12 (compare also ﬁg. 17.9). One may clearly see

how the mixture limits the inﬂuence of outliers. Norms with inﬂuence-limiting

properties are called M-estimators in regression [5].

The diﬃculty involved in using M-estimators is that they usually possess a

set of parameters which have to be properly determined. In our case, the param-

eters are P

,σ

and σ

. In the framework of the CDEN architecture, they

are considered as adaptable weights of the density network. The advantage is

that unlike classical M-estimators, the parameters are determined by the data

during training and thus do not bias the solution. One method which has proven

to be successful in regression is to substitute eq. 17.23 by a mixture of a quadratic

17. How to Train Neural Networks 387

error function and a linear function of the absolute error to limit the inﬂuence

of the outliers. An adaptive version of such an error measure may easily be

implemented in the CDEN framework by substituting a mixture of Gaussian

and Laplace distributions in equation 17.21. Other limiting error functions may

be constructed alike.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

noise model

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

error function

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−3

−2

−1

derivative

Fig. 17.12. The Gaussian mixture noise model, left, the error function 17.23, middle,

and its corresponding derivative, right

Heteroscedasticity may arise due to changes in the risk associated with an

investment. In the stock market, for example, the variance of a stock’s return

is commonly related to the company’s debt-equity ratio due to the well-known

leverage eﬀect of a varying income on its return on equity. Heteroscedasticity has

been extensively studied in time series analysis and is commonly approached us-

ing the (G)ARCH methodology. While the latter explains the conditional vari-

ances based on past residuals, the CDEN particularly accounts for nonlinear

dependencies of the variances on past observations.

If we assume a normally distributed noise model with zero mean and variance

, an appropriate representation is a single Gaussian with variable scale and

location parameter. An implementation in the CDEN is straightforward. We

use a parameter prediction network with two outputs, one for the conditional

expectation and another one for the conditional variance. This special case of the

CDEN has also been extensively investigated by Nix and Weigend [21]. During

the training of the network, the weights are optimized with respect to



log

√

2πσ

,w)+

2σ

,w)



y(x

,w) −y





. (17.24)

Minimization according to eq. 17.17 and eq. 17.24 obviously only diﬀers in that

the individual errors (y(x

,w) − y

) are weighted by estimates of the inverse

variances 1/σ

,w) in the CDEN. Training the CDEN thus corresponds to

deriving a generalized least square estimator (GLSE), except that we use an

estimate ˆσ

,w) instead of the unknown σ

Deﬁnition and properties of the GLSE are extensively discussed in [5]. The case where

an estimate of σ is used during optimization is commonly denoted as a two-stage

estimation.

388 R. Neuneier and H.G. Zimmermann

17.5 Error Bar Estimation with CDEN

In addition to a more robust learning, the CDEN approach can be used to

estimate the uncertainty associated with the prediction of the expected value

t+n

:= y(x

,w). Here we assume a normally distributed error and have to

optimize a likelihood function of the form

E =



t=1



log

√

2πσ(x

)

(y(x

,w) − y

)

2σ

)



. (17.25)

If we assume a Laplacian noise model, the cost function is

E =



t=1



log (2σ(x

)) +

|y(x

,w) −y

σ(x

)



. (17.26)

The log cosh-approximation of the Laplacian model has the form

E =



t=1



log (πσ(x

)) + log cosh



y(x

,w) −y

σ(x

)



. (17.27)

In ﬁg. 17.13 we show a network architecture which we found useful to apply such

error bar estimations. The architecture combines net internal preprocessing with

the squared input approach and estimates in two branches the expected value

y(x

,w) and the standard derivation σ(x

). The CDEN cluster combines

these pieces of information and computes the ﬂow of the error.

We have found that the mixture of local and global analysis as shown in

ﬁg. 17.13 harmonizes well with the aim of solving a forecasting problem including

error bars. To assure positive values for σ(x) a suitable activation function e

(·)

is used in the σ cluster. By using a positive oﬀset (bias) one can avoid the

singularities in the likelihood target function which are caused by very small

σ(x).

In ﬁg. 17.14 the CDEN with one Gauss-function is combined with the archi-

tecture of section 17.3.7. Here we assume that the estimated “forces” contain the

necessary information to approximate the uncertainty of our averaged forcecast:

t+n

)=σ

t+n

|forces(x

)) (17.28)

or, more speciﬁcally, using the acceleration and mean reverting forces of ﬁg. 17.14:

t+n



i=1

·force

) . (17.29)

17. How to Train Neural Networks 389

n input nodes

n square nodes

CDEN

n hidden nodes

hidden nodeshidden nodes

Fig. 17.13. Error bar estimation with CDEN

By a linear combination of the squared values of the forces, this architecture

learns the input dependent variance σ(y

t+n

) (see ﬁg. 17.14 and eq. 17.13 for a

description of forces). Thus, we are able to achieve a forecast as well as an error

bar.

The interesting point in this combination is not only to be seen in the pos-

sibility of using the general frame as the means to identify dynamical systems.

In this environment we can even analyze the responsibility of the forces, long

range or short range, with respect to the uncertainty of the forecast. In several

monthly forecast models we have found a monotonic increase of importance from

the short to the long range forces.

Error bars and variance estimates are an essential piece of information for the

typically used mean-variance approach in portfolio management [7]. The aim is

390 R. Neuneier and H.G. Zimmermann

hidden force

hidden embedding

force

multi forecast

CDEN

square

n input nodes

n hidden nodes

n square nodes

embedding

average

id id id

Fig. 17.14. Error bar estimation with CDEN using the architecture of section 17.3.

The control-embedding and control-force clusters are suppressed for visual clarity.

to compute eﬃcient portfolios which allocate the investor’s capital to several

assets in order to maximize the return of the investment for a certain level of

risk. The variance is often estimated by linear models or is given a priori by

an expert. In contrast, the CDEN approximates σ(y

t+n

) as a function of the

current state of the ﬁnancial market and of the forecast of the neural network.

Covariances can also be estimated with CDEN using a multivariate Gaussian

in eq. 17.22. Implementation details can be found in [17, 19]. If one approxi-

mates the conditional density p(y|φ(x)) with several Gaussians, the estimated

expected mean, variance and covariance may be computed using typical moment

generating transformations [23, 19].

17. How to Train Neural Networks 391

17.6 Data Meets Structure

17.6.1 The Observer-Observation Dilemma

Human beings believe that they are able to solve a psychological version of the

Observer-Observation Dilemma. On the one hand, they use their observations to

constitute an understanding of the laws of the world, on the other hand, they use

this understanding to evaluate the correctness of the incoming pieces of infor-

mation. Of course, as everybody knows, human beings are not free from making

mistakes in this psychological dilemma. We encounter a similar situation when

we try to build a mathematical model using data. Learning relationships from

the data is only one part of the model building process. Overrating this part

often leads to the phenomenon of overﬁtting in many applications (especially

in economic forecasting). In practice, evaluation of the data is often done by

external knowledge, i. e. by optimizing the model under constraints of smooth-

ness and regularization [16]. If we assume that our model summarizes the best

knowledge of the system to be identiﬁed, why shouldn’t we use the model itself

to evaluate the correctness of the data? One approach to do this is called Clearn-

ing [33]. In this paper, we present a uniﬁed approach of the interaction between

the data and a neural network (see also [38]). It includes a new symmetric view

on the optimization algorithms, here learning and cleaning, and their control by

parameter and data noise.

17.6.2 Learning Reviewed

We are especially interested in using the output of a neural network y(x, w),

given the input pattern, x, and the weight vector, w, as a forecast of ﬁnancial

time series. In the context of neural networks learning normally means the min-

imization of an error function E by changing the weight vector w in order to

achieve good generalization performance. Again, we assume that the error func-

tion can be written as a sum of individual terms over all T training patterns,

E =



t=1

. The often used sum-of-square error can be derived from the

maximum-likelihood principle and a Gaussian noise model:



y(x, w) − y



, (17.30)

with y

as the given target pattern. If the error function is a nonlinear function

of the parameters, learning has to be done iteratively by a search through the

weight space, changing the weights from step τ to τ +1according to:

(τ +1)

= w

(τ )

+ Δw

(τ )

. (17.31)

There are several algorithms for choosing the weight increment Δw

(τ )

, the easiest

being gradient descent. After each presentation of an input pattern, the gradient

:= ∇E

of the error function with respect to the weights is computed. In

392 R. Neuneier and H.G. Zimmermann

the batch version of gradient descent the increments are based on all training

patterns

Δw

(τ )

= −ηg = −η



t=1

, (17.32)

whereas the pattern-by-pattern version changes the weights after each presenta-

tion of a pattern x

(often randomly chosen from the training set):

Δw

(τ )

= −ηg

. (17.33)

The learning rate η is typically held constant or follows an annealing procedure

during training to assure convergence.

Our experiments have shown that small batches are most useful, especially

in combination with Vario-Eta, a stochastic approximation of a Quasi-Newton

method [9]:

Δw

(τ )

= −



− g)



t=1

, (17.34)

with N ≤ 20.

Let us assume, that the error function of a speciﬁc problem is characterized

by a minimum in a narrow valley whose boundaries are parallel to the axes of

the weight space. For the two dimensional case, such a situation is shown in

ﬁg. 17.15.

Fig. 17.15. Vario-Eta, a stochastic approximation of a Quasi-Newton method

A gradient approach would follow a “zigzagging” track and would approx-

imate the minimum very slowly. With Vario-Eta, the zigzagging along w

damped and the drift along w

is accelerated. This behavior is similar to the

weight trajectories classical Newton methods show in such a situation. The ac-

tual implementation uses stochastic approximations to compute the standard

deviation.

17. How to Train Neural Networks 393

The use of the standard deviation instead of the variance in Vario-Eta means

an additional advantage in training large networks. Passing a long sequence of

layers in a neural network, the error signals contain less and less information.

The normalization in eq. 17.34 rescales the learning information for every weight.

If one designed Vario-Eta as close as possible to second order methods, it would

be appropriate to use the variance instead the standard deviation in the denom-

inator, but then we would also lose the scaling property. In section 17.6.3, we

will provide a further advantage of using the standard deviation.

Learning pattern-by-pattern or with small batches can be viewed as a stochas-

tic search process because we can write the weight increments as:

Δw

(τ )

= −η



g +





t=1

− g



. (17.35)

These increments consist of the terms g with a drift to a local minimum and of

noise terms (



t=1

− g) disturbing this drift.

17.6.3 Parameter Noise as an Implicit Penalty Function

Consider the Taylor expansion of E(w) around some point w in the weight space

E(w + Δw)=E(w)+



∂E

∂w

Δw



i,j

∂

∂w

Δw

+ ... . (17.36)

Assume a given sequence of T disturbance vectors Δw

, whose elements are

uncorrelated over t with zero mean and variance (row-)vector var(Δw

).The

expected value E(w) can then be approximated by

E(w)≈



E(w + Δw

)=E(w)+



var(Δw

)

∂

∂w

(17.37)

assuming that the ﬁrst and second derivatives of E arestableifwearecloseto

a local minimum. In eq. 17.37, noise on the weights acts implicitly as a penalty

term to the error function given by the second derivatives

∂

∂w

. The noise vari-

ances var(Δw

) operate as penalty parameters. As a consequence, ﬂat minima

solutions which may be important for achieving good generalization performance

are favored [13].

Learning pattern-by-pattern introduces automatically such noise in the train-

ing procedure i.e., Δw

= −η · g

. Close to convergence, we can assume that g

is i.i.d. with zero mean and variance vector var(g

) so that the expected value

can be approximated by

E(w)≈E(w)+



var(g

)

∂

∂w

. (17.38)

394 R. Neuneier and H.G. Zimmermann

This type of learning introduces a local penalty parameter var(g

), characterizing

the stability of the weights w =[w

]

i=1,...,k

. In a local minimum the sum of

gradients for the weight w



=0whereas the variance var(g

) may be

large. In this case the solution is very sensitive against resampling of the data

and therefore unstable. To improve generalization the curvature of the error

function around such weights with high variance should be strongly penalized.

This is automatically done by pattern-by-pattern learning.

The noise eﬀects due to Vario-Eta learning Δw

(i)=−

√

· g

leads to an

expected value

E(w)≈E(w)+



∂

∂w

. (17.39)

By canceling the term var(g

) in eq. 17.38, Vario-Eta achieves a simpliﬁed uni-

form penalty parameter, which depends only on the learning rate η.Whereas

pattern-by-pattern learning is a slow algorithm with a locally adjusted penalty

control, Vario-Eta is fast only at the cost of a simpliﬁed uniform penalty term.

local learning rate (Vario-Eta) → global penalty

global learning rate (pattern by pattern) → local penalty

Following these thoughts, we will now show that typical Newton methods

should not be used in stochastic learning. Let us assume that we are close to a

local minimum and that the Hessian H of the error function is not signiﬁcantly

changing anymore. On this supposition, the implied noise would be of the form

Δw

= ηH

−1

. Because of the stable Hessian, the mean of the noise is zero so

that the expected error function becomes

E(w)≈E(w)+



var(g

)



∂

∂w



−1

. (17.40)

In this case, we have again a local control of the penalty parameter through the

variance of the gradients var(g

), like in the pattern by pattern learning. But

now, we are penalizing weights at points in the weight space where the inverse

of the curvature is large. This means, we penalize ﬂat minima solutions, which

counters our goal of searching for stable solutions.

Typically Newton methods are used as cumulative learning methods so that

the previous arguments do not apply. Therefore, we conclude, that second order

methods should not be used in stochastic search algorithms. To support global

17. How to Train Neural Networks 395

Table 17.1. Structure-Speed-Dilemma

learning

structure speed

pattern by pattern + −

VarioEta

− +

learning and to exploit the bias to ﬂat minima solution of such algorithms, we

only use pattern-by-pattern learning or Vario-Eta in the following.

We summarize this section by giving some advice on how to achieve ﬂat

minima solutions (see also table 17.1):

– Train the network to a minimal training error solution with Vario-Eta, which

is a stochastic approximation of a Newton method and therefore fast.

– Add a ﬁnal phase of pattern-by-pattern learning with a uniform learning rate

to ﬁne tune the local curvature structure by the local penalty parameters

(eq. 17.38). For networks with many layers, this step should be omitted be-

cause the gradients will vanish due to the long signal ﬂows. Only Vario-Eta

with its scaling capability can solve such optimization problems appropri-

ately.

– Use a learning rate η as high as possible to keep the penalty eﬀective. The

training error may vary a bit, but the inclusion of the implicit penalty is

more important.

We want to point out that the decision of which learning algorithm to use not

only inﬂuences the speed and global behavior of the learning, but also, for ﬁxed

learning rates, leads to diﬀerent structural solutions. This structural consequence

has also to be taken into account if one analyzes and compares stochastic learning

algorithms.

17.6.4 Cleaning Reviewed

When training neural networks, one typically assumes that the input data is

noise-free and one forces the network to ﬁt the data exactly. Even the control

procedures to minimize overﬁtting eﬀects (i.e., pruning) consider the inputs as

exact values. However, this assumption is often violated, especially in the ﬁeld

of ﬁnancial analysis, and we are taught by the phenomenon of overﬁtting not to

follow the data exactly. Clearning, as a combination of cleaning and learning,

has been introduced in [33]. In the following, we focus on the cleaning aspects.

The motivation was to minimize overﬁtting eﬀects by considering the input data

as being corrupted by noise whose distribution has to be learned also.

396 R. Neuneier and H.G. Zimmermann

Fig. 17.16. If the slope of the modeled function is large, then a small shift in the input

data decreases the output error dramatically

The cleaning error function for the pattern t is given by the sum of two terms

assuming same variance levels for input and output

y,x

(



− y





− x



)

= E

+ E

(17.41)

with x

as the observed data point. In the pattern-by-pattern learning, the

network output y(x

,w) determines the weight adaptation as usual,

(τ +1)

= w

(τ )

− η

∂E

∂w

. (17.42)

We also must memorize correction vectors Δx

for all input data of the training

set in order to present the cleaned input x

to the network,

= x

+ Δx

. (17.43)

The update rule for the corrections, initialized with Δx

(0)

=0can be derived

from typical adaptation sequences x

(τ +1)

= x

(τ )

− η

∂E

y,x

∂x

leading to

Δx

(τ +1)

=(1− η)Δx

(τ )

− η(y

− y

)

∂y

∂x

(17.44)

This is a nonlinear version of the error-in-variables concept in statistics [27] (see

also ﬁg. 17.16 for the two dimensional case).

17. How to Train Neural Networks 397

All of the necessary quantities, i. e. (y

−y

)

∂y(x,w)

∂x

are computed by typical

back-propagation algorithms anyway. We found that the algorithms work well if

the same learning rate η is used for both the weight and cleaning updates. For

regression, cleaning forces the acceptance of a small error in x,whichcaninturn

decrease the error in y dramatically, especially in the case of outliers. Successful

applications of cleaning are reported in [33] and [30].

Although the network may learn an optimal model for the cleaned input

data, there is no easy way to work with cleaned data on the test set because for

this data we do not know the output target diﬀerence for computing eq. 17.44.

As a consequence, the model is evaluated on a test set with a diﬀerent noise

characteristic compared to the training set. We will later propose a combination

of learning with noise and cleaning to work around this serious disadvantage.

17.6.5 Data Noise Reviewed

Artiﬁcial noise on the input data is often used during training because it creates

an inﬁnite number of training examples and expands the data to empty parts of

the input space. As a result, the tendency of learning by heart may be limited

because smoother regression functions are produced.

Now, we consider again the Taylor expansion, this time applied to E(x) around

some point x in the input space. The expected value E(x) is approximated by

E(x)≈



E(x + Δx

)=E(x)+



var(Δx

)

∂

∂x

(17.45)

where

∂

∂x

are the diagonal elements of the Hessian H

of the error function

with respect to the inputs x. Again, in eq. 17.45, noise on the inputs acts implic-

itly as a penalty term to the error function with the noise variances var(Δx

)

operating as penalty parameters (compare eq. 17.37). Noise on the input im-

proves generalization behavior by favoring smooth models [3].

The noise levels can be set to a constant value, e. g. given by a priori knowl-

edge, or adaptive as described now. We will concentrate on a uniform or normal

noise distribution. Then, the adaptive noise level ξ

is estimated for each input

j individually. Suppressing pattern indices, we deﬁne the average residual errors

and ξ

as:

uniform residual error: ξ





∂E

∂x



, (17.46)

Gaussian residual error: ξ





∂E

∂x



. (17.47)

398 R. Neuneier and H.G. Zimmermann

Actual implementations use stochastic approximation, e. g. for the uniform resid-

ual error

(τ +1)

=(1−

)ξ

(τ )



∂E

∂x



. (17.48)

The diﬀerent residual error levels can be interpreted as follows (table 17.2):

A small level ξ

may indicate an unimportant input j or a perfect ﬁt of the

network concerning this input j. In both cases, a small noise level is appropriate.

On the other hand, a high value of ξ

for an input j indicates an important but

imperfectly ﬁtted input. In this case high noise levels are advisable. High values of

lead to a stiﬀer regression model and may therefore increase the generalization

performance of the network. Therefore, we use ξ

or ξ

as parameter to control

the level of noise for input j.

Table 17.2. The interpretation of diﬀerent levels of the residual errors ξ

observation of

the residual

error

interpretation

advice for the

noise control

ξ small

perfect ﬁt

unimportant input

use low noise

ξ large

imperfect ﬁt

but

important input

use high noise

17.6.6 Cleaning with Noise

Typically, training with noisy inputs involves taking a data point and adding

a random variable drawn from a ﬁxed or adaptive distribution. This new data

point x

is used as an input to the network. If we assume, that the data is

corrupted by outliers and other inﬂuences, it is preferable to add the noise term

to the cleaned input. For the case of Gaussian noise the resulting new input is:

= x

+ Δx

+ ξφ, (17.49)

with φ drawn from the normal distribution. The cleaning of the data leads to a

corrected mean of the data and therefore to a more symmetric noise distribution,

which also covers the observed data x

17. How to Train Neural Networks 399

We propose a variant which allows more complicated and problem dependent

noise distributions:

= x

+ Δx

− Δx

, (17.50)

where k is a random number drawn from the indices of the memorized correc-

tion vectors [Δx

]

t=1,...,T

. By this, we exploit the distribution properties of the

correction vectors in order to generate a possibly asymmetric and/or dependent

noise distribution, which still covers x

= x

if k = t.

One might wonder why we want to disturb the cleaned input x

+Δx

with an

additional noisy term Δx

. The reason for this is that we want to beneﬁt from

representing the whole input distribution to the network instead of only using

one particular realization. This approach supplies a solution to the cleaning

problem when switching from the training set to the test set as described in

section 17.6.4.

17.6.7 A Unifying Approach: The Separation of Structure and

Noise

In the previous sections we explained how the data can be separated into a

cleaned part and an unexplainable noisy part. Analogously, the neural network

is described as a time invariant structure (otherwise no forecasting is possible)

and a noisy part.

data → cleaned data + time invariant data noise

neural network → time invariant parameters + parameter noise

We propose using cleaning and adaptive noise to separate the data and using

learning and stochastic search to separate the structure of the neural network.

data ← cleaning (neural network) + adaptive noise (neural network)

neural network ← learning (data) + stochastic search (data)

The algorithms analyzing the data depend directly on the network whereas the

methods searching for structure are directly related to the data. It should be clear

that the model building process should combine both aspects in an alternating

or simultaneous manner. We normally use learning and cleaning simultaneously.

The interaction of the data analysis and network structure algorithms is a direct

embodiment of the the concept of the Observer-Observation Dilemma.

400 R. Neuneier and H.G. Zimmermann

The aim of the uniﬁed approach can be described, exemplary assuming here a

Gaussian noise model, as the minimization of the error due to both, the structure

and the data:



t=1

(



− y





− x



)

→ min

(17.51)

Combining the algorithms and approximating the cumulative gradient g by ˜g

using an exponential smoothing over patterns, we obtain

data

Δx

(τ +1)

=(1− η)Δx

(τ )

− η(y

− y

)

∂y

∂x

= x

+ Δx

(τ )

>?@A

cleaning

−Δx

(τ )

>?@A

noise

structure

˜g

(τ +1)

=(1− α)˜g

(τ )

+ α(y

− y

)

∂y

∂w

(τ +1)

= w

(τ )

− η˜g

(τ )

>?@A

learning

−η(g

− ˜g

(τ )

)

?@ A

noise

(17.52)

The cleaning of the data by the network computes an individual correction term

for each training pattern. The adaptive noise procedure according to eq. 17.50

generates a potentially asymmetric and dependent noise distribution which also

covers the observed data. The implied curvature penalty, whose strength depends

on the individual liability of the input variables, can improve the generalization

performance of the neural network.

The learning of the structure involves a search for time invariant parameters

characterized by



=0. The parameter noise supports this exploration as a

stochastic search to ﬁnd better “global” minima. Additionally, the generalization

performance may be further improved by the implied curvature penalty depend-

ing on the local liability of the parameters. Note that, although the description

of the weight updates collapses to the simple form of eq. 17.33, we preferred the

formula above to emphasize the analogy between the mechanism which handles

the data and the structure.

In searching for an optimal combination of data and parameters, we have

needed to model in both. This is not an indication of our failure to build a perfect

model but rather it is an important element for controlling the interaction of data

and structure.

17. How to Train Neural Networks 401

17.7 Architectural Optimization

The initial network can only be a guess on the appropriate ﬁnal architecture. One

way to handle this inadequacy is to use a growing network. As a possibility in the op-

posite direction, pruning can be applied to shrink the network. From our experience

the advantage of fast learning of growing networks is more than counterbalanced

by the better learning properties of large architectures. At least in the ﬁrst learning

steps, large networks are not trapped as quickly in local minima. When applying

the pruning methods according the late stopping concept, which trains the weights

of the network until the error function converges (see section 17.8.1), the learning

procedure had a suﬃcient number of adaptation steps to adjust the broad network

structure. Another advantage of the pruning technique is the greater ﬂexibility in

generating sparse and possibly irregular network architectures.

17.7.1 Node-Pruning

We evaluate the importance of input- or hidden nodes by using as a test value

test

= E(x

,...,x

= μ(x

),...,x

) − E(x

,...,x

) , (17.53)

with μ(x

) describing the mean value of the i-th input in the time series for the

training set. This test value is a measure of the increase of the error if we omit

one of the input series.

The creation of the bottleneck structure in the net internal preprocessing

(section 17.2) is best performed by applying this test for the hidden neurons in

the bottleneck on the training set. Thereby the disturbance of the learning is

reduced to the minimum. In one of the steps of the training procedure (section

17.8) we will show that the pruning of the input neurons should be done on the

validation set to improve the time stability of the forecasting.

In addition to the deletion of nodes the ranking of the inputs can give us a

deeper insight into the task (which inputs are the most relevant and so on).

17.7.2 Weight-Pruning

The neural network topology represents only a hypothesis of the true underly-

ing class of functions. Due to possible mis-speciﬁcation, we may have defects of

the parameter noise distribution. Pruning algorithms not only limit the mem-

ory of the network, but they also appear to be useful for correcting the noise

distribution in diﬀerent ways.

Stochastic-Pruning

Stochastic-Pruning [8] is basically a t-test on the weights w,

test

|w + g|



− g)

, (17.54)

402 R. Neuneier and H.G. Zimmermann

with g =



. Weights with low test

values constitute candidates for prun-

ing. From the viewpoint of our approach, this pruning algorithm is equivalent

to the cancellation of weights with low liability as measured by the size of the

weight divided by the standard deviation of its ﬂuctuations. By this, we get a

stabilization of the learning against resampling of the training data.

This easy to compute method worked well in combination with the early

stopping concept in contrast to insuﬃcient results with the late stopping concept.

For early stopping Stochastic-Pruning acts like a t-test, whereas for late stopping

the gradients of larger weights do not ﬂuctuate enough to give useful test values.

Thus, the pruning procedure includes an artiﬁcial bias to nonlinear models.

Furthermore, Stochastic-Pruning is also able to revive already pruned weights.

Early-Brain-Damage

A further weight pruning method is EBD, Early-Brain-Damage [31], which is

based on the often cited OBD pruning method [15]. In contrast to OBD, EBD

allows its application before the training has reached a local minimum.

For every weight, EBD computes as a test value an approximation of the

diﬀerence between the error function for w =0versus the value of the error

function for the best situation this weight can have

test

= E(0) − E(w

min

)=−gw +



Hw +

−1



. (17.55)

The above approximation is motivated by a Taylor expansion of the error func-

tion. From

E(˜w)=E(w)+g(˜w − w)+

(˜w − w)



H(˜w − w) (17.56)

we get

E(0) = E(w) −gw +



Hw (17.57)

and as a solution to the minimum problem E(˜w) → min we have

min

= w − H

−1



together with E(w

min

)=E(w) −

−1



. (17.58)

The diﬀerence of these two error values is the proposed EBD test. The Hessian

H in this approach is computed in the same way as in the original OBD calculus

[15].

One of the advantages of EBD over OBD is the possibility of performing

the testing while being slightly away from a local minimum. In our training

procedure we propose using noise even in the ﬁnal part of learning so that we

are only near a local minimum. Furthermore, EBD is also able to revive already

pruned weights.

Similar to Stochastic Pruning, EBD favors weights with a low rate of ﬂuc-

tuations. If a weight is pushed around by a high noise, the implicit curvature

17. How to Train Neural Networks 403

EBD(w)

min

{

OBD(w)

Fig. 17.17. EBD versus ODB weight pruning

penalty would favor a ﬂat minimum around this weight leading to its elimination

by EBD.

Our practical experience shows that EBD pruning allows the creation of ex-

tremely sparse networks. We had examples where we could prune an initial

network with 3000 weights down to a structure with around 50 weights. The

ﬁrst iterations of EBD pruning would typically give no improvement in general-

ization. This is due to the fact that EBD is testing the importance of a weight

regarding the error function, i. e. the same information which is used by the

backpropagation algorithm. To say it in another way: EBD cancels out only

weights which are not disturbing the learning. Only at the end of the training

procedure does the network have to give up a part of the coded structure which

leads to an improvement in generalization.

Inverse-Kurtosis

A third pruning method we want to discuss is a method which we call Inverse-

Kurtosis. The motivation follows an analysis of the following examples of possible

distributions of gradient impulses forcing a weight shown in ﬁg. 17.18.

If the network is trained to a local minimum the mean of all gradients by def-

inition is equal to zero. Nevertheless the distribution of the gradients may diﬀer.

Now we have to analyze the diﬀerence of a peaked or very broad distribution

versus a normal distribution. It is our understanding that the peaked distribu-

tion indicates a weight which reacts only to a small number of training patterns.

A broader distribution, on the other hand, is a sign that many training patterns

404 R. Neuneier and H.G. Zimmermann

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.5

1.5

2.5

distribution of the gradients

Fig. 17.18. Possible distributions of the gradients g

, if weights are ﬁxed

focus on the optimization of this weight. In other words, a weight with a peaked

distribution has learned by heart special events of the time series but it has not

modeled a general underlying structure of the data. Therefore, Inverse-Kurtosis

pruning harmonizes well with our network architecture which can encapsulate

local structures with its squared layer. Furthermore, a weight with a very broad

distribution of its gradients is pushed around by random events because it re-

acts to almost every pattern with a similar strength. A straight forward way to

distinguish between the above distributions is based on the kurtosis to measure

the diﬀerence to a normal distribution:

distance

⎛

⎜

⎝



t=1

− g)



t=1

− g)

− 3

⎞

⎟

⎠

. (17.59)

To rank importance of the network weights based on this diﬀerence to a normal

distribution we deﬁne the test values as

test

ε + |distance

, (17.60)

with the small term ε ≈ 0.001 to avoid numerical problems.

Similar to the previously mentioned pruning techniques we have now large

test values for weights we do not want to eliminate. Note that in this test, we

have neglected the size of the gradients and only taken into account the form

of the distribution. The weights themselves are not a part of the evaluation,

and we have not observed that this method has much eﬀect on the distribution

of the weight sizes found by the learning. In contrast, Stochastic Pruning and

EBD have a tendency to prune out small weights because they explicitly refer

to the size of the weights. Basically, Inverse-Kurtosis computes low test values

17. How to Train Neural Networks 405

(i. e. probable pruning candidates) for weights which are encapsulating only local

artifacts or random events in the data instead of modeling a general structure.

In our experience we found that this technique can give a strong improvement

of generalization in the ﬁrst several iterations of pruning.

Instability-Pruning

The local consequence of our learning until the minimal training error is that

the cumulative gradient g for every weight is zero on the training set,

g =



t∈T

=0. (17.61)

If this condition was valid for the validation set our model would be perfectly

stable:



t∈V

=0. (17.62)

Since g

typically is diﬀerent from zero we have to check if the measured dif-

ference is signiﬁcant enough to indicate instability. In the language of statistics

this is a two sample test for the equality of the mean of two distributions. The

following test value (Welch-test) measures this diﬀerence:

distance

− g

, (17.63)

which is approximately normally distributed with zero mean and unit variance.

As a pruning test value we deﬁne

test

ε + |distance

. (17.64)

This test value can be used to construct a ranking of the weights in the following

way: Train the network until weights are near a local optimum and compute the

test values. Then, take out 5% of the most unstable weights as measured by

this test criterion and redo the learning step. Eliminating more than 5% of the

weights is not recommandable because Instability-Pruning is not referring to the

cost function used for learning, and thus may have a large impact on the error.

This test of stability allows the deﬁnition of an interesting criterion when to

stop the model building process. After pruning weights until a given stability

level, we can check if the model is still approximating well enough. This ﬁnal

stopping criterion is possible due to the asymptotic normal distribution of the

test. For example, if we deﬁne weights with |distance

| < 1 as stable, then we

prune weights with test

< 1.

One may argue that the usage of the validation set in this approach is an

implicit modeling of the validation set and therefore the validation set allows no

406 R. Neuneier and H.G. Zimmermann

error estimation on the generalization set. On the other hand, the typical need

for an validation set is the estimation of a ﬁnal stopping point in the pruning

procedure. We have tried an alternative approach: Instead of using the validation

set in terms of g

and σ

in our comparison let us take a weighted measurement

of the training set such that most recent data is given more importance:

˜g =

T (T +1)



t∈T

, (17.65)

T (T +1)



t∈T

t(g

− ˜g)

. (17.66)

Checking the stability over the training set by substituting g

resp. σ

with

eq. 17.65 resp. 17.66 in eq. 17.63 avoids using the validation set. In our ﬁrst

experiences we observed that this version works as well as the previous one

although the evaluation still needs more testing.

In comparison to EBD, which eliminates weights with only a small eﬀect on

the error function, Instability-Pruning introduces a new feature in the process

of model building: stability over time. Additional information about when to

choose which pruning method will be given in the following section 17.8.4.

17.8 The Training Procedure

The authors believe that complex real world problems should be attacked by a

combination of various methods in order to handle the diﬀerent types of diﬃcul-

ties which may arise during the optimization of neural networks. Thus, the aim

of this section is to link all the previously described features to a complete and

consistent training procedure.

17.8.1 Training Paradigms: Early vs. Late Stopping

One of the most well known techniques for attacking the overﬁtting problem

is the early stopping procedure. In this procedure, we start the learning with

a network initialized by small weights. In its most simple version one uses a

validation set to estimate the beginning of overﬁtting, at which point the network

learning is stopped. A more sophisticated variant of the procedure is to learn

up to the start of overﬁtting and then to prune a part of the network structure,

by say 10% of the weights. Then, one restarts the smaller network and repeats

the steps. This sequence, which is schematically shown in in ﬁg. 17.19, has to be

iterated until a stable structure with no signs of overﬁtting is achieved.

In principle this procedure may work and it will generate a model with good

generalization performance, but in many cases it will fail to do so, as we have

observed in our experiments during the last several years. Diﬃculties with early

stopping arise because the stopping point turns up after a few learning epochs

through the training data. The authors have worked on examples where the

17. How to Train Neural Networks 407

best generalization

best training

linear model as initialization

search path 2

search path 1

stopping point 1

stopping point 2

Fig. 17.19. Early stopping: After initialization with small weights the network is

trained until the error on the validation set starts to increase (search path 1 to stopping

point 1). Then, some weights are pruned, the remaining weights are reinitialized and

training starts again until stopping point 2. This procedure has to be iterated until a

stable structure with no signs of overﬁtting is achieved.

stopping point appeared as early as after the second epoch of learning. In this

case, the solution is restricted to linear models, since the network has not been

oﬀered any chance to learn a complex nonlinearity from the data. A decreased

learning rate does not mean a reduction of this bias, because it only slows down

the movement away from the linear models. Using initial weights with larger

values is also problematic for two reasons. The random initial (probably incor-

rect) speciﬁcation of the network may lead to decreasing error curves due to

shrinking weight values, but after a while overﬁtting will probably start again.

Another important critique of this initialization is the intrinsic dependency of

the solution on the starting point.

The same argument is true for some of the penalty term approaches for reg-

ularizing the networks. Weight decay, for example, is a regularizer with a bias

towards linear models. Most of the other penalty functions (like data indepen-

dent smoothness) set additional constraints on the model building process. These

restrictions are not necessarily related to the task to be solved (see [16] for

smoothing regularizers for neural networks with sigmoidal activation function in

the hidden layer).

Following these thoughts, in this section we present a sequence of steps that

follow a late stopping concept. We start with small weights to lower the inﬂuence

of the random initialization. Then, we learn until the minimal error on the train-

ing data is achieved. Typically, the network shows clear overﬁtting at this point

but we can be sure that the maximum nonlinearity which can be represented

by the chosen network, has been learned from the data. At this point we must

describe the way back to a solution with good generalization performance, as

indicated in ﬁg. 17.20.

408 R. Neuneier and H.G. Zimmermann

linear model as initialization

best generalization

search path

best training

Fig. 17.20. Late stopping: After initialization with small weights the network is trained

until the minimal training error. Subsequent optimization of the network structure by

pruning increases generalization performance. Then, the network is trained again until

the minimal training error.

Basically the two parts of the training, learning to the minimal training er-

ror and extracting an appropriate generalizing model, can be understood as a

generation of a structural hypothesis followed by a falsiﬁcation of the generated

structure. These steps will be explained in detail in the next paragraphs.

17.8.2 Setup Steps

Before starting the learning process we have to specify the network architecture.

Let us assume that our aim is the identiﬁcation of a dynamical system. Then, we

propose using the network architecture (ﬁg. 17.8) of section 17.3, may be with

the CDEN extension (ﬁg. 17.14) of section 17.5 in order to additionally estimate

error bars.

A further setup decision is the separation of the data set into a training

set, a validation set and a generalization set. Possible separations are shown in

ﬁg. 17.21.

Considering forecasting models of time series (at least in economic applica-

tions), the sets should be explicitly separated in the order indicated in ﬁg. 17.21.1.

Otherwise there is no chance to test the model stability over time. If we ran-

domly took out the validation set of the training set (ﬁg. 17.21.2), we would

have no chance to test this stability because the validation patterns are always

embeddedintrainingpatterns.

In the proposed sequence we do not use the most recent data before the test

set in the learning. As a consequence, one might tend to choose the validation

patterns from the oldest part while using the most recent data in the training

set (ﬁg. 17.21.3). This leads to a better basis for the generation of our structural

hypothesis by learning. Basically, the pruning methods are a falsiﬁcation of the

generated structure. Using the separation in ﬁg. 17.21.3 with the validation set

17. How to Train Neural Networks 409

test set

training set

validation set

test settraining set

validation set

test set

validation set

training set

Fig. 17.21. Separation of the data

including the oldest data can make this falsiﬁcation misleading. In a fast changing

world, model stability over time is an important performance characteristic of a

good network model. Thus, we prefer the separation in ﬁg. 17.21.1.

The several preprocessed time series which serve as inputs have to be checked

for correlation because highly correlated inputs only serve to increase the amount

of numerical instability in our model. To avoid this, we can introduce a bottle-

neck substructure to perform a principle component analysis. For our typical

application with about 50 inputs, we use net internal preprocessing by the di-

agonal matrix. For models with a very large number of inputs (some hundreds)

the bottleneck approach may be superior.

If using the net internal preprocessing by the diagonal matrix, it is necessary

to check the pairwise correlation of the inputs. Following the common guideline

of using lots of chart indicators in ﬁnancial modeling it is typical that many of

these indicators have a high correlation. Keeping them in the training set will

cause the solutions of diﬀerent optimization runs to give us diﬀerent answers

regarding the importance of the input factors.

Since we often start with a number of weights which is large relative to the

number of training patterns, we also propose using a small weight decay during

learning. The penalty parameter should be in the range of 10% of the learning

rate because we are not interested in reducing the number of eﬀective parameters.

With this small value of the decay parameter, only those weights which simply

learn nothing are pulled to zero. By this, we still achieve the minimal training

error, but eliminate all the unnecessary weights which have an unpredictable

eﬀect on the test set.

17.8.3 Learning: Generation of Structural Hypothesis

If confronted with a large data set, such as in the task of forecasting daily returns,

we propose training the weights with the VarioEta adaptation algorithm. As a

410 R. Neuneier and H.G. Zimmermann

stochastic approximation to a Newton method this is a fast learning technique.

Obeying the arguments about the implicit penalty of section 17.6 we should let

the VarioEta training be followed by a simple pattern by pattern learning with a

constant learning rate to achieve structurally correct learning. As mentioned, it is

valuable to hold the learning rate as high as possible to beneﬁt from the implied

penalty term. However, if interested in monthly forecasting models, one may use

only the pattern-by-pattern learning because learning speed is not relevant due

to the small data set. On the other hand, the implied curvature penalty is more

important to generate good models (see section 17.6).

We propose to start the cleaning and the cleaning noise procedure of section

17.6 from the beginning. In this way, one can observe the following interesting

relationship and interaction between the stochastic learning and the cleaning

noise which improves the learning behavior. Since the initial noise variance is

set to zero, the noise to the input variables will start with a low level and will

then increase rapidly during the ﬁrst learning epochs. After several epochs, when

the network has captured some of the structure in the data, the noise decays

in parallel with the residual input error of the network. As a consequence, in

the beginning of the training, the network can learn only the global structure

in the data. Later in the training process, more and more detailed features of

the data are extracted which leads simultaneously to lower cleaning noise levels

(ﬁg. 17.22). Cleaning

improves the learning process by sequencing the data

structure from global to increasingly specialized features. Furthermore, noise

improves the stochastic search in the weight space and therefore reduces the

problem of local minima.

training epochs

noise level

Fig. 17.22. Due to the large noise level in the beginning of the training phase, the

network ﬁrst learns global features. At later training steps, it is able to extract more

and more details.

Depending on the problem, we observed that the implicitly controlled noise

level can go down close to zero or it can converge to an intermediate level. We

ﬁnish the step of learning to a minimal training error when we observe a stable

Even in an early stopping concept Cleaning may improve the learning.

17. How to Train Neural Networks 411

behavior of the error function on the training and validation set simultaneously

with the stable behavior of the mean noise level for all inputs on the training

set. The ﬁnal model of this step is a structural hypothesis about the dynamical

system we are analyzing.

17.8.4 Pruning: Falsiﬁcation of the generated Structure

Next, we have to test the stability of our model over time. Especially in economic

modeling, it is not evident that the structure of the capital markets we are

extracting from the seventies or eighties are still valid now. We can check this

stability by testing the network behavior on the validation set which follows the

training set on the time axis.

By using input pruning on the validation set (see section 17.7), we are able to

identify input series that may be of high relevance in the training set but may

also have no or even a counterproductive eﬀect on the validation data. Note

that, the pruning of non-relevant time series is not possible before achieving the

minimal training error because there exists no well deﬁned relationship between

all the inputs and the target variables before. On the other hand, unimportant

or inconsistent inputs with respect to the validation set should not be eliminated

later in order to facilitate the construction of models with high generalization

performance by subsequent learning.

An alternative way to check the stability of the model is to perform weight

pruning on the validation set using the Instability-Pruning algorithm. If the data

set is very large, weight pruning has a signiﬁcant advantage because only one

pass through the data set is necessary, in comparison to n passes to get the

ranking of the input pruning with n as the number of inputs. This becomes very

important in the ﬁeld of data-mining where we have to deal with hundreds or

thousands of megabytes of training data.

Each time after pruning of weights or inputs, the network is trained again

to achieve a stable behavior of the error curves on the training / validation set

(ﬁg. 17.24) and of the noise level (ﬁg. 17.23). Here, by stable behavior of the

error curve we mean that there is no signiﬁcant downward trend (i. e. improving

of the model) or upward trend (i. e. restructuring of the model). In the diagram

shown in ﬁg. 17.25, these stop conditions are marked with (∗).

To eliminate local artifacts in the network structure at this stage, we can

include some iterations of Inverse-Kurtosis pruning and subsequent retraining.

The process of iteratively pruning and training ends if a drastic decay, i. e. 5%,

of this criterion is no longer observable. For several iterations of training and

pruning, this strategy leads to a rapidly decreasing error on the validation set.

Due to the fact that the shrinking number of network parameters leads to

increasing gradients, pattern by pattern learning with a constant learning rate

is most useful because the implicit local curvature penalty becomes increasingly

important.

412 R. Neuneier and H.G. Zimmermann

One may criticize the proposed pruning policy because it is indeed

an optimization on the validation set. Thus, the error on the validation set

does not represent a good estimation of the generalization error anymore. We

believe that there is no way to omit this pruning step since the test of the

time stability is important for achieving models with high generalization

performance.

In the subsequent part of the network learning, we use “Occam’s Razor” by

weight pruning based on test values computed on the training set. This may

be EBD or Instability-Pruning (see section 17.7.2). The performance of both

methods based on hundreds of experiments is comparable. After pruning about

10% resp. 5% of the active weights, we train again until stable error curves and

noise levels are obtained.

On can combine these pruning methods by ﬁrst eliminating weights with

Instability-Pruning and generating very sparse networks using EBD afterwards.

Instability-Pruning favors model stability over time whereas EBD allows sparse

models which still approximate well.

If using the eleven layer architecture of ﬁg. 17.8, weight pruning should only

be applied to the weights connecting the preprocessing or square layer to the

hidden layers for two reasons. First, these connectors represent the majority of

the weights in the neural network. Second, it is important to apply the pruning

techniques only to such weights which have the same distance to the output

neurons. Thus, the test values of the gradient based pruning algorithms are

comparable which leads to a reliable ranking of the weights.

training epochs

noise level

Fig. 17.23. Interaction between pruning and the adaptive noise level

If one decides not to use the eleven layer architecture but a typical neural network

extended by random targets, one has to consider the following consequences for

input and weight pruning. The test values have to be computed only with respect

to the correct targets. By this, weights which are modeling random targets will

loose their importance. EBD pruning works very well in identifying such ﬂat minima

weights (section 17.7.2), and thus, is our favorite pruning method for such a modeling

approach.

17. How to Train Neural Networks 413

There is an interesting interaction between the pruning and the adaptive noise

procedure (see ﬁg. 17.23). At each pruning step, we eliminate some parameters

which have memorized some of the structure in the data. Thereby, the residual

input errors will increase, which leads to increased noise levels on the inputs.

Consequently, after each pruning step the learning has to focus on more global

structures of the data. Owing to the fact that by pruning the network memory

gradually decreases, the network cannot rely on any particular features of the

data but rather is forced to concentrate more and more on the general underlying

structure. Interestingly, this part of the procedure can be viewed as being the

opposite of learning until the minimal training error. There the network is forced

by the cleaning noise to learn the global structures ﬁrst before trying to extract

speciﬁc characteristics from the data.

17.8.5 Final Stopping Criteria of the Training

The last question we have to answer is the deﬁnition of the ﬁnal stopping point.

Obviously, after many weight pruning and training iterations, the error will

increase for the training and the validation set. Thus, the minimal error on the

validation set should give us an estimate of when to stop the optimization. In

practice, the error curves during learning are not smooth and monotonic curves

with only one minimum. The most simple advice is to prune until all weights

are eliminated while keeping a trace to store the best intermediate solution

on the validation set. Fig. 17.24 displays the error behavior in the diﬀerent

phases of the training procedure assuming the worst situation from the viewpoint

of an early stopping method due to the immediately increasing error on the

validation set.

Instability-Pruning oﬀers an alternative deﬁnition of a ﬁnal stopping point. If

we substitute EBD by this pruning method for applying Occam’s Razor, each

error

validation set

training set

training time

(1) (2) (3)

Fig. 17.24. The error behavior during the training. Region (1) describes the way to

a minimal training error, (2) is the typical behavior during the input pruning on the

validation set and (3) shows the consequence of Occam’s razor in form of the pruning.

414 R. Neuneier and H.G. Zimmermann

pruning / retraining iteration will increase the stability of the model because

Instability-Pruning deletes unstable weights (see section 17.7.2).

17.8.6 Diagram of the Training Procedure

The following diagram in ﬁg. 17.25 shows the training steps combined into a

uniﬁed training procedure.

Initialization: Activate a small weight decay

and the cleaning / cleaning noise procedure

Data preprocessing and separation

in training, validation and test set

Selection of a network architecture

Structural correct learning using pattern by pattern learning to stop conditions *.

Adjust the learning rate such that the implied curvature penalty function is active.

Activate a trace of the weights on the validation set.

Fast learning by VarioEta to stop conditions *

Falsify the time stability of the model by input pruning (one input at each time step) or

Retraining after each pruning using pattern by pattern learning to stop conditions *.

Instability Pruning (5% of the weights) as long as the validation set error drops at least for 5%.

Delete encapsulated artefacts in the network structure by Inverse-Kurtosis pruning.

3.7 , 5

8.3

Define the best intermediate solution on the validation set (stored in the trace) as the resulting model.

Retraining after each pruning using pattern by pattern learning to stop conditions *.

Eliminate 10% (5%) of the effective weights per iteration.

Occam’s Razor: Decrease the memory by weight pruning EBD (or Instability-Pruning) on the training set.

8.5

6.3

2, 8.2

6.2, 8.3

7.2

8.4

7.1, 7.2

7.2

8.4

Fig. 17.25. The diagram of the training procedure. The marks (∗) indicates the stop-

ping points (section 17.8.3 and 17.8.4). Box numbers correspond to sections.

17. How to Train Neural Networks 415

17.9 Experiments

In a research project sponsored by the European Community we are using the

proposed approach to estimate the returns of 3 ﬁnancial markets for each of

the G7 countries [12]. These estimations will be subsequently used in an asset

allocation scheme to create a Markowitz-optimal portfolio. In this paper, we only

report the results of the estimation of the German bond rate, which is one of

the more diﬃcult tasks due to the reuniﬁcation of Germany and GDR. Here, we

have to predict the return of the bond rate 6 months ahead. The inputs consist

of 39 variables obtained by preprocessing 16 relevant ﬁnancial time series

.The

training set covers the time from April 1974 to December 1991, the validation set

from January 1992 to May 1994. The results are collected using the out-of-sample

data which runs from June 1994 to May 1996. To demonstrate the behavior of the

algorithms, we compare our approach with a standard neural network with one

hidden layer (20 neurons, tanh transfer function) and one linear output (eq. 17.30

as error function). First, we trained the neural network until convergence with

pattern-by-pattern learning using a small batch size of 20 patterns. We refer

to this network as a classical approach. Then, we trained the 11-layer network

(ﬁg. 17.8) using the uniﬁed approach as described in section 4.1. Due to small

data sets we used pattern-by-pattern learning without VarioEta. The data were

manipulated by the cleaning and noise method of eq. 17.50. We compare the

resulting predictions of the networks on the basis of three performance measures

(see tab. 17.3). First, the hit rate counts how often the sign of the return of

the bond has been correctly predicted. As for the other measures, the step from

the forecast model to a trading system is here kept very simple. If the output

is positive, we buy shares of the bond, otherwise we sell them. The potential

realized is the ratio of the return to the maximum possible return over the

training (test) set. The annualized return is the average yearly proﬁt of the

trading systems. Our approach turns out to be superior. For example, we almost

doubled the annualized return from 4.5% to 8.5% on the test set. In ﬁg. 17.26,

we compare the accumulated return of the two approaches on the test set. The

uniﬁed approach not only shows a higher proﬁtability, but also has by far a less

maximal draw down.

Table 17.3. Comparison of the hit rate, the realized potential and the annualized

return of the two networks for the training (test) data

network hit rate realized potential annualized return

our approach 96% (81%) 100% (75%) 11.22% (8.5%)

classical approach 93% (66%) 96% (44%) 10.08% (4.5%)

At the moment, we are not allowed to be more speciﬁc on the actual data we used.

416 R. Neuneier and H.G. Zimmermann

Unified

Classical

0 10 20 30 40 50 60

−5

Accumulated Profit&Loss

Time

Fig. 17.26. Comparison of the accumulated proﬁt&loss curve of the two approaches

17.10 Conclusion

A typical regression analysis by neural networks begins with the statement:

“In principal a neural network can model everything”. If the data is not very

reliable due to noise or missing factors, learning is only one part of the model

building procedure. Often, task-independent regularization is used as weapon

to reduce the uncertainty introduced by the data. More generally, the model

building process demands more information (prior knowledge) about the speciﬁc

task.

We have shown that there are a large number of techniques for including ad-

ditional constraints which depend on the problem. Consequently, the resulting

model is not only based on the training data but also on the additional con-

straints and also on the exact steps of the training procedure. From a Bayesian

viewpoint, this can be described as an integration of priors in the model build-

ing process. Thus, this article can also be interpreted as a discussion of valuable

priors and of how to combine them in order to achieve a maximal synergy.

The authors believe that the additional features are an outstanding advantage

of neural networks and will lead to more robust and successful models. This

paper gives an overview about some of the features we experienced as useful in

our applications.

Finally, the uniﬁed training procedure is a recipe for building a model with

neural networks by way of a relatively simple sequence of steps. An important

aspect of this paper has been the study of the interactions between the diﬀerent

techniques. The described algorithms are integrated in the Simulation Environ-

ment for Neural Networks, SENN, a product of Siemens AG. More information

can be found on the web page

http://www.senn.sni.de.

17. How to Train Neural Networks 417

References

[1] Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for Boltzmann

machines. Cognitive Science 9, 147–169 (1985); Reprinted in [2]

[2] Anderson, J.A., Rosenfeld, E. (eds.): Neurocomputing: Foundations of Research.

The MIT Press, Cambridge (1988)

[3] Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford

(1995)

[4] Breiman, L.: Bagging predictors. Technical Report TR No. 421, Department of

Statistics, University of California (1994)

[5] Bunke, H., Bunke, O.: Nonlinear Regression, Functional Analysis and Robust

Methods, vol. 2. John Wiley and Sons (1989)

[6] Caruana, R.: Multitask learning. Machine Learning 28, 41 (1997)

[7] Elton, E.J., Gruber, M.J.: Modern Portfolio Theory and Investment Analysis.

John Wiley & Sons (1995)

[8] Finnoﬀ, W., Hergert, F., Zimmermann, H.G.: Improving generalization perfor-

mance by nonconvergent model selection methods. In: Aleksander, I., Taylor, J.

(eds.) Proc. of the Inter. Conference on Artiﬁcial Neural Networks, ICANN 1992,

vol. 2, pp. 233–236 (1992)

[9] Finnoﬀ, W., Hergert, F., Zimmermann, H.G.: Neuronale Lernverfahren mit vari-

abler Schrittweite, Tech. report, Siemens AG (1993)

[10] Flake, G.W.: Square Unit Augmented, Radially Extended, Multilayer Percep-

trons. In: Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 1st edn. LNCS,

vol. 7700, pp. 143–161. Springer, Heidelberg (2012)

[11] Gershenfeld, N.A.: An experimentalist’s introduction to the observation of dynam-

ical systems. In: Hao, B.L. (ed.) Directions in Chaos, vol. 2, pp. 310–384. World

Scientiﬁc, Singapore (1989)

[12] Herve, P., Naim, P., Zimmermann, H.G.: Advanced Adaptive Architectures for

Asset Allocation: A Trial Application. In: Forecasting Financial Markets (1996)

[13] Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Computation 9(1), 1–42

(1997)

[14] Hornik, K.: Approximation Capabilities of Multilayer Feedforward Networks. Neu-

ral Networks 4, 251–257 (1991)

[15] le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D.S.

(ed.) Advances in Neural Information Processing Systems, NIPS 1989, vol. 2, pp.

598–605. Morgan Kaufmann, San Mateo (1990)

[16] Moody, J.E., Rögnvaldsson, T.S.: Smoothing regularizers for projective basis func-

tion networks. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neu-

ral Information Processing Systems, vol. 9, p. 585. The MIT Press (1997)

[17] Williams, P.M.: Using Neural Networks to Model Conditional Multivariate Den-

sities. Technical Report CSRP 371, School of Cognitive and Computing Sciences,

Univ. of Sussex (February 1995)

[18] Neuneier, R.: Optimal asset allocation using adaptive dynamic programming. In:

Advances in Neural Information Processing Systems, vol. 8. MIT Press (1996)

[19] Neuneier, R.: Optimale Investitionsentscheidungen mit Neuronalen Netzen. PhD

thesis, Universität Kaiserslautern, Institut für Informatik (1998)

[20] Neuneier, R., Finnoﬀ, W., Hergert, F., Ormoneit, D.: Estimation of Conditional

Densities: A Comparison of Neural Network Approaches. In: Intern. Conf. on

Artiﬁcial Neural Networks, ICANN, vol. 1, pp. 689–692. Springer (1994)

[21] Nix, D.A., Weigend, A.S.: Estimating the mean and variance of the target prob-

ability distribution. In: World Congress of Neural Networks. Lawrence Erlbaum

Associates (1994)

418 R. Neuneier and H.G. Zimmermann

[22] Ormoneit, D.: Estimation of Probability Densities using Neural Networks. Mas-

ter’s thesis, Fakultät für Informatik, Technische Universität München (1993)

[23] Papoulis, A.: Probability, Random Variables, and Stochastic Processes, 3rd edn.

McGraw Hill, Inc. (1991)

[24] Perrone, M.P.: Improving Regression Estimates: Averaging Methods for Variance

Reduction with Extensions to General Convex Measure Optimization. PhD thesis,

Brown University (1993)

[25] Refenes, A.P. (ed.): Neural Networks in the Capital Market. Wiley & Sons (1994)

[26] Sanger, T.D.: Optimal unsupervised learning in a single-layer linear feedforward

network. Neural Networks 2, 459–473 (1989)

[27] Seber, G.A.F., Wild, C.J.: Nonlinear Regression. John Wiley & Sons, New York

(1989)

[28] Srivastava, A.N., Weigend, A.S.: Computing the probability density in connection-

ist regression. In: Marinaro, M., Morasso, P.G. (eds.) Proceedings of the Interna-

tional Conference on Artiﬁcial Neural Networks, Sorrento, Italy (ICANN 1994),

pp. 685–688. Springer (1994); Also in Proceedings of the IEEE International Con-

ference on Neural Networks, Orlando, FL (IEEE–ICNN 1994), pp. 3786–3789.

IEEE Press (1994)

[29] Takens, F.: Detecting Strange Attractors in Turbulence. In: Rand, D.A., Young,

L.S. (eds.) Dynamical Systems and Turbulence. Lecture Notes in Mathematics,

vol. 898, pp. 366–381. Springer (1981)

[30] Tang, B., Hsieh, W., Tangang, F.: Clearning neural networks with continuity con-

straints for prediction of noisy time series. In: Progres in Neural Information Pro-

cessing (ICONIP 1996), pp. 722–725. Springer, Berlin (1996)

[31] Tresp, V., Neuneier, R., Zimmermann, H.G.: Early brain damage. In: Advances

in Neural Information Processing Systems, vol. 9. MIT Press (1997)

[32] Weigend, A.S., Zimmermann, H.G.: Exploiting local relations as soft constraints

to improve forecasting. Computational Intelligence in Finance 6(1) (January 1998)

[33] Weigend, A.S., Zimmermann, H.G., Neuneier, R.: The observer-observation

dilemma in neuro-forecasting: Reliable models from unreliable data through

clearning. In: Freedman, R. (ed.) AI Applications on Wall Street, pp. 308–317.

Software Engineering Press, New York (1995)

[34] Weigend, A.S., Rumelhart, D.E., Huberman, B.A.: Generalization by weight-

elimination with application to forecasting. In: Lippmann, R.P., Moody, J.E.,

Touretzky, D.S. (eds.) Advances in Neural Information Processing Systems, vol.

3, pp. 875–882. Morgan Kaufmann, San Mateo (1991)

[35] White, H.: Parametrical statistical estimation with artiﬁcial neural networks.

Technical report, University of California, San Diego (1991)

[36] Zimmermann, H.G., Weigend, A.S.: Representing dynamical systems in feed-

forward networks: A six layer architecture. In: Weigend, A.S., Abu-Mostafa, Y.,

Refenes, A.-P.N. (eds.) Decision Technologies for Financial Engineering: Proceed-

ings of the Fourth International Conference on Neural Networks in the Capital

Markets (NNCM 1996). World Scientiﬁc, Singapore (1997)

[37] Zimmermann, H.G.: Neuronale Netze als Entscheidungskalkül. In: Rehkugler, H.,

Zimmermann, H.G. (eds.) Neuronale Netze in der Ökonomie. Verlag Franz Vahlen

(1994)

[38] Zimmermann, H.G., Neuneier, R.: The observer-observation dilemma in neuro-

forecasting. In: Advances in Neural Information Processing Systems, vol. 10. MIT

Press (1998)

Big Learning and Deep Neural Networks

Preface

More data and compute resources opens the way to “big learning”, that is, scaling

up machine learning to large data sets and complex problems. In order to solve

these new problems, we need to identify the complex dependencies that interre-

late inputs and outputs [1]. Achieving this goal requires powerful algorithms in

terms of both representational power and ability to absorb and distill information

from large data ﬂows. More and more it becomes apparent that neural networks

provide an excellent toolset to scale learning. This holds true for simple linear

systems as well as today’s grand challenges where the underlying application

problem requires high nonlinearity and complex structured representations.

The following four chapters introduce the basic toolbox for solving these

complex learning problems. They include ﬁrst and second-order optimization

methods, the best practice of training neural networks and an introduction to

Torch7, a neural network library for implementing large-scale learning problems.

A ﬁrst challenge is to feed the model with enough data in order to prop-

erly identify the rich set of dependencies. Chapter 18 [2] presents the stochastic

gradient descent algorithm (learning one example at the time), showing that it

can minimize the objective function at a rate that is under certain conditions

asymptotically optimal.

Neural networks are a useful framework to carry out such optimization.

The correspondence principle for neural networks that will be detailed later in

Chapter 28 [7] states that we can associate to an equation or an optimization

problem, a neural network architecture that reduces the learning problem to an

exchange of local signals between neighboring units of the network. Therefore,

the tractability of the optimization problem is much dependent on the neural

network architecture and its multiple hyperparameters [3]. These hyperparame-

ters and the methods to choose them are dissected in Chapter 19 [5].

In some cases, the multiple hyperparameters of the neural network are insuf-

ﬁcient to produce a well-conditioned optimization problem. Diﬃcult optimiza-

tion problems include, for example, recurrent neural networks or deep neural

networks, both involving long-range dependencies. Exploiting second order in-

formation (i.e. the second derivatives of the objective function) can lead to faster

convergence. Second-order methods are much more sensitive to noise than ﬁrst

order methods and require carefully tuned optimization parameters in order to

achieve a good speed-stability tradeoﬀ. Second-order methods for deep and re-

current neural networks are discussed in much depth in Chapter 20 [6].

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 419–420, 2012.

 Springer-Verlag Berlin Heidelberg 2012

420 G. Montavon and K.-R. Müller

Finally, algorithms must be translated into code that can be run by the

machine. Chapter 21 [4] presents Torch7,aneﬃcientandeasytousesoftware

for building and training neural networks. The software provides a large number

of neural network primitives and is highly modular so that it can be applied to a

wide range of problems including computer vision, natural language processing,

speech recognition and many more.

Grégoire & Klaus

References

[1] Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Large Scale

Kernel Machines (2007)

[2] Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B.,

Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 421–

436. Springer, Heidelberg (2012)

[3] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer, Heidel-

berg (1998)

[4] Collobert, R., Kavukcuoglu, K., Farabet, C.: Implementing Neural Networks Eﬃ-

ciently. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 537–557. Springer, Heidelberg (2012)

[5] Bengio, Y.: Practical Recommendations for Gradient-based Training of Deep Ar-

chitectures. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the

Trade, 2nd edn. LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012)

[6] Martens, J., Sutskever, I.: Training Deep and Recurrent Networks with Hessian-free

Optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the

Trade, 2nd edn. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012)

[7] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural

Networks: 12 Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

Stochastic Gradient Descent Tricks

Léon Bottou

Microsoft Research, Redmond, WA

leon@bottou.org

http://leon.bottou.org

Abstract. Chapter 1 strongly advocates the stochastic back-propagation

method to train neural networks. This is in fact an instance of a

more general technique called stochastic gradient descent (SGD). This

chapter provides background material, explains why SGD is a good

learning algorithm when the training set is large, and provides useful

recommendations.

18.1 Introduction

Chapter 1 strongly advocates the stochastic back-propagation method to train

neural networks. This is in fact an instance of a more general technique called

stochastic gradient descent (SGD). This chapter provides background material,

explains why SGD is a good learning algorithm when the training set is large,

and provides useful recommendations.

18.2 What Is Stochastic Gradient Descent?

Let us ﬁrst consider a simple supervised learning setup. Each example z is a pair

(x, y) composed of an arbitrary input x and a scalar output y.Weconsideraloss

function (ˆy, y) that measures the cost of predicting ˆy when the actual answer is

y, and we choose a family F of functions f

(x) parametrized by a weight vector

w.Weseekthefunctionf ∈Fthat minimizes the loss Q(z, w)=(f

(x),y)

averaged on the examples. Although we would like to average over the unknown

distribution dP (z) that embodies the Laws of Nature, we must often settle for

computing the average on a sample z

...z

E(f)=



(f(x),y) dP (z) E

(f)=



i=1

(f(x

),y

) (18.1)

The empirical risk E

(f) measures the training set performance. The expected

risk E(f) measures the generalization performance, that is, the expected

performance on future examples. The statistical learning theory [25] justiﬁes

minimizing the empirical risk instead of the expected risk when the chosen family

F is suﬃciently restrictive.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 421–436, 2012.

 Springer-Verlag Berlin Heidelberg 2012

422 L. Bottou

18.2.1 Gradient Descent

It has often been proposed (e.g., [18]) to minimize the empirical risk E

)

using gradient descent (GD). Each iteration updates the weights w on the basis

of the gradient of E

) ,

t+1

= w

− γ



i=1

∇

Q(z

) , (18.2)

where γ is an adequately chosen learning rate. Under suﬃcient regularity

assumptions, when the initial estimate w

is close enough to the optimum,

and when the learning rate γ is suﬃciently small, this algorithm achieves linear

convergence [6], that is, −log ρ ∼ t,whereρ represents the residual error.

Much better optimization algorithms can be designed by replacing the scalar

learning rate γ by a positive deﬁnite matrix Γ

that approaches the inverse of

the Hessian of the cost at the optimum :

t+1

= w

− Γ



i=1

∇

Q(z

) . (18.3)

This second order gradient descent (2GD) is a variant of the well known Newton

algorithm. Under suﬃciently optimistic regularity assumptions, and provided

that w

is suﬃciently close to the optimum, second order gradient descent

achieves quadratic convergence. When the cost is quadratic and the scaling

matrix Γ is exact, the algorithm reaches the optimum after a single iteration.

Otherwise, assuming suﬃcient smoothness, we have −log log ρ ∼ t.

18.2.2 Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm is a drastic simpliﬁcation.

Instead of computing the gradient of E

) exactly, each iteration estimates

this gradient on the basis of a single randomly picked example z

t+1

= w

− γ

∇

Q(z

) . (18.4)

The stochastic process {w

, t =1,...} depends on the examples randomly picked

at each iteration. It is hoped that (18.4) behaves like its expectation (18.2)

despite the noise introduced by this simpliﬁed procedure.

Since the stochastic algorithm does not need to remember which examples

were visited during the previous iterations, it can process examples on the ﬂy in

a deployed system. In such a situation, the stochastic gradient descent directly

optimizes the expected risk, since the examples are randomly drawn from the

ground truth distribution.

For mostly historical reasons, linear convergence means that the residual error

asymptotically decreases exponentially, and quadratic convergence denotes an even

faster asymptotic convergence. Both convergence rates are considerably faster than

the SGD convergence rates discussed in section 18.2.3.

18. Stochastic Gradient Descent Tricks 423

Table 18.1. Stochastic gradient algorithms for various learning systems

Loss Stochastic gradient algorithm

Adaline [26]

adaline



y − w



Φ(x)



Features Φ(x) ∈ R

, Classes y = ±1

w ← w + γ



− w



Φ(x

)



Φ(x

)

Perceptron [17]

perceptron

=max{0, −yw



Φ(x)}

Features Φ(x) ∈ R

, Classes y = ±1

w ← w + γ



Φ(x

) if y



Φ(x

) ≤ 0

0 otherwise

K-Means [12]

kmeans

=min

(z − w

)

Data z ∈ R

Centroids w

...w

∈ R

Counts n

...n

∈ N, initially 0

∗

=argmin

− w

)

∗

← n

∗

← w

∗

− w

∗

)

(counts provide optimal learning rates!)

SVM [5]

svm

= λw

+max{0, 1 − yw



Φ(x)}

Features Φ(x) ∈ R

, Classes y = ±1

Hyperparameter λ>0

w ← w − γ



λw if y



Φ(x

) > 1,

λw − y

Φ(x

) otherwise.

Lasso [23]

lasso

= λ|w|



y − w



Φ(x)



w =(u

− v

,...,u

− v

)

Features Φ(x) ∈ R

, Classes y = ±1

Hyperparameter λ>0

←



− γ



λ −(y

− w



Φ(x

))Φ

)



←



− γ



λ +(y

− w



Φ(x

))Φ

)



with notation [x]

=max{0,x}.

Table 18.1 illustrates stochastic gradient descent algorithms for a number

of classic machine learning schemes. The stochastic gradient descent for the

Perceptron, for the Adaline, and for k-Means match the algorithms proposed in

the original papers. The SVM and the Lasso were ﬁrst described with traditional

optimization techniques. Both Q

svm

and Q

lasso

include a regularization term

controlled by the hyper-parameter λ. The K-means algorithm converges to a

local minimum because Q

kmeans

is nonconvex. On the other hand, the proposed

update rule uses second order learning rates that ensure a fast convergence. The

proposed Lasso algorithm represents each weight as the diﬀerence of two positive

variables. Applying the stochastic gradient rule to these variables and enforcing

their positivity leads to sparser solutions.

18.2.3 The Convergence of Stochastic Gradient Descent

The convergence of stochastic gradient descent has been studied extensively

in the stochastic approximation literature. Convergence results usually require

decreasing learning rates satisfying the conditions



< ∞ and



= ∞.

The Robbins-Siegmund theorem [16] provides the means to establish almost sure

convergence under surprisingly mild conditions [3], including cases where the loss

function is non smooth.

424 L. Bottou

The convergence speed of stochastic gradient descent is in fact limited by

the noisy approximation of the true gradient. When the learning rates decrease

too slowly, the variance of the parameter estimate w

decreases equally slowly.

When the learning rates decrease too quickly, the expectation of the parameter

estimate w

takes a very long time to approach the optimum.

– When the Hessian matrix of the cost function at the optimum is strictly

positive deﬁnite, the best convergence speed is achieved using learning rates

∼ t

−1

(e.g. [14]). The expectation of the residual error then decreases with

similar speed, that is, E(ρ) ∼ t

−1

. These theoretical convergence rates are

frequently observed in practice.

– When we relax these regularity assumptions, the theory suggests slower

asymptotic convergence rates, typically like E(ρ) ∼ t

−1/2

(e.g., [28]). In

practice, the convergence only slows down during the ﬁnal stage of the

optimization process. This may not matter in practice because one often

stops the optimization before reaching this stage (see section 18.3.1.)

Second order stochastic gradient descent (2SGD) multiplies the gradients by a

positive deﬁnite matrix Γ

approaching the inverse of the Hessian :

t+1

= w

− γ

∇

Q(z

) . (18.5)

Unfortunately, this modiﬁcation does not reduce the stochastic noise and

therefore does not signiﬁcantly improve the variance of w

. Although constants

are improved, the expectation of the residual error still decreases like t

−1

,that

is, E(ρ) ∼ t

−1

at best, (e.g. [1], appendix).

Therefore, as an optimization algorithm, stochastic gradient descent is asymp-

totically much slower than a typical batch algorithm. However, this is not the

whole story...

18.3 When to Use Stochastic Gradient Descent?

During the last decade, the data sizes have grown faster than the speed

of processors. In this context, the capabilities of statistical machine learning

methods is limited by the computing time rather than the sample size. The

analysis presented in this section shows that stochastic gradient descent performs

very well in this context.

Use stochastic gradient descent

when training time is the bottleneck.

18.3.1 The Trade-Oﬀs of Large Scale Learning

Let f

∗

=argmin

E(f) be the best possible prediction function. Since we

seek the prediction function from a parametrized family of functions F,let

18. Stochastic Gradient Descent Tricks 425

∗

=argmin

f∈F

E(f) be the best function in this family. Since we optimize

the empirical risk instead of the expected risk, let f

=argmin

f∈F

(f)

be the empirical optimum. Since this optimization can be costly, let us stop

the algorithm when it reaches a solution

that minimizes the objective

function with a predeﬁned accuracy E

(

) <E

)+ρ. The excess error

E = E



) − E(f

∗

)



can then be decomposed in three terms [2] :

E = E



E(f

∗

) −E(f

∗

)





 

app

+ E



E(f

) −E(f

∗

)





 

est

+ E



) −E(f

)





 

opt

. (18.6)

– The approximation error E

app

= E



E(f

∗

) −E(f

∗

)



measures how closely

functions in F can approximate the optimal solution f

∗

. The approximation

error can be reduced by choosing a larger family of functions.

– The estimation error E

est

= E



E(f

) −E(f

∗

)



measures the eﬀect of

minimizing the empirical risk E

(f) instead of the expected risk E(f).The

estimation error can be reduced by choosing a smaller family of functions or

by increasing the size of the training set.

– The optimization error E

opt

= E



) − E(f

)



measures the impact of the

approximate optimization on the expected risk. The optimization error can

be reduced by running the optimizer longer. The additional computing time

depends of course on the family of function and on the size of the training

set.

Given constraints on the maximal computation time T

max

and the maximal

training set size n

max

, this decomposition outlines a trade-oﬀ involving the size

of the family of functions F, the optimization accuracy ρ, and the number of

examples n eﬀectively processed by the optimization algorithm.

min

F,ρ,n

E = E

app

+ E

est

+ E

opt

subject to



n ≤ n

max

T (F,ρ,n) ≤ T

max

(18.7)

Two cases should be distinguished:

– Small-scale learning problems are ﬁrst constrained by the maximal number

of examples. Since the computing time is not an issue, we can reduce the

optimization error E

opt

to insigniﬁcant levels by choosing ρ arbitrarily small,

and we can minimize the estimation error E

est

by choosing n = n

max

.We

then recover the approximation-estimation trade-oﬀ that has been widely

studied in statistics andinlearningtheory.

– Large-scale learning problems are constrained by the maximal computing

time, usually because the supply of training examples is very large. Approx-

imate optimization can achieve better expected risk because more training

examples can be processed during the allowed time. The speciﬁcs depend on

the computational properties of the chosen optimization algorithm.

426 L. Bottou

18.3.2 Asymptotic Analysis of the Large-Scale Case

Solving (18.7) in the asymptotic regime amounts to ensuring that the terms of the

decomposition (18.6) decrease at similar rates. Since the asymptotic convergence

rate of the excess error (18.6) is the convergence rate of its slowest term, the

computational eﬀort required to make a term decrease faster would be wasted.

For simplicity, we assume in this section that the Vapnik-Chervonenkis

dimensions of the families of functions F are bounded by a common constant. We

also assume that the optimization algorithms satisfy all the assumptions required

to achieve the convergence rates discussed in section 18.2. Similar analyses can

be carried out for speciﬁc algorithms under weaker assumptions (e.g. [22]).

A simple application of the uniform convergence results of [25] gives then the

upper bound

E = E

app

+ E

est

+ E

opt

= E

app

+ O





log n

+ ρ



Unfortunately the convergence rate of this bound is too pessimistic. Faster

convergence occurs when the loss function has strong convexity properties [9]

or when the data distribution satisﬁes certain assumptions [24]. The equivalence

E = E

app

est

opt

∼E

app



log n



+ ρ, for some α ∈



, 1



, (18.8)

provides a more realistic view of the asymptotic behavior of the excess error (e.g.

[13, 4]). Since the three components of the excess error should decrease at the

same rate, the solution of the trade-oﬀ problem (18.7) must then obey the

multiple asymptotic equivalences

E∼E

app

∼E

est

∼E

opt

∼



log n



∼ ρ. (18.9)

Table 18.2 summarizes the asymptotic behavior of the four gradient algorithms

described in section 18.2. The ﬁrst three rows list the computational cost of each

iteration, the number of iterations required to reach an optimization accuracy

ρ, and the corresponding computational cost. The last row provides a more

interesting measure for large scale machine learning purposes. Assuming we

operate at the optimum of the approximation-estimation-optimization trade-

oﬀ (18.7), this line indicates the computational cost necessary to reach a

predeﬁned value of the excess error, and therefore of the expected risk. This

is computed by applying the equivalences (18.9) to eliminate the variables n and

ρ from the third row results.

Although the stochastic gradient algorithms, SGD and 2SGD, are clearly

the worst optimization algorithms (third row), they need less time than the

Note that ε

1/α

∼ log(n)/n implies both α

−1

log ε ∼ log log(n) −log(n) ∼−log(n)

and n ∼ ε

−1/α

log n.Replacinglog(n) in the latter gives n ∼ ε

−1/α

log(1/ε).

18. Stochastic Gradient Descent Tricks 427

Table 18.2. Asymptotic equivalents for various optimization algorithms: gradient

descent (GD, eq. 18.2), second order gradient descent (2GD, eq. 18.3), stochastic

gradient descent (SGD, eq. 18.4), and second order stochastic gradient descent (2SGD,

eq. 18.5). Although they are the worst optimization algorithms, SGD and 2SGD achieve

the fastest convergence speed on the expected risk. They diﬀer only by constant factors

not shown in this table, such as condition numbers and weight vector dimension.

GD 2GD SGD 2SGD

Time per iteration : nn11

Iterations to accuracy ρ : log

log log

1/ρ 1/ρ

Time to accuracy ρ : n log

n lo g log

1/ρ 1/ρ

Time to excess error

E :

1/α

log

1/α

log

log log

1/E 1/E

other algorithms to reach a predeﬁned expected risk (fourth row). Therefore,

in the large scale setup, that is, when the limiting factor is the computing time

rather than the number of examples, the stochastic learning algorithms performs

asymptotically better !

18.4 General Recommendations

The rest of this contribution provides a series of recommendations for using

stochastic gradient algorithms. Although some of these recommendations seem

trivial, experience has shown again and again how easily they can be overlooked.

18.4.1 Preparing the Data

Randomly shuﬄe the training examples.

Although the theory calls for picking examples randomly, it is usually faster to

zip sequentially through the training set. But this does not work if the examples

are grouped by class or come in a particular order. Randomly shuﬄing the

examples eliminates this source of problems. Section 1.4.2 provides an additional

discussion.

Use preconditioning techniques.

Stochastic gradient descent is a ﬁrst-order algorithm and therefore suﬀers

dramatically when it reaches an area where the Hessian is ill-conditioned.

Fortunately, many simple preprocessing techniques can vastly improve the

situation. Sections 1.4.3 and 1.5.3 provide many useful tips.

428 L. Bottou

18.4.2 Monitoring and Debugging

Monitor both the training cost

and the validation error.

Since stochastic gradient descent is useful when the training time is the primary

concern, we can spare some training examples to build a decent validation set. It

is important to periodically evaluate the validation error during training because

we can stop training when we observe that the validation error has not improved

inalongtime.

It is also important to periodically compute the training cost because

stochastic gradient descent is an iterative optimization algorithm. Since the

training cost is exactly what the algorithm seeks to optimize, the training cost

should be generally decreasing.

A good approach is to repeat the following operations:

1. Zip once through the shuﬄed training set and perform the stochastic gradient

descent updates (18.4).

2. With an additional loop over the training set, compute the training cost.

Training cost here means the criterion that the algorithm seeks to optimize.

You can take advantage of the loop to compute other metrics, but the

training cost is the one to watch

3. With an additional loop over the validation set, to compute the validation

set error. Error here means the performance measure of interest, such as

the classiﬁcation error. You can also take advantage of this loop to cheaply

compute other metrics.

Computing the training cost and the validation error represent a signiﬁcant

computational eﬀort because it requires additional passes over the training and

validation data. But this beats running blind.

Check the gradients using ﬁnite diﬀerences.

When the computation of the gradients is slightly incorrect, stochastic gradient

descent often works slowly and erratically. This has led many to believe that

slow and erratic is the normal operation of the algorithm.

During the last twenty years, I have often been approached for advice in setting

the learning rates γ

of some rebellious stochastic gradient descent program.

My advice is to forget about the learning rates and check that the gradients

are computed correctly. This reply is biased because people who compute the

gradients correctly quickly ﬁnd that setting small enough learning rates is easy.

Those who ask usually have incorrect gradients. Carefully checking each line of

the gradient computation code is the wrong way to check the gradients. Use

ﬁnite diﬀerences:

18. Stochastic Gradient Descent Tricks 429

1. Pick an example z.

2. Compute the loss Q(z,w) for the current w.

3. Compute the gradient g = ∇

Q(z,w).

4. Apply a slight perturbation w



= w + δ. For instance, change a single weight

by a small increment, or use δ = −γg with γ small enough.

5. Compute the new loss Q(z,w



) and verify that Q(z,w



) ≈ Q(z,w)+δg .

This process can be automated and should be repeated for many examples

z, many perturbations δ, and many initial weights w. Flaws in the gradient

computation tend to only appear when peculiar conditions are met. It is not

uncommon to discover such bugs in SGD code that has been quietly used for

years.

Experiment with the learning rates γ

using a small sample of the training set.

The mathematics of stochastic gradient descent are amazingly independent of

the training set size. In particular, the asymptotic SGD convergence rates [14] are

independent from the sample size. Therefore, assuming the gradients are correct,

the best way to determine the correct learning rates is to perform experiments

using a small but representative sample of the training set. Because the sample

is small, it is also possible to run traditional optimization algorithms on this

same dataset in order to obtain reference point and set the training cost target.

When the algorithm performs well on the training cost of the small dataset,

keep the same learning rates, and let it soldier on the full training set. Expect the

validation performance to plateau after a number of epochs roughly comparable

to the number of epochs needed to reach this point on the small training set.

18.5 Linear Models with L

Regularization

This section provides speciﬁc recommendations for training large linear models

with L

regularization. The training objective of such models has the form

(w)=

w



i=1

(y

) (18.10)

where y

= ±1, and where the function (m) is convex. The corresponding

stochastic gradient update is then obtained by approximating the derivative of

the sum by the derivative of the loss with respect to a single example

t+1

=(1− γ

λ)w

− γ





) (18.11)

Examples:

– Support Vector Machines (SVM) use the non diﬀerentiable hinge loss [5] :

(m)=max{0, 1 −m}.

430 L. Bottou

– It is often more convenient in the linear case to use the log-loss:

(m) = log(1 + e

−m

) .

The diﬀerentiable log-loss is more suitable for the gradient algorithms dis-

cussed here. This choice leads to a logistic regression algorithm: probability

estimates can be derived using the logistic function:

P (y =+1|x) ≈

1+e

−wx

– All statistical models with linear parametrization are in fact amenable to

stochastic gradient descent, using the log-likelihood of the model as the loss

function Q(z, w). For instance, results for Conditional Random Fields (CRF)

[8] are reported in Sec. 18.5.4.

18.5.1 Sparsity

Leverage the sparsity of the training examples {x

–Representw

as a product s

where s

∈ IR .

The training examples often are very high dimensional vectors with only a few

non zero coeﬃcients. The stochastic gradient update (18.11)

t+1

=(1− γ

λ)w

− γ





)

is then inconvenient because it ﬁrst rescales all coeﬃcients of vector w by factor

(1 −γ

λ). In contrast, the rest of the update only involves the weight coeﬃcients

corresponding to a nonzero coeﬃcient in the pattern x

Expressing the vector w

as the product s

,wheres is a scalar, provides

a workaround [21]. The stochastic gradient update (18.11) can then be divided

into operations whose complexity scales with the number of nonzero terms in x

= 



) ,

t+1

=(1− γ

λ)s

t+1

= W

− γ

t+1

18.5.2 Learning Rates

Use learning rates of the form γ

= γ

(1 + γ

λt)

−1

– Determine the best γ

using a small training data sample.

18. Stochastic Gradient Descent Tricks 431

When the Hessian matrix of the cost function at the optimum is strictly positive,

the best convergence speed is achieved using learning rates of the form (λ

min

−1

where λ

min

is the smallest eigenvalue of the Hessian [14]. The theoretical analysis

also shows that overestimating λ

min

by more than a factor two leads to very

slow convergence. Although we do not know the exact value of λ

min

,theL

regularization term in the training objective function means that λ

min

≥ λ.

Therefore we can safely use learning rates that asymptotically decrease like

(λt)

−1

Unfortunately, simply using γ

=(λt)

−1

leads to very large learning rates in

the beginning of the optimization. It is possible to use an additional projection

step [21] to contain the damage until the learning rates reach reasonable values.

However it is simply better to start with reasonable learning rates. The formula

= γ

(1 + γ

λt)

−1

ensures that the learning rates γ

start from a predeﬁned

value γ

and asymptotically decrease like (λt)

−1

The most robust approach is to determine the best γ

as explained earlier,

using a small sample of the training set. This is justiﬁed because the asymptotic

SGD convergence rates [14] are independent from the sample size. In order to

make the method more robust, I often use a γ

slightly smaller than the best

value observed on the small training sample.

Such learning rates have been found to be eﬀective in situations that far

exceed the scope of this particular analysis. For instance, they work well with

nondiﬀerentiable loss functions such as the hinge loss [21]. They also work well

when one adds an unregularized bias term to the model. However it is then wise

to use smaller learning rates for the bias term itself.

18.5.3 Averaged Stochastic Gradient Descent

The stochastic gradient descent!averaged SGD (ASGD) algorithm [19] performs

the normal stochastic gradient update (18.4) and computes the average

¯w

t − t



i=t

This average can be computed eﬃciently using a recursive formula. For instance,

in the case of the L

regularized training objective (18.10), the following weight

updates implement the ASGD algorithm:

t+1

=(1−γ

λ)w

− γ





)

¯w

t+1

=¯w

+ μ

t+1

− ¯w

)

with the averaging rate

=1/max{1,t− t

When one uses learning rates γ

that decrease slower than t

−1

, the theoretical

analysis of ASGD shows that the training error E

(¯w

) decreases like t

−1

with

the optimal constant [15]. This is as good as the second order stochastic gradient

descent (2SGD) for a fraction of the computational cost of (18.5).

432 L. Bottou

Unfortunately, ASGD typically starts more slowly than the plain SGD and can

take a long time to reach the optimal asymptotic convergence speed. Although

an adequate choice of the learning rates helps [27], the problem worsens when

the dimension d of the inputs x

increases. Unfortunately, there are no clear

guidelines for selecting the time t

that determines when we engage the averaging

process.

Try averaged stochastic gradient with

– Learning rates γ

= γ

(1 + γ

λt)

−3/4

– Averaging rates μ

=1/ max{1,t− d, t − n}

Similar to the trick explained in Sec. 18.5.1, there is an eﬃcient method to

implement averaged stochastic gradient descent for sparse training data. The

idea is to represent the variables w

and ¯w

= s

¯w

=(A

+ α

)/β

where η

, α

and β

are scalars. The average stochastic gradient update equations

can then be rewritten in the manner that only involve scalars or sparse

operations [27] :

= 



) ,

t+1

=(1− γ

λ)s

t+1

= W

− γ

t+1

= A

+ γ

t+1

= β

/(1 − μ

)

t+1

= α

+ μ

t+1

18.5.4 Experiments

This section brieﬂy reports experimental results illustrating the actual perfor-

mance of SGD and ASGD on a variety of linear systems. The source code is

available at

http://leon.bottou.org/projects/sgd. All learning rates were

determined as explained in section 18.5.2.

Figure 18.1 reports results achieved using SGD for a linear SVM trained for

the recognition of the CCAT category in the RCV1 dataset [10] using both

the hinge loss and the log loss. The training set contains 781,265 documents

represented by 47,152 relatively sparse TF/IDF features. SGD runs considerably

faster than either the standard SVM solvers SVMLight and SVMPerf [7] or

the super-linear optimization algorithm TRON [11].

Figure 18.2 reports results achieved for a linear model trained on the ALPHA

task of the 2008 Pascal Large Scale Learning Challenge using the squared hinge

18. Stochastic Gradient Descent Tricks 433

Algorithm Time Test Error

Hinge loss SVM, λ =10

−4

SVMLight 23,642 s. 6.02 %

SVMPerf 66 s. 6.03 %

SGD 1.4 s. 6.02 %

Log loss SVM, λ =10

−5

TRON (

-e0.01)30s.5.68%

TRON (

-e0.001)44s. 5.70%

SGD 2.3 s. 5.66 %

100

0.1 0.01 0.001 0.0001 1e−05 1e−07 1e−08 1e−09

Training time (secs)

1e−06

Optimization accuracy (trainingCost−optimalTrainingCost)

TRON

SGD

0.25 Expected risk

0.20

Fig. 18.1. Results achieved with a L

regularized linear model trained on the RCV1

task using both the hinge loss and the log loss. The lower half of the plot shows the time

required by SGD and TRON to reach a predeﬁned accuracy ρ on the log loss task. The

upper half shows that the expected risk stops improving long before the super-linear

optimization algorithm TRON overcomes SGD.

0.30

0.32

0.34

0.36

0.38

0.40

0 1 2 3 4 5

Expected risk

Number of epochs

SGD

SGDQN

ASGD

21.0

22.0

23.0

24.0

25.0

26.0

27.0

0 1 2 3 4 5

Test Error (%)

Number of epochs

SGD

SGDQN

ASGD

Fig. 18.2. Comparison of the test set performance of SGD, SGDQN, and ASGD for a

regularized linear model trained with the squared hinge loss on the ALPHA task

of the 2008 Pascal Large Scale Learning Challenge. ASGD nearly reaches the optimal

expected risk after a single pass.

loss (m)=max{0, 1 − m}

. For reference, we also provide the results achieved

by the SGDQN algorithm [1] which was one of the winners of this competition,

and works by adapting a separate learning rate for each weight. The training set

contains 100,000 patterns represented by 500 centered and normalized variables.

Performances measured on a separate testing set are plotted against the number

of passes over the training set. ASGD achieves near optimal results after one

epoch only.

Figure 18.3 reports results achieved using SGD, SGDQN, and ASGD for

a CRF [8] trained on the CONLL 2000 Chunking task [20]. The training

set contains 8936 sentences for a 1.68 × 10

dimensional parameter space.

Performances measured on a separate testing set are plotted against the number

of passes over the training set. SGDQN appears more attractive because ASGD

434 L. Bottou

4400

4500

4600

4700

4800

4900

5000

5100

5200

5300

5400

0 5 10 15

epochs

SGD

SGDQN

ASGD

Test loss

92.2

92.4

92.6

92.8

93.2

93.4

93.6

93.8

0 5 10 15

epochs

SGD

SGDQN

ASGD

Test FB1 score

Fig. 18.3. Comparison of the test set performance of SGD, SGDQN, and ASGD on a L

regularized CRF trained on the CONLL Chunking task. On this task, SGDQN appears

more attractive because ASGD does not fully reach its asymptotic performance.

does not reach its asymptotic performance. All three algorithms reach the best

test set performance in a couple minutes. The standard CRF L-BFGS optimizer

takes 72 minutes to compute an equivalent solution.

18.6 Conclusion

Stochastic gradient descent and its variants are versatile techniques that have

proven invaluable as a learning algorithms for large datasets. The best advice

for a successful application of these techniques is (i) to perform small-scale

experiments with subsets of the training data, and (ii)topayaruthlessattention

to the correctness of the gradient computation.

References

[1] Bordes, A., Bottou, L., Gallinari, P.: SGD-QN: Careful quasi-Newton stochastic

gradient descent. Journal of Machine Learning Research 10, 1737–1754 (2009);

with erratum, JMLR 11, 2229–2240 (2010)

[2] Bottou, L., Bousquet, O.: The tradeoﬀs of large scale learning. In: Platt, J., Koller,

D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Sys-

tems, vol. 20, pp. 161–168. NIPS Foundation (2008),

http://books.nips.cc

[3] Bottou, L.: Online algorithms and stochastic approximations. In: Saad, D. (ed.)

Online Learning and Neural Networks. Cambridge University Press, Cambridge

(1998)

[4] Bousquet, O.: Concentration Inequalities and Empirical Processes Theory Ap-

plied to the Analysis of Learning Algorithms. Ph.D. thesis, Ecole Polytechnique,

Palaiseau, France (2002)

[5] Cortes, C., Vapnik, V.: Support-vector network. Machine Learning 20(3), 273–297

(1995)

[6] Dennis, J., Schnabel, R.B.: Numerical Methods For Unconstrained Optimization

and Nonlinear Equations. Prentice-Hall, Inc., Englewood Cliﬀs (1983)

18. Stochastic Gradient Descent Tricks 435

[7] Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th

ACM SIGKDD International Conference, New York (2006)

[8] Laﬀerty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random ﬁelds: Prob-

abilistic models for segmenting and labeling sequence data. In: Brodley, C.E.,

Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on

Machine Learning (ICML), pp. 282–289. Morgan Kaufmann, Williams College

(2001)

[9] Lee, W.S., Bartlett, P.L., Williamson, R.C.: The importance of convexity in learn-

ing with squared loss. IEEE Transactions on Information Theory 44(5), 1974–1980

(1998)

[10] Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection

for text categorization research. Journal of Machine Learning Research 5, 361–397

(2004)

[11] Lin, C.J., Weng, R.C., Keerthi, S.S.: Trust region newton methods for large-scale

logistic regression. In: Ghahramani, Z. (ed.) Proc. Twenty-Fourth International

Conference on Machine Learning (ICML), pp. 561–568. ACM (2007)

[12] MacQueen, J.: Some methods for classiﬁcation and analysis of multivariate ob-

servations. In: LeCam, L.M., Neyman, J. (eds.) Proceedings of the Fifth Berkeley

Symposium on Mathematics, Statistics, and Probabilities, vol. 1, pp. 281–297.

University of California Press, Berkeley (1967)

[13] Massart, P.: Some applications of concentration inequalities to statistics. Annales

de la Faculté des Sciences de Toulouse series 6 9(2), 245–303 (2000)

[14] Murata, N.: A statistical study of on-line learning. In: Saad, D. (ed.) Online Learn-

ing and Neural Networks. Cambridge University Press, Cambridge (1998)

[15] Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averag-

ing. SIAM J. Control Optim. 30(4), 838–855 (1992)

[16] Robbins, H., Siegmund, D.: A convergence theorem for non negative almost su-

permartingales and some applications. In: Rustagi, J.S. (ed.) Optimizing Methods

in Statistics. Academic Press (1971)

[17] Rosenblatt, F.: The perceptron: A perceiving and recognizing automaton. Tech.

Rep. 85-460-1, Project PARA, Cornell Aeronautical Lab (1957)

[18] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations

by error propagation. In: Parallel Distributed Processing: Explorations in the Mi-

crostructure of Cognition, vol. I, pp. 318–362. Bradford Books, Cambridge (1986)

[19] Ruppert, D.: Eﬃcient estimations from a slowly convergent robbins-monro pro-

cess. Tech. Rep. 781, Cornell University Operations Research and Industrial En-

gineering (1988)

[20] Sang, E.F.T.K., Buchholz, S.: Introduction to the CoNLL-2000 shared task:

Chunking. In: Cardie, C., Daelemans, W., Nedellec, C., Tjong Kim Sang, E.F.

(eds.) Proceedings of CoNLL 2000 and LLL 2000, Lisbon, Portugal, pp. 127–132

(2000)

[21] Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated subgradient

solver for SVM. In: Proc. 24th Intl. Conf. on Machine Learning (ICML 2007), pp.

807–814. ACM (2007)

[22] Shalev-Shwartz, S., Srebro, N.: SVM optimization: inverse dependence on training

set size. In: Proceedings of the 25th International Machine Learning Conference

(ICML 2008), pp. 928–935. ACM (2008)

[23] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society (Series B) 58, 267–288 (1996)

[24] Tsybakov, A.B.: Optimal aggregation of classiﬁers in statistical learning. Annals

of Statististics 32(1) (2004)

436 L. Bottou

[25] Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative fre-

quencies of events to their probabilities. Theory of Probability and its Applica-

tions 16(2), 264–280 (1971)

[26] Widrow, B., Hoﬀ, M.E.: Adaptive switching circuits. In: IRE WESCON Conv.

Record, Part 4, pp. 96–104 (1960)

[27] Xu, W.: Towards optimal one pass large scale learning with averaged stochastic

gradient descent (2011),

http://arxiv.org/abs/1107.2490

[28] Zinkevich, M.: Online convex programming and generalized inﬁnitesimal gradient

ascent. In: Proc. Twentieth International Conference on Machine Learning (2003)

Practical Recommendations for Gradient-Based

Training of Deep Architectures

Yoshua Bengio

Université de Montréal

Abstract. Learning algorithms related to artiﬁcial neural networks and

in particular for Deep Learning may seem to involve many bells and

whistles, called hyper-parameters. This chapter is meant as a practical

guide with recommendations for some of the most commonly used hyper-

parameters, in particular in the context of learning algorithms based

on back-propagated gradient and gradient-based optimization. It also

discusses how to deal with the fact that more interesting results can be

obtained when allowing one to adjust many hyper-parameters. Overall,

it describes elements of the practice used to successfully and eﬃciently

train and debug large-scale and often deep multi-layer neural networks.

It closes with open questions about the training diﬃculties observed with

deeper architectures.

19.1 Introduction

Following a decade of lower activity, research in artiﬁcial neural networks was

revived after a 2006 breakthrough [61, 14, 95] in the area of Deep Learning,

based on greedy layer-wise unsupervised pre-training of each layer of features.

See [7] for a review. Many of the practical recommendations that justiﬁed the

previous edition of this book are still valid, and new elements were added, while

some survived longer by virtue of the practical advantages they provided. The

panorama presented in this chapter regards some of these surviving or novel

elements of practice, focusing on learning algorithms aiming at training deep

neural networks, but leaving most of the material speciﬁc to the Boltzmann

machine family to another chapter [60].

Although such recommendations come out of a living practice that emerged

from years of experimentation and to some extent mathematical justiﬁcation,

they should be challenged. They constitute a good starting point for the exper-

imenter and user of learning algorithms but very often have not been formally

validated, leaving open many questions that can be answered either by theo-

retical analysis or by solid comparative experimental work (ideally by both). A

good indication of the need for such validation is that diﬀerent researchers and

research groups do not always agree on the practice of training neural networks.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 437–478, 2012.

 Springer-Verlag Berlin Heidelberg 2012

438 Y. Bengio

Several of the recommendations presented here can be found implemented

in the Deep Learning Tutorials

andintherelatedPylearn2 library

,allbased

on the

Theano library (discussed below) written in the Python programming

language.

The 2006 Deep Learning breakthrough [61, 14, 95] centered on the use of

unsupervised representation learning to help learning internal representations

by providing a local training signal at each level of a hierarchy of features

Unsupervised representation learning algorithms can be applied several times

to learn diﬀerent layers of a deep model. Several unsupervised representation

learning algorithms have been proposed since then. Those covered in this chap-

ter (such as auto-encoder variants) retain many of the properties of artiﬁcial

multi-layer neural networks, relying on the back-propagation algorithm to esti-

mate stochastic gradients. Deep Learning algorithms such as those based on the

Boltzmann machine and those based on auto-encoder or sparse coding variants

often include a supervised ﬁne-tuning stage. This supervised ﬁne-tuning as well

as the gradient descent performed with auto-encoder variants also involves the

back-propagation algorithm, just as like when training deterministic feedforward

or recurrent artiﬁcial neural networks. Hence this chapter also includes recom-

mendations for training ordinary supervised deterministic neural networks or

more generally, most machine learning algorithms relying on iterative gradient-

based optimization of a parametrized learner with respect to an explicit training

criterion.

This chapter assumes that the reader already understands the standard algo-

rithms for training supervised multi-layer neural networks, with the loss gradient

computed thanks to the back-propagation algorithm [103]. It starts by explain-

ing basic concepts behind Deep Learning and the greedy layer-wise pretraining

strategy (Section 19.1.1), and recent unsupervised pre-training algorithms (de-

noising and contractive auto-encoders) that are closely related in the way they

are trained to standard multi-layer neural networks (Section 19.1.2). It then re-

views in Section 19.2 basic concepts in iterative gradient-based optimization and

in particular the stochastic gradient method, gradient computation with a ﬂow

graph, automatic diﬀerentation. The main section of this chapter is Section 19.3,

which explains hyper-parameters in general, their optimization, and speciﬁcally

covers the main hyper-parameters of neural networks. Section 19.4 brieﬂy de-

scribes simple ideas and methods to debug and visualize neural networks, while

Section 19.5 covers parallelism, sparse high-dimensional inputs, symbolic inputs

http://deeplearning.net/tutorial/

http://deeplearning.net/software/pylearn2

A neural network computes a sequence of data transformations, each step encoding

the raw input into an intermediate or internal representation, in principle to make

the prediction or modeling task of interest easier.

In standard multi-layer neural networks trained using back-propagated gradients, the

only signal that drives parameter updates is provided at the output of the network

(and then propagated backwards). Some unsupervised learning algorithms provide

a local source of guidance for the parameter update in each layer, based only on the

inputs and outputs of that layer.

19. Recommendations for Training Deep Architectures 439

and embeddings, and multi-relational learning. The chapter closes (Section 19.6)

with open questions on the diﬃculty of training deep architectures and improv-

ing the optimization methods for neural networks.

19.1.1 Deep Learning and Greedy Layer-Wise Pretraining

The notion of reuse, which explains the power of distributed representations [7], is

also at the heart of the theoretical advantages behind Deep Learning.Complexity

theory of circuits, e.g. [54, 55], (which include neural networks as special cases)

has much preceded the recent research on deep learning. The depth of a circuit is

the length of the longest path from an input node of the circuit to an output node

of the circuit. Formally, one can change the depth of a given circuit by changing

the deﬁnition of what each node can compute, but only by a constant factor [7].

The typical computations we allow in each node include: weighted sum, product,

artiﬁcial neuron model (such as a monotone non-linearity on top of an aﬃne

transformation), computation of a kernel, or logic gates. Theoretical results [54,

55, 13, 10, 9] clearly identify families of functions where a deep representation

can be exponentially more eﬃcient than one that is insuﬃciently deep. If the

same set of functions can be represented from within a family of architectures

associated with a smaller VC-dimension (e.g. less hidden units

), learning theory

would suggest that it can be learned with fewer examples, yielding improvements

in both computational eﬃciency and statistical eﬃciency.

Another important motivation for feature learning and Deep Learning is that

they can be done with unlabeled examples, so long as the factors (unobserved

random variables explaining the data) relevant to the questions we will ask later

(e.g. classes to be predicted) are somehow salient in the input distribution itself.

This is true under the manifold hypothesis, which states that natural classes and

other high-level concepts in which humans are interested are associated with

low-dimensional regions in input space (manifolds) near which the distribution

concentrates, and that diﬀerent class manifolds are well-separated by regions

of very low density. It means that a small semantic change around a particular

example can be captured by changing only a few numbers in a high-level abstract

representation space. As a consequence, feature learning and Deep Learning are

intimately related to principles of unsupervised learning, and they can work in

the semi-supervised setting (where only a few examples are labeled), as well as

in the transfer learning and multi-task settings (where we aim to generalize to

new classes or tasks). The underlying hypothesis is that many of the underlying

factors are shared across classes or tasks. Since representation learning aims to

extract and isolate these factors, representations can be shared across classes and

tasks.

One of the most commonly used approaches for training deep neural net-

works is based on greedy layer-wise pre-training [14]. The idea, ﬁrst introduced

in Hinton et al. [61], is to train one layer of a deep architecture at a time using

Note that in our experiments, deep architectures tend to generalize very well even

when they have quite large numbers of parameters.

440 Y. Bengio

unsupervised representation learning. Each level takes as input the representa-

tion learned at the previous level and learns a new representation. The learned

representation(s) can then be used as input to predict variables of interest, for

example to classify objects. After unsupervised pre-training, one can also per-

form supervised ﬁne-tuning of the whole system

, i.e., optimize not just the

classiﬁer but also the lower levels of the feature hierarchy with respect to some

objective of interest. Combining unsupervised pre-training and supervised ﬁne-

tuning usually gives better generalization than pure supervised learning from

a purely random initialization. The unsupervised representation learning algo-

rithms for pre-training proposed in 2006 were the Restricted Boltzmann Machine

or RBM [61], the auto-encoder [14] and a sparsifying form of auto-encoder similar

to sparse coding [95].

19.1.2 Denoising and Contractive Auto-encoders

An auto-encoder has two parts: an encoder function f that maps the input x to a

representation h = f (x), and a decoder function g that maps h back in the space of

x in order to reconstruct x. In the regular auto-encoder the reconstruction function

r(·)=g(f(·)) is trained to minimize the average value of a reconstruction loss on

the training examples. Note that reconstruction loss should be high for most other

input conﬁgurations

. The regularization mechanism makes sure that reconstruc-

tion cannot be perfect everywhere, while minimizing the reconstruction loss at

training examples digs a hole in reconstruction error where the density of training

examples is large. Examples of reconstruction loss functions include ||x − r(x)||

(for real-valued inputs) and −



log r

(x)+(1− x

) log(1 − r

(x)) (when

interpreting x

as a bit or a probability of a binary event). Auto-encoders cap-

ture the input distribution by learning to better reconstruct more likely input

conﬁgurations. The diﬀerence between the reconstruction vector and the input

vector can be shown to be related to the log-density gradient as estimated by

the learner [114, 16] and the Jacobian matrix of the reconstruction with respect

to the input gives information about the second derivative of the density, i.e.,

in which direction the density remains high when you are on a high-density

manifold [99, 16]. In the Denoising Auto-Encoder (DAE) and the Contractive

Auto-Encoder (CAE), the training procedure also introduces robustness (insen-

sitivity to small variations), respectively in the reconstruction r(x) or in the

representation f(x). In the DAE [115, 116], this is achieved by training with

stochastically corrupted inputs, but trying to reconstruct the uncorrupted in-

puts. In the CAE [99], this is achieved by adding an explicit regularizing term

in the training criterion, proportional to the norm of the Jacobian of the en-

coder, ||

∂f(x)

∂x

. But the CAE and the DAE are very related [16]: when the

The whole system composes the computation of the representation with computation

of the predictor’s output.

Diﬀerent regularization mechanisms have been proposed to push reconstruction error

up in low density areas: denoising criterion, contractive criterion, and code sparsity.

It has been argued that such constraints play a role similar to the partition function

for Boltzmann machines [96].

19. Recommendations for Training Deep Architectures 441

noise is Gaussian and small, the denoising error minimized by the DAE is equiv-

alent to minimizing the norm of the Jacobian of the reconstruction function

r(·)=g(f(·)), whereas the CAE minimizes the norm of the Jacobian of the

encoder f(·). Besides Gaussian noise, another interesting form of corruption has

been very successful with DAEs: it is called the masking corruption and consists

in randomly zeroing out a large fraction (like 20% or even 50%) of the inputs,

where the zeroed out subset is randomly selected for each example. In addition

to the contractive eﬀect, it forces the learned encoder to be able to rely only on

an arbitrary subset of the input features.

Another way to prevent the auto-encoder from perfectly reconstructing every-

where is to introduce a sparsity penalty on h, discussed below (Section 19.3.1).

19.1.3 Online Learning and Optimization of Generalization Error

The objective of learning is not to minimize training error or even the train-

ing criterion. The latter is a surrogate for generalization error, i.e., performance

on new (out-of-sample) examples, and there are no hard guarantees that min-

imizing the training criterion will yield good generalization error: it depends

on the appropriateness of the parametrization and training criterion (with the

corresponding prior they imply) for the task at hand.

Many learning tasks of interest will require huge quantities of data (most of

which will be unlabeled) and as the number of examples increases, so long as

capacity is limited (the number of parameters is small compared to the num-

ber of examples), training error and generalization approach each other. In the

regime of such large datasets, we can consider that the learner sees an unending

stream of examples (e.g., think about a process that harvests text and images

from the web and feeds it to a machine learning algorithm). In that context, it

is most eﬃcient to simply update the parameters of the model after each exam-

ple or few examples, as they arrive. This is the ideal online learning scenario,

andinasimpliﬁedsetting,wecaneven consider each new example z as being

sampled i.i.d. from an unknown generating distribution with probability density

p(z). More realistically, examples in online learning do not arrive i.i.d. but in-

stead from an unknown stochastic process which exhibits serial correlation and

other temporal dependencies. Many learning algorithms rely on gradient-based

numerical optimization of a training criterion. Let L(z,θ) be the loss incurred

on example z when the parameter vector takes value θ. The gradient vector for

the loss associated with a single example is

∂L(z,θ)

∂θ

If we consider the simpliﬁed case of i.i.d. data, there is an interesting obser-

vation to be made: the online learner is performing stochastic gradient descent

on its generalization error. Indeed, the generalization error C of a learner with

parameters θ and loss function L is

C = E[L(z,θ)] =



p(z)L(z, θ)dz

442 Y. Bengio

while the stochastic gradient from sample z is

ˆg =

∂L(z,θ)

∂θ

with z a random variable sampled from p. The gradient of generalization error

∂C

∂θ

∂

∂θ



p(z)L(z, θ)dz =



p(z)

∂L(z,θ)

∂θ

dz = E[ˆg]

showing that the online gradient ˆg is an unbiased estimator of the generalization

error gradient

∂C

∂θ

. It means that online learners, when given a stream of non-

repetitive training data, really optimize (maybe not in the optimal way, i.e., using

a ﬁrst-order gradient technique) what we really care about: generalization error.

19.2 Gradients

19.2.1 Gradient Descent and Learning Rate

The gradient or an estimator of the gradient is used as the core part the computa-

tion of parameter updates for gradient-based numerical optimization algorithms.

For example, simple online (or stochastic) gradient descent [102, 28] updates the

parameters after each example is seen, according to

(t)

← θ

(t−1)

− 

∂L(z

,θ)

∂θ

where z

is an example sampled at iteration t and where 

is a hyper-parameter

that is called the learning rate and whose choice is crucial. If the learning rate

is too large

, the average loss will increase. The optimal learning rate is usually

close to (by a factor of 2) the largest learning rate that does not cause divergence

of the training criterion, an observation that can guide heuristics for setting the

learning rate [8], e.g., start with a large learning rate and if the training criterion

diverges, try again with 3 times smaller learning rate, etc., until no divergence

is observed.

See [26] for a deeper treatment of stochastic gradient descent, including sug-

gestions to set learning rate schedule and improve the asymptotic convergence

through averaging.

In practice, we use mini-batch updates based on an average of the gradients

inside each block of B examples:

(t)

← θ

(t−1)

− 

B(t+1)



=Bt+1

∂L(z



,θ)

∂θ

. (19.1)

Above a value which is approximately 2 times the largest eigenvalue of the average

loss Hessian matrix.

Compared to a sum, an average makes a small change in B have only a small eﬀect on

the optimal learning rate, with an increase in B generally allowing a small increase

in the learning rate because of the reduced variance of the gradient.

19. Recommendations for Training Deep Architectures 443

With B =1we are back to ordinary online gradient descent, while with B equal

to the training set size, this is standard (also called “batch”) gradient descent.

With intermediate values of B there is generally a sweet spot. When B increases

we can get more multiply-add operations per second by taking advantage of

parallelism or eﬃcient matrix-matrix multiplications (instead of separate matrix-

vector multiplications), often gaining a factor of 2 in practice in overall training

time. On the other hand, as B increases, the number of updates per computation

done decreases, which slows down convergence (in terms of error vs number of

multiply-add operations performed) because less updates can be done in the

same computing time. Combining these two opposing eﬀects yields a typical

U-curve with a sweet spot at an intermediate value of B.

Keep in mind that even the true gradient direction (averaging over the whole

training set) is only the steepest descent direction locally but may not point

in the right direction when considering larger steps. In particular, because the

training criterion is not quadratic in the parameters, as one moves in parameter

space the optimal descent direction keeps changing. Because the gradient direc-

tion is not quite the right direction of descent, there is no point in spending a lot

of computation to estimate it precisely for gradient descent. Instead, doing more

updates more frequently helps to explore more and faster, especially with large

learning rates. In addition, smaller values of B may beneﬁt from more explo-

ration in parameter space and a form of regularization both due to the “noise”

injected in the gradient estimator, which may explain the better test results

sometimes observed with smaller B.

When the training set is ﬁnite, training proceeds by sweeps through the train-

ing set called an epoch, and full training usually requires many epochs (iterations

through the training set). Note that stochastic gradient (either one example at a

time or with mini-batches) is diﬀerent from ordinary gradient descent, sometimes

called “batch gradient descent”, which corresponds to the case where B equals

the training set size, i.e., there is one parameter update per epoch). The great

advantage of stochastic gradient descent and other online or minibatch update

methods is that their convergence does not depend on the size of the training

set, only on the number of updates and the richness of the training distribution.

In the limit of a large or inﬁnite training set, a batch method (which updates

only after seeing all the examples) is hopeless. In fact, even for ordinary datasets

of tens or hundreds of thousands of examples (or more!), stochastic gradient de-

scent converges much faster than ordinary (batch) gradient descent, and beyond

some dataset sizes the speed-up is almost linear (i.e., doubling the size almost

doubles the gain)

. It is really important to use the stochastic version in order

to get reasonable clock-time convergence speeds.

As for any stochastic gradient descent method (including the mini-batch case),

it is important for eﬃciency of the estimator that each example or mini-batch

be sampled approximately independently. Because random access to memory

(or even worse, to disk) is expensive, a good approximation, called incremental

On the other hand, batch methods can be parallelized easily, which becomes an

important advantage with currently available forms of computing power.

444 Y. Bengio

gradient [21], is to visit the examples (or mini-batches) in a ﬁxed order corre-

sponding to their order in memory or disk (repeating the examples in the same

order on a second epoch, if we are not in the pure online case where each example

is visited only once). In this context, it is safer if the examples or mini-batches

are ﬁrst put in a random order (to make sure this is the case, it could be useful

to ﬁrst shuﬄe the examples). Faster convergence has been observed if the order

in which the mini-batches are visited is changed for each epoch, which can be

reasonably eﬃcient if the training set holds in computer memory.

19.2.2 Gradient Computation and Automatic Diﬀerentiation

The gradient can be either computed manually or through automatic diﬀeren-

tiation. Either way, it helps to structure this computation as a ﬂow graph,in

order to prevent mathematical mistakes and make sure an implementation is

computationally eﬃcient. The computation of the loss L(z,θ) as a function of θ

is laid out in a graph whose nodes correspond to elementary operations such as

addition, multiplication, and non-linear operations such as the neural networks

activation function (e.g., sigmoid or hyperbolic tangent), possibly at the level

of vectors, matrices or tensors. The ﬂow graph is directed and acyclic and has

three types of nodes: input nodes, internal nodes, and output nodes. Each of its

nodes is associated with a numerical output which is the result of the applica-

tion of that computation (none in the case of input nodes), taking as input the

output of previous nodes in a directed acyclic graph. Example z and parameter

vector θ (or their elements) are the input nodes of the graph (i.e., they do not

have inputs themselves) and L(z,θ) is a scalar output of the graph. Note that

here, in the supervised case, z can include an input part x (e.g. an image) and

a target part y (e.g. a target class associated with an object in the image). In

the unsupervised case z = x. In a semi-supervised case, there is a mix of labeled

and unlabeled examples, and z includes y on the labeled examples but not on

the unlabeled ones.

In addition to associating a numerical output o

to each node a of the ﬂow

graph, we can associate a gradient g

∂L(z,θ)

∂o

. The gradient will be deﬁned and

computed recursively in the graph, in the opposite direction of the computation

of the nodes’ outputs, i.e., whereas o

is computed using outputs o

of predecessor

nodes p of a, g

will be computed using the gradients g

of successor nodes s of

a.Moreprecisely,thechainruledictates



∂o

where the sum is over immediate successors of a. Only output nodes have no

successor, and in particular for the output node that computes L, the gradient

is set to 1 since

∂L

=1, thus initializing the recursion. Manual or automatic

diﬀerentiation then only requires to deﬁne the partial derivative associated with

each type of operation performed by any node of the graph. When implementing

gradient descent algorithms with manual diﬀerentiation the result tends to be

verbose, brittle code that lacks modularity – all bad things in terms of software

19. Recommendations for Training Deep Architectures 445

engineering. A better approach is to express the ﬂow graph in terms of objects

that modularize how to compute outputs from inputs as well as how to compute

the partial derivatives necessary for gradient descent. One can pre-deﬁne the

operations of these objects (in a “forward propagation” or

fprop method) and

their partial derivatives (in a “backward propagation” or

bprop method) and

encapsulate these computations in an object that knows how to compute its

output given its inputs, and how to compute the gradient with respect to its

inputs given the gradient with respect to its output. This is the strategy adopted

in the

Theano library

with its Op objects [18], as well as in libraries such as

Torch

[37] and Lush

Compared to

Torch and Lush, Theano adds an interesting ingredient which

makes it a full-ﬂedged automatic diﬀerentiation tool: symbolic computation. The

ﬂow graph itself (without the numerical values attached) can be viewed as a

symbolic representation (in a data structure) of a numerical computation. In

Theano, the gradient computation is ﬁrst performed symbolically, i.e., each Op

object knows how to create other Ops corresponding to the computation of the

partial derivatives associated with that

Op. Hence the symbolic diﬀerentiation

of the output of a ﬂow graph with respect to any or all of its input nodes can

be performed easily in most cases, yielding another ﬂow graph which speciﬁes

how to compute these gradients, given the input of the original graph. Since

the gradient graph typically contains the original graph (mapping parameters to

loss) as a sub-graph, in order to make computations eﬃcient it is important to

automate (as done in

Theano)anumberofsimpliﬁcations which are graph trans-

formations preserving the semantics of the output (given the input) but yielding

smaller (or more numerically stable or more eﬃciently computed) graphs (e.g.,

removing redundant computations). To take advantage of the fact that com-

puting the loss gradient includes as a ﬁrst step computing the loss itself, it is

advantageous to structure the code so that both the loss and its gradient are

computed at once, with a single graph having multiple outputs. The advantages

of performing gradient computations symbolically are numerous. First of all, one

can readily compute gradients over gradients, i.e., second derivatives, which are

useful for some learning algorithms. Second, one can deﬁne algorithms or training

criteria involving gradients themselves, as required for example in the Contrac-

tive Auto-Encoder (which uses the norm of a Jacobian matrix in its training

criterion, i.e., really requires second derivatives, which here are cheap to com-

pute). Third, it makes it easy to implement other useful graph transformations

such as graph simpliﬁcations or numerical optimizations and transformations

that help making the numerical results more robust and more eﬃcient (such as

working in the domain of logarithms of probabilities rather than in the domain

of probabilities directly). Other potential beneﬁcial applications of such sym-

bolic manipulations include parallelization and additional diﬀerential operators

(such as the R-operator, recently implemented in

Theano, which is very useful to

http://deeplearning.net/software/theano/

http://www.torch.ch

http://lush.sourceforge.net

446 Y. Bengio

compute the product of a Jacobian matrix

∂f(x)

∂x

or Hessian matrix

∂

L(x,θ)

∂θ

with

a vector without ever having to actually compute and store the matrix itself [90]).

19.3 Hyper-parameters

A pure learning algorithm can be seen as a function taking training data as input

and producing as output a function (e.g. a predictor) or model (i.e. a bunch

of functions). However, in practice, many learning algorithms involve hyper-

parameters, i.e., annoying knobs to be adjusted. In many algorithms such as

Deep Learning algorithms the number of hyper-parameters (ten or more!) can

make the idea of having to adjust all of them unappealing. In addition, it has

been shown that the use of computer clusters for hyper-parameter selection can

have an important eﬀect on results [91]. Choosing hyper-parameter values is

formally equivalent to the question of model selection, i.e., given a family or set

of learning algorithms, how to pick the most appropriate one inside the set? We

deﬁne a hyper-parameter for a learning algorithm A as a variable to be set prior

to the actual application of A to the data, one that is not directly selected by the

learning algorithm itself. It is basically an outside control knob. It can be discrete

(as in model selection) or continuous (such as the learning rate discussed above).

Of course, one can hide these hyper-parameters by wrapping another learning

algorithm, say B, around A, to selects A’s hyper-parameters (e.g. to minimize

validation set error). We can then call B a hyper-learner, and if B has no hyper-

parameters itself then the composition of B over A could be a “pure” learning

algorithm, with no hyper-parameter. In the end, to apply a learner to training

data, one has to have a pure learning algorithm. The hyper-parameters can be

ﬁxed by hand or tuned by an algorithm, but their value has to be selected. The

value of some hyper-parameters can be selected based on the performance of

A on its training data, but most cannot. For any hyper-parameter that has an

impact on the eﬀective capacity of a learner, it makes more sense to select its

value based on out-of-sample data (outside the training set), e.g., a validation

set performance, online error, or cross-validation error. Note that some learning

algorithms (in particular unsupervised learning algorithms such as algorithms

for training RBMs by approximate maximum likelihood) are problematic in this

respect because we cannot directly measure the quantity that is to be optimized

(e.g. the likelihood) because it is intractable. On the other hand, the expected

denoising reconstruction error is easy to estimate (by just averaging the denoising

error over a validation set).

Once some out-of-sample data has been used for selecting hyper-parameter

values,itcannotbeusedanymoretoobtain an unbiased estimator of generaliza-

tion performance, so one typically uses a test set (or double cross-validation

,in

Double cross-validation applies recursively the idea of cross-validation, using an outer

loop cross-validation to evaluate generalization error and then applying an inner loop

cross-validation inside each outer loop split’s training subset (i.e., splitting it again

into training and validation folds) in order to select hyper-parameters for that split.

19. Recommendations for Training Deep Architectures 447

the case of small datasets) to estimate generalization error of the pure learning

algorithm (with hyper-parameter selection hidden inside).

19.3.1 Neural Network Hyper-parameters

Diﬀerent learning algorithms involve diﬀerent sets of hyper-parameters, and it

is useful to get a sense of the kinds of choices that practitioners have to make in

choosing their values. We focus here mostly on those relevant to neural networks

and Deep Learning algorithms.

Hyper-parameters of the Approximate Optimization. First of all, several

learning algorithms can be viewed as the combination of two elements: a train-

ing criterion and a model (e.g., a family of functions, a parametrization) on the

one hand, and on the other hand, a particular procedure for approximately opti-

mizing this criterion. Correspondingly, one should distinguish hyper-parameters

associated with the optimizer from hyper-parameters associated with the model

itself, i.e., typically the function class, regularizer and loss function. We have al-

ready mentioned above some of the hyper-parameters typically associated with

gradient-based optimization. Here is a more extensive descriptive list, focusing

on those used in stochastic (mini-batch) gradient descent (although number of

training iterations is used for all iterative optimization algorithms).

– The initial learning rate (

below, Eq.(19.2)). This is often the single

most important hyper-parameter and one should always make sure that it

has been tuned (up to approximately a factor of 2). Typical values for a

neural network with standardized inputs (or inputs mapped to the (0,1)

interval) are less than 1 and greater than 10

−6

but these should not be taken

as strict ranges and greatly depend on the parametrization of the model. A

default value of 0.01 typically works for standard multi-layer neural networks

but it would be foolish to rely exclusively on this default value. If there is

only time to optimize one hyper-parameter and one uses stochastic gradient

descent, then this is the hyper-parameter that is worth tuning.

– The choice of strategy for decreasing or adapting the learning rate sched-

ule (with hyper-parameters such as the time constant τ in Eq. (19.2) below).

The default value of τ →∞means that the learning rate is constant over

training iterations. In many cases the beneﬁt of choosing other than this

default value is small. An example of O(1/t) learning rate schedule, used

in Bergstra and Bengio [17] is



max(t, τ)

(19.2)

which keeps the learning rate constant for the ﬁrst τ steps and then decreases

it in O(1/t

), with traditional recommendations (based on asymptotic anal-

ysis of the convex case) suggesting α =1. See Bach and Moulines [2] for

a recent analysis of the rate of convergence for the general case of α ≤ 1,

suggesting that smaller values of α should be used in the non-convex case,

448 Y. Bengio

especially when using a gradient averaging or momentum technique (see be-

low). An adaptive and heuristic way of automatically setting τ above is to

keep 

constant until the training criterion stops decreasing signiﬁcantly (by

more than some relative improvement threshold) from epoch to epoch. That

threshold is a less sensitive hyper-parameter than τ itself. An alternative to

a ﬁxed schedule with a couple of (global) free hyper-parameters like in the

above formula is the use of an adaptive learning rate heuristic, e.g., the sim-

ple procedure proposed in [26]: at regular intervals during training, using

a ﬁxed small subset of the training set (what matters is only the number

of examples used, not what fraction of the whole training set it represents),

continue training with N diﬀerent choices of learning rate (all in parallel),

and keep the value that gave the best results until the next re-estimation of

the optimal learning rate. Other examples of adaptive learning rate strategies

are discussed below (Sec. 19.6.2).

– The mini-batch size (B in Eq. (19.1)) is typically chosen between 1 and a

few hundreds, e.g. B =32is a good default value, with values above 10 tak-

ing advantage of the speed-up of matrix-matrix products over matrix-vector

products. The impact of B is mostly computational, i.e., larger B yield faster

computation (with appropriate implementations) but requires visiting more

examples in order to reach the same error, since there are less updates per

epoch. In theory, this hyper-parameter should impact training time and not

so much test performance, so it can be optimized separately of the other

hyper-parameters, by comparing training curves (training and validation er-

ror vs amount of training time), after the other hyper-parameters (except

learning rate) have been selected. B and 

may slightly interact with other

hyper-parameters so both should be re-optimized at the end. Once B is se-

lected, it can generally be ﬁxed while the other hyper-parameters can be

further optimized (except for a momentum hyper-parameter, if one is used).

– Number of training iterations T (measuredinmini-batchupdates).This

hyper-parameter is particular in that it can be optimized almost for free using

the principle of early stopping: by keeping track of the out-of-sample error

(as for example estimated on a validation set) as training progresses (every

N updates), one can decide how long to train for any given setting of all the

other hyper-parameters. Early stopping is an inexpensive way to avoid strong

overﬁtting, i.e., even if the other hyper-parameters would yield to overﬁtting,

early stopping will considerably reduce the overﬁtting damage that would

otherwise ensue. It also means that it hides the overﬁtting eﬀect of other

hyper-parameters, possibly obscuring the analysis that one may want to do

when trying to ﬁgure out the eﬀect of individual hyper-parameters, i.e., it

tends to even out the performance obtained by many otherwise overﬁtting

conﬁgurations of hyper-parameters by compensating a too large capacity

with a smaller training time. For this reason, it might be useful to turn

early-stopping oﬀ when analyzing the eﬀect of individual hyper-parameters.

Now let us turn to implementation details. Practically, one needs to continue

training beyond the selected number of training iterations

T (which should be

the point of lowest validation error in the training run) in order to ascertain

19. Recommendations for Training Deep Architectures 449

that validation error is unlikely to go lower than at the selected point. A

heuristic introduced in the Deep Learning Tutorials

is based on the idea

of patience (set initially to 10000 examples in the MLP tutorial), which is

a minimum number of training examples to see after the candidate selected

point

T before deciding to stop training (i.e. before accepting this candidate

as the ﬁnal answer). As training proceeds and new candidate selected points

T (new minima of the validation error) are observed, the patience parameter

is increased, either multiplicatively or additively on top of the last

T found.

Hence, if we ﬁnd a new minimum

at t, we save the current best model,

update

T ← t and we increase our patience up to t+constant or t× constant.

Note that validation error should not be estimated after each training update

(that would be really wasteful) but after every N examples, where N is at

least as large as the validation set (ideally several times larger so that the

early stopping overhead remains small)

–Momentumβ. It has long been advocated [56, 59] to temporally smooth

out the stochastic gradient samples obtained during the stochastic gradi-

ent descent. For example, a moving average of the past gradients can be

computed with ¯g ← (1 − β)¯g + βg,whereg is the instantaneous gradient

∂L(z

,θ)

∂θ

or a minibatch average, and β is a small positive coeﬃcient that

controls how fast the old examples get downweighted in the moving aver-

age. The simplest momentum trick is to make the updates proportional to

this smoothed gradient estimator ¯g instead of the instantaneous gradient g.

The idea is that it removes some of the noise and oscillations that gradient

descent has, in particular in the directions of high curvature of the loss func-

tion

. A default value of β =1(no momentum) works well in many cases

but in some cases momentum seems to make a positive diﬀerence. Polyak

averaging [93] is a related form of parameter averaging

that has theoretical

advantages and has been advocated and shown to bring improvements on

some unsupervised learning procedures such as RBMs [110]. More recently,

several mathematically motivated algorithms [88, 75] have been proposed

that incorporate some form of momentum and that also ensure much faster

convergence (linear rather than sublinear) compared to stochastic gradient

http://deeplearning.net/tutorial/

Ideally, we should use a statistical test of signiﬁcance and accept a new minimum

(over a longer training period) only if the improvement is statistically signiﬁcant,

based on the size and variance estimates one can compute for the validation set.

When an extra processor on the same machine is available, validation error can

conveniently be recomputed by a processor diﬀerent from the one performing the

training updates, allowing more frequent computation of validation error.

Think about a ball coming down a valley. Since it has not started from the bottom of

the valley it will oscillate between its sides as it settles deeper, forcing the learning

rate to be small to avoid large oscillations that would kick it out of the valley.

Averaging out the local gradients along the way will cancel the opposing forces from

each side of the valley.

Polyak averaging uses for predictions a moving average of the parameters found in

the trajectory of stochastic gradient descent.

450 Y. Bengio

descent, at least for convex optimization problems. See also [26] for an exam-

ple of averaged SGD with successful empirical speedups in the convex case.

Note however that in the pure online case (stream of examples) and under

some assumptions, the sublinear rate of convergence of stochastic gradient

descent with O(1/t) decrease of learning rate is an optimal rate, at least for

convex problems [87]. That would suggest that for really large training sets

it may not be possible to obtain better rates than ordinary stochastic gra-

dient descent, albeit the constants in front (which depend on the condition

number of the Hessian) may still be greatly reduced by using second-order

information online [28, 27].

– Layer-speciﬁc optimization hyper-parameters: although rarely done,

it is possible to use diﬀerent values of optimization hyper-parameters (such

as the learning rate) on diﬀerent layers of a multi-layer network. This is

especially appropriate (and easier to do) in the context of layer-wise unsu-

pervised pre-training, since each layer is trained separately (while the layers

below are kept ﬁxed). This would be particularly useful when the number of

units per layer varies a lot from layer to layer. See the paragraph below enti-

tled Layer-wise optimization of hyper-parameters (Sec. 19.3.3). Some

researchers also advocate the use of diﬀerent learning rates for the diﬀerent

types of parameters one ﬁnds in the model, such as biases and weights in the

standard multi-layer network, but the issue becomes more important when

parameters such as precision or variance are included in the lot [38].

Up to now we have only discussed the hyper-parameters in the setup where

one trains a neural network by stochastic gradient descent. With other opti-

mization algorithms, some hyper-parameters are typically diﬀerent. For exam-

ple, Conjugate Gradient (CG) algorithms typically have a number of line search

steps (which is a hyper-parameter) and a tolerance for stopping each line search

(another hyper-parameter). An optimization algorithm like L-BFGS (limited-

memory Broyden-Fletcher-Goldfarb-Shanno) also has a hyper-parameter con-

trolling the memory usage of the algorithm, the rank of the Hessian approxima-

tion kept in memory, which also has an inﬂuence on the eﬃciency of each step.

Both CG and L-BFGS are iterative (e.g., one line search per iteration), and the

number of iterations can be optimized as described above for stochastic gradient

descent, with early stopping.

19.3.2 Hyper-parameters of the Model and Training Criterion

Let us now turn to “model” and “criterion” hyper-parameters typically found in

neural networks, especially deep neural networks.

– Number of hidden units n

. Each layer in a multi-layer neural network

typically has a size that we are free to set and that controls capacity. Because

of early stopping and possibly other regularizers (e.g., weight decay, discussed

below), it is mostly important to choose n

large enough. Larger than op-

timal values typically do not hurt generalization performance much, but of

19. Recommendations for Training Deep Architectures 451

course they require proportionally more computation (in O(n

) if scaling all

the layers at the same time in a fully connected architecture). Like for many

other hyper-parameters, there is the option of allowing a diﬀerent value of

for each hidden layer

of a deep architecture. See the paragraph below

entitled Layer-wise optimization of hyper-parameters (Sec. 19.3.3).

In a large comparative study [70], we found that using the same size for

all layers worked generally better or the same as using a decreasing size

(pyramid-like) or increasing size (upside down pyramid), but of course this

may be data-dependent. For most tasks that we worked on, we ﬁnd that an

overcomplete

ﬁrst hidden layer works better than an undercomplete one.

Another even more often validated empirical observation is that the opti-

mal n

is much larger when using unsupervised pre-training in a supervised

neural network, e.g., going from hundreds of units to thousands of units.

A plausible explanation is that after unsupervised pre-training many of the

hidden units are carrying information that is irrelevant to the speciﬁc super-

vised task of interest. In order to make sure that the information relevant to

the task is captured, larger hidden layers are therefore necessary when using

unsupervised pre-training.

– Weight decay regularization coeﬃcient λ. A way to reduce overﬁtting is

to add a regularization term to the training criterion, which limits the ca-

pacity of the learner. The parameters of machine learning models can be

regularized by pushing them towards a prior value, which is typically 0.

L2 regularization adds a term λ



to the training criterion, while L1

regularization adds a term λ



|θ

|. Both types of terms can be included.

There is a clean Bayesian justiﬁcation for such a regularization term: it is

the negative log-prior −log P (θ) on the parameters θ. The training criterion

then corresponds to the negative joint likelihood of data and parameters,

−log P (data, θ)=−log P (data|θ) − log P (θ), with the loss function L(z,θ)

being interpreted as −log P (z|θ) and −log P (data|θ)=−



t=1

L(z

,θ) if

the data consists of T i.i.d. examples z

. This detail is important to note

because when one is doing stochastic gradient-based learning, it makes sense

to use an unbiased estimator of the gradient of the total training criterion

(including both the total loss and the regularizer), but one only considers

a single mini-batch or example at a time. How should the regularizer be

weighted in this sum, which is diﬀerent from the sum of the regularizer and

the total loss on all examples? On each mini-batch update, the gradient of

the regularization penalty should be multiplied not just by λ but also by

i.e., one over the number of updates needed to go once through the training

set. When the training set size is not a multiple of B, the last mini-batch

will have size B



<Band the contribution of the regularizer to the mini-

batch gradient should therefore be modiﬁed accordingly (i.e. scaled by



compared to other mini-batches). In the pure online setting (there is no ﬁxed

ahead training set size nor iterating again on the examples), it would then

A hidden layer is a group of units that is neither an input layer nor an output layer.

Larger than the input vector.

452 Y. Bengio

make sense to use

at example t, or one over the number of updates to

date. L2 regularization penalizes large values more strongly and corresponds

to a Gaussian prior ∝ exp(−

||θ||

) with prior variance σ

=1/(2λ).Note

that there is a connection between early stopping (see above, choosing the

number of training iterations) and L2 regularization [34], with one basically

playing the same role as the other (but early stopping allowing a much more

eﬃcient selection of the hyper-parameter value, which suggests dropping L2

regularization altogether when early-stopping is used). However, L1 regular-

ization behaves diﬀerently and can sometimes be useful, acting as a form of

feature selection. L1 regularization makes sure that parameters that are not

really very useful are driven to zero (i.e. encouraging sparsity of the param-

eter values), and corresponds to a Laplace density prior ∝ e

−

|θ|

with scale

parameter s =

. L1 regularization often helps to make the input ﬁlters

cleaner (more spatially localized) and easier to interpret. Stochastic gradi-

ent descent will not yield actual zeros but values hovering around zero. If

both L1 and L2 regularization are used, a diﬀerent coeﬃcient (i.e. a diﬀer-

ent hyper-parameter) should be considered for each, and one may also use a

diﬀerent coeﬃcient for diﬀerent layers. In particular, the input weights and

output weights may be treated diﬀerently.

One reason for treating output weights diﬀerently (i.e., not relying only

on early stopping) is that we know that it is suﬃcient to regularize only the

output weights in order to constrain capacity: in the limit case of the number

of hidden units going to inﬁnity, L2 regularization corresponds to Support

Vector Machines (SVM) while L1 regularization corresponds to boosting [12].

Another reason for treating inputs and outputs diﬀerently from hidden units

is because they may be sparse. For example, some input features may be 0

most of the time while others are non-zero frequently. In that case, there are

fewer examples that inform the model about that rarely active input feature,

and the corresponding parameters (weights outgoing from the correspond-

ing input units) should be more regularized than the parameters associated

with frequently observed inputs. A similar situation may occur with target

variables that are sparse (e.g., trying to predict rarely observed events). In

both cases, the eﬀective number of meaningful updates seen by these pa-

rameters is less than the actual number of updates. This suggests to scale

the regularization coeﬃcient of these parameters by one over the eﬀective

number of updates seen by the parameter. A related formula turns up in

Bayesian probit regression applied to sparse inputs [53]. Some practitioners

also choose to penalize only the weights w and not the biases b associated

with the hidden unit activations w



z +b for a unit taking the vector of values

z as input. This guarantees that even with strong regularization, the predic-

tor would converge to the optimal constant predictor, rather than the one

corresponding to 0 activation. For example, with the mean-square loss and

the cross-entropy loss, the optimal constant predictor is the output average.

The input weights of a 1st layer neuron are often called “ﬁlters” because of analogies

with signal processing techniques such as convolutions.

19. Recommendations for Training Deep Architectures 453

– Sparsity of activation regularization coeﬃcient α. A common practice in

the Deep Learning literature [95, 97, 81, 82, 3, 49, 33, 52] consists in adding

a penalty term to the training criterion that encourages the hidden units to

be sparse, i.e., with values at or near 0. Although the L1 penalty (discussed

above in the case of weights) can also be applied to hidden units activations,

this is mathematically very diﬀerent from the L1 regularization term on

parameters. Whereas the latter corresponds to a prior on the parameters,

the former does not because it involves the training distribution (since we

are looking at data-dependent hidden units outputs). Although we will not

discuss this much here, the inspiration for a sparse representation in Deep

Learning comes from the earlier work on sparse coding [89]. As discussed

in Goodfellow et al. [51] sparse representations may be advantageous because

they encourage representations that disentangle the underlying factors of

representation. A sparsity-inducing penalty is also a way to regularize (in

the sense of reducing the number of examples that the learner can learn

by heart) [97], which means that the sparsity coeﬃcient is likely to interact

with the many other hyper-parameters which inﬂuence capacity. In general,

increased sparsity can be compensated by a larger number of hidden units.

Several approaches have been proposed to induce a sparse representation (or

with more hidden units whose activation is closer to 0). One approach [97,

72, 120] is simply to penalize the L1 norm of the representation or another

function of the hidden units’ activation (such as the student-t log-prior). This

typically makes sense for non-linearities such as the sigmoid which have a

saturating output around 0, but not for the hyperbolic tangent non-linearity

(whose saturation is near the -1 and 1 interval borders rather than near

the origin). Another option is to penalize the biases of the hidden units,

to make them more negative [95, 81, 51, 69]. Note that penalizing the bias

runs the danger that the weights could compensate for the bias

,which

could hurt the numerical optimization of parameters. When directly penal-

izing the hidden unit outputs, several variants can be found in the literature,

but no clear comparative analysis has been published to evaluate which one

works better. Although the L1 penalty (i.e., simply α times the sum of out-

put elements h

in the case of sigmoid non-linearity) would seem the most

natural (because of its use in sparse coding), it is used in few papers involv-

ing sparse auto-encoders. A close cousin of the L1 penalty is the Student-t

penalty (log(1 + h

)), originally proposed for sparse coding [89]. Several

researchers penalize the average output

(e.g. over a mini-batch), and in-

stead of pushing it to 0, encourage it to approach a ﬁxed target ρ.This

can be done through a mean-square error penalty such as



(ρ −

)

,or

maybe more sensibly (because h

behaves like a probability), a Kullback-

Liebler divergence with respect to the binomial distribution with probability

ρ, −ρ log

−(1 −ρ) log(1 −

)+constant, e.g., with ρ =0.05,asin[59].In

addition to the regularization penalty itself, the choice of activation function

Because the input to the layer generally has a non-zero average, that when multiplied

by the weights acts like a bias.

454 Y. Bengio

can have a strong impact on the sparsity obtained. In particular, rectifying

non-linearities (such as max(0,x), instead of a sigmoid) have been very suc-

cessful in several instances [64, 86, 49, 84, 50]. The rectiﬁer also relates to the

hard tanh [35], whose derivatives are also 0 or 1. In sparse coding and sparse

predictive coding [65] the activations are directly optimized and actual zeros

are the expected result of the optimization. In that case, ordinary stochastic

gradient is not guaranteed to ﬁnd these zeros (it will oscillate around) and

other methods such as proximal gradient are more appropriate [21].

– Neuron non-linearity. The typical neuron output is s(a)=s(w



x + b),

where x is the vector of inputs into the neuron, w the vector of weights and

b the oﬀset or bias parameter, while s is a scalar non-linear function. Sev-

eral non-linearities have been proposed and some choices of non-linearities

have been shown to be more successful [64, 48, 49]. The most commonly

used by the author, for hidden units, are the sigmoid 1/(1 + e

−a

),thehy-

perbolic tangent

−e

−a

, the rectiﬁer max(0,a) and the hard tanh [35]. Note

that the sigmoid was shown to yield serious optimization diﬃculties when

used as the top hidden layer of a deep supervised network [48] without

unsupervised pre-training, but works well for auto-encoder variants

.For

output (or reconstruction) units, hard neuron non-linearities like the recti-

ﬁer do not make sense because when the unit is saturated (e.g. a<0 for

the rectiﬁer) and associated with a loss, no gradient is propagated inside the

network, i.e., there is no chance to correct the error

. In the case of hid-

den layers the gradient manages to go through a subset of the hidden units,

even if the others are saturated. For output units a good trick is to obtain

the output non-linearity and the loss by considering the associated negative

log-likelihood and choosing an appropriate (conditional) output probability

model, usually in the exponential family. For example, one can typically take

squared error and linear outputs to correspond to a Gaussian output model,

cross-entropy and sigmoids to correspond to a binomial output model, and

−log output[target class] with softmax outputs to correspond to multino-

mial output variables. For reasons yet to be elucidated, having a sigmoidal

non-linearity on the output (reconstruction) units (along with target inputs

normalized in the (0,1) interval) seems to be helpful when training the con-

tractive auto-encoder.

– Weights initialization scaling coeﬃcient. Biases can generally be ini-

tialized to zero but weights need to be initialized carefully to break the

The author hypothesizes that this discrepency is due to the fact that the weight

matrix W of an auto-encoder of the form r(x)=W

sigmoid(Wx) is pulled towards

being orthonormal since this would make the auto-encoder closer to the identity

function, because W

Wx ≈ x when W is orthonormal and x is in the span of the

rows of W .

A hard non-linearity for the output units non-linearity is very diﬀerent from a hard

non-linearity in the loss function, such as the hinge loss. In the latter case the

derivative is 0 only when there is no error.

19. Recommendations for Training Deep Architectures 455

symmetry between hidden units of the same layer

. Because diﬀerent out-

put units receive diﬀerent gradient signals, this symmetry breaking issue

does not concern the output weights (into the output units), which can

therefore also be set to zero. Although several tricks [79, 48] for initializing

the weights into hidden layers have been proposed (i.e. a hyper-parameter

is the discrete choice between them), Bergstra and Bengio [17] also inserted

as an extra hyper-parameter a scaling coeﬃcient for the initialization range.

These tricks are based on the idea that units with more inputs (the fan-

in of the unit) should have smaller weights. Both LeCun et al. [79] and

Glorot and Bengio [48] recommend scaling by the inverse of the square

root of the fan-in, although Glorot and Bengio [48] and the Deep Learn-

ing Tutorials use a combination of the fan-in and fan-out, e.g., sample a

Uniform(−r, r) with r =

6/(fan-in + fan-out) for hyperbolic tangent units

and r =4

6/(fan-in + fan-out) for sigmoid units. We have found that we

could avoid any hyper-parameter related to initialization using these formu-

las (and the derivation in Glorot and Bengio [48] can be used to derive the

formula for other settings). Note however that in the case of RBMs, a zero-

mean Gaussian with a small standard deviation around 0.1 or 0.01 works

well [59] to initialize the weights, while visible biases are typically set to

their optimal value if the weights were 0, i.e., log(¯x/(1 − ¯x)) in the case of a

binomial visible unit whose corresponding binary input feature has empirical

mean ¯x in the training set.

An important choice is whether one should use unsupervised pre-training

(and which unsupervised feature learning algorithm to use) in order to ini-

tialize parameters. In most settings we have found unsupervised pre-training

to help and very rarely to hurt, but of course that implies additional training

time and additional hyper-parameters.

– Random seeds. There are often several sources of randomness in the train-

ing of neural networks and deep learners (such as for random initializa-

tion, sampling examples, sampling hidden units in stochastic models such

as RBMs, or sampling corruption noise in denoising auto-encoders). Some

random seeds could therefore yield better results than others. Because of the

presence of local minima in the training criterion of neural networks (except

in the linear case or with ﬁxed lower layers), parameter initialization matters.

See Erhan et al. [44] for an example of histograms of test errors for hundreds

of diﬀerent random seeds. Typically, the choice of random seed only has a

slight eﬀect on the result and can mostly be ignored in general or for most

of the hyper-parameter search process. If computing power is available, then

a ﬁnal set of jobs with diﬀerent random seeds (5 to 10) for a small set of

best choices of hyper-parameter values can squeeze a bit more performance.

Another way to exploit computing power to push performance a bit is model

averaging, as in Bagging [29] and Bayesian methods. After training them,

By symmetry, if hidden units of the same layer share the same input and output

weights, they will compute the same output and receive the same gradient, hence

performing the same update and remaining identical, thus wasting capacity.

456 Y. Bengio

the outputs of diﬀerent networks (or in general diﬀerent learning algorithms)

can be averaged. For example, the diﬀerence between the neural networks

being averaged into a committee may come from the diﬀerent seeds used for

parameter initialization, or the use of diﬀerent subsets of input variables, or

diﬀerent subsets of training examples (the latter being called Bagging).

– Preprocessing. Many preprocessing steps have been proposed to massage

raw data into appropriate inputs for neural networks and model selection

must also choose among them. In addition to element-wise standardization

(subtract mean and divide by standard deviation), Principal Components

Analysis (PCA) has often been advocated [79, 17] and also allows dimen-

sionality reduction, at the price of an extra hyper-parameter (the number of

principal components retained, or the proportion of variance explained). A

convenient non-linear preprocessing is the uniformization [84] of each feature

(which estimates its cumulative distribution F

and then transforms each fea-

ture x

by its quantile F

−1

), i.e., returns an approximate normalized rank

or quantile for the value x

). A simpler to compute transform that may help

reduce the tails of input features is a non-linearity such as the logarithm or

the square root, in an attempt to make them more Gaussian-like.

In addition to the above somewhat generic choices, more choices arise with dif-

ferent architectures and learning algorithms. For example, the denoising auto-

encoder has a hyper-parameter scaling the amount of input corruption and the

contractive auto-encoder has as hyper-parameter a coeﬃcient scaling the norm of

the Jacobian of the encoder, i.e., controlling the importance of the contraction

penalty. The latter seems to be a rather sensitive hyper-parameter that must

be tuned carefully. The contractive auto-encoder’s success also seems sensitive

to the weight tying constraint used in many auto-encoder architectures: the

decoder’s weight matrix is equal to the transpose of the encoder’s weight ma-

trix. The speciﬁc architecture used in the contractive auto-encoder (with tied

weights, sigmoid non-linearies on hidden and reconstruction units, along with

squared loss or cross-entropy loss) works quite well but other related variants do

not always train well, for reasons that remain to be understood.

There are also many architectural choices that are relevant in the case of

convolutional architectures (e.g. for modeling images, time-series or sound) [78,

80, 71] in which hidden units have local receptive ﬁelds.

19.3.3 Manual Search and Grid Search

Many of the hyper-parameters or model choices described above can be ignored

by picking a standard trick suggested here or in some other paper. Still, one

remains with a substantial number of choices to be made, which may give the

impression of neural network training as an art. With modern computing fa-

cilities based on large computer clusters, it is however possible to make the

optimization of hyper-parameters a more reproducible and automated process,

using techniques such as grid search or better, random search, or even hyper-

parameter optimization, discussed below.

19. Recommendations for Training Deep Architectures 457

General Guidance for the Exploration of Hyper-parameters. First of

all, let us consider recommendations for exploring hyper-parameter settings,

whether with manual search, with an automated procedure, or with a combi-

nation of both. We call a numerical hyper-parameter one that involves choosing

a real number or an integer (where order matters), as opposed to making a

discrete symbolic choice from an unordered set. Examples of numerical hyper-

parameters are regularization coeﬃcients, number of hidden units, number of

training iterations, etc. One has to think of hyper-parameter selection as adiﬃ-

cult form of learning: there is both an optimization problem (looking for hyper-

parameter conﬁgurations that yield low validation error) and a generalization

problem: there is uncertainty about the expected generalization after optimizing

validation performance, and it is possible to overﬁt the validation error and get

optimistically biased estimators of performance when comparing many hyper-

parameter conﬁgurations. The training criterion for this learning is typically the

validation set error, which is a proxy for generalization error. Unfortunately,

the relation between hyper-parameters and validation error can be complicated.

Although to ﬁrst approximation we expect a kind of U-shaped curve (when con-

sidering only a single hyper-parameter, the others being ﬁxed), this curve can

also have noisy variations, in part due to the use of ﬁnite data sets.

– Best value on the border. When considering the validation error ob-

tained for diﬀerent values of a numerical hyper-parameter one should pay

attention as to whether or not the best value found is near the border of the

investigated interval. If it is near the border, then this suggests that better

values can be found with values beyond the border: it is recommended in

that case to explore further, beyond that border. Because the relation be-

tween a hyper-parameter and validation error can be noisy, it is generally

not enough to try very few values. For instance, trying only 3 values for a

numerical hyper-parameter is insuﬃcient, even if the best value found is the

middle one.

– Scale of values considered. Exploring values of a numerical

hyper-parameter entails choosing a starting interval to be searched, which

is therefore a kind of hyper-hyper-parameter. By choosing the interval large

enough to start with, but based on previous experience with this hyper-

parameter, we ensure that we do not get completely wrong results. Now

instead of choosing the intermediate values linearly in the chosen interval,

it often makes much more sense to consider a linear or uniform sampling

in the log-domain (in the space of the logarithm of the hyper-parameter).

For example, the results obtained with a learning rate of 0.01 are likely to

be very similar to the results with 0.011 while results with 0.001 could be

quite diﬀerent from results with 0.002 even though the absolute diﬀerence is

the same in both cases. The ratio between diﬀerent values is often a better

guide of the expected impact of the change. That is why exploring uniformly

or regularly-spaced values in the space of the logarithm of the numerical

hyper-parameter is typically preferred for positive-valued numerical hyper-

parameters.

458 Y. Bengio

– Computational considerations. Validation error is actually not the only

measure to consider in selecting hyper-parameters. Often, one has to con-

sider computational cost, either of training or prediction. Computing re-

sources for training and prediction are limited and generally condition the

choice of intervals of considered values: for example increasing the number

of hidden units or number of training iterations also scales up computation.

An interesting idea is to use computationally cheap estimators of validation

error to select some hyper-parameters. For example, Saxe et al. [105] showed

that the architecture hyper-parameters of convolutional networks could be

selected using random weights in the lower layers of the network (ﬁlters of

the convolution). While this yields a noisy and biased (pessimistic) esti-

mator of the validation error which would otherwise be obtained with full

training, this cheap estimator appears to be correlated with the expensive

validation error. Hence this cheap estimator is enough for selecting some

hyper-parameters (or for keeping under consideration for further and more

expensive evaluation only the few best choices found). Even without cheap

estimators of generalization error, high-throughput computing (e.g., on clus-

ters, GPUs, or clusters of GPUs) can be exploited to run not just hundreds

but thousands of training jobs, something not conceivable only a few years

ago, with each job taking on the order of hours or days for larger datasets.

With computationally cheap surrogates, some researchers have run on the

order of ten thousands trials, and we can expect future advances in paral-

lelized computing power to boost these numbers.

Coordinate Descent and Multi-resolution Search. When performing a

manual search and with access to only a single computer, a reasonable strategy

is coordinate descent: change only one hyper-parameter at a time, always making

a change from the best conﬁguration of hyper-parameters found up to now.

Instead of a standard coordinate descent (which systematically cycles through

all the variables to be optimized) one can make sure to regularly ﬁne-tune the

most sensitive variables, such as the learning rate.

Another important idea is that there is no point in exploring the eﬀect of

ﬁne changes before one or more reasonably good settings have been found. The

idea of multi-resolution search is to start the search by considering only a few

values of the numerical hyper-parameters (over a large range), or considering

large changes each time a new value is tried. One can then start from the one

or few best conﬁgurations found and explore more locally around them with

smaller variations around these values.

Automated and Semi-automated Grid Search. Once some interval or set

of values has been selected for each hyper-parameter (thus deﬁning a search

space), a simple strategy that exploits parallel computing is the grid search.

One ﬁrst needs to convert the numerical intervals into lists of values (e.g., K

regularly-spaced values in the log-domain of the hyper-parameter). The grid

search is simply an exhaustive search through all the combinations of these

values. The cross-product of these lists contains a number of elements that

19. Recommendations for Training Deep Architectures 459

is unfortunately exponential in the number of hyper-parameters (e.g., with 5

hyper-parameters, each allowed to take 6 diﬀerent values, one gets 6

= 7776

conﬁgurations). In section 19.3.4 below we consider an approach that works more

eﬃciently than the grid search when the number of hyper-parameters increases

beyond 2 or 3.

The advantage of the grid search, compared to many other optimization strate-

gies (such as coordinate descent), is that it is fully parallelizable. If a large com-

puter cluster is available, it is tempting to choose a model selection strategy that

can take advantage of parallelization. One practical disadvantage of grid search

(especially against random search, Sec. 19.3.4), with a parallelized set of jobs on

a cluster, is that if only one of the jobs fails

then one has to launch another

volley of jobs to complete the grid (and yet a third one if any of these fails, etc.),

thus multiplying the overall computing time.

Typically, a single grid search is not enough and practitioners tend to pro-

ceed with a sequence of grid searches, each time adjusting the ranges of values

considered based on the previous results obtained. Although this can be done

manually, this procedure can also be automated by considering the idea of multi-

resolution search to guide this outer loop. Diﬀerent, more local, grid searches can

be launched in the neighborhood of the best solutions found previously. In ad-

dition, the idea of coordinate descent can also be thrown in, by making each

grid search focus on only a few of the hyper-parameters. For example, it is com-

mon practice to start by exploring the initial learning rate while keeping ﬁxed

(and initially constant) the learning rate descent schedule. Once the shape of

the schedule has been chosen, it may be possible to further reﬁne the learning

rate, but in a smaller interval around the best value found.

Humans can get very good at performing hyper-parameter search, and having

a human in the loop also has the advantage that it can help detect bugs or

unwanted or unexpected behavior of a learning algorithm. However, for the sake

of reproducibility, machine learning researchers should strive to use procedures

that do not involve human decisions in the middle, only at the outset (e.g.,

setting hyper-parameter ranges, which can be speciﬁed in a paper describing the

experiments).

Layer-Wise Optimization of Hyper-parameters. InthecaseofDeepLearn-

ing with unsupervised pre-training there is an opportunity for combining coor-

dinate descent and cheap relative validation set performance evaluation associ-

ated with some hyper-parameter choices. The idea, described by Mesnil et al.

[84], Bengio [8], is to perform greedy choices for the hyper-parameters associ-

ated with lower layers (near the input) before training the higher layers. One

ﬁrst trains (unsupervised) the ﬁrst layer with diﬀerent hyper-parameter values

and somehow estimates the relative validation error that would be obtained from

these diﬀerent conﬁgurations if the ﬁnal network only had this single layer as

internal representation. In the common case where the ultimate task is super-

vised, it means training a simple supervised predictor (e.g. a linear classiﬁer) on

For all kinds of hardware and software reasons, a job failing is very common.

460 Y. Bengio

top of the learned representation. In the case of a linear predictor (e.g. regres-

sion or logistic regression) this can even be done on the ﬂy while unsupervised

training of the representation progresses (i.e. can be used for early stopping as

well), as in [70]. Once a set of apparently good (according to this greedy evalu-

ation) hyper-parameters values has been found (or possibly using only the best

one found), these good values can be used as starting point to train (and hyper-

optimize) a second layer in the same way, etc. The completelygreedyapproachis

to keep only the best conﬁguration up to now (for the lower layers), but keeping

the K best conﬁgurations overall only multiplies computational costs of hyper-

parameter selection by K for layers beyond the ﬁrst one, because we would

still keep only the best K conﬁgurations from all the 1st layer and 2nd layer

hyper-parameters as starting points for exploring 3rd layer hyper-parameters,

etc. This procedure is formalized in the Algorithm 19.1 below. Since greedy

layer-wise pre-training does not modify the lower layers when pre-training the

upper layers, this is also very eﬃcient computationally. This procedure allows

one to set the hyper-parameters associated with the unsupervised pre-training

stage, and then there remains hyper-parameters to be selected for the supervised

ﬁne-tuning stage, if one is desired. A ﬁnal supervised ﬁne-tuning stage is strongly

suggested, especially when there are many labeled examples [67].

19.3.4 Random Sampling of Hyper-parameters

A serious problem with the grid search approach to ﬁnd good hyper-parameter

conﬁgurations is that it scales exponentially badly with the number of hyper-

parameters considered. In the above sections we have discussed numerous hyper-

parameters and if all of them were to be explored at the same time it would be

impossible to use only a grid search to do so.

One may think that there are no other options simply because this is an in-

stance of the curse of dimensionality. But like we have found in our work on Deep

Learning [7], if there is some structure in a target function we are trying to dis-

cover, then there is a chance to ﬁnd good solutions without paying an exponential

price. It turns out that in many practical cases we have encountered, there is

a kind of structure that random sampling can exploit [17]. The idea of random

sampling is to replace the regular grid by a random (typically uniform) sam-

pling. Each tested hyper-parameter conﬁguration is selected by independently

sampling each hyper-parameter from a prior distribution (typically uniform in

the log-domain, inside the interval of interest). For a discrete hyper-parameter,

a multinomial distribution can be deﬁned according to our prior beliefs on the

likely good values. At worse, i.e., with no prior preference at all, this would be

a uniform distribution across the allowed values. In fact, we can use our prior

knowledge to make this prior distribution quite sophisticated. For example, we

can readily include knowledge that some values of some hyper-parameters only

make sense in the context of other particular values of hyper-parameters. This

is a practical consideration for example when considering layer-speciﬁc hyper-

parameters when the number of layers itself is a hyper-parameter.

19. Recommendations for Training Deep Architectures 461

Algorithm 19.1 Greedy layer-wise hyper-parameter optimization.

input K: number of best conﬁgurations to keep at each level.

input NLEV ELS: number of levels of the deep architecture

input LEV ELSET T INGS: list of hyper-parameter settings to be considered

for unsupervised pre-training of a level

input SFTSETTINGS: list of hyper-parameter settings to be considered for

supervised ﬁne-tuning

Initialize set of best conﬁgurations S = ∅

for L =1to NLEV ELS do

for C in LEV ELSET T IN GS do

for H in (S or {∅}) do

* Pretrain level L using hyper-parameter setting C for level L and

the parameters obtained with setting H for lower levels.

* Evaluate target task performance L using this depth-L pre-trained

architecture (e.g. train a linear classiﬁer on top of these layers and estimate

validation error).

* Push the pair (C∪H, L) into S if it is among the K best performing

of S.

end for

for C in SFTSETTINGS do

for H in S do

* Supervised ﬁne-tuning of the pre-trained architecture associated with

H, using supervised ﬁne-tuning hyper-parameter setting C.

* Evaluate target task performance L of this ﬁne-tuned predictor (e.g.

validation error).

* Push the pair (C ∪H, L

) into S if it is among the K best performing

of S.

end for

output S the set of K best-performing models with their settings and vali-

dation performance.

462 Y. Bengio

The experiments performed [17] show that random sampling can be many

times more eﬃcient than grid search as soon as the number of hyper-parameters

goes beyond the 2 or 3 typically seen with SVMs and vanilla neural networks.

The main reason why faster convergence is observed is because it allows one to

explore more values for each hyper-parameter, whereas in grid search, the same

value of a hyper-parameter is repeated in exponentially many conﬁgurations (of

all the other hyper-parameters). In particular, if only a small subset of the hyper-

parameters really matters, then this procedure can be shown to be exponentially

more eﬃcient. What we found is that for diﬀerent datasets and architectures, the

subset of hyper-parameters that mattered most was diﬀerent, but it was often

the case that a few hyper-parameters made a big diﬀerence (and the learning rate

is always one of them!). When marginalizing (by averaging or minimizing) the

validation performance to visualize the eﬀect of one or two hyper-parameters,

we get a more noisy picture using a random search compared to a grid search,

because of the random variations of the other hyper-parameters but one with

much more resolution, because so many more diﬀerent values have been consid-

ered. Practically, one can plot the curves of best validation error as the number

of random trials

is increased (with mean and standard deviation, obtained by

considering, for each choice of number of trials, all possible same-size subsets of

trials), and this curve tells us that we are approaching a plateau, i.e., it tells

us whether it is worth it or not to continue launching jobs, i.e., we can per-

form a kind of early stopping in the outer optimization over hyper-parameters.

Note that one should distinguish the curve of the “best trial in ﬁrst N trials”

with the curve of the mean (and standard deviation) of the “best in a subset

of size N”. The latter is a better statistical representative of the improvements

we should expect if we increase the number of trials. Even if the former has a

plateau, the latter may still be on the increase, pointing for the need to more

hyper-parameter conﬁguration samples, i.e., more trials [17]. Comparing these

curves with the equivalent obtained from grid search we see faster convergence

with random search. On the other hand, note that one advantage of grid search

compared to random sampling is that the qualitative analysis of results is eas-

ier because one can consider variations of a single hyper-parameter with all the

other hyper-parameters being ﬁxed. It may remain a valid option to do a small

grid search around the best solutions found by random search, considering only

the hyper-parameters that were found to matter or which concern a scientiﬁc

question of interest

Random search maintains the advantage of easy parallelization provided by

grid search and improves on it. Indeed, a practical advantage of random search

compared to grid search is that if one of the jobs fails then there is no need to

re-launch that job. It also means that if one has launched 100 random search

Each random trial corresponding to a training job with a particular choice of hyper-

parameter values.

This is often the case in machine learning research, e.g., does depth of architecture

matter? then we need to control accurately for the eﬀect of depth, with all other

hyper-parameters optimized for each value of depth.

19. Recommendations for Training Deep Architectures 463

jobs, and ﬁnds that the convergence curve still has an interesting slope, one can

launch another 50 or 100 without wasting the ﬁrst 100. It is not that simple to

combine the results of two grid searches because they are not always compatible

(i.e., one is not a subset of the other).

Finally, although random search is a useful addition to the toolbox of the

practitioner, semi-automatic exploration is still helpful and one will often iterate

between launching a new volley of jobs and analysis of the results obtained with

the previous volley in order to guide model design and research. What we need

is more, and more eﬃcient, automation of hyper-parameter optimization. There

are some interesting steps in this direction [62, 19, 63, 109] but much more needs

to done.

19.4 Debugging and Analysis

19.4.1 Gradient Checking and Controlled Overﬁtting

A very useful debugging step consists in verifying that the implementation of

the gradient

∂L

∂θ

is compatible with the computation of L as a function of θ.If

the analytically computed gradient does not match the one obtained by a ﬁnite

diﬀerence approximation, this signals that a bug is probably present somewhere.

First of all, looking at for which i one gets important relative change between

∂L

∂θ

and its ﬁnite diﬀerence approximation, we can get hints as to where the problem

may be. An error in sign is particularly troubling, of course. A good next step is

then to verify in the same way intermediate gradients

∂L

∂a

with a some quantities

that depend on the faulty θ, such as intervening neuron activations.

As many researchers know, the gradient can be approximated by a ﬁnite

diﬀerence approximation obtained from the ﬁrst-order Taylor expansion of a

scalar function f with respect to a scalar argument x:

∂f(x)

∂x

f(x + ε) − f (x)

+ o(ε)

But a less known fact is that a second order approximation can be achieved by

considering the following alternative formula:

∂f(x)

∂x

≈

f(x + ε) − f(x − ε)

2ε

+ o(ε

The second order terms of the Taylor expansion of f(x + ε) and f(x − ε) cancel

each other because they are even, leaving only 3rd or higher order terms, i.e.,

o(ε

) error after dividing the diﬀerence by ε. Hence this formula is twice more

expensive (not a big deal while debugging) but provides quadratically more

precision.

Note that because of ﬁnite precision in the computation, there will be a diﬀer-

ence between the analytic (even correct) and ﬁnite diﬀerence gradient. Contrary

to naive expectations, the relative diﬀerence may grow if we choose an ε that is

too small, i.e., the error should ﬁrst decrease as ε is decreased, and then may

464 Y. Bengio

worsen when numerical precision kicks in, due to non-linearities. We have often

used a value of ε =10

−4

in neural networks, a value that is suﬃciently small to

detect most bugs.

Once the gradient is known to be well computed, another sanity check is that

gradient descent (or any other gradient-based optimization) should be able to

overﬁt on a small training set

. In particular, to factor out eﬀects of SGD hyper-

parameters, a good sanity check for the code (and the other hyper-parameters)

is to verify that one can overﬁt on a small training set using a powerful second

order method such as L-BFGS. For any optimizer, though, as the number of

examples is increased, the degradation of training error should be gradual while

validation error should improve. And one typically sees the advantages of SGD

over batch second-order methods like L-BFGS increase as the training set size

increases. The break-even point may depend on the task, parallelization (multi-

core or GPU, see Sec.19.5 below), and architecture (number of computations

compared to number of parameters, per example).

Of course, the real goal of learning is to achieve good generalization error,

and the latter can be estimated by measuring performance on an independent

test set. When test error is considered too high, the ﬁrst question to ask is

whether it is because of a diﬃculty in optimizing the training criterion or because

of overﬁtting. Comparing training error and test error (and how they change

as we change hyper-parameters that inﬂuence capacity, such as the number of

training iterations) helps to answer that question. Depending on the answer, of

course, the appropriate ways to improve test error are diﬀerent. Optimization

diﬃculties can be ﬁxed by looking for bugs in the training code, inappropriate

values of optimization hyper-parameters, or simply insuﬃcient capacity (e.g.

not enough degrees of freedom, hidden units, embedding sizes, etc.). Overﬁtting

diﬃculties can be addressed by collecting more training data, introducing more

or better regularization terms, multi-task training, unsupervised pre-training,

unsupervised term in the training criterion, or considering diﬀerent function

families (or neural network architectures). In a multi-layer neural network, both

problems can be simultaneously present. For example, as discussed in Bengio

et al. [14], Bengio [7], it is possible to have zero training error with a large top-

level hidden layer that allows the output layer to overﬁt, while the lower layer

are not doing a good job of extracting useful features because they were not

properly optimized.

Unless using a framework such as Theano which automatically handles the

eﬃcient allocation of buﬀers for intermediate results, it is important to pay

attention to such buﬀers in the design of the code. The ﬁrst objective is to

avoid memory allocation in the middle of the training loop, i.e., all memory

buﬀers should be allocated once and for all. Careless reuse of the same memory

In principle, bad local minima could prevent that, but in the overﬁtting regime,

e.g., with more hidden units than examples, the global minimum of the training

error can generally be reached almost surely from random initialization, presumably

because the training criterion becomes convex in the parameters that suﬃce to get

the training error to zero [12], i.e., the output weights of the neural network.

19. Recommendations for Training Deep Architectures 465

buﬀers for diﬀerent uses can however lead to bugs, which can be checked, in

the debugging phase, by initializing buﬀers to the NaN (Not-A-Number) value,

which propagates into downstream computation (making it easy to detect that

uninitialized values were used)

19.4.2 Visualizations and Statistics

The most basic statistics that should be measured during training are error

statistics. The average loss on the training set and the validation set and their

evolution during training are very useful to monitor progress and diﬀerentiate

overﬁtting from poor optimization. To make comparisons easier, it may be useful

to compare neural networks during training in terms of their “age” (number of

updates made times mini-batch size B, i.e., number of examples visited) rather

than in terms of number of epochs (which is very sensitive to the training set

size).

When using unsupervised training to learn the ﬁrst few layers of a deep ar-

chitecture, a very common debugging and analysis tool is the visualization of

ﬁlters, i.e., of the weight vectors associated with individual hidden units. This is

simplest in the case of the ﬁrst layer and where the inputs are images (or image

patches), time-series, or spectrograms (all of which are visually interpretable).

Several recipes have been proposed to extend this idea to visualize the preferred

input of hidden units in layers that follow the ﬁrst one [81, 43]. In the case of the

ﬁrst layer, since one often obtains Gabor ﬁlters, a parametric ﬁt of these ﬁlters to

the weight vector can be done so as to visualize the distribution of orientations,

positions and scales of the learned ﬁlters. An interesting special case of visualiz-

ing ﬁrst-layer weights is the visualization of word embeddings (see Section 19.5.3

below) using a dimensionality reduction technique such as t-SNE [113].

An extension of the idea of visualizing ﬁlters (which can apply to non-linear or

deeper features) is that of visualizing local (arount the given test point) leading

tangent vectors, i.e., the main directions in input space to which the represen-

tation (at a given layer) is most sensitive to [100].

In the case where the inputs are not images or easily visualizable, or to get

a sense of the weight values in diﬀerent hidden units, Hinton diagrams [58] are

also very useful, using small squares whose color (black or white) indicates a

weight’s sign and whose area represents its magnitude.

Another way to visualize what has been learned by an unsupervised (or joint

label-input) model is to look at samples from the model. Sampling procedures

have been deﬁned at the outset for RBMs, Deep Belief Nets, and Deep Boltz-

mann Machines, for example based on Gibbs sampling. When weights become

larger, mixing between modes can become very slow with Gibbs sampling. An

interesting alternative is rates-FPCD [112, 30] which appears to be more robust

to this problem and generally mixes faster, but at the cost of losing theoretical

guarantees.

Personal communication from David Warde-Farley, who learned this trick from Sam

Roweis.

466 Y. Bengio

In the case of auto-encoder variants, it was not clear until recently whether

they were really capturing the underlying density (since they are not optimized

with respect to the maximum likelihood principle or an approximation of it). It

was therefore even less clear if there existed appropriate sampling algorithms for

auto-encoders, but a recent proposal for sampling from contractive auto-encoders

appears to be working very well [101], based on arguments about the geometric

interpretation of the ﬁrst derivative of the encoder [16], showing that denoising

and contractive auto-encoders capture local moments (ﬁrst and second) of the

training density.

To get a sense of what individual hidden units represent, it has also been

proposed to vary only one unit while keeping the others ﬁxed, e.g., to the value

obtained by ﬁnding the hidden units representation associated with a particular

input example.

Another interesting technique is the visualization of the learning trajectory in

function space [44]. The idea is to associate the function (as opposed to simply

the parameters) computed by a neural network with a low-dimensional (2-D

or 3-D) representation, e.g., with the t-SNE [113] or Isomap [111] algorithms,

and then plot the evolution of this function during training, or the population

of such trajectories for diﬀerent initializations. This provides visualization of

eﬀective local minima

and shows that no two diﬀerent random initializations

ended up in the same eﬀective local minimum.

Finally, another useful type of visualization is to display statistics (e.g., his-

togram, mean and standard deviation) of activations (inputs and outputs of the

non-linearities at each layer), activation gradients, parameters and parameter

gradients, by groups (e.g. diﬀerent layers, biases vs weights) and across training

iterations. See Glorot and Bengio [48] for a practical example. A particularly in-

teresting quantity to monitor is the discriminative ability of the representations

learnt at each layer, as discussed in [85], and ultimately leading to an analysis of

the disentangled factors captured by the diﬀerent layers as we consider deeper

architectures.

19.5 Other Recommendations

19.5.1 Multi-core Machines, BLAS and GPUs

Matrix operations are the most time-consuming in eﬃcient implementations of

many machine learning algorithms and this is particularly true of neural net-

works and deep architectures. The basic operations are matrix-vector products

(forward propagation and back-propagation) and vector times vector outer prod-

ucts (resulting in a matrix of weight gradients). Matrix-matrix multiplications

can be done substantially faster than the equivalent sequence of matrix-vector

products for two reasons: by smart caching mechanisms such as implemented in

the BLAS library (which is called from many higher-level environments such as

It is diﬃcult to know for sure if it is a true local minima or if it appears like one

because the optimization algorithm is stuck.

19. Recommendations for Training Deep Architectures 467

python’s numpy and Theano, Matlab, Torch or Lush), and thanks to parallelism.

Appropriate versions of BLAS can take advantage of multi-core machines to dis-

tribute these computations on multi-core machines. The speed-up is however

generally a fraction of the total speedup one can hope for (e.g. 4× on a 4-core

machine), because of communication overheads and because not all computa-

tion is parallelized. Parallelism becomes more eﬃcient when the sizes of these

matrices is increased, which is why mini-batch updates can be computationally

advantageous, and more so when more cores are present.

The extreme multi-core machines are the GPUs (Graphics Processing Units),

with hundreds of cores. Unfortunately, they also come with constraints and spe-

cialized compilers which make it more diﬃcult to fully take advantage of their

potential. On 512-core machines, we are routinely able to get speed-ups of 4×

to 40× for large neural networks. To make the use of GPUs practical, it re-

ally helps to use existing libraries that eﬃciently implement computations on

GPUs. See Bergstra et al. [18] for a comparative study of the Theano library

(which compiles numpy-like code for GPUs). One practical issue is that only the

GPU-compiled operations will typically be done on the GPU, and that trans-

fers between the GPU and CPU considerably slow things down. It is important

to use a proﬁler to ﬁnd out what is done on the GPU and how eﬃcient these

operations are in order to quickly invest one’s time where needed to make an

implementation GPU-eﬃcient and keep most operations on the GPU card.

19.5.2 Sparse High-Dimensional Inputs

Sparse high-dimensional inputs can be eﬃciently handled by traditional super-

vised neural networks by using a sparse matrix multiplication. Typically, the

input is a sparse vector while the weights are in a dense matrix, and one should

use an eﬃcient implementation made for just this case in order to optimally take

advantage of sparsity. There is still going to be an overhead on the order of 2×

or more (on the multiply-add operations, not the others) compared to a dense

implementation of the matrix-vector product.

For many unsupervised learning algorithms there is unfortunately a diﬃculty.

The computation for these learning algorithms usually involves some kind of

reconstruction of the input (like for all auto-encoder variants, but also for RBMs

and sparse coding variants), as if the inputs were in the output space of the

learner. Two exceptions to this problem are semi-supervised embedding [117]

and Slow Feature Analysis [119, 20]. The former pulls the representation of

nearby examples near each other and pushes dissimilar points apart, while also

tuning the representation for a supervised learning task. The latter maximizes

the learned features’ variance while minimizing their covariance and maximizing

their temporal auto-correlation.

For algorithms that do need a form of input reconstruction, an eﬃcient ap-

proach based on sampled reconstruction [39] has been proposed, successfully

implemented and evaluated for the case of auto-encoders and denoising auto-

encoders. The ﬁrst idea is that on each example (or mini-batch), one samples

a subset of the elements of the reconstruction vector, along with the associated

468 Y. Bengio

reconstruction loss. One only needs to compute the reconstruction and the loss

associated with these sampled elements (or features), as well as the associated

back-propagation operations into hidden units and reconstruction weights. That

alone would multiplicatively reduce the computational cost by the amount of

sparsity but make the gradient much more noisy and possibly biased as well, if

the sampling distribution was chosen not uniform. To reduce the variance of that

estimator, the idea is to guess for which features the reconstruction loss will be

larger and to sample with higher probability these features (and their loss). In

particular, the authors always sample the features with a non-zero in the input

(or the corrupted input, in the denoising case), and uniformly sample an equal

number of those with a zero in the input and corrupted input. To make the

estimator unbiased now requires introducing a weight on the reconstruction loss

associated with each sampled feature, inversely proportional to the probability

of sampling it, i.e., this is an importance sampling scheme. The experiments

show that the speed-up increases linearly with the amount of sparsity while the

average loss is optimized as well as in the deterministic full-computation case.

19.5.3 Symbolic Variables, Embeddings, Multi-task Learning and

Multi-relational Learning

Parameter sharing [68, 77, 68, 31, 4, 5] is an old neural network technique for

increasing statistical power: if a parameter is used in N times more contexts

(diﬀerent tasks, diﬀerent parts of the input, etc.) then it may be as if we had N

times more training examples for tuning its value. More examples to estimate a

parameter reduces its variance (with respect to sampling of training examples),

which is directly inﬂuencing generalization error: for example the generalization

mean squared error can be decomposed as the sum of a bias term and a variance

term [46]. The reuse idea was ﬁrst exploited by applying the same parameter

to diﬀerent parts of the input, as in convolutional neural networks [68, 77].

Reuse was also exploited by sharing the lower layers of a network (and the

representation of the input that they capture) across multiple tasks associated

with diﬀerent outputs of the network [31, 4, 5]. This idea is also one of the key

motivations behind Deep Learning [7] because one can think of the intermediate

features computed in higher (deeper) layers as diﬀerent tasks that can share the

sub-features computed in lower layers (nearer the input). This very basic notion

of reuse is key to improving generalization in many settings, guiding the design

of neural network architectures in practical applications as well.

An interesting special case of these ideas is in the context of learning with

symbolic data. If some input variables are symbolic, taking value in a ﬁnite

alphabet, they can be represented as neural network inputs by a one-hot sub-

vector of the input vector (with a 0 everywhere except at the position associated

with the particular symbol). Now, sometimes diﬀerent input variables refer to

diﬀerent instances of the same type of symbol. A patent example is with neu-

ral language models [11, 6], where the input is a sequence of words. In these

models, the same input layer weights are reused for words at diﬀerent positions

in the input sequence (as in convolutional networks). The product of a one-hot

19. Recommendations for Training Deep Architectures 469

sub-vector with this shared weight matrix is a generally dense vector, and this

associates each symbol in the alphabet with a point in a vector space

,which

we call its embedding. The idea of vector space representations for words and

symbols is older [40] and is a particular case of the notion of distributed repre-

sentation [57, 58] central to the connectionist approaches. Learned embeddings of

symbols (or other objects) can be conveniently visualized using a dimensionality

reduction algorithm such as t-SNE [113].

In addition to sharing the embedding parameters across positions of words

in an input sentence, Collobert et al. [36] share them across natural language

processing tasks such as Part-Of-Speech tagging, chunking and semantic role

labeling. Parameter sharing is a key idea behind convolutional nets, recurrent

neural networks and dynamic Bayes nets, in which the same parameters are used

for diﬀerent temporal or spatial slices of the data. This idea has been generalized

from sequences and 2-D images to arbitrary graphs with recursive neural net-

works or recursive graphical models [92, 45, 25, 108], Markov Logic Networks [98]

and relational learning [47]. A relational database can be seen as a set of objects

(or typed values) and relations between them, of the form (object1, relation-

type, object2). The same global set of parameters can be shared to characterize

such relations, across relations (which can be seen as tasks) and objects. Object-

speciﬁc parameters are the parameters specifying the embedding of a particular

discrete object. One can think of the elements of each embedding vector as im-

plicit learned attributes. Diﬀerent tasks may demand diﬀerent attributes, so that

objects which share some underlying characteristics and behavior should end

up having similar values of some of their attributes. For example, words ap-

pearing in semantically and syntactically similar contexts end up getting a very

close embedding [36]. If the same attributes can be useful for several tasks, then

statistical power is gained through parameter sharing, and transfer of informa-

tion between tasks can happen, making the data of some task informative for

generalizing properly on another task.

The idea proposed in Bordes et al. [23, 24] is to learn an energy function that

is lower for positive (valid) relations present in the training set, and parametrized

in two parts: on the one hand the symbol embeddings and on the other hand the

rest of the neural network that maps them to a scalar energy. In addition, by

considering relation types themselves as particular symbolic objects, the model

can reason about relations themselves and have relations between relation types.

For example, ‘

To be’ can act as a relation type (in subject-attribute relations)

but in the statement “ ‘

To be’ is a verb” it appears both as a relation type and

as an object of the relation.

Such multi-relational learning opens the door to the application of neural

networks outside of their traditional applications, which was based on a single

homogeneous source of data, often seen as a matrix with one row per example

and one column (or group of columns) per random variable. Instead, one often

has multiple heterogeneous sources of data (typically providing examples seen

The result of the matrix multiplication, which equals one of the columns of the

matrix.

470 Y. Bengio

as a tuple of values), each involving diﬀerent random variables. So long as these

diﬀerent sources share some variables, then the above multi-relational multi-

task learning approaches can be applied. Each variable can be associated with its

embedding function (that maps the value of a variable to a generic representation

space that is valid across tasks and data sources). This framework can be applied

not only to symbolic data but to mixed symbolic/numeric data if the mapping

from object to embedding is generalized from a table look-up to a parametrized

function (the simplest being a linear mapping) from its raw attributes (e.g.,

image features) to its embedding. This has been exploited successfully to design

image search systems in which images and queries are mapped to the same

semantic space [118].

19.6 Open Questions

19.6.1 On the Added Diﬃculty of Training Deeper Architectures

There are experimental results which provide some evidence that, at least in

some circumstances, deeper neural networks are more diﬃcult to train than

shallow ones, in the sense that there is a greater chance of missing out on bet-

ter minima when starting from random initialization. This is borne out by all

the experiments where we ﬁnd that some initialization scheme can drastically

improve performance. In the Deep Learning literature this has been shown with

the use of unsupervised pre-training (supervised or not), both applied to super-

vised tasks — training a neural network for classiﬁcation [61, 14, 95] — and

unsupervised tasks — training a Deep Boltzmann Machine to model the data

distribution [104].

The learning trajectories visualizations of Erhan et al. [44] have shown that

even when starting from nearby conﬁgurations in function space, diﬀerent initial-

izations seem to always fall in a diﬀerent eﬀective local minimum. Furthermore,

the same study showed that the minima found when using unsupervised pre-

training were far in function space from those found from random initialization,

in addition to giving better generalization error. Both of these ﬁndings highlight

the importance of initialization, hence of local minima eﬀects, in deep networks.

Finally, it has been shown that these eﬀects were both increased when consider-

ing deeper architectures [44].

There are also results showing that speciﬁc ways of setting the initial dis-

tribution and ordering of examples (“curriculum learning”) can yield better so-

lutions [42, 15, 66]. This also suggest that very particular ways of initializing

parameters, very diﬀerent from uniformly sampled, can have a strong impact

on the solutions found by gradient descent. The hypothesis proposed in [15] is

that curriculum learning can act similarly to a continuation method, i.e., starting

from an easier optimization task (e.g. convex) and tracking the local minimum

as the learning task is gradually made more diﬃcult and closer to the real task

of interest.

Why would training deeper networks be more diﬃcult? This is clearly still

an open question. A plausible partial answer is that deeper networks are also

19. Recommendations for Training Deep Architectures 471

more non-linear (since each layer composes more non-linearity on top of the

previous ones), making gradient-based methods less eﬃcient. It may also be

that the number and structure of local minima both change qualitatively as we

increase depth. Theoretical arguments support a potentially exponential gain in

expressive power of deeper architectures [7, 9] and it would be plausible that

with this added expressive power coming from the combinatorics of composed

reuse of sub-functions could come a corresponding increase in the number (and

possibly quality) of local minima. But the best ones could then also be more

diﬃcult to ﬁnd.

On the practical side, several experimental results point to factors that may

help training deep architectures:

– A local training signal. What many successful procedures for training

deep networks have in common is that they involve a local training signal that

helps each layer decide what to do without requiring the back-propagation

of gradients through many non-linearities. This includes of course the many

variants of greedy layer-wise pre-training but also the less well-known semi-

supervised embedding algorithm [117].

– Initialization in the right range. Based on the idea that both activations

and gradients should be able to ﬂow well through a deep architecture without

signiﬁcant reduction in variance, Glorot and Bengio [48] proposed setting up

the initial weights to make the Jacobian of each layer have singular values

near 1 (or preserve variance in both directions). In their experiments this

clearly helped greatly reducing the gap between purely supervised and pre-

trained deep networks.

– Choice of non-linearities. In the same study [48] and a follow-up [49]

it was shown that the choice of hidden layer non-linearities interacted with

depth. In particular, without unsupervised pre-training, a deep neural net-

work with sigmoids in the top hidden layer would get stuck for a long time

on a plateau and generally produce inferior results, due to the special role

of 0 and of the initial gradients from the output units. Symmetric non-

linearities like the hyperbolic tangent did not suﬀer from that problem, while

softer non-linearities (without exponential tails) such as the softsign func-

tion s(a)=

1+|a|

worked even better. In Glorot et al. [49] it was shown

that an asymmetric but hard-limiting non-linearity such as the rectiﬁer

(s(a)=max(0,a), see also [86]) actually worked very well (but should not

be used for output units), in spite of the prior belief that the fact that when

hidden units are saturated, gradients would not ﬂow well into lower layers.

In fact gradients ﬂow very well, but on selected paths, possibly making the

credit assignment (which parameters should change to handle the current

error) sharper and the Hessian condition number better. A recent heuristic

that is related to the diﬃculty of gradient propagation through neural net

non-linearities is the idea of “centering” the non-linear operation such that

each hidden unit has zero average output and zero average slope [107, 94].

472 Y. Bengio

19.6.2 Adaptive Learning Rates and Second-Order Methods

To improve convergence and remove learning rates from the list of

hyper-parameters, many authors have advocated exploring adaptive learning

rate methods, either for a global learning rate [32], a layer-wise learning rate,

a neuron-wise learning rate, or a parameter-wise learning rate [22] (which then

starts to look like a diagonal Newton method). LeCun [76], LeCun et al. [79]

advocate the use of a second-order diagonal Newton (always positive) approxi-

mation, with one learning rate per parameter (associated with the approximated

inverse second derivative of the loss with respect to the parameter). Hinton [59]

proposes scaling learning rates so that the average weight update is on the order

of 1/1000th of the weight magnitude. LeCun et al. [79] also propose a simple

power method in order to estimate the largest eigenvalue of the Hessian (which

would be the optimal learning rate). An interesting alternative to variants of

Newton’s method are variants of the natural gradient method [1], but like the

basic Newton method it is computationally too expensive, requiring operations

on a too large square matrix (number of parameters by number of parame-

ters). Diagonal and low-rank online approximations of natural gradient [73, 74]

have been proposed and shown to speed-up training in some contexts. Several

adaptive learning rate procedures have been proposed recently and merit more

attention and evaluations in the neural network context, such as adagrad [41]

and the adaptive learning rate method from Schaul et al. [106] which claims to

remove completely the need for a learning rate hyper-parameter.

Whereas stochastic gradient descent converges very quickly initially it is gen-

erally slower than second-order methods for the ﬁnal convergence, and this may

be important in some applications. As a consequence, batch training algorithms

(performing only one update after seeing the whole training set) such as the

Conjugate Gradient method (a second order method) have dominated stochas-

tic gradient descent for not too large datasets (e.g. less than thousands or tens

of thousands of examples). Furthermore, it has recently been proposed and suc-

cessfully applied to use second-order methods over large mini-batches [72, 83].

The idea is to do just a few iterations of the second-order methods on each mini-

batch and then move on to the next mini-batch, starting from the best previous

point found. A useful twist is to start training with one or more epoch of SGD,

since SGD remains the fastest optimizer early on in training.

At this point in time however, although the second-order and natural gradient

methods are appealing conceptually, have demonstrably helped in the studied

cases and may in the end prove to be very important, they have not yet become

a standard for neural networks optimization and need to be validated and maybe

improved by other researchers, before displacing simple (mini-batch) stochastic

gradient descent variants.

19.7 Conclusion

In spite of decades of experimental and theoretical work on artiﬁcial neural

networks, and with all the impressive progress made since the ﬁrst edition of

19. Recommendations for Training Deep Architectures 473

this book, in particular in the area of Deep Learning, there is still much to

be done to better train neural networks and better understand the underlying

issues that can make the training task diﬃcult. As stated in the introduction, the

wisdom distilled here should be taken as a guideline, to be tried and challenged,

not as a practice set in stone. The practice summarized here, coupled with the

increase in available computing power, now allows researchers to train neural

networks on a scale that is far beyond what was possible at the time of the ﬁrst

edition of this book, helping to move us closer to artiﬁcial intelligence.

Acknowledgements. The author is grateful for the comments and feedback

provided by Nicolas Le Roux, Ian Goodfellow, James Bergstra, Guillaume Des-

jardins, Razvan Pascanu, David Warde-Farley, Eric Larsen, Frederic Bastien, and

Sina Honari, as well as for the ﬁnancial support of NSERC, FQRNT, CIFAR,

and the Canada Research Chairs.

References

[1] Amari, S.: Natural gradient works eﬃciently in learning. Neural Computa-

tion 10(2), 251–276 (1998)

[2] Bach, F., Moulines, E.: Non-asymptotic analysis of stochastic approximation al-

gorithms. In: NIPS 2011 (2011)

[3] Bagnell, J.A., Bradley, D.M.: Diﬀerentiable sparse coding. In: NIPS 2009, pp.

113–120 (2009)

[4] Baxter, J.: Learning internal representations. In: COLT 1995, pp. 311–320 (1995)

[5] Baxter, J.: A Bayesian/information theoretic model of learning via multiple task

sampling. Machine Learning 28, 7–40 (1997)

[6] Bengio, Y.: Neural net language models. Scholarpedia 3(1), 3881 (2008)

[7] Bengio, Y.: Learning deep architectures for AI. Now Publishers (2009)

[8] Bengio, Y.: Deep learning of representations for unsupervised and transfer learn-

ing. In: JMLR W&CP: Proc. Unsupervised and Transfer Learning (2011)

[9] Bengio, Y., Delalleau, O.: On the Expressive Power of Deep Architectures. In:

Kivinen, J., Szepesvári, C., Ukkonen, E., Zeugmann, T. (eds.) ALT 2011. LNCS,

vol. 6925, pp. 18–36. Springer, Heidelberg (2011)

[10] Bengio, Y., LeCun, Y.: Scaling learning algorithms towards AI. In: Large Scale

Kernel Machines (2007)

[11] Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic lan-

guage model. JMLR 3, 1137–1155 (2003)

[12] Bengio, Y., Le Roux, N., Vincent, P., Delalleau, O., Marcotte, P.: Convex neural

networks. In: NIPS 2005, pp. 123–130 (2006a)

[13] Bengio, Y., Delalleau, O., Le Roux, N.: The curse of highly variable functions for

local kernel machines. In: NIPS 2005, pp. 107–114 (2006b)

[14] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training

of deep networks. In: NIPS 2006 (2007)

[15] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In:

ICML 2009 (2009)

[16] Bengio, Y., Alain, G., Rifai, S.: Implicit density estimation by local moment

matching to sample from auto-encoders. Technical report, arXiv:1207.0057 (2012)

[17] Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J.

Machine Learning Res. 13, 281–305 (2012)

474 Y. Bengio

[18] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins,

G., Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a CPU and GPU math

expression compiler. In: Proc. Python for Scientiﬁc Comp. Conf. (SciPy) (2010)

[19] Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter

optimization. In: NIPS 2011 (2011)

[20] Berkes, P., Wiskott, L.: Applying Slow Feature Analysis to Image Sequences

Yields a Rich Repertoire of Complex Cell Properties. In: Dorronsoro, J.R. (ed.)

ICANN 2002. LNCS, vol. 2415, pp. 81–86. Springer, Heidelberg (2002)

[21] Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for

convex optimization: a survey. Technical Report 2848, LIDS (2010)

[22] Bordes, A., Bottou, L., Gallinari, P.: Sgd-qn: Careful quasi-newton stochastic

gradient descent. Journal of Machine Learning Research 10, 1737–1754 (2009)

[23] Bordes, A., Weston, J., Collobert, R., Bengio, Y. (2011). Learning structured

embeddings of knowledge bases. In: AAAI (2011)

[24] Bordes, A., Glorot, X., Weston, J., Bengio, Y.: Joint learning of words and mean-

ing representations for open-text semantic parsing. In: AISTATS 2012 (2012)

[25] Bottou, L.: From machine learning to machine reasoning. Technical report,

arXiv.1102 (2011)

[26] Bottou, L.: Stochastic Gradient Descent Tricks. In: Montavon, G., Orr, G.B.,

Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp.

421–436. Springer, Heidelberg (2012)

[27] Bottou, L., Bousquet, O.: The tradeoﬀs of large scale learning. In: NIPS 2008

(2008)

[28] Bottou, L., LeCun, Y.: Large-scale on-line learning. In: NIPS 2003 (2004)

[29] Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1994)

[30] Breuleux, O., Bengio, Y., Vincent, P.: Quickly generating representative samples

from an rbm-derived process. Neural Computation 23(8), 2053–2073 (2011)

[31] Caruana, R.: Multitask connectionist learning. In: Proceedings of the 1993 Con-

nectionist Models Summer School, pp. 372–379 (1993)

[32] Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for

training restricted boltzmann machines. In: ICML 2011, pp. 105–112 (2011)

[33] Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse

coding and vector quantization. In: ICML 2011 (2011)

[34] Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: ICML

2004 (2004a)

[35] Collobert, R., Bengio, S.: Links between perceptrons, MLPs and SVMs. In: In-

ternational Conference on Machine Learning, ICML (2004b)

[36] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:

Natural language processing (almost) from scratch. Journal of Machine Learning

Research 12, 2493–2537 (2011a)

[37] Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment

for machine learning. In: BigLearn, NIPS Workshop (2011b)

[38] Courville, A., Bergstra, J., Bengio, Y.: Unsupervised models of images by spike-

and-slab RBMs. In: ICML 2011 (2011)

[39] Dauphin, Y., Glorot, X., Bengio, Y.: Sampled reconstruction for large-scale learn-

ing of embeddings. In: Proc. ICML 2011 (2011)

[40] Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.:

Indexing by latent semantic analysis. J. Am. Soc. Information Science 41(6),

391–407 (1990)

[41] Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learn-

ing and stochastic optimization. Journal of Machine Learning Research (2011)

19. Recommendations for Training Deep Architectures 475

[42] Elman, J.L.: Learning and development in neural networks: The importance of

starting small. Cognition 48, 781–799 (1993)

[43] Erhan, D., Courville, A., Bengio, Y.: Understanding representations learned in

deep architectures. Technical Report 1355, Université de Montréal/DIRO (2010a)

[44] Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.:

Why does unsupervised pre-training help deep learning? J. Machine Learning

Res. 11, 625–660 (2010b)

[45] Frasconi, P., Gori, M., Sperduti, A.: A general framework for adaptive processing

of data structures. IEEE Transactions on Neural Networks 9(5), 768–786 (1998)

[46] Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance

dilemma. Neural Computation 4(1), 1–58 (1992)

[47] Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press

(2006)

[48] Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforward

neural networks. In: AISTATS 2010, pp. 249–256 (2010)

[49] Glorot, X., Bordes, A., Bengio, Y. (2011a). Deep sparse rectiﬁer neural networks.

In: AISTATS 2011 (2011)

[50] Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment

classiﬁcation: A deep learning approach. In: ICML 2011 (2011b)

[51] Goodfellow, I., Le, Q., Saxe, A., Ng, A.: Measuring invariances in deep networks.

In: NIPS 2009, pp. 646–654 (2009)

[52] Goodfellow, I., Courville, A., Bengio, Y.: Spike-and-slab sparse coding for unsu-

pervised feature discovery. In: NIPS Workshop on Challenges in Learning Hier-

archical Models (2011)

[53] Graepel, T., Candela, J.Q., Borchert, T., Herbrich, R.: Web-scale Bayesian click-

through rate prediction for sponsored search advertising in microsoft’s bing search

engine. In: ICML (2010)

[54] Håstad, J.: Almost optimal lower bounds for small depth circuits. In: STOC 1986,

pp. 6–20 (1986)

[55] Håstad, J., Goldmann, M.: On the power of small-depth threshold circuits. Com-

putational Complexity 1, 113–129 (1991)

[56] Hinton, G.E.: Relaxation and its role in vision. Ph.D. thesis, University of Edin-

burgh (1978)

[57] Hinton, G.E.: Learning distributed representations of concepts. In: Proc. 8th

Annual Conf. Cog. Sc. Society, pp. 1–12 (1986)

[58] Hinton, G.E.: Connectionist learning procedures. Artiﬁcial Intelligence 40, 185–

234 (1989)

[59] Hinton, G.E.: A practical guide to training restricted Boltzmann machines. Tech-

nical Report UTML TR 2010-003, Department of Computer Science, University

of Toronto (2010)

[60] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In:

Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn.

LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)

[61] Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief

nets. Neural Computation 18, 1527–1554 (2006)

[62] Hutter, F.: Automated Conﬁguration of Algorithms for Solving Hard Computa-

tional Problems. Ph.D. thesis, University of British Columbia (2009)

[63] Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization

for general algorithm conﬁguration. In: Coello Coello, C.A. (ed.) LION 5. LNCS,

vol. 6683, pp. 507–523. Springer, Heidelberg (2011)

476 Y. Bengio

[64] Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-

stage architecture for object recognition? In: ICCV (2009)

[65] Kavukcuoglu, K., Ranzato, M.-A., Fergus, R., LeCun, Y.: Learning invariant

features through topographic ﬁlter maps. In: CVPR 2009 (2009)

[66] Krueger, K.A., Dayan, P.: Flexible shaping: how learning in small steps helps.

Cognition 110, 380–394 (2009)

[67] Lamblin, P., Bengio, Y.: Important gains from supervised ﬁne-tuning of deep ar-

chitectures on large labeled sets. In: NIPS 2010 Deep Learning and Unsupervised

Feature Learning Workshop (2010)

[68] Lang, K.J., Hinton, G.E.: The development of the time-delay neural network

architecture for speech recognition. Technical Report CMU-CS-88-152, Carnegie-

Mellon University (1988)

[69] Larochelle, H., Bengio, Y.: Classiﬁcation using discriminative restricted Boltz-

mann machines. In: ICML 2008 (2008)

[70] Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for

training deep neural networks. J. Machine Learning Res. 10, 1–40 (2009)

[71] Le, Q., Ngiam, J., Chen, Z., Hao Chia, D.J., Koh, P.W., Ng, A.: Tiled convolu-

tional neural networks. In: NIPS 2010 (2010)

[72] Le, Q., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.: On optimization

methods for deep learning. In: ICML 2011 (2011)

[73] Le Roux, N., Manzagol, P.-A., Bengio, Y.: Topmoumoute online natural gradient

algorithm. In: NIPS 2007 (2008)

[74] Le Roux, N., Bengio, Y., Fitzgibbon, A.: Improving ﬁrst and second-order meth-

ods by modeling uncertainty. In: Optimization for Machine Learning. MIT Press

(2011)

[75] Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an ex-

ponential convergence rate for strongly-convex optimization with ﬁnite training

sets. Technical report, arXiv:1202.6258 (2012)

[76] LeCun, Y.: Modèles connexionistes de l’apprentissage. Ph.D. thesis, Université

de Paris VI (1987)

[77] LeCun, Y.: Generalization and network design strategies. Technical Report CRG-

TR-89-4, University of Toronto (1989)

[78] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard,

W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition.

Neural Computation 1(4), 541–551 (1989)

[79] LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr,

G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer,

Heidelberg (1998a)

[80] LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient based learning applied

to document recognition. IEEE 86(11), 2278–2324 (1998b)

[81] Lee, H., Ekanadham, C., Ng, A. (2008). Sparse deep belief net model for visual

area V2. In: NIPS 2007 (2007)

[82] Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks

for scalable unsupervised learning of hierarchical representations. In: ICML 2009

(2009)

[83] Martens, J.: Deep learning via Hessian-free optimization. In: ICML 2010, pp.

735–742 (2010)

19. Recommendations for Training Deep Architectures 477

[84] Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I., Lavoie,

E., Muller, X., Desjardins, G., Warde-Farley, D., Vincent, P., Courville, A.,

Bergstra, J.: Unsupervised and transfer learning challenge: a deep learning ap-

proach. In: Proc. Unsupervised and Transfer Learning, JMLR W&CP, vol. 7

(2011)

[85] Montavon, G., Braun, M.L., Müller, K.-R.: Deep Boltzmann machines as feed-

forward hierarchies. In: AISTATS 2012 (2012)

[86] Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted Boltzmann ma-

chines. In: ICML 2010 (2010)

[87] Nemirovski, A., Yudin, D.: Problem complexity and method eﬃciency in opti-

mization. Wiley (1983)

[88] Nesterov, Y.: Primal-dual subgradient methods for convex problems. Mathemat-

ical Programming 120(1), 221–259 (2009)

[89] Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a

strategy employed by V1? Vision Research 37, 3311–3325 (1997)

[90] Pearlmutter, B.: Fast exact multiplication by the Hessian. Neural Computa-

tion 6(1), 147–160 (1994)

[91] Pinto, N., Doukhan, D., DiCarlo, J.J., Cox, D.D.: A high-throughput screening

approach to discovering good forms of biologically inspired visual representation.

PLoS Comput. Biol. 5(11), e1000579 (2009)

[92] Pollack, J.B.: Recursive distributed representations. Artiﬁcial Intelligence 46(1),

77–105 (1990)

[93] Polyak, B., Juditsky, A.: Acceleration of stochastic approximation by averaging.

SIAM J. Control and Optimization 30(4), 838–855 (1992)

[94] Raiko, T., Valpola, H., LeCun, Y. (2012). Deep learning made easier by linear

transformations in perceptrons. In: AISTATS 2012 (2012)

[95] Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Eﬃcient learning of sparse

representations with an energy-based model. In: NIPS 2006 (2007)

[96] Ranzato, M., Boureau, Y.-L., LeCun, Y.: Sparse feature learning for deep belief

networks. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural

Information Processing Systems (NIPS 2007), vol. 20, pp. 1185–1192. MIT Press,

Cambridge (2008a)

[97] Ranzato, M., Boureau, Y., LeCun, Y.: Sparse feature learning for deep belief

networks. In: NIPS 2007 (2008b)

[98] Richardson, M., Domingos, P.: Markov logic networks. Machine Learning 62,

107–136 (2006)

[99] Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contracting auto-

encoders: Explicit invariance during feature extraction. In: ICML 2011 (2011a)

[100] Rifai, S., Dauphin, Y., Vincent, P., Bengio, Y., Muller, X.: The manifold tangent

classiﬁer. In: NIPS 2011 (2011b)

[101] Rifai, S., Bengio, Y., Dauphin, Y., Vincent, P.: A generative process for sampling

contractive auto-encoders. In: ICML 2012 (2012)

[102] Robbins, H., Monro, S.: A stochastic approximation method. Annals of Mathe-

matical Statistics 22, 400–407 (1951)

[103] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-

propagating errors. Nature 323, 533–536 (1986)

[104] Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. In: AISTATS 2009

(2009)

[105] Saxe, A.M., Koh, P.W., Chen, Z., Bhand, M., Suresh, B., Ng, A.: On random

weights and unsupervised feature learning. In: ICML 2011 (2011)

478 Y. Bengio

[106] Schaul, T., Zhang, S., LeCun, Y.: No More Pesky Learning Rates. Technical

report (2012)

[107] Schraudolph, N.N.: Centering Neural Network Gradient Factors. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 207–548. Springer,

Heidelberg (1998)

[108] Socher, R., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language

with recursive neural networks. In: ICML 2011 (2011)

[109] Srinivasan, A., Ramakrishnan, G.: Parameter screening and optimisation for ILP

using designed experiments. Journal of Machine Learning Research 12, 627–662

(2011)

[110] Swersky, K., Chen, B., Marlin, B., de Freitas, N.: A tutorial on stochastic approx-

imation algorithms for training restricted boltzmann machines and deep belief

nets. In: Information Theory and Applications Workshop (2010)

[111] Tenenbaum, J., de Silva, V., Langford, J.C.: A global geometric framework for

nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

[112] Tieleman, T., Hinton, G.: Using fast weights to improve persistent contrastive

divergence. In: ICML 2009 (2009)

[113] van der Maaten, L., Hinton, G.E.: Visualizing data using t-sne. J. Machine Learn-

ing Res. 9 (2008)

[114] Vincent, P.: A connection between score matching and denoising autoencoders.

Neural Computation 23(7) (2011)

[115] Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and com-

posing robust features with denoising autoencoders. In: ICML 2008 (2008)

[116] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked de-

noising autoencoders: Learning useful representations in a deep network with a

local denoising criterion. J. Machine Learning Res. 11 (2010)

[117] Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embed-

ding. In: ICML 2008 (2008)

[118] Weston, J., Bengio, S., Usunier, N.: Wsabie: Scaling up to large vocabulary image

annotation. In: Proceedings of the International Joint Conference on Artiﬁcial

Intelligence, IJCAI (2011)

[119] Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of

invariances. Neural Computation 14(4), 715–770 (2002)

[120] Zou, W.Y., Ng, A.Y., Yu, K.: Unsupervised learning of visual invariance with

temporal coherence. In: NIPS 2011 Workshop on Deep Learning and Unsuper-

vised Feature Learning (2011)

Training Deep and Recurrent Networks with

Hessian-Free Optimization

James Martens and Ilya Sutskever

University of Toronto, Department of Computer Science

{jmartens,ilya}@cs.utoronto.ca

Abstract. In this chapter we will ﬁrst describe the basic HF approach,

and then examine well-known performance-improving techniques such as

preconditioning which we have found to be beneﬁcial for neural network

training, as well as others of a more heuristic nature which are harder to

justify, but which we have found to work well in practice. We will also

provide practical tips for creating eﬃcient and bug-free implementations

and discuss various pitfalls which may arise when designing and using

an HF-type approach in a particular application.

20.1 Introduction

Hessian-Free optimization (HF) is an approach for unconstrained minimization

of real-valued smooth objective functions. Like standard Newton’s method, it

uses local quadratic approximations to generate update proposals. It belongs

to the broad class of approximate Newton methods that are practical for prob-

lems of very high dimensionality, such as the training objectives of large neural

networks. Diﬀerent algorithms that use many of the same key principles have

appeared in the literatures of various communities under diﬀerent names such

as Newton-CG, CG-Steihaug, Newton-Lanczos, and Truncated Newton [27, 28],

but applications to machine learning and especially neural networks, have been

limited or non-existent until recently. With the work of Martens [22] and later

Martens and Sutskever [23] it has been demonstrated that such an approach,

if carefully designed and implemented, can work very well for optimizing non-

convex functions such as the training objective for deep neural networks and

recurrent neural networks (RNNs), given sensible random initializations. This

was signiﬁcant because gradient descent methods have been observed to be very

slow and sometimes completely ineﬀective [17, 4, 18] on these problems, unless

special non-random initializations schemes like layer-wise pre-training [17, 16, 3]

are used. HF, which is a general optimization-based approach, can be used in

conjunction with or as an alternative to existing pre-training methods and is

more widely applicable, since it relies on fewer assumptions about the speciﬁc

structure of the network.

In this report we will ﬁrst describe the basic HF approach, and then examine

well-known general purpose performance-improving techniques as well as others

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 479–535, 2012.

 Springer-Verlag Berlin Heidelberg 2012

480 J. Martens and I. Sutskever

that are speciﬁc to HF (versus other Truncated-Newton type approaches) or

to neural networks. We will also provide practical tips for creating eﬃcient and

correct implementations, and discuss the pitfalls which may arise when designing

and using an HF-based approach in a particular application.

Table 20.1. A summary of the notation used. Note we will occasionally use some of

these symbols to describe certain concepts that are local to a given sub-section. The

subscripts “k”and“k−1”, will often be dropped for compactness where they are implied

from the context.

Notation Description

[x]

The i-th entry of a vector x

[A]

i,j

The (i, j)-th entry a matrix A

A vector of length m whose entries are 1

sq(·) The element-wise square of a vector or a matrix

vec(A) The vectorization of a matrix A

f Theobjectivefunction

The objective function on case i

k the current iteration of HF

The parameter setting at the k-th HF iteration

n The dimension of θ

The variable being optimized by CG at the k-th HF iteration

k−1

A local quadratic approximation of f at θ

k−1

A “damped” version of the above

k−1

The curvature matrix of M

k−1

The curvature matrix of

k−1



, ∇h The gradient of a scalar function h



, ∇

h The Hessian of a scalar function h

L(·) The loss function

ρ The reduction ratio

f(θ

) −f(θ

k−1

)

k−1

(δ

)

F (θ) A function that maps parameters to predictions on all training cases

D A damping matrix

P A preconditioning matrix

(A, r

) The subspace span{r

,Ar

,...,A

i−1

}

 The number of layers of a feedforward net

z The output of the network

m The dimension of z

T The number of time-steps of an RNN

λ Strength constant for penalty damping terms

j-th eigenvalue of curvature matrix

diag(A) A vector consisting of the diagonal of the matrix A

diag(v) A diagonal matrix A satisfying [A]

i,i

=[v]

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 481

20.2 Feedforward Neural Networks

We now formalize feedforward neural networks (FNNs). Given an input x and

setting of the parameters θ that determine weight matrices and the biases

,...,W

−1

,...,b

−1

), the FNN computes its output y



by the following

recurrence:

i+1

= s

+ b

)

where y

= x. The vectors y

are the activations of the neural network, and the

activation functions s

(·) are some nonlinear functions, typically sigmoid or a

tanh functions applied coordinate-wise.

Given a matching target t, the FNN’s training objective on a single case

f(θ;(x, t)) is given by

f(θ;(x, t)) = L(y



; t)

where L(z; t) is a loss function which quantiﬁes how bad z is at predicting the

target t.NotethatL may not compare z directly to t, but instead may transform

it ﬁrst into some prediction vector p.

Finally, the training error, which is the objective of interest for learning, is

obtained by averaging the losses f(θ;(x, t)) over a set S of input-output pairs

(aka training cases):

f(θ)=

|S|



(x,t)∈S

f(θ;(x, t))

Algorithm 20.1 . An algorithm for computing the gradient of a feedforward

neural network

input: y

; θ mapped to (W

,...,W

−1

,...,b

−1

for all i from 1 to  − 1 do

i+1

← W

+ b

i+1

← s

i+1

)

end for



← ∂L(y



; t



)/∂y



is the target)

for all i from  −1 downto 1 do

i+1

← dy

i+1



i+1

)

← dx

i+1



← dx

i+1

← W



i+1

end for

output: ∇f (θ) as mapped from (dW

,...,dW

−1

,db

,...,db

−1

482 J. Martens and I. Sutskever

20.3 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are the time-series analog of feed-forward

neural networks. RNNs model the mapping from an input sequence to an output

sequence, and possess feedback connections in their hidden units that allow them

to use information about past inputs to inform the predictions of future outputs.

They may also be viewed as a special kind of feed-forward net with a “layer” for

each time-step of the sequence. But unlike in a feed-forward network where each

layer has its own parameters, the “layers” of an RNN share their parameters.

Their high-dimensional hidden state and nonlinear dynamics allow RNNs to

learn very general and versatile representations, and to express highly complex

sequential relationships. This representational power makes it possible, in prin-

ciple, for RNNs to learn compact solutions for very diﬃcult sequence modeling

and labeling tasks. But despite their attractive qualities, RNNs did not enjoy

widespread adoption after their initial discovery due to the perception that they

were too diﬃcult to properly train. The vanishing gradients problem [4, 18],

where the derivative terms can exponentially decay to zero or explode during

back-propagation through time is cited as one of the main reasons for this dif-

ﬁcultly. In the case of decay, important back-propagated error signals from the

output at future time-steps may decay nearly to zero by the time they have

been back-propagated far enough to reach the relevant inputs.This makes the

unmodiﬁed gradient a poor direction to follow if the RNN is to learn to exploit

long-range input-output dependencies in the certain datasets.

Recent work by Martens & Sutskever [23] has demonstrated that HF is a viable

method for optimizing RNNs on datasets that exhibit pathological long range

dependencies that were believed diﬃcult or impossible to learn with gradient

descent. These problems were ﬁrst examined by Hochreiter & Schmidhuber [19]

where the proposed solution was to modify the RNN architecture with special

memory units.

Basic RNNs are parameterized by three matrices and a special initial hidden

state vector, so that θ ≡ (W

),whereW

are the connections

from the inputs to the hidden units, W

are the recurrent connections, and

are the hidden-to-output connections. Given a sequence of vector-valued

inputs x =(x

,...,x

) and vector-valued target outputs t =(t

,...,t

),the

RNN computes a sequence of hidden states and predictions according to:

= s(W

+ W

τ −1

)

= W

where h

is a special parameter vector of the initial state and s(·) is a nonlinear

activation function (typically evaluated coordinate-wise).

The RNN learning objective for a single input-output pair of sequences (x, t)

is given by:

f(θ;(x, t)) = L(z; t) ≡



τ =1

; t

)

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 483

where L

is a loss function as in the previous section. As with FNNs, the objective

function is obtained by averaging the loss over the training cases:

f(θ)=

|S|



(x,t)∈S

f(θ;(x, t))

20.4 Hessian-Free Optimization Basics

We consider the setting of unconstrained minimization of a twice-diﬀerentiable

objective function f : R

→ R w.r.t. to a vector of real-valued parameters

θ ∈ R

. 2nd-order optimizers such as HF are derived from the classical New-

ton’s method (a.k.a. the Newton-Raphson method), an approach based on the

idea of iteratively optimizing a sequence of local quadratic models/approxima-

tions of the objective function in order to produce updates to θ. In the simplest

situation, given the previous setting of the parameters θ

k−1

, iteration k produces

a new iterate θ

by minimizing a local quadratic model M

k−1

(δ) of the objective

f(θ

k−1

+ δ), which is formed using gradient and curvature information local to

k−1

.Moreprecisely,wedeﬁne

k−1

(δ)=f(θ

k−1

)+∇f(θ

k−1

)



δ +



k−1

δ (20.1)

where B

k−1

is the “curvature matrix”, and is chosen to be the Hessian H(θ

k−1

)

of f at θ

k−1

in the case of standard Newton’s method. The new iterate θ

computed as θ

k−1

+ α

where δ

∗

is the minimizer of 20.1, and α

∈ [0, 1]

is chosen typically chosen via a line-search, with a preference for α

=1.A

standard eﬃcient method for performing this kind of line search will be brieﬂy

discussed in section 20.8.8. The multiplication of δ

by α

canbeviewedasa

crude instance of a general technique called “update damping”, which we will

introduce next, and later discuss in depth in section 20.8.

When B

k−1

is positive deﬁnite (PD), M(δ

) will be bounded below and so its

minimizer will exist, and will be given by δ

∗

= x

−B

−1

k−1

∇f(θ

k−1

),whichisthe

standard Newton step. Unfortunately, for many good choices of B

k−1

,suchas

the Hessian at θ

k−1

, even computing the entire n ×n curvature matrix B

k−1

,let

alone inverting it/solving the system B

k−1

= −f (θ

k−1

) (at a cost of O(n

)),

will be impractical for all but very small neural networks.

The main idea in Truncated-Newton methods such as HF is to avoid this

costly inversion by partially optimizing the quadratic function M using the linear

conjugate gradient algorithm (CG) [15], and using the resulting approximate

minimizer δ

to update θ. CG is a specialized optimizer created speciﬁcally for

quadratic objectives of the form q(x)=



Ax−b



x where A ∈ R

n×n

is positive

semi-deﬁnite (PSD), and b ∈ R

. CG works by constructing the update from

a sequence of vectors which have the property that they are “A-conjugate” and

can thus be optimized independently in sequence. To apply CG to eqn. 20.1 we

484 J. Martens and I. Sutskever

take x = δ, A =B

k−1

and b = ∇f(θ

k−1

), noting that the constant term f(θ

k−1

)

can be ignored.

Note: From this point forward, we will abbreviate M

k−1

with M

and B

k−1

with B when the subscript is implied by the context.

CG has the nice property that it only requires access to matrix-vectors prod-

ucts with the curvature matrix B (which can be computed much more eﬃciently

than the entire matrix in many cases, as we will discuss in section 20.5), and

it has a ﬁxed-size storage overhead of a few n-dimensional vectors. Moreover,

CG is a very powerful algorithm, which after i iterations, will ﬁnd the provably

optimal solution of any convex quadratic function q(x) over the Krylov sub-

space K

(A, r

) ≡ span{r

,Ar

, ..., A

i−1

},wherer

= Ax

−b and x

the initial solution [32]. Any other gradient based method applied directly to a

quadratic function like M, even a very powerful one like Nesterov’s accelerated

gradient descent [29], can also be shown to produce solutions which lie in the

Krylov subspace, and thus will always be strictly outperformed by CG given the

same number of iterations

Fortunately, in addition to these strong optimality properties, CG works ex-

tremely well in practice and may often converge in a number of iterations i n,

depending on the structure of B. But even when it does not converge it tends to

make very good partial progress.

The preconditioned CG algorithm is given in alg. 20.2. Note that Ap

only

needs to be computed once in each iteration of the main loop, and the quadratic

objective q(x

) can be cheaply computed as q(x

− b)



.Alsonote

that any notation such as α

or y

should not be confused with the other uses of

these symbols that occur elsewhere in this report. The preconditioning matrix

P allows CG to operate within a transformed coordinate system and a good

choice of P can substantially accelerate the method. This is possible despite the

previously claimed optimality of CG because P induces a transformed Krylov

subspace. Preconditioning, methods for implementing it, its role within HF, and

its subtle interaction with other parts of the HF approach, will be discussed in

section 20.11.

With practicality in mind, one can terminate CG according to various criteria,

balancing the quality of the solution with the number of iterations required to

obtain it (and hence number of matrix vector products – the main computational

expense of the method). The approach taken by Martens [22] was to terminate

CG based on a measure of relative progress optimizing M, computed as:

M(x

) − M(x

j−k

)

M(x

)

This being said, it is possible to construct quadratic optimization problems where

CG will perform essentially no better than accelerated gradient descent. Although

it is also possible to construct ones where CG converge in only a few iteration while

accelerated gradient descent will take much longer.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 485

Algorithm 20.2 . Preconditioned conjugate gradient algorithm (PCG)

inputs: b, A, x

, P

← Ax

− b

← solution of Py = r

←−y

i ← 0

while termination conditions do not apply do

←



i+1

← x

+ α

i+1

← r

+ α

i+1

← solution of Py = r

i+1

←



i+1



i+1

←−y

i+1

+ β

i+1

i ← i +1

end while

output: x

where x

is the j-th iterate of CG and k is the size of the window over which the

average is computed, which should be increased with j. A reasonable choice that

works well in practice is k =max(10,j/10). CG can be terminated at iteration

j when

< 0.0001 (20.2)

or some other such constant. Depending on the situation it may make more sense

to truncate earlier to ﬁnd a more economical trade-oﬀ between relative progress

and computation.

However, deciding when to terminate CG turns out to be a much more com-

plex and subtle issue than implied by the above discussion, and in section 20.8.7

of this paper we will discuss additional reasons to terminate CG that have noth-

ing directly to do with the value of M . In particular, earlier truncations may

sometimes have a beneﬁcial damping eﬀect, producing updates that give a bet-

ter improvement in f than would be obtained by a fully converged solution (or

equivalently, one produced by exact inversion of the curvature matrix).

When f is non-convex (as it is with neural networks), B will sometimes be

indeﬁnite, and so the minimizer of M may not exist. In particular, progressively

larger δ’s may produce arbitrarily low values of M, leading to nonsensical or

undeﬁned updates. This issue can be viewed as an extreme example of the general

problem that the quadratic model M is only a crude local approximation to

f, and so its minimizer (assuming it even exists), might lie in a region of R

where the approximation breaks down, sometimes catastrophically. While the

aforementioned line-search can remedy this problem to some degree, this is a

general problem with 2nd-order optimization that must be carefully addressed.

Ways to do this are sometimes called “damping methods”, a term which we shall

486 J. Martens and I. Sutskever

use here, and include such techniques as restriction of the optimization over

M(·) to a “trust-region”, and the augmentation of M by penalty terms which

are designed to encourage the minimizer of M to be somewhere in R

where M

remains a good approximation to f . Such approaches must be used with care,

since restricting/penalizing the optimization of M too much will result in very

reliable updates which are nonetheless useless due to being too “small”. In section

20.8 we will discuss various general damping methods in 2nd-order optimization,

and some which are more speciﬁc to HF.

While the damping methods such as those mentioned above allow one to op-

timize M even when B is indeﬁnite, there is another way to deal with the indef-

initeness problem directly. The classical Gauss-Newton algorithm for non-linear

least squares uses a positive semi-deﬁnite curvature matrix which is viewed as an

approximation to the Hessian, and Schraudolph [11] was able to generalize this

idea to cover a much larger class of objective functions that include most neural

network training objectives. This “generalized Gauss-Newton matrix” (GGN),

is also guaranteed to be positive semi-deﬁnite, and tends to work much better

than the Hessian in practice as a curvature matrix when optimizing non-convex

objectives. While using the GGN matrix will not eliminate the need for damp-

ing, Martens [22] nonetheless found that it was easier to use than the Hessian,

producing better updates and requiring less damping. The computational and

theoretical aspects of the GGN matrix and its use within HF will be discussed

in detail in section 20.6.

Objective functions f (θ) that appear in machine learning are almost always

deﬁned as arithmetic averages over a training set S, and thus so can the gradient

and the curvature-matrix vector products:

f(θ)=

|S|



(x,t)∈S

f(θ;(x, t))

∇f(θ)=

|S|



(x,t)∈S

∇f(θ;(x, t))

B(θ)v =

|S|



(x,t)∈S

B(θ;(x, t))v

where f(θ;(x, t)) is the objective and B(θ;(x, t)) the curvature matrix associated

with the training pair (x, t).

In order to make HF practical for large datasets it is necessary to estimate

the gradient and curvature matrix-vector products using subsets of the train-

ing data, called “minibatches.” And while it may seem natural to compute the

matrix-vector products required by CG using a newly sampled minibatch at each

iteration of alg. 20.2, CG is unfortunately not designed to handle this kind of

“stochasticity” and its theory depends very much on a stable deﬁnition of B for

concepts like B-conjugacy to even make sense. And in practice, we have found

that such an approach does not seem to work very well, and results in CG itself

diverging in some cases. The solution advocated by Martens [22] and indepen-

dently by Byrd et al. [8] is to ﬁx the minibatch used to deﬁne B for the entire

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 487

run of CG. Minibatches and the practical issues which arise when using them

will be discussed in more depth in section 20.12.

Algorithm 20.3. High-level outline for the basic Hessian-free approach. Various

details have been purposefully left unstated, and some aspects will be subject to

change throughout this report.

inputs: θ

, λ

Set δ

← 0

k ← 1

while solution is not satisfactory do

Select a set of points S for the gradient ..............................sec. 20.12

b ←−∇f (θ

k−1

) on S .................................................sec. 20.2

Select a set of points S



for the curvature ............................sec. 20.12

Compute a preconditioner P at θ

...................................sec. 20.11

Compute a damping matrix D

.......................................sec. 20.8

Deﬁne A(v) ≡ G(θ

k−1

)v + λD

v on S



................................sec. 20.6

Choose a decay constant ζ ∈ [0, 1] ...................................sec. 20.10

← PCG(b, A, ζδ

k−1

,P) .............................................alg. 20.2

Update λ with theLevenberg-Marquardt method ...................sec. 20.8.5

Choose/compute a step-size α ......................................sec. 20.8.8

← θ

k−1

+ αδ

k ← k +1

end while

20.5 Exact Multiplication by the Hessian

To use the Hessian H of f as the curvature matrix B within HF we need an

algorithm to eﬃciently compute matrix-vector products with arbitrary vectors

v ∈ R

. Noting that the Hessian is the Jacobian of the gradient, we have that

the Hessian-vector product H(θ)v is the directional derivative of the gradient

∇f(θ) in the direction v, and so by the deﬁnition of directions derivatives,

H(θ)v = lim

ε→0

∇f(θ + εv) −∇f (θ)

This equation implies a ﬁnite-diﬀerences algorithm for computing Hv at the

cost of a single extra gradient evaluation. But in practice, and in particular

when dealing with highly nonlinear functions like neural network training objec-

tives, methods that use ﬁnite diﬀerences suﬀer from signiﬁcant numerical issues,

which can make them generally undesirable and perhaps even unusable in some

situations.

488 J. Martens and I. Sutskever

Fortunately, there is a method for computing the sought-after directional

derivative in a numerically stable way that does not resort to ﬁnite diﬀer-

ences. In the optimization theory literature, the method is known as “forward-

diﬀerentiation” [34, 30], although we follow the exposition of Pearlmutter [31],

who rediscovered it for neural networks and other related models. The idea is

to make repeated use of the chain rule, much like in the backpropagation al-

gorithm, to diﬀerentiate the value of every node in the computational graph of

the gradient. We formalize this notion by introducing the R

-notation. Let R

denote the directional derivative of X in direction v:

X = lim

ε→0

X(θ + εv) −X(θ)

∂X

∂θ

v (20.3)

Being a derivative, the R

(·) operator obeys the usual rules of diﬀerentiation:

(X + Y )=R

X +R

Y linearity (20.4)

(XY )=(R

X)Y + XR

Y product rule (20.5)

(h(X)) = (R

X)h



(X) chain rule (20.6)

where h



denotes the Jacobian of h. From this point on we will abbreviate R

simply “R” to keep the notation compact.

Noting that Hv =R{∇f (θ)}, computing the Hessian-vector product amounts

to computing R{∇f(θ)} by applying these rules recursively to the computational

graph for ∇f(θ), in a way analogous to back-propagation (but operating forward

instead of backwards).

To make this precise, we will formalize the notion of a computational graph

for an arbitrary vector-valued function h(θ), which can be thought of as a spe-

cial kind of graph which implements the computation of a given function by

breaking it down as a collection of simpler operations, represented by M nodes,

with various input-output dependencies between the nodes indicated by directed

edges. The nodes of the computational graph are vector valued, and each node i

computes an arbitrary diﬀerentiable functions a

= γ

) of their input z

.Each

input vector z

is formally the concatenation the output of each of its parent

nodes a

∈ P

. The input θ is distributed over a set of input nodes I⊂{1, ..., M}

and the outputs are computed at output nodes O⊂{1 , ..., M}.

In summary, the function h(θ) is computed according to the following proce-

dure:

1. For each i ∈Iset a

according to entries of θ

2. For i from 1 to M such that i ∈I:

= concat

j∈P

= γ

)

3. Output h(θ) according to the values in {a

}

i∈O

where P

is the set of parents of node i.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 489

Fig. 20.1. An example of

a computational graph of

the loss of a neural network

objective. The weights are

considered the inputs here.

The advantage of the computational graph formal-

ism is that it allows the application of the R-operator

to be performed in a fool-proof and mechanical way

that can be automated. In particular, our function

R(h(θ)) can be computed as follows:

1. For each i ∈Iset Ra

according to entries of v

(which correspond to entries of θ)

2. For i from 1 to M such that i ∈I:

= concat

j∈P

(20.7)

= γ



)Rz

(20.8)

3. Set output R(h(θ)) according to the values in

{Rz

}

i∈O

where γ



) is the Jacobian of γ

In general, computing γ



) (or more simply mul-

tiplying it by a vector) is simple

and is of compara-

ble cost to computing γ

), which makes computing

the Hessian-vector product using this method com-

parable to the cost of the gradient. Notice however

that we need to have each z

available in order to

evaluate γ



) in general, so all of the z

’s (or equiv-

alently all of the a

’s) must either be computed in

tandem with the Ra

’s and Rz

’s (making the cost

of the Hessian-vector product roughly comparable to

the cost of two evaluations of the gradient), or be precomputed and cached (e.g.

during the initial computation of the gradient).

When using an iterative algorithm like CG that requires multiple Hessian-

vector products for the same θ, caching can save considerable computation, but

as discussed in section 20.7 may require considerable extra storage when com-

puting matrix-vector products over large minibatches.

Algorithm 20.5 gives the pseudo-code for computing the Hessian-

vector product associated with the feedforward neural network deﬁned

in section 20.2. The parameter vector θ deﬁnes the weight matrices

and the biases (W

,...,W

−1

,...,b

−1

) and v maps analogously to

(RW

,...,RW

−1

, Rb

,...,Rb

−1

). This algorithm was derived by applying the

rules 20.4–20.6 to each line of alg. 20.2, where various required quantities such

as y

are assumed to be available either because they are cached, or by running

the corresponding lines of alg. 20.2 in tandem.

20.6 The Generalized Gauss-Newton Matrix

The indeﬁniteness of the Hessian is problematic for 2nd-order optimization of

non-convex functions because an indeﬁnite curvature matrix B may result in a

If this is not the case then node i should be split into several simpler operations.

490 J. Martens and I. Sutskever

Algorithm 20.4 . An algorithm for computing H(θ)v in feedforward neural

networks.

input: v mapped to (RW

,...,RW

−1

, Rb

,...,Rb

−1

)

← 0 (since y

is not a function of the parameters)

for all i from 1 to  − 1 do

i+1

← RW

+ W

+Rb

(product rule)

i+1

← Rx

i+1



i+1

) (chain rule)

end for

Rdy



← R



∂L(y



; t



)

∂y





∂{∂L(y



; t



)/∂y



}

∂y



∂

L(y



; t



)

∂y



for all i from  −1 downto 1 do

Rdx

i+1

← Rdy

i+1



i+1

)+dy

i+1





i+1

)



(product rule)

= dy

i+1



i+1

)Rx

i+1

(chain rule)

RdW

← Rdx

i+1



+ dx

i+1



(product rule)

Rdb

← Rdy

Rdy

← RW



i+1

+ W



Rdx

i+1

(product rule)

end for

output: H(θ)v as mapped from (RdW

,...,RdW

−1

, Rdb

,...,Rdb

−1

quadratic M which is not bounded below and thus does not have a minimizer to

use as the update δ. This problem can be addressed in a multitude of ways. For

example, imposing a trust-region (sec. 20.8.6) will constrain the optimization,

or a penalty-based damping method (sec. 20.8.1) will eﬀectively add a positive

semi-deﬁnite (PSD) contribution to B which may render it positive deﬁnite (PD).

Another solution speciﬁc to truncated Newton methods is to truncate CG as

soon as it generates a conjugate direction with negative curvature (i.e., when



< 0 in alg. 20.2), a solution which may be useful in some applications

but which we have not found to be particularly eﬀective for neural network

training.

Based on our experience, the best solution to the indeﬁniteness problem is to

instead use the generalized Gauss-Newton (GGN) matrix proposed by Schrau-

dolph [11], which is a provably positive semideﬁnite curvature matrix that can

be viewed as an approximation to the Hessian. We will denote this matrix as G.

The generalized Gauss-Newton matrix can be derived in at least two ways,

and both require that the objective f (θ) be expressed as the composition of two

functions as f (θ)=L(F (θ)) where L is convex. In a neural network setting,

F maps the parameters θ to a m-dimensional vector of the neural network’s

outputs z ≡ F (θ),andL(z) is a convex “loss function” which typically measures

the diﬀerence between the network’s outputs (which may be further transformed

within L to produce “predictions” p) and the targets. For RNNs, z will be a vector

of the outputs from all the time-steps and L computes the sum over losses at

each one of them.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 491

One way to view the GGN matrix is as an approximation of H wherewedrop

certain terms that involve the 2nd-derivatives of F . Applying the chain rule to

compute the Hessian of f (at θ

k−1

), we get:

f = L(F(θ))

∇f(θ)=J



∇L



(θ)=J





J +



i=1

[∇L]

([F ]

)



where J denotes the Jacobian of F , ∇L is the gradient of L(z) w.r.t. z,and

all 1st and 2nd derivatives are evaluated at θ

k−1

. The ﬁrst term J





J is a

positive deﬁnite matrix whenever L(z) is convex in z, and is deﬁned as the GGN

matrix. Note that in the special case where L(z)=

z

(so that L



= I)

we recover the standard Gauss-Newton matrix usually seen in the context of

non-linear least squares optimization and the Levenberg-Marquardt algorithm

[25].

Martens and Sutskever [23] showed that the GGN matrix can also be viewed as

the Hessian of a particular approximation of f constructed by replacing F with

its 1st-order approximation. Consider a local convex approximation

f to f at

k−1

that is obtained by taking the ﬁrst-order approximation F (θ) ≈ F (θ

k−1

Jδ (where δ = θ − θ

k−1

f(δ)=L(F (θ

k−1

)+Jδ) (20.9)

The approximation

f is convex because it is a composition of a convex function

and an aﬃne function. It is easy to see that

f and f havethesamederivative

when δ =0, because

∇

f = J



∇L = J



∇L

which is precisely the derivative of f at θ

k−1

. And the Hessian of

f at δ =0is

precisely the GGN matrix:



= J





J =G

Note that it may be possible to represent a function f with multiple distinct

compositions of the form L(F (θ)), and each of these will give rise to a slightly

diﬀerent GGN matrix. For neural networks, a natural choice for the output vector

z is often just to identify it as the output of the ﬁnal layer (i.e., y



), however this

may not always result in a convex L.As a rule of thumb, it is best to deﬁne L and

F in way that L performs “as much of the computation of f as possible” (but

this is a problematic concept due to the existence of multiple distinct sequences

of operations for computing f). For the case of neural networks with a softmax

output layer and cross-entropy error, it is best to deﬁne L so that it performs

both the softmax and then the cross-entropy, while F computes only the inputs

to the soft-max function. This is also the recommendation made by Schraudolph

[11]. A possible reason that this choice works best is due to the fact that F

492 J. Martens and I. Sutskever

is being replaced with its ﬁrst-order approximation whose range is unbounded.

Hence the GGN matrix makes sense only when L’s input domain is R

(as

opposed to [0, 1]

for the cross-entropy error), since this is the range of the

1st-order approximation of F .

20.6.1 Multiplying by the Gauss-Newton Matrix

For the GGN matrix to be useful in the context of HF, we need an eﬃcient algo-

rithm for computing the Gv products. Methods for multiplying by the classical

Gauss-Newton matrix are well-known in the optimization literature [30], and

these methods were generalized by Schraudolph [11] for the GGN matrix, using

an approach which we will now describe.

We know from the previous section that the GGN matrix can be expressed as

the product of three matrices: Gv = J





Jv. Thus multiplication of a vector

v by the GGN matrix amounts to the sequential multiplication of that vector

by these 3 matrices. First, the product Jv is a Jacobian times vector and is

therefore precisely equal to the directional derivative R

{F (θ)}, and thus can

be eﬃciently computed with the R-method as in section 20.5. Next, given that

the loss function L is usually simple, multiplication of Jv by L



is also simple

(sec. 20.6.2). Finally, we multiply the vector L



Jv by the matrix J



using the

backpropagation algorithm. Note that the backpropagation algorithm takes the

derivatives w.r.t. the predictions (∇L) as inputs, and returns the derivative

w.r.t. the parameters, namely J



∇L, but we can replace ∇L with any vector

we want.

Algorithm 20.5 . An algorithm for computing Gv of a feedforward neural

network.

input: RW

,...,RW

−1

, Rb

,...,Rb

−1

← 0 (y

is not a function of the parameters)

for all i from 1 to  − 1 do

i+1

← RW

+ W

+Rb

(product rule)

i+1

← Rx

i+1



i+1

)

end for

Rdy



←

∂

L(y



; t



)

∂y



for all i from  −1 downto 1 do

Rdx

i+1

← Rdy

i+1



i+1

)

RdW

← Rdx

i+1



Rdb

← Rdx

i+1

Rdy

← RW



i+1

end for

output: (RdW

,...,RdW

−1

, Rb

,...,Rb

−1

As observed by Martens and Sutskever [23], the second interpretation of the

GGN matrix given in the previous section immediately implies an alternative

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 493

method for computing Gv products. In particular, we can use the R method

from sec. 20.5 to eﬃciently multiply by the Hessian of

f, given a computational

graph for ∇

f. While doing this would require one to replace the part of the

forward pass corresponding to F with a multiplication by the analogous Jacobian

evaluated at θ

k−1

(whichcanbedoneusingtheR operator method applied to

f), a simpler approach is just to modify the algorithm for computing ∇f so that

all derivative terms involving intermediate quantities in the back-propagation

through F are treated as “constants”, which while they are computed from θ

k−1

are formally independent of θ. This version will only compute

f properly for

θ = θ

k−1

, but this is ﬁne for our purposes since this is the point at which we

wish to evaluate the GGN matrix.

Algorithm 20.5 multiplies by the GGN matrix for the special case of a feed-

forward neural network and is derived using this second technique.

20.6.2 Typical Losses

In this section we present a number of typical loss functions and their Hessians

(table 20.2). The function p(z) computes the predictions p from the network

outputs z. These losses are convex and it is easy to multiply by their Hessians

Table 20.2. Typical losses with their derivatives and Hessians. The loss L and the

nonlinearity p(z) are “matching”, which means that the Hessian is independent of the

target t and is PSD.

Name L(z; t) ∇L(z; t) L



(z; t) p

Squared error

p −t

−(p −t) I p = z

Cross-entropy error

−t log p − (1 − t)log(1− p)

−(p −t) diag(p(1 − p))

p = Sigmoid(z)

Cross-entropy error (multi-dim)

−



[t]

log[p]

−(p −t) diag(p) − pp



p = Softmax(z)

without explicitly forming the matrix, since they are each either diagonal or the

sum of a diagonal and a rank-1 term.

When applying this formulation to FNNs, tote that because it formally in-

cludes the computation of the predictions p from the network outputs z (assumed

to lie anywhere in R

) in the loss function itself (instead of in the activation

function s



at the output layer), s



should be set to the identity function.

20.6.3 Dealing with Non-convex Losses

We may sometimes want to have a non-convex loss function. The generalized

Gauss-Newton matrix construction will not produce a positive deﬁnite matrix in

this case because the GGN matrix J





J will usually be PSD only when L



is,

which is a problem that can be addressed in one of several ways. For example, if our

loss is L(y; t)=tanh(y) − t

/2, which is non-convex, we could formally treat

the tanh nonlinearity as being part of F (replacing F with tanh ◦F ), and redeﬁne

the loss L as y − t

/2. Another trick which may work would be to approximate

494 J. Martens and I. Sutskever

the loss-Hessian L



with a positive deﬁnite matrix, which could be done, say,

by adding a scaled multiple of the diagonal to L



, or by taking the eigen-

decomposition of L



and discarding the eigenvectors that have negative eigenval-

ues.

20.7 Implementation Details

20.7.1 Eﬃciency via Parallelism

A good implementation of HF can make fruitful use of parallelization in two

ways.

First, it can beneﬁt from model parallelism, which is the ability to perform

the input and output computations associated with each neuron in a given layer

parallel. Although model parallelism accelerates any optimization algorithm that

is applied to neural networks, current hardware is incapable of fully taking ad-

vantage of it, mostly because weights are stored in a centralized memory with

very limited bandwidth.

Second, an implementation of HF can beneﬁt from data parallelism, where the

computation of the gradient or curvature matrix vector products is performed

independently and in parallel across the training cases in the current minibatch.

Data parallelism is much easier to exploit in current hardware because it requires

minimal communication, in stark contrast to model parallelism, which requires

frequent and rapid communication of unit activations. The potential speedup

oﬀered by data parallelism is limited by the gains that can be derived from

using larger minibatches to compute updates in HF, as well as the sheer amount

of parallel computing power available.

HF tends to beneﬁt from using relatively large minibatches, especially com-

pared to ﬁrst-order methods like stochastic gradient descent, and so exploiting

data parallelism may bring signiﬁcant reductions in computation time. Nonethe-

less, there is a point of diminishing returns after which making the minibatch

larger provides limited or no beneﬁt in terms of the quality of the update pro-

posals (as measured by how much they reduce f).

Data parallelism is typically implemented using vectorization, which is a way

of specifying a single computational process that is independently performed on

every element of a vector. Since most implementations that use vectorization

(e.g. GPU code) become more eﬃcient per case as the size of the minibatch

increases, there is a distinct beneﬁt to using larger minibatches (up until the

aforementioned point of diminishing returns, or the point where the implemen-

tations parallel computing resources are fully utilized).

Most of the computation performed by HF consists of computing the GGN

vector products and fortunately it is possible to obtain a 50% speedup over a

naive implementation of the GGN vector products using activity caching. Recall

that a multiplication by the GGN matrix consists of a multiplication by J which

is followed by a multiplication by J



, both of which require the neural network’s

unit activations (the y

’s in FNNs or y

’s in RNNs). However, given that the

network’s activations are a function of only θ, and that CG multiplies diﬀerent

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 495

vectors by the same GGN matrix (so its setting of θ is ﬁxed), it is possible to

cache the network’s activations y

and to reuse them for all the GGN-vector

products made during an entire run of CG.

When a model is very large, which is the case for a large RNN with a large

number of time-steps T , the unit activations produced by even a modestly-sized

minibatch may become too numerous to ﬁt in memory. This is especially a

problem for GPU implementations since GPUs typically have much less mem-

ory available than CPUs. This has two undesirable consequences. First, activ-

ity caching becomes impossible, and second, it necessitates the splitting of a

large minibatch into many smaller “computational minibatches” (the results from

which will be summed up after each has been processed), which can greatly re-

duce the cost-eﬀectiveness of vectorization.

The problem can be addressed in at least 2 ways. One is to cache the ac-

tivations in a larger but slower memory storage (e.g. the CPU memory), and

to retrieve them as needed. This is often faster than the use of many smaller

minibatches.

Another way involves reducing the storage requirements at the cost of per-

forming re-computation of some of the states. In particular, we store the hidden

states at every multiple of

√

T time-steps (thus reducing the storage requirement

by a factor of

√

T ), and recompute sequences of

√

T between these “check-points”

as they become needed, discarding them immediately after use. Due to the way

the forward and backwards passes involved in computing gradients and matrix-

vector products go through the time-steps in linear order, the state at each

time-step needs to be recomputed at most once in the case of the gradient, and

twice in the case of the matrix-vector product.

Fig. 20.2. An illustration of the method for conserving the memory of the RNN. Each

column represents a hidden state of an RNN, and only the highlighted columns reside

in memory at any given time.

20.7.2 Verifying the Correctness of G Products

A well-known pitfall for neural networks practitioners is an incorrect imple-

mentation for computing the gradient, which is hard to diagnose without hav-

ing a correct implementation to compare against. The usual procedure is to

496 J. Martens and I. Sutskever

re-implement the gradient computation using ﬁnite diﬀerences and verify that

the two implementations agree, up to some reasonable precision.

To verify the correctness of an implementation of Truncated Newton optimizer

like HF, as we must also verify the correctness of the curvature-matrix vector

products. When B=H, there are well-known black-box ﬁnite diﬀerentiation

implementations available which can be used for this purpose. Thus we will

concentrate on how to verify the correctness of the Gv products.

Given that G=J





J so Gv = J





(Jv)), computing the G-vector prod-

ucts via ﬁnite diﬀerences reduces to doing this for Jw, L



w and J



w for arbitrary

vectors w of appropriate dimension (not necessarily the same for each).

1. For Jw we compute (F (θ + εw) − F (θ − εw))/(2ε) for a small ε.

2. For L



w we can simply approximate L



using one of the aforementioned

ﬁnite-diﬀerences implementations that are available for approximating Hes-

sians.

3. For J



w we exploit [J]



j,i

=[Je

]

where e

is the j-th standard basis vector,

and use the method in point 1 to approximate Je

To be especially thorough, one should probably test that Ge

agrees with its

ﬁnite diﬀerences version for each j, eﬀectively constructing the whole matrix G.

For this kind of ﬁnite-diﬀerences numerical diﬀerentiation to be practical it

is important to use small toy versions of the target networks, with much fewer

units in each layer, and smaller values for the depth  or sequence length T (such

as 4). In most situations, a good value of ε is often around 10

−4

, and it is possible

to achieve a relative estimation error from the ﬁnite diﬀerences approximation

of around 10

−6

, assuming a high-precision ﬂoating point implementation (i.e.

ﬂoat64 rather than ﬂoat32).

It is also important to use random θ’s that are of a reasonable scale. Param-

eters that are too small will fail to engage the nonlinearities, leaving them in

their “linear regions” and making them behave like linear functions, while pa-

rameters that are too large may cause “saturation” of the units of the network,

making them behave like step-functions (the opposite extreme). In either case, a

proposed implementation of some exact derivative computation could match the

ﬁnite diﬀerences versions to high precision despite being incorrect, as the local

derivatives of the activation functions may be constant or even zero.

Another option to consider when implementing complex gradient/matrix com-

putations is to use an automatic diﬀerentiation system package such as Theano

[5]. This approach has the advantage of being mostly fool proof, at the possible

cost of customization and eﬃciency (e.g. it may be hard to cache the activities

using previously discussed techniques).

20.8 Damping

While unmodiﬁed Newton’s method may work well for certain objectives, it tends

to do very poorly if applied directly to highly nonlinear objective functions, such

as those which arise when training neural networks. The reason for this failure

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 497

has to do with the fact that the minimizer δ

∗

of the quadratic approximation

M may be very large and “aggressive” in the early and the intermediate stages

of the optimization, in the sense that it is often located far beyond the region

where the quadratic approximation is reasonably trust-worthy.

The convergence theory for non-convex smooth optimization problems (which

include neural net training objectives) describes what happens only when the

optimization process gets close enough to a local minimum so that the steps taken

are small compared to the change in curvature (e.g. as measured by the Lipschitz

constant of the Hessian). In such a situation, the quadratic model will always be

highly accurate at δ

∗

, and so one can fully optimize M and generate a sequence of

updates which will converge “quadratically” to the local minimum of f.Andfor

some very simply optimization problems which can arise in practice it may even

be possible to apply unmodiﬁed Newton’s method without any trouble, ignoring

the theoretical requirement of proximity to a local minimum. However, for neural

network training objectives, and in particular deep feed-forward networks and

RNNs, the necessity of these proximity assumptions quickly becomes clear after

basic experiments, where such naive 2nd-order optimization tends to diverge

rapidly from most sensible random initializations of θ.

The solution that is sometimes advocated for this problem is to use a more sta-

ble and reliable method, like gradient-descent for the beginning of optimization,

and then switch later to 2nd-order methods for “ﬁne convergence”. Optimization

theory guarantees that as long as the learning rate constant is suﬃciently small,

gradient descent will converge from any starting point. But precise convergence

is often not necessary, or even undesirable (due to issues of overﬁtting). Instead,

if we believe that making use of curvature information can be beneﬁcial in con-

structing updates long before the “ﬁne convergence” regime described by local

convergence theory sets in, it may be worthwhile to consider how to make more

careful and conservative use of curvature information in order to construct large

but still sensible update proposals, instead of defaulting to 1st-order ones out of

necessity.

“Damping”, a term used mostly in the engineering literature, and one which we

will adopt here, refers to methods which modify M or constrain the optimization

over it in order to make it more likely that the resulting update δ will lie in a

region where M remains a reasonable approximation to f and hence yield a

substantial reduction. The key diﬃculty with damping methods is that if they

are overused or improperly calibrated, the resulting updates will be “reliable”

but also be too small and insigniﬁcant (as measured by the reduction in f ).

An eﬀective damping method is of critical importance to the performance of a

2nd-order method, and obtaining the best results will likely require the use of a

variety of diﬀerent techniques, whose usefulness depends both on the particular

application and the underlying 2nd-order method. In this section we will discuss

some generic damping methods that can be used in 2nd-order optimizers and

how to apply them in HF (either separately or in some combination) along with

methods which are speciﬁc to neural networks and HF.

498 J. Martens and I. Sutskever

One thing to keep in mind when reading this section is that while the im-

mediate goal of damping methods is to increase the quality of the parameter

update produced by optimizing M (as measured by the immediate improvement

in the objective f), damping methods can and will have an important inﬂuence

on the global optimization performance of 2nd-order optimizers when applied to

multimodal objectives functions, in ways that are sometimes diﬃcult to predict

or explain, and will be problem dependent. For example, we have observed em-

pirically that on diﬃcult neural-net training objectives, damping schemes which

tend to produce updates that give the best reductions in f in the short term,

may not always yield the best global optimization performance in the long term.

20.8.1 Tikhonov Damping

“Tikhonov regularization” or Tikhonov damping

is arguably the most well-

known damping method, and works by penalizing the squared magnitude δ

of the update δ by introducing an additional quadratic penalty term into the

quadratic model M. Thus, instead of minimizing M, we minimize a “damped”

quadratic

M(δ) ≡ M(δ)+



δ = f(θ)+∇f(θ)



δ +



Bδ

where

B=B+λI,whereλ ≥ 0 is a scalar parameter determining the “strength”

of the damping. Computing the matrix-vector product with

B is straightforward

since

Bv =(B+λI)v =Bv + λv.

As λ →∞, the damped curvature matrix

B tends to a multiple of the identity

and the minimizer δ

∗

has the property that δ

∗

→∇f(θ)/λ, meaning the overall

optimization process reduces to gradient descent with a particular learning rate.

To better understand the eﬀect of the Tikhonov damping, note that the ad-

dition of a scalar multiple of the identity matrix to B has the eﬀect of increas-

ing each of the eigenvalues by precisely λ. This can be seen by noting that if

B=VΣV



where V =[v

|...|v

] are eigenvectors of B (which are orthonor-

mal since B is symmetric), and Σ ≡ diag(λ

,λ

,...,λ

) the diagonal matrix of

eigenvalues, then

B=VΣV



+ λI = VΣV



+ λV V



= V (Σ + λI)V



.Thus

the curvature associated with each eigenvector v

in the damped matrix is given

by v



= λ

+ λ.

This modulation of the curvature has profound eﬀect on the in-

verse of

B since

B=V



(Σ + λI)

−1

V ,where(Σ + λI)

−1

diag



(λ

+ λ)

−1

, (λ

+ λ)

−1

,...,(λ

+ λ)

−1



and this will be particularly sig-

niﬁcant for λ

’s that are small compared to λ,since(λ

+ λ)

−1

will generally be

much smaller than λ

−1

in such cases.

A name which we will use to avoid confusion with the other meaning of term regu-

larization in the learning context.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 499

The eﬀect on the minimizer δ

∗

M can be seen by noting that

∗

= −





∇f(θ

k−1

)

+ λ

so the distance v



∗

that δ

∗

moves θ in the direction v

will be eﬀectively

multiplied by

+ λ

. Thus, Tikhonov damping should be appropriate when the

quadratic model is most untrustworthy along directions of very low-curvature

(along which δ

∗

will tend to travel very far in the absence of damping).

Picking a good value of λ is critical to the success of a Tikhonov damping

approach. Too high, and the update will resemble gradient descent with a very

small learning rate and most of the power of 2nd-order optimization will be lost,

with the low-curvature directions particularly aﬀected. Conversely, if λ is too

small, the quadratic model

M will be too aggressively optimized by CG, resulting

in a very large parameter update (particular in directions of low curvature) which

may cause an increase in f instead of a decrease. Unfortunately, determining a

good value of λ is a nontrivial problem, which is sensitive to the overall scale of

the objective function (i.e. using λ =1for f gives the same update as λ =2

would for 2f ), and other more subtle properties of f, many of which will vary

over the parameter space. It is in fact very rarely the case that a single value of

λ will be appropriate at all θ’s.

A method for dynamically adapting λ during optimization, which we have

found works reasonably well in practice, will be discussed in section 20.8.5. Note

that Tikhonov damping is the method used by LeCun et al. [21], where the

constant “μ”(whichisnot adapted) plays the role of λ.

It is worth noting that Vinyals and Povey [33] have recently developed an alter-

native approach to Tikhonov damping, based on the idea of directly optimizing

f over a K-dimensional Krylov basis generated by CG (or equivalently a Lanczos

iteration). Because the Krylov subspace generated using a

B=B+λI doesn’t

depend on λ (assuming a CG initialization of x

=0), this method searches over

a space of solutions that contain all those which would be found by optimizing

a Tikhonov-damped

M for some λ. Because of this it can ﬁnd solutions which

will give more reduction in f than CG could obtain for any value of λ.The

downsides of the approach are that the searching must be performed using a

general-purpose 2nd-order optimizer like BFGS, which will require extra gradi-

ent and function evaluations, that a basis for the entire Krylov subspace must

be stored in memory (which may not always be practical when n is large), and

ﬁnally that CG initializations cannot inﬂuence the construction of the Krylov

subspace.

20.8.2 Problems with Tikhonov Damping

For standard parameterizations of neural networks, where entries of the weight-

matrices and bias vectors are precisely the entries of θ, and the regularization

is the standard spherical L2 penalty βθ

, Tikhonov damping appears to be

500 J. Martens and I. Sutskever

a natural choice, and works pretty well in practice. This is because for certain

nicely behaved and also useful areas of the parameter space, the eﬀective scale

at which each parameter operates is (very) roughly equal. But imagine a simple

reparameterization of a FNN so that at some particular layer j, θ parameterizes

instead of W

. Now the objective function is 10

times more sensitive than

it was before to changes in the parameters associated with layer j and only layer

j, and imposing a Tikhonov damping penalty consisting of an equally weighted

sum of squared changes over all entries of θ (given by λ/2δ

= λ/2



i=1

)

no longer seems like a good idea.

For an even more extreme example, consider the case where we would like

to constrain some of the weights of the network to be positive, and do this by

a simple reparameterization via the exp function, so that for each component

[θ]

of θ corresponding to one of these weights w,wehavew =exp([θ]

) instead

w =[θ]

. By applying the chain rule we see that in the new parameterization,

the i-th component of the gradient, and the i-throwandcolumnoftheGGN

matrix are both eﬀectively multiplied by exp([θ]

), resulting in the update δ

∗

changing by a factor exp([θ]

)

−1

in entry i.

More formally, if we deﬁne C ∈ R

n×n

to be Jacobian of the function φ which

maps the new parameters back to the default ones, then the gradient and GGN

matrix in the new parameterization can be expressed in terms of those from

the original parameterization as C



∇f and C



GC respectively

.Theoptimal

update thus becomes:

∗

=(C



BC)

−1



∇f = C

−

−1

∇f

For our particular example, C is a diagonal matrix satisfying [C]

i,i

=exp([θ]

)

for reparameterized entries of θ,and[C]

i,i

=1for the rest.

Assuming that the original 2nd-order update was a reasonable one in the

original parameterization, the 2nd-order update as computed in the new param-

eterization should also reasonable (when taken in the new parameterization). In

particular, a reparameterized weight w (and hence f) will become exponentially

more sensitive to changes in [θ]

as [θ]

itself grows, and exponentially less sen-

sitive as it shrinks, so an extra multiplicative factor of exp([θ]

)

−1

compensates

for this nicely. This should be contrasted with gradient descent, where the up-

date will change in exactly the opposite way (being multiplied by exp([θ]

))thus

further compounding the sensitivity problem.

Unfortunately, if we use standard Tikhonov damping directly in the repa-

rameterized space, the assumption that all parameters operate at similar scales

will be strongly violated, and we will lose the nice self-rescaling property of our

update. For example, the curvature associated with [θ]

, which is equal to the

curvature for w multiplied by exp([θ]

)

, may be completely overwhelmed by the

addition of λ to the diagonal of G when [θ]

is below zero, resulting in an update

which will fail to make a substantial change in [θ]

.Conversely,if[θ]

is large

Note that this result holds for smooth and invertible φ, as long as we use the GGN

matrix. If we use the Hessian, it holds only if φ is aﬃne.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 501

then the Tikhonov damping contribution won’t properly penalize large changes

to [θ]

which may lead to a very large and untrustworthy update.

We could hope that a sensible scheme for adapting λ would compensate by

adjusting λ in proportion with exp([θ]

), but the issue is that there are many

other components of θ,suchasotherexp-reparameterized weights, and these

may easily be small or larger than [θ]

, and thus operate at vastly diﬀerent

scales. In practice, what will mostly likely happen is that any sensible scheme

for dynamically adjusting λ will cause it to increase until it matches the scale

of the largest of these reparameterized weights, resulting in updates which make

virtually no changes to the other weights of the network.

In general, Tikhonov and any of the other quadratic penalty based damping

methods we will discuss in the following sections, can all be made arbitrarily

strong through the choice of λ, thus constraining the optimization of

M to a

degree suﬃcient to ensure that the update will not leave the region where M is

a sensible approximation. What diﬀerentiates good approaches from bad ones

is how they weigh diﬀerent directions relative to each other. Schemes that tend

to assign more weight to directions associated with more serious violations of

the approximation quality of M will get away with using smaller values of λ,

thus allowing the sub-optimization of

M to be less constrained and thus produce

larger and more useful updates to θ.

20.8.3 Scale-Sensitive Damping

The scale sensitivity of the Tikhonov damping is similar to the scale sensitivity

that plagues 1st-order methods, and is precisely the type of issue we would like

to avoid by moving to 2nd-order methods. Tikhonov damping makes the same

implicit assumptions about scale that are made by ﬁrst-order methods: that the

default norm ·on R

is a reasonable way to measure change in θ and a

reasonable quantity to penalize when searching for a suitable update to θ.1st-

order methods can even be viewed as a special case of 2nd-order methods where

the curvature term is given entirely by a Tikhonov-type damping penalty, so

that

B=λI and δ

∗

= −1/λ∇f(θ).

One solution to this problem is to only use parameterizations which exhibit

approximately uniform sensitivity properties, but this is limiting and it may be

hard to tell at-a-glance if such a property holds for a particular network and

associated parameterization.

A potential way to address this problem is to use a quadratic penalty function

which depends on the current position in parameter space (θ

k−1

) and is designed

to better respect the local scale properties of f at θ

k−1

. In particular, instead

of adding the penalty term λ/2δ

to M we may instead add λ/2δ

k−1

λ/2δ



k−1

δ,whereD

k−1

is some symmetric positive deﬁnite (PD) matrix that

depends on θ

k−1

. Such a term may provide a more meaningful measure of change

in θ, by accounting for the sensitivity properties of f moreprecisely.Wecallthe

matrix D

k−1

, the damping matrix, and will drop the subscript k −1 for brevity.

Scale sensitive damping is implemented in HF by working with a “damped”

curvature matrix given

B=B+λD, where the required matrix-vector products

502 J. Martens and I. Sutskever

can be computed using as

Bv =Bv + λDv, assuming an eﬃcient algorithm for

computing matrix-vector products with D.

A speciﬁc damping matrix which may work well in the case of the exp-

reparameterized network discussed in the previous sub-section would be D =



C (for a deﬁnition of C, see the previous sub-section). With such a choice we

ﬁnd that the update δ

∗

produced by fully optimizing

M is equal to C

−1

times

the update which would have been obtained with the original parameterization

and standard Tikhonov damping with strength λ. Similarly to the undamped

case, this is true because:



BC + λC



−1



g = C

−1

(B + λI)

−1

−



g = C

−1

(B + λI)

−1

It should also be noted that this choice of damping matrix corresponds to a

penalty function

δ



which is precisely the Gauss-Newton approximation

θ

†



φ(θ)

w.r.t. to the new parameters θ,whereθ

†

= φ(θ) are the

default/original parameters. The interpretation is that we are penalizing change

in the original parameters (which are assumed to have a very roughly uniform

scale), despite performing optimization w.r.t. the new ones.

While we were able to design a sensible custom scheme in this example, ex-

ploiting the fact that the default parameterization of a neural network gives

parameters which tend to operate at approximately similar scales (in most ar-

eas of the parameter space anyway), it would be nice to have a more generic

and self-adaptive approach in the cases where we do not have such a property.

One possible approach is to set D to be the diagonal matrix formed by tak-

ing the diagonal of B (i.e. D = diag(diag(B))),achoicemadeintheclassical

Levenberg-Marquardt algorithm. With this choice, the update δ

†

∗

produced by

fully optimizing the damped quadratic

M will be invariant to diagonal linear

reparameterizations of θ.

Another nice property of this choice of D is that it produces an update which

is invariant to rescaling of f (i.e. optimizing βf instead of f for some β>0). By

contrast, a pure Tikhonov damping scheme would rely on the careful adjustment

of λ to achieve such an invariance.

One obvious way to overcome the deﬁciencies of a damping approach based

on a diagonal matrix would be to use a non-diagonal one, such as the original

curvature matrix B itself. Such a choice for D produces updates that share all

of the desirable invariance properties associated with a pure undamped Newton

approach (assuming full optimization of

M), such as invariance to arbitrary

linear reparameterizations, and rescalings of f. This is because with this choice,

the damping-modiﬁed curvature matrix

B is simply (1 + λ)B, and if we assume

either full optimization of

M, or partial optimization via a run of CG initialized

from 0, this type of damping has the eﬀect of simply rescaling the update δ by

afactorof1/(1 + δ). In section 20.8.8 we will discuss line-search methods which

eﬀectively accomplish the same type of rescaling.

Despite the nice scale invariance properties associated with these choices,

there are good reasons not to use either of them in practice, or at least to use

them only with certain modiﬁcations, and in conjunction with other approaches.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 503

While the Tikhonov approach arguably makes too few assumptions about the

local properties of f, damping approaches based on D =Bor its diagonal may

make too many. In particular, they make the same modeling assumptions as the

original undamped quadratic approximation M itself. For example, B may not

even be full-rank, and in such a situation it may be the case that M will predict

unbounded improvement along some direction in B ’s nullspace, a problem which

will not be handled by damping with D =Bfor any value of λ, no matter how

big. Even if B is full-rank, there may be directions of near-zero curvature which

can cause a less extreme version of the same problem. Since the diagonal B will

usually be full-rank even when B isn’t, or just better conditioned in general,

using it instead may give some limited immunity to these kinds of problems, but

it is far from an ideal solution, as demonstrated in ﬁg. 20.8.3.

Fig. 20.3. A 2D toy example of how using D =diag(B) resultsinanoverlyrestricted

update. Let u =[−1, 1]



and v =[1, 1]



,andletB be uu



+avv



where a is large (e.g.

, although we take a =15for display purposes). This matrix is full rank, and its

diagonal entries are given by [a +1,a+1]



, representing the fact that the quadratic is

highly sensitive to independent changes to the 2 parameters. The small circular region

is where the update will be eﬀectively restricted to when we make λ large enough.

In order to explain such degeneracies and understand why choices like D =B

can be bad, it is useful to more closely examine and critique our original choices

for making them. The quadratic approximation breaks down due to higher-order

eﬀects (and even certain unmodelled 2nd-order eﬀects in the case of the GGN

matrix) and it is the goal of damping to help compensate for this. By taking

D =Bwe are penalizing directions according to their curvature, and so are in

some sense assuming that the relative strength of the contribution to f from

the high-order terms (and thus the untrustworthiness of M) along two diﬀerent

directions can be predicted reasonably well by looking at the ratio of their re-

spective curvatures. And while there is a tendency for this to be true for certain

objective functions, making this assumption too strongly may be dangerous.

504 J. Martens and I. Sutskever

Unfortunately, in the absence of any other information about the semi-local

behavior of the function f , it may not always be clear what kind of assumption

we should fall back on. To move towards the uniform scale assumption implied

by the Tikhonov approach by choosing D to be some interpolation between the

diagonal of B and a multiple of the identity (e.g. using the methods discussed in

20.11.2) seems like an arbitrary choice, since in general there may not be anything

particularly special or natural about whatever default parameterization of f

that we are given to work with. Despite this, such a strategy can be reasonably

eﬀective in some situations, and a reasonable compromise between Tikhonov

damping and damping with B.

A conceivably better approach would be to collect information about higher-

order derivatives, or to use information collected and aggregated from previous

iterations of the optimization process to build a simple model of the coarse

geometric structure of f. Or perhaps some useful information could be gleaned

from examining the structure of the computational graph of f. Unfortunately

we are unaware of the existence of general methods for building D based on such

ideas, and so this remains a direction for future research.

In the next section we discuss a method called “structural damping” which

constructs D using knowledge of the particular structure of deep and recurrent

neural networks, in order to construct damping matrices which may be better

at selectively penalizing directions for the purposes of damping.

20.8.4 Structural Damping

Recurrent Neural Networks are known to be diﬃcult to train with gradient

descent, so it is conceivable that problematic variations in scale and curvature are

responsible. Indeed, a direct application of the implementation of HF presented

by Martens [22] to RNNs can yield reasonable results, performing well on a

family of synthetic pathological problems [19, 23] designed to have very long-

range temporal dependencies of up to 100 time-steps. However Martens and

Sutskever [23] found that performance could be made substantially better and

more robust using an idea called “structural damping”.

Martens and Sutskever [23] found that a basic Tikhonov damping approach

performed poorly when applied to training RNNs. In particular, in order to

avoid very large and untrustworthy update proposals, they found λ needed to

be very high, and this in turn would lead to much slower optimization. This

need for a large λ can be explained by the extreme sensitivity of the RNN’s

long sequence of hidden states to changes in the parameters and in particular

the hidden dynamics matrix W

. While these sorts of problems exist with deep

feed-forward neural networks like the autoencoders considered in Hinton and

Salakhutdinov [17] and Martens [22], the situation with RNNs is much more

extreme, since they have many more eﬀective “layers”, and their parameters are

applied repeatedly at every time-step and can thus have a dramatic eﬀect on the

entire hidden state sequence [4, 18]. Due to this extreme and highly non-linear

sensitivity, local quadratic approximations are likely to be highly inaccurate in

certain directions in parameter space, even over verysmalldistances.ATikhonov

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 505

damping approach can only compensate for this by imposing a strict penalty

against changes in all directions, since it lacks any mechanism to be more selective.

Structural damping addresses this problem by imposing a quadratic penalty

not just to changes in parameters, but also to cetain intermediate quantities that

appear in the evaluation of f, such as the hidden state activities of an RNN.

This allows us to be more selective in the way we penalize directions of change

in parameter space, focusing on those that are more likely to lead to the large

changes in the hidden state sequence, which due to their highly nonlinear nature,

tend to correspond to catastrophic breakdowns in the accuracy of the quadratic

approximation.

Speculatively, structural damping may have another more subtle beneﬁt for

RNN learning. It is known [20, 23] that good random initializations give rise to

nontrivial hidden state dynamics that can carry useful information about the

past inputs even before any learning has taken place. So if an RNN is initialized

carefully to contain such random dynamics, the inclusion of structural damping

may encourage the updates to preserve them at least until the hidden-to-output

weights have had some time to be adapted to the point where the long-range

information contained in the hidden activities actually gets used to inform future

predictions. After such a point, a locally greedy optimizer like HF will have more

obvious reasons to preserve the dynamics.

To formalize structural damping we ﬁrst re-express the nonlinear objective

f(θ) as a composition of functions L(z(h(θ),θ)),whereh(θ) computes the hidden

states (whose change we wish to penalize), z(h, θ) computes the outputs, and

L(z) computes the loss.

Given the current parameter setting θ

k−1

, the local (undamped) quadratic

approximation is given by M(δ) and its curvature is B. We prevent large changes

in the hidden state by penalizing the distance between h(θ

k−1

+ δ) and h(θ

k−1

)

according to the penalty function

S(δ)=d(h(θ

k−1

+ δ); h(θ

k−1

))

where d is a distance function or loss function such as a squared error or the

cross-entropy

Ideally, we would deﬁne the damped local objective as:

†

(δ)=M(δ)+μS(δ)+λI

where μ is a strength constant similar to λ. But since we cannot minimize a

non-quadratic objective with CG, we must resort to using a local quadratic

approximation to S(δ). This will be given by δ



k−1

δ/2 where D

k−1

is the

Gauss-Newton matrix of the penalty function S(δ) at δ =0. Note that the

quadratic approximation to S(δ) does not have a linear term because δ =0is a

minimum of S.

Fortunately, it is straightforward to multiply by the generalized Gauss-Newton

matrix of S using the techniques outlined in section 20.6. Thus we could compute

The cross-entropy is suitable when the hidden units use a logistic sigmoid nonlin-

earity

506 J. Martens and I. Sutskever

the products Bv and μD

k−1

v using two separate Gauss-Newton matrix-vector

products, adding together the results, approximately doubling the computational

burden. In order to avoid this, we can instead compute the sum (B

k−1

+μD

k−1

directly by exploiting the fact that S(δ) can be computed much more eﬃciently

by reusing the h(θ

+ δ) which gets computed as an intermediate quantity for

f(θ

+ δ). Indeed, consider a neural network whose output units include the

hidden state as well as the predictions:

c(θ) ≡ [h(θ),z(h(θ),θ)]

and whose loss function is given by L

(h, y)=μd(h; h(θ

k−1

)+L(y),sothat

we have f (θ)+μS(θ)=L

(c(θ)). This “new” neural network includes struc-

tural damping in its loss, and any automatic routines computing the required

Jacobian-vector products for c will be no more expensive than the analogous

routines in the original network. Multiplication by the Hessian of the loss







(y)0

0 μd



(h; h(θ

k−1

))



is also easy and can be done block-wise.

20.8.5 The Levenberg-Marquardt Heuristic

For the penalty-based damping methods such as those described above to work

well, λ (and perhaps also μ, as deﬁned in the previous sub-section) must be

constantly adapted to keep up with the changing local curvature properties of

Fortunately, we have found that the well-known Levenberg-Marquardt (LM)

heuristic, which is usually used in the context of the LM method [25] to be

eﬀective at adapting λ in a sensible way even in the context of the truncated

CG runs that are used in HF.

The key quantity behind the LM heuristic is the “reduction ratio”, denoted

by ρ,whichisgivenby

ρ ≡

f(θ

k−1

+ δ

) − f(θ

k−1

)

k−1

(δ

)

(20.10)

The reduction ratio measures the ratio of the reduction in the objective f (θ

k−1

) −f(θ

k−1

) produced by the update δ

, to the amount of reduction predicted

by the quadratic model. When ρ is much smaller than 1, the quadratic model

overestimates the amount of reduction and so λ should be increased, encouraging

future updates to be more conservative and thus lie somewhere that the quadratic

model more accurately predicts the reduction. In contrast, when ρ is close to 1,

the quadratic approximation is likely to be fairly accurate near δ

∗

,andsowecan

aﬀord to reduce λ, thus relaxing the constraints on δ

and allowing for “larger”

and more substantial updates.

The Levenberg-Marquardt heuristic is given by:

1. If ρ>3/4 then λ ← λ2/3

2. If ρ<1/4 then λ ← λ3/2

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 507

Although the constants in the above description of the LM are somewhat arbi-

trary, we found them to work well in our experiments.

Note that the above heuristic is valid in the situation where ρ<0,

which can arise in one of two ways. The ﬁrst way is where M

k−1

< 0 and

f(θ

k−1

+ δ

) − f(θ

k−1

) > 0, which means that the quadratic approximation is

very inaccurate around δ

∗

and doesn’t even get the sign of the change right. The

other possibility is that M

k−1

> 0 and f(θ

k−1

+ δ

) − f (θ

k−1

) < 0,whichcan

only occur if CG is initialized from a nonzero previous solution and doesn’t make

suﬃcient progress from that point to obtain a negative value M

k−1

before being

terminated/truncated. When this occurs one should either allow CG to use more

iterations or possibly initialize the next run of CG from 0 (as this will guarantee

that M

k−1

< 0,sinceM

k−1

(0) = 0 and CG decreases M

k−1

monotonically).

While the deﬁnition of ρ in eqn. 20.10 uses the undamped quadratic in the

denominator, the damped quadratic approximation

k−1

can also be used, and

in our experience will give similar results, favoring only slightly lower values of

λ.

Because of this tendency for M to lose accuracy as CG iterates (see subsec-

tion 20.8.7), the value of ρ tends to decrease as well (sometimes after an initial

up-swing caused by using a non-zero initialization as in section 20.10). If CG

were to run to convergence, the Levenberg-Marquardt heuristic would work just

as it does in the classical Levenberg-Marquardt algorithm, which is to say, very

eﬀectively and giving provable strong local convergence guarantees. But the situ-

ation becomes more complicated when this heuristic is used in conjunction with

updates produced by unconverged runs of CG, because the “optimal” value of λ,

which the LM heuristic is trying to ﬁnd, will be a function of how much progress

CG tends to make when optimizing M.

Fortunately, as long as the local properties of f change slowly enough, ter-

minating CG according to a ﬁxed maximum number of steps should result in a

relatively stable and well-chosen value of λ.

But unfortunately, well intentioned methods which attempt to be smarter and

terminate CG based on the value of f, for example, can run into problems caused

by this dependency of the optimal λ on the performance of CG. In particular,

this “smart” decision of when to stop CG will have an aﬀect on ρ, which will

aﬀect the choice of λ via the LM heuristic, which will aﬀect the damping and

hence how the value of f evolves as CG iterates (at the next HF iteration), which

will ﬁnally aﬀect the decision of when to stop CG, bringing us full circle. It is

this kind of feed-back which may result in unexpected and undesirable behaviors

when using the LM heuristic, such as λ and the number of the length of the CG

runs both going to zero as HF iterates, or both quantities creeping upwards to

inappropriately large values.

20.8.6 Trust-Region Methods

In contrast to damping methods that are based on penalty terms designed to

encourage updates to be “smaller” according to some measure, trust region meth-

ods impose an explicit constraint on the optimization of the quadratic model

508 J. Martens and I. Sutskever

M. Instead of performing unconstrained optimization on the (possibly damped)

quadratic

M, a constrained optimization is performed over M. This is referred

to as the trust-region sub-problem, and is deﬁned by:

∗

= argmin

δ:δ∈R

M(δ)

where R ⊆ R

is some region localized around δ =0called the “trust-region”.

Ideally, R has the property that M remains a reasonable approximation to f

for any δ ∈ R, without being overly restrictive. Or perhaps more weakly (and

practically), that whatever update δ

∗

is produced by solving the trust-region

sub-problem will produce a signiﬁcant reduction in f . Commonly, R is chosen to

be a ball of radius r centered at 0, although it could just as easily be an ellipsoid

or something more exotic, as long as the required constrained optimization can

be performed eﬃciently enough.

There is a formal connection between trust region methods and penalty-based

damping methods such as Tikhonov damping, which states that when R is an

elliptical ball around 0 of radius r given R = {x : x

≤ r} for some positive

deﬁnite matrix Q, then for each quadratic function M(δ) there exists a λ such

that

argmin

δ:δ∈R

M(δ)=argmin

M(δ)+

δ

This result is valid for all quadratic functions M,evenwhenB is indeﬁnite, and

can be proved using Lagrange multipliers.

Trust-region methods have some appeal over the previously discussed penalty-

based damping methods, because it may be easier to reason intuitively, and

perhaps also mathematically, about the eﬀect of an explicit trust-region (with a

particular radius r) on the update than a quadratic penalty. Indeed, the trust

region R is invariant to changes in the scale of the objective

,whichmaymake

it easier to tune, either manually or by some automatic method.

However, the trust region optimization problem is much more diﬃcult than

the unconstrained quadratic optimization of

M. It cannot be directly solved

either by CG or by matrix inversion. Even in the case where a spherical trust-

region with radius r is used, the previously discussed result is non-constructive

and merely guarantees the existence of an appropriate λ that makes the exact

minimizer of the Tikhonov damped

M equal to the solution of the trust region

sub-problem. Finding such a λ is a hard problem, and while there are algorithms

that do this [26], they involve expensive operations such as full decompositions of

For completeness, we present an outline of the proof here: Consider the Lagrangian

L(δ, λ)=M(δ)+(r −



Qδ)λ. It is known that there exists a λ

∗

such that the

minimizer of the trust region problem δ

∗

satisﬁes ∂L(δ

∗

,λ

∗

)/∂(δ, λ)=0.ForM(δ)=



δ−δ



Bδ/2,thisisequivalenttog−Bδ

∗

−λ

∗

Qδ

∗

=0, and thus (B+λ

∗

Q)δ

∗

= g.If

the matrix B+λ

∗

Q is positive deﬁnite, then δ

∗

can be expressed as the unconstrained

minimization of g



δ − δ



(B + λ

∗

I)δ/2.

If the objective is multiplied by 2, then λ also needs to be multiplied by 2 to achieve

the same eﬀect when using a penalty method. By contrast, the trust region R is

unaﬀected by such a scale change.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 509

the B matrix, or ﬁnding the solutions of multiple Tikhonov-damped quadratics

of the form

M. When CG is used to perform partial optimization of the quadratic

model, there are also good approximation algorithms [13] based on examining

the tridiagonal decomposition of B (restricted to the Krylov subspace), but these

require either storing a basis for the entire Krylov subspace (which may be

impractical when n is large), or will require a separate run of CG once λ has

been found via the tridiagonal decomposition, eﬀectively doubling the amount

of matrix-vector products that must be computed.

Even if we can easily solve (or partially optimize) the trust-region sub-

problem, we are still left with the problem of adjusting r. And while it is likely

that the “optimal” r will remain a more stable quantity than the “optimal” λ over

the parameter space, it still needs to be adjusted using some heuristic. Indeed,

one heuristic which is advocated for adjusting r [30] is precisely the Levenberg-

Marquardt heuristic discussed in section 20.8.5 which is already quite eﬀective

(in our experience) at adjusting λ directly. This leaves us with the question: why

not just adjust λ directly and avoid the extra work required to compute λ by

way of r?

One method which is eﬀective at ﬁnding reasonable approximate solutions of

the trust-region sub-problem, for the case where R is a ball of radius r,isto

run CG (initialized at zero) until the norm of the solution exceeds r (i.e. the

solution leaves the trust-region) and truncate CG at that iteration, with a possi-

ble modiﬁcation to the α multiplier for the ﬁnal conjugate direction to ensure a

solution of norm exactly r. This is known as the Steihaug-Toint method, and it

can be shown [13] that, provided CG is initialized from a zero starting solution

and no preconditioning is used, the norm of the solution will increase monoton-

ically and that if CG is truncated in the manner described above, the resultant

solution δ

†

will satisfy M(δ

†

) ≤

M(δ

∗

) where δ

∗

is the optimal solution to the

trust-region sub-problem. This seems like a good compromise, and it is more

economical than approaches that try to solve the trust region problem exactly

(or using better approximations, as discussed above). Unfortunately, the restric-

tion that CG must be initialized from zero cannot be easily removed, and in our

experience such initializations turn out to be very beneﬁcial, as we will discuss

in section 20.10.

Another argument against using the Steihaug-Toint method is that if the

trust-region is left after only a few steps of CG, it will likely be the case that

few, if any, of the low-curvature directions have converged to a signiﬁcant degree

(see section 20.9). And while we know that this will not aﬀect the optimized

value of M by more than a factor of 2, it will nonetheless produce a qualitatively

diﬀerent type of update than one produced using a penalty method, which will

have a possibly negative eﬀect on the overall optimization trajectory.

Another possible argument against using the Steihaug-Toint method is that

it cannot used with preconditioned CG. However, as long as we are willing to

enforce an elliptical trust-region of the form {x : x

<r} where P is the

preconditioning matrix, instead of a spherical one, the method still works and

510 J. Martens and I. Sutskever

its theory remains valid. And depending on the situation, this kind of elliptical

trust-region may actually be a very natural choice.

20.8.7 CG Truncation as Damping

Within HF, the main reason for terminating CG before it has converged is one

of a practical nature: the matrix-vector products are expensive and additional

iterations will eventually provide diminishing returns as far as optimizing the

quadratic model M. But there is another more subtle reason we may want to

truncate early: CG truncation may be viewed as special type of damping which

may be used in conjunction with (or as an alternative to) the various other

damping methods discussed in this section.

As CG iterates, the accuracy of M(δ) tends to go down, even while f(θ

k−1

+δ)

may still be improving. One way to explain this, assuming a zero initialization,

is that the norm of δ will increase monotonically with each step, and thus be

more likely to leave the implicit region around δ =0where M is a reasonable

approximation of f . There is also theory which suggests that CG will converge

to δ

∗

ﬁrst along directions of mostly higher curvature, and only later along direc-

tions of mostly low curvature (see section 20.9). While pursuing low curvature

directions seems to be important for optimization of deep networks and RNNs,

it also tends to be associated with large changes in θ which can lead to more

serious breakdowns in the accuracy of the local quadratic model.

The Steihaug-Toint method described in section 20.8.6 already makes use of

this phenomenon and relies exclusively on truncating the solution early to en-

force trust-region constraints. And it is well-known that truncating CG, which

eﬀectively limits the size of the Krylov subspace, has certain regularizing prop-

erties in the context of solving noisy and ill-posed systems of linear equations

[14].

Although it wasn’t emphasize this in the paper, Martens [22] supplemented

the progress-based CG termination criteria with a maximum limit of 250 steps.

In practice, we have found that this maximum is consistently reached after the

ﬁrst 100 (or so) iterations of HF when the approach is applied to problems

like the “CURVES” auto-encoder from Hinton and Salakhutdinov [17], and that

it plays an important role which is not limited to saving computation time.

Instead, we can observe that the improvement in the value of the objective f is

non-monotonic in the number of CG steps, and may peak long before condition

20.2 is satisﬁed (see ﬁg. 20.4).

Martens [22] proposed “CG-backtracking” as a method to take advantage of

this tendency for earlier iterates to be more favorable as updates, by selecting

the “best” iterate among some manageable subset, as measured by the associated

reduction in f . One possible approach which is reasonably eﬃcient is to compute

the objective f only on a small subset of the current minibatch or training set,

and only at every multiple of c steps for some ﬁxed c,oreverypowerofν steps

(rounding to the nearest integer) for some ﬁxed ν, and then terminate once it

appears that there will be no further possible improvement in the objective.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 511

Fig. 20.4. A plot of the objective function (solid) versus the quadratic approximation

(dotted) as a function of the number of CG steps. This was selected as a typical example

of a single run of CG performed by HF on a typical deep auto-encoder training task.

An important thing to keep in mind is that the damping eﬀect of CG trun-

cation will depend on both the preconditioning scheme used within CG (as dis-

cussed in section 20.11), and on the particular manner in which CG is initialized

(as discussed in sec. 20.10). In the extreme case where P =B, the eigenvalues

will all be equal and CG will converge in one step, rendering CG truncation

trivial/useless. And for “stronger” preconditioners that let CG converge faster

it can be argued that truncation at a particular iteration i will have a smaller

eﬀect than with “weaker” preconditioners. But this perspective oversimplies a

very complex issue.

Preconditioning is eﬀectively reparameterizing the quadratic optimization

problem, which has the eﬀect of creating new eigenvectors with new eigenvalues

and of changing the Krylov subspace (as discussed further in section 20.11.1).

This in turn aﬀects the “order” in which CG tends to prioritize convergence along

diﬀerent directions (see section 20.9). Thus, when CG is terminated long before

convergence, preconditioning will have an important eﬀect on the nature of the

implicit damping and thus on the quality of the update.

From the perspective of the Steihaug-Toint method and trust-regions, CG

preconditioning can be thought of as determining the shape of the trust-region

that is being implicitly enforced through the use of truncation. In particular, the

trust region will be given by R = {x : x

≤ r},forsomer,whereP is the

preconditioning matrix. Similarly, initializing CG from some non-zero point x

can be thought of as shifting the center of said trust-region away from 0. In both

cases, the guarantee remains true that the solution found by truncated CG will

be at least half as good (in terms of the value of M), subject to the trust-region

512 J. Martens and I. Sutskever

radius implied by the current iterate, as long as we the modify deﬁnition of the

trust region appropriately.

20.8.8 Line Searching

The most basic form of damping, which is present in almost every 2nd-order

optimization algorithm, is standard line searching applied to the update proposal

. In particular, we select a scalar α

to multiply the update δ

before adding

it to θ so that θ

= θ

k−1

+ αδ

. Usually α is set to 1 as long as certain conditions

hold (see Nocedal et al. [30]), and decreased only as necessary until they do.

Doing this guarantees that certain local convergence theorems apply.

Provided that δ

is a descent direction (δ



∇f(θ

k−1

) < 0) we know that for a

suﬃciently small (but non-zero) value of α, we will have:

f(θ

k−1

) >f(θ

)=f(θ

k−1

+ αδ

) (20.11)

will be descent direction as long as

k−1

is positive deﬁnite, ∇f is computed

on the entire training set, and

M is optimized either exactly, or partially with

CG (provided that we achieve M (δ) < 0). Thus, in many practical scenarios, a

line search will ensure that the update results in a decrease in f , although it may

be very small. And if we are content with the weaker condition that only the

terms of f corresponding to the minibatches used to estimate ∇f (θ

k−1

) must

decrease, then we can drop the requirement that ∇f be computed using the

whole training set.

One easy way to implement line searching is via the back-tracking approach,

which starts at α =1and repeatedly multiplies this by some constant β ∈ (0, 1)

until the “suﬃcient-decrease condition” applies. This is given by

f(θ

k−1

+ αδ

) ≤ f(θ

k−1

)+cα∇f(θ

k−1

)



where c is some small constant like 10

−2

. It can be shown that this following

approach will produce an update which has strong convergence guarantees

Unlike 1st-order methods where the total number of updates to θ can often

be on the order of 10

− 10

for a large neural network, powerful approximate

Newton methods like HF may only require on the order of 10

− 10

,sothe

expense of a line-search is much easier to justify.

Note also that it is also possible to view line searching as a special type of a

penalty based damping method, where we use a penalty term of the form

2α

In other words, we simply increase the scale of curvature matrix B by 1/α.This

interpretation is valid as long as we solve

M exactly, or partially by CG as long

as it is initialized from 0.

The line search is best thought of as a last line of defense to compensate for

occasional and temporary maladjustment of the various non-static parameters

of the other damping methods, such as λ or r, and not as a replacement for

But note that other possible ways of choosing an α that satisﬁes this condition may

make α too small.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 513

these methods. If the line search becomes very active (i.e., very often chooses an

α strictly less than 1) there are two probable causes, which should be addressed

directly instead of by relying on the line-search. The ﬁrst is poorly designed/inap-

propriate damping methods and/or poor heuristics for adjusting their non-static

meta-parameters. The second probable cause is that the data used to estimate

the curvature and/or gradient is insuﬃcient and the update has eﬀectively “over-

ﬁtted” the current minibatch.

The argument for not relying on line searches to ﬁx whatever problems might

exist with the updates produced by optimizing M is that they work by rescaling

each eigen-component (or conjugate direction) of the update equally by the mul-

tiplicative factor α, which is a very limited and inﬂexible approach. It turns out

that in many practical situations, the type of scaling modiﬁcation performed by

a penalty method like Tikhonov damping is highly preferable, such as when B

is very ill-conditioned. In such a situation, minimizing M results in an update

proposal δ

which is scaled much too strongly in certain, possibly spurious, low-

curvature eigen-directions, by a factor of possibly 10

or more, and a line-search

will have to divide all components by this factor in order to make the update

viable, which results in an extremely small update. Meanwhile, a Tikhonov ap-

proach, because it eﬀectively modiﬁes the curvatures by the additive constant

λ (as discussed in section 20.8.1) can easily suppress these very low-curvature

directions while leaving the higher curvature directions relatively untouched,

which makes the update much bigger and more useful as a result.

20.9 Convergence of CG

In this section we will examine the theoretical convergence properties of CG

and provide justiﬁcations for various statements regarding its convergence made

throughout this report, such as how δ converges along diﬀerent “directions” in

parameter space, and how CG prioritizes these directions according to their

“associated curvature”. This analysis has particularly important implications for

preconditioning (see section 20.11) and CG truncation damping (see section

20.8.7).

Before we begin it should be noted that due to inexact computer arithmetic,

in practice the conclusions of this analysis (which implicitly assume exact arith-

metic) will only hold approximately. Indeed, CG, unlike many other numerical

algorithms, is highly sensitive to numerical issues and after only 5−10 iterations

on a high-precision machine will produce iterates that can diﬀer signiﬁcantly

from those which would be produced by a hypothetical exact implementation.

We will analyze CG applied to a general quadratic objective of the form

q(x)=



Ax−b



x for a symmetric positive deﬁnite matrix A ∈ R

n×n

, b ∈ R

This can be related by to HF by taking A =B(or A =

B in the case of a damped

quadratic approximation) and b = −∇f .

Note that x as deﬁned here is an n-dimensional vector and should not be

confused with its use elsewhere in this chapter as the input to a neuron.

Let {λ

}

j=1

be the eigenvalues of A and {v

}

j=1

the corresponding (unit)

eigenvectors, x

be the initialization of CG, and x

∗

the minimizer of q.Sincethe

514 J. Martens and I. Sutskever

’s are an orthonormal basis for R

(because A is symmetric and invertible) for

any x we can express x

− x

∗

in terms of the v

’s, giving x

− x

∗



j=1

for ξ

= v



− x

∗

It can be shown [30, 32] that:

x

− x

∗



=min



j=1

p(λ

)

(20.12)

where z

≡



Az and where the minimum is taken over all polynomials of

degree i with constant term 1. This result can be used to prove various conver-

gence theorems for CG [30]. For example, CG will always converge to x

∗

after a

number of iterations less than or equal to the number m of distinct eigenvalues

of A, since it is easy to design a polynomial of degree m with constant term 1

that satisﬁes p(λ

)=0for all j.

To gain more insight into eqn. 20.12 we will re-derive and re-express it in a

way that implies an intuitive interpretation for each term. It is easy to show

that:

q(z + x

) − q(x



Az + r



where r

= Ax

− b (i.e. the initial residual). And so q(z + x

) − q(x

) is a

quadratic in z with constant term equal to 0, and linear term r

Deﬁning η

= r



to be the size of eigenvector v

in direction r

= Ax

− b

(which is the initial residual), and observing that v



= λ



= λ

,we

have for any α ∈ R:

q(αv

+ x

) − q(x



+ αr



+ αη

Since the v

’s are an orthonormal basis for R

(because A is symmetric and

invertible), we can express x − x

(for any x)intermsofthev

’s, giving x =



where β

= v



(x − x

). Note that the v

’s are also mutually

conjugate (which follows from them being both orthonormal and eigenvectors)

and that since q(z + x

) −q(x

) is quadratic in z with constant term 0 we have

that for any two conjugate vectors u and w:

q(u + w + x

) − q(x

)=(q(u + x

) − q(x

)) + (q(w + x

) − q(x

))

which is straightforward to show. Thus we have:

q(x) −q(x

)=q((x − x

)+x

) − q(x

)



j=1

(q(β

+ x

) − q(x

)) =



j=1



+ β



What this says is that the size β

of the contribution of each eigenvector/eigen-

direction to x, have an independent inﬂuence on the value of q(x)−q(x

),andso

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 515

we can meaningfully talk about how each one of them independently “converges”

as CG iterates.

In a sense, each β

is being optimized by CG, with the ultimate goal of mini-

mizing the corresponding 1-D quadratic q(β

+ x

) − q(x

), whose minimizer

is β

∗

= −

= v



∗

− x

) with associated minimum value:

q(β

∗

+ x

) − q(x

)=−

= −ω

where we deﬁne ω

≡

. The diﬀerence between the current value of q(β

)

and its minimum has a particularly nice form:

q(β

+ x

) −q(β

∗

+ x

+ β

+ ω

= ω





Now suppose that x − x

∈ K

(A, r

) so that there exists some (i − 1)-degree

polynomial s s.t. x − x

= s(A)r

and note that s(A)v

= s(λ

. Then,

= v



(x − x

)=v



s(A)r

=(s(A)v

)



=(s(λ

)



= s(λ

)η

so that:

q(β

+ x

) −q(β

∗

+ x

)=ω



s(λ

)η



= ω

(λ

s(λ

)+1)

= ω

p(λ

)

wherewedeﬁnep(z)=zs(z)+1 (which is a general polynomial of degree i with

constant coeﬃcient 1).

Summarizing, we have:

q(x) − q(x

∗

)=(q(x) − q(x

)) − (q(x

∗

) −q(x

))



j=1

(q(β

+ x

) − q(β

∗

+ x

)) =



j=1

p(λ

)

(20.13)

We now apply this results to CG. We know that after i steps, CG applied to

q with initialization x

, ﬁnds the iterate x

which minimizes q(x

) subject to

restriction x

−x

∈ K

(A, r

). This is equivalent to minimizing over all possible

p with the requirement that p has degree i with constant coeﬃcient 1, or in other

words:

q(x

) − q(x

∗

)=min



j=1

p(λ

)

(20.14)

Thus we see that CG is eﬀectively picking a polynomial p to minimize the

weighted sum of p

evaluated at the diﬀerent eigenvalues.

516 J. Martens and I. Sutskever

Fig. 20.5. This ﬁgure demonstrates geometrically how the contribution to the poly-

nomial p(z) of an additional root ν or ν



in the vicinity of a small eigenvalue λ

or a

large eigenvalue λ

(resp.) aﬀects the loss term associated with the other eigenvalue.

In particular, the distance of the lines above or below the horizontal axis is equal to

the factor whose square eﬀectively multiplies the loss term associated with the given

eigenvalue.

As mentioned before, there are various results that make use of expressions

similar to this one in order to prove results about how the distribution of the

eigenvalues of A determine how quickly CG can make progress optimizing q.One

particularly interesting result states that if the eigenvalues cluster into m groups,

then since we can easily design a degree-m polynomial p which is relatively small

in the vicinity of each cluster by placing a root of p at each cluster center, the

error will be quite low by the m-th iteration [30].

However, the particular form of eqn. 20.14 and its derivation allow us to paint

a more intuitive picture of how CG operates. Each of the terms in the sum 20.13

correspond to a direction-restricted objective q(β

+ x

) − q(β

∗

+ x

p(λ

)

which are indirectly optimized by CG w.r.t. the β

’s. The size each of

these “loss” terms negatively correlates with how much progress CG has made

in optimizing x along the corresponding eigen-directions, and by examining the

form of these terms, we can talk about how CG will “prioritize” these diﬀerent

terms (and hence the optimization of their associated eigen-directions) through

its choice of an optimal polynomial p.

Firstly, consider the “weights” ω

= −(q(β

∗

+ x

) − q(x

)) =

,which

measure the total decrease in q that can be obtained by fully optimizing along

the associated direction v

. Their eﬀect is thus to shift the focus of CG towards

those eigen-directions which will give the most reduction. They are inversely

proportional to the curvature λ

, and proportional to η

=(v



)

,whichis

the square of the size of the contribution of the eigen-direction within the initial

residual (which in HF will be the gradient of f when x

=0), and this makes

the ω

’s “scale-invariant”, in the sense that any linear reparameterization which

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 517

preserves the eigenvectors of A, while possibly rescaling the eigenvalues, will

have no eﬀect on the ω

’s.

Secondly, we note the eﬀect of the size of the λ

’s, or in other words, the

curvatures associated with v

’s. If it weren’t for the requirement that p must

have a constant term of 1, the λ

’s probably wouldn’t have any inﬂuence on

CG’s prioritizing of directions (beyond for how they modulate the weights ω

But note that general polynomials of degree i with constant coeﬃcient 1 must

have the form:

p(z)=

k=1



1 −



for ν

∈ C. We will argue by illustrative example that this fact implies that CG

will favor high-curvature directions, everything else being equal. Suppose there

are two tight clusters of eigenvalues of A, a low-magnitude one located close

to zero and a large-magnitude one located further away. Suppose also that they

have equal total loss as measured by the sum of the associated ω

p(λ

)

’s (for the

current p). Placing an additional root ν

close to the large-magnitude cluster will

greatly reduce the associated ω

p(λ

)

loss terms in that cluster, by eﬀectively

multiplying each by

1 −

which will be small due to the closeness of ν

each λ

. Meanwhile, for the λ

’s in the small-magnitude cluster, the associated

loss terms will be multiplied by

1 −

which won’t be greater than 1 since

0 <λ

<ν

, implying that these loss terms will not increase (in fact, they will

very slightly decrease).

Now contrast this with what would happen if we placed a root ν

close to

the small magnitude cluster. As before, the loss terms associated with that clus-

ter will be greatly reduced. However, because λ

! ν

for λ

’s in the large-

magnitude cluster, we will have

1 −

! 1 for such λ

’s, and so the asso-

ciated loss terms will greatly increase, possibly even resulting in a net increase

in q. Thus CG, being optimal, will place the root near to the large-magnitude

cluster in this situation, versus the small magnitude-one, despite convergence of

either one yielding the same improvement in q.

20.10 Initializing CG

As in the previous section we will use the generic notation q(x)=



Ax −b



to refer to the quadratic objective being optimized by CG.

A useful property of CG is that it is able to make use of arbitrary initial

guesses x

for x. This choice can have a strong eﬀect on the performance of

CG, which is not surprising since x

depends strongly on the Krylov subspace

(A, r

) (where r

= Ax

− b), which in turn depends strongly on x

.From

518 J. Martens and I. Sutskever

the perspective of the previous section, an initial x

may be “more converged”

than x =0along some eigen-directions and less converged along others, thus

aﬀecting the corresponding weights ω

, which measure the total reduction that

can be obtained by fully optimizing eigen-direction v

(versus leaving it as it is

in x

). This “redistribution” of weights caused by taking a diﬀerent x

may make

the quadratic optimization easier or harder for CG to optimize, depending on

how the eigenvalues and associated weights are distributed.

Since the local geometry of the error surface of f (and hence the local damped

quadratic model q =

M) changes relatively slowly between updates (at least

along some eigen-directions), this suggests using the previous update δ

k−1

the starting solution x

for CG, as was done by Martens [22].

In practice, this choice can result in an initial value of q which is higher than

zero, and thus seemingly worse than just using x

=0, which satisﬁes q(x

)=0.

may not even be a descent direction, implying that q(x

) > 0 for all >0.

But these objections are based on the naive notion that the value of q tells us

everything there is to know about the quality of potential initialization. What

we observe in practice is that while CG runs initialized with x

= δ

k−1

“start

slow” (as measured by the value of q(x)), they eventually catch up and then

surpass runs started from x

=0.

To make sense of this ﬁnding, we ﬁrst note it is easy to design initializations

which will have arbitrarily high values of q, but which require only one CG step

to reach the minimum. To do this, we simply take the minimizer of q and add

a large multiple of one of the eigenvectors of A to it. This corresponds to the

situation where only one eigenvalue λ

has non-zero weight ω

,sothattomake

q(x) − q(x

∗

)=0CG can simply select the degree-1 polynomial which places a

root at λ

More generally, x

may be more converged than 0 along eigen-directions which

are more numerous, or which have have small and spread-out eigenvalues (i.e.

curvatures), and meanwhile less converged than 0 (perhaps severely so) only

along eigen-directions which are fewer in number, or have larger or more tightly

clustered eigenvalues. If the later group has a larger total weight (given by the

sum of the ω

’s as deﬁned in the previous section) this will cause q(x

) > 0.

But since the former directions will be easier for CG to optimize than the latter,

this implies that the given x

will still be a highly preferable initialization over

“safer” choice of 0, as long as CG is given enough iterations to properly optimize

the later group of badly initialized but “easy-to-ﬁx” eigen-directions.

We surmise that the choice x

= δ

k−1

ﬁts into the situation described above,

where the later group of eigen-directions correspond to the slowly optimized

low-curvature directions that tend to remain descent-directions across many HF

iterations. Consistent with this theory, is our observation that the number of

CG steps required to achieve q(x) < 0 from the initialization x

= δ

k−1

tends to

grow linearly with the number of CG steps used at the previous HF iteration

This is, incidentally, one reason why it is good to use a ﬁxed maximum number of

CG iterations at each HF step.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 519

Analogously to how the current update vector is “decayed” by a scalar constant

when using gradient descent with momentum, we have found that it is helpful to

slightly decay the initialization, taking x

= ζδ

k−1

for some constant 0 ≤ ζ ≤ 1,

such as 0.95.

Choosing this decay factor for HF carefully is not nearly as important as it can

be for momentum methods. This is because while momentum methods modify

their current update vectors by a single gradient-descent step, HF uses an entire

run of CG, which can make much more signiﬁcant changes. This allows HF to

more easily scale back x

along eigen-directions, which while they may have been

helpful at the previous θ, are no longer appropriate to follow from the current θ.

In particular, x

will be quickly “corrected” along turbulent directions of high-

curvature, reducing (but not completely eliminating) the need for a decay to

help “clean up” these directions.

Our experience suggests that properly tuning the decay constant becomes

more important as aggressive CG truncation, or other factors like weak precon-

ditioning, limit CG’s ability either to modify the update from its initial value

, or to make good progress along important low-curvature directions. While

the former problem calls for lowering ζ, the later calls for raising it. The optimal

value will likely depend on the amount of truncation, the type of precondition-

ing, and the local geometry of the objective being optimized. ζ =0.95 seems to

be a good default value, but it may help to reduce it when using an approach

which truncates CG very early. It may also be beneﬁcial to increase it in the

later stages of optimization where CG struggles much harder to optimize q along

low curvature directions.

20.11 Preconditioning

As powerful as CG is, there are quadratics optimization problems which can be

easily solved using more direct methods, that CG will struggle with. For exam-

ple, if the curvature matrix is diagonal, CG will in general require i iterations to

converge (where i is the number of distinct values on the diagonal) using a total

of O(in) time. Meanwhile, we could easily solve the entire system by straightfor-

ward inversion of the diagonal curvature matrix in time O(n). CG is, in a sense,

unaware of this special structure and unable to exploit it.

While the curvature matrix will in general not be diagonal or have any other

special form that makes it easy to invert, there may nevertheless be cheap oper-

ations which can exploit information about the course structure of the curvature

matrix to do some of the work in optimizing q, reducing the burden on CG.

In the context of HF, preconditioning refers to the reparameterization of

according to some linear transformation relatively easy to invert, with the idea

that CG will make more rapid progress per iteration optimizing w.r.t. the new

parameterization.

We will use theˆ. notation for the damped quadratic and associated damped curvature

matrix

B since this is what CG will actually optimize when used within HF.

520 J. Martens and I. Sutskever

Formally, given some invertible transformation deﬁned by a matrix C,we

transform the quadratic objective

M(δ) by a change of coordinates δ = C

−1

and optimize w.r.t. γ instead of δ.

M(C

−1

γ)=



−

−1

γ + ∇f



−1

Applying preconditioning to CG is very easy and amounts to computing trans-

formed residual vectors y

at each iteration, by solving Py

= r

,whereP = C



(see alg. 20.2). This can be accomplished, say, by multiplication of r

by P

−1

which for many common choices of P (such as diagonal approximations of

B)is

a cheap operation.

Preconditioning can be applied to other optimization methods, such as gra-

dient descent, where it corresponds to a non-static linear reparameterization of

the objective f that typically varies with each iteration, and amounts simply to

multiplication of the gradient update by P

−1

. In fact, one way to view 2nd-order

optimization is as a particular non-static preconditioning approach for gradient

descent, where P is given by the curvature matrix B (or some approximation or

Krylov subspace restriction of it).

20.11.1 The Eﬀects of Preconditioning

In section 20.9, we saw how the eigenvalues of the curvature matrix and the

corresponding sizes of the contributions to the initial residual of each eigen-

direction eﬀectively determine the convergence characteristics of CG, in terms of

the “order” in which the eigen-directions tend to converge, and how quickly. It was

found that each eigen-direction has an eﬀective “weight” ω

(corresponding to the

total decrease in q which can be obtained by completely optimizing it), and that

CG prioritizes convergence along the eigen-directions both according to their

weights and their associated curvature/eigenvalue, preferring larger values of

both. Because CG is optimal, it will tend to make faster progress along directions

whose the eigenvalues are close proximity to many other ones that are associated

with directions of high-weight (due to its ability to make progress on many such

directions at once when their eigenvalues are closely packed).

Thus to understand how a potential preconditioning scheme aﬀects the conver-

gence of CG we can look at the eigen-distribution of the transformed curvature

matrix C

−

−1

, and the associated weights, which depend on the transformed

initial residual C

−

(

−∇f ). Choices for C (or equivalently P )whichyield

tight clusters of eigenvalues should lead to overall faster convergence, at least

along the directions which are contained in such clusters.

But as discussed in section 20.8.7, the eigenvectors and corresponding eigen-

value distribution will aﬀect the order in which various directions converge, and

this will interact in a non-trivial way with CG truncation damping. In partic-

ular, certain directions which would otherwise never be touched by CG in the

original parameterization, either because their eigenvalues are located far away

from any high-weight eigenvalue clusters, or because they have very low curva-

ture (i.e., low eigenvalue), could, within the reparameterization, become part of

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 521

eigen-directions with the opposite properties, and thus be partially optimized by

CG even when it is aggressively truncated.

This is a potential problem, since it is our experience that certain very low

curvature directions tend to be highly non-trustworthy for neural network train-

ing objectives (in the default parameterization). In particular, they often tend to

correspond to degeneracies in the quadratic model, such as those introduced by

using diﬀerent sets of data to compute the gradient and curvature-matrix vector

products (see section 20.12.1), or to directions which yield small reductions in

q for the current minibatch but large increases on other training data (an issue

called “minibatch overﬁtting”, which is discussed in section 20.12.2).

20.11.2 Designing a Good Preconditioner

Designing a good preconditioner is an application speciﬁc art, especially for HF,

and it is unlikely that any one preconditioning scheme will be the best in all

situations. There will often be a trade-oﬀ between the computational eﬃciency

of implementing the preconditioner and its eﬀectiveness, both in terms of how

it speeds of convergence of CG, and how it may reduce the eﬀectiveness of CG

truncation damping.

While the previous section describes how a preconditioner can help in theory,

in practice it is not obvious how to design one based directly on insights about

eigen-directions and their prioritization.

An approach which is popular and often very eﬀective in various domains

whereCGisusedistodesignP to be some kind of easily inverted approxi-

mation of the curvature matrix (in our case,

B). While the ultimate purpose of

preconditioning is to help CG optimize more eﬀectively, which may conceivably

be accomplished by less obvious choices for P , approximating

B may be an easier

goal to approach directly. Justifying this idea is the fact that when P =

B,the

preconditioned matrix is I, so CG will converge in one step.

Adopting the perspective that P should approximate

B, the task of designing

a good preconditioner becomes one of balancing approximation quality with

practical concerns, such as the cost of multiplying by P

−1

Of course, “approximation quality” is a problematic concept, since the various

ways we might want to deﬁne it precisely, such as via various matrix norms, may

not correlate well with the eﬀectiveness of P as a preconditioner. Indeed, CG is

invariant to the overall scale of the preconditioner, and so while β

B would be an

optimal preconditioner for any β>0, it could be considered an arbitrarily poor

approximation to

B as β grows, depending on how we measure this.

Diagonal P ’s are a very convenient choice due to the many nice properties

they naturally possess, such as being full rank, easy to invert, and easy to store.

They also tend to be quite eﬀective for optimizing deep feed-forward neural

networks, due to how the scale of the gradient and curvature w.r.t. the hidden

activities grows or shrinks exponentially as we proceed backwards through the

layers [4, 18], and how each parameter is associated with a single layer. Without

compensating for this with diagonal preconditioning, the eigenvalues of the ef-

fective curvature matrix will likely be much more “spread out” and thus harder

522 J. Martens and I. Sutskever

for CG to deal with. By contrast, RNN optimization does not seem to beneﬁt

as much from diagonal preconditioning, as reported by Martens and Sutskever

[23]. Despite how RNNs can possess per-timestep scale variations analogous to

the per-layer scale variations sometimes seen with feed-forward nets, these won’t

manifest as diﬀerences in scales between any particular parameters (i.e. diagonal

scale diﬀerences), due to the way each parameter is used at every time-step.

Many obvious ways of constructing non-diagonal preconditioners end up re-

sulting in P ’s which are expensive and cumbersome to use when n is large. For

example, if P or P

−1

is the sum of a k-rank matrix and a diagonal, it will require

O((k +1)n) storage, which for very large n will be a problem (unless k is very

small).

A well-designed diagonal preconditioner P should represent a conservative

estimate of the overall scale of each parameter, and while the diagonal of the

curvature matrix is a natural choice in many situations, such as when the curva-

ture matrix is diagonally dominant, it is seriously ﬂawed for curvature matrices

with a strong non-diagonal component. Nonetheless, building a diagonal precon-

ditioner based on d = diag(

B) (or an approximation of this) is a sensible idea,

and forms the basis of the approaches taken by Martens [22] and Chapelle and

Erhan [10]. However, it may be beneﬁcial, as Martens [22] found, not to use d

directly, but to choose P to be somewhere between diag(d) and a scalar multiple

of the identity matrix. This has the eﬀect of making it more gentle and conser-

vative, and it works considerably better in practice. One way to accomplish this

is by raising each entry of d (or equivalently, the whole matrix P )tosomepower

0 <ξ<1, which will make P tend to the identity as ξ approaches 0.

In situations where diagonal damping penalty terms like the Tikhonov term

are weak or absent, it may also be beneﬁcial to include an additional additive

constant κ, which also has the eﬀective of making P tend to a scalar multiple of

the identity as κ grows so that we have:

P =(diag(d)+κI)

If there is information available about the coarse relative scale of the parameters,

in the form of some vector s ∈ R

, such as the reparameterized neural network

example discussed in sec. 20.8.2, it may better to use κdiag(s) instead of κI.

It is important to emphasize that d should approximate diag(

B) and not

diag(B), since it is the latter curvature matrix which is used in the quadratic

which CG actually optimizes. When D is a diagonal matrix, one should take

d = diag(B) + λD

where the latter contribution can be computed independently and exactly (and

not via the methods for approximating diag(B) which we will discuss next).

Meanwhile, if the damping matrix D is non-diagonal, then one should take

d = diag(B + λD)

where we might in fact use the aforementioned methods in order to approximate

the diagonal of B+λD together.

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 523

So far the discussion ignored the cost of obtaining the diagonal of a curvature

matrix. Although it is easy to compute Hessian-vector products of arbitrary

functions, there exists no eﬃcient exact algorithm for computing the diagonal

of the Hessian of a general nonlinear function (Martens et al. [24, sec. 4]), so

approximations must be used. Lecun et al. [2] report an eﬃcient method for

computing the diagonal of the Gauss-Newton matrix, but close examination

reveals that it is mathematically unsound (although it can still be viewed as a

heuristic approximation).

In case of the Gauss-Newton matrix, it is possible to obtain the exact diagonal

at the cost of k runs of backpropagation, where k is the number of output

units [6]. This approach can be generalized in the obvious way to compute the

diagonal of the generalized Gauss-Newton matrix, and is feasible for classiﬁcation

problems with small numbers of classes, although not feasible for problems such

as deep autoencoders or RNNs which have high-dimensional outputs. In the next

sections, we describe two practical methods for approximating the diagonal of

the GGN matrix regardless of the dimension of the output.

20.11.3 The Empirical Fisher Diagonal

One approach to approximating the diagonal of the GGN matrix G is to instead

compute the diagonal of a related matrix for which exact computation of the

diagonal is easier. For this purpose Martens [22] selected the Empirical Fisher

Information matrix F , which is an approximation to the well-known Fisher in-

formation matrix [1] (which is itself related to the generalized Gauss-Newton

matrix). The empirical Fisher Information matrix is given by

F ≡



∇f



where ∇f

is the gradient on case i. However, because of its special low-rank

form, its diagonal is readily computable as:

diag(F) =



sq(∇f

)

where sq(x) denotes coordinate-wise square.

Because the ∇f

’s are available from the gradient computation ∇f =



∇f

over the minibatch, additionally computing diag(F) over the same minibatch in

parallel incurs no extra cost, save for the possible requirement of storing the

∇f

’s, which can be avoided for feed-forward networks but not RNNs. Algo-

rithm 20.11.3 computes the diagonal of the Empirical Fisher Information matrix

without the extra storage. In the algorithm, each y

is a matrix with B columns

which represent the activations of a minibatch with B cases, and sq(·) is the

coordinate-wise square. It diﬀers from algorithm 20.2 only in lines 9 and 10.

In general, it is possible to compute the sum of squares of gradients in a

minibatch in parallel without storing the squares of the individual gradients

(which is often prohibitively expensive) whenever the computational graph of

524 J. Martens and I. Sutskever

Algorithm 20.6 . An algorithm for computing the diagonal diag(F ) of the

Empirical Fisher Information matrix of a feedforward neural network (includes

the standard forward pass)

1: input: y

, θ mapped to (W

,...,W

−1

,...,b

−1

)

2: for all i from 1 to  − 1 do

3: x

i+1

← W

+ b

4: y

i+1

← s

i+1

)

5: end for

6: dy



← ∂L(y



; t



)/∂y



is the target)

7: for i from  −1 downto 1 do

8: dx

i+1

← dy

i+1



i+1

)

9: Set entries of diag([F ]) corresponding to W

to be sq(dx

i+1

)sq(y

)



10: Set entries of diag([F ]) corresponding to b

to be sq(dx

i+1



11: dy

← W



i+1

12: end for

13: output: diag(F )

the gradient makes precisely one additive contribution to every parameter for

each case. In this case, it possible to add [∇f

]

to the appropriate entry of

diag(F) as soon as it is computed, so we need not allocate temporary storage

for [∇f

]

for each i and j (rather, only each j).

However, when the computational graph of the derivative (for a given case i)

makes multiple additive contributions to each [∇f

]

, it is necessary to allocate

temporary storage for this quantity since we must square its total contribution

before summing over the cases. Interestingly, this issue does not occur for the

RNN’s gradient computation, since without the per-i squaring, each contribution

to [∇f

]

can be stored in a single vector for all the i’s.

20.11.4 An Unbiased Estimator for the Diagonal of G

Chapelle and Erhan [10] give a randomized algorithm for computing an unbiased

estimate of the diagonal of the generalized Gauss-Newton matrix, which requires

the same amount of work as computing the gradient. And just as with any

unbiased estimate, this approach can be repeatedly applied, and the results

averaged, to achieve more precise estimates.

The method of Chapelle and Erhan is described in algorithm 20.7 below:

Algorithm 20.7 . Computing an unbiased estimate of the diagonal of the GGN

matrix

1: Sample v ∈ R

from a distribution satisfying E[vv



]=I

2: output sq





1/2

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 525

As discussed in section 20.6, multiplication of an arbitrary v ∈ R

by the

Jacobian J of F can be performed eﬃciently by the usual back-propagation

algorithm. The correctness of algorithm 20.7 is easy to prove:









1/2





= E



diag







1/2



v)(J





1/2





= diag







1/2



E[vv





1/2



= diag







1/2





1/2



(as E[vv



]=I)

=diag(J





=diag(G)

wherewehaveusedtheidentitysq(x) = diag(xx



Martens et al. [24], following the work of Chapelle and Erhan [10], introduced

an eﬃcient unbiased approximation method for estimating the entire Hessian

or GGN matrix of a given function (or just their diagonals, if this is desired)

with a cost also comparable to computing the gradient. In the special case of

estimating the diagonal of the GGN matrix, the two methods are equivalent. Of

practical import, Martens et el. [24] also proved that sampling the components

of v uniformly from −1, 1 will produce lower variance estimates than will be

obtained by sampling them from N (0, 1).

Computationally, the method of Chapelle and Erhan is very similar to the

method for computing the diagonal of the Empirical Fisher Information that was

described in the previous section. Indeed, while the latter compute sq





∇L



this method computes sq





1/2

for random v’s, and so the methods have

similar implementations. In particular, both estimates can be computed in par-

allel over cases in the minibatch, and they share the issue with temporary stored

discussed in the previous section, which can be overcome in for feed-forward

networks but not RNNs.

In our experience, both methods tend to produce preconditioners with similar

properties and performance characteristics, although Chapelle and Erhan [10]

found that in certain situations this unbiased estimate gave better results. One

clear advantage of this method is that it can correctly account for structural

damping, which is not done by using the empirical Fisher matrix, as the gradients

of the structural damping objective are equal to zero. The disadvantage of this

method is that due to the stochastic nature of the curvature estimates, there

could be parameters with non-zero gradients whose diagonal estimates could

nonetheless be very small or even zero (which will never happen with the diagonal

of the Fisher matrix). The diagonal of the Fisher matrix also has the additional

advantage that it can be computed in tandem with the gradient at virtually no

extra cost.

526 J. Martens and I. Sutskever

20.12 Minibatching

In modern machine learning applications, training sets can be very large, and

a learning algorithm which processes all the examples in the training set to

compute each parameter update (called “batch processing”) will likely be very

slow or even totally impractical [7]. Some training datasets may even be inﬁnite

and so it may not even make any sense to talk about an algorithm operating in

batch-mode at all.

“Online” or “stochastic” gradient algorithms like stochastic gradient descent

(SGD) can theoretically use gradient information computed on arbitrarily small

subsets of the training set, called “minibatches”, as long as the learning rate is

suﬃciently small. HF, on the other hand, uses minibatches to estimate the curva-

ture matrix, which is used in a very strong way to produce large and sometimes

aggressive parameter updates. These curvature matrix estimates may become

increasingly low-rank and degenerate as the minibatch size shrinks (assuming

no contribution from damping or weight-decay), which in some cases (see sub-

section 20.12.1) may lead to unbounded and nonsensical updates, although the

damping mechanisms discussed in sec. 20.8 can compensate for this to some

extent.

But even without these more obvious degeneracy issues, it can be argued that,

intuitively, the matrix B captures “soft-constraints” about how far we can go in

any one direction before making things worse, and if the constraints relevant to

a particular training case are not well approximated in the curvature estimated

from the minibatch, the update δ obtained from optimizing M could easily make

the objective f worse on such a case, perhaps severely so.

Thus 2nd-order methods like HF which must estimate the curvature only from

the current minibatch, may not work nearly as well with very small minibatches.

And while there are several strategies to deal with minibatches that are “too

small”, (as we will discuss in subsection 20.12.2), the beneﬁts of using a 2nd-

order method like HF may be diminished by their use.

Fortunately, in the case of neural networks, there are elegant and natural

implementations which exploit data-parallelism and vectorization to eﬃciently

compute gradients and curvature matrix-vector products over minibatches (see

section 20.7). For highly parallel architectures like GPUs, these tend to give

an increase in computational cost which remains sublinear (as a function of

minibatch size) up to and beyond minibatch sizes which are useful in HF.

20.12.1 Higher Quality Gradient Estimates

Unlike 1st order optimization schemes like SGD where the number of iterations

required to approach a good solution can reach as high as 10

−10

,thenumber

of iterations required by a strong 2nd-order optimizer like HF is in our experi-

ence orders of magnitude smaller, and usually around 10

−10

. While the linear

term b = −∇f(θ

k−1

) passed to CG needs to be computed only once for each

update, CG may require on the order of 10

− 10

matrix-vector products with

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 527

the curvature matrix A =Bto produce each update. These matrix-vector prod-

ucts are by far the most computationally costly part of any truncated Newton

approach.

It may therefore be cost-eﬀective to compute the gradient on a much larger

minibatch than is used to compute the matrix-vector products. Martens [22]

recommended using this technique (as does Byrd et al. [8]), and in our experience

it can often improve optimization speed if used carefully and in the right contexts.

But despite this, there are several good theoretical reasons why it might be

better, at least in some situations, to use the same minibatch to compute gradient

and curvature matrix-vector products. These have been corroborated by our

practical experience.

We will refer to the minibatches used to compute the gradient and curvature

matrix-vector products as the “gradient minibatch” and “curvature minibatch”,

respectively.

When the gradient and curvature minibatches are equal, and the GGN cur-

vature matrix is used, the quadratic model M maintains its interpretation as

the local Taylor series approximation of a convex function, which will simply be

the approximation of f obtained by linearizing F (see eqn. 20.9), but restricted

to the data in the current minibatch. In such a situation, with some additional

reasonable assumptions about the convex function L (strong convexity would

be suﬃcient, but is more than what is needed), the quadratic model M can be

written as:

M(δ)=



Bδ + ∇f



k−1

δ + f(θ

k−1





Jδ + δ



∇L)+f (θ

k−1

)

(Jδ)





(Jδ)+(Jδ)





−1

∇L + ∇L





−1



−1

∇L

−∇L





−1



−1

∇L + f(θ

k−1

)

(Jδ + ∇L)





(Jδ + ∇L)



+ c

Jδ + L



−1

∇L



+ c

where c = f (θ

k−1

) −∇L





−1

∇L and all quantities are computed only on the

current minibatch. Here we have used the fact that L



is invertible, which follows

from the fact that L is strongly convex.

This result is interesting because it applies only when B is the generalized

Gauss-Newton matrix (instead of the Hessian), and it establishes a bound on

the maximum improvement in f that the quadratic model M can ever predict:

∇L





−1

∇L, a quantity which does not depend on the properties of the net-

work, only on its current predictions and the associated convex loss function.

Such a boundedness result does not exist whenever M is estimated using

diﬀerent minibatches for the gradient and curvature. In this case, the estimated

gradient may easily lie outside the column space of the estimated curvature

matrix in the sense that there may exist directions d s.t. g



d<0, d =1, but



Bd =0. In such a case it is easy to see that the quadratic model is unbounded

528 J. Martens and I. Sutskever

and M(αd) →−∞as α →∞. While boundedness can be guaranteed with the

inclusion of damping penalty terms which ensure that the damped curvature

matrix

is positive deﬁnite, and will also be guaranteed when the curvature

matrix B is full rank, it may be the case that the boundedness is “weak” in

the sense that d



Bd may be non-zero but extremely small, leading to a nearly

degenerate update δ. For example, when using Tikhonov damping then we know

that d



d ≥ λ, but in order for this to suﬃciently constrain the update along

direction d, λ may have to be large enough that it would impose unreasonably

high constraints on optimization in all directions.

More intuitively, the gradient represents a linear reward for movement in

certain directions d (the strength of which is given by g



d) while the curvature

matrix represents a quadratic penalty. If we include the linear rewards associated

with a subset of cases without also including the corresponding quadratic penal-

ties, then there is a chance that this will lead to a degenerate situation where

some directions will have lots of reward (predicted linear reduction) without any

corresponding penalty (curvature). This can result in an update which makes f

worse even on cases contained in both the gradient and curvature minibatches,

for reasons that have nothing directly to do with a breakdown in the reliability

of the quadratic approximation to f . On the other hand, if the curvature and

gradient minibatches are equal and the quadratic approximation to f is oth-

erwise reliable (or properly damped), then using equal gradient and curvature

minibatches provides a minimal guarantee that f will improve on the current

minibatch after the proposed update is applied.

Another more subtle way in which using a smaller-sized curvature minibatch

than gradient minibatch could be counterproductive, is that in addition to caus-

ing a dangerous underestimation of the curvature associated with the left-out

cases, it may also lead to an overestimation of the curvature for the cases actu-

ally in the curvature-minibatch. This is because when we compute estimates of

the curvature by averaging, we must divide by the number of cases in the mini-

batch, and since this number will be smaller for the curvature estimate than

for the gradient, the gradient contributions from these cases will be smaller in

proportion to the corresponding curvature terms.

Byrd et al. [8] showed that if the eigenvalues of the curvature matrices (esti-

mated using any method) are uniformly bounded from below in the sense that

there exists μ>0 s.t.

B − μI is PSD for all possible

B’s which we might pro-

duce, then assuming the use of a basic line-search and other mild conditions, an

truncated Newton algorithm like HF which estimates the gradient on the full

training set will converge in the sense that the gradients will go to zero. But

this result makes no use of the particular form of B and is as a result very weak,

saying nothing about the rate of convergence, or how small the updates will have

to be. As far as theorem is concerned, B can be any λ dependent matrix with

the required boundedness property, and need not have anything to do with local

quadratic models of f at all.

Despite all of these objections, the higher quality estimates of the gradient

may nonetheless provide superior convergence properties in some situations. The

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 529

best trade-oﬀ between these various factors is likely to be highly dependent on

the particular problem, the stage of the optimization (early versus late), and

the damping mechanisms being used. Our experience is that penalty and CG-

truncation damping become more active when there is a signiﬁcant qualitative

mismatch between the gradient and curvature estimates, which is more likely to

happen when the training dataset, or the network’s responses to it, are particular

“diverse”.

20.12.2 Minibatch Overﬁtting and Methods to Combat It

As mentioned in the previous section, the updates produced by HF may be eﬀec-

tively “overﬁt” to the current minibatch of training data. While a single update

of SGD has the same problem, this is less of an issue because the updates are

extremely cheap and numerous. HF, by contrast, performs a run of CG with

anywhere between 10 to 300+ iterations, which is a long and expensive process,

and must be performed using the same ﬁxed estimates of the gradient and curva-

ture from a single minibatch. Ideally, we could use a stochastic algorithm when

optimizing the local quadratic models which would be able to see much more

data at no extra cost. Unfortunately we are not aware of any batch methods

which possess the same strongly optimal performance for optimizing quadratics

as CG does, while also working well as a stochastic method.

The simplest solution to dealing with the minibatch overﬁtting problem is

to increase the size of the minibatches, thus providing CG with more accurate

estimates of the gradient and curvature. When optimizing neural networks, we

have observed that the minibatch overﬁtting problem becomes gradually worse as

optimization proceeds, and so implementing this solution will require continually

growing the minibatches, possibly without bound. Fortunately, there are other

ways of dealing with this problem.

The damping methods discussed in section 20.8 were developed to compensate

for untrustworthiness of the local quadratic approximations M being made to

f, which exists due to the simple fact that f is not actually a convex quadratic,

and so M may fail to be a sensible approximation to f at its minimum δ

∗

.These

methods work by imposing various soft or hard constraints on the update δ in

order to keep it “closer” to 0 (where M trustworthy by construction), according

to some metric.

Using minibatches to compute the gradient and curvature imposes a diﬀerent

kind of untrustworthiness on the quadratic model, arising from the fact that the

function being approximated is not actually the true objective but rather just an

unbiased sample-based approximation of it. But despite the diﬀering nature of

the source of this untrustworthiness, the previously developed damping methods

turn out to be well suited to the task of compensating for it, in our experience.

Of these, decreasing the maximum number of CG steps, using larger min-

batches for the gradient, and decreasing the default learning rate (which is

equivalent to damping by adding multiples of B, as discussed in section 20.8.3)

according to some SGD-like schedule, seem to be the most eﬀective approaches in

practice. If standard Tikhonov damping is used and its strength λ is increased to

530 J. Martens and I. Sutskever

compensate for minibatch overﬁtting, this will make HF behave asymptotically

like SGD with a dynamically adjusted learning rate.

There is a compelling analogy between the minibatch overﬁtting which occurs

when optimizing δ with CG, and the general overﬁtting of a non-linear optimizer

applied to a conventional learning problem. And some of the potential solutions

to both of these problems turn out to be analogous as well. Tikhonov damping,

for example, is analogous to an L2 prior or “weight-decay” penalty (but centered

at δ =0), and CG truncation is analogous to “early-stopping”.

Recall that damping approaches are justiﬁed as a method for dealing with

quadratic approximation error because as the “size” of the update shrinks (ac-

cording to any reasonable metric), it will eventually lie inside a region where this

source of error must necessarily be negligible. This is due to the simple fact that

any super-linear terms in the Taylor series expansion of f, which are unmodeled

by M, will approach zero much more rapidly than the size of the update. It

is important to keep in mind that a similar justiﬁcation does not apply to the

handling of minibatch-related estimation errors with update damping methods.

Indeed, the negative gradient computed on a given minibatch may not even be a

descent direction for the total objective (as computed on the complete training

set), and even an inﬁnitesimally small update computed from a given minibatch

may actually make the total objective worse.

Thus, when tuning damping mechanisms to handle minibatch overﬁtting (ei-

ther by hand, or dynamically using an automatic heuristic), one shouldn’t aim

for obtaining a reduction in the total f that is a ﬁxed multiple of that which

is predicted by the minibatch-computed M (as is done in section 20.8.5), but

rather to simply obtain a more modest reduction which is proportional to the

relative contribution of the current minibatch to the entire training dataset.

It is also worthwhile noting that the practice of initializing CG from the up-

date computed at the previous iteration of HF (as discussed in section 20.10)

seems to bias CG towards ﬁnding solutions that generalize better to data out-

side of the current minibatch. We don’t have a good understanding for why this

happens, but one possible explanation is that by carrying δ over to each new

run of CG and performing an incomplete optimization on it using new data, δ

is allowed to slowly grow (as HF iterates) along low-curvature directions

that

by necessity, must generalize across lots of training data. The reasoning is that

if such slowly optimized directions didn’t generalize well, then they would in-

evitably be detected as high-curvature ascent directions for some new minibatch

and quickly zeroed out by CG before ever having a chance grow large in δ.

Finally, Byrd et al. [9] has developed methods to deal with the minibatch

overﬁtting problem, which are based on heuristics that increase the minibatch

size and also terminate CG early, according to estimates of the variance of

the gradient and curvature-matrix vector products. While this is a potentially

eﬀective approach (which we don’t have experience with), there are several

problems with it, in theory. First, variance is measured according to highly

Which get optimized much more slowly by CG than high-curvature directions, as

shown in section 20.9

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 531

parameterization-dependent metrics which are not particularly meaningful. Sec-

ond, increasing the size of the minibatch, which is only one method to deal with

minibatch overﬁtting, is not a strategy which will remain practical for very long.

Thirdly, aggressive early termination heuristics for CG, similar to this one, tend

to interact badly with non-zero CG initializations

and other forms of damping.

And ﬁnally, there are other more direct ways of measuring how well updates will

generalize, such as simply monitoring f on some training data outside of the

current minibatch.

20.13 Tricks and Recipes

There are many things to keep in mind when designing an HF-style optimization

algorithm for a particular application and it can be somewhat daunting even to

thoseofusthathavelotsofexperiencewiththemethod.Soinordertomake

things easier in this regard, in this section we relate some of our experience in

designing eﬀective methods, and describe several particular setups that seem

to work particularly well for certain deep neural network learning problems,

assuming the use of a standard parameterization.

Common elements to all successful approaches we have tried are:

• use of the GGN matrix instead of the Hessian

• the CG initialization technique described in section 20.10

• a well-designed preconditioner. When using Tikhonov damping, a reasonable

choice is an estimate of the diagonal of the GGN matrix, modiﬁed using the

technique described in section 20.11.2 with κ =0and ψ =0.75.When

using one of the scale-sensitive methods described in section 20.8.3, it may

be necessary to increase κ to something like 10

−2

times the mean of the

diagonal entries of D

• the use of one of diagonal damping methods, possibly in conjunction with

structure damping for certain RNN learning problems. For feedforward net-

work learning problems under the default parameterization, Tikhonov damp-

ing often works well, and usually so does using a modiﬁed estimate of the

diagonal of the GGN matrix, provided that κ is large enough (as in the

previous point)

• the use of the progress-based termination criterion for CG described in sec-

tion 20.4 in addition to some other condition which may stop CG sooner,

such as a ﬁxed iteration limit

• dynamic adjustment of damping constants (e.g. λ) according to the LM

heuristic or a similar method

Now, we describe the particulars of each of these successful methods.

First, there is the original algorithm described in [22], which works pretty

well. Here, the “CG-backtracking” approach is used to select an iterate based

on the objective function value (see section 20.8.7), the gradient is computed

Because termination of CG may be forced before

M(δ) < 0 is achieved.

532 J. Martens and I. Sutskever

on a larger subset of the training data than the curvature, and CG is always

terminated before reaching a ﬁxed maximum number of steps (around 50 −250,

depending on the problem).

Second, there is a subtle but powerful variation on the above method which

diﬀers only in how CG is terminated, how the iterate used for the update δ is

selected, and how CG is initialized at the next iteration of HF. In particular,

CG is terminated as soon as the objective function, as evaluated on the data in

the “curvature minibatch” (see section 20.12.1) gets signiﬁcantly worse than its

value from some number of steps ago (e.g. 10). The iterate used as the parameter

update δ is selected to minimize M (or perhaps f) as evaluated on some data

which is not contained in the curvature minibatch. Lastly, CG is initialized at

the next iteration k +1, not from the previous update δ

, but instead from the

CG iterate which gave the highest objective function value on the curvature

minibatch (which will be close to the last). In practice, the quantities used to

determine when to terminate CG, and how to select the best iterate, do not need

to be computed at every step of CG, and can also be computed on much smaller

(but representative) subsets of data.

Finally, an approach which has emerged recently as perhaps the most eﬃ-

cient and eﬀective, but also the most diﬃcult to use, involves modifying HF to

behaving more like a tradition momentum method, thus making stronger use

of the CG initializations (see section 20.10) to better distribute work involved

in optimizing the local quadratics across many iterations. To do this, we use a

smaller maximum number of CG steps (around 25 to 50), smaller minibatches

of training-data, and we also pay more attention to the CG initialization decay-

constant ζ, which usually means increasing it towards the end of optimization.

Using shorter runs of CG helps with minibatch overﬁtting, and makes it feasible

to use smaller minibatches and also compute the gradient and curvature using

the same data. And as discussed in section 20.12.1, computing the gradient on

the same data as the curvature has numerous theoretical advantages, and in

practice seems to result in a reduced need for extra damping, thus resulting in a

λ that shrinks reliably towards 0 when adjusted by the LM heuristic. However,

this method tends to produce “noisy” updates, which while arguably beneﬁcial

from the standpoint of global optimization, make it more diﬃcult to obtain

ﬁner convergence on some problems. So when nearing the end of optimization,

we adopt some of the methods described in section 20.12.2, such as lowering

the learning rate, using shorter CG runs, increasing the minibatch size, and/or

switching back to using larger minibatches to compute gradients (making sure

to raise the damping constant λ to compensate for this) in order to achieve ﬁne

convergence.

One more piece of general advice we have is that using small amounts of

weight-decay regularization can be highly beneﬁcial from the standpoint of global

optimization. In particular, to get the lowest training error possible, we have

observed that it helps to use such regularization at the beginning of optimization

only to disable it near the end. Also, using a good initialization is extremely

important in regards to global optimization, and methods which work well for

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 533

deep networks include the sparse initialization scheme advocated in [22], and

the method of Glorot & Bengio [12], and of course the pre-training techniques

pioneered in [17].

20.14 Summary

We described various components of the Hessian-free optimization, how they

can interact non-trivially, and how their eﬀectiveness (or possibly harmfulness)

is situation dependent. The main points to keep in mind are:

• for non-convex optimizations it is usually preferable to use the generalized

Gauss-Newton matrix which is guaranteed to be PSD

• updates must be “damped” due to the untrustworthiness of the quadratic

model

• there are various damping techniques that can be used, and their eﬀectiveness

depends highly on f and how it is parameterized

• truncating CG before convergence, in addition to making HF practical, can

also provide a beneﬁcial damping eﬀect

• the strategy for terminating CG is usually a combination of progress-based

heuristic and a hard-limit anywhere between 10 and 300+ (which should be

considered an important meta-parameter)

• preconditioning can sometimes enable CG to make more rapid progress per

step, but only if used correctly

• simple diagonal preconditioning methods tend to work well for feed-forward

nets but not for RNNs

• preconditioning interacts non-trivially with certain forms of damping such

as CG truncation, which must be kept in mind

• initializing CG from the update computed by the previous run of CG can

have a beneﬁcial “momentum-type eﬀect”

• HF tends to require much larger minibatches than are used in SGD

• minibatch-overﬁtting can cause HF’s update proposals to be poor even for

δ’s where the quadratic model is trustworthy

• using more data to compute the gradients than the curvature matrix-vector

products is a low-cost method of potentially increasing the quality of the

updates, but it can sometimes do more harm than good

• minibatch-overﬁtting can also be combated using some of the standard

damping methods, along with simply increasing the minibatch size

• structural damping works well for training RNNs, particularly on problems

with pathological long-range dependencies

• exploiting data-parallelism is very important for producing an eﬃcient im-

plementation

• correctness of curvature matrix-vector products should be checked using ﬁ-

nite diﬀerence methods

• the extra memory costs associated with the parallel computation of gradients

and curvature matrix-vector products can be mitigated

534 J. Martens and I. Sutskever

The diﬃculty of customizing an HF approach for particular application will

no doubt depend on the speciﬁcs of the model and the dataset. While in many

cases a generic approach can be used to good eﬀect, some more diﬃcult problems

like RNNs or feed-forward networks with non-standard parameterizations may

require additional care. And even on “easier” problems, a better designed ap-

proach may allow one to surpass performance barriers that may have previously

been mistaken for convergence.

This report has described many of the tricks and ideas, along with their theo-

retical justiﬁcations, which may useful in this regard. While we can try to predict

what combination of ideas will work best for a given problem, based on previous

experience and/or mathematical/intuitive reasoning, the only way to be sure is

with careful experimentation. Unfortunately, optimization theory has a long way

to go before being able to predict the performance of a method like HF applied

to the highly non-convex objectives functions associated with neural networks.

References

[1] Amari, S.I.: Natural gradient works eﬃciently in learning. Neural Computa-

tion 10(2), 251–276 (1998)

[2] Becker, S., Le Cun, Y.: Improving the convergence of back-propagation learning

with second order methods. In: Proceedings of the 1988 Connectionist Models

Summer School, pp. 29–37. Morgan Kaufmann, San Matteo (1988)

[3] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training

of deep networks. In: Advances in Neural Information Processing Systems, vol. 19,

p. 153 (2007)

[4] Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gra-

dient descent is diﬃcult. IEEE Transactions on Neural Networks 5(2), 157–166

(1994)

[5] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,

Turian, J., Warde-Farley, D., Bengio, Y.: Theano: a cpu and gpu math expres-

sion compiler. In: Proceedings of the Python for Scientiﬁc Computing Conference

(SciPy), vol. 4 (2010)

[6] Bishop, C.: Exact calculation of the hessian matrix for the multilayer perceptron.

Neural Computation 4(4), 494–501 (1992)

[7] Bottou, L., Bousquet, O.: The tradeoﬀs of large scale learning. In: Advances in

Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)

[8] Byrd, R.H., Chin, G.M., Neveitt, W., Nocedal, J.: On the use of stochastic hes-

sian information in optimization methods for machine learning. SIAM Journal on

Optimization 21, 977 (2011)

[9] Byrd, R.H., Chin, G.M., Nocedal, J., Wu, Y.: Sample size selection in optimization

methods for machine learning. Mathematical Programming, 1–29 (2012)

[10] Chapelle, O., Erhan, D.: Improved preconditioner for hessian free optimization.

In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011)

[11] Schraudolph, N.: Fast Curvature Matrix-Vector Products for Second-Order Gra-

dient Descent. Neural Computation 14 (2002)

[12] Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforward

neural networks. In: Proceedings of the International Conference on Artiﬁcial In-

telligence and Statistics (AISTATS 2010). Society for Artiﬁcial Intelligence and

Statistics (2010)

20. Training Deep and Recurrent Networks with Hessian-Free Optimization 535

[13] Gould, N.I.M., Lucidi, S., Roma, M., Toint, P.L.: Solving the trust-region sub-

problem using the lanczos method. SIAM Journal on Optimization 9(2), 504–525

(1999)

[14] Hansen, P.C., O’Leary, D.P.: The use of the l-curve in the regularization of discrete

ill-posed problems. SIAM Journal on Scientiﬁc Computing 14, 1487 (1993)

[15] Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear sys-

tems (1952)

[16] Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief

nets. Neural Computation 18(7), 1527–1554 (2006)

[17] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-

ral networks. Science 313(5786), 504–507 (2006)

[18] Hochreiter, S.: Untersuchungen zu dynamischen neuronalen netzen. diploma

thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität

München (1991)

[19] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computa-

tion 9(8), 1735–1780 (1997)

[20] Jaeger, H., Haas, H.: Harnessing nonlinearity: Predicting chaotic systems and

saving energy in wireless communication. Science 304(5667), 78–80 (2004)

[21] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr,

G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer,

Heidelberg (1998)

[22] Martens, J.: Deep learning via hessian-free optimization. In: Proceedings of the

27th International Conference on Machine Learning (ICML), vol. 951 (2010)

[23] Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free

optimization. In: Proc. ICML (2011)

[24] Martens, J., Sutskever, I., Swersky, K.: Estimating the hessian by backpropagating

curvature. In: Proc. ICML (2012)

[25] Moré, J.J.: The levenberg-marquardt algorithm: implementation and theory. Nu-

merical Analysis, 105–116 (1978)

[26] Moré, J.J., Sorensen, D.C.: Computing a trust region step. SIAM Journal on

Scientiﬁc and Statistical Computing 4, 553 (1983)

[27] Nash, S.G.: Newton-type minimization via the lanczos method. SIAM Journal on

Numerical Analysis, 770–788 (1984)

[28] Nash, S.G.: A survey of truncated-newton methods. Journal of Computational

and Applied Mathematics 124(1), 45–59 (2000)

[29] Nesterov, Y.: A method for unconstrained convex minimization problem with the

rate of convergence o (1/k2). In: Doklady AN SSSR, vol. 269, pp. 543–547 (1983)

[30] Nocedal, J., Wright, S.J.: Numerical optimization. Springer (1999)

[31] Pearlmutter, B.A.: Fast exact multiplication by the hessian. Neural Computa-

tion 6(1), 147–160 (1994)

[32] Shewchuk, J.R.: An introduction to the conjugate gradient method without the

agonizing pain (1994)

[33] Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. arXiv preprint

arXiv:1111.4259 (2011)

[34] Wengert, R.E.: A simple automatic derivative evaluation program. Communica-

tions of the ACM 7(8), 463–464 (1964)

Implementing Neural Networks Eﬃciently

Ronan Collobert

, Koray Kavukcuoglu

, and Clément Farabet

3,4

Idiap Research Institute

Martigny, Switzerland

NEC Laboratories America

Princeton, NJ, USA

Courant Institute of Mathematical Sciences

New York University, New York, NY, USA

Université Paris-Est

Équipe A3SI - ESIEE Paris, France

Abstract. Neural networks and machine learning algorithms in gen-

eral require a ﬂexible environment where new algorithm prototypes and

experiments can be set up as quickly as possible with best possible com-

putational performance. To that end, we provide a new framework called

Torch7, that is especially suited to achieve both of these competing goals.

Torch7 is a versatile numeric computing framework and machine learn-

ing library that extends a very lightweight and powerful programming

language Lua. Its goal is to provide a ﬂexible environment to design,

train and deploy learning machines. Flexibility is obtained via Lua, an

extremely lightweight scripting language. High performance is obtained

via eﬃcient OpenMP/SSE and CUDA implementations of low-level nu-

meric routines. Torch7 can also easily be interfaced to third-party soft-

ware thanks to Lua’s light C interface.

Runtime eﬃciency is probably perceived as the most important topic when

considering an eﬃcient neural network implementation. One should however not

under-estimate the time spent in designing the right neural network for a given

task, or even the amount of work put into feeding data to the neural network

properly. Designing the right network for a given task in a short amount of

time requires a ﬂexible development environment and a properly designed neural

network toolbox.

Several eﬃcient (in terms of runtime execution) neural network libraries for

very speciﬁc needs are freely available. QuickNet

is a good example in the speech

recognition community: it implements most commonly used algorithms, that

is multi-layer perceptrons with few layers. However, ﬂexible libraries are quite

rare. It is not a trivial task to implement a library supporting a wide range of

complex networks (such as convolutional networks for images, text or speech...),

any type of connectivity (full connectivity, shared weights, order in layers...), or

several type of training algorithms (stochastic gradient, batch, second order like

http://www.icsi.berkeley.edu/Speech/qn.html

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 537–557, 2012.

 Springer-Verlag Berlin Heidelberg 2012

538 R. Collobert, K. Kavukcuoglu, and C. Farabet

LBFGS...). It is even more diﬃcult to ﬁnd a uniﬁed environment where one can

easily read, prepare, feed properly the data to the network, or debug the internals

of the architecture (for example when the network is not training properly).

In Section 21.1, we will consider eﬃcient neural network implementation in

terms of eﬃcient environment. We will then focus on the runtime eﬃciency and

analyze diﬀerent state-of-the-art approaches to speed-up the network training

and testing phases in section 21.2. In this work, our analysis is built on the

experience we acquired with our own neural network implementation, Torch

and more particularly the last version Torch7.

21.1 Eﬃcient Environment

An eﬃcient environment for implementing neural networks should not be only

limited to neural networks themselves. It should provide all necessary tools for

eﬃcient development of new numerical algorithms in general. Often one needs

various numerical algorithms to transform the data before feeding it to the neural

network. Algorithms will strongly vary from one research domain to the other.

Moreover, in the last few years, the research activity on neural networks started

to intersect with many other domains like optimization, linear algebra, parallel

processing to name a few. A successful framework should provide necessary tools

to cope with the variability in the development process. Only in that case the

framework would allow to not only easily investigate new types of models or new

training algorithms, but also to easily compare or combine neural networks with

other machine learning algorithms.

In order for a framework to provide necessary environment for development

of new numerical algorithms, its extension capabilities should be very advanced.

Machine learning researchers face many problems where there is need for using

existing libraries. As we will see in Section 21.2, this includes interfacing eﬃ-

cient linear algebra libraries or even the neural network implementation itself.

The ability to interface these existing libraries with as little runtime and code

development overhead as possible is crucial for an eﬃcient toolbox.

Finally, the neural network toolbox implementation itself should be modular

enough to allow for the creation of any kind of new neural network models or

implementation on diﬀerent modalities of data, leaving the choice of the archi-

tecture as much as possible to the user.

In this section, we will cover the following three important points: eﬃciency

of development environment, extension capabilities and modular neural network

toolbox. The modular structure of Torch7 that fuses advantages of high-level

and low-level libraries is shown in Figure 21.1.

21.1.1 Scripting Language

A scripting (or interpreted) language is the most convenient solution for fast

prototyping and development of new algorithms. At the same time, it is crucial

http://www.torch.ch

21. Implementing Neural Networks Eﬃciently 539

   



!"#

$%

"#&'&#'(&#&

!#!)(!

($*()(&(





(

Fig. 21.1. Modular Structure of Torch7. Low level numerical libraries are interfaced

with TH to provide a uniﬁed tensor library. luaT provides essential data structures

for object/class manipulation in Lua. The core Torch package uses TH and luaT to

provide a numerical computing environment purely in Lua. All other packages can use

either Torch interface from inside Lua scripting environment or can interface low-level

C interfaces for increased performance optimizations.

for the interpreted language to have a lightweight C API, both in terms of

simplicity and eﬃciency. Simplicity in the C API encourages easier interfacing

to existing external libraries and eﬃciency is the single most important criterion

for large-scale applications.

In a complex development environment, the scripting language becomes the

“glue” between heterogeneous components: diﬀerent structures of the same con-

cept (coming from diﬀerent libraries) can be merged together using a high-level

language, while keeping all the functionalities that are exposed from all the dif-

ferent libraries.

Lua. Among existing scripting languages

ﬁnding the ones that satisfy runtime

eﬃciency severely restricts the number of possibilities. In our machine learning

framework Torch7, we chose Lua, the fastest interpreted language (with also the

fastest Just In Time-JIT compiler

) we could ﬁnd. Lua has also the advantage

that it is designed to be easily embedded in a C application, and provides a very

clean C API, based on a virtual stack to pass values and carry out function

evaluation from C. This uniﬁes the interface to C/C++ and minimizes the eﬀort

required for wrapping third party libraries.

Lua is intended to be used as a powerful, light-weight scripting language for

any program that needs one. It is implemented as a library, written in pure C

in the common subset of ANSI C and C++. Quoting Lua webpage

Lua combines simple procedural syntax with powerful data description

constructs based on associative arrays and extensible semantics. Lua is

dynamically typed, runs by interpreting bytecode for a register-based

E.g. on http://shootout.alioth.debian.org

http://luajit.org/

http://www.lua.org

540 R. Collobert, K. Kavukcuoglu, and C. Farabet

virtual machine, and has automatic memory management with incre-

mental garbage collection, making it ideal for conﬁguration, scripting,

and rapid prototyping.

Lua oﬀers good support for object-oriented programming, functional program-

ming, and data-driven programming. As shown in Figure 21.2, it handles nu-

merical computations very eﬃciently (compared to C). This is a great asset for

rapid implementation of new numerical algorithms. Lua’s main type is table,

which implements associative arrays in a very eﬃcient manner (see Figure 21.2).

An associative array is an array that can be indexed not only with numbers,

but also with strings or any other value of the language. Tables have no ﬁxed

size, can be resized dynamically, and can be used as “virtual tables” over an-

other table, to simulate various object-oriented paradigms. Tables are the only

data structuring mechanism in Lua, yet a powerful one. One can use tables to

represent ordinary arrays, symbol tables, sets, records, queues, and other data

structures, in a simple, uniform, and eﬃcient way. Lua uses tables to represent

packages as well. In addition, functions are ﬁrst class citizens of the language. A

function, just like any other variable can be passed as a variable to or returned

from a function. Last, but not the least, Lua supports closures. Combined with

tables, closures provide a very powerful and eﬃcient syntax for data handling

and programming complicated algorithms.

100

1000

10000

execution time (s)

Mandelbrot

Lua JIT

Lua

Python JIT

Python

100

1000

10000

Binary Trees

Lua JIT

Lua

Python JIT

Python

Fig. 21.2. Comparison of runtime eﬃciency of the C language (with gcc 4.4), Lua 5.1.4

and Python 2.7. Lua and Python JIT implementations were LuaJIT and PyPy, respec-

tively. The Mandelbrot and Binary Trees benchmarks are taken from “The Computer

Language Benchmarks Game”. All benchmarks were run using a single CPU on a high-

end 12 cores Xeon server. The Mandelbrot benchmark makes a heavy use of numbers,

while the Binary Trees benchmark makes a heavy use of data structures (struct in C,

tables in Python and Lua). The execution time is reported (on a log-scale axis) for

each language.

Why Not Python? It is hard to talk about a programming language with-

out starting a ﬂame war. While Lua is well known in the gaming programmer

community, mostly due to its speed advantage and great embedding capabilities,

21. Implementing Neural Networks Eﬃciently 541

Python is more popular in more general public. With no doubt, Python ships

with more libraries and one can ﬁnd support about almost any problem easily in

many diﬀerent contexts. However, Lua oﬀers at least two important advantages

over Python:

– First and foremost, the simplicity of integrating existing C/C++ libraries

is very important. Many eﬃcient numerical algorithms are implemented in

specialized packages in BLAS, LAPACK, FFTW and similar libraries. A

lightweight interface to existing code is crucial for achieving a high per-

formance environment. In section 21.2.8 we show quantitative results on

eﬃciency of Lua compared to Python when wrapping BLAS function calls.

– Second, since Lua is embeddable in C/C++, any prototyped application can

be turned into a ﬁnal system/product with very little extra eﬀort. Since Lua

is written in pure C and does not have dependency to any external library, it

can be easily used in embedded applications like, Android, iOS

,FPGAs

and DSPs.

There are also alternatives to writing a custom interface between interpreted lan-

guage and C/C++, like Simpliﬁed Wrapper and Interface Generator (SWIG)

Although these might provide a simpliﬁed interface at ﬁrst, writing a tensor li-

brary with several linear algebra backends requires a very ﬁne-grained control

and we found it is harder to manage this interface rather than using the native

Lua API.

In addition to its performance advantage on number of operations (see Fig-

ure 21.2), Lua also provides other unique advantages – rarely found simultane-

ously in existing programming languages – for implementing a large-scale ma-

chine learning framework. In the following section we will show how we extended

Lua’s basic numerical capabilities to a rather complete framework for developing

complex numerical algorithms.

21.1.2 Multi-purpose Eﬃcient N-Dimensional Tensor Object

Torch7 provides a generic Tensor library (called TH) that is written in pure

C. This library is interfaced in Lua, providing new eﬃcient multi-dimensional

array types in the scripting language. Most packages in Torch7 (or third-party

packages that depend on Torch7 ) rely on this Tensor class to represent signals,

images, videos..., allowing Lua to nicely “glue” most libraries together. Fast pro-

totyping and creation of new packages is made possible, as the library is available

directly from both Lua and C. Interfacing or extending existing libraries is very

eﬃcient. The following code demonstrates a few standard Tensor-based opera-

tions, from the Lua side:

https://github.com/clementfarabet/torch-ios

http://www.neuflow.org

www.swig.org

542 R. Collobert, K. Kavukcuoglu, and C. Farabet

-- create a tensor of single-precision floats

2 t = torch.FloatTensor(100,100)

-- randomized: sampled from a normal distribution

5 l = torch.randn(100,100)

-- basic operators

8 r=t+l/2

-- in-place operators

11 r:add(0.5, t)

-- common operators

14 r = torch.log(torch.exp(-r)+10)

As in Matlab, multiple types can co-exist in Torch7, and it is easy to cast from

one to the other:

1 -- a single-precision tensor

2 tfloat = torch.FloatTensor(100)

-- converted to double-precision

5 tdouble = tfloat:double()

r = torch.FloatTensor(tdouble:size())

-- automatically casts from double->float

10 r:copy(tdouble)

A sample matrix, matrix multiplication operation is done as in the following

example.

1 x = torch.Tensor(1000,5000)

2 y = torch.Tensor(5000,3000)

3 z = torch.mm(x,y)

4 print(z:size())

1000

7 3000

8 [torch.LongStorage of size 2]

The Torch7 Tensor library provides a lot of classic operations (including linear

algebra operations), eﬃciently implemented in C, leveraging SSE instructions on

Intel’s platforms and optionally binding linear algebra operations to existing ef-

ﬁcient BLAS/Lapack implementations (like Intel MKL, OpenBLAS or ATLAS).

As we will see in the next section, we also support OpenMP instructions and

CUDA GPU computing for certain subset of operations where these platforms

oﬀer unique performance advantages.

Related Approaches. Our Tensor library implementation got mostly inspired

from SN [3] and Lush [5] toolboxes, which were one of the ﬁrst to introduce

21. Implementing Neural Networks Eﬃciently 543

the concept (in a LISP language framework). Matlab also supports N-dimension

arrays (even though early releases only supported 2D matrices). Compared to

Matlab, we put big emphasis on memory allocation control, as we will see in

Section 21.2.2. Numpy

is another popular alternative, but only available for

the Python language. As mentioned before, Lua oﬀers unique advantages for a

machine learning framework because of its speed and the simpler C interface.

21.1.3 Modular Neural Networks

Following [6], we view a neural network as a set of modules connected together

in a particular graph. In Torch7, the “nn” package provides a set of standard

neural network modules, as well as a set of container modules that can be used

to deﬁne arbitrary directed (acyclic or not) graphs. By explicitly describing the

graph’s architecture, using pluggable modules, we avoid the complexity of a

graph parser, or any other middle-ware compiler. In practice, most networks

are either sequential, or have simple branching patterns and recursions. The

following example shows how to describe a multi-layer perceptron:

1 mlp = nn.Sequential()

2 mlp:add(nn.Linear(100,1000))

3 mlp:add(nn.Tanh())

4 mlp:add(nn.Linear(1000,10))

5 mlp:add(nn.SoftMax())

Each module (or container) provides standard functions to compute its output

state, and back-propagate derivatives to its inputs, and to its internal parame-

ters. Given the previous network, an input X, and the gradient of some error E

with respect to the output Y —dE/dY —these three functions can be called like

this:

1 -- compute the activations Y = f(X)

2 Y = mlp:updateOutput(X)

-- compute some loss E = l(Y,T)

5 E = loss:updateOutput(Y,T)

-- compute the gradient dE/dY = dl(Y,T)/dY

8 dE

dY = loss:updateGradInput(Y,T)

-- back-propagate the gradients, down to dE/dX

11 dE

dX = mlp:updateGradInput(X,dE

dY)

-- compute the gradients wrt the weights: dE/dW

14 mlp:accGradParameters(X,dE

dY)

The “nn” package in Torch7 provides about 80 diﬀerent neural network modules,

allowing the user to implement most existing neural networks with minimal eﬀort

in pure Lua.

http://numpy.scipy.org

544 R. Collobert, K. Kavukcuoglu, and C. Farabet

Leveraging the TH Library. Neural network modules in Torch7 use Tensors

provided by the TH library (see Section 21.1.2) to represent their own input data,

output or internal states. Most modules are simply written in Lua, using the

Torch package for intensive numerical operations. Only packages which require

very speciﬁc operations have a dedicated C back-end. And, even in this case

many of them use the TH library interface from C. In any case, Tensors are used

as data containers to interact seamlessly with the rest of the library.

Training Algorithms. In Torch7, every neural network module, given the

partial derivatives with respect to its outputs, is able to compute the partial

derivative with respect to its parameters and its inputs. Thus, any complicated

network structure can be trained using gradient-based optimization methods.

Batch, mini-batch and stochastic gradient descent algorithms are supported.

More advanced algorithms, such as second-order gradient descent algorithms like

conjugate gradient or LBFGS are also possible, thanks to a numerical package

called “optim”. While this optimization package is designed to be used stand-

alone, it also provides second-order optimization capabilities for neural networks

when used with the “nn” package.

21.1.4 Additional Torch7 Packages

Torch7 comes with many built-in and third-party packages. In order to encour-

age collaborations and redistribution of machine learning algorithms, a built-in

package manager is provided. It can easily download, compile and install addi-

tional Torch7 packages from any package repository, when needed. At this time,

the most interesting packages related to numerical computation or numerical

analysis are:

–torch:Torch7 ’s main package: provides Tensors, easy serialization and other

basic functionalities. This package provides, Matlab-like functions to create,

transform and use Tensors.

– gnuplot: This package provides plotting interface to Gnuplot using Tensors.

–image:An image processing package. It provides all the standard image

processing functions such as loading and saving images, rescaling, rotating,

converting color spaces, ﬁltering operations, . . .

–optim:A compact package providing steepest descent, conjugate gradient

and limited memory BFGS implementations.

–qt:Full bindings between Qt and Lua

, with transparent conversion of

Torch7 Tensors from/to QImages. Great for quickly developing interactive

demos with advanced GUIs (running natively on Linux, Mac or Windows

platforms).

The list of available packages is constantly growing, as Lua makes it easy to

interface any C library. Third-party packages include: unsup, which contains

several unsupervised learning algorithms like sparse coding and auto encoders.

Thanks to Léon Bottou for this huge piece of work.

21. Implementing Neural Networks Eﬃciently 545

mattorch, which provides a two-way interface between Matlab’s matrix format

and Torch’s tensor; parallel, which provides simple routines to fork and execute

Lua code on local or remote machines, and exchange data using Torch7 ’s seri-

alization mechanism; camera, a simple wrapper to camera/webcam drivers on

Linux and MacOSX; imgraph, a package that provides lots of routines to create

edge-weighted graphs on images, and manipulate these graphs.

21.2 Eﬃcient Runtime Execution

Torch7 has been designed with eﬃciency in mind, leveraging SSE when possible

and supporting two ways of parallelization: OpenMP and CUDA. The Tensor

library (which is interfaced to Lua using the “torch” package) makes heavy use

of these techniques. From the user viewpoint, enabling CUDA and OpenMP can

lead to great speedups in any “Lua” script, at zero implementation cost (because

most packages rely on the Tensor library). Other packages (like the “nn” package)

also leverage OpenMP and CUDA for more speciﬁc usages not covered by the

Tensor library. In the following we explain speciﬁc advantages of Torch7 for

achieving an excellent runtime performance.

21.2.1 Float or Double Representations

One of the major computational bottlenecks of modern computers is their mem-

ory bandwidth. When implementing any numerical algorithm, the number of

memory accesses should be always reduced by all means. This has an impact

not only on the coding style, but also on the ﬂoating point type we will choose

when implementing neural networks. A C “double” takes usually 8 bytes in mem-

ory, while C “ﬂoat” takes only 4. Given that high precision is rarely required in

neural networks, one might consider using ﬂoating point precision in most cases.

On a simple matrix-matrix operation, speedups of ×2 are common when using

ﬂoats instead of doubles. In practice similar speedups are also observed in neural

networks using ﬂoating point precision. In that respect, in Torch7, the user can

easily choose (at runtime) the default ﬂoating point type.

21.2.2 Memory Allocation Control

One of the main complaints about using high level interfaces (such as Matlab)

for numerical programming is the loss of control over memory allocation. The

high-level abstraction makes it very hard for the researcher to know when a

copy of a tensor is created. Although this is not a major problem for small-

scale applications, as the data size grows, repeated copy and memory allocation

might become problematic and even a bottleneck for the algorithm. To avoid such

problems, Torch7 tensor library is designed to support complete control over new

memory allocation only when the user wants to use it. To better demonstrate

this point, we repeat the matrix multiplication example.

546 R. Collobert, K. Kavukcuoglu, and C. Farabet

x = torch.Tensor(1000,5000)

2 y = torch.Tensor(5000,3000)

3 z = torch.mm(x,y) print(z:size())

1000

6 3000

7 [torch.LongStorage of size 2]

One can see that the tensor z, which did not exist before, is newly allocated in

this context. One can imagine that these series of operations are done repeatedly

inside a loop. In this case, Torch7 allows the following intuitive syntax.

1 x = torch.Tensor(1000,5000)

2 y = torch.Tensor(5000,3000)

3 z = torch.Tensor(1000,3000)

4 torch.mm(z,x,y)

As it can be seen from the example, the torch.mm function also can take three

arguments, in which case the ﬁrst argument becomes the result of the operation.

This syntax is implemented for all operations in the Tensor library consistently,

so that for every single operation, the user has the choice of passing in the

target Tensor or allocating a new one without any overhead and heavy syntax.

For example the following element-wise Sin operation can be represented in two

diﬀerent ways.

1 x = torch.rand(1000)

-- a new tensor is created

4 tsin = torch.sin(x)

-- a scalar one is added to tensor x (x is reused)

7 x:add(1)

-- tsin is reused

10 torch.sin(tsin,x)

In this example, both scalar addition to tensor x and calculating the Sin of

resulting tensor did not allocate any new memory. In the above example, we

also hinted another use of tensor library, where one can make method calls on a

tensor object, as in any object oriented language. This syntax makes it explicit

that the operation is directly applied on the object that the method call is done.

21.2.3 BLAS/LAPACK Interfaces

The key to a successful numerical computation framework is to have eﬃcient

implementations of linear algebra operations. This requires highly sophisticated

algorithms with very precise implementations. In order to be able to provide

the best experience, the C tensor library (TH) that is included in Torch7 in-

terfaces BLAS and LAPACK libraries.

All the major Level 1, 2 and 3 BLAS

http://www.netlib.org.

21. Implementing Neural Networks Eﬃciently 547

operations like matrix-vector products, matrix-matrix products and most major

linear algebra routines like singular value decomposition, matrix inverse, least

square solutions are interfaced to BLAS and LAPACK libraries respectively. This

interface provides the user with a rich experience of building block operations

where higher level algorithms can easily be implemented.

21.2.4 SIMD Instructions

Most computations involved in a neural network consist in applying the same

type of operations over (possibly large) vectors or matrices. Several CPU ar-

chitectures, such as PowerPC, Intel or ARM support SIMD (Single Instruction,

Multiple Data) operations which are perfectly suited for this kind of task: for

example with SSE (Streaming SIMD Extensions) on Intel processors one might

perform 4 additions over a vector with a unique instruction. Calling these in-

structions instead of regular CPU instructions might lead to great speedup. This

type of optimization is unfortunately CPU-speciﬁc. Fortunately, in many cases

one can rely on BLAS/LAPACK implementations specialized for a given plat-

form, which leverage SIMD instructions. For other neural network speciﬁc cases,

such as convolutions, one must implement specialized routines for each architec-

ture of choice. In Torch7, we try to leverage SSE (on Intel architectures) and

NEON (on ARM architectures) whenever possible. Compared to a non-SSE im-

plementation 1.5× speedup are common, as shown in Figure 21.3 in the case of

convolutional neural networks.

500

1000

1500

2000

No-SSE

SSE

BLAS

examples/s

CNN 32x32

100

150

200

250

No-SSE

SSE

BLAS

CNN 96x96

No-SSE

SSE

BLAS

CNN 256x256

Fig. 21.3. Comparison of several convolutional neural network implementations (with-

out SSE, with SSE or with BLAS). Tests were conducted using one core on a Intel bi-

Xeon X5690 server. Performance is given in number of examples processed by second

(higher is better). Three diﬀerent architectures were tested, with input image sizes of

32x32, 96x96 and 256x256 respectively.

21.2.5 Ordering Memory Accesses

As already mentioned in Section 21.2.1 and Section 21.2.2, memory accesses

are one of the main bottleneck on today’s computers. In fact, not only the number

548 R. Collobert, K. Kavukcuoglu, and C. Farabet

of accesses is important, but also the order of these accesses. For example, op-

erations with tensors not contiguous in memory (say, with extra jumps between

each element of the tensor) should always be avoided. In many cases, it is better

to organize the tensor in a contiguous memory block (possibly at the cost of

the copy), before performing any intensive computations. A striking example for

neural networks is convolutions. When performing a convolution over an image,

successive dot products are done between the kernel and all possible patches of

the image. One can create a copy of all these patches beforehand (the drawback

being a huge memory cost for large convolutions or large images) and then apply

a matrix-matrix operation (using BLAS) to compute all dot products. The mem-

ory consumption increases proportional to the number of pixels of convolutional

kernel. As shown in Figure 21.3, this leads to unbeatable runtime performance,

even though the initial memory copy is quite large. Torch7 provides this imple-

mentation as part of the neural net package too. Whenever there is suﬃcient

memory available, it is advantageous to use this implementation which uses an

innovative design that takes advantage of multi-core CPU architectures.

21.2.6 OpenMP Support

Open Multi-Processing (OpenMP) provides a shared memory CPU paralleliza-

tion framework on C/C++ and Fortran languages on almost every operating

system and compiler toolset. It generally requires minimal modiﬁcation for in-

tegrating into an existing project. Torch7 is designed and developed to use

OpenMP directives for various operations in its tensor library and neural net-

work package. Although the details of the OpenMP speciﬁcation is beyond the

scope of this work, below we show one of the most commonly used OpenMP

directive, parallelization over for-loops:

1 // private makes a copy for each thread

2 #pragma omp parallel for private(i)

3 for (i=0; i<N; i++)

4 {

5 a[i]=i

6 }

Without the omp parallel for directive at line 2, this piece of code will run to

completion using a single thread. However, since each loop iteration is indepen-

dent from each other, it becomes a trivial single line addition to existing code

that parallelizes this computation over many cores.

Torch7 automatically detects if the compiler supports OpenMP directives and

compiles a high level package that adds multi-threaded tensor operations, con-

volutions and several neural network classes. The switch from single threaded

code to multi-threaded code is completely transparent to the user and it only

requires -l openmp argument to be passed to torch executable. With this op-

tion, Torch7 by default uses the OpenMP enabled function calls when avail-

able. The number of threads to be used can be speciﬁed by either setting the

“OMP_NUM_THREADS” environment variable to desired number:

21. Implementing Neural Networks Eﬃciently 549

bash# export OMP

NUM

THREADS=4

or from inside lua by

1 torch.setNumThread(4)

function. Moreover, openmp can even be temporarily enabled or disabled using

the following function calls.

1 torch.setNumThreads(1)

2 torch.setNumThreads(N)

Multi-threading of BLAS operations rely on the speciﬁc BLAS library that

Torch7 is linked against. For example Intel’s MKL library also uses OpenMP for

parallelizing Level 3 BLAS operations. In the neural network package nn,the

convolutional layers, most common non-linearity functions like tanh and sigmoid,

pooling operations like average, sum and max pooling and various other primi-

tive operations like sum, square modules are all parallelized. For all the models

that apply element-wise operations, the parallelization is almost as trivial as

shown in the example above. For more complicated modules like convolutional

layers with multiple input output feature maps, the function evaluation pass is

parallelized over output feature maps so that every output feature is calculated

in parallel. For calculating the gradient wrt kernels, operations are parallelized

over kernels and over input feature maps for gradient wrt inputs. Using this

strategy the convolutional network architecture can be sped up almost linearly.

21.2.7 CUDA Support

CUDA (Compute Uniﬁed Device Architecture) is nVidia’s framework for pro-

gramming their graphics processors to perform general purpose computations.

CUDA exposes the hierarchy of memories available to the graphics processor,

the two main ones being the external (large, high-latency) DRAM and the inter-

nal shared memory (a couple of kB, low-latency). It also exposes the hierarchy

of compute cores, and how they interact with each other, and with the diﬀerent

types of memory.

Contrary to common belief, we found that writing CUDA code (kernels) can

be signiﬁcantly simpliﬁed. It is very easy to obtain decent performance, and the

simplest kernels already yield satisfying speedups over regular C. The only three

things to know, and carefully handle are: understanding the interaction between

shared memory and threads; understand memory coalescing, to maximize band-

width to/from external DRAM; understand the hierarchy of processing units,

to eﬃciently divide the workload between blocks and threads. Once understood,

these concepts were suﬃcient to allow us to write our own 2D convolutions, which

are computed at about 200GFLOP/s on a GTX580, for large enough inputs.

For smaller inputs, our OpenMP+SSE implementation remains more eﬃcient.

It is worth mentioning that Torch7 employs an eﬃcient, yet general method

for implementing a wide variety of CUDA kernels. As it is shown in the upcoming

550 R. Collobert, K. Kavukcuoglu, and C. Farabet

sections, this strategy results in the best performance in most cases. However, it is

also possible to achieve superior performance by developing CUDA kernels under

speciﬁc assumptions, like particular input or operator shape and sizes. Despite

the performance advantage, these cases generally require signiﬁcant development

eﬀort and produce modules that can not be reused, thus they are not suitable

for a general machine learning library.

Once built with CUDA, software!Torch7 provides a new Tensor type: torch.

CudaTensor. Tensors with this particular type lives in the GPU’s DRAM mem-

ory. All operators deﬁned on standard Tensors are also deﬁned on CudaTensors,

which completely abstracts the use of the graphics processor. Here is a small

illustrative example, that demonstrates the simplicity of the interface:

1 -- lives in the CPU’s DRAM

2 tf = torch.FloatTensor(4,100,100)

-- lives in the GPU’s DRAM

5 tc = tf:cuda()

-- performed by the GPU

8 tc:mul(3)

-- res lives in the CPU’s DRAM

11 res = tc:float()

On top of the Tensors’ main operators, all the matrix-based operators are avail-

able, as well as most standard convolution routines.

21.2.8 Benchmarks

In this section we analyze the eﬃciency of Torch7 in two diﬀerent setups: ﬁrst in

the framework of a matrix-matrix multiplication benchmark, then when training

various neural networks. To that eﬀect, we compare Torch7 with Numpy and

Theano.WechoseNumpy as a reference because it is a widely-used numeri-

cal library for Python, the latter being itself a widely-used scripting language.

Theano [1] is a recent compiler for mathematical expressions, built upon Python

and Numpy, and which has been shown as over-performing many neural network

implementations, which makes it a very relevant baseline. In our experiments we

chose the latest version of each software, that is Theano 0.5, Numpy 1.6.1 and

Scipy 0.10.1.

Measuring the Overhead of Interpreted Languages. The majority of the

computation for neural networks and many numerical algorithms is spent in

BLAS calls for performing linear algebra operations. To that end, we demon-

strate the eﬃciency of the Torch7 and also the underlying C library TH.

The Torch7 numerical routines follow a simple design that contains layers.

The ﬁrst layer is an eﬃcient C library that provides a high level tensor package

(TH). TH library provides a templated design that enables the choice of dif-

ferent precisions. Available types are, Byte (unsigned char), Char (char), Short

21. Implementing Neural Networks Eﬃciently 551

CPUx1 CPUx2 CPUx4 CPUx8 CPUx12

Time Spent (s)

100000 Iterations of (100,100)x(100,100) using Floats

torch

numpy

theano

CPUx1 CPUx2 CPUx4 CPUx8 CPUx12

Time Spent (s)

100 Iterations of (1000,1000)x(1000,1000) using Floats

torch

numpy

theano

Fig. 21.4. Benchmarks of matrix multiplication performance using C, Torch7 torch

package, nn package in Torch7, Numpy and Theano. Tests were conducted on a machine

with two Intel Xeon X5690 CPUs with 6 computational cores in each CPU. Hyper-

threading was disabled. We considered multi-thread computation using 1, 2, 4, 8 and

12 CPU cores. Performance is given in seconds of time spent for processing, therefore

smaller is better.

(16 bit integer), Integer (32 bit integer), Long (64 bit integer), Float (32 bit

ﬂoating point) and Double (64 bit ﬂoating precision). TH library does not have

any dependencies to Lua or any other language other than standard C libraries,

therefore it is also suitable to be used by other projects that rely on eﬃcient nu-

merical routines. The choice of C language was a careful choice as with Lua. Since

TH uses only C, it can be compiled in almost any programming environment like

cellphones, DSP, embedded systems, etc. TH provides interface to many BLAS

operations, but also contains hand-coded operations for all functions in case no

BLAS library is available. It also provides and interface to several most widely

used LAPACK routines for linear algebra. The second layer on top of TH is the

torch package that integrates TH into Lua. All of the TH mathematical oper-

ations are interfaced from the Lua language in the torch packages. Finally, the

nn package uses the torch package to provide a modular, yet fast and eﬃcient,

neural network library.

One might argue that such a layered approach would introduce quite a bit of

overhead. In order to quantify the overhead coming from each layer, we selected

matrix-matrix multiplication as our test case since it is one of the most widely

used operations in linear algebra and ran tests using diﬀerent sizes of matrices

and diﬀerent layers of programming. We used 100×100 and 1000×1000, matrices

and benchmarked using a C only program that directly uses TH library, using

torch library, using linear layer (with no bias) from nn package in Torch7. We also

included tests using Numpy package and ﬁnally Theano. We compiled all pack-

ages using Intel MKL library to be able achieve the best possible performance

and maximize the advantages of using CPU threading. As it can be seen from

the results given in Figure 21.4, the overhead coming from TH, Torch7 or nn

libraries is minimal, even for small size matrices. Even though Python gets a bit

more overhead, for larger matrices the overhead is minimal in all conﬁgurations.

552 R. Collobert, K. Kavukcuoglu, and C. Farabet

Comparing Machine Learning Packages. In a recent paper [1], the authors

introduced a new compiler for mathematical expressions, built upon Python and

Numpy. As for Torch7, Theano is (at this time) mainly used in a neural network

framework. Theano can be either run on a CPU or a GPU. The authors of

Theano showed benchmarks (involving the training of various neural networks

architectures) comparing with other alternative implementations (when running

Theano over a GPU), including Torch5, Matlab with GPUmat (running over

aGPU)orEBLearn

. Below, we reproduce these exact benchmarks, limiting

ourselves to Torch7 versus Theano, as Theano appears already faster than any

existing implementation.

Table 21.1. Convolutional Network Architectures used in the benchmark study

32 ×32 96 × 96 256 × 256 #F.Maps

1.c Convolution 5 × 5 7 × 7 7 × 7 6

1.p Max-pooling 2 × 2 3 × 3 5 × 5 6

2.c Convolution 5 × 5 7 × 7 7 × 7 16

2.p Max-pooling 2 × 2 3 × 3 4 × 4 16

3.l Linear 120 output features

4.o Linear 10 output features

For a fair comparison, we compiled both Numpy and SciPy (on which Theano

relies) and Torch7 against MKL Intel library. Latest versions of Theano also sup-

port direct link against MKL for certain operations (without passing by Numpy),

which we setup carefully. We ran the experiments on a Intel Xeon X5690 with 12

cores. We optionally used a nVidia Tesla M2090 GPU. Following [1] benchmark

suite, we considered the training of three kinds of multi-layer Perceptrons. 1.

784 inputs, 10 classes, cross-entropy cost, and respectively no-hidden layer. 2.

One hidden layer of size 500. 3. Three hidden layers of size 1000.Wealsocon-

sidered the training of three kinds of convolutional neural networks (as shown

in Table 21.1) on 32 ×32, 96 ×96,and256 ×256 input images, following exactly

the architectures given in [1]. The optimization algorithms we used were pure

stochastic gradient descent (SGD) and SGD with a mini-batch of 60 examples.

We compare all architectures running on a single CPU core, over multiple cores

using OpenMP, or on the GPU. Note that Theano does not support OpenMP.

However, it gets a speedup (on the multi-layer Perceptron benchmarks), since

the Intel MKL library (called through Numpy) supports multiple threads using

OpenMP.

As shown in Figure 21.5, Torch7 is faster than Theano on most benchmarks.

Interestingly, Theano underperforms for small architectures using pure SGD

training (left column in Figure 21.5), which might be explained by a Python

overhead, as mentioned in the previous section. Another interesting comment is

http://www.eblearn.sf.net

21. Implementing Neural Networks Eﬃciently 553

the surprising performance of OpenMP implementations compared to the GPU

implementation. As it can be seen from the graphs only largest network architec-

tures will beneﬁt from using the GPU. It is also worth mentioning that for CNN

with 32 × 32 inputs using batch training, Theano’s GPU implementation is su-

perior than Torch 7. Under certain conditions, GPU optimizations might pay oﬀ

by providing signiﬁcant speed-ups, however they also require signiﬁcant devel-

opment eﬀort for covering a small input domain. For CNN experiments a second

Torch7 benchmark, TorchMM is included. In this case matrix-matrix product

operations for performing convolutions as explained in section 21.2.5 are used.

It can be seen that this implementation signiﬁcantly outperforms other models

from Theano and Torch7, including GPU implementations.

21.3 Eﬃcient Optimization Heuristics

As pointed in Chapter 18, the size of datasets have grown faster than the speed

of processors in the last couple of years. When estimating the parameters of a

neural network, it is then crucial to use an optimization procedure that can scale

accordingly. Recently, research on optimization methods for neural networks has

become an important topic [2, 4, 8, 7]. Torch 7 provides a ﬂexible framework

designed particularly to make it easy for developing optimization algorithms on

neural networks.

Let us consider the case of supervised learning, when one has a training set

of N examples (x

),withx

an observed input vector and y

an output

target vector that we wish to predict. We consider a loss function l(ˆy

) that

measures the cost of predicting ˆy

when the actual answer is y

.Wealsoconsider

apredictorf

), with trainable parameters w. The task of learning can be

deﬁned as ﬁnding the vector w that minimizes the loss function L over the entire

training set:

L(w)=



n=1

l(f

),y

), (21.1)

∗

= argmin

L(w). (21.2)

This general form of loss minimization can be easily carried out using one of a

variety of optimization methods like Conjugate Gradient Descent (CG), BFGS

or Limited Memory BFGS, Levenberg-Marquardt methods or simple SGD. In

Torch7, these heuristics and methods can be carried out simply, using one unify-

ing idea: decoupling the form of the function f

from the optimization procedure.

By grouping all the trainable parameters into a single parameter vector and

using a vector of gradients of the same size for gradients, the type and shape of

the neural network is completely abstracted from the developer. Combined with

powerful closure mechanism of Lua, one can develop optimization algorithms

554 R. Collobert, K. Kavukcuoglu, and C. Farabet

20000

40000

60000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

MLP 784/10

80000

160000

240000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

MLP 784/10 (batch)

2000

4000

6000

8000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

MLP 784/500/10

20000

40000

60000

80000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

MLP 784/500/10 (batch)

500

1000

1500

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

MLP 784/1000x3/10

Torch

Theano

5000

10000

15000

20000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

MLP 784/1000x3/10 (batch)

1000

2000

3000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

CNN 32x32

5000

10000

15000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

CNN 32x32 (batch)

200

400

600

800

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

CNN 96x96

250

500

750

1000

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

CNN 96x96 (batch)

100

150

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

examples/s

CNN 256x256

Torch

TorchMM

Theano

100

150

CPUx1

CPUx2

CPUx3

CPUx4

CPUx5

CPUx6

CPUx7

CPUx8

CPUx9

CPUx10

CPUx11

CPUx12

GPU

CNN 256x256 (batch)

Fig. 21.5. Benchmarks of Torch7 versus Theano, while training various neural net-

works architectures with SGD algorithm. Tests were conducted on a machine with two

Intel Xeon X5690 CPUs and Nvidia M2090 GPU. We considered multi-thread compu-

tation using 1 to 12 CPU cores using OpenMP and GPU with Nvidia CUDA interface.

Performance is given in number of examples processed by second (higher is better).

“batch” means 60 examples at a time were fed when training with SGD. TorchMM uses

the convolutional neural network layer implementation introduced in Section 21.2.5.

21. Implementing Neural Networks Eﬃciently 555

for most complicated neural network models as easy for the simplest ones. The

following code shows how this is done:

1 -- create an arbitrary model:

2 model = nn.Sequential()

3 model:add( nn.Linear(100,1000) )

4 model:add( nn.Tanh() )

5 model:add( nn.Linear(1000,10) )

-- and a loss function:

8 loss = nn.MSECriterion()

-- extract the parameters, and the gradient holder

11 w,dloss

dw = model:getParameters()

-- w and dl

dw are two vectors of the same size

Once the trainable parameter vector has been extracted, arbitrary, external op-

timization procedures can be used. Torch7 provides a few standard methods

(LBFGS, CG, SGD, ASGD) which simply require: (1) a function that computes

and

and (2) the parameter vectors w and dL/dw.Ofcourse,L

can

be either the true loss, or any approximation of it. The function that is de-

ﬁned is responsible for sampling from the training dataset, and estimating these

approximations.

With these two concepts in mind, one can easily deﬁne a loop over a training

dataset, and deﬁne a closure at each iteration, which computes L

and

The following listing shows an example of such a loop, assuming a pre-shuﬄed

training dataset in which each entry is a tuple (x

1 -- assuming a training dataset ’trainset’, and the model

2 -- defined above: ’model’, ’w’ and ’dL

dw’:

3 for e = 1,nepochs do

4 for i,sample in ipairs(trainset) do

5 -- next training pair:

6 x

n = sample[1]

7 y

n = sample[2]

-- create closure that estimates y

hat = f

w(x

n),

10 -- stochastically

11 feval = function()

12 -- estimate loss:

13 y

hat = model:forward(x

14 f = loss:forward(y

hat, y

-- estimate gradients:

17 dloss

dw:zero()

18 dloss

hat = loss:backward(y

hat, y

19 model:backward(x

n, dloss

hat)

-- return loss, and gradients

556 R. Collobert, K. Kavukcuoglu, and C. Farabet

return f,dloss

23 end

-- now that the closure is defined, pass it to an

26 -- optimization algorithm:

27 w,fs = optim.sgd(feval,w)

-- + the new w is returned, but as computations are

30 -- done in place, it is typically not necessary to

31 -- store it (the old w contains the new value)

32 -- + fs is a list of all the function (loss)

33 -- evaluations that were done during optimization.

34 -- SGD only returns one value, as it does not

35 -- perform any line search.

36 end

In the listing above, one can see that the loss and gradient estimation can be

easily changed at runtime, and estimated over arbitrary batch sizes. To use a

batch size diﬀerent than 1 (as done above), one simply needs to create a list of

training pairs, and the feval function needs to loop over these training pairs to

estimate the approximate loss and gradients.

21.4 Conclusion

Compared to the early days of neural network training, the challenges towards

an eﬃcient implementation did not change a lot, however the means changed

slightly. Already in the late 80’s, the SN [3] toolbox was providing a scripting

language (LISP) for building neural networks in a modular way. At the time,

memory bandwidth and processor speed were about the same order of mag-

nitude. Nowadays, we have to pay much more attention on memory accesses,

counting number of instructions for optimizing the code is not suﬃcient any-

more. Speciﬁc vectorized instructions can be easily integrated, but will not give

order of magnitude speedups. In the end, what brings most advantage is par-

allelization. As computers become more and more parallel, it becomes crucial

to leverage parallelization frameworks properly, such as OpenMP. On a more

extreme side, GPUs (e.g. running with CUDA) are not as attractive as some

could have expected: GPU-speciﬁc implementations require heavy extra work

for a speedup (see Figure 21.5) which can be quite disappointing compared to

what one can get with few extra lines of code with OpenMP.

Acknowledgments. Torch7 is the oﬃcial successor to Torch3

and Torch5.

Over the years many distinguished machine learning researchers have contributed

to Torch, including Léon Bottou (we thank him in particular for providing the

Qt library interface), Jason Weston, Iain Melvin, Samy Bengio and Johnny

Mariéthoz.

http://www.torch.ch/torch3

http://torch5.sourceforge.net/

21. Implementing Neural Networks Eﬃciently 557

We would also like to thank Yann LeCun and Léon Bottou for sharing their

work in SN, Lush, and extending their support and advices.

Finally, we thank James Bergstra for making his benchmark code available.

References

[1] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G.,

Turian, J., Bengio, Y.: Theano: a CPU and GPU math expression compiler. In:

Proceedings of the Python for Scientiﬁc Computing Conference, SciPy (2010)

[2] Bottou, L.: Large-scale machine learning with stochastic gradient descent. In:

Lechevallier, Y., Saporta, G. (eds.) Proceedings of the 19th International Confer-

ence on Computational Statistics (COMPSTAT 2010), Paris, France, pp. 177–187.

Springer (August 2010)

[3] Bottou, L., LeCun, Y.: SN: A simulator for connectionist models. In: Proceedings

of NeuroNimes 1988, Nimes, France (1988)

[4] Le, Q.V., Coates, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep

learning. Learning, 265–272 (2011)

[5] LeCun, Y., Bottou, L.: Lush reference manual. Technical report (2002), code,

http://lush.sourceforge.net

[6] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer, Heidel-

berg (1998)

[7] Martens, J.: Deep learning via hessian-free optimization. In: Proceedings of the

27th International Conference on Machine Learning (ICML), vol. 951 (2010)

[8] Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. Arxiv preprint

arXiv:1111.4259 (2011)

http://www.github.com/jaberg/DeepLearningBenchmarks

Better Representations: Invariant, Disentangled

and Reusable

Preface

In many cases, the amount of labeled data is limited and does not allow for fully

identifying the function that needs to be learned. When labeled data is scarce,

the learning algorithm is exposed to simultaneous underﬁtting and overﬁtting.

The learning algorithm starts to “invent” nonexistent regularities (overﬁtting)

while at the same time not being able to model the true ones (underﬁtting). In

the extreme case, this amounts to perfectly memorizing training data and not

being able to generalize at all to new data.

The following ﬁve chapters present various tricks to solve the underﬁtting/over-

ﬁtting problem. These includeapproachesto force invariance intothe modelin order

to increase the signal-to-noise ratio, pretraining methods to disentangle the factors

of variation, and how to use auxiliary tasks to learn a shared representation.

In image recognition, we know a priori that a good classiﬁer should be translation

invariant, rotation invariant, scale invariant, etc. Convolutional neural networks

[7] are special networks composed of convolution and pooling layers that explicitly

implement the translation invariance at multiple scales. They can be trained using

standard backpropagation. One can also construct the convolutional network [9]

one layer at the time in an unsupervised fashion. Chapter 22 [3] shows that, in this

context, k-means is a particularly eﬃcient method to learn the convolution ﬁlters.

Alternatively, the desired invariance can be injected in the model with arti-

ﬁcial training samples that are translated, rotated or distorted versions of the

original samples. The idea is presented in Chapter 23 [2]. A well-engineered set of

transformations can produce a very large number of artiﬁcial samples and con-

siderably improve the generalization of the neural network. Things can become

complicated as invariance is not always absolute: For example, the handwritten

digit “1” is rotation-invariant only up to a certain angle, beyond which it can be

confused with the handwritten digit “7”.

Both convolutional networks and generating artiﬁcial samples yield excellent

results on problems such as handwritten digit recognition and classiﬁcation of

small images. Combination of both techniques brings even higher performance.

Unfortunately, these methods are only applicable when invariance is known. In

the most general case, the invariance is unknown and other tricks are necessary

in order to improve the learned representation.

The representation can be improved by learning an unsupervised model as a

ﬁrst step [5], using large amounts of unlabeled data. Unsupervised pretraining

aims to construct a network of disentangled factors of variation. As a second

step, the supervised learner only needs to choose in the set of disentangled

factors, those that best predict the task of interest. Unsupervised pretraining

transfers the burden of complex nonlinear modeling to unlabeled data which is

generally available in much larger amounts. This two-steps approach was shown

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 559–560, 2012.

 Springer-Verlag Berlin Heidelberg 2012

560 G. Montavon and K.-R. Müller

to signiﬁcantly reduce the underﬁtting/overﬁtting problem [5, 4] and improve

handwritten digit and phoneme recognition performance.

A well-established algorithm for learning the unsupervised representation is

the restricted Boltzmann machine. Chapter 24 [6] explains step by step how to

train it successfully and how to choose the multiple hyperparameters. Chapter

25 [8] shows how keeping the model centered throughout training facilitates

learning in the more sophisticated deep Boltzmann machine.

The series of ﬁve chapters terminates with the question of using auxiliary

tasks to improve the solution learned by a neural network. As it was shown in

Chapter 8 [1], this can be achieved by sharing an internal representation (and its

associated parameters) across multiple related tasks and training the resulting

multi-task network with backpropagation. A variant of multitask learning is

introduced in Chapter 26 [10] where the auxiliary task is not deﬁned by a set

of labels but by a set of pairwise similarities between samples. This extension

considerably widens the domain of applicability of multi-task learning.

Grégoire & Klaus

References

[1] Caruana, R.: A Dozen Tricks with Multitask Learning. In: Orr, G.B., Müller,

K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 163–189. Springer, Heidelberg

(1998)

[2] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep Big Mul-

tilayer Perceptrons for Digit Recognition. In: Montavon, G., Orr, G.B., Müller,

K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 581–598.

Springer, Heidelberg (2012)

[3] Coates, A., Ng, A.Y.: Learning Feature Representations with k-means. In: Mon-

tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS,

vol. 7700, pp. 561–580. Springer, Heidelberg (2012)

[4] Erhan, D., Bengio, Y., Courville, A., Manzagol, P.-A., Vincent, P., Bengio, S.: Why

does unsupervised pre-training help deep learning? J. Machine Learning Res. 11,

625–660 (2010)

[5] Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief

nets. Neural Computation 18, 1527–1554 (2006)

[6] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In:

Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn.

LNCS, vol. 7700, pp. 437–478. Springer, Heidelberg (2012)

[7] LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient based learning applied

to document recognition. IEEE 86(11), 2278–2324 (1998)

[8] Montavon, G., Müller, K.-R.: Deep Boltzmann Machines and the Centering Trick.

In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd

edn. LNCS, vol. 7700, pp. 621–637. Springer, Heidelberg (2012)

[9] Serre, T., Wolf, L., Poggio, T.: Object recognition with features inspired by visual

cortex. In: Computer Vision and Pattern Recognition Conference, pp. 994–1000.

IEEE Press (2005)

[10] Weston, J., Ratle, F., Collobert, R.: Deep Learning via Semi-Supervised Embed-

ding. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 639–655. Springer, Heidelberg (2012)

Learning Feature Representations with K-Means

Adam Coates and Andrew Y. Ng

Stanford University, Stanford CA 94306, USA

Abstract. Many algorithms are available to learn deep hierarchies of

features from unlabeled data, especially images. In many cases, these

algorithms involve multi-layered networks of features (e.g., neural net-

works) that are sometimes tricky to train and tune and are diﬃcult to

scale up to many machines eﬀectively. Recently, it has been found that

K-means clustering can be used as a fast alternative training method.

The main advantage of this approach is that it is very fast and easily

implemented at large scale. On the other hand, employing this method

in practice is not completely trivial: K-means has several limitations, and

care must be taken to combine the right ingredients to get the system

to work well. This chapter will summarize recent results and technical

tricks that are needed to make eﬀective use of K-means clustering for

learning large-scale representations of images. We will also connect these

results to other well-known algorithms to make clear when K-means can

be most useful and convey intuitions about its behavior that are useful

for debugging and engineering new systems.

22.1 Introduction

A major goal in machine learning is to learn deep hierarchies of features for other

tasks. For instance, given a set of unlabeled images, many current algorithms

seek to greedily learn successive layers of features that will make subsequent

classiﬁcation tasks (e.g., object recognition) easier to accomplish. A typical ap-

proach taken in the literature is to use an unsupervised learning algorithm to

train a model of the unlabeled data and then use the results to extract interest-

ing features from the data [35, 21, 31]. Depending on the choice of unsupervised

learning scheme, it is sometimes diﬃcult to make these systems work well. There

can be many hyper-parameters and not much intuition for how to tune them.

More recently, we have found that using K-means clustering as the unsupervised

learning module in these types of “feature learning” pipelines can lead to excel-

lent results, often rivaling state-of-the-art systems [11]. In this chapter, we will

review some of this work with added notes on useful tricks and observations that

are helpful for building large-scale feature learning systems.

K-means has already been identiﬁed as a successful method to learn fea-

tures from images by computer vision researchers. The popular “bag of features”

model [13, 28] from the computer vision community is very similar to the pipeline

that we will use in this chapter, and many conclusions here are similar to those

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 561–580, 2012.

 Springer-Verlag Berlin Heidelberg 2012

562 A. Coates and A.Y. Ng

identiﬁed by vision researchers [18, 1]. In this chapter, however, we will focus on

the ingredients needed to make K-means work well in a setting similar to that

used by other deep learning and feature learning systems: learning directly from

raw inputs (pixel intensities) and building multi-layered hierarchies, as well as

connecting K-means to other well-known feature learning systems.

The classic K-means clustering algorithm ﬁnds cluster centroids that minimize

the distance between data points and the nearest centroid. Also called “vector

quantization”, K-means can be viewed as a way of constructing a “dictionary”

D∈R

n×k

of k vectors so that a data vector x

(i)

∈R

, i =1,...,m can

be mapped to a code vector that minimizes the error in reconstruction. In this

chapter, we will use a modiﬁed version of K-means (sometimes called “gain shape”

vector quantization [41], or “spherical K-means” [14]) that ﬁnds D according

to:

minimize

D,s



||Ds

(i)

− x

(i)

subject to ||s

(i)

≤ 1, ∀i

and ||D

(j)

=1, ∀j

where s

(i)

is a “code vector” associated with the input x

(i)

,andD

(j)

is the j’th

column of the dictionary D. The goal here is to ﬁnd a dictionary D and a new

representation, s

(i)

,ofeachexamplex

(i)

that satisﬁes several criteria. First, given

(i)

and D, we should be able to reconstruct the original x

(i)

well; in particular,

we aim to minimize the squared diﬀerence between x

(i)

and its corresponding

reconstruction Ds

(i)

. This goal is optimized under two constraints. The ﬁrst

constraint, ||s

(i)

≤ 1, means that each s

(i)

is constrained to have at most one

non-zero entry. Thus we are searching not only for a new representation of x

(i)

that preserves it as well as possible, but also for a very simple or parsimonious

representation. The second constraint requires that each dictionary column have

unit length, preventing them from becoming arbitrarily large or small. Otherwise

we could arbitrarily rescale D

(j)

and the corresponding s

(i)

without eﬀect.

This algorithm is very similar in spirit to other algorithms for learning eﬃcient

coding schemes, such as sparse coding [34, 17]:

minimize

D,s



||Ds

(i)

− x

(i)

+ λ||s

(i)

subject to ||D

(j)

=1, ∀j.

Sparse coding optimizes the same type of reconstruction objective, but constrains

the complexity of s

(i)

by adding a penalty λ||s

(i)

that encourages s

(i)

to be

sparse. This is similar to the constraint used by K-means (||s

(i)

≤ 1), but

allows more than one non-zero entry in each s

(i)

, enabling a much more accurate

representation of each x

(i)

while still requiring each s

(i)

to be simple.

From their descriptions above, it is no surprise that K-means and

more sophisticated dictionary-learning schemes like sparse coding are often

22. Learning Feature Representations with K-Means 563

interchangeable—diﬀering in their optimization objectives, but producing code

vectors s

(i)

and dictionaries D that accomplish similar goals. Empirically though,

sparse coding appears to be a better performer in many applications. For in-

stance, replacing K-means with sparse coding in the classic bag-of-features model

has been shown to signiﬁcantly improve image recognition results [39]. Despite

its simplicity, however, K-means is still a very useful algorithm for learning fea-

tures due to its speed and scalability. Sparse coding requires us to solve a convex

optimization problem [36, 15, 32] for every s

(i)

repeatedly during the learning

procedure and thus is very expensive to deploy at large scale. For K-means, by

contrast, the optimal s

(i)

used in the algorithm above is simply:

(i)

⎧

⎨

⎩

(j)

(i)

if j == arg max

(l)

(i)

0 otherwise.

(22.1)

Because this can be done very quickly (and solving for D given s is also easy), we

can train very large dictionaries rapidly by alternating optimization of D and s.

As well, K-means does not have any parameters requiring tuning other than k,

the number of centroids, making it relatively easy to get working. The surprise

is that large dictionaries learned by K-means often work very well in practice

provided we mix in a few other ingredients that are less commonly documented

in other works. This chapter is about what these ingredients are as well as some

intuition about why they are needed and how they aﬀect results. For most of

this work, we will use images (or image patches) as input data to the algorithm,

but the basic principles are applicable to other types of data as well.

22.2 Data, Pre-processing and Initialization

We will begin with a dataset composed of small image patches. For concreteness,

we will work with 16-by-16 pixel grayscale patches represented as a vector of 256

pixel intensities (i.e., x

(i)

∈R

256

), but color patches can also be used similarly.

These patches can be collected from unlabeled imagery by cropping out random

16-by-16 chunks. In order to build a “complete” dictionary (i.e., a dictionary with

at least 256 centroids), we should ensure that there will be enough patches so

that each cluster can claim a reasonable number of inputs. For 16-by-16 gray

patches, m = 100, 000 patches is enough. In practice, we will often need more

data to train a K-means dictionary than is necessary for other algorithms (e.g.,

sparse coding), since each data point contributes to just 1 centroid in the end.

Usually the added expense is easily oﬀset by the speed of training. For notational

convenience, we will assume that our data points are packed into the columns of

amatrixX ∈R

n×m

. (Similarly, we will denote by S the matrix whose columns

are the code vectors s

(i)

from Eq. (22.1).)

22.2.1 Pre-processing

Before running a learning algorithm on our input data points x

(i)

,itisusefulto

normalize the brightness and contrast of the patches. That is, for each x

(i)

564 A. Coates and A.Y. Ng

subtract out the mean of the intensities and divide by the standard deviation. A

small value is added to the variance before division to avoid divide by zero and

also suppress noise. For pixel intensities in the range [0, 255], adding 10 to the

variance is often a good starting point:

(i)

˜x

(i)

− mean(˜x

(i)

)

var(˜x

(i)

)+10

where ˜x

(i)

are unnormalized patches and “mean” and “var” are the mean and

variance of the elements of ˜x

(i)

(a) (b) (c)

Fig. 22.1. (a) Centroids learned by K-means from natural images without whiten-

ing. (b) A cartoon depicting the eﬀect of whitening on the K-means solution. Left:

unwhitened data, where the centroids tend to be biased by the correlated data.

Right: whitened data, where centroids are more orthogonal. (c) Centroids learned from

whitened image patches.

After normalization, we can try to run K-means on the new input patches. The

centroids that are obtained (i.e., the columns of the dictionary D) are visualized

as patches in Figure 22.1a. It can be seen that K-means tends to learn low-

frequency edge-like centroids. This result has been reproduced many times in

the past [16, 37, 2]. Unfortunately, it turns out that these centroids tend to

work poorly in recognition tasks [11]. One explanation for this result is that the

correlations between nearby pixels (i.e., low-frequency variations in the images)

tend to be very strong even after brightness and contrast normalization. In the

presence of these correlations, K-means tends to generate many highly correlated

centroids rather than spreading the centroids out to span the data more evenly.

A cartoon depicting this problem is shown on the left of Figure 22.1b. To remedy

this situation, one should use whitening (also called “sphering”) to rescale the

input data to remove these correlations [22]. This tends to cause K-means to

allocate more centroids in the orthogonal directions, as shown on the right of

Figure 22.1b.

A simple choice of whitening transform is the ZCA whitening transform. If

VDV



= cov(x) is the eigenvalue decomposition of the covariance of the data

points x, then the whitened points are computed as V (D+

zca

−1/2



x,where

22. Learning Feature Representations with K-Means 565



zca

is a small constant. For contrast-normalized data, setting 

zca

to 0.01 for

16-by-16 pixel patches, or 0.1 for 8-by-8 pixel patches is a good starting point.

Note that setting this number too small can cause high-frequency noise to be

ampliﬁed and make learning more diﬃcult. Since rotating the data does not alter

the behavior of K-means, one can also use other whitening transforms such as

PCA whitening (which diﬀer from ZCA only by a rotation).

Running K-means on whitened image patches yields sharper edge features

similar to those discovered by sparse coding, ICA, and others as seen in Fig-

ure 22.1c. This procedure of normalization, whitening, and K-means clustering

is an eﬀective “oﬀ the shelf” unsupervised learning module that can serve in

many feature-learning roles. From this point forward, we will assume that when-

ever we apply K-means to new data that they are normalized and whitened as

described here. But keep in mind that proper choices of the  parameters for

normalization and whitening can sometimes require adjustment for new data

sources. Though these are likely best set by cross validation, they can often be

tuned visually (e.g., to yield image patches with high contrast, not too much

noise, and not too much low-frequency undulation).

22.2.2 Initialization

The usual K-means clustering algorithm is known to require a number of small

tweaks to avoid common problems like empty clusters. One important consid-

eration is the initialization of the centroids. Though it is common to initialize

K-means to randomly-chosen examples drawn from the data, this has been found

to be a poor choice. It is possible that images tend to group too densely in some

areas, and thus initializing K-means with randomly chosen patches leads to a

large number of centroids starting close together. Many of these centroids ul-

timately end up becoming near-empty clusters, as a single cluster accumulates

all of the data points located within a dense area. Instead, it is better to ran-

domly initialize the centroids from a Normal distribution and then normalize

them to unit length. Note that because of the whitening stage, we expect that

the important components of our data have already been rescaled to a more or

less spherical distribution, so initializing to random vectors on a sphere is not a

terrible starting point.

Other well-known heuristics for improving the behavior of K-means can be

useful. For instance, heuristics for reinitializing empty clusters are commonly

used in other implementations. In practice, the initialization scheme above works

relatively well for image data. When empty clusters do occur, reinitializing the

centroids with random examples is usually suﬃcient, but this is rarely neces-

sary.

In fact, for a suﬃciently scalable implementation, we can often train many

centroids and simply throw away clusters that have too few data points.

Often, a large number of empty clusters indicates that the whitening or normalization

parameters are improperly tuned, or the data is too high-dimensional for K-means

to be successful.

566 A. Coates and A.Y. Ng

Another minor tweak that improves behavior is to use damped updates of the

centroids. Speciﬁcally, at each iteration we compute new centroids according to:

new

:= arg min

||DS − X||

+ ||D − D

old

=(SS



+ I)

−1

(XS



+ D

old

)

∝ XS



+ D

old

new

:= normalize(D

new

Note that this form of damping does not aﬀect “big” clusters very much (the

j’th column of XS



will be large compared to D

(j)

old

) and only serves to prevent

small clusters from being pulled too far in a single iteration.

Including the initialization and pre-processing, the full K-means training rou-

tine presented above is summarized here:

1. Normalize inputs:

(i)

− mean(x

(i)

)

var(x

(i)

)+

norm

, ∀i

2. Whiten inputs:

[V,D]:=eig(cov(x)); // So VDV



= cov(x)

(i)

:= V (D + 

zca

−1/2



(i)

, ∀i

3. Loop until convergence (typically 10 iterations is enough):

(i)

⎧

⎨

⎩

(j)

(i)

if j == arg max

(l)

(i)

0 otherwise.

∀j, i

D := XS



+ D

(j)

:= D

(j)

/||D

(j)

∀j

22.3 Comparison to Sparse Feature Learning

As was shown above, K-means learns oriented edge-like features when applied to

natural images, much like ICA [23] or sparse coding [34]. An important practical

question is whether this is accidental (e.g., because edges are so common that

learning “exemplars” from images is enough to ﬁnd them) or whether this implies

that K-means can perform a type of sparse decomposition similar to ICA. When

we attempt to apply K-means to other types of data such as audio or features

22. Learning Feature Representations with K-Means 567

Fig. 22.2. The result of running spherical K-means on points sampled from a heavy-

tailed distribution. The K-means “centroids” tend to point in the directions of the

tails.

computed by lower layers in a deep network, it is important to understand to

what extent this clustering algorithm mimics well-known projection methods like

ICA and what the limitations are. It is clear that because each s

(i)

is allowed to

contain only a single non-zero entry, K-means tries to learn centroids that single-

handedly explain an entire input image. It is thus not guaranteed that the learned

centroids will always be like the ﬁlters produced by sparse coding or ICA. These

algorithms learn genuine “distributed” representations where a single image can

be explained jointly by multiple columns of the dictionary instead of just one.

Nevertheless, empirically it turns out that K-means does tend to discover sparse

projections of the data under the right conditions. Because of this property, we

can often use the learned dictionary in a manner similar to the dictionaries or

ﬁlters learned by other algorithms that explicitly search for sparse, distributed

representations.

One intuition for why this tends to occur can be seen in a simple low-

dimensional example. Consider the case where our data is sampled from two

independent, heavy-tailed (sparse) random variables. After normalization, the

data will be most dense near the coordinate axes, and less dense in the quad-

rants between them. As a result, while K-means will tend to represent many

points very badly, training 2 centroids on this distribution will tend to yield a

dictionary of basis vectors pointing in the direction of the tails (i.e., in the sparse

directions). This result is illustrated in Figure 22.2.

If the data dimensionality is not too high (e.g., a hundred or so) this “tail-

seeking” phenomenon also shows up in more complicated scenarios. Figure 22.3

shows examples of three sets of images generated from sparse sources. On the

left (at top), are 16-by-16 images with pixels sampled from independent Laplace

distributions (sparse pixel intensities). In the middle (top) are images composed

of a sparse combination of gratings and at right (top) a sparse combination of

non-orthogonal gabor ﬁlters. The bottom row shows the result of learning 256

centroids with K-means from 500000 examples drawn from each distribution. It

can be seen that K-means does, in fact, roughly recover the sparse projections we

568 A. Coates and A.Y. Ng

Fig. 22.3. Top: Three diﬀerent sparse distributions of images. Bottom: Filters (cen-

troids) identiﬁed by K-means with a complete (256-centroid) dictionary.

would expect. A similar experiment appears in [34] to demonstrate the source-

separation ability of sparse coding—yet K-means tends to recover the same ﬁlters

even though these ﬁlters are clearly not especially similar to individual images.

That is, K-means can potentially do more than merely recover “exemplar” images

from the input distribution.

When applied to natural images, it is evident that the learned centroids

(j)

(as in Figure 22.1c) are relatively sparse projections of the data. Fig-

ure 22.4a shows a histogram of responses resulting from projecting randomly

chosen (whitened) image patches onto 4 diﬀerent ﬁlters. The 4 ﬁlters used are:

(i) a centroid learned by K-means, (ii) a basis vector trained via sparse coding,

(iii) a randomly selected image patch, and (iv) a randomly initialized ﬁlter. In

the ﬁgure, it can be seen that using the K-means-trained centroid as a linear

ﬁlter gives us a very sparse projection of the data. Thus, it appears that relative

to other simple choices K-means does tend to seek out very sparse projections

of the data, even though its formulation as a clustering algorithm does not aim

to do this explicitly.

Despite this empirical similarity to ICA and sparse coding, K-means does have

a major drawback: it turns out that its ability to discover sparse directions in

the data depends heavily on the dimensionality of the input and the quantity of

data. In particular, as the dimensionality increases, we need increasingly large

quantities of data to get clean results. For instance, to obtain the results above

we had to use a very large number of examples. We can obtain similar results

easily with sparse coding with just 10000 examples. For very large patches, we

must use even more. Figure 22.4b shows the results of running K-means on

500000 64-by-64 patches—note that while we can capture a few edges, most of

the clusters are composed of a single patch from the dataset. At this scale, empty

or near-empty clusters become far more common and extremely large amounts of

data are needed to make K-means work well. This tradeoﬀ is the main driver in

deciding when and how to employ K-means: depending on the dimensionality of

the input, a certain amount of data will be required (typically much more than is

needed for similar results from sparse coding). For modest dimensionalities (e.g.,

22. Learning Feature Representations with K-Means 569

(a) (b)

Fig. 22.4. (a) A comparison of the distribution of linear ﬁlter responses when ﬁlters

obtained by 4 diﬀerent methods are applied to natural image patches. K-means tends to

learn ﬁlters with very sparse response characteristics similar to sparse coding—much

more sparse than randomly chosen patches or randomly initialized ﬁlters. (b) “Cen-

troids” learned from 64-by-64 pixel patches. At this scale, K-means becomes diﬃcult

to apply as many clusters become empty or singletons.

hundreds of inputs), this tradeoﬀ can be advantageous because the additional

data requirements do not outweigh the very large constant-factor speedup that

is gained by training with K-means. For very high dimensionalities, however, it

may well be the case that another algorithm like sparse coding works better or

even faster.

22.4 Application to Image Recognition

The above discussion has provided the basic ingredients needed to turn K-means

into a simple feature learning method. Given a batch of unlabeled data, we can

now learn a dictionary D whose columns yield more or less sparse projections

of the data points. Just as with sparse coding, we can use the learned dictio-

nary (centroids) to deﬁne features for a supervised learning task [35]. A typical

pipeline used for image recognition applications (that is easy to use with learned

features) is based on the classic spatial pyramid model developed in the computer

vision literature [13, 28, 39, 11]. It is similar in many ways to a single-layered

convolutional neural network [29, 30]. The main components of this pipeline are:

(i) the unsupervised training algorithm (in this case, K-means), which generates

abankofﬁltersD, (ii) a function f : R

→R

that maps an input image patch

x ∈R

to a feature vector y ∈R

given the dictionary of k ﬁlters. For instance,

we could choose f(x; D)=g(D



x) for an element-wise nonlinear function g(·).

Using the learned feature extractor f(x; D),givenanyp-by-p pixel image

patch, we can now compute a representation y ∈R

for that patch. We can thus

570 A. Coates and A.Y. Ng

Fig. 22.5. A standard image recognition pipeline used in conjunction with K-means

dictionary learning

deﬁne a (single layer) representation of the entire image by applying the function

f to many sub-patches. Speciﬁcally, given an image of w-by-w pixels, we deﬁne a

(w −p +1)-by-(w−p+1)-by-k array of features by computing the representation

y for every p-by-p “sub-patch” of the input image. For computational eﬃciency,

we may also “step” our p-by-p feature extractor across the image with some

step-size (or “stride”) greater than 1. This is illustrated in Figure 22.5.

Before classiﬁcation, we typically reduce the dimensionality of the image rep-

resentation by pooling. For a step size of 1 pixel, our feature extractor produces

a (w −p +1)-by-(w −p +1)-by-k representation. We can reduce this by summing

(or applying some other reduction, e.g., max) over local regions of the feature

responses. Once we have the pooled responses, we could attempt to learn higher-

level features by applying K-means again (this is the approach pursued by [1]), or

just use all of the features directly with a standard linear classiﬁcation algorithm

(e.g., SVM).

22.4.1 Parameters

The processing pipeline above has a large number of tunable parameters, such as

the patch size p, the choice of f (x; D), and the step size. It turns out that getting

these parameters set correctly can make a major diﬀerence in performance for

practical applications. In fact, these parameters often have a bigger inﬂuence

on classiﬁcation performance than the training algorithm itself. When we are

unhappy with performance, searching for better choices of these parameters can

often be more beneﬁcial than trying to modify our learning algorithm [11, 18].

Here we will brieﬂy summarize current advice on how to choose these parameters.

First, when we use K-means to train the ﬁlter bank D, we noted previously

that the input dimensionality can signiﬁcantly inﬂuence the data requirements

and success of training. Thus, in addition to other eﬀects it may have on classi-

ﬁcation performance, it is important to choose the patch size p wisely. If p is too

large (e.g., 64 pixels, as in Figure 22.4b), then K-means may yield poor results.

Though this situation can be debugged visually for applications to image data,

22. Learning Feature Representations with K-Means 571

it is much more diﬃcult to know when K-means is doing well when it is applied

to other kinds of data such as when training a multi-layered network where the

higher-level features are hard to visualize. For this reason, it is recommended

that p be chosen by cross validation or it should be set so that the dimension-

ality of the data passed to K-means is kept small (typically not more than a

few hundred, depending on the amount of data used). For image patches, 6-by-

6 or 8-by-8 pixel (color or gray) patches work consistently well in the pipeline

outlined above.

Depending on the choice of pooling method, the step size and pooling regions

may need to be chosen diﬀerently. There is a signiﬁcant body of work covering

these areas [6, 5, 4, 12, 24, 18]. In our own experience, for single layers of features,

average pooling over a few equally-sized regions (e.g., a 2-by-2 or 3-by-3 grid)

can work very well in practice and is a good “ﬁrst try” for image recognition.

Finally, the number of features k learned by the algorithm has a signiﬁcant

inﬂuence on the results. It has been observed several times [18, 11] that learn-

ing large numbers of features can substantially improve supervised classiﬁcation

results. Indeed, it is frequently best to set k as large as compute resources will

allow, considering data requirements. Though performance typically asymptotes

as k becomes extremely large, increasing the size of k is a very eﬀective way

to squeeze out a bit of extra performance from an already-built system. This

trend, combined with the fact that K-means is especially eﬀective for building

very large dictionaries, is the main advantage of the system presented above.

22.4.2 Encoders

The choice of the feature “encoding” function f (x; D) also has a major impact on

recognition performance. For instance, consider the “hard assignment” encoding

where we take f (x; D)=s,withs the standard “one-hot” code used during

K-means training (Eq. (22.1)). It is well-known that this scheme performs very

poorly compared to other choices [18, 1]. Thus, once we have run K-means to

train our ﬁlters, one should certainly make an eﬀort to choose a better encoder

f. There are many encoding schemes present in the literature [34, 38, 40, 42,

19, 20, 7] and, while they often include their own training algorithms, one can

choose to use K-means-trained dictionaries in conjunction with many of them.

Unfortunately, many encoding schemes proposed for computer vision tasks are

potentially very expensive to compute. Many require us to solve a diﬃcult opti-

mization problem in order to compute f (x; D) [34, 40, 20, 7]. On the other hand,

some encoders are simple nonlinear functions of the ﬁlter responses D



x that

can be computed extremely quickly. In previous work it has appeared that algo-

rithms like sparse coding are generally the best performers in benchmarks [39, 4].

However, in some cases we can manage to get away with much simpler encod-

ings. Speciﬁcally, when we use the single-layer architecture outlined above, it

turns out that algorithms like sparse coding and more sophisticated variants

(e.g., spike-slab sparse coding [20]) are diﬃcult to top when we have very little

labeled data. But as can be seen in Figure 22.6, with much more labeled data

the disparity in performance between sparse coding and a very simple nonlinear

572 A. Coates and A.Y. Ng

Fig. 22.6. A comparison of the performance between two encoders on the CIFAR-

10 [25] benchmark as a function of the number of labeled examples. When labeled data

is scarce, an expensive encoder can be worthwhile. If labeled data is plentiful, a fast,

simple encoder such as a soft-threshold nonlinearity is suﬃcient.

encoder decreases signiﬁcantly. We have found, not surprisingly, that as we use

increasing quantities of labeled data the supervised learning stage takes over and

works equally well with most reasonable encoders.

As a result of this observation, application designers should consider the quan-

tity of available labeled data. If labeled data is abundant, a fast feed-forward

encoder works well (and is the easiest to use on large datasets). If labeled

data is scarce, however, it can be worthwhile to use a more expensive encoding

scheme. In the large-scale case we have found that soft-threshold nonlinearities

(f(x; D,α)=max{0, D



x − α},whereα is a tunable constant) work very well.

By contrast, sigmoid nonlinearities (e.g., f(x; D,b)=(1+exp(−D



x + b))

−1

)

appear to work signiﬁcantly less well [33, 12] in similar instances.

22.5 Local Receptive Fields and Multiple Layers

Several times we have referred to the diﬃculties involved in applying K-means to

high-dimensional data. In Section 22.4.1 it was explained that we should choose

the image patch size p (“receptive ﬁeld” size) carefully to avoid exceeding the

modeling capacity of K-means (as in Figure 22.4b). If we have a very large image

it is generally not eﬀective to apply K-means to the entire input at once.

In-

stead, applying K-means to p-by-p pixel sub-patches is a reasonable substitute,

since we expect that most learned features will be localized to a small region.

This “trick” allows us to keep the input size to K-means relatively small (e.g.,

just p

input dimensions for grayscale patches), but still use the resulting ﬁlters

Note that for very large inputs it becomes impractical to perform whitening, which

requires solving a very large optimization problem (e.g., eigenvalue decomposition).

In the authors’ experience K-means starts giving poor results before this computa-

tional bottleneck is reached.

22. Learning Feature Representations with K-Means 573

on a much larger image by either reusing the ﬁlters for every p-by-p pixel sub-

patch of the image, or even by re-running K-means independently on each p-by-p

region (if, for some reason, the features present in other parts of the image diﬀer

signiﬁcantly). Though this approach is well-known from computer vision appli-

cations the same trick works more generally and in some cases is indispensable

for building working systems.

(a) (b)

Fig. 22.7. (a) A dataset composed of concatenated pairs of independently sampled

image patches. (b) Centroids trained from pairs of image patches. Note that, as ex-

pected, K-means learns ﬁlters that only involve half of the input at a time. Due to

the increased dimensionality, signiﬁcantly more data is needed compared to training

on each image separately. (c) A dataset composed of image-depth pairs. The left half

of each example is a whitened grayscale patch. The right half is a “depth image” [27]

where pixel intensity represents the distance of a surface from the camera. (d) Cen-

troids learned jointly from image-depth pairs learn only a few weak features that use

both modalities for similar reasons as in (b).

Consider a situation where our input is, in fact, a concatenation of two indepen-

dent signals. Concretely, let us take two image patches drawn at random from a

larger dataset and concatenate them side-by-side as in Figure 22.7a. When we run

K-means on this type of data we end up with the centroids in Figure 22.7b where

individual centroids tend to model just one of the two independent components,

much like we would expect from ICA. Unfortunately, as observed previously, to

achieve this result we actually need more data than if we had tried to model the

two patches separately. Hence, whenever we can determine a priori that our input

variables can be split into independent chunks, we should try to split them up im-

mediately and run K-means on each chunk separately. Note that the contrived ex-

ample here occurs in real applications, such as learning features from RGB-Depth

data [27]: Figure 22.7c shows examples of image intensity concatenated with depth

patches and Figure 22.7d shows centroids learned from them. Since at this scale

574 A. Coates and A.Y. Ng

depth tends to be only weakly related to raw pixel intensity, it might be better to

run K-means separately on each modality of the data.

22.5.1 Deep Networks

In Section 22.4 we presented a simple pipeline that enabled us to extract a

single layer of features from an image by taking a dictionary learned from small

patches and using it to extract features over a larger image (see Figure 22.5).

We then used a pooling stage to reduce the number of features before applying a

supervised learning algorithm. It would be nice if we could learn higher layers of

features by taking the resulting single-layer representation and passing it back

through our feature learning pipeline. For instance, one simple way we might try

to accomplish this is to compute the pooled feature values for all of the examples

in our unlabeled dataset X to give us a new dataset Z, then apply the exact

same learning pipeline to Z to learn new features. This simple approach has

been applied in [1], but it turns out that this straight-forward strategy can hit

a major snag: the inputs Z used for the second layer of features will often be

very high dimensional if, as is common, we use very large dictionaries of features

(e.g., k = 10000 or more).

Concretely, let’s consider a simple example.

Suppose that our goal is to learn

features for a 20-by-20 pixel image patch. With the approach of Section 22.4

we train k = 10000 16-by-16 pixel ﬁlters with K-means from 16-by-16 patches

cropped out of our dataset. We then take the learned dictionary D and extract

feature responses with a step size of 4 pixels f (x; D) from the 20-by-20 pixel

images, yielding a 2-by-2-by-10000 image representation. Finally, we sum up the

responses over each 2-by-2 region (i.e., all responses produced by each ﬁlter) to

yield a 1-by-1-by-10000 “pooled representation” which we will take as Z.Wecan

think of each feature value z

as being a slightly translation-invariant version

of the feature detector associated with the ﬁlter D

(j)

. Note that each vector in

Z now has 10000 dimensions to represent the original 400-dimensional patch.

At this scale, even learning a “complete” representation of 10000 features from

the 10000-dimensional inputs Z becomes challenging. Regrettably, there is no

obvious choice of local receptive ﬁeld that can be made here: the 10000 features

are unorganized and we have no way to split them up by hand.

One proposed solution to this problem is to use a simple form of pair-wise

“dependency test” to help identify groups of dependent input variables in an

automated way. If we can do this, then we can break up the input vector coordi-

nates into small groups suitable for input to K-means instead of picking groups

by hand. This tool is most valuable for building multiple layers of features with

K-means.

As an example, we can use a type of dependency called “energy correlation”.

Given two whitened inputs (i.e., two inputs z

and z

that have no linear cor-

relation) their energy correlation is just the correlation between their squared

The numbers used in this example are chosen to illustrate the problem and its

solution. For a more detailed and realistic setup, see [10].

22. Learning Feature Representations with K-Means 575

responses. In particular, if we have E [z]=0and E







= I, then we will deﬁne

the dependency between inputs z

and z

as:

d[z

]=corr(z

)=E



− 1





− 1



E [z

− 1].

This metric is easy to compute by ﬁrst whitening the input data Z with ZCA

whitening [3], then computing the pairwise similarities between all of the fea-

tures:

d(j, k; Z) ≡ d[z

] ≡



(i)

− 1



(i)

− 1)



(i)

− 1)

This computation is practical for fewer than 10000 input features. It can still be

computed approximately for hundreds of thousands of features if necessary [10].

Thus, we now have a function d(j, k; Z) that can provide a measure of the de-

pendency between features z

and z

observed in a given dataset Z.

Fig. 22.8. Groups of features selected by an automated dependency test. The features

corresponding to each group of ﬁlters would be processed separately by K-means to

build a second layer of features.

Now we would like to try to learn some “higher level” features on top of the

Z representation. Using our dependency test we can ﬁnd reasonable choices of

receptive ﬁelds in a relatively simple way: we pick one feature, say z

,andthen

use the dependency test to ﬁnd the R features with the strongest dependence on

according to the test (i.e., ﬁnd the indices j so that d(0,j; Z) is large). We then

run K-means using only these R features as input. If we pick R small enough

(e.g., 100 or 200) the usual normalization, whitening and K-means training steps

can be applied easily and require virtually no tuning to work well. Because of

the smaller input dimension we only need to train a few hundred centroids and

thus we can use much less data than would be required to run K-means on the

original 10000-dimensional dataset. This procedure can be repeated for many

choices of the “seed” feature (z

above) until we have trained dictionaries from

receptive ﬁelds covering all of the input variables in z. Figure 22.8 shows the

ﬁrst-layer ﬁlters from D corresponding to some of these groups of features (i.e.,

these are the D

(j)

whose pooled responses z

have high dependence according to

the energy correlation test).

576 A. Coates and A.Y. Ng

Table 22.1. Results on CIFAR-10

(full)

Architecture Accuracy (%)

1 Layer 78.3%

1 Layer (4800 maps) 80.6%

2 Layers (Single RF) 77.4%

2 Layers (Random RF) 77.6%

2 Layers (Learned RF) 81.2%

3 Layers (Learned RF) 82.0%

VQ (6000 maps) [12] 81.5%

Conv. DBN [26] 78.9%

Deep NN [8] 80.49%

Multi-column Deep NN [9] 88.79%

Table 22.2. Results on CIFAR-10 (400 ex. per

class)

Architecture Accuracy (%)

1 Layer 64.6% (±0.8%)

1 Layer (4800 maps) 63.7% (±0.7%)

2 Layers (Single RF) 65.8% (±0.3%)

2 Layers (Random RF) 65.8% (±0.9%)

2 Layers (Learned RF) 69.2% (±0.7%)

3Layers(LearnedRF) 70.7% (±0.7%)

Sparse coding (1 layer) [12] 66.4% (±0.8%)

VQ (1 layer) [12] 64.4% (±1.0%)

Table 22.3. Classiﬁcation Results on STL-10

Architecture Accuracy (%)

1 Layer 54.5% (±0.8%)

1 Layer (4800 maps) 53.8% (±1.6%)

2 Layers (Single RF) 55.0% (±0.8%)

2 Layers (Random RF) 54.4% (±1.2%)

2 Layers (Learned RF) 58.9% (±1.1%)

3Layers(LearnedRF) 60.1%(±1.0%)

Sparse coding (1 layer) [12] 59.0% (±0.8%)

VQ (1 layer) [12] 54.9% (±0.4%)

The eﬀectiveness of this sort of approach combined with K-means has been

shown in previous work [10]. Table 22.1 details results obtained on the full CIFAR

dataset with various settings and comparisons to other contemporary methods.

First, we can see in these results that learning 2 layers of features is essen-

tially fruitless when using naive choices of receptive ﬁelds: a single receptive

ﬁeld that includes all of the inputs, or receptive ﬁelds that connect to random

inputs. Indeed, the results for 2 layer networks are worse (77.4% and 77.6%)

than obtained using a single layer alone (78.3%). This should be expected: using

a single receptive ﬁeld, K-means is unable to build good features due to the

high-dimensional input (like Figure 22.4b), yet using a random receptive ﬁeld

wastes representational power modeling unrelated inputs (like Figure 22.7). By

contrast, results obtained with the receptive ﬁeld learning scheme above (with

2nd-order dependency measure) are signiﬁcantly better: achieving a signiﬁcant

improvement over the baseline single-layer results, and even out-performing a

much larger single-layer network. With 3 layers, this system improves further

to 82.0% accuracy. Achieving the best possible results (as reported in [9]) may

require supervised training of the entire network, but this result demonstrates

very clearly the importance of controlling the connectivity of features in order for

22. Learning Feature Representations with K-Means 577

K-means to work well in deep networks (where we typically use only unlabeled

data to construct features).

Training only from unlabeled data is much more useful in a scenario where

we have limited labeled training data. Tables 22.2 and 22.3 show results ob-

tained from similar experiments on the CIFAR-10 dataset when using only 400

labeled examples per class, and the STL-10 [11] dataset (where only 100 labels

are available per class). The results are very similar, even though we have less

supervision: poor choices of receptive ﬁelds almost entirely negate the beneﬁts of

training multiple layers of features, but using the simple receptive ﬁeld selection

technique above allows us to successfully build up to 3 layers of useful features

with K-means.

22.6 Conclusion

In this chapter we have reviewed many results, observations and tricks that are

useful for building feature-learning systems with K-means as a scalable unsuper-

vised learning module. The major considerations that we have covered, which

practitioners should keep in mind before embarking on a new application, are

summarized as:

1. Mean and contrast normalize inputs.

2. Use whitening to “sphere” the data, taking care to set the  parameter ap-

propriately. If whitening cannot be performed due to input dimensionality,

one should split up the input variables.

3. Initialize K-means centroids randomly from Gaussian noise and normalize.

4. Use damped updates to help avoid empty clusters and improve stability.

5. Be mindful of the impact of dimensionality and sparsity on K-means. K-

means tends to ﬁnd sparse projections of the input data by seeking out

“heavy-tailed” directions. Yet when the data is not properly whitened, the

input dimensionality is very high, or there is insuﬃcient data, it may perform

poorly.

6. With higher dimensionalities, K-means will require signiﬁcantly increased

amounts of data, possibly negating its speed advantage.

7. Exogenous parameters in the system (pooling, encoding methods, etc.) can

have a bigger impact on ﬁnal performance than the learning algorithm itself.

Consider spending compute resources on more cross-validation for parame-

ters before concluding that a more expensive learning scheme is required.

8. Using more centroids almost always helps when using the image recognition

pipeline described in this chapter, provided we have enough training data.

Indeed, whenever more compute resources become available, this is the ﬁrst

thing to try.

9. When labeled data is abundant, ﬁnd a cheap encoder and let a supervised

learning system do most of the work. If labeled data is limited (e.g., hundreds

of examples per class), an expensive encoder may work better.

578 A. Coates and A.Y. Ng

10. Use local receptive ﬁelds wherever possible. Input data dimensionality is

the main bottleneck to the success of K-means and should be kept as low as

possible. If local receptive ﬁelds cannot be chosen by hand, try an automated

dependency test to help cut up your data into (overlapping) groups of inputs

with lower dimensionality. This is likely a necessity for deep networks!

The above recommendations cover essentially all of the tools, tricks and insights

that underlie recent feature-learning results based on K-means. Though it is

unclear how far K-means can be pushed in comparison to more expressive algo-

rithms, the tips above are enough to know when K-means is appropriate and to

get it working in many challenging scenarios.

References

[1] Agarwal, A., Triggs, B.: Hyperfeatures – Multilevel Local Coding for Visual Recog-

nition. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS,

vol. 3951, pp. 30–43. Springer, Heidelberg (2006)

[2] Aharon, M., Elad, M., Bruckstein, A.: K-SVD: An algorithm for designing over-

complete dictionaries for sparse representation. IEEE Transactions on Signal Pro-

cessing 54(11), 4311–4322 (2006)

[3] Bell, A., Sejnowski, T.J.: The ‘independent components’ of natural scenes are edge

ﬁlters. Vision Research 37(23), 3327–3338 (1997)

[4] Boureau, Y., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recog-

nition. In: 23rd Conference on Computer Vision and Pattern Recognition, pp.

2559–2566 (2010)

[5] Boureau, Y., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in

visual recognition. In: 27th International Conference on Machine Learning, pp.

111–118 (2010)

[6] Boureau, Y., Roux, N.L., Bach, F., Ponce, J., LeCun, Y.: Ask the locals: multi-

way local pooling for image recognition. In: 13th International Conference on

Computer Vision, pp. 2651–2658 (2011)

[7] Bradley, D.M., Bagnell, J.A.: Diﬀerentiable sparse coding. In: Advances in Neural

Information Processing Systems, vol. 22, pp. 113–120 (2008)

[8] Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.M., Schmidhuber, J.: Flexi-

ble, high performance convolutional neural networks for image classiﬁcation. In:

International Joint Conference on Artiﬁcial Intelligence, pp. 1237–1242 (2011)

[9] Ciresan, D.C., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for

image classiﬁcation. In: Computer Vision and Pattern Recognition, pp. 3642–3649

(2012)

[10] Coates, A., Ng, A.Y.: Selecting receptive ﬁelds in deep networks. In: Advances in

Neural Information Processing Systems, vol. 24, pp. 2528–2536 (2011)

[11] Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised

feature learning. In: 14th International Conference on AI and Statistics, pp. 215–

223 (2011)

[12] Coates, A., Ng, A.Y.: The importance of encoding versus training with sparse

coding and vector quantization. In: 28th International Conference on Machine

Learning, pp. 921–928 (2011)

[13] Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization

with bags of keypoints. In: ECCV Workshop on Statistical Learning in Computer

Vision, pp. 59–74 (2004)

22. Learning Feature Representations with K-Means 579

[14] Dhillon, I.S., Modha, D.M.: Concept decompositions for large sparse text data

using clustering. Machine Learning 42(1), 143–175 (2001)

[15] Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. The

Annals of statistics 32(2), 407–499 (2004)

[16] Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene

categories. In: Computer Vision and Pattern Recognition, vol. 2, pp. 524–531

(2005)

[17] Garrigues, P., Olshausen, B.: Group sparse coding with a laplacian scale mixture

prior. In: Advances in Neural Information Processing Systems, vol. 23, pp. 676–684

(2010)

[18] van Gemert, J.C., Geusebroek, J.-M., Veenman, C.J., Smeulders, A.W.M.: Kernel

Codebooks for Scene Categorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.)

ECCV 2008, Part III. LNCS, vol. 5304, pp. 696–709. Springer, Heidelberg (2008)

[19] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectiﬁer neural networks. In:

14th International Conference on Artiﬁcial Intelligence and Statistics, pp. 315–

323 (2011)

[20] Goodfellow, I., Courville, A., Bengio, Y.: Spike-and-slab sparse coding for unsuper-

vised feature discovery. In: NIPS Workshop on Deep Learning and Unsupervised

Feature Learning (2011)

[21] Hinton, G., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets.

Neural Computation 18(7), 1527–1554 (2006)

[22] Hyvärinen, A., Hurri, J., Hoyer, P.: Natural Image Statistics. Springer (2009)

[23] Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applica-

tions. Neural Networks 13(4-5), 411–430 (2000)

[24] Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-

stage architecture for object recognition? In: 12th International Conference on

Computer Vision, pp. 2146–2153 (2009)

[25] Krizhevsky, A.: Learning multiple layers of features from Tiny Images. Master’s

thesis, Dept. of Comp. Sci., University of Toronto (2009)

[26] Krizhevsky, A.: Convolutional Deep Belief Networks on CIFAR-10 (2010) (unpub-

lished manuscript)

[27] Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view RGB-D

object dataset. In: International Conference on Robotics and Automation, pp.

1817–1824 (2011)

[28] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid

matching for recognizing natural scene categories. In: Computer Vision and Pat-

tern Recognition (2006)

[29] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.,

Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural

Computation 1, 541–551 (1989)

[30] LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recog-

nition with invariance to pose and lighting. In: Computer Vision and Pattern

Recognition, vol. 2, pp. 97–104 (2004)

[31] Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks

for scalable unsupervised learning of hierarchical representations. In: 26th Inter-

national Conference on Machine Learning, pp. 609–616 (2009)

[32] Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization

and sparse coding. Journal of Machine Learning Research 11, 19–60 (2010)

[33] Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-

chines. In: 27th International Conference on Machine Learning, pp. 807–814 (2010)

[34] Olshausen, B.A., Field, D.J.: Emergence of simple-cell receptive ﬁeld properties

by learning a sparse code for natural images. Nature 381(6583), 607–609 (1996)

580 A. Coates and A.Y. Ng

[35] Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: trans-

fer learning from unlabeled data. In: 24th International Conference on Machine

learning, pp. 759–766 (2007)

[36] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the

Royal Statistical Society. Series B (Methodological), 267–288 (1996)

[37] Varma, M., Zisserman, A.: A statistical approach to material classiﬁcation using

image patch exemplars. Transactions on Pattern Analysis and Machine Intelli-

gence 31(11), 2032–2047 (2009)

[38] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained lin-

ear coding for image classiﬁcation. In: Computer Vision and Pattern Recognition,

pp. 3360–3367 (2010)

[39] Yang, J., Yu, K., Gong, Y., Huang, T.S.: Linear spatial pyramid matching using

sparse coding for image classiﬁcation. In: Computer Vision and Pattern Recogni-

tion, pp. 1794–1801 (2009)

[40] Yu, K., Zhang, T., Gong, Y.: Nonlinear learning using local coordinate coding.

In: Advances in Neural Information Processing Systems, vol. 22, pp. 2223–2231

(2009)

[41] Zetzsche, C., Krieger, G., Wegmann, B.: The atoms of vision: Cartesian or polar?

Journal of the Optical Society of America 16(7), 1554–1565 (1999)

[42] Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image Classiﬁcation Using Super-Vector

Coding of Local Image Descriptors. In: Daniilidis, K., Maragos, P., Paragios, N.

(eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg

(2010)

Deep Big Multilayer Perceptrons for Digit

Recognition

Dan Claudiu Cireşan

1,2

, Ueli Meier

1,2

Luca Maria Gambardella

1,2

, and Jürgen Schmidhuber

1,2

IDSIA, Galleria 2, 6928 Manno-Lugano, Switzerland

University of Lugano & SUPSI, Switzerland

Abstract. The competitive MNIST handwritten digit recognition

benchmark has a long history of broken records since 1998. The most re-

cent advancement by others dates back 8 years (error rate 0.4%). Good

old on-line back-propagation for plain multi-layer perceptrons yields a

very low 0.35% error rate on the MNIST handwritten digits benchmark

with a single MLP and 0.31% with a committee of seven MLP. All we

need to achieve this until 2011 best result are many hidden layers, many

neurons per layer, numerous deformed training images to avoid overﬁt-

ting, and graphics cards to greatly speed up learning.

Keywords: NN (Neural Network), MLP (Multilayer Perceptron), GPU

(Graphics Processing Unit), training set deformations, MNIST

,com-

mittee, BP (back-propagation).

Note: This work combines three previously published papers [6, 7, 22].

23.1 Introduction

Automatic handwriting recognition is of academic and commercial interest. Cur-

rent algorithms are already pretty good at learning to recognize handwritten

digits. Post oﬃces use them to sort letters; banks use them to read personal

checks. MNIST [21] is the most widely used benchmark for isolated handwritten

digit recognition. More than a decade ago, artiﬁcial neural networks called Mul-

tilayer Perceptrons or MLPs [35, 20, 29] were among the ﬁrst classiﬁers tested on

MNIST. Most had few layers or few artiﬁcial neurons (units) per layer [21], but

apparently back then they were the biggest feasible MLPs, trained when CPU

cores were at least 20 times slower than today. A more recent MLP with a single

hidden layer of 800 units achieved 0.70% error [33]. However, more complex meth-

ods listed on the MNIST web page always seemed to outperform MLPs, and the

general trend went towards more and more complex variants of Support Vector

http://yann.lecun.com/exdb/mnist/

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 581–598, 2012.

 Springer-Verlag Berlin Heidelberg 2012

582 D.C. Cireşan et al.

Machines or SVMs [13] and combinations of NNs and SVMs [19] etc. Convolu-

tional neural networks (CNNs) achieved a record-breaking 0.40% error rate [33],

using novel elastic training image deformations. Recent methods pre-train each

hidden CNN layer one by one in an unsupervised fashion (this seems promising

especially for small training sets), then use supervised learning to achieve 0.39%

error rate [26, 27]. The biggest MLP so far [31] also was pre-trained without

supervision, then piped its output into another classiﬁer to achieve an error of

1% without domain-speciﬁc knowledge. Deep MLPs initialized by unsupervised

pretraining were also successfully applied to speech recognition [23].

Are all these complexiﬁcations of plain MLPs really necessary? Can’t one sim-

ply train really big plain MLPs on MNIST? One reason is that at ﬁrst glance

deep MLPs do not seem to work better than shallow networks [1]. Training them

is hard as back-propagated gradients quickly vanish exponentially in the number

of layers [16, 18, 15], just like in the ﬁrst recurrent neural networks [17]. Indeed,

previous deep networks successfully trained with back-propagation (BP) either

had few free parameters due to weight-sharing [21, 33] or used unsupervised,

layer-wise pre-training [14, 1, 26]. But is it really true that deep BP-MLPs do

not work at all, or do they just need more training time? How to test this? Un-

fortunately, on-line BP for hundreds/thousands of epochs on large MLPs may

take weeks or months on standard serial computers. But can’t one parallelize it?

Well, on computer clusters this is hard due to communication latencies between

individual computers. Parallelization across training cases and weight updates

for mini-batches [24] might alleviate this problem, but still leaves the task of

parallelizing fully online-BP. Only GPUs are capable of such ﬁne grained par-

allelism. Multi-threading on a multi-core processor is not easy either. We may

speed up BP using SSE (Streaming Single Instruction, Multiple Data Exten-

sions), either manually, or by setting appropriate compiler ﬂags. The maximum

theoretical speedup under single precision ﬂoating-point, however, is four, which

is not enough. And MNIST is large - its 60,000 images take almost 50MB, too

much to ﬁt in the L2/L3 cache of any current processor. This requires to con-

tinually access data in considerably slower RAM. To summarize, currently it is

next to impossible to train big MLPs on CPUs.

We showed how to overcome all these problems by training large, deep MLPs

on graphics cards [6] and obtained an error rate of 0.35% with a deep MLP.

Deformations proved essential to prevent MLPs with up to 12 million free pa-

rameters from overﬁtting. To let the deformation process keep up with network

training speed we had to port it onto the GPU as well.

At some stage in the classiﬁer design process one usually has collected a set

of possible classiﬁers. Often one of them yields best performance. Intriguingly,

however, the sets of patterns misclassiﬁed by the diﬀerent classiﬁers do not nec-

essarily overlap. This information could be harnessed in a committee. In the

context of handwritten recognition it was already shown [4] how a combination

of various classiﬁers can be trained more quickly than a single classiﬁer yield-

ing the same error rate. Here we focus on improving recognition rate using a

committee of MLP. Our goal is to produce a group of classiﬁers whose errors on

23. Deep Big Multilayer Perceptrons for Digit Recognition 583

various parts of the training set diﬀer (are uncorrelated) as much as possible [2].

We show that for handwritten digit recognition this can be achieved by training

identical classiﬁers on data normalized in diﬀerent ways prior to training.

23.2 Data

MNIST consists of two datasets, one for training (60,000 images) and one for

testing (10,000 images). Many studies divide the training set into two sets con-

sisting of 50,000 images for training and 10,000 for validation. Our network is

trained on slightly deformed images, continually generated in on-line fashion;

hence we may use the whole un-deformed training set for validation, without

wasting training images. Pixel intensities of the original gray scale images range

from 0 (background) to 255 (max foreground intensity). 28 ×28 = 784 pixels per

image get mapped to real values

pixel intensity

127.5

− 1.0 in [−1.0, 1.0],andarefed

into the NN input layer.

23.3 Architectures

We train 5 MLPs with 2 to 9 hidden layers and varying numbers of hidden units.

Mostly but not always the number of hidden units per layer decreases towards

the output layer (Table 23.3). There are 1.34 to 12.11 million free parameters

(or weights, or synapses).

We use standard on-line BP [30] , without momentum, but with a variable

learning rate that shrinks by a multiplicative constant after each epoch, from

−3

down to 10

−6

. Weights are initialized with a uniform random distribution

in [−0.05, 0.05]. Each neuron’s activation function is a scaled hyperbolic tangent:

y(a)=A tanh Ba,whereA =1.7159 and B =0.6666 [21], and a softmax output

layer is used. Weight initialization and annealing rate are not overly important

as long as sensible choices are made.

23.4 Deforming Images to Get More Training Instances

So far, the best results on MNIST were obtained by deforming training images

[33], thus greatly increasing their number. This allows for training networks

with many weights without overﬁtting. We combine aﬃne (rotation, scaling and

horizontal shearing) and elastic deformations (Figure 23.1), characterized by the

following real-valued parameters:

– σ and α: for elastic distortions emulating uncontrolled oscillations of hand

muscles [33];

– β: a random angle from [−β, +β] describes either rotation or horizontal

shearing. In case of shearing, tan β deﬁnes the ratio between horizontal dis-

placement and image height;

– γ

, γ

: for horizontal and vertical scaling, randomly selected from [1 −

γ/100, 1+γ/100].

584 D.C. Cireşan et al.



 

 

Fig. 23.1. Original digit (top) and distorted digits (bottom). The digit was distorted

with four diﬀerent displacement ﬁelds shown in the middle.

Each aﬃne deformation is fully deﬁned by the corresponding real-valued pa-

rameter that is randomly drawn from a uniform interval. Building the elastic

deformation ﬁeld on the other hand consists of three parts: 1) create an initial

random distortion vector ﬁeld, 2) smooth the random distortion ﬁeld by con-

volving it with a Gaussian kernel deﬁned by a standard deviation σ, and 3) scale

the smoothed deformation ﬁeld with α, the elastic scaling parameter.

At the beginning of every epoch the entire original MNIST training set gets

deformed. Initial experiments with small networks suggested the following de-

formation parameters: σ =5.0 − 6.0, α =36.0 − 38.0, γ =15− 20.Sincedigits

1 and 7 are similar they get rotated/sheared less (β =7.5

◦

) than other digits

(β =15.0

◦

It takes 83 CPU seconds to deform the 60,000 MNIST training images, most

of them (75 seconds) for elastic distortions. Only the most time-consuming part

of the latter—convolution with a Gaussian kernel—is ported to the GPU. The

MNIST training set is split into 600 sequentially processed minibatches of 100

samples each. MNIST digits are scaled from the original 28 ×28 pixels to 29 ×29

pixels, to get a proper center, which simpliﬁes convolution. Each batch grid

(10 × 10 images) has 290 × 290 cells, zero-padded to 310 × 310, thus avoiding

margin eﬀects when applying a Gaussian convolution kernel of size 21 × 21.

The GPU program groups many threads into a block, where they share the same

23. Deep Big Multilayer Perceptrons for Digit Recognition 585

Fig. 23.2. Mapping the thread grid of convolution onto the distortion ﬁeld

Gaussian kernel and parts of the random ﬁeld. All 29 × 290 blocks contain 21

(the kernel size) ×10 threads, each computing a vertical strip of the convolution

(Figure 23.2). Generating the elastic displacement ﬁeld takes only 1.5 seconds.

Deforming the whole training set is more than 10 times faster, taking 6.5 instead

of the original 83 seconds. Further optimizations would be possible by porting all

deformations onto GPU, and by using the hardware’s interpolation capabilities

to perform the ﬁnal bilinear interpolation. We omitted these since deformations

are already pretty fast (deforming all images of one epoch takes only 3-10 % of

total computation time, depending on MLP size).

23.5 Forming a Committee

The training procedure of a single network of the committee is summarized in

Figure 23.3. Each network is trained separately on normalized or original data.

The normalization is done for all digits in the training set prior to training

(normalization stage). For the network trained on original MNIST data the nor-

malization step is omitted. Normalization of the original MNIST data is mainly

motivated by practical experience. MNIST digits are already normalized such

that the width or height of the bounding box equals 20 pixels. The variation of

the aspect ratio for various digits is quite large, and we normalize the width of

the bounding box to range from 10 to 20 pixels with a step-size of 2 pixels prior

to training for all digits except ones. Normalizing the original MNIST training

data results in 6 normalized training sets. In total we perform experiments with

seven diﬀerent data sets (6 normalized and the original MNIST).

We perform six experiments to analyze performance improvements due to

committees. Each committee consists of seven randomly initialized one-hidden-

layer MLPs with 800 hidden units, trained with the same algorithm on randomly

586 D.C. Cireşan et al.

Fig. 23.3. Training a committee member. Original MNIST training data (left digit)

is normalized (W10) prior to training (middle digit). The normalized data is distorted

(D) for each training epoch and used as input (right digit) to the network (NN). Each

depicted digit represents the whole training set.

Fig. 23.4. Testing with a committee. If required, the input digits are width-normalized

(W blocks) and then processed by the corresponding MLP. The committee is formed

by averaging the outputs of all MLPs.

selected batches. The six committees diﬀer only in how the data are normalized

(or not) prior to training and on how the data are deformed during training.

The committees are formed by simply averaging the corresponding outputs as

showninFigure23.4.

The ﬁrst two experiments are performed on undeformed original MNIST im-

ages. We train a committee of seven MLPs on original MNIST and we also

form a committee of MLPs trained on normalized data. In Table 23.1 the error

rates are listed for each of the individual nets and the committees. The improve-

ment of the committee with respect to the individual nets is marginal for the

ﬁrst experiment. Adding normalization, the individual experts as well as the

corresponding committee of the second experiment achieve substantially better

recognition rates.

To study the combined eﬀect of normalization and deformation, we per-

form four additional experiments on deformed MNIST (Tab. 23.2). Unless stated

23. Deep Big Multilayer Perceptrons for Digit Recognition 587

Table 23.1. Error rates of individual nets and of the two resulting committees. For ex-

periment 1 seven randomly initialized nets are trained on the original MNIST, whereas

for experiment 2 seven randomly initialized nets are trained on width-normalized data:

WN x - Width Normalization of the bounding box to be x pixels wide; ORIG - original

MNIST.

Error rate [%]

Exp. 1 Exp. 2

Net 1: init 1: 1.79 WN 10: 1.62

Net 2:

init 2: 1.80 WN 12: 1.37

Net 3:

init 3: 1.77 WN 14: 1.48

Net 4:

init 4: 1.72 WN 16: 1.53

Net 5:

init 5: 1.91 WN 18: 1.56

Net 6:

init 6: 1.86 WN 20: 1.49

Net 7:

init 7: 1.75 ORIG: 1.79

Average: 1.70 1.31

otherwise, default elastic deformation parameters σ =6and α =36are used.

All experiments with deformed images use independent horizontal and vertical

scaling of maximum 12.5% and a maximum rotation of ±12.5

◦

. Experiment 3 is

similar to Experiment 1, except that the data are continually deformed. Error

rates of the individual experts are much lower than without deformation (Tab.

23.1). In experiment 4 we randomly reselect training and validation sets for

each of the individual experts, simulating in this way the bootstrap aggregation

technique [3]. The resulting committee performs slightly better than that of

experiment 3. In experiment 5 we vary deformations for each individual network.

Error rates of some of the nets are bigger than in experiments 3 and 4, but the

resulting committee has a lower error rate. In the last experiment we train seven

MLPs on width-normalized images that are also continually deformed during

training. The error rate of the committee (0.43 %) is the best result ever reported

for such a simple architecture. We conclude that width-normalization is essential

for good committee performance, i.e. it is not enough to form a committee from

trained nets with diﬀerent initializations (experiment 3) or trained on subsets of

the original dataset (experiment 4).

23.6 Using the GPU to Train Deep MLPs

Using simple tricks, such as creating a virtually inﬁnite amount of training data

through random distortions at the beginning of every epoch and forming a com-

mittee of experts trained on diﬀerently preprocessed data, state-of-the art results

are obtained on MNIST with a relatively small (800 hidden units) single hidden

layer MLP. Here we report results using deep MLPs, with as many as 5 hidden

layers and up to 12 millions of free parameters, that are prohibitive to train on

588 D.C. Cireşan et al.

Table 23.2. Error rates of the individual nets and of the resulting committees. In

experiments 3 and 4 seven randomly initialized nets are trained on deformed (σ =6,

α =36) MNIST, whereas in experiment 4 training and validation sets are reselected.

In experiment 5 seven randomly initialized nets are trained on deformed (diﬀerent σ,

α) MNIST, and in experiment 6 seven randomly initialized nets are trained on width-

normalized, deformed (σ =6, α =36) MNIST. WN x - Width Normalization of the

bounding box to be x pixels wide; ORIG - original MNIST.

Error rate [%]

Exp. 3 Exp. 4 Exp. 5 Exp. 6

Net 1: init 1: 0.72 0.68 σ =4.5 α =36:0.69WN 10: 0.64

Net 2:

init 2: 0.71 0.82 σ =4.5 α =42:0.94WN 12: 0.78

Net 3:

init 3: 0.72 0.73 σ =6.0 α =30:0.55WN 14: 0.70

Net 4:

init 4: 0.71 0.69 σ =6.0 α =36:0.72WN 16: 0.60

Net 5:

init 5: 0.62 0.71 σ =6.0 α =42:0.60WN 18: 0.59

Net 6:

init 6: 0.65 0.70 σ =7.5 α =30:0.86WN 20: 0.70

Net 7:

init 7: 0.69 0.75 σ =7.5 α =36:0.79ORIG: 0.71

Average: 0.56 0.53 0.49 0.43

current CPUs but can successfully be trained on GPUs in a few days. All sim-

ulations were performed on a computer with a Core i7 920 2.66GHz processor,

12GB of RAM, and a GTX 480 graphics card. The GPU accelerates the defor-

mation routine by a factor of 10 (only elastic deformations are GPU-optimized);

the forward propagation (FP) and BP routines are sped up by a factor of 50.

We pick the trained MLP with the lowest validation error, and evaluate it on

the MNIST test set.

23.6.1 Single MLP

We train various MLP and summarize the results in Table 23.3. Training starts

with a learning rate of 10

−3

multiplied with 0.997 after every epoch until it

reaches 10

−6

, thus resulting in more than 2000 epochs, which can be computed

in a few days even for the biggest net. The best network has an error rate of

only 0.35% (35 out of 10,000 digits). This is better than the best previously

published results, namely, 0.39% [26] and 0.40% [33], both obtained by more

complex methods. The 35 misclassiﬁed digits are shown in Figure 23.5a. Many

of them are ambiguous and/or uncharacteristic, with obviously missing parts

or strange strokes etc. Interestingly, the second guess of the network is correct

for 30 out of the 35 misclassiﬁed digits. The best test error of this MLP is even

lower (0.32%) and may be viewed as the maximum capacity of the network, i.e.

what it can learn if we do not get the result for the lowest error on validation

set. Performance clearly proﬁts from adding hidden layers and more units per

23. Deep Big Multilayer Perceptrons for Digit Recognition 589

Table 23.3. Error rates on MNIST test set. Architecture: 841 input neurons, hidden

layers containing 2500, 2000, 1500, 1000 and 500 neurons, and 10 outputs. TEfBV -

test error for best validation, BTE - best test error.

ID architecture TEfBV BTE simulation weights test error [%]

(number of neurons in each layer) [%] [%] time [h] [millions] no distortion

1 1000, 500, 10 0.49 0.44 23.4 1.34 1.78

1500, 1000, 500, 10 0.46 0.40 44.2 3.26 1.85

2000, 1500, 1000, 500, 10 0.41 0.39 66.7 6.69 1.73

2500, 2000, 1500, 1000, 500, 10 0.35 0.32 114.5 12.11 1.71

5 9 × 1000, 10 0.44 0.43 107.7 8.86 1.81

layer. For example, network 5 has more but smaller hidden layers than network

4 (Table 23.3).

Networks with up to 12 million weights can successfully be trained by plain

gradient descent to achieve test errors below 1% after 20-30 epochs in less than

2 hours of training. How can networks with so many parameters generalize well

on the unseen test set? Answer: the continual deformations of the training set

generate a virtually inﬁnite supply of training examples, and the network rarely

sees any training image twice. Without any distortions, the error for all networks

is around 1.7-1.8% (last column in Table 23.3).

(a) (b)

Fig. 23.5. The misclassiﬁed digits, together with the two most likely predictions (bot-

tom, from left to right) and the correct label according to MNIST (top, right): (a) the

best network from Table 23.3. (b) the committee from Table 23.4.

23.6.2 Committee of MLP

Here we list results of a committee of MLP with the architecture that obtained

0.35% error rate on MNIST (841 neurons in the input layer, ﬁve hidden layers

590 D.C. Cireşan et al.

Table 23.4. Error rates of the individual nets and of the resulting committee. Archi-

tecture: 841 input neurons, hidden layers containing 2500, 2000, 1500, 1000 and 500

neurons, and 10 outputs. WN x—Width Normalization of the bounding box to be x

pixels wide.

10 12 14 16 18 20 ORIGINAL MNIST

test error [%] 0.52 0.45 0.44 0.49 0.36 0.38 0.35

committee error [%] 0.31

containing 2500, 2000, 1500, 1000 and 500 neurons, and 10 outputs). We train

six additional nets with the same architecture on normalized data (the width of

the digits is normalized prior to training) and form a committee by averaging the

predictions of the individual nets (Table 23.4). The width-normalization is es-

sential for good committee performance as shown in Section 23.5. All committee

members distort their width-normalized training dataset before each epoch.

Interestingly, the error of the extremely simple committee (0.31%)islower

than those of the individual nets. This is the best result ever reported on MNIST

using MLP. Many of the 31 misclassiﬁed digits (Figure 23.5b) are ambiguous

and/or uncharacteristic, with obviously missing parts or strange strokes etc.

Remarkably, the committee’s second guess is correct for 29 of the 31.

23.7 Discussion

In recent decades the amount of raw computing power per Euro has grown by

a factor of 100-1000 per decade. Our results show that this ongoing hardware

progress may be more important than advances in algorithms and software (al-

though the future will belong to methods combining the best of both worlds).

Current graphics cards (GPUs) are already more than 50 times faster than stan-

dard microprocessors when it comes to training big and deep neural networks

by the ancient algorithm, on-line back-propagation (weight update rate up to

7.5×10

/s,andmorethan10

per trained network). On the competitive MNIST

handwriting benchmark, single precision ﬂoating-point GPU-based neural nets

surpass all previously reported results, including those obtained by much more

complex methods involving specialized architectures, unsupervised pre-training,

combinations of machine learning classiﬁers etc. Training sets of suﬃcient size

to avoid overﬁtting are obtained by appropriately deforming images. Of course,

the approach is not limited to handwriting, and obviously holds great promise

for many visual and other pattern recognition problems.

Although big deep MLP are very powerful general classiﬁers when combined

with an appropriate distortion algorithm to enhance the training set, they cannot

compete with dedicated architectures such as max-pooling convolutional neural

networks on complex image classiﬁcation problems. For tasks more diﬃcult than

handwritten digit recognition MLP are not competitive anymore, both in clas-

siﬁcation performance and required training time. We have recently shown [11]

23. Deep Big Multilayer Perceptrons for Digit Recognition 591

that large convolutional neural networks combined with max-pooling [32] im-

prove the state-of-the-art by 30-80% for a plethora of benchmarks like Latin

letters [8], Chinese characters [11], traﬃc signs [9, 12], stereo projection of 3D

models [10, 11] and even small natural images [11].

Acknowledgments. Part of this work got started when Dan Cireşan was a

PhD student at University ”Politehnica” of Timişoara. He would like to thank

his PhD advisor, Ştefan Holban, for his guidance, and Răzvan Moşincat for pro-

viding a CPU framework for MNIST. This work was partially supported by the

Swiss Commission for Technology and Innovation (CTI), Project n. 9688.1 IFF:

Intelligent Fill in Form., and by a FP7-ICT-2009-6 EU Grant, Project Code

270247: A Neuro-dynamic Framework for Cognitive Robotics: Scene Represen-

tations, Behavioral Sequences, and Learning.

Appendix - GPU Implementation

Graphics Processing Unit

Until 2007 the only way to program a GPU was to translate the problem-solving

algorithm into a set of graphical operations. Despite being hard to code and

diﬃcult to debug, several GPU-based NN implementations were developed when

GPUs became faster than CPUs. Two layer MLPs [34] and CNNs [5] have been

previously implemented on GPUs. Although speedups were relatively modest,

these studies showed how GPUs can be used for machine learning. More recent

GPU-based CNNs trained in batch mode are two orders of magnitude faster

than CPU-based CNNs [32].

The GPU code is written using CUDA (Compute Uniﬁed Device Architec-

ture), a C-like general programming language. GPU speed and memory band-

width are vastly superior to those of CPUs, and crucial for fast MLP implemen-

tations. To fully understand our algorithm in terms of GPU / CUDA, please

visit the NVIDIA website [25]. According to CUDA terminology, the CPU is

called host and the graphics card device or GPU.

Deformations

Only the most time-consuming part of the latter – convolution with a gaussian

kernel [33] – is ported to the GPU. The MNIST training set is split into 600

sequentially processed batches. MNIST digits are scaled from the original 28×28

pixels to 29 ×29 pixels, to get a proper center, which simpliﬁes convolution. An

image grid has 290 × 290 cells, zero-padded to 300 × 300, thus avoiding margin

eﬀects when applying a Gaussian convolution kernel of size 21 × 21.

Our GPU program groups many threads into a block, where they share the

same gaussian kernel and parts of the random ﬁeld. The blocks contain 21 (the

kernel size) ×10 threads, each computing a vertical strip of the convolution

operation (Listing 23.1).

592 D.C. Cireşan et al.

Listing 23.1. Convolution Kernel for elastic distortion

1 __global__ void ConvolveField(ﬂoat ∗randomﬁeld, int width, int height,

ﬂoat ∗kernel, ﬂoat ∗outputﬁeld, ﬂoat elasticScale){

2 ﬂoat sum=0;

3 const int stride_k=GET_STRIDE(GAUSSIAN_FIELD_SIZE,pitch_x

>>2); //stride for gaussian kernel

4 __shared__ ﬂoat K[GAUSSIAN_FIELD_SIZE][stride_k]; //kernel (21 x

32 values)

5 __shared__ ﬂoat R[GAUSSIAN_FIELD_SIZE+9][

GAUSSIAN_FIELD_SIZE]; //random ﬁeld (30 x 21 values)

6 __shared__ ﬂoat s[10][GAUSSIAN_FIELD_SIZE]; //partial sums (10 x

21 values)

7 int stride_in=GET_STRIDE(width,pitch_x>>2); //random ﬁeld stride as

amultipleof32

8 int stride_out=GET_STRIDE(width−GAUSSIAN_FIELD_SIZE+1,

pitch_x>>2); //output stride as a multiple of 32

10 //loading gaussian kernel into K (21 x 21 values)

11 K[ 0+threadIdx.y][threadIdx.x] = kernel[( 0+threadIdx.y)∗stride_k +

threadIdx.x];//rows 0..9

12 K[10+threadIdx.y][threadIdx.x] = kernel[(10+threadIdx.y)∗stride_k +

threadIdx.x];//rows 10..19

13 if(threadIdx.y==0)

14 K[20+threadIdx.y][threadIdx.x] = kernel[(20+threadIdx.y)∗stride_k +

threadIdx.x];//row 20

16 //loading randomﬁeld into R

17 //0..9 x 21 values

18 R[ 0+threadIdx.y][threadIdx.x] = randomﬁeld[(10∗blockIdx.y+ 0+

threadIdx.y)∗stride_in + blockIdx.x + threadIdx.x];

19 //10..19 x 21 values

20 R[10+threadIdx.y][threadIdx.x] = randomﬁeld[(10∗blockIdx.y+10+

threadIdx.y)∗stride_in + blockIdx.x + threadIdx.x];

21 //20..29 x 21 values

22 R[20+threadIdx.y][threadIdx.x] = randomﬁeld[(10∗blockIdx.y+20+

threadIdx.y)∗stride_in + blockIdx.x + threadIdx.x];

23 __syncthreads(); //wait until everything is read into shared memory

25 //computing partial sums

26 #pragma unroll 21 //GAUSSIAN_FIELD_SIZE

27 for(int i=0;i<GAUSSIAN_FIELD_SIZE;i++)

28 sum += R[threadIdx.y + i][threadIdx.x] ∗ K[i][threadIdx.x];

29 s[threadIdx.y][threadIdx.x]=sum;

30 __syncthreads();

32 if(threadIdx.x==0){ //the ﬁrst column of threads computes the ﬁnal values

of the convolutions

33 #pragma unroll 20//GAUSSIAN_FIELD_SIZE−1

34 for(int i=1;i<GAUSSIAN_FIELD_SIZE;i++) sum+=s[threadIdx.y][i

];

35 outputﬁeld[(blockIdx.y∗10+threadIdx.y)∗stride_out + blockIdx.x] =

sum ∗ elasticScale;

36 }

37 }

23. Deep Big Multilayer Perceptrons for Digit Recognition 593

Training Algorithm

We closely follow the standard BP algorithm [30], except that BP of deltas and

weight updates are disentangled and performed sequentially. This allows for more

parallelism within each routine.

Forward Propagation

The algorithm is divided into two kernels. The weight matrix W is partitioned

as illustrated in Figure 23.6.

Fig. 23.6. Forward propagation: a) mapping of kernel 1 grid onto the padded weight

matrix; b) mapping the kernel 2 grid onto the partial dot products matrix; c) output

of forward propagation

Kernel 1. Each block has 256 threads (Figure 23.6a), each computing a partial

dot product of 32 component vectors. The dot products are stored in a temporary

matrix (Figure 23.6b). This kernel has a very high throughput: average memory

bandwidth is 115GB/s. This is possible because many relatively small blocks

keep the GPU busy. Each block uses shared memory for storing the previous

layer activations, which are simultaneously read by the ﬁrst 32 threads of each

block and then used by all 256 threads. After thread synchronization, the partial

dot products are computed in parallel (Listing 23.2). The number of instructions

is kept to a minimum by pre-computing all common index parts.

Kernel 2. The thread grid (Figure 23.6b) has only one row of blocks consisting

of warp threads, since each thread has to compute a complete dot product (Fig-

ure 23.6c) and then pipe it into the activation function. This kernel (Listing 23.2)

is ineﬃcient for layers with fewer than 1024 incoming connections per neuron,

especially for the last layer which has only ten neurons, one for each digit. That

is, its grid will have only one block, occupying only 6% of the GTX 480 GPU.

594 D.C. Cireşan et al.

Listing 23.2. Forward propagation kernels

1 __global__ void MLP_FP_reduction_Kernel1(ﬂoat ∗prevLN, ﬂoat ∗W,

ﬂoat ∗partialsum, unsigned int neurons, unsigned int prevneurons){

2 const int threads=256;

3 const int stride=GET_STRIDE(neurons,pitch_x>>2); //horizontal stride

of W matrix

4 int X=blockIdx.x∗threads + threadIdx.x; //precomputing expressions

5 int Y=X+stride∗blockIdx.y;

6 int Z=blockIdx.y∗pitch_y∗stride + X;

7 ﬂoat sum=0.0f;

8 __shared__ ﬂoat output[pitch_y];

9 if(blockIdx.y==0)

10 if(threadIdx.x==0) output[0]=1.0f;

11 else if(threadIdx.x<pitch_y) //there are only 32 values to read and

128 threads

12 output[threadIdx.x] = threadIdx.x−1<prevneurons ? prevLN[

threadIdx.x−1]:0.0f;

13 else;

14 else if(threadIdx.x<pitch_y) //there are only 32 values to read and 128

threads

15 output[threadIdx.x] = blockIdx.y∗pitch_y+threadIdx.x−1<

prevneurons ?

16 prevLN[blockIdx.y∗pitch_y+threadIdx.x−1]:0.0f;

17 else;

18 __syncthreads();

19 if(X<neurons){//compute partial sums

20 //#pragma unroll 32

21 int size=0;

22 if

((blockIdx.y+1)∗pitch_y>=prevneurons+1)

23 size = prevneurons + 1 − blockIdx.y∗pitch_y;

24 else size=pitch_y;

25 for (int ic=0; ic<size; ic++){

26 sum += output[ic] ∗ W[Z];

27 Z+=stride;

28 }

29 partialsum[Y]=sum;

30 }

31 }

33 __global__ void MLP_FP_reduction_Kernel2(ﬂoat ∗currLN, ﬂoat ∗

partialsum, unsigned int neurons, unsigned int size){

34 ﬂoat sum=0.0f;

35 int idx = blockIdx.x∗(pitch_x>>2) + threadIdx.x; //precomputed index

36 unsigned int stride = GET_STRIDE(neurons,pitch_x>>2); //stride for

partialsum matrix

38 if(idx>=neurons)return; //is this thread computing a true neuron?

39 for (int i=0; i<size; i++) sum += partialsum[i∗stride+idx]; //computing

the ﬁnal dot product

40 currLN[idx] = SIGMOIDF(sum); //applying activation

41 }

23. Deep Big Multilayer Perceptrons for Digit Recognition 595

Backward Propagation

This is similar to FP, but we need W

for coalesced access. Instead of transpos-

ing the matrix, the computations are performed on patches of data read from

device memory into shared memory, similar to the optimized matrix transposi-

tion algorithm of [28]. Shared memory access is much faster, without coalesc-

ing restrictions. Because we have to cope with layers of thousands of neurons,

back-propagating deltas uses a reduction method implemented in two kernels

communicating partial results via global memory (Listing 23.3).

Listing 23.3. Backpropagating deltas kernels

2 __global__ void backPropagateDeltasFC_A(ﬂoat ∗indelta, ﬂoat ∗weights,

unsigned int ncon, unsigned int nrneur, ﬂoat ∗partial){

3 const int px = pitch_x>>2;

4 unsigned int stride_x = GET_STRIDE(nrneur,px);

5 unsigned int stride_y = GET_STRIDE(ncon,pitch_y);

6 ﬂoat outd = 0.0;

7 int idx = blockIdx.x∗px+threadIdx.x;

8 int X = blockIdx.y∗pitch_y∗stride_x + idx;

9 int Y=threadIdx.x;

10 __shared__ ﬂoat w[32∗33]; //pitch_y and px should be equal ! +1 to

avoid bank conﬂict!

11 __shared__ ﬂoat id[px]; //input delta

12 #pragma unroll 32 //read the weight patch in shared memory

13 for(int i=0;i<pitch_y;i++){w[Y]=weights[X]; X+=stride_x; Y+=33;}

14 //read the input delta patch in shared memory

15 if(idx>=nrneur) id[threadIdx.x]=0; //a fake input delta for inexistent

indelta

16 else id[threadIdx.x]=indelta[idx];

17 __syncthreads(); //not needed for block with warp number of threads:

implicit synchronization

18 #pragma unroll 32 //compute partial results

19 for(int i=0;i<px;i++) outd+=w[threadIdx.x∗33+i]∗id[i];

20 //write out the partial results

21 partial[blockIdx.x∗stride_y + blockIdx.y∗pitch_y + threadIdx.x] = outd;

22 }

23 __global__ void backPropagateDeltasFC_B(ﬂoat ∗outdelta,

ﬂoat ∗instates,

unsigned int ncon, unsigned int nrneur, ﬂoat ∗partial){

24 int px=pitch_x>>2;

25 unsigned int stride_x = GET_STRIDE(nrneur,px);

26 unsigned int stride_y = GET_STRIDE(ncon,pitch_y);

27 ﬂoat outd = 0.0;

28 int size=stride_x/px;

29 int idx=blockIdx.x∗pitch_y+threadIdx.x;

30 if(idx==0); //true only for block and thread 0

31 else{

32 for(int i=0;i<size;i++)

33 outd+=partial[i∗stride_y + idx];

34 outdelta[idx−1] = outd ∗ DSIGMOIDF(instates[idx−1]); //−1 BIAS ...

35 }

36 }

596 D.C. Cireşan et al.

Kernel 1. The bi-dimensional grid is divided into blocks of warp (32) threads.

The kernel starts by reading a patch of 32 × 32 values from W. The stride

of the shared memory block is 33 (warp +1), thus avoiding all bank conﬂicts

and signiﬁcantly improving speed. Next, 32 input delta values are read and all

memory locations that do not correspond to real neurons (because of vertical

striding) are zero-padded to avoid branching in subsequent computations. The

number of elements is ﬁxed to warp size, and the computing loop is unrolled for

further speedups. Before ﬁnishing, each thread writes its own partial dot product

to global memory.

Kernel 2. This kernel completes BP of deltas by summing up partial deltas

computed by the previous kernel. It multiplies the ﬁnal result by the derivative

of the activation function applied to the current neuron’s state, and writes the

new delta to global memory.

Weight Updating

The algorithm (Listing 23.4) starts by reading the appropriate delta, and pre-

computes all repetitive expressions. Then the ﬁrst 16 threads read the states from

global memory into shared memory. The “bias neuron” with constant activation

1.0 is dealt with by conditional statements, which could be avoided through

expressions containing the conditions. Once threads are synchronized, each single

thread updates 16 weights in a ﬁxed unrolled loop.

Listing 23.4. Weights adjustment kernel

1 __global__ void adjustWeightsFC(ﬂoat ∗states,ﬂoat ∗deltas, ﬂoat ∗weights,

ﬂoat eta, unsigned int ncon, unsigned int nrneur){

2 const int pitch_y=16;

3 const int threads=256;

4 unsigned int px = pitch_x >> 2;

5 unsigned int stride_x = GET_STRIDE(nrneur,px);

6 ﬂoat etadeltak = eta∗deltas[blockIdx.x∗threads+threadIdx.x],t;

7 int b=blockIdx.y∗stride_x∗pitch_y + threads∗blockIdx.x + threadIdx.x;

8 __shared__ ﬂoat st[pitch_y]; //for states

9 int cond1 = blockIdx.y || threadIdx.x;

10 int cond2 = (blockIdx.y+1)∗pitch_y<=ncon;

11 int size = cond2 ∗ pitch_y + !cond2 ∗ (ncon%pitch_y);

12 if(threadIdx.x<pitch_y) st[threadIdx.x] = cond1 ∗ states[blockIdx.y∗

pitch_y + threadIdx.x − 1] + !cond1;

13 __syncthreads();

15 if (blockIdx.x∗threads + threadIdx.x < nrneur){

16 #pragma unroll 16

17 for (int j=0; j<16; j++){

18 t=weights[b];

19 t−=etadeltak∗ st[j];

20 weights[b]=t;

21 b+=stride_x;}}

22 }

23. Deep Big Multilayer Perceptrons for Digit Recognition 597

References

[1] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training

of deep networks. In: Neural Information Processing Systems (2006)

[2] Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)

[3] Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)

[4] Chellapilla, K., Shilman, M., Simard, P.: Combining Multiple Classiﬁers for Faster

Optical Character Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS,

vol. 3872, pp. 358–367. Springer, Heidelberg (2006)

[5] Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural net-

works for document processing. In: International Workshop on Frontiers in Hand-

writing Recognition (2006)

[6] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep, big, simple

neural nets for handwritten digit recognition. Neural Computation 22(12), 3207–

3220 (2010)

[7] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Handwritten Digit

Recognition with a Committee of Deep Neural Nets on GPUs. Technical Re-

port IDSIA-03-11, Istituto Dalle Molle di Studi sull’Intelligenza Artiﬁciale, IDSIA

(2011)

[8] Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Convolutional neu-

ral network committees for handwritten character recognition. In: International

Conference on Document Analysis and Recognition, pp. 1135–1139 (2011)

[9] Ciresan, D.C., Meier, U., Masci, J., Schmidhuber, J.: A committee of neural net-

works for traﬃc sign classiﬁcation. In: International Joint Conference on Neural

Networks, pp. 1918–1921 (2011)

[10] Ciresan, D.C., Meier, U., Masci, J., Gambardella, L.M., Schmidhuber, J.: Flexi-

ble, high performance convolutional neural networks for image classiﬁcation. In:

International Joint Conference on Artiﬁcial Intelligence, pp. 1237–1242 (2011)

[11] Ciresan, D.C., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for

image classiﬁcation. In: Computer Vision and Pattern Recognition, pp. 3642–3649

(2012)

[12] Ciresan, D.C., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural

network for traﬃc sign classiﬁcation. Neural Networks 32, 333–338 (2012)

[13] Decoste, D., Scholkopf, B.: Training invariant support vector machines. Machine

Learning (46), 161–190 (2002)

[14] Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neu-

ral networks. Science 313 (2006)

[15] Hinton, G.E.: To recognize shapes, ﬁrst learn to generate images. Computational

Neuroscience: Theoretical Insights into Brain Function (2007)

[16] Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma

thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität

München (1991),

http://www7.informatik.tu-muenchen.de/~hochreit;advi-

sor: J. Schmidhuber

[17] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9,

1735–1780 (1997)

[18] Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient ﬂow in re-

current nets: the diﬃculty of learning long-term dependencies. In: Kremer, S.C.,

Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE

Press (2001)

[19] Lauer, F., Suen, C., Bloch, G.: A trainable feature extractor for handwritten digit

recognition. Pattern Recognition (40), 1816–1824 (2007)

598 D.C. Cireşan et al.

[20] LeCun, Y.: Une procédure d’apprentissage pour réseau a seuil asymmetrique (a

learning scheme for asymmetric threshold networks). In: Proceedings of Cognitiva

1985, Paris, France, pp. 599–604 (1985)

[21] LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied

to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)

[22] Meier, U., Ciresan, D.C., Gambardella, L.M., Schmidhuber, J.: Better digit recog-

nition with a committee of simple neural nets. In: ICDAR, pp. 1135–1139 (2011)

[23] Mohamed, A., Dahl, G., Hinton, G.E.: Deep belief networks for phone recognition.

In: Proc. of NIPS 2009 Workshop on Deep Learning for Speech Recognition and

Related Applications (2009)

[24] Nair, V., Hinton, G.E.: 3D object recognition with deep belief nets. In: Advances

in Neural Information Processing Systems (2009)

[25] NVIDIA: NVIDIA CUDA. Reference Manual, vol. 2.3. NVIDIA (2009)

[26] Ranzato, M., Poultney, C., Chopra, S., LeCun, Y.: Eﬃcient learning of sparse

representations with an energy-based model. In: Platt, J., et al. (eds.) Advances

in Neural Information Processing Systems (NIPS 2006). MIT Press (2006)

[27] Ranzato, M.: Fu Jie Huang, Y.L.B., LeCun, Y.: Unsupervised learning of invariant

feature hierarchies with applications to object recognition. In: Proc. of Computer

Vision and Pattern Recognition Conference (2007)

[28] Ruetsch, G., Micikevicius, P.: Optimizing matrix transpose in cuda. In: NVIDIA

GPU Computing SDK, pp. 1–2 (2009)

[29] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations

by error propagation. In: Parallel Distributed Processing: Explorations in the Mi-

crostructure of Cognition, vol. 1: Foundations, pp. 318–362. MIT Press, Cambridge

(1986)

[30] Russell, S., Norvig, P.: Artiﬁcial Intelligence: A Modern Approach, 2nd edn.

Prentice-Hall, Englewood Cliﬀs (2003)

[31] Salakhutdinov, R., Hinton, G.: Learning a nonlinear embedding by preserving class

neighborhood structure. In: Proc. of the International Conference on Artiﬁcial

Intelligence and Statistics, vol. 11 (2007)

[32] Scherer, D., Behnke, S.: Accelerating large-scale convolutional neural networks

with parallel graphics multiprocessors. In: Proc. of NIPS 2009 Workshop on Large-

Scale Machine Learning: Parallelism and Massive Datasets (2009)

[33] Simard, P., Steinkraus, D., Platt, J.C.: Best practices for convolutional neural net-

works applied to visual document analysis. In: Seventh International Conference

on Document Analysis and Recognition, pp. 958–963 (2003)

[34] Steinkraus, D., Simard, P.Y.: Gpus for machine learning algorithms. In: Interna-

tional Conference on Document Analysis and Recognition, pp. 1115–1120 (2005)

[35] Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the

Behavioral Sciences. PhD thesis, Harvard University (1974)

A Practical Guide to Training Restricted

Boltzmann Machines

Geoﬀrey E. Hinton

University of Toronto, Toronto, Ontario M5S 3G4

Department of Computer Science

hinton@cs.toronto.edu

Abstract. Restricted Boltzmann machines (RBMs) have been used as

generative models of many diﬀerent types of data. RBMs are usually

trained using the contrastive divergence learning procedure. This requires

a certain amount of practical experience to decide how to set the val-

ues of numerical meta-parameters. Over the last few years, the machine

learning group at the University of Toronto has acquired considerable

expertise at training RBMs and this guide is an attempt to share this

expertise with other machine learning researchers.

24.1 Introduction

Restricted Boltzmann machines (RBMs) have been used as generative models of

many diﬀerent types of data including labeled or unlabeled images [7], windows of

mel-cepstral coeﬃcients that represent speech [12], bags of words that represent

documents [15], and user ratings of movies [17]. In their conditional form they

can be used to model high-dimensional temporal sequences such as video or

motion capture data [20] or speech [11]. Their most important use is as learning

modules that are composed to form deep belief nets [7].

RBMs are usually trained using the contrastive divergence learning procedure

[5]. This requires a certain amount of practical experience to decide how to set

the values of numerical meta-parameters such as the learning rate, the momen-

tum, the weight-cost, the sparsity target, the initial values of the weights, the

number of hidden units and the size of each mini-batch. There are also decisions

to be made about what types of units to use, whether to update their states

stochastically or deterministically, how many times to update the states of the

hidden units for each training case, and whether to start each sequence of state

updates at a data-vector. In addition, it is useful to know how to monitor the

progress of learning and when to terminate the training.

For any particular application, the code that was used gives a complete spec-

iﬁcation of all of these decisions, but it does not explain why the decisions were

made or how minor changes will aﬀect performance. More signiﬁcantly, it does

not provide a novice user with any guidance about how to make good decisions

for a new application. This requires some sensible heuristics and the ability to

relate failures of the learning to the decisions that caused those failures.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 599–619, 2012.

 Springer-Verlag Berlin Heidelberg 2012

600 G.E. Hinton

Over the last few years, the machine learning group at the University of

Toronto has acquired considerable expertise at training RBMs and this guide is

an attempt to share this expertise with other machine learning researchers.

24.2 An Overview of Restricted Boltzmann Machines

and Contrastive Divergence

Skip this section if you already know about RBMs

Consider a training set of binary vectors which we will assume are binary im-

ages for the purposes of explanation. The training set can be modeled using a

two-layer network called a “Restricted Boltzmann Machine” [18, 2, 5] in which

stochastic, binary pixels are connected to stochastic, binary feature detectors us-

ing symmetrically weighted connections. The pixels correspond to “visible” units

of the RBM because their states are observed; the feature detectors correspond

to “hidden” units. A joint conﬁguration, (v, h) of the visible and hidden units

has an energy [9] given by:

E(v, h)=−



i∈visible

−



j∈hidden

−



i,j

(24.1)

where v

are the binary states of visible unit i and hidden unit j, a

are their biases and w

is the weight between them. The network assigns a

probability to every possible pair of a visible and a hidden vector via this energy

function:

p(v, h)=

−E(v,h)

(24.2)

where the “partition function”, Z, is given by summing over all possible pairs of

visible and hidden vectors:

Z =



v,h

−E(v,h)

(24.3)

The probability that the network assigns to a visible vector, v,isgivenby

summing over all possible hidden vectors:

p(v)=



−E(v,h)

(24.4)

The probability that the network assigns to a training image can be raised by

adjusting the weights and biases to lower the energy of that image and to raise

the energy of other images, especially those that have low energies and therefore

make a big contribution to the partition function. The derivative of the log

probability of a training vector with respect to a weight is surprisingly simple.

∂ log p(v)

∂w

= v



data

−v



model

(24.5)

24. A Practical Guide to Training Restricted Boltzmann Machines 601

where the angle brackets are used to denote expectations under the distribution

speciﬁed by the subscript that follows. This leads to a very simple learning rule

for performing stochastic steepest ascent in the log probability of the training

data:

Δw

= (v



data

−v



model

) (24.6)

where  is a learning rate.

Because there are no direct connections between hidden units in an RBM, it

is very easy to get an unbiased sample of v



data

. Given a randomly selected

training image, v, the binary state, h

, of each hidden unit, j,issetto1 with

probability

p(h

=1| v)=σ(b



) (24.7)

where σ(x) is the logistic sigmoid function 1/(1 + exp(−x)). v

is then an

unbiased sample.

Because there are no direct connections between visible units in an RBM, it

is also very easy to get an unbiased sample of the state of a visible unit, given a

hidden vector

p(v

=1| h)=σ(a



) (24.8)

Getting an unbiased sample of v



model

, however, is much more diﬃcult. It

can be done by starting at any random state of the visible units and perform-

ing alternating Gibbs sampling for a very long time. One iteration of alternating

Gibbs sampling consists of updating all of the hidden units in parallel using equa-

tion 24.7 followed by updating all of the visible units in parallel using equation

24.8.

A much faster learning procedure was proposed in [5]. This starts by setting

the states of the visible units to a training vector. Then the binary states of the

hidden units are all computed in parallel using equation 24.7. Once binary states

have been chosen for the hidden units, a “reconstruction” is produced by setting

each v

to 1 with a probability given by equation 24.8. The change in a weight

is then given by

Δw

= (v



data

−v



recon

) (24.9)

A simpliﬁed version of the same learning rule that uses the states of individual

units instead of pairwise products is used for the biases.

The learning works well even though it is only crudely approximating the

gradient of the log probability of the training data [5]. The learning rule is much

more closely approximating the gradient of another objective function called the

Contrastive Divergence [5] which is the diﬀerence between two Kullback-Leibler

divergences, but it ignores one tricky term in this objective function so it is not

even following that gradient. Indeed, Sutskever and Tieleman have shown that

it is not following the gradient of any function [19]. Nevertheless, it works well

enough to achieve success in many signiﬁcant applications.

RBMs typically learn better models if more steps of alternating Gibbs sam-

pling are used before collecting the statistics for the second term in the learning

602 G.E. Hinton

rule, which will be called the negative statistics. CD

will be used to denote

learning using n full steps of alternating Gibbs sampling.

24.3 How to Collect Statistics When Using Contrastive

Divergence

To begin with, we shall assume that all of the visible and hidden units are binary.

Other types of units will be discussed in section 24.13. We shall also assume that

the purpose of the learning is to create a good generative model of the set of

training vectors. When using RBMs to learn Deep Belief Nets (see the article on

Deep Belief Networks at www.scholarpedia.org) that will subsequently be ﬁne-

tuned using backpropagation, the generative model is not the ultimate objective

and it may be possible to save time by underﬁtting it, but we will ignore that

here.

24.3.1 Updating the Hidden States

Assuming that the hidden units are binary and that you are using CD

,the

hidden units should have stochastic binary states when they are being driven by

a data-vector. The probability of turning on a hidden unit, j, is computed by

applying the logistic function σ(x)=1/(1 + exp(−x)) to its “total input”:

p(h

=1)=σ(b



) (24.10)

and the hidden unit turns on if this probability is greater than a random number

uniformly distributed between 0 and 1.

It is very important to make these hidden states binary, rather than using

the probabilities themselves. If the probabilities are used, each hidden unit can

communicate a real-value to the visible units during the reconstruction. This

seriously violates the information bottleneck created by the fact that a hidden

unit can convey at most one bit (on average). This information bottleneck acts

as a strong regularizer.

For the last update of the hidden units, it is silly to use stochastic binary states

because nothing depends on which state is chosen. So use the probability itself

to avoid unnecessary sampling noise. When using CD

, only the ﬁnal update of

the hidden units should use the probability.

24.3.2 Updating the Visible States

Assuming that the visible units are binary, the correct way to update the visible

states when generating a reconstruction is to stochastically pick a 1 or 0 with a

probability determined by the total top-down input:

= p(v

=1)=σ(a



) (24.11)

24. A Practical Guide to Training Restricted Boltzmann Machines 603

However, it is common to use the probability, p

, instead of sampling a binary

value. This is not nearly as problematic as using probabilities for the data-driven

hidden states and it reduces sampling noise thus allowing faster learning. There

is some evidence that it leads to slightly worse density models (Tijmen Tieleman,

personal communication, 2008). This probably does not matter when using an

RBM to pretrain a layer of hidden features for use in a deep belief net.

24.3.3 Collecting the Statistics Needed for Learning

Assuming that the visible units are using real-valued probabilities instead of

stochastic binary values, there are two sensible ways to collect the positive statis-

tics for the connection between visible unit i and hidden unit j:

p



data

or p



data

where p

is a probability and h

is a binary state that takes value 1 with prob-

ability p

.Usingh

is closer to the mathematical model of an RBM, but using

usually has less sampling noise which allows slightly faster learning

24.3.4 A Recipe for Getting the Learning Signal for CD

When the hidden units are being driven by data, always use stochastic binary

states. When they are being driven by reconstructions, always use probabilities

without sampling.

Assuming the visible units use the logistic function, use real-valued probabil-

ities for both the data and the reconstructions

When collecting the pairwise statistics for learning weights or the individual

statistics for learning biases, use the probabilities, not the binary states, and

make sure the weights have random initial values to break symmetry.

24.4 The Size of a Mini-batch

It is possible to update the weights after estimating the gradient on a single

training case, but it is often more eﬃcient to divide the training set into small

“mini-batches” of 10 to 100 cases

. This allows matrix-matrix multiplies to be

used which is very advantageous on GPU boards or in Matlab.

Using h

always creates more noise in the positive statistics than using p

but it

can actually create less noise in the diﬀerence of the positive and negative statistics

because the negative statistics depend on the binary decision for the state of j that

is used for creating the reconstruction. The probability of j when driven by the

reconstruction is highly correlated with the binary decision that was made for j

when it was driven by the data.

So there is nothing random about the generation of the reconstructions given the

binary states of the hidden units.

The word “batch” is confusing and will be avoided because when it is used to contrast

with “on-line” it usually means the entire training set.

604 G.E. Hinton

To avoid having to change the learning rate when the size of a mini-batch is

changed, it is helpful to divide the total gradient computed on a mini-batch by

the size of the mini-batch, so when talking about learning rates we will assume

that they multiply the average, per-case gradient computed on a mini-batch, not

the total gradient for the mini-batch.

It is a serious mistake to make the mini-batches too large when using stochastic

gradient descent (see chapter 1 and 18 for more details). Increasing the mini-

batch size by a factor of N leads to a more reliable gradient estimate but it does

not increase the maximum stable learning rate by a factor of N, so the net eﬀect

is that the weight updates are smaller per gradient evaluation

24.4.1 A Recipe for Dividing the Training Set into Mini-batches

For datasets that contain a small number of equiprobable classes, the ideal mini-

batch size is often equal to the number of classes and each mini-batch should

contain one example of each class to reduce the sampling error when estimat-

ing the gradient for the whole training set from a single mini-batch. For other

datasets, ﬁrst randomize the order of the training examples then use minibatches

of size about 10.

24.5 Monitoring the Progress of Learning

It is easy to compute the squared error between the data and the reconstructions,

so this quantity is often printed out during learning. The reconstruction error

on the entire training set should fall rapidly and consistently at the start of

learning and then more slowly. Due to the noise in the gradient estimates, the

reconstruction error on the individual mini-batches will ﬂuctuate gently after

the initial rapid descent. It may also oscillate gently with a period of a few

mini-batches when using high momentum (see section 24.9).

Although it is convenient, the reconstruction error is actually a very poor

measure of the progress of learning. It is not the function that CD

learning

is approximately optimizing, especially for n>>1, and it systematically con-

founds two diﬀerent quantities that are changing during the learning. The ﬁrst

is the diﬀerence between the empirical distribution of the training data and the

equilibrium distribution of the RBM. The second is the mixing rate of the al-

ternating Gibbs Markov chain. If the mixing rate is very low, the reconstruction

error will be very small even when the distributions of the data and the model

are very diﬀerent. As the weights increase the mixing rate falls, so decreases

The easy way to parallelize the learning on a cluster is to divide each mini-batch

into sub-mini-batches and to use diﬀerent cores to compute the gradients on each

sub-mini-batch. The gradients computed by diﬀerent cores must then be combined.

To minimize the ratio of communication to computation, it is tempting to make

the sub-mini-batches large. This usually makes the learning much less eﬃcient, thus

wiping out much of the gain achieved by using multiple cores (Vinod Nair, personal

communication, 2007).

24. A Practical Guide to Training Restricted Boltzmann Machines 605

in reconstruction error do not necessarily mean that the model is improving

and, conversely, small increases do not necessarily mean the model is getting

worse. Large increases, however, are a bad sign except when they are temporary

and caused by changes in the learning rate, momentum, weight-cost or sparsity

meta-parameters.

24.5.1 A Recipe for Using the Reconstruction Error

Use it but don’t trust it. If you really want to know what is going on during the

learning, use multiple histograms and graphic displays as described in section

24.15. Also consider using Annealed Importance Sampling [16] to estimate the

density on held out data. If you are learning a joint density model of labelled

data (see section 24.16), consider monitoring the discriminative performance on

the training data and on a held out validation set.

24.6 Monitoring the Overﬁtting

When learning a generative model, the obvious quantity to monitor is the prob-

ability that the current model assigns to a datapoint. When this probability

starts to decrease for held out validation data, it is time to stop learning. Unfor-

tunately, for large RBMs, it is very diﬃcult to compute this probability because

it requires knowledge of the partition function. Nevertheless, it is possible to

directly monitor the overﬁtting by comparing the free energies of training data

and held out validation data. In this comparison, the partition function cancels

out. The free energy of a data vector can be computed in a time that is linear in

the number of hidden units (see section 24.16.1). If the model is not overﬁtting

at all, the average free energy should be about the same on training and valida-

tion data. As the model starts to overﬁt the average free energy of the validation

data will rise relative to the average free energy of the training data and this

gap represents the amount of overﬁtting

24.6.1 A Recipe for Monitoring the Overﬁtting

After every few epochs, compute the average free energy of a representative

subset of the training data and compare it with the average free energy of a

validation set. Always use the same subset of the training data. If the gap starts

growing, the model is overﬁtting, though the probability of the training data

may be growing even faster than the gap, so the probability of the validation

data may still be improving. Make sure that the same weights are used when

computing the two averages that you wish to compare.

The average free energies often change by large amounts during learning and this

means very little because the log partition function also changes by large amounts.

It is only diﬀerences in free energies that are easy to interpret without knowing the

partition function.

606 G.E. Hinton

24.7 The Learning Rate

If the learning rate is much too large, the reconstruction error usually increases

dramatically and the weights may explode.

If the learning rate is reduced while the network is learning normally, the

reconstruction error will usually fall signiﬁciantly. This is not necessarily a good

thing. It is due, in part, to the smaller noise level in the stochastic weight updates

and it is generally accompanied by slower learning in the long term. Towards

the end of learning, however, it typically pays to decrease the learning rate.

Averaging the weights across several updates is an alternative way to remove

some of the noise from the ﬁnal weights.

24.7.1 A Recipe for Setting the Learning Rates for Weights and

Biases

A good rule of thumb for setting the learning rate (Max Welling, personal com-

munication, 2002) is to look at a histogram of the weight updates and a his-

togram of the weights. The updates should be about 10

−3

times the weights (to

within about an order of magnitude). When a unit has a very large fan-in, the

updates should be smaller since many small changes in the same direction can

easily reverse the sign of the gradient. Conversely, for biases, the updates can be

bigger.

24.8 The Initial Values of the Weights and Biases

The weights are typically initialized to small random values chosen from a zero-

mean Gaussian with a standard deviation of about 0.01. Using larger random

values can speed the initial learning, but it may lead to a slightly worse ﬁnal

model. Care should be taken to ensure that the initial weight values do not allow

typical visible vectors to drive the hidden unit probabilities very close to 1 or

0 as this signiﬁcantly slows the learning. If the statistics used for learning are

stochastic, the initial weights can all be zero since the noise in the statistics will

make the hidden units become diﬀerent from one another even if they all have

identical connectivities.

It is usually helpful to initialize the bias of visible unit i to log[p

/(1 − p

)]

where p

is the proportion of training vectors in which unit i is on. If this is not

done, the early stage of learning will use the hidden units to make i turn on with

a probability of approximately p

When using a sparsity target probability of t (see section 24.11), it makes

sense to initialize the hidden biases to be log[t/(1 −t)]. Otherwise, initial hidden

biases of 0 are usually ﬁne. It is also possible to start the hidden units with quite

large negative biases of about −4 as a crude way of encouraging sparsity.

24. A Practical Guide to Training Restricted Boltzmann Machines 607

24.8.1 A Recipe for Setting the Initial Values of the Weights and

Biases

Use small random values for the weights chosen from a zero-mean Gaussian with

a standard deviation of 0.01. Set the hidden biases to 0. Set the visible biases

to log[p

/(1 − p

)] where p

is the proportion of training vectors in which unit i

is on. Look at the activities of the hidden units occasionally to check that they

are not always on or oﬀ.

24.9 Momentum

Momentum is a simple method for increasing the speed of learning when the

objective function contains long, narrow and fairly straight ravines with a gentle

but consistent gradient along the ﬂoor of the ravine and much steeper gradients

up the sides of the ravine. The momentum method simulates a heavy ball rolling

down a surface. The ball builds up velocity along the ﬂoor of the ravine, but

not across the ravine because the opposing gradients on opposite sides of the

ravine cancel each other out over time. Instead of using the estimated gradient

times the learning rate to increment the values of the parameters, the momentum

method uses this quantity to increment the velocity, v, of the parameters and

the current velocity is then used as the parameter increment.

The velocity of the ball is assumed to decay with time and the “momentum”

meta-parmeter, α is the fraction of the previous velocity that remains after

computing the gradient on a new mini-batch:

Δθ

(t)=v

(t)=αv

(t − 1) −

dθ

(t) (24.12)

If the gradient remains constant, the terminal velocity will exceed dE/dθ

by a

factor of 1/(1 −α). This is a factor of 10 for a momentum of 0.9 which is a typ-

ical setting of this meta-parameter. The temporal smoothing in the momentum

method avoids the divergent oscillations across the ravine that would be caused

by simply increasing the learning rate by a factor of 1/(1 −α).

The momentum method causes the parameters to move in a direction that is

not the direction of steepest descent, so it bears some resemblance to methods like

conjugate gradient, but the way it uses the previous gradients is much simpler.

Unlike methods that use diﬀerent learning rates for each parameter, momentum

works just as well when the ravines are not aligned with the parameter axes.

An alternative way of viewing the momentum method (Tijmen Tieleman,

personal communication, 2008) is as follows: It is equivalent to increasing the

learning rate by a factor of 1/(1 −α) but delaying the full eﬀect of each gradient

estimate by dividing the full increment into a series of exponentially decaying

installments. This gives the system time to respond to the early installments

by moving to a region of parameter space that has opposing gradients before it

feels the full eﬀect of the increment. This, in turn, allows the learning rate to be

larger without causing unstable oscillations.

608 G.E. Hinton

At the start of learning, the random initial parameter values may create very

large gradients and the system is unlikely to be in the ﬂoor of a ravine, so it is

usually best to start with a low momentum of 0.5 for a number of parameter

updates. This very conservative momentum typically makes the learning more

stable than no momentum at all by damping oscillations across ravines [4].

24.9.1 A Recipe for Using Momentum

Start with a momentum of 0.5. Once the large initial progress in the reduction

of the reconstruction error has settled down to gentle progress, increase the mo-

mentum to 0.9. This shock may cause a transient increase in the reconstruction

error. If this causes a more lasting instability, keep reducing the learning rate by

factors of 2 until the instability disappears.

24.10 Weight-Decay

Weight-decay works by adding an extra term to the normal gradient. The extra

term is the derivative of a function that penalizes large weights. The simplest

penalty function, called “L2”, is half of the sum of the squared weights times a

coeﬃcient which will be called the weight-cost.

It is important to multiply the derivative of the penalty term by the learning

rate. Otherwise, changes in the learning rate change the function that is being

optimized rather than just changing the optimization procedure.

There are four diﬀerent reasons for using weight-decay in an RBM. The ﬁrst

is to improve generalization to new data by reducing overﬁtting to the training

data

. The second is to make the receptive ﬁelds of the hidden units smoother

and more interpretable by shrinking useless weights. The third is to “unstick”

hidden units that have developed very large weights early in the training and

are either always ﬁrmly on or always ﬁrmly oﬀ. A better way to allow such units

to become useful again is to use a “sparsity” target as described in section 24.11.

The fourth reason is to improve the mixing rate of the alternating Gibbs

Markov chain. With small weights, the Markov chain mixes more rapidly

.The

CD learning procedure is based on ignoring derivatives that come from later

steps in the Markov chain (Hinton, Osindero and Teh, 2006), so it tends to

approximate maximum likelihood learning better when the mixing is fast. The

ignored derivatives are then small for the following reason: When a Markov chain

is very close to its stationary distribution, the best parameters for modeling

samples from the chain are very close to its current parameters.

Since the penalty is applied on every mini-batch, Bayesians really ought to divide

the weight-cost by the size of the training set. They can then interpret weight-decay

as the eﬀect of a Gaussian weight prior whose variance is independent of the size of

the training set. This division is typically not done. Instead, larger weight-costs are

used for smaller training sets.

With all zero weights, it reaches its rather boring stationary distribution in a single

full step.

24. A Practical Guide to Training Restricted Boltzmann Machines 609

A diﬀerent form of weight-decay called “L1” is to use the derivative of the sum

of the absolute values of the weights. This often causes many of the weights to

become exactly zero whilst allowing a few of the weights to grow quite large. This

can make it easier to interpret the weights. When learning features for images,

for example, L1 weight-decay often leads to strongly localized receptive ﬁelds.

An alternative way to control the size of the weights is to impose a maximum

allowed value on the sum of the squares or absolute values of the incoming

weights for each unit. After each weight update, the weights are rescaled if they

exceed this maximum value. This helps to avoid hidden units getting stuck with

extremely small weights, but a sparsity target is probably a better way to avoid

this problem.

24.10.1 A Recipe for Using Weight-Decay

For an RBM, sensible values for the weight-cost coeﬃcient for L2 weight-decay

typically range from 0.01 to 0.00001. Weight-cost is typically not applied to the

hidden and visible biases because there are far fewer of these so they are less

likely to cause overﬁtting. Also, the biases sometimes need to be quite large.

Try an initial weight-cost of 0.0001. If you are using Annealed Importance

Sampling [16] to estimate the density on a held-out validation set, try adjusting

the weight-cost by factors of 2 to optimize density. Small diﬀerences in weight-

cost are unlikely to cause big diﬀerences in performance. If you are training

a joint density model that allows you to test discriminative performance on a

validation set this can be used in place of the density for optimizing the weight-

cost. However, in either case, remember that weight-decay does more than just

preventing overﬁtting. It also increases the mixing rate which makes CD learning

a better approximation to maximum likelihood. So even if overﬁtting is not a

problem because the supply of training data is inﬁnite, weight-decay can still be

helpful.

24.11 Encouraging Sparse Hidden Activities

Hidden units that are only rarely active are typically easier to interpret than

those that are active about half of the time. Also, discriminative performance is

sometimes improved by using features that are only rarely active [13].

Sparse activities of the binary hidden units can be achieved by specifying

a “sparsity target” which is the desired probability of being active, p<<1.

An additional penalty term is then used to encourage the actual probability of

being active, q,tobeclosetop. q is estimated by using an exponentially decaying

average of the mean probability that a unit is active in each mini-batch:

new

= λq

old

+(1−λ)q

current

(24.13)

where q

current

is the mean activation probability of the hidden unit on the cur-

rent mini-batch.

610 G.E. Hinton

The natural penalty measure to use is the cross entropy between the desired

and actual distributions:

Sparsity penalty ∝−p lo g q − (1 − p) log(1 − q) (24.14)

For logistic units this has a simple derivative of q − p with respect to the total

input to a unit. This derivative, scaled by a meta-parameter called “sparsity-

cost”, is used to adjust both the bias and the incoming weights of each hidden

unit. It is important to apply the same derivative to both. If the derivative

is only applied to the bias, for example, the bias will typically keep becoming

more negative to ensure the hidden unit is rarely on, but the weights will keep

becoming more positive to make the unit more useful.

24.11.1 A Recipe for Sparsity

Set the sparsity target to between 0.01 and 0.1

. Set the decay-rate, λ,ofthe

estimated value of q to be between 0.9 and 0.99. Histogram the mean activities

of the hidden units and set the sparsity-cost so that the hidden units have mean

probabilities in the vicinity of the target. If the probabilities are tightly clustered

around the target value, reduce the sparsity-cost so that it interferes less with

the main objective of the learning.

24.12 The Number of Hidden Units

Intuitions derived from discriminative machine learning are a bad guide for deter-

mining a sensible number of hidden units. In discriminative learning, the amount

of constraint that a training case imposes on the parameters is equal to the num-

ber of bits that it takes to specify the label. Labels usually contain very few bits

of information, so using more parameters than training cases will typically cause

severe overﬁtting. When learning generative models of high-dimensional data,

however, it is the number of bits that it takes to specify a data vector that

determines how much constraint each training case imposes on the parameters

of the model. This can be several orders of magnitude greater than number of

bits required to specify a label. So it may be quite reasonable to ﬁt a million

parameters to 10,000 training images if each image contains 1,000 pixels. This

would allow 1000 globally connected hidden units. If the hidden units are locally

connected or if they use weight-sharing, many more can be used.

24.12.1 A Recipe for Choosing the Number of Hidden Units

Assuming that the main issue is overﬁtting rather than the amount of computa-

tion at training or test time, estimate how many bits it would take to describe

If you are only using the sparsity target to revive hidden units that are never active

and suppress hidden units that are always active, a target value of 0.5 makes sense

(even though it makes nonsense of the name).

24. A Practical Guide to Training Restricted Boltzmann Machines 611

each data-vector if you were using a good model (i.e. estimate the typical neg-

ative log

probability of a datavector under a good model). Then multiply that

estimate by the number of training cases and use a number of parameters that

is about an order of magnitude smaller. If you are using a sparsity target that is

very small, you may be able to use more hidden units. If the training cases are

highly redundant, as they typically will be for very big training sets, you need

to use fewer parameters.

24.13 Diﬀerent Types of Unit

RBM’s were developed using binary visible and hidden units, but many other

types of unit can also be used. A general treatment for units in the exponential

family is given in [24]. The main use of other types of unit is for dealing with

data that is not well-modeled by binary (or logistic) visible units.

24.13.1 Softmax and Multinomial Units

For a binary unit, the probability of turning on is given by the logistic sigmoid

function of its total input, x.

p = σ(x)=

1+e

−x

+ e

(24.15)

The energy contributed by the unit is −x if it is on and 0 if it is oﬀ. Equation

24.15 makes it clear that the probability of each of the two possible states is

proportional to the negative exponential of its energy. This can be generalized

to K alternative states.



i=1

(24.16)

This is often called a “softmax” unit. It is the appropriate way to deal with

a quantity that has K alternative values which are not ordered in any way.

A softmax can be viewed as a set of binary units whose states are mutually

constrained so that exactly one of the K states has value 1 and the rest have

value 0. When viewed in this way, the learning rule for the binary units in a

softmax is identical to the rule for standard binary units. The only diﬀerence

is in the way the probabilities of the states are computed and the samples are

taken.

A further generalization of the softmax unit is to sample N times (with re-

placement) from the probability distribution instead of just sampling once. The

K diﬀerent states can then have integer values bigger than 1, but the values

must add to N. This is called a multinomial unit and, again, the learning rule

is unchanged.

612 G.E. Hinton

24.13.2 Gaussian Visible Units

For data such as patches of natural images or the Mel-Cepstrum coeﬃcients used

to represent speech, logistic units are a very poor representation. One solution

is to replace the binary visible units by linear units with independent Gaussian

noise. The energy function then becomes:

E(v, h)=



i∈vis

− a

)

2σ

−



j∈hid

−



i,j

(24.17)

where σ

is the standard deviation of the Gaussian noise for visible unit i.

It is possible to learn the variance of the noise for each visible unit but this

is diﬃcult using CD

. In many applications, it is much easier to ﬁrst normalise

each component of the data to have zero mean and unit variance and then to

use noise free reconstructions, with the variance in equation 24.17 set to 1.The

reconstructed value of a Gaussian visible unit is then equal to its top-down input

from the binary hidden units plus its bias.

The learning rate needs to be about one or two orders of magnitude smaller

than when using binary visible units and some of the failures reported in the

literature are probably due to using a learning rate that is much too big. A

smaller learning rate is required because there is no upper bound to the size of a

component in the reconstruction and if one component becomes very large, the

weights emanating from it will get a very big learning signal. With binary hidden

and visible units, the learning signal for each training case must lie between −1

and 1, so binary-binary nets are much more stable.

24.13.3 Gaussian Visible and Hidden Units

If both the visible and the hidden units are Gaussian, the instability problems

become much worse. The individual activities are held close to their means by

quadratic “containment” terms with coeﬃcients determined by the standard de-

viations of the assumed noise levels:

E(v, h)=



i∈vis

− a

)

2σ



j∈hid

− b

)

2σ

−



i,j

(24.18)

If any of the eigenvalues of the weight matrix become suﬃciently large, the

quadratic interaction terms can dominate the containment terms and there is

then no lower bound to the energy that can be achieved by scaling up the

activities in the direction of the corresponding eigenvector. With a suﬃciently

small learning rate, CD

can detect and correct these directions so it is possible

to learn an undirected version of a factor analysis model [10] using all Gaussian

units, but this is harder than using EM [3] to learn a directed model.

24. A Practical Guide to Training Restricted Boltzmann Machines 613

24.13.4 Binomial Units

A simple way to get a unit with noisy integer values in the range 0 to N is to

make N separate copies of a binary unit and give them all the same weights and

bias [21]. Since all copies receive the same total input, they all have the same

probability, p, of turning on and this only has to be computed once. The expected

number that are on is Np and the variance in this number is Np(1 − p).For

small p, this acts like a Poisson unit, but as p approaches 1 the variance becomes

small again which may not be desireable. Also, for small values of p the growth

in p is exponential in the total input. This makes learning much less stable than

for the rectiﬁed linear units described in section 24.13.5.

One nice thing about using weight-sharing to synthesize a new type of unit

out of binary units is that the mathematics underlying binary-binary RBM’s

remains unchanged.

24.13.5 Rectiﬁed Linear Units

A small modiﬁcation to binomial units makes them far more interesting as

models of real neurons and also more useful for practical applications. All

copies still have the same learned weight vector w and the same learned bias,

b, but each copy has a diﬀerent, ﬁxed oﬀset to the bias. If the oﬀsets are

−0.5, −1.5, −2.5 , ... − (N − 0.5) the sum of the probabilities of the copies is

extremely close to having a closed form:

∞



i=1

σ(x −i +0.5) ≈ log(1 + e

) (24.19)

where x = vw

+ b. So the total activity of all of the copies behaves like a

smoothed version of a rectiﬁed linear unit that saturates for suﬃciently large

input. Even though log(1 + e

) is not in the exponential family, we can model it

accurately using a set of binary units with shared weights and ﬁxed bias oﬀsets.

This set has no more parameters than an ordinary binary unit, but it provides

a much more expressive variable. The variance is σ(x) so units that are ﬁrmly

oﬀ do not create noise and the noise does not become large when x is large.

A drawback of giving each copy a bias that diﬀers by a ﬁxed oﬀset is that the

logistic function needs to be used many times to get the probabilities required

for sampling an integer value correctly. It is possible, however, to use a fast

approximation in which the sampled value of the rectiﬁed linear unit is not

constrained to be an integer. Instead it is approximated by max(0,x+ N(0, 1)

where N(0, 1) is Gaussian noise with zero mean and unit variance. This type of

rectiﬁed linear unit seems to work ﬁne for either visible units or hidden units

when training with CD

[14].

If both visible and hidden units are rectiﬁed linear, a much smaller learning

rate may be needed to avoid unstable dynamics in the activity or weight updates.

If the weight between two rectiﬁed linear units is greater than 1 there is no

lower bound to the energy that can be achieved by giving both units very high

614 G.E. Hinton

activities so there is no proper probability distribution. Nevertheless, contrastive

divergence learning may still work provided the learning rate is low enough to

give the learning time to detect and correct directions in which the Markov chain

would blow up if allowed to run for many iterations. RBM’s composed of rectiﬁed

linear units are more stable than RBM’s composed of Gaussian units because

the rectiﬁcation prevents biphasic oscillations of the weight dynamics in which

units alternate between very high positive activity for one mini-batch followed

by very high negative activity for the next mini-batch.

24.14 Varieties of Contrastive Divergence

Although CD

is not a very good approximation to maximum likelihood learn-

ing, this does not seem to matter when an RBM is being learned in order to

provide hidden features for training a higher-level RBM. CD

ensures that the

hidden features retain most of the information in the data vector and it is not

necessarily a good idea to use a form of CD that is a closer approximation to

maximum likelihood but is worse at retaining the information in the data vector.

If, however, the aim is to learn an RBM that is a good density or joint-density

model, CD

is far from optimal.

At the beginning of learning, the weights are small and mixing is fast so CD

provides a good approximation to maximum likelihood. As the weights grow,

the mixing gets worse and it makes sense to gradually increase the n in CD

[1, 17]. When n is increased, the diﬀerence of pairwise statistics that is used for

learning will increase so it may be necessary to reduce the learning rate.

A more radical departure from CD

is called “persistent contrastive diver-

gence” [22]. Instead of initializing each alternating Gibbs Markov chain at a

datavector, which is the essence of CD learning, we keep track of the states of

a number of persistent chains or “fantasy particles”. Each persisitent chain has

its hidden and visible states updated one (or a few) times after each weight

update. The learning signal is then the diﬀerence between the pairwise statis-

tics measured on a mini-batch of data and the pairwise statistics measured on

the persistent chains. Typically the number of persistent chains is the same as

thesizeofamini-batch,butthereisnogoodreasonforthis.Thepersistent

chains mix surprisingly fast because the weight-updates repel each chain from

its current state by raising the energy of that state [23].

When using persistent CD, the learning rate typically needs to be quite a lot

smaller and the early phase of the learning is much slower in reducing the recon-

struction error. In the early phase of learning the persistent chains often have

very correlated states, but this goes away with time. The ﬁnal reconstruction

error is also typically larger than with CD

because persistent CD is, asymp-

totically, performing maximum likelihood learning rather than trying to make

the distribution of the one-step reconstructions resemble the distribution of the

data. Persistent CD learns signiﬁcantly better models than CD

or even CD

[22] and is the recommended method if the aim is to build the best density model

of the data.

24. A Practical Guide to Training Restricted Boltzmann Machines 615

Persistent CD can be improved by adding to the standard parameters an

overlay of “fast weights” which learn very rapidly but also decay very rapidly

[23]. These fast weights improve the mixing of the persistent chains. However,

the use of fast weights introduces yet more meta-parameters and will not be

discussed further here.

24.15 Displaying What Is Happening during Learning

There are many ways in which learning can go wrong and most of the common

problems are easy to diagnose with the right graphical displays. The three types

of display described below give much more insight into what is happening than

simply monitoring the reconstruction error.

Histograms of the weights, the visible biases and the hidden biases are very

useful. In addition, it is useful to examine histograms of the increments to these

parameters when they are updated, though it is wasteful to make these his-

tograms after every update.

For domains in which the visible units have spatial or temporal structure (e.g.

images or speech) it is very helpful to display, for each hidden unit, the weights

connecting that hidden unit to the visible units. These “receptive” ﬁelds are a

good way of visualizing what features the hidden units have learned. When dis-

playing the receptive ﬁelds of many hidden units it can be very misleading to use

diﬀerent scales for diﬀerent hidden units. Gray-scale displays of receptive ﬁelds

are usually less pretty but much more informative than false colour displays.

For a single minibatch, it is very useful to see a two-dimensional, gray-scale

display with a range of [0,1] that shows the probability of each binary hidden

unit on each training case in a mini-batch

. This immediately allows you to

see if some hidden units are never used or if some training cases activate an

unusually large or small number of hidden units. It also shows how certain the

hidden units are. When learning is working properly, this display should look

thoroughly random without any obvious vertical or horizontal lines. Histograms

can be used instead of this display, but it takes quite a few histograms to convey

the same information.

24.16 Using RBM’s for Discrimination

There are three obvious ways of using RBMs for discrimination. The ﬁrst is to

use the hidden features learned by the RBM as the inputs for some standard

discriminative method. This will not be discussed further here, though it is

probably the most important way of using RBM’s, especially when many layers

of hidden features are learned unsupervised before starting on the discriminative

training.

The second method is to train a separate RBM on each class. After training,

thefreeenergyofatestvector,t, is computed (see subsection 24.16.1)for each

If there are more than a few hundred hidden units, just use a subset of them.

616 G.E. Hinton

class-speciﬁc RBM. The log probability that the RBM trained on class c assigns

to the test vector is given by:

log p(t|c)=−F

(t) − log Z

(24.20)

where Z

is the partition function of that RBM. Since each class-speciﬁc RBM

will have a diﬀerent, unknown partition function, the free energies cannot be

used directly for discrimination. However, if the number of classes is small it

is easy to deal with the unknown log partition functions by simply training a

“softmax” model (on a separate training set) to predict the class from the free

energies of all of the class speciﬁc RBMs:

log p(class = c|t)=

−F

(t)−log



−F

(t)−log

(24.21)

where the

Z are parameters that are learned by maximum likleihood training of

the softmax. Of course, equation 24.21 can also be used to learn the weights and

biases of each RBM but this requires a lot of data to avoid overﬁtting. Combining

the discriminative gradients for the weights and biases that come from equation

24.21 with the approximate gradients that come from contrastive divergence will

often do better than either method alone. The approximate gradient produced

by contrastive divergence acts as a strong regularizer to prevent overﬁtting and

thediscriminativegradientensuresthatthereissomepressuretousetheweights

and biases in a way that helps discrimination.

The third method is to train a joint density model using a single RBM that has

two sets of visible units. In addition to the units that represent a data vector,

there is a “softmax” label unit that represents the class. After training, each

possible label is tried in turn with a test vector and the one that gives lowest

free energy is chosen as the most likely class. The partition function is not a

problem here, since it is the same for all classes. Again, it is possible to combine

discriminiative and generative training of the joint RBM by using discriminative

gradients that are the derivatives of the log probability of the correct class [6]:

log p(class = c|t)=

−F

(t)



−F

(t)

(24.22)

24.16.1 Computing the Free Energy of a Visible Vector

The free energy of visible vector v is the energy that a single conﬁguration would

need to have in order to have the same probability as all of the conﬁgurations

that contain v:

−F (v)



−E(v,h)

(24.23)

24. A Practical Guide to Training Restricted Boltzmann Machines 617

It is also given by the expected energy minus the entropy:

F (v)=−



−



log p

+(1−p

) log(1 − p

)) (24.24)

where x

= b



is the total input to hidden unit j and p

= σ(x

) is

the probability that h

=1given v. A good way to compute F(v)istouseyet

another expression for the free energy:

F (v)=−



−



log(1 + e

) (24.25)

24.17 Dealing with Missing Values

In a directed belief net it is very easy to deal with missing values for visible

variables. When performing inference, a missing value that is at the receiving

end of a directed connection has no eﬀect on the units that send connections to

it. This is not true for the undirected connections used in an RBM. To perform

inference in the standard way, the missing value must ﬁrst be ﬁlled in and there

areatleasttwowaystodothis.

A particularly simple type of missing value occurs when learning a joint den-

sity for data in which each training case is composed of a vector v such as an

image plus a single discrete label. If the label is missing from a subset of the

cases, it can be Gibbs sampled from its exact conditional distribution. This is

done by computing the free energy (see section 24.16.1) for each possible value of

the label and then picking label l with probability proportional to exp(−F(l, v)).

After this, the training case is treated just like a complete training case.

For real-valued visible units, there is a diﬀerent way to impute missing values

that still works well even if several values are missing from the same training

case [8]. If the learning cycles through the training set many times, the missing

valuescanbetreatedinjustthesamewayastheotherparameters.Starting

with a sensible initial guess, a missing value is updated each time the weights

are updated, but possibly using a diﬀerent learning rate. Theupdateforthe

missing value for visible unit i on training case c is:

Δv

= 



∂F

∂ˆv

−

∂F

∂v



(24.26)

where v

is the imputed value and ˆv

is the reconstruction of the imputed value.

Momentum can be used for imputed values in just the same way as it is used

for the usual parameters.

There is a more radical way of dealing with missing values that can be used

when the number of missing values is very large. This occurs, for example, with

user preference data where most users do not express their preference for most

objects [17]. Instead of trying to impute the missing values, we pretend they

do not exist by using RBM’s with diﬀerent numbers of visible units for diﬀerent

training cases. The diﬀerent RBMs form a family of diﬀerent models with shared

618 G.E. Hinton

weights. Each RBM in the family can now do correct inference for its hidden

states, but the tying of the weights means that they may not be ideal for any

particular RBM. Adding a visible unit for a missing value and then performing

correct inference that integrates out this missing value does not give the same

distribution for the hidden units as simply ommitting the visible unit which is

why this is a family of models rather than just one model. When using a family

of models to deal with missing values, it can be very helpful to scale the hidden

biases by the number of visible units in the RBM [15].

Acknowledgements. This research was supported by NSERC and the Cana-

dian Institute for Advanced Research. Many of my past and present graduate

students and postdocs have made valuable contributions to the body of practi-

cal knowledge described in this chapter. I have tried to acknowledge particularly

valuable contributions in the chapter, but I cannot always recall who suggested

what.

References

[1] Carreira-Perpignan, M.A., Hinton, G.E.: On contrastive divergence learning. In:

Artiﬁcial Intelligence and Statistics (2005)

[2] Freund, Y., Haussler, D.: Unsupervised learning of distributions on binary vectors

using two layer networks. In: Advances in Neural Information Processing Systems

4, pp. 912–919. Morgan Kaufmann, San Mateo (1992)

[3] Ghahramani, Z., Hinton, G.: The EM algorithm for mixtures of factor analyzers.

Technical Report CRG-TR-96-1, University of Toronto (May 1996)

[4] Hinton, G.E.: Relaxation and its role in vision. PhD Thesis (1978)

[5] Hinton, G.E.: Training products of experts by minimizing contrastive divergence.

Neural Computation 14(8), 1711–1800 (2002)

[6] Hinton, G.E.: To recognize shapes, ﬁrst learn to generate images. In: Computa-

tional Neuroscience: Theoretical Insights into Brain Function (2007)

[7] Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief

nets. Neural Computation 18(7), 1527–1554 (2006)

[8] Hinton, G.E., Osindero, S., Welling, M., Teh, Y.: Unsupervised discovery of non-

linear structure using contrastive backpropagation. Cognitive Science 30, 725–731

(2006b)

[9] Hopﬁeld, J.J.: Neural networks and physical systems with emergent collective com-

putational abilities. Proceedings of the National Academy of Sciences 79, 2554–

2558 (1982)

[10] Marks, T.K., Movellan, J.R.: Diﬀusion networks, product of experts, and factor

analysis. In: Proc. Int. Conf. on Independent Component Analysis, pp. 481–485

(2001)

[11] Mohamed, A.R., Hinton, G.E.: Phone recognition using restricted boltzmann ma-

chines. In: ICASSP 2010 (2010)

[12] Mohamed, A.R., Dahl, G., Hinton, G.E.: Deep belief networks for phone recogni-

tion. In: NIPS 22 Workshop on Deep Learning for Speech Recognition (2009)

[13] Nair, V., Hinton, G.E.: 3-d object recognition with deep belief nets. In: Advances

in Neural Information Processing Systems, vol. 22, pp. 1339–1347 (2009)

[14] Nair, V., Hinton, G.E.: Rectiﬁed linear units improve restricted boltzmann ma-

chines. In: Proc. 27th International Conference on Machine Learning (2010)

24. A Practical Guide to Training Restricted Boltzmann Machines 619

[15] Salakhutdinov, R.R., Hinton, G.E.: Replicated softmax: An undirected topic

model. In: Advances in Neural Information Processing Systems, vol. 22 (2009)

[16] Salakhutdinov, R.R., Murray, I.: On the quantitative analysis of deep belief net-

works. In: Proceedings of the International Conference on Machine Learning,

vol. 25, pp. 872–879 (2008)

[17] Salakhutdinov, R.R., Mnih, A., Hinton, G.E.: Restricted Boltzmann machines for

collaborative ﬁltering. In: Ghahramani, Z. (ed.) Proceedings of the International

Conference on Machine Learning, vol. 24, pp. 791–798. ACM (2007)

[18] Smolensky, P.: Information processing in dynamical systems: Foundations of har-

mony theory. In: Rumelhart, D.E., McClelland, J.L. (eds.) Parallel Distributed

Processing, vol. 1, ch. 6, pp. 194–281. MIT Press, Cambridge (1986)

[19] Sutskever, I., Tieleman: On the convergence properties of contrastive divergence.

In: Proceedings of the 13th International Conference on Artiﬁcial Intelligence and

Statistics (AISTATS), Sardinia, Italy (2010)

[20] Taylor, G., Hinton, G.E., Roweis, S.T.: Modeling human motion using binary

latent variables. In: Advances in Neural Information Processing Systems. MIT

Press (2006)

[21] Teh, Y.W., Hinton, G.E.: Rate-coded restricted Boltzmann machines for face

recognition. In: Advances in Neural Information Processing Systems, vol. 13, pp.

908–914 (2001)

[22] Tieleman, T.: Training restricted Boltzmann machines using approximations to

the likelihood gradient. In: Proceedings of the Twenty-ﬁrst International Confer-

ence on Machine Learning (ICML 2008). ACM (2008)

[23] Tieleman, T., Hinton, G.E.: Using fast weights to improve persistent contrastive

divergence. In: Proceedings of the 26th International Conference on Machine

Learning, pp. 1033–1040. ACM, New York (2009)

[24] Welling, M., Rosen-Zvi, M., Hinton, G.E.: Exponential family harmoniums with

an application to information retrieval. In: Advances in Neural Information Pro-

cessing Systems, pp. 1481–1488. MIT Press, Cambridge (2005)

Deep Boltzmann Machines and the Centering

Trick

Grégoire Montavon

and Klaus-Robert Müller

1,2

Technische Universität Berlin, 10587 Berlin, Germany,

Machine Learning Group

Korea University, Anam-dong, Seongbuk-gu, Seoul 136-713, Korea,

Department of Brain and Cognitive Engineering

{gregoire.montavon,klaus-robert.mueller}@tu-berlin.de

Abstract. Deep Boltzmann machines are in theory capable of learning

eﬃcient representations of seemingly complex data. Designing an algo-

rithm that eﬀectively learns the data representation can be subject to

multiple diﬃculties. In this chapter, we present the “centering trick” that

consists of rewriting the energy of the system as a function of centered

states. The centering trick improves the conditioning of the underlying

optimization problem and makes learning more stable, leading to models

with better generative and discriminative properties.

Keywords: Deep Boltzmann machine, centering, reparameterization,

unsupervised learning, optimization, representations.

25.1 Introduction

Deep Boltzmann machines are undirected networks of interconnected units that

learn a joint probability density over these units by adapting connections between

them. They are in theory capable of learning statistically and computationally

eﬃcient representations of seemingly complex data distributions.

Designing an algorithm that eﬀectively learns the data representation can be

subject to multiple diﬃculties. Deep Boltzmann machines are sensitive to the

parameterization of their energy function. In addition, the gradient of the opti-

mization problem is not directly accessible and must instead be approximated

stochastically by continuously querying the model throughout training.

In this chapter, we present the “centering trick” that consists of rewriting the

energy function of the deep Boltzmann machine as a function of centered states.

We argue that centering improves the conditioning of the optimization prob-

lem and facilitates the emergence of complex structures in the deep Boltzmann

machine.

We demonstrate on the M

NIST dataset that the centering trick allows mid-

sized deep Boltzmann machines to be trained faster and to produce a solution

which is a good generative model of data but also distills interesting discrimina-

tive features in the top layer.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 621–637, 2012.

 Springer-Verlag Berlin Heidelberg 2012

622 G. Montavon and K.-R. Müller

25.2 Boltzmann Machines

In this section, we give some background on the Boltzmann machine [6]. We will

use the following notation: The sigmoid function is deﬁned as sigm(x)=

x ∼B(p) denotes that the variable x is drawn randomly from a Bernoulli dis-

tribution of parameter p and ·

denotes the expectation operator with respect

to a probability distribution P . All these operations apply element-wise to the

input if the latter is a vector.

fully connected

Boltzmann machine

deep Boltzmann

machine (DBM)

locally connected

DBM (LC-DBM)

Fig. 25.1. Example of Boltzmann machines used in practice with visible units depicted

in gray and hidden units depicted in white. The layered structure of a DBM is interest-

ing because a particular representation of data forms at each layer, possibly enabling

the emergence of interesting statistics.

A Boltzmann machine is a network of M

interconnected binary units that

associates to each state x ∈{0, 1}

the probability

p(x; θ)=

−E(x;θ)



−E(ξ;θ)

The term in the denominator is the called the partition function and makes

probabilities sum to one. The function

E(x; θ)=−



Wx− x



is the energy of the state x given the model parameters θ =(W, b).Fromthese

equations, we can interpret agoodmodelofdataasamodelθ that has low energy

in regions of high data density and high energy elsewhere. The matrix W of size

× M

is symmetric and contains the connection strengths between units.

The vector b of size M

contains the biases associated to each unit. The diagonal

of W is constrained to be zero. Units are either visible units (representing the

sensory input) or hidden units (representing latent variables that are not directly

observable but contribute to explaining data).

From the equations above, we can derive the conditional probability of each

unit being activated given the other units

p(x

=1|x

−i

; θ)=sigm(b



j=i

)

25. Deep Boltzmann Machines and the Centering Trick 623

where x

−i

denotes the set of all units but x

. The gradient of the data log-

likelihood with respect to model parameters W and b takes the form

∂

∂W

log p(x

vis

; θ)

data

= xx





data

−xx





model

(25.1)

∂

∂b

log p(x

vis

; θ)

data

= x

data

−x

model

(25.2)

where x

vis

are the visible units (i.e. the subset of units that represent the sen-

sory input). The terms ·

data

and ·

model

are respectively the data-dependent

expectations (obtained by conditioning the joint distribution on the observed

state of the visible units) and the data-independent expectations obtained by

sampling freely from the joint probability distribution.

25.2.1 Deep Boltzmann Machines

It is often desirable to incorporate some predeﬁned structure to the Boltzmann

machine. The most common way to achievethisistoremovecertain connections

in the network, that is, forcing parts of the matrix W to zero. Example of Boltz-

mann machines with diﬀerent structures are shown in Figure 25.1. For example,

in the deep Boltzmann machine (DBM) [12], units are organized in a deep lay-

ered manner where only adjacent layers communicate and where units within

the same layer are disconnected. Locally connected deep Boltzmann machines

add further constraints to the model by forcing the modeling of the interaction

between remote parts of the input to take place in the top layer.

The special layered structure of the DBM and its multiple variants has two

advantages: First, particular statistics can emerge at each layer that may capture

interesting features of data. Second, the layered structure of the DBM can be

folded into a bipartite graph (one side containing odd layers and the other side

containing even layers) where each side of the graph is conditionally independent

given the other side.

In the case of the deep Boltzmann machine shown in Figure 25.2 (left) with

, M

and M

units at each layer, the energy associated to each state (x, y, z) ∈

{0, 1}

E(x, y, z; θ)=− y



Wx− z



Vy−x



a − y



b − z



where θ = {W, V, a, b, c} groups the model parameters. The bipartite graph

structure of the deep Boltzmann machine implies that an eﬃcient alternating

Gibbs sampler can be derived:

x ∼B(sigm(W



y + a))

z ∼B(sigm(Vy+ c))

y ∼B(sigm(Wx+ V



z + b)).

(25.3)

624 G. Montavon and K.-R. Müller

A similar alternating Gibbs sampler can be used for sampling states when the

input units x are clamped to the data:

z ∼B(sigm(Vy+ c))

y ∼B(sigm(Wx

data

+ V



z + b)).

(25.4)

These alternating Gibbs samplers are illustrated in Figure 25.2 (right) and allow

us to collect the data-independent and data-dependent statistics that intervene

in the computation of the gradient (see Equation 25.1 and 25.2).

(i) (ii)

Fig. 25.2. On the left, diagram of a two-layer deep Boltzmann machine along with

its parameters. On the right, diﬀerent sampling methods: (i) the path followed by the

alternating Gibbs sampler and (ii) the path followed by the alternating Gibbs sampler

when the input is clamped to data.

25.3 Training Boltzmann Machines

While Equation 25.1 and 25.2 provide an exact gradient for minimizing the

log-likelihood of data, keeping track of data statistics and model statistics is

computationally demanding. The mixing rate of the model (i.e. the speed at

which the alternating Gibbs sampler converges to the model’s true distribution)

is typically slow and implies that we need to resort to some approximation.

Collecting data-dependent statistics is relatively easy as the complexity of

the distribution is reduced by the clamping of visible units to the data. In the

case where hidden units are independent when conditioned on the visible units,

sampling can be achieved exactly in only one pass of the Gibbs sampler. This is

the case of the restricted Boltzmann machine [4] presented in Chapter 24 [5]. In

practice, when the number of hidden-to-hidden connections is relatively low or

the connections are not particularly strong, reasonable approximations can be

obtained by running a few steps of the alternating Gibbs sampler.

Collecting data-independent statistics is much harder and typically requires

hundreds or thousands of iterations before converging to the true probability

distribution. A workaround to the problem is to approximate these statistics

by a small set of persistent chains (or “free particles”) {x

,...,x

} that are

continuously updated throughout training. This idea called persistent contrastive

divergence has been proposed by Tieleman [19].

25. Deep Boltzmann Machines and the Centering Trick 625

The intuition behind persistent contrastive divergence is the following: let’s

ﬁrst remember that the minima of the energy function correspond to high prob-

ability states and that the free particles are therefore inclined to descend the

energy function. As the model is trained, the energy of the free particles is

raised under the eﬀect of the gradient update and free particles are encouraged

to slide down the “bump” created by the gradient update. The higher the learn-

ing rate, the higher the bumps, and the faster the particles are descending the

energy function. This implies that free particles are mixing much faster under

the eﬀect of training than in a static setting.

When training a deep Boltzmann machine, at least two sources of instability

can be identiﬁed: (1) Approximation instability: The stochastic and approximate

nature of the learning algorithms described above implies that the estimation

of the gradient is noisy. The noise comes in part from the stochastic gradient

descent procedure, but principally from the approximate sampling procedure

that may cause systematically biased estimates of the gradient. (2) Structural

instability: As it has been identiﬁed by Cho et al. [3], in standard Boltzmann

machines, the weight matrix W tends to model in the ﬁrst steps of the learn-

ing algorithm a global bias instead of co-dependencies between each units as

we would expect. This is particularly problematic in the case of a Boltzmann

machine with hidden-to-hidden connections such as the DBM, because hidden

units tend to conglomerate and form a bias that may speed up learning initially

but that eventually destroys the learning signal between pairs of hidden units.

25.3.1 The Centering Trick

The centering trick attempts to mitigate these sources of instability by ensur-

ing that units activations intervening in the computation of the gradient are

centered. Centering aims to produce a better conditioned optimization problem

that is more robust to the noise of the learning procedure and to avoid the use of

units as a global bias. Centering was already advocated in Chapter 1 [7] and 10

[17] in the context of backpropagation networks. Centering can be achieved by

rewriting the energy of the Boltzmann machine as a function of centered states:

Center the Boltzmann machine

E(x; θ)=−

(x − β)



W (x − β) − (x − β)



The new variable β represents the oﬀset associated to each unit of the network

and must be set to the mean activation of x. Setting β = sigm(b

) where b

is the

initial bias ensures that units are initially centered. A similar parameterization

of the energy function has been proposed by Tang and Sutskever [18] where

the oﬀset parameters were restricted to visible units. As we will see later, the

Boltzmann machine also beneﬁts from centering the hidden units. From this

626 G. Montavon and K.-R. Müller

new energy function, we can derive the conditional probability of each unit in

the centered Boltzmann machine:

p(x

=1|x

−i

; θ)=sigm(b



j=i

(x − β)

Similarly, the gradient of the model log-likelihood with respect to W and b now

takes the form:

∂

∂W

log p(x

vis

; θ)

data

= (x − β)(x − β)





data

−(x −β)(x − β)





model

(25.5)

∂

∂b

log p(x

vis

; θ)

data

= x − β

data

−x −β

model

. (25.6)

These gradients are similar to the enhanced gradients proposed by Cho et al. [3]

and to those arising from the parameterization proposed by Arnold et al. [1] at

the diﬀerence that our gradients do not account for the possibility that oﬀsets

β deviate from the mean activations of units throughout training. If the latter

eﬀect is problematic, it is possible to reparameterize the network continuously

or at regular intervals so that the oﬀsets correspond to the new expected means

x

data

. The reparameterization θ → θ



must leave the energy function invariant

up to a constant, that is, E(x; θ)=E(x; θ



)+const. Solving the equation under

the new centering constraints leads to the update equations W



= W , b



b + W (x

data

− β) and β



= x

data

Update biases and oﬀsets at regular intervals



= b + W (x

data

− β)



= x

data

Similar derivations can be made for the deep Boltzmann machine. The energy

function of the deep Boltzmann machine becomes E(x, y, z; θ)=−(y−β)



W (x−

α) −(z −γ)



V (y −β) −(x −α)



a −(y −β)



b −(z −γ)



c where α, β and γ are

the oﬀsets associated to the units in each layer. A basic algorithm for training a

centered deep Boltzmann machine is given in Figure 25.3.

25.3.2 Understanding the Centering Trick

We look at the eﬀect of centering on the stability of learning in a Boltzmann

machine. We argue that when the Boltzmann machine is centered, the optimiza-

tion problem is better conditioned (see Figure 25.5), more precisely, the ratio

between the highest and the lowest eigenvalue of the Hessian H is smaller. We

deﬁne ξ as the centered state ξ = x−β. Substituting x−β by ξ in Equation 25.5,

the derivative of the data log-likelihood with respect to the weight parameter

becomes

∂

∂W

log p(x; θ)

data

= ξξ





W,data

−ξξ





25. Deep Boltzmann Machines and the Centering Trick 627

Training a centered deep Boltzmann machine

W, V =0, 0

a, b, c = sigm

−1

(x

data

),b

α, β, γ = sigm(a), sigm(b), sigm(c)

initialize free particle (x

)=(α, β, γ)

loop

initialize data particle (x

)=(pick(data),β,γ)

loop

B(sigm(W (x

 α)+V



 γ)+b))

B(sigm(V ( y

 β)+c))

end loop

B(sigm(W (x

 α)+V



 γ)+b))

B(sigm(W



 β)+a))

B(sigm(V (y

 β)+c))

W = W + η • [(y

 β)(x

 α)



 (y

 β)(x

 α)



]

V = V + η • [(z

 γ)(y

 β)



 (z

 γ)(y

 β)



]

a = a + η • (x

 x

) + ν • W



 β)

b = b + η • (y

 y

) + ν • W (x

 α)+ν • V



 γ)

c = c + η • (z

 z

) + ν • V (y

 β)

α =(1ν) • α + ν • x

β =(1 ν) • β + ν • y

γ =(1 ν) •γ + ν • z

end loop

Fig. 25.3. Basic algorithm for training a two-layer centered deep Boltzmann machine.

The algorithm is based on persistent contrastive divergence and is kept minimal for

the sake of simplicity. The variable η isthelearningrateandthevariableν is the rate

of the moving average necessary for the reparameterization. A Python implementation

of the algorithm is available at

http://gregoire.montavon.name/code/dbm.py.

β = sigm(2)

β = sigm(0)

β = sigm(2)

b =2 b =0

b = 2

Fig. 25.4. Example of sigmoids f(x)=sigm(x + b)  β with diﬀerent biases b and

oﬀsets β. This ﬁgure illustrates how setting β

= sigm(b

) ensures that the sigmoid

crosses the origin initially and do not contribute to modeling a bias component.

628 G. Montavon and K.-R. Müller

where ·

denotes the expectation with respect to the probability distribution

associated to a model of weight parameter W and ·

W,data

denotes the expecta-

tion of the same model with visible units clamped to data. Using the deﬁnition

of the directional derivative, the second derivative with respect to a random

direction V (which is equal to the projected Hessian HV ) can be expressed as:

HV =

∂

∂V



∂

∂W

log p(x; W )

data



= lim

h→0



∂

∂W

log p(x; W + hV )

data

−

∂

∂W

log p(x; W )

data



= lim

h→0



(ξξ





W +hV,data

−ξξ





W +hV

) −(ξξ





W,data

−ξξ





)



= lim

h→0



ξξ





W +hV,data

−ξξ





W,data



− lim

h→0



ξξ





W +hV

−ξξ







From the last line, we can see that the Hessian can be decomposed into a data-

dependent term and a data-independent term. A remarkable fact is that in ab-

sence of hidden units, the data-dependent part of the Hessian is zero, because

the model—and therefore, the perturbation of the model—have no inﬂuence on

the states. The conditioning of the optimization problem can therefore be ana-

lyzed exclusively from a model perspective. The data-dependent term is likely

to be small even in the presence of hidden variables due to the sharp reduction

of entropy caused by the clamping of visible units to data.

We can think of a well-conditioned model as a model for which a perturbation

of the model parameter W in any direction V causes a well-behaved perturbation

of state expectations ξξ





. Pearlmutter [11] showed that in a Boltzmann

machine with no hidden units, the projected Hessian can be further reduced to

HV = ξξ





·D

−ξξ



D

where D =



Vξ (25.7)

thus, getting rid of the limit and leading to numerically more accurate estimates.

Chapter 1 [7] shows that the stability of the optimization problem can be quanti-

ﬁed by the condition number deﬁned as the ratio between the largest eigenvalue

and the smallest eigenvalue λ

of H. A geometrical interpretation of the

condition number is given in Figure 25.5 (left). A low rank approximation of the

Hessian can be obtained as

H = H(V

|...|V

)=(HV

|...|HV

) (25.8)

where the columns of (V

|...|V

) form a basis of independent unit vectors that

projects the Hessian on a low-dimensional random subspace. The condition num-

ber can then be estimated by performing a singular value decomposition of the

projected Hessian

H and taking the ratio between the largest and smallest re-

sulting eigenvalues.

We estimate the condition number λ

/λ

of a fully connected Boltzmann

machine of 50 units at initial state (W =0) for diﬀerent bias and oﬀset pa-

25. Deep Boltzmann Machines and the Centering Trick 629

Centered BM

/λ

small

Standard BM

/λ

large

/λ

b =2b =0b = 2

β = sigm(2) 1.98 12.65 51.42

β = sigm(0)

22.97 1.82 22.20

β = sigm(2)

52.72 13.40 1.94

Fig. 25.5. Left: relation between the condition number λ

/λ

and the shape of the

optimization problem. Gradient descent is easier to achieve when the condition number

is small. Right: condition number obtained for centered Boltzmann machines (shown

in bold on the diagonal) and for non-centered deep Boltzmann machines (oﬀ-diagonal

elements). It can clearly be seen that the condition number is much smaller when the

Boltzmann machine is centered.

rameters b and β using Equation 25.7 and 25.8. The condition numbers are

obtained by drawing 100 random unit vectors for the projection matrix V and

for each of them, estimating the statistics by sampling 1000 independent states ξ.

Numerical estimates are given in Figure 25.5 (right) and clearly exhibit the bet-

ter conditioning occurring when the Boltzmann machine is centered (i.e. when

β = sigm(b)).

25.4 Evaluating Boltzmann Machines

In this section, we present two complementary approaches to evaluating a Boltz-

mann machine. The ﬁrst method consists of looking at the discriminative com-

ponents built in diﬀerent portions of the Boltzmann machine (e.g. layers of a

DBM) using kernels. The analysis is based on the work of Montavon et al. [8, 9]

that characterizes the representation that emerges from the learning algorithm

at each layer of a neural network. Second, we present a method introduced by

Salakhutdinov and Hinton [13] that measures the generative performance of the

Boltzmann machine in terms of log-likelihood of test data.

25.4.1 Discriminative Analysis

When Boltzmann machines incorporate a special structure, for example, via

restricted connectivity, it can be useful to measure the discriminative capability

of the representations emerging in speciﬁc portions of the network. Measuring

the discriminative capability of a group of units is non-trivial, because (1) the

meaningful information is not carried in a symbolic form but instead, distributed

across the multiple units of the group and (2) the mapping of each data point

onto these units is not deterministic but instead corresponds to the conditional

distribution over the units in the group given the input x. We present here a

method that exploits the insight that the projection of the input distribution

onto the group of units forms an kernel (with a well-deﬁned feature space [16])

that can be analyzed with respect to certain properties (or labels) t.Themethod

was ﬁrst introduced by Montavon et al. [8] in the context of backpropagation

630 G. Montavon and K.-R. Müller

networks. The method is based on the observation that leading kernel principal

components can be approximated up to high accuracy from a ﬁnite, typically

small set of samples [2]. We propose a family of kernels for representing the top

layer of a DBM:

A family of kernels for deep Boltzmann machines:

k(x, x



)=E

z,z



[f(z|x, z



)]

where z|x and z



denote the random top-layer activities conditioned

respectively on data points x and x



and where the function f(z, z



) is a

similarity metric between z and z



. Typical choices for f are:

linear: f(z,z



)=z,z





radial basis function: f(z, z



)=exp

−||z − z



equality: f(z, z



)=1

{z=z



}

These kernels are able to gracefully deal with multimodal posteriors in the top-

level distribution as the expectation operator lies outside the “detection function”

f and therefore, accounts for the all possible modalities of z|x and z



.Note

that since the units of the Boltzmann machine are binary, norms ||·||

,...,||·||

are equivalent. Also, the linear and equality functions correspond up to some

normalization to the extremal cases of the radial basis function kernel (with σ

very large or very small). Once a kernel has been chosen, the analysis proceeds

in four steps:

1. Collect a small test set X of size n × m and its associated label matrix T

of size n × c and compute the empirical kernel K of size n × n.Thekernel

can be built iteratively by running a Gibbs sampler on each data point and

taking the average of all kernels. Alternatively a moving average of the kernel

matrix can be maintained throughout training in order to keep track of the

discriminative performance.

2. Perform an eigendecomposition of the kernel matrix K = UΛU



where Λ

is a diagonal matrix representing the eigenvalues of K sorted by decreasing

order, and where the columns of U represent the eigenvectors associated to

each eigenvalue. These eigenvectors are the kernel principal components of

X with respect to the kernel k and form a non-linear subspace that spans

the main directions of variation in the data [15].

3. The representation is then evaluated by looking at how many kernel principal

components are capturing the task T .LetU

1..d

and Λ

1..d

be the matrices

containing respectively the d leading eigenvectors and eigenvalues. Compute

the projected outputs Y

= U

1..d



1..d

T . These predictions are the optimal

ﬁt in the least square sense based on the d kernel principal components

Alternatively, if we would like to focus on the discrimination boundary between

classes, a logistic model of type max

trace(log(softmax( Φ

1..d

•β) •T



)) can be ﬁtted,

where Φ

1..d

= U

1..d

0.5

1..d

is the empirical feature space and β of size d  c contains

the regression parameters.

25. Deep Boltzmann Machines and the Centering Trick 631

4. Compute the residuals curve e(d) and the AUC:

e(d)=||T − Y

AUC =



d=1

e(d) (25.9)

These two quantities serve as metrics for evaluating how well the task is

represented by the kernel. An interpretation of residuals curves e(d) is given

in Figure 25.6.

e(d)

Scenario 1

e(d)

Scenario 2

e(d)

Scenario 3

e(d)

Scenario 4

Fig. 25.6. Cartoon showing how to interpret residuals curves yield by various kernels

on a certain task. Scenario 1 : the kernel contains all label-relevant information in its

principal components. This is the optimal case. Scenario 2 :alargeamountoflabel-

relevant information is contained in the leading components, but remaining information

is missing. Scenario 3 : the task relevant information is contained in a large number

of principal components but can be predicted accurately. Scenario 4: the kernel is not

suited for the task of interest. Note that although Scenario 2 and 3 have similar AUC,

their residuals curves are qualitatively very diﬀerent.

25.4.2 Generative Analysis

The generative performance, measured in terms of data likelihood is what the

Boltzmann machine is optimized for. Unfortunately, data likelihood can not

be measured easily as it involves the estimation of the partition function that

is generally intractable. As a consequence, we must recourse to sophisticated

approximation schemes. We present an analysis introduced by Salakhutdinov

and Murray [14] that estimates the likelihood of the learned Boltzmann machine

based on annealed importance sampling (AIS) [10]. We describe here the basic

analysis. Salakhutdinov and Hinton [13] introduced more elaborate procedures

for particular types of Boltzmann machines such as restricted, semi-restricted

and deep Boltzmann machines.

As we have seen in Section 25.2, a deep Boltzmann machine associates to each

input x a probability

p(x; θ)=

Ψ(θ, x)

Z(θ)

where Ψ(θ, x)=



y,z



(x, y, z; θ) and Z(θ)=



x,y,z



(x, y, z; θ)

632 G. Montavon and K.-R. Müller

and where p



(x, y, z; θ)=e

−E(x,y,z;θ)

is the unnormalized probability of state

(x, y, z). Computing Ψ(θ, x) and Z(θ) analytically is intractable because of the

exponential number of elements involved in the sum. We must therefore resort

to approximation procedures. Let us ﬁrst rewrite the ratio of partition functions

as follows:

p(x; θ)=

Ψ(θ, x)

Z(θ)

Ψ (θ,x)

Ψ (0,x)

Z(θ)

Z(0)

Ψ(0,x)

Z(0)

(25.10)

It can be noticed that the ratio of base-rate partition functions (θ =0)iseasy

to compute as θ =0makes all units independent. It has the analytical form

Ψ(0,x)

Z(0)

. (25.11)

The two other ratios in Equation 25.10 can be estimated using annealed impor-

tance sampling. The annealed importance sampling method proceeds as follows:

Annealed importance sampling (AIS) [10]:

1. Generate a sequence of states ξ

,...,ξ

using a sequence of transi-

tion operators T (ξ,ξ



; θ

),...,T (ξ,ξ



; θ

) that leave p(ξ) invariant,

that is,

–

Draw ξ

from the base model (e.g. a random vector of zero and ones)

– Draw ξ

given ξ

using T (ξ,ξ



; θ

)

– ...

– Draw ξ

given ξ

K−1

using T (ξ,ξ



; θ

)

2. Compute the importance weight

AIS



(ξ

; θ

)



(ξ

; θ

)



(ξ

; θ

)



(ξ

; θ

)

·····



(ξ

; θ

)



(ξ

; θ

K−1

)

It can be shown that if the sequence of models θ

,θ

,...,θ

where θ

=0and

= θ evolves slowly enough, the importance weight obtained with the annealed

importance sampling procedure is an estimate for the ratio between the partition

function of the model θ and the partition function of the base rate model.

In our case, ξ denotes the state (x, y, z) of the DBM and the transition opera-

tor T (ξ,ξ



; θ) corresponds to the alternating Gibbs samplers deﬁned in Equation

25.3 and 25.4. The sequence of parameters {θ

,...,θ

} can, for example, lie on

the line between 0 and θ,thatis,θ

= α

· θ

where α

< ···<α

.Alterna-

tively, the sequence of parameters can be those that are observed throughout

training. In that case, maintaining a moving average of the parameter throughout

25. Deep Boltzmann Machines and the Centering Trick 633

training is necessary as the learning noise creates unnecessarily large variations

between two adjacent parameters.

We can now compute the two ratios of partition functions of Equation 25.10

Z(θ)

Z(0)

≈ E[ω

AIS

] and

Ψ(θ, x)

Ψ(0,x)

≈ E[ν

AIS

(x)] (25.12)

where ω

AIS

is the importance weight resulting from the annealing process with

the freely running Gibbs sampler and ν

AIS

is the importance weight resulting

from the annealing with input units clamped to the data point. Substituting

Equation 25.11 and 25.12 into Equation 25.10, we obtain

p(x; θ) ≈

E[ν

AIS

(x)]

E[ω

AIS

]

and therefore, the log-likelihood of the model is estimated as:

[log(p(x; θ))] ≈ E

[log E[ν

AIS

(x)]] − log E[ω

AIS

] − M

log(2). (25.13)

Generally, computing an average of the importance weight ν

AIS

for each data

point x can take a long time. In practice, we can approximate it with a single

AIS run for each data point. In that case, it follows from Jensen’s inequality that

[log ν

AIS

(x)] − log E[ω

AIS

] ≤ E

[log E[ν

AIS

(x)]] − log E[ω

AIS

]. (25.14)

Consequently, this approximation tends to produce slightly pessimistic estimates

of the model log-likelihood, however the variance of ν

AIS

is low compared to the

variance of ω

AIS

because the clamping of visible units to data points reduces the

diversity of AIS runs.

25.5 Experiments

In this section, we present a few experiments that demonstrate the eﬀectiveness

of the centering trick for learning deep Boltzmann machines. We use the M

NIST

handwritten digits recognition dataset that consists of 60000 training samples

and 10000 test samples. Each sample is a 28 × 28 grayscale image representing

a handwritten digit along with its label. Grayscale values (between 0 and 1) are

treated as probabilities.

Architectures: We consider a deep Boltzmann machine (DBM) made of 28 ×28

input units, 200 intermediate units and 25 top units and a locally-connected

DBM (LC-DBM) made of 28×28 input units, 400 intermediate units that connect

to random input patches of size 6 ×6 and 100 top units. These architectures are

illustrated in Figure 25.7 (left). In the DBM, the modeling load is concentrated in

the ﬁrst layer with the top layer serving merely to model global digit-like features.

On the other hand, in the LC-DBM, most of the modeling load is postponed

to the second layer and the ﬁrst layer serves essentially as a low-level local

634 G. Montavon and K.-R. Müller

preprocessor. The initial oﬀsets and biases for visible units are set to α = x

data

and a

= sigm

−1

(α). We consider diﬀerent initial biases (b

= −2, b

and b

=2)andoﬀsets(β,γ = sigm(−2), β,γ = sigm(0) and β,γ = sigm(2))

for the hidden units. These oﬀsets and initial biases correspond to the sigmoids

plottedinFigure25.4.

Learning: We use persistent contrastive divergence [19] to train the network

and keep track of 25 free particles in background of the learning procedure. We

use a Gibbs sampler to collect both the data-independent and data-dependent

statistics. At each iteration of the learning procedure, we run 3 iterations of the

alternating Gibbs sampler for collecting the data-dependent statistics (from a

minibatch of 25 data points) and one iteration for updating the data-independent

statistics. We use a learning rate η =0.0004 per sample for each layer.

Evaluation: Evaluating the DBM in an online fashion requires to keep track

of the model parameters throughout training. We reduce the learning noise by

maintaining a moving average of the sequence of parameters observed during

learning. The moving average is tuned to remember approximately 10% of the

training history. We keep track of 500 data-dependent chains running on the

smoothed sequence of parameters and from which top-layer statistics k(z,z



) and

ratios Ψ(θ, x)/Ψ(0,x) are estimated. We also keep track of 100 data-independent

chains on the same sequence of parameters and from which the ratio Z(θ)/Z(0) is

estimated. Discriminative performance is measured as the projection residuals of

the labels on the kernel principal components and the area under the error curve

(see Equation 25.9) using an exponential RBF kernel with σ set to the mean of

pairwise distances between z and z



. Generative performance is measured in

terms of data log-likelihood (see Equation 25.13).

Results: Figure 25.7 summarizes the results of our analysis and corroborates

the importance of centering for obtaining a better discriminative and genera-

tive model of data. The centered DBM systematically produces better top-layer

AUC errors and has higher log-likelihood. The importance of centering for im-

proving generative models is particularly marked for the locally-connected DBM

(LC-DBM) where the top-layer is crucial for modeling long-range dependencies.

These results suggest that the centering trick is particularly useful when deal-

ing with hierarchical architectures where global statistics are handled only in the

deep layers of the network. Figure 25.8 (left) shows that the centered DBM yields

a kernel that contains most of the label information in its leading components

and has a low level of noise in the remaining components. This corresponds to

Scenario 1 of Figure 25.6. On the other hand, in the case of the non-centered

DBM, the labels span a few leading components of the kernel, but the remaining

components have a high level of noise. This corresponds to Scenario 2 of Figure

25.6. As a comparison, the simple input-layer RBF kernel exhibits high dimen-

sionality and low noise and thus, corresponds to Scenario 3 of Figure 25.6. The

importance of centering for producing good top-layer kernels is further conﬁrmed

by looking at the second layer ﬁlters, visualized in Figure 25.8 (right) using a

25. Deep Boltzmann Machines and the Centering Trick 635

Deep Boltzmann

machine (DBM)

28  28

200

Locally-connected

DBM (LC-DBM)

28  28

...

400

100

Fig. 25.7. Evolution of the AUC error and log-likelihood throughout training. “Cen-

tered+” designates deep Boltzmann machines that are continuously recentered through-

out training. In the DBM, reasonable generative performance can be achieved without

centering as the top layer is simply ignored by the rest of the model. In the LC-DBM,

centering is important for both generative and discriminative performance as the top

layer is required for modeling long-range dependencies. Continuously recentering yields

the most robust performance.

Centered DBM

Non-centered DBM

Fig. 25.8. Comparison of the top-layer representation produced by centered and non-

centered DBMs. Left: error residuals produced by centered DBMs and non-centered

DBMs. Right: 2D-PCA (with a linear kernel) and second-layer ﬁlters. Results suggest

richer top-layer representations for centered DBMs than for non-centered DBMs.

636 G. Montavon and K.-R. Müller

linear back-projection, and observing that they are much richer for the centered

DBM than for the non-centered one.

25.6 Conclusion

Learning deep Boltzmann machines is a diﬃcult optimization problem that can

be highly sensitive to the parameterization of its energy function. In this chap-

ter, we propose the centering trick that consists of rewriting the energy as a

function of centered states. The centering trick improves the stability of deep

Boltzmann machines and allows to learn models that exhibit both advantageous

discriminative and generative properties.

Our experiments have been most successful on mid-scale models (in the range

of a few hundred hidden units). The high representational power of deep Boltz-

mann machines makes it hard to extend our experiments to larger scale models

(of thousands of units) without using an explicit regularizer such as layer-wise

pretraining or limited connectivity. We believe that applying the centering trick

to large-scale models should be made in conjunction with a strong regularizer

that limits the eﬀective dimensionality of the model.

Acknowledgements. The authors thank Mikio Braun and the multiple re-

viewers for their useful comments. This work was supported by the World Class

University Program through the National Research Foundation of Korea funded

by the Ministry of Education, Science, and Technology, under Grant R31-10008.

The authors also acknowledge partial support by DFG (MU 987/17-1).

References

[1] Arnold, L., Auger, A., Hansen, N., Ollivier, Y.: Information-geometric optimiza-

tion algorithms: A unifying picture via invariance principles, arXiv:1106.3708

(2011)

[2] Braun, M.L., Buhmann, J., Müller, K.-R.: On relevant dimensions in kernel feature

spaces. Journal of Machine Learning Research 9, 1875–1908 (2008)

[3] Cho, K., Raiko, T., Ilin, A.: Enhanced gradient and adaptive learning rate for

training restricted Boltzmann machines. In: Proceedings of the 28th International

Conference on Machine Learning, pp. 105–112 (2011)

[4] Hinton, G.E.: Training products of experts by minimizing contrastive divergence.

Neural Computation 14(8), 1771–1800 (2002)

[5] Hinton, G.E.: A Practical Guide to Training Restricted Boltzmann Machines. In:

Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn.

LNCS, vol. 7700, pp. 599–619. Springer, Heidelberg (2012)

[6] Hinton, G.E., Sejnowski, T.J.: Learning and relearning in Boltzmann machines. In:

Parallel Distributed Processing: Explorations in the Microstructure of Cognition,

vol. 1, pp. 282–317. MIT Press (1986)

[7] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr,

G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer,

Heidelberg (1998)

25. Deep Boltzmann Machines and the Centering Trick 637

[8] Montavon, G., Braun, M.L., Müller, K.-R.: Kernel analysis of deep networks.

Journal of Machine Learning Research 12, 2563–2581 (2011)

[9] Montavon, G., Braun, M.L., Müller, K.-R.: Deep Boltzmann machines as feed-

forward hierarchies. Journal of Machine Learning Research - Proceedings Track 22,

789–804 (2012)

[10] Neal, R.M.: Annealed importance sampling. Statistics and Computing 11(2), 125–

139 (2001)

[11] Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Computa-

tion 6(1), 147–160 (1994)

[12] Salakhutdinov, R., Hinton, G.E.: Deep Boltzmann machines. In: Proceedings of

the International Conference on Artiﬁcial Intelligence and Statistics, vol. 5, pp.

448–455 (2009)

[13] Salakhutdinov, R.: Learning and Evaluating Boltzmann Machines. Technical Re-

port UTML TR 2008-002, Dept. of Computer Science, University of Toronto

(2008)

[14] Salakhutdinov, R., Murray, I.: On the quantitative analysis of deep belief networks.

In: Proceedings of the 25th International Conference on Machine Learning, ICML

2008, pp. 872–879 (2008)

[15] Schölkopf, B., Smola, A., Müller, K.-R.: Nonlinear component analysis as a kernel

eigenvalue problem. Neural Computation 10(5), 1299–1319 (1998)

[16] Schölkopf, B., Mika, S., Burges, C.J.C., Knirsch, P., Müller, K.-R., Rätsch, G.,

Smola, A.J.: Input space versus feature space in kernel-based methods. IEEE

Transactions on Neural Networks 10(5), 1000–1017 (1999)

[17] Schraudolph, N.N.: Centering Neural Network Gradient Factors. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 207–226. Springer, Hei-

delberg (1998)

[18] Tang, Y., Sutskever, I.: Data normalization in the learning of restricted Boltzmann

machines. Technical Report UTML-TR-11-2, Department of Computer Science,

University of Toronto (2011)

[19] Tieleman, T.: Training restricted Boltzmann machines using approximations to

the likelihood gradient. In: Proceedings of the 25th International Conference on

Machine Learning, pp. 1064–1071 (2008)

Deep Learning via Semi-supervised Embedding

Jason Weston

, Frédéric Ratle

, Hossein Mobahi

, and Ronan Collobert

4,

Google, New York, USA

jweston@google.com

Nuance Communications, Montreal, Canada

frederic.ratle@gmail.com

Department of Computer Science, University of Illinois Urbana-Champaign, USA

hmobahi2@illinois.edu

IDIAP Research Institute, Martigny, Switzerland

ronan@collobert.com

Abstract. We show how nonlinear semi-supervised embedding algo-

rithms popular for use with “shallow” learning techniques such as kernel

methods can be easily applied to deep multi-layer architectures, either

as a regularizer at the output layer, or on each layer of the architecture.

Compared to standard supervised backpropagation this can give signiﬁ-

cant gains. This trick provides a simple alternative to existing approaches

to semi-supervised deep learning whilst yielding competitive error

rates compared to those methods, and existing shallow semi-supervised

techniques.

26.1 Introduction

In this chapter we describe a trick for improving the generalization ability of

neural networks by utilizing unlabeled pairs of examples for semi-supervised

learning. The ﬁeld of semi-supervised learning [7] has the goal of improving gen-

eralization on supervised tasks using unlabeled data. One of the tricks they use

is the so-called embedding of data into a lower dimensional space (or the related

task of clustering) which are unsupervised dimensionality reduction techniques

that have been intensively studied. For example, researchers have used nonlinear

embedding or cluster representations as features for a supervised classiﬁer, with

improved results. Many of those proposed architectures are disjoint and shal-

low, by which we mean the unsupervised dimensionality reduction algorithm is

trained on unlabeled data separately as a ﬁrst step, and then its results are fed

to a supervised classiﬁer which has a shallow architecture such as a (kernelized)

linear model. For example, several methods learn a clustering or a distance mea-

sure based on a nonlinear manifold embedding as a ﬁrst step [8, 9]. Transductive



Much of the work in this chapter was completed while Jason Weston and Ronan

Collobert were working at NEC labs, Princeton, USA, and while Frédéric Ratle

was aﬃliated with the University of Lausanne, Switzerland. See [28] and [18] for

conference papers on this subject.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 639–655, 2012.

 Springer-Verlag Berlin Heidelberg 2012

640 J. Weston et al.

Support Vector Machines (TSVMs) [26] (which employs a kind of clustering) and

LapSVM [2] (which employs a kind of embedding) are examples of methods that

are joint in their use of unlabeled data and labeled data, while their architecture

is shallow. In this work we use the same embedding trick as those researchers,

but apply it to (deep) neural networks.

Deep architectures seem a natural choice in hard AI tasks which involve several

sub-tasks which can be coded into the layers of the architecture. As argued

by several researchers [14, 3] semi-supervised learning is also natural in such a

settingasotherwiseoneisnotlikelytoeverhaveenoughdatatoperformwell.

This is both because of the dearth of label data, and because of the diﬃculty of

training the architectures. Secondly, intuitively one would think that training on

labeled and unlabeled data jointly should help guide the best use of the unlabeled

data for the labeled task compared to a two-stage disjoint approach. (However,

to our knowledge there is no systematic evidence of the latter, and there might

be reasons to train disjointly, for example label prediction tends to overﬁt faster

than the embedding because you have less data to ﬁt them. Doing unsupervised

pretraining ﬁrst and supervised ﬁne-tuning afterwards might naturally solve this

problem. On the other hand, it is only because the problem is non-convex that

a two-stage approach does anything at all – all the learning from the ﬁrst stage

may be “forgotten”).

Several authors have recently proposed methods for using unlabeled data in

deep neural network-based architectures. These methods either perform a greedy

layer-wise pre-training of weights using unlabeled data alone followed by super-

vised ﬁne-tuning (which can be compared to the disjoint shallow techniques for

semi-supervised learning described before), or learn unsupervised encodings at

multiple levels of the architecture jointly with a supervised signal. Only consid-

ering the latter, the basic setup we advocate is simple:

1. Choose an unsupervised learning algorithm.

2. Choose a model with a deep architecture.

3. The unsupervised learning is plugged into any (or all) layers of the architec-

ture as an auxiliary task.

4. Train supervised and unsupervised tasks using the same architecture simul-

taneously (with a joint objective function).

The aim is that the unsupervised method will improve accuracy on the task athand.

In this chapter we advocate a simple way of performing deep learning by leveraging

existing ideas from semi-supervised algorithms developed in shallow architectures.

In particular, we focus on the idea of combining an embedding-based regularizer

with a supervised learner to perform semi-supervised learning, such as is used in

Laplacian SVMs [2]. We show that this method can be: (i) generalized to multi-

layer networks and trained by stochastic gradient descent; and (ii) is valid in the

deep learning framework given above. Experimentally, we also show that it seems to

work quite well. We expect this is due to several eﬀects: ﬁrstly, the extra embedding

objective acts both as a data-dependent regularizer but secondly also as a weakly-

supervised task that is correlated well with the supervised task of interest. Finally,

adding this training objective at multiple layers of the network helps to train all

26. Deep Learning via Semi-supervised Embedding 641

the layers rather than just backpropagating from the ﬁnal layer as in supervised

learning.

Although the core of this chapter focuses on a particular algorithm (embed-

ding) in a joint setup, we expect the approach would also work in a disjoint

setup too, and with other unsupervised algorithms, for example the approach of

Transductive SVM has also been generalized to the deep learning case [15].

26.2 Semi-supervised Embedding

Our method will adapt existing semi-supervised embedding techniques for shal-

low methods to neural networks. Hence, before we describe the method, let us

ﬁrst review existing semi-supervised approaches. A key assumption in many semi-

supervised algorithms is the structure assumption

: points within the same struc-

ture (such as a cluster or a manifold) are likely to have the same label. Given this

assumption, the aim is to use unlabeled data to uncover this structure. In order to

do this many algorithms such as cluster kernels [8], LDS [9], label propagation [30]

and LapSVM [2], to name a few, make use of regularizers that are directly related

to unsupervised embedding algorithms. To understand these methods we will ﬁrst

review some relevant approaches to linear and nonlinear embedding.

26.2.1 Embedding Algorithms

We will focus on a rather general class of embedding algorithms that can be de-

scribed by the following type of optimization problem: given the data x

,...,x

ﬁnd an embedding f (x

) of each point x

by minimizing



i,j=1

L(f(x

,α),f(x

,α),W

)

w.r.t. the learning paramaters α, subject to

Balancing constraint.

This type of optimization problem has the following main ingredients:

– f(x) ∈ R

is the embedding one is trying to learn for a given example

x ∈ R

. It is parametrized by α. In many techniques f(x

)=f

is a lookup

table where each example i is assigned an independent vector f

– L is a loss function between pairs of examples.

– The matrix W of weights W

speciﬁes the similarity or dissimilarity between

examples x

and x

. This is supplied in advance and serves as a kind of label

for the loss function.

– A balancing constraint is often required for certain objective functions so

that a trivial solution is not reached.

This is often referred to as the cluster assumption or the manifold assumption [7].

642 J. Weston et al.

As is usually the case for such machine learning setups, one can specify the model

type (family of functions) and the loss to get diﬀerent algorithmic variants. Many

well known methods ﬁt into this framework, we describe some pertinent ones

below.

Multidimensional scaling (MDS) [16] is a classical algorithm that attempts to

preserve the distance between points, whilst embedding them in a lower dimen-

sional space, e.g. by using the loss function

L(f

)=(||f

− f

||− W

)

MDS is equivalent to PCA if the metric is Euclidean [29].

ISOMAP [25] is a nonlinear embedding technique that attempts to capture

manifold structure in the original data. It works by deﬁning a similarity metric

that measures distances along the manifold, e.g. W

is deﬁned by the shortest

path on the neighborhood graph. One then uses those distances to embed using

conventional MDS.

Laplacian Eigenmaps [1] learn manifold structure by emphasizing the preserva-

tion of local distances. One deﬁnes the distance metric between the examples by

encoding them in the Laplacian

L = W − D,whereD



is diagonal.

Then, the following optimization is used:



L(f



||f

− f

= f



Lf (26.1)

subject to the balancing constraint:



Df = I and f



D1=0. (26.2)

Siamese Networks [4] are also a classical method for nonlinear embedding. Neural

networks researchers think of such models as a network with two identical copies of

the same function, with the same weights, fed into a “distance measuring” layer to

compute whether the two examples are similar or not, given labeled data. In fact,

this is exactly the same as the formulation given at the beginning of this section.

Several loss functions have been proposed for siamese networks, here we de-

scribe a margin-based loss proposed by the authors of [13]:

L(f

||f

− f

if W

=1,

max(0,m−||f

− f

)

if W

(26.3)

which encourages similar examples to be close, and dissimilar ones to have a

distance of at least m from each other. Note that no balancing constraint is

needed with such a choice of loss as the margin constraint inhibits a trivial

solution. Compared to using constraints like (26.2) this is much easier to optimize

by gradient descent.

26. Deep Learning via Semi-supervised Embedding 643

26.2.2 Semi-supervised Algorithms

Several semi-supervised classiﬁcation algorithms have been proposed which take

advantage of the algorithms described in the last section. Here we assume the

setting where one is given M +U examples x

, but only the ﬁrst M have a known

label y

Label Propagation [30] adds a Laplacian Eigenmap type regularization to a

nearest-neighbor type classiﬁer:

min



i=1

||f

− y

+ λ

M+U



i,j=1

||f

− f

(26.4)

The algorithm tries to give two examples with large weighted edge W

the same

label. The neighbors of neighbors tend to also get the same label as each other

by transitivity, hence the name label propagation.

LapSVM [2] uses the Laplacian Eigenmaps type regularizer with an SVM:

min

w,b

||w||

+ γ



i=1

H(y

f(x

)) + λ

M+U



i,j=1

||f(x

) − f(x

)||

(26.5)

where H(x)=max(0, 1 − x) is the hinge loss, and the ﬁnal classiﬁer will be

f(x)=w · x + b.

Other Methods In [9] a method called graph is suggested which combines a

modiﬁed version of ISOMAP with an SVM. The authors also suggest to combine

modiﬁed ISOMAP with TSVMs rather than SVMs, and call it Low Density

Separation (LDS).

26.3 Semi-supervised Embedding for Deep Learning

We would like to use the ideas developed in semi-supervised learning for deep

learning. Deep learning consists of learning a model with several layers of non-

linear mapping. In this chapter we will consider multi-layer networks with N

layers of hidden units that give a C-dimensional output vector:

(x)=



j=1

O,i

(x)+b

O,i

,i=1,...,C (26.6)

where w

are the weights for the output layer, and typically the k

layer is

deﬁned as

(x)=S



k,i

k−1

(x)+b

k,i

,k>1 (26.7)

644 J. Weston et al.

(x)=S



1,i

+ b

1,i

(26.8)

and S is a non-linear squashing function such as tanh. Here, we describe a stan-

dard fully connected multi-layer network but prior knowledge about a particular

problem could lead one to other network designs. For example in sequence and

image recognition time delay and convolutional networks (TDNNs and CNNs)

[17] have been very successful. In those approaches one introduces layers that

apply convolutions on their input which take into account locality information

in the data, i.e. they learn features from image patches or windows within a

sequence.

The general method we propose for deep learning via semi-supervised embed-

ding is to add a semi-supervised regularizer in deep architectures in one of three

diﬀerent modes, as shown in Figure 26.1:

(a) Add a semi-supervised loss (regularizer) to the supervised loss on the entire

network’s output (26.6):



i=1

(f(x

),y

)+λ

M+U



i,j=1

L(f(x

),f(x

),W

) (26.9)

This is most similar to the shallow techniques described before, e.g. equation

(26.5).

(b) Regularize the k

hidden layer (26.7) directly:



i=1

(f(x

),y

)+λ

M+U



i,j=1

L(f

),f

),W

) (26.10)

where f

(x)=(h

(x),...,h

(x)) is the output of the network up to the

hidden layer (HU

is the number of hidden units on layer k).

network but has a new ﬁnal set of weights:

(x)=



AUX,i

(x)+b

AUX,i

(26.11)

We train this network to embed unlabeled data simultaneously as we train

the original network on labeled data.

One can use the loss function (26.3) for embedding, and the hinge loss

(f(x),y)=



c=1

H(y(c)f

(x)),

for labeled examples, where y(c)=1if y = c and -1 otherwise. For neighboring

points, this is the same regularizer as used in LapSVM and Laplacian Eigenmaps.

26. Deep Learning via Semi-supervised Embedding 645

OUTPUTOUTPUT

INPUTINPUT

Embedding

Space

LAYER 1

LAYER 2

LAYER 3

OUTPUTOUTPUT

INPUTINPUT

LAYER 1

LAYER 2

LAYER 3

Embedding

Space

OUTPUTOUTPUT

INPUTINPUT

LAYER 1

LAYER 2

LAYER 3

Embedding

Layer

(a) Output (b) Internal (c) Auxiliary

Fig. 26.1. Three modes of embedding in deep architectures

Algorithm 26.1 EmbedNN

Input: labeled data (x

), i =1,...,M, unlabeled data x

, i = M +1,...,U,

set of functions f (·), and embedding functions g

(·), see Figure 26.1 and equa-

tions (26.9), (26.10) and (26.11).

repeat

Pick a random labeled example (x

)

Make a gradient step to optimize (f(x

),y

)

for each embedding function g

(·) do

Pick a random pair of neighbors x

Make a gradient step for λL(g

),g

), 1)

Pick a random unlabeled example x

Make a gradient step for λL(g

),g

), 0)

end for

until stopping criteria is met.

For non-neighbors, where W

=0, this loss “pulls” points apart, thus inhibiting

trivial solutions without requiring diﬃcult constraints such as (26.2). To achieve

an embedding without labeled data the latter is necessary or all examples would

collapse to a single point in the embedding space. This regularizer is therefore

preferable to using (26.1) alone. Pseudocode of the overall approach is given in

Algorithm 26.1.

Some possible tricks to take into consideration are:

– The hyperparameter λ: in most of our experiments we simply set this to λ =1

and it worked well due to the alternating updates in Algorithm 26.1. Note

however if you are using many embedding loss functions they will dominate

the objective in that case.

– We note that near the end of optimization it may be advantageous to re-

duce the learning rate of the regularizer more than the learning rate for the

646 J. Weston et al.

term that is minimizing the training error so that the training error can be

as low as possible on noiseless tasks (however we did not try this in our

experiments).

– If you use an internal embedding on the ﬁrst layer of your network, it is

likely that this embedding problem is harder than an internal embedding on

a later layer, so you might not want to give them all the same learning rate

or margin, but that complicates the hyperparameter choices. An alternative

idea would be to use auxiliary layers on earlier layers, or even go through

two auxiliary layers, rather than one to make the embedding task easier.

Auxiliary layers are thrown away at test time.

– Embedding on the last output layer may not always be a good idea, de-

pending on the type of network. For example if you are using a softmax last

layer the 2-norm type embedding loss may not be appropriate for the log

probability representation in the last layer. In that case we suggest to do the

embedding on the last-but-one layer instead.

– Finally, although we did not try it, training in a disjoint fashion, i.e. doing

the embedding training ﬁrst, and then continuing training with a ﬁne tuning

step with only the labeled data, might simplify these hyperparameter choices

above.

26.3.1 Labeling Unlabeled Data as Neighbors (Building the Graph)

Training neural networks online using stochastic gradient descent is fast and can

scale to millions of examples. A possible bottleneck with the described approach

is computation of the matrix W , that is, computing which unlabeled examples

are neighbors and have value W

=1. Embedding algorithms often use k-nearest

neighbor for this task. Many methods for its fast computation do exist, for

example hashing and tree-based methods.

However, there are also many other ways of collecting neighboring unlabeled

data that do not involve computing k-nn. For example, if one has access to

unlabeled sequence data the following tricks can be used:

– For image tasks one can make use of the temporal coherence of unlabeled

video: two successive frames are very likely to contain similar content and

represent the same concept classes. Each object in the video is also likely

to be subject to small transformations, such as translation, rotation or de-

formation over neighboring frames. Hence, using this with semi-supervised

embedding could learn classes that are invariant to those changes. For exam-

ple, one can take images from two consecutive (or close) frames of video as a

neighboring pair with W

=1. Such pairs are likely to have the same label,

and are collected cheaply. Frames that are far apart are assigned W

=0.

– For text tasks one can use documents to collect unsupervised pairs. For

example, one could consider sentences (or paragraphs) of a document as

neighbors that contain semantically similar information (they are probably

about the same topic).

– Similarly, for speech tasks it might be possible to use audio streams in the

same way.

26. Deep Learning via Semi-supervised Embedding 647

26.3.2 When Do We Expect This Approach to Work?

One can see the described approach as an instance of multi-task learning [6] us-

ing unsupervised auxiliary tasks. In common with other semi-supervised learning

approaches, and indeed other deep learning approaches, given a k-nn type ap-

proach to building unlabeled pairs we only expect this to work if p(x) is useful

for the supervised task p(y|x), i.e. if the structure assumption is true. That is,

if the decision rule lies in a region of low density with respect to the distance

metric chosen for k-nearest neighbors. We believe many natural tasks have this

property.

However, if the graph is built using sequence data as described in the previous

section, it is then possible that the method does not rely on the low density

assumption at all. To see this, consider uniform two-dimensional data where the

class label is positive if it is above the y-axis, and negative if it is below. A nearest-

neighbor graph gives no information about the class label, or equivalently there

is no margin to optimize for TSVMs. However, if sequence data (analogous to a

video) only has data points with the same class label in consecutive frames then

this would carry information. Further, no computational cost is associated with

collecting video data for computing the embedding loss, in contrast to building

neighbor graphs. Finally, note that in high dimensional spaces nearest neighbors

might also perform poorly, e.g. in the pixel space of images.

26.3.3 Why Is This Approach Good?

There are a number of reasons why the deep semi-supervised embedding trick

might be useful compared to competing approaches:

– Deep embedding is very easy to optimize by gradient descent as it has a

very simple loss function. This means it can be applied to any kind of neu-

ral network architecture cheaply and eﬃciently. As well as being generally

applicable, it is also quite easy to implement.

– Compared to a reconstruction based loss function, such as used in an autoen-

coder, our approach can be much cheaper to do the gradient updates. In our

approach there is an encoding step, but no decoding step. That is, the loss

is measured in the usually relatively low-dimensional embedding space. For

high-dimensional input data (even if that data is sparse) e.g. text data, the

reconstruction can be very slow, e.g. a bag-of-words representation with a

dictionary of tens of thousands of words. Further, in a convolutional-pooling

network architecture it might be hard to reconstruct the original data, so

again an encoder-decoder system might be hard to do, but our method only

requires an encoder.

– Our approach does not necessarily require the so called low density assump-

tion which most other approaches depend upon. Many methods only work

on data when that assumption is true (which we do not know in advance

in general). Our method may still work, depending on how the pair-data is

collected. This point was elaborated in the previous subsection.

648 J. Weston et al.

Table 26.1. Datasets used in our experiments. The ﬁrst three are small scale datasets

used in the same experimental setup as found in [9, 24, 10]. The following six datasets

are large scale. The Mnist 1h, 6h, 1k, 3k and 60k variants are MNIST with a labeled

subset of data, following the experimental setup in [10]. SRL is a Semantic Role La-

beling task [20] with one million labeled training examples and 631 million unlabeled

examples. COIL100 is an object detection dataset [19].

data set classes dims points labeled

g50c 2 50 500 50

Text 2 7511 1946 50

Uspst 10 256 2007 50

Mnist1h 10 784 70k 100

Mnist6h 10 784 70k 600

Mnist1k 10 784 70k 1000

Mnist3k 10 784 70k 3000

Mnist60k 10 784 70k 60000

SRL 16 - 631M 1M

COIL100 (30 objects) 30 72x72 pixels 7200 120

COIL100 (100 objects) 100 72x72 pixels 7200 400

26.4 Experimental Evaluation

We test the semi-supervised embedding approach on several datasets summa-

rized in Table 26.1.

26.4.1 Small-Scale Experiments

g50c, Text and Uspst are small-scale datasets often used for semi-supervised

learning experiments [9, 24, 10]. We followed the same experimental setup, aver-

aging results of ten splits of 50 labeled examples where the rest of the data is un-

labeled. In these experiments we test the embedding regularizer on the output of

a neural network (see equation (26.9) and Figure 26.1(a)). We deﬁne a two-layer

neural network (NN) with hu hidden units. We deﬁne W so that the 10 nearest

neighbors of i have W

=1,andW

=0otherwise. We train for 50 epochs of

stochastic gradient descent and ﬁxed λ =1, but for the ﬁrst 5 we optimized the

supervised target alone (without the embedding regularizer). This gives two free

hyperparameters: the number of hidden units hu = {0, 5, 10, 20, 30, 40, 50} and

the learning rate lr = {0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}.

We report the optimum choices of these values optimized both by 5-fold cross

validation and by optimizing on the test set in Table 26.2. Note the datasets are

very small, so cross validation is unreliable. Several methods from the literature

optimized their hyperparameters using the test set (those that are not marked

with (cv)). Our EmbedNN is competitive with state-of-the-art semi-supervised

methods based on SVMs, even outperforming them in some cases.

26. Deep Learning via Semi-supervised Embedding 649

Table 26.2. Results on Small-Scale Datasets. We report the best test error over the

hyperparameters of our method, EmbedNN, as in the methodology of [9] as well as the

error when optimizing the parameters by cross-validation, EmbedNN

(cv)

.LDS

(cv)

and

LapSVM

(cv)

also use cross-validation.

g50c Text Uspst

SVM 8.32 18.86 23.18

TSVM 5.80 5.71 17.61

LapSVM

(cv)

5.4 10.4 12.7

LDS

(cv)

5.4 5.1 15.8

Label propagation 17.30 11.71 21.30

Graph SVM 8.32 10.48 16.92

NN 10.62 15.74 25.13

EmbedNN 5.66 5.82 15.49

EmbedNN

(cv)

6.78 6.19 15.84

Table 26.3. Results on MNIST with 100, 600, 1000 and 3000 labels. A two-layer

Neural Network (NN) is compared to an NN with Embedding regularizer (EmbedNN)

on the output (O), i

layer (Ii) or auxiliary embedding from the i

layer (Ai)(see

Figure 26.1). A convolutional network (CNN) is also tested in the same way. We com-

pare to SVMs and TSVMs. RBM, SESM, DBN-NCA and DBN-rNCA (marked with

(∗)) taken from [21, 23] are trained on a diﬀerent data split.

Mnst1h Mnst6h Mnst1k Mnst3k

SVM 23.44 8.85 7.77 4.21

TSVM 16.81 6.16 5.38 3.45

RBM

(∗)

21.5 - 8.8 -

SESM

(∗)

20.6 - 9.6 -

DBN-NCA

(∗)

- 10.0 - 3.8

DBN-rNCA

(∗)

- 8.7 - 3.3

NN 25.81 11.44 10.70 6.04

Embed

NN 17.05 5.97 5.73 3.59

Embed

NN 16.86 9.44 8.52 6.02

Embed

NN 17.17 7.56 7.89 4.93

CNN 22.98 7.68 6.45 3.35

Embed

CNN 11.73 3.42 3.34 2.28

Embed

CNN 7.75 3.82 2.73 1.83

Embed

CNN 7.87 3.82 2.76 2.07

Table 26.4. Mnist1h dataset with deep networks of 2, 6, 8, 10 and 15 layers; each

hidden layer has 50 hidden units. We compare classical NN training with EmbedNN

where we either learn an embedding at the output layer (

) or an auxiliary embedding

on all layers at the same time (

ALL

24681015

NN 26.0 26.1 27.2 28.3 34.2 47.7

Embed

NN 19.7 15.1 15.1 15.0 13.7 11.8

Embed

ALL

NN 18.2 12.6 7.9 8.5 6.3 9.3

650 J. Weston et al.

Table 26.5. Full Mnist60k dataset with deep networks of 2, 6, 8, 10 and 15 layers, using

either 50 or 100 hidden units. We compare classical NN training with Embed

ALL

where we learn an auxiliary embedding on all layers at the same time.

24681015

NN (HUs=50) 2.9 2.6 2.8 3.1 3.1 4.2

Embed

ALL

NN 2.8 1.9 2.0 2.2 2.4 2.6

NN (HUs=100) 2.0 1.9 2.0 2.2 2.3 3.0

Embed

ALL

NN 1.9 1.5 1.6 1.7 1.8 2.4

26.4.2 MNIST Experiments

We compare our method in all three diﬀerent modes (Figure 26.1) with conven-

tional semi-supervised learning (TSVM) using the same data split and validation

set as in [10]. We also compare to several deep learning methods: RBMs (Re-

stricted Boltzmann Machines), SESM (Sparse Encoding Symmetric Machine),

DBN-NCA and DBN-rNCA (Deep Belief Nets - (regularized) Neighbourhood

Components Analysis). (Note, however the latter were trained on a diﬀerent

data split). In these experiments we consider 2-layer networks (NN) and 6-layer

convolutional neural nets (CNN) for embedding. We optimize the parameters

of NN ( hu = {50, 100, 150, 200, 400} hidden units and learning rates as before)

on the validation set. The CNN architecture is ﬁxed: 5 layers of image patch-

type convolutions, followed by a linear layer of 50 hidden units, similar to [17].

The results given in Table 26.3 show the eﬀectiveness of embedding in all three

modes, with both NNs and CNNs.

26.4.3 Deeper MNIST Experiments

We then conducted a similar set of experiments but with very deep architectures

– up to 15 layers, where each hidden layer has 50 hidden units. Using Mnist1h,

we ﬁrst compare conventional NNs to Embed

ALL

NN where we learn an aux-

iliary nonlinear embedding (50 hidden units and a 10 dimensional embedding

space) on each layer, as well as Embed

NN where we only embed the outputs.

Results are given in Table 26.4. When we increase the number of layers, NNs

trained with conventional backpropagation overﬁt and yield steadily worse test

error (although they are easily capable of achieving zero training error). In con-

trast, Embed

ALL

NN improves with increasing depth due to the semi-supervised

“regularization”. Embedding on all layers of the network has made deep learning

possible. Embed

NN (embedding on the outputs) also helps, but not as much.

We also conducted some experiments using the full MNIST dataset, Mnist60k.

Again using deep networks of up to 15 layers using either 50 or 100 hidden

units Embed

ALL

NN outperforms standard NN. Results are given in Table 26.5.

Despite the lack of availability of extra unlabeled data, we still the same eﬀect

as in the semi-supervised case that NN overﬁts with increasing capacity, whereas

EmbedNN is more robust (even if it exhibits some overﬁtting compared to the

optimal depth, it is nowhere near as pronounced.) Increasing the number of

26. Deep Learning via Semi-supervised Embedding 651

hidden units is likely to improve these results further, e.g. using 4 layers and

500 hidden units on each layer, one obtains 1.27% using Embed

ALL

NN. Overall,

these results show that the regularization in EmbedNN

ALL

is useful in settings

outside of a semi-supervised learning.

Table 26.6. A deep architecture for Semantic Role Labeling with no prior knowledge

outperforms state-of-the-art systems ASSERT and SENNA that incorporate knowledge

about parts-of-speech and parse trees. A convolutional network (CNN) is improved

by learning an auxiliary embedding (Embed

CNN) for words represented as 100-

dimensional vectors using the entire Wikipedia website as unlabeled data.

Method Test Error

ASSERT [20] 16.54%

SENNA [11] 16.36%

CNN [no prior knowledge] 18.40%

Embed

CNN [no prior knowledge] 14.55%

26.4.4 Semantic Role Labeling

The goal of semantic role labeling (SRL) is, given a sentence and a relation

of interest, to label each word with one of 16 tags that indicate that word’s

semantic role with respect to the action of the relation. For example the sen-

tence "The cat eats the ﬁsh in the pond" is labeled in the following way:

"The

ARG0

cat

ARG0

eats

REL

the

ARG1

ﬁsh

ARG1

ARGM−LOC

the

ARGM−LOC

pond

ARGM−LOC

" where ARG0 and ARG1 eﬀectively indicate the subject and

object of the relation “eats” and ARGM-LOC indicates a locational modiﬁer.

The PropBank dataset includes around 1 million labeled words from the Wall

Street Journal. We follow the experimental setup of [11] and train a 5-layer con-

volutional neural network for this task, where the ﬁrst layer represents the input

sentence words as 50-dimensional vectors. Unlike [11], we do not give any prior

knowledge to our classiﬁer. In that work words were stemmed and clustered us-

ing their parts-of-speech. Our classiﬁer is trained using only the original input

words.

We attempt to improve this system by, as before, learning an auxiliary embed-

ding task. Our embedding is learnt using unlabeled sentences from the Wikipedia

web site, consisting of 631 million words in total using the scheme described in

Section 26.3. The same lookup table of word vectors as in the supervised task is

used as input to an 11 word window around a given word, yielding 550 features.

Then a linear layer projects these features into a 100 dimensional embedding

space. All windows of text from Wikipedia are considered neighbors, and non-

neighbors are constructed by replacing the middle word in a sentence window

with a random word. Our lookup table indexes the most frequently used 30,000

words, and all other words are assigned index 30,001.

The results in Table 26.6 indicate a clear improvement when learning an

auxiliary embedding. ASSERT [20] is an SVM parser-based system with many

hand-coded features, and SENNA is a NN which uses part-of-speech information

to build its word vectors. In contrast, our system is the only state-of-the-art

652 J. Weston et al.

Table 26.7. Test Accuracy on COIL100 in various settings. Both 30 and 100 objects

were used following [27]. The semi-supervised embedding algorithm using temporal

coherence of video (Embed CNN) on the last but one layer of an 8 layer CNN, with

various choices of video, outperforms a standard (otherwise identical) 8-layer CNN

and other baselines. (Note that with 100 objects this is a transductive approach, as

we use the test set as unlabeled data during training, whereas with 30 objects a semi-

supervised approach is used.)

Method 30 objects 100 objects

Nearest Neighbor 81.8 70.1

SVM 84.9 74.6

SpinGlass MRF 82.79 69.41

Eigen Spline 84.6 77.0

VTU 89.9 79.1

Standard CNN 84.88 71.49

Embed CNN 95.03 92.25

method that does not use prior knowledge in the form of features derived from

parts-of-speech or parse tree data. The use of neural network techniques for

this application is explored in much more detail in [12], although a diﬀerent

semi-supervised technique is used in that work.

26.4.5 Object Recognition Using Unlabeled Video

Finally, we detail some experiments using unlabeled video for semi-supervised

embedding, more details of these experiments can be found in [18]. We used the

COIL100 image dataset [19] which contains color pictures of 100 objects, each

72x72 pixels. There are 72 diﬀerent views for every object, i.e. there are 7200

images in total. The images were obtained by placing the objects on a turntable

and taking a shot for each 5 degree turn. Note that the rotation of the objects

can be viewed as an unlabeled video which we can use in our semi-supervised

embedding approach.

The setup of our experiments is as follows. First, we use a standard convo-

lutional neural network (CNN) without utilizing any temporal information to

establish a baseline. We used an 8-layer network consisting of three sets of con-

volution followed by subsampling layers, a ﬁnal convolution layer and a fully

connected layer that predicts the outputs.

For comparability with the settings available from other studies on COIL100,

we choose two experimental setups. These are (i) when all 100 objects of COIL

are considered in the experiment and (ii) when only 30 labeled objects out of

100 are studied (for both training and testing). In either case, 4 out of 72 views

(at 0, 90, 180, and 270 degrees) per object are used for training, and the rest of

the 68 views are used for testing. The results are given in Table 26.7 compared

to some existing methods [22, 27, 5]. Note that using 100 objects is a harder

task than using 30 objects (classes).

26. Deep Learning via Semi-supervised Embedding 653

To use the semi-supervised embedding trick on our CNN for video, we treat

COIL100 as a continuous unlabeled video sequence of rotating objects with 72

consecutive frames per each object (after 72 frames the continuous video switches

object). We perform the embedding on the last but one layer of our 8 layer CNN,

i.e. on the representation yielded by the successive layers of the network just

before the ﬁnal softmax. For the 100 object result, the test set is hence part of

the unlabeled video (a so-called “transductive” setting). Here we obtained 92.25%

accuracy (Embed CNN) which is much higher than the best alternative method

(VTU) and the standard CNN that we trained.

A natural question is what happens if we do not have access to test data

during training, i.e. the setting is a typical semi-supervised situation rather than

a “transductive” setting. To explore this, we used 30 objects as the primary

task, i.e. 4 views of each object in this set were used for training, and the rest

for test. The other 70 objects only were treated as an unlabeled video sequence

(again, images of each object were put in consecutive frames of a video sequence).

Training with 4 views of 30 objects (labeled data) and 72 views of 70 objects

(unlabeled video sequence) resulted in an accuracy of 95.03% on recognizing 68

views of the 30 objects. This is about 10% above the standard CNN performance.

26.5 Conclusion

In this chapter, we showed how one can improve supervised learning for deep

architectures if one jointly learns an embedding task using unlabeled data.

Researchers using shallow architectures already showed two ways of using em-

bedding to improve generalization: (i) embedding unlabeled data as a separate

pre-processing step (i.e., ﬁrst layer training) and; (ii) using embedding as a reg-

ularizer (i.e., at the output layer). It appears similar techniques can also be used

for multi-layer neural networks as well, using the tricks described in this chapter.

References

[1] Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data

representation. Neural Computation 15(6), 1373–1396 (2003)

[2] Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric frame-

work for learning from Labeled and Unlabeled Examples. Journal of Machine

Learning Research 7, 2399–2434 (2006)

[3] Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training

of deep networks. In: Advances in Neural Information Processing Systems, NIPS

19 (2007)

[4] Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., Sackinger,

E., Shah, R.: Signature veriﬁcation using a siamese time delay neural network. In-

ternational Journal of Pattern Recognition and Artiﬁcial Intelligence 7(4) (August

1993)

[5] Caputo, B., Hornegger, J., Paulus, D., Niemann, H.: A spin-glass markov random

ﬁeld for 3-d object recognition. Technical Report LME-TR-2002-01, Institut fur

Informatik, Universitat Erlangen Nurnberg (2002)

654 J. Weston et al.

[6] Caruana, R.: Multitask Learning. Machine Learning 28(1), 41–75 (1997)

[7] Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. Adaptive com-

putation and machine learning. MIT Press, Cambridge (2006)

[8] Chapelle, O., Weston, J., Schölkopf, B.: Cluster kernels for semi-supervised learn-

ing. In: Becker, S., Thrun, S., Obermayer, K. (eds.) NIPS, vol. 15, pp. 585–592.

MIT Press, Cambridge (2003)

[9] Chapelle, O., Zien, A.: Semi-supervised classiﬁcation by low density separation.

In: International Conference on Artiﬁcial Intelligence and Statistics (AISTATS),

pp. 57–64 (January 2005)

[10] Collobert, R., Sinz, F., Weston, J., Bottou, L.: Large scale transductive svms.

Journal of Machine Learning Research 7, 1687–1712 (2006)

[11] Collobert, R., Weston, J.: Fast semantic extraction using a novel neural network

architecture. In: Proceedings of the 45th Annual Meeting of the Association of

Computational Linguistics, pp. 25–32 (2007)

[12] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:

Natural language processing (almost) from scratch. The Journal of Machine Learn-

ing Research 12, 2493–2537 (2011)

[13] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an in-

variant mapping. In: Proc. Computer Vision and Pattern Recognition Conference

(CVPR 2006). IEEE Press (2006)

[14] Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets.

Neural Comp. 18(7), 1527–1554 (2006)

[15] Karlen, M., Weston, J., Erkan, A., Collobert, R.: Large scale manifold transduc-

tion. In: Proceedings of the 25th International Conference on Machine Learning,

pp. 448–455. ACM (2008)

[16] Kruskal, J.B.: Multidimensional scaling by optimizing goodness of ﬁt to a non-

metric hypothesis. Psychometrika 29(1), 1–27 (1964)

[17] LeCun, Y., Bottou, L., Bengio, Y., Haﬀner, P.: Gradient-based learning applied

to document recognition. Proceedings of the IEEE 86(11) (1998)

[18] Mobahi, H., Collobert, R., Weston, J.: Deep learning from temporal coherence in

video. In: Proceedings of the 26th Annual International Conference on Machine

Learning, pp. 737–744. ACM (2009)

[19] Nene, S.A., Nayar, S.K., Murase, H.: Columbia object image library (coil-100).

Technical Report CUCS-006-96 (February 1996)

[20] Pradhan, S., Ward, W., Hacioglu, K., Martin, J., Jurafsky, D.: Shallow semantic

parsing using support vector machines. In: Proceedings of HLT/NAACL 2004

(2004)

[21] Ranzato, M., Huang, F., Boureau, Y., LeCun, Y.: Unsupervised learning of invari-

ant feature hierarchies with applications to object recognition. In: Proc. Computer

Vision and Pattern Recognition Conference (CVPR 2007). IEEE Press (2007)

[22] Roobaert, D., Van Hulle, M.: View-based 3d object recognition with support vec-

tor machines. In: IEEE International Workshop on Neural Networks for Signal

Processing, pp. 77–84 (1999)

[23] Salakhutdinov, R., Hinton, G.: Learning a Nonlinear Embedding by Preserving

Class Neighbourhood Structure. In: International Conference on Artiﬁcial Intelli-

gence and Statistics, AISTATS (2007)

[24] Sindhwani, V., Niyogi, P., Belkin, M.: Beyond the point cloud: from transductive

to semi-supervised learning. In: International Conference on Machine Learning,

ICML (2005)

[25] Tenenbaum, J.B., de Silva, V., Langford, J.C.: A global geometric framework for

nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000)

26. Deep Learning via Semi-supervised Embedding 655

[26] Vapnik, V.: Statistical Learning Theory. John Wiley and Sons, New York (1998)

[27] Wersing, H., Körner, E.: Learning optimized features for hierarchical models of

invariant recognition. Neural Computation 15(7), 1559–1599 (2003)

[28] Weston, J., Ratle, F., Collobert, R.: Deep learning via semi-supervised embedding.

In: Proceedings of the 25th International Conference on Machine Learning, pp.

1168–1175. ACM (2008)

[29] Williams, C.K.I.: On a connection between kernel PCA and metric multidimen-

sional scaling. In: Advances in Neural Information Processing Systems, NIPS 13

(2001)

[30] Zhu, X., Ghahramani, Z.: Learning from labeled and unlabeled data with label

propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University

(2002)

Identifying Dynamical Systems for Forecasting

and Control

Preface

Identifying dynamical systems from data is a promising approach to data fore-

casting and optimal control. Data forecasting is an essential component of ratio-

nal decision making in quantitative ﬁnance, marketing and planning. Optimal

control systems, that is, systems that can sense the environment and react ap-

propriately, enable the design of cost eﬃcient gas turbines, smart grids and

human-machine interfaces.

A successful architecture for the task of modeling a dynamical system is the

recurrent neural network (RNN). The state of the dynamical system is repre-

sented by the set of units that compose the network and the transition between

two consecutive states is determined by the recurrent connection between these

units. The network can be trained with backpropagation through time, that is,

standard backpropagation on the RNN unfolded in time. Recurrent neural net-

works are notoriously diﬃcult to train, specially, when the neural network has

to model long-term dependencies. Indeed, the local learning signal is of little use

when when local variations (high frequency components of the time series) do

not reﬂect global trends (low frequency components).

This state of aﬀairs has led many to seek alternatives to backpropagation

through time. A radical departure from backpropagation is the Echo State Net-

work [2]. The idea behind echo state networks is simple: (1) Create a huge neural

network with random recurrent connections and (2) ﬁt a linear model between

the activations of the network and the time series to predict. The huge random

RNN is called a “reservoir” and implements an overcomplete set of nonlinear

primitives, only a subset of which, are useful in order to model the time series

to predict. Tuning the reservoir to produce the most task-relevant primitives

requires some practical experience. Best practices for tuning echo state networks

are described in Chapter 27 [3].

An alternative approach for overcoming the inherent diﬃculties of backprop-

agation is to pay particular attention to the structure of the RNN. A carefully

designed RNN helps error derivatives to ﬂow on larger time scales. Tricks such

as overshooting, error correction neural networks or variant-invariant separation

are introduced in Chapter 28 [6]. The same type of networks can be applied

to the identiﬁcation of state-action representation for control systems. Chapter

29 [1] shows how to apply recurrent neural networks to the identiﬁcation of a

full-ﬂedged Markov decision process from the observation of an existing control

system. This so-called Markov decision process extraction network (MPEN) en-

courages the emergence of a joint state-action representation that best captures

the relevant information for the control task.

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 657–658, 2012.

 Springer-Verlag Berlin Heidelberg 2012

658 G. Montavon and K.-R. Müller

Q-learning [5] is a popular reinforcement learning algorithm for control sys-

tems. It associates to each state-action pairs a Q-value that indicates how

close to the goal the action at a given state brings us. Q-values are deter-

mined dynamically by the Q-learning algorithm, as a result of the exploration

of the state-action space by the controller. The question remains how can Q-

values generalize in a continuous state space. Chapter 30 [4] answers this ques-

tion and provides a practical guide to set up step-by-step neural reinforcement

controllers.

Grégoire & Klaus

References

[1] Duell, S., Udluft, S., Sterzing, V.: Solving Partially Observable Reinforcement

Learning Problems with Recurrent Neural Networks. In: Montavon, G., Orr, G.B.,

Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–

707. Springer, Heidelberg (2012)

[2] Jaeger, H., Haas, H.: Harnessing Nonlinearity: Predicting Chaotic Systems and

Saving Energy in Wireless Communication. Science 304(5667), 78–80 (2004)

[3] Lukoševičius, M.: A Practical Guide to Applying Echo State Networks. In: Mon-

tavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade, 2nd edn. LNCS,

vol. 7700, pp. 659–686. Springer, Heidelberg (2012)

[4] Riedmiller, M.: 10 Steps and Some Tricks to Set Up Neural Reinforcement Con-

trollers. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 735–757. Springer, Heidelberg (2012)

[5] Watkins, C.J.C.H.: Learning from Delayed Rewards. Ph.D. thesis, Cambridge Uni-

versity (1989)

[6] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural

Networks: 12 Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

A Practical Guide to Applying

Echo State Networks

Mantas Lukoševičius

Jacobs University Bremen, Campus Ring 1,

28759 Bremen, Germany

m.lukosevicius@jacobs-university.de

Abstract. Reservoir computing has emerged in the last decade as an

alternative to gradient descent methods for training recurrent neural net-

works. Echo State Network (ESN) is one of the key reservoir computing

“ﬂavors”. While being practical, conceptually simple, and easy to imple-

ment, ESNs require some experience and insight to achieve the hailed

good performance in many tasks. Here we present practical techniques

and recommendations for successfully applying ESNs, as well as some

more advanced application-speciﬁc modiﬁcations.

27.1 Introduction

Training Recurrent Neural Networks (RNNs) is inherently diﬃcult. This (de-)

motivates many to avoid them altogether. RNNs, however, represent a very pow-

erful generic tool, integrating both large dynamical memory and highly adaptable

computational capabilities. They are the Machine Learning (ML) model most

closely resembling biological brains, the substrate of natural intelligence.

Error backpropagation (BP) [40] is to this date one of the most important

achievements in artiﬁcial neural network training. It has become the standard

method to train especially Feed-Forward Neural Networks (FFNNs). Many use-

ful practical aspects of BP are discussed in other chapters of this book and in

its previous edition, e.g., [26]. BP methods have also been extended to RNNs

[51, 52], but only with a partial success. One of the conceptual limitations of

BP methods for RNNs is that bifurcations can make training non-converging

[8]. Even when they do converge, this convergence is slow, computationally ex-

pensive, and can lead to poor local minima.

Ten years ago an alternative trend of understanding, training, and using RNNs

has been proposed with Echo State Networks (ESNs) [16, 21] in ML, and Liq-

uid State Machines (LSMs) [32] in computational neuroscience. It was shown

that RNNs often work well enough even without full adaptation of all network

weights. In the classical ESN approach the RNN (called reservoir) is generated

randomly, and only the readout from the reservoir is trained. It should be noted

that this basic idea was ﬁrst clearly spelled out in a neuroscientiﬁc model of the

corticostriatal processing loop [7]. Perhaps surprisingly this approach yielded

excellent performance in many benchmark tasks, e.g., [16, 15, 19, 22, 47, 48].

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 659–686, 2012.

 Springer-Verlag Berlin Heidelberg 2012

660 M. Lukoševičius

The trend started by ESNs and LSMs became lately known as Reservoir

Computing (RC) [49]. RC is currently a proliﬁc research area, giving important

insights into RNNs, procuring practical machine learning tools, as well as en-

abling computation with non-conventional hardware [31]. RC today subsumes

a number of related methods and extensions of the original idea [29], but the

original ESN approach still holds its ground for its simplicity and power.

The latest developments in BP for RNNs, second-order gradient descent meth-

ods called Hessian-free optimization, presented in [34] and discussed in a chapter

[35] of this book, alleviate some of the mentioned shortcomings. In particular,

they perform better on problems which require long memory [34]. These are

known to be hard for BP RNN training [1], unless networks are speciﬁcally de-

signed to deal with them [13]. Structural damping of Hessian-free optimization

[34], an online adaptation of the learning process which penalizes big changes in

RNN activations, likely tends to drive the learning process away from passing

through many bifurcations (that are exactly big changes in activations and can

probably be anticipated and avoided to some extent). On a benchmark suite

designed to challenge long short-term memory acquisition, ESNs however still

outperform Hessian-free trained RNNs [23].

ESNs from their beginning proved to be a highly practical approach to RNN

training. It is conceptually simple and computationally inexpensive. It reinvigo-

rated interest in RNNs, by making them accessible to wider audiences. However,

the apparent simplicity of ESNs can sometimes be deceptive. Successfully apply-

ing ESNs needs some experience. There is a number of things that can be done

wrong. In particular, the initial generation of the raw reservoir network is inﬂu-

enced by a handful of global parameters, and these have to be set judiciously. The

same, however, can be said about virtually every ML technique. Techniques and

recommendations on successfully applying ESNs will be addressed in this work.

We will try to organize the “best practices” of ESNs into a logical order despite

the fact that they are often non-sequentially interconnected. We will start with

deﬁning the ESN model and the basic learning procedure in Section 27.2. Then

we will detail out guidelines on producing good reservoirs in Section 27.3, various

aspects of training diﬀerent types of readouts in Section 27.4, and dealing with

output feedback in Section 27.5. We will end with a short summary in Section 27.6.

27.2 The Basic Model

ESNs are applied to supervised temporal ML tasks where for a given training

input signal u(n) ∈ R

a desired target output signal y

target

(n) ∈ R

known. Here n =1,...,T is the discrete time and T is the number of data

points in the training dataset. In practice the dataset can consist of multiple

sequences of varying lengths, but this does not change the principles. The task is

to learn a model with output y(n) ∈ R

,wherey(n) matches y

target

(n) as well

as possible, minimizing an error measure E(y, y

target

), and, more importantly,

generalizes well to unseen data. The error measure E is typically a Mean-Square

Error (MSE), for example Root-Mean-Square Error (RMSE)

27. Practical ESNs 661

E(y, y

target



i=1

;



n=1



(n) − y

target

(n)



, (27.1)

which is also averaged over the N

dimensions i of the output here.

The RMSE can also be dimension-wise normalized (divided) by the vari-

ance of the target y

target

(n), producing a Normalized Root-Mean-Square Error

(NRMSE). The NRMSE has an absolute interpretation: it does not depend on

the arbitrary scaling of the target y

target

(n) andthevalueof1canbeachieved

with a simple constant output y(n) set to the mean value of y

target

(n). This sug-

gests that a reasonable model of a stationary process should achieve the NRMSE

accuracy between zero and one.

The normalization and the square root parts are more for human interpretabil-

ity, as the optimal output y

target

minimizing any MSE is equivalent to the one

minimizing (27.1), as long as no additional penalties or weighting are introduced

into the equation, such as discussed in Sections 27.4.2 and 27.4.6.

ESNs use an RNN type with leaky-integrated discrete-time continuous-value

units. The typical update equations are

˜x(n)=tanh



[1; u(n)] + Wx(n − 1)



, (27.2)

x(n)=(1− α)x(n − 1) + α˜x(n), (27.3)

where x(n) ∈ R

is a vector of reservoir neuron activations and ˜x(n) ∈ R

is its update, all at time step n, tanh(·) is applied element-wise, [·; ·] stands

for a vertical vector (or matrix) concatenation, W

∈ R

×(1+N

)

and W ∈

×N

are the input and recurrent weight matrices respectively, and α ∈ (0, 1]

is the leaking rate. Other sigmoid wrappers can be used besides the tanh,which

however is the most common choice. The model is also sometimes used without

the leaky integration, which is a special case of α =1and thus ˜x(n) ≡ x(n).

Fig. 27.1. An echo state network

The linear readout layer is deﬁned as

y(n)=W

out

[1; u(n); x(n)], (27.4)

662 M. Lukoševičius

where y(n) ∈ R

is network output, W

out

∈ R

×(1+N

)

the output weight

matrix, and [·; ·; ·] again stands for a vertical vector (or matrix) concatenation.

An additional nonlinearity can be applied to y(n) in(27.4),aswellasfeedback

connections W

from y(n −1) to ˜x(n) in (27.2). A graphical representation of

an ESN illustrating our notation and the idea for training is depicted in Figure

27.1.

The original method of RC introduced with ESNs [16] was to:

1. generate a large random reservoir RNN (W

, W,α);

2. run it using the training input u(n) and collect the corresponding reservoir

activation states x(n);

3. compute the linear readout weights W

out

from the reservoir using linear

regression, minimizing the MSE between y(n) and y

target

(n);

4. use the trained network on new input data u(n) computing y(n) by employ-

ing the trained output weights W

out

In subsequent sections we will delve deeper into the hidden intricacies of this

procedure which appears so simple on the surface, and spell out practical hints

for the concrete design choices that wait on the way. More speciﬁcally, Step 1

is elaborated on in Section 27.3; Step 2 is done by Equations (27.2) and (27.3),

with initialization discussed in Section 27.4.5; Step 3 is formally deﬁned and

options explained in Section 27.4 with additional options for some particular

applications in Section 27.5; and Step 3 is again performed by Equations (27.2),

(27.3), and (27.4).

27.3 Producing a Reservoir

For producing a good reservoir it is important to understand what function it is

serving.

27.3.1 Function of the Reservoir

In practice it is important to keep in mind that the reservoir acts (i) as a

nonlinear expansion and (ii) as a memory of input u(n) at the same time.

There is a parallel between RC and kernel methods in ML. The reservoir can

be seen as (i) a nonlinear high-dimensional expansion x(n) of the input signal

u(n). For classiﬁcation tasks, input data u(n) which are not linearly separable in

the original space R

, often become so in the expanded space R

of x(n),where

they are separated by W

out

. In fact, employing the “kernel trick” to integrate

over all possible reservoirs is also possible in the context of RC, even though not

really practical [12].

At the same time, (ii) the reservoir serves as a memory, providing tempo-

ral context. This is a crucial reason for using RNNs in the ﬁrst place. In the

tasks where memory is not necessary, non-temporal ML techniques implement-

ing functional mappings from current input to current output should be used.

27. Practical ESNs 663

Both aspects (i) and (ii) combined, the reservoir, being an input-driven dy-

namical system, should provide a rich and relevant enough signal space in x(n),

such that the desired y

target

(n) could be obtained by linear combination from it.

There is however some trade-oﬀ between (i) and (ii) when setting the parameters

of the reservoir [50], which we will explain in more detail.

27.3.2 Global Parameters of the Reservoir

Given the RNN model (27.2),(27.3), the reservoir is deﬁned by the tuple

, W,α). The input and recurrent connection matrices W

and W are gen-

erated randomly according to some parameters discussed later and the leaking

rate α is selected as a free parameter itself.

In analogy to other ML, and especially NN, approaches, what we call “param-

eters” here could as well be called “meta-parameters” or “hyper-parameters”, as

they are not concrete connection weights but parameters governing their distri-

butions. We will call them “global parameters” to better reﬂect their nature, or

simply “parameters” for brevity.

The deﬁning global parameters of the reservoir are: the size N

,sparsity,

distribution of nonzero elements, and spectral radius of W; scaling(-s) of W

;

and the leaking rate α. We will now proceed in this order to give more details on

each of these design choices and intuitions on how to make them. Then, in Section

27.3.3, we will summarize by advising how to prioritize these global parameters

and tune the really important, or rather task-speciﬁc, ones in a principled way.

Size of the Reservoir. One obviously crucial parameter of the model

(27.2)(27.3) is N

, the number of units in the reservoir.

The general wisdom is that the bigger the reservoir, the better the obtain-

able performance, provided appropriate regularization measures are taken against

overﬁtting (see Section 27.4.1). Since training and running an ESN is computa-

tionally cheap compared to other RNN approaches, reservoir sizes of order 10

are not uncommon [47]. The bigger the space of reservoir signals x(n), the easier

it is to ﬁnd a linear combination of the signals to approximate y

target

(n).Inour

experience the reservoir can be too big only when the task is trivial and there is

not enough data available T<1+N

+ N

For challenging tasks use as big a reservoir as you can

aﬀord computationally.

That being said, computational trade-oﬀs are important. In academic settings,

when comparing diﬀerent approaches instead of going for the best possible per-

formance, authors often limit their reservoir sizes for convenience and compat-

ibility of results. Even when going for the best performance, starting with the

biggest possible reservoir from the beginning is cumbersome.

664 M. Lukoševičius

Select global parameters with smaller reservoirs, then

scale to bigger ones.

The tuning of global parameters (described below) often needs multiple trials,

thus each should not consume too much time. Good parameters are usually

transferable to bigger reservoirs, but some trials with big reservoirs can also be

done to conﬁrm this.

A lower bound for the reservoir size N

can roughly be estimated by consid-

ering the number of independent real values that the reservoir must remember

from the input to successfully accomplish the task. The maximal number of

stored values, called memory capacity, in ESN can not exceed N

[17].

should be at least equal to the estimate of

independent real values the reservoir has to remember

from the input to solve its task.

For i.i.d. inputs u(n),thisestimateisN

times a rough estimate of how many

time steps the inputs should be remembered to solve the task. While the re-

sult in [17] is precise for i.i.d. inputs, in practice there are often temporal and

inter-channel correlations in u(n), that make it somewhat “compressible”. Also,

the shapes of the “forgetting curves” of the reservoirs are typically not rectan-

gular (depend on other parameters), i.e., the forgetting is not instantaneous but

gradual. As a result, reservoir can often make do with smaller sizes.

Sparsity of the Reservoir. In the original ESN publications it is recommended

to make the reservoir connections sparse, i.e., make most of elements in W

equal to zero. In our practical experience also sparse connections tend to give a

slightly better performance. In general, sparsity of the reservoir does not aﬀect

the performance much and this parameter has a low priority to be optimized.

However, sparsity enables fast reservoir updates if sparse matrix representations

are used.

Connect each reservoir node to a small ﬁxed number of

other nodes (e.g., 10) on average, irrespective of the

reservoir size. Exploit this reservoir sparsity to

speedup computation.

If regardless of reservoir size, a ﬁxed fanout number is chosen, the computational

cost of network state updates grows only linearly with the network size instead

of quadratically. This greatly reduces the cost of running big reservoirs. The

computational savings require virtually no additional eﬀort when the program-

ming environment supports eﬃcient representation and operations with sparse

matrices, which many do.

27. Practical ESNs 665

Distribution of Nonzero Elements. The matrix W is typically generated

sparse, with nonzero elements having an either a symmetrical uniform, discrete

bi-valued, or normal distribution centered around zero. Diﬀerent authors prefer

diﬀerent distributions. We usually prefer a uniform distribution for its continu-

ity of values and boundedness. Gaussian distributions are also popular. Both

distributions give virtually the same performance which depends on the other

parameters discussed here. The discrete bi-valued distribution tends to give a

slightly less rich signal space (there is a non-zero probability of identical neu-

rons), but might make analysis of what is happening in the reservoir easier. The

width of the distributions does not matter, as it is reset in a way explained in

the next section.

The input matrix W

is usually generated according to the same type of

distribution as W, but typically dense.

Spectral Radius. One of the most central global parameters of an ESN is

spectral radius of the reservoir connection matrix W, i.e., the maximal absolute

eigenvalue of this matrix. It scales the matrix W, or viewed alternatively, scales

the width of the distribution of its nonzero elements.

Typically a random sparse W is generated; its spectral radius ρ(W) is com-

puted; then W is divided by ρ(W) to yield a matrix with a unit spectral radius;

this is then conveniently scaled with the ultimate spectral radius to be deter-

mined in a tuning procedure.

For the ESN approach to work, the reservoir should satisfy the so-called echo

state property: the state of the reservoir x(n) should be uniquely deﬁned by the

fading history of the input u(n) [16]. In other words, for a long enough input

u(n), the reservoir state x(n) should not depend on the initial conditions that

were before the input.

Large ρ(W) values can lead to reservoirs hosting multiple ﬁxed point, peri-

odic, or even chaotic (when suﬃcient nonlinearity in the reservoir is reached)

spontaneous attractor modes, violating the echo state property.

ρ(W) < 1 ensures echo state property in most

situations.

Even though it is possible to violate the echo state property even with ρ(W) < 1,

this is unlikely to happen in practice. More importantly, the echo state property

often holds for ρ(W) ≥ 1 for nonzero inputs u(n). This can be explained by the

strong u(n) pushing activations of the neurons away from 0 where their tanh()

nonlinearities have a unitary slope to regions where this slope is smaller, thus

reducing the gains of the neurons and the eﬀective strength of feedback con-

nections. Intuitively speaking, due to activation-squashing nonlinearities, strong

inputs “squeeze out” the autonomous activity from the reservoir activations. This

means, that for nonzero u(n) ρ(W) < 1 is not a necessary condition for the echo

state property and optimal ρ(W) values can sometimes be signiﬁcantly greater

than 1.

666 M. Lukoševičius

In practice ρ(W) should be selected to maximize the performance, with the

value 1 serving as an initial reference point.

As a guiding principle, ρ(W) should be set greater for tasks where a more

extensive history of the input is required to perform it, and smaller for tasks

where the current output y(n) depends more on the recent history of u(n).

The spectral radius determines how fast the inﬂuence of an input dies out in a

reservoir with time, and how stable the reservoir activations are [50].

The spectral radius should be greater in tasks

requiring longer memory of the input.

Input Scaling. The scaling of the input weight matrix W

is another key

parameter to optimize in an ESN. For uniformly distributed W

we usually

deﬁne the input scaling a as the range of the interval [−a; a] from which values

of W

are sampled; for normal distributed input weights one may take the

standard deviation as a scaling measure.

In order to have a small number of freely adjustable parameters, often all the

columns of W

are scaled together using a single scaling value. However, the

scaling of the ﬁrst column of W

corresponding to the bias input to the reservoir

units in (27.2) can be optimized separately from the rest. If the remaining “active”

input channels contribute to the task in very diﬀerent ways, it is also advised to

optimize their scalings separately.

Scale the whole W

uniformly to have few global

parameters in ESN. However, to increase the

performance:

– scale the ﬁrst column of W

(i.e., the bias inputs)

separately;

– scale other columns of W

separately if channels of

u(n) contribute diﬀerently to the task.

This varies the number of free global parameters to set for W

from 1 up to

+1.

It was suggested in the original ESN publications to scale and shift the input

data, optimizing the magnitude of both. But the same eﬀect can be achieved by

scaling the input weights of the active inputs and the bias, respectively.

Still, input data normalization is advisable for ESNs just as for any other

ML approach. This puts each learning task into a more standardized set-

ting. It may be helpful to have the range of the input data values bounded.

For example, apply the tanh(·) squashing to u(n) if its distribution is unbounded.

27. Practical ESNs 667

Otherwise the outliers can throw the reservoir state x(n) into some “unfamiliar”

regions not well covered by the usual working trajectories of x(n) for which the

global parameters have been optimized or the outputs learned. This can lead to a

virtual loss of useful memory (due to saturations in the activation nonlinearities)

or unpredictable outputs at these points, respectively.

It is advisable to normalize the data and may help to

keep the inputs u(n) bounded avoiding outliers (e.g.,

apply tanh(·) to u(n) if it is unbounded).

Input scaling determines how nonlinear the reservoir responses are. For very lin-

ear tasks W

should be small, letting units operate around the 0 point where

their activation tanh(·) is virtually linear. For large W

, the units will get eas-

ily saturated close to their 1 and −1 values, acting in a more nonlinear, binary

switching manner. While ρ(W) also aﬀects the nonlinearity, the reservoir acti-

vations become unstable when increasing ρ(W), as explained in Section 27.3.2,

before it can make the reservoir highly nonlinear.

The amount of nonlinearity the task requires is not easy to judge. Finding

a proper setting beneﬁts from experience and intuitive insight into nonlinear

dynamics. But also the masters of RC (if there are such) use trial and error to

tune this characteristic.

Looking at (27.2), it is clear that the scaling of W

, together with the scaling

of W (i.e., ρ(W)) determines the proportion of how much the current state x(n)

depends on the current input u(n) and how much on the previous state x(n−1),

respectively. The respective sizes N

and N

should also be taken into account.

The input scaling regulates:

– the amount of nonlinearity of the reservoir represen-

tation x(n) (also increasing with ρ(W));

– the relative eﬀect of the current input on x(n) as

opposed to the history (in proportion to ρ(W)).

It has been empirically observed that the representation of the diﬀerent principle

components of u(n) in x(n) is roughly proportional to the square root of their

magnitudes in u(n) [11]. In other words, the reservoir tends to ﬂatten the spec-

trum of principal components of u(n) in x(n) – something to keep in mind when

choosing the right representation or preprocessing of the data. For example, if

smaller principal components carry no useful information it might be helpful

to remove them from the data by Principal Component Analysis (PCA) before

feeding them to a reservoir, otherwise they will get relatively ampliﬁed there.

Leaking Rate. The leaking rate α of the reservoir nodes in (27.3) can be

regarded as the speed of the reservoir update dynamics discretized in time. We

668 M. Lukoševičius

can describe the reservoir update dynamics in continuous time as an Ordinary

Diﬀerential Equation (ODE)

x = −x +tanh



[1; u]+Wx



. (27.5)

If we make an Euler’s discretization of this ODE (27.5) in time, taking

Δx

Δt

x(n +1)−x(n)

Δt

≈

x, (27.6)

we arrive at exactly (up to some time indexing conventions) the discrete time

equations (27.2)(27.3) with α taking the place of the sampling interval Δt.Thus

α can be regarded as the time interval in the continuous world between two

consecutive time steps in the discrete realization. Also, empirically the eﬀect of

setting α is comparable to that of re-sampling u(n) and y

target

(n) when the

signals are slow [27, 41]. The leaking rate α can even be adapted online to deal

with time wrapping of the signals [27, 22]. Equivalently, α can be introduced as

a time constant in (27.5), if keeping Δt ≡ 1.

While there are some slight variations alternative to those in (27.3), of how

to do leaky integration (e.g., [22]), the version (27.3) has emerged as preferred,

because it guarantees that x(n) never goes outside the (−1, 1) interval.

Set the leaking rate α in (27.3) to match the speed of

the dynamics of u(n) and/or y

target

(n).

This can, again, be diﬃcult and subjective to determine in some cases. Especially

when the timescales of u(n) and y

target

(n) are quite diﬀerent. This is one more

of the global parameters to be tuned by trial and error.

When the task requires modeling the time series producing dynamical system

on multiple time scales, it might be useful to set diﬀerent leaking rates to diﬀerent

units (making α a vector α ∈ R

) [43], with a possible downside of having more

parameters to optimize.

Alternatively, the leaky integration (27.3) can be seen as a simple digital

low-pass ﬁlter, also known as exponential smoothing, applied to every node.

Some contributions even suggest applying more powerful ﬁlters for this purpose

[53, 14].

In some cases setting a small α, and thus inducing slow dynamics of x(n),can

dramatically increase the duration of the short-term memory in ESN [23].

27.3.3 Practical Approach to Reservoir Production

Prioritizing Parameters. While all the ESN reservoir parameters discussed

in Section 27.3.2 have their guiding intuitions in setting them, ﬁxing some of

them is more straightforward than others.

27. Practical ESNs 669

The main three parameters to optimize in an ESN

reservoir are:

– input scaling(-s);

– spectral radius;

– leaking rate(-s).

These three parameters, discussed in Sections 27.3.2, 27.3.2, and 27.3.2 respec-

tively, are very important for a good performance and are quite task-speciﬁc.

The reservoir size N

almost comes as an external restriction (Section 27.3.2),

and the rest of the parameters can be set to reasonable default values: reservoir

sparseness (Section 27.3.2), weight distribution (Section 27.3.2), or details of the

model (27.2)(27.3). It is still worth investigating several options for them, as a

lower priority.

As explained before, the performance can also be additionally improved in many

cases by “splitting” a single parameter into several. Setting diﬀerent scalings to

the columns of W

(corresponding to the bias input and possibly to diﬀerent di-

mensions of input if they are of diﬀerent nature) can go a long way. Also, setting

leaking rates α diﬀerently for diﬀerent units (e.g., by splitting them to several sub-

populations with constant value) can help a lot in multi-timescale tasks.

Setup for Parameter Selection. One of the main advantages of ESNs is that

learning the outputs is fast. This should be exploited in evaluating how good a

reservoir generated by a particular set of parameters is.

The most pragmatic way to evaluate a reservoir is to

train the output (27.4) and measure its error.

Either validation or training error can be used. Validation is, of course, preferred

if there is a danger of overﬁtting. Training error has the advantage of using less

data and in some cases no need to rerun a trained network with it. If a validation

data set is necessary for the output training (as explained in Section 27.4), the

error on it might be utilized with no additional cost, as a compromise between

the training and a yet separate second validation set.

If training of the output and validation is not fast enough, smaller initial

reservoirs (as stated in Section 27.3.2), or a reduced representative data set can

be used. For the same reason it is often an overkill to use a k-fold cross-validation

in global parameter optimization, at least in initial stages, unless the data are

really scarce.

It is important to keep in mind that the randomly generated reservoirs even

with the same parameters vary slightly in their performance. This variance is

ever present but is typically more pronounced with smaller reservoirs than with

bigger ones; the random variations inside of a big reservoir tend to “average

670 M. Lukoševičius

out”. It is nonetheless important to keep this random ﬂuctuation of performance

separate from the one caused by diﬀerent parameter values.

To eliminate the random ﬂuctuation of performance,

keep the random seed ﬁxed and/or average over several

reservoir samples.

Fixing a random seed in the programming environment before generating the

reservoirs makes the random aspect of the reservoirs identical across trials and

thus the experiments deterministically repeatable. Using a single reservoir is

faster, but with an obvious danger of below-average performance and/or overﬁt-

ting the parameters to a particular instance of a randomly generated reservoir:

good parameters might not carry over well to a diﬀerent software implementa-

tion, or, e.g., diﬀerent size of a reservoir.

Manual Parameter Selection. Manual selection of parameters is unavoid-

able to some extent in virtually all ML approaches. Even when parameters are

learned or selected through automated search, it is typically necessary to set

meta-parameters (or rather “meta-meta-parameters”) for these procedures.

When manually tuning the reservoir parameters,

change one parameter at a time.

Changes in several parameters at once often have opposing eﬀects on perfor-

mance, but it is impossible to tell which contributed what. A reasonable approach

is to set a single parameter to a well enough value before starting changing an-

other one, and repeating this until the performance is satisfactory.

It is also advisable to take notes or log the performance automatically for

extended optimizations, in order not to “go in circles” when repeating the same

parameter values.

An empirical direction of a gradient can be estimated for a parameter, making

a small change to it and observing the change in performance. However, the error

landscapes are often non-convex and trying distant values of the parameters can

sometimes lead to dramatic improvements.

Always plot samples of reservoir activation signals

x(n) to have a feeling of what is happening inside the

reservoir.

This may reveal that x(n) are over-saturated, under-activated, exhibiting au-

tonomous cyclic or chaotic behavior, etc. Overall, plotting information additional

to the error rate helps a lot in gaining more insight into how the parameters

should be changed.

Typically, good average performance is not found in a very narrow parameter

range, thus a very detailed ﬁne-tuning of parameters does not give a signiﬁcant

improvement and is not necessary.

27. Practical ESNs 671

Automated Parameter Selection. Since manual parameter optimization

might quickly get tedious, automated approaches are often preferred.

Since ESNs have only a few parameters requiring more careful tuning, grid

search is probably the most straightforward option. It is easy enough to imple-

ment with a few nested loops, and high-level ML programming libraries, such

as Oger (mentioned in Section 27.6) in the case of RC, often have ready-made

routines for this.

A reasonable approach is to do a coarser grid search over wider parameter

intervals to identify promising regions and then do a ﬁner search (smaller steps)

in these regions. As mentioned, typically the grid must not be very dense to

achieve a good performance.

The best performance being on a boundary of a covered grid is a good indi-

cation that the optimal performance might be outside the grid.

In general, meta-parameter or hyper-parameter optimization is a very com-

mon topic in many branches of ML and beyond. There are numerous generic

optimization methods applicable to this task described in the literature. They

are often coping with much larger search spaces than a grid search is eﬀec-

tively capable of, such as random search, or more sophisticated methods trying

to model the error landscape (see, e.g., [2]). They are in principle just as well

applicable to ESNs with a possibility of also including in the optimization the

parameters of second importance.

There is also a way to optimize the global parameters of the reservoir through

a gradient descent [22]. It has, however, not been widely applied in the literature.

27.3.4 Pointers to Reservoir Extensions

There are also alternative ways of generating and adapting reservoirs suggested

in the literature, including deterministic, e.g., [39], and data-speciﬁc, e.g., [36],

ones. With a variety of such methods the modern ﬁeld of RC has evolved from

using the initial paradigm of a ﬁxed reservoir and only training a readout from

it, to also adapting the reservoir but diﬀerently from the readout, using generic,

unsupervised, or even supervised methods. In some cases a hardware system is

used as a reservoir and thus is predetermined by its speciﬁc features. See [29]

and updated in Chapter 2 of [30] for a classiﬁcation and overview. The classical

ESN approach described here, however, still holds its ground for its simplicity

and performance.

27.4 Training Readouts

27.4.1 Ridge Regression

Since readouts from an ESN are typically linear and feed-forward, the Equation

(27.4) can be written in a matrix notation as

Y = W

out

X, (27.7)

672 M. Lukoševičius

where Y ∈ R

×T

are all y(n) and X ∈ R

(1+N

)×T

are all [1; u(n); x(n)]

produced by presenting the reservoir with u(n), both collected into respective

matrices by concatenating the column-vectors horizontally over the training pe-

riod n =1,...,T.WeusehereasingleX instead of [1; U; X] for notational

brevity.

Finding the optimal weights W

out

that minimize the squared error between

y(n) and y

target

(n) amounts to solving a typically overdetermined system of

linear equations

target

= W

out

X, (27.8)

where Y

target

∈ R

×T

are all y(n), with respect to W

out

in a least-square

sense – i.e., a case of linear regression. In this context X can be called the design

matrix. The system is overdetermined, because typically T ! 1+N

+ N

There are standard well-known ways to solve (27.8), we will discuss a couple

of good choices here.

Probably the most universal and stable solution to (27.8) in this context is

ridge regression, also known as regression with Tikhonov regularization:

out

= Y

target

+ βI

−1

, (27.9)

where β is a regularization coeﬃcient explained in Section 27.4.2, and I is the

identity matrix.

The most generally recommended way to learn linear

output weights from an ESN is ridge regression (27.9)

We start with this method because it should be the ﬁrst choice, even though it

is not the most trivial one. We will explain diﬀerent aspects of this method in

the coming sections together with reasons for why it should be preferred and

alternatives that in some cases can be advantageous.

27.4.2 Regularization

To assess the quality of the solution produced by training, it is advisable to

monitor the actual obtained output weights W

out

. Large weights indicate that

out

exploits and ampliﬁes tiny diﬀerences among the dimensions of x(n),and

can be very sensitive to deviations from the exact conditions in which the network

has been trained. This is a big problem in the setups where the network receives

its output as the next input. The slight deviation of the output from the expected

value quickly escalates in subsequent time steps. Ways of dealing with such setup

are explained in Section 27.5.

Extremely large W

out

values may be an indication of a

very sensitive and unstable solution.

27. Practical ESNs 673

To counteract this eﬀect is exactly what the regularization part βI in the ridge

regression (27.9) is for. Instead of just minimizing RMSE (27.1), ridge regression

(27.9) solves

out

=argmin

out



i=1





n=1



(n) − y

target

(n)



+ β

out



, (27.10)

where w

out

is the ith row of W

out

and · stands for the Euclidean norm.

The objective function in (27.10) adds a regularization, or weight decay, term

β w

out



penalizing large sizes of W

out

to the square error between y(n) and

target

(n). This is a sum of two objectives, a compromise between having a small

training error and small output weights. The relative “importance” between these

two objectives is controlled by the regularization parameter β.

Use regularization (e.g., (27.9)) whenever there is a

danger of overﬁtting or feedback instability.

In (27.9) the optimal regularization coeﬃcient β depends on the concrete in-

stantiation of the ESN. It should be selected individually for a concrete reservoir

based on validation data.

Select β for a concrete ESN using validation, without

rerunning the reservoir through the training data.

There is no need to rerun the model through the data with every value β,

because none of the other variables in (27.9) are aﬀected by its changes. Memory

permitting, there is also no need to rerun the model with the (small) validation

dataset, if you can store X for it and compute the validation output by (27.7).

This makes testing β values computationally much less expensive than testing

the reservoir parameters explained in Section 27.3.

The optimal values of β can vary by many magnitudes of size, depending on

the exact instance of the reservoir and length of the training data. If doing a

simple exhaustive search, it is advisable to search on a logarithmic grid.

Setting β to zero removes the regularization: the objective function in (27.10)

becomes equivalent to RMSE (27.1), making the ridge regression a generalization

of a regular linear regression. The solution (27.9) with β =0becomes

out

= Y

target

−1

, (27.11)

known as normal equations method for solving linear regression (27.8). In prac-

tice, however, setting β =0often leads to numerical instabilities when inverting

(XX

) in (27.11). This too recommends using a logarithmic scale for selecting

β where it never goes to zero. The problem can in some cases also be alleviated

by using a pseudoinverse instead of the real inverse in (27.11).

674 M. Lukoševičius

A Gaussian process interpretation of the linear readout gives an alternative

criterion for setting β directly [4].

A similar regularization eﬀect to Tikhonov (27.9) can be achieved by adding

scaled white noise to x(n) in (27.3) – a method that predates ridge regression in

ESNs [16]. Like in ridge regression, i.i.d. noise emphasizes the diagonal of (XX

The advantage is that it is also propagated through W in (27.2), modeling better

the eﬀects of noisy signals in the reservoir. The output learns to recover from

perturbed signals, making the model more stable with feedback loops (Section

27.5). The downside of this noise immunization is that the model needs to be

rerun with each value of the noise scaling.

27.4.3 Large Datasets

Ridge regression (27.9) (or the Wiener-Hopf solution (27.11) as a special case)

also allows a one-shot training with virtually unlimited amounts of data.

Notice that the dimensions of the matrices (Y

target

) ∈ R

×N

and

(XX

) ∈ R

×N

do not depend on the length T of the training sequence

in their sizes. The two matrices can be updated by simply adding the corre-

sponding results from the newly incoming data. This one-shot training approach

in principle works with an unlimited amount of data – neither complexity of

working memory, nor time of the training procedure (27.9) itself depend on the

length of data T .

With large datasets collect the matrices

target

and

incrementally for (27.9).

The eventual limitation of the straightforward summation comes from the ﬁnite

precision of ﬂoating point numbers – adding large (like in the so-far accumulated

matrix) and small (like the next update) numbers becomes inaccurate. A better

summation scheme, such as a hierarchical multi-stage summation where the two

added values are always of similar magnitude (e.g., coming from the same amount

of time steps), or Kahan summation [24] that compensates for the accumulating

errors, should be used instead.

With very large datasets, a more accurate summation

scheme should be used for accumulating

target

and

Using extended precision numbers here could also help, as well as in other

calculations.

27. Practical ESNs 675

27.4.4 Direct Pseudoinverse Solution

A straightforward solution to (27.8) is

out

= Y

target

, (27.12)

where X

is the Moore-Penrose pseudoinverse of X.If(XX

) is invertible the

(27.12) in essence becomes equivalent to (27.11), but works even when it is not.

The direct pseudoinverse calculation typically exhibits high numerical stability.

As a downside, it is expensive memory-wise for large design matrices X,thereby

limiting the size of the reservoir N

and/or the number of training samples T .

Since there is virtually no regularization, the system of linear equations (27.8)

should be well overdetermined, i.e., 1+N

+ N

T .Inotherwords,thetask

should be diﬃcult relatively to the capacity of the reservoir so that overﬁtting

does not happen.

Use direct pseudoinverse (27.12) to train ESNs with

high precision and little regularization when memory

and run time permit.

Many modern programming libraries dealing with linear algebra have implemen-

tations of a matrix pseudoinverse, which can be used “oﬀ the shelf”. However,

implementations vary in their precision, computational eﬃciency, and numerical

stability.

For high precision tasks, check whether the regression

target

−W

out

X)X

on the error Y

target

−W

out

X is

actually all = 0,andaddittoW

out

if it is not.

This computational trick should not work in theory (the regression on the error

should be equal to zero), but sometimes does work in practice in Matlab [28],

possibly because of some internal optimizations.

Some high level linear algebra libraries have ready-made subroutines for doing

regression, i.e., solving linear least-squares as in (27.8), where the exact methods

are not made explicit and can internally be chosen depending on the conditioning

of the problem. The use of them has an obvious disadvantage of lacking the

control on the issues discussed here.

A powerful extension of the basic ESN approach is training (very) many (very

small) ESNs in parallel and averaging their outputs, which in some cases has

drastically improved performance [19, 22]. This might be not true for tasks re-

quiring large memory, where one bigger reservoir may still be better than several

smaller ones.

Averaging outputs from multiple reservoirs increases

the performance.

676 M. Lukoševičius

27.4.5 Initial Transient

Usually x(n) data from the beginning of the training run are discarded (i.e., not

used for learning W

out

) since they are contaminated by initial transients. To

keep notation simple let us assume they come before n =1.

For long sequences discard the initial time steps of

activations x(n) for training that are aﬀected by initial

transient.

The initial transient is in essence a result of an arbitrary setting of x(0),whichis

typically x(0) = 0. This introduces an unnatural starting state which is not nor-

mally visited once the network has “warmed up” to the task. The amount of time

steps to discard depends on the memory of the network (which in turn depends

on reservoir parameters), and typically are in the order of tens or hundreds.

However, if the data consists of multiple short separate sequences (like in

sequence classiﬁcation), “the initial transient” might be the usual working mode

of the ESN. In this case discarding the precious (possibly all!) data might be

disadvantageous. See Section 27.4.7 for more on this. Note, that you would want

to reset the state to some initial (the same) value before each sequence to make

the classiﬁcation of sequences independent.

With multiple sequences of data the time steps on which the learning should

be performed should be concatenated in X and Y

target

, the same way as there

would only be a single long sequence.

A generalization of discarding data is presented in the next Section 27.4.6.

27.4.6 Regression Weighting

In the regression learning of ESNs it is easy to make some time steps count more

than others by weighting the minimized square error diﬀerently. For this, the

time steps of the square error are weighted with a weight vector s(n) ∈ R:

E(y, y

target



i=1



n=1

s(n)



(n) −y

target

(n)



. (27.13)

This error is minimized with the same learning algorithms, but at each time

step n the vectors [1; u(n); x(n)] and y

target

(n) are element-wise multiplied with

s(n) before collecting them into X and Y

target

. Higher values of s(n) put more

emphasis on minimizing the error between y(n) and y

target

(n). Putting a weight

s(n) on a time step n has the same eﬀect as if [1; u(n); x(n)] and y

target

(n) have

appeared s(n) times in the training with a regular weight s(n)=1.Setting

s(n)=0is equivalent to discarding the time steps from training altogether.

Use weighting to assign diﬀerent importance to

diﬀerent time steps when training.

27. Practical ESNs 677

This weighted least squares scheme can be useful in diﬀerent situations like dis-

carding or weighting down the signals aﬀected by the initial transients, corrupted

or missing data, emphasizing the ending of the signal [6], etc. It is fully compat-

ible with ridge regression (27.9) where the weighting (27.13) is applied to the

square error part of the objective function in (27.10).

For an even more reﬁned version diﬀerent channels of y(n) can be trained

separately and with diﬀerent s(n). The weighting can for example be used to

counter an imbalance between positive vs. negative samples in a classiﬁcation or

detection task.

27.4.7 Readouts for Classiﬁcation

When the task is to classify separate short time series, the training is typically

set up such that the output y(n) has a dimension for every class and y

target

(n)

is equal to one in the dimension corresponding to the correct class and zero

everywhere else. The model is trained to approximate y

target

(n) and the class

for single sequence u(n) is very often decided by

class (u(n)) = arg max



|τ|



n∈τ

(n)



=argmax

((Σy)

) , (27.14)

where y

(n) is the kth dimension of y(n) produced by ESN from u(n), τ is some

integration interval (can be the length of the whole sequence u(n)), and Σy

stands for a shorthand notation of y(n) time-averaged over τ.

There is a better way to do this. Notice, that in this case

Σy =

|τ|



n∈τ

y(n)=

|τ|



n∈τ

out

[1; u(n); x(n)] = (27.15)

= W

out

|τ|



n∈τ

[1; u(n); x(n)] = W

out

Σx, (27.16)

where Σx is a shorthand for [1; u(n); x(n)] time-averaged over τ.Theform

(27.16) is a more eﬃcient way to compute Σy, since there is only one multi-

plication with W

out

More importantly, (27.16) can be used to make training more eﬃcient and

powerful. For a given short sequence u(n), instead of ﬁnding W

out

that min-

imizes E (y

target

(n), y(n)) for every n ∈ τ, it is better to ﬁnd the one that

minimizes the error between the time-averaged values E(y

target

,Σy).Inthis

case y(n) (which is not actually explicitly computed) is allowed to deviate

from y

target

(n) as long as the time-averaged Σy is close to y

target

.Here

target

≡ Σy

target

= y

target

(n)=const for a single short sequence.

To classify sequences, train and use readouts from

time-averaged activations Σx (27.16), instead of x(n).

678 M. Lukoševičius

Note that weighting is still possible both among the short sequences and inside

each sequence over the intervals τ, using weighted average instead of a simple

one. Actually, weighting over τ is often recommendable, emphasizing the ending

of the sequence where the whole information to make the classiﬁcation decision

has been fed into the reservoir.

To retain information from diﬀerent times in the short sequence, weighted

averages Σ

x,...,Σ

x over several time intervals τ

,...,τ

during the short

sequence can be computed and concatenatedintoanextendedstateΣ

∗

x =

[Σ

x; ...; Σ

x]. This extended state Σ

∗

x can be used instead of Σx for an

even more powerful classiﬁcation. In this case W

out

is also extended to W

out

∗

∈

×k·(1+N

)

Concatenate weighted time-averages over diﬀerent

intervals to read out from for an even more powerful

classiﬁcation.

Since the short sequences are typically of diﬀerent lengths (the advantage of

using temporal classiﬁcation methods), the intervals τ

,...,τ

should be scaled

to match the length of each sequence.

The techniques so far described in this section eﬀectively reduce the time series

classiﬁcation to a static data classiﬁcation problem by reducing the variable-length

inputs u(n) to ﬁxed-size feature vectors Σ

∗

x ∈ R

k·(1+N

)

. Thereare many pow-

erful machine learning methods available to solve the static classiﬁcation problem

that can be employed at this point, such as logistic regression or maximum margin

classiﬁers. See, e.g., [3] for diﬀerent options. These methods deﬁne the error func-

tion diﬀerently and oﬀer diﬀerent, mostly iterative, optimization algorithms.

Diﬀerent powerful classiﬁcation methods for static

data can be employed as the readout from the

time-averaged activations Σ

∗

Among others, the same regression methods as for temporal data can be used

to train a linear readout y = W

out

∗

x and decide the class by maximum as

in (27.14). In this case, for every short sequence u(n) only one pair of vectors

target

and Σ

∗

x is collected into Y

target

and X respectively for training by (27.9)

or (27.12). Since this reduces training data points from the total number of time

steps to the number of short sequences, precautions against overﬁtting should be

taken. Such regression training for classiﬁcation has an advantage of the single-

shot closed form solution, however, it is not optimal because it does not directly

optimize the correct classiﬁcation rates.

For temporal pattern recognition tasks in a long sequence (i.e., detection plus

classiﬁcation), y

target

(n) should be designed cleverly. The shapes, durations, and

delays of the signals in y

target

(n) indicating patterns in u(n) are also parameters

that have to be optimized; as well as algorithms producing the ﬁnal recognition

(in the form of discrete symbol sequences or annotations) from the continuous

27. Practical ESNs 679

signals y(n). But this goes beyond the scope of this paper. Alternatively, dynamic

programming methods (such as Viterbi algorithm) can be used for trainable

recognizers at the output layer, see [10].

27.4.8 Online Learning

Some applications require online model adaptation, e.g., [19]. In such cases the

process generating the data is often not assumed to be stationary and is tracked

by the constantly adapting model. W

out

here acts as an adaptive linear combiner.

The simplest way to train W

out

is the method known as the Least Mean

Squares (LMS) algorithm [9], it has many extensions and modiﬁcations. It is a

stochastic gradient descent algorithm which at every time step n changes W

out

the direction of minimizing the instantaneous squared error y

target

(n) − y(n)

LMS is a ﬁrst-order gradient descent method, locally approximating the error

surface with a hyperplane. This approximation is poor then curvature of the

error surface is very diﬀerent in diﬀerent directions, which is signiﬁed by large

eigenvalue spreads of XX

. In such a situation the convergence performance of

LMS is unfortunately severely impaired.

An alternative linear readout learning to LMS, known in linear signal pro-

cessing as the Recursive Least Squares (RLS) algorithm, is insensitive to the

detrimental eﬀects of eigenvalue spread and boasts a much faster convergence.

It explicitly at each time step n minimizes a square error that is exponentially

discounted going back in time:

E(y, y

target

,n)=



i=1



j=1

n−j



(j) − y

target

(j)



, (27.17)

where 0 <λ≤ 1 is the error “forgetting” parameter. This weighting is not unlike

the one discussed in Section 27.4.6 where s

(n)

(j)=λ

n−j

at time step n.RLScan

be seen as a method for minimizing (27.17) at each time step n similar to Wiener-

Hopf (27.11), but optimized by keeping and updating the estimate of (XX

)

−1

from time n − 1 instead of recomputing it from scratch. The downside of RLS is

it being computationally more expensive (quadratic in number of weights instead

of linear like LMS) and notorious for numerical stability issues. Demonstrations of

RLS for ESNs are presented in [19, 18]. A careful and comprehensive comparison

of variants of RLS as ESN readouts is carried out in a Master’s thesis [25], which

may be helpful for practitioners.

The BackPropagation-DeCorrelation (BPDC) [44] and FORCE [46] learning

algorithms discussed in Section 27.5.3 are two other powerful methods for online

training of single-layer readouts with feedback connections from the reservoirs.

27.4.9 Pointers to Readouts Extensions

There are also alternative ways of training outputs from the reservoirs suggested

in the literature, e.g., Gaussian process [4], copula [5], or Support Vector Machine

[42] style outputs. See [29] and updated in Chapter 2 of [30] for an overview.

680 M. Lukoševičius

27.5 Dealing with Output Feedbacks

27.5.1 Output Feedbacks

Even if the reservoir is kept ﬁxed, for some tasks the trained readouts are fed

back to the reservoir and thus the training process changes its dynamics. In

other words, a recurrence exists between the reservoir and the trained readout.

Pattern generation is a typical example of such task. This can be realized in two

ways. Either by feedback connections W

∈ R

×N

from the output to the

reservoir, replacing (27.2) with

˜x(n)=tanh



[1; u(n)] + Wx(n − 1) + W

y(n − 1)



, (27.18)

or by looping the output y(n − 1) as an input u(n) for the next update step

n in (27.2), in eﬀect turning a trained one step predictor ESN into a pattern

generator. Note that these two options are equivalent and are just a matter of

notation: u(n) and W

instead of y(n − 1) and W

, respectively. The same

principles thus apply to producing W

as to W

. In some cases, however, both

external input and output feedback can be present.

This extends the power of RC, because it no longer relies on ﬁxed random

input-driven dynamics to construct the output, but the dynamics are adapted

to the task. This power has its price, because stability issues arise here.

Use output feedbacks to the reservoir only if they are

necessary for the task.

This may include tasks that simply cannot be learned well enough without feed-

backs. Feedbacks enable reservoirs to achieve universal computational capabil-

ities [33] and can in practice be beneﬁcial even where they are not an integral

part of the task [28].

In order to avoid falling prey to the same diﬃculties as with full RNN training

algorithms, two strategies are used in RC when learning outputs with feedbacks:

– Breaking the feedback loop during the training, Section 27.5.2;

– Adapting W

out

online with specialized algorithms in the presence of real

feedbacks, Section 27.5.3.

27.5.2 Teacher Forcing

The ﬁrst strategy is to disengage the recurrent relationship between the reservoir

and the readout using teacher forcing and treat output learning as a feedfor-

ward task. This is done by feeding the desired output y

target

(n −1) through the

feedback connections W

in (27.18) instead of the real output y(n − 1) while

learning (Figure 27.2a). The target signal y

target

(n) “bootstraps” the learning

process and if the output is learned with high precision (i.e., y(n) ≈ y

target

(n)),

the recurrent system runs much in the same way with the real y(n) in feedbacks

27. Practical ESNs 681

after training as it did with y

target

(n) during training (Figure 27.2b). In a pure

pattern generator setup with no additional inputs, u(n) and W

may not be

present at all – after training the ESN is run to autonomously generate a pat-

tern, using teacher forcing initially to start the pattern. As noted before, this

is equivalent to training a one time step predictor without feedbacks W

and

looping its output y(n − 1) as input u(n) through W

(a) Training with teacher forcing (b) Running in a generative mode

Fig. 27.2. An ESN with output feedbacks trained with teacher forcing

For simple tasks, feed y

target

(n) instead of y(n) in

(27.18) while learning to break the recurrence.

This way W

out

can be learned in the same eﬃcient batch mode by linear regres-

sion, as explained before.

Teacher forcing applied to speedup error backpropagation RNN training is

also discussed in chapter [54] of this book.

There are some caveats here. The approach works very well if the output can

be learned precisely [16]. However, if this is not the case, the distorted feedback

leads to an even more distorted output and feedback at the next time step,

and so on, with the actual generated output y(n) quickly diverging from the

desired y

target

(n). Even with well-learned outputs the dynamical stability of the

autonomous running system is often an issue.

Stability with Feedbacks. In both cases regularization of the output by ridge

regression or/and noise immunization, as explained in Section 27.4.2, is the key

to success.

Regularization by ridge regression or noise is crucial to

make teacher-forced feedbacks stable.

Some additional options to the ones in Section 27.4.2 are available here. One

is adding scaled noise to the forced teacher signal y

target

(n), emulating an im-

perfectly learned y

target

(n) by y(n) and making the network robust to this. In

fact, a readout can be trained to ignore some inputs or feedbacks altogether by

feeding strong noise into them during training [20].

682 M. Lukoševičius

Another is doing training in several iterations and feeding back the signals

that are in between the perfect y

target

(n) and the actual y(n) obtained from the

previous iteration. For example, a one time step prediction of the signal by the

ESN (as opposed to running with real feedbacks) can be used as teacher forcer

for the next iteration of training [19]. This way the model learns to recover from

the directions of deviations from the correct signal that it actually produces, not

from just random ones as in the case with noise; while at the same time the

teacher signal does not diverge from the target too far.

Another recently proposed option is to also regularize the recurrent connec-

tions W themselves. A one-shot relearning of W with regularization (similar to

ridge regression (27.9) for W

out

) to produce the same x(n), as the one from the

initially randomly generated W, reduces the recurrent connection strengths and

helps making the ESN generator more stable [38, 37].

27.5.3 Online Learning with Real Feedbacks

The second strategy to deal with output feedbacks in ESNs is using online (in-

stead of one-shot) learning algorithms to train the outputs W

out

while the feed-

backs are enabled and feed back the (yet imperfectly) learned output, not the

teacher signal. This way the model learns to stabilize itself in the real generative

setting.

General purpose online learning algorithms, such as discussed in Section

27.4.8, can be used for this. However, there exist a couple of online RC learning

algorithms that are specialized in training outputs with feedbacks, and in fact

would not work without them.

BackPropagation-DeCorrelation (BPDC) [44] is such a highly optimized RC

online learning algorithm which runs with a linear time complexity in the number

of connections. The algorithm is said to be insensitive to reservoir settings and

capable of tracking quickly changing signals. As a downside of the latter feature,

the trained network forgets the previously seen data and is highly biased by the

recent data. Some remedies for reducing this eﬀect are reported in [45].

A recent RC approach named FORCE learning uses the RLS (Section 27.4.8)

online learning algorithm to vigorously adapt W

out

in the presence of the real

feedbacks [46]. By the initial fast and strong adaptation of W

out

the feedbacks

y(n) are kept close to the desired y

target

(n) already from the beginning of the

learning process, similar to teacher forcing. The algorithm beneﬁts from initial

spontaneous chaotic activations inside the reservoir which are then subdued by

the feedbacks. It appears that FORCE learning is well suited to yield very stable

and accurate neural pattern generators.

27.6 Summary and Implementations

We have presented many practical aspects for successfully applying ESNs. Some

of them are not universal and should be ﬁltered depending on a particular task.

They are also not the only possible approaches, and can most likely be improved

27. Practical ESNs 683

upon. They collect, however, the best practices accumulated in the ﬁeld over

the ten years from its start, and should serve well as guiding principles for ESN

researchers and practitioners.

Implementing an ESN is relatively straightforward – minimalistic one-page

self-contained code examples in several programming languages are avail-

able through

http://reservoir-computing.org/software/minimal.There

is also a number of ready-made and expandable software libraries available which

incorporate many of the techniques described here. A collection of open source

RC toolboxes in diﬀerent programming languages and varying degrees of so-

phistication can be found at

http://reservoir-computing.org/software.

The most comprehensive of them is the Oger toolbox in Python

http://reservoir-computing.org/oger.

The

http://reservoir-computing.org website is an overall good hub of

reservoir computing related resources, where new people can also register and

contribute.

Acknowledgments. The author was supported through funds from the Euro-

pean FP7 projects ORGANIC and AMARSi. He would like to specially thank

Herbert Jaeger for proof-reading this chapter and valuable suggestions, as well

as the anonymous reviewers, and the whole reservoir computing community on

whose collective wisdom this chapter is to some extent based.

References

[1] Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gra-

dient descent is diﬃcult. IEEE Transactions on Neural Networks 5(2), 157–166

(1994)

[2] Bergstra, J.S., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter

optimization. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P., Pereira, F.C.N.,

Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 23

(NIPS 2010), pp. 2546–2554 (2011)

[3] Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science

and Statistics). Springer-Verlag New York, Inc., Secaucus (2006)

[4] Chatzis, S.P., Demiris, Y.: Echo state Gaussian process. IEEE Transactions on

Neural Networks 22(9), 1435–1445 (2011)

[5] Chatzis, S.P., Demiris, Y.: The copula echo state network. Pattern Recogni-

tion 45(1), 570–577 (2012)

[6] Daukantas, S., Lukoševičius, M., Marozas, V., Lukoševičius, A.: Comparison of

“black box” and “gray box” methods for lost data reconstruction in multichannel

signals. In: Proceedings of the 14th International Conference “Biomedical Engi-

neering”, Kaunas, pp. 135–138 (2010)

[7] Dominey, P.F., Ramus, F.: Neural network processing of natural language: I. sensi-

tivity to serial, temporal and abstract structure of language in the infant. Language

and Cognitive Processes 15(1), 87–127 (2000)

[8] Doya, K.: Bifurcations in the learning of recurrent neural networks. In: Proceedings

of IEEE International Symposium on Circuits and Systems, vol. 6, pp. 2777–2780

(1992)

684 M. Lukoševičius

[9] Farhang-Boroujeny, B.: Adaptive Filters: Theory and Applications. Wiley (1998)

[10] Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. PhD

thesis, Technical University Munich, Munich, Germany (2008)

[11] Hermans, M., Schrauwen, B.: Memory in reservoirs for high dimensional input.

In: Proceedings of the IEEE International Joint Conference on Neural Networks

(IJCNN 2010), pp. 1–7 (2010)

[12] Hermans, M., Schrauwen, B.: Recurrent kernel machines: Computing with inﬁnite

echo state networks. Neural Computation 24(1), 104–133 (2012)

[13] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computa-

tion 9(8), 1735–1780 (1997)

[14] Holzmann, G., Hauser, H.: Echo state networks with ﬁlter neurons and a delay

and sum readout. Neural Networks 23(2), 244–256 (2010)

[15] Ilies, I., Jaeger, H., Kosuchinas, O., Rincon, M., Šak˙enas, V., Vaškevičius, N.:

Stepping forward through echoes of the past: forecasting with echo state networks.

Short report on the winning entry to the NN3 ﬁnancial forecasting competition

(2007),

http://www.neural-forecasting-competition.com/

downloads/NN3/methods/27-NN3

Herbert

Jaeger

report.pdf

[16] Jaeger, H.: The “echo state” approach to analysing and training recurrent neural

networks. Technical Report GMD Report 148, German National Research Center

for Information Technology (2001)

[17] Jaeger, H.: Short term memory in echo state networks. Technical Report GMD

Report 152, German National Research Center for Information Technology (2002)

[18] Jaeger, H.: Adaptive nonlinear system identiﬁcation with echo state networks. In:

Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 593–600.

MIT Press, Cambridge (2003)

[19] Jaeger, H., Haas, H.: Harnessing nonlinearity: predicting chaotic systems and sav-

ing energy in wireless communication. Science 304(5667), 78–80 (2004)

[20] Jaeger, H.: Generating exponentially many periodic attractors with linearly grow-

ing echo state networks. Technical Report No. 3, Jacobs University Bremen (2006)

[21] Jaeger, H.: Echo state network. Scholarpedia 2(9), 2330 (2007)

[22] Jaeger, H., Lukoševičius, M., Popovici, D., Siewert, U.: Optimization and applica-

tions of echo state networks with leaky-integrator neurons. Neural Networks 20(3),

335–352 (2007)

[23] Jaeger, H.: Long short-term memory in echo state networks: Details of a simulation

study. Technical Report No. 27, Jacobs University Bremen (2012)

[24] Kahan, W.: Pracniques: further remarks on reducing truncation errors. Commu-

nications of the ACM 8(1), 40 (1965)

[25] Küçükemre, A.U.: Echo state networks for adaptive ﬁltering. Master’s thesis, Uni-

versity of Applied Sciences Bohn-Rhein-Sieg, Germany (2006),

http://reservoir-computing.org/publications/

2006-echo-state-networks-adaptive-filtering

[26] LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient BackProp. In: Orr,

G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer,

Heidelberg (1998)

[27] Lukoševičius, M., Popovici, D., Jaeger, H., Siewert, U.: Time warping invariant

echo state networks. Technical Report No. 2, Jacobs University Bremen (May

2006)

[28] Lukoševičius, M.: Echo state networks with trained feedbacks. Technical Report

No. 4, Jacobs University Bremen (February 2007)

[29] Lukoševičius, M., Jaeger, H.: Reservoir computing approaches to recurrent neural

network training. Computer Science Review 3(3), 127–149 (2009)

27. Practical ESNs 685

[30] Lukoševičius, M.: Reservoir Computing and Self-Organized Neural Hierarchies.

PhD thesis, Jacobs University Bremen, Bremen, Germany (2011)

[31] Lukoševičius, M., Jaeger, H., Schrauwen, B.: Reservoir computing trends. KI -

Künstliche Intelligenz, pp. 1–7 (2012)

[32] Maass, W., Natschläger, T., Markram, H.: Real-time computing without stable

states: a new framework for neural computation based on perturbations. Neural

Computation 14(11), 2531–2560 (2002)

[33] Maass, W., Joshi, P., Sontag, E.D.: Principles of real-time computing with feed-

back applied to cortical microcircuit models. In: Advances in Neural Information

Processing Systems 18 (NIPS 2005), pp. 835–842. MIT Press, Cambridge (2006)

[34] Martens, J., Sutskever, I.: Learning recurrent neural networks with Hessian-free

optimization. In: Proc. 28th Int. Conf. on Machine Learning (2011)

[35] Martens, J., Sutskever, I.: Training Deep and Recurrent Networks with Hessian-

free Optimization. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 479–535. Springer, Heidelberg (2012)

[36] Ozturk, M.C., Xu, D., Príncipe, J.C.: Analysis and design of echo state networks.

Neural Computation 19(1), 111–138 (2007)

[37] Reinhart, F.R., Steil, J.J.: A constrained regularization approach for input-driven

recurrent neural networks. Diﬀerential Equations and Dynamical Systems 19, 27–

46 (2011)

[38] Reinhart, F.R., Steil, J.J.: Reservoir regularization stabilizes learning of echo state

networks with output feedback. In: Proceedings of the 19th European Symposium

on Artiﬁcial Neural Networks, ESANN 2011 (2011) (in Press)

[39] Rodan, A., Tiňo, P.: Minimum complexity echo state network. IEEE Transactions

on Neural Networks 22(1), 131–144 (2011)

[40] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations

by error propagation. In: Neurocomputing: Foundations of Research, pp. 673–695.

MIT Press, Cambridge (1988)

[41] Schrauwen, B., Defour, J., Verstraeten, D., Van Campenhout, J.: The Introduction

of Time-Scales in Reservoir Computing, Applied to Isolated Digits Recognition.

In: de Sá, J.M., Alexandre, L.A., Duch, W., Mandic, D.P. (eds.) ICANN 2007.

LNCS, vol. 4668, pp. 471–479. Springer, Heidelberg (2007)

[42] Shi, Z., Han, M.: Support vector echo-state machine for chaotic time-series pre-

diction. IEEE Transactions on Neural Networks 18(2), 359–372 (2007)

[43] Siewert, U., Wustlich, W.: Echo-state networks with band-pass neurons: to-

wards generic time-scale-independent reservoir structures. Internal status report,

PLANET intelligent systems GmbH (2007),

http://reslab.elis.ugent.be/node/112

[44] Steil, J.J.: Backpropagation-decorrelation: recurrent learning with O(N) complex-

ity. In: Proceedings of the IEEE International Joint Conference on Neural Net-

works (IJCNN 2004), vol. 2, pp. 843–848 (2004)

[45] Steil, J.J.: Memory in Backpropagation-Decorrelation O(N) Eﬃcient Online Re-

current Learning. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN

2005. LNCS, vol. 3697, pp. 649–654. Springer, Heidelberg (2005)

[46] Sussillo, D., Abbott, L.F.: Generating coherent patterns of activity from chaotic

neural networks. Neuron 63(4), 544–557 (2009)

[47] Triefenbach, F., Jalalvand, A., Schrauwen, B., Martens, J.-P.: Phoneme recogni-

tion with large hierarchical reservoirs. In: Advances in Neural Information Pro-

cessing Systems 23 (NIPS 2010), pp. 2307–2315. MIT Press, Cambridge (2011)

[48] Verstraeten, D., Schrauwen, B., Stroobandt, D.: Reservoir-based techniques for

speech recognition. In: Proceedings of the IEEE International Joint Conference

on Neural Networks (IJCNN 2006), pp. 1050–1053 (2006)

686 M. Lukoševičius

[49] Verstraeten, D., Schrauwen, B., D’Haene, M., Stroobandt, D.: An experimen-

tal uniﬁcation of reservoir computing methods. Neural Networks 20(3), 391–403

(2007)

[50] Verstraeten, D., Dambre, J., Dutoit, X., Schrauwen, B.: Memory versus non-

linearity in reservoirs. In: Proc. Int. Neural Networks (IJCNN) Joint Conf., pp.

1–8 (2010)

[51] Werbos, P.J.: Backpropagation through time: what it does and how to do it.

Proceedings of the IEEE 78(10), 1550–1560 (1990)

[52] Williams, R.J., Zipser, D.: A learning algorithm for continually running fully re-

current neural networks. Neural Computation 1, 270–280 (1989)

[53] Wyﬀels, F., Schrauwen, B., Verstraeten, D., Stroobandt, D.: Band-pass reservoir

computing. In: Hou, Z., Zhang, N. (eds.) Proceedings of the IEEE International

Joint Conference on Neural Networks (IJCNN 2008), Hong Kong, pp. 3204–3209

(2008)

[54] Zimmermann, H.-G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural

Networks: 12 Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

Forecasting with Recurrent Neural Networks:

12 Tricks

Hans-Georg Zimmermann, Christoph Tietz, and Ralph Grothmann

Siemens AG, Corporate Technology

Otto-Hahn-Ring 6, D-81739 Munich, Germany

Hans

Georg.Zimmermann@siemens.com

Abstract. Recurrent neural networks (RNNs) are typically considered

as relatively simple architectures, which come along with complicated

learning algorithms. This paper has a diﬀerent view: We start from the

fact that RNNs can model any high dimensional, nonlinear dynamical

system. Rather than focusing on learning algorithms, we concentrate on

the design of network architectures. Unfolding in time is a well-known

example of this modeling philosophy. Here a temporal algorithm is trans-

ferred into an architectural framework such that the learning can be

performed by an extension of standard error backpropagation.

We introduce 12 tricks that not only provide deeper insights in the

functioning of RNNs but also improve the identiﬁcation of underlying

dynamical system from data.

28.1 Introduction

In many business management disciplines, complex planning and decision-

making can be supported by quantitative forecast models that take into account

a wide range of inﬂuencing factors with non-linear cause and eﬀect relationships.

Furthermore, the uncertainty in forecasting should be considered. The procure-

ment of raw materials or demand planning are prime examples: The timing of

copper purchases can be optimized with accurate market price forecasts, whereas

precise forecasts of product sales increase the deliver reliability and reduce costs.

Likewise technical applications, e.g. energy generation, also require the modeling

of complex dynamical systems.

In this contribution we deal with time-delay recurrent neural networks (RNNs)

for time series forecasting and introduce 12 tricks that not only ease the handling

of RNNs, but also improve the forecast accuracy. The RNNs and associated tricks

are applied in many of our customer projects from economics and industry.

RNNs oﬀer signiﬁcant beneﬁts for dealing with the typical challenges associated

with forecasting. With their universal approximation properties [11], RNNs can

model high-dimensional, non-linear relationships. The time-delayed information

processing addresses temporal structures. In contrast, conventional econometrics

generally uses linear models (e.g. autoregressive models (AR), multivariate linear

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 687–707, 2012.

 Springer-Verlag Berlin Heidelberg 2012

688 H.-G. Zimmermann, C. Tietz, and R. Grothmann

regression) which can be eﬃciently estimated from historical data, but provide

only an inadequate framework for non-linear dynamical systems [12].

In Section 28.2 we introduce the so-called correspondence principle for neural

networks (Trick 1). For us neural networks are a class of functions which can be

transformed into architectures. We will work only with algorithms that process

information locally within the architectures. As we will outline, for some prob-

lems it is easier to start oﬀ with the NN architecture and formulate the equations

afterwards and for other problems vice versa. The locality of the algorithms en-

ables us to model even large systems. The correspondence principle is the basis

for diﬀerent RNN models and associated tricks.

We will start with a basic RNN in state space formulation for the modeling

of partly autonomous and partly externally driven dynamical systems (so-called

open systems). The associated parameter optimization task is solved by (ﬁnite)

unfolding in time, which can be handled by a shared weights extension of stan-

dard backpropagation. Dealing with state space models, we are able to utilize

memory eﬀects. Therefore, there is no need of a complicated input preprocess-

ing in order to represent temporal relationships. Nevertheless, learning of open

dynamical systems tends to focus on the external drivers and, thus, neglects the

identiﬁcation of the autonomous part. On this problem Trick 2 enforces the au-

tonomous ﬂow of the dynamics and thus, enables long-term forecasting. Trick 3

ﬁnds a proper initialization for the ﬁrst state vector of recurrent neural network

in the ﬁnite unfolding.

Typically we do not know all external drivers of the open dynamical system.

This may cause the identiﬁcation of pseudo causalities. Trick 4 is the extension of

the RNN with an error correction term, resulting in a so-called error correction

neural network (ECNN), which enables us to handle missing information, hidden

external factors or shocks on the dynamics. ECNNs are an appropriate frame-

work for low-dimensional dynamical systems with less than 5 target variables.

For the modeling of high-dimensional systems on low dimensional manifolds as

in electrical load curves Trick 5 adds a coordinate transformation (so-called bot-

tleneck) to the ECNN.

Standard RNNs use external drivers in the past and assume constant envi-

ronmental conditions from present time on. For fast changing environments this

is a questionable assumption. Internalizing the environment of the dynamics

into the model, leads to Trick 6, so-called historically consistent neural networks

(HCNN). The special feature of the HCNN is that it not only models the in-

dividual dynamics of interest, but also models the external drivers. This leads

to a closed dynamical system formulation. Therefore, HCNNs are symmetric in

their input and output variables, i.e. the system description does not draw any

distinction between input, output and internal state variables.

In practice, HCNNs are diﬃcult to train, because the models have no input

signals and are unfolded across the complete data horizon. This implies that we

learn from a single data pattern, which is the unique data history. In Trick 7 we

therefore introduce an architectural teacher forcing to make the best possible

use of the data from the observables and to accelerate training of the HCNN.

28. Forecasting with Recurrent Neural Networks 689

The HCNN models the dynamics for all of the observables and their interac-

tion in parallel. For this purpose a high-dimensional state transition matrix is

required. A fully connected state transition matrix can, however, lead to a signal

overload during the training of the neural network using error backpropagation

through time (EBTT). With Trick 8 we solve this problem by introducing sparse

state transition matrices.

The information ﬂow within a HCNN is from the past to present and future

time, i.e. we have a causal model to explain the highly-interacting non-linear

dynamical systems across multiple time scales. Trick 9 extends this modeling

framework with an information ﬂow from the future into past. As we will show

this enables us to incorporate the eﬀects of rational decision making and plan-

ning into the modeling. The resulting models are called causal-retro-causal his-

torically consistent neural networks (CRCNNs). Likewise to HCNNs, CRCNNs

are diﬃcult to train. In Trick 10 we extend the basic CRCNN architecture by an

architectural teacher forcing mechanism, which allows us to learn the CRCNN

using the standard EBTT algorithm. Trick 11 introduces a way to improve the

modeling of deterministic chaotic systems.

Finally, Trick 12 is dedicated to the modeling of uncertainty and risk. We

calculate ensembles of either HCNNs or CRCNNs to forecast probability distri-

butions. Both modeling frameworks give a perfect description of the dynamic

of the observables in the past. However, the partial observability of the world

results in a non-unique reconstruction of the hidden variables and thus, diﬀerent

future scenarios. Since the genuine development of the dynamics is unknown

and all paths have the same probability, the average of the ensemble is the best

forecast, whereas the ensemble bandwidth describes the market risk.

Section 28.3 summarizes the primary ﬁndings of this contribution and points

to future directions of research.

28.2 Tricks for Recurrent Neural Networks

Trick 1. The Correspondence Principle for Neural Networks

In order to gain a deeper understanding in the functioning and composition of

RNNs we introduce our ﬁrst conceptual trick, which is called correspondence

principle between equations, architectures and local algorithms. The correspon-

dence principle for neural networks (NN) implies that any equation for a NN can

be portrayed in graphical form by means of an architecture which represents the

individual layers of the network in the form of nodes and the matrices between

the layers in the form of edges. This correspondence is most beneﬁcial in combi-

nation with local optimization algorithms that provide the basis for the training

of the NNs. For example, the error back propagation algorithm needs only lo-

cally available information from the forward and backward ﬂow of the network

in order to calculate the partial derivatives of the NN error function[13]. The

use of local algorithms here provides an elegant basis for the expansion of the

neural network towards the modeling of large systems. Used in combination with

690 H.-G. Zimmermann, C. Tietz, and R. Grothmann

an appropriate (stochastic) learning rule, it is possible to use the gradients as a

basis for the identiﬁcation of robust minima[8].

Now let us introduce a basic recurrent neural network (RNN) in state space

formulation. We start from the assumption that a vector time series y

is created

by an open dynamical system, which can be described in discrete time τ using

a state transition and output equation[3]:

state transition s

τ +1

= f(s

)

output equation y

= g(s

)

(28.1)

The hidden time-recurrent state transition equation s

τ +1

= f (s

) describes

the upcoming state s

τ +1

by means of a function of the current system state s

and of the external factors u

. The system formulated in the state transition

equation can therefore be interpreted as a partially autonomous and partially

externally driven dynamic. We call this an open dynamical system.

The output, also called observer, equation y

uses the non-observable system

state s

in every time step τ to calculate the output of the dynamic system

. The data-driven system identiﬁcation is based on the selected parameterized

functions f() and g().Wechosetheparametersinf () and g() such that an

appropriate error function is minimized (see Eq. 28.2).



τ =1



− y



→ min

f,g

. (28.2)

The two functions f () and g() are estimated using the quadratic error function

(Eq. 28.2) in such a way that the average distance between the observed data

and the system outputs y

across a number of observations τ =1,...,T is

minimal.

Thus far, we have given a general description of the state transition and

the output equation for open dynamical systems. Without loss of generality we

can specify the functions f() and g() by means of a recurrent neural network

(RNN)[11, 15]:

state transition s

τ +1

= tanh(As

+ Bu

)

output equation y

= Cs

(28.3)

Eq. 28.3 speciﬁes an RNN with weight matrices A, B and C to model the open

dynamical system. The RNN is designed as a non-linear state-space model, which

is able to approximate any function f() and g(). We choose the hyperbolic tan-

gent tanh() as the activation function for the hidden network layer s

τ +1

.The

output equation is speciﬁed as a linear function. The RNN output is generated

by a superposition of two components: (i) the autonomous dynamics (coded in

A), which accumulates information over time (memory), and (ii) the inﬂuence

of external factors (coded in B).

Note, that the state transition in Eq. 28.3 does not need an additional matrix

leading the hyperbolic tangent tanh() activation function, since the additional

28. Forecasting with Recurrent Neural Networks 691

matrix can be merged into matrix A. Furthermore, without loss of generality we

can use a linear output equation in Eq. 28.3. If we would have a non-linearity

in the output equation (Eq. 28.3), it could be merged in the state transition

equation (Eq. 28.3). For details see Schäfer et al. [11].

We use the technique of ﬁnite unfolding in time[10] to solve the temporal

system identiﬁcation, i.e. for the selection of appropriate matrices A, B and C

to minimize the error function (Eq. 28.1). The underlying idea here is that any

RNN can be reformulated to form an equivalent feedforward neural network, if

matrices A, B and C are identical in the individual time steps (shared weights).

Fig. 28.1 depicts the resulting RNN.

Fig. 28.1. Basic RNN unfolded in time with shared weights A, B and C

One advantage of the shared matrices is the moderate number of free param-

eters (weights), which reduces the risk of over-ﬁtting[18]. The actual training is

conducted using error backpropagation through time (EBTT) together with a

stochastic learning rule[13, 10]. For algorithmic solutions, the reader is referred

to the overview article by B. Pearlmutter[9].

Trick 2. Overshooting

In applications we often observed that RNNs tend to focus on only the most

recent external inputs in order to explain the dynamics. To balance the infor-

mation ﬂow, we use a trick called overshooting. Overshooting extends the au-

tonomous system dynamics (coded matrix A) into the future (here t +2, t +3,

t +4)[18] (see Fig. 28.2). In order to describe the development of the dynamics

in one of these future time steps adequately, matrix A must be able to transfer

information over time. The diﬀerent instances of matrix A refer to the same

prototype matrix A. Thus the shared weights principle allow us to maintain the

locality of the correspondence principle (see Trick 1). By this we can compute

consistent multi-step forecasts. A corresponding RNN architecture is depicted in

Fig. 28.2. For the RNN we typically use an input preprocessing u

= x

−x

τ −1

as the transformation for the raw data x. This avoids trends in the input or

692 H.-G. Zimmermann, C. Tietz, and R. Grothmann

Fig. 28.2. RNN incorporating overshooting

target variables of the RNN. The missing external inputs u

τ>0

in the future can

be interpreted as a constant environment (u

τ>0

≈ 0). The eﬀectiveness of over-

shooting depends on the strength of the autonomous dynamics. The stronger the

autonomous ﬂow, the better is the forecast accuracy for the future overshoot-

ing time steps. Furthermore, overshooting has an implication for the learning

itself. Without overshooting, RNNs have the tendency to focus only on short-

term input-output relationships. With overshooting the learning has to work out

mid- to long-term input-output relationships.

Summarizing, it should be noted, that overshooting generates additional valu-

able forecast information about the analyzed dynamical system and acts as a

regularization method for the learning.

Trick 3. Handling the Uncertainty of the Initial State

One of the diﬃculties with ﬁnite unfolding in time is to ﬁnd a proper initialization

for the ﬁrst state vector of the RNN. Our next trick takes care of this problem.

An obvious solution is to set the ﬁrst state s

equal to zero. We assume that the

unfolding includes enough (past) time steps such that the misspeciﬁcation of the

initialization phase is overwritten along the state transitions. In other words, the

network accumulates information over time and thus, may eliminate the impact

of the arbitrary initial state on the network outputs.

Beyond this the modeling can be improved if we are able to make the unfolded

RNN less sensitive from the unknown initial state s

. A ﬁrst approach to stiﬀ

the model against the unknown s

is to apply a noise term Θ to the state s

AﬁxednoisetermΘ which is drawn from a certain noise distribution is clearly

inadequate to handle the uncertainty of the initial state. Since we do not know

what is an appropriate level of noise, we have to ﬁnd a way to estimate the noise

level. We propose to apply an adaptive noise Θ, which ﬁts best to the level of

uncertainty of the unknown s

. The characteristic of the adaptive noise term

Θ is automatically determined as a by-product of the error backpropagation

algorithm.

The basic idea is as follows [8]: We use the residual error  of the neural

network as computed by error backpropagation for s

. The residual error  as

28. Forecasting with Recurrent Neural Networks 693

measured at the initial state s

can be interpreted as the uncertainty which is

due to the missing information about the true initial state vector s

.Wedisturb

the initial state s

with a noise term Θ which follows the distribution of the

residual error . Given the uncertain initial states, learning tries to fulﬁll the

output-target relationships along the dynamics. As a result of the learning we

get a state transition matrix in form of a contraction, which squeezes out the

initial uncertainty. A corresponding network architecture is depicted in Fig. 28.3.

Fig. 28.3. Handling the uncertainty of the initial state s

by applying adaptive noise

It is important to notice, that the noise term Θ is drawn from the observed

residual errors without any assumption on the underlying noise distribution. The

desensitization of the network to the initial state vector s

can therefore be seen

as a self-scaling stabilizer of the modeling.

In general, a time discrete trajectory can be seen as a sequence of points over

time. Such a trajectory is comparable to a ﬁne thread in the internal state space.

The trajectory is very sensitive to the initial state vector s

. If we apply noise

to s

, the trajectory becomes a tube in the internal state space. Due to the

characteristics of the adaptive noise term, the tube contracts over time. This

enforces the identiﬁcation of a stable dynamical system.

Trick 4. Error Correction Neural Networks (ECNN)

A weakness of the RNN (see Fig. 28.1 or 28.2) is, that modeling might be dis-

turbed by unknown external inﬂuences or shocks. As a remedy, the next trick

called error correction neural networks (ECNN) introduces an additional term

= tanh(y

−y

) in the state transition (28.4). The term can be interpreted as

a correctional factor: The model error (y

− y

) at time τ quantiﬁes the misﬁt

and may help to adjust the model output afterwards.

state transition s

τ +1

= tanh(As

+ Bu

+ D tanh(y

− y

))

output equation y

= Cs

(28.4)

694 H.-G. Zimmermann, C. Tietz, and R. Grothmann

In Eq. 28.4 the model output y

is computed by Cs

andcomparedwiththe

observation y

. The output clusters of the ECNN which generate error signals

during the learning phase are z

(τ ≤ t) and y

(τ>t). Have in mind, that

the target values of the sequence of output clusters z

are zero, because we

want to optimize the compensation mechanism y

− y

between the expected

value y

and its observation y

. The additional non-linear squashing function

in z

= tanh(y

− y

) absorbs large errors respectively shocks. A special case

occurs at all future time steps t +1: here we have no compensation y

t+1

of the

internal expected value, and thus the system is oﬀering a forecast y

t+1

= Cs

t+1

The system identiﬁcation task (see Eq. 28.2) is once again solved by ﬁnite

unfolding in time [3]. Fig. 28.4 depicts the resulting ECNN[15].

Fig. 28.4. ECNN incorporating overshooting. Note, that −Id is the ﬁxed negative

of an identity matrix, while z

t−τ

are output clusters to model the error correction

mechanism.

The ECNN has two diﬀerent inputs: (i) the externals u

, which directly in-

ﬂuence the state transition, and (ii) the targets y

. Only the diﬀerence between

and y

has an impact on s

τ +1

.Inthefutureτ>t,wehavenocompensa-

tion for y

and thus compute forecasts y

= Cs

. We have successfully applied

ECNN models to predict the demand of products and product groups within the

context of supply chain management.

Trick 5. Variant-Invariant Separation

For the modeling of high-dimensional dynamical systems, the next trick called

variant-invariant separation extends the architecture of the ECNN (Fig. 28.5) by

a coordinate transformation to reduce the dimensionality of the original forecast

problem. The dimension reduction is realized in form of a bottleneck sub-network

(Fig. 28.5, left). The compressor E removes time invariant structures from the

dynamics. The reconstruction of the complete dynamics (decompression) is done

by matrix F . The bottleneck is implicitly connected to the ECNN via the shared

matrices E and F [15]. We compress the high-dimensional vector y

to a lower

dimensional vector x

using a bottleneck neural network. The vector x

contains

all relevant information about y

. The ECNN predicts the low dimensional vector

instead of the high dimensional vector y

. The error correction compares

28. Forecasting with Recurrent Neural Networks 695

Fig. 28.5. ECNN with Variant-Invariant Separation

the expectations x

= Cs

with the transformed observations −x

= E(−y

Note, that the negative inputs −y

are required by the ECNN to generate the

transformed targets −x

. In our experience we found, that the training of the

extended ECNN (Fig. 28.5) is very robust, i.e. the coordinate transformation

and the forecasting can be trained in parallel. The time-invariant information

is not lost in the bottleneck network, but simply relegated to lower components

of variance of the representation. Furthermore, node pruning can be applied to

the middle layer. Due to shared weights the result of a pruning in the bottleneck

network is transfered to the ECNN branch as well.

We have successfully applied the ECNN depicted in Fig. 28.5 to forecast elec-

trical load curves and traﬃc ﬂows. In load forecasting the typical application

scenario is that one has to forecast the load curve in 15 minute time buckets,

i.e. 96 observations per day. To avoid the error accumulation of a pure itera-

tive model it is more useful to forecast the load curve day-by-day. From our

experience a load curve (i.e. 96 observations per day) can be compressed to an

dim ≈ 8 dimensional indicator vector from which the load curve can in turn be

reconstructed. From a mathematical viewpoint this approach corresponds to the

modeling of dynamical systems on manifolds. More complicated manifolds can

be generated by deep bottleneck neural networks.

Trick 6. Historically Consistent Neural Networks (HCNNs)

Many real-world technical and economic applications can be seen in the context

of large systems in which various (non-linear) dynamics interact with one an-

other (in time). Unfortunately only a small sub-set of variables can be observed.

From the sub-set of observed variables we have to reconstruct the hidden vari-

ables of the large system in order to understand the dynamics. Here the term

observables includes the input and output variables in conventional modeling

approaches (i.e. y

:= (y

)). This indicates a consistency problem in the

RNN (Fig. 28.1) or ECNN (Fig. 28.4) respectively: on the output side, the RNN

(ECNN) provides forecasts of the dynamics in the observables y

,whereasthe

input side assumes that the observables y

will not change from present time

on. This lack of consistency represents a clear contradiction within the model

696 H.-G. Zimmermann, C. Tietz, and R. Grothmann

framework. If, on the other hand, we are able to implement a model framework

in which common descriptions and forecasts can be used for the trend in all of

the observables, we will be in a position to close the open system – in other

words, we will model a closed large dynamic system.

The next trick called historically consistent neural networks (HCNNs) intro-

duces a model class which follows the design principles for modeling of large dy-

namic systems and overcomes the conceptual weaknesses of conventional models.

Equation 28.5 formulates the historically consistent neural network (HCNN).

state transition s

τ +1

= A tanh(s

)

output equation y

=[Id,0]s

(28.5)

The joint dynamics for all observables is characterized in the HCNN (28.5) by

the sequence of states s

. The observables (i =1,...,N) are arranged on the

ﬁrst N state neurons s

and followed by non-observable (hidden) variables as

subsequent neurons. The connector [Id,0] is a ﬁxed matrix which reads out the

observables. The initial state s

is described as a bias vector. The bias s

and

matrix A contain the only free parameters.

Like standard RNNs (Fig. 28.1) HCNNs also have universal approximation

capabilities. The proof for the RNN can be found in [11]. Fig. 28.6 outlines the

proof for HCNNs. As depicted in Fig. 28.6 the proof of the universal approxi-

mation capabilities for HCNN can be divided in six steps:

1. The output equation is shifted one time step into the future and the resulting

τ +1

is substituted by the system transition equation.

2. By combining outputs and state variables into an extended state we get an

extended state transition equation. The output of the system is derived from

the ﬁrst components of the extended internal state.

3. For the extended state transition equation we apply the feedforward universal

approximation theorem. At least for a ﬁnite time horizon this guarantees a

small approximation error. Note, that in RNNs at least one large component

of the state vector together with the hyperbolic tangent can mimic a bias

vector. Thus, we have omitted the explicit notation of a bias vector in the

NN equations.

4. In this step we remove one of the two matrices within the state transition

equation. We apply a state transformation r

= As

. This results in two

state transition equations.

5. The two state transition equations can be reorganized in one state transition,

which has twice the dimensionality of the original equation.

6. Rewriting the matrix located on front of the tanh activation function results

in the claimed formulation for closed systems.

Instead of being applied inside the tanh activation function, matrix A is used

outside the tanh activation function. This has the advantage that the states and

thus, also the system outputs are not limited to the ﬁnite state space (−1; 1)

created by the tanh(.) nonlinearity. The output equation has a simple and ap-

plication independent form. Note, that we can only observe the ﬁrst elements of

the state vector.

28. Forecasting with Recurrent Neural Networks 697

Fig. 28.6. Proof of the universal approximation capabilities for HCNNs

Fig. 28.7. Historically Consistent Neural Network (HCNN)

Fig.28.7 depicts the HCNN architecture. The HCNN states s

are hidden

layers with tanh squashing. The forecasts are supplied by the output layers y

There are no target values available for the future time steps. The expected values

τ>t

can be read out at the corresponding future time steps of the network.

Since the HCNN model has no inputs, we have to unfold the neural network

along the complete data history. This is diﬀerent to small recurrent neural net-

works (see e.g. Fig. 28.2), where we construct training data patterns in form of

sliding windows. The HCNN learns the large dynamics from a single history of

observations (i.e. a single training data pattern). Forecasting commodity prices

with HCNN, we unfold the neural network over 440 trading days in the past to

predict the next 20 days.

Trick 7. Architectural Teacher Forcing (ATF) for HCNNs

In practice we observe that HCNNs are diﬃcult to train since the models do not

have any input signals and are unfolded across the complete data set. Our next

698 H.-G. Zimmermann, C. Tietz, and R. Grothmann

trick, architectural teacher forcing (ATF) for HCNNs, makes the best possible

use of the data from the observables and accelerates the training of the HCNN

[20, 14, 9]. The HCNN with integrated teacher forcing is shown in Fig. 28.8 below.

In the HCNN with ATF (Fig. 28.8) the expected values for all observables up

Fig. 28.8. HCNN with Architectural Teacher Forcing (ATF)

to time τ = t are replaced with the actual observations. The output layers of

the HCNN are given ﬁxed target values of zero. The negative observed values

−y

for the observables are added to the output layer. This forces the HCNN

to create the expected values y

to compensate for the negative observed values

−y

. The content of the output layer, i.e. y

−y

, is now transferred to the ﬁrst

N neurons of the hidden layer r

on a component-by-component basis with a

minus symbol. In addition, we copy the expected values y

from the state s

to the intermediate hidden layer r

. As a result, the expected values y

on the

ﬁrst N components of the state vector s

are replaced by the observed values

= y

−(y

−y

) (Fig. 28.8). All connections of the ATF mechanism are ﬁxed.

Following the ATF step, the state transition matrix A is applied, to move the

system into the next time step. By deﬁnition, we have no observations for the

observables in future time steps. Here, the system is iterated exclusively upon the

expected values. This turns an open into a closed dynamic system. The HCNN

in Fig. 28.8 is equivalent to the architecture in Fig.28.7, if it converges to zero

error in the training. In this case we have solved the original problem.

Trick 8. Sparsity, Dimensionality vs. Connectivity and Memory

HCNNs may have to model ten, twenty or even more observables in parallel over

time. It is clear that we have to work with high dimensional dynamical systems

(e.g. in our commodity price models we use dim(s) = 300). The iteration with

a fully connected state transition matrix A of such a dimension is dangerous:

Sometimes the matrix vector operations will produce large numbers which will

be spread in the recursive computation all over the network and will generate

an arithmetic overﬂow. To avoid this phenomena we can choose a sparse matrix

A. Thus, the linear algebra does not accumulate large numbers and the spread

of large numbers through the network is damped by the sparsity too.

28. Forecasting with Recurrent Neural Networks 699

We have to answer the questions which dimensionality and which sparsity

we will choose. In [16] we have worked out that dimensionality and sparsity

are related to another pair of meta-parameters: Connectivity (con) and memory

length (mem). Connectivity is deﬁned as the number of nonzero elements in

each row of matrix A. The memory length is the number of steps from which

we have to collect information in order to reach a Markovian state, i.e. the state

vector contains all necessary information from the past. We propose the following

parameterization for the state dimension (dim(s)) and sparsity [16]:

Dimension of Adim(s)=con · mem (28.6)

Sparsity of A Sparsity = random



con

mem · con



= random



mem



(28.7)

Eq. 28.7 represents the insight that a sparse system conserves information over a

longer time period before it diﬀuses in the network. For instance a shift register

is very sparse and behaves only as a memory, whereas in a fully connected matrix

the superposition of information masks the information sources. Let us assume

that we have initialized the state transition matrix with a uniform random sparse

matrix A. Following Eq. 28.7 the more dense parts of A will model the faster

sub-dynamics within the overall dynamics, while the highly sparse parts of A

will focus on slow subsystems. As a result a sparse random initialization allows

the combined modeling of systems on diﬀerent time scales.

Unfortunately, Eq. 28.6 favors very large dimensions. Our earlier work on the

subject (see [16]) started with the predeﬁnition of the systems memory length

mem, because for RNNs the memory length is equal to the length of the past

unfolding in time. On the other hand, connectivity has to be chosen larger than

the number of the observables. Working with HCNNs the memory length is less

important, because we unfold the neural network along the whole data horizon.

Here the connectivity plays the superior role. From our experience we know that

the EBTT algorithm works stably with a connectivity which is equal or smaller

than 50 (con ≤ 50). For computational performance we usually limit the state

dimensionality to dim(s) = 300.Thisimpliesasparsityof50/300 ≈ 17%.We

leave the ﬁne tuning of the parameters to the EBTT learning.

We propose to initialize the neural network with a randomly chosen spar-

sity grid. The sparsity grid is therefore chosen arbitrary and not optimized by

e.g. pruning algorithms. This raises the question if a random sparse initializa-

tion biases the network towards inferior solutions. This is handled by ensemble

forecasts. We have performed ensemble experiments with diﬀerent sparsity grids

versus ensembles based on the same sparsity grid. We found, that the average of

the ensemble as well as the ensemble width are unaﬀected by the initialization of

the sparsity grid (for more details on ensemble forecasting see Trick 12). These

considerations hold only for large systems.

Trick 9. Causal-Retro-Causal Neural Networks (CRCNNs)

The fundamental idea of the HCNN is to explain the joint dynamics of the

observables in a causal manner, i.e. with an information ﬂow from the past to

700 H.-G. Zimmermann, C. Tietz, and R. Grothmann

the future. However, rational planning is not only a consequence of a causal

information ﬂow but also of anticipating future developments and responding

to them on the basis of a certain goal function. This is similar to the adjoint

equation in optimal control theory [6]. In other words a retro causal information

ﬂow is equivalent to asking for the motivation of a behavior as a goal. In turn,

this is the anchor point for the reconstruction of the dynamics.

In order to incorporate the eﬀects of rational decision making and planning

into the modeling, the next trick introduces causal-retro-causal neural networks

(CRCNNs). The idea behind the CRCNN is to enrich the causal information

ﬂow within the HCNN, which is directed from the past to the future, by a retro-

causal information ﬂow, directed from the future into the past. The CRCNN

model is given by the following set of equations 28.8.

causal state transition s

= A tanh(s

τ −1

)

retro-causal state transition s



= A



tanh(s



τ +1

)

output equation y

=[Id,0]s

+[Id,0]s



(28.8)

The output equation y

of the CRCNN (Eq. 28.8) is a mixture of causal and

retro-causal inﬂuences. The dynamics of all observables is hence explained by a

sequence of causal (s

) and retro-causal states s



using transition matrices A and



for the causal and retro-causal information ﬂow. Upon the basis of Eq. 28.8,

we draw the network architecture for the CRCNN as depicted in Fig. 28.9.

Fig. 28.9. A Causal-Retro-Causal Historically Consistent Neural Network (CRCNN)

Trick 10. Architectural Teacher Forcing (ATF) for CRCNNs

The CRCNN (Fig. 28.9) is also unfolded across the entire time path, i.e. we learn

the unique history of the system. Likewise to the training of the HCNN, CRCNNs

are diﬃcult to train. Our next trick called architectural teacher forcing (ATF) for

CRCNNs formulates TF as a part of the CRCNN architecture, which allows us

to learn the CRCNN using the standard EBTT algorithm[3]. ATF enables us to

exploit the information contained in the data more eﬃciently and accelerates the

training itself. Fig. 28.10 depicts the CRCNN architecture incorporating ATF.

28. Forecasting with Recurrent Neural Networks 701

Fig. 28.10. Extended CRCNN with an architectural Teacher Forcing (ATF)

mechanism

Let us explain the ATF mechanism in the extended CRCNN model

(Fig. 28.10): The extended CRCNN uses a causal-retro-causal network to correct

the error of the opposite part of the network in a symmetric manner. In every

time step τ ≤ t the expected values y

are replaced by the observations y

using

the intermediate tanh() layers for the causal and for the retro-causal part. Since

the causal and the retro-causal part jointly explain the observables y

,wehave

to inject the causal into the retro-causal part and vice versa. This is done by

compensating the actual observations −y

with the output of the causal (Δ

)

and the retro-causal part (Δ



) within output layers with ﬁxed target values of

zero. The resulting content (Δ

+ Δ



− y

=0) of the output layers is negated

and transferred to the causal and retro-causal part of the CRCNN using the ﬁxed

[−Id,0]



connector. Within the intermediate tanh() layers the expectations of y

are replaced with the actual observations y

, whereby the contribution to y

from the opposite part of the CRCNN is considered. Note, that ATF does not

lead to a larger number of free network parameters, since all new connections

are ﬁxed and are used only to transfer data in the NN. In future direction τ>t

the CRCNN is iterated exclusively on the basis of expected values.

The usage of ATF does not reintroduce an input / output modeling, since

we replace the expected value of the observables with their actual observations,

while simultaneously considering the causal and retro-causal part of the dynam-

ics. For suﬃciently large CRCNNs and convergence of the output error to zero,

the architecture in Fig. 28.10 converges towards the fundamental CRCNN ar-

chitecture shown in Fig. 28.9. The advantage of the CRCNN is, that it allows a

fully dynamical superposition of the causal and retro-causal information ﬂows.

The CRCNN depicted in Fig. 28.10 describes a dynamics on a manifold. In ev-

ery time step the information ﬂows incorporates closed loops, which technically

can be seen as equality constraints. These constraints are only implicitly deﬁned

through the interaction of the causal and retro-causal parts. The closed loops

in the CRCNN architecture (Fig. 28.10) lead to ﬁx-point recurrent substructures

702 H.-G. Zimmermann, C. Tietz, and R. Grothmann

within the model, which are hard to handle with EBTT. As a remedy we propose

a solution concept similar to Trick 7: We embed the CRCNN model into a larger

network architecture, which is easier to solve and converges to the same solution

as the original system. Fig. 28.11 depicts an initial draft for such an embedding.

The extended architecture in Fig. 28.11 is a duplication of the original model

depicted in Fig. 28.10. The CRCNN architecture in Fig. 28.11 does not contain

closed loops, because we split the ATF mechanism for the causal and retro-

causal part into two branches. It is important to notice that these branches are

implicitly connected through the shared weights in the causal and retro-causal

part. If this architecture converges, the ATF is no longer required and we have

two identical copies of the CRCNN model depicted in Fig. 28.9. The solution

proposed for the embedding is not the only feasible way to handle the ﬁx-point

loops. We will outline alternative solutions in an upcoming paper. The CRCNN

is the basis for our projects on forecasting commodity prices.

Fig. 28.11. Asymmetric split of ATF in CRC neural networks

Trick 11. Stable & Instable Information Flows in Dynamical Systems

Following Trick 3 it is natural to apply noise in the causal as well as in retro-

causal branch (see also Fig. 28.12). This should improve the stability of both time

directions resp. information ﬂows. In CRCNNs this feature has a special inter-

pretation: Stability in the causal information ﬂow means that the uncertainty

in the beginning is damped along the time path from past to future. Instability

in the causal system means that a small disturbance in the past diﬀuses to very

diﬀerent future scenarios under a chaotic regime (see Fig. 28.12, upper part).

28. Forecasting with Recurrent Neural Networks 703

Fig. 28.12. An instable causal dynamics is converted to a stable retro-causal dynamics

If the causal information ﬂow of a sub-dynamics is instable, then the retro-

causal description of this sub-system is stable. On the other hand, an instable

retro-causal dynamics is stable from a causal perspective. In a combined causal-

retro-causal neural network the causal and the retro-causal branch can be simul-

taneously stable, even if the underlying dynamics is partially instable. In order

to enforce the stability of the causal and the retro-causal part we apply noise at

the origins of both branches.

Trick 12. Uncertainty and Risk

The experience gained during the latest ﬁnancial crisis has triggered a far-

reaching discussion on the limitations of quantitative forecasting models and

made investors very conscious of risk[2]. In order to understand risk distribu-

tions, traditional risk management uses diﬀusion models. Risk is understood as

a random walk, in which the diﬀusion process is calibrated by the observed past

error of the underlying model[7]. In contrast the next trick called uncertainty

and risk focuses on ensemble forecasts in order to provide important insights

into complex risk relationships, since internal model (unobserved) variables can

be reconstructed from the trend in observed variables (observables).

If the system identiﬁcation is calculated repeatedly for HCNNs or CRCNNs,

an ensemble of solutions will be produced, which all have a forecast error of

zero in the past, but which diﬀer from one another in the future. Since every

HCNN or CRCNN model gives a perfect description of the observed data, the

complete ensemble is the true solution. A way to simplify the forecast is to take

the arithmetical average of the individual ensemble members as the expected

value, provided the ensemble histogram is unimodal in every time step.

In addition to the expected value, we consider the bandwidth of the ensemble,

i.e. its distribution. The form of the ensemble is governed by diﬀerences in the re-

construction of the hidden system variables from the observables: for every ﬁnite

Thanks to Prof. Jürgen Jost, MPI Leipzig, for a fruitful discussion on this topic.

704 H.-G. Zimmermann, C. Tietz, and R. Grothmann

volume of observations there is an inﬁnite number of explanation models which

describe the data perfectly, but diﬀer in their forecasts, since the observations

make it possible to reconstruct the hidden variables in diﬀerent forms during the

training. In other words, our risk concept is based on the partial observability of

the world, leading to diﬀerent reconstructions of the hidden variables and thus,

diﬀerent future scenarios. Since all scenarios are perfectly consistent with the

history, we do not know which of the scenarios describes the future trend best

and risk emerges.

This approach directly addresses the model risk. For HCNN and CRCNN

modeling we claim that the model risk is equal to the forecast risk. The reasons

can be summarized as follows: First, HCNNs are universal approximators, which

are therefore able to describe every market scenario. Second, the form of the

ensemble distribution is caused by an underlying dynamics, which interpret the

market dynamics as the result of interacting decisions[17]. Third, in experiments

we have shown that the ensemble distribution is independent from the details of

the model conﬁguration, if we use large models and large ensembles.

Let us exemplify our risk concept. The diagram below (Fig. 28.13, left) shows

the approach applied to the Dow Jones Industrial Index (DJX). For the ensem-

ble, a HCNN was used to generate 250 individual forecasts for the DJX. For

every forecast date, all of the individual forecasts for the ensemble represent

the empirical density function, i.e. a probability distribution over many possible

market prices at a single point in time (see Fig. 28.13, right). It is noticeable

that the actual development of the DJX is always within the ensemble channel

(see gray lines, Fig. 28.13, left). The expected value for the forecast distribution

is also an adequate point forecast for the DJX (see Fig. 28.13, right).

Fig. 28.13. HCNN ensemble forecast for the Dow Jones Index (12 weeks forecast hori-

zon), left, and associated index point distribution for the ensemble in time step t +12,

right

28.3 Conclusion and Outlook

Recurrent neural networks model dynamical systems in the form of non-linear

state space models. Just like any other NN, the equation of the RNNs, ECNNs,

HCNNs or CRCNNs can be expressed as an architecture which represents the

28. Forecasting with Recurrent Neural Networks 705

individual layers in the form of nodes and the connections between the layers

in the form of links. In the graphical architecture we can apply local learning

algorithms like error back propagation and an appropriate (stochastic) learning

rule to train the NN[13, 17, 3]. This relationship is called the correspondence

principle between equations, architectures and the local algorithms associated

with them (Trick 1).

Finite unfolding in time of RNN using shared weight matrices enables us to

stick to the correspondence principle. Overshooting enforces the autonomous

dynamics and enables long-term forecasting (Trick 2), whereas an adaptive

noise term handles the uncertainty of the ﬁnite unfolding in time (Trick 3).

ECNN utilizes the previous model error as an additional input. Hence, the

learning can interpret the models misﬁt as an external shock which is used to

guide the model dynamics afterwards. This allows us to prevent the autonomous

part of the model to adapt misleading inter-temporal causalities. If we know

that a dynamical system is inﬂuenced by external shocks, the error correction

mechanism of the ECNN is an important prestructuring element of the networks

architecture to compensate missing inputs (Trick 4).

Extending the ECNN by variants-invariants separation, one is able to include

additional prior structural knowledge of the underlying dynamics into the model.

The separation of variants and invariants with a bottleneck coordinate transfor-

mation allows to handle high dimensional problems (Trick 5).

HCNNs model not just an individual dynamics, but complex systems made up

of a number of interacting sub-dynamics. HCNNs are symmetrical in their input

and output variables, i.e. the system description does not draw any distinction

between input, output and internal state variables. Thus, an open system be-

comes a closed system (Trick 6). Sparse transition matrices enable us to model

diﬀerent time scales and stabilize the training (Trick 8). Causal and retro-causal

information ﬂow within an integrated model (CRCNN) can be used to model ra-

tional planning and decision making in markets. CRCNNs dynamically combine

causal and retro-causal information to describe the prevailing market regime

(Trick 9). Architectural teacher forcing can be applied to eﬃciently train the

HCNN or CRCNN (Trick 7 and 10). An architectural extension (see Fig. 28.12)

enables us to balance the causal and retro-causal information ﬂow during the

learning of the CRCNN (Trick 11).

We usually work with ensembles of HCNN or CRCNN to predict commodity

prices. All solutions have a model error of zero in the past, but show a diﬀerent

behavior in the future. The reason for this lies in diﬀerent ways of reconstructing

the hidden variables from the observations and is independent of diﬀerent random

sparse initializations. Since every model gives a perfect description of the observed

data, we can use the simple average of the individual forecasts as the expected

value, assuming that the distribution of the ensemble is unimodal. The analysis

of the ensemble spread opens up new perspectives on market risks. We claim that

the model risk of a CRCNN or HCNN is equal to the forecast risk (Trick 12).

Work currently in progress aims to improve the embedding of the CRCNN

architecture (see Fig. 28.10) in order to simplify and stabilize the training. On

706 H.-G. Zimmermann, C. Tietz, and R. Grothmann

the other hand, we analyze the micro-structure of the ensembles and implement

the models in practical risk management and ﬁnancial market applications.

All NN architectures and algorithms are implemented in the Simulation En-

vironment for Neural Networks (SENN), a product of Siemens Corporate Tech-

nology. Work is partially funded by German Federal Research Ministry (BMBF

grant Alice, 01 IB10003 A-C).

References

[1] Calvert, D., Kremer, S.: Networks with Adaptive State Transitions. In: Kolen,

J.F., Kremer, S. (eds.) A Field Guide to Dynamical Recurrent Networks, pp. 15–

25. IEEE (2001)

[2] Föllmer, H.: Alles richtig und trotzdem falsch?, Anmerkungen zur Finanzkrise und

Finanzmathematik. In: MDMV, vol. 17, pp. 148–154 (2009)

[3] Haykin, S.: Neural Networks and Learning Machines, 3rd edn. Prentice Hall (2008)

[4] Hull, J.: Options, Futures & Other Derivative Securities. Prentice Hall (2001)

[5] Hornik, K., Stinchcombe, M., White, H.: Multilayer Feedforward Networks are

Universal Approximators. Neural Networks 2, 359–366 (1989)

[6] Kamien, M., Schwartz, N.: Dynamic Optimization: The Calculus of Variations

and Optimal Control in Economics and Management, 2nd edn. Elsevier Science

(October 1991)

[7] McNeil, A., Frey, R., Embrechts, P.: Quantitative Risk Management: Concepts,

Techniques and Tools. Princeton University Press, Princeton (2005)

[8] Neuneier, R., Zimmermann, H.-G.: How to Train Neural Networks. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 373–423. Springer, Hei-

delberg (1998)

[9] Pearlmutter, B.: Gradient Calculations for Dynamic Recurrent Neural Networks.

In: Kolen, J.F., Kremer, S. (eds.) A Field Guide to Dynamical Recurrent Networks,

pp. 179–206. IEEE Press (2001)

[10] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations

by Error Propagation. In: Rumelhart, D.E., McClelland, J.L., et al. (eds.) Parallel

Distributed Processing, vol. 1: Foundations. MIT Press, Cambridge (1986)

[11] Schäfer, A.M., Zimmermann, H.-G.: Recurrent Neural Networks Are Universal

Approximators. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN

2006. LNCS, vol. 4131, pp. 632–640. Springer, Heidelberg (2006)

[12] Wei, W.S.: Time Series Analysis: Univariate and Multivariate Methods. Addison-

Wesley Publishing Company, N.Y. (1990)

[13] Werbos, P.J.: Beyond Regression: New Tools for Prediction and Analysis in the

Behavioral Sciences. PhD Thesis, Harvard University (1974)

[14] Williams, R.J., Zipser, D.: A Learning Algorithm for continually running fully

recurrent neural networks. Neural Computation 1(2), 270–280 (1989)

[15] Zimmermann, H.G., Grothmann, R., Neuneier, R.: Modeling of Dynamical Sys-

tems by Error Correction Neural Networks. In: Sooﬁ, A., Cao, L. (eds.) Modeling

and Forecasting Financial Data, Techniques of Nonlinear Dynamics. Kluwer (2002)

[16] Zimmermann, H.G., Grothmann, R., Schäfer, A.M., Tietz, Ch.: Modeling Large

Dynamical Systems with Dynamical Consistent Neural Networks. In: Haykin, S.,

et al. (eds.) New Directions in Statistical Signal Processing. MIT Press, Cambridge

(2006)

28. Forecasting with Recurrent Neural Networks 707

[17] Zimmermann, H.G.: Neuronale Netze als Entscheidungskalkül. In: Rehkugler, H.,

Zimmermann, H.G. (eds.) Neuronale Netze in der Ökonomie, Grundlagen und

wissenschaftliche Anwendungen. Vahlen, Munich (1994)

[18] Zimmermann, H.G., Neuneier, R.: Neural Network Architectures for the Model-

ing of Dynamical Systems. In: Kolen, J.F., Kremer, S. (eds.) A Field Guide to

Dynamical Recurrent Networks, pp. 311–350. IEEE Press (2001)

[19] Zimmermann, H.G., Grothmann, R., Tietz, C., von Jouanne-Diedrich, H.: Market

Modeling, Forecasting and Risk Analysis with Historical Consistent Neural Net-

works. In: Hu, B., et al. (eds.) Operations Research Proceedings 2010, Selected

Papers of the Annual Int. Conferences of the German OR Society (GOR), Munich.

Springer, Heidelberg (September 2011)

[20] Zimmermann, H.G., Grothmann, R., Tietz, Ch.: Forecasting Market Prices with

Causal-Retro-Causal Neural Networks. In: Klatte, D., Lüthi, H.-J., Schmedders,

K. (eds.) Operations Research Proceedings 2011, Selected Papers of the Int. Con-

ference on Operations Research 2011 (OR 2011), Zurich, Switzerland. Springer

(2012)

Solving Partially Observable Reinforcement

Learning Problems with Recurrent Neural

Networks

Siegmund Duell

1,2

, Steﬀen Udluft

, and Volkmar Sterzing

Siemens AG, Corporate Technology,

Intelligent Systems and Control

Berlin University of Technology, Machine Learning

{duell.siegmund.ext,steffen.udluft,volkmar.sterzing}@siemens.com

Abstract. The aim of this chapter is to provide a series of tricks and

recipes for neural state estimation, particularly for real world applica-

tions of reinforcement learning. We use various topologies of recurrent

neural networks as they allow to identify the continuous valued, possibly

high dimensional state space of complex dynamical systems. Recurrent

neural networks explicitly oﬀer possibilities to account for time and mem-

ory, in principle they are able to model any type of dynamical system.

Because of these capabilities recurrent neural networks are a suitable

tool to approximate a Markovian state space of dynamical systems. In

a second step, reinforcement learning methods can be applied to solve

a deﬁned control problem. Besides the trick of using a recurrent neural

network for state estimation, various issues regarding real world prob-

lems such as, large sets of observables and long-term dependencies are

addressed.

29.1 Introduction

In this chapter we present a state estimation approach to tackle partially ob-

servable reinforcement learning [26] problems in discrete time. Reinforcement

learning is the machine learning approach to the optimal control problem. In-

stead of designing the control strategy, reinforcement learning learns it from ac-

tual observations of the system to be controlled. Combined with powerful func-

tion approximators like neural networks, impressive results could be achieved

[28, 13, 19]. In most real world applications, some form of state estimation is

necessary to fulﬁll the requirements of reinforcement learning.

Consider the task to reduce the emissions of a gas turbine while keeping

humming, caused by combustion dynamics, low. A gas turbine consists of a

compressor, providing compressed air. Within the combustion chamber, this air

is burned with gas to drive the turbine and the coupled generator. The gas is

injected through multiple burners. At each burner, the fuel ﬂow is split into

diﬀerent fractions enabling control over the combustion process. The combus-

tion process results in emissions of NOx and CO, which have to be kept below

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 709–733, 2012.

 Springer-Verlag Berlin Heidelberg 2012

710 S. Duell, S. Udluft, and V. Sterzing

legal limits. At the same time, the combustion process causes humming. This

humming reduces the life time of the machine and can cause fatal damage to

the turbine. Reinforcement learning can address the problem of optimizing the

combustion process by tuning fuel fractions in an elegant way. Reinforcement

learning can also account for seasonal eﬀects and wear because these eﬀects can

be found within sensor readings. Approaches that are based on (ﬁxed) physical

models are usually unable to ﬁnd such eﬀects. To successfully develop a rein-

forcement learning agent, the problem of state estimation (see chapter 30 [23]

for other problem solutions to real world reinforcement learning applications)

has to be solved, in order to fulﬁll the requirements of the reinforcement learn-

ing framework. Since a gas turbine provides a vast number of sensor readings,

such as temperature and pressure readings, mass ﬂows and actor settings such

as valve positions and set points of low level controllers, the state is very high

dimensional. Even experts cannot anticipate all eﬀects that could be caused by

the various subsystems and their interactions. A state estimation approach can

overcome this problem.

Various topologies of recurrent neural networks (RNNs) are used for state

estimation as they allow to identify the continuous valued, possibly high dimen-

sional state space of real world applications. RNNs explicitly oﬀer possibilities

to account for time and memory, in principle they are able to model any type

of dynamical system [9, 15, 11, 32]. Because of these capabilities RNNs are a

valuable tool for state estimation, especially for real world applications. Based

on these estimates, methods of reinforcement learning can be applied to solve a

deﬁned control problem.

The chapter is divided into four parts. After a brief introduction to reinforce-

ment learning and its requirements, the main trick of modeling a Markovian

state space using an RNN is described in section 29.3. Section 29.4 proceeds

with tricks to extract a Markov decision process from a possibly large set of

observables and adapts a neural topology further towards the task of state es-

timation for reinforcement learning. To address the task to capture diﬀerent

time scales of dynamical dependencies a solution to the long-term dependency

problem is provided in section 29.5. For all presented tricks, recipes as prac-

tical guides are introduced, to avoid potential pitfalls and to improve general

applicability. Further, some experiments to demonstrate the applicability of the

presented procedures are presented.

29.2 Background

Reinforcement learning usually requires the system of interest to be described as

a Markov decision process (MDP) M := (S, A, P, R),whereS and A denote the

state and action spaces, respectively, P : S×A×S→ [0, 1] the state transition

probabilities, i.e., the probability of entering a successor state s



t+1

by executing

an action a

in state s

,andR : S×A×S→ R the reward function assigning

a transition its immediate cost or gain. The aim of reinforcement learning is to

derive a policy π : S →Amapping each state to an action that maximizes the

29. Reinforcement Learning with Recurrent Neural Networks 711

return, i.e., the sum of all (discounted) future rewards. A central characteristic

of an MDP is the Markov property which states that the probability of reaching

a certain successor state s

t+1

depends on the current state s

and action a

only.

However, in many real-world control problems the current state is not directly

accessible. Instead, only a number of observables z

∈Zcanbeusedassourceof

information about the true current state s

, rendering the MDP into a partially

observable Markov decision process (POMDP) M



:= (S, Z, A, P, R, O).Inad-

dition to the components of an MDP, a POMDP contains an observation space

Z and an (usually unknown) observation function O : S×A→Z, describing

the mapping from state-action pairs to observations.

In the gas turbine setting, presented in the introduction one also has to deal

with a POMDP. Various measurements as well as actor settings like valve posi-

tions are provided as observables z. The state of the turbine can be inﬂuenced by

changing the fuel ﬂow valve positions. As a result, an immediate reward signal

is calculated, providing information about the quality of performed actions.

A common approach for dealing with POMDPs is to model a distribution of

possible current states [12], i.e., a belief state. When using such approaches, the

selection of action a

is based on the most probable current state and additionally

its uncertainty [20]. When dealing with technical systems, the partial observabil-

ity mostly stems from the limited available measurements and the fact that one

time step is insuﬃcient to describe the current system state, i.e., most technical

systems can also be described as MDPs of higher order. By simply aggregating a

suﬃcient number of prior time slices, such Markov processes can be reduced to

MDPs of ﬁrst order, i.e., the type of MDPs required by reinforcement learning. In

many cases a simple aggregation leads to a high dimensional state space and there-

fore turns out to be impractical in most cases (Bellman’s "curse of dimensionality"

[2]). Either a deep understanding of the underlying system or, especially in more

autonomous settings, a state estimator can overcome this problem. In the follow-

ing, recurrent neural approaches to model a Markovian state space from partially

observable dynamics are derived. The modeled state allows to apply any well un-

derstood powerful reinforcement learning algorithms in the oﬄine setting, such

as the neural ﬁtted Q iteration [24], the recurrent control neural network [31], or

online approaches such as actor critic algorithms [21].

29.3 The Trick of Modeling a Markovian State Space

Using a Recurrent Neural Network

A reinforcement learning agent interacts with a system thereby commonly al-

tering its evolution. This development can be described as an open dynamical

system. Note that in comparison to the description found in chapter 28 [36],

the adjustable external drives of the system, a are explicitly disjoint from other

external drivers z. In practice a dynamical system like a gas turbine can be in-

ﬂuenced by adjustable external drives like valve positions of fuel or cooling ﬂows

but at the same time, other external driver such as ambient conditions might

also be relevant. For discrete time grids (t =1,...,T and T ∈ N)thiscanbe

712 S. Duell, S. Udluft, and V. Sterzing

dynamics

Fig. 29.1. A dynamics can be aﬀected by actions a, performed by a (reinforcement

learning) controller but also by external drivers z, e.g., ambient conditions of a gas

turbine. Both alter the state s of the dynamics and cause a state transition, resulting

in a set of observables y. For a gas turbine, observables are sensor readings such as

temperatures or pressure levels as well as actor settings such as valve positions.

represented as a set of equations consisting of a state transition and an output

equation [9, 11]:

t+1

= f (s

) state transition

= g(s

) output equation.

(29.1)

Figure 29.1 illustrates equation 29.1. The state transition is a mapping from the

present internal state of the system s

and the inﬂuence of external inputs a

and

, to the new state s

t+1

. In the context of reinforcement learning, the inputs z

are the agent’s (partial) information about the system and a

represents an action

that has been selected by a control strategy. The output equation determines the

observable output y

. In a basic framework, the output is equivalent to the resulting

change of the system, in other words the subsequent observables of the system z

t+1

The task of identifying a dynamical system of Eq. 29.1 can be stated as the

problem to ﬁnd (parametrized) functions f and g such that a distance measure-

ment (Eq. 29.2) between the observed data y

and the model output y

is minimal:



t=1



− y



→ min

f,g

(29.2)

The identiﬁcation task of Eq. 29.1 and 29.2 can be formulated by an RNN of

the form

t+1

= tanh(As

+ c + Bz

+ Ca

) state transition

= Ds

output equation

(29.3)

where A, B, C,andD are weight matrices of appropriate dimensions and c is a

bias.

By approximating the functions f and g with an RNN using the weight matri-

ces A, B, C, D, and a bias vector c, of ﬁxed dimensions, the system identiﬁcation

task of Eq. 29.2 is transformed into a parameter optimization problem:



t=1



− y



→ min

A,B,C,D,c

(29.4)

29. Reinforcement Learning with Recurrent Neural Networks 713

−3

−2

−1

−3

−2

−1

−3

−2

−1

A A A A A A A

[B,C] [B,C] [B,C] [B,C] C

DD D D D D

Fig. 29.2. A standard RNN used as state estimator during the training phase. Each

circle represents a hidden layer of neurons (called cluster). Each arrow depicts a weight

matrix, connecting all neurons between two layers. Note that all hidden clusters as well

as all output clusters are connected to a bias (not shown in the ﬁgure).

This parameter optimization problem is solved by an RNN with ﬁnite unfolding

in time using shared weight matrices A, B, C,andD [22, 9]. Fig. 29.2 depicts the

resulting spatial neural network architecture. The RNN is trained with backprop-

agation through time which is a shared weights extension of standard backprop-

agation [22, 9]. The training of an RNN using shared weights is straightforward

and delivers sound results. RNNs are—other than feed forward networks—able

to establish memory due to the temporal structure with unfolding towards the

past and the future. This allows to model inter-temporal dependencies and latent

variables. E.g., the emissions of a gas turbine are delayed multiple steps because

of sensor response characteristics and the distance between causing the emis-

sions (within the combustion chamber) and the point of measurement (within

the exhaust gas ﬂow).

In order to map the state estimation to the RNN the hidden units s of the

RNN (Fig. 29.2) describe the state variables for the reinforcement learning task.

In other words, after training, these variables are used as outputs of the state es-

timation function. The externally available observables and the performed action

are used as input vector z

and action vector a

, y

= z

t+1

deﬁnes the targets.

The RNN is extended into the future by so-called overshooting [35], i.e., the

network is unfolded beyond the present time into the future (Fig. 29.2). This

results in a whole sequence of forecasts as outputs. Overshooting regularizes the

learning and thus improves the model’s performance [35]. Overshooting does not

require additional network parameters as shared weight matrices A, B, C,and

D are used.

The resulting trained network models the state transition of the underlying

dynamical system with respect to the sequence of actions. If the system is able to

model the forecast horizon suﬃciently well, the Markov property for the hidden

cluster s

is arbitrarily well approximated [30, 27]. Therefore a suﬃciently well

trained RNN results in a state estimator for a subsequent use in reinforcement

learning algorithms [29].

In this section the basic idea of using an RNN for state estimation was in-

troduced. Subsequent to the trick of using RNNs as state estimators, a series

of recipes to overcome various problems when designing neural state estimators

714 S. Duell, S. Udluft, and V. Sterzing

will be presented. First, the important role of actions on the state estimation

task and the requirement to generalize over all actions within the action space

(Sec. 29.3.1, 29.3.2) is discussed. Next, data and pattern preprocessing meth-

ods to provide a suitable training and test set (Sec. 29.3.3, 29.3.4, 29.3.5) are

introduced. Thereafter, the learning process is introduced (Sec. 29.3.6, 29.3.7).

Finally, the reduced recall-topology of a trained state estimator (Sec. 29.3.8)

is presented. The applicability of the approaches is illustrated on the cart-pole

problem in Sec. 29.3.9.

29.3.1 Improving the Generalization Capabilities with Respect to

Actions

In contrast to the standard deﬁnition of an open dynamical system and the

description of overshooting [35], the known sequence of actions is used as past

and future inputs a

, a

p−1

, ..., a

, a

, ..., a

,wherep describes the number of

past and f the number of future steps (Fig. 29.2). This is crucial, since the data

might have been generated by an arbitrary, possibly unknown control strategy.

The resulting network should be capable to model the dynamics of the system

based on the inﬂuence of these external drivers a

, rather than predict a sequence

of actions from the Markovian state. E.g., the data of a gas turbine might be

generated by an arbitrary control strategy. If this eﬀect is neglected, the state

estimator is forced to encode the underlying policy within the network. The

resulting network is most likely useless since it does not generalize to diﬀerent

action sequences as they occur when following a reinforcement learning policy. In

general, it is also recommended to test a trained state estimator for its general-

ization capabilities towards diﬀerent actions sequences. This is possible for most

systems. E.g., for the gas turbine certain actions are known to reduce emissions,

applying them to the turbine should lead to a change of NOx or CO.

29.3.2 A Recipe to Improve the Modeling of State Transitions

The standard RNN shown in Fig. 29.2 uses a single hidden cluster of neurons

to encode a state s, the state transition is modeled by the matrix A.Incase

of state estimation for reinforcement learning we are interested in modeling a

state transition, i.e., s

−→ s

t+1

. This can be explicitly expressed in the neural

topology, improving the overall performance of the estimator. Especially gener-

alization capabilities with respect to diﬀerent actions are improved. Note that a

good state estimator not only minimizes the training error of the optimization

task described in Eq. 29.4, but also generalizes well to diﬀerent sequences of

actions (Sec. 29.3.1). The neural representation of a state transition is shown in

Fig. 29.3.

29.3.3 Scaling of Inputs and Targets

Like in other supervised learning approaches scaling of input and target values

is an essential preprocessing step. Usually input scaling to receive a mean of

29. Reinforcement Learning with Recurrent Neural Networks 715

zero and a standard deviation of one is preferred. Other scaling approaches can

be used as well, e.g., to encode prior knowledge about a dynamics. E.g., for

concentrations like the emissions of a gas turbine, a logarithm can be used to

avoid negative, implausible output values.

29.3.4 Block Validation

As the patterns for an RNN comprise a whole sequence of input and target

values, subsequent patterns are highly correlated. Each pattern diﬀers from the

preceding one only by a single time slice. Consequently, assigning patterns to

the training (used to update weights) or validation set (not used to update

weights but to evaluate the performance of the current set of weights) at random

would lead to a strong correlation between training and validation set and hence

the training and validation error. In order to de-correlate the pattern sets an

increased granularity of the partitioning is achieved by using blocks of patterns

that do not contain overlapping information. A block is either used for training

or testing. Ideally, sets with identical statistical properties are generated.

The usage of blocks is motivated by the occurrence of new situations. E.g., a

gas turbine might operate at full load and change to a reduced load level. After

a transient period, the turbine reaches the new load level. Such processes take

minutes, whereas the controller is designed to perform an action every couple of

seconds. By grouping the patterns to blocks, the sets are de-correlated and still

capture all operational situations. Choosing an appropriate block size for the

given problem is essential for good results and a reliable test set. If one would

simply split a data set of, e.g., two days of operation into two sets, a complete

day each, both sets could have signiﬁcantly diﬀerent statistical properties and a

set might easily miss some operational situations.

The random block validation algorithm (Alg. 29.1) addresses these problems.

The algorithm uses a data ﬁle containing M observations, i.e., M time slices

of data vectors. First the size of a block j is determined randomly within pre-

deﬁned limits min

blocksize

and max

blocksize

.Nextj patterns are generated from

subsequential time slices of observations. Each pattern is unfolded over its entire

time range. In other words, if the topology deﬁnes p past time slices and f future

−1

B A B A B A B

C D D D

E EE

Fig. 29.3. An RNN modeling explicit state transitions depending on an applied action:

−→ s

t+1

. Note that all hidden clusters as well as all output clusters are connected

to a bias (not shown in the ﬁgure).

716 S. Duell, S. Udluft, and V. Sterzing

1,1

« t

1,2

1,i

block 1 block 2 block b

« t

Fig. 29.4. Pattern generated with the block validation procedure. All blocks are de-

correlated to provide reliable, uncorrelated test sets (i.e., for validation and generaliza-

tion tests).

time slices, the entire pattern covers p + f observations. Further each pattern

is scaled (see Sec. 29.3.3). The process is repeated until the block is ﬁlled or

one runs out of observations. Finally the type (training or test set) of a block is

determined according to predeﬁned probabilities. A complete block is added to

the ﬁnal pattern set. The resulting generated patterns are organized as shown

in Fig. 29.4. The resulting data set fulﬁlls our requirement of a set of patterns,

where each pattern set (training set and test set) are de-correlated but have

similar properties, since the diﬀerent operational situations are distributed over

all sets.

29.3.5 Removal of Invalid Data Patterns

In all applications with training data unfolded in time the problem of invalid

patterns can occur. An invalid data pattern contains data that does not ﬁt into

a predeﬁned time grid. E.g., every time the gas turbine is restarted, the stream

of data is restarted as well and causes a gap. Any pattern that is generated

and contains data from the previous shutdown, probably a couple of hours ago,

and the new start-up would be invalid. It does not describe the behavior of the

turbine and should therefore be removed from the used patterns.

A valid pattern contains, for each step unfolded in time, inputs that match

the deﬁned time grid. E.g., if a time grid of t = τ is selected, a valid pattern

unfolds over n equidistant steps t = τ, 2τ,...,nτ,wheren deﬁnes the number of

unfolded steps of the pattern. Any pattern that has gaps in the time line should

be excluded to avoid invalid training patterns. This problem occurs in many real

world applications but also for episodic benchmark problems such as a cart-pole

or the acrobot [26].

In practice this task can be solved by simply extending the block validation

algorithm (Alg. 29.1). Before the subroutine: generate(tm, ..., tm + n), each data

vector is tested to match the deﬁned time grid. All data vectors violating the

grid are ignored.

29. Reinforcement Learning with Recurrent Neural Networks 717

Algorithm 29.1 . Random block validation

Input: set of observations O = {t

=(z

)|m =1,...,M}

Result: block validated set of patterns for training and testing

m := 1

while m ≤ M do

j := rand(min

blocksize

, max

blocksize

) " determine the block size

within given limits

i := 0

while (i<j) ∧ (m + n) ≤ M do " generate a block of patterns,

containing j patterns, where each

pattern covers n time slices,

break if insuﬃcient data available

pattern := generate(t

,...,t

m+n

) " generate a valid pattern

using the selected scaling

block.add(pattern)

i := i +1

m := m +1

end while

type := rand(p

training

test

) " choose the type of the new block

according to the speciﬁed probabilities p

for training and test set

patternSet.add(block, type) " Add valid block to ﬁnal pattern set

m := m + n"Skip data to avoid overlapping pattern,

n depicts the network’s unfolding in time

end while

return patternSet

29.3.6 Learning Settings

We made good experience using standard backpropagation on the described neu-

ral network topologies. This eﬀectively realizes backpropagation through time for

a ﬁxed horizon, i.e., over all past and future time slices covered by the neural

network. Robust learning is achieved by small randomly chosen batches of pat-

terns with a batch size B of, e.g. B =1/1000 N,whereN is the number of all

patterns. The weights updates Δw are calculated as

Δw = −η



i=1

∂E

/∂w , (29.5)

where η is the learning rate and E

is the error for pattern i. The learning rate

is chosen between 0.01 and 0.001 in most cases.

In many applications the learning process can be sped up signiﬁcantly by an

extension to standard backpropagation called VarioEta [18]. VarioEta scales the

weight updates individually by the standard deviation over all patterns

718 S. Duell, S. Udluft, and V. Sterzing

;



i=1

(∂E

/∂w − D

)

, (29.6)

where



i=1

∂E

/∂w (29.7)

leading to the VarioEta update rule

Δw



= −η



i=1

∂E

/∂w

. (29.8)

29.3.7 Double Rest Learning

The training process of RNNs with shared weights has proven empirically to be

reliable and robust. Nonetheless our goal is to get the best training result without

wasting time on unnecessary network training. For this purpose we developed

the double rest learning procedure (Alg. 29.2).

In principle the same network is trained multiple times on a constant data set

but diﬀerent random initializations of the initial neural weights. If no better net-

work was found after multiple trials of training, the process terminates. A set of

pattern that contains a training set and at least one test set used for validation is

required. The algorithm is initialized with a maximum number of trials, t

max

and

Algorithm 29.2 . Double rest learning

Input: maximal epochs per training e

max

, maximal training trials t

max

,rest

epochs e

rest

,resttrialst

rest

Result: trained neural network

t := 0

total

:= 0

v := ∞

while t<t

rest

∧ t

total

max

net := restLearning(e

rest

max

)

:= net.min()

if v

<vthen

v := v

bestNN := net

t := 0

end if

t := t +1;

total

:= t

total

end while

return bestNN

29. Reinforcement Learning with Recurrent Neural Networks 719

the number of trials one wants to try to ﬁnd a better solution, t

rest

.Fortherest

learning algorithm (Alg. 29.3) the maximum number of epochs allowed for train-

ing a single network, e

max

and the number of epochs a single network is trained

beyond the epoch that achieved the best validation error e

rest

is required.

Double rest learning starts training single networks using the rest learning

algorithm (Alg. 29.3). Whenever a new network is found to have a better vali-

dation error than a previous one, the number of rested trials t is reset to 0.The

algorithm terminates as soon as no better network was found for t

rest

trials or

the maximum number of trials t

max

is exceeded.

The rest learning process for a single network follows a similar approach

(Alg. 29.3). The algorithm allows us to select the best network according to

a validation set. A single network is trained for e epochs. If a better solution

is found within e

rest

,theepochi with the best validation error is determined.

The best set of weights, bestNN, from epoch i is stored, and the network is

trained for another e

rest

− i epochs. If the validation error did not improve for

rest

epochs or the maximum number of training epochs e

max

is exceeded, the

training process terminates.

Algorithm 29.3 . Rest learning

Input: maximal epochs per training e

max

, number of rest epochs e

rest

Result: trained neural network t := 0 v

min

:= ∞ t

total

:= 0 e := e

rest

while t<e

rest

∧ t

total

max

net.learn(e) " train the network for e epochs

total

:= t

total

+ e

v:=net.min() " retrieve the minimal validation error

i:=net.index(v) " retrieve the epoch resulting

in the minimal validation error

if v <v

min

then

min

:= v

bestNN := net.weights(i) " retrieve the network with the weights,

resulting in the minimal validation error

t := i

e := e

rest

elset := t + ie:= e

rest

− i

end if

end while

return bestNN

Double rest learning can be applied to any neural network training where

one wants to select the best network from multiple trials using a robust stopping

criterion. It is especially useful when the meta-parameters of the training process

are subject to optimization and are modiﬁed at each trial

. Another advantage

In case of identical meta-parameters for all trials, it is adequate (and even more

eﬃcient) to use a ﬁxed number of trials, i.e., t

max

= t

rest

720 S. Duell, S. Udluft, and V. Sterzing

is the parametrization of the algorithm. Even though there are four parameters,

they can be assumed to be constant. For most applications we use: e

rest

=50,

max

= 5000, t

rest

=10,andt

max

= ∞. This parametrization has shown robust

results for a wide range of applications and 500 to 50000 training patterns.

29.3.8 A Recipe to Generate an Eﬃcient State Estimation Function

After training an RNN the neural network is truncated to receive a function of

the form:

= f(z

t−1

, ..., z

t−n

t−1

, ..., a

t−n

), (29.9)

where n denotes the length of the considered history of the network. The opti-

mal length n can be found by subsequentially reducing the length of the past

network until the error of the forecast, i.e., for outputs y

, y

, ..., y

,increases.

An instantaneously increasing error either indicates a perfect choice of n or an

insuﬃcient history. Another training with an increased past horizon ensures that

all prior required information is provided to the neural network.

The topology of the resulting network is shown in Fig. 29.5. This function can

be used to transform observations into tuples of {s, a, s



} and apply a reinforce-

ment learning method of choice to solve the control problem.

−3

−2

−1

−3

)(z

−2

)(z

−1

)(z

)

A A A

[B,C] [B, C] [B, C]

[B,C]

Fig. 29.5. A ﬁnalized state estimator after training. The former hidden unit s

is now

used as output of the function and provides a Markovian state representation for a

given sequence of observations and actions. Note that all hidden clusters are connected

to a bias (not shown in the ﬁgure).

29.3.9 Application of a Neural State Estimator

To demonstrate the capabilities of an RNN as state estimator, we chose the cart-

pole problem, which has been extensively studied in control and reinforcement

learning theory. Since more than 30 years it serves as a benchmark for new ideas,

because it is easy to understand and also quite representative for related questions.

The classical problem has been completely solved in the past. E.g., Sutton [26]

demonstrated that the pole can be balanced for an arbitrary number of time steps

within a remarkable short training sequence. There are two major directions to

make the cart-pole problem more challenging. One is to make the task itself more

diﬃcult by taking for example two poles [8] or regarding a two dimensional cart

[7]. The other one is to make the original problem partially observable [17, 1, 8].

29. Reinforcement Learning with Recurrent Neural Networks 721

We will focus on the latter, since all other variants provide a Markovian state rep-

resentation in the ﬁrst place, and therefore do not require state estimation. A re-

inforcement learning solution to the original (simulated) benchmark problem, as

well as to a real cart-pole system can be found in chapter 30 [23].

Problem Description. The classical cart-pole problem consists of a cart mov-

ing on a bounded track and trying to balance a pole on its top. This cart-pole

system is illustrated in Fig. 29.6 [17].

Fig. 29.6. The cart-pole problem system

The system is fully deﬁned through four variables (t =1,...,T):

:= horizontal cart position

˙x

:= horizontal velocity of the cart

:= angle between pole and vertical

:= angular velocity of the pole

(29.10)

The goal is to balance the pole for a preferably long sequence of time steps without

moving out of the limits. Possible actions are to push the cart left or right with a

constant force F . The pole tilts when its angle θ

is higher than 12 degrees. Either

then or when the cart hits one of the boundaries, the system is punished with a

negative reinforcement signal. In all other cases the reward is zero.

As already mentioned the system has been extensively studied in its several

forms. When the system was studied as partially observable, one usually omitted

the two velocities, ˙x

and

, i.e., only the cart’s position and the angle between

the pole and the vertical where given as inputs [17, 1, 8]. Solving this problem is

not diﬃcult because the model or algorithm just needs the memory of one past

time step to calculate the missing information.

To demonstrate the advantages of RNNs (unfolded in time) only the horizontal

position of the cart, x

is observable [29]. All other information is unknown to

the system.

722 S. Duell, S. Udluft, and V. Sterzing

Model Description. To solve the problem described above, an RNN (Sec. 29.3)

was used to develop the full dynamics of the cart-pole system. Input z

and tar-

get y

consist of the horizontal cart position x

as well as the preprocessing

transformations x

− x

t−1

. The input a

contains the agent’s action. No other

information is observable by the model. The internal state space s

is limited

to four neurons, allowing the network to reconstruct the complete but only par-

tially observable dynamics (Eq. 29.10) in its internal state space. The network

is unfolded ten time steps into the past and future. Preceding experiments have

shown that this memory length is suﬃcient to identify the dynamics. To make

the network independent from the last unfolded time slice a technique called

cleaning noise is used as a start initialization [10]. The network is trained by

backpropagation through time [22, 9].

In a second step the evolved state space is extracted from the RNN, i.e., a state es-

timation function was exported to calculate estimated states from observations (see

Sec. 29.3.8). Then a generalized form of Samuel’s adaptive heuristic critic (AHC)

algorithm [25] was used to solve the control problem based on the state estimates.

Note, that the algorithm has to be started with an already ﬁlled lag structure, i.e,

the past lags of the state estimator need to be ﬁlled with observations in order to

provide a ﬁrst state estimate. Otherwise there is a high probability that the al-

gorithm is faced with a tilted pole in its ﬁrst learning step, as a minimum of ten

uncontrolled time steps would be necessary to ﬁll all the lags.

Fig. 29.7. Correlation between the best quadratic combination of the reconstructed

state space variables (s

)

,...,(s

)

of the RNN and the original ones (Eq. 29.10)

29. Reinforcement Learning with Recurrent Neural Networks 723

Results. As a ﬁrst result the estimation quality of the diﬀerent state vari-

ables of the cart-pole is illustrated in Fig. 29.7. The four plots show the corre-

lation between the original state space variables of the dynamics, x

, ˙x

,θ

(Eq. 29.10) and the best linear combination of the reconstructed state space

variables (s

)

,...,(s

)

and their squares (s

)

,...,(s

)

in each case. The high

correlation for each state space variable demonstrates the reconstruction quality

of the RNN. It also supports the use of RNNs for partially observable reinforce-

ment learning problems.

We compared the results of our approach to a direct application of the AHC

algorithm to the problem, i.e., without using an RNN in the ﬁrst step. Note,

that no adaptive binning has been used. In both cases the discretization of the

state space was chosen to yield the best results.

Fig. 29.8 plots the achieved number of steps, the pole could be balanced, to the

number of trials. The training process was stopped as soon as the ﬁrst method

was able to balance the pole for a minimum of 1000 steps. Fig. 29.8 shows how

the RNN approach outperforms a direct application of the AHC algorithm.

trials

number of steps balanced

200

400

600

800

1000

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Fig. 29.8. Comparison of the performance in the partially observable cart-pole problem

of our RNN approach (upper curve) to a direct application of the AHC algorithm (lower

curve)

29.4 The Markov Decision Process Extraction Network

When applying reinforcement learning algorithms to real world problems such as

a gas turbine, one not only faces the problem of observations that do not provide

the Markov property but in many applications it is not even obvious which

variables to consider for the state deﬁnition. E.g., the measurements provided

by a gas turbine include information about various subsystems. In practice it is

often hard to determine if a certain variable has to be considered or a whole set

of variables from a certain subsystem can be dismissed.

724 S. Duell, S. Udluft, and V. Sterzing

−1

B D B E G E G

A C F F

H H

Fig. 29.9. The Markov decision process extraction network (MPEN) consists of a past

(left) and a future (right) subnetwork. The input variables are split into two groups:

Actions a

are controllable by the reinforcement learning agent, z

denote observable

variables from the dynamics. The future subnetwork has outputs r

only. State tran-

sitions are modeled by s

as well as s

. Note that all weight matrices in the past

(A,...,D) are diﬀerent from the future matrices (E,...,H).

In addition to considering expert knowledge and methods for input selec-

tion we developed an approach, designed for optimal control problems, where

a reward signal is available. From a reinforcement learning point of view, the

performed action a

, a possibly large set of observables z

as well as the reward

for the given tuple of observation and action is available for all steps in time

t. In many real world scenarios, the reward function r

= f

) is known,

because it describes the desired goal of the problem to be addressed and was

most likely designed by us. This knowledge can be modeled in an advanced neu-

ral state estimator by splitting the network into past and future subnetworks to

match following equations:

t+1

= f

past

),t≤ 0 (29.11)

t+1

= f

future

),t>0. (29.12)

The split of the network addresses an inconsistency within the topology of the

RNN since observables z are not available for future time slices (t>0). Another

topology to address this problem would be the dynamically consistent recurrent

neural network [34]. This topology however must predict all observables z which

can particularly cause problems when some variables are harder to predict than

others. Since the split into past and future allows us to use any target, variables

not included in the input set can also be considered. This allows to use the

reward signal or, in case of a known reward function, the arguments of the

reward function as targets:

= g(s

),t≥ 0 (29.13)

Note that the current state s

and the applied action a

are suﬃcient to de-

scribe the expectation value of all relevant reward functions, because the entire

29. Reinforcement Learning with Recurrent Neural Networks 725

information about the successor state’s s

t+1

probability distribution must be

included in these two arguments. This target allows us to change the objective

of the neural network to the desired dynamics:



t=1

− r

)

→ min

f,g

(29.14)

The resulting topology allows the neural network to accumulate all information

required for the Markov property from the provided past observations in the

past network, while the future network remodels the state transitions.

It is also possible to encode prior knowledge about the distribution of relevant

information over the considered past horizon by adapting the topology of the past

network accordingly. A possible modiﬁcation could be to add additional inputs

directly to the hidden cluster s

, for instance if they are known to be constant

over the considered past horizon. E.g., the ambient temperature, pressure, and

humidity for the combustion process of a gas turbine. Since these variables are

constant over the considered unfolded past network, they can directly be fed to

the cluster s

. The only constraint for any input of past information is that it has

to be passed through the hidden cluster s

which is ultimately used as the output

of the Markovian state representation. Therefore, possibilities for topologies of

the past network are countless.

The representation of these ﬁndings leads to the Markov decision process

extraction network (MPEN) shown in Fig. 29.9. The subnetwork on the left

(past network) has no explicit target and provides information to the neural

branch on the right (future network). The two subnetworks are connected via

an arbitrary neural structure, e.g., a weight matrix or a multi-layer perceptron.

The future network uses the information provided by s

as well as the future

actions to learn a system’s dynamic capable to predict the sequence of future

rewards. Note that future actions are important inputs preventing the network

to predict the sequence of actions which induced the state transitions. This is

important because action selection can be based on external information which

is not included in the set of observables, or might even be unpredictable due to

random exploration. See Sec. 29.3.1 for more details regarding this issue.

As proven in [27], an RNN can be used to approximate an MDP by predict-

ing all expected future successor states based on a history of observations. The

structure of an RNN forces each hidden cluster s to encode all necessary infor-

mation to estimate a successor state with respect to the inﬂuence of an action.

For this reason an RNN must be capable to estimate the expected rewards for

each future state because a reward function can only use a state, action, and

the resulting successor state as arguments. Therefore, it is suﬃcient to model a

dynamical system predicting the reward for all future time slices. Based on this

conclusion, the approach is designed to model the minimal required dynamics of

a regarded system [4].

The major advantage of the introduced neural topology over other RNN based

state estimators is the capability to model a minimal dynamics from a set of

observables without manual selection of variables. A further advantage is the

726 S. Duell, S. Udluft, and V. Sterzing

capability to extract a minimal state space. Networks that need to forecast all

observables such as the dynamically consistent RNN [34] encode all information

into the state space and are therefore not minimal. This is of special interest if the

set of observables contains unimportant variables that are possibly diﬃcult to

predict and therefore cause a drop in forecast performance. Additionally, such

variables interfere with the training process, since the validation error can be

highly inﬂuenced by these variables. In other words, resulting largest residuals

causing the training result of the entire neural network to be less robust.

29.4.1 Reward Function Design Inﬂuences the Performance of a

State Estimator

In contrast to the standard RNN, the diﬃculty to train the neural topology

depends on the reinforcement learning problem itself since the deﬁnition of the

reward function provides the target. We could show that for episodic problems

like a cart-pole it is possible to train the network with the standard reward

function that gives some positive or negative feedback at the end of trajectories

[4]. However, the problem becomes much easier to solve when more information

about the quality of the current state is provided. This is reﬂected in additional

gradient information speeding up the learning process. For most real world ap-

plications, such a reward function can be easily provided since we usually par-

ticipate in its design. In case the reward function is known, the arguments of the

reward function are usually preferred to be used as targets.

29.4.2 Choosing the Forecast Horizon of a State Estimator

When designing a neural state estimator network, the question about the number

of future steps, that should be included into the prediction, arises immediately.

Since the networks are unfolded in time towards the past and the future for a

limited number of steps, a practical answer to that problem has to be found.

Fortunately, the deﬁnition of the reinforcement learning problem itself provides

an answer. Using a discount factor γ to deﬁne the return, limits the signiﬁcant

horizon of future time steps. The return, deﬁning the performance of an agent’s

strategy, is deﬁned by:

∞



t=1

. (29.15)

In practice one can safely limit the number of future time steps accordingly,

since the impact of the rewards decreases exponentially.

29.5 The Trick of Addressing Long Term Dependencies

Most state estimation approaches rely on Takens’s theorem [33] which states

that a suﬃcient number of past time slices contain all information necessary to

29. Reinforcement Learning with Recurrent Neural Networks 727

−1

B E B G H G H

A D J J

K K

Fig. 29.10. The Markov decision process extraction network with shortcuts (MPEN-S)

consists of a past (left) and a future (right) subnetwork. The input variables are split

into two groups: Actions a

are controllable by the reinforcement learning agent, z

denote observable variables from the dynamics. The future subnetwork has outputs r

State transitions are modeled by s

as well as s

. Note that all weight matrices in the

past (A,...,E) are diﬀerent from future matrices (G,...,L). All hidden clusters as

well as all output clusters are connected to a bias (not shown in the ﬁgure).

estimate a Markovian state. Recurrent state estimators show remarkable perfor-

mance in various applications such as state estimation for gas turbine control

[28], but despite the advantages of RNNs there have been concerns regarding

their capability to model long-term dependencies [3]. In a system exhibiting long-

term dependencies the system’s output at time T is dependent on the input at

time t T [14]. The problem was discovered by Mozer [16] who found RNNs to

be unable to capture global eﬀects in classical music. The main reason for this

eﬀect are vanishing gradients in gradient-based learning methods [6, 14]. Long-

term dependencies occur in many time series ranging from technical systems to

ﬁnancial data. To overcome this issue the Markov decision process extraction

network with shortcuts (MPEN-S) [5] is introduced. It is an extension of the

previously introduced MPEN topology (Sec. 29.4), where additional shortcuts

connect clusters across a ﬁxed number of time slices (Fig. 29.10). Examples are

shortcuts of length n in a network with p steps in the past and f steps into

the future that connect s

−n

→ s

, s

−n+1

→ s

, ..., s

→ s

p+n

.Inthefuture

part, shortcuts of the same length are added from s

→ s

n+1

, s

→ s

n+2

, ...,

f−n

→ s

The resulting topology is successfully used for state estimation problems that

face the problem of delayed observables or actions that show a delayed eﬀect

on a dynamics. For instance, gas turbine emission measurements are delayed

by about one to two minutes, whereas the combustion dynamics of the turbine

occurs almost instantaneously. Both eﬀects are inﬂuenced by the same action

applied to a turbine, therefore the underlying state contains information about

both, the highly delayed as well as the short term eﬀect.

After presenting a recipe to ﬁnd a good shortcut length in Sec. 29.5.1, we

present experiments to demonstrate the capabilities of the MPEN-S topology.

Experiments include a gas turbine simulation with highly delayed eﬀects of action

on the dynamics.

728 S. Duell, S. Udluft, and V. Sterzing

29.5.1 A Recipe to Find a Good Shortcut Length

For selecting an adequate shortcut length n, the following heuristic can be con-

sidered: The severity of the vanishing-gradient problem is correlated with the

number of steps information has to be forwarded within the network. There-

fore, the value n is chosen to minimizes the total number of steps informa-

tion has to travel from any state in the past to the current state s

. I.e.,



i=1

steps(s

−i

,n) →

min,wheresteps(s

−i

,n) gives the minimum num-

ber of steps to travel from s

−i

to s

, including possible shortcuts. E.g., if n =2,

steps(s

−1

)=1,steps(s

−2

)=1,steps(s

−3

)=2,steps(s

−4

)=2.The

only information required for this heuristic is the maximum number of past time

slices that are assumed to inﬂuence the current state. Fig. 29.11 shows results

from experiments with diﬀerent shortcut lengths indicating that the heuristic

leads to reasonable results.

29.5.2 Experiments on Long Term Dependencies Problems

To demonstrate the capabilities of the MPEN-S two benchmarks are used, a

sequence of random numbers as well as a gas turbine simulation. We compare

the MPEN-S with shortcuts of length n =4to the MPEN. The shortcut length

n =4was chosen according to a heuristic (Sec. 29.5.1). For each architecture

and benchmark, 10 networks are trained by online backpropagation with a learn-

ing rate of 0.001 on 10,000 observations, of which 30% are used for validation.

Evaluation is based on another set of the same size but without noise. To judge

the quality of the state estimate, represented by the activation values in s

,we

test the content of information within s

. To do so we feed the estimated states

as inputs to a feed forward neural network (two hidden layers with 150 neurons

w/o

24681012141618

100

150

200

shortcut length

sum of steps

(a) Sum of Steps

w/o

24681012141618

5.2

5.25

5.3

·10

−3

shortcut length

validation error

(b) Random Numbers Experiment

Fig. 29.11. Results from experiments with diﬀerent shortcut lengths and a past of

p =20. (a) shows the sum



i=1

steps(s

−i

,n) of a network without shortcuts and

networks with shortcuts of diﬀerent lengths (n ∈{4, 5, 6} minimizes the sum). (b)

shows the validation errors of these networks for a random numbers experiment (Sec.

29.5.2). The correlation between the sum of steps and the validation error is obvious.

29. Reinforcement Learning with Recurrent Neural Networks 729

each) whose targets are the true Markovian states of the benchmark applica-

tions. In the best case, the estimated states include all relevant information and

consequently allow a perfect mapping to the true Markovian states, i.e., the

correlation between the target and output is 1.

Random Numbers Experiment. In a ﬁrst experiment, sequences of equally

distributed random numbers x

∈ [0, 1] areused.Thenetworkwithapastand

a future horizon of i steps receives a sequence x

t+1

,...,x

t+i

as inputs in the

past part of the network. The sequence is also used as targets for the future part

of the network, introducing a delay between input and corresponding output.

This way, the network has to output an input given at time step t at time step

t+ i to minimize its error. In addition, equally distributed noise e ∈ [−0.05, 0.05]

is added to the target values for training. The goal of the state estimation for

the random numbers problem is to encode information about the past i random

numbers.

Gas Turbine Simulation. To demonstrate the capabilities of the presented

approach on a problem similar to the real-world application of our interest, a gas

turbine simulation is used. The simulation provides a controllable variable pilot

aﬀecting the emissions and humming of the turbine. The pilot fraction is one of

the most important fuel fractions regarding the combustion control. In order to

keep the simulation as simple as possible, the combustion tuning is reduced to

this variable. The goal of a strategy is to minimize emissions and avoid critical

humming, which is reﬂected in the reward function. While humming reacts in-

stantaneously, the emissions, like in real world, have a long delay and a deﬁned

blur over several steps in time. Each step, the simulator provides observations

about its current state. The only additional information for the state estimation

model is the maximal expected delay d of the simulation. The goal of the state

estimator is to encode all relevant information about the past d steps.

Results. Fig. 29.12 illustrates the correlation of the estimated and true Marko-

vian states of the random numbers experiment ((a) and (b)) and the results of

the gas turbine simulation experiment ((c) and (d)). Both benchmark results

indicate that the MPEN approach is capable of estimating the Markovian states

well for small delays. For longer dependencies however, the approximation quality

drops signiﬁcantly, while the MPEN-S can maintain its performance. Table 29.1

shows the average correlation and validation errors of all state variables. The

numbers show that the MPEN-S outperforms the MPEN both in reconstruc-

tion quality of the Markovian state (resulting in a better correlation) as well

as raw forecast performance (lower validation error). The validation error is a

good indicator for estimation quality, which is especially relevant for real-world

applications.

730 S. Duell, S. Udluft, and V. Sterzing

20 40 60

0.2

0.4

0.6

0.8

variable index

correlation

(a) MPEN

20 40 60

0.2

0.4

0.6

0.8

variable index

(b) MPEN-S

10 20 30

0.96

0.98

variable index

10 20 30

0.96

0.98

variable index

(d) MPEN-S

Fig. 29.12. Average correlations of true and estimated Markovian state. A higher vari-

able index indicates a variable closer to the present. (a) and (b) show the estimation

quality for the random numbers problem with a delay of 60 steps. (c) and (d) illustrate

the estimation quality for the gas turbine simulation with a delay of 30 steps.

Table 29.1. Comparison of the MPEN and the MPEN-S in terms of validation error

(MSE) as well as correlation between true and estimated Markovian state

architecture experiment delay correlation validation error

MPEN

random numbers 60 0.4 0.9

gas turbine 30 0.988 0.0160

MPEN-S

random numbers 60 0.99 0.012

gas turbine 30 0.995 0.0058

29.6 Conclusion

In this chapter we introduced methods for neural state estimation, particularly

for reinforcement learning applications. However, the methods can also be ap-

plied to other applications that require a Markovian state representation. Fur-

ther, a set of practical recipes to overcome problems especially relevant to real

world applications was presented.

Sec. 29.3 describes the usage of recurrent neural networks (RNNs) as state es-

timators. Thereafter, a series of recipes to improve the applicability and to over-

come possible pitfalls are introduced. A partially observable cart-pole is used to

demonstrate the approach in Sec. 29.3.9. In Sec. 29.4, the idea of state estimation

based on standard RNNs is further developed towards the Markov decision pro-

cess extraction network (MPEN). This topology is dynamically consistent and

is capable to autonomously model a minimal Markov decision process (MDP)

from a large number of observables. Finally, long-term eﬀects are addressed by

the Markov decision process extraction network with shortcuts (MPEN-S) in

Sec. 29.5. This topology is especially relevant to real world applications where

29. Reinforcement Learning with Recurrent Neural Networks 731

diﬀerent eﬀects can occur on various time scales. The capabilities of the MPEN-S

are demonstrated using benchmarks, including a gas turbine simulation. Results

on the benchmark applications indicate a signiﬁcant improvement over previous

approaches. This is reﬂected in a smaller validation error of the forecast as well

as an improved estimation quality, especially for state variables dependent on

highly delayed observables. Another important conclusion to draw from the ex-

periments is the correlation between the validation error and estimation quality.

This information is of high value, since in any real-world application one can

only rely on the measure of the validation error.

Acknowledgment. Part of this work has been funded by the Federal German

Ministry for Education and Research under the grant ALICE, 01 IB10003 A-C.

References

[1] Bakker, B.: Reinforcement Learning with Long Short-Term Memory. In: Becker,

S., Dietterich, T.G., Ghahramani, Y. (eds.) Advances in Neural Information Pro-

cessing Systems, pp. 1475–1482. MIT Press (2002)

[2] Bellman, R.E.: Dynamic Programming. Princeton University Press (1957)

[3] Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gra-

dient descent is diﬃcult. IEEE Transactions on Neural Networks 5(2), 157–166

(1994)

[4] Duell, S., Hans, A., Udluft, S.: The Markov Decision Process Extraction Network.

In: Proc. of the 18th European Symposium on Artiﬁcial Neural Networks (2010)

[5] Duell, S., Weichbrodt, L., Hans, A., Udluft, S.: Recurrent Neural State Estima-

tion in Domains with Long-Term Dependencies. In: Proc. of the 20th European

Symposium on Artiﬁcial Neural Networks (2012)

[6] Frasconi, P., Gori, M., Soda, G.: Local feedback multilayered networks. Neural

Computation 4(1), 120–130 (1992)

[7] Gomez, F., Miikkulainen, R.: 2-D Balancing with Recurrent Evolutionary Net-

works. In: Proceedings of the International Conference on Artiﬁcial Neural Net-

works (ICANN 1998), pp. 425–430. Springer (1998)

[8] Gomez, F.: Robust Non-Linear Control through Neuroevolution. PhD thesis, De-

partement of Computer Sciences Technical Report AI-TR-03-3003 (2003)

[9] Haykin, S.: Neural networks and learning machines, vol. 3. Prentice-Hall (2009)

[10] Haykin, S., Principe, J., Sejnowski, T., McWhirter, J.: New directions in statistical

signal processing: from systems to brain. MIT Press (2007)

[11] Kolen, J.F., Kremer, S.C.: A ﬁeld guide to dynamical recurrent networks. IEEE

Press (2001)

[12] Kaelbling, L.P., Littman, M.L., Cassandra, A.R.: Planning and acting in partially

observable stochastic domains. Artiﬁcial Intelligence 101, 99–134 (1998)

[13] Kietzmann, T.C., Riedmiller, M.: The Neuro Slot Car Racer: Reinforcement

Learning in a Real World Setting. In: Proc. of the Int. Conf. on Machine Learning

and Applications. IEEE (2009)

[14] Lin, T., Horne, B.G., Tino, P., Giles, C.L.: Learning long-term dependencies in

NARX recurrent neural networks. IEEE Transactions on Neural Networks 7(6)

(1996)

[15] Medsker, L., Jain, L.: Recurrent Neural Networks: Design and Application. Inter-

national Series on Comp. Intelligence, vol. I. CRC Press (1999)

732 S. Duell, S. Udluft, and V. Sterzing

[16] Mozer, M.C.: Induction of multiscale temporal structure. In: Advances in Neural

Information Processing Systems, vol. 4, pp. 275–282 (1992)

[17] Meuleau, N., Peshkin, L., Kee-Eung, K., Kaebling, L.P.: Learning Finite-State

Controllers for Partially Observable Environments. In: Proceedings of the Fif-

teenth International Conference on Uncertainty in Artiﬁcial Intelligence (UAI

1999), pp. 427–436 (1999)

[18] Neuneier, R., Zimmermann, H.-G.: How to Train Neural Networks. In: Orr, G.B.,

Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 373–423. Springer, Hei-

delberg (1998)

[19] Peters, J., Schaal, A.: Reinforcement learning of motor skills with policy gradients.

Neural Networks 21(4) (2008)

[20] Ramachandran, D.: Knowledge and Ignorance in Reinforcement Learning. PhD

thesis, University of Illinois (2011)

[21] Rosenstein, M.T., Barto, A.G., Si, J., Powell, W., Wunsch, D.: Supervised actor-

critic reinforcement learning. In: Handbook of Learning and Approximate Dy-

namic Programming, pp. 359–380 (2012)

[22] Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-

propagating errors. Nature 323(9), 533–536 (1986)

[23] Riedmiller, M.: 10 Steps and Some Tricks to Set Up Neural Reinforcement Con-

trollers. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks of the Trade,

2nd edn. LNCS, vol. 7700, pp. 735–757. Springer, Heidelberg (2012)

[24] Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Eﬃcient

Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B.,

Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328.

Springer, Heidelberg (2005)

[25] Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM

Journal on Research and Developement, 210–229 (1959)

[26] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press

(1998)

[27] Schneegass, D.: Steigerung der Informationseﬃzienz im Reinforcement-Learning.

PhD thesis, Luebeck University (2008)

[28] Schäfer, A.M., Schneegass, D., Sterzing, V., Udluft, S.: A Neural Reinforcement

Learning Approach to Gas Turbine Control. In: Proc. of the Int. Joint Conf. on

Neural Networks (2007)

[29] Schäfer, A.M., Udluft, S.: Solving Partially Observable Reinforcement Learning

Problems with Recurrent Neural Networks. In: Workshop Proc. of the European

Conf. on Machine Learning (2005)

[30] Schneegass, D., Udluft, S., Martinetz, T.: Neural Rewards Regression for Near-

Optimal Policy Identiﬁcation in Markovian and Partial Observable Environments.

In: Proc. of the European Symposium on Artiﬁcial Neural Networks, pp. 301–306

(2007)

[31] Schäfer, A.M., Udluft, S., Zimmermann, H.G.: The Recurrent Control Neural Net-

work. In: Proc. of the European Symposium on Artiﬁcial Neural Networks, pp.

319–324 (2007)

[32] Schäfer, A.M., Zimmermann, H.-G.: Recurrent Neural Networks Are Universal

Approximators. In: Kollias, S.D., Stafylopatis, A., Duch, W., Oja, E. (eds.) ICANN

2006. LNCS, vol. 4131, pp. 632–640. Springer, Heidelberg (2006)

[33] Takens, F.: Detecting strange attractors in turbulence. Dynamical Systems and

Turbulence 898, 366–381 (1981)

29. Reinforcement Learning with Recurrent Neural Networks 733

[34] Zimmermann, H.G., Grothmann, R., Schäfer, A.M., Tietz, C.: Identiﬁcation and

Forecasting of Large Dynamical Systems by Dynamical Consistent Neural Net-

works. In: New Directions in Statistical Signal Processing: From Systems to Brain,

pp. 203–242. MIT Press (2006)

[35] Zimmermann, H.G., Neuneier, R.: Neural network architectures for the model-

ing of dynamical systems. In: Kolen, J.F., Kremer, S.C. (eds.) A Field Guide to

Dynamical Recurrent Networks, pp. 311–350. IEEE Press (2001)

[36] Zimmermann, H.G., Tietz, C., Grothmann, R.: Forecasting with Recurrent Neural

Networks: 12 Tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) NN: Tricks

of the Trade, 2nd edn. LNCS, vol. 7700, pp. 687–707. Springer, Heidelberg (2012)

10 Steps and Some Tricks to Set up Neural

Reinforcement Controllers

Martin Riedmiller

Machine Learning Lab

Computer Science Department

Albert-Ludwigs Universitaet Freiburg

riedmiller@informatik.uni-freiburg.de

http://ml.informatik.uni-freiburg.de

Abstract. The paper discusses the steps necessary to set up a neural

reinforcement controller for successfully solving typical (real world) con-

trol tasks. The major intention is to provide a code of practice of crucial

steps that show how to transform control task requirements into the

speciﬁcation of a reinforcement learning task. Thereby, we do not neces-

sarily claim that the way we propose is the only one (this would require

a lot of empirical work, which is beyond the scope of the paper), but

wherever possible we try to provide insights why we do it the one way or

the other. Our procedure of setting up a neural reinforcement learning

system worked well for a large range of real, realistic or benchmark-style

control applications.

Keywords: Neural reinforcement learning, ﬁtted Q, batch reinforce-

ment learning, learning control.

30.1 Overview

The paper discusses the steps necessary to set up a neural reinforcement con-

troller for successfully solving typical (real world) control tasks. The major in-

tentionistoprovideacodeofpractice containing crucial steps necessary to

transform control task speciﬁcations into the speciﬁcation and parameterization

of a reinforcement learning task. Thereby, we do not necessarily claim that the

way we propose is the only one (this would require a lot of empirical work, which

is beyond the scope of the paper). But, wherever possible we try to provide in-

sights why we do it the one way or the other. In that spirit, this paper is mainly

intended to be a subjective report on how we tackle control problems by rein-

forcement learning in practice. It is not meant as a general review article and

therefore, many related and alternative methods are not mentioned here.

When faced with a real world system, typically a very large number of ways

exist to formulate it as a learning problem. This is somewhat diﬀerent from

the situation usually found in reinforcement learning papers, where all the main

settings (like state description, actions, control interval) are usually given. In

G. Montavon et al. (Eds.): NN: Tricks of the Trade, 2nd edn., LNCS 7700, pp. 735–757, 2012.

 Springer-Verlag Berlin Heidelberg 2012

736 M. Riedmiller

the following we therefore carefully distinguish between the (real world) control

problem (which is given) and the learning problem (which we have to design). Of

course, ideally, when the learning task is solved, the resulting policy should fulﬁll

the original controller task. The goal of this paper is to show how we can use the

degrees of freedom in the modelling of the learning task to successfully solve the

original control task. Our procedure of setting up a neural reinforcement learning

system worked well for a large range of real, realistic or benchmark-style control

applications, e.g. [9, 24, 10, 26, 7, 19, 13, 6].

30.2 The Reinforcement Learning Framework

30.2.1 Learning in Markovian Decision Processes

The approach for learning controllers followed here tackles control problems as

discrete-time Markovian Decision Processes (MDPs). An MDP is described by

asetS of states, a set A of actions, a stochastic transition function p(s, a, s



)

describing the (stochastic) system behavior and an immediate reward or cost

function c : S ×A → R. The goal is to ﬁnd an optimal policy π

∗

: S → A,that

minimizes the expected cumulated costs for each state. In particular, we allow

S to be continuous, assume A to be ﬁnite, and p to be unknown to our learning

system (model-free approach). Decisions are taken in regular time steps with a

constant cycle time "

The underlying learning principle is based on Q-learning [30], a model-free

variant of the value iteration idea from dynamic programming. The basic idea

is to iteratively learn a value function, Q, that maps state-action pairs to their

expected optimal path costs. In Q-learning, the update rule is given by

k+1

(s, a):=(1− α)Q(s, a)+α(c(s, a)+γ min



,b)).

Here, s denotes the state where the transition starts, a is the action that is

applied, and s



is the resulting state. α is a learning rate that has to be de-

creased in the course of learning in order to fulﬁll the conditions of stochastic

approximation and γ is a discounting factor. It can be shown, that under mild

assumptions Q-learning converges for ﬁnite state and action spaces, as long as

every state action pair is updated inﬁnitely often (see e.g. [3] ). Then, in the

limit, the optimal Q-function, Q

∗

, is reached. The optimal policy π

∗

can then

be derived by greedily evaluating the optimal Q-function:

∗

(s) ∈ arg min

a∈A

∗

(s, a)

For a detailed introduction to reinforcement learning the reader is referred to

the excellent textbooks [3, 27].

NFQ Tricks 737

30.2.2 Q-Learning with Function Approximation

When dealing with large or even continuous state spaces, a tabular representation

of the Q-function comes to its limits or is simply infeasible. A standard way to

tackle this, is the use of function approximation to represent the Q-function. We

focus on neural networks in the following, but other approximation schemes (like

e.g. Gaussian processes [4], CMACs [28, 29], ...) are being used as well.

One big advantage of using neural networks is their capability to generalize to

unseen situations - a fact particularly useful in large or continuous state spaces,

where one can not expect to experience all situations during training. However,

this positive feature has also a negative impact: when the standard Q-learning

rule is applied to a certain state transition, it will also inﬂuence the value at

other inputs in an unforeseeable manner.

To work against this eﬀect, we developed a neural Q-learning framework, that

is based on updating batches of transitions instead of single transition updates

as in the original Q-learning rule. This approach has turned out to be an instance

of the Fitted Q Iteration family of algorithms [5], and was named ‘Neural Fitted

Q Iteration (NFQ)’ accordingly [23].

The basic idea underlying NFQ is simple but decisive: the update of the value

function is performed considering the complete set of transition experiences in-

stead of single transitions. Transitions are collected in triples of the form (s, a, s



)

by interacting with the environment. Here, s is the original state, a is the chosen

action and s



is the resulting state. The set of experiences is called the sample

set D.

The algorithm is displayed in ﬁgure 30.1. It consists of two major steps: The

generation of the training set P and the training of these patterns within a

multilayer perceptron. The input part of each training pattern consists of the

state s

and action a

of training experience l. The target value of each pattern

is computed as suggested by the Q-learning rule: it is the sum of the transition

costs c(s

) plus the expected minimal path costs for the successor state s

l

The latter is computed on the basis of the current estimate of the Q−function,

Since at this point, training the Q-function can be done as batch learning of a

ﬁxed pattern set, we can use more advanced supervised learning techniques, that

converge more quickly and more reliably than ordinary gradient descent tech-

niques. In particular, NFQ uses the Rprop algorithm for fast supervised learning.

Rprop adapts its search step size based on the signs of the partial derivatives and

has proven to be very fast and yet robust with respect to the choice of its param-

eters. For a detailed description of Rprop see [17]. The training of the pattern set

is executed either for a predeﬁned number of epochs (=complete sweeps through

the pattern set), or until the error is below a certain predeﬁned limit. Although

simple and straight-forward, training for a ﬁxed number of epochs works well

and therefore is our standard choice. For a more detailed discussion about NFQ,

please refer to [23].

738 M. Riedmiller

NFQ_main() {

input: a set of transition samples D; output: Q-value function Q

k=0

init_MLP() → Q

;

Do {

generate_pattern_set P = {(input

, target

),l=1,...,#D} where:

input

= s

target

= c(s

)+γ min

l

,b)

Rprop_training(P ) → Q

k+1

k:= k+1

} While (k<k

max

)

Fig. 30.1. Main loop of NFQ . k counts the number of iterations, k

max

denotes the

maximum number of iterations. init_MLP() returns a multilayer perceptron with

randomly initialized weights. Rprop_training(P ) takes pattern set P and returns a

mulitlayer perceptron that has been trained on P using Rprop as the supervised train-

ing method.

30.3 Characteristics of the Control Task

In this work, we consider control scenarios of the following type. The controller

has to control a technical system or process such that ﬁnally a desired target

situation is achieved. The current situation of the system is measured by sensors.

Thus, the control goal is usually deﬁned by making one or more sensor values

equal to externally given reference values within some tolerance bounds. To do

so, the controller has to apply an appropriate sequence of control actions. The

control system is realized as a closed-loop system, that acts at discrete time

steps. At every time-step, the sensor values are measured, and the controller

computes a control action, which is then applied to the process.

Diﬀerent types of control tasks exist within this framework. An important

characterization is if the control task has a deﬁned termination or not. In the

ﬁrst case, control terminates immediately, once a certain goal criterion has been

achieved. A typical example would be to drive a mobile robot to a certain target

location. The task is terminated, once the target location is reached.

The second case is more challenging: the control task does not end, if a target

condition is reached once. Instead, the controller has to actively keep the system

in a set of goal states, that all fulﬁll a certain success criterion. Typically, this

criterion is given by the sensor values being equal to their target reference values

within a small tolerance band. This is a very common scenario in the control

of technical systems. A typical example would be to achieve and hold a certain

temperature in a room. From a control perspective, this latter scenario is much

more challenging (since it contains the ﬁrst one as a special case).

NFQ Tricks 739

30.4 Modeling the Learning Task

This section discusses how to model a given (real-world) control task appropri-

ately within the neural reinforcement learning framework. We discuss alterna-

tives and ‘tricks’ while always trying to stay as close as possible to the framework

proposed by the theory of dynamic programming.

30.4.1 State Information

The underlying reinforcement learning framework crucially depends on the as-

sumption of the Markov property of state transitions: the successor state is a

(probabilistic) function of the current state and action. As a consequence, state

information provided to the learning system must be ‘rich’ enough — ‘rich’ in

the sense that the observed state transition does not depend on additional his-

torical information. In a real application, we can not necessarily expect to get the

complete state information out of the values provided by the sensors. In classical

control theory, the concept of an observer is known to deduce state information

out of the sequence of observed sensor information. However, this requires the

availability of a system model, which we assume not to have in our learning

framework. A standard way to tackle this problem is to provide historical sensor

and action information from previous time steps. Since we are learning anyhow,

we do not rely on a particular semantic interpretability of the state information.

This allows for example to provide more information than necessary or redun-

dant information, to be on the safe side. As a tradeoﬀ, state information should

be kept as small and concise as possible to support good generalization, which

will generally lead to faster learning and better control performance. In techni-

cal dynamical systems, we often use temporal diﬀerences of sensor values as an

approximation to physically meaningful values like velocity or acceleration.

Like in supervised learning, using meaningful features instead of raw sensor

values to enforce generalization is often helpful. However, also like in supervised

learning, the design of good features typically requires deep insight into the

application (here: knowledge about system behavior which in the strong sense

we assume not to have). A current research direction is to autonomously learn

meaningful state representations directly out of high-dimensional raw sensor data

(like e.g. cameras) [15, 16, 25, 2].

Summary:

– state information must be designed out of sensor information and must be

rich enough to support Markov property

– redundant information is not a problem, but it is preferable to keep the input

dimensionality as low as possible

– state representation can be transformed into features to enforce generaliza-

tion

– state information does not necessarily have a human understandable meaning

740 M. Riedmiller

30.4.2 Actions

The original control task often allows the application of (quasi) continuous con-

trol values, typically in a certain range between a minimum and a maximum

value. While in principle methods exist to learn continuous control actions (e.g.

[11, 22]), we will here focus on providing a discrete set of control actions to our

learning system. This is the most common framework in reinforcement learning

and corresponds to the framework as presented in section 30.2.

This means, for setting up the learning system, one has to explicitly choose a

discrete set of actions within the range of potential control signals. One potential

choice is a two action set, consisting out of the minimum and maximum control

action (‘bang-bang’-control). In classical control theory, such a two-value control

policy is the basis of time-optimal controllers. Of course, oscillating back and

forth between two extreme control signals, e.g. to keep the system close to a

desired sensor output, often is not acceptable when it comes to the control of

real hardware. Therefore it is often advisable, to add additional actions to the

learning set, e.g. a neutral action that does not put additional energy into the

system.

The search space of available policies increases exponentially with the number

of actions. Therefore, from the perspective of learning time, one should try to

keep the number of available actions limited. However, there is of course a trade-

oﬀ: a smaller number of actions leads to a coarser control behavior and the

learning controller might not be able to fulﬁll the required accuracy in control.

There is a close interplay with the length of the control interval "

:alarger

control interval might require a larger action set to achieve the same level of

controllability and vice-versa: the smaller the control interval, the coarser the

action set may be, since more frequent interaction (and thus correction) is pos-

sible. The dynamic output element framework exploits this close relationship

between temporal and structural aspects of the action set to enable more ﬂexi-

ble control policies [22].

Trivially, a least requirement to the action set is that a policy must exist,

that transfers the system to the goal state and - in case of the non-terminal

state framework (see below, section 30.4.4) - keeps it within the goal area.

Summary:

– action set should be kept small to allow fast learning

– tradeoﬀ: more actions can enhance quality or accuracy of control policy

– actions must allow to reach goal states and to keep the system within goal

area in the non-terminal goal state setting.

30.4.3 Choice of Control Interval 

The control interval "

denotes the time between two control interventions of the

learning system. In classical control theory, controller design is often assuming

that interaction happens in continuous time (like e.g. in classical PID-control).

NFQ Tricks 741

Therefore one aims to make the control interval "

as small as possible to ap-

proximate the assumed continuous time scenario - otherwise, the controller will

not work as expected. This is no necessary requirement in the learning frame-

work proposed here. Instead, the controller learns to adapt its behavior to the

control interval given - therefore also larger control intervals can be managed.

This additional degree of freedom is a big advantage, since for example a slower

controller might be realized on less expensive hardware.

As a general tradeoﬀ, learning gets easier, if the control interval is larger,

since there are less decision points. On the other hand, the smaller the control

interval, the more accurately the system can be controlled. If absolutely no

prior knowledge of the system behavior is available, then "

must be chosen

empirically. A potential strategy to determine "

is to start with a relatively

large time step, which helps to learn faster, and then to reﬁne it until the desired

accuracy is achieved.

As already discussed in section 30.4.2 there is a close interplay between the

available action set, the control interval "

, and the potential accuracy in control.

Summary:

– the larger the control interval "

, the fewer decisions have to be taken to

reach the goal, therefore learning is generally faster

– tradeoﬀ: a smaller control interval "

potentially allows more accurate con-

trol and better control quality

30.4.4 The Terminal Goal State and The Non-terminal Goal State

Setting

As already discussed in section 30.3, control tasks can either terminate once

a goal criterion is met, or (virtually) continue forever. In the latter case, the

control goal is not only to reach a state fulﬁlling certain success criteria, but

to actively keep the system in a set of goal states, that all fulﬁll these success

criteria. A typical success criterion for a state could be, for example, that all

sensor values correspond to their target values within some tolerance bound. In

the following we discuss, how these two control scenarios can be modeled in our

learning framework.

For control tasks, mainly two learning scenarios are most appropriate: the

terminal goal state and the non-terminal goal state framework. From the learning

perspective, the terminal goal state setting is the simpler one. The task is to

transfer the controlled system from an initial state to a terminal goal state by

an appropriate sequence of actions. Once the goal state is reached, the episode

is stopped, and the target Q-value of the last state action pair is set to the

transition costs plus ﬁnal costs of the terminal state.

This can be implemented by computing the target Q values as follows

Q(s, u)=



c(s, u)+γ · terminal_costs(s



) , if s



∈ X

c(s, u)+γ ·min

Q(s



,b) , else

(30.1)

742 M. Riedmiller

where c(s, u) denotes the immediate costs of a transition (see below). Here,

denotes the set of all terminal goal states, that fulﬁll the success criteria

and by being reached, terminate the control task. For each terminal goal state,

terminal_costs() assigns the corresponding terminal costs.

On the other side, the non terminal goal state framework is particularly tai-

lored to control tasks, where the controller also has to actively keep the system

in a set of goal states. Here, X

again denotes the set of all goal states, that

fulﬁll the success criteria, but in contrast to the above framework, the control

task is not terminated, when one of those states is reached.

This results in the following rule for the update of the Q-values

Q(s, u)=



0+γ · min

Q(s



,b) , if s



∈ X

c(s, u)+γ · min

Q(s



,b) , else

(30.2)

where as before c(s, u) denotes the immediate costs of a transition outside the

goal region (see below).

This seemingly slight modiﬁcation has two important consequences: the epi-

sode is not stopped, once a state in the goal area is reached, and secondly no

‘grounding’ of the Q values to a terminal value occurs. This has a nasty eﬀect to

the value function, when a multilayer perceptron is used to approximate it. Due

to interpolation eﬀects and the lack of grounding, the value function tends to

steadily increase. We will discuss this problem in further detail in section 30.5.3.

Since learning in the terminal goal state framework is usually easier, it some-

times makes sense to model a per-se non-terminal control problem as a terminal

state learning problem. The general idea is to consider goal states with a low

change-rate as pseudo terminal states. Then, the terminal goal state framework

according to equation 30.1 can be applied. During learning, the task is always

stopped, when one of these pseudo terminal states is reached. In the application

phase, the policy learned by this procedure is then applied without stopping.

The idea behind this is, that whenever the system drifts away from its goal re-

gion during application, then the controller immediately brings it back to its

goal region.

However, this method is only an approximation to the actually desired behav-

ior. It moreover requires to heuristically deﬁne what a ‘low change-rate’ means

within the particular setting. So for non terminal control tasks we recommend

to use the non terminal goal state learning framework whenever possible. We

only wanted to mention this possibility, because sometimes a control task might

be too diﬃcult to learn within the non-terminal goal state framework. Then,

an approximation by a terminal goal state problem might constitute a practical

way to make it work.

Summary:

– in general, terminal goal states make learning easier

– for control applications, often the nonterminal goal state framework is ap-

propriate, since ﬁnding a control policy that also stabilizes the system within

the target region is required.

NFQ Tricks 743

30.4.5 Choice of X

The set X

comprises all states, which fulﬁll the goal criteria as deﬁned by

the control tasks. One typical way to deﬁne X

is to denote ranges for the

values of each state variable. For the state variables that we want to control, we

typically deﬁne a range around their speciﬁed target value, i.e. target value ±δ,

where δ>0 denotes the allowed tolerance. For other state variables, the allowed

ranges might be inﬁnitely large, denoting that we do not care what value they

have for judging membership to X

Again, there is a tradeoﬀ: the smaller we choose X

, the more accurate the

successfully learned ﬁnal controller will be. The larger X

, the easier it will be

to learn, but we also have to accept less accurate controllers as a result.

An important requirement from the perspective of the learning framework is

that X

is large enough, so that a policy exists, that X

can be reached from

every starting state. Therefore, the choice of X

is highly related to the choice

of the action set and the choice of the control interval "

In the undiscounted (γ =1) nonterminal goal state case, an additional re-

quirement applies for X

. It must be chosen such that a policy exists, that

keeps the system permanently within X

. This policy need not to be known

in advance. Again, this requirement implies the interplay between the choice of

, the control interval "

and the available action set.

Summary:

– the larger X

, the easier it is to learn

– the smaller X

, the more accurate the learned controller will be

30.4.6 Choice of X

−

In many control problems, constraints on the state variables exist, that must not

be violated by a successful control policy. The deﬁnition of the set of undesired

states, X

−

constitutes a way to model this requirement within the proposed

learning framework. In a typical setting, a state is within X

−

whenever a con-

straint of the original control problem is violated. Whenever a state within X

−

is encountered, the control episode is stopped. Below, we show the resulting

computation for the target of the Q-value as an extension of the equation for

the non-terminal goal state framework (equation 30.2). The application within

the terminal goal state framework is straightforward.

Q(s, u)=

⎧

⎨

⎩

0+γ · min

Q(s



,b) , if s



∈ X

c(s, u)+γ · terminal_costs(s



) , if s



∈ X

−

c(s, u)+γ · min

Q(s



,b) , else

(30.3)

The terminal costs for a state within X

−

should ideally be larger than the

path costs for any successful policy. When using a multilayer perceptron with a

sigmoid output function, we typically use a value close to 1 as terminal costs of

a state within X

−

744 M. Riedmiller

Summary:

– the learning framework allows the modeling of hard constraints on state

variables

30.4.7 Choice of Immediate and Final Costs

The choice of the immediate cost function c(s, u) determines the course of the

control trajectory until the target region is reached. It is not uncommon in

reinforcement learning to make c(s, u) a function of the distance to the target

region. This has the advantage, that the immediate costs already contain a local

hint to the goal, which may help learning considerably. From the perspective

of the control task, however, one has to keep in mind, that the ﬁnal controller

optimizes the path costs to the goal. Optimizing the integrated distances to

the goal might not always be the ideal realization of what is actually intended

(imagine a situation, where a policy ﬁrst makes a large error but then reaches

the goal in a few time steps, instead of a policy that only makes small errors but

for a long period of time). The situation gets even more diﬃcult, if one has to

design an immediate cost function that trades oﬀ between two or more sensor

values, that all have to ﬁnally achieve a certain target value.

We therefore prefer a cost formulation, that has the advantage of a very simple

and thus broadly applicable cost function:

c(s, u)=



0 , if s



∈ X

c,else

(30.4)

where c>0 is a constant value. A reasonable choice of c is that c multiplied by

the estimated number of time steps of the optimal policy should be considerably

below the maximum path costs that can be represented by the neural network

(which, when using a the standard sigmoid output function, is 1).

The immediate cost function proposed above moreover has the advantage,

that the learned optimal policy has a clear interpretation: it is the minimum

time controller. As a side note: using this immediate cost function, one can also

check the ability of a learning system to learn correct value functions: within

this framework, learning can only be successful, if the learned value function is

actually meaningful, since no hint towards the goal is provided by the immediate

cost function.

The terminal cost function is simple as well: terminal costs are 0, if a terminal

goal state is reached, and 1, if a constraint is violated. Of course, these values

may depend on the potential range of the output values of the neural network.

Summary:

– costs should serve the purpose of meeting the speciﬁcations of the original

control task as close as possible

– immediate costs may reﬂect local hints to the goal to help learning but this

might not necessarily reﬂect the intention of the original control task

NFQ Tricks 745

30.4.8 Discounting

In the above equations, γ with 0 ≤ γ ≤ 1 denotes a discounting parameter. Using

γ<1 may make learning the value function easier, since the horizon of the future

costs considered is reduced (consider e.g. the extreme case where γ =0.Then

only the immediate costs are relevant). On the other hand, choosing a discount

rate also has an inﬂuence on the resulting optimal policies: if γ<1, immediate

costs that occur later in the sequence are weighted less. One has to make sure,

that this is in accordance with the initial control task formulation. We therefore

usually prefer a formulation with no discounting, i.e. γ =1and therefore have to

make sure that for successful learning, additional assumptions are fulﬁlled (e.g.

the existence of proper policies, which basically means that X

can be reached

from every state with non-zero probability. For a detailed discussion see e.g. [1]).

Summary:

– discounting requires less assumptions and therefore can make learning sim-

pler and/ or more robust

– it must be checked, whether introducing a discounting rate for the sake of

better learning still matches the intention of the original control task.

30.4.9 Choice of X

In a typical setup, at the start of each episode, an initial starting state is ran-

domly drawn from the starting state set X

. Ideally X

is chosen such that it

covers the whole range of initial conditions that occur in the original control

task.

In tasks, that in average require a large number of steps to reach the goal

states, the probability of hitting the goal region by chance can be pretty low.

Here, a method that we call the ‘growing-competence’-heuristic [21] might help:

First, start with initial states close to the goal area and then incrementally

increase the set of starting states until it ﬁnally covers the complete original

starting state area.

Summary:

– the set of initial starting states for learning should cover the intended working

space of the original control problem

– if applicable, then starting with simple states ﬁrst and then increasing the

range might help to improve the learning process dramatically

30.4.10 Choice of the Maximal Episode Length N

Control episodes might take inﬁnitely long — this is inherently the case in the

non-terminal goal state framework and can also occur in the terminal goal state

setting, if the policy neither ﬁnds to the goal region nor crashes. Therefore,

while learning, one typically stops the episode after some predeﬁned number

746 M. Riedmiller

of time steps. This is called the maximal episode length N in the following.

Theoretically, within the ﬁtted Q learning framework, N is not a critical choice.

It just denotes the number of transitions sampled in a row (this is diﬀerent

from learning methods that rely on complete trajectories). Actually, N might be

as low as 2. Then, per episode only one transition sample is collected. From a

practical perspective, however, it typically makes more sense to consider longer

episodes — in particular, when the policy used to sample the transitions drives

the system closer towards the goal region and therefore allows to collect more

and more ‘interesting’ transitions.

A rough heuristic that we use is to make N double or three times as large as the

expected average time a successful controller will need to reach the target region.

If N is chosen too large, then a lot of useless information might be collected -

consider for example very long episodes that just contain cycles of ever the same

states.

Summary:

– theoretically, the choice of N is not critical

– practically, N can considerably inﬂuence learning behavior, since it inﬂuences

the distribution of the collected transitions.

30.5 Tricks

30.5.1 Scaling the Input Values

Like in normal supervised learning, scaling the input values is an important

preprocessing step. Various methods and according explanations are discussed

in [14]. As a standard method, in all our learning experiments, we normalize the

input values to have mean of 0 and a standard deviation of 1.

Summary:

– like in supervised learning, it is important that all input values have a similar

level

– a simple scaling to mean 0 and standard deviation 1 works well in all our

learning experiments so far

30.5.2 The X

-Trick

If no explicit terminal state exists (which is the case in the nonterminal goal state

framework), the output of the neural network tends to constantly increase from

iteration to iteration. This is due to the choice of the transition costs, which are

0 (within the target region) or positive (outside the target region). Therefore,

the target value of each state action pair is larger or at least equally large than

the evaluation of its successor state. Ampliﬁed by the generalization property of

the multilayer perceptron, this leads to the tendency to ever increase the output

values of all state action pairs.

NFQ Tricks 747

A simple but eﬀective remedy against this eﬀect is to actually ﬁx the values of

some state action pairs to 0. We call the set of such states, for which we assume

this to be true, X

. This heuristic is in accordance with a correct working of

the value iteration scheme, as long as 0 is the expected optimal path costs for

the respective state action pairs in X

. Of course, usually state action pairs

for which this is true, cannot be assumed to be known a priori. Therefore, in

order to apply this trick, one has to rely on heuristics. One reasonable choice of

are states, that lie in the center of X

, the region of zero transition costs.

The reasoning behind this is the following: if one starts at a state x at the center

of X

, then a good control policy has a very high chance of keeping the system

within X

forever — which justiﬁes to assign zero path costs to that starting

state.

If X

is chosen too large, then for some states within X

the assumption

of optimal path costs of 0 may be violated. As a consequence, the resulting

policy most likely will not fulﬁll the expected property of reliably keeping the

system within X

. On the other hand, if X

is too small, the chance, that

a state actually falls into X

is very low and therefore the heuristic becomes

ineﬀective. A remedy against this, is to actually force the learning system to face

states in X

. One possibility to do that, is to enforce starting episodes close

to X

, another possibility is to introduce artiﬁcial state transitions, which is

discussed in the context of the hint-to-goal heuristic in the next section 30.5.3.

Summary:

– the X

heuristic is a method to prevent the value function to steadily

increase

– if applied carefully, it is in perfect accordance with the value iteration scheme

30.5.3 Artiﬁcial Training Transitions

In a certain sense, the learning process can be interpreted as spreading its knowl-

edge of the optimal value function from the goal states to the rest of the state

space. Therefore, it is crucially required to actually have a reasonable number

of state action pairs that lead to a goal state within the overall transition sam-

ple set. An obvious recipe would be to try to enforce the occurrence of such

goal states, e.g. by starting episodes close to the goal area. However, this is not

possible for all systems because they do not allow to set arbitrary initial states.

An unconventional method to cope with the situation is to add artiﬁcial state

transitions to the sample set. Then, the pattern set used for training consists of

actually collected transitions, as well as additionally added artiﬁcial transitions.

This method was ﬁrst introduced as part of the hint-to-goal heuristic in our ﬁrst

NFQ paper [23] and has meanwhile also been successfully applied by other re-

searchers using other function approximation schemes (e.g. Gaussian processes,

[4]). The idea of the hint-to-goal heuristic is to introduce artiﬁcial state transi-

tions, that start in X

and end in X

. Those states have — by deﬁnition of

— terminal costs of 0. As a consequence, the value function is ‘clamped’

748 M. Riedmiller

to zero at these input patterns. Supported by the generalization ability of the

function approximation, also the neighboring states will tend to have a low and

thus attractive value.

If for the artiﬁcially introduced state action pairs the optimal path costs

are actually zero, the hint-to-goal heuristic will not negatively interfere with

the correct working of the value iteration process. An obvious choice therefore

is a state action pair, where the state is well embedded in X

such that the

assumption of optimal path costs of 0 is most likely fulﬁlled. We usually generate

such an artiﬁcial state action pair by combining such a state with every action

in the action set.

The number of artiﬁcial patterns should be chosen such that a ‘reasonable

balance’ between experience of success and regular state transitions exists (as a

rule of thumb, something between 1:100 and 1:10). This is of course a number,

that has to be determined empirically. We are currently working on methods

that automatically ﬁnd such a balance, but this is ongoing work and beyond the

scope of this paper.

Summary:

– the hint-to-goal heuristic might help to establish a goal region in the value

function, if real experiences of success are diﬃcult to achieve during regular

learning

30.5.4 Growing Batch

The ﬁtted Q iteration framework originally works with a ﬁxed set of transitions.

No particular assumption is made, how these transitions are collected. In the

extreme case, these experiences are randomly sampled all over the working space

of the controller in advance. In a practical setting, however, this is not always

feasible. One reason is, that arbitrary sampling all over the working space is not

realizable, since initial states can not be set arbitrarily. Another reason is, that

to sample transitions equally over the working space might just be infeasible due

to the huge amount of data required to cover the complete space.

Therefore it is desirable to concentrate on regions of the state space that are

relevant for the ﬁnal controller. One method to realize this is the growing batch

method. The idea is, that one starts with an empty transition set. After the

ﬁrst episode, the value function is updated and the new episode is controlled

by exploiting the new value function. Diﬀerent variants exist, e.g. the value

function can only be updated after n episodes, or the number k

max

of NFQ

iterations between two episodes can be varied. In most of our experiments so

far we successfully used this growing batch procedure with the choice of n =

max

=1.

Summary:

– the growing batch method aims at collecting more and more relevant transi-

tions when the performance of the policy increases.

NFQ Tricks 749

30.5.5 Training the Neural Q-Function

To represent the value function, we use a neural multilayer perceptron. Although

it is often believed that setting up such kind of networks is a black art and its

parameters are hard to ﬁnd, we found that this is not particularly critical in

the proposed neural ﬁtted Q framework. One crucial point however is to use a

powerful training algorithm to train the weights. The Rprop learning algorithm

combines the advantage of fast learning and uncritical parameter choice [17].

We always use Rprop with its standard parameters. Also we found, that the

number of epochs (sweeps through the training set) is not particularly critical.

We therefore always train for 300 epochs and get good results. One can also

think of ways to monitor the training error and ﬁnd some stopping criterion

to make this more ﬂexible (e.g. to adapt to diﬀerent network sizes, to diﬀerent

pattern set sizes, etc.), but for the applications we had so far, we found this a

minor issue for learning success.

Surprisingly, the same robustness is observed for the choice of the neural net-

work size and structure. In our experience, a multilayer perceptron with 2 hidden

layers and 20 neurons per layer works well over a wide range of applications. We

use the tanh activation function for the hidden neurons and the standard sigmoid

function at the output neuron. The latter restricts the output range of estimated

path costs between 0 and 1 and the choice of the immediate costs and terminal

costs have to be done accordingly. This means, in a typical setting, terminal

goal costs are 0, terminal failure costs are 1 and immediate costs are usually set

to a small value, e.g. c =0.01. The latter is done with the consideration, that

the expected maximum episode length times the transition costs should be well

below 1 to distinguish successful trajectories from failures.

As a general impression, the success of learning depends much more on the

proper setting of other parameters of the learning framework. The neural network

and its training procedure work very robustly over a wide range of choices.

Summary:

– choice of multilayer perceptron is rather uncritical

– important to have a powerful learning algorithm to adjust the weights

– advantage, if the supervised learning algorithm is not particularly dependent

on the choice of its parameters.

30.5.6 Exploration

In reinforcement learning, exploration — the deviation from a greedy exploita-

tion of the current value function — is important to explore the state space.

Various suggestions for good exploration strategies have been proposed, e.g.

considering a safe control behavior in the learning phase [12]. From our expe-

rience with NFQ, a simple -greedy exploration scheme is often suﬃcient. This

means that in every time step, with a certain probability (e.g. 0.1), the action is

chosen randomly instead of greedily exploiting the value function.

750 M. Riedmiller

In many application cases, we also observe good results even with no explicit

exploration at all. This is due to the fact, that the learning process itself —

the randomly initialized neural value function, the growing experience, the ran-

domly distributed starting states — already bears a fair amount of randomness.

To learn without explicit exploration is also of practical interest. When always

acting greedily, the performance achieved in a training episode is already the

performance, that the ﬁnal greedy controller will show. This reduces the eﬀort

of additional testing and therefore is particularly interesting for real world tasks.

Summary:

– asimple greedy exploration scheme is often suﬃcient

– if the starting states are well distributed in the working space, then in con-

junction with the growing batch method, even an always greedy exploitation

of the current value function works in many cases

30.5.7 Delays

In practical systems delays play a crucial role. Delays may occur both on the

sensor side - i.e. a sensor value is available only n time steps later, or on the actor

side - a control action has an eﬀect only some time steps later. Simply neglect-

ing these eﬀects typically leads to bad control behavior or even failures. Various

methods exist, e.g. to use prediction or ﬁlter methods to synchronize the infor-

mation available to the controller with the current world situation. One simple

but eﬀective method is to augment state information with historical information

about previous actions applied to the system [31, 20].

Summary:

– in practice, actuator and sensor delays may often be not neglectable

– a simple remedy is to add historical information about previous action values

to the current state information

30.6 Experiments

30.6.1 The Control Task

The control task tackled in the following is to control a real cart pole system.

While cart-pole is a well known benchmark [27], this real world task is charac-

terized by additional challenging features:

– the initial states can not be set to arbitrary values. We moreover assume,

that no human intervention is allowed, in particular, the system can initially

only be started with the pole hanging down

NFQ Tricks 751

Fig. 30.2. The real cart pole system

– the control task is to balance the pole upright with high accuracy and with

the cart at a given target position (here: in the middle of the track). The

controller therefore not only needs to learn to swing-up the pole from the

downright position, but also to do it in such a sophisticated manner, that

ﬁnally it can be balanced at the requested position.

– one cannot directly control the force applied to the cart, but only the voltage

given to the DC motor driving the car. This introduces additional dynamical

eﬀects into the system.

– due to communication eﬀects, the sensor information is delayed

– there is considerable noise in both actuation and sensor values.

– there is a discontinuity (jump) in sensor values from −π to +π when the

pole is in the downright position.

– there is a hard constraint: the position of the cart may not be less than

-0.25m and more than 0.25m, since the track is bounded.

– the ﬁnal controller should be able to work from arbitrary initial start states,

not only from one position.

The range of control inputs is (quasi) continuous from -12 volt to 12 volt. Sensor

information provided by the system is the position of the cart and the pole; no

velocity information can be measured. The target values for the sensor values

should be reached as fast as possible. The minimum control interval allowed by

the hardware is "

=0.01s.

Since on the real system we can only perform a limited number of exper-

iments, we also report some results on a reliable simulation of the real sys-

tem (section 30.6.5). The input and output interfaces are exactly the same

for both simulated and real plant. The simulation model was derived by pa-

rameterizing a physical model of the plant using real data. The accurate

match between real and simulated system behavior allowed us to do keep

all modelling decisions and learning parameter settings the same for both

752 M. Riedmiller

the simulated and the real system. Therefore in the following, we only de-

scribe the real system setup. An implementation of the simulated plant is

available within our open-source learning framework CLSquare available at

http://ml.informatik.uni-freiburg.de/research/clsquare.

30.6.2 Modeling as a Learning Task

State Description. State information provided to the learning system consists

of sensor values of position (p

), angle (α

), the normalized temporal diﬀerence

of these measurements

−p

t−1



and

−α

t−1



. The angle is zero, when the pole is

upright. The angular value has a discontinuity (a jump from −π to +π) when the

pole is hanging down. Nothing particularly is done to resolve this discontinuity.

Instead, we expect the learning algorithm to be able to deal with that. To cope

with the sensor delay, additionally the value of the previous control action a

t−1

is added to the state information.

Actions. The action set available for the learning system consists of the ‘stan-

dard’ choice of minimal and maximal control signal plus the ‘neutral’ action 0V.

Thus A = {−12V,0V, +12V }.

Control Interval 

. As deﬁned by the hardware, the minimum length of the

control interval is 0.01s. After some initial experiments, we found that a control

interval of "

=0.05s is still suﬃcient for an acceptable control quality while at

the same time allowing a fast and successful learning process.

Non-terminal Goal State Framework. For the cart-pole task, control must

be continued once pole angle and cart position reached their target values to

actively keep the system within the goal states. This means, that the correct

formulation of the learning problem is the non-terminal goal state setting. As a

consequence, every episode is only interrupted, if the system state entered the

failure set X

−

or if the maximum number of steps per episode, N is reached.

Choice of X

. A state is in X

, if the following two conditions are fulﬁlled:

the cart position is at most 0.1m away from the target position (here: middle of

thetrack)andthepoleangledeviatesfrom0 rad by maximally ±0.15 rad. The

rest of the state entries is not considered for judging membership to X

Choice of X

−

. A state is in X

−

, if the cart position is less than -0.25m or

more than 0.25m. This corresponds to the physical boundaries of the track. The

rest of the state entries is not considered for judging membership to X

−

NFQ Tricks 753

Immediate and Final Cost Functions. As immediate costs we use the stan-

dard minimum-time formulation with constant transition costs of 0.01. Thus,

c(s, u)=



0 , if |p

|≤0.1m and |α

|≤0.15rad

0.01 , else

(30.5)

When a state from X

−

is observed, the episode is stopped and ﬁnal costs of +1

are assigned.

Episode Length. Empirically, a good episode length was found to be N = 200.

30.6.3 Applied Tricks

Scaling. We applied our standard input scaling procedure as described in 30.5.1.

Choice of X

and Artiﬁcial Transitions. X

contains only one state,

namely if all state variables are exactly 0. This corresponds to the center of

. Of course, this state will most likely not occur by chance in the learning

process. Therefore, this deﬁnition makes only sense in conjunction with adding

artiﬁcial transitions in the spirit of the hint-to-goal heuristic. Here, we used 3

diﬀerent artiﬁcial patterns, namely state (0,0,0,0,0) combined with all 3 actions.

These transitions were repeated 100 times in the training pattern set, in order

to establish some balance between the (huge) number of normal transitions and

those special transitions. The target values for those artiﬁcial patterns is 0.

Growing Batch. Learning was implemented as a ‘growing batch’ process. This

means, that after every episode, one NFQ iteration (new calculation of Q-target

values, supervised learning of the neural Q function) was performed. Then the

next episode was controlled by -greedy exploitation of this new Q function.

Training the Neural Q Function. The neural Q function is represented

by a multilayer perceptron with 6 input neurons, two hidden layers with 20

neurons each and one output neuron. Hidden neurons use the tanh activation

function, the output neuron uses the standard sigmoid function. Rprop with

standard parameters was used for weight updates. In every NFQ iteration step,

the network weights of the learning network were randomly initialized between

-0.5 and 0.5 before training. The network was trained for 300 epochs per NFQ

iteration.

Exploration. No explicit exploration scheme was used for the experiments done

here, i.e. the current Q function was always exploited greedily to determine the

action. This has the advantage, that the application performance can already be

determined during training.

754 M. Riedmiller

30.6.4 Measuring Quality

The quality of a learning control approach has two important aspects: the quality

of the learning process and the quality of the resulting controller. The quality of

the learning process is measured by the learning eﬀort needed, usually measured

in the number of transitions (or the number of episodes) needed, the quality of

the achieved solution with respect to the used cost function, and the reliability

of the results over a number of learning trials.

The quality of the resulting controller is measured with respect to the speciﬁ-

cation of the original control task. Relevant criteria are for example accuracy, ro-

bustness, working area, and performance measures like e.g. the time outside the

tolerated error zone. For a detailed discussion of diﬀerent criteria also see [11].

Here, we ﬁrst report results achieved in a realistic simulation. This allows us to

conduct a series of 10 experiments with diﬀerent seeds of the random generator

in reasonable time. For learning on the real system, we used exactly the same

setup and parameters. The only diﬀerence we made was, that the controller was

allowed to learn for a maximum of 500 episodes on the simulated cart-pole and

- due to time restrictions - for a maximum of 300 episodes on the real cart-pole

system.

30.6.5 Results on the Simulated Cart Pole

For the simulated system, all 10 runs delivered a successful controller. ‘Suc-

cessful’ means, that for a test set of 100 random initial starting situations, the

controller was able to swing up the pole and then steadily keep the system within

the desired tolerance. A test run lasted 20s. In average over 10 runs, the best

controller was found after an average training of 392 episodes with a standard

deviation of 80. The average time needed by the best controllers was 3.23s with

a standard deviation of 0.16s.

Table 30.1. Results on the simulated cart-pole for the standard setup, averaged over

10 trials. Shown are the average number of episode to train the best controller and

its control performance, measured in time outside the target region. The number in

brackets shows the respective standard deviation.

setup successful trials best controller at episode time outside of X

Default 10/10 392 (80.7) 3.23s (0.16s)

30.6.6 Results on the Real Cart Pole

The evaluation on the real cart-pole system was slightly diﬀerent, due to the

eﬀort it takes to do experiments with the real device. However, the overall picture

of the learning behavior on the simulated and real system was consistent.

We performed three learning trials with diﬀerent initializations of the random

generator. Each learning trial lasted 300 episodes. Besides the reduced number

NFQ Tricks 755

of maximum episodes, the setup of the learning system was exactly the same as

for the simulated system. In all 3 trials performed, successful controllers were

learned within less than 300 episodes of training. In particular, the controllers

are very robust with respect to varying initial states or to disturbance from

outside.

A video documenting learning and ﬁnal controller performance is available at

http://www.youtube.com/watch?v=Lt-KLtkDlh8

30.7 Conclusion

This paper discusses many of the basic modeling and methodological tricks to

set up a reinforcement learning task. These insights should help to successfully

handle a wide range of interesting control problems. The proposed method builds

on neural ﬁtted Q iteration (NFQ), a method that considers the complete batch

of collected transitions to update the Q function. While the paper is written from

the perspective of using a neural network, it should also give useful insights when

using other kinds of function approximation schemes.

Current and future work is aiming to further improve the method in sev-

eral directions. One big direction is to improve NFQ with respect to resulting

controller quality (e.g. accuracy, continuous actions, interpretation of control

policies, increasing complexity of control tasks, etc). Some steps in this direction

have already been made and are discussed in [11]. Another area of ongoing and

future research is to further improve NFQ with respect to robustness and auton-

omy of the learning process. A third area is to improve eﬃciency with respect to

the data required for learning. Beyond that, distributed reinforcement learning

algorithms that cooperatively control a complex system in a multi-agent setting

is a vital research area. Distributed learning systems that are based on the neu-

ral learning framework presented here have been successfully applied in typical

multi-agent scenarios like distributed job-shop scheduling [18, 8]).

Acknowledgment. The author wants to especially thank Roland Hafner from

Cognit GmbH for the important initial ignition for writing this article.

References

[1] Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. I, II. Athena

Scientiﬁc, Belmont (1995)

[2] Blum, M., Springenberg, J.T., Wülﬁng, J., Riedmiller, M.: A Learned Feature

Descriptor for Object Recognition in RGB-D Data. In: Proceedings of the IEEE

International Conference on Robotics and Automation (ICRA), St. Paul, Min-

nesota, USA (2012)

[3] Bertsekas, D.P., Tsitsiklis, J.N.: Neuro Dynamic Programming. Athena Scientiﬁc,

Belmont (1996)

[4] Deisenroth, M.P., Rasmussen, C.E., Peters, J.: Gaussian Process Dynamic Pro-

gramming. Neurocomputing 72(7–9), 1508–1524 (2009)

756 M. Riedmiller

[5] Ernst, D., Wehenkel, L., Geurts, P.: Tree-based batch mode reinforcement learn-

ing. Journal of Machine Learning Research 6, 503–556 (2005)

[6] Gabel, T., Lutz, C., Riedmiller, M.: Improved Neural Fitted Q Iteration Applied to

a Novel Computer Gaming and Learning Benchmark. In: Proceedings of the IEEE

Symposium on Approximate Dynamic Programming and Reinforcement Learning

(ADPRL 2011), Paris, France. IEEE Press (April 2011)

[7] Gabel, T., Riedmiller, M.: On Experiences in a Complex and Competitive Gaming

Domain: Reinforcement Learning Meets RoboCup. In: Proceedings of the IEEE

Symposium on Computational Intelligence and Games, Honolulu, USA (2007)

[8] Gabel, T., Riedmiller, M.: Adaptive Reactive Job-Shop Scheduling with Rein-

forcement Learning Agents. International Journal of Information Technology and

Intelligent Computing 24(4) (2008)

[9] Hafner, R., Riedmiller, M.: Reinforcement learning on an omnidirectional mobile

robot. In: Proceedings of the 2003 IEEE/RSJ International Conference on Intel-

ligent Robots and Systems (IROS 2003), Las Vegas (2003)

[10] Hafner, R., Riedmiller, M.: Neural Reinforcement Learning Controllers for a Real

Robot Application. In: Proceedings of the IEEE International Conference on

Robotics and Automation (ICRA 2007), Rome, Italy (2007)

[11] Hafner, R., Riedmiller, M.: Reinforcement learning in feedback control. Machine

Learning 27(1), 55–74 (2011), 10.1007/s10994-011-5235-x

[12] Hans, A., Schneegass, D., Schäfer, A.M., Udluft, S.: Safe exploration for reinforce-

ment learning. In: ESANN, pp. 143–148 (2008)

[13] Kietzmann, T., Riedmiller, M.: The Neuro Slot Car Racer: Reinforcement Learn-

ing in a Real World Setting. In: Proceedings of the Int. Conference on Machine

Learning Applications (ICMLA 2009), Miami, Florida. Springer (December 2009)

[14] LeCun, Y., Bottou, L., Orr, G.B., Müller, K.-R.: Eﬃcient backProp. In: Orr,

G.B., Müller, K.-R. (eds.) NIPS-WS 1996. LNCS, vol. 1524, pp. 9–50. Springer,

Heidelberg (1998)

[15] Lange, S., Riedmiller, M.: Deep auto-encoder neural networks in reinforcement

learning. In: International Joint Conference on Neural Networks (IJCNN 2010),

Barcelona, Spain (2010)

[16] Lange, S., Riedmiller, M.: Deep learning of visual control policies. In: European

Symposium on Artiﬁcial Neural Networks, Computational Intelligence and Ma-

chine Learning (ESANN 2010), Brugge, Belgium (2010)

[17] Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation

learning: The RPROP algorithm. In: Ruspini, H. (ed.) Proceedings of the IEEE

International Conference on Neural Networks (ICNN), San Francisco, pp. 586–591

(1993)

[18] Riedmiller, M., Gabel, T.: Distributed Policy Search Reinforcement Learning

for Job-Shop Scheduling Tasks. TPRS International Journal of Production Re-

search 50(1) (2012); Available online from (May 2011)

[19] Riedmiller, M., Gabel, T., Hafner, R., Lange, S.: Reinforcement Learning for Robot

Soccer. Autonomous Robots 27(1), 55–74 (2009)

[20] Riedmiller, M., Hafner, R., Lange, S., Lauer, M.: Learning to Dribble on a Real

Robot by Success and Failure. In: Proceedings of the 2008 International Conference

on Robotics and Automation (ICRA 2008), Pasadena CA. Springer (2008) (video

presentation)

[21] Riedmiller, M.: Learning to control dynamic systems. In: Trappl, R. (ed.) Pro-

ceedings of the 13th European Meeting on Cybernetics and Systems Research,

EMCSR 1996 (1996)

NFQ Tricks 757

[22] Riedmiller, M.: Generating continuous control signals for reinforcement controllers

using dynamic output elements. In: European Symposium on Artiﬁcial Neural

Networks, ESANN 1997, Bruges (1997)

[23] Riedmiller, M.: Neural Fitted Q Iteration - First Experiences with a Data Eﬃcient

Neural Reinforcement Learning Method. In: Gama, J., Camacho, R., Brazdil, P.B.,

Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317–328.

Springer, Heidelberg (2005)

[24] Riedmiller, M.: Neural reinforcement learning to swing-up and balance a real pole.

In: Proc. of the Int. Conference on Systems, Man and Cybernetics, 2005, Big

Island, USA (October 2005)

[25] Riedmiller, M., Lange, S., Voigtländer, A.: Autonomous reinforcement learning on

raw visual input data in a real world application. In: Proceedings of the Interna-

tional Joint Conference on Neural Networks, Brisbane, Australia (2012)

[26] Riedmiller, M., Montemerlo, M., Dahlkamp, H.: Learning to Drive in 20 Minutes.

In: Proceedings of the FBIT 2007 Conference, Jeju, Korea. Springer (2007) (Best

Paper Award)

[27] Sutton, R.S., Barto, A.G.: Reinforcement Learning. MIT Press, Cambridge (1998)

[28] Sutton, R.S.: Generalization in reinforcement learning: Successful examples using

sparse coarse coding. In: Touretzky, D.S., Mozer, M.C., Hasselmo, M.E. (eds.)

Advances in Neural Information Processing Systems, vol. 8, pp. 1038–1044. MIT

Press, Cambridge (1996)

[29] Timmer, S., Riedmiller, M.: Fitted Q Iteration with CMACs. In: Proceedings of

the IEEE International Symposium on Approximate Dynamic Programming and

Reinforcement Learning (ADPRL 2007), Honolulu, USA (2007)

[30] Watkins, C.J.: Learning from Delayed Rewards. Phd thesis, Cambridge University

(1989)

[31] Walsh, T.J., Nouri, A., Li, L., Littman, M.L.: Planning and Learning in Environ-

ments with Delayed Feedback. In: Kok, J.N., Koronacki, J., Lopez de Mantaras,

R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI),

vol. 4701, pp. 442–453. Springer, Heidelberg (2007)

Author Index

Andersen, Lars Nonboe 111

Back, Andrew 295

Bengio, Yoshua 437

Bottou, Léon 9, 419

Burns, Ian 295

Caruana, Rich 163

Cireşan, Dan Claudiu 581

Coates, Adam 561

Collobert, Ronan 537, 639

Denker, John S. 235

Duell, Siegmund 709

Farabet, Clément 537

Finke, Michael 311

Flake, Gary William 143

Fritsch, Jürgen 311

Gambardella, Luca Maria 581

Giles, C. Lee 295

Grothmann, Ralph 687

Hansen, Lars Kai 111

Hinton, Geoﬀrey E. 599

Hirzinger, Gerd 191

Horn, David 131

Intrator, Nathan 131

Kavukcuoglu, Koray 537

Larsen, Jan 111

Lawrence, Steve 295

LeCun, Yann A. 9, 235

Lukoševičius, Mantas 657

Lyon, Richard F. 271

Martens, James 479

Meier, Ueli 581

Mobahi, Hossein 639

Montavon, Grégoire 421, 559, 621, 659

Moody, John 339

Müller, Klaus-Robert 1, 7, 9, 49, 139,

231, 343, 421, 559, 621, 659

Naftaly, Ury 131

Neuneier, Ralph 369

Ng, Andrew Y. 561

Orr, Genevieve B. 1, 7, 9, 49, 139,

231, 343

Plate, Tony 91, 225

Prechelt, Lutz 53

Ratle, Frédéric 639

Riedmiller, Martin 735

Rögnvaldsson, Thorsteinn S. 69

Schmidhuber, Jürgen 581

Schraudolph, Nicol N. 205

Simard, Patrice Y. 235

Sterzing, Volkmar 709

Sutskever, Ilya 479

Svarer, Claus 111

Tietz, Christoph 687

Tsoi, Ah Chung 295

Udluft, Steﬀen 709

van der Smagt, Patrick 191

Victorri, Bernard 235

Webb, Brandyn J. 271

Weston, Jason 639

Yaeger, Larry S. 271

Zimmermann, Hans-Georg 369, 687

Subject Index

ACID, 324

ACID/HNN, 324

acoustic modeling, 318, 324

– connectionist, 322

Adaline, 423

adaptive heuristic critic (AHC), 722, 723

annealed importance sampling (AIS), 632

Apple Computer

– eMate, 271

– MessagePad, 271

– Newton, 271

architectural optimization, 401–406

asset allocation, 415

auto-encoder, 440

– contractive auto-encoder, 440

– denoising auto-encoder, 440

– sparse auto-encoder, 453

automatic diﬀerentiation, 444

Automatic Relevance Detection, 92

automatic relevance detection (ARD), 92

autoregressive model, 301

averaging, 377

– averaging rate, 432, 449

– weight averaging, 432, 449, 633

backpropagation, 11, 301, 583

– ﬂatness, 196

backpropagation-decorrelation, 682

bag of features, 561, 569

Bayes rule, 312, 322

Bayesian

– a posteriori probabilities, 295–297

– framework, 91

– hyperparameters, 91

–modelprior,72

bias/variance, 71, 356

– decomposition, 134

– ensembles, 134

block validation, 715

Boltzmann machine, 599, 622

– centering, 625

– Hessian, 626

– parameterization, 625

– reparameterization, 626

CART, 321

centering

– Boltzmann machine, 621, 625

– neural network

– – activity, 207

– – error, 207

– – slope, 208

class

– equalizing class membership, 299

– frequencies, 296

– probabilities, 295

classiﬁcation, 120, 311

– ECG, 299

– error, 316

– ﬁgure-of-merit, 297

– handwritten characters, 274

– k-nearest neighbor, 236

– LVQ, 236

– medical, 306

– memory based, 236

– modular, 312, 327

–networks,92

– time series, 677

– trees, 315

cleaning, 395

clearning, 395

clustering, 317

– agglomerative, 324

committee

– committee of neural networks, 585, 589

computational graph, 487

conditional random ﬁeld, 433

conditioning, 191, 626

– condition number, 27, 450, 471, 628

confusion matrix, 299, 317

conjugate gradient, 32, 304, 450, 484, 544,

553

– conjugate directions, 33

– conjugate gradient truncation, 510

– convergence, 513

– damping, 496

– Fletcher-Reeves, 33

– line search, 32, 512

– Polak-Ribiere, 33

– preconditioned, 484

– trust-region methods, 507

context

762 Subject Index

– context driven search, 283

– geometric, 286

– lexical, 284

contrastive divergence, 601, 614

– persistent contrastive divergence, 625

controller

– action, 710, 740

– closed-loop system, 738

– control interval, 735, 740

– cycle time, 736

– neural reinforcement controller, 735

– state, 710, 739

convergence, 304

– almost sure convergence, 423

– convergence of conjugate gradient, 513

– convergence speed, 423

convolutional neural network, 590

correspondence principle, 689

covariance, 26

covariance estimation, 390

credit assignment, 222

cross-validation, 113, 334, 357

– generalized (GCV), 356

– nonlinear, 357

damping, 496

– conjugate gradient truncation, 510

– K-means damped update, 566

– line search, 512

– scale-sensitive damping, 501

– structural damping, 504

– Tikhonov damping, 498

– trust-region methods, 507

data

– artiﬁcially generated, 304

– autonomous vehicle navigation, 177

– bilinear problem, 77

– CIFAR-10, 575, 577

– COIL100, 652

– combustion engine, 78

– data noise, 397

– dataset extension, 583

– Deterding’s vowel, 150, 215

– distribution, 179

– ECG, 299

– elastic distortions, 584

– German bond, 415

– hill-plateau, 146

– Index of Industrial Production, 344, 361

– Mackey-Glass, 123

– MIT-BIH Arrhythmia database, 300

– MNIST, 581, 583, 633, 650

– NETtalk, 181

– NIST1, 247

– NIST2, 247

– overlapping distributions, 305

– partitioning, 408

– Pascal Large Scale Challenge, 432

– Peterson and Barney, 120

– pneumonia, 171

– power load, 78

–Proben1,59

– RCV1, 432

–riverﬂow,78

– STL-10, 575, 577

– sunspots, 78, 132

– Switchboard, 329, 334

– two-spirals, 148

– US Postal Service, 247

– Wikipedia, 651

debugging, 463

– activation statistics, 466

– gradient veriﬁcation, 428, 463

– layer-wise analysis, 466, 629, 636

– restricted Boltzmann machines, 615

– verifying Gauss-Newton products, 495

decimation

– of d.c. errors, 207

– of linear errors, 208

decision boundary, 304

decision tree, 313, 321

– phonetic, 326

deep network

– deep Boltzmann machine (DBM), 623

– – locally connected (LC-DBM), 623

– deep K-means, 574

– – dependency testing, 574

– – energy correlation, 574

– embedding, 639

deformable prototype, 237

dendrogram, 326

density estimation, 384, 622

– conditional, 384

dictionary, 562

dilemma

– Observer-Observation Dilemma, 391,

399

– Structure-Speed-Dilemma, 395

distance

– Euclidean, 237

– invariant, 239

– tangent, 238, 242

Subject Index 763

divide & conquer, 312

Doonesbury eﬀect, 271, 289

dynamic programming, 736

– value iteration, 736

dynamical system, 712, 713

– instable information ﬂow, 702

– open dynamical system, 714

– risk, 703

– stable information ﬂow, 702

– uncertainty, 703

early stopping, 73, 406, 448

– in MTL nets, 181

echo state network, 659

eﬀectice number of parameters, 357

eﬀective degrees of freedom, 94

eﬃciency

– of stopping criteria, 58

embedding, 468, 639

– deep embedding, 644

– ISOMAP, 642

– Laplacian eigenmaps, 642

– multidimensional scaling, 642

– neighbors graph, 646

– siamese networks, 642

– symbol embedding, 469

– temporal embedding, 646

empirical, 421

empirical ﬁsher diagonal, 523

ensemble

– averaging, 131

– forecast, 349

error, 383, 388

– approximation error, 192, 425

– cross-entropy, 92, 228, 296

– cross-validation, 113

– error-in-variables, 396

– estimate, 55

– estimation error, 425

–evolution,55

– – idealized, 54

– – realistic, 55

– generalization, 53

– generalized least square estimation, 387

– increase, 57

– Kullback-Leibler, 325

– Likelihood

– – log, 384

– – scaled, 322

– LnCosh, 383

– local minima, 55

– mean squared sensitivity error, 300

– metrics, 175

– normalized root mean square error, 661

– optimization error, 425

– prediction error, 355

– – ﬁnal (FPE), 356

– – generalized (GPE), 357

– – squared (PSE), 356

– residual pattern, 192

– root mean square error, 660

– roundoﬀ error, 225

– training, 53

– validation, 53

– word error rate, 334

error bar estimation, 388–390

estimation

– conditional density, 384

– covariance, 390

– density, 384

– error bar, 388–390

– generalization error, 55

– robust, 383–388

– – with CDEN, 384

– – with LnCosh, 383

evidence, 91

expected risk, 421

factoring posteriors, 323

false positive rate, 299

feature selection, 180

feed-forward network, 480

– Hessian, 194

– Jacobian, 194

– linearly augmented, 191–202

feedback, 680

– output feedback connections, 680

ﬁnite diﬀerences, 428

ﬂat spots, 228

focus of attention, 177

forecasting, 687, 713, 720, 729, 730

function approximation, 709

GARCH, 387

Gauss-Newton matrix, 489

– generalized, 489

– multiplication, 492

generalization, 10, 56, 92, 113, 329, 334,

639

– reuse, 468

generalized Gauss-Newton

– estimating the diagonal, 524

764 Subject Index

Gibbs sampling, 601, 623

– alternating Gibbs sampling, 601, 623

gradient descent, 422, 442

–convergence,23

– divergence, 24

– regularization parameters, 116

graphics processing unit (GPU), 466, 542,

556, 582, 587

– convolution kernel, 591

– kernel, 591

– thread, 591

handwriting recognition, 246, 312

– architecture, 275

– online, 271

Hessian, 25, 194, 446

– backpropagating diagonal Hessian, 36

– Boltzmann machine, 626

– conditioning, 221

– eigenvalue spread, 38

– eigenvalues, 38

– exact multiplication by the Hessian, 487

– maximum eigenvalue, 27, 42

– – online computation, 43

– – power method, 42

– – Taylor expansion, 42

– minimum Eigenvalue, 22

– product of Hessian and vector, 37

–shape,40

– singular, 195–196

– square Jacobian approximation, 34, 35

Hessian-free optimization, 479, 483

– damping, 496

– debugging, 495

– parallelism, 494

heteroscedasticity, 384, 387

hidden Markov models, 319

hidden units, 450, 610

– number of, 181, 302, 450

hierarchical, 313

– classiﬁer, 333, 336

– features, 574

– hierarchy of neural networks, 323, 324,

327, 329

hint-to-goal heuristic, 747

hyperparameters, 91, 347, 446

– automated grid search, 458

– coordinate descent, 458

– grid search, 456

– layer-wise optimization, 459

– manual search, 456

– multi-resolution search, 458

– semi-supervised learning, 645

ill-conditioned, 191

ill-posed problem, 70

image recognition, 569

information divergence, 324

initialization

– conjugate gradient, 517

– K-means, 565

input

– contrast normalization, 564

– force, 370

– momentum, 370

– normalization, 16, 30

– – decorrelation, 17

– – equalization, 17

–– zero mean,16

– preprocessing, 374, 563, 714, 716

– representation, 274

– scaling, 714

– sphering, 32

– squared inputs, 374

– symbolic, 468

– whitening, 32, 564

invariance transformations, 263

– diagonal hyperbolic transformation, 265

– elastic distortions, 584

– parallel hyperbolic transformation, 264

– rotation, 264

– scaling, 264

– thickening, 265

– X-translation, 263

– Y-translation, 264

Jacobian, 34, 35, 194, 441

K-means, 423, 561, 562, 566

– encoding, 571

– – hard assignment, 571

– hyperparameters, 570

– – patch size, 570

– – pooling, 571

– – receptive ﬁelds, 570

– initialization, 565

– receptive ﬁeld, 572

– sparsity, 566

– spherical K-means, 562

– training procedure, 566

Kahan summation, 674

kernel principal component analysis, 630

Subject Index 765

Krylov subspace, 484

Kullback-Leibler, 325

labeled data, 572

large-scale learning, 424

– trade-oﬀs, 426

Lasso, 423

LBFGS, 544, 553

leaky integration, 661, 667

learning, 391

– batch, 13, 544

– – advantages, 14

–BFGS,34

– bold driver, 213

– condition, 191–202

– conjugate gradient, 32, 148, 193, 225,

230

– curriculum learning, 470

– FORCE learning, 682

– Gauss-Newton, 34

– gradient descent, 296

– hierarchical, 180

– hill-climbing, 299

– hints, 179

– K-means, 561, 562

– Levenberg-Marquardt, 35, 193

– life-long, 163

– multi-relational, 468

– multitask, 164, 468

– Newton, 31, 148

– online learning, 13, 441, 679

– Perceptron, 28

– Quasi-Newton, 34, 193, 392

– rankprop, 172

– rate, 20, 185

– – adaptation, 21

– – annealing, 14

– – momentum, 21

– Rprop, 59, 737

– second order, 31, 193–194, 225, 230

– semi-supervised learning, 642

– single task, 164

– stochastic, 13

– – advantages, 13

– stochastic diagonal Levenberg Mar-

quardt, 40

– stopping criterion, 59

– tangent propagation, 256

– variable metric, 193

–vario-η, 212, 392, 717

learning rate, 332, 429, 430, 432, 447, 606,

717

– adaptive, 333, 448, 472

– annealing, 281

– individual, 40

– maximal, 25

–optimal,25

Levenberg-Marquardt heuristic, 506

Lie group, 259

linear models, 429

linearly augmented feed-forward network,

191–202

– equivalence with feed-forward network,

198

– universal approximation, 198

liquid state machine, 659

local minimum, 191–202

local versus global features, 143, 155, 159

long-range dependencies, 713

long-term dependencies, 482, 691, 727, 730

loss function

– convex, 489

– non-convex, 493

– second derivative, 493

Lua, 539

– associative arrays, 539

– closures, 540

– performance, 541

LVCSR, 318, 319, 322

M-estimators, 386

macroeconomic forecasting, 343

Markov decision process, 710, 736

– extraction network (MPEN), 725, 727,

729

– – with shortcuts, 727, 729

– partially observable (POMDP), 711,

723

– policy, 736

Markov property, 711, 714, 723, 725, 729

Markov-chain Monte Carlo methods, 109

Markovian state, 727

mean-variance approach, 389

medical

– diagnosis, 171

– risk prediction, 171

memory allocation, 545

mini-batch, 443, 544, 603, 717

– Hessian-free optimization, 526

– size of the mini-batch, 448

missing values, 617

766 Subject Index

mixture

– densities, 320, 324

– of Gaussians, 323, 385

model

– complexity, 329

– interpretation, 349

– selection, 316, 348, 353, 446

– – architecture, 354

– – input, 354

momentum, 449, 607

monitoring

– errors, 428

– overﬁtting, 605

– progress of learning, 604, 615, 630, 632

– reconstruction error, 605

Moore-Penrose pseudoinverse, 675

MTL, 164

multi-core machines, 466

network information criterion (NIC), 357

neural network

– architecture, 371–382

– bottleneck network, 372

– capacity, 181

– correspondence principle, 689

– eﬃcient implementation, 538

– expert, 324

– feed-forward, 121, 123

– interaction layer, 375

– internal preprocessing, 371

– linear, 28

– multi-layer perceptron, 28, 313, 329, 581

– neural ﬁtted Q-iteration, 737

– neural reinforcement controller, 735

– radial basis function, 374

– size, 181

– vanilla, 346

noise

– Gaussian, 383, 391

– Laplacian, 383

– parameter noise, 393

noise reduction, 221

noise/nonstationarity tradeoﬀ, 345

non-linearity, 454

– hard-limiting, 471

– softsigm, 471

nonlinear calibration, 307

outliers, 372, 383, 386

– information shocks, 383

output representation, 176

overﬁtting, 53, 55, 91

– mini-batch overﬁtting, 529

overshooting, 714

partition function, 600, 622

– ratio of partition functions, 631

pattern generation, 680

penalty

–factors,91

– implicit, 393

– implied curvature, 400

Perceptron, 423

performance

– criteria, 299

– measures, 299

polyphones, 321

portfolio management, 389

positive predictivity, 299

power method, 42

preconditioning, 427

– conjugate gradient, 519

– preconditioner design, 521

predicting

– posteriors, 313

preprocessing, 370–371, 456, 563

– by bottleneck network, 372

– by diagonal connector, 371

price-performance ratio, 56

prior, 72

– bias reduction, 289

– knowledge, 242, 317

– probabilities, 295

– scaling, 296

probabilistic sampling, 298

pronunciation graph, 319

pruning, 359, 401–406

– Early-Brain-Damage, 402

– input variables, 348, 359

– Instability-Pruning, 405

– Inverse-Kurtosis, 403

– of nodes, 401

– of weights, 401

– sensitivity-based (SBP), 359

– Stochastic-Pruning, 401

Python, 541

Q-learning, 736

– artiﬁcial training transitions, 747

– cost function, 744

– discounting, 736, 745

– growing batch, 748

Subject Index 767

– maximal episode length, 745

– neural ﬁtted Q-iteration, 737

– neural Q-function, 749

– policy, 736

– X++ trick, 746

quantized weights, 282

R operator, 445, 487

radial basis function network, 146, 155

random seeds, 455

recurrent neural network, 313, 482, 659,

687, 710, 712, 720, 721, 724

– causal-retro-causal (CRCNN), 699

– connectivity, 698

– dimensionality, 698

– dynamically consistent, 724

– echo state network, 659

– error correction (ECNN), 693

– historically consistent (HCNN), 695

– initial state problem, 692

– liquid state machine, 659

– memory, 698

– overshooting, 691, 713

– shortcut connections, 208

– sparsity, 698

– unfolding in time, 713

– variant-invariant separation, 694

regression, 676

– least mean squares, 679

– linear regression, 672

– recursive least squares, 679

– ridge regression, 672

– weighted least squares, 677

regularization, 54, 71, 91, 111, 114, 181,

241, 349, 639, 672

– adaptive, 115

– choice of, 115

– for recurrent networks, 351

– smoothing, 350

– Tikhonov regularization, 672

reinforcement learning, 709, 712, 735

– cart-pole problem, 750

– exploration, 749

– neural reinforcement controller, 735

representational capacity, 304

reservoir, 660

– parameters, 663

– size, 663

– sparsity, 664

– spectral radius, 665

rest learning, 719

– double rest learning, 718

restricted Boltzmann machine, 599

– discrimination, 615

risk, 390

robust estimation, 383–388

– with CDEN, 384

– with LnCosh, 383

robust learning, 717

scaling, 296

scripting language, 538

– for hyperparameters, 92

second order gradient descent, 422, 472,

479, 544

segmentation

– character, 273

– word, 287

semi-supervised learning, 642

– deep embedding, 643

– label propagation, 642

– LapSVM, 643

sensitivity, 299

– average absolute gradient, 360

– average gradient, 360

– delta output, 362

– evolution, 362

– output gradient, 362

– pruning, 359

– RMS gradient, 360

separation of structure and noise, 399

shuﬄe test, 167

shuﬄing, 427

SIMD operations, 547

– NEON, 547

– SSE, 542, 547, 582

softmax, 228, 331

software

– BLAS, 466, 541, 542, 546, 550, 551

– CUDA, 542, 545, 549, 556

– EBLearn, 552

– JanusRTk, 329

– Just In Time compiler, 539

– LAPACK, 541, 546

– Lush, 542

– Numpy, 543, 550

– OpenMP, 545, 548, 556

– SENN, 416

– SN, 542

– SWIG, 541

– Theano, 445, 550, 552

768 Subject Index

– Torch5, 552

– Torch7, 537, 542

source separation, 568

sparsity, 430

– feature sparsity, 453

– input sparsity, 467

– K-means, 566

– sparse coding, 562

– sparsity of activation, 453, 609

– weight sparsity, 303

speech recognition, 296, 312, 318

– connectionist, 312

– statistical, 314, 318

– system, 330

squared inputs, 143

stability

– structural instability, 382

state estimation, 709, 714, 720, 724, 727,

729

state tying, 322

STL, 164

stochastic gradient descent, 332, 422, 442,

544, 553

– averaged SGD, 431

stopping criterion, 56, 719

– early stopping, 406

– eﬀectiveness, 62

– eﬃciency, 56, 58, 62

– ﬁnal stopping, 413

– late stopping, 407

–predicate,56

– robustness, 58, 62

– rules for selecting, 57

– threshold

– – for stopping training, 56

– time versus error tradeoﬀ, 63

–tradeoﬀ

– – time versus error, 55, 58

– variance, 60

SVM, 423, 432

symbolic diﬀerentiation, 445

Takens’s theorem, 726

tangent propagation, 253

targets

– embeddings, 380

– forces, 380

– random, 378

teacher forcing, 680

– architectural teacher forcing, 697, 700

temporal pattern recognition, 678

time series

– linear models, 345

– noise, 345

– nonlinearity, 345

– nonstationarity, 345

– prediction, 123, 133, 176

Torch7, 537, 550, 552

– CUDA, 550

– neural networks, 543

– packages, 544

– Tensor, 542, 544

– TH, 551

training

– procedure, 406–414

transfer, 163

– inductive, 163

– sequential vs. parallel, 179

tricks

– backpropagating diagonal Hessian, 36

– backpropagating second derivatives, 36

– choice of targets, 19

– computing Hessian information, 35

– data warping, 279

– derivative roundoﬀ error, 225

– ensemble with diﬀerent initial weights,

131

– error emphasis, 281

– frequency balancing, 298

– Hessian

– – ﬁnite diﬀerence, 35

– – maximum eigenvalue, 42

– – minimum Eigenvalue, 22

– – square Jacobian approximation, 35

– hyperparameters, 91

– initializing the weights, 20

– multitask learning, 164

– negative training, 278

– nonlinear cross-validation, 357

– normalizing output errors, 276

– post scaling, 298

– prior frequency balancing, 280

– prior scaling, 296

– roundoﬀ error, 225

– sensitivity-based pruning, 359

– shuﬄing examples, 15

– sigmoid, 17

– squared input units, 143

– tangent distance, 248

– – elastic tangent distance, 249

– – hierarchy of distances, 251

– – smoothing, 249

Subject Index 769

– – speeding, 251

– tangent propagation, 259

– training big nets, 312

– variance reduction with ensembles, 131

– weight decay parameter estimate, 75, 77

triphone models, 321

unequal misclassiﬁcation costs, 295

unit

– binomial, 613

– Gaussian, 612

– logistic, 226

– radial basis function, 22

– rectiﬁed linear, 613

– sigmoid, 17

– softmax, 611

– squared input, 143

– tanh, 229

user adaptation, 291

vanishing gradient, 728

variable metric, 193

VC dimension, 241

vector quantization, 562

ventricular contraction

– premature ventricular contraction, 301

– supraventricular contraction, 301

visualization, 349, 361, 465

– ﬁlters, 465, 636

– Hinton diagram, 465

– learning trajectory, 466

– local tangent vectors, 465

– sampling, 465

– t-SNE, 466

weight

–updates

– – magnitude of, 298

weight decay, 73, 118, 121, 123, 303, 451,

608, 673

– L1 regularization, 452

– L2 regularization, 452

– parameter estimate, 74, 75

– simulation, 79

weight sharing, 468, 713, 718

weights initialization, 454, 606

– fan-in, 454

well determined parameters, 94