During my PhD, in 1985, I proposed and published (in French), an early version of the learning algorithm known as error backpropagation [LeCun, 1985] [LeCun, 1986]. Although a special form of my algorithm was equivalent to backpropagation it was based on backpropagating "virtual target values" for the hidden units rather than gradients. I had actually proposed the algorithm during my Diplome d'Etudes Approfondies work in 1983/1984 but had no theoretical justification for it back then. The final form of the backpropagation algorithm has had a profound impact on the fields of Machine Learning, and was the primary cause for the surge of interest in Neural Networks in the mid 80's. Subsequently, I refined the backpropagation algorithm, and proposed a consistent theoretical framework for its analysis. I applied backpropagation to handwriting recognition, medical diagnosis of abdominal pains, and intron/exon splicing site detection in DNA sequences [LeCun, 1987]. For the medical diagnosis application, I designed a network with specific groups of hidden units to represent "syndromes". Connections were established from symptoms (inputs) to syndroms, and from syndromes to diagnoses (outputs) if there was a known causal relationship between them (an early form of graphical models with logistic nodes, I guess). I also applied multilayer nets to associative data retrieval [LeCun and Fogelman-Soulie, 1987] [Gallinari et al., 1987] [Fogelman-Soulié et al., 1987c] [Fogelman-Soulié et al., 1987b] [Fogelman-Soulié et al., 1987a]
At the end of my PhD, my friend Leon Bottou and I started writing a software system called SN to experiment with machine learning and neural networks. It was built around a home-grown Lisp interpreter that eventually morphed into the Lush language. I completed this software while I was a Research Associate in Geoff Hinton's group at the University of Toronto. [LeCun and Bottou, 1988] [Bottou and LeCun, 1988] and has been using its various embodiments ever since.
Learning Theory, Learning AlgorithmsDuring my PostDoc in Toronto, I extended and published my thesis work on the theoretical foundation of backpropagation, and showed that back-propagation and several of its generalizations could be derived rigorously using Lagrange functions [LeCun, 1988], reprinted in [LeCun, 1992].
I also briefly worked on a new "perturbative" learning algorithm called GEMINI [LeCun, Gallant, and Hinton, 1989e].
With various colleagues, I published a theoretical analysis of learning curves [Solla, LeCun, 1991]; proposed a precursor of a now common Bayesian methods for producing class probability estimates from the output of a neural network [Denker, LeCun 1991]; and experimented with a practical boosting method for training and combining the outputs of multiple pattern recognition systems [Drucker et al. 1994].
I also worked on a method for estimating the effective capacity (or effective VC-dimension) of a learning machine by fitting the training error and testing error curves [Vapnik, Levin, LeCun 1994].
One of my long-time interests is the inherently subjective nature of the notions of information, probability, complexity, and entropy (including the definitions of those used in physics). John Denker and I wrote a short paper on the subject [Denker, LeCun, 1993].
The Dynamics of (fast) Neural Net Training
There are many hacks in the literature to accelerate the convergence of learning algorithms for multilayer neural nets. There have been many claims that second-order methods ought to converge faster than plain gradient descent. Unfortunately, most second order methods found in the classical non-linear optimization litterature are practically useless for neural-net training because they are all designed to work in "batch" mode (with accurate gradients and line searches). Neural net training is known to be considerably faster with "on-line" learning (also called stochastic gradient) where the parameters are updated after every training sample.
Therefore, during my PhD thesis, I proposed a stochastic version of the Levenberg-Marquardt algorithm with a diagonal approximation of the Hessian. The Diagonal Hessian in neural nets was shown to be very easy to compute with a "backprop" like procedure [LeCun, 1987]. Comparisons with classical non-linear optimization algorithms have been performed [Becker and LeCun, 1989].
With Ido Kanter and Sara Solla, I published a formula for the distribution of eigenvalues of certain types of random matrices (covariance matrices of random vectors). This result was used to predict the convergence speed of adaptive filter algorithms and neural-net learning procedures [LeCun, Kanter, Solla 1991] [Kanter, LeCun, Solla 1991]. A surprising result is that Linear Least Square converges fastest when the ratio of number of samples to number of input variables is about 4.
A major headache when training neural nets is how to choose the step size (learning rate) of the learning algorithm. I proposed a simple (and well justified) method for automatically computing (and adjusting) the step size for gradient-based neural-net training algorithms. It is based on a very efficient method for estimating and tracking the largest eigenvalue of the Hessian matrix of a neural nets error surface (without computing the Hessian itself) [LeCun, Simard, Pearlmutter 1993].
A recent tutorial paper on my work on efficient neural net training can be found in [LeCun, Bottou, Orr, Muller 1998].
Learning Algorithms, Generalization, and RegularizationI proposed a "pruning" (weight elimination) methods for neural networks called Optimal Brain Damage that has become quite popular over the years [LeCun, Denker, Solla, 1990].
Harris Drucker and I worked on a regularization method called double backpropagation that improves the smoothness of the functions learned by multilayer neural nets (and therefore their robustness to small distortions of the inputs) [Drucker, LeCun 1992] [Drucker, LeCun 1991a] [Drucker, LeCun 1991b].
Later, Patrice Simard and I (with J. Denker and B. Victorri) found a regularization method called Tangent Prop that can force neural nets to become invariant (or robust) to a set of chosen transformations of the inputs. [Simard, Victorri, LeCun, Denker 1992] [Simard, LeCun, Denker, Victorri 1992]. A recent review paper on Tangent Prop (and Tangent Distance) is available [Simard, LeCun, Denker, Victorri 1998].
Invariant Recognition: Convolutional Neural Networks
Other Classification Algorithms: Tangent Distance, Boosting, SVM,....Patrice Simard, John Denker and I worked on a method called Tangent Distance for measuring similarities between shapes while being robust to small distortions and displacements of the input pattern. We applied this method to handwriting recognition [Simard, LeCun, Denker 1993] [Simard, LeCun, Denker 1994]. A recent and rather complete paper on Tangent Prop and Tangent Distance is available [Simard, LeCun, Denker, Victorri 1998].
Applications: Zipcode and Check Readers
Applications: On-line Handwriting Recognition
Another line of work, in collaboration with Isabelle Guyon, concerned handwriting recognizers for pen-based computers using the idea of "Subsampled Time-Delay Neural Network" (a special type of temporal convolutional net) to recognize the trajectory of a stylus [Guyon et al. 1991] [Guyon et al. 1990] [Guyon et al. 1992].
I briefly worked on applying similar ideas to Chinese handwriting recognition [Wu, LeCun, Jackel, Jeng 1993],
Later, Yoshua Bengio and I started a new line of work on on-line handwriting recognition based on "pictorial" representations of the written words rather than on the temporal pen trajectories. Word images enhanced with the stroke direction information were fed to convolutional nets with great success. We also found principled ways of combining neural nets with Hidden Markov Model. This method allowed us to train the entire system with a discriminative criterion at the word level. [Bengio, LeCun, Nohl, Burges 1995] [LeCun, Bengio 1994] [Bengio, LeCun 1994] [Bengio, LeCun, Henderson 1994] A few cute applications were built around this system, including a pen-based visitor registration system for corporate buildings [Matic, Henderson, LeCun, Bengio 1994].
We realized after a while that, although handwriting recognition is an interesting problem in itself, it is also a very inefficient way to communicate with a computer, except in a small number of very specific situations [Guyon, Warwick, LeCun 1996]. So we lost interest in the problem. But by working on it, we had figured out how to train a large and complex system at the word level without requiring manual segmentation. Those ideas led to the Graph Transformer Network concept and were immediately applied to recognizing bank checks.
HMM/Neural Net Hybrids, Graph Transformer Networks, Global Training of Heterogeneous Learning Machines
Document Image Compression and Digital LibrariesBetween 1996 and 2001, much of my work has been centered on the DjVu project. DjVu is an image compression technology and a software platform for distributing high-resolution scanned documents over the Web.
At the end of 1995, AT&T decided to spin off Lucent and NCR. Mosaic and the World Wide Web were only about 2 years old, and, as a book lover, I thought (naively) that digital libraries would become one of the Web's killer apps (naturally, just like with TV, the Web became the ultimate tool for entertainment and commerce more than for education and culture). I had just become head of Image Processing Research in the newly created AT&T Labs, and felt that our work on handwriting recognition had reached a point of diminishing return (particularly since NCR, the main outlet for our research, was spun off). So, I was looking for ways to apply our expertise in document understanding to Web-based digital libraries.
I soon realized that there was no good way of distributing high-resolution color images of scanned documents on the web. There was significant expertise in the Lab on compressing B&W textual images (with Paul Howard's work on JBIG-2), and continuous-tone images (with various work on JPEG, MPEG, and wavelets) [Cox et al. 1998] [Haskell et al. 1998]. I thought that it would be nice if we could segment the text from the pictures and backgrounds so as to compress them separately. DjVu was born.
DjVu can compress scanned documents at 300dpi in B&W to 5KB-30KB per page (3-8 times better than TIFF - Group IV), and color documents to 30KB-100KB per page (5-10 times better than JPEG) [Bottou et al. 2000] [Haffner et al. 1999] [Haffner, Bottou, Howard, LeCun 1999] [LeCun, Bottou, Haffner, Howard 1998] [Bottou et al. 1998] [Haffner et al. 1998] [Cox et al. 1998].
Our first research prototype was ready at the end of 1996, our first practical system at the end of 1997, and our first public release of free compressors and browser plug-ins in March 1998. Commercial versions appeared in early 1999. Version 3.0 of the software appeared in summer 2000 and is commercialized by LizardTech, who partnered with AT&T.
DjVu is now used by hundreds of website around the world to distribute documents (mostly scanned documents). I maintain several DjVu-related web sites including DjVuZone, Any2DjVu, Bib2Web, and DjVuLibre.
More recently, we developed a real-time system that can detect human faces in images and simultaneously estimate their pose. This system uses convolutional networks trained with a loss function derived from the "Energy-Based Model" concept [Osadchy, Miller, and LeCun, 2004].
Object Detection and Recognition
Energy-Based Models (EBMs) are a non-probabilistic form of Graphical Model (more precisely, a un-normalized form of Factor Graph). They associate a scalar energy to configurations of variables that are observed and variables that must be predicted.
Given the values of the observed variables, the inference process in an EBM consists in finding the value of the variables to be predicted that minimizes the energy.
Learning consists in "shaping" the energy landscape so that it has minima near the desired values of the variables to be predicted. We characterize the type of loss functions that, when minimized with respect to the trainable parameters of the system, will shape the energy landscape in suitable ways [LeCun and Huang, 2005].
We demonstrate an application to face detection and pose estimation [Osadchy, Miller, and LeCun, 2004].
Yann LeCun, Professor
The Courant Institute of Mathematical Sciences
Room 1221, 715 Broadway, New York, NY 10012, USA
Copyright © 2000-2004 Yann LeCun.