We must stop crediting the wrong people for inventions made by others.
Instead let's heed the recent call in the journal Nature:
"Let 2020 be the year in which we value those who ensure that
science is self-correcting."[SV20]

Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's [DL1][DL2][NASC1-9]. The present page is offered as a resource for members of the machine learning community who share this inclination. I am also inviting others to contribute additional relevant references.
By grounding research in its true intellectual foundations, I do not mean to diminish important contributions made by others. My goal is to encourage the entire community to be more scholarly in its efforts and to recognize the foundational work that sometimes gets lost in the frenzy of modern AI and machine learning.

Here I will focus on six false and/or misleading attributions of credit to Dr. Hinton
in the press release of
the 2019 Honda Prize [HON].
For each claim there is a paragraph
(I, II, III, IV, V, VI)
labeled by "Honda," followed by a critical comment labeled "Critique."
Reusing material and references from recent
blog posts [MIR][DEC], I'll point out
that Hinton's most visible publications failed to mention essential relevant prior work -
this may explain some of Honda's misattributions.

I. Honda:
"Dr. Hinton has created a number of technologies that have enabled the broader application of AI, including the backpropagation algorithm that forms the basis of the deep learning approach to AI."

Critique: Hinton and his co-workers have made certain
significant contributions to deep learning, e.g.,
[BM][CDI][RMSP][TSNE][CAPS].
However,
the claim above is plain wrong.
He was 2nd of 3 authors of an article on backpropagation [RUM] (1985)
which failed to mention that 3 years earlier, Paul Werbos proposed to train neural networks (NNs) with this method
(1982) [BP2].
And the article [RUM] even failed to mention Seppo Linnainmaa, the inventor of this famous algorithm for credit assignment in networks [BP1] (1970), also known as "reverse mode of automatic differentiation." (In 1960, Kelley already had a precursor thereof in the field of control theory [BPA]; compare [BPB][BPC].) See also [R7].

By 1985, compute had become about 1,000 times cheaper than in 1970, and desktop computers
had become accessible in some academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs [RUM]. But this was essentially just an experimental analysis of a known method[BP1][BP2]. And
the authors [RUM] did not cite the prior art[DLC].
(BTW, Honda [HON] claims over 60,000 academic
references to [RUM] which seems exaggerated [R5].)
More on the
history of backpropagation
can be found at Scholarpedia [DL2] and in my award-winning survey [DL1].

The first successful method for learning useful internal representations
in hidden layers of deep nets was published two decades
before [RUM]. In 1965, Ivakhnenko & Lapa had the first general, working learning algorithm for deep multilayer perceptrons with arbitrarily many layers
(also with multiplicative gates which have become popular)
[DEEP1-2][DL1][DL2]. Ivakhnenko's paper of 1971 [DEEP2] already described a deep learning feedforward net with 8 layers,
much deeper than those of 1985 [RUM],
trained by a highly cited method which was still popular in the new millennium [DL2], especially in Eastern Europe, where much of Machine Learning was born. (Ivakhnenko did not call it an NN, but that's what it was.) Hinton has never cited this, not even in his
recent survey [DLC]. Compare [MIR] (Sec. 1) [R8].

Note that there is a misleading "history of deep learning" propagated by Hinton and co-authors, e.g., Sejnowski [S20]. It goes more or less like this: In 1958, there was "shallow learning" in NNs without hidden layers [R58]. In 1969, Minsky & Papert [M69] showed that such NNs are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s"[S20]. However, "shallow learning" (through linear regression and the method of least squares) has actually existed since about 1800 (Gauss & Legendre [DL1][DL2]).
Ideas from the early 1960s on deeper adaptive NNs [R61][R62] did not get very far, but by 1965, deep learning worked[DEEP1-2][DL2][R8]. So the 1969 book [M69] addressed a "problem" that had already been solved for 4 years. (Maybe Minsky really did not know; he should have known though.)

II. Honda:In 2002, he introduced a fast learning algorithm for restricted Boltzmann machines (RBM) that allowed them to learn a single layer of distributed representation without requiring any labeled data. These methods allowed deep learning to work better and they led to the current deep learning revolution.

Critique: No, Hinton's interesting
unsupervised [CDI]
pre-training for deep NNs (e.g., [UN4]) was irrelevant for the current deep learning revolution.
In 2010, our team showed
that deep feedforward NNs (FNNs)
can be trained by plain backpropagation and do not at all require unsupervised
pre-training for important applications [MLP1] -
see Sec. 2 of [DEC].
This was achieved by greatly accelerating traditional FNNs on highly parallel
graphics processing units called GPUs.
Subsequently, in the early 2010s, this type of
unsupervised pre-training was largely abandoned in commercial applications - see [MIR],
Sec. 19.

III. Honda: "In 2009, Dr. Hinton and two of his students used multilayer neural nets to make a major breakthrough in speech recognition that led directly to greatly improved speech recognition."

Critique:
This is very misleading.
See Sec. 1 of [DEC]:
The first superior end-to-end neural speech recogniser that outperformed the
state of the art was based on two methods from my lab:
(1)Long Short-Term Memory
(LSTM, 1990s-2005) [LSTM0-6]
(overcoming the famous vanishing gradient problem first analysed by my
student Sepp Hochreiter in 1991 [VAN1]); (2)Connectionist Temporal Classification[CTC] (my student Alex Graves et al., 2006). Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks [LSTM14]). This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs), e.g., [BW][BRI][BOU]. Hinton et al. (2009-2012) still used the old hybrid approach[HYB12]. They did not compare their hybrid to CTC-LSTM.
Alex later reused our superior end-to-end neural approach [LSTM4][LSTM14] as a postdoc in Hinton's lab [LSTM8].
By 2015, when compute had become cheap enough,
CTC-LSTM dramatically improved Google's speech recognition [GSR][GSR15][DL4].
This was soon on almost every smartphone.
Google's 2019
on-device speech recognition of 2019
(not any longer on the server)
was still based on
LSTM. See [MIR],
Sec. 4.

IV. Honda:
"In 2012, Dr. Hinton and two more students revolutionized computer vision by showing that deep learning worked far better than the existing state-of-the-art for recognizing objects in images."

Critique: See Sec. 2 of [DEC] (relevant parts repeated here for convenience):
The basic ingredients of the computer vision revolution through convolutional NNs (CNNs)
were developed by
Fukushima (1979), Waibel (1987), LeCun (1989), Weng (1993) and others since the 1970s [CNN1-4].
A success of Hinton's team (ImageNet, Dec 2012) [GPUCNN4] was mostly due to GPUs used to speed up CNNs
(they also used Malsburg's ReLUs [CMB]
and a variant of Hanson's rule [Drop1] without citation; see Sec. V).
However, the
first superior award-winning GPU-based CNN
was created earlier in 2011 by our team in Switzerland (my postdoc Dan Ciresan et al.) [GPUCNN1,3,5][R6].
Our deep and fast CNN, sometimes called "DanNet,"
was a practical breakthrough. It was much deeper and faster than earlier GPU-accelerated
CNNs [GPUCNN].
Already in 2011, it showed "that deep learning worked far better than the existing state-of-the-art for recognizing objects in images."
In fact, it
won 4 important computer vision competitions in a row
between May 15, 2011, and September 10, 2012 [GPUCNN5],
before the similar GPU-accelerated CNN of Hinton's student Krizhevsky won the ImageNet 2012 contest [GPUCNN4-5][R6].

At IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest (where a team of Hinton's frequent co-author LeCun took second place).
Even the NY Times mentioned this.
DanNet was also the first deep CNN to win:
a Chinese handwriting contest (ICDAR 2011),
an image segmentation contest (ISBI, May 2012),
a contest on object detection in large images (ICPR, 10 Sept 2012),
at the same time a medical imaging contest on cancer detection.
All before ImageNet 2012 [GPUCNN4-5][R6].
Our CNN image scanners were 1000 times faster than previous methods [SCAN]. The tremendous importance for health care etc. is obvious. Today IBM, Siemens, Google and many startups are pursuing this approach. Much of modern computer vision is extending the work of 2011, e.g., [MIR], Sec. 19.

V. Honda: "To achieve their dramatic results, Dr. Hinton also invented a widely used new method called "dropout" which reduces overfitting in neural networks by preventing complex co-adaptations of feature detectors."

Critique:
However, "dropout" is actually a variant of Hanson's much earlier stochastic delta rule (1990)[Drop1]. Hinton's 2012 paper [GPUCNN4]
did not cite this.

Apart from this,
already in 2011 we showed that dropout is not necessary
to win computer vision competitions and achieve superhuman results - see
Sec. IV above. Back then, the only really
important task was to make CNNs deep and fast on GPUs [GPUCNN1,3,5][R6].
(Today, dropout is rarely used for CNNs.)

VI. Honda: "Of the countless AI-based technological services across the world, it is no exaggeration to say that few would have been possible without the results Dr. Hinton created."

Critique:
Name one that would NOT have been possible!
Most famous AI applications are based on results created by others.
Here a representative list of our contributions, taken from
Sec. 1 and
Sec. 2 of [DEC]:

3. Language processing. The first superior end-to-end neural machine translation was also based on our LSTM.
In 1995, we already had excellent neural probabilistic models
of text [SNT].
In 2001, we showed that our LSTM can learn languages unlearnable by traditional models such as HMMs [LSTM13]. That is, a neural "subsymbolic" model suddenly excelled at learning "symbolic" tasks.
Compute still had to get 1000 times cheaper, but by 2016-17, both Google Translate [GT16][WU] (which mentions LSTM over 50 times) and Facebook Translate [FB17] were based on two connected LSTMs [S2S], one for incoming texts, one for outgoing translations - much better than what existed before [DL4].
By 2017, Facebook's users made 30 billion LSTM-based translations per week [FB17][DL4]. Compare: the most popular youtube video needed 2 years to achieve only 6 billion clicks.

5. Robotics. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics, e.g., [LSTM-RL][RPG]. In the 2010s, combinations of RL and LSTM have become standard. For example, in 2018, an RL LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher [OAI1][OAI1a].

In the recent
decade of deep learning,
all of 2-6 above depended on our LSTM. See [MIR],
Sec. 4.
And there are innumerable additional LSTM applications ranging from healthcare & chemistry & molecule design
to stock market prediction and self-driving cars [DEC].
By 2016, more than a quarter of the power of all those
Tensor Processing Units in Google's datacenters
was used for LSTM (only 5% for CNNs) [JOU17].
Apparently [LSTM1] has become the most cited AI and NN
research paper of the 20th century [R5].
By 2019, it got more citations per year than any other computer science paper of the 20th century [DEC]. The current record holder of the 21st century [HW2][R5]is also related to LSTM, since ResNet[HW2] (Dec 2015) is a special case of our Highway Net (May 2015) [HW1], the feedforward net version of vanilla LSTM [LSTM2] and the first working, really deep feedforward NN with over 100 layers.
(Admittedly, however, citations are a highly questionable measure of true impact [NAT1].)

7. Medical imaging etc.
Some of the most important NN applications are in healthcare.
In 2012, our Deep Learner was
the first to win a medical imaging contest
(on cancer detection), before ImageNet 2012 [GPUCNN5][R6].
Similar for materials science and quality control: Already in 2010, we introduced our
deep and fast GPU-based NNs to Arcelor Mittal, the world's largest steel maker,
and were able to greatly improve steel defect detection [ST].
This may have been the first deep learning breakthrough in heavy industry. There are many other early applications of our deep learning methods which were frequently used by Hinton.

Concluding Remarks

Dr. Hinton and co-workers have made certain significant contributions to NNs
and deep learning, e.g., [BM][CDI][RMSP][TSNE][CAPS].
But his most visible work (lauded by Honda)
popularized methods created by other researchers whom he did not cite. As emphasized earlier[DLC]:
"The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)."

It is a sign of our field's immaturity that popularizers are sometimes still credited for inventions of others.
Honda should correct this. Else others will. Science must not allow corporate PR to distort the academic record. Similar for certain scientific journals, which "need to make clearer and firmer commitments to self-correction"[SV20].

Unfortunately, Hinton's frequent failures to credit essential prior work by others
cannot serve as a role model for PhD students who are told by their advisors
to perform meticulous research on prior art, and to avoid at all costs
the slightest hint of plagiarism.

Yes, this critique is also an implicit critique of certain other awards to Dr. Hinton.
It is also related to some of the most popular posts and comments of 2019 at reddit/ml, the largest machine learning forum with over 800k subscribers. See, e.g., posts [R4-R8] influenced by [MIR] (although my name is frequently misspelled).

Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas, as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation [NASC1-2], the telephone [NASC3], the computer [NASC4-7], resilient robots [NASC8], and scientists of the 19th century [NASC9].

At least in science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end.
(No fancy award can ever change that.)

As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away."

Dr. Hinton: Having a public debate with Schmidhuber about academic credit is not advisable because it just encourages him and there is no limit to the time and effort that he is willing to put into trying to discredit his perceived rivals.

Reply: This is apparently an
ad hominem argument [AH3][AH2]
true to the motto:
"If you cannot dispute a fact-based message, attack the messenger himself."
Obviously I am not "discrediting" others (e.g., popularisers) by crediting the
inventors.

Dr. Hinton:He has even resorted to tricks like having multiple aliases in Wikipedia to make it look as if other people are agreeing with what he says.

Reply: Another ad hominem attack which I reject.

(Many of my web pages encourage others though through this statement: "The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.")

Dr. Hinton: The page on his website about Alan Turing is a nice example of how he goes about trying to diminish other people's contributions.

Reply: This is yet another fact-free comment that has nothing to do with the contents of
my post. Nevertheless, I'll take the bait and respond (skip this reply if you are
not interested in this deviation from the topic). I believe that
my web pages on
Kurt Gödel (the founder of theoretical computer science in 1931 [GOD])
and Alan Turing
paint an accurate picture of the origins of our field
(also crediting important pioneers ignored by certain movies about Turing). As always, in the interest of self-correcting science [SV20], I'll be happy to correct the pages based on evidence.
But what exactly should I correct? Here the brief summary:
Both
Gödel and the American computer science pioneer
Alonzo Church (1935) [CHU] were cited by Turing who published later (1936) [TUR]. Gödel introduced the first universal coding language (based on the integers). He used it to represent both data (such as axioms and theorems) and programs (such as proof-generating sequences of operations on the data). He famously constructed formal statements that talk about the computation of other formal statements, especially self-referential statements which state that they are not provable by any computational theorem prover. Thus he exhibited the fundamental limits of mathematics and computing and Artificial Intelligence [GOD]. Compare [MIR] (Sec. 18). Church (1935) extended Gödel's result to the famous Entscheidungsproblem (decision problem) [CHU], using his alternative universal language called Lambda Calculus, basis of LISP. Later, Turing
introduced yet another universal model (the Turing Machine) to do the same (1936) [TUR].
Nevertheless, although he was standing on the shoulders of others, Turing
was certainly one of the most important computer science pioneers.

Dr. Hinton: Despite my own best judgement, I feel that I cannot leave his charges completely unanswered so I am going to respond once and only once. I have never claimed that I invented backpropagation. David Rumelhart invented it independently long after people in other fields had invented it. It is true that when we first published we did not know the history so there were previous inventors that we failed to cite. What I have claimed is that I was the person to clearly demonstrate that backpropagation could learn interesting internal representations and that this is what made it popular. I did this by forcing a neural net to learn vector representations for words such that it could predict the next word in a sequence from the vector representations of the previous words. It was this example that convinced the Nature referees to publish the 1986 paper.

It is true that many people in the press have said I invented backpropagation and I have spent a lot of time correcting them. Here is an excerpt from the 2018 book by Michael Ford entitled "Architects of Intelligence": "Lots of different people invented different versions of backpropagation before David Rumelhart. They were mainly independent inventions and sit's something I feel I have got too much credit for. I've seen things in the press that say that I invented backpropagation, and that is completely wrong. It's one of these rare cases where an academic feels he has got too much credit for something! My main contribution was to show how you can use it for learning distributed representations, so I'd like to set the record straight on that."

Reply: This is finally a response related to my post.
However, it does not at all contradict what I wrote in the relevant Sec. I.
It is true that Dr. Hinton credited in 2018 his co-author
Rumelhart [RUM] with the "invention" of backpropagation [AOI]. But neither in [AOI]
nor in his 2015 survey [DL3] he
mentioned Linnainmaa (1970) [BP1], the true inventor of
this efficient algorithm for applying the chain rule to networks with differentiable nodes[BP4].
It should be mentioned
that [DL3] does cite Werbos (1974) who however described the method correctly only
later in 1982 [BP2] and
also failed to cite [BP1].
Linnainmaa's method was well-known, e.g.,
[BP5][DL1][DL2][DLC].
It wasn't created by "lots of different people" but by exactly
one person who published first [BP1] and therefore should get the credit.
(Sec. I above also mentions the method's precursors [BPA][BPB][BPC].)
Dr. Hinton accepted the Honda Prize
although he apparently agrees that Honda's claims (e.g., Sec. I) are false.
He should ask Honda to correct their statements.

Dr. Hinton: Maybe Juergen would like to set the record straight on who invented LSTMs?

Reply: This question is again deviating from what's in my post. Nevertheless, I'll happily respond:
See [MIR], Sec. 3 and
Sec. 4 on the fundamental contributions of my
former student Sepp Hochreiter in his 1991 diploma thesis [VAN1] which I called "one of the most important documents in the history of machine learning." (Sec. 4 also mentions later great contributions by other students including Felix Gers, Alex Graves, and others.)

To summarize, Dr. Hintons comments and ad hominem arguments
diverge from the contents of my post and do not challenge
any of the facts presented in Sec. I, II, III, IV, V, VI. The facts still stand.

Acknowledgments

Thanks to several expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.

[Drop1] Hanson, S. J.(1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272. (Compare preprint
arXiv:1808.03578 on dropout as a special case, 2018.)

[CMB]
C. v. d. Malsburg (1973).
Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. [See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.]

[HW1] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS'2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where g(x)=t(x)=const=1.More.

[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011.
PDF.
HTML overview.
[First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor. This led to massive interest from industry.]

[GPUCNN5] J. Schmidhuber. History of computer vision contests won by deep CNNs on GPU. March 2017. HTML.
[How IDSIA used GPU-based CNNs to win four important computer vision competitions 2011-2012 before others started using similar approaches.]

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.PDF.
See also BIT 16, 146-160, 1976.
Link.

[UN1]
J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991 [UN0]. PDF.
[First working Deep Learner based on a deep RNN hierarchy, overcoming the vanishing gradient problem. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills - such approaches are now widely used. More.]

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
[An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN
can be found here. Plus lots of additional material and images related to other refs in the present page.]

[CHU]
A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the American Mathematical Society, 41: 332-333. Abstract of a talk given on 19 April 1935, to the American Mathematical Society.
Also in American Journal of Mathematics, 58(2), 345-363 (1 Apr 1936).

[TUR]
A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230-267. Received 28 May 1936. Errata appeared in Series 2, 43, pp 544-546 (1937).

[AOI] M. Ford. Architects of Intelligence: The truth about AI from the people building it. Packt Publishing, 2018.
(Preface to German edition by J. Schmidhuber.)