- Python Deep Learning
- Valentino Zocca Gianmario Spacagna Daniel Slater Peter Roelants
- 1302字
- 2021-07-02 23:32:46
Deep learning applications
In the next couple of paragraphs, we will discuss how deep neural networks have applications in the field of speech recognition and computer vision, and how their application in recent years has vastly improved accuracy in these two fields by completely outperforming many other machine learning algorithms not based on deep neural networks.
Speech recognition
Deep learning has started to be used in speech recognition starting in this decade (2010 and later, see for example the 2012 article titled Deep Neural Networks for Acoustic Modeling in Speech Recognition by Hinton et al., available online at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf); until then, speech recognition methods were dominated by algorithms called GMM-HMM methods (Hidden Markov Models with Gaussian Mixture Emission). Understanding speech is a complex task, since speech is not, as is naively thought, made up of separate words with clear boundaries between them. In reality, in speech there are no really distinguishable parts, and there are no clear boundaries between spoken words. In studying sounds when composing words, we often look at so-called triphones, which are comprised of three regions where the first part depends on the previous sound, the middle part is generally stable, and the next one depends on the following sound. In addition, typically it is better to detect only parts of a triphone, and those detectors are called senones.
In Deep Neural Networks for Acoustic Modeling in Speech Recognition, several comparisons were made between the then state-of-the art models and the model used by the authors that comprised five hidden layers with 2048 units per layer. The first comparison was using the Bing voice search application, achieving a 69.6% accuracy vs. a 63.8% accuracy with a classical method, named the GMM-HMM model, on 24 hours of training data. The same model was also tested on the Switchboard speech recognition task, a public speech-to-text transcription benchmark (similar to the MNIST dataset used for digit recognition) that includes about 2500 conversations by 500 speakers from around the US. In addition, tests and comparisons were performed using Google Voice input speech, YouTube data, and English Broadcast News speech data. In the next table, we summarize the results from the article, showing the error rate for DNN versus GMM-HMM.
Task |
Total number of hours of training data |
DNN (error rate) |
GMM-HMM with same training (error rate) |
GMM-HMM with longer training (error rate) |
---|---|---|---|---|
Switchboard (test1) |
309 |
18.5 |
27.4 |
18.6 (2000hrs) |
Switchboard (test2) |
309 |
16.1 |
23.6 |
17.1 (2000hrs) |
English Broadcast News |
50 |
17.5 |
18.8 |
|
Bing Voice Search |
24 |
30.4 |
36.2 |
|
Google Voice |
5870 |
12.3 |
16.0 (>>5870hrs) |
|
YouTube |
1400 |
47.6 |
52.3 |
In another article, New types of Deep Neural Network Learning for Speech Recognition and Related Applications: An overview, by Deng, Hinton, and Kingsbury (https://www.microsoft.com/en-us/research/publication/new-types-of-deep-neural-network-learning-for-speech-recognition-and-related-applications-an-overview/), the authors also notice how DNNs work particularly well for noisy speech.
Another advantage of DNNs is that before DNNs, people had to create transformations of speech spectrograms. A spectrogram is a visual representation of the frequencies in a signal. By using DNNs, these neural networks can autonomously and automatically pick primitive features, in this case represented by primitive spectral features. Use of techniques such as convolution and pooling operations, can be applied on this primitive spectral feature to cope with typical speech variations between speakers. In recent years, more sophisticated neural networks that have recurrent connections (RNN) have been employed with great success (A. Graves, A. Mohamed and G. Hinton , Speech Recognition with Deep Recurrent Neural Networks in Proceedings of International Conference on Acoustic Speech and Signal Processing (ICASSP) (2013); refer to http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf) for example, a particular type of deep neural network called LSTM (long short-term memory neural network) that will be described in a later chapter.
In Chapter 2, Neural Networks, we have discussed different activity functions, and although the logistic sigmoid and the hyperbolic tangent are often the best known, they are also often slow to train. Recently, the ReLU activity function has been used successfully in speech recognition, for example in an article by G. Dahl, T. Sainath, and G. Hinton in Improving Deep Neural Networks for LVCSR Using Rectified Linear Units and Dropout in Proceeding of International Conference on Acoustics Speech and Signal Processing (ICASSP) (2013) (http://www.cs.toronto.edu/~gdahl/papers/reluDropoutBN_icassp2013.pdf). In Chapter 5, Image Recognition, we will also mention the meaning of "Dropout", as discussed in this paper (and also mentioned in its title).
Object recognition and classification
This is perhaps the area where deep neural networks, success is best documented and understood. As in speech recognition, DNNs can discover basic representations and features automatically. In addition, handpicked features were often able to capture only low-level edge information, while DNNs can capture higher-level representations such as edge intersections. In 2012, results from the ImageNet Large Scale Visual Recognition Competition (the results are available online at http://image-net.org/challenges/LSVRC/2012/results.html) showed the winning team, composed of Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton, using a large network with 60 million parameters and 650,000 neurons with five convolutional layers and followed by max-pooling layers, beat the second placed team with an error rate of 16.4% versus an error rate of 26.2%. Convolutional layers and max-pooling layers will be the focus of Chapter 5, Image Recognition. It was a huge and impressive result, and that breakthrough result sparked the current renaissance in neural networks. The authors used many novel ways to help the learning process by bringing together convolutional networks, use of the GPU, and some tricks like dropout methods and the use of the ReLU activity function instead of the sigmoid.
The network was trained using GPUs (we will talk about GPU advantages in the next section) and showed how a large amount of labeled data can greatly improve the performance of deep learning neural nets, greatly outperforming more conventional approaches to image recognition and computer vision. Given the success of convolutional layers in deep learning, Zeiler and Fergus in two articles (M. Zeiler and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, in Proceeding of International Conference on Learning Representations (ICLR), 2013 (http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf) and M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, arXiv:1311.2901, pages 1-11, 2013, (http://www.matthewzeiler.com/pubs/arxive2013/arxive2013.pdf)) tried to understand why using convolutional networks in deep learning worked so well, and what representations were being learned by the network. Zeiler and Fergus set to visualize what the intermediate layers captured by mapping back their neural activities. They created a de-convolutional network attached to each layer, providing a loop back to the image pixels of the input.

Image taken from M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks
The article shows what features are being revealed, where layer 2 shows corner and edges, layer 3, different mesh patterns, layer 4, dog faces and bird legs, and layer 5 shows entire objects.

Image taken from M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks
Deep learning can also be used in unsupervised learning by using networks that incorporate RBMs and autoencoders. In an article by Q. Le, M. Ranzato, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng, Building high-level features using large scale unsupervised learning, in Proceedings of International Conference on Machine Learning (ICML) (http://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf) the authors used a 9-layer network with autoencoders, with one billion connections trained on 10 million images downloaded from the Internet. The unsupervised feature learning allows the system to be trained to recognize faces without being told whether the image contains a face or not. In the article the authors state:
"It is possible to train neurons to be selective for high-level concepts using entirely unlabeled data … neurons functions as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos … starting from these representations we obtain 15.8% accuracy for object recognition on ImageNet with 20,000 categories, a significant leap of 70% relative improvement over the state-of-the-art."
- Python科學計算(第2版)
- Microsoft Exchange Server PowerShell Cookbook(Third Edition)
- Java入門經典(第6版)
- 新編Premiere Pro CC從入門到精通
- C語言程序設計
- Serverless computing in Azure with .NET
- Angular開發入門與實戰
- Java零基礎實戰
- Spring+Spring MVC+MyBatis從零開始學
- 大學計算機應用基礎(Windows 7+Office 2010)(IC3)
- Access數據庫應用教程(2010版)
- Android編程權威指南(第4版)
- FusionCharts Beginner’s Guide:The Official Guide for FusionCharts Suite
- 例說FPGA:可直接用于工程項目的第一手經驗
- Learning TypeScript