官术网_书友最值得收藏!

  • Python Deep Learning
  • Valentino Zocca Gianmario Spacagna Daniel Slater Peter Roelants
  • 1302字
  • 2021-07-02 23:32:46

Deep learning applications

In the next couple of paragraphs, we will discuss how deep neural networks have applications in the field of speech recognition and computer vision, and how their application in recent years has vastly improved accuracy in these two fields by completely outperforming many other machine learning algorithms not based on deep neural networks.

Speech recognition

Deep learning has started to be used in speech recognition starting in this decade (2010 and later, see for example the 2012 article titled Deep Neural Networks for Acoustic Modeling in Speech Recognition by Hinton et al., available online at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf); until then, speech recognition methods were dominated by algorithms called GMM-HMM methods (Hidden Markov Models with Gaussian Mixture Emission). Understanding speech is a complex task, since speech is not, as is naively thought, made up of separate words with clear boundaries between them. In reality, in speech there are no really distinguishable parts, and there are no clear boundaries between spoken words. In studying sounds when composing words, we often look at so-called triphones, which are comprised of three regions where the first part depends on the previous sound, the middle part is generally stable, and the next one depends on the following sound. In addition, typically it is better to detect only parts of a triphone, and those detectors are called senones.

In Deep Neural Networks for Acoustic Modeling in Speech Recognition, several comparisons were made between the then state-of-the art models and the model used by the authors that comprised five hidden layers with 2048 units per layer. The first comparison was using the Bing voice search application, achieving a 69.6% accuracy vs. a 63.8% accuracy with a classical method, named the GMM-HMM model, on 24 hours of training data. The same model was also tested on the Switchboard speech recognition task, a public speech-to-text transcription benchmark (similar to the MNIST dataset used for digit recognition) that includes about 2500 conversations by 500 speakers from around the US. In addition, tests and comparisons were performed using Google Voice input speech, YouTube data, and English Broadcast News speech data. In the next table, we summarize the results from the article, showing the error rate for DNN versus GMM-HMM.

Task

Total number of hours of training data

DNN

(error rate)

GMM-HMM with same training

(error rate)

GMM-HMM with longer training

(error rate)

Switchboard (test1)

309

18.5

27.4

18.6 (2000hrs)

Switchboard (test2)

309

16.1

23.6

17.1 (2000hrs)

English Broadcast News

50

17.5

18.8

 

Bing Voice Search

24

30.4

36.2

 

Google Voice

5870

12.3

 

16.0 (>>5870hrs)

YouTube

1400

47.6

52.3

 

In another article, New types of Deep Neural Network Learning for Speech Recognition and Related Applications: An overview, by Deng, Hinton, and Kingsbury (https://www.microsoft.com/en-us/research/publication/new-types-of-deep-neural-network-learning-for-speech-recognition-and-related-applications-an-overview/), the authors also notice how DNNs work particularly well for noisy speech.

Another advantage of DNNs is that before DNNs, people had to create transformations of speech spectrograms. A spectrogram is a visual representation of the frequencies in a signal. By using DNNs, these neural networks can autonomously and automatically pick primitive features, in this case represented by primitive spectral features. Use of techniques such as convolution and pooling operations, can be applied on this primitive spectral feature to cope with typical speech variations between speakers. In recent years, more sophisticated neural networks that have recurrent connections (RNN) have been employed with great success (A. Graves, A. Mohamed and G. Hinton , Speech Recognition with Deep Recurrent Neural Networks in Proceedings of International Conference on Acoustic Speech and Signal Processing (ICASSP) (2013); refer to http://www.cs.toronto.edu/~fritz/absps/RNN13.pdf) for example, a particular type of deep neural network called LSTM (long short-term memory neural network) that will be described in a later chapter.

In Chapter 2, Neural Networks, we have discussed different activity functions, and although the logistic sigmoid and the hyperbolic tangent are often the best known, they are also often slow to train. Recently, the ReLU activity function has been used successfully in speech recognition, for example in an article by G. Dahl, T. Sainath, and G. Hinton in Improving Deep Neural Networks for LVCSR Using Rectified Linear Units and Dropout in Proceeding of International Conference on Acoustics Speech and Signal Processing (ICASSP) (2013) (http://www.cs.toronto.edu/~gdahl/papers/reluDropoutBN_icassp2013.pdf). In Chapter 5, Image Recognition, we will also mention the meaning of "Dropout", as discussed in this paper (and also mentioned in its title).

Object recognition and classification

This is perhaps the area where deep neural networks, success is best documented and understood. As in speech recognition, DNNs can discover basic representations and features automatically. In addition, handpicked features were often able to capture only low-level edge information, while DNNs can capture higher-level representations such as edge intersections. In 2012, results from the ImageNet Large Scale Visual Recognition Competition (the results are available online at http://image-net.org/challenges/LSVRC/2012/results.html) showed the winning team, composed of Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton, using a large network with 60 million parameters and 650,000 neurons with five convolutional layers and followed by max-pooling layers, beat the second placed team with an error rate of 16.4% versus an error rate of 26.2%. Convolutional layers and max-pooling layers will be the focus of Chapter 5, Image Recognition. It was a huge and impressive result, and that breakthrough result sparked the current renaissance in neural networks. The authors used many novel ways to help the learning process by bringing together convolutional networks, use of the GPU, and some tricks like dropout methods and the use of the ReLU activity function instead of the sigmoid.

The network was trained using GPUs (we will talk about GPU advantages in the next section) and showed how a large amount of labeled data can greatly improve the performance of deep learning neural nets, greatly outperforming more conventional approaches to image recognition and computer vision. Given the success of convolutional layers in deep learning, Zeiler and Fergus in two articles (M. Zeiler and R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, in Proceeding of International Conference on Learning Representations (ICLR), 2013 (http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf) and M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, arXiv:1311.2901, pages 1-11, 2013, (http://www.matthewzeiler.com/pubs/arxive2013/arxive2013.pdf)) tried to understand why using convolutional networks in deep learning worked so well, and what representations were being learned by the network. Zeiler and Fergus set to visualize what the intermediate layers captured by mapping back their neural activities. They created a de-convolutional network attached to each layer, providing a loop back to the image pixels of the input.

Image taken from M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks

The article shows what features are being revealed, where layer 2 shows corner and edges, layer 3, different mesh patterns, layer 4, dog faces and bird legs, and layer 5 shows entire objects.

Image taken from M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks

Deep learning can also be used in unsupervised learning by using networks that incorporate RBMs and autoencoders. In an article by Q. Le, M. Ranzato, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Ng, Building high-level features using large scale unsupervised learning, in Proceedings of International Conference on Machine Learning (ICML) (http://static.googleusercontent.com/media/research.google.com/en//archive/unsupervised_icml2012.pdf) the authors used a 9-layer network with autoencoders, with one billion connections trained on 10 million images downloaded from the Internet. The unsupervised feature learning allows the system to be trained to recognize faces without being told whether the image contains a face or not. In the article the authors state:

"It is possible to train neurons to be selective for high-level concepts using entirely unlabeled data … neurons functions as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos … starting from these representations we obtain 15.8% accuracy for object recognition on ImageNet with 20,000 categories, a significant leap of 70% relative improvement over the state-of-the-art."

主站蜘蛛池模板: 仁怀市| 黔东| 科技| 会昌县| 安义县| 前郭尔| 祁阳县| 漳浦县| 海原县| 太康县| 安溪县| 共和县| 东港市| 潮安县| 河北区| 平乡县| 衢州市| 曲沃县| 丰宁| 开原市| 和田市| 延庆县| 太保市| 九龙坡区| 奉新县| 嘉祥县| 托里县| 叙永县| 昂仁县| 彭水| 陆丰市| 波密县| 贵港市| 渭源县| 区。| 定南县| 银川市| 加查县| 稷山县| 汶上县| 满洲里市|