深度学习Deep Learning 101_deeplearning 101_GarfieldEr007的博客-程序员宅基地

技术标签: DNN  深度学习  DL  神经网络  Deep Learning  

Deep learning has become something of a buzzword in recent years with the explosion of 'big data', 'data science', and their derivatives mentioned in the media. Justifiably, deep learning approaches have recently blown other state-of-the-art machine learning methods out of the water for standardized problems such as the MNIST handwritten digits dataset. My goal is to give you a layman understanding of what deep learning actually is so you can follow some of my thesis research this year as well as mentally filter out news articles that sensationalize these buzzwords.

Intro

MNISTMNIST( source)

Imagine you are trying to recognize someone's handwriting - whether they drew a '7' or a '9'. From years of seeing handwritten digits, you automatically notice the vertical line with a horizontal top section. If you see a closed loop in the top section of the digit, you think it is a '9'. If it is more like a horizontal line, you think of it as a '7'. Easy enough. What it took for you to correctly recognize the digit, however, is an impressive display of fitting smaller features together to make the whole - noticing contrasted edges to make lines, seeing a horizontal vs. vertical line, noticing the positioning of the vertical section underneath the horizontal section, noticing a loop in the horizontal section, etc.

Ultimately, this is what deep learning or representation learning is meant to do: discover multiple levels of features that work together to define increasingly more abstract aspects of the data (in our case, initial image pixels to lines to full-blown numbers). This post is going to be a rough summary of two main survey papers:

Why do we care about deep learning?

Current machine learning algorithms' performance depends heavily on the particular features of the data chosen as inputs. For example, document classification (such as marking emails as spam or not) can be performed by breaking down the input document into bag-of-words or n-grams as features. Choosing the correct feature representation of input data, or feature engineering, is a way that people can bring prior knowledge of a domain to increase an algorithm's computational performance and accuracy. To move towards general artificial intelligence, algorithms need to be less dependent on this feature engineering and better learn to identify the explanatory factors of input data on their own.

Deep learning tries to move in this direction by capturing a 'good' representation of input data by using compositions of non-linear transformations. A 'good' representation can be defined as one that disentangles underlying factors of variation for input data. It turns out that deep learning approaches can find useful abstract representations of data across many domains: it has had great commercial success powering most of Google and Microsoft's current speech recognition, image classification, natural language processing, object recognition, etc. Facebook is also planning on using deep learning approaches to understand its users1. Deep learning has been so impactful in industry that MIT Technology Review named it as a top-10 breakthrough technology of 20132.

So how do you build a deep representation of input data? The central idea is to learn a hierarchy of features one level at a time where the input to one computational level is the output of the previous level for an arbitrary number of levels. Otherwise, 'shallow' representations (most current algorithms like regression or svm) go directly from input data to output classification.

One good analogue for deep representations is neurons in the brain (a motivation for artificial neural networks) - the output of a group of neurons is agglomerated as the input to more neurons to form a hierarchical layer structure. Each layer N is composed of h computational nodes that connect to each computational node in layer N+1. See the image below for an example: 

neural network layersneural network layers( source)

Interpretations of representation learning

There are two main ways to interpret the computation performed by these layered deep architectures:

  • Probabilistic graphical models have nodes in each layer that are considered as latent random variables. In this case, you care about the probability distribution of the input data x and the hidden latent random variables h that describe the input data in the joint distributionp(x,h). These latent random variables describe a distribution over the observed data.
  • Direct encoding (neural network) models have nodes in each layer that are considered as computational units. This means each node hperforms some computation (normally nonlinear like a sigmoidal functionhyperbolic tangent nonlinearity, or rectifier linear unit) given its inputs from the previous layer.

To get started, principal component analysis (PCA) is a simple feature extraction algorithm that can span both of these interpretations. PCA learns a linear transform h = f(x) = Wx + b where W is a weight matrix for the inputs x and b is a bias. The columns of the dx dmatrix W form anorthogonal basis for the dorthogonal directions of greatest variance in the input training data x. The result is dfeatures that make representation layerh that are decorrelated. 

PCAPCA( source)

From a probabilistic viewpoint, PCA is simply finding the principaleigenvectors of the covariance matrix of the data. This means that you are finding which features of the input data can explain away the most variance in the data3.

From an encoding viewpoint, PCA is performing a linear computation over the input data to form a hidden representation h that has a lower dimensionality than the data.

Note that because PCA is a linear transformation of the input x, it cannot really be stacked in layers because the composition of linear operations is just another linear operation. There would be no abstraction benefit of multiple layers. To form powerful deep representations, we will look at stackingRestricted Boltzmann Machines (RBM) from a probability viewpoint and nonlinear auto-encoders from a direct encoding viewpoint.

Probabilistic models: restricted boltzmann machine (RBM)

Boltzmann machine is a network of symmetrically-coupled binary random variables or units. This means that it is a fully-connected, undirected graph. This graph can be divided into two parts:

  1. The visible binary units x that make up the input data and
  2. The hidden or latent binary units h that explain away the dependencies between the visible units x through their mutual interactions.

Boltzmann machineBoltzmann machine

(A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency; in this example there are 3 hidden units and 4 visible units. source)

Boltzmann machines describe this pattern of interaction through the distribution over the joint space [x,h] with the energy function

Boltzmann energy functionBoltzmann energy function

Where the model parameters Θ are { U,V,W,b,d }.

Trying to evaluate conditional probabilities over this fully connected graph ends up being an intractable problem. For example, computing the conditional probability of hidden variable given the visibles, P(hx), requires marginalizing over all the other hidden variables. This would be evaluating a sum with 2d- 1 terms.

However, we can restrict the graph from being fully connected to only containing the interactions between the visible units x and hidden units h

restricted boltzmann machinerestricted boltzmann machine( source)

This gives us an RBM, which is a bipartite graph with the visible and hidden units forming distinct layers. Calculating the conditional distribution P(hx) is readily tractable and now factorizes to: 

rbm eqnrbm eqn

Very successful deep learning algorithms stack multiple RBMs together, where the hiddens h from the visible input data x become the new input data for another RBM for an arbitrary number of layers. 

stacked rbmstacked rbm

There are a few drawbacks to the probabilistic approach to deep architectures:

  1. The posterior distribution P(hx) becomes incredibly complicated if the model has more than a few interconnected layers. We are forced to resort to sampling or approximate inference techniques to solve the distribution, which has computational and approximation error prices.
  2. Calculating this distribution over latent variables still does not give a usable feature vector to train a final classifier to make this algorithm useful for AI tasks. For example, we calculate all of these hidden distributions that explain the variations over the handwriting digit recognition problem, but they do not give a final classification of a number. Actual feature values are normally derived from the distribution, taking the latent variable's expected value, which are then used as the input to a normal machine learning classifier, such as logistic regression.

Direct encoding models: auto-encoder

To get around the problem of deriving useful feature values, an auto-encoderis a non-probabilistic alternative approach to deep learning where the hidden units produce usable numeric feature values. An auto-encoder directly maps an input x to a hidden layer h through a parameterized closed-form equation called an encoder. Typically, this encoder function is a nonlinear transformation of the input to h in the form:

encodeencode

This resulting transformation is the feature-vector or representation computed from input x.

Conversely, a decoder function is used to then map from this feature space hback to the input space, which results in a reconstruction x'. This decoder is also a parameterized closed-form equation that is a nonlinear 'undoing' the encoding function:

decodedecode

In both cases, the nonlinear function s is normally an element-wise sigmoid,hyperbolic tangent nonlinearity, or rectifier linear unit.

Thus, the goal of an auto-encoder is to minimize a loss function over the reconstruction error given the training data. Model parameters Θ are { W,b,W',d }, with the weight matrix W most often having 'tied' weights such that W' = W.

Stacking auto-encoders in layers is the same process as with RBMs: 

stacked autoencoderstacked autoencoder

One disadvantage of auto-encoders is that they can easily memorize the training data - i.e. find the model parameters that map every input seen to a perfect reconstruction with zero error - given enough hidden units h. To combat this problem, regularization is necessary, which gives rise to variants such as sparse auto-encoderscontractive auto-encoders, or denoising auto-encoders.

A practical advantage of auto-encoder variants is that they define a simple, tractable optimization objective that can be used to monitor progress.

Challenges and looking forward

Deep learning is currently a very active research topic. Many problems stand in the way of reaching more general AI-level performance:

Scaling computations - the more complex the input space (such as harder AI problems), the larger the deep networks have to be to capture its representation. These computations scale much worse than linearly, and current research in parallelizing the training algorithms and creating convolutional architectures is meant to make these algorithms useful in practice. Convolutional architectures mean that every hidden unit output to a layer does not become the input for every other hidden unit in the next layer; they can be restricted to only connect to other hidden units that are within the same spatial area. Further, there are so many hyper-parameters for these algorithms (number of layers, hidden units, nonlinear functions, training procedures) that choosing them has become considered an 'art'.

Optimization - as the input datasets grow larger and larger (growing faster than the size of the models), training error and generalization error converge. Optimization difficulty during training of deep architectures comes from both finding local minima and having ill-conditioning (the two main types of optimization difficulties in continuous optimization problems). Better optimization can have an impact on scaling computations, and is interesting to study to obtain better generalization of the algorithms. Layer-wise pretraining has helped immensely in recent years for optimization during training deep architectures.

Inference and sampling - all probabilistic models except for the RBM require a non-trivial form of inference (guessing values of the latent variables h given the conditional distribution over x). Inference and sampling techniques can be slow during training as well as have difficulties since the distributions can be incredibly complex and often have a very large number of modes.

Disentangling - finding the 'underlying factors' that explain the input data. Complex input data arise from the interaction of many interrelated sources - such as lights casting shadows, object material properties, etc. for image recognition. This would allow for very powerful cross-task learning, leading to a representation that can 'zoom in' on the relevant features in the learned representation given the current problem. Disentanglement is the most ambitious challenge presented so far, as well as the one with the most far-reaching impact towards more general AI.

Conclusion

  • Deep learning is about creating an abstract hierarchical representation of the input data to create useful features for traditional machine learning algorithms. Each layer in the hierarchy learns a more abstract and complex feature of the data, such as edges to eyes to faces.
  • This representation gets its power of abstraction by stacking nonlinear functions, where the output of one layer becomes the input to the next.
  • The two main schools of thought for analyzing deep architectures areprobabilistic vs. direct encoding.
  • The probabilistic interpretation means that each layer defines a distribution of hidden units given the observed input, P(h | x).
  • The direct encoding interpretation learns two separate functions - theencoder and decoder - to transform the observed input to the feature space and then back to the observed space.
  • These architectures have had great commercial success so far, powering many natural language processing and image recognition tasks at companies like Google and Microsoft.

If you would like to learn more about the subject, check out this awesome hubor Bengio's page!

If you are inclined to use deep learning in code, Theano is an amazing open-source Python package developed by Bengio's LISA group at University of Montreal.

Update: HN discussion


from: http://markus.com/deep-learning-101/

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/GarfieldEr007/article/details/51114352

智能推荐

laravel-admin 多图上传_laravel-admin多图上传-程序员宅基地

form 方法内 $form->multipleImage('image', '配图'); 对应的 Model 内 public function setImageAttribute($image){ if (is_array($image)) { $this->attributes['image'] = json_encode($..._laravel-admin多图上传

输入一批整数,输出其中的最大值和最小值,输入数字0时结束循环_输入若干整数以零就输输出最小值-程序员宅基地

/****程序功能: 输入一批整数,输出其中的最大值和最小值,输入数字0时结束循环。*/package com.lianxiti;import java.util.Scanner;public class java01 {public static void main(String[] args) {System.out.println(“请输入一些数字,输入0结束&&输出最大值和最小值 \n”);int max; // 最大值int min; // 最小值int_输入若干整数以零就输输出最小值

Ubuntu1804_vimrc括号匹配_vimrc 只能匹配括号-程序员宅基地

vim的一些简单配置和括号匹配在/etc/vim目录下更改vimrc配置文件,在文件末尾加上以下内容可以实现vim的一些简单配置和括号匹配功能。如果使用!wq无法保存,则使用命令w!sudo tee%进行保存 inoremap ( ()<ESC>i inoremap [ []<ESC>i inoremap { {}<ESC>i "inor..._vimrc 只能匹配括号

javascript学习笔记全记录-程序员宅基地

js的初步了解1.就是用来修改样式的,修改的是行内样式。任何样式都能够修改。 2.css里面怎么写js就怎么写。 3.任何元素都能加事件;事件都要小写js的三大组成部分:1.ECMAScript——核心解释器,把js代码转换成计算机可以读懂的语言2.DOM——Document object model 文档对象模型 ...

使用WinDbg内核调试-程序员宅基地

WINDOWS调试工具很强大,但是学习使用它们并不容易。特别对于驱动开发者使用的WinDbg和KD这两个内核调试器(CDB和NTSD是用户态调试器)。 本教程的目标是给予一个已经有其他调试工具使用经验的开发者足够信息,使其能通过参考WINDOWS调试工具的帮助文件进行内核调试。 本文将假定开发者熟悉一般WINDOWS操作系统和进程的建立过程。 本文的重点是集成内核模...

PR曲线详解-程序员宅基地

目录PR曲线概念precision(精准率)和recall(召回率)PR曲线功能说明PR曲线概念PR曲线中的P代表的是precision(精准率),R代表的是recall(召回率),其代表的是精准率与召回率的关系,一般情况下,将recall设置为横坐标,precision设置为纵坐标。precision(精准率)和recall(召回率)上述中介少了PR曲线的实质代表为precision(精准率)和recall(召回率),但是这二者是什么呢?下面咱们进行相关的讲述。首先,我们了解一下混淆矩阵,如下表_pr曲线

随便推点

如何彻底关闭FF新推荐弹出广告_鲁大师的helper-程序员宅基地

感觉最近流氓广告太多了,什么FF新推荐、鲁大师新闻、91手机助手的新闻以及360弹窗等等。非常难关,好多时候显示永久关闭的按钮也是骗人的。如果不想看到这些新闻,简单的方法就是卸载软件,像FF新推荐这种弹窗,一般还不容易找到怎么卸载它的方法。这篇文章我来总结一下如何去掉FF新推荐的弹窗。一、在任务管理器中结束FlashHelperService进程。二、在服务中禁用FlashHelperService。三、在C:\Windows\SysWOW64\Macromed\Flash中删掉FlashHel_鲁大师的helper

Intellij IDEA导入已有项目-程序员宅基地

Intellij IDEA导入已有项目一. 安装Jdk8和Intellij Idea二. 导入已有Maven项目安装maven导入项目三. 导入vue项目四. 添加域名映射一. 安装Jdk8和Intellij Idea安装不细说了,网上一大堆资料,全都有。二. 导入已有Maven项目安装maven官网下载相应版本的maven,进行安装(也不细说)。导入项目1.打开Intellij I...

Java连载146-内存泄漏和容器_java内存泄漏 容器没有容器-程序员宅基地

内存溢出和内存泄漏的区别内存溢出,就是我们在内存种分配了一块内存区域,但是当我添加了超出内存的数据的时候,就会导致溢出部分,覆盖了其他的内存,影响到了其他数据.内存溢出容易招致黑客攻击,修改未经分配给的内存.内存溢出在C++和C中存在,在Java中基本不存在,如果是堆栈会报OutOfMemory异常;对于数组就会报ArrayOutOfIndex异常.JVM有一系列机制来进行内存的检测等,出现异常会直接抛出内存泄漏是指分配了一段内存,当我们不在使用内存中存储的数据或者对象的时候,没有及时释放掉,导致这块_java内存泄漏 容器没有容器

pprof生成Profile和trace文件及使用_pprof profile-程序员宅基地

1.在代码中添加pprof功能2. http://127.0.0.1:9090/debug/pprof/ 点击其中的profile下载profile文件。3. go tool pprof profile命令打开profile文件进入pprof模式。此时可以使用pprof相关命令安装graphviz-2.38.msi配置D:\Program Files ..._pprof profile

Redis资料整理-程序员宅基地

1.Redis命令参考简体中文版。2.java操作redis。jedis使用api3.Redis学习笔记。_redis资料整理

(笔记)Android MaterialButton去掉阴影解决方法_materialbutton 阴影_疼老婆会发达的博客-程序员宅基地

添加上以下这行代码android:stateListAnimator="@null"_materialbutton 阴影

推荐文章

热门文章

相关标签