技术标签： DNN 深度学习 DL 神经网络 Deep Learning
Deep learning has become something of a buzzword in recent years with the explosion of 'big data', 'data science', and their derivatives mentioned in the media. Justifiably, deep learning approaches have recently blown other state-of-the-art machine learning methods out of the water for standardized problems such as the MNIST handwritten digits dataset. My goal is to give you a layman understanding of what deep learning actually is so you can follow some of my thesis research this year as well as mentally filter out news articles that sensationalize these buzzwords.
MNIST( source)
Imagine you are trying to recognize someone's handwriting - whether they drew a '7' or a '9'. From years of seeing handwritten digits, you automatically notice the vertical line with a horizontal top section. If you see a closed loop in the top section of the digit, you think it is a '9'. If it is more like a horizontal line, you think of it as a '7'. Easy enough. What it took for you to correctly recognize the digit, however, is an impressive display of fitting smaller features together to make the whole - noticing contrasted edges to make lines, seeing a horizontal vs. vertical line, noticing the positioning of the vertical section underneath the horizontal section, noticing a loop in the horizontal section, etc.
Ultimately, this is what deep learning or representation learning is meant to do: discover multiple levels of features that work together to define increasingly more abstract aspects of the data (in our case, initial image pixels to lines to full-blown numbers). This post is going to be a rough summary of two main survey papers:
Current machine learning algorithms' performance depends heavily on the particular features of the data chosen as inputs. For example, document classification (such as marking emails as spam or not) can be performed by breaking down the input document into bag-of-words or n-grams as features. Choosing the correct feature representation of input data, or feature engineering, is a way that people can bring prior knowledge of a domain to increase an algorithm's computational performance and accuracy. To move towards general artificial intelligence, algorithms need to be less dependent on this feature engineering and better learn to identify the explanatory factors of input data on their own.
Deep learning tries to move in this direction by capturing a 'good' representation of input data by using compositions of non-linear transformations. A 'good' representation can be defined as one that disentangles underlying factors of variation for input data. It turns out that deep learning approaches can find useful abstract representations of data across many domains: it has had great commercial success powering most of Google and Microsoft's current speech recognition, image classification, natural language processing, object recognition, etc. Facebook is also planning on using deep learning approaches to understand its users^{1}. Deep learning has been so impactful in industry that MIT Technology Review named it as a top-10 breakthrough technology of 2013^{2}.
So how do you build a deep representation of input data? The central idea is to learn a hierarchy of features one level at a time where the input to one computational level is the output of the previous level for an arbitrary number of levels. Otherwise, 'shallow' representations (most current algorithms like regression or svm) go directly from input data to output classification.
One good analogue for deep representations is neurons in the brain (a motivation for artificial neural networks) - the output of a group of neurons is agglomerated as the input to more neurons to form a hierarchical layer structure. Each layer N is composed of h computational nodes that connect to each computational node in layer N+1. See the image below for an example:
There are two main ways to interpret the computation performed by these layered deep architectures:
To get started, principal component analysis (PCA) is a simple feature extraction algorithm that can span both of these interpretations. PCA learns a linear transform h = f(x) = W^{T }x + b where W is a weight matrix for the inputs x and b is a bias. The columns of the d_{x }x d_{h }matrix W form anorthogonal basis for the d_{h }orthogonal directions of greatest variance in the input training data x. The result is d_{h }features that make representation layerh that are decorrelated.
From a probabilistic viewpoint, PCA is simply finding the principaleigenvectors of the covariance matrix of the data. This means that you are finding which features of the input data can explain away the most variance in the data^{3}.
From an encoding viewpoint, PCA is performing a linear computation over the input data to form a hidden representation h that has a lower dimensionality than the data.
Note that because PCA is a linear transformation of the input x, it cannot really be stacked in layers because the composition of linear operations is just another linear operation. There would be no abstraction benefit of multiple layers. To form powerful deep representations, we will look at stackingRestricted Boltzmann Machines (RBM) from a probability viewpoint and nonlinear auto-encoders from a direct encoding viewpoint.
A Boltzmann machine is a network of symmetrically-coupled binary random variables or units. This means that it is a fully-connected, undirected graph. This graph can be divided into two parts:
Boltzmann machine
(A graphical representation of an example Boltzmann machine. Each undirected edge represents dependency; in this example there are 3 hidden units and 4 visible units. source)
Boltzmann machines describe this pattern of interaction through the distribution over the joint space [x,h] with the energy function:
Where the model parameters Θ are { U,V,W,b,d }.
Trying to evaluate conditional probabilities over this fully connected graph ends up being an intractable problem. For example, computing the conditional probability of hidden variable given the visibles, P(h_{i }| x), requires marginalizing over all the other hidden variables. This would be evaluating a sum with 2^{dh - 1} terms.
However, we can restrict the graph from being fully connected to only containing the interactions between the visible units x and hidden units h.
This gives us an RBM, which is a bipartite graph with the visible and hidden units forming distinct layers. Calculating the conditional distribution P(h_{i }| x) is readily tractable and now factorizes to:
Very successful deep learning algorithms stack multiple RBMs together, where the hiddens h from the visible input data x become the new input data for another RBM for an arbitrary number of layers.
There are a few drawbacks to the probabilistic approach to deep architectures:
To get around the problem of deriving useful feature values, an auto-encoderis a non-probabilistic alternative approach to deep learning where the hidden units produce usable numeric feature values. An auto-encoder directly maps an input x to a hidden layer h through a parameterized closed-form equation called an encoder. Typically, this encoder function is a nonlinear transformation of the input to h in the form:
encode
This resulting transformation is the feature-vector or representation computed from input x.
Conversely, a decoder function is used to then map from this feature space hback to the input space, which results in a reconstruction x'. This decoder is also a parameterized closed-form equation that is a nonlinear 'undoing' the encoding function:
decode
In both cases, the nonlinear function s is normally an element-wise sigmoid,hyperbolic tangent nonlinearity, or rectifier linear unit.
Thus, the goal of an auto-encoder is to minimize a loss function over the reconstruction error given the training data. Model parameters Θ are { W,b,W',d }, with the weight matrix W most often having 'tied' weights such that W' = W^{T }.
Stacking auto-encoders in layers is the same process as with RBMs:
One disadvantage of auto-encoders is that they can easily memorize the training data - i.e. find the model parameters that map every input seen to a perfect reconstruction with zero error - given enough hidden units h. To combat this problem, regularization is necessary, which gives rise to variants such as sparse auto-encoders, contractive auto-encoders, or denoising auto-encoders.
A practical advantage of auto-encoder variants is that they define a simple, tractable optimization objective that can be used to monitor progress.
Deep learning is currently a very active research topic. Many problems stand in the way of reaching more general AI-level performance:
Scaling computations - the more complex the input space (such as harder AI problems), the larger the deep networks have to be to capture its representation. These computations scale much worse than linearly, and current research in parallelizing the training algorithms and creating convolutional architectures is meant to make these algorithms useful in practice. Convolutional architectures mean that every hidden unit output to a layer does not become the input for every other hidden unit in the next layer; they can be restricted to only connect to other hidden units that are within the same spatial area. Further, there are so many hyper-parameters for these algorithms (number of layers, hidden units, nonlinear functions, training procedures) that choosing them has become considered an 'art'.
Optimization - as the input datasets grow larger and larger (growing faster than the size of the models), training error and generalization error converge. Optimization difficulty during training of deep architectures comes from both finding local minima and having ill-conditioning (the two main types of optimization difficulties in continuous optimization problems). Better optimization can have an impact on scaling computations, and is interesting to study to obtain better generalization of the algorithms. Layer-wise pretraining has helped immensely in recent years for optimization during training deep architectures.
Inference and sampling - all probabilistic models except for the RBM require a non-trivial form of inference (guessing values of the latent variables h given the conditional distribution over x). Inference and sampling techniques can be slow during training as well as have difficulties since the distributions can be incredibly complex and often have a very large number of modes.
Disentangling - finding the 'underlying factors' that explain the input data. Complex input data arise from the interaction of many interrelated sources - such as lights casting shadows, object material properties, etc. for image recognition. This would allow for very powerful cross-task learning, leading to a representation that can 'zoom in' on the relevant features in the learned representation given the current problem. Disentanglement is the most ambitious challenge presented so far, as well as the one with the most far-reaching impact towards more general AI.
If you would like to learn more about the subject, check out this awesome hubor Bengio's page!
If you are inclined to use deep learning in code, Theano is an amazing open-source Python package developed by Bengio's LISA group at University of Montreal.
Update: HN discussion
from: http://markus.com/deep-learning-101/
form 方法内 $form->multipleImage('image', '配图'); 对应的 Model 内 public function setImageAttribute($image){ if (is_array($image)) { $this->attributes['image'] = json_encode($..._laravel-admin多图上传
/****程序功能： 输入一批整数，输出其中的最大值和最小值，输入数字0时结束循环。*/package com.lianxiti;import java.util.Scanner;public class java01 {public static void main(String[] args) {System.out.println(“请输入一些数字，输入0结束&&输出最大值和最小值 \n”);int max; // 最大值int min; // 最小值int_输入若干整数以零就输输出最小值
vim的一些简单配置和括号匹配在/etc/vim目录下更改vimrc配置文件，在文件末尾加上以下内容可以实现vim的一些简单配置和括号匹配功能。如果使用!wq无法保存，则使用命令w!sudo tee%进行保存 inoremap ( ()<ESC>i inoremap [ []<ESC>i inoremap { {}<ESC>i "inor..._vimrc 只能匹配括号
js的初步了解1.就是用来修改样式的,修改的是行内样式。任何样式都能够修改。 2.css里面怎么写js就怎么写。 3.任何元素都能加事件；事件都要小写js的三大组成部分：1.ECMAScript——核心解释器，把js代码转换成计算机可以读懂的语言2.DOM——Document object model 文档对象模型 ...
WINDOWS调试工具很强大，但是学习使用它们并不容易。特别对于驱动开发者使用的WinDbg和KD这两个内核调试器（CDB和NTSD是用户态调试器）。 本教程的目标是给予一个已经有其他调试工具使用经验的开发者足够信息，使其能通过参考WINDOWS调试工具的帮助文件进行内核调试。 本文将假定开发者熟悉一般WINDOWS操作系统和进程的建立过程。 本文的重点是集成内核模...
目录PR曲线概念precision（精准率）和recall（召回率）PR曲线功能说明PR曲线概念PR曲线中的P代表的是precision（精准率），R代表的是recall（召回率），其代表的是精准率与召回率的关系，一般情况下，将recall设置为横坐标，precision设置为纵坐标。precision（精准率）和recall（召回率）上述中介少了PR曲线的实质代表为precision（精准率）和recall（召回率），但是这二者是什么呢？下面咱们进行相关的讲述。首先，我们了解一下混淆矩阵，如下表_pr曲线
感觉最近流氓广告太多了，什么FF新推荐、鲁大师新闻、91手机助手的新闻以及360弹窗等等。非常难关，好多时候显示永久关闭的按钮也是骗人的。如果不想看到这些新闻，简单的方法就是卸载软件，像FF新推荐这种弹窗，一般还不容易找到怎么卸载它的方法。这篇文章我来总结一下如何去掉FF新推荐的弹窗。一、在任务管理器中结束FlashHelperService进程。二、在服务中禁用FlashHelperService。三、在C:\Windows\SysWOW64\Macromed\Flash中删掉FlashHel_鲁大师的helper
Intellij IDEA导入已有项目一. 安装Jdk8和Intellij Idea二. 导入已有Maven项目安装maven导入项目三. 导入vue项目四. 添加域名映射一. 安装Jdk8和Intellij Idea安装不细说了，网上一大堆资料，全都有。二. 导入已有Maven项目安装maven官网下载相应版本的maven，进行安装（也不细说）。导入项目1.打开Intellij I...
内存溢出和内存泄漏的区别内存溢出,就是我们在内存种分配了一块内存区域,但是当我添加了超出内存的数据的时候,就会导致溢出部分,覆盖了其他的内存,影响到了其他数据.内存溢出容易招致黑客攻击,修改未经分配给的内存.内存溢出在C++和C中存在,在Java中基本不存在,如果是堆栈会报OutOfMemory异常;对于数组就会报ArrayOutOfIndex异常.JVM有一系列机制来进行内存的检测等,出现异常会直接抛出内存泄漏是指分配了一段内存,当我们不在使用内存中存储的数据或者对象的时候,没有及时释放掉,导致这块_java内存泄漏 容器没有容器
1.在代码中添加pprof功能2. http://127.0.0.1:9090/debug/pprof/ 点击其中的profile下载profile文件。3. go tool pprof profile命令打开profile文件进入pprof模式。此时可以使用pprof相关命令安装graphviz-2.38.msi配置D:\Program Files ..._pprof profile
1.Redis命令参考简体中文版。2.java操作redis。jedis使用api3.Redis学习笔记。_redis资料整理
添加上以下这行代码android:stateListAnimator="@null"_materialbutton 阴影