Wednesday, August 16, 2017

On the Expressive Power of Deep Neural Networks

In a Twitter exchange with MarkSurya mentioned four background papers from his group supporting his ICML presentation last week. Here they are:





We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows:
(1) The complexity of the computed function grows exponentially with depth.
(2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights.
(3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance.


We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.


We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in generic, deep neural networks with random weights. Our results reveal an order-to-chaos expressivity phase transition, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth but not width. We prove this generic class of deep random functions cannot be efficiently computed by any shallow network, going beyond prior work restricted to the analysis of single functions. Moreover, we formalize and quantitatively demonstrate the long conjectured idea that deep networks can disentangle highly curved manifolds in input space into flat manifolds in hidden space. Our theoretical analysis of the expressive power of deep networks broadly applies to arbitrary nonlinearities, and provides a quantitative underpinning for previously abstract notions about the geometry of deep functions.




Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Thesis: Towards Scalable Gradient-Based Hyperparameter Optimizitation in Deep Neural Networks by Jie Fu

Although a little late, congratulations Dr. Fu !



Towards Scalable Gradient-Based Hyperparameter Optimizitation in Deep Neural Networks by Jie Fu
It is well-known that the performance of large-sized deep neural networks (DNNs) is sensitive to the setting of their hyperparameters. Hyperparameter optimization is thus recognized as a crucial step in the process of applying DNNs to achieve best performance and drive industrial applications. For many years, the de-facto standard for hyperparameter tuning in deep learning has been a simple grid search. Recently, Bayesian optimization has been proposed for automatic hyperparameter tuning. However, it can hardly tune more than 20 hyperparameters simultaneously. Furthermore, the elementary- and hyper-parameter optimization tasks are usually solved separately where the hyperparameter optimization process, defined as the outer loop does not make full use of the inner elementary optimization process. To address these issues, we propose effective, efficient and scalable gradient-based methods for optimizing elementary- and hyper-parameters in DNNs in a unified manner. The first is a novel approximate method, DrMAD, for obtaining gradients with respect to hyperparameters based on asymmetric reverse-mode automatic differentiation. It is 15 ∼ 45 times faster and consumes 50 ∼ 100 times less memory on a variety of benchmark datasets compared to the state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. Inspired by the approximate nature of DrMAD, we develop an adaptive and approximate gradient-based method for optimizing elementary parameters in DNNs, which is more effective. We also propose an effective, efficient and scalable neural optimizer using a recurrent v neural network (RNN) for tuning dynamic parameter-wise hyperparameters of another DNN. The proposed neural optimizer is trained using the approximate hypergradients obtained from DrMAD. Extensive experiments show that our approach outperforms the state-of-the-art neural optimizer in terms of classification accuracy of the DNN being optimized for long horizons, but converges at least 20 times faster and consumes about 100 times less memory. To the best of our knowledge, the works described in this thesis represent the first forays into the scalable gradient-based methods for elementary- and hyper-parameter optimization in DNNs in a unified manner



The performance of deep neural networks is well-known to be sensitive to the setting of their hyperparameters. Recent advances in reverse-mode automatic differentiation allow for optimizing hyperparameters with gradients. The standard way of computing these gradients involves a forward and backward pass of computations. However, the backward pass usually needs to consume unaffordable memory to store all the intermediate variables to exactly reverse the forward training procedure. In this work we propose a simple but effective method, DrMAD, to distill the knowledge of the forward pass into a shortcut path, through which we approximately reverse the training trajectory. Experiments on several image benchmark datasets show that DrMAD is at least 45 times faster and consumes 100 times less memory compared to state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. To the best of our knowledge, DrMAD is the first research attempt to make it practical to automatically tune thousands of hyperparameters of deep neural networks. The code can be downloaded from this https URL

The DrMAD GitHub is here: https://github.com/bigaidream-projects/drmad

Monday, August 14, 2017

Randomization or Condensation?: Linear-Cost Matrix Sketching Via Cascaded Compression Sampling / A Bootstrap Method for Error Estimation in Randomized Matrix Multiplication / Effective sketching methods for value function approximation

We are starting the week with some sketching and randomized approaches ! 







Matrix sketching is aimed at finding compact representations of a matrix while simultaneously preserving most of its properties, which is a fundamental building block in modern scientific computing. Randomized algorithms represent state-of-the-art and have attracted huge interest from the fields of machine learning, data mining, and theoretic computer science. However, it still requires the use of the entire input matrix in producing desired factorizations, which can be a major computational and memory bottleneck in truly large problems. In this paper, we uncover an interesting theoretic connection between matrix low-rank decomposition and lossy signal compression, based on which a cascaded compression sampling framework is devised to approximate an m-by-n matrix in only O(m+n) time and space. Indeed, the proposed method accesses only a small number of matrix rows and columns, which significantly improves the memory footprint. Meanwhile, by sequentially teaming two rounds of approximation procedures and upgrading the sampling strategy from a uniform probability to more sophisticated, encoding-orientated sampling, significant algorithmic boosting is achieved to uncover more granular structures in the data. Empirical results on a wide spectrum of real-world, large-scale matrices show that by taking only linear time and space, the accuracy of our method rivals those state-of-the-art randomized algorithms consuming a quadratic, O(mn), amount of resources. 



In recent years, randomized methods for numerical linear algebra have received growing interest as a general approach to large-scale problems. Typically, the essential ingredient of these methods is some form of randomized dimension reduction, which accelerates computations, but also creates random approximation error. In this way, the dimension reduction step encodes a tradeoff between cost and accuracy. However, the exact numerical relationship between cost and accuracy is typically unknown, and consequently, it may be difficult for the user to precisely know (1) how accurate a given solution is, or (2) how much computation is needed to achieve a given level of accuracy. In the current paper, we study randomized matrix multiplication (sketching) as a prototype setting for addressing these general problems. As a solution, we develop a bootstrap method for {directly estimating} the accuracy as a function of the reduced dimension (as opposed to deriving worst-case bounds on the accuracy in terms of the reduced dimension). From a computational standpoint, the proposed method does not substantially increase the cost of standard sketching methods, and this is made possible by an "extrapolation" technique. In addition, we provide both theoretical and empirical results to demonstrate the effectiveness of the proposed method.

High-dimensional representations, such as radial basis function networks or tile coding, are common choices for policy evaluation in reinforcement learning. Learning with such high-dimensional representations, however, can be expensive, particularly for matrix methods, such as least-squares temporal difference learning or quasi-Newton methods that approximate matrix step-sizes. In this work, we explore the utility of sketching for these two classes of algorithms. We highlight issues with sketching the high-dimensional features directly, which can incur significant bias. As a remedy, we demonstrate how to use sketching more sparingly, with only a left-sided sketch, that can still enable significant computational gains and the use of these matrix-based learning algorithms that are less sensitive to parameters. We empirically investigate these algorithms, in four domains with a variety of representations. Our aim is to provide insights into effective use of sketching in practice.



Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !

Saturday, August 12, 2017

Saturday Morning Videos: Convolutional Neural Networks for Visual Recognition (Spring 2017 Stanford CS 231n)

Justin mentioned this on his twitter feed

















Friday, August 11, 2017

30 new Numerical Tours in Julia - implementation -

Gabriel mentiond it on his Twitter feed:
More than 30 new Numerical Tours in Julia has just been released. It consists in all the most important Numerical Tours, and covers all the topics, from Wavelet image denoising to 3D mesh parameterization. Enjoy! 
This conversion has been funded by the ERC project SIGMA-Vision, and it has been performed by Ayman Chaouki and Quentin Moret, congratulation for this nice work! 
These are the Julia tours, that can be browsed as HTML pages, but can also be downloaded as iPython notebooks. Please read the installation page for more information about how to run these tours. 
Note that it is a work in progress to port all the Numerical Tours to Julia. Help is wellcome, please refer to the GitHub repository for how to proceed.
Basics Wavelets Approximation, Coding and Compression Denoising Inverse Problems Optimization Shapes Audio Processing Computer Graphics Mesh Parameterization and Deformation Geodesic Processing Optimal Transport


Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Thursday, August 10, 2017

Solving ill-posed inverse problems using iterative deep neural networks / Jobs: 2 Postdocs @ KTH, Sweden - implementation -

Ozan just sent me the following e-mail. It has the right mix of elements of The Great Convergence by applying learning to learn methods to inverse problems that are some of the problems we thought compressive sensing could solve well (CT tomography), papers suporting those results, an implementation, a blog entry and two postdoc jobs. Awesome !
Dear Igor,


I have for some time followed your excellent blog Nuit Blanche. I'm not familiar with how you select entries for Nuit Blanche, but let me take the opportunity to provide potential input for related to Nuit Blanche on the exciting research we pursue at the Department of Mathematics, KTH Royal Institute of Technology. If you find any of this interesting, please feel free to post it on Nuit Blanche.


1. Deep learning and tomographic image reconstruction
The main objective for the research is to develop theory and algorithms for 3D tomographic reconstruction. An important recent development has been to use techniques from deep learning to solve inverse problems. We have developed a rather generic, yet adaptable, framework that combines elements of variational regularization with machine learning for solving large scale inverse problems. More precisely, the idea is to learn a reconstruction scheme by making use of the forward operator, noise model and other a priori information. This goes beyond learning a denoiser where one first performs an initial (non machine-learning) reconstruction and then uses machine learning on the resulting image-to-image (denoising) problem. Several groups have done learning a denoiser and the results are in fact quite remarkable, outperforming previous state of the art methods. Our approach however combines reconstruction and denoising steps which further improves the results. The following two arXiv-reports http://arxiv.org/abs/1707.06474 and http://arxiv.org/abs/1704.04058 provide more details, there is also a blog-post at http://adler-j.github.io/2017/07/21/Learning-to-reconstruct.html by one of our PhD students that explains this idea of "learning to reconstruct".


2. Post doctoral fellowships
I'm looking for two 2-year post-doctoral fellowships, one dealing with regularization of spatiotemporal and/or multichannel images and the other with methods for combining elements of variational regularization with deep learning for solving inverse problems. The announcements are given below. I would be glad if you could post these also on your blog.


Postdoctoral fellow in PET/SPECT Image Reconstruction (S-2017-1166)
Deadline: December 1, 2017
Brief description:
The position includes research & development of algorithms for PET and SPECT image reconstruction. Work is closely related to on-going research on (a) multi-channel regularization for PET/CT and SPECT/CT imaging, (b) joint reconstruction and image matching for spatio-temporal pulmonary PET/CT and cardiac SPECT/CT imaging, and (c) task-based reconstruction by iterative deep neural networks. An important part is to integrate routines for forward and backprojection from reconstruction packages like STIR and EMrecon for PET and NiftyRec for SPECT with ODL (http://github.com/odlgroup/odl), our Python based framework for reconstruction. Part of the research may include industrial (Elekta and Philips Healthcare) and clinical (Karolinska University Hospital) collaboration.
Announcement & instructions:
http://www.kth.se/en/om/work-at-kth/lediga-jobb/what:job/jobID:158920/type:job/where:4/apply:1

Postdoctoral fellow in Image Reconstruction/Deep Dictionary Learning (S-2017-1165)
Deadline: December 1, 2017
Brief description:

The position includes research & development of theory and algorithms that combine methods from machine learning with sparse signal processing for joint dictionary design and image reconstruction in tomography. A key element is to design dictionaries that not only yield sparse representation, but also contain discriminative information. Methods will be implemented in ODL (http://github.com/odlgroup/odl), our Python based framework for reconstruction which enables one to utilize the existing integration between ODL and TensorFlow. The research is part of a larger effort that aims to combine elements of variational regularization with machine learning for solving large scale inverse problems, see the arXiv-reports http://arxiv.org/abs/1707.06474 and http://arxiv.org/abs/1704.04058 and the blog-post at http://adler-j.github.io/2017/07/21/Learning-to-reconstruct.html for further details. Part of the research may include industrial (Elekta and Philips Healthcare) and clinical (Karolinska University Hospital) collaboration.Announcement & instructions:
http://www.kth.se/en/om/work-at-kth/lediga-jobb/what:job/jobID:158923/type:job/where:4/apply:1




Best regards,
Ozan


--

Assoc. Prof. Ozan Öktem
Director, KTH Life Science Technology Platform
Web: http://ww.kth.se/lifescience


Department of Matematics
KTH Royal Institute of Technology
SE-100 44 Stockholm, Sweden
E-mail: ozan@kth.se




Learned Primal-dual Reconstruction by Jonas Adler, Ozan Öktem

We propose a Learned Primal-Dual algorithm for tomographic reconstruction. The algorithm includes the (possibly non-linear) forward operator in a deep neural network inspired by unrolled proximal primal-dual optimization methods, but where the proximal operators have been replaced with convolutional neural networks. The algorithm is trained end-to-end, working directly from raw measured data and does not depend on any initial reconstruction such as FBP.
We evaluate the algorithm on low dose CT reconstruction using both analytic and human phantoms against classical reconstruction given by FBP and TV regularized reconstruction as well as deep learning based post-processing of a FBP reconstruction.
For the analytic data we demonstrate PSNR improvements of >10 dB when compared to both TV reconstruction and learned post-processing. For the human phantom we demonstrate a 6.6 dB improvement compared to TV and a 2.2 dB improvement as compared to learned post-processing. The proposed algorithm also improves upon the compared algorithms with respect to the SSIM and the evaluation time is approximately 600 ms for a 512 x 512 pixel dataset.  

Solving ill-posed inverse problems using iterative deep neural networks by Jonas Adler, Ozan Öktem
We propose a partially learned approach for the solution of ill posed inverse problems with not necessarily linear forward operators. The method builds on ideas from classical regularization theory and recent advances in deep learning to perform learning while making use of prior information about the inverse problem encoded in the forward operator, noise model and a regularizing functional. The method results in a gradient-like iterative scheme, where the "gradient" component is learned using a convolutional network that includes the gradients of the data discrepancy and regularizer as input in each iteration. We present results of such a partially learned gradient scheme on a non-linear tomographic inversion problem with simulated data from both the Sheep-Logan phantom as well as a head CT. The outcome is compared against FBP and TV reconstruction and the proposed method provides a 5.4 dB PSNR improvement over the TV reconstruction while being significantly faster, giving reconstructions of 512 x 512 volumes in about 0.4 seconds using a single GPU.
An implementation is here: https://github.com/adler-j/learned_gradient_tomography
 
 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Wednesday, August 09, 2017

SPORCO: A Python package for standard and convolutional sparse representations - implementation -




Brendt just sent me the following:

Hi Igor,
I noticed that you maintain an extensive list of software tools for sparse representations and related problems. Could you please add a reference to SPORCO, which is a relatively new library providing algorithms for sparse coding and dictionary learning? It supports standard sparse representations as well as a variety of other problems, including ℓ1-TV and ℓ2-TV regularization and Robust PCA, but the major strength is in algorithms for convolutional sparse coding and dictionary learning (the form of sparse coding inspired by deconvolutional networks). 
A Matlab version is available at 

but development is now focused on the Python version, available on GitHub at

The Python version features an object-oriented design that allows the existing ADMM algorithms to be extended or modified with limited effort, as described in some detail in paper presented at the recent SciPy conference 

Thanks,
Brendt

Thanks Brendt ! Let me add this to the Advanced Matrix Factorization Jungle page in the coming days. In the meantime, here is the paper:  



SParse Optimization Research COde (SPORCO) is an open-source Python package for solving optimization problems with sparsity-inducing regularization, consisting primarily of sparse coding and dictionary learning, for both standard and convolutional forms of sparse representation. In the current version, all optimization problems are solved within the Alternating Direction Method of Multipliers (ADMM) framework. SPORCO was developed for applications in signal and image processing, but is also expected to be useful for problems in computer vision, statistics, and machine learning.

Sketched Subspace Clustering



Sketched Subspace Clustering by Panagiotis A. Traganitis, Georgios B. Giannakis
The immense amount of daily generated and communicated data presents unique challenges in their processing. Clustering, the grouping of data without the presence of ground-truth labels, is an important tool for drawing inferences from data. Subspace clustering (SC) is a relatively recent method that is able to successfully classify nonlinearly separable data in a multitude of settings. In spite of their high clustering accuracy, SC methods incur prohibitively high computational complexity when processing large volumes of high-dimensional data. Inspired by random sketching approaches for dimensionality reduction, the present paper introduces a randomized scheme for SC, termed Sketch-SC, tailored for large volumes of high-dimensional data. Sketch-SC accelerates the computationally heavy parts of state-of-the-art SC approaches by compressing the data matrix across both dimensions using random projections, thus enabling fast and accurate large-scale SC. Performance analysis as well as extensive numerical tests on real data corroborate the potential of Sketch-SC and its competitive performance relative to state-of-the-art scalable SC approaches.








Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Tuesday, August 08, 2017

When is Network Lasso Accurate?, The Network Nullspace Property for Compressed Sensing of Big Data over Networks, Semi-Supervised Learning via Sparse Label Propagation




Alex just mentioned this to me:

I started to translate recovery conditions from compressed sensing into the graph signal setting. So far, I managed to relate the null space property and variants of the restricted isometry properties to the connectivity properties (topology) of networks. In particular, the conditions amount to the existence of certain network flows​. I would be happy if you have a look and share your opinion with me:
Greetings from Finland, Alex

Thanks Alex ! Here are the preprints:


A main workhorse for statistical learning and signal processing using sparse models is the least absolute shrinkage and selection operator (Lasso). The Lasso has been adapted recently for massive network-structured datasets, i.e., big data over networks. In particular, the network Lasso allows to recover (or learn) graph signals from a small number of noisy signal samples by using the total variation semi-norm as a regularizer. Some work has been devoted to studying efficient and scalable implementations of the network Lasso. However, only little is known about the conditions on the underlying network structure which ensure a high accuracy of the network Lasso. By leveraging concepts of compressed sensing, we address this gap and derive precise conditions on the underlying network topology and sampling set which guarantee the network lasso to deliver an accurate estimate of the entire underlying graph signal.


We adapt the nullspace property of compressed sensing for sparse vectors to semi-supervised learning of labels for network-structured datasets. In particular, we derive a sufficient condition, which we term the network nullspace property, for convex optimization methods to accurately learn labels which form smooth graph signals. The network nullspace property involves both the network topology and the sampling strategy and can be used to guide the design of efficient sampling strategies, i.e., the selection of those data points whose labels provide the most information for the learning task.


This work proposes a novel method for semi-supervised learning from partially labeled massive network-structured datasets, i.e., big data over networks. We model the underlying hypothesis, which relates data points to labels, as a graph signal, defined over some graph (network) structure intrinsic to the dataset. Following the key principle of supervised learning, i.e., similar inputs yield similar outputs, we require the graph signals induced by labels to have small total variation. Accordingly, we formulate the problem of learning the labels of data points as a non-smooth convex optimization problem which amounts to balancing between the empirical loss, i.e., the discrepancy with some partially available label information, and the smoothness quantified by the total variation of the learned graph signal. We solve this optimization problem by appealing to a recently proposed preconditioned variant of the popular primal-dual method by Pock and Chambolle, which results in a sparse label propagation algorithm. This learning algorithm allows for a highly scalable implementation as message passing over the underlying data graph. By applying concepts of compressed sensing to the learning problem, we are also able to provide a transparent sufficient condition on the underlying network structure such that accurate learning of the labels is possible. We also present an implementation of the message passing formulation allows for a highly scalable implementation in big data frameworks.





Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

Printfriendly