Cognitive Science > Efficient Coding

Jump to: navigation, search

Contents

[edit] Overview

  • Random Variables and Matrix Algebra
  • Decorrelation Transformations
  • CCA-Methods and Blind Source Separation
  • Information Theory and ICA
  • Efficient Coding: Discrete Entropy and Rate-Distortion Theory

[edit] A computational theory of early vision

  • Explains the role of neurobiological findings for the purpose of perceptual inference or behavioral tasks
  • How does the visual pathway make inferences from the visual image?
  • How does brain use image information to compute these inferences?

[edit] Motivation

  • Unconscious visual inference
  • Collective phenomenon, self-organization

[edit] The idea of efficient coding

  • Early stages adapt to statistics of incoming signals
  • Learn efficient representation

[edit] Matlab code

X = rand(1,1000) figure(1) hist(X, 100) a = -1 b = 1 Y = a+(b-a)*X X = rand(1,1000); Y = a+(b-a)*X Y = -log(X); bins=(-1:100)/10; figure(1), hist(Y, bins) figure(1),semilogy(bins, hist(Y, bins)) S=cumsum(Y); figure81), plot(S(!:50),zeros(1,50),.)) X = -log(rand(10,10000)); for m=1:10,

Y=sum(X(1:m,:));
figure(1),hist(Y, 100)
mu = mean(Y),v=var(Y)
pause

end figure(1),semilogy(60:150,hist(Y,60:150)) X = randn(100, 10000); X1 = randn(100, 10000); X = X./X1; for m=10:10:100, Y=sum(X(1:m,:)); for m=10:10:100,

Y=sum(X(1:m,:));
figure(1),hist(Y,100)
mu=mean(Y),v=var(Y)
pause

end

[edit] Basic statistics

[edit] Random variable x: Function of random experiment which assigns a number to every possible outcome

[edit] Cumulative distribution function (cdf): F_x(threshold) = P({m : x(m) < threshold})

Grows to 0 for -> -inf Grows to 1 for -> inf Monotonous (threshold increased -> more elements in set) Always invertible

[edit] Uniform random variable: x ~ U(a, b)

x distributed uniformly over interval [a, b]

[edit] (Probability) density function: Derivate of cumulative distribution function

[edit] Generating a random variable by using the inverse of the cumulative distribution function

  • CDF: F(phi) = [(phi-a)/(b-a)]_0^1
  • [z]_0^1 = min(max(0, z), 1)
  • Inverted: G: [0, 1] => R
  • Inverted: G(u) = a + (b-a)*u

[edit] Generalized: Generating a random variable by using a function of a random variable

  • Random variable X
  • y = g(X)
  • F_y(phi) = P({g(x(xi)) < phi})
  • Outcome of x must lie in domain of g
  • Example: Linear transformation
    y = mx + c
    P({mx+b < theta})
    P({x < (theta-b)/m}) if m > 0
    We can now express the new CDF using the old CDF:
    F_y(theta) = F_x((theta-b)/m) if m > 0
  • Poisson distribution: For each phi, count number of events that are smaller than phi

[edit] Mean values

  • Expectation value: E[x] = <x> = \sum_x P(x)*x = \integral density(x)*x dx
    Values weighted by probability
    • E[ax+b] = a*E[x] + b
  • Expectation values of functions of random variables
    • Discrete set: {f_1(x), ..., f_n(x)}
    • Continuos set: f(t, x) = e^(itx)
    • E[f_k(x)]
    • e.g. f_k(x) = x^k
    • Moments E[x^k]
    • E[e^(itx)] = \integral density(x)*e^(itx) dx
      Characteristic function
    • g(t, x) = e^(tx) ) \sum_(k=0)^\infty ((tx)^k)/(k!)
  • Moment-generating function
    • E[e^(tx)] = \sum E[((tx)^k)/(k!)] = \sum_0^\infty (t^k)/(k!) E[x^k]
    • nth derivative, set to zero => only one term remains
    • Expectation value is linear function of its arguments
    • Continuos function can be obtained from a discrete number of moments and vice versa
    • g(0) = E[x^0] = 1 = \integral density(x) dx
  • Cumulant-generating function
    • h(t) = log g(t) = \sum \kappa_k t^k/k! = E[x]*t + Var[x](t/k)
    • First cumulant: Expectation value
    • Second cumulant: Variance
    • \kappa_k[ax+b] = a^k * \kappa_k[x] + kronecker_{1k}*b
    • First cumulant is shift-equivariant, remaining are shift-invariant
  • Variance
    • E[(x-E[x])^2] = E[x^2 - 2xE[x] + E[x]^2] = E[x^2] - E[2xE[x]] + E[E[x]^2] = E[x^2] - E[x]^2
    • Var[ax+b] = E[(ax+b)-E[ax+b]^2] = E[(ax+b-aE(x)-b)^2] = E[a^2(x-E[x])^2] = a^2*Var[x]

[edit] Central limit theorem

  • Sum over n random variables: Gaussian for n -> inf
  • Density of Gaussian: Parabola
  • Normalizing a random variable: Z = (Y-E[Y])/(sqrt(Var[Y]))
    Mean 0, variance 1
  • Cauchy distribution
    A alpha-stable distribution
    Sum independent distributions -> convergence
  • If the sum of independent identically distributed random variables has finite variance, then it will be approximately normally distributed (Gaussian distribution).
  • Gaussian distribution: Only distribution for which all cumulants for one and higher are zero.

[edit] Higher moments

  • Skewness: Skew[x] = \kappa_3 / \sigma^3 = E[(x-E[x])^3]/\sigma^3
  • Kurtosis: \kappa_4 / \sigma^4 = E[(x-E[x])^4]/\sigma^4

[edit] Edgeworth Expansion

  • Series approximating a probability distribution in terms of its cumulants
  • Improvement to the central limit theorem with controlled error
  • Uses Hermite polynomials
  • Conditions of central limit theorem can be relaxed
  • Does a number of random variables for every distribution with skew=0 and finite variance converge to a Gaussian distribution function?
    Counterexample:
    IID,
    Kurt(\sum_{k=1}^n x_k) = 1/n^2 * \sum_{k=1}^n kurt(x_k) if var[x_k]=const.
    Distribution from exponential power family (generalized gaussian): 1/z * exp{-0.5*|x_k/\sigma|^alpha_k}
    Useful: One parameter sufficient to tune between uniform, Gaussian and Laplacian distribution
    \sigma = 1
    \alpha_k -> 0 fast enough => kurt(x_k) ~ k^2 => kurt(\sum x_k) = 1 + \sum_{k=1}^{n-1} kurt(x_k) =/= 0 for all n

[edit] Multivariate distributions

  • Probability of intersections of events
  • Lower-dimensional distribution functions can be generated from higher-dimensional distribution functions by integrating out variables.
    = marginal distributions
    n different univariate marginals for n variables
  • Independence -> Joint distribution function can be written as a product of the marginals
  • y1 = x1 + x2; y2 = x1 - x2; dependent but not correlated.

[edit] Correlation

  • (Pairwise) correlation of two random variables: E[Y_1*Y2] = \integral \integral Y_1 Y2 density(Y1, Y2), dy1 dy2
  • In case of statistical independence
    • density(X1, X2) = density(X1)*density(X2)
    • E[X1, X2] = \itnegral \ integral x_1 * x_2 * density(x1)*density(x2) dx1 dx2 = E[X1]*E[X2]

[edit] Covariance

  • Cov[X1, X2] = E[X1, X2] - E[X1]*E[X2]
  • Statistical independence => covariance needs to be 0
  • Covariance 0 =/> statistical independence
    Only for Gaussian variables
  • Can be seen as scalar product on function space of densities; uncorrelated = orthogonal; angle = correlation coefficient

[edit] Homework problems I

[edit] y1 = x1 + x2; y2 = x1 - x2; why dependent (but not correlated)?

  • If statistically independent, then density(Y1|Y2) = density(Y1)
  • Y2 = 1 => X2=1, X1=0 => density(Y1=1|Y2=-1) = 1 =/= density(Y1)
  • Not independent: Product of marginals =/= joint density
  • density(y1) = 1-|y1-1|
  • density(y2) = 1- |y2|

[edit] Uniform distribution over sphere; linear mapping applied to a set of random variables

x = randn(D, 1) x = x/norm(x)

[edit] Matrix decomposition

For decorrelation and for reduction of dimensionality

[edit] Background: Rayleigh-Ritz

  • Rayleigh quotient: r(x) = (transpose(x)*M*x)/(transpose(x)*x)
    Column vector times row vector = scalar product
  • Denominator: Normalization
  • We ask for the input vector that gives the maximum output vector
  • max r(x) = max{\lambda_1, ... \lambda_n} = \lambda_1
    If eigenvalues ordered
  • min r(x) = \lambda_n
    If eigenvalues ordered

[edit] Background: Covariance matrix

  • Correlation: E[x_k x_j]
  • Covariance: Cov[x_k, x_j] = E[x_k x_j]-E[x_k]*E[x_j]
  • Covariance matrix: C2_kj = Cov[x_k, x_j]
  • Sample covariance matrix
    • X = (x_1, ..., X_N)
    • dim(RV) x num_samples
    • C = (1/N)*X*transpose(X) - mu * transpose(mu)
    • C_ij = (1/N)* \sum_{k=1}^N x_ik x_jk - \mu_i \mu_j
    • mu = mean(X, 2) = (1/N) * X * ones(N, 1)

[edit] Eigenvalue decomposition

  • Eigenvectors of a linear mapping: Those that stay in same one-dimensional subspace
  • For normal matrix M: M = U*D*transpose(U)
    MU = UD
    Compact matrix representation of m Eigenvalue equations
    Action of M on columns of U can be understood as a rescaling (by right multiplication) of the columns of the matrix
    Left multiplication: Rescaling of rows
  • SVD + same orthogonal matrix
  • Normal matrix: M*transpose(M) = transpose(M)*M
  • Always exists for symmetric matrices

[edit] Power method

  • Intuition: Direction of largest eigenvector will be amplified more by mapping than other directions
  • Decomposition into outer products of eigenvectors
  • cos(u,x) = |transpose(u)*M*(x/||x||)| >= |transpose(u)*(x/||x||)|
  • s = sum(M.*M)
  • [g, ind] = max(s)
  • u1 = normalize(M(:, ind))
  • Loop until output vector does not differ much from input vector
    old_u = 0
    while norm(u-old_u) > 0.0001:
    old_u = u
    u = M*u
    d = norm(u)
    u = u/d
    \lambda_1 = d * transpose(old_u) * u
  • After a component is known, subtract from matrix
    M_new = M - \lambda_1 * u_1 * transpose(u_1)
  • Iterate n times if matrix is of dimensionality n to determine all components
  • Slow convergence if eigenvectors not separated well

[edit] Singular value decomposition

  • Basics
    • Useful for understanding linear mappings
    • Matlab: [U, D, V] = svd(M)
    • M = U*D*transpose(V)
    • U, V orthogonal, D diagonal
      Orthogonal matrix ~ rotation in space around center of coordinate system, mirroring
    • Not entirely unique (signs can be changed)
    • For a uniform distribution, rotation does not change anything
    • If the diagonal elements are all the same, we will always have a spherical symmetric distribution
    • Linear mappings always map ellipsoids to ellipsoids
  • Types of matrices
    • Normal matrix: M*transpose(M) = transpose(M)*M => M = U*D*transpose(U)
    • Orthogonal matrix: M*transpose(M) = I = transpose(M)*M
    • Symmetric matrix: transpose(M) = M => M*transpose(M) = M^2 = transpose(M)*M
  • Relations
    • M*transpose(M) is symmetric because transpose(M*transpose(M)) = M*transpose(M)
    • M*transpose(M) = U * D_1 * transpose(U)
    • transpose(M)*M = V * D_2 * transpose(V)
    • M = U * D * transpose(V)
    • M * transpose(M) = U * D * transpose(V) * V * transpose(D) * transpose(U) = U * D * transpose(D) * transpose(U)
    • transpose(M) * M = V * D * transpose(U) * U * D * transpose(V) = V * transpose(D) * D *transpose(V)
    • D_1 = D * transpose(D)
    • D_2 = transpose(D) * D
  • Method
    • 1. Determine/estimate covariance matrix
    • 2. Obtain eigenvectors + eigenvalues from covariance matrix by decomposition

[edit] Principle component analysis

  • In the following:
    • x = x-E[x] => c_{x} = E[x*transpose(x)]
    • y = Mx
    • C_y = E[y*transpose(y)] = E[M*x*transpose(x)*transpose(M)] = M*E[x*transpose(x)]*transpose(M) = M * C_x * transpose(M)
    • M = transpose(w) with ||w|| = 1
    • \sigma_w^2 = transpose(w)*C_x w
    • Covariance matrix gives variance along arbitrary directions
    • Only method for orthogonl vectors
  • Motivation
    • Dimensionality reduction
    • Decorrelation
  • Dimensionality reduction
    • Keep subspace that preserves most variance of data
    • y = Sx
    • S in R^(mxn), m < n
    • S = (s_1, ..., s_m) and transpose(s_\delta) and transpose(s_j)*sk_k = \\delta_j
    • max_S trance{C_y} = max_s trace{S*C_x*transpose(T)}
    • trace{M} = \sum_i M_{ii}
  • Practically: [U, D, V] = svd(C_x)
    Take first M columns to determine subspace
  • Trace is invariant under permutations: Take first M columns to determine subspace
    trace(ABC) = trace(CAB)
    • S = Q * S
    • trace{Q S C_x transpose(T) transpose(Q)} = trace{S C_x transpose(S)}

[edit] Homework problems II

  • Map uniform distribution, see how shape changes
  • Compute principal components of natural images (with helper functions)
    http://redwood.berekeley.edu/bruno/sparsenet/
    Download raw images
    1. Extract patches from image (use im2patch from utils)
    2. x*transpose(x) - mu*transpose(mu)
    3. imagesc(img), svd
    4. dsp_patches I(U); disp_marginals
    Does random selection make a difference for for variante or not?
    Printouts!

[edit] Steerable filters

  • Knutsson & Granlund 1985
  • Freeman et al 1991
  • Perona 1995

[edit] Fourier basis

  • Basis of cyclo-stationary data
  • sin(x+delta_x) = alpha(delta_x)*sin(x) + beta(delta_x) * cos(x)
  • For non-cyclic data: Discrete cosine transform (DCT) is better as an approximation

[edit] Whitening transformation

[edit] Decorrelation transformations

[edit] Set of all possible decorrelation transformations: W = D*Q*D^(-1/2)*U

[edit] W orthogonal (W*W = 1): PCA is the only orthogonal transformation provided that Dii /= Dkk

[edit] W symmetric (W=W)

  • W = U*D^(-1/2)*U = W
  • ||Y-X||_FR0 = tr{(Y-X)(Y-X)}

[edit] W triangular: Whitening transformation

  • W_0 * C_x * W_0 = 1
  • Apply QR decomposition to W_0
  • W_0 = Q * R
  • Q * W_0 = R

[edit] Homework problems III

  • Apply different transformations to natural images, other data
  • Also try random orthogonal transformation (random rotation) — how can such a transformation be generated?
    Columns of a matrix => new basis vectors
    From Gaussian distribution => in general not orthogonal
    1. Draw random Gaussian matrix
    M = randn(n)
    2. Orthogonalize (such that is changed in the least way)
    Q = (M*M)^(-1/2) * M = inv(sqrt m(M*M))*M
    Q*Q = (M*M)^(-1/2)*M*M*(M*M)^(-1/2) = 1
    M*M = UDU => (M*M)^(-1/2) = U*D^(-1/2)*U

[edit] Generalized eigenvalue decomposition problem

  • S1, S2 both symmatric
  • Can we find a transformation such that both are diagonal when transformed? (Yes.)
  • Symmetric whitening: S1^(-1/2)
  • S1^(-1/2), post-multiplied

[edit] Blind source separation

[edit] The problem

  • Separation of a set of signals from a set of mixed signals, without the aid of information (or with very little information) about the nature of the signals.
  • Can we find a unique decorrelation transform based only on statistic criteria?
  • For linear superposition of signals (Cocktail party problem)
  • e.g. PCA, SVD, ICA, ...

[edit] Formalization

  • x: mixed signal
  • s: source signal
  • x(t) = A*s(t)
  • A: mixing transformation
  • Find W for y = Wx such that W=A

[edit] Solution

  • C(tau) = E[s_j(t)*s_k(t+tau)] = kronecker_jk * c_jk(tau)
  • tau = 0: C(0) = E[s_j(t) s_k(t)]
  • C(0) being diagonal => W = DQD^(-1/2)U
    DQ: Free
    D^(-1/2)U: Fixed
  • C(tau /= 0) also diagonal
    Only correlation across time for signals that belong to same component
  • 1. Whiten data with respect to zero-timelag covariance matrix
  • 2. PCA with withened (new) random variable C, in addition diagonal timelag-covariance matrix
  • SOBI algorithm, 1997

[edit] Overview

[edit] Eigenvalue Decomposition

  • Decomposes a matrix M into matrices U, D, U such that U*D*U = M with diagonal matrix D, orthogonal matrix U
  • Requires a symmetric matrix M

[edit] Singular Value Decomposition

  • = Eigenvalue decomposition of M*M and M*M
  • Decomposes a matrix M into matrices U, D, V such that U*D*V = M with diagonal matrix D, orthogonal matrices U, V.
  • Eigenvectors are rows of U

[edit] Principle Component Analysis

  • = Eigenvalue decomposition of covariance matrix
  • Covariance matrix: E(X*X) - mean(X)*mean(X)
  • Why does Eigenvalue decomposition of covariance matrix return principle components of image?
  • W = U

[edit] Whitening PCA

  • Length of basis vectors changed
  • W = D^(-1/2)*U

[edit] Symmetric whitening

  • W = U*D^(-1/2)*U = C_x^(-1/2)

[edit] Random whitening

  • W = A*D^(-1/2)*U with A*A = 1
  • Draw random gaussian matrix: A = randn(n);
  • Orthogonalize: A = (A*A)^(-1/2)*A

[edit] Generalized Diagonalization Transforms

[edit] Class of decorrelation transformations

  • Any decorrelation (diagonalization) transformation for A_x = U * D * U is of the form W = D*Q*D^(-1/2)*U
  • Q*Q = 1
  • D is diagonal

[edit] Simultaneously find one transformation matrix for two covariance matrices

  • Problem
    • Coefficient transformation matrix W for Y = W*X
    • W now changes two matrices of interest
    • A_y = W * A_x * W
    • B_y = W * B_x * W
    • Typical case: A is covariance matrix, B is timelag covariance matrix
    • Goal: Make two matrices A_y and B_y diagonal at the same time, with one transformation W
    • Different classes of algorithms specify how to generate A and B in different ways
  • Approach
    • 1. Apply general form of decorrelation transformations
      • A_y = D*Q*D_A^(-1/2)*U*(U*D_A*U)*U*D_A^(-1/2)*Q*D = D*D
      • We can still choose D and Q (with D being diagonal, Q orthogonal)
    • 2. Insert result into second condition
      • B_y = D*Q*D_A^(-1/2)*U*B_X*U*D_A^(-1/2)*Q*D = D*Q*V*D_B*V*Q*D
      • Set Q = V = D*D_B*D
  • Examples
    • AMUSE (Tong, 1990)
      • y(t) = Wx(t) with E[x(t)] = 0
      • A_x = E[x(t)*x(t)] = E[x_t*x_t] = C_A(0)
      • B_x = 0.5*(E[x(t)*x(t+tau)]+E[x(t+tau)*x(t)]) = 0.5*(C(tau)+C(tau))
      • Matrix + transposed matrix is symmetric
      • Unstable
    • Oriented PCA (Diamantaras 1992)
      "Supervised" (we need to say what counts as noise and what as a signal)
      • Principle
        • Instead of choosing the first n components with largest variance, we choose the first n components with highest signal-to-noise ratio
        • Plain PCA = special case where noise is isotropic (noise covariance proportional to identity matrix)
        • Here: Whitening of noise covariance normalizes noise; we use a space where noise /is/ isotropic
        • Use simultaneous diagonalization
        • A_x = C_n (noise covariance matrix)
        • B_x = C_s (signal covariance matrix)
        • x = s + n
        • y = Wx, W \in R^(mxn) with m < n (i.e. fat matrix for dimensionality reduction)
        • Objective: SNR ~ log(det(WC_sW)/det(WC_nW))
      • Noise covariance
        e.g. for fixation noise
        • Cannot be obtained from x
        • Knowledge about either signal or noise covariance necessary
      • Algorithm
        • W = P_m*Q*C_n^(-1/2)
          P_m is identity matrix up to the mth dimension, then zeros
        • SNR ~ log(det(Q*C_n^(-1/2)*C_s*C_n^(-1/2)*Q)) = log(det(Q*V*D_B*V*Q)) = log(det(P_m*D_B*P_m))
          Q = V
        • W = P_m * V * C_n^(-1/2)
      • Relation to Fishers linear discriminant
        • Fishers linear discriminant: Projects high-dimensional data onto a line, then maximizes distance between means of two classes while minimizing variance within each class (-> signal-to-interference ratio, similar to perceptron)
        • Oriented PCA: Generalization, more of a regression (analog) than classification (binary).
    • Slow feature analysis (Wiskott, Sejnowski 2002)
      • Principle
        • Method for learning invariant or slowly varying features from a vectorial input signal
        • Find representation that is invariant against changes that are less relevant to stimulus
        • Noise covariance stable in time
        • Solution equivalent to AMUSE algorithm
      • Derivation
        • C_n = E[(x(t)-x(t-tau))*(x(t)-x(t-tau))]
          x(t) - x(t-tau) ~ derivative
          C_n = E[x(t)x(t)-x(t)x(t-tau)-x(t-tau)x(t)+x(t-tau)x(t-tau)]
          = C_x(0) - (C_x(tau)+C_x(tau)) + C_x(0)
          = 2[C_x(0)-0.5*(C_x(tau)+C_x(tau))]
        • 1. W = D*V*C_x^(-1/2)(0)
        • 2. C_x^(-1/2)*C_n*C_x^(-1/2) = V*D_B*V
    • Canonical Correlation Analysis (CCA, by Hotelling, 1936)
      e.g. for stereo vision (find maximally correlated components)
      • Principle
        • Finds the two bases in which the correlation matrix between the variables is diagonal and the correlations on the diagonal are maximized
        • Can we find a transformation that explains the correlation of two neurons by one common input signal (=one large component, others zero)?
        • y_1 = W_1 * x_1
        • y_2 = W_2 * x_2
        • 1. Whitening
        • 2. SVD
      • Derivation
        • W_1 = U_1*C_x1^(-1/2)
        • W_2 = U_2*C_x2^(-1/2)
        • C_x1^(-1/2)*C_(x1x2)*C_x2^(-1/2) = U_1 * D * U_2
        • C_x1x2 = E[x1 x2] - E[x1]*E[x2]

[edit] Relation to neuroscience

  • Data analysis: What signals are transmitted by a population of neurons collectively
  • Bad performance of algorithms which need invariant representations
    e.g. face recognition: 60% in optimal conditions
  • Part of brain that is studied best: Early vision
    Could we understand how a radio works with our current methods?
  • Organisms adapted to environment; sensory stages adapted to statistics of incoming signals
    PCA applied to color space => red/green, blue/yellow color opponency
    Separation found in second- and higher-order statistics

[edit] Homework problems IV

[edit] Take image, use circshift by one pixel, compute differences; Generates noise covariance for fixation

[edit] What are good hypotheses to test?

See background papers on teaching web page

  • How do RGCs, LGN and primary visual cortex change the representation of visual information?
    • Decorrelation?
    • Whitening filters?
  • If the activity of retina/LGN/primary visual cortex is close to one of the decorrelation methods, then where do the differences stem from?
  • Can we model the processing through linear means? If not, which nonlinear transformations are necessary?
  • Is the most compressed representation the most useful one (i.e. the one that is actually used by the visual system)?

[edit] Possible objective functions of V1

[edit] Redundancy reduction (minimize multi-information)

Different neurons should encode for different aspects of the input.

  • Representation of natural images in form of pixel intensities constitutes a highly redundant code.
  • Example for redundancy reduction: Second-order correlations between neighboring pixels
  • Principle Component Analysis (PCA)
    Global (nonlocal), non-oriented basis functions
  • Whitening
    Squeezes variance such that it is proportional to identity matrix, is rotationally invariant
  • Independent Component Analysis (ICA)
    Find non-gaussian distribution (gradient descent); unique answer to redundancy reduction

[edit] Density estimation (maximize log-likelihood)

Learn a representation of the probabilities of different stimuli.

[edit] Efficient coding (compute rate-distortion curve)

Find most informative representation subject to limited coding resources. Assumes bottleneck in contrast to the first two options. Which aspects of the stimulus should be selected?

[edit] Other ideas?

No free lunch — important to think about environment

  • Sparseness
  • Slowness
  • Deep belief nets
  • Invariance properties: Steerable filters
  • Feature selection in a multi-task setting (what type of nonlinearities?)
    • Not the right nonlinearities (with maybe only a few parameters) for good generalization?
    • Independent of how you generate the changes ot learn invariances
  • Use of orientation field for 3D-shape inference
  • Active vision, learning of invariance with regard to own actions
    • At most as good as supervised, mostly reinforcement?
    • Same input, different action repertoires => different output

[edit] Orientation Selectivity in V1

[edit] Basics

  • Standard model: Orientation selectivity of simple cells (Hubel & Wiesel)
  • Unclear how large a part of the computations going on is described by reduction to orientation selective filter bands
  • Self-organized orientation maps (connectivity learned from statistics): Wimbauer et al, 1987
  • => The unsolved mystery of vision
  • Computational models: Redundancy reduction (Barlow), Efficient Coding
  • 10x more receptors than RGCs => bottleneck imposed by optic nerve

[edit] Quantitative role of orientation selectivity for redundancy reduction

  • Contradicting results in previous studies
  • Used different data sets, different performance measures (multi-information, log-likelihood, rate-distortion curves), different estimation methods
    If multi-information and log-likelihood do not match, you probably have overfitting
  • Goal: Comprehensive study that resolves inconsistencies
  • Understanding of the role of orientation selective filtering requires quantitatively testable theory

[edit] Results

  • Multi-information reduction smaller than 5% for both color and gray scale images
  • No clear advantage in redundancy reduction and density estimation
  • Suboptimal in terms of coding efficiency

[edit] Information theory

[edit] Multi-information

  • I[y] = \sum h[h[p_k(y_k)]]-h[p(y)]
  • Comparison of differential entropies
  • Cannot do better than entropy if joint distributions

[edit] For n states, we need log_2 n bits; log n represents resources required for representing n states

If all states are equally likely

[edit] On average, the coding cost is approx. p(x)*(1/(p(x))) = E[-log p(x)]

Choose long numbers for states that occur seldom, short numbers for those that occur often

[edit] E[-log p(x)] is upper bound, can be reached asymptotically

[edit] Log loss: -log p(x)

[edit] Discrete entropy

  • H[p(x)] = - \sum p(x)*log p(x)
  • Always positive

[edit] Differential entropy

  • P(x_k) =\integral _{a(x_k);b(x_k=}\rho-7d-8-k = ..., -3, -2, -1, 0, 1, 2, ... p dx = F((k+1)*\delta) - F(k*\delta)
  • H[p_\delta(k)] = - \integral \rho(x1)*log \rho(x) dx
  • Differential entropy can take negative values

[edit] Kullback-Leibler distance (KL divergence, information gain)

How many extra bits we need because of our ignorance regarding the true distribution?

[edit] PCA in information theory

  • PCA: Orthogonal transformation
  • Determinant = 1
  • Upper bound to entropy can be minimized through PCA
  • Orientation selectivity through higher-order redundancy reduction
    Olshausen & Field 1996
    Bell & Sejnowski 1997
  • ICA yields localized and orientation selective receptive fields