Cognitive Science > Efficient Coding
[edit] Overview
- Random Variables and Matrix Algebra
- Decorrelation Transformations
- CCA-Methods and Blind Source Separation
- Information Theory and ICA
- Efficient Coding: Discrete Entropy and Rate-Distortion Theory
[edit] A computational theory of early vision
- Explains the role of neurobiological findings for the purpose of perceptual inference or behavioral tasks
- How does the visual pathway make inferences from the visual image?
- How does brain use image information to compute these inferences?
[edit] Motivation
- Unconscious visual inference
- Collective phenomenon, self-organization
[edit] The idea of efficient coding
- Early stages adapt to statistics of incoming signals
- Learn efficient representation
[edit] Matlab code
X = rand(1,1000) figure(1) hist(X, 100) a = -1 b = 1 Y = a+(b-a)*X X = rand(1,1000); Y = a+(b-a)*X Y = -log(X); bins=(-1:100)/10; figure(1), hist(Y, bins) figure(1),semilogy(bins, hist(Y, bins)) S=cumsum(Y); figure81), plot(S(!:50),zeros(1,50),.)) X = -log(rand(10,10000)); for m=1:10,
Y=sum(X(1:m,:)); figure(1),hist(Y, 100) mu = mean(Y),v=var(Y) pause
end figure(1),semilogy(60:150,hist(Y,60:150)) X = randn(100, 10000); X1 = randn(100, 10000); X = X./X1; for m=10:10:100, Y=sum(X(1:m,:)); for m=10:10:100,
Y=sum(X(1:m,:)); figure(1),hist(Y,100) mu=mean(Y),v=var(Y) pause
end
[edit] Basic statistics
[edit] Random variable x: Function of random experiment which assigns a number to every possible outcome
[edit] Cumulative distribution function (cdf): F_x(threshold) = P({m : x(m) < threshold})
Grows to 0 for -> -inf Grows to 1 for -> inf Monotonous (threshold increased -> more elements in set) Always invertible
[edit] Uniform random variable: x ~ U(a, b)
x distributed uniformly over interval [a, b]
[edit] (Probability) density function: Derivate of cumulative distribution function
[edit] Generating a random variable by using the inverse of the cumulative distribution function
- CDF: F(phi) = [(phi-a)/(b-a)]_0^1
- [z]_0^1 = min(max(0, z), 1)
- Inverted: G: [0, 1] => R
- Inverted: G(u) = a + (b-a)*u
[edit] Generalized: Generating a random variable by using a function of a random variable
- Random variable X
- y = g(X)
- F_y(phi) = P({g(x(xi)) < phi})
- Outcome of x must lie in domain of g
- Example: Linear transformation
- y = mx + c
- P({mx+b < theta})
- P({x < (theta-b)/m}) if m > 0
- We can now express the new CDF using the old CDF:
- F_y(theta) = F_x((theta-b)/m) if m > 0
- Poisson distribution: For each phi, count number of events that are smaller than phi
[edit] Mean values
- Expectation value: E[x] = <x> = \sum_x P(x)*x = \integral density(x)*x dx
- Values weighted by probability
- E[ax+b] = a*E[x] + b
- Expectation values of functions of random variables
- Discrete set: {f_1(x), ..., f_n(x)}
- Continuos set: f(t, x) = e^(itx)
- E[f_k(x)]
- e.g. f_k(x) = x^k
- Moments E[x^k]
- E[e^(itx)] = \integral density(x)*e^(itx) dx
- Characteristic function
- g(t, x) = e^(tx) ) \sum_(k=0)^\infty ((tx)^k)/(k!)
- Moment-generating function
- E[e^(tx)] = \sum E[((tx)^k)/(k!)] = \sum_0^\infty (t^k)/(k!) E[x^k]
- nth derivative, set to zero => only one term remains
- Expectation value is linear function of its arguments
- Continuos function can be obtained from a discrete number of moments and vice versa
- g(0) = E[x^0] = 1 = \integral density(x) dx
- Cumulant-generating function
- h(t) = log g(t) = \sum \kappa_k t^k/k! = E[x]*t + Var[x](t/k)
- First cumulant: Expectation value
- Second cumulant: Variance
- \kappa_k[ax+b] = a^k * \kappa_k[x] + kronecker_{1k}*b
- First cumulant is shift-equivariant, remaining are shift-invariant
- Variance
- E[(x-E[x])^2] = E[x^2 - 2xE[x] + E[x]^2] = E[x^2] - E[2xE[x]] + E[E[x]^2] = E[x^2] - E[x]^2
- Var[ax+b] = E[(ax+b)-E[ax+b]^2] = E[(ax+b-aE(x)-b)^2] = E[a^2(x-E[x])^2] = a^2*Var[x]
[edit] Central limit theorem
- Sum over n random variables: Gaussian for n -> inf
- Density of Gaussian: Parabola
- Normalizing a random variable: Z = (Y-E[Y])/(sqrt(Var[Y]))
- Mean 0, variance 1
- Cauchy distribution
- A alpha-stable distribution
- Sum independent distributions -> convergence
- If the sum of independent identically distributed random variables has finite variance, then it will be approximately normally distributed (Gaussian distribution).
- Gaussian distribution: Only distribution for which all cumulants for one and higher are zero.
[edit] Higher moments
- Skewness: Skew[x] = \kappa_3 / \sigma^3 = E[(x-E[x])^3]/\sigma^3
- Kurtosis: \kappa_4 / \sigma^4 = E[(x-E[x])^4]/\sigma^4
[edit] Edgeworth Expansion
- Series approximating a probability distribution in terms of its cumulants
- Improvement to the central limit theorem with controlled error
- Uses Hermite polynomials
- Conditions of central limit theorem can be relaxed
- Does a number of random variables for every distribution with skew=0 and finite variance converge to a Gaussian distribution function?
- Counterexample:
- IID,
- Kurt(\sum_{k=1}^n x_k) = 1/n^2 * \sum_{k=1}^n kurt(x_k) if var[x_k]=const.
- Distribution from exponential power family (generalized gaussian): 1/z * exp{-0.5*|x_k/\sigma|^alpha_k}
- Useful: One parameter sufficient to tune between uniform, Gaussian and Laplacian distribution
- \sigma = 1
- \alpha_k -> 0 fast enough => kurt(x_k) ~ k^2 => kurt(\sum x_k) = 1 + \sum_{k=1}^{n-1} kurt(x_k) =/= 0 for all n
[edit] Multivariate distributions
- Probability of intersections of events
- Lower-dimensional distribution functions can be generated from higher-dimensional distribution functions by integrating out variables.
- = marginal distributions
- n different univariate marginals for n variables
- Independence -> Joint distribution function can be written as a product of the marginals
- y1 = x1 + x2; y2 = x1 - x2; dependent but not correlated.
[edit] Correlation
- (Pairwise) correlation of two random variables: E[Y_1*Y2] = \integral \integral Y_1 Y2 density(Y1, Y2), dy1 dy2
- In case of statistical independence
- density(X1, X2) = density(X1)*density(X2)
- E[X1, X2] = \itnegral \ integral x_1 * x_2 * density(x1)*density(x2) dx1 dx2 = E[X1]*E[X2]
[edit] Covariance
- Cov[X1, X2] = E[X1, X2] - E[X1]*E[X2]
- Statistical independence => covariance needs to be 0
- Covariance 0 =/> statistical independence
- Only for Gaussian variables
- Can be seen as scalar product on function space of densities; uncorrelated = orthogonal; angle = correlation coefficient
[edit] Homework problems I
[edit] y1 = x1 + x2; y2 = x1 - x2; why dependent (but not correlated)?
- If statistically independent, then density(Y1|Y2) = density(Y1)
- Y2 = 1 => X2=1, X1=0 => density(Y1=1|Y2=-1) = 1 =/= density(Y1)
- Not independent: Product of marginals =/= joint density
- density(y1) = 1-|y1-1|
- density(y2) = 1- |y2|
[edit] Uniform distribution over sphere; linear mapping applied to a set of random variables
x = randn(D, 1) x = x/norm(x)
[edit] Matrix decomposition
For decorrelation and for reduction of dimensionality
[edit] Background: Rayleigh-Ritz
- Rayleigh quotient: r(x) = (transpose(x)*M*x)/(transpose(x)*x)
- Column vector times row vector = scalar product
- Denominator: Normalization
- We ask for the input vector that gives the maximum output vector
- max r(x) = max{\lambda_1, ... \lambda_n} = \lambda_1
- If eigenvalues ordered
- min r(x) = \lambda_n
- If eigenvalues ordered
[edit] Background: Covariance matrix
- Correlation: E[x_k x_j]
- Covariance: Cov[x_k, x_j] = E[x_k x_j]-E[x_k]*E[x_j]
- Covariance matrix: C2_kj = Cov[x_k, x_j]
- Sample covariance matrix
- X = (x_1, ..., X_N)
- dim(RV) x num_samples
- C = (1/N)*X*transpose(X) - mu * transpose(mu)
- C_ij = (1/N)* \sum_{k=1}^N x_ik x_jk - \mu_i \mu_j
- mu = mean(X, 2) = (1/N) * X * ones(N, 1)
[edit] Eigenvalue decomposition
- Eigenvectors of a linear mapping: Those that stay in same one-dimensional subspace
- For normal matrix M: M = U*D*transpose(U)
- MU = UD
- Compact matrix representation of m Eigenvalue equations
- Action of M on columns of U can be understood as a rescaling (by right multiplication) of the columns of the matrix
- Left multiplication: Rescaling of rows
- SVD + same orthogonal matrix
- Normal matrix: M*transpose(M) = transpose(M)*M
- Always exists for symmetric matrices
[edit] Power method
- Intuition: Direction of largest eigenvector will be amplified more by mapping than other directions
- Decomposition into outer products of eigenvectors
- cos(u,x) = |transpose(u)*M*(x/||x||)| >= |transpose(u)*(x/||x||)|
- s = sum(M.*M)
- [g, ind] = max(s)
- u1 = normalize(M(:, ind))
- Loop until output vector does not differ much from input vector
- old_u = 0
- while norm(u-old_u) > 0.0001:
- old_u = u
- u = M*u
- d = norm(u)
- u = u/d
- \lambda_1 = d * transpose(old_u) * u
- After a component is known, subtract from matrix
- M_new = M - \lambda_1 * u_1 * transpose(u_1)
- Iterate n times if matrix is of dimensionality n to determine all components
- Slow convergence if eigenvectors not separated well
[edit] Singular value decomposition
- Basics
- Useful for understanding linear mappings
- Matlab: [U, D, V] = svd(M)
- M = U*D*transpose(V)
- U, V orthogonal, D diagonal
- Orthogonal matrix ~ rotation in space around center of coordinate system, mirroring
- Not entirely unique (signs can be changed)
- For a uniform distribution, rotation does not change anything
- If the diagonal elements are all the same, we will always have a spherical symmetric distribution
- Linear mappings always map ellipsoids to ellipsoids
- Types of matrices
- Normal matrix: M*transpose(M) = transpose(M)*M => M = U*D*transpose(U)
- Orthogonal matrix: M*transpose(M) = I = transpose(M)*M
- Symmetric matrix: transpose(M) = M => M*transpose(M) = M^2 = transpose(M)*M
- Relations
- M*transpose(M) is symmetric because transpose(M*transpose(M)) = M*transpose(M)
- M*transpose(M) = U * D_1 * transpose(U)
- transpose(M)*M = V * D_2 * transpose(V)
- M = U * D * transpose(V)
- M * transpose(M) = U * D * transpose(V) * V * transpose(D) * transpose(U) = U * D * transpose(D) * transpose(U)
- transpose(M) * M = V * D * transpose(U) * U * D * transpose(V) = V * transpose(D) * D *transpose(V)
- D_1 = D * transpose(D)
- D_2 = transpose(D) * D
- Method
- 1. Determine/estimate covariance matrix
- 2. Obtain eigenvectors + eigenvalues from covariance matrix by decomposition
[edit] Principle component analysis
- In the following:
- x = x-E[x] => c_{x} = E[x*transpose(x)]
- y = Mx
- C_y = E[y*transpose(y)] = E[M*x*transpose(x)*transpose(M)] = M*E[x*transpose(x)]*transpose(M) = M * C_x * transpose(M)
- M = transpose(w) with ||w|| = 1
- \sigma_w^2 = transpose(w)*C_x w
- Covariance matrix gives variance along arbitrary directions
- Only method for orthogonl vectors
- Motivation
- Dimensionality reduction
- Decorrelation
- Dimensionality reduction
- Keep subspace that preserves most variance of data
- y = Sx
- S in R^(mxn), m < n
- S = (s_1, ..., s_m) and transpose(s_\delta) and transpose(s_j)*sk_k = \\delta_j
- max_S trance{C_y} = max_s trace{S*C_x*transpose(T)}
- trace{M} = \sum_i M_{ii}
- Practically: [U, D, V] = svd(C_x)
- Take first M columns to determine subspace
- Trace is invariant under permutations: Take first M columns to determine subspace
- trace(ABC) = trace(CAB)
- S = Q * S
- trace{Q S C_x transpose(T) transpose(Q)} = trace{S C_x transpose(S)}
[edit] Homework problems II
- Map uniform distribution, see how shape changes
- Compute principal components of natural images (with helper functions)
- http://redwood.berekeley.edu/bruno/sparsenet/
- Download raw images
- 1. Extract patches from image (use im2patch from utils)
- 2. x*transpose(x) - mu*transpose(mu)
- 3. imagesc(img), svd
- 4. dsp_patches I(U); disp_marginals
- Does random selection make a difference for for variante or not?
- Printouts!
[edit] Steerable filters
- Knutsson & Granlund 1985
- Freeman et al 1991
- Perona 1995
[edit] Fourier basis
- Basis of cyclo-stationary data
- sin(x+delta_x) = alpha(delta_x)*sin(x) + beta(delta_x) * cos(x)
- For non-cyclic data: Discrete cosine transform (DCT) is better as an approximation
[edit] Whitening transformation
[edit] Decorrelation transformations
[edit] Set of all possible decorrelation transformations: W = D*Q*D^(-1/2)*U
[edit] W orthogonal (W*W = 1): PCA is the only orthogonal transformation provided that Dii /= Dkk
[edit] W symmetric (W=W)
- W = U*D^(-1/2)*U = W
- ||Y-X||_FR0 = tr{(Y-X)(Y-X)}
[edit] W triangular: Whitening transformation
- W_0 * C_x * W_0 = 1
- Apply QR decomposition to W_0
- W_0 = Q * R
- Q * W_0 = R
[edit] Homework problems III
- Apply different transformations to natural images, other data
- Also try random orthogonal transformation (random rotation) — how can such a transformation be generated?
- Columns of a matrix => new basis vectors
- From Gaussian distribution => in general not orthogonal
- 1. Draw random Gaussian matrix
- M = randn(n)
- 2. Orthogonalize (such that is changed in the least way)
- Q = (M*M)^(-1/2) * M = inv(sqrt m(M*M))*M
- Q*Q = (M*M)^(-1/2)*M*M*(M*M)^(-1/2) = 1
- M*M = UDU => (M*M)^(-1/2) = U*D^(-1/2)*U
[edit] Generalized eigenvalue decomposition problem
- S1, S2 both symmatric
- Can we find a transformation such that both are diagonal when transformed? (Yes.)
- Symmetric whitening: S1^(-1/2)
- S1^(-1/2), post-multiplied
[edit] Blind source separation
[edit] The problem
- Separation of a set of signals from a set of mixed signals, without the aid of information (or with very little information) about the nature of the signals.
- Can we find a unique decorrelation transform based only on statistic criteria?
- For linear superposition of signals (Cocktail party problem)
- e.g. PCA, SVD, ICA, ...
[edit] Formalization
- x: mixed signal
- s: source signal
- x(t) = A*s(t)
- A: mixing transformation
- Find W for y = Wx such that W=A
[edit] Solution
- C(tau) = E[s_j(t)*s_k(t+tau)] = kronecker_jk * c_jk(tau)
- tau = 0: C(0) = E[s_j(t) s_k(t)]
- C(0) being diagonal => W = DQD^(-1/2)U
- DQ: Free
- D^(-1/2)U: Fixed
- C(tau /= 0) also diagonal
- Only correlation across time for signals that belong to same component
- 1. Whiten data with respect to zero-timelag covariance matrix
- 2. PCA with withened (new) random variable C, in addition diagonal timelag-covariance matrix
- SOBI algorithm, 1997
[edit] Overview
[edit] Eigenvalue Decomposition
- Decomposes a matrix M into matrices U, D, U such that U*D*U = M with diagonal matrix D, orthogonal matrix U
- Requires a symmetric matrix M
[edit] Singular Value Decomposition
- = Eigenvalue decomposition of M*M and M*M
- Decomposes a matrix M into matrices U, D, V such that U*D*V = M with diagonal matrix D, orthogonal matrices U, V.
- Eigenvectors are rows of U
[edit] Principle Component Analysis
- = Eigenvalue decomposition of covariance matrix
- Covariance matrix: E(X*X) - mean(X)*mean(X)
- Why does Eigenvalue decomposition of covariance matrix return principle components of image?
- W = U
[edit] Whitening PCA
- Length of basis vectors changed
- W = D^(-1/2)*U
[edit] Symmetric whitening
- W = U*D^(-1/2)*U = C_x^(-1/2)
[edit] Random whitening
- W = A*D^(-1/2)*U with A*A = 1
- Draw random gaussian matrix: A = randn(n);
- Orthogonalize: A = (A*A)^(-1/2)*A
[edit] Generalized Diagonalization Transforms
[edit] Class of decorrelation transformations
- Any decorrelation (diagonalization) transformation for A_x = U * D * U is of the form W = D*Q*D^(-1/2)*U
- Q*Q = 1
- D is diagonal
[edit] Simultaneously find one transformation matrix for two covariance matrices
- Problem
- Coefficient transformation matrix W for Y = W*X
- W now changes two matrices of interest
- A_y = W * A_x * W
- B_y = W * B_x * W
- Typical case: A is covariance matrix, B is timelag covariance matrix
- Goal: Make two matrices A_y and B_y diagonal at the same time, with one transformation W
- Different classes of algorithms specify how to generate A and B in different ways
- Approach
- 1. Apply general form of decorrelation transformations
- A_y = D*Q*D_A^(-1/2)*U*(U*D_A*U)*U*D_A^(-1/2)*Q*D = D*D
- We can still choose D and Q (with D being diagonal, Q orthogonal)
- 2. Insert result into second condition
- B_y = D*Q*D_A^(-1/2)*U*B_X*U*D_A^(-1/2)*Q*D = D*Q*V*D_B*V*Q*D
- Set Q = V = D*D_B*D
- 1. Apply general form of decorrelation transformations
- Examples
- AMUSE (Tong, 1990)
- y(t) = Wx(t) with E[x(t)] = 0
- A_x = E[x(t)*x(t)] = E[x_t*x_t] = C_A(0)
- B_x = 0.5*(E[x(t)*x(t+tau)]+E[x(t+tau)*x(t)]) = 0.5*(C(tau)+C(tau))
- Matrix + transposed matrix is symmetric
- Unstable
-
Oriented PCA (Diamantaras 1992)
- "Supervised" (we need to say what counts as noise and what as a signal)
- Principle
- Instead of choosing the first n components with largest variance, we choose the first n components with highest signal-to-noise ratio
- Plain PCA = special case where noise is isotropic (noise covariance proportional to identity matrix)
- Here: Whitening of noise covariance normalizes noise; we use a space where noise /is/ isotropic
- Use simultaneous diagonalization
- A_x = C_n (noise covariance matrix)
- B_x = C_s (signal covariance matrix)
- x = s + n
- y = Wx, W \in R^(mxn) with m < n (i.e. fat matrix for dimensionality reduction)
- Objective: SNR ~ log(det(WC_sW)/det(WC_nW))
- Noise covariance
- e.g. for fixation noise
- Cannot be obtained from x
- Knowledge about either signal or noise covariance necessary
- Algorithm
- W = P_m*Q*C_n^(-1/2)
- P_m is identity matrix up to the mth dimension, then zeros
- SNR ~ log(det(Q*C_n^(-1/2)*C_s*C_n^(-1/2)*Q)) = log(det(Q*V*D_B*V*Q)) = log(det(P_m*D_B*P_m))
- Q = V
- W = P_m * V * C_n^(-1/2)
- W = P_m*Q*C_n^(-1/2)
- Relation to Fishers linear discriminant
- Fishers linear discriminant: Projects high-dimensional data onto a line, then maximizes distance between means of two classes while minimizing variance within each class (-> signal-to-interference ratio, similar to perceptron)
- Oriented PCA: Generalization, more of a regression (analog) than classification (binary).
- Slow feature analysis (Wiskott, Sejnowski 2002)
- Principle
- Method for learning invariant or slowly varying features from a vectorial input signal
- Find representation that is invariant against changes that are less relevant to stimulus
- Noise covariance stable in time
- Solution equivalent to AMUSE algorithm
- Derivation
- C_n = E[(x(t)-x(t-tau))*(x(t)-x(t-tau))]
- x(t) - x(t-tau) ~ derivative
- C_n = E[x(t)x(t)-x(t)x(t-tau)-x(t-tau)x(t)+x(t-tau)x(t-tau)]
- = C_x(0) - (C_x(tau)+C_x(tau)) + C_x(0)
- = 2[C_x(0)-0.5*(C_x(tau)+C_x(tau))]
- 1. W = D*V*C_x^(-1/2)(0)
- 2. C_x^(-1/2)*C_n*C_x^(-1/2) = V*D_B*V
- C_n = E[(x(t)-x(t-tau))*(x(t)-x(t-tau))]
- Principle
- Canonical Correlation Analysis (CCA, by Hotelling, 1936)
- e.g. for stereo vision (find maximally correlated components)
- Principle
- Finds the two bases in which the correlation matrix between the variables is diagonal and the correlations on the diagonal are maximized
- Can we find a transformation that explains the correlation of two neurons by one common input signal (=one large component, others zero)?
- y_1 = W_1 * x_1
- y_2 = W_2 * x_2
- 1. Whitening
- 2. SVD
- Derivation
- W_1 = U_1*C_x1^(-1/2)
- W_2 = U_2*C_x2^(-1/2)
- C_x1^(-1/2)*C_(x1x2)*C_x2^(-1/2) = U_1 * D * U_2
- C_x1x2 = E[x1 x2] - E[x1]*E[x2]
- AMUSE (Tong, 1990)
[edit] Relation to neuroscience
- Data analysis: What signals are transmitted by a population of neurons collectively
- Bad performance of algorithms which need invariant representations
- e.g. face recognition: 60% in optimal conditions
- Part of brain that is studied best: Early vision
- Could we understand how a radio works with our current methods?
- Organisms adapted to environment; sensory stages adapted to statistics of incoming signals
- PCA applied to color space => red/green, blue/yellow color opponency
- Separation found in second- and higher-order statistics
[edit] Homework problems IV
[edit] Take image, use circshift by one pixel, compute differences; Generates noise covariance for fixation
[edit] What are good hypotheses to test?
See background papers on teaching web page
- How do RGCs, LGN and primary visual cortex change the representation of visual information?
- Decorrelation?
- Whitening filters?
- If the activity of retina/LGN/primary visual cortex is close to one of the decorrelation methods, then where do the differences stem from?
- Can we model the processing through linear means? If not, which nonlinear transformations are necessary?
- Is the most compressed representation the most useful one (i.e. the one that is actually used by the visual system)?
[edit] Possible objective functions of V1
[edit] Redundancy reduction (minimize multi-information)
Different neurons should encode for different aspects of the input.
- Representation of natural images in form of pixel intensities constitutes a highly redundant code.
- Example for redundancy reduction: Second-order correlations between neighboring pixels
-
Principle Component Analysis (PCA)
- Global (nonlocal), non-oriented basis functions
- Whitening
- Squeezes variance such that it is proportional to identity matrix, is rotationally invariant
- Independent Component Analysis (ICA)
- Find non-gaussian distribution (gradient descent); unique answer to redundancy reduction
[edit] Density estimation (maximize log-likelihood)
Learn a representation of the probabilities of different stimuli.
[edit] Efficient coding (compute rate-distortion curve)
Find most informative representation subject to limited coding resources. Assumes bottleneck in contrast to the first two options. Which aspects of the stimulus should be selected?
[edit] Other ideas?
No free lunch — important to think about environment
- Sparseness
- Slowness
- Deep belief nets
- Invariance properties: Steerable filters
- Feature selection in a multi-task setting (what type of nonlinearities?)
- Not the right nonlinearities (with maybe only a few parameters) for good generalization?
- Independent of how you generate the changes ot learn invariances
- Use of orientation field for 3D-shape inference
- Active vision, learning of invariance with regard to own actions
- At most as good as supervised, mostly reinforcement?
- Same input, different action repertoires => different output
[edit] Orientation Selectivity in V1
[edit] Basics
- Standard model: Orientation selectivity of simple cells (Hubel & Wiesel)
- Unclear how large a part of the computations going on is described by reduction to orientation selective filter bands
- Self-organized orientation maps (connectivity learned from statistics): Wimbauer et al, 1987
- => The unsolved mystery of vision
- Computational models: Redundancy reduction (Barlow), Efficient Coding
- 10x more receptors than RGCs => bottleneck imposed by optic nerve
[edit] Quantitative role of orientation selectivity for redundancy reduction
- Contradicting results in previous studies
- Used different data sets, different performance measures (multi-information, log-likelihood, rate-distortion curves), different estimation methods
- If multi-information and log-likelihood do not match, you probably have overfitting
- Goal: Comprehensive study that resolves inconsistencies
- Understanding of the role of orientation selective filtering requires quantitatively testable theory
[edit] Results
- Multi-information reduction smaller than 5% for both color and gray scale images
- No clear advantage in redundancy reduction and density estimation
- Suboptimal in terms of coding efficiency
[edit] Information theory
[edit] Multi-information
- I[y] = \sum h[h[p_k(y_k)]]-h[p(y)]
- Comparison of differential entropies
- Cannot do better than entropy if joint distributions
[edit] For n states, we need log_2 n bits; log n represents resources required for representing n states
If all states are equally likely
[edit] On average, the coding cost is approx. p(x)*(1/(p(x))) = E[-log p(x)]
Choose long numbers for states that occur seldom, short numbers for those that occur often
[edit] E[-log p(x)] is upper bound, can be reached asymptotically
[edit] Log loss: -log p(x)
[edit] Discrete entropy
- H[p(x)] = - \sum p(x)*log p(x)
- Always positive
[edit] Differential entropy
- P(x_k) =\integral _{a(x_k);b(x_k=}\rho-7d-8-k = ..., -3, -2, -1, 0, 1, 2, ... p dx = F((k+1)*\delta) - F(k*\delta)
- H[p_\delta(k)] = - \integral \rho(x1)*log \rho(x) dx
- Differential entropy can take negative values
[edit] Kullback-Leibler distance (KL divergence, information gain)
How many extra bits we need because of our ignorance regarding the true distribution?
[edit] PCA in information theory
- PCA: Orthogonal transformation
- Determinant = 1
- Upper bound to entropy can be minimized through PCA
- Orientation selectivity through higher-order redundancy reduction
- Olshausen & Field 1996
- Bell & Sejnowski 1997
- ICA yields localized and orientation selective receptive fields