Paper summaryhlarochelle**Summary**
Representation (or feature) learning with unsupervised learning has yet really to yield the type of results that many believe to be achievable. For example, we’d like to unleash an unsupervised learning algorithm on all web images and then obtain a representation that captures the various factors of variation we know to be present (e.g. objects and people). One popular approach for this is to train a model that assumes a high-level vector representation with independent components. However, despite a large body of literature on such models by now, such so-called disentangling of these factors of variation still seems beyond our reach.
In this short paper, the authors propose an alternative to this approach. They propose that disentangling might be achievable by learning a representation whose dimensions are each separately **controllable**, i.e. that each have an associated policy which changes the value of that dimension **while letting other dimensions fixed**.
Specifically, the authors propose to minimize the following objective:
$\mathop{\mathbb{E}}_s\left[\frac{1}{2}||s-g(f(s))||^2_2 \right] - \lambda \sum_k \mathbb{E}_{a,s}\left[\sum_a \pi_k(a|s) \log sel(s,a,k)\right]$
where
- $s$ is an agent’s state (e.g. frame image) which encoder $f$ and decoder $g$ learn to autoencode
- $k$ iterates over all dimensions of the representation space (output of encoder)
- $a$ iterates over actions that the agent can take
- $\pi_k(a|s)$ is the policy that is meant to control the $k^{\rm th}$ dimension of the representation space $f(s)_k$
- $sel(s,a,k)$ is the selectivity of $f(s)_k$ relative to other dimensions in the representation, at state $s$:
$sel(s,a,k) = \mathop{\mathbb{E}}_{s’\sim {\cal P}_{ss’}^a}\left[\frac{|f_k(s’)-f_k(s)|}{\sum_{k’} |f_{k’}(s’)-f_{k’}(s)| }\right]$
${\cal P}_{ss’}^a$ is the conditional distribution over the next step state $s’$ given that you are at state $s$ and are taking action $a$ (i.e. the environment transition distribution). One can see that selectivity is higher when the change $|f_k(s’)-f_k(s)|$ in dimension $k$ is much larger than the change
$|f_{k’}(s’)-f_{k’}(s)|$ in the other dimensions $k’$. A directed version of selectivity is also proposed (and I believe was used in the experiments), where the absolute value function is removed and $\log sel$ is replaced with $\log(1+sel)$ in the objective.
The learning objective will thus encourage the discovery of a representation that is informative of the input (in that you can reconstruct it) and for which there exists policies that separately control these dimensions.
Algorithm 1 in the paper describes a learning procedure for optimizing this objective. In brief, for every update, a state $s$ is sampled from which an update for the autoencoder part of the loss can be made. Then, iterating over each dimension $k$, REINFORCE is used to get a gradient estimate of the selectivity part of the loss, to update both the policy $\pi_k$ and the encoder $f$ by using the policy to reach a next state $s’$.
**My two cents**
I find this concept very appealing and thought provoking. Intuitively, I find the idea that valuable features are features which reflect an aspect of our environment that we can control more sensible and possibly less constraining than an assumption of independent features. It also has an interesting analogy of an infant learning about the world by interacting with it.
The caveat is that unfortunately, this concept is currently fairly impractical, since it requires an interactive environment where an agent can perform actions, something we can’t easily have short of deploying a robot with sensors. Moreover, the proposed algorithm seems to assume that each state $s$ is sampled independently for each update, whereas a robot would observe a dependent stream of states.
Accordingly, the experiments in this short paper are mostly “proof of concept”, on simplistic synthetic environments. Yet they do a good job at illustrating the idea.
To me this means that there’s more interesting work worth doing in what seems to be a promising direction!

**Summary**
Representation (or feature) learning with unsupervised learning has yet really to yield the type of results that many believe to be achievable. For example, we’d like to unleash an unsupervised learning algorithm on all web images and then obtain a representation that captures the various factors of variation we know to be present (e.g. objects and people). One popular approach for this is to train a model that assumes a high-level vector representation with independent components. However, despite a large body of literature on such models by now, such so-called disentangling of these factors of variation still seems beyond our reach.
In this short paper, the authors propose an alternative to this approach. They propose that disentangling might be achievable by learning a representation whose dimensions are each separately **controllable**, i.e. that each have an associated policy which changes the value of that dimension **while letting other dimensions fixed**.
Specifically, the authors propose to minimize the following objective:
$\mathop{\mathbb{E}}_s\left[\frac{1}{2}||s-g(f(s))||^2_2 \right] - \lambda \sum_k \mathbb{E}_{a,s}\left[\sum_a \pi_k(a|s) \log sel(s,a,k)\right]$
where
- $s$ is an agent’s state (e.g. frame image) which encoder $f$ and decoder $g$ learn to autoencode
- $k$ iterates over all dimensions of the representation space (output of encoder)
- $a$ iterates over actions that the agent can take
- $\pi_k(a|s)$ is the policy that is meant to control the $k^{\rm th}$ dimension of the representation space $f(s)_k$
- $sel(s,a,k)$ is the selectivity of $f(s)_k$ relative to other dimensions in the representation, at state $s$:
$sel(s,a,k) = \mathop{\mathbb{E}}_{s’\sim {\cal P}_{ss’}^a}\left[\frac{|f_k(s’)-f_k(s)|}{\sum_{k’} |f_{k’}(s’)-f_{k’}(s)| }\right]$
${\cal P}_{ss’}^a$ is the conditional distribution over the next step state $s’$ given that you are at state $s$ and are taking action $a$ (i.e. the environment transition distribution). One can see that selectivity is higher when the change $|f_k(s’)-f_k(s)|$ in dimension $k$ is much larger than the change
$|f_{k’}(s’)-f_{k’}(s)|$ in the other dimensions $k’$. A directed version of selectivity is also proposed (and I believe was used in the experiments), where the absolute value function is removed and $\log sel$ is replaced with $\log(1+sel)$ in the objective.
The learning objective will thus encourage the discovery of a representation that is informative of the input (in that you can reconstruct it) and for which there exists policies that separately control these dimensions.
Algorithm 1 in the paper describes a learning procedure for optimizing this objective. In brief, for every update, a state $s$ is sampled from which an update for the autoencoder part of the loss can be made. Then, iterating over each dimension $k$, REINFORCE is used to get a gradient estimate of the selectivity part of the loss, to update both the policy $\pi_k$ and the encoder $f$ by using the policy to reach a next state $s’$.
**My two cents**
I find this concept very appealing and thought provoking. Intuitively, I find the idea that valuable features are features which reflect an aspect of our environment that we can control more sensible and possibly less constraining than an assumption of independent features. It also has an interesting analogy of an infant learning about the world by interacting with it.
The caveat is that unfortunately, this concept is currently fairly impractical, since it requires an interactive environment where an agent can perform actions, something we can’t easily have short of deploying a robot with sensors. Moreover, the proposed algorithm seems to assume that each state $s$ is sampled independently for each update, whereas a robot would observe a dependent stream of states.
Accordingly, the experiments in this short paper are mostly “proof of concept”, on simplistic synthetic environments. Yet they do a good job at illustrating the idea.
To me this means that there’s more interesting work worth doing in what seems to be a promising direction!

Good summary from your side. I am also working on disentanglement of factor of variations on different direction. Can you shed some light on why we should not expect any form of disentanglement if we can replace f, and g with r.f and r_inv.g where r is a bijective function; as discussed in paper.

Good question. Actually the comment in the paper is that if you replace $f$ by $r \circ f$ or $r(f(x))$ and $g$ by $g \circ r^{-1}$ or $g(r^{-1}(h))$ (I think there's a typo in the paper for the latter), then $g(r^{-1}(r(f(x)))) = g(f(x))$ and thus you get the same reconstruction error. So that suggests that the reconstruction error objective itself doesn't impose axis-aligned disentanglement, since any bijective function could entangle the representation while keeping the reconstruction the same.
Hope this helps!