Abstract: We present a conditional generative model that maps low-dimensional
embeddings of multiple modalities of data to a common latent space hence
extracting semantic relationships between them. The embedding specific to a
modality is first extracted and subsequently a constrained optimization
procedure is performed to project the two embedding spaces to a common
manifold. The individual embeddings are generated back from this common latent
space. However, in order to enable independent conditional inference for
separately extracting the corresponding embeddings from the common latent space
representation, we deploy a proxy variable trick - wherein, the single shared
latent space is replaced by the respective separate latent spaces of each
modality. We design an objective function, such that, during training we can
force these separate spaces to lie close to each other, by minimizing the
distance between their probability distribution functions. Experimental results
demonstrate that the learned joint model can generalize to learning concepts of
double MNIST digits with additional attributes of colors,from both textual and
speech input.