There is an increased interest in artificially intelligent technology that surrounds us and takes decisions on our behalf. This creates the need for such technology to be able to communicate with humans and understand natural language and non-verbal behaviour that may carry information about our complex physical world. Artificial agents today still have little knowledge about the physical space that surrounds us and about the objects or concepts within our attention. We are still lacking computational methods in understanding the context of human conversation that involves objects and locations around us. Can we use multimodal cues from human perception of the real world as an example of language learning for robots? Can artificial agents and robots learn about the physical world by observing how humans interact with it and how they refer to it and attend during their conversations? This PhD project’s focus is on combining spoken language and non-verbal behaviour extracted by multi-party dialogue in order to increase context awareness and spatial understanding for artificial agents.