Structured Vectors for Chinese Word Representations

Abstract—The use of word representations has been a key reason for the success of many NLP tasks. A lot of work has focused on improving the learning of word representations, and most approaches treat word as atomic unit. However, in some languages, for example Chinese, some words cannot be recognized correctly. This leads to the corruption of word embeddings’ ability to capture semantic information. This paper addresses this shortcoming by proposing structured embeddings for word representations. Our method utilizes sub-word and atomic unit embeddings to represent word embeddings. We build structured vectors for Chinese word representations based on the method, and evaluateon SemEval-2012 Task 4: Measuring Chinese word similarity. The result shows that our method is remarkably effective in capturing semantic information and outperforms previous best performance by a large margin. Our method can be extended to the languages which do not have a trivial word segmentation process.