Across domains, the scale of data and complexity of models have both been
increasing greatly in the recent years. For many models of interest, tractable learning and inference without access to expensive computational resources have become challenging. In this thesis, we approach efficient learning and inference through the leverage of sparse structures inherent in the learning objective, which allows us to develop algorithms sublinear in the size of parameters without compromising the accuracy of models. In particular, we address the following three questions for each problem of interest: (a) how to formulate model estimation as an optimization problem with tractable sparse structure, (b) how to efficiently, i.e. in sublinear time, search, maintain, and utilize the sparse structures during training and inference, (c) how to guarantee fast convergence of our optimization algorithm despite its greedy nature? By answering these questions, we develop state-of-the-art algorithms in varied domains. Specifically, in the extreme classification domain, we utilizes primal and dual sparse structures to develop greedy algorithms of complexity sublinear in the number of classes, which obtain state-of-the-art accuracies on several benchmark data sets with one or two orders of magnitude speedup over existing algorithms. We also apply the primal-dual-sparse theory to develop a state-of-the-art trimming algorithm for Deep Neural Networks, which sparsifies neuron connections of a DNN with a task-dependent theoretical guarantee, which results in models of smaller storage cost and faster inference speed. When it comes to structured prediction problems (i.e. graphical models) with inter-dependent outputs, we propose decomposition methods that exploit sparse messages to decompose a structured learning problem of large output domains into factorwise learning modules amenable to sublineartime optimization methods, leading to practically much faster alternatives to existing learning algorithms. The decomposition technique is especially effective when combined with search data structures, such as those for Maximum Inner-Product Search (MIPS), to improve the learning efficiency jointly. Last but not the least, we design
novel convex estimators for a latent-variable model by reparameterizing it as a solution of sparse support in an exponentially high-dimensional space, and approximate it with a greedy algorithm, which yields the first polynomial-time approximation method for the Latent-Feature Models and Generalized Mixed Regression without restrictive data assumptions.