This paper presents an efficient decomposition scheme for hardware-efficient realization of discrete cosine transform (DCT) based on distributed arithmetic. We have proposed an efficient design for the implementation of cyclic convolution based on a group distributed arithmetic (GDA) technique where the read-only memory size could be reduced over the existing GDA-based design. The proposed structure for DCT implementation, based on the new decomposition scheme and proposed design of GDA-based cyclic convolution, involves significantly less area complexity than the existing one. For example, to implement the DCT of transform length $(N = 17)$, the proposed design needs a lookup table of 128 words, while the existing design for $(N = 16)$ requires a lookup table of 256 words. From the synthesis results, it is found that proposed design involves significantly less area, gives higher throughput, and consumes less power compared to the existing designs of nearly the same or lower lengths.