Continuous Galerkin (CG) algorithms are well-established in the numerical methods community, and have been widely used in the context of CFD software as they offer good single-node performance due to the compact problem size per node when compared with discontinuous Galerkin methods [5]. However, they suffer from complex inter-node communication patterns, since any elements meeting at vertices or edges must communicate. Therefore, reduction across many nodes must be made for some degrees of freedom, particularly when the mesh is unstructured. Conversely, discontinuous Galerkin methods lead to a minimal pair-wise communication pattern, since elements connect only through faces [6]. However, this comes at the expense of greater computational cost on each individual node. To achieve exascale performance, an approach is needed which offers good single-node performance, including the use of accelerators and coprocessors, while minimizing the communication cost.