This study compares the performance of high-order discontinuous Galerkin finite elements on modern hardware. The main computational kernel is the matrix-free evaluation of differential operators by sum factorization, exemplified on the symmetric interior penalty discretization of the Laplacian as a metric for complex applications in fluid dynamics. State-of-the-art implementations of these kernels stress both arithmetics and memory transfer. The implementation of SIMD vectorization and shared-memory parallelization are detailed. Computational results are presented for a dual-socket Intel Haswell system at 28 cores, a 64-core Intel Knights Landing system, and a 16-core IBM Power8 system. For moderate polynomial degrees between two and six, the Knights Landing machine is approximately twice as fast as the Haswell system. One core of Haswell is also considerably faster than a Power8 core. For our code, parallelism expressed through for loops shows better performance than task-based parallelism with dynamic scheduling according to dependency graphs on medium to high core counts, despite less memory transfer in the latter algorithm.