About your problem, there isn’t any specific optimization on bigger datasets.

I ran TPCH Q1 on a tpch-40 and tpch-100 and while the execution times aren’t stellar ( I guess this one is one of the few queries that runs better on CPU), the response times are of 1100ms and 2400ms using 2 GPUs, while the runtime on CPU is faster (1289ms on tpch100 using a few cores).

Maybe you are experiencing some troubles when the queries run on GPU (tpch-100 depending on DDL can use over 20GB of memory, so probably the query is falling back to CPU).

Could you share the DDLs of your table? My lineitem DDL is the following

You can exactly know where (GPU or CPU) the query ran, and which step took more time to run.

In the example, the query ran on two GPUs (two threads with launchGpuCode ), and the bottleneck looks to be the runtime on the first Gpu thread that took 2400ms (this is weird, the data is balanced between the two GPUs).