leftDf.top_k_join(k: Column, rightDf: DataFrame, joinExprs: Column, score: Column) only joins the top-k records of rightDf for each leftDf record with a join condition joinExprs. An output schema of this operation is the joined schema of leftDf and rightDf plus (rank: Int, score: score type).

top_k_join is much IO-efficient as compared to regular joining + ranking operations because top_k_join drops unsatisfied records and writes only top-k records to disks during joins.

Caution

top_k_join is supported in the DataFrame of Spark v2.1.0 or later.

A type of score must be ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, or DecimalType.

If k is less than 0, the order is reverse and top_k_join joins the tail-K records of rightDf.