1 Answer

Can you just join your RDD's instead of pulling down all of a single RDD?

The problem is if your data is 5 GB today, maybe you can increase your memory size to allow you to do that - but what happens when that data size grows to 10GB - even 100GB? Pulling all your data down to the Driver Node isn't going to scale.