First time trying to implement a clean production workflow using R, sparklyr, and AWS technologies (mainly ec2, s3, and Redshift). For simplicity sake, I'm hoping to keep everything inside R. I would like some advice around what architecture makes the most sense.

Essentially, the modeling workflow looks something like:

execute Redshift query to clean up/manipulate some raw data

unload result set into s3

load data into either HDFS or directly into memory

execute spark job fitting model using sparklyr

write results back to s3

What I'm struggling with is whether to go with

a standalone Spark ec2 cluster, where I think I'm reading data directly off of s3, OR

an EMR cluster with Spark installed, where I'm loading data into hadoop

From an automated production workflow perspective, it seems much cleaner to go with the former, since I can avoid writing a hadoop job using an R package to load data. However, I'm wondering if there are performance benefits I'm losing out on by not storing data in hadoop. Additionally, I'm not really sure what is actually happening in option 1- is data somehow being loaded straight from s3 into memory? Lastly, if I'm going with first option, is it overkill to still use an EMR cluster in order to avoid configuring Spark manually on a set of ec2 instances?

Sparklyr can read data directly from s3 into Spark memory, see ?sparklyr::read_csv for details. You would not gain much copying the data into HDFS first and then loading it into memory (you would still need to use a sparklyr::read function against HDFS or setup a hive metastore).