Loading objects only once

Loading objects only once

Hello all,

I am a new user to Spark, please bear with me if this has been discussed earlier.

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any

I am a new user to Spark, please bear with me if this has been discussed earlier.

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any

I am a new user to Spark, please bear with me if this has been discussed earlier.

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source
like S3 something like this

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So
I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference
on them separately for which I need to load the object once per task. Any

I am a new user to Spark, please bear with me if this has been discussed earlier.

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any

I am a new user to Spark, please bear with me if this has been discussed earlier.

I am trying to run batch inference using DL frameworks pre-trained models and Spark. Basically, I want to download a model(which is usually ~500 MB) onto the workers and load the model and run inference on images fetched from the source like S3 something like this

After a lot of trial and error, I moved the code to a separate file by creating a static method for predict that checks if a class variable is set or not and loads the model if not set. This approach does not sound thread safe to me, So I wanted to reach out and see if there are established patterns on how to achieve something like this.

Also, I would like to understand the executor->tasks->python process mapping, Does each task gets mapped to a separate python process? The reason I ask is I want to be to use mapPartition method to load a batch of files and run inference on them separately for which I need to load the object once per task. Any