Data Science Stack Exchange is a question and answer site for Data science professionals, Machine Learning specialists, and those interested in learning more about the field. It only takes a minute to sign up.

In past, for majority of multiclass/binary image classification problems, I used to feed images efficiently using ImageDataGenerator and .flow_from_directory in Keras after images are properly organized in a separate directory for each class. Therefore, I have never bothered converting images to numpy.arrary prior to feeding to the model, unless I had to and of course datasets were small so that I could do it easily in my local machine.

However, CelebA is a Multi-label Image Classification with each images having 40 labels (attributes like Smiling, Eyeglasses, Young etc.) meaning that I can not organize them in subclasses as I used to do, so .flow_from_directory is off the table (as far as I know!). Still I've managed to convert the images to numpy.array by the following simple loop:

Well it was not an impossible task. CelebA dataset is large, well not super large compared to many other image datasets (>200K RGB images, totally 1.4GB in size, each image ~ 8 KB). YET surprisingly it takes the hell of the time to convert these images to numpy arrays and even stuck during the run of a small CNN model.

My computer specs: MacBook Pro (2015), Memory: 8GB, Harddisk: 128 GB.

Even with almost more than 4GM free memory, and 20 GB free hard-disk I could not manage this on my local machine.

UPDATE: It seems quite possible and way more efficient via .flow_from_directory method of ImageDataGenerator in Keras. While it is not that option for this multi label classification at hand, I simply made some dummy subclasses and it worked and the model runs much faster!!

To My Questions (finally!!):

Maybe I am not doing the image-to-numpy conversion efficiently? Please advice how I could construct the arrays in a more efficient approach!

UPDATE: I found a very similar question in stackoverflow a year ago, yet answers do not seem to offer any better alternative.

Maybe it is what it is and I only need better hardware to do it locally!?

How then Keras manage to do the conversion efficiently under the hood then?

At the end (as we speak), I sampled only 20% of the images to have at least a model prototype up and running, although accuracy is not impressive!!

$\begingroup$Sounds like you've solved your problem, but what was likely happening is you were running out of memory. While they take up only 1.4GB of space on your hard drive, they're compressed as images, whereas numpy is storing the raw data. A generator or sequencer is definitely the right way to do it.$\endgroup$
– mpotmaJun 13 '18 at 8:57

$\begingroup$Exactly! I started linking generators, they are smart solutions to many big data issues. ;-)$\endgroup$
– TwinPenguinsJun 13 '18 at 10:31

$\begingroup$While it's equally brand new to you, you should actually check out the Keras Sequence, seen here keras.io/utils/#sequence. They're threadsafe and provide more control. I'm expecting the generator will be phased out in favour of the sequence$\endgroup$
– mpotmaJun 13 '18 at 11:37

$\begingroup$Thanks a lot. I have never thought of Keras Sequence for this purpose. I will give it a go.$\endgroup$
– TwinPenguinsJun 13 '18 at 13:04

1 Answer
1

I ended up writing a python generator, which actually works very well, for manually feeding desired number of images chunk by chunk into my CNN model like this:

def image_batch_generator(df,images_path, batch_size):
'''
A generator that takes a dataframe (for image names) and
with a given image path goes to conver images to numpy array over
batch (chunk by chunk).
"df": is the "Attributes Annotations" text file from CelebA dataset.
It has a column for image names, and another 40 attributes columns
(binary) for each image.
"images_path": it is the path to CelebA dataset image files.
"batch_size": the batch size by which image will be read chunk by chunk.
'''
L = df.shape[0]
files = df['Images'].tolist()
#this line is just to make the generator infinite, keras needs that
while True:
batch_start = 0
batch_end = batch_size
while batch_start < L:
limit = min(batch_end, L)
X= np.array([np.array(skimage.transform.resize(io.imread(os.path.join(images_path,fname))/255., (64, 64))) for fname in files[batch_start:limit]])
y=df.loc[df["Images"].isin(files[batch_start:limit]), :].drop(['Images'],axis=1).values
yield (X,y) #a tuple with two numpy arrays with batch_size samples
batch_start += batch_size
batch_end += batch_size