How to use data generators in tensorflow
Why ?
Believe it or not, but loading the entire dataset in memory is NOT the best idea.
If you're dealing with a small dataset, that might work, but that is just a waste of resources, and worse if you're working on a huge dataset like the imageNet dataset, this won't work at all.
Python generators are lazy which means they are iterables that give you the data upon request, unlike regular lists that just store the data in memory all the time.
tensorflow keras has a Sequence
class that can be used for this purpose. Sequence Class API Reference
let's jump into it
Working with images is a good example for this, so let's say that you have pictures of objects that you need to localize,
So your features are images and labels are (x, y, h, w) for coordinate and dimensions of the containing box, and the labels and image names are stored in a csv file.
image_file | x | y | w | h |
file1.png | 10 | 20 | 50 | 50 |
... | ... | ... | ... | ... |
... | ... | ... | ... | ... |
Let's define an initializer, the initializer is going to take the information needed to get the data such as:
- The csv file
- The directory containing all of the images
It will also take the output shape of the batch
- The output size of each image
- The batch size
import numpy as np
import pandas as pd
import matplotlib.image as mpimg
from tensorflow.keras.utils import Sequence
class DataGenerator(Sequence):
def __init__(self, csv_file, base_dir, output_size, shuffle=False, batch_size=10):
Initializes a data generator object
:param csv_file: file in which image names and numeric labels are stored
:param base_dir: the directory in which all images are stored
:param output_size: image output size after preprocessing
:param shuffle: shuffle the data after each epoch
:param batch_size: The size of each batch returned by __getitem__
self.df = pd.read_csv(csv_file)
self.base_dir = base_dir
self.output_size = output_size
self.shuffle = shuffle
self.batch_size = batch_size
Now let's define some special methods starting with the one called in the initializer on_epoch_end()
that is called after each epoch as the name may suggest, duh!
We call this method in the initializer because we need the indeces attribute to be set at the begining of the first epoch, otherwise we will get an error telling us that the class has no attribute "indecies"
def on_epoch_end(self):
self.indices = np.arange(len(self.df))
if self.shuffle:
Now we need to define the length of the data, which is not the number of entries as you might think, it's actually the number of batches, this needs to be accessible by the len
function in python so we need to define the __len__
def __len__(self):
return int(len(self.df) / self.batch_size)
Now let's get serious, the fun part is in the next method which is __getitem__
This function gets called on indexing or slicing like data_generator[0]
or data_generator[1:3]
and the index is passed as a parameter to it. Here we call it idx
In this function we shall load and preprocess the images.
This will only be fired when keras trys to load a batch, which will save our memory.
You might think splitting this into multiple functions would be a good idea ... and you'd be totally right.
This function should return a preprocessed batch of data
def __getitem__(self, idx):
## Initializing Batch
# that one in the shape is just for a one channel images
# if you want to use colored images you might want to set that to 3
X = np.empty((self.batch_size, *self.output_size, 1))
# (x, y, h, w)
y = np.empty((self.batch_size, 4, 1))
# get the indices of the requested batch
indices = self.indices[idx*self.batch_size:(idx+1)*self.batch_size]
for i, data_index in enumerate(indices):
img_path = os.path.join(self.base_dir,
self.df.iloc[data_index, 0])
img = mpimg.imread()
## this is where you preprocess the image
## make sure to resize it to be self.output_size
label = self.df.iloc[data_index, 1:].to_numpy()
## if you have any preprocessing for
## the labels too do it here
X[i,] = img
y[i] = label
return X, y
Now you are ready to fit the model to this generator. You can also easily make a validation generator and validate your model against that, all you need to do is make a new instance of the DataGenerator
class, and pass in the validation csv and base directory and you're good to go. That's why I love OOP.
from tensorflow.keras.models import Sequential
model = Sequential([
## define the model's architecture
train_gen = DataGenerator("data.csv",
(244, 244),
## compile the model first of course
# now let's train the model, epochs=5, ...)
# note you could also make a validation generator and pass it here like normal datasets
# back in the days you had to do this
# model.fit_generator(train_gen, ...)
The complete code
class DataGenerator(Sequence):
def __init__(self, csv_file, base_dir, output_size, shuffle=False, batch_size=10):
Initializes a data generator object
:param csv_file: file in which image names and numeric labels are stored
:param base_dir: the directory in which all images are stored
:param output_size: image output size after preprocessing
:param shuffle: shuffle the data after each epoch
:param batch_size: The size of each batch returned by __getitem__
self.df = pd.read_csv(csv_file)
self.base_dir = base_dir
self.output_size = output_size
self.shuffle = shuffle
self.batch_size = batch_size
def on_epoch_end(self):
self.indices = np.arange(len(self.df))
if self.shuffle:
def __len__(self):
return int(len(self.df) / self.batch_size)
def __getitem__(self, idx):
## Initializing Batch
# that one in the shape is just for a one channel images
# if you want to use colored images you might want to set that to 3
X = np.empty((self.batch_size, *self.output_size, 1))
# (x, y, h, w)
y = np.empty((self.batch_size, 4, 1))
# get the indices of the requested batch
indices = self.indices[idx*self.batch_size:(idx+1)*self.batch_size]
for i, data_index in enumerate(indices):
img_path = os.path.join(self.base_dir,
self.df.iloc[data_index, 0])
img = mpimg.imread()
img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY) # to reduce it to one channel to match the shape
## this is where you preprocess the image
## make sure to resize it to be self.output_size
label = self.df.iloc[data_index, 1:].to_numpy()
## if you have any preprocessing for
## the labels too do it here
X[i,] = img
y[i] = label
return X, y
## Defining and training the model
model = Sequential([
## define the model's architecture
train_gen = DataGenerator("data.csv", "data", (244, 244), batch_size=20, shuffle=True)
## compile the model first of course
# now let's train the model, epochs=5, ...)
And that's it.. you've just created your dataset generator that loads the data into memory batch by batch instead of the whole thing at once.
I hope this was useful.