how to split up tf.data.Dataset into x_train, y_train, x_test, y_test for keras

Written by- Aionlinecourse2829 times views

In this article, we will discuss how to split up a tf.data.Dataset into x train, y train, x test, y test for Keras.

how to split up tf.data.Dataset into x_train, y_train, x_test, y_test for keras

Solution:

This will do the separation for you. What you need to do is create a directory let's call it c:\train. Now in that directory you will need to create a series of subdirectories, one per class. For example if you had images of dogs and images of cats and you want to build a classifier to distinguish images as being either a cat or a dog then create two sub directories within the train directory. Name one directory cats, name the other sub directory dogs. Now place all the images of cats in the cat sub directory and all the images of dogs into the dogs sub directory. Now let's assume you want to use 75% of the images for training and 25% of the images for validation. Now use the code below to create a training set and a validation set.

train_batch_size = 50  # Set the training batch size you desire
valid_batch_size = 50  # Set this so that .25 X total sample/valid_batch_size is an integer
dir = r'c:\train'
img_size = 224  # Set this to the desired image size you want to use
train_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=train_batch_size, image_size=(img_size, img_size),
    shuffle=True, seed=None, validation_split=.25, subset="training",
    interpolation='nearest', follow_links=False)
valid_set = tf.keras.preprocessing.image_dataset_from_directory(
    directory=dir, labels='inferred', label_mode='categorical', class_names=None,
    color_mode='rgb', batch_size=valid_batch_size, image_size=(img_size, img_size),
    shuffle=False, seed=None, validation_split=.25, subset="validation",
    interpolation='nearest', follow_links=False)

With labels='inferred' the labels will be the names of the sub directories .In the example they would be cats and dogs. With label_mode='categorical' the label data are one hot vectors, so when you compile your model set loss='CategoricalCrossentropy'. Note for the training set shuffle is set to true whereas for validation set shuffle set to False. When you build your model the top layer should have 2 nodes and the activation should be softmax. When you use model.fit to train your model it is desireable to go through your validation set once per epoch. So say in the dog-cat example you have 1000 dog images and 1000 cat images for a total of 2000. 75% = 1500 will be used for training and 500 will be used for validation. If you set the valid_batch_size=50 it will take 10 steps to go through all the validation images once per epoch. Similarly if train_batch_size=50 it will take 30 steps to go through the training set. When you run model.fit set steps_per_epoch=30 and validation_steps=10. Actually I prefer to use tf.keras.preprocessing.image.ImageDataGenerator for generating data sets. It is similar but more versatile. Documentation is here. If like it because it allows you to specify a pre=processing function if you wish and also allows you to rescale your images values. Typically you want to use 1/255 as the rescale value.

If you just want to split up the training data you can use train_test_split from sklearn.Documentation is here., Code below shows how to seperate it into a training set, a validation set and a test set. Assume you want 80% of data for training, 10% for validation and 10% for test. Assume X is an np array of images and y is the associated array of labels. Code below shows the split

from sklearn.model_selection import train_test_split
X_train, X_tv, y_train, y_tv = train_test_split( X, y, train_size=0.8, random_state=42)
X_test, X_valid, y_test, y_valid=train_test_split(X_tv,y_tv, train_size=.5, randon_state=20)

Thank you for reading the article.