⬅️ Day 2 – Introduction to Computer Vision
In the last chapter, we discussed how Computer Vision can be used in image processing. You can check my GitHub repository for updates. Today we’ll see how we can improve the model further to detect other features in an image using convolutions.
What are Convolutions?
A convolution is simply a filter or kernel of weights that multiplies a pixel with its neighbor to get a new value for the pixel thereby creating another grid of layers. These filters are randomly initialized.
As an example let’s take the ankle boot image from the Fashion MNIST and the pixel values of the image.

If you consider the highlighted pixels in the above image you can see that the middle value is 192. Let’s say we want to perform the kernel convolution on the pixel value 292. We define a filter in the same 3 x 3 grid as shown below and transform the current value in each pixel in the image by multiplying with the corresponding value in the same position in the filter grid.

So in this case the new value of the pixel in the image will be 577 after summing up the total amount. Repeating this process to each pixel in the image will give us the new filtered image.
The CNN generally consists of an input layer, an output layer, and a hidden layer that includes multiple convolutional layers, pooling layers, fully connected layers, and normalization layers.
Pooling
Pooling is the process of eliminating pixels or reducing the size of the image by summarizing the regions. The below image depicts the idea behind max pooling. To perform maxpooling we need a grid which will be the pool size and a stride.

In the example above we have grouped the 16 pixels in a monochrome image into 2 x 2 arrays. So altogether there are four 2 x 2 arrays or grids which are called pools.
You can see that the maximum value in each group is selected and resembled into a new image. This concept is known as max pooling.
Next, let’s implement the above learnings to our neural network designed earlier for the Fashion MNIST dataset. Below is the code from our earlier lesson on computer vision.
import tensorflow as tf
data = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = data.load_data()
training_images = training_images / 255.0
test_images = test_images / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation = tf.nn.relu),
tf.keras.layers.Dense(10, activation= tf.nn.softmax)
])
model.compile (optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=10)
In order to convert this to a convolutional neural network, convolutional layers are used in the model definition.
To implement the convolutional layer, tf.keras.layers.Convo2D type is used. The number of convolutions used in the layer, the size of the convolutions, and the activation function are the key parameters accepted.
import tensorflow as tf
data = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = data.load_data()
training_images = training_images.reshape(60000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)
test_images = test_images / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3, 3), activation='relu'),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=50)
model.evaluate(test_images, test_labels)
classifications = model.predict(test_images)
print(classifications[0])
print(test_labels[0])
Unlike in the earlier week, this time we are reshaping the data images first. The reason for this is when defining the convolutional layer we use input_shape as a parameter and there’s an additional third dimension that describes the color type of the image. 1 will be used for grayscale images and 3 will be used for colored images stored as values of R, G, and B.
Therefore, prior to normalizing the images, we reshape the training and testing image arrays to have the extra dimension as below.
training_images = training_images.reshape(60000, 28, 28, 1)
test_images = test_images.reshape(10000, 28, 28, 1)
In this example we have used the convolutional layer as the input layer to the neural network:
tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1))
Here, 64 convolutions are randomly initialized and over time the model will learn the filter values that work best to match the labels of the input values. The size of the filter is indicated by (3, 3). Typically this value will be an odd number but the most common size is (3, 3) filter size.
You can also see that the input_shape parameter has a third dimension to depict the color type as discussed earlier.
Here’s how we use the pooling layer in our neural network.
tf.keras.layers.MaxPooling2D(2, 2)
The above shows that we are splitting the image into 2 x 2 grid pools and picking the maximum value in each. The parameters used here which are (2, 2) indicate the pool size.
You can see that using this neural network and training the same data for 50 epochs has a higher accuracy of 99.45% compared to earlier accuracy which was 90.99%. Therefore it can be clearly seen using convolutional neural networks increase the ability to classify images. You can use model.summary() to further inspect the model.
In the next chapter let’s explore a dataset comprising color images and see how convolutions can identify features in them. Happy coding! 😃🔥