I’m no expert in AI/ML. I’m trying to self-learn and haven’t built anything significant, until now!
To learn something, I need a problem to apply and practice what we learned. I was missing that. I didn’t have a use case for anything related to AI/ML and I wanted a pet project for AI/ML.
The beginnings
Yeah! Let’s talk about that.
I started with this – Basic classification: Classify images of clothing. Yeah, why not. Jump into the deep end and figure out the basics as you progress. That has been my way of learning new things. This time, it did not work. My effort to learn ML and build a basic classification system spans 2-3 years 🤷🏾♂️. At the pace at which technology progresses, that’s a lifetime.
Anyways, back to the beginnings. I tried giving it a shot again after reading about the basics. Even enrolled in a Coursera course for ML. But I couldn’t see the course through as it got boring, and I failed to find applications for what was taught. Nothing worked. AI/ML started to look like taking on The Nihilanth; too difficult and complex, and try as I may, I couldn’t understand the basics.
December 2020 – Secret Santa 🎅🏻
Every year for Christmas, we play Secret Santa at the office. COVID-induced lockdowns and travel restrictions meant that we all stayed at home and ordered gifts for each other. When the time came to reveal the Secret Santas, our HR sent each of us a letter grid with the name of our Santa in it. This was mine.
You should know that I’m lazy. I won’t sit around and find who my Santa is by searching for a name from a list of all the employees, when I can write a program to do that for me! 💡
OCR using Tesseract
The problem looks like an OCR – optical character recognition – one. I decided to write a program using an OCR library and see if that works. Tesseract is what i used.
The program reads the image and calling pytesseract.image_to_string()
method looked very promising. Until I ran the program. This was my output for the above image.
moet AX 2k Poe Pos Cop yw FE I 38 Ged TOP Aes sd: Deod VON 4 OF FB Geo Pe os Ok yok ¥ ge VN Te 2 SE Ek I oe wse D ge -PJZR 5 sqgvwwYyeaA He t | POM Yee fcr Hee x se a RSW RSE eB eh See FEO I LF FF ewe ED Ke ti SS oF Le WR. ¥ Pg oe 2d Pow ey Poo Fee p QW Gx acoH asa BoC rev Gow 4 2 SC Te Beog Oo pew Peep Ok £256 Dex Goh 5 “ET ££ Bor Pp oa RIS Tp Pp Vouk Sb Ow k go Acw BSH Fat ¥ Woe Gow X42 Cn YOR Zoek OFb m te ko WoW a
🤬 That doesn’t look even close to what I had thought would be the output.
As it turns out, computers have difficulty understanding images (duh!). It’s difficult for a computer (program) to look at a random image and extract the text from it. We can make out shapes and boundaries. Programs can’t. They need help.
One of the first things we humans did was to introduce special fonts that computers can easily recognize. These are called OCR fonts (https://en.wikipedia.org/wiki/OCR-A), which looks like this.
The keen-eyed ones among you would recognize this as the same font used on credit cards.
So far, in my attempt to find my Secret Santa, I would say I got moderate success using Tesserat. Don’t get me wrong. The output is garbage. But by trying to find out why my program didn’t work, I read about and learned about OCR. Win-win!
What are my options now?
There’s only one option I could see; I could create an image classification model. I can follow the Basic classification tutorial and retrofit it for finding my Secret Santa!
Training
I have to train a model to identify each letter in the image.
I won’t train the model using the letters from the image on which I want to run and get the results. That is cheating. Training the model on the letters from the same image would cause the model not to work when correctly we have to run it against other similar images.
Data
To train the model, we need a bunch of training data. The training data should be images of individual letters, and this data has to be “labeled”. What’s labeling? It’s a way of tagging images so that when the model looks at it, it knows which letter each of these images looks like. The simplest way to tag an image for training is to drop it into a folder by that name.
Since we can’t use the image above, I had to create the data from all the other images that my HR had created for others. Called her up and got the whole set as a big PDF.
The Build
Step 1: Reading the PDF
Step 1 was to read the PDF and convert each slide into images. I used the pdf2image
library for this. The code for this can be found here readPDFAndGenerateImages.py
. I picked my slide and two other slides for testing, and all other slides were used for building the training data.
slidesToSkip = [2, 23, 34]
These test images were named TEST-XX.jpg
, and all other images were SLIDE-XX.jpg
Step 2: Generating letter-images
I now had to find a way to crop these images and generate individual letter images from them. Since all the images had the same dimensions, I could apply the same cropping logic to them.
All images have the exact dimensions
Crop images in steps to generate letter images.
I’ve used OpenCV for image manipulation, and you can find the code here 02-slicer.py. Once the program ran, I was left with 6970 images. 😐 Yup, just shy of 7k images. Yikes!
Step 3: Labeling the images
I could open each letter image and move them to the corresponding folder. But that’s too time-consuming. I wrote a very simple Nodejs-Express program, and built and hosted a static HTML page. The page loads one letter image at a time, and I can click on the corresponding letter button, which makes a backend API call to move the image to the right folder. It program looked like this.
Easy, right? Wrong!
I’m too lazy to sit and label all 6970 images. Time to outsource! Where can I find cheap labor? Ahh yes, kids! I have two at home. I can coerce them to do this for me. In exchange, I can pay them for time and effort.
I offered them ₹100/- each for helping me classify these images. The older one took a look at the app and showed no interest. The younger one, on the other had told me 100 is too less. Her rate was ₹ 500/- (wtf!). I wasn’t going to fall for that. We renegotiated and settled for ₹200/-. We shook hands and she got down to labeling the images.
She was impressed with the web app that I had made and helped me classify around 1500 images. After which she got bored and gave up. I guess the ₹200 was well spent.
Step 4: Training the model – Part 1
1500 images across 52 sets of alphabets are roughly 28-29 images per letter. Not a big training set. Let’s see where it gets us. Unfortunately I don’t have the code nor the results from this run. This was almost an year back. Had I known that I would be writing about it, I would have kept the code i used for training. If I remember correctly, I got really good accuracy and low rate of loss too. It almost looked like I was close to finding my Santa.
I tested it out, and the results were not what I had hoped for. It couldn’t recognize a
from b
and n
from m
.
Let’s debug!
But try as I may, I couldn’t find the problem. It was time to get some professional help. I pinged a colleague who works in our ML/data science department and asked him to take a look. I explained to him what I was trying to do and explained to him the problem that I was facing. He looked at it, and in a day’s time, he could identify what was going wrong.
The problem
Let’s take the letter a
. We are asking the model to look at a bunch of images and “learn” how the letter a
would look like. We can then ask this program to recognize any image of a
as a
(Image1). The issue is with these images itself. If you zoom into one of these images, you will notice that it’s not sharp. To put it differently, the region that forms a
is not just black, and the region around a
in the image is not all white. We have to cut down on noise! We have to “clean” all these images.
Image 1
Image 2
Step 5: Clean up
Let’s learn a few basics of images and how colors are represented. A pixel in an 8-bit image has 3 values or 3 channels – red, green, and blue. If the pixel value is [0,0,0]
it’s black, and [255,255,255]
represents white. Our image has a range of pixels whose value is between white and black. In other words, we have a bunch of grey pixels.
To clean this image, we will convert any pixel whose value is greater than 200 to 255 and anything less than 200 to 0. Apply this, and we get a sharper image.
greaterThanGrayPixels = np.where((
(image[:, :, 0] > 200) &
(image[:, :, 1] > 200) &
(image[:, :, 2] > 200)
))
image[greaterThanGrayPixels] = [255, 255, 255]
lessThanGrayPixels = np.where((
(image[:, :, 0] < 200) &
(image[:, :, 1] < 200) &
(image[:, :, 2] < 200)
))
image[lessThanGrayPixels] = [0, 0, 0]
This also opens up another possibility. We do not need a whole lot of images to train our model. We can generate the training data or training images. Thanks again to my friend who told me about this trick. We can now pick an image of **a**
and generate more images by shifting it a few pixels in multiple directions.
The code that does that can be found here [03-imageProcessing.py](https://github.com/jerrymannel/ml_findsanta/blob/main/03-imageProcessing.py)
.
Step 6: Training the model – Part 2
We now have a lot more images to train with, and each of these images are sharper too. Here are the code and the results from training the model on the new set of images.
model = Sequential([
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3), name="rescaleImage"),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Dropout(0.2),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes, name="output")
], name='imageClassification')
Excerpt from [04-personalTrainer.ipynb](https://github.com/jerrymannel/ml_findsanta/blob/main/04-personalTrainer.ipynb)
Accuracy – 0.9975 (99.75%)
Step 7: Test
I picked an image from the set of slides that I had skipped and set aside for testing. The results looked very promising.
Success!
Finding Santa
The rest of the process was pretty straightforward. I have the list of colleagues that took part in Secret Santa, and the program to find their names can be summarised as,
- Generate the character matrix for the image
- For each of the names in the list, scan in both directions
- Each row
- Each column
- Diagonally starting from the top-left -> bottom-right
- Diagonally starting from the top-right -> bottom-left
Results
My slide was 02, and when I run it through the program, I got the answer as ARUSHI.
For slide 23, the result was KAVI.
But for slide 34, we get no results. So the model is not that perfect 🤓. The model read an “o” as an “a”.
The whole process took me close to 2 years. This meant that I had no idea that my Santa was ARUSHI till the next to next Christmas; mainly because the whole let’s-build-a-classification-model-to-find-my-secret-santa was a side project, and I was only doing it whenever I had time.
The learnings were worth it though! Keep hacking!
./J