Once upon a time, when we had simpler questions like “What is the problem?”, structured datasets were used to report numbers. Fast forward a few decades, we have more complex questions to answer like “Why is this problem happening?” and with complex problems comes complex datasets or unstructured datasets. To rescue us from all this complexity comes neural network, making machines learn and resolve all those complexities for us, scaling out with each level of complexity.
One of such complex problem is handwriting recognition, imagine handwriting on a paper or a tablet and getting that translated into a computer typed text, no more redo! Imagine not wracking your brains into deciphering a doctor’s handwriting. Imagine a child with dysgraphia, a condition that results in poor handwriting, not struggling in the classroom.
All this can happen with the handwriting recognition tool, which classifies text from an image. This tool has a Graphical User Interface, where inside a canvas, a user can write any English word in freehand style, and the model sitting at the backend will be able to recognize the word. For this tool, Multi-Layer Perceptron (MLP) classifier with Adam solver and sigmoid function has been used to achieve significant results.
Dataset
Dataset used was created by the National Institute of Standards and Technology (NIST). The NIST Special Database 19 consists of roughly 0.7 million sample png images. The current model has been trained only for uppercase letters (A-Z). The following table highlights the number of observations per character:
Table 1: Number of Observations Per Characters
A: 7,010 | Q: 2,566 | g: 3,839 | w: 2,699 |
B: 4,091 | R: 4,536 | h: 9,713 | x: 2,820 |
C: 2,792 | S: 23,827 | i: 2,788 | y: 5,088 |
D: 4,945 | T: 10,927 | j: 1,920 | z: 2,726 |
E: 5,420 | U: 14,146 | k: 2,562 | 0: 34,803 |
F: 10,203 | V: 4,951 | l: 16,937 | 1: 38,049 |
G: 2,575 | W: 5,026 | m: 2,634 | 2: 34,184 |
H: 3,271 | X: 2,731 | n: 12,856 | 3: 35,293 |
I: 13,179 | Y: 2,359 | o: 2,761 | 4: 33,432 |
J: 3,962 | Z: 2,698 | p: 2,401 | 5: 31,067 |
K: 2,473 | a: 11,196 | q: 3,115 | 6: 34,037 |
L: 5,390 | b: 5,551 | r: 15,934 | 7: 35,796 |
M: 10,027 | c: 11,315 | s: 2,698 | 8: 33,884 |
N: 9,149 | d: 11,421 | t: 20,793 | 9: 33,720 |
O: 28,680 | e: 28,299 | u: 2,837 | |
P: 9,277 | f: 2,493 | v: 2,854 |
Preprocessing
Each character in the original dataset occupies 128×128 pixels per raster (Fig 1 a), to avoid heavy computation the size of the image was reduced to 56 x 56 pixels (Fig 1 b). Furthermore, canvas size was reduced to 28×28 pixels by removing the padding (Fig 1 c) which resulted in a 784 feature configuration dataset. Each character was labelled sequentially from “A”- “Z”.
The package ‘tkinter’ was used to create the canvas-like user interface. Once a user writes a word in the canvas, the tool converts the image into a NumPy 2-D array and then traverses the array column-wise looking for a filled pixel to mark the beginning of a letter. For words, the model continues to traverse and look for a column where there is significant relative blank space to mark the beginning of the second character. The tool is intelligent enough to differentiate a break in letters versus the beginning of a second letter.
Challenges
While designing and creating this tool, several challenges were faced highlighted below:
#1: The original dataset demanded heavy computational power to hyper tune the model for different combinations. To overcome this limitation with the personal computer that was used to build this model, the dataset was split into batches of five characters (i.e. letters A-E and F-I, etc.), and the model was trained and tested using these batches.
#2: An imbalance in the number of observations among the characters in the dataset, resulted in complications during the testing and training phase (i.e. for example 10k and 2.5k images for J and K respectively). To overcome this a script was created that divided the test and train dataset for each character individually, and later merged them to ensure a well-balanced dataset.
#3: English alphabets contained letters that appeared very similar to each other (i.e. B and P, D and O). While training and testing, the model struggled to classify these letters accurately. To minimize misclassification for these letters, the model was trained and hyper tuned for these letters separately with a larger dataset.
Neural Network Model Configuration
For this tool, Multi-Layer Perceptron (MLP) classifier has been trained using backpropagation to achieve significant results. Below is the configuration of the neural network:
- Hidden Layer Size: (100,100,100) i.e., 3 hidden layers with 100 neurons in each
- Activation Function: logistic sigmoid, returns f(x) = 1 / (1 + exp(-x))
- Solver for weight optimization: stochastic gradient-based optimizer (“Adam”)
- Early Stopping (to avoid overfitting): True
Figure 4: Model Architecture
Results
Table 2: Results Summary
Number of Samples (Test Set): | 74,491 | |||
Correctly Classified: | 71,227 | |||
Accuracy: | 95.6% | |||
Character | Attempts | Correctly Classified | Accuracy | Misclassified With |
A | 2,774 | 2,696 | 97.2% | ‘B’, ‘H’, ‘K’, ‘N’, ‘R’, ‘X’ |
B | 1,734 | 1,572 | 90.7% | ‘A’, ‘D’, ‘E’, ‘G’, ‘H’, ‘R’, ‘S’ |
C | 4,682 | 4,544 | 97.1% | ‘E’, ‘G’, ‘L’, ‘O’ |
D | 2,027 | 1,776 | 87.6% | ‘B’, ‘O’, ‘P’, ‘Q’ |
E | 2,288 | 2,115 | 92.4% | ‘B’, ‘C’, ‘F’, ‘G’, ‘K’, ‘S’ |
F | 233 | 210 | 90.4% | ‘E’, ‘P’,’T’ |
G | 1,152 | 1,050 | 91.1% | ‘B’, ‘C’, ‘E’, ‘O’, ‘Q’ |
H | 1,444 | 1,278 | 88.5% | ‘A’, ‘B’, ‘K’, ‘N’, ‘R’ |
I | 224 | 196 | 87.4% | ‘J’, ‘L’, ‘T’, ‘Z’ |
J | 1,699 | 1,589 | 93.5% | ‘I’, ‘T’, ‘Z’ |
K | 1,121 | 1,008 | 90.0% | ‘A’, ‘E’, ‘H’, ‘M’, ‘N’, ‘R’, ‘X’, Y |
L | 2,317 | 2,255 | 97.3% | ‘C’, ‘I’ |
M | 2,467 | 2,337 | 94.7% | ‘K’, ‘N’, ‘W’ |
N | 3,802 | 3,606 | 94.9% | ‘A’, ‘H’, ‘K’, ‘M’, ‘R’ |
O | 11,565 | 11,338 | 98.0% | ‘C’, ‘D’, ‘G’, ‘Q’ |
P | 3,868 | 3,778 | 97.7% | ‘D’, ‘F’, ‘R’ |
Q | 1,162 | 1,009 | 86.8% | ‘D’, ‘G’, ‘O’ |
R | 2,313 | 2,144 | 92.7% | ‘A’, ‘B’, ‘H’, ‘K’, ‘N’, ‘P’ |
S | 9,684 | 9,481 | 97.9% | ‘B’, ‘E’ |
T | 4,499 | 4,430 | 98.5% | ‘F’, ‘I’, ‘J’ |
U | 5,802 | 5,658 | 97.5% | ‘V’, ‘W’ |
V | 836 | 810 | 96.8% | ‘U’, ‘W’, ‘Y’ |
W | 2,157 | 2,017 | 93.5% | ‘M’, ‘U’, ‘V’ |
X | 1,254 | 1,163 | 92.7% | ‘A’, ‘K’, ‘Y’ |
Y | 2,172 | 2,055 | 94.6% | ‘K’, ‘X’ |
Z | 1,215 | 1,162 | 95.6% | ‘I’, ‘J’ |
Future Expansion
While this tool serves as a base model in bridging the communication gap, there is more work that needs to be done. Currently, the model can decrypt letters and words, but it is capable of processing phrases and paragraphs with proper expansion. Additionally, the UI/ UX can be further developed to be leveraged by a wider user-audience.