Alt text is the piece of text that pops up on the screen when an image fails to load. When tagged with proper alt text, visually impaired persons can understand image content, being read out in a synthetic voice. Since the alt text descriptions have to be manually added, many pictures feature no alt text, greatly restricting the experience. To bridge this gap, in 2016, Facebook introduced automatic alternative text (AAT) technology. As the name suggests, AAT generates the description of photos on demand.
The researchers at Facebook have now brought in three significant changes to AAT:
- A ten-fold increase in the number of concepts AAT can detect and describe.
- More detailed image descriptions, giving information on the activities, landmarks, types of animals, etc.
- Including information about the location coordinates and the relative size of the object — an industry first.
Automatic Alternative Text
AAT is rooted in Facebook’s object recognition technology. This object recognition technology makes use of a neural network with billions of parameters trained on several examples.
In 2018, AAT bagged the Helen Keller Achievement Award (under the aegis of the American Foundation for the Blind) for its ‘exemplary role in creating accessible products or improving the accessibility of their already-popular products’.
The first version of AAT was developed using human-labelled data. The data was then used to train a deep convolutional neural network. The initial version of AAT could recognise about 100 basic concepts such as trees, mountains, people’s identity (using facial recognition model) etc.
However, the version was not scalable, necessitating a move away from the fully supervised learning model with human-labelled data.
In an earlier blog, Facebook explained how hashtags could be an ideal source for training data and making images more accessible. However, hashtags present two main challenges–they at times reference nonvisual concepts (for example #tbt) and they are very vague. Both these shortcomings could confuse deep learning models. To overcome this, new approaches including assessing multiple labels used per image, sorting through hashtag synonyms, and balancing between frequently used hashtags and rare ones were developed. The team also trained a large-scale hashtag prediction model for image recognition.
Leveraging this method of image recognition through hashtags, Facebook’s team has adopted a model trained on weakly supervised data, from public instagram images and their hashtags for AAT. This model is fine-tuned to utilise data obtained from different geographies to make it work better for everyone. Apart from this, concepts of gender, skin colour, and age were also applied to build a more accurate and inclusive model.
The new enhancement to AAT also applies transfer learning, a method of repurposing machine learning models for training on new tasks. This enabled the creation of a model that could identify more distinctive concepts such as monuments, type of food, and selfies.
The model was further trained on Faster R-CNN, a two-stage object detector, using an open-source platform for object detection and segmentation called Detectron2, for more details on parameters such as position and counts.
Greater Detail & Higher Accuracy
The improved AAT can recognise over 1,200 concepts. Facebook has only included concepts based on well-trained models that meet a high precision threshold to maintain a high level of accuracy.
The images would now include a ‘Detail Image Description’ panel that provides comprehensive descriptions of content including positional information (top/bottom/left/right) and relative prominence of objects (primary, secondary or minor).