\n\n\n\n \n\nThe Single largest direct human impact on marine ecosystems comes from over-exploitation of resources by activities like fishing. Now, researchers have devised a technique to track the population of endangered marine life, like \u00a0\u00a0Humpback Whales, by listening to their peculiar hour-long song. \n\n \n\nAudio analysing techniques developed by Google\u2019s Artificial Intelligence Perception team have been used before for YouTube videos for non-speech captions. And, now similar techniques are being used for conservational activities. \n\n \n\nGoogle, in association with the Pacific Islands Fisheries Science Center of the US National Oceanic and Atmospheric Administration (NOAA), has developed algorithms to identify humpback whale calls in 15 years of underwater recordings from a number of locations in the Pacific. This research provides new and important information about humpback whale presence, seasonality, daily calling behaviour, and population structure.\n\n \n\nHARP\n\nIn fact, Google has used HARP (high-frequency acoustic recording packages) devices to collect audio data (9.2 terabytes) over a period of 15 years. \n\n \n\nManually marking humpback whale calls is extremely time-consuming. That is why, researchers at Google, deployed supervised machine learning models to optimize images for detecting the whales. An audio event detection is considered as an image classification problem. \n\n \n\nThis data of different magnitudes of sound intensities are plotted on time-frequency axes.\n\n\n\n \n\nSpectrograms of audio events found in the dataset, with time on the x-axis and frequency on the y-axis. Left: a humpback whale call, Center: narrow-band noise from an unknown source, Right: hard disk noise from the HARP\n\n \n\nFor classifying the images, a Resnet-50, was used which has given reliable results in classifying non-speech audio. \n\n \n\nA Resnet is a residual learning framework introduced back in 2015 to ease the training of deep networks. The inputs to the layers in the neural network are used as a reference to model these layers as a residual function for learning. So, as the depth of a network increases, the accuracy gain from this residual layers increases.\n\n \n\nThe idea behind using residual learning instead of stacking layers over layers is, challenges like delayed convergence caused to vanishing or exploding gradients. For example, in case of a vanishing gradient problem, the weights in the network receive an update with respect to the error function after each iteration, and the weights become vanishingly small and might lead to complete shutdown of a network\u2019s training.\u00a0\u00a0Resnet\u2019s have also been good with tackling the degradation of training accuracy problem caused due to increasing depth and saturated weights.\n\n \n\nHumpback whales have a varied, but sustained frequencies. These frequencies, if don\u2019t vary at all then a spectrogram would display a horizontal bar. The arcs mean that the signals have been modulated.\u00a0\u00a0The challenge with collecting humpback\u2019s audio data is the noise that gets mixed up along with it; noise caused by the propellers of the ships and other equipment.\u00a0This noise is taken as an unvaried signal and displayed as a horizontal bar on a spectrogram. \n\n \n\nPCEN (per-channel energy normalization) is a technique used in far-field speech recognition tasks. PCEN is modelled as deep neural networks and uses dynamic compression instead of logarithmic compression to spot keywords in distant or noisy acoustic environments.This technique suppresses the stationary narrow band noise generated by machines resulting in an error reduction at a rate of 24%.\n\n \n\nA whale song is generally a structure, sequential audio signal that can last over 20 minutes. And, there is a high possibility of a new song beginning within a few seconds. This incoming audio units with such large time windows give extra information useful for predicting with improved precision.The test-set consists of 75-second audio clips, for which the model showed accuracy scores above 90%. \n\n \n\nUnsupervised Learning For Similar Song Units\n\nIn this approach, the labels are used to learn from the ResNet output. The classification is done by identifying the Euclidean distances between two ResNet output vectors belonging to corresponding audio units of similar time frames. This helps in distinguishing different humpback unit types from each other.\n\n \n\nCalculating distances has been done using Unsupervised Learning Of Semantic Audio Representations. The basic idea here is to highlight the correlation between closeness in time and closeness in meaning. So a sample consisting of three vectors which are representations of sounds of humpback unit(anchor), a similar unit(positive) and noise(negative) respectively. The model tries to minimize the loss such by forcing the Euclidean distance between the anchor-negative exceeds that of anchor-positive distance. \u00a0The nearest neighbours in the entire dataset are retrieved using Euclidean distance between embedding vectors. \n\n\n\nThe above plot summarises the model output with respect to time and location(Kona and Saipan). The results clearly show that seasonal variation is consistent with a known pattern in which humpback populations spend summers feeding near Alaska and then migrate to the vicinity of the Hawaiian Islands to breed and give birth.\n\n \n\nThese results further can be used to determine the effects due to anthropogenic activity and the success of this project demands for application of machine learning tool to a wider spectrum of environmental challenges.