Hands-On Guide To Differential Digital Signal Processing Using Neural Networks

Digital Signal Processors take waveforms in voice, audio, video, temperature, and then mathematically manipulate them. The key idea of a DSP is to create complex, realistic signals by precisely controlling and tuning their many parameters.

In the developing field of science and technology, the term Artificial Intelligence plays a prominent role with its recent advancements that have made AI gain more popularity than ever and making concepts of Artificial Intelligence and Machine Learning a buzz among the masses. Artificial Intelligence has played a key role and made it possible for machines to learn from experience, using data and data processing to perform tasks more efficiently. The Artificial neural network is said to be inspired by the structure of the human brain, hence helping computers and machines think and process more like a human. With every discovery and development, there is even more to understand the optimal structure of Artificial Intelligence Neural Networks and their working procedures. 

Artificial neural networks, also known as ANN, are today the key tool for machine learning. Neural networks consist of both the input & output layers and a hidden layer in the middle of the architecture, containing units that change input into the output so that the output layer can utilize the processed value. Such are helpful tools for finding patterns that are numerous & complex for programmers to retrieve, and hence machines are trained to recognize valuable patterns. ANNs are composed of multiple nodes, which in turn imitate the working of biological neurons of the human brain. 

The neurons present in a Neural Network are interconnected, and they constantly interact with each other as the network processes information. The input nodes can take input data and perform multiple operations on the data. The results of these operations derived are then further passed to other neurons. The output at each node is assigned with an activation or node value. Such a processing technique imitates the human brain and hence has become a basis to develop algorithms that can be used to model complex patterns and prediction problems. It is also called an MLP or Multi-Layer Perceptron because of the multiple layers. The hidden layer makes the network faster and efficient by identifying only the important information to be processed from the inputs, by focusing on the weights assigned, hence leaving out the redundant information. 

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

The key to creating a good model that provides accurate predictions is finding the values of the optimal weights that would minimize the prediction error. This is achieved by implementing a method known as the backpropagation algorithm. This method makes ANN a learning algorithm as learning from the errors; the model is improved automatically. Neural networks rely heavily on training data to learn and improve their accuracy over time. Once these learning algorithms are fine-tuned for accuracy, they become powerful computer science and artificial intelligence tools, allowing one to classify and cluster data at a high velocity. Tasks such as speech recognition or image recognition can take minutes compared to long hours if human experts perform manual identification. Google’s search algorithm is an example of a well developed neural network. Most business applications and commercial companies use such technologies to solve complex problems like pattern recognition or facial recognition. Several other applications include speech-to-text transcription, data analysis, handwriting recognition for check processing, weather prediction, and signal processing.

What is Differential Digital Signal Processing?

DDSP or Digital Signal Processing can be called one of the backbones of modern society, with reference to telecommunications, transportation, audio, and many medical technologies. Digital Signal Processors take waveforms in voice, audio, video, temperature, and then mathematically manipulate them. The key idea of a DSP is to create complex, realistic signals by precisely controlling and tuning their many parameters. For example, a collection of linear filters and sinusoidal oscillators can be used to create the sound of a realistic violin if the frequencies and responses are tuned in the right way. 

However, it is sometimes difficult to control all of these parameters manually, which is why the output from sound synthesizers is unnatural and robotic. DDSP is an open-source library that uses a neural network to convert a user’s input into complex DSP controls that can help produce more realistic signals. The input could be of any audio form, including features extracted from the audio itself. Since the DDSP units are differentiable, the neural network can be trained to adapt to a dataset through the use of backpropagation. DDSP elements as the final output layer of an autoencoder. It combines a Harmonic Additive Synthesizer, that adds sinusoids at many different frequencies, with a Subtractive Noise Synthesizer that filters out the noise with time-varying filters. The signal from each synth is then combined and run through a reverberation module to produce the final audio waveform. The loss is computed by comparing the generated audio and source audio spectrograms across six different frame sizes. Since all the components are differentiable, we can use backpropagation and stochastic gradient descent to train the network end-to-end.

Image Source

Getting Started with Code

This article will use the DDSP library and neural networks to create speech audio to music converter. The model will take human speech audio as input and replace the waveforms with music that imitates a violin. We will also be comparing the original audio with the processed audio to notice the sheer difference.  The following code is an official implementation from the creators of DDSP and Magenta, whose official website can be accessed from the link here.  

Installing The Library

The first step will be to installing our required library; you can use the following command to do so,

#Installing the Library
!pip install -qU ddsp==1.6.3
Importing Dependencies 

Further, we will be importing the dependencies and helper functions that will help build our model pipeline and set the sample rate for our audio converter,

#Importing necessary dependencies
import copy
import os
import time
import crepe
import ddsp
import ddsp.training
from ddsp.colab.colab_utils import (
    auto_tune, get_tuning_factor, download, 
    play, record, specplot, upload, 
from ddsp.training.postprocessing import (
    detect_notes, fit_quantile_transform
import gin
from google.colab import files
import librosa
import matplotlib.pyplot as plt
import numpy as np
import pickle
import tensorflow.compat.v2 as tf
import tensorflow_datasets as tfds
# Helper Functions
sample_rate = DEFAULT_SAMPLE_RATE  # 16000
Recording the Audio to be processed

Further, we will now provide our model with a 5-second speech audio recording as an input, which our model pipeline will further process.

#recording audio
record_or_upload = "Record"  
record_seconds =    
if record_or_upload == "Record":
  audio = record(seconds=record_seconds)
  filenames, audios = upload()
  audio = audios[0]
audio = audio[np.newaxis, :]
print('\nExtracting audio features...')
# Plot.
# Setup the session.
# Computing the speech audio features.
start_time = time.time()
audio_features = ddsp.training.metrics.compute_audio_features(audio)
audio_features['loudness_db'] = audio_features['loudness_db'].astype(np.float32)
audio_features_mod = None
print('Audio features took %.1f seconds' % (time.time() - start_time))
TRIM = -15

Now let’s also plot the processed speech audio features,

# Plot the Speech audio Features.
#Extracts fundamental frequency (f0) and loudness features
fig, ax = plt.subplots(nrows=3, 
                       figsize=(6, 8))
ax[1].set_ylabel('f0 [midi]')
ax[2].set_ylabel('f0 confidence')
_ = ax[2].set_xlabel('Time step [frame]')

Output : 

Note: The above audio files can be seen at the time of execution only.)

The output will play your recorded 5-second speech audio clip, and the model also plots the features such as loudness, frequency and confidence. 

Loading Audio Model To be Combined

Let us now combine our speech audio with another monotonic instrument audio to be resynthesized to create new audio. We are loading instrument audio of a violin here. 

#Loading an audio model
model = 'Violin'
MODEL = model
# Iterate through directories until model directory is found
def find_model_dir(dir_name):
  for root, dirs, filenames in os.walk(dir_name):
    for filename in filenames:
      if filename.endswith(".gin") and not filename.startswith("."):
        model_dir = root
  return model_dir 
if model in ('Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'):
  # Load Pretrained models.
  PRETRAINED_DIR = '/content/pretrained'
  !rm -r $PRETRAINED_DIR &> /dev/null
  !mkdir $PRETRAINED_DIR &> /dev/null
  GCS_CKPT_DIR = 'gs://ddsp/models/timbre_transfer_colab/2021-07-08'
  model_dir = os.path.join(GCS_CKPT_DIR, 'solo_%s_ckpt' % model.lower())
  !gsutil cp $model_dir/* $PRETRAINED_DIR &> /dev/null
  model_dir = PRETRAINED_DIR
  gin_file = os.path.join(model_dir, 'operative_config-0.gin')
  # User models.
  UPLOAD_DIR = '/content/uploaded'
  !mkdir $UPLOAD_DIR
  uploaded_files = files.upload()
  for fnames in uploaded_files.keys():
    print("Unzipping... {}".format(fnames))
    !unzip -o "/content/$fnames" -d $UPLOAD_DIR &> /dev/null
  model_dir = find_model_dir(UPLOAD_DIR)
  gin_file = os.path.join(model_dir, 'operative_config-0.gin')
# Load the dataset statistics.
dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl')
print(f'Loading dataset statistics from {dataset_stats_file}')
  if tf.io.gfile.exists(dataset_stats_file):
    with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f:
      DATASET_STATS = pickle.load(f)
except Exception as err:
  print('Loading dataset statistics from pickle failed: {}.'.format(err))
# Ensure dimensions and sampling rates are equal
time_steps_train = gin.query_parameter('F0LoudnessPreprocessor.time_steps')
n_samples_train = gin.query_parameter('Harmonic.n_samples')
hop_size = int(n_samples_train / time_steps_train)
time_steps = int(audio.shape[1] / hop_size)
n_samples = time_steps * hop_size
gin_params = [
    'Harmonic.n_samples = {}'.format(n_samples),
    'FilteredNoise.n_samples = {}'.format(n_samples),
    'F0LoudnessPreprocessor.time_steps = {}'.format(time_steps),
    'oscillator_bank.use_angular_cumsum = True', 
with gin.unlock_config():
# Trim all input audio vectors to correct lengths 
for key in ['f0_hz', 'f0_confidence', 'loudness_db']:
  audio_features[key] = audio_features[key][:time_steps]
audio_features['audio'] = audio_features['audio'][:, :n_samples]
# Set up the model just to predict audio given new conditioning
model = ddsp.training.models.Autoencoder()
# Build a model by running a batch through it.
start_time = time.time()
_ = model(audio_features, training=False)
print('Restoring model took %.1f seconds' % (time.time() - start_time))
Creating our Resynthesized Audio

Our last step will be to teach and process our original audio with our loaded audio to create  unique audio,

#Resynthesize Audio
af = audio_features if audio_features_mod is None else audio_features_mod
# Run a batch of predictions.
start_time = time.time()
outputs = model(af, training=False)
audio_gen = model.get_audio_from_outputs(outputs)
print('Prediction took %.1f seconds' % (time.time() - start_time))
# Plotting graph for comparison
_ = plt.title("Resynthesis")

The output will comprise two audios, the first being our original recorded speech audio and the second being the synthesised audio. 

We can now clearly observe the difference between the two with the help of the plotted graph. 

End Notes

In this article, we explored what a neural network is and what are its potential uses. We also built speech audio to music audio converter model using neural networks and the Differential Digital Signal Processing library. Finally, we processed, synthesized, and compared our audio to notice the difference between them. The following implementation can be found as a Colab notebook, accessed using the link here.

Happy Learning!


Victor Dey
Victor is an aspiring Data Scientist & is a Master of Science in Data Science & Big Data Analytics. He is a Researcher, a Data Science Influencer and also an Ex-University Football Player. A keen learner of new developments in Data Science and Artificial Intelligence, he is committed to growing the Data Science community.

Download our Mobile App

MachineHack | AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIMResearch Pioneering advanced AI market research

With a decade of experience under our belt, we are transforming how businesses use AI & data-driven insights to succeed.

The Gold Standard for Recognizing Excellence in Data Science and Tech Workplaces

With Best Firm Certification, you can effortlessly delve into the minds of your employees, unveil invaluable perspectives, and gain distinguished acclaim for fostering an exceptional company culture.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox