In the developing field of science and technology, the term Artificial Intelligence plays a prominent role with its recent advancements that have made AI gain more popularity than ever and making concepts of Artificial Intelligence and Machine Learning a buzz among the masses. Artificial Intelligence has played a key role and made it possible for machines to learn from experience, using data and data processing to perform tasks more efficiently. The Artificial neural network is said to be inspired by the structure of the human brain, hence helping computers and machines think and process more like a human. With every discovery and development, there is even more to understand the optimal structure of Artificial Intelligence Neural Networks and their working procedures.
Artificial neural networks, also known as ANN, are today the key tool for machine learning. Neural networks consist of both the input & output layers and a hidden layer in the middle of the architecture, containing units that change input into the output so that the output layer can utilize the processed value. Such are helpful tools for finding patterns that are numerous & complex for programmers to retrieve, and hence machines are trained to recognize valuable patterns. ANNs are composed of multiple nodes, which in turn imitate the working of biological neurons of the human brain.
The neurons present in a Neural Network are interconnected, and they constantly interact with each other as the network processes information. The input nodes can take input data and perform multiple operations on the data. The results of these operations derived are then further passed to other neurons. The output at each node is assigned with an activation or node value. Such a processing technique imitates the human brain and hence has become a basis to develop algorithms that can be used to model complex patterns and prediction problems. It is also called an MLP or Multi-Layer Perceptron because of the multiple layers. The hidden layer makes the network faster and efficient by identifying only the important information to be processed from the inputs, by focusing on the weights assigned, hence leaving out the redundant information.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
The key to creating a good model that provides accurate predictions is finding the values of the optimal weights that would minimize the prediction error. This is achieved by implementing a method known as the backpropagation algorithm. This method makes ANN a learning algorithm as learning from the errors; the model is improved automatically. Neural networks rely heavily on training data to learn and improve their accuracy over time. Once these learning algorithms are fine-tuned for accuracy, they become powerful computer science and artificial intelligence tools, allowing one to classify and cluster data at a high velocity. Tasks such as speech recognition or image recognition can take minutes compared to long hours if human experts perform manual identification. Google’s search algorithm is an example of a well developed neural network. Most business applications and commercial companies use such technologies to solve complex problems like pattern recognition or facial recognition. Several other applications include speech-to-text transcription, data analysis, handwriting recognition for check processing, weather prediction, and signal processing.
What is Differential Digital Signal Processing?
DDSP or Digital Signal Processing can be called one of the backbones of modern society, with reference to telecommunications, transportation, audio, and many medical technologies. Digital Signal Processors take waveforms in voice, audio, video, temperature, and then mathematically manipulate them. The key idea of a DSP is to create complex, realistic signals by precisely controlling and tuning their many parameters. For example, a collection of linear filters and sinusoidal oscillators can be used to create the sound of a realistic violin if the frequencies and responses are tuned in the right way.
However, it is sometimes difficult to control all of these parameters manually, which is why the output from sound synthesizers is unnatural and robotic. DDSP is an open-source library that uses a neural network to convert a user’s input into complex DSP controls that can help produce more realistic signals. The input could be of any audio form, including features extracted from the audio itself. Since the DDSP units are differentiable, the neural network can be trained to adapt to a dataset through the use of backpropagation. DDSP elements as the final output layer of an autoencoder. It combines a Harmonic Additive Synthesizer, that adds sinusoids at many different frequencies, with a Subtractive Noise Synthesizer that filters out the noise with time-varying filters. The signal from each synth is then combined and run through a reverberation module to produce the final audio waveform. The loss is computed by comparing the generated audio and source audio spectrograms across six different frame sizes. Since all the components are differentiable, we can use backpropagation and stochastic gradient descent to train the network end-to-end.

Getting Started with Code
This article will use the DDSP library and neural networks to create speech audio to music converter. The model will take human speech audio as input and replace the waveforms with music that imitates a violin. We will also be comparing the original audio with the processed audio to notice the sheer difference. The following code is an official implementation from the creators of DDSP and Magenta, whose official website can be accessed from the link here.
Installing The Library
The first step will be to installing our required library; you can use the following command to do so,
#Installing the Library !pip install -qU ddsp==1.6.3
Importing Dependencies
Further, we will be importing the dependencies and helper functions that will help build our model pipeline and set the sample rate for our audio converter,
#Importing necessary dependencies import copy import os import time import crepe import ddsp import ddsp.training from ddsp.colab.colab_utils import ( auto_tune, get_tuning_factor, download, play, record, specplot, upload, DEFAULT_SAMPLE_RATE) from ddsp.training.postprocessing import ( detect_notes, fit_quantile_transform ) import gin from google.colab import files import librosa import matplotlib.pyplot as plt import numpy as np import pickle import tensorflow.compat.v2 as tf import tensorflow_datasets as tfds # Helper Functions sample_rate = DEFAULT_SAMPLE_RATE # 16000
Recording the Audio to be processed
Further, we will now provide our model with a 5-second speech audio recording as an input, which our model pipeline will further process.
#recording audio record_or_upload = "Record" record_seconds = if record_or_upload == "Record": audio = record(seconds=record_seconds) else: filenames, audios = upload() audio = audios[0] audio = audio[np.newaxis, :] print('\nExtracting audio features...') # Plot. specplot(audio) play(audio) # Setup the session. ddsp.spectral_ops.reset_crepe() # Computing the speech audio features. start_time = time.time() audio_features = ddsp.training.metrics.compute_audio_features(audio) audio_features['loudness_db'] = audio_features['loudness_db'].astype(np.float32) audio_features_mod = None print('Audio features took %.1f seconds' % (time.time() - start_time)) TRIM = -15
Now let’s also plot the processed speech audio features,
# Plot the Speech audio Features. #Extracts fundamental frequency (f0) and loudness features fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, figsize=(6, 8)) ax[0].plot(audio_features['loudness_db'][:TRIM]) ax[0].set_ylabel('loudness_db') ax[1].plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM])) ax[1].set_ylabel('f0 [midi]') ax[2].plot(audio_features['f0_confidence'][:TRIM]) ax[2].set_ylabel('f0 confidence') _ = ax[2].set_xlabel('Time step [frame]')
Output :

Note: The above audio files can be seen at the time of execution only.)


The output will play your recorded 5-second speech audio clip, and the model also plots the features such as loudness, frequency and confidence.
Loading Audio Model To be Combined
Let us now combine our speech audio with another monotonic instrument audio to be resynthesized to create new audio. We are loading instrument audio of a violin here.
#Loading an audio model model = 'Violin' MODEL = model # Iterate through directories until model directory is found def find_model_dir(dir_name): for root, dirs, filenames in os.walk(dir_name): for filename in filenames: if filename.endswith(".gin") and not filename.startswith("."): model_dir = root break return model_dir if model in ('Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'): # Load Pretrained models. PRETRAINED_DIR = '/content/pretrained' . !rm -r $PRETRAINED_DIR &> /dev/null !mkdir $PRETRAINED_DIR &> /dev/null GCS_CKPT_DIR = 'gs://ddsp/models/timbre_transfer_colab/2021-07-08' model_dir = os.path.join(GCS_CKPT_DIR, 'solo_%s_ckpt' % model.lower()) !gsutil cp $model_dir/* $PRETRAINED_DIR &> /dev/null model_dir = PRETRAINED_DIR gin_file = os.path.join(model_dir, 'operative_config-0.gin') else: # User models. UPLOAD_DIR = '/content/uploaded' !mkdir $UPLOAD_DIR uploaded_files = files.upload() for fnames in uploaded_files.keys(): print("Unzipping... {}".format(fnames)) !unzip -o "/content/$fnames" -d $UPLOAD_DIR &> /dev/null model_dir = find_model_dir(UPLOAD_DIR) gin_file = os.path.join(model_dir, 'operative_config-0.gin') # Load the dataset statistics. DATASET_STATS = None dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl') print(f'Loading dataset statistics from {dataset_stats_file}') try: if tf.io.gfile.exists(dataset_stats_file): with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f: DATASET_STATS = pickle.load(f) except Exception as err: print('Loading dataset statistics from pickle failed: {}.'.format(err)) # Ensure dimensions and sampling rates are equal time_steps_train = gin.query_parameter('F0LoudnessPreprocessor.time_steps') n_samples_train = gin.query_parameter('Harmonic.n_samples') hop_size = int(n_samples_train / time_steps_train) time_steps = int(audio.shape[1] / hop_size) n_samples = time_steps * hop_size gin_params = [ 'Harmonic.n_samples = {}'.format(n_samples), 'FilteredNoise.n_samples = {}'.format(n_samples), 'F0LoudnessPreprocessor.time_steps = {}'.format(time_steps), 'oscillator_bank.use_angular_cumsum = True', ] with gin.unlock_config(): gin.parse_config(gin_params) # Trim all input audio vectors to correct lengths for key in ['f0_hz', 'f0_confidence', 'loudness_db']: audio_features[key] = audio_features[key][:time_steps] audio_features['audio'] = audio_features['audio'][:, :n_samples] # Set up the model just to predict audio given new conditioning model = ddsp.training.models.Autoencoder() model.restore(ckpt) # Build a model by running a batch through it. start_time = time.time() _ = model(audio_features, training=False) print('Restoring model took %.1f seconds' % (time.time() - start_time))
Creating our Resynthesized Audio
Our last step will be to teach and process our original audio with our loaded audio to create unique audio,
#Resynthesize Audio af = audio_features if audio_features_mod is None else audio_features_mod # Run a batch of predictions. start_time = time.time() outputs = model(af, training=False) audio_gen = model.get_audio_from_outputs(outputs) print('Prediction took %.1f seconds' % (time.time() - start_time)) # Plotting graph for comparison print('Original') play(audio) print('Resynthesis') play(audio_gen) specplot(audio) plt.title("Original") specplot(audio_gen) _ = plt.title("Resynthesis")
The output will comprise two audios, the first being our original recorded speech audio and the second being the synthesised audio.



We can now clearly observe the difference between the two with the help of the plotted graph.
End Notes
In this article, we explored what a neural network is and what are its potential uses. We also built speech audio to music audio converter model using neural networks and the Differential Digital Signal Processing library. Finally, we processed, synthesized, and compared our audio to notice the difference between them. The following implementation can be found as a Colab notebook, accessed using the link here.
Happy Learning!