Differentiable Digital Signal Processing (DDSP) is an audio generation library that uses classical interpretable DSP elements (like oscillators, filters, synthesizers) with deep learning models. It was introduced by Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu and Adam Roberts (ICLR paper).
Before going into the library’s details, let us have an overview of the concept of DSP.
What is DSP?
Digital Signal Processing (DSP) is a process in which digitized signals such as audio, video, pressure, temperature etc., are taken as input to perform mathematical operations on them, e.g. adding, subtracting or multiplying the signals. Visit this page for a detailed understanding of DSP.
Subscribe to our Newsletter
Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.
Overview of DDSP
DDSP library creates complex realistic audio signals by controlling parameters of simple interpretable DSP, e.g. by tuning the frequencies and responses of sinusoidal oscillators and linear filters; it can synthesize the sound of a realistic instrument such as violin, flute etc.
How DDSP works?
Image source: Official documentation
Neural network models such as WaveNet used for audio generation can generate waveforms taking one sample at one point of time. Unlike those models, DDSP passes parameters through known algorithms for sound synthesis. All the components in the above figure are differentiable, the model can be trained end-to-end using stochastic gradient descent and backpropagation.
Practical implementation of DDSP
Image source: Official Colab Demo
Here’s a demonstration of timbre (pitch/tone quality) transfer using DDSP. The code has been implemented using Google colab with Python 3.7.10 and ddsp 1.2.0 versions. Step-wise explanation of the code is as follows:
- Install DDSP library
!pip install ddsp
- Import required libraries and modules.
import warnings warnings.filterwarnings("ignore") import copy import os #for interacting with the operating system import time import crepe import ddsp import ddsp.training from ddsp.colab import colab_utils from ddsp.colab.colab_utils import ( auto_tune, detect_notes, fit_quantile_transform, get_tuning_factor, download, play, record, specplot, upload, DEFAULT_SAMPLE_RATE) import gin from google.colab import files import librosa import matplotlib.pyplot as pl import numpy as np import pickle import tensorflow.compat.v2 as tf import tensorflow_datasets as tfds
- Initialize signal sampling rate (default sampling rate of 16000 defined in ddsp.spectral_ops has been used here)
sample_rate = DEFAULT_SAMPLE_RATE
- Display options for the user to record an input audio signal or upload one. If recorded, provide an option of selecting the number of seconds for which recording is to be done.
#Allow .mp3 or .wav file extensions for uploaded file record_or_upload = "Upload (.mp3 or .wav)" #@param ["Record", "Upload (.mp3 or .wav)"] “”” Input for recording’s duration can range from 1 to 10 seconds; it can be changed in step of 1 seconds “”” record_seconds = 20 #@param {type:"number", min:1, max:10, step:1}
- Define actions to be performed based on the user’s selection of recording or uploading the audio.
#If user selects ‘Record’ option, record audio from browser using record() method defined here if record_or_upload == "Record": audio = record(seconds=record_seconds) “”” If user selects ‘Upload’ option, allow loading a .wav or .mp3 audio file from disk into the colab notebook using upload() method defined here “”” else: filenames, audios = upload() “”” upload() returns names of the files uploaded and their respective audio sound. If user uploads multiple files, select the first one from the ‘audios’ array “”” audio = audios[0] audio = audio[np.newaxis, :] print('\nExtracting audio features...')
- Plot the spectrum of the audio signal using specplot() method
specplot(audio)
Create an HTML5 audio widget using play() method to play the audio file
play(audio)
Reset CREPE’s global state for re-building the model
ddsp.spectral_ops.reset_crepe()
- Record the start time of audio
start_time = time.time()
Compute audio features
audio_features = ddsp.training.metrics.compute_audio_features(audio)
Store the loudness (in decibels) of the audio
audio_features['loudness_db'] = audio_features['loudness_db'] .astype(np.float32) audio_features_mod = None
Compute the time taken for calculating audio features by subtracting start time from the current time
print('Audio features took %.1f seconds' % (time.time() - start_time))
- Plot the computed features
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, figsize=(6, 8)) #Plot the loudness of audio ax[0].plot(audio_features['loudness_db'][:-15]) ax[0].set_ylabel('loudness_db') #Plot the frequency of MIDI notes ax[1].plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM])) ax[1].set_ylabel('f0 [midi]') #Plot the confidence of audio signal ax[2].plot(audio_features['f0_confidence'][:TRIM]) ax[2].set_ylabel('f0 confidence') _ = ax[2].set_xlabel('Time step [frame]')
Output:
The .mp3 audio file that we have used for the demonstration:
(Source of the audio file)
- Select the pretrained model of an instrument to be used.
model = 'Violin' #@param ['Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone', 'Upload your own (checkpoint folder as .zip)'] MODEL = model
Define a function to find the selected model
def find_model_dir(dir_name): # Iterate through directories until model directory is found for root, dirs, filenames in os.walk(dir_name): for filename in filenames: if filename.endswith(".gin") and not filename.startswith("."): model_dir = root break return model_dir
- Select the model to be used.
if model in ('Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'): # Pretrained models. PRETRAINED_DIR = '/content/pretrained' # Copy over from gs:// for faster loading. !rm -r $PRETRAINED_DIR &> /dev/null !mkdir $PRETRAINED_DIR &> /dev/null GCS_CKPT_DIR = 'gs://ddsp/models/timbre_transfer_colab/2021-01-06' model_dir = os.path.join(GCS_CKPT_DIR, 'solo_%s_ckpt' % model.lower()) !gsutil cp $model_dir/* $PRETRAINED_DIR &> /dev/null model_dir = PRETRAINED_DIR gin_file = os.path.join(model_dir, 'operative_config-0.gin') else: # User models. UPLOAD_DIR = '/content/uploaded' !mkdir $UPLOAD_DIR uploaded_files = files.upload() for fnames in uploaded_files.keys(): print("Unzipping... {}".format(fnames)) !unzip -o "/content/$fnames" -d $UPLOAD_DIR &> /dev/null model_dir = find_model_dir(UPLOAD_DIR) gin_file = os.path.join(model_dir, 'operative_config-0.gin')
- Load the dataset statistics file
DATASET_STATS = None dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl') print(f'Loading dataset statistics from {dataset_stats_file}') try: #Load the dataset statistics file if it exists if tf.io.gfile.exists(dataset_stats_file): with tf.io.gfile.GFile(dataset_stats_file, 'rb') as f: DATASET_STATS = pickle.load(f) #throw exception if loading of the file fails except Exception as err: print('Loading dataset statistics from pickle failed: {}.'.format(err))
- Define a method to parse gin config
#First, unlock the config temporarily using a context manager with gin.unlock_config(): #Parse the file using parse_config_file() defined here gin.parse_config_file(gin_file, skip_unknown=True)
- Store the checkpoint files
“”” For each file in the list containing files of the model directory, add it to the ‘ckpt_files’ if it has checkpoint “”” ckpt_files = [f for f in tf.io.gfile.listdir(model_dir) if 'ckpt' in f] #Extract name of the checkpoint file ckpt_name = ckpt_files[0].split('.')[0] #Add the checkpoint filename to the path of model directory
- Check that dimensions and sampling rates are equal
“”” gin.query_parameter() returns the value currently bound to the binding key specified as its parameter. Binding is the parameter whose value we need to query “”” #Time steps for training process time_steps_train = gin.query_parameter('F0LoudnessPreprocessor .time_steps') #Number of training samples n_samples_train = gin.query_parameter('Harmonic.n_samples') #Compute number of samples between successive frames (called ‘hop size’) hop_size = int(n_samples_train / time_steps_train) #Compute total time steps and number of samples time_steps = int(audio.shape[1] / hop_size) n_samples = time_steps * hop_size
- Create a list of gin parameters
gin_params = [ 'Harmonic.n_samples = {}'.format(n_samples), 'FilteredNoise.n_samples = {}'.format(n_samples), 'F0LoudnessPreprocessor.time_steps = {}'.format(time_steps), 'oscillator_bank.use_angular_cumsum = True', ] Parse the above gin parameters #First, unlock the config with gin.unlock_config(): #Parse the list of parameter bindings using parse_config() gin.parse_config(gin_params)
- Trim the input vectors to correct lengths
#Trip each of the frequency, confidence and loudness to its time step length for key in ['f0_hz', 'f0_confidence', 'loudness_db']: audio_features[key] = audio_features[key][:time_steps] #Trip ‘audio’ vector to the length equal to the total number os samples audio_features['audio'] = audio_features['audio'][:, :n_samples]
- Initialize the model just to predict audio
model = ddsp.training.models.Autoencoder()
Restore the model checkpoints
model.restore(ckpt)
- Build a model by running a batch of audio features through it.
#Record start time of the audio start_time = time.time() #Build the model using computed features _ = model(audio_features, training=False) “”” Display the time taken for model building by computing difference between current time and start time of audio “”” print('Restoring model took %.1f seconds' % (time.time() - start_time))
Sample output: Restoring model took 2.0 seconds
- The pretrained models (Violin, Flute etc.) were not explicitly trained to perform timbre transfer, so they may sound unnatural if the input audio frequencies and loudness are very different from the training data (which will be true most of the time).
Create sliders for model conditioning
#@markdown ## Note Detection #@markdown You can leave this at 1.0 for most cases threshold = 1 #@param {type:"slider", min: 0.0, max:2.0, step:0.01} #@markdown ## Automatic ADJUST = True #@param{type:"boolean"} #@markdown Quiet parts without notes detected (dB) quiet = 20 #@param {type:"slider", min: 0, max:60, step:1} #@markdown Force pitch to nearest note (amount) autotune = 0 #@param {type:"slider", min: 0.0, max:1.0, step:0.1} #@markdown ## Manual #@markdown Shift the pitch (octaves) pitch_shift = 0 #@param {type:"slider", min:-2, max:2, step:1} #@markdown Adjust the overall loudness (dB) loudness_shift = 0 #@param {type:"slider", min:-20, max:20, step:1} audio_features_mod = {k: v.copy() for k, v in audio_features.items()}
The sliders to modify conditioning appear in the colab as follows:
- Define a method to shift loudness
def shift_ld(audio_features, ld_shift=0.0): #Increment the loudness by ld_shift audio_features['loudness_db'] += ld_shift #Return modified audio features return audio_features
- Define a method to shift frequency by a number of octaves
def shift_f0(audio_features, pitch_shift=0.0): #Multiply the frequency by 2^pitch_shift audio_features['f0_hz'] *= 2.0 ** (pitch_shift) audio_features['f0_hz'] = np.clip(audio_features['f0_hz'],0.0, librosa.midi_to_hz(110.0)) return audio_features
- Detect the sections of audio which are ‘on’
if ADJUST and DATASET_STATS is not None: #Store the loudness, confidence and notes of ‘on’ sections mask_on, note_on_value = detect_notes(audio_features['loudness_db'], audio_features['f0_confidence'],threshold)
Quantile shift the parts with ‘on’ section
_, loudness_norm = colab_utils.fit_quantile_transform( audio_features['loudness_db'],mask_on, inv_quantile=DATASET_STATS['quantile_transform'])
Turn down the parts of audio with ‘off’ notes.
#If mask_on is not True, assign that note as ‘mask_off’ mask_off = np.logical_not(mask_on) #In the normalized loudness’ array, compute the loudness of such off notes loudness_norm[mask_off] -= quiet * (1.0 - note_on_value[mask_off][:, np.newaxis]) #Reshape the normalized loudness’ array loudness_norm = np.reshape(loudness_norm, audio_features['loudness_db'].shape) #Update the loudness (in dB) to the normalized loudness audio_features_mod['loudness_db'] = loudness_norm #If ‘autotune’ is selected using the slider widget if autotune: #Frequency (Hz) to MIDI notes conversion f0_midi = np.array(ddsp.core.hz_to_midi (audio_features_mod['f0_hz'])) #Get an offset in cents, to most consistent set of chromatic intervals tuning_factor = get_tuning_factor(f0_midi, audio_features_mod['f0_confidence'], mask_on) #Reduce variance of the frequency from the chromatic or scale intervals f0_midi_at = auto_tune(f0_midi, tuning_factor, mask_on, amount=autotune) #Store the frequency in Hz by converting MIDI notes to Hz audio_features_mod['f0_hz'] = ddsp.core.midi_to_hz(f0_midi_at) """ Display proper message if ‘ADJUST’ option is deselected or no notes are detected """ else: print('\nSkipping auto-adjust (no notes detected or ADJUST box empty).' """ Display message if ‘ADJUST’ box is not checked or dataset statistics file is not found """ else: print('\nSkipping auto-adjust (box not checked or no dataset statistics found).')
- Perform manual shifts of loudness and frequency using methods defined in step (20) and (21)
audio_features_mod = shift_ld(audio_features_mod, loudness_shift) audio_features_mod = shift_f0(audio_features_mod, pitch_shift)
- Plot the features
#Check if ‘on’ notes has mask has_mask = int(mask_on is not None) #3 subplots if ‘has_mask’ is 1(True), else only 2 subplots of loudness and frequency n_plots = 3 if has_mask else 2 #Initialize the figure and axes parameters fig, axes = plt.subplots(nrows=n_plots, ncols=1, sharex=True, figsize=(2*n_plots, 8)) #Plot the mask of ‘on’ notes, if exists if has_mask: ax = axes[0] ax.plot(np.ones_like(mask_on[:TRIM]) * threshold, 'k:') ax.plot(note_on_value[:TRIM]) ax.plot(mask_on[:TRIM]) ax.set_ylabel('Note-on Mask') ax.set_xlabel('Time step [frame]') ax.legend(['Threshold', 'Likelihood','Mask']) #Plot the original and adjusted loudness ax = axes[0 + has_mask] ax.plot(audio_features['loudness_db'][:TRIM]) ax.plot(audio_features_mod['loudness_db'][:TRIM]) ax.set_ylabel('loudness_db') ax.legend(['Original','Adjusted']) #Plot the original and adjusted frequencies ax = axes[1 + has_mask] ax.plot(librosa.hz_to_midi(audio_features['f0_hz'][:TRIM])) ax.plot(librosa.hz_to_midi(audio_features_mod['f0_hz'][:TRIM])) ax.set_ylabel('f0 [midi]') _ = ax.legend(['Original','Adjusted'])
Output:
- Resynthesize the audio
Store the computed audio features first
af = audio_features if audio_features_mod is None else audio_features_mod
Run a batch of predictions
#Record the start time of audio start_time = time.time() #Apply the model defined in step (17) using the computed audio feature outputs = model(af, training=False)
Extract audio output from outputs’ dictionary
audio_gen = model.get_audio_from_outputs(outputs)
Display the time taken for making predictions by computing difference between current time and start time of input audio
print('Prediction took %.1f seconds' % (time.time() - start_time))
- Plot the HTML5 widget for playing the original and resynthesized audios as well as spectrum of both the signals
print('Original') play(audio) print('Resynthesis') play(audio_gen) specplot(audio) plt.title("Original") specplot(audio_gen) _ = plt.title("Resynthesis")
Output widgets:
Output plots:
Original audio:
Resynthesized audio (using ‘Violin’ model):
Google colab notebook of the above implementation is available here.
References
- Magenta documentation
- Research paper
- GitHub repository
- Colab tutorials of DDSP applications