Now Reading
Guide To Differentiable Digital Signal Processing (DDSP) Library with Python Code

Guide To Differentiable Digital Signal Processing (DDSP) Library with Python Code

Nikita Shiledarbaxi

Differentiable Digital Signal Processing (DDSP) is an audio generation library that uses classical interpretable DSP elements (like oscillators, filters, synthesizers) with deep learning models. It was introduced by Jesse Engel, Lamtharn (Hanoi) Hantrakul, Chenjie Gu and Adam Roberts (ICLR paper). 

Before going into the library’s details, let us have an overview of the concept of DSP.

What is DSP?

Digital Signal Processing (DSP) is a process in which digitized signals such as audio, video, pressure, temperature etc., are taken as input to perform mathematical operations on them, e.g. adding, subtracting or multiplying the signals. Visit this page for a detailed understanding of DSP.

Overview of DDSP

DDSP library creates complex realistic audio signals by controlling parameters of simple interpretable DSP, e.g. by tuning the frequencies and responses of sinusoidal oscillators and linear filters; it can synthesize the sound of a realistic instrument such as violin, flute etc. 

How DDSP works?

Working of DDSP

Image source: Official documentation

Neural network models such as WaveNet used for audio generation can generate waveforms taking one sample at one point of time. Unlike those models, DDSP passes parameters through known algorithms for sound synthesis. All the components in the above figure are differentiable,  the model can be trained end-to-end using stochastic gradient descent and backpropagation.

Practical implementation of DDSP

DDSP practical implementation

Image source: Official Colab Demo

Here’s a demonstration of timbre (pitch/tone quality) transfer using DDSP. The code has been implemented using Google colab with Python 3.7.10 and ddsp 1.2.0 versions. Step-wise explanation of the code is as follows:

  1. Install DDSP library

!pip install ddsp

  1. Import required libraries and modules.
 import warnings
 import copy
 import os  #for interacting with the operating system
 import time 
 import crepe
 import ddsp
 from ddsp.colab import colab_utils
 from ddsp.colab.colab_utils import (
     auto_tune, detect_notes, fit_quantile_transform, 
     get_tuning_factor, download, play, record, 
     specplot, upload, DEFAULT_SAMPLE_RATE)
 import gin
 from google.colab import files
 import librosa
 import matplotlib.pyplot as pl
 import numpy as np
 import pickle
 import tensorflow.compat.v2 as tf
 import tensorflow_datasets as tfds 
  1. Initialize signal sampling rate (default sampling rate of 16000 defined in ddsp.spectral_ops has been used here)

sample_rate = DEFAULT_SAMPLE_RATE   

  1. Display options for the user to record an input audio signal or upload one. If recorded, provide an option of selecting the number of seconds for which recording is to be done.
 #Allow .mp3 or .wav file extensions for uploaded file
 record_or_upload = "Upload (.mp3 or .wav)"  #@param ["Record", "Upload (.mp3 or .wav)"]
 Input for recording’s duration can range from 1 to 10  seconds; it can be changed in step of 1 seconds
 record_seconds = 20 #@param {type:"number", min:1, max:10, step:1} 
  1. Define actions to be performed based on the user’s selection of recording or uploading the audio.
 #If user selects ‘Record’ option, record audio from browser using record() method defined here
 if record_or_upload == "Record":
   audio = record(seconds=record_seconds)
 If user selects ‘Upload’ option, allow loading a .wav or .mp3 audio file from disk into the colab notebook using upload() method defined here
   filenames, audios = upload()
 upload() returns names of the files uploaded and their respective audio sound. If user uploads multiple files, select the first one from the ‘audios’ array
  audio = audios[0]
 audio = audio[np.newaxis, :]
 print('\nExtracting audio features...') 
  1. Plot the spectrum of the audio signal using specplot() method


Create an HTML5 audio widget using play() method to play the audio file


Reset CREPE’s global state for re-building the model


  1. Record the start time of audio

start_time = time.time()

Compute audio features

audio_features =

Store the loudness (in decibels) of the audio 

audio_features['loudness_db'] = audio_features['loudness_db']
audio_features_mod = None 

Compute the time taken for calculating audio features by subtracting start time from the current time

print('Audio features took %.1f seconds' % (time.time() - start_time))

  1. Plot the computed features
 fig, ax = plt.subplots(nrows=3, 
                        figsize=(6, 8))
 #Plot the loudness of audio
 #Plot the frequency of MIDI notes
 ax[1].set_ylabel('f0 [midi]')
 #Plot the confidence of audio signal
 ax[2].set_ylabel('f0 confidence')
 _ = ax[2].set_xlabel('Time step [frame]') 



The .mp3 audio file that we have used for the demonstration:

(Source of the audio file)

  1. Select the pretrained model of an instrument to be used.
 model = 'Violin' #@param ['Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone', 'Upload your own (checkpoint folder as .zip)']
 MODEL = model 

Define a function to find the selected model

 def find_model_dir(dir_name):
   # Iterate through directories until model directory is found
   for root, dirs, filenames in os.walk(dir_name):
     for filename in filenames:
       if filename.endswith(".gin") and not filename.startswith("."):
         model_dir = root
   return model_dir  
  1.  Select the model to be used.
 if model in ('Violin', 'Flute', 'Flute2', 'Trumpet', 'Tenor_Saxophone'):
    # Pretrained models.
   PRETRAINED_DIR = '/content/pretrained'
   # Copy over from gs:// for faster loading.
    !rm -r $PRETRAINED_DIR &> /dev/null
   !mkdir $PRETRAINED_DIR &> /dev/null
   GCS_CKPT_DIR = 'gs://ddsp/models/timbre_transfer_colab/2021-01-06'
   model_dir = os.path.join(GCS_CKPT_DIR, 'solo_%s_ckpt' % model.lower())
   !gsutil cp $model_dir/* $PRETRAINED_DIR &> /dev/null
   model_dir = PRETRAINED_DIR
    gin_file = os.path.join(model_dir, 'operative_config-0.gin')
    # User models.
    UPLOAD_DIR = '/content/uploaded'
    !mkdir $UPLOAD_DIR
    uploaded_files = files.upload()
    for fnames in uploaded_files.keys():
     print("Unzipping... {}".format(fnames))
      !unzip -o "/content/$fnames" -d $UPLOAD_DIR &> /dev/null
    model_dir = find_model_dir(UPLOAD_DIR)
    gin_file = os.path.join(model_dir, 'operative_config-0.gin') 
  1.  Load the dataset statistics file
 dataset_stats_file = os.path.join(model_dir, 'dataset_statistics.pkl')
 print(f'Loading dataset statistics from {dataset_stats_file}')
 #Load the dataset statistics file if it exists
     with, 'rb') as f:
       DATASET_STATS = pickle.load(f)
 #throw exception if loading of the file fails
 except Exception as err:
   print('Loading dataset statistics from pickle failed: {}.'.format(err)) 
  1. Define a method to parse gin config
 #First, unlock the config temporarily using a context manager
 with gin.unlock_config():
 #Parse the file using parse_config_file() defined here
   gin.parse_config_file(gin_file, skip_unknown=True) 
  1. Store the checkpoint files
 For each file in the list containing files of the model directory, add it to the ‘ckpt_files’ if it has checkpoint
 ckpt_files = [f for f in if 'ckpt' in f]
 #Extract name of the checkpoint file
 ckpt_name = ckpt_files[0].split('.')[0]
 #Add the checkpoint filename to the path of model directory 
  1. Check that dimensions and sampling rates are equal
 gin.query_parameter() returns the value currently bound to the binding key specified as 
 its parameter. Binding is the parameter whose value we need to query 
 #Time steps for training process
 time_steps_train = gin.query_parameter('F0LoudnessPreprocessor
 #Number of training samples
 n_samples_train = gin.query_parameter('Harmonic.n_samples')
 #Compute number of samples between successive frames (called ‘hop size’)
 hop_size = int(n_samples_train / time_steps_train)
 #Compute total time steps and number of samples
 time_steps = int(audio.shape[1] / hop_size)
 n_samples = time_steps * hop_size 
  1.  Create a list of gin parameters 
 gin_params = [
     'Harmonic.n_samples = {}'.format(n_samples),
     'FilteredNoise.n_samples = {}'.format(n_samples),
     'F0LoudnessPreprocessor.time_steps = {}'.format(time_steps),
     'oscillator_bank.use_angular_cumsum = True', ]
 Parse the above gin parameters
 #First, unlock the config 
 with gin.unlock_config():
 #Parse the list of parameter bindings using parse_config()
  1. Trim the input vectors to correct lengths 
 #Trip each of the frequency, confidence and loudness to its time step length
 for key in ['f0_hz', 'f0_confidence', 'loudness_db']:
   audio_features[key] = audio_features[key][:time_steps]
 #Trip ‘audio’ vector to the length equal to the total number os samples
 audio_features['audio'] = audio_features['audio'][:, :n_samples] 
  1. Initialize the model just to predict audio

model =

Restore the model checkpoints


  1. Build a model by running a batch of audio features through it.
 #Record start time of the audio
 start_time = time.time()
 #Build the model using computed features
 _ = model(audio_features, training=False)
 Display the time taken for model building by computing difference between current time and start time of audio
 print('Restoring model took %.1f seconds' % (time.time() - start_time)) 

 Sample output: Restoring model took 2.0 seconds

  1. The pretrained models (Violin, Flute etc.) were not explicitly trained to perform timbre transfer, so they may sound unnatural if the input audio frequencies and loudness are very different from the training data (which will be true most of the time).

Create sliders for model conditioning

 #@markdown ## Note Detection
 #@markdown You can leave this at 1.0 for most cases
 threshold = 1 #@param {type:"slider", min: 0.0, max:2.0, step:0.01}
 #@markdown ## Automatic
 ADJUST = True #@param{type:"boolean"}
 #@markdown Quiet parts without notes detected (dB)
 quiet = 20 #@param {type:"slider", min: 0, max:60, step:1}
 #@markdown Force pitch to nearest note (amount)
 autotune = 0 #@param {type:"slider", min: 0.0, max:1.0, step:0.1}
 #@markdown ## Manual
 #@markdown Shift the pitch (octaves)
 pitch_shift =  0 #@param {type:"slider", min:-2, max:2, step:1}
 #@markdown Adjust the overall loudness (dB)
 loudness_shift = 0 #@param {type:"slider", min:-20, max:20, step:1}
 audio_features_mod = {k: v.copy() for k, v in audio_features.items()} 

The sliders to modify conditioning appear in the colab as follows:

See Also
hybrid ensemble learning model

  1. Define a method to shift loudness 
 def shift_ld(audio_features, ld_shift=0.0):
 #Increment the loudness by ld_shift
   audio_features['loudness_db'] += ld_shift
 #Return modified audio features
   return audio_features  
  1. Define a method to shift frequency by a number of octaves
 def shift_f0(audio_features, pitch_shift=0.0):
 #Multiply the frequency by 2^pitch_shift
    audio_features['f0_hz'] *= 2.0 ** (pitch_shift)
   audio_features['f0_hz'] = np.clip(audio_features['f0_hz'],0.0,
   return audio_features 
  1. Detect the sections of audio which are ‘on’
 if ADJUST and DATASET_STATS is not None:
 #Store the loudness, confidence and notes of ‘on’ sections
    mask_on, note_on_value = detect_notes(audio_features['loudness_db'],

Quantile shift the parts with ‘on’ section

     _, loudness_norm = colab_utils.fit_quantile_transform(

Turn down the parts of audio with ‘off’ notes.

   #If mask_on is not True, assign that note as ‘mask_off’
   mask_off = np.logical_not(mask_on)
 #In the normalized loudness’ array, compute the loudness of such off notes
   loudness_norm[mask_off] -=  quiet * (1.0 - note_on_value[mask_off][:, np.newaxis])
 #Reshape the normalized loudness’ array   
 loudness_norm = np.reshape(loudness_norm, audio_features['loudness_db'].shape)
 #Update the loudness (in dB) to the normalized loudness
 audio_features_mod['loudness_db'] = loudness_norm 
 #If ‘autotune’ is selected using the slider widget
   if autotune:
 #Frequency (Hz) to MIDI notes conversion
       f0_midi = np.array(ddsp.core.hz_to_midi
 #Get an offset in cents, to most consistent set of chromatic intervals
       tuning_factor = get_tuning_factor(f0_midi,  
       audio_features_mod['f0_confidence'], mask_on)
 #Reduce variance of the frequency from the chromatic or scale intervals
       f0_midi_at = auto_tune(f0_midi, tuning_factor, mask_on, 
  #Store the frequency in Hz by converting MIDI notes to Hz
       audio_features_mod['f0_hz'] = ddsp.core.midi_to_hz(f0_midi_at)
  Display proper message if ‘ADJUST’ option is deselected or no notes are   
     print('\nSkipping auto-adjust (no notes detected or ADJUST box empty).'
Display message if ‘ADJUST’ box is not checked or dataset statistics file is not found  
     print('\nSkipping auto-adjust (box not checked or no dataset 
     statistics found).') 
  1. Perform manual shifts of loudness and frequency using methods defined in step (20) and (21)
 audio_features_mod = shift_ld(audio_features_mod, loudness_shift)
 audio_features_mod = shift_f0(audio_features_mod, pitch_shift) 
  1. Plot the features
 #Check if ‘on’ notes has mask
 has_mask = int(mask_on is not None)
 #3 subplots if ‘has_mask’ is 1(True), else only 2 subplots of loudness and frequency
 n_plots = 3 if has_mask else 2 
 #Initialize the figure and axes parameters
 fig, axes = plt.subplots(nrows=n_plots, 
                       figsize=(2*n_plots, 8))
 #Plot the mask of ‘on’ notes, if exists
 if has_mask:
    ax = axes[0]
    ax.plot(np.ones_like(mask_on[:TRIM]) * threshold, 'k:')
    ax.set_ylabel('Note-on Mask')
    ax.set_xlabel('Time step [frame]')
    ax.legend(['Threshold', 'Likelihood','Mask'])
#Plot the original and adjusted loudness
 ax = axes[0 + has_mask]
 #Plot the original and adjusted frequencies
 ax = axes[1 + has_mask]
 ax.set_ylabel('f0 [midi]')
 _ = ax.legend(['Original','Adjusted']) 


  1. Resynthesize the audio 

Store the computed audio features first

af = audio_features if audio_features_mod is None else audio_features_mod

Run a batch of predictions

 #Record the start time of audio
 start_time = time.time()
 #Apply the model defined in step (17) using the computed audio feature
 outputs = model(af, training=False) 

Extract audio output from outputs’ dictionary

audio_gen = model.get_audio_from_outputs(outputs)

Display the time taken for making predictions by computing difference between current time and start time of input audio

print('Prediction took %.1f seconds' % (time.time() - start_time))

  1. Plot the HTML5 widget for playing the original and resynthesized audios as well as spectrum of both the signals
 _ = plt.title("Resynthesis") 

Output widgets:


Output plots:


Original audio:

Resynthesized audio (using ‘Violin’ model):

Google colab notebook of the above implementation is available here.


Subscribe to our Newsletter

Get the latest updates and relevant offers by sharing your email.
Join Our Telegram Group. Be part of an engaging online community. Join Here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top