TrainGenerator is a Streamlit based web app for machine learning template code generation surpassing the different stages of data loading, preprocessing, model development, hyperparameter setting, and declaring other such constraints for complete model building. This wonderful open-source software has been created by Johannes Rieke, a machine learning engineer. This eases the task of data scientists and also non-technical people in the field of data science and machine learning. The code can then be used in Google Colab notebook or downloaded in .py or .ipynb formats.
Traingenerator allows users to add their custom templates also. Until now, only image classification algorithms have been released. Soon object detection and other use cases will be seen. The left sidebar of the web app contains parameter specifications. Under framework selection, it has options for PyTorch and scikit-learn libraries. For model selection for PyTorch there is Alexnet, Resnet, VGGnet, and DenseNet along with options for selecting pre-trained model built on ImageNet and for scikit-learn there is Support vectors, Random forest, Perceptron, K-nearest neighbours, Decision trees. Input data format specification there is Numpy files or Image files. Under preprocessing options include image resizing compatible with the model, centre crop image augmentation, scaling mean and standard deviation for the pre-trained model. Then comes the training options, including GPU availability and save a model checkpoint. Hyperparameters include loss functions (CrossEntropyLoss or BCEWithLogitsLoss), optimizers (Adam, Adadelta, Adagrad, Adamax, RMSprop, SGD), other parameters that can be can be specified manually are learning rate, batch size, epochs, printing progress after every batch. Lastly, there is an option for selecting visualisation (log metrics) in the form of TensorBoard, comet.ml or none.
Code Snippet
There are two ways to use – web app(as mentioned above) and running locally
git clone https://github.com/jrieke/traingenerator.git
cd traingenerator
pip install -r requirements.txt
streamlit run app/main.py
Code Generated for sklearn
import numpy as np import sklearn from sklearn.tree import DecisionTreeClassifier from torchvision import datasets, transforms import urllib import zipfile from tensorboardX import SummaryWriter from datetime import datetime
# comment out this part to use own data
url = "https://github.com/jrieke/traingenerator/raw/main/data/fake-image-data.zip" zip_path, _ = urllib.request.urlretrieve(url) with zipfile.ZipFile(zip_path, "r") as f: f.extractall("data")
# Data insertion
train_data = "data/image-data" # required val_data = "data/image-data" # optional test_data = None # optional
# Setting up logging.
experiment_id = datetime.now().strftime('%Y-%m-%d_%H-%M-%S') writer = SummaryWriter(logdir=f"logs/{experiment_id}")
# preprocessing
# Setting up a scalar.
scaler = sklearn.preprocessing.StandardScaler() def preprocess(data, name): if data is None: # val/test can be empty return None
# Reading image files to pytorch dataset
transform = transforms.Compose([ transforms.Resize(28), transforms.CenterCrop(28), transforms.ToTensor() ]) data = datasets.ImageFolder(data, transform=transform)
# Converting images to NumPy arrays.
images_shape = (len(data), *data[0][0].shape) images = np.zeros(images_shape) labels = np.zeros(len(data)) for i, (image, label) in enumerate(data): images[i] = image labels[i] = label images = images.reshape(len(images), -1)
# Scaling to mean 0 and std 1.
if name == "train": scaler.fit(images) images = scaler.transform(images)
# Shuffling over the train set.
if name == "train": images, labels = sklearn.utils.shuffle(images, labels) return [images, labels]
processed_train_data = preprocess(train_data, "train") processed_val_data = preprocess(val_data, "val") processed_test_data = preprocess(test_data, "test") model = DecisionTreeClassifier() def evaluate(data, name): if data is None: # val/test can be empty return images, labels = data acc = model.score(images, labels) print(f"{name + ':':6} accuracy: {acc}") writer.add_scalar(f"{name}_accuracy", acc)
# Train on train_data.
model.fit(*processed_train_data)
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, presort='deprecated', random_state=None, splitter='best')
# Evaluation
evaluate(processed_train_data, "train") evaluate(processed_val_data, "val") evaluate(processed_test_data, "test")
train: accuracy: 1.0 val: accuracy: 1.0
Complete notebook from traingenerator can be viewed from here.
Code generated for Pytorch
import numpy as np import torch from torch import optim, nn from torch.utils.data import DataLoader, TensorDataset from torchvision import models, datasets, transforms from ignite.engine import Events, create_supervised_trainer, create_supervised_evaluator from ignite.metrics import Accuracy, Loss from datetime import datetime from tensorboardX import SummaryWriter from pathlib import Path
** loading same as sklearn
# preprocessing
def preprocess(data, name): if data is None: # val/test can be empty return None
# Reading image files to pytorch dataset.
transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), ]) dataset = datasets.ImageFolder(data, transform=transform) loader = DataLoader(dataset, batch_size=batch_size, shuffle=(name=="train"), **kwargs) return loader train_loader = preprocess(train_data, "train") val_loader = preprocess(val_data, "val") test_loader = preprocess(test_data, "test")
# Setting up model, loss, optimizer.
model = models.resnet18(pretrained=True) model = model.to(device) loss_func = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=lr)
# Setting up pytorch-ignite trainer and evaluator.
trainer = create_supervised_trainer( model, optimizer, loss_func, device=device, ) metrics = { "accuracy": Accuracy(), "loss": Loss(loss_func), } evaluator = create_supervised_evaluator( model, metrics=metrics, device=device ) @trainer.on(Events.ITERATION_COMPLETED(every=print_every)) def log_batch(trainer): batch = (trainer.state.iteration - 1) % trainer.state.epoch_length + 1 print( f"Epoch {trainer.state.epoch} / {num_epochs}, " f"batch {batch} / {trainer.state.epoch_length}: " f"loss: {trainer.state.output:.3f}" ) @trainer.on(Events.EPOCH_COMPLETED) def log_epoch(trainer): print(f"Epoch {trainer.state.epoch} / {num_epochs} average results: ") def log_results(name, metrics, epoch): print( f"{name + ':':6} loss: {metrics['loss']:.3f}, " f"accuracy: {metrics['accuracy']:.3f}" ) writer.add_scalar(f"{name}_loss", metrics["loss"], epoch) writer.add_scalar(f"{name}_accuracy", metrics["accuracy"], epoch)
# Training data.
evaluator.run(train_loader) log_results("train", evaluator.state.metrics, trainer.state.epoch)
# Validation data.
if val_loader: evaluator.run(val_loader) log_results("val", evaluator.state.metrics, trainer.state.epoch)
# Testing data.
if test_loader: evaluator.run(test_loader) log_results("test", evaluator.state.metrics, trainer.state.epoch) print() print("-" * 80) print() @trainer.on(Events.EPOCH_COMPLETED)
# saving checkpoint
def checkpoint_model(trainer): torch.save(model, checkpoint_dir / f"model-epoch{trainer.state.epoch}.pt")
# Starting training.
trainer.run(train_loader, max_epochs=num_epochs)
Epoch 1 / 5, batch 1 / 1: loss: 8.112 Epoch 1 / 5 average results: train: loss: 10.275, accuracy: 0.000 val: loss: 11.407, accuracy: 0.000 Epoch 2 / 5, batch 1 / 1: loss: 0.152 Epoch 2 / 5 average results: train: loss: 7.251, accuracy: 0.000 val: loss: 10.479, accuracy: 0.000 Epoch 3 / 5, batch 1 / 1: loss: 0.185 Epoch 3 / 5 average results: train: loss: 4.322, accuracy: 0.500 val: loss: 10.263, accuracy: 0.000 Epoch 4 / 5, batch 1 / 1: loss: 0.000 Epoch 4 / 5 average results: train: loss: 2.429, accuracy: 0.500 val: loss: 9.824, accuracy: 0.000 Epoch 5 / 5, batch 1 / 1: loss: 0.000 Epoch 5 / 5 average results: train: loss: 1.521, accuracy: 0.750 val: loss: 9.791, accuracy: 0.000
Complete notebook from traingenerator can be viewed from here.
Deployment using Heroku
After complete installation and logging onto Heroku, inside traingenerator run:
heroku create
git push heroku main
heroku open
EndNotes
To make contributions in the form of adding more templates make pull requests to the Github repository. Traingenerator is a simple, easy-to-use and user-friendly app for both technical and non-technical people. It’s auto code generation features come in very handy for large scale productions.