The Largest CAD Dataset Released With 15M Designs

The Largest CAD Dataset Released With 15M Designs

In an attempt to automate industrial designing, researchers from Princeton University and Columbia University introduced a large dataset of 15 million two-dimensional real-world computer-aided designs — SketchGraphs. Along with that to facilitate research in ML-aided design, they also launched an open-source data processing pipeline. 

Introduced during the International Conference on Machine Learning, SketchGraphs is aimed to train the artificial intelligence machine with this large dataset, in order to expertise it to assist humans in creating CAD models. In a recent paper, researchers revealed that each of the CAD sketches is represented with a geometric constraint graph and the understanding of the line and shape sequence in which the design was initially created. This will enable the predictions of what is going to be designed next.

There have been many CAD data sets available by voxel or mesh, which have allowed users to work on sampling realistic 3D shapes for creating CAD models. However, these models are usually not modifiable in parametric design settings and thus not preferred for engineering workflows. SketchGraphs, on the other hand, approaches parametric modelling instead of focusing on 3D shape modelling.

AIM Daily XO

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Left: Example of a sketch; Right: A portion of its geometric constraint graph.

This large dataset can be used to train AI models directly from the targeted applications making it easier for engineers to design workflow. Further, by providing a set of rendering functions for sketches, the researchers are aiming to enable work on CAD inference from images.

Download our Mobile App

The SketchGraphs Dataset For Creating CAD Models

Ranging from a simple part of a machine to an entire machine itself, CAD models, like AutoCAD, SolidWorks, and OnShape can be used to design anything. However, the SketchGraphs dataset was obtained from the public API of product development platform OnShape, which includes sketches of 15 years, resulting in over 15 million sketches.

The main reason for introducing SketchGraphs by researchers is to understand the underlying framework of how the geometry is constructed. And thus for each CAD sketch, the researchers aimed at extracting the ground truth construction operations for both the geometric primitives and the constraints attached to them.

Firstly the researchers leveraged OnShape’s API for gathering the metadata of all the public documents from 2015-2020. This provided the researchers with two million unique document IDs. Further, these unique documents contained multiple PartStudios with each one mentioning the design of the individual component of a CAD model. After extracting all the 2D sketches, omitting the non-sketch features, from each of the PartStudio, the researchers achieved 15 million sketches. 

Left: Histogram of sketch sizes. Middle: Number of constraints with respect to the numbers of primitives in the sketch. Right: Average node degree with respect to the number of primitives.

The sketches also had to undergo specific criteria of containing at least one geometric primitive and one geometric constraint, in order to get included in the dataset. Thus the dataset has a range of ketches starting from larger constraint graphs to simple ones on a single shape.

Applications of SketchGraphs Dataset

The researchers also noted some targeted applications for which they believe SketchGraphs dataset can be beneficial in order to train those models. Alongside, the researchers also highlighted the unexplored field of machine-designed focused applications, for which SketchGraphs can act as a testbed for future research.

The paper further demonstrated two cases of SketchGraphs dataset — Autoconstrain and Generative Modeling. For both, conditionally inferring constraints and unconditional generative modelling, the researchers provided initial benchmarking for these applications. 

Case in point — Autoconstraints, for which researchers suggest that by treating the primitives of the dataset sketches as input, the ground truth constraints become the predictive target. Post that the task of autoconstrain is to predict a set of constraints given as an input. The researchers for this proposed an auto-regressive model based on message passing networks.

Autoconstraining a sketch. Left: Original input of the sketch. Blue Arrows: User modifications. Modification A: Dragging the top circle’s upwards; Modification B: Both enlarging and dragging it to the right.

To evaluate the Autoconstrain model, the researchers predicted edges on a test dataset, where they obtained an average edge precision of 0.74. They further demonstrated the inferred constraints by editing a sample sketch and checking out the results of the solved state. 

Wrapping Up

Along with SketchGraphs, the large-scale dataset for CAD sketches, the researchers also introduced an open-source processing pipeline for ML-aided designs. Researchers believe that effective training of machine learning models to construct object designs has immense potential to encourage more efficient design workflows for engineers. And, “unsupervised learning on the SketchGraphs data will allow such possibilities for CAD designs,” stated the researchers.

Read the research paper here.

Sign up for The Deep Learning Podcast

by Vijayalakshmi Anandan

The Deep Learning Curve is a technology-based podcast hosted by Vijayalakshmi Anandan - Video Presenter and Podcaster at Analytics India Magazine. This podcast is the narrator's journey of curiosity and discovery in the world of technology.

Sejuti Das
Sejuti currently works as Associate Editor at Analytics India Magazine (AIM). Reach out at

Our Upcoming Events

24th Mar, 2023 | Webinar
Women-in-Tech: Are you ready for the Techade

27-28th Apr, 2023 I Bangalore
Data Engineering Summit (DES) 2023

23 Jun, 2023 | Bangalore
MachineCon India 2023 [AI100 Awards]

21 Jul, 2023 | New York
MachineCon USA 2023 [AI100 Awards]

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Council Post: The Rise of Generative AI and Living Content

In this era of content, the use of technology, such as AI and data analytics, is becoming increasingly important as it can help content creators personalise their content, improve its quality, and reach their target audience with greater efficacy. AI writing has arrived and is here to stay. Once we overcome the initial need to cling to our conventional methods, we can begin to be more receptive to the tremendous opportunities that these technologies present.