ML Ops with Dagster: 5 Key Features for Developing a Continuous Training Pipeline
One of the key concerns of ML Ops, the practice of unifying ML development with ML operations, is continuous training, the automated training and deployment of models. Using pipelines to automate model training and deployment allows your team to have a shorter model release cycle, resulting in your deployed models being more capable of capturing new trends in the data. It also means a less error-prone deployment.
Here at Thinking Machines, we’re big fans of Dagster, a task orchestrator for machine learning, analytics, and ETL. We’ve used Dagster to develop ETL pipelines for several clients, but today we’d like to share which Dagster features we found the most useful for building a continuous training pipeline for our ML models. In this blog post, we will cover:
- How our team structured the ML pipeline for a search app
- How Dagster works, and
- Which Dagster features enabled us to implement continuous training
We talked about implementing continuous training being a key concern of ours. In addition to implementing continuous training, some key properties we want in our ML pipelines are:
- For the individual parts of our pipelines to be reusable across pipelines
- The same pipeline code must execute in dev and prod
- Being able to track metadata of our pipeline runs to improve reproducibility and allow us to compare model versions
Before we get into the Dagster features we found especially useful, we’ll give a little bit of context on the search application we used Dagster for.
Where We Used Dagster for ML Ops
We built the ML components of a smart search app for a client to consolidate search functionality across different data sources. The app actually combines three different kinds of search features:
- Semantic search - this returns a link to the relevant FAQ doc based on a user’s query.
- Q&A search - this returns a specific section of the employee handbook related to a query (e.g. “How do I apply for vacation leave”)
- Entity search - this detects the person or company being searched for in a query and provides relevant info.
Each search query actually goes through all three models, and the search results are fed to an additional model that ranks them. The final results presented to the user are ordered based on what the ranker model considers to be most relevant.
Before we go into more detail on how we implemented continuous training for this search app, let’s go over some of the basics of Dagster.
Quick Dagster Crash Course
Feel free to skip this section if you’re already familiar with Dagster
Dagster organizes work into units of computation called “ops” which are usually Python functions decorated with
Multiple ops can be linked together to form a Dagster job, an executable graph of ops.
Jobs are created by decorating Python functions with
@job. You specify the dependencies between ops in a job by invoking the downstream ops on the output of upstream ops. An important property of Dagster jobs is that dependencies between ops are data dependencies, not just a specification of the order in which tasks are run.
In the example above, we create a job called
serial, which consists of two ops,
find_sugariest. Note: the function calls don’t actually execute the functions. Within the body of a job, function calls are how we specify the dependency structure of the ops within the job. We invoke
find_sugariest on the output of
download_cereals because we want
find_sugariest to depend on the output of
In the UI of Dagit, the web app for working with Dagster, the job looks like this:
Since dependencies between Dagster ops are data dependencies, Dagster optionally allows you to define the data types of a for an op’s input and output. In the example below, the op specifies its input and output types in its decorator:
Dagster provides types for built-in python primitives, and also allows users to define new custom types. Should a type check fail during runtime, dagster will raise a
dagster.core.errors.DagsterTypeCheckDidNotPass error, and you should see output like this in your Dagit ui:
Now that we’ve covered some preliminaries on how to use Dagster, we can talk about our workflow and how we structured our ML pipelines. Remember, our goal for the project was to implement continuous training. We want to automate the training and deployment of the model in production.
Note the 3 sections in the diagram:
- POC: This is the phase where we mostly worked in notebooks while our data scientists finalized which models to use and their methodology for training and evaluating them.
- Dev: This is where we developed the pipelines for doing continuous training.
- Prod: This is the actual production environment where our model is served to users. This pipeline automatically runs training, evaluation, and deployment of the new models if they meet a certain threshold on our metrics.
Deploying an ML model in this context means moving a file with the model weights to a staging area, where it can be loaded by the prediction services.
Continuous Training With Dagster
Now that you know a bit more about our overall automation workflow, we’d like to tell you more about the Dagster features that we felt had the most impact on developing our continuous training pipeline.
Easily Port Training Code
At Thinking Machines, Python is our lingua franca, so it’s natural that our data scientists write their training and eval functions in Python. Since Dagster ops are just Python functions decorated with @op, our notebook code could basically be copy-pasted into the body of a Dagster op. After that, we only had to add arguments to the @op decorator, as well as any additional logging and calls to
AssetMaterialization we needed (more on that later).
Since jobs are composed of ops, and ops are just decorated Python functions, this fulfills our condition that our ML pipeline components can be reused across pipelines. Later on, we’ll talk a little about the mechanisms Dagster provides like runtime configuration resources that allow us to modify ops and jobs to run in different situations without needing to modify the op code itself.
Simplify Dependency Management with Repositories
Dagster allows you to group your jobs together into collections called repositories. Repositories execute in separate processes from each other, and communicate with Dagster system processes like the Dagit web interface via RPC. This means each repository can be loaded in its own Python environment, and can even have different Python versions. It also means that the repositories’ dependencies don’t need to be stored in the same Python environment as Dagit, and that when there are code changes Dagit can pick them up without needing to restart.
Since we might have multiple ML jobs (and separate ETL jobs) running in a single Dagster deployment, it’s important for us to be able to isolate dependencies between our ML jobs and avoid conflicts (e.g. Use Tensorflow version x for the jobs in repository a, and use Tensorflow version y for the jobs in repository b).
Run the Same Code in Different Environments with Graphs
Remember when we said that one of the properties we wanted from our ML pipeline was for the same pipeline code to execute in dev and prod? Dagster provides a graph abstraction that represents a graph of data transformations. Dagster graphs can be used to create jobs with different configuration and external dependencies while reusing the same graph of ops. Since our bigger ML Ops goal is unifying development with operations, we want to avoid having code changes or extra logic specific to a particular environment. With configs and resources, Dagster's graphs allow us to modify a common workflow of ops to suit different environments and scenarios.
To create a graph, you annotate a Python function with
@graph. Similar to a job, within the function body you invoke ops on the output of other ops in order to define the dependencies between your ops in the graph.
In order to facilitate reusing jobs and ops in different situations, Dagster provides a system to configure them at runtime. This is a convenient way for people to specify what each op should do when the job executes without having to modify that op’s code. E.g. If you want to train on different datasets or do ad-hoc batch prediction jobs, Dagster makes it easy for you to specify that as op config. Being able to easily specify configuration at the op level also makes it easier to reuse our ops in multiple jobs.
When manually executing a job, you can provide a config yaml in the Dagit UI:
You can create multiple jobs with different configs from the same graph by calling
to_job, and specifying a new config:
Dagster also provides a resource system where you can create jobs with different external dependencies configured. This lets you create jobs and swap out different implementations of resources depending on which environment you are running your jobs in.
Track Metadata with Asset Materializations
In order for us to be able to reproduce and compare our models, we need to be able to track metadata about our ML pipelines. This might be information such as links to the stored versions of trained models, which job run produced the model, hyperparameter settings, and evaluation metrics.
Dagster provides Asset Materializations as a way to track artifacts and associated metadata that will persist beyond the lifetime of the computations that produce them. Asset Materializations allow you to associate metadata with an executing Dagster op and a particular key.
In the example above we create an
AssetMaterialization and use it to track metadata with a model’s filename, best hyperparameters, and accuracy. This data will then be visible in the Dagit UI for each training run. You can check materializations over time to see if any models performed especially well or poorly during evaluation.
Catch Errors Sooner with Input Validation
Dagster optionally lets you specify the type for each op input and output. This made it easier for us to keep track of what our jobs were doing, allowing us to catch errors earlier, speeding up our development cycle. We generally found this easier to develop in and debug than Airflow, where we only explicitly know the execution order between the individual tasks and relationships of the inputs and outputs of individual tasks is buried in the operator code. (for an extensive, though not unbiased, discussion of the differences between Airflow and Dagster, see this blog post)
We believe Dagster is a good orchestrator for implementing continuous training for production ML systems. Dagster job code is similar to regular Python, so it's easy to port over notebook code. Dagster repositories make it easy to organize your jobs, and avoid conflicting dependencies. A key feature for us is Asset Materializations which make it easy for us to view metrics and other metadata from model training runs. Finally, Dagster provided a few "quality-of-life" features that we appreciated. The ability to easily run jobs (or subsets), with different configs makes it easy to develop in, as well as to move to prod. Dagster has an easy-to-use UI for non-engineers to work with, allowing them to check data types in the UI.