A tutorial on building custom ML training workflow using Google Cloud Platform
– Quý Đinh –
Imagine a company named Rainbow imports boxes of flowers and need to classify them into species. For six months, they have some staff label the boxes manually. Now, they hire you to build a Machine Learning model to do the task.
With a small amount of labelled data as input and tons of experience working on Kaggle projects, you quickly develop a 95% accuracy using simple RandomForestClassifier from the popular scikit-learn library. Nice. Stakeholders approve and ask you when you could deploy that model to production.
Hmm, deploy a model from my laptop? …
In case you wonder, I hope this tutorial will help you understand one among some common and most simple approaches. The diagram below depicts how we will use Google Cloud Platform to do the job in a batch-processing manner.
Like many other self-taught data people, I am familiar with manipulating data and develop a model on my laptop.
However, when you’re solving real-world problems, your duty does not stop after you deliver a presentation. You will have to think about how to bring that solution to the production environment.
Over the last few months, I have tried to deploy multiple computing pipelines. They are different in their scopes and complexity, ranging from processing a dozen of MB to 400 GB data per run. In this article, I want to summarize and share what I learned.
The targeted audience
This post is for data analysts/scientists who want to deploy their local solution, especially those without a software engineering background.
You will need Cloud Dataproc to proceed. This product allows you to spin up a cluster of machines to run your computing job in a distributed manner. Please refer to this documentation if you don’t know what Dataproc is.
- Discuss the approach
- Step-by-step instructions to create the infrastructure and run the pipeline
- Explain codebase
- Introduce other extended components, including Big Data processing with Apache Spark, scheduler with Airflow, local development environment, unit testing
About writing codes
Instead of writing a long script to do everything, we break a pipeline into tasks and checkpoint interim data to disk. For example, after doing preprocess on train and test data, we dump both the data outputs and the transformer to Google Cloud Storage. We then load those objects as inputs for the next step.
This strategy has several purposes. First, for a long-running task, if a job fails at one of the last steps, we can re-run the pipeline from the nearest checkpoint rather than wasting time and resources restarting the whole pipeline. Second, it allows us to (1) debug more easily, (2) get alert when things break and (3) monitor interim outputs. Lastly, decoupled components can be understood more clearly, and easier to be replaced or extended later.
About computing resources
Normally for a small input size, we are fine with setting up a single virtual machine on the cloud. However, in some companies with mature cloud practice, the overhead of managing that VM is a type of cost that is difficult to justify. Especially when we have better options. For instance, Cloud Dataproc provides us with virtual machines that only live for the duration of one run, thereby free us from managing the machines. In this post, we explore Dataproc as our main engine for all the computing process.
Create a GCP project and enable necessary components
- 👉 Create a free GCP account with $300 credit by going to console.cloud.google.com. Beware that by following this tutorial, you might incur a cost of about $0.2–$0.5.
2. 👉 Click Billing at the left sidebar and initiate a billing account to be able to use the components used in this tutorial
3. 👉 Select Library, then search and enable the following API: Cloud Dataproc, Cloud Storage and Cloud Firestore.
4. 👉 Navigate to the Firestore either by scrolling the sidebar to the left or search from the top menu bar. When you arrive at the below screen, choose SELECT NATIVE MODE, then choose
us-east1 as the location.
Step 1: Launch terminal window
5. 👉 At the home page of your GCP project, select the command button to the right of your menubar. The CloudShell window then appears as you can see below:
6. 👉 Launch Cloud Shell Editor:
It’s recommended to use Cloud Shell to follow this tutorial. However, if you’re using Linux and want to use terminal on your local machine, make sure you first install the Google Cloud SDK and firebase CLI.
Step 2: Clone Github repo
7. 👉 In the Terminal window:
git clone https://github.com/dvquy13/gcp_ml_pipeline.git cd gcp_ml_pipeline
8. 👉 Select
File then open the file
9. 👉 Replace the values enclosed by <>. For the
GCP_PROJECT, you need to provide the
id of your GCP project. For the remaining, feel free to choose some random names for the global variables that identify your resources. The final output looks like this:
GCP_PROJECT='zinc-primer-230105' GCS_BUCKET=dvquys-tut-gcp-ml-pipeline DATA_LOCATION=us-east1 BQ_DATASET=tut_iris BQ_ORG_TABLE=F_ORIGINAL CLUSTER_NAME=iris-pred
10. 👉 Grant
execute permission to the folder scripts by running the command:
chmod +x -R ./scripts. Then, run
./scripts/00_import_data_to_bigquery.sh. Link to the script.
Step 3: Create Dataproc cluster and submit jobs
We use Makefile to orchestrate our actions. You can find it here: https://github.com/dvquy13/gcp_ml_pipeline/blob/master/Makefile.
️Now, run the following commands in sequence:
make create-dataproc-clusterThis command creates a Dataproc cluster. The
single-nodeflag indicates that this is a cluster containing only one machine.
n1-standard-1is the cheapest machine we can rent. To install Python packages, we supply the
Package your code, including your source code and other 3rd party libraries that you can not pre-install when creating the cluster (PyYAML for example). To submit a job to the cluster, we will send these codes to those machines via the
gcloud dataproc jobs submit pysparkcommand.
make submit-job ENV=dev MODULE=data_import TASK=query_train_predSubmit job cloning input data for training and predicting. The
makecommand allows you to use this interface to run on both local and development environments.
make submit-job ENV=dev MODULE=feature_engineer TASK=normalize
Prepare features. In this illustrative example, we choose to include only normalization in the pipeline. After learning the normalization parameters from the train data set, we save those configurations for later usage.
make submit-job ENV=dev MODULE=model TASK=fit
Train model. Here we build a pipeline consisting of 2 steps, Normalization and Logistic Regression. After that, we persist the fit pipeline.
make submit-job ENV=dev MODULE=predict TASK=batch_predict
Batch predict. This job demonstrates the process when you use your learned model to make predictions.
make submit-job ENV=dev MODULE=predict TASK=store_predictions
Store predictions. The reason we do not combine this with the above step is two-fold. First, writing to a database often takes time and requires several retries. Second, we write to a document database like Cloud Firestore because when other team uses, they typically retrieve one document per query. However, there are times when we want to inspect the whole batch of predictions (e.g. debugging, count number of documents scored more than 0.9). For this query pattern, we will better off using the persisted outputs from the previous step, stored as parquet files in Cloud Storage.
Delete Dataproc cluster. After the process finishes, delete the cluster so no further cost incurs.
You can see that your predictions are stored at Cloud Firestore by accessing its web console.
Along the way, you will see that the output data of each step is persisted in Cloud Storage. I use
parquet rather than
CSV as the serialization format because it can embed schema information (therefore you do not have to specify column types when reading) and reduce storage size. For more detail, please refer to this benchmark.
11. 👉 Finally, when you’re done exploring the results, you can delete all resources by running these commands:
./scripts/01_erase_resources.sh ./scripts/02_disable_resources.sh ./scripts/03_delete_project.sh
scripts/This directory contains some initial scripts, which are the steps to help you set things up. In practice, I also favor using script rather than user interfaces such as web console because it is self-documented and easy for others to follow the exact steps.
configs/Store all the arguments that need to be set initially.
.project_env is a file to store the global variables used to work with GCP. We also have the
runtime.yaml, where we use Anchor, Alias and Extension in YAML to define runtime parameters for multiple environments. Both of these files serve as a centralized config store so that we can easily look up and make changes, instead of finding the configs scattered elsewhere in the code.
MakefileOriginally Makefile is used to orchestrate the build process in C programming language. But it has done so well out of being just a shortcut so people start using it to facilitate ML model development. I have seen many tutorials using this tool, including the one that inspires me to design my Pyspark codebase.
In this small project, we also use Makefile to save us a lot of time. As you can see above in Step 3, I put there our frequently used commands so that I can easily type
make <something> to run a particular step.
iris_pred/: Source code.
main.py: is the interface to all tasks. This file parses the arguments to load config and get the job name, then call
analyze function in
entry_point.py from the appropriate module.
jobs/: contain tasks as modules. Inside
jobs, we have one module corresponding to a step in our pipeline. All these modules expose an
entry_point.py file where we unify the API to easily and consistently communicate with
As you can see in the snippet above, the class
Trainer expose a function
run. Each step in the process corresponds to a private function declared in the same class.
shared/: functions and classes to be reused across modules
io_handler.py, the class IOHandler applies the principle Composition Over Inheritance to ease the process of loading outputs from the previous step.
To completely build and operate a pipeline, there is still more to be considered.
Apache Spark for bigger data
In this tutorial, we rent one small machine from Dataproc and use pandas as our preprocessing engine, which perfectly handles the case of data fit into the memory of that machine. However, often data input in real-world situations will be much bigger, therefore require us to use a distributed computing framework for scalability. In that case, you can just switch to using Apache Spark. From version 1.3, Spark introduces its DataFrame API, which greatly bears resemblance to Pandas counterpart. After porting your code from Pandas to Spark, to be able to run jobs across multiple machines, you just need to create a bigger cluster with a master and multiple workers.
Apache Airflow for orchestration
Most of the batch job is not ad hoc. If it is, we should not even think about putting effort to standardize the process in the first place. Apache Airflow can play the role of both a scheduler and a monitor. It keeps metadata of each run and can send you alerts when things fail.
An alternative is Dataproc Workflows. This is a native solution offered by GCP, but I haven’t tried it myself so I will just leave the documentation here.
Because rarely our codes work the first time we write them, it’s very important to be able to quickly test without having to go through all the boilerplate steps from setting up variables to requesting cloud resources. My suggestion is that we should set up our local environment asap. We can install Apache Spark 2.4.3+ to act as our runner engine, and MongoDB to be our alternative for Cloud Firestore. Here in the code repo, you can still refer to some line containing what I call the “environment branching logic”, which enables you to switch between running the same code on both local and cloud environments.
Many people have already talked about unit testing, so I won’t go too detailed here. I also don’t do unit testing in this tutorial for the sake of simplicity. However, I strongly encourage you to add testing yourself. Whatever it takes, unit testing forces us to modularize our code and add a layer of alerting. This is very important because things in data science often break in silence.
Alternatives for batch processing
There are several ways to do batch processing on Google Cloud Platform. I choose Cloud Dataproc because I am familiar with Apache Spark and my workflow typically involves components in the Machine Learning/Data Science ecosystem.
Here is a summary of what you have learned in this tutorial:
- How to utilize different Google Cloud Platform components to build a batch job pipeline (whether it involves ML or not).
- A product named Google Cloud Dataproc, where you can both submit a light-weight job via single-node mode and easily scale to a cluster of computers.
- One approach to structurize ML pipeline codebase: Link to the repo.
- Some convenient components in model development, e.g. Makefile, runtime config, parquet persistence. This mostly helps people with little or no software engineering background.
Again, one of my main goals in writing this article is to receive feedback from the community, so I can do my job better. Please feel free me leave me comments, and I hope you guys enjoy this tutorial.