What I learned about deploying Machine Learning application

               

A tutorial on building custom ML training workflow using Google Cloud Platform

Quý Đinh

Imagine a company named Rainbow imports boxes of flowers and need to classify them into species. For six months, they have some staff label the boxes manually. Now, they hire you to build a Machine Learning model to do the task.

Image result for type of flowers recognized by machine learning
Source: https://hackernoon.com/top-5-machine-learning-projects-for-beginners-47b184e7837f

With a small amount of labelled data as input and tons of experience working on Kaggle projects, you quickly develop a 95% accuracy using simple RandomForestClassifier from the popular scikit-learn library. Nice. Stakeholders approve and ask you when you could deploy that model to production.

Hmm, deploy a model from my laptop? …

In case you wonder, I hope this tutorial will help you understand one among some common and most simple approaches. The diagram below depicts how we will use Google Cloud Platform to do the job in a batch-processing manner.

I choose the Iris data set as our input to help you see how our approach works with small-sized problems. All the codes are in this repo: https://github.com/dvquy13/gcp_ml_pipeline

Introduction

However, when you’re solving real-world problems, your duty does not stop after you deliver a presentation. You will have to think about how to bring that solution to the production environment.

Over the last few months, I have tried to deploy multiple computing pipelines. They are different in their scopes and complexity, ranging from processing a dozen of MB to 400 GB data per run. In this article, I want to summarize and share what I learned.

The targeted audience

You will need Cloud Dataproc to proceed. This product allows you to spin up a cluster of machines to run your computing job in a distributed manner. Please refer to this documentation if you don’t know what Dataproc is.

Agenda

  1. Step-by-step instructions to create the infrastructure and run the pipeline
  2. Explain codebase
  3. Introduce other extended components, including Big Data processing with Apache Spark, scheduler with Airflow, local development environment, unit testing
  4. Summary

Approaches

About writing codes

This strategy has several purposes. First, for a long-running task, if a job fails at one of the last steps, we can re-run the pipeline from the nearest checkpoint rather than wasting time and resources restarting the whole pipeline. Second, it allows us to (1) debug more easily, (2) get alert when things break and (3) monitor interim outputs. Lastly, decoupled components can be understood more clearly, and easier to be replaced or extended later.

About computing resources

Step-by-step instructions

Create a GCP project and enable necessary components

2. 👉 Click Billing at the left sidebar and initiate a billing account to be able to use the components used in this tutorial

3. 👉 Select Library, then search and enable the following API: Cloud Dataproc, Cloud Storage and Cloud Firestore.

4. 👉 Navigate to the Firestore either by scrolling the sidebar to the left or search from the top menu bar. When you arrive at the below screen, choose SELECT NATIVE MODE, then choose us-east1 as the location.

Environment setup

5. 👉 At the home page of your GCP project, select the command button to the right of your menubar. The CloudShell window then appears as you can see below:

6. 👉 Launch Cloud Shell Editor:

Launch Cloud Shell Editor

It’s recommended to use Cloud Shell to follow this tutorial. However, if you’re using Linux and want to use terminal on your local machine, make sure you first install the Google Cloud SDK and firebase CLI.

Step 2: Clone Github repo

7. 👉 In the Terminal window:

git clone https://github.com/dvquy13/gcp_ml_pipeline.git
cd gcp_ml_pipeline

8. 👉 Select File then open the file gcp_ml_pipeline/configs/.project_env:

9. 👉 Replace the values enclosed by <>. For the GCP_PROJECT, you need to provide the id of your GCP project. For the remaining, feel free to choose some random names for the global variables that identify your resources. The final output looks like this:

GCP_PROJECT='zinc-primer-230105'
GCS_BUCKET=dvquys-tut-gcp-ml-pipeline
DATA_LOCATION=us-east1
BQ_DATASET=tut_iris
BQ_ORG_TABLE=F_ORIGINAL
CLUSTER_NAME=iris-pred

10. 👉 Grant execute permission to the folder scripts by running the command: chmod +x -R ./scripts. Then, run ./scripts/00_import_data_to_bigquery.shLink to the script.

Step 3: Create Dataproc cluster and submit jobs

️Now, run the following commands in sequence:

  1. make create-dataproc-cluster
    This command creates a Dataproc cluster. The single-node flag indicates that this is a cluster containing only one machine. n1-standard-1 is the cheapest machine we can rent. To install Python packages, we supply the metadata and initialization-actions params.
  2. make build
    Package your code
    , including your source code and other 3rd party libraries that you can not pre-install when creating the cluster (PyYAML for example). To submit a job to the cluster, we will send these codes to those machines via the gcloud dataproc jobs submit pyspark command.
  3. make submit-job ENV=dev MODULE=data_import TASK=query_train_pred
    Submit job cloning input data for training and predicting. The submit-job make command allows you to use this interface to run on both local and development environments.
  4. make submit-job ENV=dev MODULE=feature_engineer TASK=normalize
    Prepare features. In this illustrative example, we choose to include only normalization in the pipeline. After learning the normalization parameters from the train data set, we save those configurations for later usage.
  5. make submit-job ENV=dev MODULE=model TASK=fit
    Train model. Here we build a pipeline consisting of 2 steps, Normalization and Logistic Regression. After that, we persist the fit pipeline.
  6. make submit-job ENV=dev MODULE=predict TASK=batch_predict
    Batch predict. This job demonstrates the process when you use your learned model to make predictions.
  7. make submit-job ENV=dev MODULE=predict TASK=store_predictions
    Store predictions. The reason we do not combine this with the above step is two-fold. First, writing to a database often takes time and requires several retries. Second, we write to a document database like Cloud Firestore because when other team uses, they typically retrieve one document per query. However, there are times when we want to inspect the whole batch of predictions (e.g. debugging, count number of documents scored more than 0.9). For this query pattern, we will better off using the persisted outputs from the previous step, stored as parquet files in Cloud Storage.
  8. make delete-dataproc-cluster
    Delete Dataproc cluster. After the process finishes, delete the cluster so no further cost incurs.

Succeeded Dataproc jobs

You can see that your predictions are stored at Cloud Firestore by accessing its web console.

Firestore populated with predictions

Along the way, you will see that the output data of each step is persisted in Cloud Storage. I use parquet rather than CSV as the serialization format because it can embed schema information (therefore you do not have to specify column types when reading) and reduce storage size. For more detail, please refer to this benchmark.

Clean up

./scripts/01_erase_resources.sh
./scripts/02_disable_resources.sh
./scripts/03_delete_project.sh

Explain codebase

configs/
Store all the arguments that need to be set initially. .project_env is a file to store the global variables used to work with GCP. We also have the runtime.yaml, where we use Anchor, Alias and Extension in YAML to define runtime parameters for multiple environments. Both of these files serve as a centralized config store so that we can easily look up and make changes, instead of finding the configs scattered elsewhere in the code.

Makefile
Originally Makefile is used to orchestrate the build process in C programming language. But it has done so well out of being just a shortcut so people start using it to facilitate ML model development. I have seen many tutorials using this tool, including the one that inspires me to design my Pyspark codebase.

In this small project, we also use Makefile to save us a lot of time. As you can see above in Step 3, I put there our frequently used commands so that I can easily type make <something> to run a particular step.

iris_pred/: Source code.
– main.py: is the interface to all tasks. This file parses the arguments to load config and get the job name, then call analyze function in entry_point.py from the appropriate module.

– jobs/: contain tasks as modules. Inside jobs, we have one module corresponding to a step in our pipeline. All these modules expose an entry_point.py file where we unify the API to easily and consistently communicate with main.py.

iris_pred/jobs/model/train.py

As you can see in the snippet above, the class Trainer expose a function run. Each step in the process corresponds to a private function declared in the same class.

– shared/: functions and classes to be reused across modules
In io_handler.py, the class IOHandler applies the principle Composition Over Inheritance to ease the process of loading outputs from the previous step.

Further discussion

Apache Spark for bigger data

Apache Airflow for orchestration

Example of Airflow DAG. Source.

An alternative is Dataproc Workflows. This is a native solution offered by GCP, but I haven’t tried it myself so I will just leave the documentation here.

Local development

Unit testing

Alternatives for batch processing

https://cloud.google.com/solutions/data-lifecycle-cloud-platform#processing_large-scale_data

Summary

  1. How to utilize different Google Cloud Platform components to build a batch job pipeline (whether it involves ML or not).
  2. A product named Google Cloud Dataproc, where you can both submit a light-weight job via single-node mode and easily scale to a cluster of computers.
  3. One approach to structurize ML pipeline codebase: Link to the repo.
  4. Some convenient components in model development, e.g. Makefile, runtime config, parquet persistence. This mostly helps people with little or no software engineering background.

Again, one of my main goals in writing this article is to receive feedback from the community, so I can do my job better. Please feel free me leave me comments, and I hope you guys enjoy this tutorial.

References

  1. https://developerzen.com/best-practices-writing-production-grade-pyspark-jobs-cb688ac4d20f
  2. https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0dhttps://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d
  3. https://medium.com/@kinghuang/docker-compose-anchors-aliases-extensions-a1e4105d70bd