Setting up an AWS Environment for Data Science

Author: Micah Melling (micahmelling@gmail.com)

A secure, scalable cloud environment is necessary for modern data science. In this blog, we will leverage Amazon Web Services (AWS) for storage and compute in the cloud. The actual data engineering and machine learning will be performed in Python.

Before working through this tutorial, please keep the following important points in mind.

The author of this tutorial assumes no responsibility for any financial costs incurred. Likewise, the author assumes no legal responsibility for any errors or bugs in the tutorial text or code.

Create an AWS Account

You can create an AWS account at this link. Create an extra secure password. I recommend stringing together three non-trivial words at random, throwing in at least 2-3 special characters, and changing at least 1-2 characters to uppercase. At minimum, the password should be at least fourteen characters, which shouldn't be an issue using the previous criteria. Likewise, craft a unique password, one that you haven't used previously.

Since this tutorial uses capabilities that are not free-tier eligible, you will need to add a credit card. To underscore, this tutorial will incur charges.

The account you've created is your root account, which is extremely powerful. It has full access and control. We will almost never use our root account. Rather, we will create more-specific accounts, which will be done in this tutorial. However, we will use our root account to set up billing alarms in the next section.

Set Up Billing Alarms

Since we will be incurring charges in this tutorial, we want to be alerted when our charges reach certain levels. This will help keep us abreast if we are running unexpectedly hot or cold on cloud spending. If spending seems high, it might indicate we have misconfigured services. Conversely, if spending seems low, we perhaps over-budgeted and could adjust our expectations.

You can follow this tutorial to create billing alarms. Create multiple billing alarms for increasing levels of spend if you'd like.

Establish a Security Stance

Cloud security is important. For obvious reasons, we don't want our environment to be hacked or exposed. We can adopt the following security protocols to give us a strong security posture.

Create an Admin Account

As discussed previously, we need to set up an administrator account. We can do so by following this tutorial. To note, you do not want your admin account to have programmatic access. It's quite powerful, and we don't want to risk the API keys getting compromised, which can cause a lot of damage in a short period.

Develop User Groups

In the last section, you learned how to create a user group by following the prescribed tutorial. Using your administrator account, create the following IAM groups:

Do not assign any permissions to just yet; they will be shells for now.

Create Data Science and Pulumi Users

Two sections ago, you learned how to create AWS users. We shall now create a few more, which generally correspond to the groups we generated in the last section:

For the nameless_data_scientist user, assign the following permission: arn:aws:iam::aws:policy/AWSCodeCommitFullAccess. This is a data science user, and we want them to have access to git repos. In later sections, we will give this user access to specific Secrets Manager secrets. No permissions will be assigned to other users at this time. Pretty obviously, the four Pulumi users will execute different types of infrastructure tasks.

Set up a Gmail Account for yagmail

One of our scripts will use yagmail to send an email. You can get going with yagmail by following this tutorial.

Get our Sample Application Code

We have two applications we will use in this tutorial, each supporting a different delivery mechanism. We can simply clone them both from GitHub.

$ git clone git@github.com:micahmelling/world-series-projections-ecs.git

$ git clone git@github.com:micahmelling/world-series-projections-batch.git

Our project involves training machine learning models to predict the winner of the World Series given data from only previous years. For instance, when predicting the probabilities for the 2020 World Series, we only use statistics from before 2020. A little bit of leakage occurs because our data source does not have Opening Day rosters, but this is a weakness we will accept.

The above repos have almost everything you need. The world-series-projections-ecs project expects a MySQL table to manage users' log-in information. We'll create those in a later section. Likewise, you will need to update the requirements.txt files with your private package information, which we will cover in a later section. Further, you'll need to update the infrastructure scripts with your appropriate AWS resources. I left AWS resource names in the Python scripts because those are fairly benign (and now deleted). However, infrastructure scripts can house more sensitive information (such as account numbers).

The world-series-projections-ecs is meant to be run on AWS Elastic Container Service (ECS). It's a simple REST service created in app.py (it's a Flask application). The user can simply POST a simple json object via PostMan or curl, such as {"year": 2015}, to get World Series predictions for 2015. You can also log into a web UI and interact with the model that way. You can build and test the app locally using Docker.

$ docker build -t ws-ecs .

$ docker run -p 8000:8000 --env AWS_SECRET_ACCESS_KEY=YOUR_KEY --env AWS_ACCESS_KEY_ID=YOUR_KEY --rm ws-ecs

From a second terminal, send a post request.

$ curl -k --data '{"year": 2015}' --request POST --header "Content-Type: application/json" https://localhost:8000/predict

I also included cert files in the repo so that we can use HTTPS through all layers of our application. These are simply self-signed certs, and this only a sample application that has already been axed, so I don't care if they are in the open. In production, a user will make a POST request to a domain name, which will then forward the traffic to a load balancer. This traffic will be covered by a formal SSL certificate. The load balancer will then communicate with copies of our application, which is where the self-signed certificate will kick in. This layer of traffic will only be going over AWS's internal network, so HTTPS is less important. That said, it's pretty easy just to use HTTPS at this layer as well for added protection against internal bad actors. Anyway, to generate your own self-signed certs, you can issue the following command:

$ openssl req -x509 -newkey rsa:4096 -nodes -out cert.pem -keyout key.pem -days 10000

The world-series-projections-batch repo is for a job that will run on a cron schedule. At a scheduled time, it will select a random year, generate the relevant predictions, and then email them. This is a bit of a contrived example, but it's still illustrative.

Get our Private Package Code

We can also clone the repository we need for our private Python library, which we will use to standardize how our programmatic accounts connect to AWS services. It also includes some utilities for interacting with databases.

$ git clone git@github.com:micahmelling/data_science_helpers.git

The code can easily be expanded as desired. For instance, you could create some nested logic for connecting to different types of databases.

Create a Secret in AWS Secrets Manager

At this stage, let's create a secret called yagmail-credentials with the following keys and appropriate values: username, password.

These are general credentials that will be valid across projects; it's not tied to any particular effort.

Set Up Secret Rotation in Secrets Manager

We want to rotate all of our programmatic access keys. Recall that our password rotation for console access is managed by AWS, per our account settings. Fortunately, we can use automation to rotate our access keys in Secrets Manager. In fact, we could do this for many types of secrets. At any given time, a data scientist should, ideally, have one set of AWS keys on their machine that are tied to a specific project (ideally, in a temporary terminal session). We could have the following script run early every morning to rotate all the project-specific access keys. Having data scientists update their keys daily isn't too onerous.

We could also take a different approach where we notify a user that keys have been changed.

The following script can accomplish our aim.

The previous scripts require some set up in MySQL, which can be done with the following commands.

Set Up CodeCommit Repo Access

We will use AWS CodeCommit as our remote git repository. When running a CI/CD pipeline in AWS, using CodeCommit is a little easier compared to an external service like GitHub. Likewise, using CodeCommit comes with many of the security benefits on which we have been working.

You'll need to configure your machine to push and pull code from CodeCommit. Use SSH authentication as it's more secure compared to HTTPS authentication. The latter can be breached if credentials are stolen. The former can only be breached if your private key is stolen.

Register Domain and Get SSL Certificate

Using your admin account, register a domain in Route 53. You'll also want to get a corresponding SSL certificate. Whenever you create an API endpoint, it will be a subdomain for our registered domain, protected by the SSL certificate.

Create and Attach IAM Policies

To get going with our infrastructure as code, we will need to give our Pulumi accounts permissions. Additionally, there are a couple of policies we want to create with our admin account via the console before we start using Pulumi.

First, let's create those custom IAM policies we will need.

Now, let's attach the following policies to each of our Pulumi groups and then assign the correspond users to those groups.

I've found creating custom policies for Pulumi users a bit challenging. In several cases, the desired resource will be created but an error will still be thrown. For example, the following policy will create an S3 bucket, but Pulumi will still raise an error on the command line. This creates some confusion, especially when it comes to debugging. Overall, Pulumi is an outstanding yet imperfect tool.

Set up Pulumi to Manage Infrastructure as Code

We now need to configure our machine to use Pulumi. You can follow this tutorial to get the ball rolling. It's quite simple.

I also recommend setting up four CodeCommit repos for your Pulumi work, one for each account, which corresponds to a segment of work.

For each subtopic, such as creating a VPC, make a subdirectory in your git repo directory. Each subdirectory then becomes its own Pulumi project that we can use to create and delete resources. In each subdirectory, we can issue the following command to initiate a Pulumi project

$ pulumi new aws-python

We could switch this design should we desire. For instance, in our project git repo, we might have a subdirectory called infrastructure, which would house our infrastructure scripts. We have to determine if we want to keep everything relevant to our project together or keep all of our infrastructure code together. This tutorial opts for the latter as the infrastructure code is generally not too relevant for the daily users of project git repos.

Further, we will need to set the appropriate AWS key environment variables when working with Pulumi. I will store these as Secrets Manager secrets in an appropriate account. When you are ready to work with a certain AWS Pulumi user, retrieve the relevant keys, export them as environment variables in your terminal, and then kill your terminal session when you are finished.

$ export AWS_ACCESS_KEY_ID=<AWS_ACCESS_KEY_ID>

$ export AWS_SECRET_ACCESS_KEY=<AWS_SECRET_ACCESS_KEY>

When you're ready to deploy your infrastructure, you simply need to issue the following from the command line:

$ pulumi up

If you want to destroy infrastructure, issue the following:

$ pulumi destroy

Virtual Private Cloud (VPC)

In the following sections, we will use our pulumi_admin account. We will create three VPCs: production, staging, and development. Production is self-explanatory. Staging is a testing environment that should be well-maintained and mimic production as closely as possible. Development is for experimentation. We can use the following function to create our three VPCs.

RDS

RDS is AWS's service for managing relational databases, like MySQL. We will create three databases: prod. stage, and dev. These match our VPC environments above. Only prod applications will be able to access the prod database, per security group configurations. Right now, we are employing a shared database model, where we create databases that can be shared across projects. However, we could easily take this script and just create a database for every individual application (and fold it into the project initiation script below), which often happens in a microservices environment.

For our ECS application, we need a MySQL user and table to manage user log-ins. Now that we have our databases, we can create the resources we need in both staging and prod.

To create a hashed password, you can issue the following commands.

$ python3
$ import hashlib
$ password = "password"
$ hashed_password = hashlib.sha256(password.encode()).hexdigest()

General S3 Buckets

We also need to create some general S3 bucket that will be used across projects. Specifically, we will need to create buckets to house CI/CD artifacts for CodeBuild and CodePipeline. We can leverage the following function.

General Execution Roles

Likewise, we have some general execution roles we need to create.

WAF IP Set

Another generic item we want to create is called a WAF IPSet. A WAF is web application firewall; it sits ontop of a load balancer and filters traffic based on criteria such as IP address. The IP set is a collection of IP addresses we want to allow to access something like a load balancer. This is an item we are likely to reuse, so we create it under the admin umbrella. To note, we use the wafregional module rather than the waf or wafv2 modules in Pulumi. Long story short, I've had issues getting the latter two to work properly. When using wafregional, you'll have to access those items in the console using the "WAF Classic" view in AWS Firewall Manager. The functionality with wafregional does what we want via Pulumi, but we don't get as many automated charts in the console when using the "classic" view. That said, I have also found wafregional to be a bit finicky. It works, unlike some of the other WAF modules. However, sometimes you just have to run "pulumi up" multiples times for it to work. I don't find the behavior to be predictable.

Security Groups

We'll also create a generic security group that only allows egress traffic.

Batch Environment (Compute Environment and Job Queue)

The final "general" items we need to create are a Batch compute environment and job queue, which will give us the infrastructure we need to run jobs on a cron schedule. We will switch to the pulumi_batch_job account to perform these actions. A compute environment is the actual compute resources. A job queue is a logical separation. For example, we might have a small queue for quick ETL jobs and a large queue for re-training and running large models, each mapped to an appropriately-sized compute cluster.

Project Initiation Script for Development Items (S3 Buckets, CodeCommit Repo, Container Registry, Secrets, Accounts, Permissions)

For every data science project we create, we will need a series of AWS resources. We can use a script to create many of these in one fell swoop. Unfortunately, we aren't able to 100% automate this process, but we can get close. First, we don't create any project-specific users with console-only access, though this might be useful. The reason is that we cannot do the necessary MFA coordination via a script. That said, we can do most anything from the command line or via Python scripts that we could do from the console. A console account may not be necessary, but it might be nice to follow the execution of a CodePipeline via the console. We do, however, create a project-specific account with programmatic access. We then give specific users access to a Secrets Manager secret that includes the necessary API keys to use the programmatic account. However, Pulumi has a limitation that plain-old boto3 doesn't: we can't get the actual strings of the key and secret to automatically upload them. We can export those items, retrieve them via the command line, and then manually update the secret. Using the exported names in the script below, we can accomplish that with the following commands:

$ pulumi stack output secret_key --show-secrets

$ pulumi stack output access_key

This is a bit of a pain. Again, Pulumi is a great tool, but it's not perfect.

Initial ECR Images

We will seed each ECR repo with an initial image. These images will be updated via our CodePipeline in later sections. I generally like to do this so that I can test applications before setting up a CodePipeline. Likewise, knowing how to interact with ECR is generally beneficial. For the below process to work, you will need the Docker daemon running on your machine.

Build Docker image locally.

$ docker build -t world-series-ecs-app .

Create connection with ECR.

$ aws ecr get-login-password --region <region> | docker login --username AWS --password-stdin < aws_account_id>.dkr.ecr.<region>.amazonaws.com

Push image to ECR.

$ docker tag <image id> aws_account_id.dkr.ecr.region.amazonaws.com/<ecr repo name>

$ docker push aws_account_id.dkr.ecr.region.amazonaws.com/<ecr repo name>

ECS Applications

Alright, now it's time to actually deploy our REST services. The below script will create full-fledged endpoints for our application.

You'll need to deploy the private Python package first so that future processes can install it. You can then build the ECS app from world-series-projections-ecs. To note, when installing a custom, private package, you need to add some flags so that pip knows where to look (it defaults to pypi.org). For example, I deployed my package on the DNS data-science-helpers.allstardatascience.com, so I need the following statement to install the package.

pip3 install --trusted-host data-science-helpers.allstardatascience.com --find-links https://data-science-helpers.allstardatascience.com/ micah_melling_ds_package==0.1.0

In a requirements.txt file, you can add those two flags atop the file and then list the package name and pegged version like any other package. One tip: give your package a unique name, one that is not already a package on pip.

How exactly is our package "private"? We do so through IP rules. Only IP addresses listed in the argument ingress_cidr_blocks will be able to install the package. We will, obviously, want to include the IPs of the data scientists we want to use the package. Likewise, we need to include the IP address(es) that will allow our CI/CD pipelines to install the package for our Docker images. In subsequent sections, we will see that we run our CI/CD pipeline, specifically the "build" step, in a private subnet. This allows us to run the process through a NAT Gateway with an Elastic IP (EIP) attached. The EIP is known, so we can easily add it to ingress_cidr_blocks. To note, we may have to create a new WAF IPSet to accommodate the EIP.

OK, back to the ECS application. Our code is all inclusive: it includes multiple layers of security and multiple layers of logging. We might opt to break the script into smaller chunks. However, we need all of these components for a full-fledged app with all the desired bells and whistles, so, in many ways, it makes sense to create everything in one script. Likewise, as you might have noticed, I only have one mega function in each script. We could write smaller, more atomic functions. However, each line is basically synonymous with creating a new resource, so the structure is inherently "modular" and readable: one line performs one action and tells you what it is.

For each project, we want to use the following script to create two versions of our app: a staging one and a production one. This delineation will be more clear in our CI/CD pipeline. One more reminder: the WAF functionality can be funky. I oftentimes have to run "pulumi up" a few times to get the resources to create correctly. Alternatively, you could create the WAF items separately.

Batch Job Definition

If we want to run our process as a batch process that operates on a cron schedule, we need to create a Batch job definition. We can then kick off a job based on the job definition, which we will do in a later section.

The command argument depends on your Dockerfile. In world-series-projections-batch, our ENTRYPOINT is "python3", so our command is simply the name of our script. In world-series-projections-ecs, we use CMD rather than ENTRYPOINT. Our default command is ./run.sh, which fires up a gunicorn webserver that can communicate with our Flask app. However, we can override the default CMD in our job definition. We do this so we can run a job to retrain our model on a cron schedule. In this case, our job definition command becomes something like "python3 retrain.py", which overrides the default command in our Docker container. We perform the model serving and model training in the same repository to adhere to the Twelve Factor App, which states "..admin processes should be run in an identical environment as the regular long-running processes of the app. They run against a release, using the same codebase and config as any process run against that release. Admin code must ship with application code to avoid synchronization issues." This, in my mind, applies to model training. If you aren't familiar with the Twelve Factor App, I recommend reading through the entire site. It's quite, quite good.

CodePipeline for ECS App

We will clearly want to be able to update our application. We, therefore, need to create a CI / CD pipeline. When we push a code update to our main git branch, the change will go through the pipeline and be pushed into production, as long as the code passes our tests. Likewise, when a file is uploaded to an S3 bucket, the pipeline will also kick off. We do this so that we can retrain our model as a batch process and release the fresh model into production on a schedule.

Let's talk a bit more about the retraining process. In general, you can take the following stances when retraining a machine learning model.

I also want to differentiate between retraining a model and releasing a wholly new model. Retraining a model simply involves taking the original model parameters and fitting them with new data. Let's say the original model is a Random Forest with max_depth of 12 and max_features set to "sqrt". None of that changes. Only the data going into these parameters changes. Releasing a whole new model might involve releasing a Random Forest with a different max_depth or switching to a XGBoost model. I would submit such a process should generally not be done automatically. We usually want to get to know the behavior of our models - their strengths, weaknesses, and idiosyncrasies. As an example, we might get a noticeably different probability calibration between an ExtraTrees and a LightGBM. We want to be prepared for such a change. In our design, unless changing the model type requires new preprocessing code, we can train an entirely new model on our data science work station and manually upload it to the desired S3 location, which will release the new model into production. Subsequently, retrain.py should pick up the new model parameters for its runs. However, if our new model requires a preprocessing change, we might need to push the model change to S3, hit "reject" in the manual approval step, and then push the code change to CodeCommit. Some level of coordination cannot be avoided.

As mentioned, the design we have implemented is pretty straightforward: our model will retrain on whatever schedule we set. We could retrain multiple times a day if we wanted. As you might notice in the below pipeline, while we automatically deploy to staging we require a manual approval for a production deployment. Therefore, our fresh models will not be automatically released into production. We could arrange such a situation if we wanted. For one, we could remove the manual approval step, but keep in mind this would also apply to any code changes. This is palatable if we have a robust and well-verified test suite. A second option is to have two separate release pipelines, one for code and another for models. In each pipeline, we would still pull from both S3 and CodeCommit to ensure each release captures the most recent code and model. If we desired, we could have a manual approval step for the code change pipeline but not for the model updates pipeline. We're simply running the same pipeline with a slight change in execution logic based on what kicked off the pipeline. In my opinion, the single pipeline with a deep test suite is preferred due to its simplicity. We can automatically test the exact same items we could manually. Likewise, by deploying to staging first, we confirm our updates actually can be successfully deployed onto ECS; this provides another confident boost.

As a last note, viewing the execution of a CodePipeline is one of the main motivations for having project-specific console accounts or granting data science users view ability to such resources. Data scientists may not create the CI/CD pipeline, but they will need to know if and when the runs succeed.

CodePipeline for Batch Job

Our CI/CD pipeline for a batch job is a bit different. This pipeline simply needs to update our Docker image upon a code change.

Test our REST API via Postman

We can use Postman to send a sample post to our service to validate it works. Our payload simply needs to have "year" as the key and any year between 1905-2020 as an integer for the value.

Test our Model UI

Our REST API also includes a simple little HTML user interface. We can simply go to the interface endpoint, enter our credentials, and verify we can interact with the application.

Run and Schedule our Batch Jobs

We want to run our batch job on a schedule, for which we can use CloudWatch events. You can follow this tutorial to learn how to do so.

Conclusion

In conclusion, we can see AWS is incredibly powerful. Putting together a platform for data science work includes many company. Thankfully, using Python and infrastructure as code, we can automate much of it.

Addendum: How I Deployed this Site

Finally, I used Pulumi to create this website! The script is below.