11/05/17

Motivation

For a couple of years, I researched best-practices for model deployment at the time there wasn’t much available as such I’d like to share my learnings from this journey: A data scientist operating with minimal support, no data/ml engineers or devops team.

As you could imagine, there’s a decent amount of interaction with engineers during this phase so I wrote these notes to support arguments for my methodology from the perspective of cost-benefit trade-offs.

Distinctions in Architectural Paradigms:

Soa (service-oriented-architecture) over monolithic

PROs:

isolation: Prevents dependency and versioning issues. This is a big problem since ML packages evolve quickly and would cause problems in a monolithic structure e.g. pandas could make a some major changes to dataframes from one version to another e.g. changes/updates in pandas used in “model_for_fraud_detection” will have an impact on “model_monitoring_churn.”
reduces engineer time/involvement, scenarios which involve the model being updated (assumes preprocessing is done at the api level):
- tuning) DS
- retraining ) DS
- feature engineering NOT involving addition data inputs) DS
- feature engineering involving addition data inputs ) DEV
- data deprecation) DEV
- sudden failure in performance) Devops -> DS, may require a DEV
root-cause analysis: Easier to figure who owns the service and which engineers to contact should something break.
scalability: Could have one service interface with many clients though you could also do this if you invoke the function locally.
error handling: Should be delivered in the client response and should be done at the api level so that it’s scalable across requests agnostic to client.

SOA vs micro-services

This is the subtle distinction of wrapping the entire data-product within a service i.e. preprocessing, prediction, and response (SOA) or decompose them to smaller services which call each other (micro-services).

Much like best-practices used in software development, good design is flexible to change and reusable. Analogously the motivation for micro-services e.g. two separate models which use the same feature matrix. they would each call the same preprocessing service.

Though we went with SOA, I think that’s an initial step to micro-services.

Hosting the service on Lambda

Lambda vs EC2+ECS

Lambda)

PROS:

devops light and don’t have to worry about performance monitoring
scales automatically
server is self-managed

CONS:

takes a zip file restrained by specific AMI environment provisions.
limitations: 50mb, 5min timeout, cold start delay (minor nuisance), 2.7 standard libraries, 3.6 incomplete
- size can be combated by decomposition of functions. Lambda is great-way to build a true micro-services architecture.
can’t ssh into server, AWS Lambda does not provide any control of infrastructure & environment used to run the application code

ECS)

PROS:

takes a docker image so it allows for more flexibility: language, size, etc.
don’t have to downgrade code to python 2.7

CONS:

better optionality for configuring infrastructure. Since this an area that’s well outside my profession, it only serves as a distraction.
involves managing a server, scaling, etc.
this would also use flask frame-work which - for me - has been unreliable. Alternatively, could use AWS’s Chalice.

Given you’ve already paid the learning pains, lambda makes much more sense for a data scientist who’s looking for a hands-off approach. However, should things not workout due to the limitations e.g. can’t compress the the libraries to under 50mbs than ECS makes more sense. More arguments on ECS vs Lambda

Building the deployment zip

alt text

First, setup an environment so that you can easily REPL.

start in the directory of the model
Install docker
docker pull down the same AMI lambda’s using
```
docker pull amazonlinux:2016.09
```

run the shell script build_and_trim_weight.sh (takes time)

docker run -v $(pwd):/outputs -it amazonlinux:2016.09 \
/bin/bash /outputs/build_and_trim_weight.sh

it’ll output all the dependencies in a zip called : venv.zip
unzip this and all scripts
trim weight down to 50mb and test full deployment package with lambdaci docker image posted above.
```
$ docker run -v "$PWD":/var/task lambci/lambda:python2.7 test.handler
```

build_and_trim_weight.sh

	#!/bin/bash
	set -ex

	yum update -y

	yum install -y \
	    atlas-devel \
	    atlas-sse3-devel \
	    blas-devel \
	    gcc \
	    gcc-c++ \
	    lapack-devel \
	    python27-devel \
	    python27-virtualenv \
	    findutils \
	    sqlite-devel \
	    zip

	do_pip () {
	    pip install --upgrade pip wheel
	    pip install --use-wheel --no-binary numpy numpy
	    pip install --use-wheel --no-binary scipy scipy
	    pip install --use-wheel sklearn
	    pip install --use-wheel joblib && \
	    pip install --use-wheel pandas && \
	    pip install --use-wheel py-stringmatching && \
	    pip install --use-wheel pytz && \
	    pip install --use-wheel unidecode && \
	    pip install --use-wheel nltk && \
	    python -m nltk.downloader stopwords

	}

	strip_virtualenv () {
	    echo "venv original size $(du -sh $VIRTUAL_ENV | cut -f1)"
	    find $VIRTUAL_ENV/lib64/python2.7/site-packages/ -name "*.so" | xargs strip
	    echo "venv stripped size $(du -sh $VIRTUAL_ENV | cut -f1)"

	    pushd $VIRTUAL_ENV/lib64/python2.7/site-packages/ && zip -r -9 -q /outputs/venv.zip * ; popd
	    echo "site-packages compressed size $(du -sh /outputs/venv.zip | cut -f1)"

	    pushd $VIRTUAL_ENV && zip -r -q /outputs/full-venv.zip * ; popd
	    echo "venv compressed size $(du -sh /outputs/full-venv.zip | cut -f1)"
	}

	shared_libs () {
	    libdir="$VIRTUAL_ENV/lib64/python2.7/site-packages/lib/"
	    mkdir -p $VIRTUAL_ENV/lib64/python2.7/site-packages/lib || true
	    cp /usr/lib64/atlas/* $libdir
	    cp /usr/lib64/libquadmath.so.0 $libdir
	    cp /usr/lib64/libgfortran.so.3 $libdir
	}

	main () {
	    /usr/bin/virtualenv \
	        --python /usr/bin/python /sklearn_build \
	        --always-copy \
	        --no-site-packages
	    source /sklearn_build/bin/activate

	    do_pip

	    shared_libs

	    strip_virtualenv
	}
	main

The provision of Lambda is a 50mb limitation of the zip is the biggest hurdle. As mentioned earlier one could compress the libraries e.g. sklearn into binaries with the script above to aid in this.

File size trimming techniques

first just try moving in all the directories e.g. pandas and zipping if it’s smaller than 50mbs.
if that doesn’t work, try the script above to compress the libraries into binaries.
then remove everything with an ‘.a’ extension in the lib folder.
in venv remove docs e.g. pandas-0.20.2.dist-info and pip and anything you think you don’t need.
remove all .pyc files.
```
find . -type f -name '*.pyc' -delete
```
if all else fails, write it as multiple separate services that call eachother. If that’s not an option turn to EC2+ECS.

Main goal is to do away with stuff you don’t need within a library. As simple as this sounds it’s pretty difficult given time-contraints eg. NLTK’s a very interesting package as it requires more than the simple pypi package, there’s layers of directories of corpora. To avoid installing everything I shaved it down to a bear minimum of stopwords-english and wordnet while maintaining the original directory structure (found in “nltk_data”).

However, note the change in call path—

import nltk
nltk.data.path.append("/var/task/nltk_data")
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

Zipping

Zip up the entire contents of the directory. It’s very important to note this distinction: Zipping up the contents of a folder vs zipping up the folder.

zip.sh

#!/bin/bash

echo 'Creating deployment zip...'

git archive -o deploy.zip HEAD

echo 'Done.'

Modify lambda handler with–

lambda_handler.py

import ctypes
import json
import os

from flattened_data_points import FlattenedDataPoints
from predict import predict_if_budget_matches_itemization


#this portion of the code decompresses the libraries
for d, _, files in os.walk('lib'):
    for f in files:
        if f.endswith('.a'):
            continue

        ctypes.cdll.LoadLibrary(os.path.join(d, f))
        print("success", str(d), str(f))

def handler(event, context):
    print('Start matching expenses to budgets')
    print('Received the following input:\n{}'.format(json.dumps(event)))

    flattened_data_points = FlattenedDataPoints(event).all()
    scores = score_all_data_points(flattened_data_points)

    print('Scores:\n{}'.format(json.dumps(scores)))
    print('Finished')
    return scores

Making the end-point on AWS

1) steps to create the lambda function:

login to console and create the lambda function: https://us-west-2.console.aws.amazon.com/lambda/home?region=us-west-2#/functions?display=list
select runtime environment as python 2.7 and select blank function
add api gateway
deployment stage: test (end point we already made)
security: open
name the function
upload your code
existing role
Making an endpoint (api gateway)
- Services->api gateway
- Actions->create resource
  - Create resource
  - Create method (this is the endpoint)
write basic_lambda_function.py

import json

print('Loading function')


def lambda_handler(event, context):
    print("Received context: " + str(context))
    print("Received event: " + json.dumps(event, indent=2))
    
    return json.dumps(event, indent=2)

name: lambda_function.lambda_handler
create a trigger with an api gateway

2) create the api gateway/post endpoint to the lambda function

*it’s key to understand there’s a separate endpoint that needs to be created to expose the function publicly as opposed to within the aws infrastructure

CREATE METHOD:

create a GET and POST separately
make sure the region you select for the api gateway is the same as the lambda function
uncheck “lambda proxy configuration”
integration type: “lambda function”

CREATE stage

Exposing the API externally and invoking lambda function: Click “Deploy API”! *Crucial

if you’re getting this error there’s likely a problem with the endpoint configuration OR you didn’t Deploy (expose) the API.

{"message": "Internal server error"}'

create Resource
create POST Method with the same instructions as above.

After it’s created, test it internally by copy-pasting the data into request body.

Handler used to troubleshoot:

def lambda_handler(event, context):
    raise Exception('the sky is falling!')

*Latency issues:

Throttle up the RAM allocated for the servers. For whatever reason, it comes default only provisions for 128mbs, which is going to have a significant hit on latency. It goes up to 3gbs.

3) Versioning and fault tolerance

versioning, how to do it through cli though there’s an even easier approach through the console. Alternatively, could call the deployment package from an s3 bucket and change the path when there’s a new version.
fault tolerance, put up mirrors in various regions in case one goes down
security, allows for auth with new credentials.

end-point resources:

FAQs

Should you dockerize the function and send away to productionize it?

Sure! but keep everything together should anyone need to make minor adjustments and rebuild, we used a single git repo solely for this. In our case, both an engineer & data scientist own the api.

The repo

example repo lambda deployment

The production level code should live in a separate repo from all the code that lives in general data science folders (sandbox)

if there’s something an engineer can fix e.g. handling nulls in the preprocessing chain, they could do it themselves and rebuild the image within the repo
this holds even if a data scientist has to fix it as well

Having everything contained in one place is nice for–

rapid adjustments
discoverability
subsequent iterations
general information about the service within documentation: README.md

README.md

name of function and version
business function/question the product was built for
metrics of interest: business & data science
how monitor performance e.g. corresponding datadog dashboard or sql queries to monitor performance
instructions on how to build the docker image

Deployment

To create the deployment package, run the following command while in the root directory of the project:

$ ./build

This creates a new file called deploy.zip in the current directory, which can then be uploaded to Amazon Lambda.

testing locally before sending it up to lambda

Executing Locally

While in the root directory of the project with Docker running, run the command:

$ docker run -v "$PWD":/var/task lambci/lambda:python2.7 main.handler JSON_PAYLOAD
Replace JSON_PAYLOAD with a JSON formatted payload sent as a string. For example:

$ docker run -v "$PWD":/var/task lambci/lambda:python2.7 main.handler '{"budgets": [{"budget_price_budget": "140.0", "travel_vendor_receipts": "Southwest Airlines", "purchase_vendor_receipts": "Southwest Airlines", "budget_id": "309366", "start_datetime_trips": "2017-04-18 04:00:00", "generated_at_budget": "2017-04-08 14:55:39", "budget_type_budget": "flight", "actual_cost_budget": "131.98", "end_datetime_trips": "2017-04-19 04:00:00"}], "all_user_expense_lineitems": [{"expensed_amount_itemization": "24.0", "expense_type_name_itemization": "Taxi", "transaction_date_itemization": "2017-04-25 00:00:00", "itemization_id": "4659496", "expense_category_itemization": "taxi", "vendor_name_expense": "taxi", "expense_type_name_expense": "Taxi", "expense_category_expense": "taxi"}]}'

testing at api level

Alternatively, you can use a tool like Postman to send your JSON payload to the following address:

https://j4e8yp6cdi.execute-api.us-west-2.amazonaws.com/test/budget_expense_matcher/

Request Format

Request body:

{
    "budgets": [
        {
            "budget_price_budget": "140.0",
            "travel_vendor_receipts": "Southwest Airlines",
            "purchase_vendor_receipts": "Southwest Airlines",
            "budget_id": "309366",
            "budget_type_budget": "flight",
            "generated_at_budget": "2017-04-08 14:55:39",
            "start_datetime_trips": "2017-04-18 04:00:00",
            "actual_cost_budget": "131.98",
            "end_datetime_trips": "2017-04-19 04:00:00"
        }
    ],
    "all_user_expense_lineitems": [
        {
            "expensed_amount_itemization": "24.0",
            "expense_type_name_itemization": "Taxi",
            "transaction_date_itemization": "2017-04-25 00:00:00",
            "itemization_id": "4659496",
            "expense_category_itemization": "taxi",
            "vendor_name_expense": "taxi",
            "expense_type_name_expense": "Taxi",
            "expense_category_expense": "taxi"
        }
    ]
}
The response should look like the following:

[
    {
        "score": 0,
        "budget_id": "309366",
        "itemization_id": "4659496"
    }
]

Calling out using boto3

Function to call out to the api.

import boto3
import simplejson
import settings


def get_model_scores():
    client = boto3.client('lambda', region_name=settings.LAMBDA_REGION_NAME)
    payload = get_payload()
    response = client.invoke(
        FunctionName=settings.LAMBDA_FUNCTION_NAME,
        InvocationType='RequestResponse',
        Payload=bytes(simplejson.dumps(payload), 'utf8')
    )

    decoded_response_payload = simplejson.loads(response['Payload'].read().decode())
    
    return decoded_response_payload

Lessons learned from designing the API:

Error handling should be done at the api level not at the client level.
The service should complete it’s job rather than error out and fail on the job leaving the client uninformed.

More on the problem, the api was made very brittle, it would break if there was a null in any of the fields. This was causing a compounding effect in terms of the workflow it was suppose to augment.

Solution:

Using try/catch blocks wrap the errors in the response back to the client.
Break the calls into smaller chunks and log the input outputs to api.

Combating costs

Lambda functions have a call cost associated PER call so we’d send up a larger payload and have all the computation done at the api level.

Lambda deployment resources:

Monitoring in prod:

For storage we created an independent table which stores the scored results from the model in a relational DB. We stored score, threshold, and items matched. Definitely, envision how you’ll investigate FPs and FNs in the future and make sure the specifics are very accessible.

monitoring tools used –

sql
datadog

Motivation: Using datadog vs sql

Do you need real-time monitoring? If it’s no, simple SQL cronjob or not should be fine for periodic check-ups. It’s very simple to implement and a solid choice for an MVP.

We drafted initial tests (precision, recall across time) in SQL then later deployed real-time monitoring in datadog. Motivation being –

We’re already using the infrastructure and what the Devs here were comfortable with (mostly this).
Robust suite of tools for tracking performance:
- time-series for detecting trends e.g. a degradation of precision overtime.
- anomaly detection i.e. measuring the error past a certain threshold in real-time. See the easy configuration below.
Tracking at the event level. So within the front-end instrumentation you could embed a function to fire an event to datadog.
On determining the window size it looks back e.g. something sensible would be EWMA or median with a span of 20 days then have thresholds for anomalous points i.e. points that have a large distance from the EWMA point.
Real-time monitoring with dashboards, SQL could be ran ad-hoc or can be triggered by a cronjob sending an email report. In our case it was cool to see what would happen if we changed a threshold watch the trade-offs between precision/recall live.

Configuring the panel for Precision.

Datadog tools

ewma_3() Exponentially Weighted Moving Average with a span of 3
ewma_5() EWMA with a span of 5
ewma_10() EWMA with a span of 10
ewma_20() EWMA with a span of 20
median_3() Median filter, useful for reducing noise, with a span of 3
median_5() Median with a span of 5
median_7() Median with a span of 7
median_9() Median with a span of

Anomaly detection tools

alt text

trend_line(): shows model decay overtime assuming a negative slope
robust_trend(): * I’d recommend using this function as it’s more robust to outliers
piecewise_constant(): * shows a mean shift i.e. a sudden drastic change.

Logging

I’m a firm believer in logging at the api level but Logentries was far easier to parse than Cloudwatch and the setup was easy. Again, this is us leveraging infrastructure we already had.

Logentries: occurs at the client level
Cloudwatch lambda logs: occurs at the api level

Prototyping

alt text

Before prototyping you need to envision the end goal of your model. Think about how to measure it’s performance in prod:

Performance metrics:

Should be able to detect the model performance e.g. accuracy, precision, recall etc and how you’re going to measure this over time.
Find a business metric that isn’t too far up the funnel. e.g. revenue on the entire business would be too far upstream and it’s movement would be too noisy to show whether or not the performance of the model was the sole contributor.

A better metric would be “median time for a business operation to be completed.” which is the very reason why the data product was being built.

Transforming large data sets

To aid in feature engineering or training on a very large sets of data. I’ve found a post which does an exceptional job of spinning up an EMR cluster. Though it didn’t work for me without making a few tweaks first.

Motivation

Distinctions in Architectural Paradigms:

Soa (service-oriented-architecture) over monolithic

SOA vs micro-services

Hosting the service on Lambda

Lambda vs EC2+ECS

More on the Cons of using Lambda:

Building the deployment zip

File size trimming techniques

Zipping

lambda_handler.py

Making the end-point on AWS

1) steps to create the lambda function:

2) create the api gateway/post endpoint to the lambda function

*Latency issues:

3) Versioning and fault tolerance

end-point resources:

FAQs

Should you dockerize the function and send away to productionize it?

The repo

README.md

Calling out using boto3

Lessons learned from designing the API:

Combating costs

Lambda deployment resources:

Monitoring in prod:

monitoring tools used –

Motivation: Using datadog vs sql

Datadog tools

Anomaly detection tools

Logging

Prototyping

Performance metrics:

Transforming large data sets

Repo w EMR Scripts