Docker and Deployment Options

Before we talk about where to deploy a model, we need to talk about how to package one. The answer, almost universally, is Docker. Almost every other deployment option in this section is either a wrapper around Docker, or a service that accepts Docker images.

What Docker Actually Does

Sandboxed

Isolation

Each container runs in its own sandbox.

Your model's dependencies don't collide with the web server's dependencies. No more "it installed fine but now nothing else works."

Key benefit

Dependencies stay contained

Reproducible

Consistency

What runs on your laptop runs identically in production.

The phrase "it works on my machine" disappears. The container image is the same artifact on every machine that runs it.

Key benefit

Dev equals prod

Run Anywhere

Portability

Any system with Docker installed can run any container.

Regardless of what's installed on the host, the container brings its own environment. Cloud, on-prem, or a colleague's laptop — it all just works.

Key benefit

Host-agnostic execution

Immutable

Versionability

Container images are versioned and immutable.

Rollbacks are as simple as switching which image version is running. Bad deploy at 3am? Point to the previous tag and you're done.

Key benefit

Rollback in one command

Almost every deployment option in the ML ecosystem is either a wrapper around Docker, or a service that accepts Docker images. Learn the container model once and the rest follows.

Once you have a container image, you have options for where to run it. The right choice depends on your traffic, latency requirements, team size, and budget.

Deployment Options

OptionWhat it isBest for

Container

Docker + Kubernetes

Run containers on a managed cluster. You define resources, auto-scaling rules, and health checks. The cluster handles the rest.

AWS ECS · GKE · AKS

Full control

High-traffic, production-grade APIs that need auto-scaling and fine-grained control

Serverless

Functions

Your model runs only when invoked — no idle cost. Scales automatically. Cold starts can add latency after periods of inactivity.

AWS Lambda · Azure Functions · GCF

Pay per call

Low-traffic or bursty models where you don't want to pay for idle compute

Managed ML

Platform

The platform handles deployment, endpoints, A/B testing, and monitoring. You bring the model; it handles the infrastructure.

SageMaker · Vertex AI · Azure ML

Full MLOps stack

Teams that want the full MLOps stack without building it themselves

Model-as-a-Service

Hosted API

You call an HTTP endpoint. No infrastructure, no containers. You don't run the model at all — someone else does.

HF Inference Endpoints · Replicate · OpenAI

Zero ops

Prototyping, or when you don't want to run the model at all

Edge

On-Device

The model is compiled and bundled with the app. Inference runs on the device — phone, sensor, vehicle. No network round-trip.

TFLite · ONNX · Core ML

No internet needed

Latency-critical or privacy-sensitive applications with no reliable internet

The right choice depends on your traffic pattern, latency budget, team infrastructure expertise, and data privacy requirements. Most orgs end up using two or three of these simultaneously for different models.

✦

Serverless Has a Catch: Cold Starts

Serverless functions scale to zero when idle — which is great for your bill and terrible for latency-sensitive applications. When a function that hasn't been called recently receives a request, it incurs a cold start penalty: the runtime must be initialized before the request can be processed. For a simple model, this might be 200–500ms. For a large deep learning model with heavy dependencies, it can be several seconds.

Rule of thumb: serverless works well for internal tools, low-frequency models, and batch inference. For real-time user-facing predictions where 100ms latency matters, keep the server warm.

Deployment Decision Tree0 / 5 answered

1What are the latency requirements?

2Does the model need internet connectivity?

3Can data leave the device or local environment?

4What is the expected traffic pattern?

5What is your team's ML ops maturity?

The right deployment option depends on your constraints, not just your model.

Checkpoint

A mobile health startup is deploying a model that predicts heart rate anomalies from wearable sensor data. The model must respond in under 50ms, the app must work without internet connectivity, and patient health data cannot leave the device. Which deployment option fits?

←PreviousFrom Notebook to the WorldBuilding ML Pipelines Next→CI/CD and Versioning EverythingBuilding ML Pipelines