Docker and Deployment Options

Before we talk about where to deploy a model, we need to talk about how to package one. The answer, almost universally, is Docker. Almost every other deployment option in this section is either a wrapper around Docker, or a service that accepts Docker images.

What Docker Actually Does
Sandboxed
Isolation
Each container runs in its own sandbox.
Your model's dependencies don't collide with the web server's dependencies. No more "it installed fine but now nothing else works."
Key benefit
Dependencies stay contained
Reproducible
Consistency
What runs on your laptop runs identically in production.
The phrase "it works on my machine" disappears. The container image is the same artifact on every machine that runs it.
Key benefit
Dev equals prod
Run Anywhere
Portability
Any system with Docker installed can run any container.
Regardless of what's installed on the host, the container brings its own environment. Cloud, on-prem, or a colleague's laptop — it all just works.
Key benefit
Host-agnostic execution
Immutable
Versionability
Container images are versioned and immutable.
Rollbacks are as simple as switching which image version is running. Bad deploy at 3am? Point to the previous tag and you're done.
Key benefit
Rollback in one command

Almost every deployment option in the ML ecosystem is either a wrapper around Docker, or a service that accepts Docker images. Learn the container model once and the rest follows.

Once you have a container image, you have options for where to run it. The right choice depends on your traffic, latency requirements, team size, and budget.

Deployment Options
OptionWhat it isBest for
Container
Docker + Kubernetes
Run containers on a managed cluster. You define resources, auto-scaling rules, and health checks. The cluster handles the rest.
AWS ECS · GKE · AKS
Full control
High-traffic, production-grade APIs that need auto-scaling and fine-grained control
Serverless
Functions
Your model runs only when invoked — no idle cost. Scales automatically. Cold starts can add latency after periods of inactivity.
AWS Lambda · Azure Functions · GCF
Pay per call
Low-traffic or bursty models where you don't want to pay for idle compute
Managed ML
Platform
The platform handles deployment, endpoints, A/B testing, and monitoring. You bring the model; it handles the infrastructure.
SageMaker · Vertex AI · Azure ML
Full MLOps stack
Teams that want the full MLOps stack without building it themselves
Model-as-a-Service
Hosted API
You call an HTTP endpoint. No infrastructure, no containers. You don't run the model at all — someone else does.
HF Inference Endpoints · Replicate · OpenAI
Zero ops
Prototyping, or when you don't want to run the model at all
Edge
On-Device
The model is compiled and bundled with the app. Inference runs on the device — phone, sensor, vehicle. No network round-trip.
TFLite · ONNX · Core ML
No internet needed
Latency-critical or privacy-sensitive applications with no reliable internet

The right choice depends on your traffic pattern, latency budget, team infrastructure expertise, and data privacy requirements. Most orgs end up using two or three of these simultaneously for different models.

Serverless Has a Catch: Cold Starts

Serverless functions scale to zero when idle — which is great for your bill and terrible for latency-sensitive applications. When a function that hasn't been called recently receives a request, it incurs a cold start penalty: the runtime must be initialized before the request can be processed. For a simple model, this might be 200–500ms. For a large deep learning model with heavy dependencies, it can be several seconds.


Rule of thumb: serverless works well for internal tools, low-frequency models, and batch inference. For real-time user-facing predictions where 100ms latency matters, keep the server warm.

Deployment Decision Tree0 / 5 answered

1What are the latency requirements?

2Does the model need internet connectivity?

3Can data leave the device or local environment?

4What is the expected traffic pattern?

5What is your team's ML ops maturity?

The right deployment option depends on your constraints, not just your model.
Checkpoint

A mobile health startup is deploying a model that predicts heart rate anomalies from wearable sensor data. The model must respond in under 50ms, the app must work without internet connectivity, and patient health data cannot leave the device. Which deployment option fits?