Docker and Deployment Options
Before we talk about where to deploy a model, we need to talk about how to package one. The answer, almost universally, is Docker. Almost every other deployment option in this section is either a wrapper around Docker, or a service that accepts Docker images.
Almost every deployment option in the ML ecosystem is either a wrapper around Docker, or a service that accepts Docker images. Learn the container model once and the rest follows.
Once you have a container image, you have options for where to run it. The right choice depends on your traffic, latency requirements, team size, and budget.
The right choice depends on your traffic pattern, latency budget, team infrastructure expertise, and data privacy requirements. Most orgs end up using two or three of these simultaneously for different models.
Serverless Has a Catch: Cold Starts
Serverless functions scale to zero when idle — which is great for your bill and terrible for latency-sensitive applications. When a function that hasn't been called recently receives a request, it incurs a cold start penalty: the runtime must be initialized before the request can be processed. For a simple model, this might be 200–500ms. For a large deep learning model with heavy dependencies, it can be several seconds.
Rule of thumb: serverless works well for internal tools, low-frequency models, and batch inference. For real-time user-facing predictions where 100ms latency matters, keep the server warm.
1What are the latency requirements?
2Does the model need internet connectivity?
3Can data leave the device or local environment?
4What is the expected traffic pattern?
5What is your team's ML ops maturity?
A mobile health startup is deploying a model that predicts heart rate anomalies from wearable sensor data. The model must respond in under 50ms, the app must work without internet connectivity, and patient health data cannot leave the device. Which deployment option fits?