Scalability, Cost, and Security

Three topics that are often treated as advanced — system scalability, operational cost, and security — are actually decisions you make on day one, whether you intend to or not. An architecture that can't scale costs you a rewrite later. A model that runs without cost controls costs you real money. A system built without security consideration costs you trust, compliance, and potentially your users' data. These aren't afterthoughts; they're design constraints.

ℹ

Scalability: Serving More Requests

When traffic grows, you have two options:

Vertical scaling — give the existing machine more resources (bigger CPU, more RAM, a faster GPU). Simple, but has a ceiling — machines only get so big, and a single machine is a single point of failure.

Horizontal scaling — add more instances behind a load balancer. Traffic is distributed across many machines. No ceiling, no single point of failure.

Auto-scaling is the version that responds dynamically: when traffic spikes, new instances spin up; when traffic subsides, instances scale down. Most cloud platforms (AWS ECS, Kubernetes, SageMaker) support auto-scaling on CPU or request queue depth.

One design decision that affects everything: batch vs. real-time inference. Batch inference processes many predictions at once (a nightly job scoring all customers for a churn model); real-time inference processes one request at a time as it arrives (a recommendation engine serving users live). Batch is cheaper — you can run it on off-peak compute, optimize for throughput. Real-time requires always-on infrastructure optimized for latency.

ℹ

Cost: Where Your Inference Budget Goes

Right-sizing — match the infrastructure to the actual workload. A model that gets two requests per day doesn't need a dedicated GPU instance. A model serving a million requests per hour does.

Spot instances — cloud providers sell spare capacity at a steep discount (60–90% off on-demand prices). The catch: the instance can be terminated with 2 minutes' notice. Great for training jobs (just checkpoint frequently). Risky for latency-sensitive inference.

Caching — if your model frequently receives the same or similar inputs, cache the predictions. A recommendation model that shows the same top-10 products to most anonymous users can serve those from a cache rather than running inference.

Model compression — smaller models cost less to run and respond faster.
Three techniques to explore:
- 1. Quantization — reduce the precision of weights from float32 to int8. Often 4× smaller with minimal accuracy loss.
- 2. Pruning — remove weights that contribute little to predictions. Reduces parameter count.
- 3. Distillation — train a small "student" model to mimic a large "teacher" model. The student is far cheaper to run.

⚠

ML-Specific Security Threats

Standard security hygiene (encryption in transit and at rest, access control, regular audits) applies to ML systems just as to any software. But ML models face a class of attacks that general software does not:

Model inversion attacks — an adversary queries the model repeatedly to reconstruct its training data. If the model was trained on sensitive records (medical data, PII), this is a serious privacy concern.

Membership inference attacks — the adversary tries to determine whether a specific record was in the training set. This leaks information about who was included in your training data.

Adversarial examples — inputs crafted to cause the model to misclassify. A stop sign with a few stickers that a human reads as "stop sign" but a computer vision model reads as "speed limit." In autonomous systems, this is a safety concern.

These aren't theoretical. Defenses include differential privacy during training, rate-limiting and anomaly detection on API queries, and adversarial training. Most ML engineers won't implement these from scratch, but you should be able to recognize the threat and escalate to a security team that can.

Duke Trust Lab

Play a Game: Are You Smarter than AI?

Adversarial examples and deceptive inputs can fool machine learning models. Can they fool you?

Duke Trust LabPlay Game

Checkpoint

A product team has a recommendation model that returns the same top-10 trending items to 80% of anonymous users. The current architecture runs full model inference for every request, and compute costs are high. What is the cheapest, lowest-risk optimization?

←PreviousMonitoring and Data DriftBuilding ML Pipelines Next→Prototypes, Demos, and Your PortfolioBuilding ML Pipelines