Scalability, Cost, and Security

Three topics that are often treated as advanced — system scalability, operational cost, and security — are actually decisions you make on day one, whether you intend to or not. An architecture that can't scale costs you a rewrite later. A model that runs without cost controls costs you real money. A system built without security consideration costs you trust, compliance, and potentially your users' data. These aren't afterthoughts; they're design constraints.

Scalability: Serving More Requests

When traffic grows, you have two options:


  • Vertical scaling — give the existing machine more resources (bigger CPU, more RAM, a faster GPU). Simple, but has a ceiling — machines only get so big, and a single machine is a single point of failure.

  • Horizontal scaling — add more instances behind a load balancer. Traffic is distributed across many machines. No ceiling, no single point of failure.

Auto-scaling is the version that responds dynamically: when traffic spikes, new instances spin up; when traffic subsides, instances scale down. Most cloud platforms (AWS ECS, Kubernetes, SageMaker) support auto-scaling on CPU or request queue depth.


One design decision that affects everything: batch vs. real-time inference. Batch inference processes many predictions at once (a nightly job scoring all customers for a churn model); real-time inference processes one request at a time as it arrives (a recommendation engine serving users live). Batch is cheaper — you can run it on off-peak compute, optimize for throughput. Real-time requires always-on infrastructure optimized for latency.

Cost: Where Your Inference Budget Goes

  • Right-sizing — match the infrastructure to the actual workload. A model that gets two requests per day doesn't need a dedicated GPU instance. A model serving a million requests per hour does.

  • Spot instances — cloud providers sell spare capacity at a steep discount (60–90% off on-demand prices). The catch: the instance can be terminated with 2 minutes' notice. Great for training jobs (just checkpoint frequently). Risky for latency-sensitive inference.

  • Caching — if your model frequently receives the same or similar inputs, cache the predictions. A recommendation model that shows the same top-10 products to most anonymous users can serve those from a cache rather than running inference.

  • Model compression — smaller models cost less to run and respond faster.
    Three techniques to explore:
    • 1. Quantization — reduce the precision of weights from float32 to int8. Often 4× smaller with minimal accuracy loss.
    • 2. Pruning — remove weights that contribute little to predictions. Reduces parameter count.
    • 3. Distillation — train a small "student" model to mimic a large "teacher" model. The student is far cheaper to run.

ML-Specific Security Threats

Standard security hygiene (encryption in transit and at rest, access control, regular audits) applies to ML systems just as to any software. But ML models face a class of attacks that general software does not:


  • Model inversion attacks — an adversary queries the model repeatedly to reconstruct its training data. If the model was trained on sensitive records (medical data, PII), this is a serious privacy concern.

  • Membership inference attacks — the adversary tries to determine whether a specific record was in the training set. This leaks information about who was included in your training data.

  • Adversarial examples — inputs crafted to cause the model to misclassify. A stop sign with a few stickers that a human reads as "stop sign" but a computer vision model reads as "speed limit." In autonomous systems, this is a safety concern.

These aren't theoretical. Defenses include differential privacy during training, rate-limiting and anomaly detection on API queries, and adversarial training. Most ML engineers won't implement these from scratch, but you should be able to recognize the threat and escalate to a security team that can.

Checkpoint

A product team has a recommendation model that returns the same top-10 trending items to 80% of anonymous users. The current architecture runs full model inference for every request, and compute costs are high. What is the cheapest, lowest-risk optimization?