Building machine learning systems that actually stay working is harder than training a model. MLOps — the infrastructure, automation, and tooling around ML workflows — determines whether your models become technical debt or competitive advantage. On AWS, you face a fork in the road: Amazon's fully managed SageMaker Pipelines or a custom workflow orchestrated through ECS and Step Functions.
We've helped teams make this decision across multiple industries. The right choice depends on your team's engineering maturity, how much control you need, and whether vendor lock-in is acceptable. Let's walk through both approaches with real constraints.
What MLOps Actually Requires
Before comparing tools, understand what MLOps really involves. It's not just running training scripts. A complete MLOps system has distinct layers:
- Data Engineering: Pipelines that source, validate, and prepare training data at scale. This often involves Apache Spark, DuckDB, or Pandas — depending on volume.
- Model Training: Executing training jobs (hours to days), tracking hyperparameters and metrics, versioning code and model artifacts.
- Model Registry: Central registry of approved models, metadata about training conditions, and lineage back to training data.
- Deployment: Taking a model from registry and deploying it as an API, batch job, or inference container that stays available.
- Monitoring: In-production metrics on prediction latency, data drift, model performance decay, and prediction confidence.
- Orchestration: Gluing it all together — triggering retraining when performance drops, A/B testing new models, handling failed jobs.
Both SageMaker and custom ECS can handle all six. The difference is who owns the glue.
SageMaker Pipelines: The Managed Approach
SageMaker Pipelines is AWS's workflow orchestration layer for ML. It's built on top of Step Functions, with native integrations into SageMaker's processing, training, and model registry services.
What SageMaker Gives You
- Built-in Model Registry: Models are stored with metadata, versioning, and approval workflows. Each model knows its training parameters and metrics without extra engineering.
- Notebook-Native: Data scientists can build pipelines directly in SageMaker Studio using Python. No Docker containers required initially.
- Experiment Tracking: Native integration with CloudWatch. Training jobs are automatically tracked — no need to manage MLflow or Weights & Biases infrastructure.
- Endpoint Management: One click to deploy a model from the registry as a real-time inference endpoint with auto-scaling built in.
- Processing & Training Steps: Predefined Step Functions integrations. You write Python, SageMaker handles container orchestration.
A Simple SageMaker Pipeline Example
Here's what a basic two-stage pipeline looks like: data processing followed by model training.
import sagemaker
from sagemaker.processing import ScriptProcessor
from sagemaker.estimator import Estimator
from sagemaker.workflow.pipeline import Pipeline
from sagemaker.workflow.steps import ProcessingStep, TrainingStep
session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Processing step: clean and transform data
processor = ScriptProcessor(
role=role,
image_uri="246618743249.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3",
instance_count=1,
instance_type="ml.m5.xlarge"
)
process_step = ProcessingStep(
name="ProcessData",
processor=processor,
code="preprocess.py",
job_arguments=["--input", "s3://my-bucket/raw-data"]
)
# Training step: train XGBoost model
xgb = Estimator(
image_uri="683313688378.dkr.ecr.us-east-1.amazonaws.com/xgboost:1-cpu-py3",
role=role,
instance_count=1,
instance_type="ml.m5.xlarge",
output_path="s3://my-bucket/model-artifacts"
)
train_step = TrainingStep(
name="TrainModel",
estimator=xgb,
inputs=process_step.properties.ProcessingOutputConfig
)
# Create and execute pipeline
pipeline = Pipeline(
name="DataToModel",
steps=[process_step, train_step]
)
pipeline.upsert(role_arn=role)
pipeline.start()
This is genuinely simple. One class per step, and SageMaker handles the Step Functions state machine underneath. No JSON configuration for state machines. Data scientists can read and modify this.
The SageMaker Cost Reality
Hidden SageMaker Costs: SageMaker notebook instances incur hourly charges ($0.29/hour for ml.t3.medium — adds up across a team). Processing and training jobs bill per second of compute, but you pay for minimum instance hours. An endpoint has a baseline cost even with zero requests. Auto-scaling minimums can exceed your actual traffic. Teams routinely spend 40% of their ML budget on infrastructure they're not actively using. Always stop notebooks between sessions and set billing alerts on endpoint spend.
SageMaker's pricing is consumption-based, which sounds good until you realize notebook instances charge hourly even when idle (most teams leave them running overnight), real-time endpoints have minimum provisioning (usually at least one instance at ~$700/month), and batch transform jobs bill with minimum instance hour billing even for small jobs. Data processing at scale requires larger GPU instances which cost $24/hour or more.
Custom ECS Workflows: The Engineering Approach
The alternative is building your own pipeline using ECS for task execution, Step Functions for orchestration, ECR for container images, and S3 for artifact storage. This is more work upfront but gives you complete control.
How It Works
- Container Everything: Your preprocessing, training, and evaluation code live in Docker images pushed to ECR. No SageMaker-specific code.
- Step Functions Orchestration: Define your workflow as a state machine that invokes ECS tasks, waits for them, and chains outputs to inputs.
- Model Registry: Build your own with S3 + DynamoDB or PostgreSQL. Track model versions, training parameters, metrics — as simple or complex as you need.
- Inference: Deploy containers to ECS (for batch) or EKS/Fargate (for APIs). You control scaling policies, request handling, and updates.
Why Teams Choose This
- No Vendor Lock-in: Your training code runs on any system — local laptops, other cloud providers, on-premise hardware. Migrate away from AWS without rewriting everything.
- Existing DevOps: If your team already manages ECS clusters for microservices, adding ML is just another workload type. One deployment pattern for everything.
- Fine-Grained Control: You decide CPU/memory allocation per task, retry policies, timeout handling, logging structure. Debug failures without AWS obfuscation.
- Cost Predictability: ECS Fargate charges per vCPU-second used, plus storage. No hidden minimums. A training job that runs 1 hour costs exactly 1 hour's compute.
- Containerized Tooling: Integrate arbitrary tools — if you need CUDA, Spark, or niche ML libraries, package them in your image. No constraints from managed services.
When SageMaker Pipelines Win
- Notebook-Heavy Teams: Your data scientists code in Jupyter, build locally, then scale up. SageMaker is built for this workflow. Jumping to Docker would slow them down.
- Rapid Experimentation: You're exploring models constantly. SageMaker's built-in experiment tracking and quick deployment reduce friction.
- AWS-First Strategy: You've committed to AWS and value tight integration with CloudWatch, IAM, and other services. Vendor lock-in is acceptable.
- Small Models: If you're training lightweight models (XGBoost, scikit-learn) on modest datasets, infrastructure overhead isn't the bottleneck.
- No Deployment Sophistication: If you just need to call a model occasionally via API, SageMaker endpoints are simpler than managing your own serving infrastructure.
When Custom ECS Workflows Win
- Engineering-Heavy Teams: You have DevOps/SRE talent. Building infrastructure is faster than learning SageMaker APIs.
- Existing Container Workloads: You already run services on ECS or Kubernetes. Adding ML is just another workload type in your existing orchestration.
- Complex Data Pipelines: You need Spark, Airflow, or custom ETL. ECS lets you schedule anything. SageMaker's processing is limited.
- Multi-Cloud Future: You may move to GCP or on-premise. Custom containers are portable. SageMaker is not.
- Cost-Sensitive: You're training models daily or running inference at volume. ECS Fargate scales to zero and charges per second.
- Strict Compliance: You need complete visibility into compute environments. Custom containers with your own VPC/security is clearer than SageMaker's shared infrastructure.
A Practical Decision Framework
- Team Composition: Is your team mostly data scientists, or do you have infrastructure engineers? Data scientists lean SageMaker. Engineers lean Custom ECS.
- Time Pressure: Do you need models in production this sprint? SageMaker is faster. Shipping quality infrastructure that scales for 2 years? Custom ECS.
- Cost Model: Calculate your expected compute spend. SageMaker endpoint baseline ($700+/month) makes sense if your team can't run inference infrastructure. ECS makes sense if you're deploying to existing clusters.
- Integration Needs: Do you have existing CI/CD pipelines, Docker build processes, Kubernetes clusters? Custom ECS fits naturally. Starting from scratch? SageMaker's opinionation helps.
- Debugging Tolerance: Can your team operate without diving into container logs and Step Functions state machines? SageMaker's abstraction is fine. Need to debug at the system level? Custom ECS required.
- Feature Lock-in: Are you comfortable using SageMaker Feature Store, Model Monitor, and Clarify? Their ecosystem locks you in but might be worth it if they solve problems. Generic S3 + custom tooling is more flexible.
The Hybrid Approach
Many mature teams use both. SageMaker for rapid prototyping and experimentation by data scientists. Custom ECS workflows for production ML workloads that need cost control and portability. A data scientist trains locally, validates on SageMaker, then engineering wraps the final model in a container for production.
This requires discipline — you need clear boundaries. But it's often the pragmatic middle ground that doesn't force you to choose between velocity and control.
Conclusion
SageMaker Pipelines and custom ECS workflows solve the same problem at different abstraction levels. SageMaker trades control for velocity, perfect if your team is data scientist-heavy and AWS-committed. Custom ECS trades setup time for flexibility, transparency, and cost control — better for engineering-driven teams with existing infrastructure.
The wrong choice won't break you, but the right one will save months of refactoring. Evaluate your team's strengths, your cost sensitivity, and your multi-year roadmap. If you're undecided, prototype both for a week. The tooling that feels natural to your team is usually right.