System Architecture
This page describes the high-level system design, the separation of concerns between components, and how the MLOps tools integrate with each other.
High-Level Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ User / Browser │
└─────────────────────────────┬────────────────────────────────────────────┘
│ HTTPS (port 443 / nginx TLS)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Frontend (scene3d-ui) │
│ React + Three.js · Vite · nginx reverse proxy │
│ port 5173 / 443 │
└─────────────────────────────┬────────────────────────────────────────────┘
│ HTTP REST /api/*
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ API Gateway (ray-serve) │
│ FastAPI · Ray Serve ingress · port 8000 │
│ Auth · Upload · Job Management · Drift · Prometheus metrics │
└────────────────┬─────────────────────────────────────────────────────────┘
│ Ray RPC (object store, zero-copy)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ GPU Model Worker (ray-serve) │
│ MASt3R · ALIKED · SuperPoint · COLMAP SfM · pycolmap │
│ 1 GPU, 4 CPUs │
└──────────────────────────────────────────────────────────────────────────┘
┌────────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌──────────┐
│ MLflow │ │ Airflow │ │ Prometheus │ │ Grafana │
│ port 5000 │ │ port 8080 │ │ port 9090 │ │ port 3001│
│ Experiment │ │ Orchestrate │ │ Metrics scrape │ │Dashboard │
│ tracking │ │ DVC + DAGs │ │ + alerts │ │ │
└────────────────┘ └──────────────┘ └─────────────────┘ └──────────┘
┌────────────────────────────────────────────────────────────────────────┐
│ Docker Network: mlops_net │
│ All services communicate by container hostname on this bridge network │
└────────────────────────────────────────────────────────────────────────┘
Frontend
The frontend is a React single-page application built with Vite and styled with Tailwind CSS. The 3D viewer is implemented with Three.js.
Responsibilities
Render the upload form and stage tracker
Poll
GET /jobs/{job_id}every few seconds to update the UIRender the interactive 3D point cloud via Three.js
Display drift warnings and reconstruction statistics
Build
The frontend is built into static files at Docker image build time (npm run build).
These are served by an nginx container which also acts as a TLS-terminating
reverse proxy, forwarding /api/* requests to the ray-serve backend.
API Gateway and GPU Worker (Ray Serve)
The backend is a single Docker container running two Ray Serve deployments:
APIGateway (CPU-only, 2 cores)
Wraps a FastAPI application via
@serve.ingressHandles all HTTP traffic: authentication, file uploads, job state, drift checks
Delegates all inference to the GPU worker over Ray’s object store (zero HTTP overhead)
Exposes Prometheus metrics at
GET /metrics
GPUModelWorker (1 GPU, 4 cores)
Loads the entire MASt3R pipeline once at container startup
Exposes
reconstruct()andping()methods callable via Ray RPCNever reloads models between requests — weights stay permanently in VRAM
Handles the full pipeline: extraction → matching → COLMAP → PLY export
The two deployments communicate via Ray’s distributed object store, which avoids serialising large tensors over HTTP and enables near-zero-copy data transfer.
Concurrency
A semaphore limits the API gateway to MAX_CONCURRENT_JOBS = 1 running job at
a time, preventing GPU memory exhaustion.
Offline MLOps Pipeline
DVC defines the pipeline as a DAG of stages in dvc.yaml. Each stage
specifies its command, input dependencies (deps), and outputs (outs).
DVC tracks content hashes so only changed stages re-run.
MLflow is used for experiment tracking. Every dvc repro run creates a
parent MLflow run, with child runs logged by each script stage. Logged entities
include:
All pipeline configuration parameters (flattened from YAML)
Stage-level metrics (registration rate, mAA, etc.)
Artifacts (eval CSV, PLY file, config YAML, Git status)
Airflow orchestrates the full DVC pipeline via the experiment_pipeline_dag
DAG, which:
Waits for required data files via
FileSensortasks.Runs
dvc reproinside an ephemeral Docker container (Docker-out-of-Docker).Calls
select_best_run.pyto promote the best MLflow run to production.Sends an email notification on success.
Monitoring Stack
Prometheus scrapes metrics from:
The Ray Serve API gateway (
/metrics, every 10 seconds)MLflow health endpoint (every 30 seconds)
Airflow (every 15 seconds)
Node Exporter (host-level hardware metrics)
Grafana provides dashboards powered by Prometheus data, including:
Reconstruction job throughput and latency
Registration rate over time
GPU utilisation (via node exporter)
Drift metric trends
Alertmanager receives alerts defined in monitoring/alert_rules.yml.
Relevant alert names include FeatureDriftDetected, InputBrightnessDrift,
InputContrastDrift, and PerformanceDecay. On firing, Alertmanager calls
an Airflow webhook to trigger experiment_pipeline_dag automatically.
Data Flow
Offline (training/evaluation)
data/train/ (images)
└─► validate → eda_baselines → preprocess → prepare
└─► run_pipeline (MASt3R + COLMAP)
└─► evaluate (mAA → MLflow)
└─► select_best_run
└─► conf/best_config.yaml
Online (inference)
User ZIP upload
└─► Drift check (vs EDA baselines)
└─► GPUModelWorker.reconstruct()
├─► MASt3R matching
├─► COLMAP SfM
└─► PLY export + decimation
└─► /app/results/ → download by user
Networking
All services run on a single Docker bridge network named mlops_net.
Service hostnames (e.g., mlflow, airflow-apiserver, prometheus)
resolve directly within this network. This allows:
Airflow DockerOperator containers to reach
mlflow:5000Alertmanager to call
airflow-apiserver:8080Ray Serve to reach
mlflow:5000for metric logging
The network name is pinned in docker-compose.yaml:
networks:
default:
name: mlops_net
Security Boundaries
External traffic enters only through nginx (port 443) and the Ray Serve API (port 8000).
JWT tokens expire after 15 minutes and are signed with a secret loaded from Docker Secrets.
Database credentials are stored as Docker Secrets, not plaintext environment variables.
CI/CD includes Trivy image scanning and
pip-auditdependency auditing on every push (see.github/workflows/security.yml).
For full security documentation see Security & Compliance.