System Architecture

This page describes the high-level system design, the separation of concerns between components, and how the MLOps tools integrate with each other.

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                         User / Browser                                   │
└─────────────────────────────┬────────────────────────────────────────────┘
                              │  HTTPS (port 443 / nginx TLS)
                              ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                         Frontend (scene3d-ui)                            │
│           React + Three.js · Vite · nginx reverse proxy                  │
│                         port 5173 / 443                                  │
└─────────────────────────────┬────────────────────────────────────────────┘
                              │  HTTP REST  /api/*
                              ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                    API Gateway (ray-serve)                               │
│              FastAPI · Ray Serve ingress · port 8000                     │
│   Auth · Upload · Job Management · Drift · Prometheus metrics            │
└────────────────┬─────────────────────────────────────────────────────────┘
                 │  Ray RPC (object store, zero-copy)
                 ▼
┌──────────────────────────────────────────────────────────────────────────┐
│                   GPU Model Worker (ray-serve)                           │
│      MASt3R · ALIKED · SuperPoint · COLMAP SfM · pycolmap               │
│                      1 GPU, 4 CPUs                                       │
└──────────────────────────────────────────────────────────────────────────┘

┌────────────────┐   ┌──────────────┐   ┌─────────────────┐   ┌──────────┐
│    MLflow      │   │   Airflow    │   │   Prometheus    │   │  Grafana │
│  port 5000     │   │  port 8080   │   │   port 9090     │   │ port 3001│
│  Experiment    │   │  Orchestrate │   │  Metrics scrape │   │Dashboard │
│  tracking      │   │  DVC + DAGs  │   │  + alerts       │   │          │
└────────────────┘   └──────────────┘   └─────────────────┘   └──────────┘

┌────────────────────────────────────────────────────────────────────────┐
│                  Docker Network: mlops_net                             │
│  All services communicate by container hostname on this bridge network │
└────────────────────────────────────────────────────────────────────────┘

Frontend

The frontend is a React single-page application built with Vite and styled with Tailwind CSS. The 3D viewer is implemented with Three.js.

Responsibilities

Render the upload form and stage tracker
Poll GET /jobs/{job_id} every few seconds to update the UI
Render the interactive 3D point cloud via Three.js
Display drift warnings and reconstruction statistics

Build

The frontend is built into static files at Docker image build time (npm run build). These are served by an nginx container which also acts as a TLS-terminating reverse proxy, forwarding /api/* requests to the ray-serve backend.

API Gateway and GPU Worker (Ray Serve)

The backend is a single Docker container running two Ray Serve deployments:

APIGateway (CPU-only, 2 cores)

Wraps a FastAPI application via @serve.ingress
Handles all HTTP traffic: authentication, file uploads, job state, drift checks
Delegates all inference to the GPU worker over Ray’s object store (zero HTTP overhead)
Exposes Prometheus metrics at GET /metrics

GPUModelWorker (1 GPU, 4 cores)

Loads the entire MASt3R pipeline once at container startup
Exposes reconstruct() and ping() methods callable via Ray RPC
Never reloads models between requests — weights stay permanently in VRAM
Handles the full pipeline: extraction → matching → COLMAP → PLY export

The two deployments communicate via Ray’s distributed object store, which avoids serialising large tensors over HTTP and enables near-zero-copy data transfer.

Concurrency

A semaphore limits the API gateway to MAX_CONCURRENT_JOBS = 1 running job at a time, preventing GPU memory exhaustion.

Offline MLOps Pipeline

DVC defines the pipeline as a DAG of stages in dvc.yaml. Each stage specifies its command, input dependencies (deps), and outputs (outs). DVC tracks content hashes so only changed stages re-run.

MLflow is used for experiment tracking. Every dvc repro run creates a parent MLflow run, with child runs logged by each script stage. Logged entities include:

All pipeline configuration parameters (flattened from YAML)
Stage-level metrics (registration rate, mAA, etc.)
Artifacts (eval CSV, PLY file, config YAML, Git status)

Airflow orchestrates the full DVC pipeline via the experiment_pipeline_dag DAG, which:

Waits for required data files via FileSensor tasks.
Runs dvc repro inside an ephemeral Docker container (Docker-out-of-Docker).
Calls select_best_run.py to promote the best MLflow run to production.
Sends an email notification on success.

Monitoring Stack

Prometheus scrapes metrics from:

The Ray Serve API gateway (/metrics, every 10 seconds)
MLflow health endpoint (every 30 seconds)
Airflow (every 15 seconds)
Node Exporter (host-level hardware metrics)

Grafana provides dashboards powered by Prometheus data, including:

Reconstruction job throughput and latency
Registration rate over time
GPU utilisation (via node exporter)
Drift metric trends

Alertmanager receives alerts defined in monitoring/alert_rules.yml. Relevant alert names include FeatureDriftDetected, InputBrightnessDrift, InputContrastDrift, and PerformanceDecay. On firing, Alertmanager calls an Airflow webhook to trigger experiment_pipeline_dag automatically.

Data Flow

Offline (training/evaluation)

data/train/ (images)
  └─► validate → eda_baselines → preprocess → prepare
                                                  └─► run_pipeline (MASt3R + COLMAP)
                                                             └─► evaluate (mAA → MLflow)
                                                                      └─► select_best_run
                                                                               └─► conf/best_config.yaml

Online (inference)

User ZIP upload
  └─► Drift check (vs EDA baselines)
        └─► GPUModelWorker.reconstruct()
              ├─► MASt3R matching
              ├─► COLMAP SfM
              └─► PLY export + decimation
                    └─► /app/results/ → download by user

Networking

All services run on a single Docker bridge network named mlops_net. Service hostnames (e.g., mlflow, airflow-apiserver, prometheus) resolve directly within this network. This allows:

Airflow DockerOperator containers to reach mlflow:5000
Alertmanager to call airflow-apiserver:8080
Ray Serve to reach mlflow:5000 for metric logging

The network name is pinned in docker-compose.yaml:

networks:
  default:
    name: mlops_net

Security Boundaries

External traffic enters only through nginx (port 443) and the Ray Serve API (port 8000).
JWT tokens expire after 15 minutes and are signed with a secret loaded from Docker Secrets.
Database credentials are stored as Docker Secrets, not plaintext environment variables.
CI/CD includes Trivy image scanning and pip-audit dependency auditing on every push (see .github/workflows/security.yml).

For full security documentation see Security & Compliance.