System Architecture
===================

This page describes the high-level system design, the separation of concerns between
components, and how the MLOps tools integrate with each other.

----

High-Level Architecture
------------------------

.. code-block:: text

   ┌──────────────────────────────────────────────────────────────────────────┐
   │                         User / Browser                                   │
   └─────────────────────────────┬────────────────────────────────────────────┘
                                 │  HTTPS (port 443 / nginx TLS)
                                 ▼
   ┌──────────────────────────────────────────────────────────────────────────┐
   │                         Frontend (scene3d-ui)                            │
   │           React + Three.js · Vite · nginx reverse proxy                  │
   │                         port 5173 / 443                                  │
   └─────────────────────────────┬────────────────────────────────────────────┘
                                 │  HTTP REST  /api/*
                                 ▼
   ┌──────────────────────────────────────────────────────────────────────────┐
   │                    API Gateway (ray-serve)                               │
   │              FastAPI · Ray Serve ingress · port 8000                     │
   │   Auth · Upload · Job Management · Drift · Prometheus metrics            │
   └────────────────┬─────────────────────────────────────────────────────────┘
                    │  Ray RPC (object store, zero-copy)
                    ▼
   ┌──────────────────────────────────────────────────────────────────────────┐
   │                   GPU Model Worker (ray-serve)                           │
   │      MASt3R · ALIKED · SuperPoint · COLMAP SfM · pycolmap               │
   │                      1 GPU, 4 CPUs                                       │
   └──────────────────────────────────────────────────────────────────────────┘

   ┌────────────────┐   ┌──────────────┐   ┌─────────────────┐   ┌──────────┐
   │    MLflow      │   │   Airflow    │   │   Prometheus    │   │  Grafana │
   │  port 5000     │   │  port 8080   │   │   port 9090     │   │ port 3001│
   │  Experiment    │   │  Orchestrate │   │  Metrics scrape │   │Dashboard │
   │  tracking      │   │  DVC + DAGs  │   │  + alerts       │   │          │
   └────────────────┘   └──────────────┘   └─────────────────┘   └──────────┘

   ┌────────────────────────────────────────────────────────────────────────┐
   │                  Docker Network: mlops_net                             │
   │  All services communicate by container hostname on this bridge network │
   └────────────────────────────────────────────────────────────────────────┘

----

Frontend
---------

The frontend is a **React** single-page application built with **Vite** and styled
with **Tailwind CSS**. The 3D viewer is implemented with **Three.js**.

**Responsibilities**

- Render the upload form and stage tracker
- Poll ``GET /jobs/{job_id}`` every few seconds to update the UI
- Render the interactive 3D point cloud via Three.js
- Display drift warnings and reconstruction statistics

**Build**

The frontend is built into static files at Docker image build time (``npm run build``).
These are served by an **nginx** container which also acts as a TLS-terminating
reverse proxy, forwarding ``/api/*`` requests to the ``ray-serve`` backend.

----

API Gateway and GPU Worker (Ray Serve)
---------------------------------------

The backend is a single Docker container running two **Ray Serve** deployments:

**APIGateway** (CPU-only, 2 cores)

- Wraps a FastAPI application via ``@serve.ingress``
- Handles all HTTP traffic: authentication, file uploads, job state, drift checks
- Delegates all inference to the GPU worker over Ray's object store (zero HTTP overhead)
- Exposes Prometheus metrics at ``GET /metrics``

**GPUModelWorker** (1 GPU, 4 cores)

- Loads the entire MASt3R pipeline once at container startup
- Exposes ``reconstruct()`` and ``ping()`` methods callable via Ray RPC
- Never reloads models between requests — weights stay permanently in VRAM
- Handles the full pipeline: extraction → matching → COLMAP → PLY export

The two deployments communicate via **Ray's distributed object store**, which
avoids serialising large tensors over HTTP and enables near-zero-copy data transfer.

**Concurrency**

A semaphore limits the API gateway to ``MAX_CONCURRENT_JOBS = 1`` running job at
a time, preventing GPU memory exhaustion.

----

Offline MLOps Pipeline
-----------------------

**DVC** defines the pipeline as a DAG of stages in ``dvc.yaml``. Each stage
specifies its command, input dependencies (``deps``), and outputs (``outs``).
DVC tracks content hashes so only changed stages re-run.

**MLflow** is used for experiment tracking. Every ``dvc repro`` run creates a
parent MLflow run, with child runs logged by each script stage. Logged entities
include:

- All pipeline configuration parameters (flattened from YAML)
- Stage-level metrics (registration rate, mAA, etc.)
- Artifacts (eval CSV, PLY file, config YAML, Git status)

**Airflow** orchestrates the full DVC pipeline via the ``experiment_pipeline_dag``
DAG, which:

1. Waits for required data files via ``FileSensor`` tasks.
2. Runs ``dvc repro`` inside an ephemeral Docker container (Docker-out-of-Docker).
3. Calls ``select_best_run.py`` to promote the best MLflow run to production.
4. Sends an email notification on success.

----

Monitoring Stack
-----------------

**Prometheus** scrapes metrics from:

- The Ray Serve API gateway (``/metrics``, every 10 seconds)
- MLflow health endpoint (every 30 seconds)
- Airflow (every 15 seconds)
- Node Exporter (host-level hardware metrics)

**Grafana** provides dashboards powered by Prometheus data, including:

- Reconstruction job throughput and latency
- Registration rate over time
- GPU utilisation (via node exporter)
- Drift metric trends

**Alertmanager** receives alerts defined in ``monitoring/alert_rules.yml``.
Relevant alert names include ``FeatureDriftDetected``, ``InputBrightnessDrift``,
``InputContrastDrift``, and ``PerformanceDecay``. On firing, Alertmanager calls
an Airflow webhook to trigger ``experiment_pipeline_dag`` automatically.

----

Data Flow
----------

**Offline (training/evaluation)**

.. code-block:: text

   data/train/ (images)
     └─► validate → eda_baselines → preprocess → prepare
                                                     └─► run_pipeline (MASt3R + COLMAP)
                                                                └─► evaluate (mAA → MLflow)
                                                                         └─► select_best_run
                                                                                  └─► conf/best_config.yaml

**Online (inference)**

.. code-block:: text

   User ZIP upload
     └─► Drift check (vs EDA baselines)
           └─► GPUModelWorker.reconstruct()
                 ├─► MASt3R matching
                 ├─► COLMAP SfM
                 └─► PLY export + decimation
                       └─► /app/results/ → download by user

----

Networking
-----------

All services run on a single Docker bridge network named ``mlops_net``.
Service hostnames (e.g., ``mlflow``, ``airflow-apiserver``, ``prometheus``)
resolve directly within this network. This allows:

- Airflow DockerOperator containers to reach ``mlflow:5000``
- Alertmanager to call ``airflow-apiserver:8080``
- Ray Serve to reach ``mlflow:5000`` for metric logging

The network name is pinned in ``docker-compose.yaml``:

.. code-block:: yaml

   networks:
     default:
       name: mlops_net

----

Security Boundaries
--------------------

- **External traffic** enters only through nginx (port 443) and the Ray Serve API
  (port 8000).
- **JWT tokens** expire after 15 minutes and are signed with a secret loaded from
  Docker Secrets.
- **Database credentials** are stored as Docker Secrets, not plaintext environment
  variables.
- **CI/CD** includes Trivy image scanning and ``pip-audit`` dependency auditing
  on every push (see ``.github/workflows/security.yml``).

For full security documentation see :doc:`security`.