Pipeline Documentation
======================

The reconstruction pipeline transforms a collection of unordered images into a
sparse 3D point cloud and a set of camera poses. This page explains each stage
in the pipeline, both the **offline DVC training pipeline** and the **online
inference pipeline**.

----

Pipeline Overview
-----------------

.. code-block:: text

   ┌──────────────┐    ┌───────────────┐    ┌─────────────┐    ┌────────────────┐
   │   validate   │ →  │ eda_baselines │ →  │   preprocess│ →  │    prepare     │
   │  (data QC)   │    │ (EDA + stats) │    │  (images)   │    │ (input CSV)    │
   └──────────────┘    └───────────────┘    └─────────────┘    └────────────────┘
                                                                        │
                                                                        ▼
                                                               ┌────────────────┐
                                                               │  run_pipeline  │
                                                               │  (MASt3R +     │
                                                               │   COLMAP SfM)  │
                                                               └────────────────┘
                                                                        │
                                                                        ▼
                                                               ┌────────────────┐
                                                               │    evaluate    │
                                                               │  (mAA + MLflow)│
                                                               └────────────────┘

Each stage is defined in ``dvc.yaml`` and tracked by DVC. Metrics and artifacts
from each run are logged to MLflow.

----

Stage 1 — Data Validation
--------------------------

**DVC stage**: ``validate``

**Script**: ``scripts/validate_data.py``

**What it does**

Reads ``data/train_labels.csv`` and verifies that every image listed in the CSV
exists on disk under ``data/train/``. It reports:

- Total rows in the labels file
- Number of distinct images and scenes
- Missing files
- Duplicate image entries
- Malformed rotation matrices or translation vectors

**Outputs**

- ``data/validation/validation_report.json`` — full issue report
- ``data/validation/validation_metrics.json`` — DVC metric file with ``issue_count``
  and ``status_code``

**Acceptance threshold**

A ``status_code`` of ``0`` means all files are present and valid. A value of ``1``
means warnings exist (e.g., missing files) but the pipeline can continue.
A value of ``2`` indicates a critical error that halts downstream stages.

----

Stage 2 — Exploratory Data Analysis and Baselines
---------------------------------------------------

**DVC stage**: ``eda_baselines``

**Script**: ``scripts/eda_baselines.py``

**What it does**

Computes image statistics across the training dataset to establish the **drift
baseline**. These baselines are later used by the drift monitor to detect when
production images differ from training data. Statistics computed include:

- Image resolution distribution (width, height histograms)
- Pairwise image similarity matrix (using global descriptors)
- Sharpness distribution (Laplacian variance)
- Brightness and contrast statistics

**Outputs**

- ``data/baselines/resolution_hist.png``
- ``data/baselines/similarity_matrix.png``
- ``data/baselines/sharpness_hist.png``
- ``data/baselines/eda_baselines.json`` — raw baseline statistics
- ``data/baselines/eda_metrics.json`` — DVC metric summary

----

Stage 3 — Image Preprocessing
-------------------------------

**DVC stage**: ``image_preprocess``

**Script**: ``scripts/image_processing.py``

**Config**: ``conf/preprocess.yaml``

**What it does**

Applies a configurable preprocessing pipeline to each training image:

- **Deblurring** — images with Laplacian variance below ``blurry_threshold`` are
  sharpened or excluded depending on configuration.
- **Orientation normalisation** — corrects image rotation based on EXIF metadata
  or a learned orientation estimator, so all images are upright before matching.

The preprocessing module is designed to be pluggable. Only stages listed in
``conf/preprocess.yaml`` are applied.

**Outputs**

- ``data/processed/images/`` — preprocessed image tree mirroring ``data/train/``
- ``data/processed/preprocess_report.json``
- ``data/processed/preprocess_metrics.json``

----

Stage 4 — Data Preparation
----------------------------

**DVC stage**: ``prepare``

**Script**: ``scripts/prepare_submission.py``

**What it does**

Reads the preprocessed image paths and ``data/train_labels.csv`` to build
``data/prepared/prepared_input.csv``. This CSV is in the IMC2025 submission format
with ``nan`` placeholder values for rotation and translation — these are populated
by the reconstruction stage.

Columns: ``image_id``, ``dataset``, ``scene``, ``image``, ``rotation_matrix``,
``translation_vector``.

----

Stage 5 — Scene Reconstruction (Core Pipeline)
-----------------------------------------------

**DVC stage**: ``run_pipeline``

**Script**: ``scripts/reconstruct_scenes.py``

**Config**: ``conf/mast3r.yaml`` (or ``conf/best_config.yaml`` in production)

This is the main computational stage. It implements the full
``IMC2025Pipeline.run()`` loop.

Shortlist Generation
~~~~~~~~~~~~~~~~~~~~~

Before matching, the pipeline generates a **shortlist** of candidate image pairs
to match. Matching all N×N pairs is computationally infeasible for large datasets,
so the shortlist generator selects the most promising pairs using an **ensemble**
of global descriptor retrievers:

1. **MASt3R-ASMK** — vocabulary-tree-based retrieval using MASt3R's dense
   descriptors and the ASMK aggregation method. This is the primary retriever.
2. **MASt3R-SPoC** — an alternative global descriptor from the MASt3R retrieval
   head.
3. **DINOv2** — a general-purpose vision transformer used as a secondary global
   descriptor for cross-domain robustness.
4. **ISC** — a descriptor trained specifically for image copy detection, effective
   for repeated structures.

Each retriever proposes its top-K most similar images per query. The union of all
proposals forms the final shortlist.

Feature Extraction and Matching
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For each pair in the shortlist, the pipeline runs matching via the **MASt3R Hybrid
Matcher** (``type: mast3r_hybrid``), which combines:

- **Dense matching** — MASt3R's end-to-end dense correspondence network operates
  at 512 px resolution and produces dense pixel-level matches.
- **Sparse matching** — two local feature detectors provide complementary keypoints:

  - **ALIKED** (with LightGlue) — a learned keypoint detector with
    up to 4096 keypoints per image at 1280 px resolution.
  - **MagicLeap SuperPoint** — a classical-style detector with up to 4096
    keypoints at 1600 px resolution.

Dense and sparse matches are **fused** late in the pipeline to maximise coverage.

COLMAP Incremental SfM
~~~~~~~~~~~~~~~~~~~~~~~

Fused matches are imported into a **COLMAP** database. COLMAP's incremental
Structure-from-Motion mapper then:

1. Selects an initial image pair with good homography overlap.
2. Triangulates an initial 3D point set.
3. Registers remaining images one by one via PnP.
4. Runs bundle adjustment after each batch of registrations.
5. Filters outlier points by reprojection error.

Key COLMAP parameters (from config):

- ``mapper_min_model_size: 3`` — minimum images to form a valid reconstruction.
- ``mapper_max_num_models: 25`` — maximum number of disconnected sub-models.

**Outputs**

- ``data/reconstruction/eval_prediction.csv`` — IMC2025 format poses
- ``data/reconstruction/sparse_reconstruction.ply`` — point cloud
- ``data/reconstruction/reconstruction_metrics.json``

----

Stage 6 — Evaluation
---------------------

**DVC stage**: ``evaluate``

**Script**: ``scripts/evaluate.py``

**What it does**

Computes the **mAA (mean Average Accuracy)** metric, which is the primary quality
measure for the IMC2025 competition. mAA measures the fraction of camera poses
registered within a set of angular and translation error thresholds.

It also computes:

- Per-dataset scores and mAA values
- Clusterness score (how well images cluster geometrically)
- Registration rate

All metrics are logged as a child MLflow run under the parent DVC run.

**Outputs**

- ``data/evaluation/metrics.json``
- ``data/evaluation/git_status.txt``

----

Online Inference Pipeline
--------------------------

The online pipeline (triggered via ``POST /upload``) mirrors the DVC pipeline but
runs directly without DVC:

1. ZIP extraction → temporary workspace
2. MASt3R hybrid matching on GPU worker (``GPUModelWorker.reconstruct()``)
3. COLMAP SfM in the same temporary workspace
4. PLY export via ``pycolmap.Reconstruction.export_PLY()``
5. Voxel downsampling (``utils/decimate.py``) to ≤500,000 points
6. Results persisted to ``/app/results/``

The pipeline configuration is loaded from ``conf/best_config.yaml`` (if present)
with fallback to ``conf/mast3r.yaml``.

----

Model Selection and Promotion
------------------------------

After each DVC experiment run, ``scripts/select_best_run.py`` queries MLflow for
the run with the highest ``mAA_overall`` metric in the
``scene_reconstruction_dvc`` experiment. It copies that run's configuration to
``conf/best_config.yaml``, which becomes the active production config on the
next ``ray-serve`` restart.

----

Drift Monitoring and Retraining
---------------------------------

The ``DriftMonitor`` class (``scripts/drift_monitor.py``) compares production image
statistics to the baselines in ``data/baselines/eda_baselines.json``. It checks:

- Mean brightness
- Mean contrast
- Mean sharpness
- Aspect ratio

If any metric drifts beyond the configured threshold, an alert is raised. The
Airflow ``drift_detection_dag`` polls Prometheus every 30 minutes for the
``feature_drift_status`` metric. If drift is detected, it sends an email alert
to the configured ``SMTP_USER``. High-severity drift additionally triggers
``experiment_pipeline_dag`` automatically via the Alertmanager webhook.