Installation Guide
==================

This guide covers two installation paths: the recommended **Docker-based** setup and the
**native Python** setup for development.

----

Prerequisites
-------------

- **Docker** >= 24.0 and **Docker Compose** >= 2.20
- **NVIDIA GPU** with CUDA 12.6 support (required for inference)
- **NVIDIA Container Toolkit** installed on the host
- **Git** and **Git LFS**
- At least **16 GB GPU VRAM** and **32 GB system RAM** recommended

----

Step 1 — Clone the Repository
-------------------------------

.. code-block:: bash

   git clone https://github.com/your-org/MLOps-Project-ME22B214.git
   cd MLOps-Project-ME22B214
   git lfs pull          # Downloads pre-trained model weights

----

Step 2 — Download the Dataset
-------------------------------

.. code-block:: bash

   kaggle competitions download -c image-matching-challenge-2025
   unzip image-matching-challenge-2025.zip -d data/
   mv data/image-matching-challenge-2025/* data/
   rm -r data/image-matching-challenge-2025

The ``data/`` directory should now contain ``train/``, ``test/``, ``train_labels.csv``,
and ``train_thresholds.csv``.

----

Docker Setup (Recommended)
---------------------------

This is the standard production-ready setup. All services run as Docker containers
on a shared network (``mlops_net``).

Step 3a — Configure the Environment File
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run the interactive setup script to generate your ``.env`` file:

.. code-block:: bash

   ./setup_env.sh

The script will prompt you for the following values:

.. list-table::
   :widths: 30 70
   :header-rows: 1

   * - Variable
     - Description
   * - ``AIRFLOW_UID``
     - Your host user ID (run ``id -u`` to find it)
   * - ``DOCKER_GID``
     - Docker group ID (run ``getent group docker | cut -d: -f3``)
   * - ``FERNET_KEY``
     - Airflow encryption key (generate with ``python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`` )
   * - ``SMTP_USER``
     - Gmail address for Airflow email alerts
   * - ``SMTP_PASSWORD``
     - Gmail app password (not your regular password)
   * - ``HOST_PROJECT_ROOT``
     - Absolute path to this repository on the host machine

Step 3b — Generate Docker Secrets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   ./generate_secrets.sh
   chmod 644 ./secrets/*

This creates two files under ``secrets/``:

- ``secrets/jwt_secret`` — used to sign API JWT tokens
- ``secrets/grafana_admin_password`` — used for the Grafana admin account

.. note::
   The ``secrets/`` directory is listed in ``.gitignore`` and will never be committed.

Step 3c — Generate TLS Certificates
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   ./generate-certs.sh

This creates self-signed certificates under ``certs/`` for nginx TLS termination.
For production, replace these with certificates from a trusted CA.

Step 3d — Clone External Dependencies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   cd extra/
   git clone https://github.com/jenicek/asmk
   git clone https://github.com/naver/croco
   cd ..

Step 3e — Launch the Full Stack
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   docker compose --profile inference up --build -d

This starts the following services:

.. list-table::
   :widths: 25 20 55
   :header-rows: 1

   * - Service
     - Port
     - Description
   * - ``scene3d-ui``
     - 5173, 443
     - React + Three.js frontend
   * - ``ray-serve``
     - 8000, 8265
     - FastAPI gateway + GPU inference worker
   * - ``mlflow``
     - 5000
     - Experiment tracking server
   * - ``airflow-apiserver``
     - 8080
     - Airflow web UI and REST API
   * - ``prometheus``
     - 9090
     - Metrics scraping
   * - ``grafana``
     - 3001
     - Monitoring dashboards
   * - ``postgres``
     - (internal)
     - Airflow metadata database

.. code-block:: bash

   # Verify all containers are healthy
   docker compose ps

Wait for the ``ray-serve`` healthcheck to pass — this can take up to 5 minutes as
MASt3R model weights are loaded into GPU memory.

----

Native Python Setup (Developer Mode)
--------------------------------------

Use this path if you need to develop or debug outside Docker.

Step 3a — Build ASMK
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   cd extra/
   git clone https://github.com/jenicek/asmk
   cd asmk/cython/
   cythonize *.pyx
   cd ..
   python -m build --no-isolation
   pip install dist/*.whl
   cd ../../

Step 3b — Build CroCo / DUSt3R Kernels
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DUSt3R relies on RoPE positional embeddings, which require compiled CUDA kernels:

.. code-block:: bash

   cd extra/
   git clone https://github.com/naver/croco.git
   cd croco/models/curope/
   python -m build --no-isolation
   pip install dist/*.whl
   cd ../../

Step 3c — Build Remaining Packages
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Build any additional packages in ``bundle/oss/`` as ``.whl`` files using
``python -m build --no-isolation`` in their respective directories, then move
the compiled ``.whl`` files to ``bundle/oss/``.

Step 3d — Create the Python Virtual Environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: bash

   pip install uv
   uv venv
   source .venv/bin/activate
   uv pip install -e .
   export LD_LIBRARY_PATH=.venv/lib/python3.11/site-packages/torch/lib:$LD_LIBRARY_PATH

The project requires **Python 3.11** exactly (``requires-python = "==3.11.*"``).

----

Pre-trained Model Weights
--------------------------

Model weights are stored under ``extra/pretrained_models/`` via Git LFS. If you
need to download them manually:

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - Model
     - Download URL
   * - ALIKED ``aliked-n16.pth``
     - https://github.com/Shiaoming/ALIKED/raw/main/models/aliked-n16.pth
   * - ISC ``isc_ft_v107.pth.tar``
     - https://github.com/lyakaap/ISC21-Descriptor-Track-1st/releases/download/v1.0.1/isc_ft_v107.pth.tar
   * - MASt3R main weights
     - https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric.pth
   * - MASt3R retrieval weights
     - https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_trainingfree.pth
   * - MASt3R codebook
     - https://download.europe.naverlabs.com/ComputerVision/MASt3R/MASt3R_ViTLarge_BaseDecoder_512_catmlpdpt_metric_retrieval_codebook.pkl

----

Verifying the Installation
---------------------------

Once all services are running, verify the stack is healthy:

.. code-block:: bash

   # API health check
   curl http://localhost:8000/health

   # GPU worker readiness
   curl http://localhost:8000/ready

   # Obtain a JWT token
   curl -X POST http://localhost:8000/auth/token \
     -H "Content-Type: application/json" \
     -d '{"username": "admin", "password": "admin"}'

A successful ``/health`` response looks like:

.. code-block:: json

   {
     "status": "ok",
     "version": "2.0.0",
     "timestamp": 1714300000.0
   }

----

Troubleshooting Installation
------------------------------

**``ray-serve`` container exits immediately**
   Check that the NVIDIA Container Toolkit is installed and that
   ``docker run --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi`` succeeds.

**Port conflicts**
   If ports 8000, 5000, or 8080 are in use on your host, edit the ``ports:``
   mappings in ``docker-compose.yaml`` before launching.

**Airflow DB migration fails**
   Ensure ``postgres`` is healthy before running ``airflow-init``:
   ``docker compose logs postgres``.

**Git LFS quota exceeded**
   Download model weights manually using the URLs above and place them under
   ``extra/pretrained_models/``.