Frequently Asked Questions


General

What does this system actually do?

It takes a collection of photos of a scene (e.g., a building, a room, an archaeological site) and reconstructs a 3D point cloud from them. For each image it also estimates where the camera was positioned and which direction it was pointing.

What kind of images can I use?

The system accepts .jpg, .jpeg, .png, .tif, .tiff, .bmp, and .webp files, packaged into a single ZIP archive. Images should be taken with a real camera or phone; synthetic renders or heavily edited images may produce poor results.

How many images do I need?

A minimum of 3 images is required to form a reconstruction. In practice, at least 10–20 overlapping images of a scene will give usable results. More images (50–200) generally improve accuracy and coverage.

Do the images need to be ordered?

No. The system handles unordered collections. However, every image must share some visual overlap with at least one other image in the set.


Installation & Setup

I get “CUDA error” when starting ray-serve. What should I do?

Ensure the NVIDIA Container Toolkit is installed on your host:

sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Then verify GPU access:

docker run --rm --gpus all nvidia/cuda:12.6.3-base-ubuntu22.04 nvidia-smi
The ``generate_secrets.sh`` script fails.

Ensure you have openssl installed (sudo apt install openssl). The script generates random secret strings using openssl rand.

I see “AIRFLOW_UID not set” warnings.

Add your user ID to .env:

echo "AIRFLOW_UID=$(id -u)" >> .env
Git LFS download fails.

You can download model weights manually using the URLs listed in Installation Guide. Place the files in extra/pretrained_models/.


Using the API

My JWT token keeps expiring mid-workflow.

Tokens are valid for 15 minutes. For long-running automation scripts, refresh the token proactively by calling POST /auth/token before each request, or increase JWT_EXPIRY_SECONDS in the environment configuration.

I get HTTP 503 on ``/ready``.

The GPU worker is still loading. MASt3R model weights take 1–3 minutes to load into VRAM at container startup. Wait for the ray-serve healthcheck to pass before sending inference requests.

``POST /upload`` returns HTTP 413.

Your ZIP file exceeds the 500 MB default limit. Either reduce the dataset size or increase SCENE3D_MAX_UPLOAD_MB in docker-compose.yaml.

How do I run multiple jobs in parallel?

The system is currently configured for one concurrent job (MAX_CONCURRENT_JOBS=1) to prevent GPU memory exhaustion. Additional uploads will be queued and processed in order.


Reconstruction Quality

My registration rate is below 50%. What went wrong?

Low registration rates are usually caused by one or more of:

  • Images without sufficient overlap (each image should share at least 20–30% of its view with neighbouring images).

  • Images that are too blurry (Laplacian variance below threshold).

  • Scenes with repetitive textures where feature matching produces false positives.

  • Very few images (fewer than 10 in a connected scene).

Some images appear in the point cloud viewer but others don’t.

Images that appear are those successfully registered by COLMAP. Excluded images did not have enough verified matches to determine their pose. Check the registration_rate value in the Stats Table for the proportion registered.

The point cloud looks very sparse.

The displayed point cloud is voxel-downsampled to at most 500,000 points for browser performance. The original full-density PLY files are available for download and will be much denser.

I uploaded 200 images but got only 1 cluster with 30 images.

COLMAP may have produced multiple disconnected sub-models and selected only the largest. This can happen when images fall into groups with little overlap between them. Try ensuring all images share some visual context, or increase the number of images from each viewpoint.


MLOps / DVC / MLflow

How do I compare two experiment runs?

Open MLflow at http://localhost:5000, navigate to the scene_reconstruction_dvc experiment, and select multiple runs to compare. You can plot mAA_overall, registration_rate, and per-dataset metrics side by side.

How is the best config selected?

scripts/select_best_run.py queries the MLflow tracking server for the run with the highest mAA_overall metric in the experiment. It copies that run’s config YAML to conf/best_config.yaml. The ray-serve container reads this file at startup.

DVC repro says “nothing changed”. How do I force a re-run?
dvc repro --force

Or invalidate a specific stage:

dvc repro --force run_pipeline
Where are MLflow artifacts stored?

Artifacts are stored in the mlflow-artifacts Docker volume, mounted at /opt/mlflow/artifacts inside the mlflow container.


Monitoring & Alerts

I’m not receiving drift alert emails.

Check that your SMTP credentials are correctly set in .env (SMTP_USER, SMTP_PASSWORD, SMTP_MAIL_FROM). Verify the Airflow connection is active at http://localhost:8080/connection/list.

Grafana shows “No data” for most panels.

The ray-serve service must be running (inference Docker Compose profile) for Prometheus to scrape metrics. Start it with:

docker compose --profile inference up -d ray-serve
Alertmanager is firing but Airflow retraining DAG is not triggered.

Check Alertmanager logs: docker compose logs alertmanager. Verify the Airflow API is reachable from within the mlops_net network:

docker compose exec alertmanager wget -qO- http://airflow-apiserver:8080/api/v2/monitor/health

Data & Drift

What does “drift detected” mean for my results?

It means the statistical properties of your uploaded images differ from the training dataset. The reconstruction will still run, but accuracy may be lower. High-severity drift triggers an automatic Airflow retraining job.

How do I update the drift baselines?

Re-run the eda_baselines DVC stage with your new dataset:

dvc repro eda_baselines --force

This regenerates data/baselines/eda_baselines.json, which the drift monitor uses as the new reference.