================================================================================
PRD COMPLÉMENTAIRE — MODÈLES AI ADDITIONNELS : WHISPER + DISTILBERT + PYANNOTE + BGE-M3 + DONUT
================================================================================
Projet      : SaaS KYC — Extension Microservice Python
Complément  : PRD initial (Phases 0–8) — MICROSERVICE OK ✅
OS          : AlmaLinux 9.7 (adaptation confirmée du PRD précédent)
Python      : 3.11.13 (venv existant : /opt/kyc-service/venv/)
Port actif  : 20900
Service     : kyc-service.service (systemd, user: kyc)
RAM dispo   : ~4.4 GB libres (1.6 GB utilisés sur 6 GB limite)
Rédigé pour : Cursor AI (exécution autonome étape par étape)
Version     : 2.0.0
================================================================================

ÉTAT DE DÉPART — CE QUI EXISTE DÉJÀ (NE PAS RETOUCHER)
────────────────────────────────────────────────────────────────────────────────
/opt/kyc-service/
├── venv/                          ← Python 3.11.13 — ACTIF
├── models/auraface/               ← 408 MB — NE PAS MODIFIER
├── models/hyperface/              ← 373 MB — NE PAS MODIFIER
├── .paddlex/official_models/      ← PP-OCRv5 ~210 MB — NE PAS MODIFIER
├── app/main.py                    ← FastAPI port 20900 — SERA ÉTENDU
├── app/ocr.py                     ← PaddleX OCR — NE PAS MODIFIER
├── app/face_match.py              ← AuraFace — NE PAS MODIFIER
├── app/utils.py                   ← helpers — SERA ÉTENDU
├── requirements.txt               ← SERA ÉTENDU
└── .env                           ← SERA ÉTENDU

NOUVEAUX MODÈLES À INSTALLER (dans cet ordre exact)
────────────────────────────────────────────────────────────────────────────────
  #1  Whisper Large-v3 Turbo      openai/whisper-large-v3-turbo    MIT     ~1.5 GB
  #2  DistilBERT base uncased      distilbert-base-uncased          Apache  ~270 MB
  #3  Pyannote Diarization 3.1    pyannote/speaker-diarization-3.1 MIT     ~800 MB
  #4  BGE-M3 + all-MiniLM-L6     BAAI/bge-m3                       MIT     ~570 MB
  #5  Donut base                   naver-clova-ix/donut-base        MIT     ~2.0 GB

RAM estimée après installation complète :
  Système existant KYC   ~1.6 GB
  5 nouveaux modèles     ~5.1 GB
  ──────────────────────────────
  Total                  ~6.7 GB  (sur 15 GB physiques — confortable ✅)

RÈGLES IMPÉRATIVES POUR CURSOR
────────────────────────────────────────────────────────────────────────────────
1. TOUJOURS activer le venv avant toute commande pip ou python :
      source /opt/kyc-service/venv/bin/activate
2. TOUJOURS exécuter les commandes en tant que root OU avec sudo, sauf si
   on travaille dans /opt/kyc-service/ (propriété de l'user kyc).
3. NE JAMAIS arrêter ni redémarrer kyc-service.service avant la Phase 9.
4. Après chaque CHECKPOINT ✅, attendre validation avant de continuer.
5. Utiliser dnf (AlmaLinux) et NON apt pour les packages système.
6. Si une dépendance échoue, afficher l'erreur COMPLÈTE et s'arrêter.
7. Les nouveaux modules vont dans /opt/kyc-service/app/ — même arborescence.
8. La phrase "CHECKPOINT ✅" marque une vérification OBLIGATOIRE.
================================================================================


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 9 — PRÉPARATION ENVIRONNEMENT COMPLÉMENTAIRE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 9.1 — Dépendances système supplémentaires (AlmaLinux 9.7)
─────────────────────────────────────────────────────────────────

    sudo dnf install -y \
        ffmpeg \
        ffmpeg-devel \
        libsndfile \
        libsndfile-devel \
        portaudio \
        portaudio-devel

    # Vérifier que ffmpeg est disponible (requis par Whisper)
    ffmpeg -version | head -1

CHECKPOINT ✅ : ffmpeg version X.X.X doit s'afficher.

NOTE : Si ffmpeg est absent du dépôt standard d'AlmaLinux 9, activer RPM Fusion :
    sudo dnf install -y epel-release
    sudo dnf install -y \
        https://mirrors.rpmfusion.org/free/el/rpmfusion-free-release-9.noarch.rpm
    sudo dnf install -y ffmpeg ffmpeg-devel


ÉTAPE 9.2 — Vérifier l'état du venv existant
──────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate
    python --version
    pip --version

CHECKPOINT ✅ : Python 3.11.x et pip 24.x ou supérieur.

    # Vérifier les packages déjà installés
    pip list | grep -E "fastapi|paddlepaddle|insightface|torch|huggingface"

CHECKPOINT ✅ : Au moins 5 lignes de résultats s'affichent.


ÉTAPE 9.3 — Étendre le fichier .env
─────────────────────────────────────

Ajouter ces lignes à la FIN du fichier /opt/kyc-service/.env
(NE PAS écraser les lignes existantes) :

-------- LIGNES À AJOUTER dans .env --------
# ── Whisper Turbo ─────────────────────────────────────────
WHISPER_MODEL=openai/whisper-large-v3-turbo
WHISPER_DEVICE=cpu
WHISPER_COMPUTE_TYPE=int8
WHISPER_BEAM_SIZE=5
WHISPER_LANGUAGE=fr
WHISPER_TASK=transcribe

# ── DistilBERT Sentiment ───────────────────────────────────
DISTILBERT_MODEL=distilbert-base-uncased-finetuned-sst-2-english
SENTIMENT_BATCH_SIZE=32
SENTIMENT_MAX_LENGTH=512

# ── Pyannote Diarization ───────────────────────────────────
PYANNOTE_MODEL=pyannote/speaker-diarization-3.1
PYANNOTE_HF_TOKEN=METTRE_VOTRE_TOKEN_HUGGINGFACE_ICI
PYANNOTE_MIN_SPEAKERS=1
PYANNOTE_MAX_SPEAKERS=10

# ── BGE-M3 Embeddings ─────────────────────────────────────
BGE_MODEL=BAAI/bge-m3
MINIML_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_BATCH_SIZE=64

# ── Donut Document Understanding ──────────────────────────
DONUT_MODEL=naver-clova-ix/donut-base-finetuned-cord-v2
DONUT_MAX_LENGTH=768
-------- FIN DES LIGNES À AJOUTER --------

NOTE IMPORTANTE : PYANNOTE_HF_TOKEN
    Pyannote 3.1 exige un token HuggingFace ET l'acceptation des conditions
    d'utilisation du modèle.
    Étapes OBLIGATOIRES avant d'exécuter la Phase 11 :
    1. Créer un compte sur huggingface.co (si pas encore fait)
    2. Aller sur https://huggingface.co/pyannote/speaker-diarization-3.1
    3. Cliquer "Agree and access repository"
    4. Aller sur https://huggingface.co/pyannote/segmentation-3.0
    5. Cliquer "Agree and access repository"
    6. Générer un token sur https://huggingface.co/settings/tokens
    7. Remplacer METTRE_VOTRE_TOKEN_HUGGINGFACE_ICI par le token réel dans .env


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 10 — INSTALLATION WHISPER LARGE-V3 TURBO
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 10.1 — Installer les dépendances Whisper
────────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate

    # faster-whisper : implémentation optimisée CPU avec CTranslate2
    # Beaucoup plus rapide que le package whisper officiel sur CPU
    pip install faster-whisper==1.1.1

    # Dépendances audio
    pip install soundfile==0.12.1
    pip install librosa==0.10.2
    pip install pydub==0.25.1
    pip install av==12.3.0

CHECKPOINT ✅ :
    python -c "import faster_whisper; print('faster-whisper OK', faster_whisper.__version__)"


ÉTAPE 10.2 — Pré-télécharger le modèle Whisper Turbo
──────────────────────────────────────────────────────

Créer le script /opt/kyc-service/scripts/download_whisper_turbo.py :

-------- DÉBUT DU SCRIPT download_whisper_turbo.py --------
"""
Télécharge et convertit Whisper Large-v3 Turbo pour faster-whisper.
Modèle : openai/whisper-large-v3-turbo
Licence : MIT — Usage commercial autorisé
Taille  : ~1.5 GB (format CTranslate2 int8)
Durée   : 5-15 minutes selon connexion
"""
import os
os.environ["HF_HOME"] = "/opt/kyc-service/models"

from faster_whisper import WhisperModel
from pathlib import Path

MODEL_DIR = Path("/opt/kyc-service/models/whisper-turbo")
MODEL_DIR.mkdir(parents=True, exist_ok=True)

print("Téléchargement de Whisper Large-v3 Turbo...")
print("Format : CTranslate2 int8 (optimisé CPU)")
print(f"Destination : {MODEL_DIR}")
print("Cela peut prendre 5 à 15 minutes...\n")

# faster-whisper télécharge ET convertit automatiquement en CTranslate2
# compute_type=int8 réduit la RAM de 3 GB à ~1.5 GB sans perte notable de précision
model = WhisperModel(
    "large-v3-turbo",
    device="cpu",
    compute_type="int8",
    download_root=str(MODEL_DIR),
)

print("\n✅ Whisper Large-v3 Turbo prêt !")
print("Test rapide de transcription...")

# Test rapide avec un fichier silencieux synthétique
import numpy as np
import soundfile as sf
import tempfile

# Créer 3 secondes d'audio de test
audio_data = np.zeros(48000, dtype=np.float32)
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
    sf.write(f.name, audio_data, 16000)
    test_path = f.name

segments, info = model.transcribe(test_path, language="fr")
print(f"✅ Test OK — langue détectée : {info.language}")
print(f"   Modèle chargé en mémoire et fonctionnel")
import os; os.unlink(test_path)
-------- FIN DU SCRIPT --------

    python /opt/kyc-service/scripts/download_whisper_turbo.py

CHECKPOINT ✅ : "✅ Whisper Large-v3 Turbo prêt !" doit apparaître.
CHECKPOINT ✅ : du -sh /opt/kyc-service/models/whisper-turbo/ doit afficher > 800 MB.


ÉTAPE 10.3 — Créer app/whisper_stt.py
───────────────────────────────────────

Créer le fichier /opt/kyc-service/app/whisper_stt.py avec ce contenu EXACT :

-------- DÉBUT DU FICHIER app/whisper_stt.py --------
"""
Module Whisper Large-v3 Turbo — Speech-to-Text
Licence  : MIT
Impl.    : faster-whisper (CTranslate2 int8, CPU optimisé)
Capacité : 216× temps réel | ~1.5 GB RAM | 99 langues
"""
import os
import io
import logging
import tempfile
from pathlib import Path
from typing import Optional

logger = logging.getLogger("kyc.whisper")

_whisper_model = None

MODELS_DIR   = Path("/opt/kyc-service/models/whisper-turbo")
WHISPER_ID   = os.getenv("WHISPER_MODEL",       "large-v3-turbo")
DEVICE       = os.getenv("WHISPER_DEVICE",      "cpu")
COMPUTE_TYPE = os.getenv("WHISPER_COMPUTE_TYPE", "int8")
BEAM_SIZE    = int(os.getenv("WHISPER_BEAM_SIZE", "5"))


def load_whisper() -> None:
    """
    Charge Whisper Turbo en mémoire — appeler UNE SEULE FOIS au startup FastAPI.
    Le modèle reste en mémoire entre les requêtes (aucun rechargement).
    """
    global _whisper_model
    if _whisper_model is not None:
        return

    logger.info("Chargement de Whisper Large-v3 Turbo (faster-whisper)...")
    from faster_whisper import WhisperModel

    _whisper_model = WhisperModel(
        WHISPER_ID,
        device=DEVICE,
        compute_type=COMPUTE_TYPE,
        download_root=str(MODELS_DIR),
    )
    logger.info("✅ Whisper Large-v3 Turbo chargé — prêt")


def is_whisper_ready() -> bool:
    return _whisper_model is not None


def transcribe_audio(
    audio_bytes: bytes,
    language:    Optional[str] = None,
    task:        str = "transcribe",
) -> dict:
    """
    Transcrit un fichier audio en texte.

    Args:
        audio_bytes : contenu binaire du fichier audio (wav, mp3, mp4, m4a, ogg...)
        language    : code langue ISO-639 ('fr', 'en', 'ar'...) ou None pour auto-détection
        task        : 'transcribe' (même langue) ou 'translate' (vers anglais)

    Returns:
        dict avec les clés :
            - success        : bool
            - text           : texte complet transcrit
            - segments       : liste de segments avec timestamps
            - language       : langue détectée
            - duration_sec   : durée audio en secondes
            - processing_sec : temps de traitement
            - error          : str | None
    """
    if _whisper_model is None:
        return {"success": False, "text": "", "segments": [],
                "language": None, "duration_sec": 0,
                "processing_sec": 0, "error": "Whisper non chargé"}

    import time
    start = time.time()
    tmp_path = None

    try:
        # Sauvegarder l'audio temporairement
        suffix = ".wav"
        with tempfile.NamedTemporaryFile(
            suffix=suffix, delete=False, dir="/tmp"
        ) as tmp:
            tmp.write(audio_bytes)
            tmp_path = tmp.name

        # Transcription
        segments_gen, info = _whisper_model.transcribe(
            tmp_path,
            language=language,
            task=task,
            beam_size=BEAM_SIZE,
            word_timestamps=True,
            vad_filter=True,          # filtre les silences automatiquement
            vad_parameters={"min_silence_duration_ms": 500},
        )

        # Consommer le générateur
        segments_list = []
        full_text_parts = []
        for seg in segments_gen:
            segments_list.append({
                "start":  round(seg.start, 2),
                "end":    round(seg.end, 2),
                "text":   seg.text.strip(),
            })
            full_text_parts.append(seg.text.strip())

        elapsed = round(time.time() - start, 2)
        full_text = " ".join(full_text_parts)

        logger.info(
            f"Transcription OK — {len(segments_list)} segments, "
            f"durée audio {round(info.duration, 1)}s, "
            f"traitement {elapsed}s"
        )

        return {
            "success":        True,
            "text":           full_text,
            "segments":       segments_list,
            "language":       info.language,
            "duration_sec":   round(info.duration, 2),
            "processing_sec": elapsed,
            "error":          None,
        }

    except Exception as e:
        logger.error(f"Erreur Whisper : {e}")
        return {
            "success": False, "text": "", "segments": [],
            "language": None, "duration_sec": 0,
            "processing_sec": round(time.time() - start, 2),
            "error": str(e),
        }
    finally:
        if tmp_path and Path(tmp_path).exists():
            Path(tmp_path).unlink()
-------- FIN DU FICHIER app/whisper_stt.py --------


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 11 — INSTALLATION DISTILBERT (SENTIMENT & CLASSIFICATION NLP)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 11.1 — Installer les dépendances DistilBERT
───────────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate

    # transformers est déjà installé (requis par HyperFace dans PRD 1)
    # Vérifier la version
    pip show transformers | grep Version

    # Si version < 4.40, mettre à jour
    pip install "transformers>=4.40.0" --upgrade

CHECKPOINT ✅ : pip show transformers doit afficher Version >= 4.40.0


ÉTAPE 11.2 — Pré-télécharger les modèles DistilBERT
──────────────────────────────────────────────────────

Créer le script /opt/kyc-service/scripts/download_distilbert.py :

-------- DÉBUT DU SCRIPT download_distilbert.py --------
"""
Télécharge deux modèles DistilBERT/RoBERTa depuis HuggingFace :
1. distilbert-base-uncased-finetuned-sst-2-english
   → Sentiment général (positif/négatif) — 268 MB
2. cardiffnlp/twitter-roberta-base-sentiment-latest
   → Sentiment tweets/réseaux sociaux (négatif/neutre/positif) — 499 MB
Licence : Apache 2.0
"""
import os
os.environ["HF_HOME"] = "/opt/kyc-service/models"

from transformers import pipeline

MODELS = [
    {
        "id":   "distilbert-base-uncased-finetuned-sst-2-english",
        "desc": "Sentiment général (SST-2) — 85-97% précision",
        "task": "sentiment-analysis",
    },
    {
        "id":   "cardiffnlp/twitter-roberta-base-sentiment-latest",
        "desc": "Sentiment social media — négatif/neutre/positif",
        "task": "sentiment-analysis",
    },
]

for m in MODELS:
    print(f"\nTéléchargement : {m['id']}")
    print(f"Description    : {m['desc']}")
    pipe = pipeline(m["task"], model=m["id"], device=-1)
    # Test rapide
    result = pipe("This product is absolutely amazing!")
    print(f"✅ OK — test : {result[0]['label']} ({result[0]['score']:.2%})")

print("\n✅ Tous les modèles DistilBERT/RoBERTa sont prêts !")
-------- FIN DU SCRIPT --------

    python /opt/kyc-service/scripts/download_distilbert.py

CHECKPOINT ✅ : "✅ Tous les modèles DistilBERT/RoBERTa sont prêts !" doit apparaître.
CHECKPOINT ✅ : Les deux tests de sentiment doivent retourner POSITIVE avec > 90% de score.


ÉTAPE 11.3 — Créer app/sentiment.py
──────────────────────────────────────

Créer le fichier /opt/kyc-service/app/sentiment.py avec ce contenu EXACT :

-------- DÉBUT DU FICHIER app/sentiment.py --------
"""
Module DistilBERT / RoBERTa — Analyse de Sentiment & Classification NLP
Licence  : Apache 2.0
Modèles  : distilbert-base-uncased-finetuned-sst-2-english
           cardiffnlp/twitter-roberta-base-sentiment-latest
RAM      : ~270 MB par modèle (chargés à la demande)
Vitesse  : ~1000 textes/seconde sur CPU (batch)
"""
import os
import logging
from typing import List, Optional

logger = logging.getLogger("kyc.sentiment")

_sentiment_general  = None    # DistilBERT SST-2
_sentiment_social   = None    # RoBERTa social media

HF_CACHE = "/opt/kyc-service/models"
os.environ["HF_HOME"] = HF_CACHE

DISTILBERT_MODEL_ID = os.getenv(
    "DISTILBERT_MODEL", "distilbert-base-uncased-finetuned-sst-2-english"
)
ROBERTA_MODEL_ID = "cardiffnlp/twitter-roberta-base-sentiment-latest"
MAX_LENGTH   = int(os.getenv("SENTIMENT_MAX_LENGTH",  "512"))
BATCH_SIZE   = int(os.getenv("SENTIMENT_BATCH_SIZE",  "32"))


def load_sentiment_models() -> None:
    """Charge les modèles de sentiment — appeler au startup FastAPI."""
    global _sentiment_general, _sentiment_social

    logger.info("Chargement de DistilBERT (sentiment général)...")
    from transformers import pipeline
    _sentiment_general = pipeline(
        "sentiment-analysis",
        model=DISTILBERT_MODEL_ID,
        device=-1,                  # CPU
        truncation=True,
        max_length=MAX_LENGTH,
    )

    logger.info("Chargement de RoBERTa (sentiment social media)...")
    _sentiment_social = pipeline(
        "sentiment-analysis",
        model=ROBERTA_MODEL_ID,
        device=-1,
        truncation=True,
        max_length=MAX_LENGTH,
    )
    logger.info("✅ Modèles de sentiment chargés")


def is_sentiment_ready() -> bool:
    return _sentiment_general is not None


def analyze_sentiment(
    texts:     List[str],
    mode:      str = "general",
) -> dict:
    """
    Analyse le sentiment d'une liste de textes.

    Args:
        texts : liste de textes à analyser (max 512 tokens chacun)
        mode  : 'general' (positif/négatif) ou 'social' (négatif/neutre/positif)

    Returns:
        dict avec :
            - success  : bool
            - results  : liste de {text, label, score, sentiment_fr}
            - summary  : {positive_pct, negative_pct, neutral_pct, dominant}
            - model    : modèle utilisé
            - error    : str | None
    """
    model = _sentiment_general if mode == "general" else _sentiment_social
    if model is None:
        return {"success": False, "results": [], "summary": {},
                "model": mode, "error": "Modèle non chargé"}

    # Table de traduction labels → français
    label_fr = {
        "POSITIVE":  "positif",
        "NEGATIVE":  "négatif",
        "NEUTRAL":   "neutre",
        "LABEL_0":   "négatif",
        "LABEL_1":   "neutre",
        "LABEL_2":   "positif",
    }

    try:
        # Traitement par batches pour performance
        raw_results = model(texts, batch_size=BATCH_SIZE, truncation=True)

        results = []
        pos = neg = neu = 0
        for text, res in zip(texts, raw_results):
            label    = res["label"].upper()
            score    = round(res["score"], 4)
            sent_fr  = label_fr.get(label, label.lower())

            if "positif" in sent_fr:  pos += 1
            elif "négatif" in sent_fr: neg += 1
            else:                      neu += 1

            results.append({
                "text":         text[:200] + "..." if len(text) > 200 else text,
                "label":        label,
                "score":        score,
                "sentiment_fr": sent_fr,
            })

        total = len(texts) or 1
        summary = {
            "positive_pct": round(pos / total * 100, 1),
            "negative_pct": round(neg / total * 100, 1),
            "neutral_pct":  round(neu / total * 100, 1),
            "dominant":     max(
                [("positif", pos), ("négatif", neg), ("neutre", neu)],
                key=lambda x: x[1]
            )[0],
            "total_analyzed": total,
        }

        logger.info(
            f"Sentiment analysé — {total} textes, "
            f"dominant : {summary['dominant']}"
        )
        return {
            "success": True, "results": results,
            "summary": summary, "model": mode, "error": None,
        }

    except Exception as e:
        logger.error(f"Erreur sentiment : {e}")
        return {"success": False, "results": [], "summary": {},
                "model": mode, "error": str(e)}
-------- FIN DU FICHIER app/sentiment.py --------


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 12 — INSTALLATION PYANNOTE SPEAKER DIARIZATION 3.1
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PRÉREQUIS OBLIGATOIRE avant cette phase :
    → Avoir complété les 5 étapes d'acceptation des conditions sur HuggingFace
      et renseigné le token dans .env (voir Étape 9.3)

ÉTAPE 12.1 — Installer les dépendances Pyannote
─────────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate

    pip install pyannote.audio==3.3.2
    pip install torchaudio==2.3.1

    # Vérifier l'installation
    python -c "import pyannote.audio; print('pyannote OK', pyannote.audio.__version__)"

CHECKPOINT ✅ : "pyannote OK 3.3.x" doit s'afficher.

NOTE : torchaudio doit correspondre à la version de torch déjà installée.
    Vérifier : pip show torch | grep Version
    Si torch == 2.3.1 → torchaudio == 2.3.1 (correct)
    Si versions incompatibles :
        pip install torchaudio --index-url https://download.pytorch.org/whl/cpu


ÉTAPE 12.2 — Pré-télécharger Pyannote
────────────────────────────────────────

Créer le script /opt/kyc-service/scripts/download_pyannote.py :

-------- DÉBUT DU SCRIPT download_pyannote.py --------
"""
Télécharge Pyannote Speaker Diarization 3.1 depuis HuggingFace.
Modèle : pyannote/speaker-diarization-3.1
Licence : MIT — Usage commercial autorisé
Taille  : ~800 MB (segmentation + embeddings)
PRÉ-REQUIS : Token HuggingFace valide dans .env
             + Conditions acceptées sur huggingface.co/pyannote/
"""
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv("/opt/kyc-service/.env")

HF_TOKEN = os.getenv("PYANNOTE_HF_TOKEN")
if not HF_TOKEN or HF_TOKEN == "METTRE_VOTRE_TOKEN_HUGGINGFACE_ICI":
    print("❌ ERREUR : Token HuggingFace manquant dans .env")
    print("   Renseigner PYANNOTE_HF_TOKEN dans /opt/kyc-service/.env")
    print("   Puis accepter les conditions sur :")
    print("   → https://huggingface.co/pyannote/speaker-diarization-3.1")
    print("   → https://huggingface.co/pyannote/segmentation-3.0")
    exit(1)

os.environ["HF_HOME"] = "/opt/kyc-service/models"

print(f"Token HF détecté : {HF_TOKEN[:8]}...")
print("Téléchargement de pyannote/speaker-diarization-3.1...")
print("Cela peut prendre 5-10 minutes (~800 MB)...\n")

from pyannote.audio import Pipeline
import torch

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN,
)
pipeline.to(torch.device("cpu"))

print("✅ Pyannote Speaker Diarization 3.1 téléchargé et prêt !")
print("Test sur audio synthétique...")

# Test minimal avec audio de 5 secondes
import numpy as np
import soundfile as sf
import tempfile

audio = np.random.randn(80000).astype(np.float32) * 0.01
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
    sf.write(f.name, audio, 16000)
    test_path = f.name

try:
    result = pipeline(test_path)
    print(f"✅ Test OK — pipeline fonctionnel")
except Exception as e:
    print(f"⚠️  Test audio minimal : {e} (normal sur audio vide)")
finally:
    import os as _os; _os.unlink(test_path)

print("\nPyannote est prêt pour la production.")
-------- FIN DU SCRIPT --------

    python /opt/kyc-service/scripts/download_pyannote.py

CHECKPOINT ✅ : "✅ Pyannote Speaker Diarization 3.1 téléchargé et prêt !" doit apparaître.
CHECKPOINT ✅ : du -sh /opt/kyc-service/models/ doit afficher une augmentation de ~800 MB.


ÉTAPE 12.3 — Créer app/diarization.py
────────────────────────────────────────

Créer le fichier /opt/kyc-service/app/diarization.py avec ce contenu EXACT :

-------- DÉBUT DU FICHIER app/diarization.py --------
"""
Module Pyannote Audio 3.1 — Speaker Diarization (qui a parlé quand)
Licence  : MIT
Usage    : Identifier et séparer les locuteurs dans un enregistrement audio
Combiné  : Avec Whisper → transcription attribuée par locuteur
RAM      : ~800 MB
"""
import os
import logging
import tempfile
from pathlib import Path
from typing import Optional
from dotenv import load_dotenv

load_dotenv("/opt/kyc-service/.env")
logger = logging.getLogger("kyc.diarization")

_pyannote_pipeline = None
HF_TOKEN = os.getenv("PYANNOTE_HF_TOKEN")
HF_CACHE = "/opt/kyc-service/models"
os.environ["HF_HOME"] = HF_CACHE

MIN_SPEAKERS = int(os.getenv("PYANNOTE_MIN_SPEAKERS", "1"))
MAX_SPEAKERS = int(os.getenv("PYANNOTE_MAX_SPEAKERS", "10"))


def load_pyannote() -> None:
    """Charge Pyannote — appeler au startup FastAPI."""
    global _pyannote_pipeline
    if _pyannote_pipeline is not None:
        return

    if not HF_TOKEN or HF_TOKEN == "METTRE_VOTRE_TOKEN_HUGGINGFACE_ICI":
        logger.warning("⚠️  Token HuggingFace manquant — Pyannote désactivé")
        return

    logger.info("Chargement de Pyannote Speaker Diarization 3.1...")
    import torch
    from pyannote.audio import Pipeline

    _pyannote_pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=HF_TOKEN,
    )
    _pyannote_pipeline.to(torch.device("cpu"))
    logger.info("✅ Pyannote chargé et prêt")


def is_diarization_ready() -> bool:
    return _pyannote_pipeline is not None


def diarize_audio(
    audio_bytes:  bytes,
    num_speakers: Optional[int] = None,
) -> dict:
    """
    Identifie qui parle et quand dans un enregistrement audio.

    Args:
        audio_bytes  : contenu binaire du fichier audio
        num_speakers : nombre de locuteurs si connu, sinon auto-détection

    Returns:
        dict avec :
            - success      : bool
            - speakers     : liste des identifiants locuteurs uniques
            - segments     : liste de {start, end, speaker, duration}
            - num_speakers : nombre de locuteurs détectés
            - error        : str | None
    """
    if _pyannote_pipeline is None:
        return {"success": False, "speakers": [], "segments": [],
                "num_speakers": 0, "error": "Pyannote non chargé ou token manquant"}

    tmp_path = None
    try:
        with tempfile.NamedTemporaryFile(
            suffix=".wav", delete=False, dir="/tmp"
        ) as tmp:
            tmp.write(audio_bytes)
            tmp_path = tmp.name

        kwargs = {"min_speakers": MIN_SPEAKERS, "max_speakers": MAX_SPEAKERS}
        if num_speakers:
            kwargs = {"num_speakers": num_speakers}

        diarization = _pyannote_pipeline(tmp_path, **kwargs)

        segments = []
        speakers_seen = set()
        for turn, _, speaker in diarization.itertracks(yield_label=True):
            speakers_seen.add(speaker)
            segments.append({
                "start":    round(turn.start, 2),
                "end":      round(turn.end, 2),
                "duration": round(turn.end - turn.start, 2),
                "speaker":  speaker,
            })

        logger.info(
            f"Diarisation OK — {len(speakers_seen)} locuteurs, "
            f"{len(segments)} segments"
        )

        return {
            "success":      True,
            "speakers":     sorted(list(speakers_seen)),
            "segments":     segments,
            "num_speakers": len(speakers_seen),
            "error":        None,
        }

    except Exception as e:
        logger.error(f"Erreur diarisation : {e}")
        return {"success": False, "speakers": [], "segments": [],
                "num_speakers": 0, "error": str(e)}
    finally:
        if tmp_path and Path(tmp_path).exists():
            Path(tmp_path).unlink()


def merge_transcript_with_speakers(
    transcript_segments: list,
    diarization_segments: list,
) -> list:
    """
    Fusionne la transcription Whisper avec la diarisation Pyannote.
    Associe chaque segment de texte à son locuteur en comparant les timestamps.

    Returns:
        liste de {start, end, speaker, text}
    """
    merged = []
    for t_seg in transcript_segments:
        t_mid = (t_seg["start"] + t_seg["end"]) / 2
        best_speaker = "SPEAKER_UNKNOWN"
        best_overlap = 0.0

        for d_seg in diarization_segments:
            overlap_start = max(t_seg["start"], d_seg["start"])
            overlap_end   = min(t_seg["end"],   d_seg["end"])
            overlap       = max(0, overlap_end - overlap_start)
            if overlap > best_overlap:
                best_overlap  = overlap
                best_speaker  = d_seg["speaker"]

        merged.append({
            "start":   t_seg["start"],
            "end":     t_seg["end"],
            "speaker": best_speaker,
            "text":    t_seg["text"],
        })

    return merged
-------- FIN DU FICHIER app/diarization.py --------


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 13 — INSTALLATION BGE-M3 / ALL-MINILM-L6 (EMBEDDINGS SÉMANTIQUES)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 13.1 — Installer les dépendances embeddings
───────────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate

    pip install sentence-transformers==3.3.1
    pip install faiss-cpu==1.9.0.post1

    # Vérifier
    python -c "import sentence_transformers; print('sentence-transformers OK', sentence_transformers.__version__)"
    python -c "import faiss; print('faiss OK')"

CHECKPOINT ✅ : Les deux lignes "OK" doivent s'afficher.


ÉTAPE 13.2 — Pré-télécharger les modèles d'embeddings
───────────────────────────────────────────────────────

Créer le script /opt/kyc-service/scripts/download_embeddings.py :

-------- DÉBUT DU SCRIPT download_embeddings.py --------
"""
Télécharge deux modèles d'embeddings sémantiques :
1. all-MiniLM-L6-v2    → ultra-léger, 22M params, 90 MB, temps réel
2. BAAI/bge-m3         → précision maximale, 567M params, 570 MB, multilingue
Licences : MIT (tous les deux)
"""
import os
os.environ["HF_HOME"] = "/opt/kyc-service/models"

from sentence_transformers import SentenceTransformer
import numpy as np

MODELS = [
    {
        "id":   "sentence-transformers/all-MiniLM-L6-v2",
        "desc": "Ultra-léger temps réel — 22M params — 90 MB",
    },
    {
        "id":   "BAAI/bge-m3",
        "desc": "Multilingue haute précision — 567M params — 570 MB",
    },
]

for m in MODELS:
    print(f"\nTéléchargement : {m['id']}")
    print(f"Description    : {m['desc']}")

    model = SentenceTransformer(m["id"])

    # Test d'embedding
    sentences = [
        "Bonjour, comment puis-je vous aider ?",
        "Hello, how can I help you?",
        "مرحبا، كيف يمكنني مساعدتك؟",
    ]
    embeddings = model.encode(sentences)
    sim = np.dot(embeddings[0], embeddings[1]) / (
        np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
    )
    print(f"✅ OK — dimension {embeddings.shape[1]}D, similarité FR/EN : {sim:.3f}")

print("\n✅ Tous les modèles d'embeddings sont prêts !")
-------- FIN DU SCRIPT --------

    python /opt/kyc-service/scripts/download_embeddings.py

CHECKPOINT ✅ : "✅ Tous les modèles d'embeddings sont prêts !" doit apparaître.
CHECKPOINT ✅ : Similarité FR/EN doit être > 0.7 (phrases sémantiquement proches).


ÉTAPE 13.3 — Créer app/embeddings.py
──────────────────────────────────────

Créer le fichier /opt/kyc-service/app/embeddings.py avec ce contenu EXACT :

-------- DÉBUT DU FICHIER app/embeddings.py --------
"""
Module Embeddings Sémantiques — BGE-M3 + all-MiniLM-L6
Licence  : MIT
Usage    : Recherche sémantique, similarité de textes, base RAG
Modèles  :
  - all-MiniLM-L6-v2 : ultra-léger (22M params, 90 MB) — temps réel
  - BAAI/bge-m3       : haute précision multilingue (567M, 570 MB)
"""
import os
import logging
import numpy as np
from typing import List, Optional

logger = logging.getLogger("kyc.embeddings")

_model_mini = None   # all-MiniLM-L6-v2 — rapide, temps réel
_model_bge  = None   # BAAI/bge-m3 — précision maximale, multilingue

HF_CACHE = "/opt/kyc-service/models"
os.environ["HF_HOME"] = HF_CACHE

MINI_MODEL_ID = os.getenv("MINIML_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
BGE_MODEL_ID  = os.getenv("BGE_MODEL",    "BAAI/bge-m3")
BATCH_SIZE    = int(os.getenv("EMBEDDING_BATCH_SIZE", "64"))


def load_embedding_models() -> None:
    """Charge les modèles d'embeddings — appeler au startup FastAPI."""
    global _model_mini, _model_bge

    logger.info("Chargement de all-MiniLM-L6-v2...")
    from sentence_transformers import SentenceTransformer
    _model_mini = SentenceTransformer(MINI_MODEL_ID)

    logger.info("Chargement de BAAI/bge-m3...")
    _model_bge = SentenceTransformer(BGE_MODEL_ID)

    logger.info("✅ Modèles d'embeddings chargés")


def is_embeddings_ready() -> bool:
    return _model_mini is not None


def encode_texts(
    texts:      List[str],
    model_type: str = "mini",
) -> dict:
    """
    Génère des embeddings vectoriels pour une liste de textes.

    Args:
        texts      : liste de textes à encoder
        model_type : 'mini' (ultra-rapide) ou 'bge' (haute précision multilingue)

    Returns:
        dict avec :
            - success    : bool
            - embeddings : liste de vecteurs numpy (shape: [N, D])
            - dimension  : dimension des vecteurs (384 pour mini, 1024 pour bge)
            - model      : modèle utilisé
            - error      : str | None
    """
    model = _model_mini if model_type == "mini" else _model_bge
    if model is None:
        return {"success": False, "embeddings": [], "dimension": 0,
                "model": model_type, "error": "Modèle non chargé"}
    try:
        embeddings = model.encode(
            texts,
            batch_size=BATCH_SIZE,
            show_progress_bar=False,
            normalize_embeddings=True,   # norme L2 = 1 → similarité cosinus = produit scalaire
        )
        return {
            "success":    True,
            "embeddings": embeddings.tolist(),
            "dimension":  embeddings.shape[1],
            "model":      model_type,
            "error":      None,
        }
    except Exception as e:
        logger.error(f"Erreur embeddings : {e}")
        return {"success": False, "embeddings": [], "dimension": 0,
                "model": model_type, "error": str(e)}


def semantic_similarity(
    text1: str,
    text2: str,
    model_type: str = "mini",
) -> dict:
    """
    Calcule la similarité sémantique entre deux textes.
    Retourne un score entre 0 (sans rapport) et 1 (identiques).
    """
    result = encode_texts([text1, text2], model_type)
    if not result["success"]:
        return {"success": False, "score": 0.0, "error": result["error"]}

    embs = np.array(result["embeddings"])
    score = float(np.dot(embs[0], embs[1]))   # dot product sur vecteurs normalisés = cosine sim
    return {"success": True, "score": round(score, 4), "error": None}
-------- FIN DU FICHIER app/embeddings.py --------


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 14 — INSTALLATION DONUT (COMPRÉHENSION DOCUMENTS STRUCTURÉS)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 14.1 — Installer les dépendances Donut
─────────────────────────────────────────────

    source /opt/kyc-service/venv/bin/activate

    pip install "transformers[vision]>=4.40.0"
    pip install timm==1.0.11
    pip install Pillow==10.4.0   # déjà installé — pas de conflit

    # Vérifier
    python -c "from transformers import DonutProcessor; print('Donut OK')"

CHECKPOINT ✅ : "Donut OK" doit s'afficher.


ÉTAPE 14.2 — Pré-télécharger Donut
────────────────────────────────────

Créer le script /opt/kyc-service/scripts/download_donut.py :

-------- DÉBUT DU SCRIPT download_donut.py --------
"""
Télécharge Donut fine-tuné pour l'extraction de documents financiers.
Modèle  : naver-clova-ix/donut-base-finetuned-cord-v2
Usage   : Extraction de champs structurés dans factures/reçus
Licence : MIT
Taille  : ~2 GB
"""
import os
os.environ["HF_HOME"] = "/opt/kyc-service/models"

print("Téléchargement de Donut (naver-clova-ix/donut-base-finetuned-cord-v2)...")
print("Cela peut prendre 10-20 minutes (~2 GB)...\n")

from transformers import DonutProcessor, VisionEncoderDecoderModel

MODEL_ID = "naver-clova-ix/donut-base-finetuned-cord-v2"

processor = DonutProcessor.from_pretrained(MODEL_ID)
model     = VisionEncoderDecoderModel.from_pretrained(MODEL_ID)
model.eval()

print("✅ Donut téléchargé et chargé !")
print(f"   Paramètres : {sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")
print("\nDonut est prêt pour l'extraction de documents structurés.")
-------- FIN DU SCRIPT --------

    python /opt/kyc-service/scripts/download_donut.py

    # NOTE : ~2 GB à télécharger — peut durer 10-20 minutes. NE PAS INTERROMPRE.

CHECKPOINT ✅ : "✅ Donut téléchargé et chargé !" doit apparaître.
CHECKPOINT ✅ : du -sh /opt/kyc-service/models/ doit afficher une augmentation de ~2 GB.


ÉTAPE 14.3 — Créer app/doc_understanding.py
─────────────────────────────────────────────

Créer le fichier /opt/kyc-service/app/doc_understanding.py avec ce contenu EXACT :

-------- DÉBUT DU FICHIER app/doc_understanding.py --------
"""
Module Donut — Document Understanding (au-delà de l'OCR)
Licence  : MIT
Usage    : Extraction de champs structurés dans documents financiers
           (factures, reçus, bons de commande, formulaires)
Modèle   : naver-clova-ix/donut-base-finetuned-cord-v2
RAM      : ~2 GB
Complément à PaddleOCR-VL qui lit le texte brut :
           Donut COMPREND la structure (montant, date, fournisseur...)
"""
import os
import re
import json
import logging
from pathlib import Path
from typing import Optional
from PIL import Image

logger = logging.getLogger("kyc.donut")

_donut_processor = None
_donut_model     = None

HF_CACHE  = "/opt/kyc-service/models"
os.environ["HF_HOME"] = HF_CACHE
MODEL_ID  = os.getenv("DONUT_MODEL", "naver-clova-ix/donut-base-finetuned-cord-v2")
MAX_LEN   = int(os.getenv("DONUT_MAX_LENGTH", "768"))


def load_donut() -> None:
    """Charge Donut — appeler au startup FastAPI."""
    global _donut_processor, _donut_model

    logger.info("Chargement de Donut (document understanding)...")
    import torch
    from transformers import DonutProcessor, VisionEncoderDecoderModel

    _donut_processor = DonutProcessor.from_pretrained(MODEL_ID)
    _donut_model     = VisionEncoderDecoderModel.from_pretrained(MODEL_ID)
    _donut_model.eval()

    logger.info("✅ Donut chargé et prêt")


def is_donut_ready() -> bool:
    return _donut_model is not None


def extract_document_fields(pil_image: Image.Image) -> dict:
    """
    Extrait les champs structurés d'un document financier (facture, reçu...).
    VA AU-DELÀ de l'OCR : comprend QUOI est le montant, QUOI est la date, etc.

    Args:
        pil_image : image PIL du document

    Returns:
        dict avec :
            - success       : bool
            - fields        : dict des champs extraits (montant, date, articles...)
            - raw_output    : sortie JSON brute de Donut
            - document_type : type de document détecté
            - error         : str | None
    """
    if _donut_model is None:
        return {"success": False, "fields": {}, "raw_output": "",
                "document_type": None, "error": "Donut non chargé"}

    import torch
    try:
        # Préparer l'image
        img_rgb = pil_image.convert("RGB")

        # Tokenizer avec le prompt de décodage Donut
        task_prompt = "<s_cord-v2>"
        decoder_input_ids = _donut_processor.tokenizer(
            task_prompt,
            add_special_tokens=False,
            return_tensors="pt",
        ).input_ids

        pixel_values = _donut_processor(img_rgb, return_tensors="pt").pixel_values

        # Inférence
        with torch.no_grad():
            outputs = _donut_model.generate(
                pixel_values,
                decoder_input_ids=decoder_input_ids,
                max_length=MAX_LEN,
                early_stopping=True,
                pad_token_id=_donut_processor.tokenizer.pad_token_id,
                eos_token_id=_donut_processor.tokenizer.eos_token_id,
                use_cache=True,
                num_beams=1,
                bad_words_ids=[[_donut_processor.tokenizer.unk_token_id]],
                return_dict_in_generate=True,
            )

        # Décoder la sortie
        sequence = _donut_processor.batch_decode(
            outputs.sequences, skip_special_tokens=False
        )[0]
        sequence = _donut_processor.token2json(sequence)

        logger.info(f"Donut extraction OK — {len(sequence)} champs extraits")

        return {
            "success":       True,
            "fields":        sequence if isinstance(sequence, dict) else {"data": sequence},
            "raw_output":    str(sequence),
            "document_type": "receipt/invoice",
            "error":         None,
        }

    except Exception as e:
        logger.error(f"Erreur Donut : {e}")
        return {"success": False, "fields": {}, "raw_output": "",
                "document_type": None, "error": str(e)}
-------- FIN DU FICHIER app/doc_understanding.py --------


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 15 — EXTENSION DE app/main.py (NOUVEAUX ENDPOINTS)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 15.1 — Mettre à jour le fichier app/main.py
───────────────────────────────────────────────────

IMPORTANT : NE PAS réécrire le fichier depuis zéro.
Effectuer les TROIS modifications suivantes sur le fichier existant.

── MODIFICATION A : Ajouter les imports (en haut du fichier, après les imports existants) ──

Localiser la ligne :
    from app.utils import bytes_to_pil, validate_image_size

AJOUTER après cette ligne :

-------- IMPORTS À AJOUTER --------
# Nouveaux modules (PRD complémentaire v2.0)
from app.whisper_stt      import load_whisper, transcribe_audio, is_whisper_ready
from app.sentiment        import load_sentiment_models, analyze_sentiment, is_sentiment_ready
from app.diarization      import load_pyannote, diarize_audio, is_diarization_ready, merge_transcript_with_speakers
from app.embeddings       import load_embedding_models, encode_texts, semantic_similarity, is_embeddings_ready
from app.doc_understanding import load_donut, extract_document_fields, is_donut_ready
-------- FIN DES IMPORTS --------


── MODIFICATION B : Mettre à jour le lifespan (chargement au démarrage) ──

Localiser dans le bloc @asynccontextmanager async def lifespan(app) :
    logger.info("→ Chargement AuraFace-v1 + HyperFace...")
    load_all_face_models()

AJOUTER après cette ligne :

-------- LIGNES À AJOUTER dans lifespan --------
    logger.info("→ Chargement Whisper Large-v3 Turbo...")
    load_whisper()

    logger.info("→ Chargement DistilBERT + RoBERTa (sentiment)...")
    load_sentiment_models()

    logger.info("→ Chargement Pyannote Diarization 3.1...")
    load_pyannote()

    logger.info("→ Chargement BGE-M3 + all-MiniLM (embeddings)...")
    load_embedding_models()

    logger.info("→ Chargement Donut (document understanding)...")
    load_donut()
-------- FIN DES LIGNES LIFESPAN --------


── MODIFICATION C : Mettre à jour /health et ajouter les 7 nouveaux endpoints ──

Localiser l'endpoint @app.get("/health") et remplacer son contenu par :

-------- ENDPOINT /health MIS À JOUR --------
@app.get("/health")
def health():
    """État complet de tous les modèles — utilisé par Laravel et monitoring."""
    return {
        "status": "ok",
        "port":   20900,
        "models": {
            # PRD 1 — KYC
            "paddleocr_v5":    is_ocr_ready(),
            "auraface_v1":     is_face_ready(),
            # PRD 2 — Additionnels
            "whisper_turbo":   is_whisper_ready(),
            "distilbert":      is_sentiment_ready(),
            "pyannote_3_1":    is_diarization_ready(),
            "bge_m3":          is_embeddings_ready(),
            "donut":           is_donut_ready(),
        }
    }
-------- FIN /health --------

AJOUTER les endpoints suivants à la FIN du fichier app/main.py, avant le bloc if __name__ == "__main__":

-------- NOUVEAUX ENDPOINTS À AJOUTER --------

# ═══════════════════════════════════════════════════════════════════════════
# WHISPER — TRANSCRIPTION AUDIO
# ═══════════════════════════════════════════════════════════════════════════

@app.post("/audio/transcribe")
async def endpoint_transcribe(
    file:     UploadFile = File(..., description="Fichier audio (wav, mp3, mp4, m4a, ogg)"),
    language: str        = "fr",
    task:     str        = "transcribe",
):
    """
    Transcrit un fichier audio en texte avec Whisper Large-v3 Turbo.
    Supporte 99 langues. Vitesse : ~216× le temps réel sur CPU.

    task='transcribe' → transcription dans la langue source
    task='translate'  → traduction vers l'anglais (Large-v3 seulement)
    """
    start    = time.time()
    contents = await file.read()

    if not validate_image_size(contents, max_mb=50):   # 50 MB max pour audio
        raise HTTPException(413, "Fichier audio trop volumineux (max 50 MB)")

    result  = transcribe_audio(contents, language=language or None, task=task)
    elapsed = round(time.time() - start, 2)

    if not result["success"]:
        raise HTTPException(422, result.get("error", "Erreur de transcription"))

    return JSONResponse({"success": True, "elapsed_sec": elapsed, **result})


@app.post("/audio/transcribe-and-diarize")
async def endpoint_transcribe_diarize(
    file:         UploadFile   = File(...),
    language:     str          = "fr",
    num_speakers: Optional[int] = None,
):
    """
    Pipeline COMPLET audio :
    1. Whisper transcrit l'audio → texte + timestamps
    2. Pyannote identifie qui parle → locuteur + timestamps
    3. Fusion → chaque phrase attribuée à son locuteur

    Idéal pour : comptes-rendus de réunion, analyse d'appels, interviews.
    """
    start    = time.time()
    contents = await file.read()

    if not validate_image_size(contents, max_mb=50):
        raise HTTPException(413, "Fichier audio trop volumineux (max 50 MB)")

    # Étape 1 — Transcription Whisper
    transcript = transcribe_audio(contents, language=language or None)
    if not transcript["success"]:
        raise HTTPException(422, f"Transcription échouée : {transcript.get('error')}")

    # Étape 2 — Diarisation Pyannote (si disponible)
    diarization_result = {"success": False, "segments": [], "error": "Pyannote non actif"}
    attributed_segments = transcript["segments"]

    if is_diarization_ready():
        diarization_result = diarize_audio(contents, num_speakers=num_speakers)
        if diarization_result["success"]:
            attributed_segments = merge_transcript_with_speakers(
                transcript["segments"],
                diarization_result["segments"],
            )

    elapsed = round(time.time() - start, 2)

    return JSONResponse({
        "success":              True,
        "elapsed_sec":          elapsed,
        "full_text":            transcript["text"],
        "language":             transcript["language"],
        "duration_sec":         transcript["duration_sec"],
        "num_speakers":         diarization_result.get("num_speakers", 0),
        "attributed_segments":  attributed_segments,
        "diarization_active":   diarization_result["success"],
    })


# ═══════════════════════════════════════════════════════════════════════════
# DISTILBERT — SENTIMENT & NLP
# ═══════════════════════════════════════════════════════════════════════════

@app.post("/nlp/sentiment")
async def endpoint_sentiment(request: Request):
    """
    Analyse le sentiment d'un ou plusieurs textes avec DistilBERT.

    Body JSON attendu :
    {
        "texts": ["texte 1", "texte 2", ...],
        "mode": "general"    // "general" ou "social"
    }

    mode='general' → DistilBERT SST-2 (positif/négatif — documents, emails)
    mode='social'  → RoBERTa twitter (négatif/neutre/positif — réseaux sociaux)
    """
    body = await request.json()
    texts = body.get("texts", [])
    mode  = body.get("mode", "general")

    if not texts:
        raise HTTPException(422, "Le champ 'texts' est obligatoire et doit être non vide")
    if len(texts) > 1000:
        raise HTTPException(422, "Maximum 1000 textes par requête")

    result = analyze_sentiment(texts, mode=mode)
    if not result["success"]:
        raise HTTPException(500, result.get("error", "Erreur analyse sentiment"))

    return JSONResponse({"success": True, **result})


# ═══════════════════════════════════════════════════════════════════════════
# BGE-M3 — EMBEDDINGS SÉMANTIQUES
# ═══════════════════════════════════════════════════════════════════════════

@app.post("/nlp/embed")
async def endpoint_embed(request: Request):
    """
    Génère des vecteurs d'embeddings sémantiques pour des textes.

    Body JSON :
    {
        "texts":      ["texte 1", "texte 2"],
        "model_type": "mini"    // "mini" (rapide) ou "bge" (précis multilingue)
    }

    Utilisé pour : recherche sémantique, chatbot RAG, recommandations, clustering.
    """
    body       = await request.json()
    texts      = body.get("texts", [])
    model_type = body.get("model_type", "mini")

    if not texts:
        raise HTTPException(422, "Le champ 'texts' est obligatoire")
    if len(texts) > 500:
        raise HTTPException(422, "Maximum 500 textes par requête")

    result = encode_texts(texts, model_type=model_type)
    if not result["success"]:
        raise HTTPException(500, result.get("error", "Erreur embedding"))

    return JSONResponse({"success": True, **result})


@app.post("/nlp/similarity")
async def endpoint_similarity(request: Request):
    """
    Calcule la similarité sémantique entre deux textes (score 0–1).

    Body JSON :
    {
        "text1":      "premier texte",
        "text2":      "deuxième texte",
        "model_type": "mini"
    }
    """
    body       = await request.json()
    text1      = body.get("text1", "")
    text2      = body.get("text2", "")
    model_type = body.get("model_type", "mini")

    if not text1 or not text2:
        raise HTTPException(422, "Les champs 'text1' et 'text2' sont obligatoires")

    result = semantic_similarity(text1, text2, model_type=model_type)
    if not result["success"]:
        raise HTTPException(500, result.get("error", "Erreur similarité"))

    return JSONResponse({"success": True, **result})


# ═══════════════════════════════════════════════════════════════════════════
# DONUT — DOCUMENT UNDERSTANDING (EXTRACTION DE CHAMPS)
# ═══════════════════════════════════════════════════════════════════════════

@app.post("/document/extract-fields")
async def endpoint_extract_fields(file: UploadFile = File(...)):
    """
    Extrait les champs structurés d'un document financier (facture, reçu...).
    COMPLÉMENTAIRE à /ocr : va au-delà du texte brut pour identifier
    les champs sémantiques (montant total, TVA, date, articles, fournisseur...).

    Input  : image du document (JPG/PNG, max 5 MB)
    Output : champs extraits en JSON structuré
    """
    start    = time.time()
    contents = await file.read()

    if not validate_image_size(contents):
        raise HTTPException(413, "Image trop volumineuse (max 5 MB)")

    pil_img = bytes_to_pil(contents)
    result  = extract_document_fields(pil_img)
    elapsed = round(time.time() - start, 2)

    if not result["success"]:
        raise HTTPException(422, result.get("error", "Erreur extraction champs"))

    return JSONResponse({"success": True, "elapsed_sec": elapsed, **result})

-------- FIN DES NOUVEAUX ENDPOINTS --------

CHECKPOINT ✅ : Vérifier que app/main.py contient bien les imports ajoutés en haut.
CHECKPOINT ✅ : Vérifier que les 5 nouvelles fonctions de chargement sont dans lifespan.
CHECKPOINT ✅ : Compter les endpoints — le fichier doit en contenir au moins 11.


ÉTAPE 15.2 — Ajouter l'import manquant dans main.py
──────────────────────────────────────────────────────

Les nouveaux endpoints /nlp/* utilisent Request. Vérifier que cet import existe :

    grep "from fastapi import" /opt/kyc-service/app/main.py

Si "Request" n'est pas dans la liste, ajouter l'import :
    from fastapi import FastAPI, File, UploadFile, HTTPException, Request

Et ajouter dans les imports existants :
    from typing import Optional


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 16 — MISE À JOUR requirements.txt ET TESTS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 16.1 — Ajouter au requirements.txt
──────────────────────────────────────────

Ajouter ces lignes à la FIN de /opt/kyc-service/requirements.txt :

-------- LIGNES À AJOUTER dans requirements.txt --------
# ── PRD Complémentaire v2.0 ───────────────────────────
# Whisper Turbo
faster-whisper==1.1.1
soundfile==0.12.1
librosa==0.10.2
pydub==0.25.1
av==12.3.0

# DistilBERT / RoBERTa
# (transformers déjà présent — vérifier version >= 4.40)

# Pyannote Diarization
pyannote.audio==3.3.2
torchaudio==2.3.1

# Embeddings
sentence-transformers==3.3.1
faiss-cpu==1.9.0.post1

# Donut
timm==1.0.11
-------- FIN DES LIGNES --------


ÉTAPE 16.2 — Créer les tests
──────────────────────────────

Créer le fichier /opt/kyc-service/tests/test_addons.py :

-------- DÉBUT DU FICHIER tests/test_addons.py --------
"""
Tests unitaires — Modules additionnels (PRD v2.0)
Lancer avec : python tests/test_addons.py
"""
import sys
sys.path.insert(0, "/opt/kyc-service")

import numpy as np


def test_whisper_loads():
    from app.whisper_stt import load_whisper, is_whisper_ready
    load_whisper()
    assert is_whisper_ready(), "Whisper devrait être chargé"
    print("✅ test_whisper_loads : PASS")


def test_whisper_transcribes_silence():
    from app.whisper_stt import load_whisper, transcribe_audio
    import soundfile as sf, tempfile, io
    load_whisper()
    audio = np.zeros(16000, dtype=np.float32)
    buf   = io.BytesIO()
    sf.write(buf, audio, 16000, format="wav")
    result = transcribe_audio(buf.getvalue(), language="fr")
    assert result["success"] is True, f"Erreur : {result.get('error')}"
    assert isinstance(result["text"], str)
    print(f"✅ test_whisper_transcribes_silence : PASS — texte : '{result['text']}'")


def test_sentiment_loads():
    from app.sentiment import load_sentiment_models, is_sentiment_ready
    load_sentiment_models()
    assert is_sentiment_ready()
    print("✅ test_sentiment_loads : PASS")


def test_sentiment_analyzes():
    from app.sentiment import load_sentiment_models, analyze_sentiment
    load_sentiment_models()
    result = analyze_sentiment(
        ["This product is amazing!", "This is terrible and I hate it."],
        mode="general"
    )
    assert result["success"] is True
    assert len(result["results"]) == 2
    assert result["results"][0]["sentiment_fr"] == "positif"
    assert result["results"][1]["sentiment_fr"] == "négatif"
    print(f"✅ test_sentiment_analyzes : PASS — résumé : {result['summary']}")


def test_embeddings_loads():
    from app.embeddings import load_embedding_models, is_embeddings_ready
    load_embedding_models()
    assert is_embeddings_ready()
    print("✅ test_embeddings_loads : PASS")


def test_embeddings_similarity():
    from app.embeddings import load_embedding_models, semantic_similarity
    load_embedding_models()
    result = semantic_similarity(
        "Comment puis-je vous aider ?",
        "How can I help you?",
        model_type="bge"
    )
    assert result["success"] is True
    assert result["score"] > 0.7, f"Score trop faible : {result['score']}"
    print(f"✅ test_embeddings_similarity : PASS — score : {result['score']}")


def test_donut_loads():
    from app.doc_understanding import load_donut, is_donut_ready
    load_donut()
    assert is_donut_ready()
    print("✅ test_donut_loads : PASS")


if __name__ == "__main__":
    print("Tests des modules additionnels PRD v2.0")
    print("=" * 50)
    test_whisper_loads()
    test_whisper_transcribes_silence()
    test_sentiment_loads()
    test_sentiment_analyzes()
    test_embeddings_loads()
    test_embeddings_similarity()
    test_donut_loads()
    print("\n🎉 Tous les tests additionnels passent !")
-------- FIN DU FICHIER tests/test_addons.py --------

    source /opt/kyc-service/venv/bin/activate
    cd /opt/kyc-service
    python tests/test_addons.py

CHECKPOINT ✅ : "🎉 Tous les tests additionnels passent !" doit apparaître.

NOTE : Le test Pyannote est exclu des tests unitaires car il nécessite un token
       HuggingFace valide. Il sera validé lors du test d'intégration en Phase 17.


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 17 — REDÉMARRAGE DU SERVICE ET VALIDATION FINALE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ÉTAPE 17.1 — Redémarrer kyc-service
──────────────────────────────────────

    sudo systemctl restart kyc-service

    # Attendre 90 secondes (tous les modèles se chargent au démarrage)
    sleep 90

    # Vérifier que le service est actif
    sudo systemctl status kyc-service

CHECKPOINT ✅ : "active (running)" doit apparaître dans le statut.

    # Voir les logs de chargement en temps réel
    sudo journalctl -u kyc-service -f --no-pager | head -50

CHECKPOINT ✅ : Les logs doivent montrer le chargement de chaque modèle :
    "Chargement de PaddleX OCR..."
    "Chargement de AuraFace-v1..."
    "Chargement de Whisper Large-v3 Turbo..."
    "Chargement de DistilBERT + RoBERTa..."
    "Chargement de Pyannote Diarization 3.1..." (ou warning si token absent)
    "Chargement de BGE-M3 + all-MiniLM..."
    "Chargement de Donut..."
    "Tous les modèles sont prêts ✅"


ÉTAPE 17.2 — Validation du health check étendu
────────────────────────────────────────────────

    curl -s http://127.0.0.1:20900/health | python3 -m json.tool

CHECKPOINT ✅ : La réponse doit ressembler à :
    {
        "status": "ok",
        "port": 20900,
        "models": {
            "paddleocr_v5":  true,
            "auraface_v1":   true,
            "whisper_turbo": true,
            "distilbert":    true,
            "pyannote_3_1":  true,   ← false si token absent (acceptable)
            "bge_m3":        true,
            "donut":         true
        }
    }


ÉTAPE 17.3 — Test des nouveaux endpoints
──────────────────────────────────────────

    # Test Sentiment (DistilBERT)
    curl -s -X POST http://127.0.0.1:20900/nlp/sentiment \
        -H "Content-Type: application/json" \
        -d '{"texts": ["Ce service est excellent !", "Je suis très déçu."], "mode": "general"}' \
        | python3 -m json.tool

CHECKPOINT ✅ : "positif" pour la première phrase, "négatif" pour la deuxième.

    # Test Similarité sémantique (BGE-M3)
    curl -s -X POST http://127.0.0.1:20900/nlp/similarity \
        -H "Content-Type: application/json" \
        -d '{"text1": "Comment annuler mon abonnement ?", "text2": "Je veux résilier mon contrat", "model_type": "bge"}' \
        | python3 -m json.tool

CHECKPOINT ✅ : "score" doit être > 0.7 (phrases sémantiquement identiques).

    # Test Transcription audio (Whisper) — avec un fichier WAV de test
    # Générer un fichier WAV de test silencieux
    python3 -c "
import numpy as np, soundfile as sf
sf.write('/tmp/test_audio.wav', np.zeros(48000, dtype=np.float32), 16000)
print('Fichier /tmp/test_audio.wav créé')
"
    curl -s -X POST http://127.0.0.1:20900/audio/transcribe \
        -F "file=@/tmp/test_audio.wav" \
        -F "language=fr" \
        | python3 -m json.tool

CHECKPOINT ✅ : "success": true doit apparaître avec "text": "" ou texte vide.

    # Test Extraction de champs (Donut) — avec une image quelconque
    curl -s -X POST http://127.0.0.1:20900/document/extract-fields \
        -F "file=@/chemin/vers/une/image.jpg" \
        | python3 -m json.tool

CHECKPOINT ✅ : "success": true et "fields": {...} non vide.


ÉTAPE 17.4 — Vérifier la consommation mémoire
───────────────────────────────────────────────

    # Vérifier la RAM consommée par le service
    sudo systemctl status kyc-service | grep Memory

    # Ou plus précis :
    ps aux | grep uvicorn | grep -v grep | awk '{print $6/1024 " MB"}'

CHECKPOINT ✅ : La consommation doit être entre 6 et 9 GB (sur 15 GB disponibles).


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PHASE 18 — RÉSUMÉ FINAL ET PASSAGE À LARAVEL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

À la fin de ce PRD complémentaire, voici l'état complet du microservice :

MODÈLES INSTALLÉS (TOUS) :
───────────────────────────
  ✅ PaddleOCR PP-OCRv5    /opt/kyc-service/.paddlex/         ~210 MB   (PRD 1)
  ✅ AuraFace-v1           /opt/kyc-service/models/auraface/  ~408 MB   (PRD 1)
  ⚡ HyperFace-10k-LDM    /opt/kyc-service/models/hyperface/ ~373 MB   (PRD 1 — optionnel)
  ✅ Whisper Turbo         /opt/kyc-service/models/whisper-t/ ~1.5 GB   (PRD 2)
  ✅ DistilBERT SST-2      HF cache auto                      ~270 MB   (PRD 2)
  ✅ RoBERTa Social        HF cache auto                      ~499 MB   (PRD 2)
  ✅ Pyannote 3.1          HF cache auto                      ~800 MB   (PRD 2)
  ✅ all-MiniLM-L6-v2      HF cache auto                      ~90 MB    (PRD 2)
  ✅ BAAI/bge-m3           HF cache auto                      ~570 MB   (PRD 2)
  ✅ Donut CORD-v2         HF cache auto                      ~2.0 GB   (PRD 2)
  ─────────────────────────────────────────────────────────────────────
  TOTAL                                                        ~6.7 GB

ENDPOINTS DISPONIBLES (COMPLETS) :
────────────────────────────────────
  GET  /health                        → état des 7 modèles
  POST /ocr                           → extraction texte document (PaddleOCR)
  POST /face-match                    → comparaison visages (AuraFace)
  POST /kyc/verify                    → pipeline KYC complet
  POST /audio/transcribe              → transcription audio (Whisper Turbo)
  POST /audio/transcribe-and-diarize  → transcription + attribution locuteur
  POST /nlp/sentiment                 → analyse sentiment (DistilBERT/RoBERTa)
  POST /nlp/embed                     → vecteurs d'embeddings (BGE-M3/MiniLM)
  POST /nlp/similarity                → similarité sémantique entre 2 textes
  POST /document/extract-fields       → extraction champs factures (Donut)

NOUVEAUX FICHIERS CRÉÉS :
──────────────────────────
  app/whisper_stt.py          ← Whisper Large-v3 Turbo
  app/sentiment.py            ← DistilBERT + RoBERTa
  app/diarization.py          ← Pyannote 3.1
  app/embeddings.py           ← BGE-M3 + all-MiniLM-L6
  app/doc_understanding.py    ← Donut CORD-v2
  tests/test_addons.py        ← Tests unitaires v2.0
  scripts/download_whisper_turbo.py
  scripts/download_distilbert.py
  scripts/download_pyannote.py
  scripts/download_embeddings.py
  scripts/download_donut.py

SAAS POSSIBLES SUR CE SERVEUR (même machine, port 20900) :
────────────────────────────────────────────────────────────
  1. SaaS KYC                    → /kyc/verify (existant)
  2. SaaS Transcription Audio    → /audio/transcribe
  3. SaaS Compte-Rendu Réunion   → /audio/transcribe-and-diarize
  4. SaaS Analyse Sentiment      → /nlp/sentiment
  5. SaaS Chatbot RAG            → /nlp/embed + moteur vectoriel
  6. SaaS Comptabilité Auto      → /document/extract-fields + /ocr

PROCHAINE ÉTAPE → INTÉGRATION LARAVEL :
────────────────────────────────────────
  Confirmer "MICROSERVICE V2 OK" pour démarrer la phase Laravel complète.
  Les services Laravel à créer seront :
    - KycService.php           (existant — déjà défini dans PRD 1)
    - TranscriptionService.php (Whisper + Pyannote)
    - SentimentService.php     (DistilBERT)
    - EmbeddingService.php     (BGE-M3 — base du chatbot RAG)
    - DocumentService.php      (Donut — extraction factures)

================================================================================
FIN DU PRD COMPLÉMENTAIRE v2.0 — MODULES AI ADDITIONNELS
================================================================================