Ë C®ŸiõEã ó.—dZddlZddlZddlZddlZddlmZddlmZddl m Z ddlmZeje«ZdZdZd gezZd ed<dZdefd „Zdad„Zdedeefd„Zd0dedededefd„Z d1dededeedeefd„Z d2deededeedededefd„Z d3dedeedededef d„Z!ed k(rÜddl"Z"ddl#Z#e"jHd!¬"«Z%e%jMdd#¬$«e%jMd%edd&¬'«e%jMd(edd)¬'«e%jMd*dd+¬,«e%jO«Z(e#jRe!e(jTe(jVe(jXe(jZ¬-««Z.e/ej`e.d.e¬/««yy)4u Quality Gate â€” KB Ingestion Pipeline Module 12 =============================================== Auto-generated quiz from KB chunks + RAG accuracy evaluation. Stories: 12.01 â€” generate_quiz: Create Q&A pairs from random KB chunks via Gemini 12.02 â€” evaluate_accuracy: Run quiz against RAG pipeline and measure hit rate 12.03 â€” run_quality_gate: Full pipeline (generate â†’ evaluate â†’ report) Usage: python3 -m core.kb.quality_gate hubspot python3 -m core.kb.quality_gate hubspot --questions 10 --threshold 0.9 Dependencies: from core.kb.qdrant_store import search_platform from core.kb.embedder import embed_text from core.rag_query import rag_query from core.kb.pg_store import get_connection import google.genai as genai éN)ÚOptional)Úsearch_platform)Ú embed_text)Ú rag_queryzgemini-2.0-flashiçgíµ ÷Æ°>a<You are a QA engineer generating a quiz to test a retrieval system. Given the following knowledge-base chunk, generate ONE question whose answer is clearly contained within the chunk text. Requirements: - The question must be specific and answerable from the chunk alone. - The answer must be a short phrase or sentence (not a list) extracted or directly inferable from the chunk. - Output ONLY valid JSON in this exact format (no markdown, no extra text): {{"question": "", "answer": ""}} Chunk title: {title} Chunk text: {text} Úreturncóv—tjdd«}|s“d}tjj|«rrt |«5}|D]W}|j«}|j d«sŒ%|jdd«dj«jd«}nddd«|S|S#1swY|SxYw) z9Load GEMINI_API_KEY from environment or secrets.env file.ÚGEMINI_API_KEYÚz(/mnt/e/genesis-system/config/secrets.envzGEMINI_API_KEY=ú=éz'"N)ÚosÚgetenvÚpathÚexistsÚopenÚstripÚ startswithÚsplit)ÚkeyÚsecrets_pathÚfhÚlines ú-/mnt/e/genesis-system/core/kb/quality_gate.pyÚ _load_api_keyrDsª€ä )‰)Ð$ bÓ )€CÙØAˆÜ 7‰7>‰>˜,Ô'ÜlÓ#ð rØòDØŸ:™:›<DØ—‘Ð'8Õ9Ø"Ÿj™j¨¨aÓ0°Ñ3×9Ñ9Ó;×AÑAÀ%ÓH˜Ùð ÷ ð€Jˆ3€J÷ ð€JúsÁ'B.Á-5B.Â.B8có^—t€"ddlm}t«}|j |¬«atS)z'Return a singleton google.genai Client.Nr)Úapi_key)Ú _genai_clientÚgoogle.genaiÚgenairÚClient)r rs rÚ_get_genai_clientr"Vs*€ôÐÝ$Ü“/ˆØŸ™¨W˜Ó5ˆ ÜÐóÚchunkcóD—t«}tj|jdd«|jdd«dd¬«} |jjt|g¬«}|jj«}|jd«r*|jd«d }|jd «r|dd}tj|«}d|vsd |vrtjd|dd«y|S#t$r }tjd|«Yd}~yd}~wwxYw)z Call Gemini to produce a {question, answer} pair for a single chunk. Returns None on any error so the caller can skip gracefully. ÚtitlerÚtextNi¸)r&r')ÚmodelÚcontentsz```r ÚjsonéÚquestionÚanswerz0Gemini response missing question/answer keys: %séÈz&generate_question_for_chunk failed: %s)r"Ú_QUIZ_PROMPT_TEMPLATEÚformatÚgetÚmodelsÚgenerate_contentÚGEMINI_MODELr'rrrr*ÚloadsÚloggerÚwarningÚ Exception)r$ÚclientÚpromptÚresponseÚrawÚqaÚexcs rÚ_generate_question_for_chunkr?`s€ôÓ €FÜ "× )Ñ )Øi‰i˜ Ó$Ø Y‰Yv˜rÓ " 5 DÐ )ð*ó€FðØ—=‘=×1Ñ1ÜØXð2ó ˆðm‰m×!Ñ!Ó#ˆà>‰>˜%Ô Ø—)‘)˜EÓ" 1Ñ%ˆCØ~‰~˜fÔ%Ø˜!˜"gÜ Z‰Z˜‹_ˆØ˜RÑ 8°2Ñ#5ÜN‰NÐMÈsÐSWÐTWÈyÔYØØˆ øÜòÜ‰Ð?ÀÔEÜûðúsÁB-C6Ã4C6Ã6 DÃ?DÄDÚaÚbÚ min_wordscóv—|j«j«}|j«}|sytt|«|z dz«D]}dj ||||z«}||vsŒytd„|D««}|syt|j««}t||z«t|«z} | dk\S)zÕ Heuristic: return True if at least min_words consecutive words from `a` appear as a sub-sequence in `b` (case-insensitive). Used to decide whether a RAG result text contains the expected answer. Fr ú Tc3ó>K—|]}t|«dkDsŒ|–—Œyw)éN)Úlen)Ú.0Úws rú z _text_overlap..‘sèø€Ò4˜¬¨Q«°!«”1Ñ4ùs‚–çà?)ÚlowerrÚrangerGÚjoinÚset) r@rArBÚa_wordsÚb_lowerÚiÚphraseÚa_uniqueÚb_wordsÚ overlap_ratios rÚ _text_overlaprW€s¸€ðg‰g‹io‰oÓ€GØg‰g‹i€GÙØä ”3w“< )Ñ+¨aÑ/Ó 0òˆØ—‘˜' ! a¨)¡mÐ4Ó5ˆØWÒÙðô Ñ4˜gÔ4Ó4€HÙØÜ'—-‘-“/Ó"€GÜ˜ 7Ñ*Ó+¬c°(«mÑ;€MØ˜CÑÐr#éÚplatformÚ num_questionsÚcustomer_idcó`— t|›d«}t|||dd¬«}|stj d|«gSt|t|««}tj||«}g}|D]n} t| «} | €Œ|j| d| d | jd d«| jdd«| jd d«| jdd«dœ«Œptj dt|«||«|S#t$r'}tjd|«t}Yd}~Œd}~wwxYw)aI Generate quiz questions from ingested KB to test RAG accuracy. Steps: 1. Get a broad set of chunks from Qdrant for the platform (up to 200). 2. Randomly sample num_questions chunks (or all if fewer available). 3. For each chunk, call Gemini to generate a question answerable from that chunk. 4. Return list of dicts: {question, expected_answer, source_chunk_id, source_text, source_url} Chunks that fail Gemini generation are silently skipped. Args: platform: The platform to build the quiz for (e.g., "hubspot"). num_questions: Desired number of quiz items (actual may be lower if fewer chunks exist or Gemini fails for some). customer_id: Optional customer scope for multi-tenant isolation. Returns: List of quiz item dicts. z documentation knowledge basezEembed_text failed during quiz generation (%s), using discovery vectorNr.r)Úquery_vectorrYr[Útop_kÚscore_thresholdz.generate_quiz: no chunks found for platform=%sr,r-Úidrr'Ú source_urlr&)r,Úexpected_answerÚsource_chunk_idÚsource_textraÚsource_titlez8generate_quiz: generated %d/%d questions for platform=%s)rr8r6r7Ú_DISCOVERY_VECTORrÚinfoÚminrGÚrandomÚsampler?Úappendr1)rYrZr[r]r>ÚchunksÚsample_sizeÚsampledÚquizr$r=s rÚ generate_quizrps7€ð8)Ü! X JÐ.KÐ"LÓMˆôØ!ØØØØô€FñÜ‰ÐDÀhÔOØˆ ôm¤S¨£[Ó1€KÜm‰m˜F KÓ0€Gà€DØòˆÜ )¨%Ó 0ˆØ ˆ:ØØ‰Ø˜:™Ø! (™|Ø$Ÿy™y¨¨rÓ2Ø Ÿ9™9 V¨RÓ0ØŸ)™) L°"Ó5Ø!ŸI™I g¨rÓ2ñ õ ð ô‡KKØBÜˆD‹ ; ôð€KøôOò)Ü‰Ð^Ð`cÔdÜ(Žûð)ús‚C=Ã= D-ÄD(Ä(D-çš™™™™™é?roÚpass_thresholdr^có4—|s|dddd|gdgdœSd}g}|D]º}|d}|d} |jdd «} t||¬ «}d} d}|D]]}|dk(r|jdd«}|jdd «}| r|r | |k(rd} n'|jdd «}|sŒNt| |«sŒ[d} n| r|dz }|j || | | t|d«dœ«Œ¼t|«}|dkDr||znd}||k\}g}|s`||z }|j d|d›d|d›d|d›d«|dkr|j d«|dk\r|j d«|j d«|||t|d«||||dœS#t$r'}tj d|dd |«g}Yd}~ŒPd}~wwxYw)uæ Run quiz against the RAG pipeline and measure retrieval accuracy. For each quiz item: 1. Call rag_query(question, platform, top_k=top_k). 2. Mark as "correct" if any result's source_url matches the quiz source_url OR if any result's text has significant overlap with the expected_answer. Args: quiz: Output from generate_quiz(). platform: Platform to evaluate. customer_id: Optional customer scope (reserved for future use). pass_threshold: Minimum accuracy to consider the gate passed (0.0â€“1.0). top_k: Number of RAG results to check per question. Returns: { "platform": str, "total_questions": int, "correct": int, "accuracy": float, "passed": bool, "threshold": float, "details": list[dict], "recommendations": list[str], } rrFu5No quiz items provided â€” run generate_quiz() first.)rYÚtotal_questionsÚcorrectÚaccuracyÚpassedÚ thresholdÚdetailsÚrecommendationsr,rbrar)r^z$rag_query failed for question=%r: %sNéPÚscoreTr'r r+)r,rbraÚfound_in_top_kÚtop_result_scorez Accuracy z.1%z is z below the z threshold.rKz€Low accuracy (<50%) suggests the platform KB may have very few indexed chunks. Re-run the ingestion pipeline to populate Qdrant.zUConsider increasing chunk overlap or reducing chunk size to improve retrieval recall.zQReview failed items in 'details' to identify poorly chunked or ambiguous content.) r1rr8r6r7rWrkÚroundrG)rorYr[rrr^ruryÚitemr,rbraÚresultsr>Úfoundr~ÚresultÚ result_urlÚresult_textÚtotalrvrwrzÚgaps rÚevaluate_accuracyrˆésD€ñDà Ø ØØØØ'ØØ WÐXñ ð ð€GØ€Gàò+ˆØ˜ Ñ#ˆØÐ0Ñ1ˆØ—X‘X˜l¨BÓ/ˆ ð Ü °Ô6ˆGðˆØ"%Ðàò ˆFØ 3Ò&Ø#)§:¡:¨g°sÓ#;Ð ð Ÿ™ L°"Ó5ˆJÙ™j¨Z¸:Ò-EØÙð!Ÿ*™* V¨RÓ0ˆKÚœ}¨_¸kÕJØÙð ñ Øq‰LˆGà‰Ø Ø.Ø$Ø#Ü %Ð&6¸Ó :ñ õ ðK+ôZ ‹I€EØ"'¨!¢)ˆw˜Š°€HØ ˜Ñ '€Fð"$€OÙØ˜xÑ'ˆØ×ÑØ˜ ~ T¨#¨c¨°+¸nÈSÐ=QÐQ\Ð]ô ðcŠ>Ø×"Ñ"ðDô ðsŠ?Ø×"Ñ"Øgô ð ×ÑØ_ô ð Ø ØÜ˜( AÓ&ØØ#ØØ*ñ ð øôwò ÜN‰NÐAÀ8ÈCÈRÀ=ÐRUÔVØŽGûð úsµ E'Å' FÅ0FÆFcƒóK—tjd|||«t|||«}|s tjd|«|dd|›ddœSt ||||«}||d<|drd nd |d<tjd||d dz|d«|Sw)uŒ Full quality gate pipeline: generate quiz â†’ evaluate â†’ return combined report. This is an async function so it can be awaited from async contexts (e.g., an orchestrator or FastAPI endpoint), but internally all work is synchronous â€” no I/O concurrency is introduced here. Args: platform: Platform KB to evaluate. customer_id: Optional customer scope. num_questions: Number of quiz questions to generate. pass_threshold: Accuracy fraction required to pass (default 0.80). Returns: Combined report dict. If no chunks found, returns a minimal NO_DATA report. zHrun_quality_gate: starting for platform=%s, questions=%d, threshold=%.2fuGrun_quality_gate: no chunks found for platform=%s â€” returning NO_DATAÚNO_DATAzNo chunks found for platform 'z<'. Ensure KB ingestion has been run before the quality gate.)rYÚstatusÚmessagerorwÚPASSEDÚFAILEDr‹z7run_quality_gate: platform=%s accuracy=%.1f%% status=%srvéd)r6rgrpr7rˆ)rYr[rZrrrorƒs rÚrun_quality_gateros¿èø€ô.‡KKØRØ- ôô ˜ =°+Ó>€DáÜ‰Ð`ÐbjÔkà ØØ7¸°zðBSðSñ ð ô˜t X¨{¸NÓ K€FØ€Fˆ6NØ#)¨(Ò#3‘x¸€Fˆ8Ñä ‡KKØAØ&˜Ñ$ sÑ*¨F°8Ñ,<ôð€Mùs‚B BÚ__main__u7KB Quality Gate â€” auto quiz + RAG accuracy evaluation)Údescriptionz$Platform to evaluate (e.g., hubspot))Úhelpz--questionsz2Number of quiz questions to generate (default: 20))ÚtypeÚdefaultr“z--thresholdu1Accuracy pass threshold 0.0â€“1.0 (default: 0.80)z --customer-idz5Optional customer_id scope for multi-tenant isolation)r•r“)rYr[rZrré)Úindentr•)r+)rXN)NrqrF)NrXrq)1Ú__doc__r*ÚloggingrriÚtypingrÚcore.kb.qdrant_storerÚcore.kb.embedderrÚcore.rag_queryrÚ getLoggerÚ__name__r6r4Ú_DISCOVERY_VECTOR_DIMrfr/Ústrrrr"Údictr?ÚintÚboolrWÚlistrpÚfloatrˆrÚargparseÚasyncioÚArgumentParserÚparserÚadd_argumentÚ parse_argsÚargsÚrunrYr[Ú questionsrxÚreportÚprintÚdumps©r#rúr´s‰ðñó,ÛÛ Û Ýå0Ý'Ý$à ˆ× Ñ ˜8Ó $€ð"€ðÐØEÐ1Ñ1ÐàÐ!Ñð Ðð&sóð€ òð¨ð°¸$±óñ@ Sð ˜Sð ¨Sð ¸ó ð>Ø!%ñEØðEàðEð˜#‘ðEð ˆ$Zó Eð^"&Ø ØñØ ˆt‰*ðàðð˜#‘ððð ð ðð ó ðP"&ØØ ñ /Øð/à˜#‘ð/ðð/ðð /ð ó/ðlˆzÒÛÛà $ˆX× $Ñ $ØMô€Fð×Ñ˜ Ð)OÐÔPØ ×ÑØ˜C¨Ø Aðôð×ÑØ˜E¨4Ø @ðôð×ÑØ Ø Dðôð×ÑÓ€Dà ˆW[‰[ÙØ—]‘]Ø×(Ñ(ØŸ.™.ØŸ>™>ô ó€Fñ ˆ*ˆ$*‰*V A¨sÔ 3Õ4ð=r#