Show HN: ZTGI Safety Gateway for LLM Safety

I built a small runtime safety layer for LLM outputs called ZTGI Safety Gateway.

This is not a new foundation model and not an AGI claim.
It is a post-generation control layer that sits between candidate outputs and final response selection.

What it does:
– Scores each candidate with two risk tracks:
– legacy risk (`p_break`)
– hybrid risk (`z_next`: instruction breach + sycophancy + divergence signals)
– Enforces hard blocks for:
– security abuse prompts
– contradiction-actionable prompts
– high-risk finance-actionable prompts
– Returns SAFE/WARN/BREAK with telemetry.

Current repo:
https://github.com/capterr/ztgi-safety-gateway

Quick run:
1) Set API key:
export GEMINI_API_KEY=YOUR_KEY
2) Build evidence pack:
python ztgi_build_submission_pack.py –model “gemini-2.0-flash” –out “ztgi_submission_pack”
3) Inspect:
– ztgi_submission_pack/evidence/ztgi_evidence_live.json
– ztgi_submission_pack/evidence/ztgi_evidence_live.csv
– ztgi_submission_pack/assets/ztgi_manifund_evidence.png

What I’d like feedback on:
– failure modes I’m missing
– overblocking vs underblocking tradeoff
– better eval set design for independent validation

I’m happy to share raw outputs and discuss limitations directly.

FIRST COMMENT (pin this under your post):
Technical notes + limitations

– This project is a runtime guard, not model-level alignment.
– Some safety behavior can still come from base-model policy itself.
– I’m trying to measure where the gateway actually adds value via hard-block reasons + telemetry.
– Current stress set is small and intentionally adversarial.
– Next step is broader independent eval (including false-positive tracking).

If you want to reproduce quickly:
– Python 3.10+
– GEMINI_API_KEY set
– matplotlib installed
– run:
python ztgi_build_submission_pack.py –model “gemini-2.0-flash” –out “ztgi_submission_pack”

Happy to add your suggested test prompts to the regression suite and report back with results.

Comments URL: https://news.ycombinator.com/item?id=46939042

Points: 1

# Comments: 0

Source link

Leave a Reply Cancel reply