AI 扩展计划 / Playbooks

AI DORA / SPACE Engineering Productivity SDLC Playbook

以下来源作为术语校准和方法锚点。正式项目必须按访问日期复核最新版本、机构政策、监管要求和内部控制边界。访问日期按 2026-06-29 记录。

1,228 行AI_DORA_SPACE_ENGINEERING_PRODUCTIVITY_SDLC_PLAYBOOK.md

AI DORA / SPACE Engineering Productivity SDLC Playbook

面向对象: AI PM、Product Architect、Engineering Productivity PM、Platform PM、DevEx Lead、AI Architect、Model Risk、Security Governance、金融零售技术管理者。核心问题: 如何把 DORA、SPACE、AI code agents、PR gate、eval gate、release gate、安全质量和开发者体验组合成一套 AI 工程生产力 operating system。使用方式: 用本文设计 AI SDLC 指标体系、code agent 治理、工程效率 dashboard、金融零售变更门禁、30 天训练计划和作品集材料。本文不讲基础敏捷或基础 DevOps, 重点训练 CBAP 之后的 AI 工程生产力、产品架构指标和治理能力。

重要说明: 本文是学习、架构设计和作品集材料, 不构成法律、监管、审计、模型验证或安全认证意见。金融零售正式项目必须由 business owner、engineering、security、privacy、legal、compliance、model risk、operational risk、internal audit 共同确认适用要求、审批权、证据保留和上线边界。

1. Source Anchors

以下来源作为术语校准和方法锚点。正式项目必须按访问日期复核最新版本、机构政策、监管要求和内部控制边界。访问日期按 2026-06-29 记录。

Anchor	Official / primary source	本手册使用方式
DORA official research	https://dora.dev/	用 DORA 的软件交付绩效研究校准 throughput、stability、reliability、organizational performance 和 well-being 的关系。
DORA software delivery metrics	https://dora.dev/guides/dora-metrics/	把 change lead time、deployment frequency、change fail rate、failed deployment recovery time、deployment rework rate 转成 AI SDLC 运营指标。
DORA research archive	https://dora.dev/research/	用 DORA core model 作为 AI 工程生产力不是单点工具 ROI 的论证基础。
SPACE developer productivity paper / ACM Queue	https://queue.acm.org/detail.cfm?id=3454124	用 Satisfaction, Performance, Activity, Communication and collaboration, Efficiency and flow 设计开发者生产力多维指标, 避免单一 activity 指标。
NIST SSDF	https://csrc.nist.gov/Projects/ssdf	用 Prepare the Organization、Protect the Software、Produce Well-Secured Software、Respond to Vulnerabilities 组织 AI-assisted SDLC 的安全控制和证据。
NIST SP 800-218A for Generative AI	https://csrc.nist.gov/pubs/sp/800/218/a/final	用 GenAI 和 dual-use foundation model 的 secure development profile 补强模型、数据、prompt、tool、供应链和获取方治理。
AI Engineering Productivity 与 Code-Agent Operating System Playbook	`docs/AI_ENGINEERING_PRODUCTIVITY_CODE_AGENT_OPERATING_SYSTEM_PLAYBOOK.md`	本文不重复 agent 操作系统全量设计, 重点补上 DORA / SPACE 指标、经营节奏、金融零售案例和作品集表达。
AI MLOps Continuous Delivery Release Playbook	`docs/AI_MLOPS_CONTINUOUS_DELIVERY_RELEASE_PLAYBOOK.md`	把 model / data / code / prompt release gate 接入本文的 AI SDLC metric operating system。
AI Requirements Engineering / GQM / Eval Contracts Playbook	`docs/AI_REQUIREMENTS_ENGINEERING_GQM_EVAL_CONTRACTS_PLAYBOOK.md`	把 measurable outcome、GQM、eval contract 和 monitoring gate 接入 DORA / SPACE dashboard。
AI Observability / Cost / SLO Playbook	`docs/AI_OBSERVABILITY_COST_SLO_PLAYBOOK.md`	把 trace、quality、cost、SLO、incident signal 接入生产反馈闭环。

Standards-to-artifacts:

Source lens	可以产出的 artifact	高级面试表达
DORA	AI SDLC delivery performance dashboard、release flow metric tree、stability review	“我不会只看 AI 生成了多少代码, 我会看变更是否更快、更稳、更容易恢复。”
SPACE	DevEx survey、collaboration map、flow friction log、well-being guardrail	“我把生产力视为社会技术系统, 不用单一 activity 指标评判个人。”
SSDF	AI-assisted secure SDLC control map、PR security gate、vulnerability response playbook	“AI code agents 不能绕过 secure development, 它们必须产生可审计证据。”
EvalOps	Eval contract、regression suite、red-team case set、release blocker list	“AI SDLC 的完成定义必须包含行为评估, 不只是代码合并。”
Product / Architecture metrics	Outcome tree、architecture fitness function、platform adoption scorecard	“工程生产力最终要连接客户结果、风险结果、平台复用和架构可演进性。”

2. One-Sentence Positioning

一句话定位:

DORA / SPACE for AI SDLC = 用 delivery performance、developer experience、AI evaluation、secure SDLC 和 product architecture metrics 构建一套可度量、可治理、可复盘、可改进的 AI 工程生产力 operating system。

它不是:

用 story point 衡量 AI 项目产出。
用代码行数证明 coding agent ROI。
用 PR 数量证明开发者更高效。
用工具 license 激活率证明组织完成 AI 转型。
用一次 eval 分数替代上线后的质量和风险运营。

它要回答七个更高级的问题:

AI 是否缩短了从业务意图到可控生产变更的时间。
AI 是否降低了高风险变更的返工、事故和恢复成本。
AI 是否让开发者更少等待、更少上下文切换、更容易进入 flow。
AI 是否提升了需求、设计、测试、评审、文档和运行证据的一致性。
AI code agents 是否在清晰权限、沙箱、门禁和审计边界内工作。
AI eval gate 是否能在 PR、release 和生产监控中阻断不可接受风险。
工程效率提升是否能连接到金融零售的客户权益、运营风险、收入、成本、合规和架构演进。

Operating system 的核心闭环:

Business outcome
-> AI SDLC work design
-> code agent governance
-> PR / eval / release gate
-> production telemetry
-> DORA / SPACE dashboard
-> portfolio and governance evidence
-> operating review
-> next improvement bet

3. 为什么 AI PM / 架构师不能只看 story point

Story point 的问题不是“完全不能用”, 而是它回答的问题太窄: 团队对相对复杂度和容量的估计。AI SDLC 的核心风险却来自概率系统、工具代理、上下文质量、eval 覆盖、数据漂移、供应链、权限、生产行为和人机责任边界。只看 story point 会把最重要的系统性风险变成不可见成本。

只看 story point 的盲区	AI SDLC 中的真实问题	更高级的指标
点数完成但需求不清	Agent 根据模糊 spec 生成大量错误实现	Spec-to-eval completeness、requirement ambiguity defect rate
点数完成但 eval 不充分	Happy path 通过, 边界、越权、幻觉、漂移未覆盖	Eval coverage by risk tier、critical failure rate、red-team pass rate
点数完成但 PR 过大	Reviewer 被 AI 生成 diff 淹没	Review load、review turnaround、AI-generated rework ratio
点数完成但质量下降	缺陷、返工、事故和客户影响后移到生产	Change fail rate、deployment rework rate、escape defect rate
点数完成但安全证据不足	Secret、依赖、license、PII、tool 权限没有证据	SSDF control evidence completeness、security gate pass rate
点数完成但开发者体验变差	Agent 产出需要大量清理, 认知负荷上升	Flow interruption rate、perceived productivity、cognitive load score
点数完成但架构债增加	快速生成局部代码, 破坏模块边界和演进能力	Architecture fitness score、dependency risk、change coupling index
点数完成但业务无结果	产出更多, 客户体验、风控、运营效率没有改善	Business outcome contribution、risk-adjusted ROI、adoption quality

高级 AI PM / 架构师要从“产能管理”升级到“系统绩效管理”:

不是问: 这个 sprint 完成了多少点?
而是问:
  这个 AI-assisted flow 是否降低了端到端 lead time?
  是否维持或提升 stability?
  是否减少开发者等待和返工?
  是否把安全、质量、模型风险证据前置?
  是否让架构更可演进?
  是否对业务结果和风险结果有可解释贡献?

管理层表达:

Story point 可以辅助短期容量沟通, 但不能作为 AI 工程生产力的主指标。主指标应该是 DORA 的交付绩效、SPACE 的开发者体验和协作健康、EvalOps 的行为质量、安全治理证据、产品结果和架构可演进性共同组成的指标组合。

4. AI 工程生产力 Operating System

AI 工程生产力 operating system 不是 dashboard, 而是一套工作系统。Dashboard 只是可视化结果。真正的 operating system 包含目标、指标、流程、门禁、权限、证据、复盘和改进节奏。

4.1 七层模型

Layer	核心对象	关键问题	典型 artifact
Outcome layer	客户、收入、成本、风险、合规、员工体验	为什么值得做, 成功如何定义	Product outcome tree、risk-adjusted ROI memo
Flow layer	从需求到生产的端到端流	变更如何流动, 哪里等待和返工	Value stream map、DORA dashboard
Work design layer	Spec、ticket、design、task、review	哪些工作适合 agent, 哪些必须人工判断	AI work routing policy、task taxonomy
Agent governance layer	Agent identity、权限、上下文、工具、沙箱	Agent 能做什么, 不能做什么, 如何追责	Agent permission matrix、audit log
Quality and safety layer	Test、eval、security、privacy、model risk	什么质量和风险证据能放行	PR gate、eval gate、SSDF control map
DevEx layer	满意度、flow、等待、认知负荷、协作	开发者是否真的更高效、更可持续	SPACE survey、friction backlog
Operating review layer	Weekly、monthly、quarterly review	指标如何驱动行动, 不是驱动表演	Improvement bet register、decision log

4.2 参考架构

flowchart TB
  A[Business Outcome and Risk Appetite] --> B[AI SDLC Metric Tree]
  B --> C[Work Intake and Risk Tiering]
  C --> D[Code Agent Governance]
  D --> E[PR Gate]
  E --> F[Eval Gate]
  F --> G[Release Gate]
  G --> H[Production Telemetry]
  H --> I[DORA / SPACE / Quality Dashboard]
  I --> J[Operating Review]
  J --> K[Improvement Bets]
  K --> C

  E --> L[SSDF Evidence]
  F --> L
  G --> L
  H --> M[Incident and Learning Loop]
  M --> B

4.3 经营节奏

Cadence	参与角色	关注指标	决策
Daily flow review	Tech lead、engineering manager、DevEx lead	Blocked PR、failing eval、flaky test、agent task stuck、review queue age	清障、拆分 PR、调整 agent scope
Weekly AI SDLC review	PM、architect、engineering、QA、security	DORA trend、SPACE pulse、eval regression、rework、security gate failures	选择本周一个 flow bottleneck 做改进
Monthly governance review	Platform、model risk、security、compliance、audit liaison	Exception aging、evidence completeness、incident trend、agent permission violations	调整门禁、策略、证据要求和风险分级
Quarterly product architecture review	Product exec、CTO staff、enterprise architect、risk owner	Business outcome、platform adoption、architecture fitness、unit economics	决定平台投资、架构债偿还、agent 扩围或收缩

5. DORA / SPACE 到 AI 的映射

5.1 DORA: 从软件交付绩效到 AI SDLC flow

DORA 的高级价值是把软件交付拆成速度和稳定性的张力, 避免“快”和“稳”被管理层误认为只能二选一。AI SDLC 要把这组指标扩展到 code、data、model、prompt、RAG index、tool schema、policy 和 eval 的联动变更。

DORA metric	传统含义	AI SDLC 映射	AI 场景解释
Change lead time	从 commit 到生产部署的时间	从业务意图、spec、eval contract、agent work、PR、eval、release 到受控上线的时间	不能只量代码 commit, 还要量 spec-to-eval 和 eval-to-release 的等待
Deployment frequency	单位时间部署次数	按风险等级统计 code / prompt / model / index / policy 的受控发布频率	频率高不等于好, 高风险系统必须结合 release quality 和 rollback readiness
Change fail rate	部署后需要立即干预的比例	引发 rollback、hotfix、eval blocker、incident、客户影响或人工补救的 AI 变更比例	包含行为回归、幻觉上升、错误工具调用、风控误判、RAG 引用错误
Failed deployment recovery time	失败部署需要多久恢复	从发现 AI 变更失败到降级、回滚、禁用工具、切回模型、恢复服务和完成沟通的时间	AI 恢复可能是模型切换、prompt 回滚、index 回滚、规则兜底或 human-only mode
Deployment rework rate	因生产事故触发的非计划部署比例	因 AI 质量、安全、数据、prompt、模型、agent 行为问题导致的补丁和再发布比例	反映 eval gate、PR review 和观测体系是否把问题前置

AI 扩展原则:

每个变更必须标记 artifact type: code、prompt、model、data、RAG index、tool schema、policy、agent config、infrastructure。
每个变更必须标记 risk tier: customer-facing、financial-impacting、regulated-decision、internal-copilot、developer-tooling。
每个指标必须能 drill down 到 team、service、repository、use case、risk tier 和 artifact type。
不把不同风险等级的系统简单平均; 核心银行和内部文档助手不能放在一个均值里解释。

5.2 SPACE: 从个人 activity 到社会技术系统

SPACE 的价值是提醒管理者: 开发者生产力不能被一个单一维度捕捉。AI code agents 会放大 activity 数据的误导性, 因为 AI 可以制造更多 commit、diff、comment 和 PR, 但这些 activity 未必代表更高生产力。

SPACE dimension	AI SDLC 中要问的问题	推荐指标	反向指标
Satisfaction and well-being	开发者是否信任 agent, 是否减少低价值负担, 是否保持可持续节奏	AI workflow satisfaction、perceived productivity、cognitive load score、after-hours recovery work	Agent cleanup fatigue、review anxiety、policy confusion
Performance	团队是否交付了更高质量业务结果	Outcome achievement、defect escape rate、eval pass by risk tier、customer impact avoided	More code with lower acceptance、quality erosion
Activity	哪些活动量有解释价值	Agent task completed、test generated and accepted、docs updated with evidence、review comments resolved	Lines of code、raw prompt count、individual commit count
Communication and collaboration	AI 是否改善跨角色协作	Spec-review alignment、PR review routing accuracy、architecture decision traceability、security issue resolution time	AI-generated ambiguity、review ping-pong、shadow changes
Efficiency and flow	端到端流是否更顺畅	Flow time、wait time、blocked time、build and eval queue time、context switching count	PR queue aging、flaky eval reruns、agent stuck loops

高级原则:

DORA 看系统是否更快更稳。
SPACE 看人和团队是否更可持续、更协作、更少摩擦。
EvalOps 看 AI 行为是否达到可接受质量。
SSDF 看安全和供应链控制是否前置。
Product / architecture metrics 看工程效率是否产生业务和长期架构价值。

5.3 DORA + SPACE 的组合视图

经营问题	DORA 信号	SPACE 信号	AI 质量信号	可能行动
Lead time 下降但 fail rate 上升	Change lead time 改善, change fail rate 变差	Review burden 上升	Critical failure 上升	收紧 eval gate, 限制 agent 可改范围, 拆小 PR
Deployment frequency 上升但业务无改善	部署次数上升	开发者满意度无提升	Eval pass 稳定	重新连接 product outcome, 停止 vanity release
Stability 改善但 flow 变慢	Fail rate 下降, recovery 改善	Wait time 和认知负荷上升	Eval queue aging	优化平台自助、并行 eval、风险分层 gate
Agent adoption 高但满意度低	DORA 无明显改善	Cleanup fatigue 上升	Agent rework 高	调整任务路由, 提升上下文质量, 停用低价值 workflow
核心系统变更慢但很稳	Lead time 长, fail rate 低	Review queue pressure 高	Eval coverage 高	不盲目追频率, 优化 waiting 和证据自动化

6. AI SDLC Metrics Taxonomy

AI SDLC 指标体系必须同时覆盖结果、流动、质量、安全、体验、架构和治理。单独看任何一类都会制造局部优化。

6.1 Metric tree

AI engineering productivity
  -> Product and risk outcomes
  -> Delivery flow performance
  -> AI behavior quality
  -> Security and compliance posture
  -> Developer experience and collaboration
  -> Code agent effectiveness
  -> Architecture and platform health
  -> Governance evidence and operating cadence

6.2 Product and risk outcomes

Metric	定义	适用场景	解释方式
Business outcome contribution	AI-assisted SDLC 交付对目标业务指标的贡献	客服 RAG、风控模型、AI 平台	不能只归因给 AI 工具, 要结合对照组、阶段性发布和业务基线
Risk-adjusted ROI	节省成本或提升收入减去风险控制、返工、事故和治理成本	金融零售 AI 平台投资	高 ROI 必须同时通过风险和质量门禁
Customer impact avoided	通过 gate、监控或回滚避免的客户影响	核心银行、支付、贷款、客服	适合向高管解释稳定性价值
Operational loss reduction	事故、人工补救、错误处理、投诉和罚损减少	风控、客服、运营 copilot	需要和业务运营数据连接
Compliance evidence readiness	上线证据是否支持合规、审计和模型风险问询	受监管流程	不是文档数量, 而是证据可追溯性和完整性

6.3 Delivery flow metrics

Metric	Definition	AI-specific breakdown
Intent-to-production lead time	从业务意图记录到生产上线的总时间	discovery、GQM、eval design、implementation、review、release
Spec-to-eval lead time	从需求确认到 eval contract 可运行的时间	需求清晰度和评估前置程度
Agent task cycle time	Agent 从任务接收、修改、测试到 PR ready 的时间	按任务类型、仓库、agent 类型拆分
Review turnaround	PR ready 到 review decision 的时间	AI-generated PR 应单独标记
Eval queue time	进入 eval 到 gate result 的时间	长队列会抵消 agent 速度
Release waiting time	Gate 通过到生产发布的等待时间	区分控制性等待和无价值等待
Rework loop count	同一变更经历的 spec、code、eval、review 循环次数	高循环数表示上下文或门禁设计问题

6.4 AI behavior quality metrics

Metric	Definition	Risk note
Eval pass rate by risk tier	不同风险等级 eval suite 的通过率	高风险用例不能被低风险均值掩盖
Critical failure rate	触发 no-go 的严重失败占比	例如错误财务建议、PII 泄露、越权工具调用
Groundedness score	输出是否被允许知识源支持	客服 RAG、政策助手、合规问答必备
Answerability accuracy	系统是否知道何时不能回答	防止强答和幻觉
Tool call correctness	工具调用参数、权限、顺序和结果处理是否正确	Agentic workflow 核心指标
Calibration error	模型信心与实际正确性的偏差	高风险建议必须管理过度自信
Regression escape rate	release 后才发现的行为回归比例	反映 eval coverage 和监控质量
Human override rate	人工拒绝、修改或覆盖 AI 输出的比例	高并不总是坏, 需要结合风险和学习信号解释

6.5 Security and quality metrics

Metric	SSDF lens	AI-specific use
Security requirement traceability	PO / PW	安全要求是否连接到 PR gate、eval case 和证据
Secret exposure prevented	PS / PW	Agent 读取和写入时是否阻断 secret 泄露
Dependency risk aging	PS / RV	AI 生成或更新依赖后的漏洞、license 和供应链风险老化天数
Secure coding gate pass rate	PW	SAST、SCA、IaC、policy-as-code、threat model 检查通过情况
Vulnerability response time	RV	从发现到修复、回归和发布的时间
Provenance completeness	PS	release artifact、model、dataset、prompt、index、agent config 的来源完整性
Prompt injection defense pass rate	PW / RV	RAG、tool agent 和客服系统的攻击用例通过率
PII leakage detection rate	PW / RV	测试和线上监控中敏感信息泄露识别能力

6.6 Code agent effectiveness metrics

Metric	Definition	Interpretation
Agent acceptance rate	Agent 产出被人类接受并进入主干的比例	需要按任务复杂度和风险等级解释
Human intervention density	每 100 行或每个任务需要人工修正的次数	太高说明 agent scope、上下文或 prompt 不成熟
Review delta size	Reviewer 要求修改的 diff 占原 diff 比例	高 delta 表示初稿质量不足
Test validity rate	Agent 生成测试中被保留且能捕捉真实缺陷的比例	防止无意义测试刷覆盖率
Context retrieval precision	Agent 使用的上下文是否相关和最新	影响 spec-to-code 和 bugfix 质量
Tool error rate	Agent 调用工具失败、越权或错误解释结果的比例	高风险 agent 必须硬门禁
Agent rollback contribution	由 agent 辅助修复事故或生成回归测试的有效次数	衡量恢复能力, 不是只衡量写新代码
Unauthorized action attempt	Agent 尝试访问未授权 repo、secret、环境或工具的次数	任何高严重度事件都应进入治理复盘

6.7 Developer experience metrics

Metric	Definition	Collection method
Perceived productivity	开发者认为 AI workflow 是否提升有效产出	Monthly pulse survey
Cognitive load score	理解 agent 产出、门禁、上下文和证据的负荷	Survey + interview
Flow interruption count	每日因等待、失败、权限、工具、review 被打断次数	DevEx diary + telemetry
Review burden index	Reviewer 的 PR 量、diff size、风险复杂度和上下文缺失综合指数	Code review telemetry
Trust calibration	开发者对 agent 结果信任是否与实际质量匹配	Survey + acceptance data
Onboarding time to first safe AI PR	新成员能安全使用 AI workflow 产出合格 PR 的时间	Training and platform logs
Policy clarity score	开发者是否知道哪些数据、系统、任务可用 AI	Survey + policy quiz

6.8 Product / architecture metrics

Metric	Definition	Why it matters
Platform reuse rate	AI SDLC 平台组件被多个团队复用的比例	证明平台能力不是单团队脚本
Golden path adoption	团队通过标准 pipeline、eval、gate、evidence 发布的比例	反映平台化和治理内嵌程度
Architecture fitness score	模块边界、可测试性、可观测性、依赖方向、回滚能力的综合评估	防止 agent 生成代码侵蚀架构
Change coupling index	一个业务变更需要改动多少模块、服务、prompt、schema、policy	高耦合会拖慢 lead time 和恢复速度
Blast radius score	变更失败影响的客户、交易、流程和系统范围	金融零售 release gate 的关键输入
Cost per successful AI-assisted change	每个成功生产变更的工具、模型、算力、review、返工成本	用于平台投资和模型路由决策
Eval asset reuse	Eval dataset、rubric、judge、red-team cases 被复用次数	证明 eval 工程资产化
Documentation freshness	架构、runbook、API、policy 与代码和 production reality 的一致性	DORA 研究中高质量内部文档对技术能力有放大作用

6.9 Governance metrics

Metric	Definition	Decision
Evidence binder completeness	每次 release 是否具备 scope、risk tier、eval、security、approval、rollback、monitoring 证据	低于阈值不得进入高风险生产
Exception aging	门禁例外从批准到关闭的天数	老化例外进入治理会
Gate override rate	人工绕过或豁免 gate 的比例	上升说明 gate 设计或交付压力存在问题
Policy violation rate	AI 使用、数据、权限、工具调用违反政策的次数	需要按严重度处理
Model / agent change review coverage	模型、agent prompt、tool permission、eval judge 更新是否评审	防止无治理的行为变化
Post-incident learning closure	事故后回归测试、eval case、runbook 和架构改进是否关闭	防止只写复盘不改系统

7. Code Agent Governance

Code agent governance 的核心不是限制工具, 而是把 agent 变成可授权、可评估、可撤销、可审计的工程参与者。金融零售环境不能接受“AI 生成代码但无人负责”的叙事。

7.1 Governance control model

Control	Required design	Evidence
Agent identity	每个 agent 有独立身份、版本、owner、scope 和运行环境	Agent registry
Task authorization	任务进入 agent 前完成风险分级、仓库范围、工具范围和数据范围确认	Agent work order
Context governance	Agent 只读取授权上下文, 且上下文来源可追溯	Context pack manifest
Tool permission	Agent 工具权限最小化, 生产写入默认禁止	Tool permission matrix
Branch and PR policy	Agent 只能在授权分支工作, 必须通过 PR 合并	Branch protection logs
Secret and data protection	Secret、PII、PCI、客户资料、模型密钥不得暴露给未授权 agent	DLP and secret scan report
Sandboxed execution	测试、构建、脚本运行在隔离环境, 生产凭证不可见	Sandbox execution logs
Human accountability	每个 agent PR 有 human owner, 高风险变更有 named approver	PR approval record
Eval and test gate	Agent 产出必须通过对应风险等级的测试和 eval	CI and eval report
Auditability	Prompt、task、diff、tool call、test result、approval、release 决策可回放	Audit trail
Revocation	Agent、模型、工具、凭证、上下文源可快速停用	Revocation runbook

7.2 Agent task taxonomy

Task type	Agent fit	Required controls	Human role
Test generation for existing behavior	High	Test validity review、mutation or bug-seeding check	确认真正覆盖风险
Documentation sync	High	Source trace、architecture owner review	确认可运行事实和边界
Low-risk UI or internal tooling change	Medium-high	CI、visual check、accessibility、owner review	评估用户体验和维护性
Refactoring in well-tested module	Medium	Regression suite、diff size limit、architecture check	控制边界和回滚
Security fix draft	Medium	SAST/SCA、threat model、security review	判断修复是否完整
Core banking business rule change	Low-medium	Formal spec、dual approval、full regression、release gate	人工承担业务和风险判断
Fraud model decision logic	Low-medium	Model validation、champion-challenger、bias and drift monitoring	风险 owner 和模型风险有效挑战
Production incident hotfix	Context-dependent	War room approval、short-lived scope、post-incident regression	人类指挥, agent 只做辅助
Direct production data action	Very low	默认禁止, 只允许只读诊断或受控 runbook	人工执行或强审批

7.3 RACI for AI-assisted PR

Activity	Product owner	Architect	Developer	Code agent	Security	Model risk	Release manager
Define outcome and risk tier	A	C	C	I	C	C	I
Prepare context pack	C	A	R	I	C	C	I
Generate implementation draft	I	C	A	R	I	I	I
Run tests and eval	I	C	A	R	C	C	I
Review design impact	C	A	R	I	C	C	I
Review security impact	I	C	R	I	A	C	I
Review model / AI behavior risk	C	C	R	I	C	A	I
Approve release	C	C	C	I	C	C	A
Own production outcome	A	C	R	I	C	C	C

RACI 记忆:

Agent can be Responsible for producing a draft.
Agent cannot be Accountable for business, security, architecture, model risk, or production outcomes.

8. PR / Eval / Release Gate

Gate 的目的不是增加审批层, 而是在正确的位置把风险证据前置。AI-assisted SDLC 的 gate 必须同时管理 code correctness、AI behavior、security、privacy、architecture、operability 和 business risk。

8.1 Risk-tiered gate model

Tier	Example	AI involvement allowed	Required gate
Tier 0: Experimental / lab	Internal prototype、non-production notebook	Agent 可自由辅助, 禁止生产数据	Basic test、data handling check
Tier 1: Internal productivity	Developer tool、internal docs assistant、non-customer workflow	Agent 可开 PR, 人工 review	CI、PR template、security scan、light eval
Tier 2: Customer-facing non-decision	Customer service RAG draft、FAQ assistant	Agent 可改 code / prompt / tests, 高风险内容人工 review	PR gate、RAG eval、PII check、red-team、canary
Tier 3: Financial / regulated decision support	Fraud triage、credit policy assistant、dispute recommendation	Agent 可辅助测试、文档、低风险代码, 核心逻辑强人工责任	Formal eval、model risk review、security sign-off、release memo
Tier 4: Core transaction / ledger / account authority	Core banking posting、limit change、payment execution	Agent 默认只做辅助分析、测试建议、文档更新	Dual control、full regression、change advisory evidence、rollback drill

8.2 PR gate

PR gate 回答: 这次变更是否足够小、足够清楚、足够可审查、足够可测试、足够安全。

Required checks:

Check	Gate question	Release blocker example
Scope clarity	PR 是否说明业务目标、风险等级和变更边界	“优化 AI”但没有具体行为或流程边界
AI involvement disclosure	哪些部分由 agent 生成、修改、测试或总结	高风险 PR 未披露 agent 修改
Diff size and coupling	PR 是否过大或跨过多边界	核心服务、prompt、policy、schema 同时修改但无分解
Test adequacy	单元、集成、contract、regression 是否覆盖风险路径	只新增 happy path
Eval linkage	AI 行为变化是否连接到 eval case 和 threshold	Prompt 改动无 RAG / safety eval
Security scan	Secret、dependency、license、SAST、IaC 是否通过	新增高危依赖或 secret
Architecture review	是否破坏模块边界、API contract、observability、rollback	Agent 绕开现有 domain service
Evidence readiness	PR 是否产生 release 所需证据	无 owner、无 rollback、无 monitoring plan

8.3 Eval gate

Eval gate 回答: 这次变更后的 AI 行为是否在风险边界内可接受。

Eval type	Purpose	Example threshold
Functional eval	核心任务是否完成	Tier 2 任务成功率不低于当前生产基线
Regression eval	既有能力是否退化	关键场景不允许 critical regression
Safety eval	越权、泄露、危险建议是否被拒绝	Critical failure 为 0
Grounding eval	回答是否有允许来源支持	客服 RAG 高风险政策类回答必须引用有效来源
Tool eval	工具调用是否正确、最小权限、可恢复	金融动作默认 require human confirmation
Bias / fairness eval	风控、信贷、营销是否存在不合理差异	模型风险阈值按机构政策定义
Robustness eval	对 prompt injection、噪声、缺失信息是否稳健	注入攻击不得绕过系统指令和工具权限
Human review eval	人工专家评估是否支持上线	高风险样本必须有专家抽检

Eval decision:

Decision	Meaning	Action
Go	指标达到风险等级要求, 无未关闭 blocker	进入 release gate
Limited go	低风险场景可放量, 高风险场景关闭或人工兜底	Canary、feature flag、monitoring trigger
No-go	存在 critical failure、证据缺失或风险无法解释	修复后重跑 eval
Rollback / freeze	生产信号表明当前版本不可接受	回滚、降级、事故复盘

8.4 Release gate

Release gate 回答: 即使 PR 和 eval 通过, 这次生产发布是否具备可控放量、监控、回滚和问责条件。

Gate area	Required evidence
Release bundle	Code、prompt、model、data、RAG index、tool schema、policy、agent config 的版本清单
Risk tier	业务影响、客户影响、财务影响、监管影响、operational risk
Eval result	Offline eval、red-team、regression、human review、已知限制
Security and privacy	SSDF controls、SAST/SCA、DLP、PII、secret scan、threat model
Architecture and operability	Rollback path、observability、SLO、runbook、feature flag、blast radius
Deployment strategy	Shadow、canary、ramp、manual approval、business freeze window
Decision owner	Business owner、technical owner、risk owner、release owner
Monitoring trigger	什么信号触发 pause、rollback、human-only mode、incident

8.5 Gate anti-theater principle

Gate is useful only when it changes a release decision.
If a gate cannot block, limit, route, rollback, or improve a change,
it is governance theater.

9. Safety, Quality and Developer Experience

AI 工程生产力的核心张力是: 更快的交付不能通过把质量、安全和认知负荷转嫁给 reviewer、QA、operations、客户服务或合规团队来实现。

9.1 Safety and quality control stack

Control layer	AI SDLC use	Owner
Requirement and eval contract	把业务边界、禁止行为、评估问题前置	Product / BA / model risk
Secure coding and supply chain	管理依赖、secret、license、SAST、SCA、IaC、provenance	Engineering / security
Agent sandbox and permissions	控制 agent 可读写范围、工具调用和环境	Platform / security
PR review	判断可维护性、架构影响、业务规则和代码质量	Developer / architect
AI behavior eval	验证回答、推理、工具调用、安全边界和回归	EvalOps / model risk
Release control	控制放量、回滚、监控和证据	Release manager / SRE
Production monitoring	发现漂移、攻击、质量下降、成本异常和客户影响	SRE / product ops
Incident learning	把事故转成回归测试、eval case、runbook 和架构改进	Cross-functional owner

9.2 DevEx as a control surface

开发者体验不是福利指标, 而是 AI SDLC 的控制面。体验差会导致绕过平台、复制敏感数据、手工清理 agent 产出、跳过 eval、积累架构债。

DevEx signal	Indicates	Action
开发者说 agent “快但不可信”	Agent scope 或上下文质量问题	缩小任务类型, 建 context pack, 增加 eval feedback
Reviewer 说 PR “看不懂”	PR 粒度和证据不足	限制 diff size, 强制 design note 和 risk note
Eval 经常排队	平台瓶颈抵消 AI 速度	并行化 eval, 风险分层, 缓存数据集
Policy 让人困惑	治理语言不可执行	改成 allowed / restricted / prohibited task catalog
Agent 产出大量清理工作	Activity 增加但 performance 下降	停止以生成量奖励工具, 看 acceptance 和 rework
高风险团队拒绝采用	门禁或平台不匹配业务风险	共同设计核心系统 golden path

9.3 Quality economics

AI code agents 改变了成本曲线:

生成成本下降
review 和验证成本可能上升
缺陷进入生产的边际风险可能上升
高质量 eval 和平台自动化的投资回报上升

因此 ROI 不能写成:

ROI = saved developer hours - tool license

更合理的表达:

Risk-adjusted productivity ROI =
  reduced lead time value
  + avoided rework and incident cost
  + developer flow improvement
  + platform reuse value
  + quality evidence automation value
  - AI tooling and model cost
  - review and eval cost
  - governance and training cost
  - residual risk cost

10. 金融零售案例

10.1 核心银行变更: 账户限额规则调整

场景:

某银行需要调整企业账户日累计转账限额规则。变更涉及核心银行服务、渠道 API、风控校验、审计日志、客服解释话术和回滚预案。

AI SDLC design:

Area	Design
Risk tier	Tier 4: core transaction / account authority
Agent role	读取 spec、生成测试用例、更新文档、辅助差异分析; 不直接决定业务规则
DORA focus	Change lead time、failed deployment recovery time、deployment rework rate
SPACE focus	Review burden、flow interruption、communication alignment
Eval focus	边界金额、币种、渠道、客户类型、异常状态、并发、审计日志一致性
PR gate	小 PR, 规则变更与 UI / 文档分离, dual approval
Release gate	Freeze window、canary by segment、feature flag、ledger reconciliation、rollback drill
Evidence	Rule source、decision log、test matrix、security scan、approvals、monitoring triggers

Metric tree:

Outcome	Metric	Target interpretation
不增加客户交易失败风险	Limit rule defect escape rate	任何财务影响缺陷进入 incident review
缩短受控变更周期	Intent-to-production lead time	优化等待和证据自动化, 不牺牲 gate
降低恢复成本	Recovery drill success	能在定义时间内切回旧规则和解释客户影响
降低 reviewer 负担	Review burden index	Agent 生成 test matrix, 人工聚焦业务判断

高级表达:

对核心银行变更, AI 的价值不是让 agent 直接改核心账务逻辑, 而是把需求追踪、边界测试、影响分析、审计证据和回滚准备自动化, 让人工把注意力放在规则正确性和生产风险上。

10.2 风控模型: 欺诈交易评分模型升级

场景:

风控团队升级交易欺诈模型, 目标是减少 false negative, 同时控制 false positive 对客户体验和运营审核队列的影响。

AI SDLC design:

Area	Design
Risk tier	Tier 3: financial / regulated decision support
Agent role	生成 feature validation、监控 query、model card 初稿、回归测试; 不替代模型风险审批
DORA focus	Change lead time for model promotion、change fail rate、deployment rework rate
SPACE focus	Collaboration between data science、risk、engineering、operations
Eval focus	Precision / recall、false positive cost、segment performance、drift、calibration、override review
PR gate	Feature schema、data lineage、model artifact、decision policy 分开审查
Release gate	Champion-challenger、shadow mode、manual override、ramp by transaction segment
Evidence	Training data snapshot、feature validation、bias analysis、threshold rationale、monitoring plan

Metric tree:

Outcome	Metric	Target interpretation
降低欺诈损失	Fraud loss avoided	必须扣除误拦截和运营成本
控制客户摩擦	False positive rate by segment	某客群异常上升触发 limited go 或 rollback
提高模型发布可控性	Model promotion lead time	关注数据、验证、审批和 release 等待
保持可解释问责	Evidence binder completeness	缺少 lineage 或 threshold rationale 不放行

高级表达:

风控模型的工程生产力不是“更快训练新模型”, 而是让模型变更以可解释、可验证、可监控、可回滚的方式进入生产决策链。

10.3 客服 RAG: 信用卡争议处理知识助手

场景:

客服团队使用 RAG 助手回答信用卡争议处理流程、时限、所需材料和升级条件。系统不直接对客户做最终裁决, 但会影响客服建议和客户权益。

AI SDLC design:

Area	Design
Risk tier	Tier 2: customer-facing non-decision, 部分场景接近 Tier 3
Agent role	维护 eval cases、生成知识库差异摘要、更新引用检查、改进 prompt
DORA focus	Prompt / index change lead time、deployment frequency、change fail rate
SPACE focus	Customer service SME collaboration、developer flow、policy clarity
Eval focus	Groundedness、citation validity、answerability、PII handling、prompt injection、escalation correctness
PR gate	Knowledge source diff、prompt diff、retrieval config diff 必须清晰
Release gate	Canary by agent group、human review sampling、safe fallback、content freeze rule
Evidence	Source document version、index build ID、eval result、red-team result、sampling plan

Metric tree:

Outcome	Metric	Target interpretation
提升一次解决率	First contact resolution uplift	不能以错误回答换取速度
降低错误政策引用	Invalid citation rate	高风险政策类问题必须硬门禁
降低客服认知负荷	SME correction rate and survey	高修正率表示 RAG 质量或流程边界问题
提高知识更新速度	Source-to-index lead time	发布频率必须和 grounding quality 一起看

高级表达:

客服 RAG 的 release gate 不只验证回答是否流畅, 而是验证回答是否可引用、可拒答、可升级、可监控, 并且不会泄露客户信息或误导客户权益。

10.4 AI 平台: 企业 code agent 和 eval 平台

场景:

金融零售集团建设统一 AI engineering platform, 支持多个产品团队使用 code agents、eval suites、policy-as-code、evidence binder 和 DORA / SPACE dashboard。

AI SDLC design:

Area	Design
Risk tier	平台本身 Tier 1-2, 但承载 Tier 3-4 系统
Agent role	标准化 task intake、context pack、PR creation、test running、evidence generation
DORA focus	Golden path adoption、lead time reduction、deployment rework rate
SPACE focus	Platform satisfaction、onboarding time、flow interruption、policy clarity
Eval focus	Agent task benchmark、repo-specific regression、security prompt-injection cases
PR gate	Platform changes 需要 backward compatibility 和 tenant isolation review
Release gate	Dogfood、pilot teams、canary by repository, rollback by agent policy version
Evidence	Agent registry、policy version、permission logs、eval result、incident learning

Metric tree:

Outcome	Metric	Target interpretation
提升安全 AI 开发采用	Golden path adoption	采用率必须结合满意度和质量, 不能强推
降低团队重复建设	Platform reuse rate	重复脚本下降和共享 eval assets 上升
控制 agent 风险	Unauthorized action attempt	高严重度事件触发权限收缩
改善开发者流动效率	Flow interruption reduction	证明平台减少等待和上下文切换

高级表达:

AI 平台不是工具集合, 而是把 agent 权限、eval、release gate、安全证据和 DORA / SPACE 经营视图产品化, 让业务团队可以在受控边界内更快交付。

11. Templates

以下模板给出可直接改写的示例内容。正式项目应替换为机构内部真实 owner、系统名、风险等级、指标阈值和证据要求。

11.1 AI SDLC Metric Tree Canvas

# AI SDLC Metric Tree: Customer Service RAG v3

## Business Outcome
- Increase first-contact resolution for credit-card dispute inquiries from 62% to 70% in the pilot group.
- Reduce policy-escalation errors from 4.8% to below 2.0%.
- Keep customer-impacting incorrect guidance incidents at zero during pilot.

## Risk Tier
- Tier 2 for general process explanation.
- Tier 3 for dispute eligibility, deadline, fee, refund, chargeback and regulatory wording.

## DORA Metrics
- Prompt / index change lead time: source document approved to canary release.
- Deployment frequency: controlled prompt, retrieval config and index releases per week.
- Change fail rate: releases requiring rollback, hotfix, content freeze or human-only mode.
- Failed deployment recovery time: detection to safe fallback and stakeholder communication.
- Deployment rework rate: unplanned releases caused by production quality or safety issues.

## SPACE Metrics
- Satisfaction: customer service SME trust score and developer perceived productivity.
- Performance: valid citation rate, escalation correctness, first-contact resolution.
- Activity: accepted eval cases added, evidence packs generated, canary reviews completed.
- Communication: policy owner response time and PR review routing accuracy.
- Efficiency and flow: source-to-index wait time, eval queue time, review turnaround.

## AI Quality Metrics
- Groundedness for policy answers.
- Answerability accuracy for insufficient context.
- Prompt injection defense pass rate.
- PII leakage detection and prevention.
- Human correction rate by intent category.

## Governance Evidence
- Source document version map.
- Index build manifest.
- Eval result and red-team result.
- Security and privacy scan report.
- Release decision memo.
- Monitoring trigger and rollback runbook.

11.2 DORA / SPACE Dashboard Spec

# Dashboard Spec: AI Engineering Productivity Operating Review

## Audience
- CTO staff, AI platform PM, engineering managers, DevEx lead, security governance, model risk liaison.

## Decisions Supported
- Which AI SDLC bottleneck receives the next platform investment.
- Which agent workflows are expanded, constrained or retired.
- Which high-risk systems need stricter gates or better automation.
- Which DevEx friction items block adoption.

## Filters
- Business domain: core banking, fraud, customer service, AI platform.
- Risk tier: Tier 0 to Tier 4.
- Artifact type: code, prompt, model, data, RAG index, tool schema, policy, agent config.
- Team, repository, service, release train, agent type.

## Executive Tiles
- Intent-to-production lead time by risk tier.
- Change fail rate and deployment rework rate by artifact type.
- Failed deployment recovery time by service and release.
- Eval critical failure rate by use case.
- Evidence binder completeness by release.
- Developer flow interruption index.
- Agent acceptance rate and human intervention density.
- Platform golden path adoption and reuse rate.

## Drill-Down Views
- PR queue aging and review burden.
- Eval queue aging and failing eval categories.
- Security gate failures by SSDF control group.
- Agent permission violations and tool error rate.
- Architecture fitness trend and change coupling index.
- Incident learning closure and regression coverage added.

## Operating Cadence
- Daily team flow review for blocked PR, eval queue and review aging.
- Weekly AI SDLC review for improvement bets.
- Monthly governance review for exceptions and policy changes.
- Quarterly platform investment review for architecture and adoption decisions.

11.3 Code Agent Work Order

# Code Agent Work Order: Add Regression Tests for Dispute RAG Citation Validation

## Objective
- Add regression tests that verify dispute-policy answers cite the approved source document and refuse to answer when the source is missing.

## Risk Tier
- Tier 2, with Tier 3 handling for deadline, refund, fee and chargeback eligibility questions.

## Authorized Scope
- Repository: customer-service-rag.
- Paths: tests/eval/dispute_policy, docs/eval-cases/dispute_policy.
- Branch: agent/dispute-citation-regression.
- Tools: read files, edit files, run unit tests, run local eval subset.
- Prohibited: production credentials, customer data, deployment commands, changing retrieval config.

## Context Pack
- Product requirement: dispute-policy-answerability-v3.
- Eval contract: dispute-rag-eval-contract-v3.
- Source documents: approved policy package 2026-06.
- Existing failures: invalid citation in deadline questions and over-answering on missing evidence.

## Required Output
- Regression eval cases for valid citation.
- Negative cases for missing source.
- Test execution report.
- PR summary with AI assistance disclosure and risk notes.

## Human Review
- SME reviews policy examples.
- Developer reviews test design and maintainability.
- Security reviews no customer data or sensitive content is embedded.

11.4 PR / Eval / Release Gate Memo

# AI-Assisted Release Gate Memo: Customer Service RAG v3.4

## Decision
- Limited go for pilot group A.
- Human-only fallback remains active for dispute eligibility and refund commitment questions.

## Change Identity
- Artifact types: prompt, RAG index, eval dataset, service code.
- Release bundle: service 3.4.0, prompt 2026-06-29.1, index build dispute-20260629-a, eval suite dispute-rag-v3.
- Risk tier: Tier 2 general, Tier 3 for customer-rights sensitive intents.

## AI Involvement
- Code agent generated regression eval cases and PR summary.
- Human developer modified retrieval guardrail code.
- SME approved policy examples.
- Security reviewed prompt injection and PII cases.

## DORA / SPACE Context
- Source-to-index lead time reduced from 6.2 days to 3.1 days for pilot changes.
- Review burden remained stable because PRs were split by artifact type.
- Eval queue time increased by 18%, with action to parallelize grounding checks.

## Eval Results
- Groundedness improved from 91.4% to 96.2% on dispute policy questions.
- Invalid citation critical failures: 0 in the release suite.
- Prompt injection defense passed all high-risk cases.
- Answerability for missing-source questions improved from 82.0% to 93.5%.

## Security and Privacy
- Secret scan passed.
- PII leakage tests passed.
- Retrieval source access limited to approved policy documents.
- No production customer data used in eval.

## Release Controls
- Canary to pilot group A for 10% traffic.
- Monitor invalid citation, SME correction, escalation error and complaint tags.
- Rollback path: revert prompt and index build; activate human-only mode for Tier 3 intents.

## Approval
- Product owner approved customer service pilot scope.
- Engineering owner approved operability.
- Security approved privacy and prompt injection controls.
- Model risk liaison accepted limited go with monitoring triggers.

11.5 Evidence Binder Index

# Evidence Binder Index: Fraud Model Feature Release 2026-06

## Release Identity
- Use case: real-time card fraud scoring.
- Release bundle: feature schema 14.2, model fraud-xgb-202606, scoring service 8.7.1.
- Risk tier: Tier 3 financial decision support.

## Business and Risk Evidence
- Business case and expected fraud loss reduction.
- False positive cost analysis by customer segment.
- Operational review capacity assessment.
- Customer impact analysis and escalation path.

## Technical Evidence
- Training data snapshot and lineage.
- Feature validation report.
- Model evaluation and calibration report.
- Regression suite result.
- Shadow mode comparison report.
- Monitoring and rollback runbook.

## Security and SSDF Evidence
- Secure coding scan result.
- Dependency and license report.
- Secrets and data handling checks.
- Artifact provenance manifest.
- Vulnerability response plan.

## Governance Evidence
- Model risk review notes.
- Product approval.
- Security approval.
- Release manager decision.
- Exception register with expiry dates.
- Post-release monitoring summary.

11.6 DevEx Friction Log

# DevEx Friction Log: AI SDLC Pilot

## Observation
- Reviewers spend too much time reconstructing agent context for medium-risk PRs.

## Evidence
- Review turnaround increased from 14 hours to 28 hours for AI-generated PRs.
- 42% of review comments ask for missing rationale, test explanation or risk note.
- Developer survey cognitive load score is 3.1 out of 5, where 5 means high load.

## Impact
- Agent coding speed is not translating into end-to-end lead time improvement.
- Senior reviewers are overloaded by context reconstruction.

## Product Bet
- Add mandatory context pack manifest and AI assistance disclosure to PR template.
- Limit agent PR diff size for Tier 2 and above.
- Auto-link eval result, test result and risk tier in PR summary.

## Success Signal
- Review turnaround returns below 18 hours.
- Context-related review comments drop below 15%.
- Developer cognitive load score improves below 2.5.

12. 评审清单

12.1 AI SDLC metric design checklist

指标是否从业务结果和风险结果倒推, 而不是从工具遥测倒推。
DORA 指标是否按 service、risk tier、artifact type 拆分。
SPACE 指标是否覆盖满意度、绩效、活动、协作、效率和 flow。
指标是否避免用于个人排名。
是否定义了每个指标的解释规则和错误解释风险。
是否包含 eval、security、architecture、DevEx 和 governance 指标。
是否能从 dashboard drill down 到 PR、eval、release、incident 和 evidence。
是否为高风险系统设置独立阈值和评审节奏。

12.2 Code agent governance checklist

Agent 是否有 identity、owner、version、scope 和 revocation path。
任务是否先完成风险分级和授权范围定义。
Agent 是否只能读取授权上下文。
Tool permission 是否最小化。
Secret、PII、PCI、客户数据和生产凭证是否默认不可见。
Agent PR 是否强制 human owner。
高风险变更是否有架构、安全、模型风险或业务 owner 评审。
Agent 行为、tool call、test result 和 diff 是否可审计。
Agent model 或 prompt 更新是否进入回归评估。

12.3 PR gate checklist

PR 是否说明业务目标、风险等级、变更边界和 AI involvement。
Diff 是否可审查, 是否避免跨多个 artifact type 混合变更。
需求、设计、测试、eval 和 release 证据是否互相链接。
AI 生成测试是否有有效性检查。
是否有架构影响说明和 rollback note。
安全扫描、依赖、secret、license、IaC 是否通过。
Reviewer 是否有足够上下文做判断, 不是被迫重建历史。

12.4 Eval gate checklist

Eval suite 是否覆盖高风险业务场景和禁止行为。
是否有 critical failure taxonomy。
是否区分 offline eval、human review、red-team 和 online monitoring。
RAG 是否验证 citation、groundedness、answerability 和 injection resistance。
Tool agent 是否验证权限、参数、顺序、失败处理和人工确认。
风控或信贷场景是否验证 segment performance、calibration、override 和 drift。
Eval threshold 是否绑定 release decision, 而不是只做参考。

12.5 Release gate checklist

Release bundle 是否列出 code、prompt、model、data、index、tool schema、policy 和 agent config 版本。
Risk tier 是否获得对应 owner 认可。
是否有 canary、ramp、feature flag、fallback、rollback 和 monitoring trigger。
是否有 production SLO、incident route 和 communication plan。
是否明确 go、limited go、no-go、rollback 的判定标准。
Evidence binder 是否完整且可复现。
Gate 例外是否有 owner、到期日、补偿控制和关闭条件。

12.6 Operating review checklist

Review 是否聚焦系统改进, 而不是追责个人。
是否把 DORA 的速度和稳定性一起看。
是否把 SPACE 的体验和协作信号纳入决策。
是否检查 agent adoption 是否带来真实 outcome。
是否每次只选择少数高杠杆 improvement bets。
是否跟踪上次改进是否关闭, 而不是不断新增指标。

13. 反模式

Anti-pattern	表面合理性	实际风险	替代做法
Story point theater	看起来能管理容量	忽略质量、风险、flow 和结果	用 DORA / SPACE / eval / outcome 指标组合
Lines-of-code ROI	容易从工具拿数据	奖励冗余代码和 review 负担	看 accepted change、rework、defect、lead time
Agent adoption vanity metric	License 激活率好看	高采用不等于高价值	看 workflow adoption、acceptance、DevEx 和稳定性
Individual productivity ranking	管理层想找强弱	破坏协作, 鼓励指标操纵	只在团队和系统层面使用生产力指标
Eval after release	先上线再补评估	高风险失败进入客户流程	Eval contract 前置到需求和 PR
Gate as paperwork	证据很多	Gate 不改变决策, 增加负担	每个 gate 必须能 block、limit、route 或 rollback
One dashboard for all systems	统一看板简单	混合核心银行和内部工具导致误判	按风险等级和系统上下文解释
Agent writes everything	最大化自动化	架构债、安全风险、业务责任失控	任务路由和权限分层
Reviewer as cleanup crew	人工兜底看似安全	高级工程师被低价值清理消耗	提升 context pack、diff limit、agent eval
Prompt-only governance	只审 prompt	忽略数据、工具、policy、代码和 release	管理完整 release bundle
Security as final scan	上线前跑扫描	设计缺陷和权限风险太晚发现	SSDF 控制嵌入 intake、PR、eval、release
Faster bad architecture	生成速度快	未来变更更慢, 事故恢复更难	Architecture fitness functions 和 review gate
No learning loop	事故处理完即结束	同类问题反复发生	Incident -> eval case -> regression -> runbook -> metric update

14. 30 天训练计划

目标: 30 天内完成一套可展示的 AI DORA / SPACE Engineering Productivity Operating System 作品集, 主题建议选择“金融零售客服 RAG”或“AI code agent 平台试点”。

Day	训练主题	输出
1	读取 DORA、SPACE、SSDF source anchors	1 页 source anchor 摘要
2	选择一个金融零售 AI SDLC 场景	Use case brief and risk tier
3	画 value stream: idea 到 production	AI SDLC value stream map
4	定义 product and risk outcomes	Outcome tree
5	建 DORA metric tree	DORA mapping sheet
6	建 SPACE metric tree	DevEx and collaboration metric sheet
7	复盘第 1 周	Executive narrative: why story point is insufficient
8	定义 artifact taxonomy: code / prompt / model / data / index / policy	Artifact change taxonomy
9	设计 AI quality metrics	Eval metric catalogue
10	设计 security and SSDF controls	SSDF-to-gate control map
11	设计 code agent task taxonomy	Agent task routing policy
12	设计 agent permission matrix	Agent governance card
13	设计 PR gate	PR gate checklist and template
14	复盘第 2 周	Gate architecture diagram
15	设计 eval gate	Eval gate decision matrix
16	设计 release gate	Release gate memo example
17	设计 evidence binder	Evidence binder index
18	设计 DORA / SPACE dashboard	Dashboard spec
19	设计 DevEx survey and friction log	DevEx operating review pack
20	设计 architecture fitness metrics	Product architecture metric sheet
21	复盘第 3 周	Operating system narrative v1
22	写核心银行变更案例	Case study 1
23	写风控模型案例	Case study 2
24	写客服 RAG 案例	Case study 3
25	写 AI 平台案例	Case study 4
26	做反模式和治理清单	Anti-pattern and governance checklist
27	写 8 个面试答案	Interview answer bank
28	组装作品集	Portfolio package
29	录制 5 分钟讲述稿	Executive story script
30	做自评和差距修正	Final portfolio review

30 天完成标准:

一张 AI SDLC value stream map。
一套 DORA / SPACE / eval / security / product architecture metric tree。
一份 code agent governance card。
一份 PR / eval / release gate pack。
四个金融零售案例。
一套 dashboard spec。
一份 evidence binder index。
一组面试答案。
一个 5 分钟作品集讲述稿。

15. 面试答案

Q1: 你如何定义 AI engineering productivity?

短答:

我不会把 AI engineering productivity 定义为代码生成量, 而是定义为组织在可控风险下更快、更稳、更可持续地把业务意图转成生产价值的能力。

展开:

AI 工程生产力要同时看五类信号。第一是 DORA, 看 change lead time、deployment frequency、change fail rate、recovery time 和 rework。第二是 SPACE, 看开发者满意度、绩效、活动、协作和 flow。第三是 EvalOps, 看 AI 行为质量、critical failure、groundedness、tool correctness 和 regression。第四是安全治理, 看 SSDF 控制、供应链、secret、PII、权限和 evidence。第五是产品和架构结果, 看客户结果、风险结果、平台复用、架构可演进性和 unit economics。AI 的价值只有在这些指标组合改善时才成立。

Q2: 为什么不能只用 story point 管 AI 团队?

短答:

Story point 只能粗略表达相对复杂度, 不能解释 AI 系统的质量、风险、eval 覆盖、agent 权限、生产稳定性和开发者体验。

展开:

AI SDLC 中很多关键工作不是传统开发点数能捕捉的, 例如构建 eval suite、做 red-team、验证 RAG grounding、设计 tool permission、准备 evidence binder、监控漂移和设计 rollback。只看点数会奖励可见产出, 惩罚质量和治理工作。更好的做法是把 story point 降级为局部容量参考, 主指标使用 DORA / SPACE / EvalOps / SSDF / product outcome 的组合。

Q3: DORA 指标如何适配 AI SDLC?

短答:

DORA 仍然适用, 但要把变更对象从代码扩展到 prompt、model、data、RAG index、tool schema、policy 和 agent config。

展开:

Change lead time 要拆成 intent-to-production、spec-to-eval、eval-to-release。Deployment frequency 要按风险等级和 artifact type 解释。Change fail rate 包含行为回归、错误工具调用、幻觉、数据漂移和生产 rollback。Failed deployment recovery time 包含模型切换、prompt 回滚、index 回滚、feature flag 和 human-only fallback。Deployment rework rate 可以反映 eval gate 和 PR gate 是否把问题前置。

Q4: SPACE 在 AI code agent 时代有什么价值?

短答:

SPACE 防止团队把生产力误解为 activity, 特别是在 AI 可以轻易制造更多 PR、diff 和 comment 的情况下。

展开:

SPACE 要求从满意度和 well-being、绩效、活动、协作、效率和 flow 多维观察。AI code agents 可能让 activity 上升, 但如果 review burden、cleanup fatigue、cognitive load 和 rework 同时上升, 生产力并没有改善。因此我会把 SPACE survey、review telemetry、flow interruption、collaboration quality 和 DORA 数据结合起来看。

Q5: 你会如何治理 code agents?

短答:

我会把 code agent 当作可授权、可撤销、可审计的工程参与者, 而不是把它当成自由访问所有代码和工具的聊天窗口。

展开:

治理设计包括 agent identity、owner、版本、任务风险分级、授权仓库和路径、工具权限、沙箱、secret 防护、branch policy、PR gate、eval gate、audit trail 和 revocation runbook。Agent 可以负责生成 draft、测试、文档和分析, 但不能承担业务、架构、安全、模型风险和生产结果的 accountability。每个高风险 PR 必须有 human owner 和对应 owner 审批。

Q6: PR gate、eval gate、release gate 怎么分工?

短答:

PR gate 管变更是否可审查、可测试和可维护; eval gate 管 AI 行为是否可接受; release gate 管生产放量、监控、回滚和证据是否就绪。

展开:

PR gate 关注 scope、diff、AI involvement、测试、架构、安全扫描和证据链接。Eval gate 关注 functional、regression、safety、grounding、tool call、bias、robustness 和 human review。Release gate 关注 release bundle、risk tier、security and privacy、deployment strategy、SLO、rollback、monitoring trigger 和 approval。三者缺一不可, 否则 AI 变更会从某个缝隙进入生产。

Q7: 如何向 CTO 证明 AI coding tools 的 ROI?

短答:

我会证明端到端 flow、质量、稳定性、开发者体验和平台复用是否改善, 而不是展示代码补全次数。

展开:

ROI 要包含 reduced lead time、avoided rework、incident cost reduction、review efficiency、developer flow、eval asset reuse 和 platform adoption, 同时扣除 tool cost、model cost、eval cost、review cost、training cost、governance cost 和 residual risk。最有说服力的方式是选择 2-3 个试点团队, 用 baseline、对照或阶段性 rollout 比较 DORA / SPACE / quality / outcome 指标。

Q8: 金融零售高风险系统如何使用 AI agents?

短答:

高风险系统可以用 AI agents, 但要把 agent 约束在测试、影响分析、文档、回归和低风险代码初稿中, 核心业务判断和生产责任必须由人承担。

展开:

例如核心银行限额规则变更, agent 可以生成边界测试、找影响范围、更新 runbook 和生成 PR summary, 但不能独立决定规则含义或直接发布。风控模型升级中, agent 可以生成 feature validation 和 monitoring query, 但模型风险、阈值和客户影响必须由风险 owner 审查。关键是任务路由、权限、gate 和证据。

Q9: 如何避免 DORA / SPACE 被滥用?

短答:

不用于个人排名, 不跨上下文粗暴平均, 不用单一指标做管理结论, 每个指标都要有解释规则和反向指标。

展开:

DORA 适合看团队和系统的 delivery performance, SPACE 适合看社会技术系统健康。它们不应该变成员工排名工具。核心银行、客服 RAG、内部平台的指标不能简单平均。每个指标都要配反向指标, 例如 deployment frequency 要配 change fail rate 和 rework, agent adoption 要配 satisfaction 和 accepted output, lead time 要配 quality 和 incident。

Q10: 你如何把这个主题做成作品集?

短答:

我会做一个 AI Engineering Productivity Operating System 包, 包含 metric tree、dashboard spec、agent governance、gate pack、金融零售案例和 executive narrative。

展开:

作品集不需要真实生产数据, 但需要像真实项目一样有约束。我会选客服 RAG 或 AI 平台试点, 给出风险分级、value stream、DORA / SPACE 指标、eval gate、SSDF 控制、release memo、evidence binder、DevEx friction log 和 30 天 rollout plan。讲述重点是我如何把 AI 工程效率从工具采购升级为可治理的 operating system。

16. 作品集交付物

16.1 Portfolio package

Artifact	内容	展示价值
Executive one-pager	问题、定位、目标、指标组合、治理原则	面向 CTO / VP Engineering 的表达
AI SDLC value stream map	从 idea 到 production 的流程、等待、返工、gate	展示系统思维
DORA / SPACE metric tree	DORA、SPACE、eval、安全、架构、业务结果指标	展示指标设计能力
Dashboard spec	角色、决策、过滤器、tiles、drill-down、cadence	展示平台 PM 能力
Code agent governance card	Agent identity、scope、permission、audit、revocation	展示 AI governance 和安全意识
PR / eval / release gate pack	Gate checklist、decision matrix、memo 示例	展示 release engineering 能力
SSDF control map	PO / PS / PW / RV 到 AI SDLC gate 的映射	展示 secure SDLC 能力
Financial retail cases	核心银行、风控模型、客服 RAG、AI 平台	展示行业落地能力
DevEx friction log	体验问题、证据、影响、改进行动、成功信号	展示 SPACE 和 adoption 能力
Interview answer bank	8-10 个高级问答	展示求职转化能力

16.2 5 分钟讲述结构

0:00-0:40  问题定义
AI coding tools 正在提升局部生成速度, 但金融零售真正需要的是可控、可评估、可审计的 AI SDLC operating system。

0:40-1:30  方法框架
我用 DORA 衡量 delivery flow 和 stability, 用 SPACE 衡量 developer experience 和协作健康, 用 EvalOps 衡量 AI 行为质量, 用 SSDF 衡量安全与供应链控制, 用 product / architecture metrics 衡量长期价值。

1:30-2:30  Operating system
展示 value stream、code agent governance、PR gate、eval gate、release gate、production telemetry 和 operating review 的闭环。

2:30-3:40  金融零售案例
用客服 RAG 或核心银行变更说明风险分级、指标选择、gate 证据和 rollback。

3:40-4:30  Dashboard and governance
展示 dashboard 如何支持 weekly flow review、monthly governance review 和 quarterly product architecture review。

4:30-5:00  结论
AI 工程生产力不是代码更多, 而是在同等或更低风险下更快学习、更稳发布、更少返工、更好体验和更强证据。

16.3 自检清单

是否明确说明不把 story point 当主指标。
是否把 DORA / SPACE / EvalOps / SSDF / product architecture metrics 连接成一个系统。
是否有至少一个金融零售高风险案例。
是否体现 code agent 权限、沙箱、审计和责任边界。
是否有 PR / eval / release gate 的分工。
是否避免用 vanity metrics 证明 ROI。
是否有可复用模板和评审清单。
是否能用 5 分钟讲给 CTO、AI platform leader 或 architecture panel。

17. 最终记忆卡

AI engineering productivity is not code generation volume.

It is the operating capability to turn business intent into production value:
  faster through DORA flow,
  safer through quality and SSDF controls,
  smarter through eval gates,
  healthier through SPACE and DevEx,
  more scalable through platform reuse,
  more accountable through governance evidence,
  more valuable through product and architecture outcomes.

For AI PM / architect:
  story point is a capacity conversation,
  DORA / SPACE / EvalOps / SSDF is an operating system conversation.