verify-ip 等逐 IP action 失败后不再进入自动回滚确认,改为保存 failed_stage 并暂停。
用户修复外部问题后输入 resume,会从失败 action 重新执行,而不是结束整个流程。 回滚从 workflow 中拆出,新增显式命令: chat:rollback [IP] CLI:rollback --checkpoint ... [--ip ...] [--stop-first|--no-stop-first] 旧 confirm approve/reject 只保留为旧 checkpoint 兼容入口,新流程不再推荐使用。 LangGraph workflow 已移除回滚确认 interrupt 节点,失败暂停和续跑走业务 checkpoint。 README、打包 README、run.sh --help、流程图、todo、提示词基线和测试都已同步。
This commit is contained in:
parent
9e10bf11cf
commit
badcce5d2d
35
README.md
35
README.md
@ -28,7 +28,7 @@ pam_deploy_graph/
|
||||
params_loader.py # 读取 JSON 或 config.txt 风格参数文件
|
||||
llm/ # LLM structured output 接口、真实 HTTP client、提示词、规则 fallback 和 guardrails
|
||||
graph.py # LangGraph StateGraph 集成入口
|
||||
langgraph_runtime.py # chat 人工确认点的 LangGraph interrupt 运行器
|
||||
langgraph_runtime.py # action 级 LangGraph 运行器
|
||||
mcp_client.py # MCP stdio/HTTP/SSE client、鉴权 token 和配置读取
|
||||
interactive.py # 常驻式 CLI 对话框,会话命令、确认和续跑
|
||||
cli.py # CLI 入口
|
||||
@ -64,8 +64,8 @@ packaging/
|
||||
- 实现 `config.txt.example` 风格和 JSON 风格参数读取。
|
||||
- 实现 fake 全局流程和完整部署流程,便于不触碰真实环境地验证 Agent 路由。
|
||||
- 实现逐 IP 处理骨架:升级、轮询、启动、校验、日志下载。
|
||||
- 实现单 IP 失败后的待回滚确认状态,不自动执行回滚。
|
||||
- 实现人工确认入口:`confirm --decision approve|reject` 只处理待确认回滚。
|
||||
- 实现单 IP 失败后暂停并保留失败 action,修复后 `resume` 会从失败 action 重试。
|
||||
- 回滚已从主 workflow 中拆出,改为 chat/CLI 的显式 `rollback` 命令;旧 `confirm` 入口仅作为兼容保留。
|
||||
- 实现 checkpoint 自动保存和 `resume` 续跑:全局步骤、成功 IP、单 IP 已完成 action 会跳过。
|
||||
- 实现 LLM structured output 骨架:意图识别、参数抽取、部署计划生成。
|
||||
- 实现 OpenAI-compatible 真实 LLM client,支持 `base_url` / `model` 配置,`api_key` 可为空。
|
||||
@ -73,12 +73,12 @@ packaging/
|
||||
- 增加规则 fallback `RuleBasedLlmClient`,用于本地开发和测试。
|
||||
- 增加 LLM 输出 guardrails,禁止计划中出现可执行脚本命令和非法 action。
|
||||
- 引入 `langgraph` 依赖,CLI/chat 执行流程统一通过 action 级 LangGraph runtime 调度。
|
||||
- chat/CLI 人工确认点已接入 LangGraph interrupt/checkpointer:运行到待回滚确认时暂停,`approve/reject` 通过 `Command(resume=...)` 恢复。
|
||||
- CLI/chat 执行流程统一通过 action 级 LangGraph runtime 调度;失败暂停状态写入业务 checkpoint,`resume` 会重新进入图并从断点继续。
|
||||
- 引入 MCP client adapter,可包装 SDK session、普通 callable、stdio server、HTTP/SSE server,并提供 JSON client 配置读取。
|
||||
- CLI/chat 支持 `--mcp-config` 直接加载 MCP server URL、鉴权和可选 tool 覆盖配置。
|
||||
- 本地已安装 `langgraph` 和 `mcp`,并完成 LangGraph fake 全局流程 smoke。
|
||||
- CLI `analyze` 输出已做敏感字段脱敏。
|
||||
- 增加 `chat` 常驻式 CLI 对话框,支持自然语言分析、参数设置、执行确认、回滚确认、状态查看、事件查看、checkpoint 选择和续跑。
|
||||
- 增加 `chat` 常驻式 CLI 对话框,支持自然语言分析、参数设置、执行确认、显式回滚、状态查看、事件查看、checkpoint 选择和续跑。
|
||||
- chat 在开发环境可选启用 `rich` / `prompt_toolkit`;PyInstaller 打包环境默认使用普通文本输入,避免交互兼容问题。
|
||||
- chat 执行前会归一化参数并展示实际写入脚本配置的值;`script_only` / `hybrid_node_mcp` 会提前检查 `ZIP_FILE_PATH` 是否存在。
|
||||
- chat 执行中会播报每个 action 的开始、完成或失败;action 执行失败会停在当前 checkpoint,不再误报 LangGraph 不可用。
|
||||
@ -90,7 +90,7 @@ packaging/
|
||||
- 支持通过 `--llm-action-analysis-prompt-file`、`PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。
|
||||
- 增加统一运行日志,默认写入 `logs/pam_deploy_agent.log`,覆盖 CLI/chat、LLM 调用、action 路由、脚本/MCP 调用、LangGraph、checkpoint 等关键流程。
|
||||
- chat 支持 `llm test [文本]`,可用当前 LLM client 做一次轻量调用,确认真实 LLM 或规则 fallback 是否正常加载。
|
||||
- 添加基础测试,当前本地结果为 `59 passed, 2 skipped`。
|
||||
- 添加基础测试,当前本地结果为 `62 passed, 2 skipped`。
|
||||
|
||||
未完成:
|
||||
|
||||
@ -294,12 +294,12 @@ PAM> llm config action_analysis_prompt_file=prompts/action_review.txt
|
||||
PAM> mcp config mcp_client.example.json
|
||||
PAM> list checkpoints
|
||||
PAM> load checkpoint runtime/checkpoints/chat-demo.json
|
||||
PAM> approve
|
||||
PAM> rollback
|
||||
PAM> resume
|
||||
PAM> exit
|
||||
```
|
||||
|
||||
`chat` 默认仍要求在会话内显式输入 `run`,并确认参数、目标 IP 范围和最终执行后才会执行 action。输入 `你好`、`hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时可直接描述部署任务,或显式使用 `analyze <需求>`。每个 action 完成后都会自动进入一次 LLM/规则审核,并播报审核开始/结束;如果审核建议停止或审核本身失败,流程会暂停并输出建议,等待用户决定是否 `resume`。`llm test [文本]` 可测试当前 LLM client 是否可用。`--analyze-actions` 仅控制详细审核结果是否写入 `events`。执行中可按 `Ctrl+C` 中断,chat 会保存当前 checkpoint 并把流程标记为 `user_interrupted`。`set KEY=VALUE` 和 `load params <路径>` 会把更新同步到当前运行 state、`config.txt` 和 checkpoint。`chat` 也支持 `--llm-base-url` / `--llm-api-key` / `--llm-model` / `--llm-action-analysis-prompt-file`、`--mcp-config` 和 `--analyze-actions`。
|
||||
`chat` 默认仍要求在会话内显式输入 `run`,并确认参数、目标 IP 范围和最终执行后才会执行 action。输入 `你好`、`hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时可直接描述部署任务,或显式使用 `analyze <需求>`。每个 action 完成后都会自动进入一次 LLM/规则审核,并播报审核开始/结束;如果审核建议停止或审核本身失败,流程会暂停并输出建议,等待用户决定是否 `resume`。逐 IP action 失败时也会暂停,修复外部环境后输入 `resume` 会从失败 action 重试;如果确实需要回滚,使用 `rollback [IP]` 显式执行。`llm test [文本]` 可测试当前 LLM client 是否可用。`--analyze-actions` 仅控制详细审核结果是否写入 `events`。执行中可按 `Ctrl+C` 中断,chat 会保存当前 checkpoint 并把流程标记为 `user_interrupted`。`set KEY=VALUE` 和 `load params <路径>` 会把更新同步到当前运行 state、`config.txt` 和 checkpoint。`chat` 也支持 `--llm-base-url` / `--llm-api-key` / `--llm-model` / `--llm-action-analysis-prompt-file`、`--mcp-config` 和 `--analyze-actions`。
|
||||
|
||||
## 日志
|
||||
|
||||
@ -332,18 +332,23 @@ fake 完整部署流程验证:
|
||||
python -m pam_deploy_graph.cli run-deploy --config doc_scripts/config.txt.example --strategy fake --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
```
|
||||
|
||||
如果某个 IP 失败并进入待回滚确认,先查看输出中的 `confirmation`,再人工决定:
|
||||
如果某个 IP 失败,流程会保存 checkpoint 并暂停;修复外部环境后可直接续跑,Agent 会从失败 action 重试:
|
||||
|
||||
```bash
|
||||
python -m pam_deploy_graph.cli confirm --checkpoint runtime/checkpoints/demo.json --decision approve --confirm
|
||||
python -m pam_deploy_graph.cli resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
```
|
||||
|
||||
`confirm` 会通过 LangGraph interrupt resume 处理确认,并在确认后继续执行后续图节点;如果流程此前处于 `paused` 状态,`resume` 会先清理暂停标记,再从 checkpoint 继续执行。
|
||||
|
||||
拒绝回滚:
|
||||
如果需要回滚失败 IP,请显式执行 rollback。未传 `--ip` 时会使用当前失败 IP;执行完成后再用 `resume` 继续主流程。
|
||||
|
||||
```bash
|
||||
python -m pam_deploy_graph.cli confirm --checkpoint runtime/checkpoints/demo.json --decision reject --note "人工决定暂不回滚" --confirm
|
||||
python -m pam_deploy_graph.cli rollback --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
python -m pam_deploy_graph.cli resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
```
|
||||
|
||||
也可以指定 IP 和停机策略:
|
||||
|
||||
```bash
|
||||
python -m pam_deploy_graph.cli rollback --checkpoint runtime/checkpoints/demo.json --ip 192.168.1.10 --stop-first --note "人工决定回滚该 IP" --confirm
|
||||
```
|
||||
|
||||
checkpoint 用于断点续跑,会保存完整运行状态和参数。为了支持真实续跑,Agent 写入 checkpoint 时不会脱敏参数;请把 checkpoint 放在受控目录中。如果不传 `--checkpoint`,流程仍可运行,但不能跨进程 `resume`。
|
||||
@ -364,5 +369,5 @@ pytest -q
|
||||
|
||||
1. 接入真实 PAM_NODE MCP session,并用 `SessionMcpToolClient` 包装。
|
||||
2. 在测试环境中做 smoke:HOME 脚本 `get-token/get-node-url` + NODE MCP `get-online-ips`。
|
||||
3. 在测试环境验证真实脚本 action 的失败、回滚确认和续跑链路。
|
||||
3. 在测试环境验证真实脚本 action 的失败重试、显式回滚和续跑链路。
|
||||
4. 继续细化参数确认、IP 范围确认的交互式 UI 或上层编排。
|
||||
|
||||
@ -1,6 +1,6 @@
|
||||
# 当前整体逻辑结构流程图
|
||||
|
||||
本文描述当前 PAM 部署 Agent 的主要模块、运行路径、LLM 审核、人工确认点、热更新和断点续跑逻辑。
|
||||
本文描述当前 PAM 部署 Agent 的主要模块、运行路径、LLM 审核、失败重试、显式回滚、热更新和断点续跑逻辑。
|
||||
|
||||
## 模块结构
|
||||
|
||||
@ -27,7 +27,7 @@ flowchart TD
|
||||
REAL --> AGENT
|
||||
|
||||
LGR --> AGENT
|
||||
LGR --> LGCHECK[LangGraph InMemorySaver checkpointer/interrupt]
|
||||
LGR --> LGCHECK[LangGraph InMemorySaver checkpointer]
|
||||
AGENT --> ROUTER[ActionRouter]
|
||||
ROUTER --> SCRIPT[ScriptActionRunner]
|
||||
ROUTER --> MCP[McpActionRunner]
|
||||
@ -68,8 +68,8 @@ flowchart TD
|
||||
C --> D[build_action_backends 生成 action 路由表]
|
||||
D --> E[LangGraph entry 节点]
|
||||
|
||||
E --> F{是否存在 pending_confirmation}
|
||||
F -- 是 --> P[confirm interrupt 节点]
|
||||
E --> F{是否已暂停}
|
||||
F -- 是 --> R[render_report 输出报告]
|
||||
F -- 否 --> G[global_action 节点循环]
|
||||
|
||||
G --> G1[get-token]
|
||||
@ -131,7 +131,7 @@ flowchart TD
|
||||
- `--analyze-actions` 或 `llm action-analysis on` 只控制是否把详细审核结果写入 `events`。
|
||||
- 如果审核本身失败,也会生成“停止继续”的审核结果并暂停流程,避免黑盒继续执行。
|
||||
|
||||
## 失败、人工确认和续跑
|
||||
## 失败、显式回滚和续跑
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
@ -142,23 +142,19 @@ flowchart TD
|
||||
C1 -- 否 --> G[保存 checkpoint 并暂停]
|
||||
B -- 是 --> D[记录 ip_state 为 FAILED]
|
||||
D --> E[download-log 尽力下载日志]
|
||||
E --> F[设置 pending_confirmation=rollback-ip:IP]
|
||||
E --> F[保存 failed_stage 和 failure_reason]
|
||||
F --> G[保存 checkpoint 并暂停]
|
||||
|
||||
G --> LG{是否来自 CLI/chat 图运行}
|
||||
LG -- 是 --> LGI[LangGraph interrupt 输出确认请求]
|
||||
LGI --> LGRS[approve/reject 通过 Command resume 恢复]
|
||||
LGRS --> H{用户决定}
|
||||
LG -- 否 --> H{用户决定}
|
||||
H -- approve --> I[confirm_pending 执行 rollback-ip]
|
||||
I --> J{rollback 是否成功}
|
||||
J -- 是 --> K[清空 pending_confirmation]
|
||||
J -- 否 --> L[保持 pending_confirmation,等待再次处理]
|
||||
H -- reject --> M[标记 REJECTED_BY_OPERATOR 并清空 pending_confirmation]
|
||||
|
||||
K --> N[resume 续跑]
|
||||
M --> N
|
||||
N --> O[跳过已完成全局步骤、成功 IP 和单 IP 已完成 action]
|
||||
G --> H{用户决定}
|
||||
H -- 修复后继续 --> I[resume 清理 paused]
|
||||
I --> J[next_ip_action 返回 failed_stage]
|
||||
J --> K[重试失败 action]
|
||||
H -- 需要回滚 --> L[rollback IP 显式执行 rollback-ip]
|
||||
L --> M{rollback 是否成功}
|
||||
M -- 是 --> N[标记 ROLLBACK_DONE]
|
||||
M -- 否 --> O[暂停为 rollback_failed]
|
||||
N --> P[resume 续跑]
|
||||
P --> Q[跳过已完成全局步骤、成功 IP、已回滚 IP 和单 IP 已完成 action]
|
||||
```
|
||||
|
||||
## 用户中断与热更新
|
||||
@ -184,12 +180,14 @@ flowchart TD
|
||||
|
||||
- `completed_global_steps`:全局阶段已经完成的 action 会跳过。
|
||||
- `ip_states[ip].status == SUCCESS`:成功 IP 会跳过。
|
||||
- `ip_states[ip].rollback_status == ROLLBACK_DONE`:已显式回滚的失败 IP 会跳过,继续后续目标。
|
||||
- `ip_states[ip].failed_stage`:失败 IP 未回滚时,`resume` 会从该 action 重试。
|
||||
- `ip_states[ip].completed_steps`:同一个 IP 已完成的 action 会跳过。
|
||||
- `pending_confirmation`:存在待确认事项时,部署流程不继续执行,必须先 `approve` 或 `reject`。
|
||||
- `paused` / `pause_reason`:流程可能因 LLM 审核阻断、用户中断、回滚失败等原因暂停;`resume` 会先清理暂停标记,再继续执行。
|
||||
- `pending_confirmation`:仅保留为旧 checkpoint/旧 confirm 入口的兼容字段,新失败流程不再自动设置。
|
||||
- `paused` / `pause_reason`:流程可能因 action 失败、LLM 审核阻断、用户中断、回滚失败等原因暂停;`resume` 会先清理暂停标记,再继续执行。
|
||||
- `review_context`:保存最近一次暂停时的审核建议、失败原因、IP 和阶段,供 chat/CLI 输出给用户。
|
||||
- CLI/chat 的运行调度由 `langgraph_runtime.py` 通过 action 级 LangGraph 节点执行;chat 和 CLI confirm 的确认点使用 LangGraph interrupt 和 InMemorySaver。
|
||||
- 跨进程续跑仍读取业务 checkpoint JSON;LangGraph checkpointer 负责单进程图恢复和 interrupt resume。
|
||||
- CLI/chat 的运行调度由 `langgraph_runtime.py` 通过 action 级 LangGraph 节点执行;失败暂停和续跑依赖业务 checkpoint JSON。
|
||||
- 跨进程续跑读取业务 checkpoint JSON;LangGraph checkpointer 负责单进程图状态保存。
|
||||
- checkpoint 为了真实续跑会保存完整参数,请放在受控目录中。
|
||||
|
||||
## 真实外部能力接入点
|
||||
|
||||
@ -11,14 +11,14 @@
|
||||
- [x] 增加参数确认和目标 IP 范围确认,不只在回滚阶段确认。
|
||||
- [x] 增加 LLM/MCP 配置热加载,例如 `llm config`、`mcp config`。
|
||||
- [x] 增加执行中 `Ctrl+C` 中断处理:保存 checkpoint、标记 `user_interrupted`,再由 `resume` 继续。
|
||||
- [x] 将 chat 的人工确认点接入 LangGraph interrupt/checkpointer;`run` 执行到回滚确认点后由 interrupt 暂停,`approve/reject` 通过 `Command(resume=...)` 恢复同一图线程。跨进程续跑仍保留业务 checkpoint JSON。
|
||||
- [x] 将 chat 执行接入 action 级 LangGraph runtime;逐 IP action 失败后保存 checkpoint 并暂停,`resume` 从失败 action 重试,`rollback [IP]` 作为显式命令单独执行。
|
||||
|
||||
## LLM action 后分析
|
||||
|
||||
- [x] 每次 action 完成后,可把 `action`、`backend`、`ok`、`values`、`stderr`、`error_summary` 和当前 `AgentState` 摘要交给 LLM 分析。
|
||||
- [x] LLM 输出结构化结果:是否异常、异常等级、可能原因、建议动作、是否需要人工确认。
|
||||
- [x] LLM 分析结果会影响流程是否继续:`should_continue=false` 时自动暂停,并把建议输出给用户。
|
||||
- [x] 本地保留规则兜底:exit code、`verify-ip SUCCESS=false`、pending confirmation 等硬规则优先于 LLM。
|
||||
- [x] 本地保留规则兜底:exit code、`verify-ip SUCCESS=false`、旧版 pending confirmation 等硬规则优先于 LLM。
|
||||
- [x] 对 LLM 输入做脱敏,禁止把 `CLIENT_SECRET`、token、Authorization、完整日志原文发送给模型。
|
||||
- [x] 每个 action 都会执行审核;`--analyze-actions` 或 `llm action-analysis on` 只控制是否把详细审核结果写入 `events`。
|
||||
- [x] 支持通过 `--llm-action-analysis-prompt-file`、环境变量或 chat 命令热加载自定义 action 审核提示词。
|
||||
|
||||
@ -72,6 +72,8 @@ cd pam-deploy-agent-linux-x86_64
|
||||
|
||||
- 每个 action 完成后都会自动执行一次 LLM/规则审核。
|
||||
- `--analyze-actions` 只控制是否把详细审核结果写入 `events`。
|
||||
- 逐 IP action 失败后会保存 checkpoint 并暂停;修复外部环境后通过 `resume` 从失败 action 重试。
|
||||
- 回滚不再属于主 workflow 自动分支;需要时使用 chat 内 `rollback [IP]` 或 CLI `rollback --checkpoint ...` 显式执行。
|
||||
- chat 支持执行中 `Ctrl+C` 中断后保存 checkpoint,再通过 `resume` 继续。
|
||||
- chat 支持 `set KEY=VALUE` 和 `load params <路径>` 热更新当前运行任务参数。
|
||||
- 支持通过 `--llm-action-analysis-prompt-file` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。
|
||||
|
||||
@ -35,7 +35,7 @@ pam-deploy-agent-linux-x86_64/
|
||||
```
|
||||
|
||||
发布包默认使用普通文本输入,避免 PyInstaller 环境下 `prompt_toolkit` 兼容性问题;输出仍会在可用时使用 `rich` 做更清晰的文本展示。
|
||||
chat 内的失败回滚确认由 LangGraph interrupt 托管;执行停在确认点后,输入 `approve` 或 `reject [原因]` 会恢复同一个图线程继续处理。
|
||||
逐 IP action 失败后会保存 checkpoint 并暂停;修复外部环境后输入 `resume` 会从失败 action 重试。回滚不再属于主 workflow 自动分支,需要时在 chat 内输入 `rollback [IP]` 显式执行。
|
||||
chat 会在执行前归一化并展示实际写入脚本配置的参数;`script_only` / `hybrid_node_mcp` 会先检查 `ZIP_FILE_PATH` 是否存在,避免脚本运行后才用默认路径失败。执行过程中每个 action 都会输出开始、完成或失败状态;每个 action 完成后还会自动进入一次 LLM/规则审核,并播报审核开始和审核结果。
|
||||
|
||||
## 交互式使用
|
||||
@ -77,7 +77,7 @@ PAM> llm config action_analysis_prompt_file=prompts/action_review.txt
|
||||
PAM> mcp config mcp_client.example.json
|
||||
PAM> list checkpoints
|
||||
PAM> load checkpoint runtime/checkpoints/demo.json
|
||||
PAM> approve
|
||||
PAM> rollback
|
||||
PAM> resume
|
||||
PAM> exit
|
||||
```
|
||||
@ -124,18 +124,23 @@ PAM> exit
|
||||
--confirm
|
||||
```
|
||||
|
||||
处理失败后的回滚确认:
|
||||
失败后从断点重试:
|
||||
|
||||
```bash
|
||||
./run.sh confirm --checkpoint runtime/checkpoints/demo.json --decision approve --confirm
|
||||
./run.sh resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
```
|
||||
|
||||
`confirm` 会通过 LangGraph interrupt resume 处理确认,并在确认后继续执行后续图节点;进程中断或需要再次续跑时,再使用 `resume`。
|
||||
|
||||
拒绝回滚:
|
||||
需要回滚失败 IP 时显式执行 rollback,未指定 `--ip` 时会使用当前失败 IP:
|
||||
|
||||
```bash
|
||||
./run.sh confirm --checkpoint runtime/checkpoints/demo.json --decision reject --note "人工决定暂不回滚" --confirm
|
||||
./run.sh rollback --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
./run.sh resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
```
|
||||
|
||||
也可以指定 IP 和停机策略:
|
||||
|
||||
```bash
|
||||
./run.sh rollback --checkpoint runtime/checkpoints/demo.json --ip 192.168.1.10 --stop-first --note "人工决定回滚该 IP" --confirm
|
||||
```
|
||||
|
||||
## LLM 配置
|
||||
@ -242,4 +247,4 @@ MCP token 获取方式与 HOME 一致,默认按 `client_credentials` POST 到
|
||||
- 如果审核建议停止、审核本身失败,或用户在执行中按下 `Ctrl+C`,流程都会保存 checkpoint 并进入暂停状态;后续可使用 `resume` 继续。
|
||||
- `set KEY=VALUE` 和 `load params <路径>` 会热更新当前运行任务的参数,并回写运行中的 `config.txt` 和 checkpoint。
|
||||
- `checkpoint` 会保存完整运行参数,请放在受控目录。
|
||||
- `hybrid_node_mcp`、`resume`、`confirm` 如果需要执行 MCP action,请同时传入 `--mcp-config`。
|
||||
- `hybrid_node_mcp`、`resume`、`rollback` 如果需要执行 MCP action,请同时传入 `--mcp-config`。
|
||||
|
||||
@ -96,7 +96,8 @@ PAM 部署 Agent 解压即用包
|
||||
run-global 执行全局阶段:token、版本、上传、发布、Node URL、下载任务。
|
||||
run-deploy 执行完整部署流程:全局阶段 + 逐 IP 阶段。
|
||||
resume 从 checkpoint 继续执行。
|
||||
confirm 处理待人工确认事项,目前用于失败 IP 回滚确认。
|
||||
rollback 显式回滚失败 IP;不传 --ip 时使用当前失败 IP。
|
||||
confirm 兼容旧 checkpoint 的人工确认命令,新流程通常不需要使用。
|
||||
|
||||
通用参数:
|
||||
--config <路径>
|
||||
@ -110,7 +111,7 @@ PAM 部署 Agent 解压即用包
|
||||
hybrid_node_mcp PAM_HOME 走脚本,PAM_NODE 走 MCP。
|
||||
|
||||
--checkpoint <路径>
|
||||
checkpoint JSON 路径。用于断点续跑和人工确认恢复。
|
||||
checkpoint JSON 路径。用于断点续跑和显式回滚。
|
||||
示例:runtime/checkpoints/demo.json
|
||||
|
||||
--target-ip <IP>
|
||||
@ -119,7 +120,7 @@ PAM 部署 Agent 解压即用包
|
||||
--mcp-config <路径>
|
||||
MCP client JSON 配置文件。通常配置 server_url 和独立鉴权信息;
|
||||
Agent 会从 server list_tools 自动发现 tools。hybrid_node_mcp 策略、
|
||||
resume 或 confirm 需要执行 MCP action 时使用。
|
||||
resume 或 rollback 需要执行 MCP action 时使用。
|
||||
示例:mcp_client.example.json
|
||||
|
||||
--confirm
|
||||
@ -166,8 +167,10 @@ LLM 环境变量:
|
||||
|
||||
./run.sh run-deploy --config doc_scripts/config.txt.example --strategy fake --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
|
||||
./run.sh confirm --checkpoint runtime/checkpoints/demo.json --decision approve --confirm
|
||||
# 如果进程中断或需要再次续跑:
|
||||
# 失败暂停后,修复外部环境并从失败 action 重试:
|
||||
./run.sh resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
# 需要回滚失败 IP 时显式执行:
|
||||
./run.sh rollback --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
./run.sh resume --checkpoint runtime/checkpoints/demo.json --confirm
|
||||
|
||||
查看子命令原始参数:
|
||||
@ -179,10 +182,10 @@ LLM 环境变量:
|
||||
2. doc_scripts 只包含运行必需文件:deploy.sh、config.txt.example、PAM_AUTO_DEPLY_SKILL.md。
|
||||
3. prompts/action_review.txt 是当前默认 action 审核提示词基线,可复制后自行修改。
|
||||
4. mcp_client.example.json 是 MCP server URL + 独立鉴权配置示例,需要按真实 MCP server 修改。
|
||||
5. confirm 会通过 LangGraph interrupt resume 处理确认,并继续后续图节点;进程中断时再使用 resume。
|
||||
5. 逐 IP action 失败后会暂停;修复后用 resume 从失败 action 重试,需要回滚时用 rollback 显式执行。
|
||||
6. chat 会在执行前归一化并展示实际写入脚本配置的参数;script_only / hybrid_node_mcp 会先检查 ZIP_FILE_PATH 是否存在。
|
||||
7. chat 执行过程中会播报每个 action 的开始、完成或失败;普通问候不会触发 LLM/结构化分析。
|
||||
8. chat 内可使用 params、events、list checkpoints、load checkpoint、load params、llm config、llm test、mcp config 等命令。
|
||||
8. chat 内可使用 params、events、rollback、list checkpoints、load checkpoint、load params、llm config、llm test、mcp config 等命令。
|
||||
9. 日志默认写入 logs/pam_deploy_agent.log,并会脱敏 token、secret、api_key、Authorization 等字段。
|
||||
10. checkpoint 会保存完整运行参数,请放在受控目录。
|
||||
HELP_TEXT
|
||||
|
||||
@ -442,7 +442,7 @@ class PamDeployAgent:
|
||||
return state
|
||||
|
||||
def run_ip_flow(self, state: AgentState) -> AgentState:
|
||||
"""执行逐 IP 部署流程,失败时停在人工确认点。"""
|
||||
"""执行逐 IP 部署流程,失败时暂停在失败 action,等待修复后重试。"""
|
||||
logger.info(
|
||||
"逐 IP 流程开始 run_id=%s paused=%s target_ips=%s online_ips=%s",
|
||||
state.run_id,
|
||||
@ -473,10 +473,12 @@ class PamDeployAgent:
|
||||
if ip_state and ip_state.get("status") == "SUCCESS":
|
||||
continue
|
||||
if ip_state and ip_state.get("status") == "FAILED":
|
||||
if ip_state.get("rollback_status") == "PENDING_AGENT_CONFIRMATION":
|
||||
state.pending_confirmation = f"rollback-ip:{ip}"
|
||||
self._save_checkpoint(state)
|
||||
return None
|
||||
if ip_state.get("rollback_status") == "ROLLBACK_DONE":
|
||||
continue
|
||||
failed_stage = str(ip_state.get("failed_stage", ""))
|
||||
completed_steps = ip_state.setdefault("completed_steps", [])
|
||||
if failed_stage and failed_stage not in completed_steps:
|
||||
return ip, failed_stage
|
||||
continue
|
||||
if not ip_state:
|
||||
logger.info("初始化 IP 状态 run_id=%s ip=%s", state.run_id, ip)
|
||||
@ -504,7 +506,7 @@ class PamDeployAgent:
|
||||
return None
|
||||
|
||||
def run_ip_action(self, state: AgentState, ip: str, action: str) -> AgentState:
|
||||
"""执行一个单 IP action,并在失败时设置人工确认点。"""
|
||||
"""执行一个单 IP action;失败时暂停并保留该 action 供 resume 重试。"""
|
||||
ip_state = state.ip_states[ip]
|
||||
completed_steps = ip_state.setdefault("completed_steps", [])
|
||||
if action in completed_steps:
|
||||
@ -567,21 +569,25 @@ class PamDeployAgent:
|
||||
"message": result.error_summary or result.values.get("MESSAGE", "action 执行失败"),
|
||||
}
|
||||
)
|
||||
self._record_ip_failure(state, ip, action, result.error_summary or str(result.values))
|
||||
self.pause_state(
|
||||
state,
|
||||
reason="action_failed",
|
||||
review_context=self._review_context(action=action, analysis=analysis, result=result, ip=ip),
|
||||
)
|
||||
self._record_ip_failure(state, ip, action, result.error_summary or str(result.values))
|
||||
if action != "download-log":
|
||||
self._download_log_best_effort(state, ip)
|
||||
state.pending_confirmation = f"rollback-ip:{ip}"
|
||||
self._save_checkpoint(state)
|
||||
logger.info("IP action 失败并进入确认 run_id=%s ip=%s action=%s pending=%s", state.run_id, ip, action, state.pending_confirmation)
|
||||
logger.info("IP action 失败并暂停等待重试 run_id=%s ip=%s action=%s", state.run_id, ip, action)
|
||||
return state
|
||||
|
||||
self._apply_ip_result(ip_state, action, result.values)
|
||||
ip_state["status"] = "RUNNING"
|
||||
ip_state["failed_stage"] = ""
|
||||
ip_state["failure_reason"] = ""
|
||||
completed_steps.append(action)
|
||||
if state.last_failed_step == action:
|
||||
state.last_failed_step = ""
|
||||
self._emit_progress(
|
||||
{
|
||||
"type": "ACTION_DONE",
|
||||
@ -736,6 +742,105 @@ class PamDeployAgent:
|
||||
)
|
||||
return state
|
||||
|
||||
def rollback_ip(
|
||||
self,
|
||||
state: AgentState,
|
||||
ip: str,
|
||||
*,
|
||||
stop_first: bool | None = None,
|
||||
operator_note: str = "",
|
||||
) -> AgentState:
|
||||
"""显式执行单个 IP 的回滚;该动作不属于主 workflow 自动分支。"""
|
||||
if ip not in state.ip_states:
|
||||
raise ValueError(f"IP 不在当前运行状态中: {ip}")
|
||||
ip_state = state.ip_states[ip]
|
||||
actual_stop_first = bool(ip_state.get("rollback_stop_first", False)) if stop_first is None else stop_first
|
||||
backend = state.action_backends.get("rollback-ip", "script")
|
||||
logger.info(
|
||||
"显式回滚开始 run_id=%s ip=%s backend=%s stop_first=%s note_len=%s",
|
||||
state.run_id,
|
||||
ip,
|
||||
backend,
|
||||
actual_stop_first,
|
||||
len(operator_note),
|
||||
)
|
||||
self._emit_progress(
|
||||
{
|
||||
"type": "ACTION_START",
|
||||
"stage": "rollback-ip",
|
||||
"backend": backend,
|
||||
"ip": ip,
|
||||
}
|
||||
)
|
||||
try:
|
||||
result = self.router.run_action(
|
||||
state,
|
||||
"rollback-ip",
|
||||
ip=ip,
|
||||
stop_first=actual_stop_first,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.exception("显式回滚 action 调用异常 run_id=%s ip=%s backend=%s", state.run_id, ip, backend)
|
||||
result = ActionResult(
|
||||
action="rollback-ip",
|
||||
backend=backend,
|
||||
ok=False,
|
||||
error_summary=str(exc),
|
||||
)
|
||||
logger.info("显式回滚 action 返回 run_id=%s ip=%s result=%s", state.run_id, ip, _action_result_for_log(result))
|
||||
ip_state["rollback_status"] = "ROLLBACK_DONE" if result.ok else "ROLLBACK_FAILED"
|
||||
ip_state["rollback_stop_first"] = actual_stop_first
|
||||
state.events.append(
|
||||
{
|
||||
"type": "ACTION_DONE" if result.ok else "ACTION_FAIL",
|
||||
"stage": "rollback-ip",
|
||||
"backend": result.backend,
|
||||
"ip": ip,
|
||||
"message": result.error_summary or result.values.get("MESSAGE", "ok"),
|
||||
"operator_note": operator_note,
|
||||
}
|
||||
)
|
||||
self._append_action_analysis(state, "rollback-ip", result, ip=ip)
|
||||
if result.ok:
|
||||
state.pending_confirmation = ""
|
||||
state.last_success_step = "rollback-ip"
|
||||
state.last_failed_step = ""
|
||||
state.paused = False
|
||||
state.pause_reason = ""
|
||||
state.review_context = {}
|
||||
self._emit_progress(
|
||||
{
|
||||
"type": "ACTION_DONE",
|
||||
"stage": "rollback-ip",
|
||||
"backend": result.backend,
|
||||
"ip": ip,
|
||||
"message": result.values.get("MESSAGE", "ok"),
|
||||
}
|
||||
)
|
||||
else:
|
||||
state.last_failed_step = "rollback-ip"
|
||||
state.paused = True
|
||||
state.pause_reason = "rollback_failed"
|
||||
state.review_context = self._review_context(action="rollback-ip", analysis=None, result=result, ip=ip)
|
||||
self._emit_progress(
|
||||
{
|
||||
"type": "ACTION_FAIL",
|
||||
"stage": "rollback-ip",
|
||||
"backend": result.backend,
|
||||
"ip": ip,
|
||||
"message": result.error_summary or result.values.get("MESSAGE", "rollback 执行失败"),
|
||||
}
|
||||
)
|
||||
self._save_checkpoint(state)
|
||||
logger.info(
|
||||
"显式回滚结束 run_id=%s ip=%s rollback_status=%s paused=%s",
|
||||
state.run_id,
|
||||
ip,
|
||||
ip_state.get("rollback_status"),
|
||||
state.paused,
|
||||
)
|
||||
return state
|
||||
|
||||
def _emit_progress(self, payload: dict[str, Any]) -> None:
|
||||
"""向 CLI/chat 回调 action 执行进度,回调失败不影响主流程。"""
|
||||
if self.progress_callback is None:
|
||||
@ -800,7 +905,7 @@ class PamDeployAgent:
|
||||
ip_state["log_file"] = str(values.get("LOG_FILE", ""))
|
||||
|
||||
def _record_ip_failure(self, state: AgentState, ip: str, action: str, reason: str) -> None:
|
||||
"""记录单 IP 失败,并设置待回滚确认状态。"""
|
||||
"""记录单 IP 失败,并保留失败 action 供 resume 重试。"""
|
||||
ip_state = state.ip_states[ip]
|
||||
stop_first = action in ("start-ip", "verify-ip")
|
||||
logger.info(
|
||||
@ -816,20 +921,20 @@ class PamDeployAgent:
|
||||
"status": "FAILED",
|
||||
"failed_stage": action,
|
||||
"failure_reason": reason,
|
||||
"rollback_status": "PENDING_AGENT_CONFIRMATION",
|
||||
"rollback_status": ip_state.get("rollback_status") or "ROLLBACK_NOT_RUN",
|
||||
"rollback_stop_first": stop_first,
|
||||
}
|
||||
)
|
||||
state.last_failed_step = action
|
||||
state.events.append(
|
||||
{
|
||||
"type": "CONFIRMATION_REQUIRED",
|
||||
"stage": "rollback-ip",
|
||||
"ip": ip,
|
||||
"stop_first": stop_first,
|
||||
"message": f"{action} 执行失败,需要确认是否回滚",
|
||||
}
|
||||
)
|
||||
"type": "ACTION_RETRY_REQUIRED",
|
||||
"stage": action,
|
||||
"ip": ip,
|
||||
"stop_first": stop_first,
|
||||
"message": f"{action} 执行失败,流程已暂停;修复后 resume 将从该 action 重试,需回滚时请显式执行 rollback",
|
||||
}
|
||||
)
|
||||
|
||||
def _download_log_best_effort(self, state: AgentState, ip: str) -> None:
|
||||
"""失败后尽力下载日志,日志失败不覆盖原失败原因。"""
|
||||
|
||||
@ -128,6 +128,17 @@ def main() -> None:
|
||||
add_mcp_args(confirm)
|
||||
add_action_analysis_arg(confirm)
|
||||
|
||||
rollback = sub.add_parser("rollback")
|
||||
rollback.add_argument("--checkpoint", required=True)
|
||||
rollback.add_argument("--ip", help="要回滚的 IP;不传时使用当前失败 IP")
|
||||
rollback.add_argument("--stop-first", dest="stop_first", action="store_true", default=None, help="回滚前先停机")
|
||||
rollback.add_argument("--no-stop-first", dest="stop_first", action="store_false", default=None, help="回滚前不先停机")
|
||||
rollback.add_argument("--note", default="")
|
||||
rollback.add_argument("--confirm", action="store_true")
|
||||
add_llm_args(rollback)
|
||||
add_mcp_args(rollback)
|
||||
add_action_analysis_arg(rollback)
|
||||
|
||||
args = parser.parse_args()
|
||||
log_path = configure_logging()
|
||||
logger.info("CLI 启动 command=%s args=%s log_path=%s", args.command, json_for_log(vars(args)), log_path)
|
||||
@ -205,13 +216,23 @@ def main() -> None:
|
||||
logger.info("开始 confirm checkpoint=%s decision=%s note_len=%s", args.checkpoint, args.decision, len(args.note))
|
||||
state = load_agent_state(args.checkpoint)
|
||||
state.checkpoint_path = state.checkpoint_path or args.checkpoint
|
||||
runtime = LangGraphDeploymentRuntime(agent=agent, flow="deploy")
|
||||
first = runtime.start(state)
|
||||
if first.interrupted:
|
||||
result = runtime.resume(approved=args.decision == "approve", note=args.note)
|
||||
print_graph_result(agent, result)
|
||||
return
|
||||
print_graph_result(agent, first)
|
||||
if not state.pending_confirmation:
|
||||
raise SystemExit("当前 checkpoint 没有待确认事项;新流程请使用 resume 重试,或 rollback 显式回滚。")
|
||||
state = agent.confirm_pending(state, approved=args.decision == "approve", operator_note=args.note)
|
||||
print(agent.render_report(state))
|
||||
print_pause_payload(agent, state)
|
||||
return
|
||||
|
||||
if args.command == "rollback":
|
||||
logger.info("开始 rollback checkpoint=%s ip=%s stop_first=%s note_len=%s", args.checkpoint, args.ip, args.stop_first, len(args.note))
|
||||
state = load_agent_state(args.checkpoint)
|
||||
state.checkpoint_path = state.checkpoint_path or args.checkpoint
|
||||
ip = args.ip or _find_current_failed_ip(state)
|
||||
if not ip:
|
||||
raise SystemExit("未找到当前失败 IP,请传入 --ip。")
|
||||
state = agent.rollback_ip(state, ip, stop_first=args.stop_first, operator_note=args.note)
|
||||
print(agent.render_report(state))
|
||||
print_pause_payload(agent, state)
|
||||
return
|
||||
|
||||
logger.info("开始 run-deploy strategy=%s checkpoint=%s target_ips=%s", args.strategy, args.checkpoint, args.target_ip)
|
||||
@ -225,5 +246,16 @@ def main() -> None:
|
||||
print_graph_result(agent, result)
|
||||
|
||||
|
||||
def _find_current_failed_ip(state) -> str:
|
||||
"""从 checkpoint state 中找一个适合显式回滚的失败 IP。"""
|
||||
context_ip = str((state.review_context or {}).get("ip", ""))
|
||||
if context_ip and context_ip in state.ip_states:
|
||||
return context_ip
|
||||
for ip, ip_state in state.ip_states.items():
|
||||
if ip_state.get("status") == "FAILED":
|
||||
return ip
|
||||
return ""
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
@ -13,7 +13,7 @@ def build_langgraph(agent: PamDeployAgent | None = None, flow: GraphFlow = "depl
|
||||
|
||||
输入 state 支持直接传 `params`,图内会先调用 `create_state`;CLI/chat
|
||||
默认使用 `LangGraphDeploymentRuntime`,该 runtime 直接接收 `AgentState`
|
||||
并支持 interrupt/checkpointer。
|
||||
并由业务 checkpoint 支撑断点续跑。
|
||||
"""
|
||||
try:
|
||||
from langgraph.graph import END, START, StateGraph
|
||||
|
||||
@ -41,9 +41,8 @@ COMMAND_HELP = """可用命令:
|
||||
mcp config <路径> 加载 MCP client JSON 配置
|
||||
run 创建部署任务并执行
|
||||
status 查看当前运行状态
|
||||
approve 确认待处理回滚
|
||||
reject [原因] 拒绝待处理回滚
|
||||
resume 从当前 checkpoint 续跑
|
||||
rollback [IP] 显式回滚指定 IP;不传 IP 时回滚当前失败 IP
|
||||
list checkpoints 列出 checkpoint 目录下的 JSON 文件
|
||||
load params <路径> 加载并热更新参数文件
|
||||
load checkpoint <路径> 加载指定 checkpoint
|
||||
@ -162,6 +161,9 @@ class InteractiveCliSession:
|
||||
if normalized == "resume":
|
||||
self._resume()
|
||||
return True
|
||||
if normalized == "rollback":
|
||||
self._rollback(rest.strip())
|
||||
return True
|
||||
if normalized == "status":
|
||||
self._status()
|
||||
return True
|
||||
@ -718,6 +720,61 @@ class InteractiveCliSession:
|
||||
if self.state.pending_confirmation:
|
||||
self._print_confirmation()
|
||||
|
||||
def _rollback(self, text: str) -> None:
|
||||
"""显式执行单 IP 回滚;主 workflow 不再自动触发回滚。"""
|
||||
if self.state is None:
|
||||
checkpoint = Path(self.checkpoint_path)
|
||||
if checkpoint.exists():
|
||||
logger.info("chat rollback 从 checkpoint 加载 path=%s", checkpoint)
|
||||
self.state = load_agent_state(checkpoint)
|
||||
self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint)
|
||||
else:
|
||||
self.output("当前没有可回滚的运行状态。")
|
||||
logger.info("chat rollback 无 state 且 checkpoint 不存在 path=%s", checkpoint)
|
||||
return
|
||||
try:
|
||||
ip, stop_first, note = _parse_rollback_args(text)
|
||||
except ValueError as exc:
|
||||
self.output(f"rollback 参数错误: {exc}")
|
||||
return
|
||||
ip = ip or _find_current_failed_ip(self.state)
|
||||
if not ip:
|
||||
self.output("未找到当前失败 IP,请使用 rollback <IP> 指定。")
|
||||
logger.info("chat rollback 未找到可回滚 IP run_id=%s", self.state.run_id)
|
||||
return
|
||||
logger.info(
|
||||
"chat rollback 开始 run_id=%s ip=%s stop_first=%s note_len=%s",
|
||||
self.state.run_id,
|
||||
ip,
|
||||
stop_first,
|
||||
len(note),
|
||||
)
|
||||
self.graph_runtime = None
|
||||
try:
|
||||
self.state = self.agent.rollback_ip(
|
||||
self.state,
|
||||
ip,
|
||||
stop_first=stop_first,
|
||||
operator_note=note,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.exception("chat rollback 执行失败 run_id=%s ip=%s", self.state.run_id, ip)
|
||||
self.output(f"rollback 执行失败: {exc}")
|
||||
self._print_pause_context()
|
||||
return
|
||||
self.output(self.agent.render_report(self.state))
|
||||
self._print_pause_context()
|
||||
self.output(f"checkpoint: {self.state.checkpoint_path or self.checkpoint_path}")
|
||||
if not self.state.paused:
|
||||
self.output("回滚已完成;如需继续主流程,输入 resume。")
|
||||
logger.info(
|
||||
"chat rollback 完成 run_id=%s ip=%s status=%s paused=%s",
|
||||
self.state.run_id,
|
||||
ip,
|
||||
self.state.ip_states.get(ip, {}).get("rollback_status"),
|
||||
self.state.paused,
|
||||
)
|
||||
|
||||
def _sync_params_to_state(self) -> None:
|
||||
"""若当前已有 state,则把热更新参数同步到 checkpoint/config。"""
|
||||
if self.state is None:
|
||||
@ -764,6 +821,12 @@ class InteractiveCliSession:
|
||||
self.output("输入 resume 可从当前 checkpoint 继续。")
|
||||
elif reason == "llm_review_blocked":
|
||||
self.output("请根据以上建议判断后续;如需继续,输入 resume。")
|
||||
elif reason == "action_failed":
|
||||
ip = context.get("ip")
|
||||
rollback_hint = f"rollback {ip}" if ip else "rollback <IP>"
|
||||
self.output(f"请修复失败原因后输入 resume 重试当前 action;如需回滚,输入 {rollback_hint}。")
|
||||
elif reason == "rollback_failed":
|
||||
self.output("请检查回滚失败原因;修复后可再次输入 rollback 重试,或人工处理后再 resume。")
|
||||
|
||||
def _on_progress(self, payload: dict[str, Any]) -> None:
|
||||
"""把 Agent action 进度转成 chat 可见输出。"""
|
||||
@ -895,6 +958,46 @@ def _parse_key_values(parts: list[str]) -> dict[str, str]:
|
||||
return values
|
||||
|
||||
|
||||
def _parse_rollback_args(text: str) -> tuple[str, bool | None, str]:
|
||||
"""解析 chat rollback 命令参数,返回 IP、停机覆盖值和备注。"""
|
||||
try:
|
||||
parts = shlex.split(text)
|
||||
except ValueError as exc:
|
||||
raise ValueError(str(exc)) from exc
|
||||
ip = ""
|
||||
stop_first: bool | None = None
|
||||
note_parts: list[str] = []
|
||||
index = 0
|
||||
while index < len(parts):
|
||||
part = parts[index]
|
||||
if part == "--stop-first":
|
||||
stop_first = True
|
||||
elif part == "--no-stop-first":
|
||||
stop_first = False
|
||||
elif part in ("--note", "-n"):
|
||||
index += 1
|
||||
if index >= len(parts):
|
||||
raise ValueError("--note 需要提供备注内容")
|
||||
note_parts.append(parts[index])
|
||||
elif not ip:
|
||||
ip = part
|
||||
else:
|
||||
note_parts.append(part)
|
||||
index += 1
|
||||
return ip, stop_first, " ".join(note_parts)
|
||||
|
||||
|
||||
def _find_current_failed_ip(state: AgentState) -> str:
|
||||
"""从当前 state 中找一个适合显式回滚的失败 IP。"""
|
||||
context_ip = str((state.review_context or {}).get("ip", ""))
|
||||
if context_ip and context_ip in state.ip_states:
|
||||
return context_ip
|
||||
for ip, ip_state in state.ip_states.items():
|
||||
if ip_state.get("status") == "FAILED":
|
||||
return ip
|
||||
return ""
|
||||
|
||||
|
||||
def _is_small_talk(text: str) -> bool:
|
||||
"""识别不应触发 LLM/结构化分析的简单寒暄。"""
|
||||
normalized = text.strip().lower()
|
||||
@ -979,9 +1082,9 @@ def _build_prompt_input(input_func: InputFunc) -> InputFunc:
|
||||
"mcp config",
|
||||
"run",
|
||||
"status",
|
||||
"approve",
|
||||
"reject",
|
||||
"resume",
|
||||
"rollback",
|
||||
"rollback --stop-first",
|
||||
"list checkpoints",
|
||||
"load params",
|
||||
"load checkpoint",
|
||||
|
||||
@ -1,4 +1,4 @@
|
||||
"""chat 人工确认点的 LangGraph interrupt 运行器。"""
|
||||
"""PAM 部署 Agent 的 action 级 LangGraph 运行器。"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
@ -27,7 +27,7 @@ class LangGraphRunResult:
|
||||
|
||||
|
||||
class LangGraphDeploymentRuntime:
|
||||
"""用 LangGraph 节点调度部署 action,并托管人工确认 interrupt。"""
|
||||
"""用 LangGraph 节点调度部署 action。"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
@ -51,11 +51,11 @@ class LangGraphDeploymentRuntime:
|
||||
|
||||
@property
|
||||
def waiting_confirmation(self) -> bool:
|
||||
"""返回当前 LangGraph 会话是否停在 interrupt 确认点。"""
|
||||
"""返回当前 LangGraph 会话是否停在旧版 interrupt 确认点。"""
|
||||
return self._waiting_confirmation
|
||||
|
||||
def start(self, state: AgentState) -> LangGraphRunResult:
|
||||
"""从给定 AgentState 开始执行,直到结束或遇到人工确认点。"""
|
||||
"""从给定 AgentState 开始执行,直到结束或业务状态暂停。"""
|
||||
self._waiting_confirmation = False
|
||||
logger.info(
|
||||
"LangGraph start run_id=%s thread_id=%s flow=%s paused=%s pending=%s",
|
||||
@ -68,7 +68,7 @@ class LangGraphDeploymentRuntime:
|
||||
return self._consume(self._graph.stream({"agent_state": state}, self._config()))
|
||||
|
||||
def resume(self, *, approved: bool, note: str = "") -> LangGraphRunResult:
|
||||
"""把人工确认结果交回 LangGraph,并继续执行。"""
|
||||
"""兼容旧版 LangGraph interrupt 确认恢复;新流程通常不使用。"""
|
||||
try:
|
||||
from langgraph.types import Command
|
||||
except ImportError as exc: # pragma: no cover - 依赖缺失时由调用方降级
|
||||
@ -83,7 +83,7 @@ class LangGraphDeploymentRuntime:
|
||||
return {"configurable": {"thread_id": self.thread_id}}
|
||||
|
||||
def _consume(self, chunks: Any) -> LangGraphRunResult:
|
||||
"""消费 LangGraph stream 输出,提取状态、报告和 interrupt 请求。"""
|
||||
"""消费 LangGraph stream 输出,提取状态、报告和旧版 interrupt 请求。"""
|
||||
result = LangGraphRunResult()
|
||||
for chunk in chunks:
|
||||
result.chunks.append(chunk)
|
||||
@ -120,7 +120,6 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
try:
|
||||
from langgraph.checkpoint.memory import InMemorySaver
|
||||
from langgraph.graph import END, START, StateGraph
|
||||
from langgraph.types import interrupt
|
||||
except ImportError as exc: # pragma: no cover - 依赖缺失时由调用方降级
|
||||
raise RuntimeError("未安装 langgraph,无法启用部署图。") from exc
|
||||
|
||||
@ -160,21 +159,6 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
agent.run_ip_action(agent_state, ip, action)
|
||||
return {"agent_state": agent_state, "current_ip": "", "current_ip_action": ""}
|
||||
|
||||
def confirm_node(state: dict[str, Any]) -> dict[str, Any]:
|
||||
"""把确认请求交给 LangGraph interrupt,并在恢复后执行确认动作。"""
|
||||
agent_state = state["agent_state"]
|
||||
request = agent.build_confirmation_request(agent_state)
|
||||
logger.info("LangGraph confirm_node interrupt run_id=%s request=%s", agent_state.run_id, json_for_log(request))
|
||||
decision = interrupt(request)
|
||||
approved, note = _parse_confirmation_decision(decision)
|
||||
logger.info("LangGraph confirm_node resume run_id=%s approved=%s note_len=%s", agent_state.run_id, approved, len(note))
|
||||
agent_state = agent.confirm_pending(
|
||||
agent_state,
|
||||
approved=approved,
|
||||
operator_note=note,
|
||||
)
|
||||
return {"agent_state": agent_state}
|
||||
|
||||
def report_node(state: dict[str, Any]) -> dict[str, Any]:
|
||||
"""渲染当前状态报告。"""
|
||||
agent_state = state["agent_state"]
|
||||
@ -185,11 +169,11 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
}
|
||||
|
||||
def route_entry(state: dict[str, Any]) -> str:
|
||||
"""从入口决定进入全局、IP、确认或报告节点。"""
|
||||
"""从入口决定进入全局、IP 或报告节点。"""
|
||||
agent_state = state["agent_state"]
|
||||
if agent_state.pending_confirmation:
|
||||
logger.info("LangGraph route_entry -> confirm run_id=%s", agent_state.run_id)
|
||||
return "confirm"
|
||||
logger.info("LangGraph route_entry -> report legacy_pending run_id=%s", agent_state.run_id)
|
||||
return "report"
|
||||
if agent.next_global_action(agent_state):
|
||||
logger.info("LangGraph route_entry -> global_action run_id=%s", agent_state.run_id)
|
||||
return "global_action"
|
||||
@ -212,11 +196,11 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
return "prepare_ip"
|
||||
|
||||
def route_after_prepare_ip(state: dict[str, Any]) -> str:
|
||||
"""IP 准备节点后进入确认、单 IP action 或报告。"""
|
||||
"""IP 准备节点后进入单 IP action 或报告。"""
|
||||
agent_state = state["agent_state"]
|
||||
if agent_state.pending_confirmation:
|
||||
logger.info("LangGraph route_after_prepare_ip -> confirm run_id=%s", agent_state.run_id)
|
||||
return "confirm"
|
||||
logger.info("LangGraph route_after_prepare_ip -> report legacy_pending run_id=%s", agent_state.run_id)
|
||||
return "report"
|
||||
if state.get("current_ip_action"):
|
||||
logger.info("LangGraph route_after_prepare_ip -> ip_action run_id=%s ip=%s action=%s", agent_state.run_id, state.get("current_ip"), state.get("current_ip_action"))
|
||||
return "ip_action"
|
||||
@ -228,7 +212,6 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
graph.add_node("global_action", global_action_node)
|
||||
graph.add_node("prepare_ip", prepare_ip_node)
|
||||
graph.add_node("ip_action", ip_action_node)
|
||||
graph.add_node("confirm", confirm_node)
|
||||
graph.add_node("report", report_node)
|
||||
|
||||
graph.add_edge(START, "entry")
|
||||
@ -236,7 +219,6 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
"entry",
|
||||
route_entry,
|
||||
{
|
||||
"confirm": "confirm",
|
||||
"global_action": "global_action",
|
||||
"prepare_ip": "prepare_ip",
|
||||
"report": "report",
|
||||
@ -254,10 +236,9 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
|
||||
graph.add_conditional_edges(
|
||||
"prepare_ip",
|
||||
route_after_prepare_ip,
|
||||
{"confirm": "confirm", "ip_action": "ip_action", "report": "report"},
|
||||
{"ip_action": "ip_action", "report": "report"},
|
||||
)
|
||||
graph.add_edge("ip_action", "prepare_ip")
|
||||
graph.add_edge("confirm", "entry")
|
||||
graph.add_edge("report", END)
|
||||
compiled = graph.compile(checkpointer=InMemorySaver())
|
||||
logger.info("LangGraph 部署图构建完成 flow=%s", flow)
|
||||
|
||||
@ -7,7 +7,7 @@ SYSTEM_PROMPT = """你是 PAM 智能部署 Agent 的结构化理解与规划组
|
||||
- 不生成 shell、PowerShell、bat、curl 等可执行命令。
|
||||
- 不回显密钥、token、CLIENT_SECRET、Authorization 等敏感值。
|
||||
- 只能在允许的 action 集合中选择部署动作。
|
||||
- 真实执行前必须保留人工确认点:参数确认、目标 IP 范围确认、失败回滚确认。
|
||||
- 真实执行前必须保留人工确认点:参数确认、目标 IP 范围确认;失败后应暂停,修复后 resume 重试,回滚只能由用户显式触发。
|
||||
"""
|
||||
|
||||
INTENT_PROMPT = """根据用户输入识别意图和执行偏好。
|
||||
@ -82,7 +82,7 @@ ACTION_ANALYSIS_PROMPT = """分析一次 PAM action 执行结果。
|
||||
|
||||
要求:
|
||||
- 必须明确给出 `should_continue`:没有问题时为 true;存在需要人工判断的问题时为 false。
|
||||
- 如果 exit_code 非 0、ok=false、verify-ip SUCCESS=false、出现 pending_confirmation,应标记异常。
|
||||
- 如果 exit_code 非 0、ok=false、verify-ip SUCCESS=false、出现 legacy pending_confirmation,应标记异常。
|
||||
- 主要依据结构化字段 `ok`、`exit_code`、`values`、`error_summary` 判断;只有输入里存在 `diagnostic_log` 时,才把它当作异常诊断上下文。
|
||||
- 脚本正常过程日志不会作为错误依据,不能因为日志来自 stderr 就判定异常。
|
||||
- 不要输出密钥、token、Authorization 或完整日志原文。
|
||||
|
||||
@ -14,7 +14,7 @@
|
||||
|
||||
要求:
|
||||
- 必须明确给出 `should_continue`:没有问题时为 true;存在需要人工判断的问题时为 false。
|
||||
- 如果 exit_code 非 0、ok=false、verify-ip SUCCESS=false、出现 pending_confirmation,应标记异常。
|
||||
- 如果 exit_code 非 0、ok=false、verify-ip SUCCESS=false、出现旧版 pending_confirmation,应标记异常。
|
||||
- 主要依据结构化字段 `ok`、`exit_code`、`values`、`error_summary` 判断;只有输入里存在 `diagnostic_log` 时,才把它当作异常诊断上下文。
|
||||
- 脚本正常过程日志不会作为错误依据,不能因为日志来自 stderr 就判定异常。
|
||||
- 不要输出密钥、token、Authorization 或完整日志原文。
|
||||
|
||||
@ -110,11 +110,46 @@ def test_run_deploy_flow_stops_on_verify_failure(tmp_path: Path):
|
||||
|
||||
agent.run_deploy_flow(state)
|
||||
|
||||
assert state.pending_confirmation == "rollback-ip:192.168.1.10"
|
||||
assert state.pending_confirmation == ""
|
||||
assert state.paused is True
|
||||
assert state.pause_reason == "action_failed"
|
||||
assert state.ip_states["192.168.1.10"]["status"] == "FAILED"
|
||||
assert state.ip_states["192.168.1.10"]["rollback_status"] == "PENDING_AGENT_CONFIRMATION"
|
||||
assert state.ip_states["192.168.1.10"]["failed_stage"] == "verify-ip"
|
||||
assert state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_NOT_RUN"
|
||||
assert "192.168.1.11" not in state.ip_states
|
||||
assert any(event["type"] == "CONFIRMATION_REQUIRED" for event in state.events)
|
||||
assert any(event["type"] == "ACTION_RETRY_REQUIRED" for event in state.events)
|
||||
|
||||
|
||||
def test_resume_retries_failed_ip_action_without_rollback(tmp_path: Path):
|
||||
fake = FakeActionRunner(
|
||||
{
|
||||
"verify-ip:192.168.1.10": {
|
||||
"ACTION": "verify-ip",
|
||||
"IP": "192.168.1.10",
|
||||
"SUCCESS": "false",
|
||||
"MESSAGE": "health check failed",
|
||||
}
|
||||
}
|
||||
)
|
||||
agent = PamDeployAgent(fake_runner=fake)
|
||||
state = agent.create_state(
|
||||
params=PARAMS,
|
||||
execution_strategy="fake",
|
||||
config_path=str(tmp_path / "config.txt"),
|
||||
)
|
||||
|
||||
agent.run_deploy_flow(state)
|
||||
fake.fixtures = {}
|
||||
agent.resume_state(state)
|
||||
agent.run_deploy_flow(state)
|
||||
|
||||
assert state.pending_confirmation == ""
|
||||
assert state.paused is False
|
||||
assert state.last_failed_step == ""
|
||||
assert state.ip_states["192.168.1.10"]["status"] == "SUCCESS"
|
||||
assert state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_NOT_RUN"
|
||||
assert state.ip_states["192.168.1.11"]["status"] == "SUCCESS"
|
||||
assert not any(call[0] == "rollback-ip" for call in fake.calls)
|
||||
|
||||
|
||||
def test_action_analysis_event_is_recorded_when_enabled(tmp_path: Path):
|
||||
@ -187,7 +222,7 @@ def test_action_review_failure_pauses_flow(tmp_path: Path):
|
||||
assert any(event["type"] == "ACTION_ANALYSIS_FAIL" for event in state.events)
|
||||
|
||||
|
||||
def test_confirm_pending_rollback_runs_rollback_and_resume_continues(tmp_path: Path):
|
||||
def test_explicit_rollback_runs_rollback_and_resume_continues(tmp_path: Path):
|
||||
fake = FakeActionRunner(
|
||||
{
|
||||
"verify-ip:192.168.1.10": {
|
||||
@ -206,18 +241,16 @@ def test_confirm_pending_rollback_runs_rollback_and_resume_continues(tmp_path: P
|
||||
)
|
||||
|
||||
agent.run_deploy_flow(state)
|
||||
request = agent.build_confirmation_request(state)
|
||||
agent.confirm_pending(state, approved=True)
|
||||
agent.rollback_ip(state, "192.168.1.10")
|
||||
agent.run_deploy_flow(state)
|
||||
|
||||
assert request["type"] == "rollback-ip"
|
||||
assert state.pending_confirmation == ""
|
||||
assert state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_DONE"
|
||||
assert state.ip_states["192.168.1.11"]["status"] == "SUCCESS"
|
||||
assert any(call[0] == "rollback-ip" for call in fake.calls)
|
||||
|
||||
|
||||
def test_failed_rollback_keeps_confirmation_pending(tmp_path: Path):
|
||||
def test_failed_explicit_rollback_pauses_without_confirmation(tmp_path: Path):
|
||||
fake = FakeActionRunner(
|
||||
{
|
||||
"verify-ip:192.168.1.10": {
|
||||
@ -242,9 +275,11 @@ def test_failed_rollback_keeps_confirmation_pending(tmp_path: Path):
|
||||
)
|
||||
|
||||
agent.run_deploy_flow(state)
|
||||
agent.confirm_pending(state, approved=True)
|
||||
agent.rollback_ip(state, "192.168.1.10")
|
||||
|
||||
assert state.pending_confirmation == "rollback-ip:192.168.1.10"
|
||||
assert state.pending_confirmation == ""
|
||||
assert state.paused is True
|
||||
assert state.pause_reason == "rollback_failed"
|
||||
assert state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_FAILED"
|
||||
|
||||
|
||||
|
||||
@ -59,6 +59,26 @@ class FakeTestableLlmClient:
|
||||
return LlmActionAnalysis(action=action)
|
||||
|
||||
|
||||
class FlakyVerifyRunner(FakeActionRunner):
|
||||
"""第一次 verify-ip 失败,后续恢复成功,用于覆盖断点重试。"""
|
||||
|
||||
def __init__(self) -> None:
|
||||
super().__init__()
|
||||
self.verify_calls = 0
|
||||
|
||||
def _fixture_for(self, action, kwargs):
|
||||
if action == "verify-ip" and kwargs.get("ip") == "192.168.1.10":
|
||||
self.verify_calls += 1
|
||||
if self.verify_calls == 1:
|
||||
return {
|
||||
"ACTION": "verify-ip",
|
||||
"IP": "192.168.1.10",
|
||||
"SUCCESS": "false",
|
||||
"MESSAGE": "health check failed",
|
||||
}
|
||||
return super()._fixture_for(action, kwargs)
|
||||
|
||||
|
||||
def run_session(session: InteractiveCliSession, inputs: list[str]) -> list[str]:
|
||||
output: list[str] = []
|
||||
iterator = iter(inputs)
|
||||
@ -165,7 +185,28 @@ def test_chat_action_failure_does_not_report_langgraph_unavailable(tmp_path: Pat
|
||||
assert not any("LangGraph 确认运行器不可用" in item for item in output)
|
||||
|
||||
|
||||
def test_chat_approve_then_resume_continues_after_failed_ip(tmp_path: Path):
|
||||
def test_chat_resume_retries_failed_ip_without_rollback(tmp_path: Path):
|
||||
fake = FlakyVerifyRunner()
|
||||
session = InteractiveCliSession(
|
||||
agent=PamDeployAgent(fake_runner=fake),
|
||||
params=PARAMS,
|
||||
strategy="fake",
|
||||
checkpoint_path=str(tmp_path / "checkpoint.json"),
|
||||
)
|
||||
|
||||
output = run_session(session, ["run", "yes", "yes", "yes", "resume", "exit"])
|
||||
|
||||
assert session.state is not None
|
||||
assert session.state.pending_confirmation == ""
|
||||
assert session.state.paused is False
|
||||
assert session.state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_NOT_RUN"
|
||||
assert session.state.ip_states["192.168.1.10"]["status"] == "SUCCESS"
|
||||
assert session.state.ip_states["192.168.1.11"]["status"] == "SUCCESS"
|
||||
assert not any(call[0] == "rollback-ip" for call in fake.calls)
|
||||
assert any("如需回滚,输入 rollback 192.168.1.10" in item for item in output)
|
||||
|
||||
|
||||
def test_chat_explicit_rollback_command_rolls_back_failed_ip(tmp_path: Path):
|
||||
fake = FakeActionRunner(
|
||||
{
|
||||
"verify-ip:192.168.1.10": {
|
||||
@ -183,12 +224,14 @@ def test_chat_approve_then_resume_continues_after_failed_ip(tmp_path: Path):
|
||||
checkpoint_path=str(tmp_path / "checkpoint.json"),
|
||||
)
|
||||
|
||||
run_session(session, ["run", "yes", "yes", "yes", "approve", "resume", "exit"])
|
||||
output = run_session(session, ["run", "yes", "yes", "yes", "rollback", "resume", "exit"])
|
||||
|
||||
assert session.state is not None
|
||||
assert session.state.pending_confirmation == ""
|
||||
assert session.state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_DONE"
|
||||
assert session.state.ip_states["192.168.1.11"]["status"] == "SUCCESS"
|
||||
assert any(call[0] == "rollback-ip" for call in fake.calls)
|
||||
assert any("回滚已完成;如需继续主流程,输入 resume" in item for item in output)
|
||||
|
||||
|
||||
def test_chat_params_events_and_checkpoint_commands(tmp_path: Path):
|
||||
|
||||
@ -17,7 +17,7 @@ PARAMS = {
|
||||
}
|
||||
|
||||
|
||||
def test_langgraph_runtime_interrupts_and_resumes_confirmation(tmp_path: Path):
|
||||
def test_langgraph_runtime_pauses_failure_and_resume_retries(tmp_path: Path):
|
||||
fake = FakeActionRunner(
|
||||
{
|
||||
"verify-ip:192.168.1.10": {
|
||||
@ -39,16 +39,22 @@ def test_langgraph_runtime_interrupts_and_resumes_confirmation(tmp_path: Path):
|
||||
|
||||
first = runtime.start(state)
|
||||
|
||||
assert first.interrupted is True
|
||||
assert runtime.waiting_confirmation is True
|
||||
assert first.confirmation["type"] == "rollback-ip"
|
||||
assert first.confirmation["ip"] == "192.168.1.10"
|
||||
assert first.interrupted is False
|
||||
assert runtime.waiting_confirmation is False
|
||||
assert first.confirmation == {}
|
||||
assert first.state is not None
|
||||
assert first.state.paused is True
|
||||
assert first.state.pending_confirmation == ""
|
||||
assert first.state.ip_states["192.168.1.10"]["failed_stage"] == "verify-ip"
|
||||
|
||||
second = runtime.resume(approved=True)
|
||||
fake.fixtures = {}
|
||||
agent.resume_state(first.state)
|
||||
second = runtime.start(first.state)
|
||||
|
||||
assert second.interrupted is False
|
||||
assert runtime.waiting_confirmation is False
|
||||
assert second.state is not None
|
||||
assert second.state.pending_confirmation == ""
|
||||
assert second.state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_DONE"
|
||||
assert second.state.paused is False
|
||||
assert second.state.ip_states["192.168.1.10"]["rollback_status"] == "ROLLBACK_NOT_RUN"
|
||||
assert second.state.ip_states["192.168.1.11"]["status"] == "SUCCESS"
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user