feat: 补充 Agent 运行日志并增加 LLM 测试命令

- 新增统一日志工具,支持日志文件路径和级别配置
- 记录 CLI/chat、Agent、LLM、action、MCP、LangGraph、checkpoint 等关键流程
- 对日志中的 token、secret、api_key、Authorization 等敏感信息做脱敏
- chat 新增 llm test 命令,用于验证当前 LLM client 是否正常加载
- 同步 README、打包文档和 run.sh 帮助说明
- 补充日志脱敏和 llm test 相关测试
This commit is contained in:
dark 2026-06-04 10:51:59 +08:00
parent 8d390aa416
commit d3f5c82d98
20 changed files with 1066 additions and 49 deletions

View File

@ -88,7 +88,9 @@ packaging/
- chat 支持执行中按 `Ctrl+C` 中断,保存 checkpoint 后再 `resume` - chat 支持执行中按 `Ctrl+C` 中断,保存 checkpoint 后再 `resume`
- chat 支持 `set KEY=VALUE``load params <路径>` 热更新当前运行参数,并同步回写运行中的 `config.txt` 与 checkpoint。 - chat 支持 `set KEY=VALUE``load params <路径>` 热更新当前运行参数,并同步回写运行中的 `config.txt` 与 checkpoint。
- 支持通过 `--llm-action-analysis-prompt-file``PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。 - 支持通过 `--llm-action-analysis-prompt-file``PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。
- 添加基础测试,当前本地结果为 `57 passed, 2 skipped` - 增加统一运行日志,默认写入 `logs/pam_deploy_agent.log`,覆盖 CLI/chat、LLM 调用、action 路由、脚本/MCP 调用、LangGraph、checkpoint 等关键流程。
- chat 支持 `llm test [文本]`,可用当前 LLM client 做一次轻量调用,确认真实 LLM 或规则 fallback 是否正常加载。
- 添加基础测试,当前本地结果为 `59 passed, 2 skipped`
未完成: 未完成:
@ -133,6 +135,12 @@ python -m pam_deploy_graph.cli analyze \
真实 LLM 调用位置在 `pam_deploy_graph/llm/openai_compatible.py`,提示词在 `pam_deploy_graph/llm/prompts.py`。发送给 LLM 的 `base_params` 会脱敏,`CLIENT_SECRET` 不会进入 prompt本地生成计划后仍会执行 guardrails 校验。 真实 LLM 调用位置在 `pam_deploy_graph/llm/openai_compatible.py`,提示词在 `pam_deploy_graph/llm/prompts.py`。发送给 LLM 的 `base_params` 会脱敏,`CLIENT_SECRET` 不会进入 prompt本地生成计划后仍会执行 guardrails 校验。
chat 内可以用当前 client 做一次轻量测试,确认真实 LLM 或规则 fallback 是否正常加载:
```text
PAM> llm test 请返回一次连通性测试结果
```
如果服务需要鉴权,再补充: 如果服务需要鉴权,再补充:
```bash ```bash
@ -280,6 +288,7 @@ PAM> run
PAM> status PAM> status
PAM> params PAM> params
PAM> events 5 PAM> events 5
PAM> llm test
PAM> llm action-analysis on PAM> llm action-analysis on
PAM> llm config action_analysis_prompt_file=prompts/action_review.txt PAM> llm config action_analysis_prompt_file=prompts/action_review.txt
PAM> mcp config mcp_client.example.json PAM> mcp config mcp_client.example.json
@ -290,7 +299,20 @@ PAM> resume
PAM> exit PAM> exit
``` ```
`chat` 默认仍要求在会话内显式输入 `run`,并确认参数、目标 IP 范围和最终执行后才会执行 action。输入 `你好``hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时可直接描述部署任务,或显式使用 `analyze <需求>`。每个 action 完成后都会自动进入一次 LLM/规则审核,并播报审核开始/结束;如果审核建议停止或审核本身失败,流程会暂停并输出建议,等待用户决定是否 `resume``--analyze-actions` 仅控制详细审核结果是否写入 `events`。执行中可按 `Ctrl+C` 中断chat 会保存当前 checkpoint 并把流程标记为 `user_interrupted``set KEY=VALUE``load params <路径>` 会把更新同步到当前运行 state、`config.txt` 和 checkpoint。`chat` 也支持 `--llm-base-url` / `--llm-api-key` / `--llm-model` / `--llm-action-analysis-prompt-file``--mcp-config``--analyze-actions` `chat` 默认仍要求在会话内显式输入 `run`,并确认参数、目标 IP 范围和最终执行后才会执行 action。输入 `你好``hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时可直接描述部署任务,或显式使用 `analyze <需求>`。每个 action 完成后都会自动进入一次 LLM/规则审核,并播报审核开始/结束;如果审核建议停止或审核本身失败,流程会暂停并输出建议,等待用户决定是否 `resume``llm test [文本]` 可测试当前 LLM client 是否可用。`--analyze-actions` 仅控制详细审核结果是否写入 `events`。执行中可按 `Ctrl+C` 中断chat 会保存当前 checkpoint 并把流程标记为 `user_interrupted``set KEY=VALUE``load params <路径>` 会把更新同步到当前运行 state、`config.txt` 和 checkpoint。`chat` 也支持 `--llm-base-url` / `--llm-api-key` / `--llm-model` / `--llm-action-analysis-prompt-file``--mcp-config``--analyze-actions`
## 日志
Agent 默认写入运行日志到 `logs/pam_deploy_agent.log`。日志覆盖 CLI/chat 输入、LLM 请求和响应摘要、action 路由、脚本/MCP 调用、LangGraph 节点、checkpoint 保存、暂停/续跑等关键流程。日志会递归脱敏 `CLIENT_SECRET``MCP_CLIENT_SECRET`、token、Authorization、api_key、password 等字段,并截断长文本。
可通过环境变量调整日志位置和级别:
```bash
export PAM_AGENT_LOG_FILE=logs/pam_deploy_agent.log
export PAM_AGENT_LOG_LEVEL=INFO
```
调试 LLM 或 MCP 调用时可临时设为 `DEBUG`,但仍建议把日志目录放在受控位置。
预演: 预演:

View File

@ -75,6 +75,9 @@ cd pam-deploy-agent-linux-x86_64
- chat 支持执行中 `Ctrl+C` 中断后保存 checkpoint再通过 `resume` 继续。 - chat 支持执行中 `Ctrl+C` 中断后保存 checkpoint再通过 `resume` 继续。
- chat 支持 `set KEY=VALUE``load params <路径>` 热更新当前运行任务参数。 - chat 支持 `set KEY=VALUE``load params <路径>` 热更新当前运行任务参数。
- 支持通过 `--llm-action-analysis-prompt-file` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。 - 支持通过 `--llm-action-analysis-prompt-file` 或 chat 内 `llm config action_analysis_prompt_file=...` 自定义 action 审核提示词。
- chat 支持 `llm test [文本]` 测试当前 LLM client 是否正常加载。
- 默认运行日志写入 `logs/pam_deploy_agent.log`,可通过 `PAM_AGENT_LOG_FILE``PAM_AGENT_LOG_LEVEL` 调整。
- 日志会脱敏 token、secret、api_key、Authorization 等字段checkpoint 仍保存完整运行参数,请放在受控目录。
## 包大小评估 ## 包大小评估

View File

@ -71,6 +71,7 @@ PAM> run
PAM> status PAM> status
PAM> params PAM> params
PAM> events 5 PAM> events 5
PAM> llm test
PAM> llm action-analysis on PAM> llm action-analysis on
PAM> llm config action_analysis_prompt_file=prompts/action_review.txt PAM> llm config action_analysis_prompt_file=prompts/action_review.txt
PAM> mcp config mcp_client.example.json PAM> mcp config mcp_client.example.json
@ -176,10 +177,26 @@ chat 内也可以热加载 LLM
```text ```text
PAM> llm config base_url=https://your-llm.example.com/v1 api_key=your-api-key model=your-model-name PAM> llm config base_url=https://your-llm.example.com/v1 api_key=your-api-key model=your-model-name
PAM> llm config action_analysis_prompt_file=prompts/action_review.txt PAM> llm config action_analysis_prompt_file=prompts/action_review.txt
PAM> llm test 请返回一次连通性测试结果
PAM> llm action-analysis on PAM> llm action-analysis on
PAM> llm fallback PAM> llm fallback
``` ```
`llm test [文本]` 会使用当前 LLM client 做一次轻量意图识别调用,并输出 client 类型、intent、strategy 和 confidence便于确认真实 LLM 或规则 fallback 是否正常加载。
## 日志
Agent 默认写入运行日志到 `logs/pam_deploy_agent.log`。日志覆盖 chat/CLI 输入、LLM 请求和响应摘要、action 路由、脚本/MCP 调用、LangGraph 节点、checkpoint 保存、暂停/续跑等关键流程。
可通过环境变量调整日志位置和级别:
```bash
export PAM_AGENT_LOG_FILE=logs/pam_deploy_agent.log
export PAM_AGENT_LOG_LEVEL=INFO
```
日志会递归脱敏 `CLIENT_SECRET``MCP_CLIENT_SECRET`、token、Authorization、api_key、password 等字段并截断长文本。checkpoint 仍会保存完整运行参数,请放在受控目录。
## 策略说明 ## 策略说明
- `fake`:全部使用 fake runner不访问真实环境。 - `fake`:全部使用 fake runner不访问真实环境。
@ -221,6 +238,7 @@ MCP token 获取方式与 HOME 一致,默认按 `client_credentials` POST 到
- 执行真实 action 前请确认配置文件中的 `HOME_BASE_URL``CLIENT_ID``CLIENT_SECRET``AIRPORT_CODE``APP_NAME``MODULE_NAME``VERSION_NUMBER``ZIP_FILE_PATH` - 执行真实 action 前请确认配置文件中的 `HOME_BASE_URL``CLIENT_ID``CLIENT_SECRET``AIRPORT_CODE``APP_NAME``MODULE_NAME``VERSION_NUMBER``ZIP_FILE_PATH`
- `chat` 中输入 `你好``hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时请直接描述部署任务,或显式使用 `analyze <需求>` - `chat` 中输入 `你好``hello` 这类问候不会触发 LLM/结构化分析;需要分析部署需求时请直接描述部署任务,或显式使用 `analyze <需求>`
- 每个 action 完成后都会自动执行一次 LLM/规则审核;`--analyze-actions``llm action-analysis on` 只控制是否把详细审核结果写入 `events` - 每个 action 完成后都会自动执行一次 LLM/规则审核;`--analyze-actions``llm action-analysis on` 只控制是否把详细审核结果写入 `events`
- `llm test [文本]` 可测试当前 LLM client 是否可用。
- 如果审核建议停止、审核本身失败,或用户在执行中按下 `Ctrl+C`,流程都会保存 checkpoint 并进入暂停状态;后续可使用 `resume` 继续。 - 如果审核建议停止、审核本身失败,或用户在执行中按下 `Ctrl+C`,流程都会保存 checkpoint 并进入暂停状态;后续可使用 `resume` 继续。
- `set KEY=VALUE``load params <路径>` 会热更新当前运行任务的参数,并回写运行中的 `config.txt` 和 checkpoint。 - `set KEY=VALUE``load params <路径>` 会热更新当前运行任务的参数,并回写运行中的 `config.txt` 和 checkpoint。
- `checkpoint` 会保存完整运行参数,请放在受控目录。 - `checkpoint` 会保存完整运行参数,请放在受控目录。

View File

@ -127,8 +127,8 @@ PAM 部署 Agent 解压即用包
chat 模式会在会话中要求输入 run并分别确认参数、目标范围和最终执行。 chat 模式会在会话中要求输入 run并分别确认参数、目标范围和最终执行。
--analyze-actions --analyze-actions
每个 action 完成后追加 LLM/规则诊断建议。诊断只作为辅助建议, 每个 action 完成后的 LLM/规则审核默认都会执行;该参数只控制
不会自动决定继续、回滚或修改参数 是否把详细审核结果写入 events。审核建议停止时流程会暂停
LLM 参数: LLM 参数:
--llm-base-url <URL> --llm-base-url <URL>
@ -140,10 +140,22 @@ LLM 参数:
--llm-model <模型名> --llm-model <模型名>
LLM 模型名称。也可通过环境变量 PAM_LLM_MODEL 提供。 LLM 模型名称。也可通过环境变量 PAM_LLM_MODEL 提供。
--llm-action-analysis-prompt-file <路径>
自定义 action 审核提示词文件。打包内置基线:
prompts/action_review.txt
LLM 环境变量: LLM 环境变量:
PAM_LLM_BASE_URL PAM_LLM_BASE_URL
PAM_LLM_API_KEY PAM_LLM_API_KEY
PAM_LLM_MODEL PAM_LLM_MODEL
PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE
日志环境变量:
PAM_AGENT_LOG_FILE
运行日志路径,默认 logs/pam_deploy_agent.log。
PAM_AGENT_LOG_LEVEL
日志级别,默认 INFO。排查 LLM/MCP 时可临时设为 DEBUG。
示例: 示例:
./run.sh chat --config doc_scripts/config.txt.example --strategy fake --checkpoint runtime/checkpoints/demo.json ./run.sh chat --config doc_scripts/config.txt.example --strategy fake --checkpoint runtime/checkpoints/demo.json
@ -170,8 +182,9 @@ LLM 环境变量:
5. confirm 会通过 LangGraph interrupt resume 处理确认,并继续后续图节点;进程中断时再使用 resume。 5. confirm 会通过 LangGraph interrupt resume 处理确认,并继续后续图节点;进程中断时再使用 resume。
6. chat 会在执行前归一化并展示实际写入脚本配置的参数script_only / hybrid_node_mcp 会先检查 ZIP_FILE_PATH 是否存在。 6. chat 会在执行前归一化并展示实际写入脚本配置的参数script_only / hybrid_node_mcp 会先检查 ZIP_FILE_PATH 是否存在。
7. chat 执行过程中会播报每个 action 的开始、完成或失败;普通问候不会触发 LLM/结构化分析。 7. chat 执行过程中会播报每个 action 的开始、完成或失败;普通问候不会触发 LLM/结构化分析。
8. chat 内可使用 params、events、list checkpoints、load checkpoint、load params、llm config、mcp config 等命令。 8. chat 内可使用 params、events、list checkpoints、load checkpoint、load params、llm config、llm test、mcp config 等命令。
9. checkpoint 会保存完整运行参数,请放在受控目录。 9. 日志默认写入 logs/pam_deploy_agent.log并会脱敏 token、secret、api_key、Authorization 等字段。
10. checkpoint 会保存完整运行参数,请放在受控目录。
HELP_TEXT HELP_TEXT
} }

View File

@ -2,9 +2,14 @@
from __future__ import annotations from __future__ import annotations
import logging
from .constants import ALLOWED_ACTIONS, HOME_ACTIONS, NODE_ACTIONS from .constants import ALLOWED_ACTIONS, HOME_ACTIONS, NODE_ACTIONS
from .logging_utils import json_for_log
from .models import AgentState, BackendName, ExecutionStrategy, ActionResult from .models import AgentState, BackendName, ExecutionStrategy, ActionResult
logger = logging.getLogger(__name__)
def build_action_backends(strategy: ExecutionStrategy) -> dict[str, BackendName]: def build_action_backends(strategy: ExecutionStrategy) -> dict[str, BackendName]:
"""根据执行策略生成每个 action 对应的后端类型。""" """根据执行策略生成每个 action 对应的后端类型。"""
@ -33,6 +38,13 @@ class ActionRouter:
backend = state.action_backends.get(action) backend = state.action_backends.get(action)
if not backend: if not backend:
raise ValueError(f"action 未配置路由: {action}") raise ValueError(f"action 未配置路由: {action}")
logger.info(
"ActionRouter 路由 action run_id=%s action=%s backend=%s kwargs=%s",
state.run_id,
action,
backend,
json_for_log(kwargs),
)
if backend == "script": if backend == "script":
return self.script_runner.run( return self.script_runner.run(
action, action,
@ -48,6 +60,13 @@ class ActionRouter:
mcp_kwargs = dict(kwargs) mcp_kwargs = dict(kwargs)
hash_code = mcp_kwargs.pop("hash_code", None) or state.hash_code hash_code = mcp_kwargs.pop("hash_code", None) or state.hash_code
node_url = mcp_kwargs.pop("node_url", None) or state.node_url node_url = mcp_kwargs.pop("node_url", None) or state.node_url
logger.info(
"ActionRouter 调用 MCP action run_id=%s action=%s hash_code_present=%s node_url_present=%s",
state.run_id,
action,
bool(hash_code),
bool(node_url),
)
return self.mcp_runner.run( return self.mcp_runner.run(
action, action,
params=state.params, params=state.params,

View File

@ -8,6 +8,7 @@ from __future__ import annotations
import re import re
import time import time
import logging
from dataclasses import asdict from dataclasses import asdict
from pathlib import Path from pathlib import Path
from typing import Any, Callable from typing import Any, Callable
@ -18,11 +19,14 @@ from .config_writer import write_config
from .constants import DEFAULT_PARAMS, GLOBAL_ACTION_SEQUENCE, IP_ACTION_SEQUENCE, REQUIRED_PARAMS from .constants import DEFAULT_PARAMS, GLOBAL_ACTION_SEQUENCE, IP_ACTION_SEQUENCE, REQUIRED_PARAMS
from .fake_runner import FakeActionRunner from .fake_runner import FakeActionRunner
from .llm import LlmClient, RuleBasedLlmClient, validate_deploy_plan, validate_intent_result from .llm import LlmClient, RuleBasedLlmClient, validate_deploy_plan, validate_intent_result
from .logging_utils import configure_logging, json_for_log
from .mcp_runner import McpActionRunner from .mcp_runner import McpActionRunner
from .models import ActionResult, AgentState, ExecutionStrategy, LlmActionAnalysis, LlmDeployPlan, LlmIntentResult, LlmParamResult from .models import ActionResult, AgentState, ExecutionStrategy, LlmActionAnalysis, LlmDeployPlan, LlmIntentResult, LlmParamResult
from .script_runner import ScriptActionRunner, select_script_entry from .script_runner import ScriptActionRunner, select_script_entry
from .skill_policy import load_skill_policy from .skill_policy import load_skill_policy
logger = logging.getLogger(__name__)
REQUIRED_ACTION_VALUES = { REQUIRED_ACTION_VALUES = {
"upload-package": ("HASH_CODE",), "upload-package": ("HASH_CODE",),
"get-node-url": ("NODE_URL",), "get-node-url": ("NODE_URL",),
@ -44,6 +48,7 @@ class PamDeployAgent:
progress_callback: Callable[[dict[str, Any]], None] | None = None, progress_callback: Callable[[dict[str, Any]], None] | None = None,
) -> None: ) -> None:
"""初始化策略、脚本 runner、MCP runner、fake runner 和 LLM client。""" """初始化策略、脚本 runner、MCP runner、fake runner 和 LLM client。"""
self.log_path = configure_logging()
self.skill_policy = load_skill_policy(skill_path) self.skill_policy = load_skill_policy(skill_path)
self.script_base_dir = Path(script_base_dir) self.script_base_dir = Path(script_base_dir)
self.script_runner = ScriptActionRunner(self.script_base_dir) self.script_runner = ScriptActionRunner(self.script_base_dir)
@ -57,16 +62,42 @@ class PamDeployAgent:
mcp_runner=mcp_runner, mcp_runner=mcp_runner,
fake_runner=self.fake_runner, fake_runner=self.fake_runner,
) )
logger.info(
"Agent 初始化完成 skill=%s script_base_dir=%s llm_client=%s mcp_runner=%s action_analysis_events=%s",
skill_path,
self.script_base_dir,
type(self.llm_client).__name__,
type(self.mcp_runner).__name__ if self.mcp_runner else "",
self.action_analysis_enabled,
)
def understand_request(self, text: str) -> LlmIntentResult: def understand_request(self, text: str) -> LlmIntentResult:
"""调用 LLM 识别用户意图,并执行基础校验。""" """调用 LLM 识别用户意图,并执行基础校验。"""
logger.info("LLM 意图识别开始 client=%s text_len=%s", type(self.llm_client).__name__, len(text))
try:
result = self.llm_client.understand_request(text) result = self.llm_client.understand_request(text)
except Exception:
logger.exception("LLM 意图识别失败 client=%s", type(self.llm_client).__name__)
raise
validate_intent_result(result) validate_intent_result(result)
logger.info("LLM 意图识别完成 result=%s", json_for_log(asdict(result)))
return result return result
def extract_params(self, text: str, base_params: dict[str, Any] | None = None) -> LlmParamResult: def extract_params(self, text: str, base_params: dict[str, Any] | None = None) -> LlmParamResult:
"""从自然语言中抽取部署参数和控制参数。""" """从自然语言中抽取部署参数和控制参数。"""
return self.llm_client.extract_params(text, base_params) logger.info(
"LLM 参数抽取开始 client=%s text_len=%s base_params=%s",
type(self.llm_client).__name__,
len(text),
json_for_log(base_params or {}),
)
try:
result = self.llm_client.extract_params(text, base_params)
except Exception:
logger.exception("LLM 参数抽取失败 client=%s", type(self.llm_client).__name__)
raise
logger.info("LLM 参数抽取完成 result=%s", json_for_log(asdict(result)))
return result
def generate_plan( def generate_plan(
self, self,
@ -76,12 +107,25 @@ class PamDeployAgent:
strategy: ExecutionStrategy, strategy: ExecutionStrategy,
) -> LlmDeployPlan: ) -> LlmDeployPlan:
"""根据参数、意图和执行策略生成部署计划。""" """根据参数、意图和执行策略生成部署计划。"""
logger.info(
"LLM 计划生成开始 client=%s intent=%s strategy=%s params=%s",
type(self.llm_client).__name__,
intent,
strategy,
json_for_log(params),
)
try:
plan = self.llm_client.generate_plan(params=params, intent=intent, strategy=strategy) plan = self.llm_client.generate_plan(params=params, intent=intent, strategy=strategy)
except Exception:
logger.exception("LLM 计划生成失败 client=%s intent=%s strategy=%s", type(self.llm_client).__name__, intent, strategy)
raise
validate_deploy_plan(plan) validate_deploy_plan(plan)
logger.info("LLM 计划生成完成 result=%s", json_for_log(asdict(plan)))
return plan return plan
def analyze_request(self, text: str, base_params: dict[str, Any] | None = None) -> dict[str, Any]: def analyze_request(self, text: str, base_params: dict[str, Any] | None = None) -> dict[str, Any]:
"""完成意图识别、参数抽取和计划生成,供 analyze/chat 使用。""" """完成意图识别、参数抽取和计划生成,供 analyze/chat 使用。"""
logger.info("需求分析开始 text_len=%s base_params=%s", len(text), json_for_log(base_params or {}))
intent = self.understand_request(text) intent = self.understand_request(text)
params = self.extract_params(text, base_params) params = self.extract_params(text, base_params)
strategy = self._choose_strategy(intent.strategy_preference) strategy = self._choose_strategy(intent.strategy_preference)
@ -90,11 +134,19 @@ class PamDeployAgent:
intent=intent.intent, intent=intent.intent,
strategy=strategy, strategy=strategy,
) )
return { result = {
"intent": intent, "intent": intent,
"params": params, "params": params,
"plan": plan, "plan": plan,
} }
logger.info(
"需求分析完成 intent=%s strategy=%s missing=%s plan_actions=%s",
intent.intent,
strategy,
params.missing_required_params,
plan.planned_actions,
)
return result
def normalize_params(self, params: dict[str, Any]) -> dict[str, Any]: def normalize_params(self, params: dict[str, Any]) -> dict[str, Any]:
"""合并默认参数并校验必填参数是否齐全。""" """合并默认参数并校验必填参数是否齐全。"""
@ -124,6 +176,13 @@ class PamDeployAgent:
target_ips: list[str] | None = None, target_ips: list[str] | None = None,
) -> AgentState: ) -> AgentState:
"""创建一次运行所需的 AgentState并写入脚本配置文件。""" """创建一次运行所需的 AgentState并写入脚本配置文件。"""
logger.info(
"创建 AgentState 开始 strategy=%s checkpoint=%s target_ips=%s params=%s",
execution_strategy,
checkpoint_path or "",
target_ips or [],
json_for_log(params),
)
normalized = self.normalize_params(params) normalized = self.normalize_params(params)
actual_run_id = run_id or time.strftime("%Y%m%d_%H%M%S") actual_run_id = run_id or time.strftime("%Y%m%d_%H%M%S")
actual_script_entry = script_entry or select_script_entry() actual_script_entry = script_entry or select_script_entry()
@ -131,7 +190,7 @@ class PamDeployAgent:
actual_config_path = _absolute_path(config_path or runtime_dir / f"config_{actual_run_id}.txt") actual_config_path = _absolute_path(config_path or runtime_dir / f"config_{actual_run_id}.txt")
actual_trace_path = _absolute_path(trace_file_path or Path("logs") / f"api_trace_{actual_run_id}.log") actual_trace_path = _absolute_path(trace_file_path or Path("logs") / f"api_trace_{actual_run_id}.log")
write_config(normalized, actual_config_path) write_config(normalized, actual_config_path)
return AgentState( state = AgentState(
run_id=actual_run_id, run_id=actual_run_id,
params=normalized, params=normalized,
execution_strategy=execution_strategy, execution_strategy=execution_strategy,
@ -143,6 +202,15 @@ class PamDeployAgent:
checkpoint_path=checkpoint_path or "", checkpoint_path=checkpoint_path or "",
target_ips=target_ips or [], target_ips=target_ips or [],
) )
logger.info(
"创建 AgentState 完成 run_id=%s config=%s trace=%s script_entry=%s backends=%s",
state.run_id,
state.config_path,
state.trace_file_path,
state.script_entry,
json_for_log(state.action_backends),
)
return state
def pause_state( def pause_state(
self, self,
@ -152,6 +220,13 @@ class PamDeployAgent:
review_context: dict[str, Any] | None = None, review_context: dict[str, Any] | None = None,
) -> AgentState: ) -> AgentState:
"""将当前 state 标记为暂停,并持久化 checkpoint。""" """将当前 state 标记为暂停,并持久化 checkpoint。"""
logger.info(
"暂停 state run_id=%s reason=%s checkpoint=%s context=%s",
state.run_id,
reason,
state.checkpoint_path,
json_for_log(review_context or {}),
)
state.paused = True state.paused = True
state.pause_reason = reason state.pause_reason = reason
state.review_context = dict(review_context or {}) state.review_context = dict(review_context or {})
@ -160,6 +235,7 @@ class PamDeployAgent:
def resume_state(self, state: AgentState) -> AgentState: def resume_state(self, state: AgentState) -> AgentState:
"""清理暂停标记,允许后续继续执行。""" """清理暂停标记,允许后续继续执行。"""
logger.info("恢复 state run_id=%s previous_reason=%s checkpoint=%s", state.run_id, state.pause_reason, state.checkpoint_path)
state.paused = False state.paused = False
state.pause_reason = "" state.pause_reason = ""
state.review_context = {} state.review_context = {}
@ -168,12 +244,14 @@ class PamDeployAgent:
def update_state_params(self, state: AgentState, updates: dict[str, Any]) -> AgentState: def update_state_params(self, state: AgentState, updates: dict[str, Any]) -> AgentState:
"""热更新 state 中的参数,并回写 config 文件。""" """热更新 state 中的参数,并回写 config 文件。"""
logger.info("热更新 state 参数开始 run_id=%s updates=%s", state.run_id, json_for_log(updates))
merged = {**state.params, **updates} merged = {**state.params, **updates}
normalized = self.normalize_params(merged) normalized = self.normalize_params(merged)
state.params = normalized state.params = normalized
if state.config_path: if state.config_path:
write_config(normalized, state.config_path) write_config(normalized, state.config_path)
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("热更新 state 参数完成 run_id=%s config=%s params=%s", state.run_id, state.config_path, json_for_log(state.params))
return state return state
def preview(self, params: dict[str, Any], strategy: ExecutionStrategy = "hybrid_node_mcp") -> str: def preview(self, params: dict[str, Any], strategy: ExecutionStrategy = "hybrid_node_mcp") -> str:
@ -209,12 +287,19 @@ class PamDeployAgent:
def run_global_flow(self, state: AgentState) -> AgentState: def run_global_flow(self, state: AgentState) -> AgentState:
"""执行全局部署阶段,并跳过 checkpoint 中已完成的步骤。""" """执行全局部署阶段,并跳过 checkpoint 中已完成的步骤。"""
logger.info(
"全局流程开始 run_id=%s paused=%s completed=%s",
state.run_id,
state.paused,
state.completed_global_steps,
)
if state.paused: if state.paused:
self._save_checkpoint(state) self._save_checkpoint(state)
return state return state
while True: while True:
action = self.next_global_action(state) action = self.next_global_action(state)
if action is None: if action is None:
logger.info("全局流程完成 run_id=%s completed=%s", state.run_id, state.completed_global_steps)
return state return state
self.run_global_action(state, action) self.run_global_action(state, action)
@ -231,6 +316,7 @@ class PamDeployAgent:
def run_global_action(self, state: AgentState, action: str) -> AgentState: def run_global_action(self, state: AgentState, action: str) -> AgentState:
"""执行一个全局 action并把结果写回 AgentState。""" """执行一个全局 action并把结果写回 AgentState。"""
if action in state.completed_global_steps: if action in state.completed_global_steps:
logger.info("跳过已完成全局 action run_id=%s action=%s", state.run_id, action)
return state return state
kwargs: dict[str, Any] = {} kwargs: dict[str, Any] = {}
if action == "publish-version": if action == "publish-version":
@ -240,16 +326,25 @@ class PamDeployAgent:
raise RuntimeError("publish-version 缺少 HASH_CODE请确认 upload-package 是否成功返回 HASH_CODE") raise RuntimeError("publish-version 缺少 HASH_CODE请确认 upload-package 是否成功返回 HASH_CODE")
kwargs["hash_code"] = state.hash_code kwargs["hash_code"] = state.hash_code
backend = state.action_backends.get(action, "script") backend = state.action_backends.get(action, "script")
logger.info(
"全局 action 开始 run_id=%s action=%s backend=%s kwargs=%s",
state.run_id,
action,
backend,
json_for_log(kwargs),
)
self._emit_progress({"type": "ACTION_START", "stage": action, "backend": backend}) self._emit_progress({"type": "ACTION_START", "stage": action, "backend": backend})
try: try:
result = self.router.run_action(state, action, **kwargs) result = self.router.run_action(state, action, **kwargs)
except Exception as exc: except Exception as exc:
logger.exception("全局 action 调用异常 run_id=%s action=%s backend=%s", state.run_id, action, backend)
result = ActionResult( result = ActionResult(
action=action, action=action,
backend=backend, backend=backend,
ok=False, ok=False,
error_summary=str(exc), error_summary=str(exc),
) )
logger.info("全局 action 返回 run_id=%s result=%s", state.run_id, _action_result_for_log(result))
state.events.append( state.events.append(
{ {
"type": "ACTION_DONE" if result.ok else "ACTION_FAIL", "type": "ACTION_DONE" if result.ok else "ACTION_FAIL",
@ -318,8 +413,10 @@ class PamDeployAgent:
reason="llm_review_blocked", reason="llm_review_blocked",
review_context=self._review_context(action=action, analysis=analysis, result=result), review_context=self._review_context(action=action, analysis=analysis, result=result),
) )
logger.info("全局 action 被 LLM 审核拦截 run_id=%s action=%s analysis=%s", state.run_id, action, json_for_log(asdict(analysis)))
return state return state
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("全局 action 完成 run_id=%s action=%s completed=%s", state.run_id, action, state.completed_global_steps)
return state return state
def _missing_required_values(self, action: str, values: dict[str, Any]) -> list[str]: def _missing_required_values(self, action: str, values: dict[str, Any]) -> list[str]:
@ -329,21 +426,37 @@ class PamDeployAgent:
def run_deploy_flow(self, state: AgentState) -> AgentState: def run_deploy_flow(self, state: AgentState) -> AgentState:
"""执行完整部署流程:全局阶段后进入逐 IP 阶段。""" """执行完整部署流程:全局阶段后进入逐 IP 阶段。"""
logger.info(
"部署流程开始 run_id=%s paused=%s pending=%s strategy=%s",
state.run_id,
state.paused,
state.pending_confirmation,
state.execution_strategy,
)
if state.pending_confirmation or state.paused: if state.pending_confirmation or state.paused:
self._save_checkpoint(state) self._save_checkpoint(state)
return state return state
self.run_global_flow(state) self.run_global_flow(state)
self.run_ip_flow(state) self.run_ip_flow(state)
logger.info("部署流程结束 run_id=%s paused=%s pending=%s", state.run_id, state.paused, state.pending_confirmation)
return state return state
def run_ip_flow(self, state: AgentState) -> AgentState: def run_ip_flow(self, state: AgentState) -> AgentState:
"""执行逐 IP 部署流程,失败时停在人工确认点。""" """执行逐 IP 部署流程,失败时停在人工确认点。"""
logger.info(
"逐 IP 流程开始 run_id=%s paused=%s target_ips=%s online_ips=%s",
state.run_id,
state.paused,
state.target_ips,
state.online_ips,
)
if state.paused: if state.paused:
self._save_checkpoint(state) self._save_checkpoint(state)
return state return state
while True: while True:
work = self.next_ip_action(state) work = self.next_ip_action(state)
if work is None: if work is None:
logger.info("逐 IP 流程完成或等待确认 run_id=%s pending=%s ip_states=%s", state.run_id, state.pending_confirmation, json_for_log(state.ip_states))
return state return state
ip, action = work ip, action = work
self.run_ip_action(state, ip, action) self.run_ip_action(state, ip, action)
@ -354,6 +467,7 @@ class PamDeployAgent:
self._save_checkpoint(state) self._save_checkpoint(state)
return None return None
self._resolve_target_ips(state) self._resolve_target_ips(state)
logger.info("计算下一个 IP action run_id=%s target_ips=%s", state.run_id, state.target_ips)
for ip in state.target_ips: for ip in state.target_ips:
ip_state = state.ip_states.get(ip) ip_state = state.ip_states.get(ip)
if ip_state and ip_state.get("status") == "SUCCESS": if ip_state and ip_state.get("status") == "SUCCESS":
@ -365,6 +479,7 @@ class PamDeployAgent:
return None return None
continue continue
if not ip_state: if not ip_state:
logger.info("初始化 IP 状态 run_id=%s ip=%s", state.run_id, ip)
state.events.append({"type": "IP_START", "ip": ip, "message": "start"}) state.events.append({"type": "IP_START", "ip": ip, "message": "start"})
ip_state = { ip_state = {
"status": "RUNNING", "status": "RUNNING",
@ -385,6 +500,7 @@ class PamDeployAgent:
ip_state["status"] = "SUCCESS" ip_state["status"] = "SUCCESS"
state.events.append({"type": "IP_DONE", "ip": ip, "message": "success"}) state.events.append({"type": "IP_DONE", "ip": ip, "message": "success"})
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("IP 部署完成 run_id=%s ip=%s", state.run_id, ip)
return None return None
def run_ip_action(self, state: AgentState, ip: str, action: str) -> AgentState: def run_ip_action(self, state: AgentState, ip: str, action: str) -> AgentState:
@ -392,7 +508,16 @@ class PamDeployAgent:
ip_state = state.ip_states[ip] ip_state = state.ip_states[ip]
completed_steps = ip_state.setdefault("completed_steps", []) completed_steps = ip_state.setdefault("completed_steps", [])
if action in completed_steps: if action in completed_steps:
logger.info("跳过已完成 IP action run_id=%s ip=%s action=%s", state.run_id, ip, action)
return state return state
logger.info(
"IP action 开始 run_id=%s ip=%s action=%s backend=%s ip_state=%s",
state.run_id,
ip,
action,
state.action_backends.get(action, ""),
json_for_log(ip_state),
)
self._emit_progress( self._emit_progress(
{ {
"type": "ACTION_START", "type": "ACTION_START",
@ -405,6 +530,7 @@ class PamDeployAgent:
try: try:
result = self.router.run_action(state, action, ip=ip) result = self.router.run_action(state, action, ip=ip)
except Exception as exc: except Exception as exc:
logger.exception("IP action 调用异常 run_id=%s ip=%s action=%s backend=%s", state.run_id, ip, action, backend)
result = ActionResult( result = ActionResult(
action=action, action=action,
backend=backend, backend=backend,
@ -412,6 +538,14 @@ class PamDeployAgent:
error_summary=str(exc), error_summary=str(exc),
) )
failed = (not result.ok) or self._business_failed(action, result.values) failed = (not result.ok) or self._business_failed(action, result.values)
logger.info(
"IP action 返回 run_id=%s ip=%s action=%s failed=%s result=%s",
state.run_id,
ip,
action,
failed,
_action_result_for_log(result),
)
state.events.append( state.events.append(
{ {
"type": "ACTION_FAIL" if failed else "ACTION_DONE", "type": "ACTION_FAIL" if failed else "ACTION_DONE",
@ -443,6 +577,7 @@ class PamDeployAgent:
self._download_log_best_effort(state, ip) self._download_log_best_effort(state, ip)
state.pending_confirmation = f"rollback-ip:{ip}" state.pending_confirmation = f"rollback-ip:{ip}"
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("IP action 失败并进入确认 run_id=%s ip=%s action=%s pending=%s", state.run_id, ip, action, state.pending_confirmation)
return state return state
self._apply_ip_result(ip_state, action, result.values) self._apply_ip_result(ip_state, action, result.values)
@ -462,8 +597,10 @@ class PamDeployAgent:
reason="llm_review_blocked", reason="llm_review_blocked",
review_context=self._review_context(action=action, analysis=analysis, result=result, ip=ip), review_context=self._review_context(action=action, analysis=analysis, result=result, ip=ip),
) )
logger.info("IP action 被 LLM 审核拦截 run_id=%s ip=%s action=%s analysis=%s", state.run_id, ip, action, json_for_log(asdict(analysis)))
return state return state
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("IP action 完成 run_id=%s ip=%s action=%s completed=%s", state.run_id, ip, action, completed_steps)
return state return state
def build_confirmation_request(self, state: AgentState) -> dict[str, Any]: def build_confirmation_request(self, state: AgentState) -> dict[str, Any]:
@ -497,6 +634,13 @@ class PamDeployAgent:
ip = request["ip"] ip = request["ip"]
ip_state = state.ip_states[ip] ip_state = state.ip_states[ip]
logger.info(
"人工确认开始 run_id=%s approved=%s request=%s note_len=%s",
state.run_id,
approved,
json_for_log(request),
len(operator_note),
)
if not approved: if not approved:
ip_state["rollback_status"] = "REJECTED_BY_OPERATOR" ip_state["rollback_status"] = "REJECTED_BY_OPERATOR"
state.events.append( state.events.append(
@ -512,6 +656,7 @@ class PamDeployAgent:
state.pause_reason = "" state.pause_reason = ""
state.review_context = {} state.review_context = {}
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info("人工确认拒绝回滚 run_id=%s ip=%s", state.run_id, ip)
return state return state
backend = state.action_backends.get("rollback-ip", "script") backend = state.action_backends.get("rollback-ip", "script")
@ -531,12 +676,14 @@ class PamDeployAgent:
stop_first=bool(ip_state.get("rollback_stop_first", False)), stop_first=bool(ip_state.get("rollback_stop_first", False)),
) )
except Exception as exc: except Exception as exc:
logger.exception("rollback action 调用异常 run_id=%s ip=%s backend=%s", state.run_id, ip, backend)
result = ActionResult( result = ActionResult(
action="rollback-ip", action="rollback-ip",
backend=backend, backend=backend,
ok=False, ok=False,
error_summary=str(exc), error_summary=str(exc),
) )
logger.info("rollback action 返回 run_id=%s ip=%s result=%s", state.run_id, ip, _action_result_for_log(result))
ip_state["rollback_status"] = "ROLLBACK_DONE" if result.ok else "ROLLBACK_FAILED" ip_state["rollback_status"] = "ROLLBACK_DONE" if result.ok else "ROLLBACK_FAILED"
state.events.append( state.events.append(
{ {
@ -579,6 +726,14 @@ class PamDeployAgent:
} }
) )
self._save_checkpoint(state) self._save_checkpoint(state)
logger.info(
"人工确认处理完成 run_id=%s ip=%s rollback_status=%s pending=%s paused=%s",
state.run_id,
ip,
ip_state.get("rollback_status"),
state.pending_confirmation,
state.paused,
)
return state return state
def _emit_progress(self, payload: dict[str, Any]) -> None: def _emit_progress(self, payload: dict[str, Any]) -> None:
@ -607,6 +762,7 @@ class PamDeployAgent:
"""根据在线 IP 和用户指定 IP 计算最终目标 IP。""" """根据在线 IP 和用户指定 IP 计算最终目标 IP。"""
if not state.target_ips: if not state.target_ips:
state.target_ips = state.online_ips.copy() state.target_ips = state.online_ips.copy()
logger.info("目标 IP 未指定,使用全部在线 IP run_id=%s target_ips=%s", state.run_id, state.target_ips)
return return
online = set(state.online_ips) online = set(state.online_ips)
requested = state.target_ips requested = state.target_ips
@ -621,6 +777,13 @@ class PamDeployAgent:
"target_ips": state.target_ips, "target_ips": state.target_ips,
} }
) )
logger.info(
"目标 IP 范围调整 run_id=%s requested=%s target_ips=%s missing=%s",
state.run_id,
requested,
state.target_ips,
missing,
)
def _business_failed(self, action: str, values: dict[str, Any]) -> bool: def _business_failed(self, action: str, values: dict[str, Any]) -> bool:
"""识别 exit code 之外的业务失败条件。""" """识别 exit code 之外的业务失败条件。"""
@ -640,6 +803,14 @@ class PamDeployAgent:
"""记录单 IP 失败,并设置待回滚确认状态。""" """记录单 IP 失败,并设置待回滚确认状态。"""
ip_state = state.ip_states[ip] ip_state = state.ip_states[ip]
stop_first = action in ("start-ip", "verify-ip") stop_first = action in ("start-ip", "verify-ip")
logger.info(
"记录 IP 失败 run_id=%s ip=%s action=%s reason=%s stop_first=%s",
state.run_id,
ip,
action,
reason,
stop_first,
)
ip_state.update( ip_state.update(
{ {
"status": "FAILED", "status": "FAILED",
@ -663,6 +834,7 @@ class PamDeployAgent:
def _download_log_best_effort(self, state: AgentState, ip: str) -> None: def _download_log_best_effort(self, state: AgentState, ip: str) -> None:
"""失败后尽力下载日志,日志失败不覆盖原失败原因。""" """失败后尽力下载日志,日志失败不覆盖原失败原因。"""
backend = state.action_backends.get("download-log", "script") backend = state.action_backends.get("download-log", "script")
logger.info("失败后尝试下载日志 run_id=%s ip=%s backend=%s", state.run_id, ip, backend)
self._emit_progress( self._emit_progress(
{ {
"type": "ACTION_START", "type": "ACTION_START",
@ -675,6 +847,7 @@ class PamDeployAgent:
try: try:
result = self.router.run_action(state, "download-log", ip=ip) result = self.router.run_action(state, "download-log", ip=ip)
except Exception as exc: except Exception as exc:
logger.exception("失败后下载日志 action 调用异常 run_id=%s ip=%s backend=%s", state.run_id, ip, backend)
result = ActionResult( result = ActionResult(
action="download-log", action="download-log",
backend=backend, backend=backend,
@ -702,6 +875,7 @@ class PamDeployAgent:
"message": result.values.get("MESSAGE", "已尽力下载日志"), "message": result.values.get("MESSAGE", "已尽力下载日志"),
} }
) )
logger.info("失败后下载日志完成 run_id=%s ip=%s result=%s", state.run_id, ip, _action_result_for_log(result))
else: else:
state.events.append( state.events.append(
{ {
@ -721,11 +895,21 @@ class PamDeployAgent:
"message": result.error_summary or "尽力下载日志失败", "message": result.error_summary or "尽力下载日志失败",
} }
) )
logger.info("失败后下载日志失败 run_id=%s ip=%s result=%s", state.run_id, ip, _action_result_for_log(result))
self._append_action_analysis(state, "download-log", result, ip=ip) self._append_action_analysis(state, "download-log", result, ip=ip)
def _save_checkpoint(self, state: AgentState) -> None: def _save_checkpoint(self, state: AgentState) -> None:
"""如果配置了 checkpoint 路径,则保存完整运行状态。""" """如果配置了 checkpoint 路径,则保存完整运行状态。"""
if state.checkpoint_path: if state.checkpoint_path:
logger.info(
"保存 checkpoint run_id=%s path=%s paused=%s pending=%s last_success=%s last_failed=%s",
state.run_id,
state.checkpoint_path,
state.paused,
state.pending_confirmation,
state.last_success_step,
state.last_failed_step,
)
save_checkpoint(state, state.checkpoint_path, redact=False) save_checkpoint(state, state.checkpoint_path, redact=False)
def _append_action_analysis( def _append_action_analysis(
@ -737,6 +921,13 @@ class PamDeployAgent:
ip: str | None = None, ip: str | None = None,
) -> Any: ) -> Any:
"""启用 action 后分析时,把诊断结果追加到 events。""" """启用 action 后分析时,把诊断结果追加到 events。"""
logger.info(
"LLM action 审核开始 run_id=%s action=%s ip=%s result=%s",
state.run_id,
action,
ip or "",
_action_result_for_log(result),
)
self._emit_progress( self._emit_progress(
{ {
"type": "ACTION_REVIEW_START", "type": "ACTION_REVIEW_START",
@ -752,6 +943,7 @@ class PamDeployAgent:
state_summary=self._state_summary_for_llm(state, ip=ip), state_summary=self._state_summary_for_llm(state, ip=ip),
) )
except Exception as exc: # pragma: no cover - 审核失败时也要显式暂停,避免黑盒继续执行 except Exception as exc: # pragma: no cover - 审核失败时也要显式暂停,避免黑盒继续执行
logger.exception("LLM action 审核失败 run_id=%s action=%s ip=%s", state.run_id, action, ip or "")
state.events.append( state.events.append(
{ {
"type": "ACTION_ANALYSIS_FAIL", "type": "ACTION_ANALYSIS_FAIL",
@ -795,6 +987,13 @@ class PamDeployAgent:
"should_continue": analysis.should_continue, "should_continue": analysis.should_continue,
} }
) )
logger.info(
"LLM action 审核完成 run_id=%s action=%s ip=%s analysis=%s",
state.run_id,
action,
ip or "",
json_for_log(payload),
)
return analysis return analysis
def _state_summary_for_llm(self, state: AgentState, *, ip: str | None = None) -> dict[str, Any]: def _state_summary_for_llm(self, state: AgentState, *, ip: str | None = None) -> dict[str, Any]:
@ -895,3 +1094,20 @@ def _normalize_local_file_path(path: str) -> str:
if value.is_absolute(): if value.is_absolute():
return str(value) return str(value)
return str(value.resolve()) return str(value.resolve())
def _action_result_for_log(result: ActionResult) -> str:
"""生成 action 结果日志摘要,避免写入完整 stdout/stderr。"""
return json_for_log(
{
"action": result.action,
"backend": result.backend,
"ok": result.ok,
"exit_code": result.exit_code,
"tool_name": result.tool_name,
"values": result.values,
"stderr": result.stderr,
"error_summary": result.error_summary,
},
max_text_len=1000,
)

View File

@ -4,6 +4,7 @@ from __future__ import annotations
import argparse import argparse
import json import json
import logging
from dataclasses import asdict from dataclasses import asdict
from .agent import PamDeployAgent from .agent import PamDeployAgent
@ -11,9 +12,12 @@ from .checkpoint_store import load_agent_state, redact_mapping
from .interactive import run_interactive_chat from .interactive import run_interactive_chat
from .langgraph_runtime import LangGraphDeploymentRuntime, LangGraphRunResult from .langgraph_runtime import LangGraphDeploymentRuntime, LangGraphRunResult
from .llm import build_llm_client from .llm import build_llm_client
from .logging_utils import configure_logging, json_for_log
from .mcp_factory import build_mcp_runner_from_config from .mcp_factory import build_mcp_runner_from_config
from .params_loader import load_params_file from .params_loader import load_params_file
logger = logging.getLogger(__name__)
def add_llm_args(parser: argparse.ArgumentParser) -> None: def add_llm_args(parser: argparse.ArgumentParser) -> None:
"""为子命令追加真实 LLM 配置参数。""" """为子命令追加真实 LLM 配置参数。"""
@ -125,7 +129,11 @@ def main() -> None:
add_action_analysis_arg(confirm) add_action_analysis_arg(confirm)
args = parser.parse_args() args = parser.parse_args()
log_path = configure_logging()
logger.info("CLI 启动 command=%s args=%s log_path=%s", args.command, json_for_log(vars(args)), log_path)
params = load_params_file(args.config) if getattr(args, "config", None) else {} params = load_params_file(args.config) if getattr(args, "config", None) else {}
if getattr(args, "config", None):
logger.info("参数文件已加载 command=%s config=%s params=%s", args.command, args.config, json_for_log(params))
llm_client = None llm_client = None
if args.command != "preview": if args.command != "preview":
llm_client = build_llm_client( llm_client = build_llm_client(
@ -136,7 +144,9 @@ def main() -> None:
) )
mcp_runner = None mcp_runner = None
if getattr(args, "mcp_config", None): if getattr(args, "mcp_config", None):
logger.info("开始加载 MCP 配置 path=%s", args.mcp_config)
mcp_runner = build_mcp_runner_from_config(args.mcp_config) mcp_runner = build_mcp_runner_from_config(args.mcp_config)
logger.info("MCP 配置加载完成 path=%s runner=%s", args.mcp_config, type(mcp_runner).__name__)
agent = PamDeployAgent( agent = PamDeployAgent(
llm_client=llm_client, llm_client=llm_client,
mcp_runner=mcp_runner, mcp_runner=mcp_runner,
@ -144,12 +154,15 @@ def main() -> None:
) )
if args.command == "analyze": if args.command == "analyze":
logger.info("开始执行 analyze text_len=%s", len(args.text))
result = agent.analyze_request(args.text, params) result = agent.analyze_request(args.text, params)
payload = redact_mapping({key: asdict(value) for key, value in result.items()}) payload = redact_mapping({key: asdict(value) for key, value in result.items()})
logger.info("analyze 完成 result=%s", json_for_log(payload))
print(json.dumps(payload, ensure_ascii=False, indent=2)) print(json.dumps(payload, ensure_ascii=False, indent=2))
return return
if args.command == "chat": if args.command == "chat":
logger.info("进入 chat 模式 strategy=%s checkpoint=%s target_ips=%s", args.strategy, args.checkpoint, args.target_ip)
run_interactive_chat( run_interactive_chat(
agent=agent, agent=agent,
params=params, params=params,
@ -160,11 +173,13 @@ def main() -> None:
return return
if args.command == "preview": if args.command == "preview":
logger.info("执行 preview strategy=%s", args.strategy)
print(agent.preview(params, args.strategy)) print(agent.preview(params, args.strategy))
return return
require_confirm(args) require_confirm(args)
if args.command == "run-global": if args.command == "run-global":
logger.info("开始 run-global strategy=%s checkpoint=%s", args.strategy, args.checkpoint)
state = agent.create_state( state = agent.create_state(
params=params, params=params,
execution_strategy=args.strategy, execution_strategy=args.strategy,
@ -177,6 +192,7 @@ def main() -> None:
return return
if args.command == "resume": if args.command == "resume":
logger.info("开始 resume checkpoint=%s", args.checkpoint)
state = load_agent_state(args.checkpoint) state = load_agent_state(args.checkpoint)
state.checkpoint_path = state.checkpoint_path or args.checkpoint state.checkpoint_path = state.checkpoint_path or args.checkpoint
if state.paused: if state.paused:
@ -186,6 +202,7 @@ def main() -> None:
return return
if args.command == "confirm": if args.command == "confirm":
logger.info("开始 confirm checkpoint=%s decision=%s note_len=%s", args.checkpoint, args.decision, len(args.note))
state = load_agent_state(args.checkpoint) state = load_agent_state(args.checkpoint)
state.checkpoint_path = state.checkpoint_path or args.checkpoint state.checkpoint_path = state.checkpoint_path or args.checkpoint
runtime = LangGraphDeploymentRuntime(agent=agent, flow="deploy") runtime = LangGraphDeploymentRuntime(agent=agent, flow="deploy")
@ -197,6 +214,7 @@ def main() -> None:
print_graph_result(agent, first) print_graph_result(agent, first)
return return
logger.info("开始 run-deploy strategy=%s checkpoint=%s target_ips=%s", args.strategy, args.checkpoint, args.target_ip)
state = agent.create_state( state = agent.create_state(
params=params, params=params,
execution_strategy=args.strategy, execution_strategy=args.strategy,

View File

@ -73,6 +73,12 @@ SENSITIVE_KEYS = {
"MCP_TOKEN", "MCP_TOKEN",
"TOKEN", "TOKEN",
"Authorization", "Authorization",
"authorization",
"access_token", "access_token",
"ACCESS_TOKEN", "ACCESS_TOKEN",
"api_key",
"API_KEY",
"PAM_LLM_API_KEY",
"password",
"PASSWORD",
} }

View File

@ -6,6 +6,7 @@ import time
import json import json
import shlex import shlex
import builtins import builtins
import logging
import os import os
import sys import sys
from dataclasses import asdict from dataclasses import asdict
@ -17,12 +18,14 @@ from .checkpoint_store import load_agent_state, redact_mapping
from .langgraph_runtime import LangGraphDeploymentRuntime, LangGraphRunResult from .langgraph_runtime import LangGraphDeploymentRuntime, LangGraphRunResult
from .llm import build_llm_client from .llm import build_llm_client
from .llm.rule_based import RuleBasedLlmClient from .llm.rule_based import RuleBasedLlmClient
from .logging_utils import configure_logging, json_for_log, redact_for_log
from .mcp_factory import build_mcp_runner_from_config from .mcp_factory import build_mcp_runner_from_config
from .models import AgentState, ExecutionStrategy from .models import AgentState, ExecutionStrategy
from .params_loader import load_params_file from .params_loader import load_params_file
InputFunc = Callable[[str], str] InputFunc = Callable[[str], str]
OutputFunc = Callable[[str], None] OutputFunc = Callable[[str], None]
logger = logging.getLogger(__name__)
COMMAND_HELP = """可用命令: COMMAND_HELP = """可用命令:
help 显示帮助 help 显示帮助
@ -32,6 +35,7 @@ COMMAND_HELP = """可用命令:
events [数量] 查看最近 action 事件默认 10 events [数量] 查看最近 action 事件默认 10
set KEY=VALUE 修改当前会话参数 set KEY=VALUE 修改当前会话参数
llm config KEY=VALUE 配置真实 LLM支持 base_url/api_key/model/action_analysis_prompt_file llm config KEY=VALUE 配置真实 LLM支持 base_url/api_key/model/action_analysis_prompt_file
llm test [文本] 测试当前 LLM client 是否可正常调用
llm fallback 切回本地规则 fallback llm fallback 切回本地规则 fallback
llm action-analysis on|off 开关 action 审核详情写入 events llm action-analysis on|off 开关 action 审核详情写入 events
mcp config <路径> 加载 MCP client JSON 配置 mcp config <路径> 加载 MCP client JSON 配置
@ -79,9 +83,20 @@ class InteractiveCliSession:
self.mcp_config_path: str = "" self.mcp_config_path: str = ""
self.graph_runtime: LangGraphDeploymentRuntime | None = None self.graph_runtime: LangGraphDeploymentRuntime | None = None
self.agent.progress_callback = self._on_progress self.agent.progress_callback = self._on_progress
self.log_path = configure_logging()
logger.info(
"chat 会话已初始化 strategy=%s checkpoint=%s target_ips=%s llm_client=%s log_path=%s params=%s",
self.strategy,
self.checkpoint_path,
self.target_ips,
type(self.agent.llm_client).__name__,
self.log_path,
json_for_log(self.params),
)
def run(self) -> None: def run(self) -> None:
"""启动 REPL 循环,直到用户 exit 或输入流结束。""" """启动 REPL 循环,直到用户 exit 或输入流结束。"""
logger.info("chat REPL 启动 checkpoint=%s", self.checkpoint_path)
self.output("PAM 部署 Agent 交互式会话") self.output("PAM 部署 Agent 交互式会话")
self.output("输入 help 查看命令,输入 exit 退出。") self.output("输入 help 查看命令,输入 exit 退出。")
self._load_existing_checkpoint_if_any() self._load_existing_checkpoint_if_any()
@ -89,9 +104,11 @@ class InteractiveCliSession:
try: try:
line = self.input("pam-deploy-agent> ") line = self.input("pam-deploy-agent> ")
except KeyboardInterrupt: except KeyboardInterrupt:
logger.info("chat 输入被用户中断")
self.output("已取消当前输入。输入 exit 退出,或继续输入命令。") self.output("已取消当前输入。输入 exit 退出,或继续输入命令。")
continue continue
except EOFError: except EOFError:
logger.info("chat 输入流结束")
self.output("bye") self.output("bye")
return return
if not self.handle_line(line): if not self.handle_line(line):
@ -105,8 +122,14 @@ class InteractiveCliSession:
command, _, rest = text.partition(" ") command, _, rest = text.partition(" ")
normalized = command.lower() normalized = command.lower()
logger.info(
"chat 收到输入 command=%s text=%s",
normalized,
redact_for_log(text, max_text_len=500),
)
if normalized in ("exit", "quit", "q"): if normalized in ("exit", "quit", "q"):
logger.info("chat 会话退出")
self.output("bye") self.output("bye")
return False return False
if normalized in ("help", "?"): if normalized in ("help", "?"):
@ -162,9 +185,11 @@ class InteractiveCliSession:
return True return True
if _is_small_talk(text): if _is_small_talk(text):
logger.info("chat 输入识别为寒暄,跳过结构化分析")
self.output("你好。可以输入 help 查看命令,或直接描述部署需求;执行前仍需输入 run 并确认。") self.output("你好。可以输入 help 查看命令,或直接描述部署需求;执行前仍需输入 run 并确认。")
return True return True
if not _looks_like_deploy_request(text): if not _looks_like_deploy_request(text):
logger.info("chat 输入未命中部署需求粗筛,跳过结构化分析")
self.output("我没有识别到明确的部署需求。可以输入 help 查看命令,或用 analyze <需求> 明确触发需求分析。") self.output("我没有识别到明确的部署需求。可以输入 help 查看命令,或用 analyze <需求> 明确触发需求分析。")
return True return True
@ -179,8 +204,10 @@ class InteractiveCliSession:
return return
try: try:
logger.info("chat 开始需求分析 text_len=%s base_params=%s", len(text), json_for_log(self.params))
result = self.agent.analyze_request(text, self.params) result = self.agent.analyze_request(text, self.params)
except Exception as exc: except Exception as exc:
logger.exception("chat 需求分析失败 text=%s", redact_for_log(text, max_text_len=500))
self.output(f"需求分析失败: {exc}") self.output(f"需求分析失败: {exc}")
return return
self.last_analysis = result self.last_analysis = result
@ -206,6 +233,13 @@ class InteractiveCliSession:
self.output("- target_ips: " + ", ".join(self.target_ips)) self.output("- target_ips: " + ", ".join(self.target_ips))
self.output("执行请输 run查看完整 JSON 可用一次性 analyze 命令。") self.output("执行请输 run查看完整 JSON 可用一次性 analyze 命令。")
self.output(_format_redacted_params(safe_payload["params"]["extracted_params"])) self.output(_format_redacted_params(safe_payload["params"]["extracted_params"]))
logger.info(
"chat 需求分析完成 intent=%s strategy=%s target_ips=%s result=%s",
intent_result.intent,
self.strategy,
self.target_ips,
json_for_log(safe_payload),
)
def _set_param(self, assignment: str) -> None: def _set_param(self, assignment: str) -> None:
"""处理 `set KEY=VALUE` 命令,更新当前会话参数。""" """处理 `set KEY=VALUE` 命令,更新当前会话参数。"""
@ -220,6 +254,7 @@ class InteractiveCliSession:
self.params[key] = value.strip() self.params[key] = value.strip()
self._sync_params_to_state() self._sync_params_to_state()
self.output(f"已设置 {key}") self.output(f"已设置 {key}")
logger.info("chat 参数已设置 key=%s params=%s", key, json_for_log(self.params))
def _show_params(self) -> None: def _show_params(self) -> None:
"""脱敏展示当前会话参数。""" """脱敏展示当前会话参数。"""
@ -241,13 +276,23 @@ class InteractiveCliSession:
def _configure_llm(self, text: str) -> None: def _configure_llm(self, text: str) -> None:
"""热加载 LLM 配置,或开关 action 后诊断。""" """热加载 LLM 配置,或开关 action 后诊断。"""
if not text: if not text:
self.output("格式llm config base_url=... api_key=... model=... action_analysis_prompt_file=... | llm fallback | llm action-analysis on|off") self.output("格式llm config base_url=... api_key=... model=... action_analysis_prompt_file=... | llm test [文本] | llm fallback | llm action-analysis on|off")
return return
try:
parts = shlex.split(text) parts = shlex.split(text)
except ValueError as exc:
logger.exception("chat LLM 命令解析失败 text=%s", redact_for_log(text, max_text_len=500))
self.output(f"LLM 命令解析失败: {exc}")
return
if parts[0] == "fallback": if parts[0] == "fallback":
self.agent.llm_client = RuleBasedLlmClient() self.agent.llm_client = RuleBasedLlmClient()
self.llm_config = {} self.llm_config = {}
self.output("已切回本地规则 LLM fallback。") self.output("已切回本地规则 LLM fallback。")
logger.info("chat LLM 已切回 fallback")
return
if parts[0] == "test":
test_text = text.partition("test")[2].strip()
self._test_llm(test_text)
return return
if parts[0] == "action-analysis": if parts[0] == "action-analysis":
if len(parts) < 2 or parts[1] not in ("on", "off"): if len(parts) < 2 or parts[1] not in ("on", "off"):
@ -255,6 +300,7 @@ class InteractiveCliSession:
return return
self.agent.action_analysis_enabled = parts[1] == "on" self.agent.action_analysis_enabled = parts[1] == "on"
self.output(f"action 审核详情写入 events 已{'开启' if self.agent.action_analysis_enabled else '关闭'}") self.output(f"action 审核详情写入 events 已{'开启' if self.agent.action_analysis_enabled else '关闭'}")
logger.info("chat action 审核事件写入开关=%s", self.agent.action_analysis_enabled)
return return
if parts[0] != "config": if parts[0] != "config":
self.output("未知 llm 命令。") self.output("未知 llm 命令。")
@ -262,6 +308,7 @@ class InteractiveCliSession:
updates = _parse_key_values(parts[1:]) updates = _parse_key_values(parts[1:])
self.llm_config.update(updates) self.llm_config.update(updates)
try: try:
logger.info("chat 开始加载 LLM 配置 updates=%s merged=%s", json_for_log(updates), json_for_log(self.llm_config))
self.agent.llm_client = build_llm_client( self.agent.llm_client = build_llm_client(
base_url=self.llm_config.get("base_url"), base_url=self.llm_config.get("base_url"),
api_key=self.llm_config.get("api_key"), api_key=self.llm_config.get("api_key"),
@ -269,12 +316,35 @@ class InteractiveCliSession:
action_analysis_prompt_path=self.llm_config.get("action_analysis_prompt_file"), action_analysis_prompt_path=self.llm_config.get("action_analysis_prompt_file"),
) )
except Exception as exc: except Exception as exc:
logger.exception("chat LLM 配置失败 config=%s", json_for_log(self.llm_config))
self.output(f"LLM 配置失败: {exc}") self.output(f"LLM 配置失败: {exc}")
return return
safe = {**self.llm_config} safe = {**self.llm_config}
if safe.get("api_key"): if safe.get("api_key"):
safe["api_key"] = "***" safe["api_key"] = "***"
self.output("LLM 配置已加载: " + json.dumps(safe, ensure_ascii=False)) self.output("LLM 配置已加载: " + json.dumps(safe, ensure_ascii=False))
logger.info("chat LLM 配置已加载 client=%s config=%s", type(self.agent.llm_client).__name__, json_for_log(self.llm_config))
def _test_llm(self, text: str) -> None:
"""使用当前 LLM client 做一次轻量调用,便于确认配置是否生效。"""
prompt = text or "请判断这是一次 LLM 连通性测试,并返回结构化意图。"
client_name = type(self.agent.llm_client).__name__
self.output(f"正在测试 LLM: {client_name}")
logger.info("chat LLM 测试开始 client=%s prompt=%s", client_name, redact_for_log(prompt, max_text_len=500))
try:
result = self.agent.understand_request(prompt)
except Exception as exc:
logger.exception("chat LLM 测试失败 client=%s", client_name)
self.output(f"LLM 测试失败: {exc}")
return
self.output("LLM 测试通过")
self.output(f"- client: {client_name}")
self.output(f"- intent: {result.intent}")
self.output(f"- strategy: {result.strategy_preference}")
self.output(f"- confidence: {result.confidence}")
if result.reasons:
self.output("- reasons: " + "; ".join(result.reasons))
logger.info("chat LLM 测试通过 client=%s result=%s", client_name, json_for_log(asdict(result)))
def _configure_mcp(self, text: str) -> None: def _configure_mcp(self, text: str) -> None:
"""热加载 MCP client 配置。""" """热加载 MCP client 配置。"""
@ -284,18 +354,22 @@ class InteractiveCliSession:
return return
path = path.strip().strip('"') path = path.strip().strip('"')
try: try:
logger.info("chat 开始加载 MCP 配置 path=%s", path)
runner = build_mcp_runner_from_config(path) runner = build_mcp_runner_from_config(path)
except Exception as exc: except Exception as exc:
logger.exception("chat MCP 配置失败 path=%s", path)
self.output(f"MCP 配置失败: {exc}") self.output(f"MCP 配置失败: {exc}")
return return
self.agent.mcp_runner = runner self.agent.mcp_runner = runner
self.agent.router.mcp_runner = runner self.agent.router.mcp_runner = runner
self.mcp_config_path = path self.mcp_config_path = path
self.output(f"MCP 配置已加载: {path}") self.output(f"MCP 配置已加载: {path}")
logger.info("chat MCP 配置已加载 path=%s runner=%s", path, type(runner).__name__)
def _list_checkpoints(self) -> None: def _list_checkpoints(self) -> None:
"""列出当前 checkpoint 目录下的 JSON 文件。""" """列出当前 checkpoint 目录下的 JSON 文件。"""
checkpoint_dir = Path(self.checkpoint_path).parent checkpoint_dir = Path(self.checkpoint_path).parent
logger.info("chat 查询 checkpoint 列表 dir=%s", checkpoint_dir)
if not checkpoint_dir.exists(): if not checkpoint_dir.exists():
self.output(f"checkpoint 目录不存在: {checkpoint_dir}") self.output(f"checkpoint 目录不存在: {checkpoint_dir}")
return return
@ -314,6 +388,7 @@ class InteractiveCliSession:
self.output("格式load checkpoint <路径>") self.output("格式load checkpoint <路径>")
return return
checkpoint = Path(path_text) checkpoint = Path(path_text)
logger.info("chat 开始加载 checkpoint path=%s", checkpoint)
if not checkpoint.exists(): if not checkpoint.exists():
self.output(f"checkpoint 不存在: {checkpoint}") self.output(f"checkpoint 不存在: {checkpoint}")
return return
@ -325,6 +400,14 @@ class InteractiveCliSession:
self.target_ips = list(self.state.target_ips) self.target_ips = list(self.state.target_ips)
self.graph_runtime = None self.graph_runtime = None
self.output(f"已加载 checkpoint: {checkpoint}") self.output(f"已加载 checkpoint: {checkpoint}")
logger.info(
"chat checkpoint 已加载 path=%s run_id=%s strategy=%s paused=%s pending=%s",
checkpoint,
self.state.run_id,
self.strategy,
self.state.paused,
self.state.pending_confirmation,
)
if self.state.pending_confirmation: if self.state.pending_confirmation:
self._print_confirmation() self._print_confirmation()
self._print_pause_context() self._print_pause_context()
@ -335,35 +418,43 @@ class InteractiveCliSession:
self.output("格式load params <路径>") self.output("格式load params <路径>")
return return
path = Path(path_text) path = Path(path_text)
logger.info("chat 开始加载参数文件 path=%s", path)
if not path.exists(): if not path.exists():
self.output(f"参数文件不存在: {path}") self.output(f"参数文件不存在: {path}")
return return
try: try:
updates = load_params_file(path) updates = load_params_file(path)
except Exception as exc: except Exception as exc:
logger.exception("chat 参数文件加载失败 path=%s", path)
self.output(f"参数文件加载失败: {exc}") self.output(f"参数文件加载失败: {exc}")
return return
self.params.update(updates) self.params.update(updates)
try: try:
self.params = self.agent.normalize_params(self.params) self.params = self.agent.normalize_params(self.params)
except ValueError as exc: except ValueError as exc:
logger.exception("chat 参数热更新归一化失败 path=%s updates=%s", path, json_for_log(updates))
self.output(f"参数热更新失败: {exc}") self.output(f"参数热更新失败: {exc}")
return return
self._sync_params_to_state() self._sync_params_to_state()
self.output(f"已加载参数文件: {path}") self.output(f"已加载参数文件: {path}")
self.output(_format_redacted_params(redact_mapping(self.params))) self.output(_format_redacted_params(redact_mapping(self.params)))
logger.info("chat 参数文件已加载 path=%s updates=%s params=%s", path, json_for_log(updates), json_for_log(self.params))
def _run_deploy(self) -> None: def _run_deploy(self) -> None:
"""在用户确认后创建状态并执行完整部署流程。""" """在用户确认后创建状态并执行完整部署流程。"""
logger.info("chat run 请求开始 strategy=%s checkpoint=%s target_ips=%s", self.strategy, self.checkpoint_path, self.target_ips)
if self.state and self.state.pending_confirmation: if self.state and self.state.pending_confirmation:
logger.info("chat run 命中待确认事项 pending=%s", self.state.pending_confirmation)
self._print_confirmation() self._print_confirmation()
return return
if not self._prepare_params_for_run(): if not self._prepare_params_for_run():
logger.info("chat run 参数准备失败")
return return
problems = self._validate_run_prerequisites(self.params) problems = self._validate_run_prerequisites(self.params)
if problems: if problems:
logger.info("chat run 前置检查失败 problems=%s", problems)
self.output("执行前检查未通过:") self.output("执行前检查未通过:")
for problem in problems: for problem in problems:
self.output(f"- {problem}") self.output(f"- {problem}")
@ -371,10 +462,12 @@ class InteractiveCliSession:
return return
if not self._confirm_params_and_scope(): if not self._confirm_params_and_scope():
logger.info("chat run 用户取消参数或目标范围确认")
self.output("已取消执行。") self.output("已取消执行。")
return return
if not self._ask_yes_no("即将执行真实 action确认执行请输入 yes: "): if not self._ask_yes_no("即将执行真实 action确认执行请输入 yes: "):
logger.info("chat run 用户取消最终执行确认")
self.output("已取消执行。") self.output("已取消执行。")
return return
@ -385,6 +478,14 @@ class InteractiveCliSession:
target_ips=self.target_ips, target_ips=self.target_ips,
) )
self.graph_runtime = None self.graph_runtime = None
logger.info(
"chat run state 已创建 run_id=%s strategy=%s checkpoint=%s config=%s target_ips=%s",
self.state.run_id,
self.state.execution_strategy,
self.state.checkpoint_path,
self.state.config_path,
self.state.target_ips,
)
self._execute_current_state() self._execute_current_state()
def _confirm_params_and_scope(self) -> bool: def _confirm_params_and_scope(self) -> bool:
@ -404,14 +505,19 @@ class InteractiveCliSession:
checkpoint = Path(self.checkpoint_path) checkpoint = Path(self.checkpoint_path)
if not checkpoint.exists(): if not checkpoint.exists():
self.output("当前没有可续跑的 checkpoint。") self.output("当前没有可续跑的 checkpoint。")
logger.info("chat resume 未找到 checkpoint path=%s", checkpoint)
return return
logger.info("chat resume 从 checkpoint 加载 path=%s", checkpoint)
self.state = load_agent_state(checkpoint) self.state = load_agent_state(checkpoint)
self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint) self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint)
if self.state.paused: if self.state.paused:
logger.info("chat resume 清理暂停状态 run_id=%s reason=%s", self.state.run_id, self.state.pause_reason)
self.state = self.agent.resume_state(self.state) self.state = self.agent.resume_state(self.state)
if self.graph_runtime and self.graph_runtime.waiting_confirmation: if self.graph_runtime and self.graph_runtime.waiting_confirmation:
logger.info("chat resume 停在 LangGraph 确认点 pending=%s", self.state.pending_confirmation if self.state else "")
self._print_confirmation() self._print_confirmation()
return return
logger.info("chat resume 开始执行 run_id=%s checkpoint=%s", self.state.run_id if self.state else "", self.checkpoint_path)
self._execute_current_state() self._execute_current_state()
def _execute_current_state(self) -> None: def _execute_current_state(self) -> None:
@ -419,10 +525,19 @@ class InteractiveCliSession:
if self.state is None: if self.state is None:
self.output("当前没有运行状态。") self.output("当前没有运行状态。")
return return
logger.info(
"chat 开始执行当前 state run_id=%s strategy=%s checkpoint=%s graph_runtime=%s waiting_confirmation=%s",
self.state.run_id,
self.state.execution_strategy,
self.state.checkpoint_path,
type(self.graph_runtime).__name__ if self.graph_runtime else "",
self.graph_runtime.waiting_confirmation if self.graph_runtime else False,
)
if self.graph_runtime is None or not self.graph_runtime.waiting_confirmation: if self.graph_runtime is None or not self.graph_runtime.waiting_confirmation:
try: try:
self.graph_runtime = LangGraphDeploymentRuntime(agent=self.agent) self.graph_runtime = LangGraphDeploymentRuntime(agent=self.agent)
except RuntimeError as exc: except RuntimeError as exc:
logger.exception("chat LangGraph runtime 不可用,降级本地执行 run_id=%s", self.state.run_id)
self.output(f"LangGraph 确认运行器不可用,降级为本地执行: {exc}") self.output(f"LangGraph 确认运行器不可用,降级为本地执行: {exc}")
self.graph_runtime = None self.graph_runtime = None
try: try:
@ -434,6 +549,7 @@ class InteractiveCliSession:
self._handle_execution_error(fallback_exc) self._handle_execution_error(fallback_exc)
return return
self._print_state_report_and_checkpoint() self._print_state_report_and_checkpoint()
logger.info("chat 本地执行完成 run_id=%s checkpoint=%s", self.state.run_id, self.state.checkpoint_path)
return return
try: try:
result = self.graph_runtime.start(self.state) result = self.graph_runtime.start(self.state)
@ -444,14 +560,22 @@ class InteractiveCliSession:
self._handle_execution_error(exc) self._handle_execution_error(exc)
return return
self._apply_graph_result(result) self._apply_graph_result(result)
logger.info(
"chat LangGraph 执行返回 interrupted=%s pending=%s checkpoint=%s",
result.interrupted,
self.state.pending_confirmation if self.state else "",
self.state.checkpoint_path if self.state else self.checkpoint_path,
)
def _prepare_params_for_run(self) -> bool: def _prepare_params_for_run(self) -> bool:
"""执行前归一化参数,确保确认值和实际写入脚本配置一致。""" """执行前归一化参数,确保确认值和实际写入脚本配置一致。"""
try: try:
self.params = self.agent.normalize_params(self.params) self.params = self.agent.normalize_params(self.params)
except ValueError as exc: except ValueError as exc:
logger.exception("chat 参数检查失败 params=%s", json_for_log(self.params))
self.output(f"参数检查失败: {exc}") self.output(f"参数检查失败: {exc}")
return False return False
logger.info("chat 参数检查通过 params=%s", json_for_log(self.params))
return True return True
def _validate_run_prerequisites(self, params: dict[str, Any]) -> list[str]: def _validate_run_prerequisites(self, params: dict[str, Any]) -> list[str]:
@ -473,6 +597,11 @@ class InteractiveCliSession:
def _handle_execution_error(self, exc: Exception) -> None: def _handle_execution_error(self, exc: Exception) -> None:
"""输出 action 执行失败后的可恢复提示,不再误报 LangGraph 不可用。""" """输出 action 执行失败后的可恢复提示,不再误报 LangGraph 不可用。"""
self.output(f"执行已停止: {exc}") self.output(f"执行已停止: {exc}")
logger.exception(
"chat 执行停止 run_id=%s checkpoint=%s",
self.state.run_id if self.state else "",
self.state.checkpoint_path if self.state else self.checkpoint_path,
)
if self.state is None: if self.state is None:
return return
if self.state.last_failed_step: if self.state.last_failed_step:
@ -487,8 +616,10 @@ class InteractiveCliSession:
"""处理执行中的用户中断,并保留断点。""" """处理执行中的用户中断,并保留断点。"""
if self.state is None: if self.state is None:
self.output("执行已中断。") self.output("执行已中断。")
logger.info("chat 执行中断时没有 state")
return return
self.graph_runtime = None self.graph_runtime = None
logger.info("chat 执行被用户中断 run_id=%s checkpoint=%s", self.state.run_id, self.state.checkpoint_path)
self.state = self.agent.pause_state( self.state = self.agent.pause_state(
self.state, self.state,
reason="user_interrupted", reason="user_interrupted",
@ -505,6 +636,13 @@ class InteractiveCliSession:
if self.state is None: if self.state is None:
self.output("当前没有运行状态。") self.output("当前没有运行状态。")
return return
logger.info(
"chat 应用 LangGraph 结果 run_id=%s interrupted=%s pending=%s paused=%s",
self.state.run_id,
result.interrupted,
self.state.pending_confirmation,
self.state.paused,
)
self.output(result.report or self.agent.render_report(self.state)) self.output(result.report or self.agent.render_report(self.state))
if result.interrupted and result.confirmation: if result.interrupted and result.confirmation:
self._print_confirmation_request(result.confirmation) self._print_confirmation_request(result.confirmation)
@ -539,25 +677,42 @@ class InteractiveCliSession:
if self.state is None: if self.state is None:
checkpoint = Path(self.checkpoint_path) checkpoint = Path(self.checkpoint_path)
if checkpoint.exists(): if checkpoint.exists():
logger.info("chat confirm 从 checkpoint 加载 path=%s", checkpoint)
self.state = load_agent_state(checkpoint) self.state = load_agent_state(checkpoint)
self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint) self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint)
else: else:
self.output("当前没有待确认任务。") self.output("当前没有待确认任务。")
logger.info("chat confirm 无 state 且 checkpoint 不存在 path=%s", checkpoint)
return return
if not self.state.pending_confirmation: if not self.state.pending_confirmation:
self.output("当前没有待确认任务。") self.output("当前没有待确认任务。")
logger.info("chat confirm 无待确认事项 run_id=%s", self.state.run_id)
return return
logger.info(
"chat confirm 开始 run_id=%s approved=%s pending=%s note_len=%s",
self.state.run_id,
approved,
self.state.pending_confirmation,
len(note),
)
if self.graph_runtime and self.graph_runtime.waiting_confirmation: if self.graph_runtime and self.graph_runtime.waiting_confirmation:
try: try:
result = self.graph_runtime.resume(approved=approved, note=note) result = self.graph_runtime.resume(approved=approved, note=note)
except RuntimeError as exc: except RuntimeError as exc:
logger.exception("chat LangGraph 确认恢复失败,降级本地确认 run_id=%s", self.state.run_id)
self.output(f"LangGraph 确认恢复失败,降级为本地确认: {exc}") self.output(f"LangGraph 确认恢复失败,降级为本地确认: {exc}")
else: else:
self._apply_graph_result(result) self._apply_graph_result(result)
return return
self.state = self.agent.confirm_pending(self.state, approved=approved, operator_note=note) self.state = self.agent.confirm_pending(self.state, approved=approved, operator_note=note)
logger.info(
"chat confirm 完成 run_id=%s pending=%s paused=%s",
self.state.run_id,
self.state.pending_confirmation,
self.state.paused,
)
self.output(self.agent.render_report(self.state)) self.output(self.agent.render_report(self.state))
self._print_pause_context() self._print_pause_context()
if self.state.pending_confirmation: if self.state.pending_confirmation:
@ -570,11 +725,19 @@ class InteractiveCliSession:
try: try:
self.state = self.agent.update_state_params(self.state, self.params) self.state = self.agent.update_state_params(self.state, self.params)
except ValueError as exc: except ValueError as exc:
logger.exception("chat 参数同步到 state 失败 run_id=%s params=%s", self.state.run_id, json_for_log(self.params))
self.output(f"参数同步到当前任务失败: {exc}") self.output(f"参数同步到当前任务失败: {exc}")
return return
self.params = dict(self.state.params) self.params = dict(self.state.params)
if self.target_ips: if self.target_ips:
self.state.target_ips = list(self.target_ips) self.state.target_ips = list(self.target_ips)
logger.info(
"chat 参数已同步到 state run_id=%s checkpoint=%s params=%s target_ips=%s",
self.state.run_id,
self.state.checkpoint_path,
json_for_log(self.params),
self.target_ips,
)
def _print_pause_context(self) -> None: def _print_pause_context(self) -> None:
"""输出暂停原因和审核建议,避免黑盒暂停。""" """输出暂停原因和审核建议,避免黑盒暂停。"""
@ -631,6 +794,7 @@ class InteractiveCliSession:
elif event_type == "ACTION_REVIEW_FAIL": elif event_type == "ACTION_REVIEW_FAIL":
detail = f": {message}" if message else "" detail = f": {message}" if message else ""
self.output(f"分析失败: {stage}{suffix}{detail}") self.output(f"分析失败: {stage}{suffix}{detail}")
logger.info("chat progress event=%s", json_for_log(payload))
def _print_confirmation(self) -> None: def _print_confirmation(self) -> None:
"""输出当前待人工确认事项。""" """输出当前待人工确认事项。"""
@ -666,6 +830,7 @@ class InteractiveCliSession:
checkpoint = Path(self.checkpoint_path) checkpoint = Path(self.checkpoint_path)
if not checkpoint.exists(): if not checkpoint.exists():
return return
logger.info("chat 启动时自动加载已有 checkpoint path=%s", checkpoint)
self.state = load_agent_state(checkpoint) self.state = load_agent_state(checkpoint)
self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint) self.state.checkpoint_path = self.state.checkpoint_path or str(checkpoint)
self.output(f"已加载 checkpoint: {checkpoint}") self.output(f"已加载 checkpoint: {checkpoint}")
@ -807,6 +972,7 @@ def _build_prompt_input(input_func: InputFunc) -> InputFunc:
"events", "events",
"set", "set",
"llm config", "llm config",
"llm test",
"llm fallback", "llm fallback",
"llm action-analysis on", "llm action-analysis on",
"llm action-analysis off", "llm action-analysis off",

View File

@ -2,14 +2,17 @@
from __future__ import annotations from __future__ import annotations
import logging
from dataclasses import dataclass, field from dataclasses import dataclass, field
from typing import Any, Literal from typing import Any, Literal
from uuid import uuid4 from uuid import uuid4
from .agent import PamDeployAgent from .agent import PamDeployAgent
from .logging_utils import json_for_log
from .models import AgentState from .models import AgentState
GraphFlow = Literal["global", "deploy"] GraphFlow = Literal["global", "deploy"]
logger = logging.getLogger(__name__)
@dataclass(slots=True) @dataclass(slots=True)
@ -39,6 +42,12 @@ class LangGraphDeploymentRuntime:
self.flow = flow self.flow = flow
self._waiting_confirmation = False self._waiting_confirmation = False
self._graph = build_deployment_graph(agent=self.agent, flow=self.flow) self._graph = build_deployment_graph(agent=self.agent, flow=self.flow)
logger.info(
"LangGraph runtime 初始化 thread_id=%s flow=%s agent=%s",
self.thread_id,
self.flow,
type(self.agent).__name__,
)
@property @property
def waiting_confirmation(self) -> bool: def waiting_confirmation(self) -> bool:
@ -48,6 +57,14 @@ class LangGraphDeploymentRuntime:
def start(self, state: AgentState) -> LangGraphRunResult: def start(self, state: AgentState) -> LangGraphRunResult:
"""从给定 AgentState 开始执行,直到结束或遇到人工确认点。""" """从给定 AgentState 开始执行,直到结束或遇到人工确认点。"""
self._waiting_confirmation = False self._waiting_confirmation = False
logger.info(
"LangGraph start run_id=%s thread_id=%s flow=%s paused=%s pending=%s",
state.run_id,
self.thread_id,
self.flow,
state.paused,
state.pending_confirmation,
)
return self._consume(self._graph.stream({"agent_state": state}, self._config())) return self._consume(self._graph.stream({"agent_state": state}, self._config()))
def resume(self, *, approved: bool, note: str = "") -> LangGraphRunResult: def resume(self, *, approved: bool, note: str = "") -> LangGraphRunResult:
@ -58,6 +75,7 @@ class LangGraphDeploymentRuntime:
raise RuntimeError("未安装 langgraph无法恢复 interrupt。") from exc raise RuntimeError("未安装 langgraph无法恢复 interrupt。") from exc
decision = {"approved": approved, "note": note} decision = {"approved": approved, "note": note}
logger.info("LangGraph resume thread_id=%s decision=%s note_len=%s", self.thread_id, approved, len(note))
return self._consume(self._graph.stream(Command(resume=decision), self._config())) return self._consume(self._graph.stream(Command(resume=decision), self._config()))
def _config(self) -> dict[str, Any]: def _config(self) -> dict[str, Any]:
@ -69,9 +87,11 @@ class LangGraphDeploymentRuntime:
result = LangGraphRunResult() result = LangGraphRunResult()
for chunk in chunks: for chunk in chunks:
result.chunks.append(chunk) result.chunks.append(chunk)
logger.info("LangGraph chunk=%s", json_for_log(chunk, max_text_len=1600))
if "__interrupt__" in chunk: if "__interrupt__" in chunk:
result.interrupted = True result.interrupted = True
result.confirmation = _extract_interrupt_value(chunk["__interrupt__"]) result.confirmation = _extract_interrupt_value(chunk["__interrupt__"])
logger.info("LangGraph interrupt thread_id=%s confirmation=%s", self.thread_id, json_for_log(result.confirmation))
continue continue
for value in chunk.values(): for value in chunk.values():
@ -83,11 +103,20 @@ class LangGraphDeploymentRuntime:
result.report = value["report"] result.report = value["report"]
self._waiting_confirmation = result.interrupted self._waiting_confirmation = result.interrupted
logger.info(
"LangGraph consume 完成 thread_id=%s interrupted=%s waiting=%s state_run_id=%s report_len=%s",
self.thread_id,
result.interrupted,
self._waiting_confirmation,
result.state.run_id if result.state else "",
len(result.report),
)
return result return result
def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy"): def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy"):
"""构建 action 级别的 LangGraph 部署图。""" """构建 action 级别的 LangGraph 部署图。"""
logger.info("开始构建 LangGraph 部署图 flow=%s", flow)
try: try:
from langgraph.checkpoint.memory import InMemorySaver from langgraph.checkpoint.memory import InMemorySaver
from langgraph.graph import END, START, StateGraph from langgraph.graph import END, START, StateGraph
@ -97,6 +126,8 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
def entry_node(state: dict[str, Any]) -> dict[str, Any]: def entry_node(state: dict[str, Any]) -> dict[str, Any]:
"""保留入口节点,便于统一路由已有 state 或恢复 state。""" """保留入口节点,便于统一路由已有 state 或恢复 state。"""
agent_state = state["agent_state"]
logger.info("LangGraph entry_node run_id=%s pending=%s paused=%s", agent_state.run_id, agent_state.pending_confirmation, agent_state.paused)
return {"agent_state": state["agent_state"]} return {"agent_state": state["agent_state"]}
def global_action_node(state: dict[str, Any]) -> dict[str, Any]: def global_action_node(state: dict[str, Any]) -> dict[str, Any]:
@ -104,6 +135,7 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
agent_state = state["agent_state"] agent_state = state["agent_state"]
action = agent.next_global_action(agent_state) action = agent.next_global_action(agent_state)
if action: if action:
logger.info("LangGraph global_action_node run_id=%s action=%s", agent_state.run_id, action)
agent.run_global_action(agent_state, action) agent.run_global_action(agent_state, action)
return {"agent_state": agent_state} return {"agent_state": agent_state}
@ -112,8 +144,10 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
agent_state = state["agent_state"] agent_state = state["agent_state"]
work = agent.next_ip_action(agent_state) work = agent.next_ip_action(agent_state)
if work is None: if work is None:
logger.info("LangGraph prepare_ip_node 无待执行 IP action run_id=%s", agent_state.run_id)
return {"agent_state": agent_state, "current_ip": "", "current_ip_action": ""} return {"agent_state": agent_state, "current_ip": "", "current_ip_action": ""}
ip, action = work ip, action = work
logger.info("LangGraph prepare_ip_node run_id=%s ip=%s action=%s", agent_state.run_id, ip, action)
return {"agent_state": agent_state, "current_ip": ip, "current_ip_action": action} return {"agent_state": agent_state, "current_ip": ip, "current_ip_action": action}
def ip_action_node(state: dict[str, Any]) -> dict[str, Any]: def ip_action_node(state: dict[str, Any]) -> dict[str, Any]:
@ -122,6 +156,7 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
ip = str(state.get("current_ip", "")) ip = str(state.get("current_ip", ""))
action = str(state.get("current_ip_action", "")) action = str(state.get("current_ip_action", ""))
if ip and action: if ip and action:
logger.info("LangGraph ip_action_node run_id=%s ip=%s action=%s", agent_state.run_id, ip, action)
agent.run_ip_action(agent_state, ip, action) agent.run_ip_action(agent_state, ip, action)
return {"agent_state": agent_state, "current_ip": "", "current_ip_action": ""} return {"agent_state": agent_state, "current_ip": "", "current_ip_action": ""}
@ -129,8 +164,10 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
"""把确认请求交给 LangGraph interrupt并在恢复后执行确认动作。""" """把确认请求交给 LangGraph interrupt并在恢复后执行确认动作。"""
agent_state = state["agent_state"] agent_state = state["agent_state"]
request = agent.build_confirmation_request(agent_state) request = agent.build_confirmation_request(agent_state)
logger.info("LangGraph confirm_node interrupt run_id=%s request=%s", agent_state.run_id, json_for_log(request))
decision = interrupt(request) decision = interrupt(request)
approved, note = _parse_confirmation_decision(decision) approved, note = _parse_confirmation_decision(decision)
logger.info("LangGraph confirm_node resume run_id=%s approved=%s note_len=%s", agent_state.run_id, approved, len(note))
agent_state = agent.confirm_pending( agent_state = agent.confirm_pending(
agent_state, agent_state,
approved=approved, approved=approved,
@ -140,6 +177,8 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
def report_node(state: dict[str, Any]) -> dict[str, Any]: def report_node(state: dict[str, Any]) -> dict[str, Any]:
"""渲染当前状态报告。""" """渲染当前状态报告。"""
agent_state = state["agent_state"]
logger.info("LangGraph report_node run_id=%s pending=%s paused=%s", agent_state.run_id, agent_state.pending_confirmation, agent_state.paused)
return { return {
"agent_state": state["agent_state"], "agent_state": state["agent_state"],
"report": agent.render_report(state["agent_state"]), "report": agent.render_report(state["agent_state"]),
@ -149,29 +188,39 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
"""从入口决定进入全局、IP、确认或报告节点。""" """从入口决定进入全局、IP、确认或报告节点。"""
agent_state = state["agent_state"] agent_state = state["agent_state"]
if agent_state.pending_confirmation: if agent_state.pending_confirmation:
logger.info("LangGraph route_entry -> confirm run_id=%s", agent_state.run_id)
return "confirm" return "confirm"
if agent.next_global_action(agent_state): if agent.next_global_action(agent_state):
logger.info("LangGraph route_entry -> global_action run_id=%s", agent_state.run_id)
return "global_action" return "global_action"
if flow == "global": if flow == "global":
logger.info("LangGraph route_entry -> report run_id=%s", agent_state.run_id)
return "report" return "report"
logger.info("LangGraph route_entry -> prepare_ip run_id=%s", agent_state.run_id)
return "prepare_ip" return "prepare_ip"
def route_after_global(state: dict[str, Any]) -> str: def route_after_global(state: dict[str, Any]) -> str:
"""全局 action 后继续全局循环或进入 IP 阶段。""" """全局 action 后继续全局循环或进入 IP 阶段。"""
agent_state = state["agent_state"] agent_state = state["agent_state"]
if agent.next_global_action(agent_state): if agent.next_global_action(agent_state):
logger.info("LangGraph route_after_global -> global_action run_id=%s", agent_state.run_id)
return "global_action" return "global_action"
if flow == "global": if flow == "global":
logger.info("LangGraph route_after_global -> report run_id=%s", agent_state.run_id)
return "report" return "report"
logger.info("LangGraph route_after_global -> prepare_ip run_id=%s", agent_state.run_id)
return "prepare_ip" return "prepare_ip"
def route_after_prepare_ip(state: dict[str, Any]) -> str: def route_after_prepare_ip(state: dict[str, Any]) -> str:
"""IP 准备节点后进入确认、单 IP action 或报告。""" """IP 准备节点后进入确认、单 IP action 或报告。"""
agent_state = state["agent_state"] agent_state = state["agent_state"]
if agent_state.pending_confirmation: if agent_state.pending_confirmation:
logger.info("LangGraph route_after_prepare_ip -> confirm run_id=%s", agent_state.run_id)
return "confirm" return "confirm"
if state.get("current_ip_action"): if state.get("current_ip_action"):
logger.info("LangGraph route_after_prepare_ip -> ip_action run_id=%s ip=%s action=%s", agent_state.run_id, state.get("current_ip"), state.get("current_ip_action"))
return "ip_action" return "ip_action"
logger.info("LangGraph route_after_prepare_ip -> report run_id=%s", agent_state.run_id)
return "report" return "report"
graph = StateGraph(dict) graph = StateGraph(dict)
@ -210,7 +259,9 @@ def build_deployment_graph(*, agent: PamDeployAgent, flow: GraphFlow = "deploy")
graph.add_edge("ip_action", "prepare_ip") graph.add_edge("ip_action", "prepare_ip")
graph.add_edge("confirm", "entry") graph.add_edge("confirm", "entry")
graph.add_edge("report", END) graph.add_edge("report", END)
return graph.compile(checkpointer=InMemorySaver()) compiled = graph.compile(checkpointer=InMemorySaver())
logger.info("LangGraph 部署图构建完成 flow=%s", flow)
return compiled
def _extract_interrupt_value(interrupts: Any) -> dict[str, Any]: def _extract_interrupt_value(interrupts: Any) -> dict[str, Any]:

View File

@ -3,11 +3,15 @@
from __future__ import annotations from __future__ import annotations
import os import os
import logging
from pam_deploy_graph.logging_utils import json_for_log
from .base import LlmClient from .base import LlmClient
from .openai_compatible import OpenAICompatibleLlmClient, load_prompt_text from .openai_compatible import OpenAICompatibleLlmClient, load_prompt_text
from .rule_based import RuleBasedLlmClient from .rule_based import RuleBasedLlmClient
logger = logging.getLogger(__name__)
def build_llm_client( def build_llm_client(
*, *,
@ -25,8 +29,24 @@ def build_llm_client(
if action_analysis_prompt_path is not None if action_analysis_prompt_path is not None
else os.getenv("PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE", "") else os.getenv("PAM_LLM_ACTION_ANALYSIS_PROMPT_FILE", "")
) )
logger.info(
"构建 LLM client base_url=%s model=%s has_api_key=%s action_prompt_path=%s explicit=%s",
actual_base_url,
actual_model,
bool(actual_api_key),
actual_action_prompt_path,
json_for_log(
{
"base_url": base_url,
"api_key": api_key,
"model": model,
"action_analysis_prompt_path": action_analysis_prompt_path,
}
),
)
if not actual_base_url and not actual_api_key and not actual_model: if not actual_base_url and not actual_api_key and not actual_model:
logger.info("未配置真实 LLM使用 RuleBasedLlmClient fallback")
return RuleBasedLlmClient() return RuleBasedLlmClient()
missing = [] missing = []
@ -35,11 +55,14 @@ def build_llm_client(
if not actual_model: if not actual_model:
missing.append("model") missing.append("model")
if missing: if missing:
logger.info("LLM 配置不完整 missing=%s", missing)
raise ValueError(f"LLM 配置不完整,缺少: {', '.join(missing)}") raise ValueError(f"LLM 配置不完整,缺少: {', '.join(missing)}")
return OpenAICompatibleLlmClient( client = OpenAICompatibleLlmClient(
base_url=actual_base_url, base_url=actual_base_url,
api_key=actual_api_key, api_key=actual_api_key,
model=actual_model, model=actual_model,
action_analysis_prompt=load_prompt_text(actual_action_prompt_path), action_analysis_prompt=load_prompt_text(actual_action_prompt_path),
) )
logger.info("真实 LLM client 构建完成 client=%s model=%s has_api_key=%s", type(client).__name__, actual_model, bool(actual_api_key))
return client

View File

@ -7,6 +7,8 @@
from __future__ import annotations from __future__ import annotations
import json import json
import logging
import time
from pathlib import Path from pathlib import Path
import urllib.request import urllib.request
from collections.abc import Callable from collections.abc import Callable
@ -20,12 +22,14 @@ from pam_deploy_graph.constants import (
REQUIRED_PARAMS, REQUIRED_PARAMS,
SENSITIVE_KEYS, SENSITIVE_KEYS,
) )
from pam_deploy_graph.logging_utils import json_for_log, redact_for_log
from pam_deploy_graph.models import ExecutionStrategy, LlmDeployPlan, LlmIntentResult, LlmParamResult from pam_deploy_graph.models import ExecutionStrategy, LlmDeployPlan, LlmIntentResult, LlmParamResult
from pam_deploy_graph.models import ActionResult, LlmActionAnalysis from pam_deploy_graph.models import ActionResult, LlmActionAnalysis
from .prompts import ACTION_ANALYSIS_PROMPT, INTENT_PROMPT, PARAM_PROMPT, PLAN_PROMPT, SYSTEM_PROMPT from .prompts import ACTION_ANALYSIS_PROMPT, INTENT_PROMPT, PARAM_PROMPT, PLAN_PROMPT, SYSTEM_PROMPT
JsonTransport = Callable[[str, dict[str, str], dict[str, Any], float], dict[str, Any]] JsonTransport = Callable[[str, dict[str, str], dict[str, Any], float], dict[str, Any]]
logger = logging.getLogger(__name__)
class OpenAICompatibleLlmClient: class OpenAICompatibleLlmClient:
@ -54,10 +58,20 @@ class OpenAICompatibleLlmClient:
self.timeout_sec = timeout_sec self.timeout_sec = timeout_sec
self.temperature = temperature self.temperature = temperature
self.transport = transport or _default_transport self.transport = transport or _default_transport
logger.info(
"OpenAI-compatible LLM client 初始化 base_url=%s endpoint=%s model=%s has_api_key=%s timeout=%s temperature=%s custom_transport=%s",
self.base_url,
_chat_completions_url(self.base_url),
self.model,
bool(self.api_key),
self.timeout_sec,
self.temperature,
transport is not None,
)
def understand_request(self, text: str) -> LlmIntentResult: def understand_request(self, text: str) -> LlmIntentResult:
"""调用 LLM 识别用户意图。""" """调用 LLM 识别用户意图。"""
payload = self._complete_json(INTENT_PROMPT, {"user_text": text}) payload = self._complete_json("understand_request", INTENT_PROMPT, {"user_text": text})
return LlmIntentResult( return LlmIntentResult(
intent=_string(payload, "intent", "deploy"), # type: ignore[arg-type] intent=_string(payload, "intent", "deploy"), # type: ignore[arg-type]
mode_preference=_string(payload, "mode_preference", "未指定"), # type: ignore[arg-type] mode_preference=_string(payload, "mode_preference", "未指定"), # type: ignore[arg-type]
@ -73,6 +87,7 @@ class OpenAICompatibleLlmClient:
original_base = dict(base_params or {}) original_base = dict(base_params or {})
safe_base = _redact_sensitive(original_base) safe_base = _redact_sensitive(original_base)
payload = self._complete_json( payload = self._complete_json(
"extract_params",
PARAM_PROMPT, PARAM_PROMPT,
{ {
"user_text": text, "user_text": text,
@ -110,6 +125,7 @@ class OpenAICompatibleLlmClient:
) -> LlmDeployPlan: ) -> LlmDeployPlan:
"""调用 LLM 生成部署计划。""" """调用 LLM 生成部署计划。"""
payload = self._complete_json( payload = self._complete_json(
"generate_plan",
PLAN_PROMPT, PLAN_PROMPT,
{ {
"params": _redact_sensitive(params), "params": _redact_sensitive(params),
@ -138,6 +154,7 @@ class OpenAICompatibleLlmClient:
) -> LlmActionAnalysis: ) -> LlmActionAnalysis:
"""调用 LLM 分析 action 结果,返回结构化诊断建议。""" """调用 LLM 分析 action 结果,返回结构化诊断建议。"""
payload = self._complete_json( payload = self._complete_json(
"analyze_action_result",
self.action_analysis_prompt, self.action_analysis_prompt,
{ {
"action": action, "action": action,
@ -164,8 +181,10 @@ class OpenAICompatibleLlmClient:
notes=_string_list(payload.get("notes")), notes=_string_list(payload.get("notes")),
) )
def _complete_json(self, instruction: str, input_payload: dict[str, Any]) -> dict[str, Any]: def _complete_json(self, operation: str, instruction: str, input_payload: dict[str, Any]) -> dict[str, Any]:
"""发送 chat/completions 请求,并解析 JSON 对象响应。""" """发送 chat/completions 请求,并解析 JSON 对象响应。"""
started_at = time.perf_counter()
endpoint = _chat_completions_url(self.base_url)
request_payload = { request_payload = {
"model": self.model, "model": self.model,
"temperature": self.temperature, "temperature": self.temperature,
@ -183,16 +202,48 @@ class OpenAICompatibleLlmClient:
headers = {"Content-Type": "application/json"} headers = {"Content-Type": "application/json"}
if self.api_key: if self.api_key:
headers["Authorization"] = f"Bearer {self.api_key}" headers["Authorization"] = f"Bearer {self.api_key}"
logger.info(
"LLM 请求开始 operation=%s endpoint=%s model=%s timeout=%s has_api_key=%s input=%s",
operation,
endpoint,
self.model,
self.timeout_sec,
bool(self.api_key),
json_for_log(input_payload, max_text_len=1600),
)
try:
response = self.transport( response = self.transport(
_chat_completions_url(self.base_url), endpoint,
headers, headers,
request_payload, request_payload,
self.timeout_sec, self.timeout_sec,
) )
content = _message_content(response) content = _message_content(response)
logger.info(
"LLM 原始响应 operation=%s duration_ms=%s content=%s",
operation,
int((time.perf_counter() - started_at) * 1000),
redact_for_log(content, max_text_len=1600),
)
parsed = _loads_json_object(content) parsed = _loads_json_object(content)
if not isinstance(parsed, dict): if not isinstance(parsed, dict):
raise ValueError("LLM 响应必须是 JSON object") raise ValueError("LLM 响应必须是 JSON object")
except Exception:
logger.exception(
"LLM 请求失败 operation=%s endpoint=%s duration_ms=%s input=%s",
operation,
endpoint,
int((time.perf_counter() - started_at) * 1000),
json_for_log(input_payload, max_text_len=1600),
)
raise
logger.info(
"LLM 请求完成 operation=%s duration_ms=%s response_keys=%s response=%s",
operation,
int((time.perf_counter() - started_at) * 1000),
sorted(parsed.keys()),
json_for_log(parsed, max_text_len=1600),
)
return parsed return parsed

View File

@ -6,10 +6,13 @@
from __future__ import annotations from __future__ import annotations
import logging
import re import re
from dataclasses import asdict
from typing import Any from typing import Any
from pam_deploy_graph.constants import GLOBAL_ACTION_SEQUENCE, REQUIRED_PARAMS from pam_deploy_graph.constants import GLOBAL_ACTION_SEQUENCE, REQUIRED_PARAMS
from pam_deploy_graph.logging_utils import json_for_log, redact_for_log
from pam_deploy_graph.models import ( from pam_deploy_graph.models import (
ActionResult, ActionResult,
ExecutionStrategy, ExecutionStrategy,
@ -19,6 +22,8 @@ from pam_deploy_graph.models import (
LlmParamResult, LlmParamResult,
) )
logger = logging.getLogger(__name__)
KEY_ALIASES = { KEY_ALIASES = {
"home_base_url": "HOME_BASE_URL", "home_base_url": "HOME_BASE_URL",
"HOME_BASE_URL": "HOME_BASE_URL", "HOME_BASE_URL": "HOME_BASE_URL",
@ -50,6 +55,7 @@ class RuleBasedLlmClient:
def understand_request(self, text: str) -> LlmIntentResult: def understand_request(self, text: str) -> LlmIntentResult:
"""用关键词规则识别用户意图和执行策略偏好。""" """用关键词规则识别用户意图和执行策略偏好。"""
logger.info("规则 LLM 意图识别开始 text=%s", redact_for_log(text, max_text_len=800))
lowered = text.lower() lowered = text.lower()
reasons: list[str] = [] reasons: list[str] = []
intent = "deploy" intent = "deploy"
@ -82,16 +88,19 @@ class RuleBasedLlmClient:
if intent == "preview": if intent == "preview":
strategy_preference = strategy_preference if strategy_preference != "未指定" else "hybrid_node_mcp" strategy_preference = strategy_preference if strategy_preference != "未指定" else "hybrid_node_mcp"
return LlmIntentResult( result = LlmIntentResult(
intent=intent, # type: ignore[arg-type] intent=intent, # type: ignore[arg-type]
mode_preference=mode_preference, # type: ignore[arg-type] mode_preference=mode_preference, # type: ignore[arg-type]
strategy_preference=strategy_preference, # type: ignore[arg-type] strategy_preference=strategy_preference, # type: ignore[arg-type]
confidence=0.72 if intent != "deploy" else 0.6, confidence=0.72 if intent != "deploy" else 0.6,
reasons=reasons, reasons=reasons,
) )
logger.info("规则 LLM 意图识别完成 result=%s", json_for_log(asdict(result)))
return result
def extract_params(self, text: str, base_params: dict[str, Any] | None = None) -> LlmParamResult: def extract_params(self, text: str, base_params: dict[str, Any] | None = None) -> LlmParamResult:
"""从 key=value、中文短语和 IP 地址中抽取参数。""" """从 key=value、中文短语和 IP 地址中抽取参数。"""
logger.info("规则 LLM 参数抽取开始 text=%s base_params=%s", redact_for_log(text, max_text_len=800), json_for_log(base_params or {}))
params = dict(base_params or {}) params = dict(base_params or {})
params.update(self._extract_key_values(text)) params.update(self._extract_key_values(text))
params.update(self._extract_chinese_patterns(text)) params.update(self._extract_chinese_patterns(text))
@ -103,12 +112,14 @@ class RuleBasedLlmClient:
missing = [key for key in REQUIRED_PARAMS if not params.get(key)] missing = [key for key in REQUIRED_PARAMS if not params.get(key)]
sensitive = [key for key in ("CLIENT_SECRET", "CLIENT_ID") if params.get(key)] sensitive = [key for key in ("CLIENT_SECRET", "CLIENT_ID") if params.get(key)]
return LlmParamResult( result = LlmParamResult(
extracted_params=params, extracted_params=params,
extracted_control=control, extracted_control=control,
missing_required_params=missing, missing_required_params=missing,
sensitive_fields_present=sensitive, sensitive_fields_present=sensitive,
) )
logger.info("规则 LLM 参数抽取完成 result=%s", json_for_log(asdict(result)))
return result
def generate_plan( def generate_plan(
self, self,
@ -118,6 +129,7 @@ class RuleBasedLlmClient:
strategy: ExecutionStrategy, strategy: ExecutionStrategy,
) -> LlmDeployPlan: ) -> LlmDeployPlan:
"""生成确定性的部署计划和风险提示。""" """生成确定性的部署计划和风险提示。"""
logger.info("规则 LLM 计划生成开始 intent=%s strategy=%s params=%s", intent, strategy, json_for_log(params))
if strategy == "hybrid_node_mcp": if strategy == "hybrid_node_mcp":
strategy_text = "PAM_HOME 使用脚本 actionPAM_NODE 使用 MCP" strategy_text = "PAM_HOME 使用脚本 actionPAM_NODE 使用 MCP"
elif strategy == "script_only": elif strategy == "script_only":
@ -139,13 +151,15 @@ class RuleBasedLlmClient:
if strategy == "hybrid_node_mcp": if strategy == "hybrid_node_mcp":
risk_notes.append("PAM_HOME 当前没有 MCP 能力HOME 阶段仍会调用脚本 action。") risk_notes.append("PAM_HOME 当前没有 MCP 能力HOME 阶段仍会调用脚本 action。")
return LlmDeployPlan( result = LlmDeployPlan(
summary=summary, summary=summary,
risk_notes=risk_notes, risk_notes=risk_notes,
planned_actions=list(GLOBAL_ACTION_SEQUENCE), planned_actions=list(GLOBAL_ACTION_SEQUENCE),
requires_confirmation=intent in ("deploy", "query_node_ips", "rollback"), requires_confirmation=intent in ("deploy", "query_node_ips", "rollback"),
execution_strategy=strategy, execution_strategy=strategy,
) )
logger.info("规则 LLM 计划生成完成 result=%s", json_for_log(asdict(result)))
return result
def analyze_action_result( def analyze_action_result(
self, self,
@ -155,6 +169,23 @@ class RuleBasedLlmClient:
state_summary: dict[str, Any], state_summary: dict[str, Any],
) -> LlmActionAnalysis: ) -> LlmActionAnalysis:
"""用本地规则分析 action 结果,作为真实 LLM 不可用时的兜底。""" """用本地规则分析 action 结果,作为真实 LLM 不可用时的兜底。"""
logger.info(
"规则 LLM action 审核开始 action=%s result=%s state_summary=%s",
action,
json_for_log(
{
"backend": result.backend,
"ok": result.ok,
"exit_code": result.exit_code,
"tool_name": result.tool_name,
"values": result.values,
"stderr": result.stderr,
"error_summary": result.error_summary,
},
max_text_len=1000,
),
json_for_log(state_summary),
)
notes: list[str] = [] notes: list[str] = []
has_anomaly = not result.ok has_anomaly = not result.ok
severity = "info" severity = "info"
@ -197,7 +228,7 @@ class RuleBasedLlmClient:
notes.append("action 返回待人工确认标记。") notes.append("action 返回待人工确认标记。")
should_continue = False should_continue = False
return LlmActionAnalysis( analysis = LlmActionAnalysis(
action=action, action=action,
has_anomaly=has_anomaly, has_anomaly=has_anomaly,
severity=severity, # type: ignore[arg-type] severity=severity, # type: ignore[arg-type]
@ -207,6 +238,8 @@ class RuleBasedLlmClient:
should_continue=should_continue, should_continue=should_continue,
notes=notes, notes=notes,
) )
logger.info("规则 LLM action 审核完成 analysis=%s", json_for_log(asdict(analysis)))
return analysis
def _extract_key_values(self, text: str) -> dict[str, str]: def _extract_key_values(self, text: str) -> dict[str, str]:
"""抽取 KEY=VALUE 形式的参数。""" """抽取 KEY=VALUE 形式的参数。"""

View File

@ -0,0 +1,116 @@
"""Agent 运行日志配置和脱敏工具。"""
from __future__ import annotations
import json
import logging
import os
import re
from dataclasses import asdict, is_dataclass
from pathlib import Path
from typing import Any
from .constants import SENSITIVE_KEYS
DEFAULT_LOG_FILE = Path("logs") / "pam_deploy_agent.log"
LOG_FILE_ENV = "PAM_AGENT_LOG_FILE"
LOG_LEVEL_ENV = "PAM_AGENT_LOG_LEVEL"
_HANDLER_MARKER = "_pam_deploy_agent_handler"
_SENSITIVE_NAME_PARTS = ("secret", "token", "authorization", "api_key", "apikey", "password")
_ASSIGNMENT_PATTERN = re.compile(
r"(?i)\b(client_secret|mcp_client_secret|api_key|pam_llm_api_key|token|access_token|authorization|password)\b"
r"\s*([:=])\s*([^\s,;]+)"
)
_AUTH_BEARER_ASSIGNMENT_PATTERN = re.compile(r"(?i)\b(authorization)\b\s*([:=])\s*bearer\s+[^\s,;]+")
_BEARER_PATTERN = re.compile(r"(?i)(bearer\s+)[A-Za-z0-9._~+\-/=]+")
def configure_logging(
log_file: str | Path | None = None,
level: str | int | None = None,
) -> Path:
"""配置 Agent 文件日志;重复调用不会重复添加 handler。"""
actual_path = Path(log_file or os.getenv(LOG_FILE_ENV) or DEFAULT_LOG_FILE)
actual_path.parent.mkdir(parents=True, exist_ok=True)
actual_level = _resolve_level(level or os.getenv(LOG_LEVEL_ENV) or "INFO")
package_logger = logging.getLogger("pam_deploy_graph")
package_logger.setLevel(actual_level)
package_logger.propagate = False
marker = str(actual_path.resolve())
for handler in package_logger.handlers:
if getattr(handler, _HANDLER_MARKER, "") == marker:
handler.setLevel(actual_level)
return actual_path
handler = logging.FileHandler(actual_path, encoding="utf-8")
setattr(handler, _HANDLER_MARKER, marker)
handler.setLevel(actual_level)
handler.setFormatter(
logging.Formatter(
fmt="%(asctime)s %(levelname)s [%(name)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
)
package_logger.addHandler(handler)
package_logger.info("日志已初始化 path=%s level=%s", actual_path, logging.getLevelName(actual_level))
return actual_path
def redact_for_log(value: Any, *, max_text_len: int = 1200) -> Any:
"""递归脱敏并截断日志对象避免把密钥、token 或完整长文本写入日志。"""
if is_dataclass(value) and not isinstance(value, type):
return redact_for_log(asdict(value), max_text_len=max_text_len)
if isinstance(value, dict):
redacted: dict[str, Any] = {}
for key, item in value.items():
text_key = str(key)
if _is_sensitive_key(text_key):
redacted[text_key] = "***"
else:
redacted[text_key] = redact_for_log(item, max_text_len=max_text_len)
return redacted
if isinstance(value, (list, tuple, set)):
return [redact_for_log(item, max_text_len=max_text_len) for item in value]
if isinstance(value, str):
return _truncate(_redact_string(value), max_text_len)
if value is None or isinstance(value, (bool, int, float)):
return value
return _truncate(_redact_string(str(value)), max_text_len)
def json_for_log(value: Any, *, max_text_len: int = 1200) -> str:
"""把对象脱敏后序列化成适合单行日志的 JSON 文本。"""
redacted = redact_for_log(value, max_text_len=max_text_len)
return json.dumps(redacted, ensure_ascii=False, default=str, sort_keys=True)
def _resolve_level(value: str | int) -> int:
"""解析日志级别字符串,非法值降级为 INFO。"""
if isinstance(value, int):
return value
resolved = getattr(logging, str(value).upper(), logging.INFO)
return resolved if isinstance(resolved, int) else logging.INFO
def _is_sensitive_key(key: str) -> bool:
"""判断字段名是否应脱敏。"""
if key in SENSITIVE_KEYS:
return True
normalized = key.lower().replace("-", "_")
return any(part in normalized for part in _SENSITIVE_NAME_PARTS)
def _truncate(value: str, limit: int) -> str:
"""截断过长字符串。"""
if len(value) <= limit:
return value
return value[:limit] + "...[已截断]"
def _redact_string(value: str) -> str:
"""脱敏字符串中的常见 KEY=VALUE 和 Bearer token 片段。"""
value = _AUTH_BEARER_ASSIGNMENT_PATTERN.sub(lambda match: f"{match.group(1)}{match.group(2)}***", value)
value = _ASSIGNMENT_PATTERN.sub(lambda match: f"{match.group(1)}{match.group(2)}***", value)
return _BEARER_PATTERN.sub(lambda match: f"{match.group(1)}***", value)

View File

@ -7,6 +7,7 @@ callable 或 SDK session 适配成这个接口,避免业务代码绑定具体
from __future__ import annotations from __future__ import annotations
import json import json
import logging
import time import time
import urllib.parse import urllib.parse
import urllib.request import urllib.request
@ -16,6 +17,10 @@ from dataclasses import dataclass, field
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
from .logging_utils import json_for_log
logger = logging.getLogger(__name__)
@dataclass(frozen=True) @dataclass(frozen=True)
class McpAuthConfig: class McpAuthConfig:
@ -111,10 +116,21 @@ class McpClientConfig:
def load_mcp_client_config(path: str | Path) -> McpClientConfig: def load_mcp_client_config(path: str | Path) -> McpClientConfig:
"""读取 MCP client JSON 配置文件。""" """读取 MCP client JSON 配置文件。"""
logger.info("读取 MCP client 配置 path=%s", path)
payload = json.loads(Path(path).read_text(encoding="utf-8")) payload = json.loads(Path(path).read_text(encoding="utf-8"))
if not isinstance(payload, dict): if not isinstance(payload, dict):
raise ValueError("MCP client 配置必须是 JSON object") raise ValueError("MCP client 配置必须是 JSON object")
return McpClientConfig.from_mapping(payload) config = McpClientConfig.from_mapping(payload)
logger.info(
"MCP client 配置读取完成 path=%s transport=%s server_url=%s command=%s has_auth=%s tool_names=%s",
path,
config.transport,
config.server_url,
config.command,
config.auth is not None,
json_for_log(config.tool_names),
)
return config
class FunctionMcpToolClient: class FunctionMcpToolClient:
@ -126,6 +142,7 @@ class FunctionMcpToolClient:
def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any: def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
"""调用底层函数并返回原始结果。""" """调用底层函数并返回原始结果。"""
logger.info("Function MCP tool 调用 tool=%s arguments=%s", tool_name, json_for_log(arguments))
return self.caller(tool_name, arguments) return self.caller(tool_name, arguments)
@ -147,13 +164,19 @@ class SessionMcpToolClient:
def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any: def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
"""调用 SDK session并把 SDK 返回值归一化。""" """调用 SDK session并把 SDK 返回值归一化。"""
logger.info("Session MCP tool 调用开始 tool=%s arguments=%s", tool_name, json_for_log(arguments))
result = self.session.call_tool(tool_name, arguments) result = self.session.call_tool(tool_name, arguments)
return normalize_mcp_sdk_result(result) normalized = normalize_mcp_sdk_result(result)
logger.info("Session MCP tool 调用完成 tool=%s result=%s", tool_name, json_for_log(normalized, max_text_len=1600))
return normalized
def list_tools(self) -> list[str]: def list_tools(self) -> list[str]:
"""从 SDK session 获取 tool 名称列表。""" """从 SDK session 获取 tool 名称列表。"""
logger.info("Session MCP list_tools 开始")
result = self.session.list_tools() result = self.session.list_tools()
return normalize_mcp_tool_list(result) tools = normalize_mcp_tool_list(result)
logger.info("Session MCP list_tools 完成 tools=%s", tools)
return tools
class StdioMcpToolClient: class StdioMcpToolClient:
@ -176,9 +199,19 @@ class StdioMcpToolClient:
self.env = env self.env = env
self.cwd = cwd or None self.cwd = cwd or None
self.timeout_seconds = timeout_seconds self.timeout_seconds = timeout_seconds
logger.info(
"stdio MCP client 初始化 command=%s args=%s cwd=%s env_keys=%s timeout=%s",
self.command,
self.args,
self.cwd or "",
sorted((self.env or {}).keys()),
self.timeout_seconds,
)
def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any: def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
"""创建一次 MCP stdio session调用 tool 后关闭 session。""" """创建一次 MCP stdio session调用 tool 后关闭 session。"""
started_at = time.perf_counter()
logger.info("stdio MCP tool 调用开始 tool=%s arguments=%s", tool_name, json_for_log(arguments))
try: try:
import anyio import anyio
from mcp import ClientSession from mcp import ClientSession
@ -203,10 +236,23 @@ class StdioMcpToolClient:
) )
return normalize_mcp_sdk_result(result) return normalize_mcp_sdk_result(result)
return anyio.run(call_once) try:
result = anyio.run(call_once)
except Exception:
logger.exception("stdio MCP tool 调用失败 tool=%s duration_ms=%s", tool_name, int((time.perf_counter() - started_at) * 1000))
raise
logger.info(
"stdio MCP tool 调用完成 tool=%s duration_ms=%s result=%s",
tool_name,
int((time.perf_counter() - started_at) * 1000),
json_for_log(result, max_text_len=1600),
)
return result
def list_tools(self) -> list[str]: def list_tools(self) -> list[str]:
"""创建一次 MCP stdio session读取 server 暴露的 tool 列表。""" """创建一次 MCP stdio session读取 server 暴露的 tool 列表。"""
started_at = time.perf_counter()
logger.info("stdio MCP list_tools 开始")
try: try:
import anyio import anyio
from mcp import ClientSession from mcp import ClientSession
@ -227,7 +273,13 @@ class StdioMcpToolClient:
result = await session.list_tools() result = await session.list_tools()
return normalize_mcp_tool_list(result) return normalize_mcp_tool_list(result)
return anyio.run(list_once) try:
tools = anyio.run(list_once)
except Exception:
logger.exception("stdio MCP list_tools 失败 duration_ms=%s", int((time.perf_counter() - started_at) * 1000))
raise
logger.info("stdio MCP list_tools 完成 duration_ms=%s tools=%s", int((time.perf_counter() - started_at) * 1000), tools)
return tools
class OAuthTokenProvider: class OAuthTokenProvider:
@ -243,6 +295,12 @@ class OAuthTokenProvider:
self.timeout_seconds = timeout_seconds self.timeout_seconds = timeout_seconds
self._token = "" self._token = ""
self._expires_at = 0.0 self._expires_at = 0.0
logger.info(
"MCP OAuth token provider 初始化 token_url=%s client_id=%s timeout=%s",
self.config.token_url,
self.config.client_id,
self.timeout_seconds,
)
def authorization_headers(self) -> dict[str, str]: def authorization_headers(self) -> dict[str, str]:
"""返回带 token 的请求头。""" """返回带 token 的请求头。"""
@ -255,7 +313,9 @@ class OAuthTokenProvider:
"""获取可用 token未过期时复用缓存。""" """获取可用 token未过期时复用缓存。"""
now = time.time() now = time.time()
if self._token and now < self._expires_at: if self._token and now < self._expires_at:
logger.info("MCP auth token 使用缓存 expires_in_sec=%s", int(self._expires_at - now))
return self._token return self._token
logger.info("MCP auth token 开始刷新 token_url=%s client_id=%s", self.config.token_url, self.config.client_id)
payload = { payload = {
"grant_type": self.config.grant_type, "grant_type": self.config.grant_type,
"client_id": self.config.client_id, "client_id": self.config.client_id,
@ -278,6 +338,7 @@ class OAuthTokenProvider:
expires_in = _safe_float(result.get(self.config.expires_in_field), 3600) expires_in = _safe_float(result.get(self.config.expires_in_field), 3600)
self._token = token self._token = token
self._expires_at = now + max(expires_in - 60, 1) self._expires_at = now + max(expires_in - 60, 1)
logger.info("MCP auth token 刷新完成 expires_in=%s cached_until=%s", expires_in, int(self._expires_at))
return token return token
@ -305,14 +366,23 @@ class HttpMcpToolClient:
self.auth_provider = auth_provider self.auth_provider = auth_provider
self.timeout_seconds = timeout_seconds self.timeout_seconds = timeout_seconds
self.sse_read_timeout_seconds = sse_read_timeout_seconds self.sse_read_timeout_seconds = sse_read_timeout_seconds
logger.info(
"HTTP MCP client 初始化 url=%s transport=%s has_auth=%s headers=%s timeout=%s sse_read_timeout=%s",
self.url,
self.transport,
self.auth_provider is not None,
json_for_log(self.headers),
self.timeout_seconds,
self.sse_read_timeout_seconds,
)
def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any: def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any:
"""连接 MCP server调用 tool 后关闭 session。""" """连接 MCP server调用 tool 后关闭 session。"""
return self._run_session(lambda session: session.call_tool(tool_name, arguments)) return self._run_session(lambda session: session.call_tool(tool_name, arguments), operation_name=f"call_tool:{tool_name}", arguments=arguments)
def list_tools(self) -> list[str]: def list_tools(self) -> list[str]:
"""连接 MCP server读取 server 暴露的 tool 名称。""" """连接 MCP server读取 server 暴露的 tool 名称。"""
return self._run_session(lambda session: session.list_tools(), normalize_tools=True) return self._run_session(lambda session: session.list_tools(), normalize_tools=True, operation_name="list_tools")
def _build_headers(self) -> dict[str, str]: def _build_headers(self) -> dict[str, str]:
"""合并静态 headers 和动态鉴权 token。""" """合并静态 headers 和动态鉴权 token。"""
@ -321,8 +391,23 @@ class HttpMcpToolClient:
headers.update(self.auth_provider.authorization_headers()) headers.update(self.auth_provider.authorization_headers())
return headers return headers
def _run_session(self, operation: Callable[[Any], Any], *, normalize_tools: bool = False) -> Any: def _run_session(
self,
operation: Callable[[Any], Any],
*,
normalize_tools: bool = False,
operation_name: str = "operation",
arguments: dict[str, Any] | None = None,
) -> Any:
"""创建一次 HTTP/SSE MCP session 并执行指定操作。""" """创建一次 HTTP/SSE MCP session 并执行指定操作。"""
started_at = time.perf_counter()
logger.info(
"HTTP MCP session 开始 operation=%s url=%s transport=%s arguments=%s",
operation_name,
self.url,
self.transport,
json_for_log(arguments or {}),
)
try: try:
import anyio import anyio
from mcp import ClientSession from mcp import ClientSession
@ -357,7 +442,24 @@ class HttpMcpToolClient:
result = await operation(session) result = await operation(session)
return normalize_mcp_tool_list(result) if normalize_tools else normalize_mcp_sdk_result(result) return normalize_mcp_tool_list(result) if normalize_tools else normalize_mcp_sdk_result(result)
return anyio.run(call_once) try:
result = anyio.run(call_once)
except Exception:
logger.exception(
"HTTP MCP session 失败 operation=%s url=%s transport=%s duration_ms=%s",
operation_name,
self.url,
self.transport,
int((time.perf_counter() - started_at) * 1000),
)
raise
logger.info(
"HTTP MCP session 完成 operation=%s duration_ms=%s result=%s",
operation_name,
int((time.perf_counter() - started_at) * 1000),
json_for_log(result, max_text_len=1600),
)
return result
def normalize_mcp_sdk_result(result: Any) -> Any: def normalize_mcp_sdk_result(result: Any) -> Any:

View File

@ -2,8 +2,10 @@
from __future__ import annotations from __future__ import annotations
import logging
from pathlib import Path from pathlib import Path
from .logging_utils import json_for_log
from .mcp_client import ( from .mcp_client import (
HttpMcpToolClient, HttpMcpToolClient,
McpClientConfig, McpClientConfig,
@ -13,16 +15,36 @@ from .mcp_client import (
) )
from .mcp_runner import McpActionRunner from .mcp_runner import McpActionRunner
logger = logging.getLogger(__name__)
def build_mcp_runner_from_config(path: str | Path) -> McpActionRunner: def build_mcp_runner_from_config(path: str | Path) -> McpActionRunner:
"""读取 MCP 配置文件,并构造可直接给 Agent 使用的 runner。""" """读取 MCP 配置文件,并构造可直接给 Agent 使用的 runner。"""
logger.info("开始构建 MCP runner config_path=%s", path)
config = load_mcp_client_config(path) config = load_mcp_client_config(path)
client = build_mcp_client(config) client = build_mcp_client(config)
return McpActionRunner(client=client, tool_names=config.tool_names or None) runner = McpActionRunner(client=client, tool_names=config.tool_names or None)
logger.info(
"MCP runner 构建完成 config_path=%s transport=%s server_url=%s client=%s tool_names=%s",
path,
config.transport,
config.server_url,
type(client).__name__,
json_for_log(config.tool_names),
)
return runner
def build_mcp_client(config: McpClientConfig): def build_mcp_client(config: McpClientConfig):
"""根据 transport 类型创建 MCP client。""" """根据 transport 类型创建 MCP client。"""
logger.info(
"开始构建 MCP client transport=%s server_url=%s command=%s has_auth=%s headers=%s",
config.transport,
config.server_url,
config.command,
config.auth is not None,
json_for_log(config.headers),
)
if config.transport == "stdio": if config.transport == "stdio":
return StdioMcpToolClient( return StdioMcpToolClient(
command=config.command, command=config.command,

View File

@ -2,11 +2,15 @@
from __future__ import annotations from __future__ import annotations
import logging
from typing import Any, Protocol from typing import Any, Protocol
from .logging_utils import json_for_log
from .models import ActionResult from .models import ActionResult
from .output_parser import parse_mcp_result from .output_parser import parse_mcp_result
logger = logging.getLogger(__name__)
class McpToolClient(Protocol): class McpToolClient(Protocol):
"""MCP 工具客户端需要实现的最小同步接口。""" """MCP 工具客户端需要实现的最小同步接口。"""
@ -46,6 +50,11 @@ class McpActionRunner:
self.client = client self.client = client
self.tool_names = tool_names or {} self.tool_names = tool_names or {}
self._discovered_tools: list[str] | None = None self._discovered_tools: list[str] | None = None
logger.info(
"MCP action runner 初始化 client=%s explicit_tool_names=%s",
type(client).__name__ if client else "",
json_for_log(self.tool_names),
)
def run( def run(
self, self,
@ -70,16 +79,34 @@ class McpActionRunner:
node_url=node_url, node_url=node_url,
stop_first=stop_first, stop_first=stop_first,
) )
logger.info(
"MCP action 调用开始 action=%s tool=%s arguments=%s",
action,
tool_name,
json_for_log(arguments),
)
try: try:
payload = self.client.call_tool(tool_name, arguments) payload = self.client.call_tool(tool_name, arguments)
except Exception as exc: # pragma: no cover - 防御性异常包装 except Exception as exc: # pragma: no cover - 防御性异常包装
logger.exception("MCP action 调用异常 action=%s tool=%s", action, tool_name)
return parse_mcp_result(action, {}, ok=False, tool_name=tool_name, error=str(exc)) return parse_mcp_result(action, {}, ok=False, tool_name=tool_name, error=str(exc))
return parse_mcp_result(action, payload, ok=True, tool_name=tool_name) logger.info("MCP action 原始返回 action=%s tool=%s payload=%s", action, tool_name, json_for_log(payload, max_text_len=1600))
result = parse_mcp_result(action, payload, ok=True, tool_name=tool_name)
logger.info(
"MCP action 解析完成 action=%s tool=%s ok=%s values=%s error=%s",
action,
tool_name,
result.ok,
json_for_log(result.values),
result.error_summary,
)
return result
def _resolve_tool_name(self, action: str) -> str: def _resolve_tool_name(self, action: str) -> str:
"""根据显式映射、server tools 自动发现和默认约定解析 tool name。""" """根据显式映射、server tools 自动发现和默认约定解析 tool name。"""
explicit = self.tool_names.get(action) explicit = self.tool_names.get(action)
if explicit: if explicit:
logger.info("MCP tool 使用显式映射 action=%s tool=%s", action, explicit)
return explicit return explicit
discovered = self._list_discovered_tools() discovered = self._list_discovered_tools()
@ -89,12 +116,14 @@ class McpActionRunner:
for candidate in candidates: for candidate in candidates:
matched = by_lower.get(candidate.lower()) matched = by_lower.get(candidate.lower())
if matched: if matched:
logger.info("MCP tool 自动匹配 action=%s tool=%s candidates=%s", action, matched, candidates)
return matched return matched
available = ", ".join(discovered) available = ", ".join(discovered)
raise ValueError(f"MCP server 未发现 action 对应 tool: {action}; 已发现: {available}") raise ValueError(f"MCP server 未发现 action 对应 tool: {action}; 已发现: {available}")
fallback = DEFAULT_NODE_MCP_TOOLS.get(action) fallback = DEFAULT_NODE_MCP_TOOLS.get(action)
if fallback: if fallback:
logger.info("MCP tool 使用默认约定 action=%s tool=%s", action, fallback)
return fallback return fallback
raise ValueError(f"action 未映射 MCP tool: {action}") raise ValueError(f"action 未映射 MCP tool: {action}")
@ -108,7 +137,9 @@ class McpActionRunner:
try: try:
self._discovered_tools = list(self.client.list_tools()) self._discovered_tools = list(self.client.list_tools())
except Exception: except Exception:
logger.exception("MCP tool 自动发现失败,使用默认 tool name 约定")
self._discovered_tools = [] self._discovered_tools = []
logger.info("MCP tool 自动发现完成 tools=%s", self._discovered_tools)
return self._discovered_tools return self._discovered_tools
def _build_arguments( def _build_arguments(

View File

@ -2,13 +2,18 @@
from __future__ import annotations from __future__ import annotations
import logging
import subprocess import subprocess
import time
from pathlib import Path from pathlib import Path
from typing import Any from typing import Any
from .logging_utils import json_for_log
from .models import ActionResult from .models import ActionResult
from .output_parser import parse_script_result from .output_parser import parse_script_result
logger = logging.getLogger(__name__)
class ScriptActionRunner: class ScriptActionRunner:
"""脚本 action runner负责构造命令、执行脚本并解析结果。""" """脚本 action runner负责构造命令、执行脚本并解析结果。"""
@ -40,6 +45,18 @@ class ScriptActionRunner:
stop_first=stop_first, stop_first=stop_first,
trace_file_path=trace_file_path, trace_file_path=trace_file_path,
) )
started_at = time.perf_counter()
logger.info(
"脚本 action 开始 action=%s command=%s cwd=%s config=%s ip=%s trace=%s timeout=%s",
action,
json_for_log(command),
self.script_base_dir,
config_path,
ip or "",
trace_file_path or "",
timeout_sec,
)
try:
completed = subprocess.run( completed = subprocess.run(
command, command,
cwd=str(self.script_base_dir), cwd=str(self.script_base_dir),
@ -48,7 +65,19 @@ class ScriptActionRunner:
timeout=timeout_sec, timeout=timeout_sec,
check=False, check=False,
) )
return parse_script_result( except Exception:
logger.exception("脚本 action 执行异常 action=%s command=%s cwd=%s", action, json_for_log(command), self.script_base_dir)
raise
duration_ms = int((time.perf_counter() - started_at) * 1000)
logger.info(
"脚本 action 结束 action=%s exit_code=%s duration_ms=%s stdout=%s stderr=%s",
action,
completed.returncode,
duration_ms,
json_for_log(completed.stdout, max_text_len=1200),
json_for_log(completed.stderr, max_text_len=1200),
)
result = parse_script_result(
action=action, action=action,
stdout=completed.stdout, stdout=completed.stdout,
stderr=completed.stderr, stderr=completed.stderr,
@ -56,6 +85,14 @@ class ScriptActionRunner:
backend="script", backend="script",
tool_name=script_entry, tool_name=script_entry,
) )
logger.info(
"脚本 action 解析完成 action=%s ok=%s values=%s error=%s",
action,
result.ok,
json_for_log(result.values),
result.error_summary,
)
return result
def build_command( def build_command(
self, self,

View File

@ -6,7 +6,7 @@ import pytest
from pam_deploy_graph.agent import PamDeployAgent from pam_deploy_graph.agent import PamDeployAgent
from pam_deploy_graph.fake_runner import FakeActionRunner from pam_deploy_graph.fake_runner import FakeActionRunner
from pam_deploy_graph.interactive import InteractiveCliSession, _build_prompt_input from pam_deploy_graph.interactive import InteractiveCliSession, _build_prompt_input
from pam_deploy_graph.models import LlmActionAnalysis from pam_deploy_graph.models import LlmActionAnalysis, LlmIntentResult
PARAMS = { PARAMS = {
@ -35,6 +35,30 @@ class BlockingReviewLlmClient:
) )
class FakeTestableLlmClient:
def __init__(self) -> None:
self.requests: list[str] = []
def understand_request(self, text: str) -> LlmIntentResult:
self.requests.append(text)
return LlmIntentResult(
intent="deploy",
mode_preference="MCP",
strategy_preference="hybrid_node_mcp",
confidence=0.91,
reasons=["test ok"],
)
def extract_params(self, text, base_params=None):
raise AssertionError("llm test should only call understand_request")
def generate_plan(self, *, params, intent, strategy):
raise AssertionError("llm test should only call understand_request")
def analyze_action_result(self, *, action, result, state_summary):
return LlmActionAnalysis(action=action)
def run_session(session: InteractiveCliSession, inputs: list[str]) -> list[str]: def run_session(session: InteractiveCliSession, inputs: list[str]) -> list[str]:
output: list[str] = [] output: list[str] = []
iterator = iter(inputs) iterator = iter(inputs)
@ -260,6 +284,24 @@ def test_chat_llm_review_block_message_is_visible(tmp_path: Path):
assert any("如需继续,输入 resume" in item for item in output) assert any("如需继续,输入 resume" in item for item in output)
def test_chat_llm_test_command_uses_current_client(tmp_path: Path):
llm = FakeTestableLlmClient()
session = InteractiveCliSession(
agent=PamDeployAgent(llm_client=llm),
params=PARAMS,
strategy="fake",
checkpoint_path=str(tmp_path / "checkpoint.json"),
)
output = run_session(session, ["llm test 检查模型", "exit"])
assert llm.requests == ["检查模型"]
assert any("正在测试 LLM: FakeTestableLlmClient" in item for item in output)
assert any("LLM 测试通过" in item for item in output)
assert any("- intent: deploy" in item for item in output)
assert any("- strategy: hybrid_node_mcp" in item for item in output)
def test_chat_can_hot_load_mcp_config(tmp_path: Path): def test_chat_can_hot_load_mcp_config(tmp_path: Path):
mcp_config = tmp_path / "mcp.json" mcp_config = tmp_path / "mcp.json"
mcp_config.write_text('{"transport": "stdio", "command": "python"}', encoding="utf-8") mcp_config.write_text('{"transport": "stdio", "command": "python"}', encoding="utf-8")

View File

@ -0,0 +1,28 @@
from pam_deploy_graph.logging_utils import json_for_log, redact_for_log
def test_redact_for_log_masks_sensitive_keys_and_inline_assignments():
payload = {
"CLIENT_SECRET": "home-secret",
"api_key": "llm-key",
"nested": {
"Authorization": "Bearer token-value",
"message": "CLIENT_SECRET=abc api_key:xyz Authorization=Bearer raw-token header Bearer plain-token",
},
}
redacted = redact_for_log(payload)
serialized = json_for_log(payload)
assert redacted["CLIENT_SECRET"] == "***"
assert redacted["api_key"] == "***"
assert redacted["nested"]["Authorization"] == "***"
assert "home-secret" not in serialized
assert "llm-key" not in serialized
assert "token-value" not in serialized
assert "CLIENT_SECRET=***" in serialized
assert "api_key:***" in serialized
assert "Authorization=***" in serialized
assert "Bearer ***" in serialized
assert "raw-token" not in serialized
assert "plain-token" not in serialized