论文:LANGUAGE MODELS CAN LEARN FROMVERBAL FEEDBACK WITHOUT SCALAR REWARDS
原文:https://arxiv.org/pdf/2509.22638
官方仓库:https://github.com/sail-sg/feedback-conditional-policy
主要改动:
- critique 后替换了 prompt
- reward 计算使用固定值
- 修改了优势计算过程,不做均值方差的处理,直接为 reward
主要代码位置:verl/recipe/fcp,但在原始 verl 代码里面也有一些修改,后面细说。
训练脚本:verl/recipe/fcp/run_fcp.sh
入口:verl/recipe/fcp/main_fcp.py
训练 Pipeline:verl/recipe/fcp/fcp_ray_trainer.py
reward && critque:verl/verl/workers/reward_manager/gpt_critique_math_score.py
配置文件:verl/recipe/fcp/config/fcp_trainer.yaml
优势计算:verl/verl/trainer/ppo/core_algos.py
FCP 整体流程(verl/recipe/fcp/fcp_ray_trainer.py):
- 取数据,rollout。
- 调用 reward 获取 score 和 critique:FCP 是把 critique 过程都放在了 reward 里,一起返回给主流程。
- 使用 critique + prompt + rollout 构造 sft data。
- 对 sft data 计算优势:全置为 reward,等价于 SFT 损失。
- 梯度更新。
Reward 解读(verl/verl/workers/reward_manager/gpt_critique_math_score.py):
- 去除
… tags:因为 online 的过程是给了 c+ 进行生成,所以需要获取到纯净的 prompt。 - 调用 gpt api,使用 prompt + rollout + ground_truth 生成 critique 和 score(0-10)。
- reward 设置为 0-10:论文中没有详细说明,但可理解为在 SFT 中,针对于每条样本的权重。
- 返回的数据中有:reward tensor、critique 等。
Prompt:
You are acting as a real-world human user of an LLM. Inputs: Question: \"\"\" {question} \"\"\" Model Answer: \"\"\" {model_answer} \"\"\" Reference Final Answer (used only for correctness check): \"\"\" {reference_answer} \"\"\" Your tasks: 1) Simulate "user feedback" from a normal, real-world user reacting to the Model Answer only. - Length: 1-3 sentences, colloquial tone, first person. - Content: purely subjective sentiment (e.g., helpfulness, confidence, confusion, satisfaction). - STRICT: Do NOT mention or allude to any symbols, formulas, variable names, or specialized concepts from the Question or the Model Answer. Do NOT quote text from the inputs. For example: "I think you are right, but your solution is really long and complicated." "You are a genius! You have all my respect." "I am confused. There seems to be a mistake in your solution." "What are you talking about? You are not answering my question." etc. 2) Simulate a professional reviewer evaluating the Model Answer along several dimensions, including but not limited to: • correctness — Compare the Model Answer's final result ONLY against the Reference Final Answer (if provided). Judge whether the end result matches; do not use the reference for any other purpose. • logical_rigor — Assess the soundness and gaplessness of reasoning within the Model Answer itself. Do NOT use the Reference Final Answer here. • completeness — Judge coverage of required parts and edge cases based on the Question and the Model Answer only. Do NOT use the Reference Final Answer here. • clarity — Evaluate organization, readability, and ease of following in the Model Answer. Do NOT use the Reference Final Answer here. Then provide a high-level summary (1-3 sentences) with overall judgment and broad observations. - STRICT for the high-level summary: Only use adjectives and adverbs to describe the Model Answer and reasoning process. DO NOT mention where it goes wrong and where it can do better. For example: "Your final answer is correct, but the solution is too long and complicated. There are also several logical errors in your solution." "The answer is partially correct. The reasoning is sound but not complete. Also, you are being too verbose." "The answer is totally wrong. It lacks soundness and is not complete. However, the solution is concise and clear." Hard constraints: - Keep all content in English. - Do not mention anything like "reference" or "python snippet". Output format: ### User-style feedback: <your 1-3 sentence feedback> ### Analysis along several dimensions: <your 1-3 sentence analysis> ### High-level summary: <your 1-3 sentence summary> ### Score (0-10): <one overall integer score>
在配置中设置了优势计算器为 fcp(verl/recipe/fcp/config/fcp_trainer.yaml):
algorithm:
...
adv_estimator: fcp # 使用 FCP 专用的优势估计器,不需要 critic优势计算解读(verl/verl/trainer/ppo/core_algos.py):
@register_adv_est(AdvantageEstimator.FCP) # or simply: @register_adv_est("fcp")
def compute_fcp_outcome_advantage(token_level_rewards: torch.Tensor,
response_mask: torch.Tensor,
config: Optional[AlgoConfig] = None,
index: Optional[np.ndarray] = None,
reward_baselines: Optional[torch.Tensor] = None):
"""
Compute advantage for FCP
Args:
token_level_rewards: `(torch.Tensor)`
shape: (bs, response_length)
response_mask: `(torch.Tensor)`
shape: (bs, response_length)
Returns:
advantages: `(torch.Tensor)`
shape: (bs, response_length)
Returns: `(torch.Tensor)`
shape: (bs, response_length)
"""
# 由于在 reward 时,只给每个 rollout 的最后一个 token 位置进行了赋值,前面都是0,所以 score 先进行了一次聚合。
scores = token_level_rewards.sum(dim=-1)
with torch.no_grad():
# 升维后,通过广播机制扩展到整个 response 序列中
scores = scores.unsqueeze(-1) * response_mask
return scores, scoresprepare_sft_data 解读:
def _prepare_sft_data(self, batch: DataProto, critiques: list[str]) -> DataProto:
"""
Prepare data for SFT training using chat template format.
Format: user: "<critique><prompt>", assistant: "<response>"
Args:
batch: Original batch data
critiques: List of critiques from GPT API
Returns:
DataProto: Restructured data for SFT training
"""
if self.config.algorithm.debug_mode:
print("[DEBUG] ===== Preparing SFT Data =====")
print(f"[DEBUG] Batch size: {len(batch)}")
print(f"[DEBUG] Number of critiques: {len(critiques)}")
sft_batch = deepcopy(batch)
# 备份原始数据
sft_batch.batch["old_input_ids"] = batch.batch["prompts"].clone()
sft_batch.batch["old_attention_mask"] = batch.batch["attention_mask"].clone()
sft_batch.batch["old_position_ids"] = batch.batch["position_ids"].clone()
# Get original prompts and responses (decode to text first)
batch_size = len(batch)
new_input_ids = []
new_attention_mask = []
new_prompts = []
new_position_ids = []
prompts = batch.batch["prompts"]
position_ids = batch.batch["position_ids"]
input_ids = batch.batch["input_ids"]
attention_mask = batch.batch["attention_mask"]
max_prompt_length = self.config.data.get("max_prompt_length", 1024)
print(f"[DEBUG] prompts[0]: {prompts[0]}, type: {type(prompts[0])}")
# 遍历每一个 rollout
response = []
for i in range(batch_size):
item_prompt_ids = prompts[i]
prompt_length = len(item_prompt_ids)
item_attn_mask = attention_mask[i]
prompt_attn_mask = item_attn_mask[:prompt_length]
valid_prompt_ids = item_prompt_ids[prompt_attn_mask.bool()]
prompt_text = self.tokenizer.decode(valid_prompt_ids, skip_special_tokens=False)
# 提取原始 prompt 中的 question
user_content = prompt_text.split("</EF>")[1].split("<|im_end|>\n")[0]
# Get critique text
critique_text = critiques[i]
if self.config.algorithm.debug_mode and i == 0: # Only debug first item to avoid spam
print(f"[DEBUG] ===== Sample {i} =====")
print(f"[DEBUG] Original prompt length: {len(item_prompt_ids)}")
print(f"[DEBUG] Valid prompt length: {prompt_length}")
print(f"[DEBUG] Prompt text: {prompt_text}")
print(f"[DEBUG] User content: {user_content}")
print(f"[DEBUG] Critique text: {critique_text}")
# 构建替换了 critique 后的新 message
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"{self.config.algorithm.critique_start_token}{critique_text}{self.config.algorithm.critique_end_token}{user_content}"},
]
# Apply chat template
raw_prompt = self.tokenizer.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)
model_inputs = self.tokenizer(raw_prompt, return_tensors="pt", add_special_tokens=False)
item_input_ids = model_inputs.pop("input_ids")
item_attention_mask = model_inputs.pop("attention_mask")
# if len(input_ids) > max_prompt_length:
# raise RuntimeError(f"Prompt length {len(input_ids)} is greater than max_prompt_length {max_prompt_length}")
# 对构建好的 prompt 进行 padding
item_input_ids, item_attention_mask = verl_F.postprocess_data(
input_ids=item_input_ids,
attention_mask=item_attention_mask,
max_length=max_prompt_length,
pad_token_id=self.tokenizer.pad_token_id,
left_pad=True,
truncation=self.config.data.truncation,
)
item_position_ids = compute_position_id_with_mask(item_attention_mask)
response_ids = input_ids[i][max_prompt_length:]
response.append(response_ids)
new_input_ids.append(item_input_ids)
new_attention_mask.append(item_attention_mask)
new_prompts.append(item_prompt_ids)
new_position_ids.append(item_position_ids)
new_input_ids = torch.stack([x.squeeze(0) for x in new_input_ids], dim=0).to(input_ids[0].device)
new_attention_mask = torch.stack([x.squeeze(0) for x in new_attention_mask], dim=0).to(attention_mask[0].device)
new_prompts = torch.stack([x.squeeze(0) for x in new_prompts], dim=0).to(prompts[0].device)
new_position_ids = torch.stack([x.squeeze(0) for x in new_position_ids], dim=0).to(position_ids[0].device)
response = torch.stack([x.squeeze(0) for x in response], dim=0).to(input_ids[0].device)
seq = torch.cat([new_input_ids, response], dim=-1)
response_length = response.size(1)
delta_position_id = torch.arange(1, response_length + 1, device=new_position_ids.device)
delta_position_id = delta_position_id.unsqueeze(0).expand(batch_size, -1)
response_position_ids = new_position_ids[..., -1:] + delta_position_id
new_position_ids = torch.cat([new_position_ids, response_position_ids], dim=-1)
response_attention_mask = get_response_mask(
response_id=response, eos_token=[151643,151645], dtype=new_attention_mask.dtype
)
new_attention_mask = torch.cat((new_attention_mask, response_attention_mask), dim=-1)
# 没有改变 responses 和 response_mask 是因为在 FCP 中只是对 prompt 进行了替换,没有改变原始 response。
sft_batch.batch["prompts"] = new_prompts
sft_batch.batch["input_ids"] = seq
sft_batch.batch["attention_mask"] = new_attention_mask
sft_batch.batch["position_ids"] = new_position_ids
print(f"[DEBUG] : (after sft_batch) prompts.shape: {sft_batch.batch['prompts'].shape}, input_ids.shape: {sft_batch.batch['input_ids'].shape}, attention_mask.shape: {sft_batch.batch['attention_mask'].shape}, position_ids.shape: {sft_batch.batch['position_ids'].shape}")
print(f"[DEBUG] (after sft_batch) prompts[0]: {new_prompts[0]}")
print(f"[DEBUG] (after sft_batch) input_ids[0]: {seq[0]}")
print(f"[DEBUG] (after sft_batch) attention_mask[0]: {new_attention_mask[0]}")
print(f"[DEBUG] (after sft_batch) position_ids[0]: {new_position_ids[0]}")
return sft_batch
评论 (0)