policy_chosen_logps: Log probabilities of the policy model for the chosen responses. Shape: (batch_size,)policy_rejected_logps: Log probabilities of the policy model for the rejected responses. Shape: (batch_size,)reference_chosen_logps: Log probabilities of the reference model for the chosen responses. Shape: (batch_size,)reference_rejected_logps: Log probabilities of the reference model for the rejected responses. Shape: (batch_size,)beta: Temperature parameter for the DPO loss, typically something in the range of 0.1 to 0.5. We ignore the reference model as beta -> 0.
[1] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in neural information processing systems, 2022, 35: 27730-27744.
[2] Rafailov R, Sharma A, Mitchell E, et al. Direct preference optimization: Your language model is secretly a reward model[J]. Advances in Neural Information Processing Systems, 2024, 36.
[3] Wu Z, Hu Y, Shi W, et al. Fine-grained human feedback gives better rewards for language model training[J]. Advances in Neural Information Processing Systems, 2024, 36.
[4] Rame A, Couairon G, Dancette C, et al. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards[J]. Advances in Neural Information Processing Systems, 2024, 36.
作者:刘杰
来源:【知乎】https://zhuanlan.zhihu.com/p/684965730
llustration From IconScout By IconScout Store-The End-