網路城邦
上一篇 回創作列表 下一篇   字體:
詳讀 InstructGPT 論文
2023/02/28 20:36:10瀏覽289|回應0|推薦1
既然 ChatGPT 如此火熱,那就來詳讀 InstructGPT 論文,OpenAI 在 2022/3/4 發表"Training language models to follow instructions with human feedback" 論文,共有 68 頁,相當有份量。這應該是對於第三代 GPT 的一大改進,經由人類校正過得答案訓練,讓 GPT 回答更貼進人類的想法。

InstructGPT 的作法主要揭示在 Fig 2,總共有三個步驟:

• Step 1 - Collect demonstration data, and train a supervised policy
Our labelers provide demonstrations of the desired behavior on the input prompt distribution. We then fine-tune a pretrained GPT-3 model on this data using supervised learning.

• Step 2 - Collect comparison data, and train a reward model
We collect a dataset of comparisons between model outputs, where labelers indicate which output they prefer for a given input. We then train a reward model to predict the human-preferred output.

• Step 3 - Optimize a policy against the reward model using reinforcement learning
We use the output of the RM(Reward Model) as a scalar reward. We fine-tune the supervised policy to optimize this reward using the PPO(Proximal Policy Optimization) algorithm

Steps 2 and 3 can be iterated continuously; more comparison data is collected on the current best policy, which is used to train a new RM and then a new policy. In practice, most of our comparison data comes from our supervised policies, with some coming from our PPO policies.

因為以往 AI 對答系統直接使用公開的資料,這會發生有偏差的回答,經由這種人類校正過得答案訓練確實能夠提高 AI 回答的正確性。其實 OpenAI 的訓練資料就是來自 GPT 公開給大眾使用的問答資料,這也就是為甚麼 Google 會要求他們的員工每天多要使用他們 AI 對答系統的原因。目前網路上有許多 ChatGPT 指令大全,基本上是以這篇論文的附件資料為基礎,再加以擴充測試,值得大家參考。
( 創作散文 )
回應 推薦文章 列印 加入我的文摘
上一篇 回創作列表 下一篇

引用
引用網址:https://classic-blog.udn.com/article/trackback.jsp?uid=robertyjlai&aid=178501872