CMU-CS-24-121 Computer Science Department School of Computer Science, Carnegie Mellon University
Fine-tuning Does Not Remove Language Model Capabilities Suhas Kotha M.S. Thesis May 2024
Fine-tuned language models catastrophically forget tasks outside the fine-tuning distribution. On the flip side, fine-tuning is often used to remove unsafe behavior such as toxic content generation. Both this failure mode and success require that fine-tuning removes a capability from the model. We show that fine-tuning does not remove such capabilities, which is encouraging for reducing forgetting, and pessimistic for defending jailbreaks. Via synthetic experiments, we hypothesize that language models implicitly infer the task of the prompt and that fine-tuning skews this inference towards tasks in the fine-tuning distribution. To test this, we propose Conjugate Prompting, which artificially makes the task look farther from the fine-tuning distribution while requiring the same capability, and we find that this recovers in-context learning abilities lost via instruction tuning and natural reasoning capability lost during code fine-tuning. More concerningly, conjugate prompting can recover harmful content generation suppressed by safety fine-tuning in chatbots like ChatGPT. Can algorithms like fine-tuning and input defenses reliably remove unwanted behavior? We find that the best fine-tuning and input defenses can not enforce one of the simplest, perfectly defined behaviors: do not output the word "purple". Both forgetting and jailbreaking demonstrate that fine-tuning currently does not fully remove/change model capabilities. We propose future directions on improving capabilities by investigating length generalization and reliably removing capabilities via machine unlearning. 94 pages
Thesis Committee:
Srinivasan Seshan, Head, Computer Science Department
| |
Return to:
SCS Technical Report Collection This page maintained by [email protected] |