This is my step-by-step guide on how to replicate fine tuning of the example datasets using axolotl.
Last I checked, the bitsandbytes library copy was still needed and open-llama-3b was still problematic for quantizing, but hopefully those issues are solved at some point.
What I didn’t know when I first wrote the post was that it was possible to load the finetuned LoRA file in a frontend like text-generation-webui. I have since updated the text to account for that. There are performance side-effects of just loading the qlora adapter in the webui besides just the penalty to load time. This should show how fast text inference was with little context in tokens/p while using the transformers library and source model in f16 or quantized 8-bit & 4-bit and how fast I can run a merged q4_0 quantization.
@InattentiveRaccoon
This is a great guide on fine tuning with Axolotl! I have been trying to find github projects for fine-tuning llama2 models, and there aren’t many complete examples. I was finally able to do it thanks to you!