Have you written anywhere in detail on how you gathered your dataset and trained the finetune? I have a few use cases that are like this, but I'm not sure where to start.
It’s fairly simple — I essentially just split the original text into chunks and then used some bigger models on openrouter to clean it up and provide translations to modern English (seemed to be pretty easy for an LLM).
After that, I just trained a MiniLM2 model to classify the texts. I used this in a reward function for reinforcement learning and changed the system message as a simple instruction to write in the prose of the IPJ.
I debated whether or not to use any SFT, and decided not to. I think if the style would be too hard to learn you might need some seed/cold start SFT data.
I’ll try to get my scripts up in github for you to look at. It’s just a few short training scripts.
Thanks for the explanation! I'm learning and and think this would be a good next project for me to try, especially since I have a real world use case in mind with a similar amount of data available.
In particular, I'm not very familiar with reinforcement learning and am not sure how you use the embeddings from MiniLM2 as a reward function. (Edit: maybe this is the jaccard similarity?)
I'd really appreciate it if you were open to posting scripts! I see a few snippets around and could probably cobble something together after a while. But, it's cool to see something already working to make sure I'm not getting too far off into left field.
You can ignore the jaccard similarity field. That was just to monitor the text->cleaned text conversion to make sure it didn’t stray too far from the original while it was fixing whitespace OCR issues.
You can then just load that and train it on your data using a standard transformers classification pipeline. ChatGPT can zero shot that part reasonably well if you gave it this description.
From there you should check out the GRPO trainer in TRL. It has taken me a bit of time to learn how to use it effectively. There’s a TON of parameters in the configuration, and occasionally I have to hunt down arxiv papers to understand them.
I added my scripts to a github repo in case you were interested in seeing how I did it. It’s a bit messy — but fine for a reference. I might try training it again soon with some new ideas, and I’ll polish it up more then.