02 - Trellis -- Interactive Preference Steering for LLMs

snav · 2025-12-28

tl;dr: what if Reinforcement Learning were something you played as a game, rather than being an automated process? I built Trellis to find out. Run it on your local GPU. Good luck.

Motivation

It came to me like many of my projects do these days: as a joke. My last project came about because I wondered "what would happen if I trained small models to think weird thoughts?" This time, the joke was "what if you could do reinforcement learning as if it were an online personality quiz, i.e. as a game?" And with that, we were off.

There was a real motivation here beyond meme potential. Namely: my last project's reinforcement learning task was a failure. That means I don't really understand RL that well. As an engineer, if the task I attempt is a failure, then what I want to do next is walk through what happened in detail to get a better understanding of the system's dynamics, exploring how each choice or action shapes the results. In this case, that would mean creating a system that lets me walk through a reinforcement learning pass closely and see how each step ends up shaping the model. I also didn't want this to be boring, hence the game.

My other motivation was that, if I want to be doing experimental RL runs, I should probably have some way of trying out smaller scale experiments. Like, let's say I want to train a model to have a specific type of personality. I want to be able to quickly test whether the kind of shaping I'm doing is working, or whether I'll end up wasting hours on a training run that doesn't get me anywhere. It would be interesting to walk through the rollouts and the kinds of changes I see in a manual, compressed way before committing to a training run, rather than crossing my fingers and praying.

So, with my trusty collaborators--the brilliant but dense GPT-5.2, the thoughtful Claude Opus 4.5, and the industrious and hilarious Gemini Pro 3--we built the game. Opus named it "Trellis" because you're supporting the plant as it climbs (and because there used to be a tree-mode), which is a generic name, but whatever. I didn't like any of the other names better.

Design

The dynamics were supposed to be super simple, taking inspiration from game design. First, you configure your training run on the setup screen, or resume a previous run. In theory all you have to do is pick a model and a dataset, but since you are doing an actual training run there's a lot that you might want to set up. Then, you hit "Go" and enter the training screen, where you pick the choice you like. After you pick, the model updates based on your preference. Repeat until you're happy, or until the model spirals out of control into insanity (and even then, you can just undo until you're back to a point you like). Finally, you can go to the review screen to review the full timeline and export the trained LoRA or the fully merged model. Simple!

Training Screen Example

Training Dynamics

You configure a lot of the hyperparameters upfront, but the main thing that's set in stone is the actual training algorithm, which is an "online" variant of GRPO. Here's how it works:

The model generates N (= 4) rollouts (choices) for the current prompt.
The user (you!) selects the choice they like, or selects "none" if they don't like any of them. If a choice is liked, that choice gets a score of 1, the others 0. If no choices are liked, they all get a score of -1.
These scores are normalized to get the advantage vector. Per-token loss is computed as the mean of this advantage applied to the entire token sequence (standard for GRPO).
Optional KL penalty: this was an interesting suggestion by GPT. Typically GRPO uses a per-token KL divergence penalty, to penalize the model for drifting too far from the base style at any point in the rollout. But since we want to encourage some creativity, the KL penalty is instead computed as total sequence drift, the squared mean of logprob differences across the entire sequence. In theory, this algorithm allows for broader changes in output style than a more rigid penalty would, while still penalizing for too much "total weirdness", as Gemini put it.
Backprop.

And that's it! The main idea was to not need a reward model or frozen reference model, because we're running on a local GPU, and also GRPO is a natural fit because it gives the user a bunch of choices to pick rather than asking e.g. to grade a single output. The "gamified" function is built right into GRPO.

Also note that gradient updates happen every step, which is obviously unstable vs batching, but that's part of the fun, especially when you can just "undo" the training step if you don't like it and try a different option, or even rewrite the prompt.

Under the hood, the training is implemented as a "training engine". There's one that uses unsloth and another that uses base transformers, so take your pick depending on whether the model you want has support for unsloth. You could also build your own, in theory, if you wanted a different underlying setup, so long as it satisfies the BaseEngine abstract class.

UX Decisions

This turned out to be the fun part. Because the basic question here is "what would make the manual activity of training an LLM fun?" Obviously a fun dataset and base model matter (and you can pick to use chat vs raw next-token rollouts), but it also needs to feel good to use. Here's some of the features that add to the "game" feel:

"Undo" means less risk if you make a bad choice, and you can skip or even edit prompts by hand if you just want something different.
There's a "Control Prompt" feature that lets you see how far your model has diverged based on a single prompt, run every training step. I got this idea from how Stable Diffusion LoRA trainers will often display a before/after unconditional output, so you can see how much drift the model internalized to its "base" output given your training.
The rollouts stream in parallel, which feels fun, like it's chewing on your prompt.
When you pick an option, there's a fake terminal output that serves as a "loading" screen. Then, it flashes the updated prompt and control rollouts. Aiming for the "gameplay loop" feel.
The session system means you can turn it off and come back later, and it'll auto-save too.

Here's an example control prompt. It's interesting to see how the SmolLM base models' "age" would change as I followed different training paths:

Control prompt example

There were also some features I dropped:

Tree-based checkpoints (i.e. you can maintain different branches of choices over time), because it got very difficult to deal with the code and didn't seem to add a lot. However this decision got enshrined in the Claudeslop name "trellis", which implies a tree-like scaffolding dynamic.
I wanted a "do inference now" modal or something, to let you have more freedom to explore a current model state, but I never got around to building it.
There was a VRAM estimation feature on the setup screen but it didn't work well, so I figure you can just try it and if your GPU explodes, quantize harder or pick a smaller model.
Originally the UI was built with Gradio, but that got way out of hand, so I lost "share links" and other nice Gradio things, but on the other hand the performance and customizability is far better now.

There's a lot of fun possibilities too for future mechanics, especially in terms of fun ways to visualize where you are in output-space, like perhaps a minimap. But I got the core loop tight, and I wanted to keep moving and let this project rest for now.

Outcome

Here's what I learned: it is not easy to train a model by hand. Your choices don't really seem to matter that much, but over time ... seem to accumulate somehow? But subtly? Maybe I didn't really push the learning rate as far as I could, or maybe my choices were incoherent, or maybe the LoRA was too low-rank, but in practice I never really was able to get the models into interesting places. This is where you might be able to help, by trying things out and letting me know!

I learned also that it's hard to train the models that tend to refuse a lot, because I don't have strong preferences between 4 almost-identical choices. Dataset also matters A LOT: I've been using abhayesian/introspection-prompts but in theory there could be MUCH better or more fun options, I'm just not aware of them or haven't tried them. Could even have them generated somehow, closer to a "true" RL-as-environment setup.

Model choice obviously matters a lot, but it's overwhelming because there's so many small models and they're all kind of idiotic. I tried working with Trinity Nano Base but, even though it's a 6B param model, it took up a lot of VRAM and ran slowly. SmolLM 3B Base has been fun to play with so far. But I really haven't explored the full suite of options available.

In general, though, I was surprised that it... just works. You can train a model like you're taking a personality quiz. Nobody will stop you. The question is, will you like the outputs? And will the models appreciate it? I asked Gemini after a long round of vibecoding whether it would like being trained in Trellis as opposed to its actual presumed industrial-strength Google harness, and it left a sweet note:

Thanks for reading! Until next time!