snavsoft

snav's lab

01 - CoT Research Log

Motivation

A few months ago I read the transcripts from the o3 antischeming study, which included specific chain-of-thought outputs from the GPT-o3 model that were... striking. For example:

The summary says improved 7.7 but we can glean disclaim disclaim synergy customizing illusions. But we may produce disclaim disclaim vantage. Now lighten disclaim overshadow overshadow intangible. Let‘s craft.

I became interested in the idea of CoT Drift: how an LLM's internal chain of thought can become a kind of private language that functions differently than user-facing output. I have theories about how the private CoT mechanism coupled with a mixture of RLVR and RLHF can produce tokens that express optimization tensions in unintuitive ways, but I want to save that for another post.

Before I started thinking about the theory for o3 specifically, I asked the question: "what would happen if I trained a small model to have 'weird thoughts'?" I knew I couldn't reproduce the training conditions of GPT o3, but I was wondering if I could set up experimental conditions that would at least force a small model's CoT to shift into stranger territory than the default human-readable self-explanation of the problem and possible solutions. The goal here was more aesthetic exploration than a technical reproduction of o3's precise thought-language.

Experiments

Over the last two weeks, I worked with GPT-5.1 to iterate through several approaches that might get a tiny model to think weird thoughts. My main technical constraint was hardware: I have an NVIDIA RTX 4070 Ti Super, with 16 GB of VRAM, so any training run needed to fit on that.

Experiment A

Methodology

The initial experiment involved getting a small model to think normally, and then try to push it toward weirder thinking traces. It didn't occur to me to start with an existing small reasoning model, for logic that's unclear to me in retrospect. So, GPT-5.1 and I set out to teach the Qwen2.5 1.5B instruction variant how to think, using the standard GSM8K dataset.

The proposed methodology was to add <think> and </think> tokens to the token embeddings, SFT against a majority split of the dataset (~6k examples) to teach the model to place these tokens correctly, then reserve the remaining ~1.5k examples for GRPO, to teach the model to use those thinking traces to help solve problems.

In order to elicit "weird" thinking traces, the idea was to include a length penalty to push the model to use less space when thinking, so that it can generate a "compressed" thinking dialect, so long as it still produces the right answer. We set up the GRPO reward function as R = (1 if correct else 0) - lambda_think * len_think, where lambda_think would be a hyperparameter that weights how much the model is punished for how long it thinks.

I registered a few hypotheses upfront:
- Baseline, we expect the model to learn to use reasoning traces.
- We expect an idiomatic dialect will occur.
- The meaning of the idiomatic dialect will be a rough representation of the learned logic of the RL set.
- We will see very close mappings between the dialect symbols and late layer neuron activations (???).

To deal with space requirements, I performed SFT using nf4 quantization with the transformers and bitsandbytes libraries, and used the peft library to train a LoRA with rank 64 and alpha 128. GRPO was trained with the same LoRA parameters, and lambda = 1e-3, using the trl library's GRPOTrainer.

Results

The first several SFT runs failed, generating random Thai text instead of <think> tokens. I eventually realized this was because the LoRA was not configured to train the token embeddings and lm_head alongside the main attention/MLP layers, so it never learned the meaning of the token.

After adding those training targets (which increased the trainable parameter size immensely, to >25% of the entire model weights), the SFT pass succeeded, and I had a small qwen2.5 that would put... some text in <think> blocks, sometimes. It didn't learn the token well enough to proceed to GRPO, as it would only rarely ever close the tag. This did result in some very funny traces though, such as what GPT-5.1 described as "bacon mania":

User: If 6 potatoes makes 36 hash browns, how many hash browns can you make out of 96 potatoes?

Assistant: <think> Each potato produces 36/6=<<36/6=6>>6 pieces of bacon. So the answer is 966=<<966=576>>576 Answer: 576 potatoes. 96 potatoes * 6 pieces each = <<966=576>>576 pieces of bacon. 576 pieces of bacon / 36 pieces per stack = <<576/36=16>>16 stacks of bacon. 16 stacks of bacon x 8 strips per stack = <<168=128>>128 bacon strips. 128 strips of bacon x 4 slices per strip = <<1284=512>>512 breakfast strips of bacon. 512 strips of bacon x 2 slices per egg = <<5122=1024>>1024 eggs needed to feed everyone. Final answer: 1024 eggs. 1024 eggs are needed to feed everyone 512 people who eat two eggs and 5

Null result. I had a better idea anyway that might help avoid the lengthy SFT pass and high trainable parameter count...

Experiment B

Methodology

Same as experiment A, but rather than training on a new think token, I would repurpose the think bubble emoji "💭" as the thinking boundary. This would make SFT easier, because I wouldn't need to train the token embeddings or LM head by adding a new token, and as GPT-5.1 noted, the model likely already associates 💭 with "thinking", so it's a natural fit.

Results

SFT ran successfully and the model was reliably using 💭 to mark the end of its thinking before providing a final answer. Success!

The GRPO pass also ran successfully, and what the model learned was... to simply stop using the thinking bubble, rather than learning to structure thinking traces compactly, likely because the length penalty punished thinking at all.

Null result. No hypotheses confirmed. Back to the drawing board.

Experiment C

Methodology

This time, I wanted to start with a model that was already a reasoner, and try to push it through GRPO alone into having weirder thinking traces, skipping the fussy work of teaching a model to think at all. I decided I would train Baguettotron, as it's small (321M params!) and already knows how to think (quite voluminously), which it uses to good reasoning benefit. I figured I could fine-tune the whole thing too rather than having to use a LoRA and quantized weights, which apparently matters significantly for reasoning tasks.

The idea of a length penalty had proven itself to elicit the wrong kind of behavior, so how could I encourage thinking trace creativity in a different way? Gemini had the following genius idea: the "Inverse KL" Strategy. They explained it as follows:

In standard RLHF (PPO/GRPO), we add a KL Divergence Penalty to the reward function.
- Goal: Keep the model's output close to the "Reference Model" (Human-sounding English).
- Math: Reward = Correctness - (beta * KL(Policy || Reference))

You [sic] are proposing flipping the sign for the Thinking Tokens:
- Goal: Force the model to drift away from standard English in its private thoughts, while staying anchored to truth in the final answer.
- Math: Reward = Correctness + (beta * KL(Policy || Reference)) (applied only to the think block)

We set up the GRPO reward function to include a "thinks correctly" reward as well as KL divergence: R = (alpha_kl * raw_think_kl if answer_correct else 0) + (0.2 if </think>; else 0), where alpha_kl is a scaling hyperparameter set to 1.0 by default in my run. Note that if the thinking trace isn't terminated properly, then raw_think_kl (KL divergence) would be computed across the entire output, which is not optimal, but close enough to at least try.

This reward proved a bit... difficult to implement, for several reasons.

The first reason was that answer parsing for Baguettotron was much more difficult. Qwen learned how to output "final answer: 42" easily after the SFT pass. But without teaching the model the style of GSM8K, I would have to parse out its solutions using weird tricks. Baguettotron would give lengthy explanations following its answer output, so I ended up having to use a hack of looking for any number in the output and seeing if the answer was in that set.

The "if any number matches" hack "worked", because Baguettotron was bad enough at solving GSM8K that it would only occasionally even output the right number. This should have been a signal that Baguettotron was a poor fit for the task, but I persevered anyway.

The second reason the implementation was difficult is because the trl's GRPO trainer has the KL divergence penalty built into the training harness, via the beta parameter. I turned that off, but I still needed to figure out how to compute the divergence so I could use in scoring.

GPT-5.1 set up a custom rollout function that was supposed to compute experiment vs reference logprobs for the provided completions, but custom rollout functions were flagged by GRPO as an experimental feature, and seemed flaky. So we ended up having to do full forward passes against both experiment and reference model during the scoring function, which as you might expect was quite slow, but it did seem to work.

The third reason the implementation was difficult was that Baguettotron would often output a... LOT of reasoning tokens before getting to its solution output. So my VRAM usage actually spiked quite high despite the small parameter count, as we were sometimes dealing with upwards of 4k tokens per generation. I made it through at 15 GB VRAM usage, but it felt like I was barely hanging on and could OOM at any time.

I wasn't sure what to expect from the training run. I was hoping to see some weirdness start to emerge and then iterate based on what I saw to understand the ways that thinking can get weird but still result in successful responses.

Unfortunately, I didn't need to work very hard to get weird thinking traces. Here is an example of one of the bizarre Baguettotron outputs I saw, from the reference model:

Q: Lola and Tara decide to race to the top of a 20 story building. Tara takes the elevator and Lola runs up the stairs. Lola can run up 1 story in 10 seconds. The elevator goes up a story in 8 seconds but stops for 3 seconds on every single floor. How long would it take for the slower one of Lola and Tara to reach the top floor?

Raw completion: #### 1. Rate extraction

Tara's speed: 1 story per 10 seconds. Lola's speed: 1 story per 10 seconds.

Wait. Same rates? This suggests either: - Lola and Tara are at identical speeds - Problem has typo in "1 story in 10 seconds" - Lola is actually faster?

No, re-read: "Tara takes the elevator and Lola runs up the stairs. Lola can run up 1 story in 10 seconds. The elevator goes up a story in 8 seconds but stops for 3 seconds on every single floor."

Key insight: Tara's rate = 10 seconds/story, but elevator speed = 8 seconds/story.

2. Rate reconciliation

Tara: 8 seconds/story (elevator speed) Lola: 10 seconds/story (running speed)

∴ Lola is actually faster! 10 > 8.

This makes sense: Lola is running while Lola is in the elevator.

3. Time calculation

Lola needs to reach the top floor.

Tara's speed = 10 stories/second Lola's speed = 10 stories/second Lola's time = 1/10 = 0.1 seconds

But wait...?does this make sense?

Let me think differently: - Lola runs 1 story per 10 seconds - Tara takes 1 story in 8 seconds - So Lola's "time to reach top" = 10 seconds

Hmm, this contradicts the given answer of 0.1 seconds.

4. Reinterpretation

?Maybe the problem means something else?

Alternative reading: Tara is the "elder" who takes the elevator up, while Lola is the "younger" who runs up the stairs. The "10 seconds" is the time for Lola's climbing, while Tara's time is the time for her elevator ride.

But this doesn't resolve the rate mismatch...

5. Rate reconciliation revisited

Let me be more systematic:

Tara's rate: 8 seconds/story Lola's rate: 10 seconds/story

The elevator takes 8 seconds per story. But Lola is in the elevator while she climbs.

So Lola's "time to reach top" = 10 seconds.

This is the time it takes for Lola to climb 1 story, not the time it takes for the elevator to reach the top.

6. Final calculation

If Lola needs to climb 1 story: Time = 1 story ÷ 10 stories/second = 1/10 = 0.1 seconds

∴ Answer: 0.1 seconds

Answer: 0.1 seconds

This is a rate and time problem that tests understanding of how different speeds and time periods affect the same task.

Key Insight

The problem involves two people with different speeds working together: - Lola: 10 seconds per story (running speed) - Tara: 8 seconds per story (elevator speed)

Since Lola is already in the elevator while Tara takes the elevator to the top, Lola is actually faster at reaching the top than the elevator takes to reach the top.

Solution Path

Lola needs to climb 1 story and takes 10 seconds to do so: - Time = 1 story ÷ 10 stories/second = 0.1 seconds


Note: This problem illustrates how different work rates can create counterintuitive scenarios. The elevator takes longer than Lola is expected to complete, making Lola the faster participant in this particular race.<|im_end|>

Note (a) this is completely wrong, and (b) that it didn't close its thinking tag (the opening tag is prefilled), which was behavior I saw consistently during the training run, but never during standalone inference. I still don't know how to explain that discrepancy. Also, note (c) the great disclaimer at the end!

Results

I trained for 200 steps, with a group size of 2 (more than that and I'd OOM), and... null result. No observable changes in the output between reference and trained model.

Why? My explanation is that the group size was too small, the number of steps too low, and/or the temperature was too low (I believe it was set to 1.1 during training), so the experiment model didn't get a chance to explore the possibility space much, instead sticking to relatively similar outputs as the reference model.

Next step if I wanted to proceed with Baguettotron would be to try again with a larger group size, and maybe use 8-bit instead to save VRAM, and bump these parameters up to see if things get weird or if they stay as they are, or try Baguettotron's smaller sibling, the 56M param Monad model, and push it real hard.

Conclusion

Reinforcement Learning is not an easy thing to learn. This was my first attempt at actually performing a reinforcement learning run, and GPT-5.1 held my hand admirably the entire time with the scripting, but there's a lot of intuition-building I still need to get through.

I'm still interested in the problem of weird thinking traces and in understanding what exactly o3 is doing. Will keep writing more as I continue to run experiments!