Discussion about this post

User's avatar
JP's avatar

The sudoku rating system is a great test case. Domain knowledge plus Claude Code is where the real leverage sits, and you've demonstrated that well with the evaluation framework approach.

Your point about the development pendulum rings true. I've hit that exact loop where the model keeps proposing the same fix you've already rejected. One thing that's helped me is swapping models mid-session. Different models have different blind spots, so when Claude gets stuck in a rut on something, switching to GPT-5.4 or Gemini sometimes breaks the cycle. Wrote up how to do that without leaving Claude Code here https://reading.sh/claude-code-how-to-run-any-model-gpt-5x-gemini-3-1-stealth-inside-it-e67e957e53c3

Have you tried any other models for the data work or have you stuck with Claude exclusively?

2 more comments...

No posts

Ready for more?