Tomasz Tunguz: When One AI Grades Another’s Work – Experimental News Clipping Site

Source URL: https://www.tomtunguz.com/evolution-of-ai-judges-improving-evoblog/
Source: Tomasz Tunguz
Title: When One AI Grades Another’s Work

Feedly Summary: Since launching EvoBlog internally, I’ve wanted to improve it. One way of doing this is having an LLM judge the best posts rather than a static scoring system.
I appointed Gemini 2.5 to be that judge. This post is a result.
The initial system relied on a fixed scoring algorithm. It counted words, checked readability scores, & applied rigid style guidelines, which worked for basic quality control but missed the nuanced aspects of good writing.
What makes one paragraph flow better than another? How do you measure authentic voice versus formulaic content?
EvoBlog now takes a different approach. Instead of static rules, an LLM evaluator scores each attempt across five dimensions: structure flow, opening hook, conclusion impact, data integration, & voice authenticity.

The theory is that magic happens in the iterative refinement cycle.
After each generation round, the system analyzes what worked & what didn’t. Did the opening hook score poorly? The next iteration emphasizes stronger first paragraphs. Was data integration weak?

The LLM judge experiment yielded mixed results. The chart shows swings in performance across 20 iterations, with no clear convergence pattern. The best run achieved 81.7% similarity to my writing style, a 3.1 percentage point improvement over the initial 78.6%.
But the final iteration scored 75.4%, actually worse than where it started.
The LLM as judge sounds like a good idea. But the non-deterministic nature of the generation & the grading doesn’t produce great results.
Plus it’s expensive. Each 20 iteration run requires about 60 LLM calls or about $1 per post. So, maybe not that expensive!
But for now, the AI judge isn’t all that effective. The verdict is in: AI judges need more training before they’re ready for court.

AI Summary and Description: Yes

Summary: The text discusses an internal initiative to use a large language model (LLM), specifically Gemini 2.5, to evaluate the quality of blog posts in a more dynamic and nuanced way than static scoring systems. The results of this experiment reveal both potential benefits and challenges, emphasizing the need for further training and refinement of AI judgment capabilities in content evaluation.

Detailed Description: The text outlines an experiment with EvoBlog that employs an LLM for assessing blog quality instead of relying on a fixed algorithm. Key points include:

– **Initial Scoring System**:
– Originally utilized a static scoring approach based on word count, readability, and rigid style guidelines.
– This method lacked the ability to assess the more subjective qualities of writing, such as flow and voice authenticity.

– **LLM Integration**:
– The new method involves the LLM scoring posts across five dimensions:
– Structure flow
– Opening hook
– Conclusion impact
– Data integration
– Voice authenticity
– The intent is for the LLM to enable a more nuanced understanding of quality through iterative learning and refinement.

– **Iterative Refinement Cycle**:
– After each iteration of blog post generation, the LLM analyzes performance and adjusts focus areas based on previous scores.
– Early outcomes showed variability, with performance fluctuating across iterations, demonstrating the non-deterministic nature of LLM-generated content.

– **Performance Results**:
– The most successful iteration achieved an 81.7% similarity to the author’s writing, suggesting an improvement.
– However, some iterations, including the final one, performed worse than the initial results, illustrating inconsistency.

– **Cost Considerations**:
– While the process can be seen as expensive, costing about $1 per post for 60 LLM calls per 20 iterations, this cost may not be prohibitive considering the potential for improved content quality.

– **Overall Verdict**:
– The experiment indicates that while using an LLM as an evaluator has merit, current results suggest that such AI tools require further training and development to become reliable judges of written content quality.

This analysis highlights the challenges and considerations for security and compliance professionals working with AI systems, particularly around the importance of training and variability in AI outputs, which can have implications for automation in content creation and evaluation.