← Back to all posts

The Line of Perfect

· Jacob E. Dawson

I remember when GPT-2 first came out. Gwern was talking about it back in 2019, and while it seemed like a toy, it was definitely on the radar as transformers showed that there was another way to approach neural network architecture. Still, it was just a toy, and at the time I didn't pay to much attention to the cute AI software that could write quirky, almost plausible-sounding poems.

By 2020 GPT-3 had arrived, and with it much better writing, often indistinguishable from human writing unless you were paying attention. The first attempts at having language models write code and build frontend components began going viral on the net - I remember being blown away that I could tell a machine to write a React component with a button and an increment function - it was unreal. Still, they weren't turning out anything production worthy, and there were other things to think about in 2020.

By early 2023 GPT-4 came out and things went bananas. For the first time I think that a mainstream audience really started to see the abilities of AI improving in real-time - suddenly certain careers (like copywriting) seemed on the cusp of being eradicated. These machines could write reasonable full-page components and simple applications, they could help debug issues, they could offer advice on improving functions. But they failed amusingly in some ways, they hallucinated API endpoints and became stuck halfway through edits, spitting out mangled code. They had a long way to go.

The releases continued. Claude arrived. Open source models came onto the scene. 
GPT-4.5. Claude Opus. Etc, etc.

In a few short years we've come from small models that could barely churn out a comprehensible sentence, to models that are being gatekept and embargoed by the US Government out of fear of cybersecurity nightmares only dreamed of in the gnarliest of science fiction.

Yet, for those of us using these models every day, professionally, prodding and poking at their jagged edges, they still aren't perfect. My own assessment of their abilities oscillates almost daily - one day it feels magical, and terrifying, and threatening. The next day a SOTA model seems to slip up on rookie errors and take the hardest path to a simple solution. It doesn't help that behind the scenes the model providers of frontier LLMs are silently quantizing, re-routing, and shifting compute. Some days you really are getting a dumber model, and you never know when that magical day comes if you haven't been part of the A cohort testing a new model

So, the line of "perfect" is not one that will be clearly drawn in the sand, a threshold that we will knowingly step across, from before to after. It's more like a fuzzy border between good and better, that we will wobble across uncertainly, and where only be looking backwards from where we came will we truly see how far we've come.

Where we're going? I'm not sure anyone knows.

Comments