> i tried running npm test and it was completely borked
> one problem, of course, was that the tests were entirely bullshit.
> whenever it would start getting lazy or confused i'd restart the session. often a failure would "demoralize" it or being sloppy once would cause sloppiness to stick. in particular i've noticed that being overwhelmed causes it to approach problems in a messy "throw anything against the wall" way. sometimes if too many newly un-skipped tests are causing failures and it got "demoralized", i'd just skip them again and have it focus on one or two at a time. with less noisy output and a permission to "really dig into what happened" (and often an explicit suggestion to remove things from the example until it no longer breaks), it would usually find the root cause.
> we got to majority of passing tests but there were a bunch of bugs it just couldn't solve and would walk in circles. the code was also getting quite complicated. it seemed like a mess of different ideas and special cases thrown in. moreover, i knew it didn't fully work because i had new test cases that just would refuse to pass
> i tried to let it do that and it just failed miserably anyway, breaking tests and not being able to recover.
> it struggled at first but i reminded it to look at other emitters.
> at one point, newly added tests kept confusing claude. it would completely get stuck on them, failure after failure, fixing one thing and breaking other thing, trying to turn off those tests or change the expected outputs (despite me telling it to never do that!) and in general seeming aimless and distraught (in the descriptive sense).
> i had to git reset --hard multiple times in this mess.
The project that would take months got done in a weekend, per the author's own direct estimate.
I've experienced the same - contributing a very large PR to a golang project (without knowing or having worked with the language prior). I did it because I could talk through abstractions, be willing to down dead ends (1:3 ratio for every meaningful feature), and be OK with the fun of redoing. Once you are able to do this, you literally become a 10X engineer when measured by working output.
If this process of trying and discarding 2 out of every 3 approaches sounds distasteful, you will not truly discover the deeper joys of working with the SOTA LLMs.
> maybe my project is a toy (it is) or you think it's poor quality (it's not) but i'm able to do things in minutes that used to take days
Just consider what this will be like as it gets better? Remember we've had working coding agents for less than a year.
People are excited not because it's fun to fight with the damn things. It's not! We're excited despite that!
I remember my old Nokia 6682. It was an early smartphone that ran S60 and I had a screen reader, basic IM client, and a few other apps including a web browser installed. It was awkward to use. It was frustrating. The connection was dog-slow. And it was cool as hell--a little slice of the future in my pocket.
I remember my Windows 98 (first edition) machine with JAWS for Windows 3.2, trying to use the early web; before they had the concept of the virtual cursor. Before any of this accessibility stuff was at all standardized, when we got what we could by scraping screen buffers and injecting into other processes. And damn it was so cool. So obviously the future that we put up with the jank.
Here we are again. Annoying to use? Often! Remarkable? Hell yeah!
Except this time we have people combing through every sentence to extract only the negative ones from a 40kb success story--I do at least hope you used an LLM for this.
Your counter to the idea that AI and LLMs will improve over time (as they have massively the past years) is picking one example of a technology that hasn’t improved much over 50 years?
> one problem, of course, was that the tests were entirely bullshit.
> whenever it would start getting lazy or confused i'd restart the session. often a failure would "demoralize" it or being sloppy once would cause sloppiness to stick. in particular i've noticed that being overwhelmed causes it to approach problems in a messy "throw anything against the wall" way. sometimes if too many newly un-skipped tests are causing failures and it got "demoralized", i'd just skip them again and have it focus on one or two at a time. with less noisy output and a permission to "really dig into what happened" (and often an explicit suggestion to remove things from the example until it no longer breaks), it would usually find the root cause.
> we got to majority of passing tests but there were a bunch of bugs it just couldn't solve and would walk in circles. the code was also getting quite complicated. it seemed like a mess of different ideas and special cases thrown in. moreover, i knew it didn't fully work because i had new test cases that just would refuse to pass
> i tried to let it do that and it just failed miserably anyway, breaking tests and not being able to recover.
> it struggled at first but i reminded it to look at other emitters.
> at one point, newly added tests kept confusing claude. it would completely get stuck on them, failure after failure, fixing one thing and breaking other thing, trying to turn off those tests or change the expected outputs (despite me telling it to never do that!) and in general seeming aimless and distraught (in the descriptive sense).
> i had to git reset --hard multiple times in this mess.
—
tldr; seems like a very nice experience, yeah