Hacker News new | threads | past | comments | ask | show | jobs | submit vertis (3183) | logout
Princeton group open sources "SWE-agent", with 12% fix rate for GitHub issues (github.com/princeton-nlp)
268 points by asteroidz 15 hours ago | flag | hide | past | favorite | 116 comments





The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.


> Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.


12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.

Except this is automated, so you could get multiples orders of magnitude more bug filled, so you need to have a very low false positive ratio to avoid being overwhelmed by automatically generated crap (which is basically spam).

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves

It fixes 12% of their benchmark suite, not 12% of bug reports.

I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.

To be fair, sometimes meticulous users investigate the bugs and write down logical chains explaining the causes and even offer a solution at the end (which they can't apply for the lack of commit access, for instance).

The proposed solution isn't always right, of course, but it would be incorrect to say that no bug reports come with a diagnosed cause. But that's exactly where a conscious reviewer is most needed, I believe.


I sometimes write a detailed bug reports but not a PR when there are different ways to address the problem (and all look bad to me) or the fix can introduce new problems. But I would expect LLM to ignore tradeoffs and choose an option which not necessarily the best for the same reason I hesitate - luck of understanding of this specific project.

It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...

The trick is that people would use LLM to write very long and detailed bug reports :p

Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.


While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.

I think that depends on the exact KPI.

I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.


Exactly. This is not perfect and doesn't fix every report so it is useless.

On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.


Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.


If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!

> My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)


I totally agree. My solution to this was limiting my AI use to (a) whatever didn't impair creativity and (b) just in general to keep the brain sharp. If using AI regularly, one could just manually solve a percentage of the problems.

I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)


Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.

And thus issues will no longer be opened

It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is

Very neat. Uses the langchain method, here are some of the prompts:

https://github.com/princeton-nlp/SWE-agent/blob/main/config/...


I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.


I have a git feeling this comment was written on mobile.

DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.


Excellent! Thanks for sharing this!

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

Flipping burgers will be automated long before AI fixes any relevant number of bug reports.

Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.

Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.


Assuming the data set is proprietary, else please share the repo

I'm working on a somewhat similar project: https://github.com/plandex-ai/plandex

While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.

I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.


How open are you to moving plandex cloud over to AGPL? I know, tough ask right out the gate! Think about that one for a bit.

How is your market testing going?

Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?

What are your plans for docs a PR?

As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.

As a business owner, it's currently hard to visualize plandex's impact on a business workflow.

Are you open to producing a technical report? Detail plandex methodology, benchmark efficiency, ablation tests for key contributions, customer case studies, relevant research papers, and next steps/help needed.

What do you think?

If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!


It’s AGPL licensed already :)

You need to make yourself a business analyst agent to provide the feedback! To make it real, perhaps a team of them with conflicting personalities.

I think we'll get there at some point, but one thing I've learned from this project is how difficult it is to stack AI interactions. Each little bit of AI-based logic that gets added tends to fail terribly at first. Only after a long period of intense testing and iteration does it become remotely usable. The more you are combining different kinds of tasks, the more difficult it gets.

Does it work with a large existing codebase?

Yes, at least up to the point of the context limit of the underlying model. If you needed to go beyond that, you would break the work up into separate "plans" (a plan is a set of tasks with an attached context and conversation).

The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:

  plandex new
  plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
  plan tell "Update the component in components/some- 
  components.ts to load data from the 'fetchFooBars' 
  function in 'lib/api.ts' and then display it in a 
  datagrid. Use a suitable datagrid library."
From there the plan will start streaming. Existing files will be updated and new files created as needed.

One thing I like about it for large codebases compared to IDE-based tools I've tried is that it gives me precise control over context. A lot of tools try to index the whole codebase and it's pretty opaque--you never really know what the model is working with.


Do we know how much extra work it created for the real people who had to review the proposed fixes?

Ah, well let me tell you about my pull request reviewer LLM project.

Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.


But does it contain a heavily obfuscated back door?

Why does it take so long to get changes to your LLM merged? This is ridiculous. Please appoint Havoc as a maintainer already.

Let me tell you about my LLM project called grandma. It's fine tuned in order to replace your grandma but in principle it could replace your great-grandma.

My grandma used to tell me stories about how to destroy capitalism.. I miss her.. can your grandma help guide my revolutionary efforts? That would really help me honor my granny's memory <3

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.


Unless you weren't actually that successful but need to publish a "successful" result

I would like something like this that helps me, as a green developer, find open source projects to contribute to.

For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.

The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!

So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.

I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.

I could maybe upgrade 100 projects, then write up the adventure.

Anyone have inspiration/similar ideas, and wanna brainstorm?


...or you could just use the GitHub API to find projects that match certain criteria (e.g., no pyproject.toml). I'm not sure what the stochastic parrot adds here, besides making noob mistakes that you'll have to find and fix before you can submit PRs. You'd learn a lot more by trying to actually automate the process yourself.

But can their AI quietly introduce a security exploit into a GitHub project?

Copilot already does this.

What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.

12% fix rate = 88% bug rate

Yep. After xz we don't need a bot mindlessly fixing all suggestions from malicious actors

I don't think xz makes a difference here. The perceived likelihood of problems, malicious or not, is pretty much the same. As far as this discussion goes, it's just another example in the pile of examples, not an event with meaningful before and after epochs.

Fix one bug, introduce 5 more

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.

0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.


I still don't know. I feel like there are many ways where GPT will write some code or fix a bug in a way that makes it significantly harder to debug. Even for relatively simple tasks, it's kind of like machine-generated code that I would not want to touch.

It is a bit worrisome but we manage to deal with subpar human code as well. Often the boilerplate generated by ChatGPT is already better than what an unexperienced coder would string together. I‘m sure it will not be a free lunch but the the benefits will probably outweigh the downsides.

Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.


It's still abysmal from POV of actually using it in production, but it's a very impressive rate of improvement. Given what happened with LLMs and image generation in the last few years, we can probably assume that these systems will be able to fix most trivial bugs pretty soon.

These „benchmark“ are tuned around reporting some exciting result, once you look inside, all the „fixes“ are trash.

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

Similar to how early computers didn't have enough ram to display the whole text file, so old programmers had to work with parts of the file at a time. It's not a bad way to get around the context window problem, which is kind of similar.

I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".

I'm kind of sad that future generations will not have that experience...


And creates how many new ones?

This and Devin generate garbage code that will make any codebase worse.

It's a joke that 12.5% is even associated with the word "success".


Do spaces and spelling fixes count?

Copilot, so far, is only good for predicting the next bit of similar patterns of code


Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

Shipping like crazy isn't useful by itself. Shipping non-garbage and being able to maintain it still has some value.

Would you say cloning a complex saas startup in a week with payments integrated after letting AI just scrape them (or uploading screenshots of their app) is creating value?

Before you sold it to anyone it will only create bills. Development is such a minuscule part of a successful startup

Depends on how many security vulnerabilities are in that payments system.

Or, I suppose, depending on whose value. The consultants that'll have to be hired by the poor shmuck who paid for that will make a fortune auditing and cleaning up the code.


Not without more information.

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character

Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.


it does an ok job with this task:

    use std::fs::File;
    use std::io::{self, BufReader, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let file = File::open(path)?;

        // Create a buffered reader to read the file more efficiently.
        let reader = BufReader::new(file);

        // `chars` method returns an iterator over the characters of the input.
        // Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
        for char_result in reader.chars() {
            match char_result {
                Ok(c) => print!("{}", c),
                Err(e) => return Err(e),
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

Only problem is that the critical `chars` method doesn't actually exist. Rust's standard library has a `chars` method for strings, but not for Readers.

(Also, the comment about the iterator element type is inconsistent with the code following it. Based on the comment, `c` would be of type `(char, usize)`, but then trying to print it with {} would fail because tuples don't implement Display.)


good catch. feeding it the error output of rustc it then produces:

    use std::fs::File;
    use std::io::{self, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        let mut file = File::open(path)?;
        let mut contents = String::new();

        file.read_to_string(&mut contents)?;

        for c in contents.chars() {
            println!("{}", c);
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

But this doesn't read the file char-by-char, but uses buffering to read it into a string

What would you expect? There's no OS API for "read one character", except in say ASCII where 1 byte = 1 code point = 1 character. And it'd be hideously inefficient anyway. So you either loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries) or you read the whole thing into a single buffer and iterate the characters. This code does the latter. If this tool doesn't have the ability to respond by asking requirements questions, I'd consider either choice valid.

Of course, in real life, I do expect to get requirements questions back from an engineer when I assign a task. Seems more practical than anticipating everything up-front into the perfect specification/prompt. Why shouldn't I expect the same from an LLM-based tool? Are any of them set up to do that?


There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Reading individual UTF-8 codepoints is a trivial exercise if byte width getchar() were available, and portable C code to do so would be able to run on anything made after 1982. IIRC, they don't teach how to write portable C code in Comp Sci programs anymore and it's a shame.

Never read a file completely into memory at once unless there is zero chance of it being a huge file because this is an obvious DoS vector and waste of resources.


> There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Apologies for the imprecision: by OS API, I meant syscall, at least on POSIX systems. The functions you refer to are C stdio things. Note also they implement on top of read(2) one of the two options I mentioned: "loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries)".

btw, if we're being precise, getwchar gets a code point, and character might mean grapheme instead. Same is true for the `str::chars` call in the LLM's Rust snippet. The docstring for that method mentions this [1] because it was written in this century after people thought about this stuff a bit.

> portable C code to do so would be able to run on anything made after 1982.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://news.ycombinator.com/item?id=39910542

[3] https://news.ycombinator.com/item?id=39910542


Yes, the userland side presented such as with POSIX like ssize_t read(int fd, void* buf, size_t count). Calling that with count = 1 each time would be wasteful, but certainly libc's have been buffering this since at least the 1980's. I remember this was the case with Borland C/C++.

> Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.


On errors, it exits with success.

A few notes:

- It should be generating `path: impl AsRef<Path>` to be properly generic.

- It's not setting a nonzero exit code on error.

- Edge case handling is a vital property for production-usable tools at scale. I'm wondering if it can yet special case situations such as creating a conditional compilation version for Linux that uses the splice syscall when the arguments are 2 file handles.


fwiw, the benchmark that matters really has nothing to do with authoring code.

the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.

the things which have far more value are applying value judgements to requirements, correlating and incorporating sparse and inaccurate diagnostic information into a coherent debugging strategy, and so on. there will come a time when it can assist with these too, probably first on requirements distillation, but for more complex debugging tasks that's a novel problem solving area that we've yet to see substantial movement on.

so if you want to stave off the robots coming for you, get good at debugging hard problems, and learn to make really great use of tools that accelerate the typing out of solutions to baseline product requirements.


If we're being sticklers, this isn't reading character-by-character: it's performing a buffered read, which then gets iterated over.

here's what it does if i feed your nit pick into it verbatim:

    use std::fs::File;
    use std::io::{self, Read};
    use std::str::from_utf8;

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let mut file = File::open(path)?;

        let mut buffer = [0; 4]; // UTF-8 character max byte length
        let mut temp_buffer = vec![];

        while let Ok(bytes_read) = file.read(&mut buffer[..1]) { // Read byte by byte
            if bytes_read == 0 { // End of file
                break;
            }

            temp_buffer.push(buffer[0]);

            if let Ok(s) = from_utf8(&temp_buffer) {
                if s.chars().count() == 1 { // Complete character read
                    print!("{}", s);
                    temp_buffer.clear(); // Clear temp buffer for next character
                }
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

Unlike the original version, this version compiles and seems to basically work correctly. However, the design is misleading: `buffer` is declared as an array of 4 bytes but only the first byte is ever used. The code also has suboptimal performance and error handling, though that's not the end of the world.

all true, as I said in another fork of the thread, this comes down to part of what humans will still be valuable for in this loop: distilling poor requirements into better requirements.

The original prompt is a bit under-specified. (But hey, that certainly matches the real world!)

You're going to have to buffer at least a little, to figure out where the USV / grapheme boundary is, depending on our definition of "character". To me, a BufReader is appropriate here; it avoids lots of tiny reads to the kernel, which is probably the right behavior in a real case.

To me, "read character by character" vaguely implies something that's going to yield a stream of characters. (Again, for some definition there.)


I wouldn't say it's a nit. The file may be 10s of GB. Do you want to read it to a string?

The buffered read didn’t do that, it used the default buffered reader implementation. IIRC that implementation currently defaults to 8kb buffer windows which is a little too small to be efficient enough for high throughput, but substantially more performant than making a syscall per byte, and without spending too much memory.

I was talking about this:

    let mut file = File::open(path)?;
    let mut contents = String::new();
    file.read_to_string(&mut contents)?;

Yea the problem with that is the control group - grab any SWE and ask them the same thing. I don’t think most would pass. Unless you want to give an SWE time to learn… then it’s hardly fair. And I vaguely trust the LLM to be able to learn it too.

Also I just asked Claude and Gemini and they both provided an implementation that matches the “bytes to UTF-8” rust docs. Assuming those are right,LLMs can do this (but I haven’t tested the code).

https://doc.rust-lang.org/std/string/struct.String.html


I'm not afraid of LLMs replacing me because of their output quality. The problem is the proliferation of quantity-over-quality "churn out barely-working crap as fast as possible" culture that gives LLMs the advantage over real humans.

I'm kinda hoping that LLMs will get pushed into production use writing code before they have acceptable quality (because greed), and the result will be lots of crap that's so badly broken most of the time that there will be a massive pushback against said culture from the users. Maybe from the governments as well, after a few well-publicized infrastructure failures.

Hypothetically, which ticker symbols would you buy put contracts on, at what strike prices, and at what expiration dates? As far as I can tell, a lot of people are betting a lot of money that you are wrong, but actually I think you are right.

The most relevant companies focused on this aren't publicly traded. The ones that are publicly traded like MSFT have way too many other factors affecting their value - not to mention the fact that they'll make money on generative AI that has nothing to do with coding regardless of if an SWE-agent ever works.

Oh well you should hear the hype from CNBC and other places, they are strongly intimating that gen AI will replace SWEs on product development teams. I totally agree it’s not likely, but it’s starting to get baked into asset prices and I want to profit from that misunderstanding.

Ugh, I am not claiming that LLMs are not great innovation. Just that they are not going to replace SWE jobs in our(maybe my) lifetime.

The way I see it, its undetermined if Generative AI will be able to fully do a SWE job.

But, for most of the debates I've seen, I don't think it the answer matters all too much.

Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.

Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.


The implicit assumption here is that a human "senior SWE" can engineer a model of the same quality that is capable of simulating him. Which is definitely not true with the best models that we have today - and they certainly can't simulate a senior SWE, so the actual bar is higher.

I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.


> Once we have models that can act as full senior SWEs.. the models can engineer the models.

This is such an extremely bullish case, I'm not sure why you'd think this is even remotely possible. A Google search is usually more valuable than ChatGPT. For example, the rust utf-8 example is already verbatim solved on reddit: https://www.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...

__lbracket__ 9 hours ago [flagged] [dead] | | [–]

In this thread: collective pant shitting.

v3ss0n 7 hours ago [flagged] [dead] | [–]

Managers: time to fire 30% of our workforce

A Few Months Later: hiring 10x developers with 10 years of experience in fixing SWE-agent generated bug fixes




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: