Maybe we need a way to generate checksums during version creation (like file version history) and during test runs of code that would be submitted along side the code as a sort of proof of work that AI couldn’t easily recreate. It would make code creation harder for actual developers as well but it may reduce people trying to quickly contribute code the LLMs shit out.
A lightweight plugin that runs in your IDE maybe. So anytime you are writing code and testing it, the plugin is modifying a validation file that shows what you were doing and the results of your tests and debugging. Could then write an algorithm that gives a confidence score to the validation file and either triggers manual review or submits obviously bespoke code.
What exactly would you checksum? All intermediate states that weren’t committed, and all test run parameters and outputs? If so, how would you use that to detect an LLM? The current agentic LLM tools also do several edits and run tests for the thing they’re writing, then edit more until their tests work.
So the presence of test runs and intermediate states isn’t really indicative of a human writing code and I’m skeptical that distinguishing between steps a human would do and steps an LLM would do is any easier or quicker than distinguishing based on the end result.
Maybe we need a way to generate checksums during version creation (like file version history) and during test runs of code that would be submitted along side the code as a sort of proof of work that AI couldn’t easily recreate. It would make code creation harder for actual developers as well but it may reduce people trying to quickly contribute code the LLMs shit out.
A lightweight plugin that runs in your IDE maybe. So anytime you are writing code and testing it, the plugin is modifying a validation file that shows what you were doing and the results of your tests and debugging. Could then write an algorithm that gives a confidence score to the validation file and either triggers manual review or submits obviously bespoke code.
What exactly would you checksum? All intermediate states that weren’t committed, and all test run parameters and outputs? If so, how would you use that to detect an LLM? The current agentic LLM tools also do several edits and run tests for the thing they’re writing, then edit more until their tests work.
So the presence of test runs and intermediate states isn’t really indicative of a human writing code and I’m skeptical that distinguishing between steps a human would do and steps an LLM would do is any easier or quicker than distinguishing based on the end result.
This could, in theory, also be used by universities to validate submitted papers to weed out AI essays.