MSFT published a study at least two years ago, where they (outsourced?) engineers to build a language model trained on Wikipedia and others, and then auto-read current Wikipedia pages, and performed edits to "correct" them.. with statistics on correctness and model performance.
Startled, I asked wikimedia engineers on Libera IRC about it and they refused to comment in any way, which was uncharacteristic. It appeared to be manipulative, lawyer-like you might say, on the part of MSFT because "what you gonna do?" The writeup (by outsourced engineers?) bragged of high "correction" scores as if it was a contest, in the style of other LLM papers and studies.
1/20 is less than my guess would have been. It's interesting that Wikipedia is heavily weighted in LLM training corpuses. Eventually there will be a feedback loop not unlike the one among meat popsicles.
How many covers are better than the original? Would we expect that to be possible with AI?
OCR is a process that usually incorporates some form of A.I., such as pattern matching, and neural networks. So, the corpus of scanned text in the Internet Archive, Hathi Trust, Google Books, those have already been run through AI-enhancement.
Wikipedians use A.I., especially to identify and revert vandalism, such as the ClueBot line, and similar heuristics are now incorporated into the revision logging.
If more of the headlines, photos, and article text published by news outlets is LLM-generated, then new additions to Wikipedia will be summaries of LLM material already.
Just prior to paywall, the article mentions "starting with AI detection tools", so that's going to be wildly inaccurate from the start.
I'd like to see someone run LLM detection on Good and Featured Articles; what percentage would be flagged there?
AI content is overly verbose, a loss of time