I would say though that you have to pick your battles. On your path you might vanquish shitty web sites but just not being able to read them, but I think a real victory against enshittification involves breaking away from the web user interface which is some of why there has been so much bot blocking lately. It's not that people are going train an LLM on a shitty site but that you're going to get the LLM to read the shitty site for you, that is the real death of shitty sites. (e.g. agent posts your question to Reddit on your behalf, picks out and summarizes the best answer; instead of looking at craigslist every day for farm animals you get notified of the good ones; instead of seeing an ad for a Laurie Anderson concert on the bus when the tickets are all sold out the system tells you the moment the tickets go on sale...) But who knows? An HTML 5 parser is a good place to start towards that end.
I had similar motivations for developing my smart RSS reader and intelligent agent YOShInOn which has gotten me to work on filtering and "deshittification" right away as opposed to face a 1500 man x year project before getting to the stuff that matters to me.
For me the path not taken was developing a "reader" that works like archive.today, there was a summmer in the pandemic when my son and I were driving back and forth to Buffalo a lot and talking about missile defense systems and I did a "spike" project to develop something that would archive, process and deshittify web pages and I came to the conclusion that it was a long road to make something that was better instead of worse (every archiver I've seen is way slower than just browsing the page and sorry "slow X" advocates, slow = bad)
This winter I turned to the problem of end-to-end active learning for selecting articles and had great success as it. Sometime it sends me to archive.today but the goddamn CAPTCHAs never end when I use it from home so the question of the archiver/reader has come back but now that the thing has a workflow engine and could archive articles before I look at them the archiver/reader looks appealing if I can find something open source that is good enough.
which took the opposite approach of developing a reader and now they are working on recommendations. That's a path that makes sense because the reader cleans up the text which in principle is good for NLP but for what I am doing I think the signals in dirty text have enough value that I am doing just fine. They've got the issue that they're very unlikely to really develop content-based recommendation for a multi-user system because collaborative filter is such a good shortcut, also YOShInOn uses a Tik Tok style interface which for some reason has hardly been tried despite being a hit with Stumbleupon, this interface collects good negative samples and gets great results with a simple classifer... if you get a few 1000 judgements.*
I had similar motivations for developing my smart RSS reader and intelligent agent YOShInOn which has gotten me to work on filtering and "deshittification" right away as opposed to face a 1500 man x year project before getting to the stuff that matters to me.
For me the path not taken was developing a "reader" that works like archive.today, there was a summmer in the pandemic when my son and I were driving back and forth to Buffalo a lot and talking about missile defense systems and I did a "spike" project to develop something that would archive, process and deshittify web pages and I came to the conclusion that it was a long road to make something that was better instead of worse (every archiver I've seen is way slower than just browsing the page and sorry "slow X" advocates, slow = bad)
This winter I turned to the problem of end-to-end active learning for selecting articles and had great success as it. Sometime it sends me to archive.today but the goddamn CAPTCHAs never end when I use it from home so the question of the archiver/reader has come back but now that the thing has a workflow engine and could archive articles before I look at them the archiver/reader looks appealing if I can find something open source that is good enough.
There's a commercial product
https://www.getupnext.com/
which took the opposite approach of developing a reader and now they are working on recommendations. That's a path that makes sense because the reader cleans up the text which in principle is good for NLP but for what I am doing I think the signals in dirty text have enough value that I am doing just fine. They've got the issue that they're very unlikely to really develop content-based recommendation for a multi-user system because collaborative filter is such a good shortcut, also YOShInOn uses a Tik Tok style interface which for some reason has hardly been tried despite being a hit with Stumbleupon, this interface collects good negative samples and gets great results with a simple classifer... if you get a few 1000 judgements.*