Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.


I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.


no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.


They can't release it without admitting to copyright infringement.


They can't do it without getting sued for copyright infringement. That's not quite the same.


Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: