I believe that if Mistral is serious about advancing in open source, they should...

wongarsu · on July 18, 2024

I doubt they could. Their corpus almost certainly is mostly composed of copyrighted material they don't have a license for. It's an open question whether that's an issue for using it for model training, but it's obvious they wouldn't be allowed to distribute it as a corpus. That'd just be regular copyright infringement.

Maybe they could share a list of the content of their corpus. But that wouldn't be too helpful and makes it much easier for all affected parties to sue them for using their content in model training.

gooob · on July 18, 2024

no, not the actual content, just the titles of the content. like "book title" by "author". the tool just simply can't be taken seriously by anyone until they release that information. this is the case for all these models. it's ridiculous, almost insulting.

candiddevmike · on July 18, 2024

They can't release it without admitting to copyright infringement.

regularfry · on July 18, 2024

They can't do it without getting sued for copyright infringement. That's not quite the same.

bilbo0s · on July 18, 2024

Uh..

That would almost be worse. All copyright holders would need to do is search a list of titles if I'm understanding your proposal correctly.

The idea is not to get sued.