Zuckerberg should stop claiming Meta is open sourcing AI (they are even running TV ads) when they are only releasing the weights, and not the code. Only DeepSeek is real OSS AI.
1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
DeepSeek is definitely not real OSS. To be open source, you need to use a real open source license (like the ones OSI lists), and you need to share all pre and post training code, any code related to tuning, any evaluation code, everything related to safety/censorship/etc, and probably the full training data as well. Otherwise you can't reproduce their weights. Sharing weights is like sharing a compiled program.
As far as I know the only true open source model that is competitive is the OLMo 2 model from AI2:
Yes, releasing training source code code is like releasing the source code of a compiler used to compile and link the binary.
Lets say you took GCC, modified its sources, compiled your code with it and released your binaries along with modified GCC source code. And you are claiming that your software is open source. Well, it wouldn’t be.
Releasing training data is extremely hard, as licensing and redistribution rights for that data are difficult to tackle. And it is not clear, what exactly are the benefits in releasing it.
Come on... Meta has been refining pytorch for more than a decade. It basically contains all that you need to train LLMs, including the latest technologies. What more do you need? The part of the code that is specific to Meta infrastructure?
PyTorch had the "first thing that didn't suck" advantage and now it has a completely dominant marketshare that prevents better alternatives from emerging. Where it sucks (e.g. on macOS) there are popular alternatives. But it's hard to be enthusiastic about a DL framework in 2025 which does not have native high performance quantization support, for example. Or one where FSDP is crudely bolted onto the side. They say "usability above all else", but I consider such things to be major usability deficiencies, which need to be addressed. But because PyTorch does not have to fight for marketshare, it'll be years before we see anything usable there.