Zuckerberg should stop claiming Meta is open sourcing AI (they are even running ...

lithiumii · 2025-02-25T13:50:52 1740491452

Well technically even DeepSeek is not as OSS as OLMo or Open Euro, because they didn't open the data.

echelon · 2025-02-25T14:41:00 1740494460

We're 2/3rds of the way there.

We need:

1. Open datasets for pretrains, including the tooling used to label and maintain

2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)

3. Open pretrained foundation model weights, fine tunes, etc.

Open AI = Data + Code + Paper + Weights

buyucu · 2025-02-25T17:06:51 1740503211

Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.

These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.

johnla · 2025-02-25T18:30:25 1740508225

Sounds like a job for AI.

sdesol · 2025-02-25T17:52:54 1740505974

I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.

tway223 · 2025-02-25T14:14:36 1740492876

For understandable reasons

chvid · 2025-02-25T19:18:01 1740511081

It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.

blackeyeblitzar · 2025-02-25T16:47:18 1740502038

DeepSeek is definitely not real OSS. To be open source, you need to use a real open source license (like the ones OSI lists), and you need to share all pre and post training code, any code related to tuning, any evaluation code, everything related to safety/censorship/etc, and probably the full training data as well. Otherwise you can't reproduce their weights. Sharing weights is like sharing a compiled program.

As far as I know the only true open source model that is competitive is the OLMo 2 model from AI2:

https://allenai.org/blog/olmo2

They even released an app recently, which is also open source, that does on-device inference:

https://allenai.org/blog/olmoe-app

They also have this other model called Tülu 3, which outperforms DeepSeek V3:

https://allenai.org/blog/tulu-3-405B

startupsfail · 2025-02-25T17:01:59 1740502919

Yes, releasing training source code code is like releasing the source code of a compiler used to compile and link the binary.

Lets say you took GCC, modified its sources, compiled your code with it and released your binaries along with modified GCC source code. And you are claiming that your software is open source. Well, it wouldn’t be.

Releasing training data is extremely hard, as licensing and redistribution rights for that data are difficult to tackle. And it is not clear, what exactly are the benefits in releasing it.

duchenne · 2025-02-25T16:00:33 1740499233

Come on... Meta has been refining pytorch for more than a decade. It basically contains all that you need to train LLMs, including the latest technologies. What more do you need? The part of the code that is specific to Meta infrastructure?

prjkt · 2025-02-25T15:35:11 1740497711

does pytorch count

ein0p · 2025-02-25T23:06:35 1740524795

PyTorch had the "first thing that didn't suck" advantage and now it has a completely dominant marketshare that prevents better alternatives from emerging. Where it sucks (e.g. on macOS) there are popular alternatives. But it's hard to be enthusiastic about a DL framework in 2025 which does not have native high performance quantization support, for example. Or one where FSDP is crudely bolted onto the side. They say "usability above all else", but I consider such things to be major usability deficiencies, which need to be addressed. But because PyTorch does not have to fight for marketshare, it'll be years before we see anything usable there.

numba888 · 2025-02-28T08:41:54 1740732114

I wonder how Meta trains its models. On vanilla Pytorch or they actually have some closed tools and frameworks?

echelon · 2025-02-25T14:38:53 1740494333

Open Weights = Binary Blob

It's a return to the FREEWARE / SHAREWARE model.

This is the language we need to use for "open" weights.