> It is interesting what the result will be (average saving on deduplication) if it is applied globally to a large-scale blob storage, such as Amazon S3 or Google Drive (we need metadata storage about chunks, and the chunks can be deduplicated).
Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.
That's an interesting extension of the illegal numbers or coloured bits theories, but we don't really see it used that way in practise. When governments or media industry groups crack down on this stuff, they don't go after everybody that ever had those bits in memory. Maybe that's just for practical reasons, but we've never seen every router in between a buyer and seller get confiscated too as they've been somehow tainted. Honestly this doesn't seem like more than a dystopian mental exercise
I’m not suggesting the hashes themselves are illegal to possess, but that transferring the bytes corresponding to those hashes is problematic: if both sides are lowly trusted, that puts you at risk as a hoster of that content. This is indeed an issue with IPFS, for instance, where I believe the solutions are “pinning” content that is already vetted by another party, or denylists of “bad bits”. I assume it’s similar to any other clearnet hosting. Btw, I make zero value judgments about all of that.
Off topic: I see downvotes on my parent comment, please let me know if I said something bad to help me improve.
Shared bytes could be construed in the opposite direction: if two or more of my users have the same chunk in their files, it is more likely to be some legal piece of data.
Files become piracy when there is evidence of intentional copyright infringement, for example when the chunk is part of a valid MPEG4 file and the MPEG4 file is titled "Wednesday_S2E4_FullHD_NetflixRip.MP4"
Re last para: probably because it's full of very certain, but also quite certainly wrong, statements along the lines of "Under current legal doctrine, blobs need some form of chain of custody." Citation needed.
It's not the illegalness I'm challenging, it's the problematicness. Maybe it is illegal to even think about those bit patterns. But I'm not aware of cases where people get _actually_ thrown in jail or fined for possessing or transmitting them. In all of the cases I know about there is intent involved.
It is hard to tell if this is what you are saying. But a common misconception of ipfs seems to be that you may end up hosting random unwanted files. this is untrue, you only end up hosting files you want.
Isn't the main use of bittorrent for ML and research data? Academic torrents is a wonderful resource and what every developer should be using if they need to provide their neural network weights, training data, etc.
How is there any legal problem using bittorrent? It's simply much more tailored for this problem than http. It doesn't make any sense to talk about 'Legal problems' for torrent protocols.
What planet have you been living on? Bittorrent is widely used to distribute copyrighted material - movies, TV shows, games, programs, porn... I'd imagine a large majority of bittorrent traffic worldwide is pirated material, with a small portion being datasets as you describe, and other legally-shared data like actual Linux distros, etc.
I suppose there could be many things happening on the internet that we are unaware of; however, torrents are very good and specifically tailored as a protocol for scientific data and ML.
It solves the link-rot issues that occur due to moving institutions, it allows huge storage for essentially free (ever tried to store 9 TB of training data or CERN data on Dropbox?), and it scales extremely beautifully.
It's really the absolute perfect solution for reproducible research in large data studies.
Torrents are no longer main source of copyrighted materials, at least for shows and movies. There is a bunch of illegal services that provide Netflix like experience against pirated content.
If you’re distributing CSAM on your blob storage, and someone lets you know, you should probably remove it. This is independent of whether you distribute chunks or the whole file.
I think for piracy/DMCA it’s enough to simply remove it. As for CSAM or more serious stuff, I don’t know if that’s enough? Does section 230 cover that? Is there a difference between being a company and an individual?
Yes this is truly promising but beware of dragons. Under current legal doctrine, blobs need some form of chain of custody. You can’t just deliver chunks to whomever has a hash (unless you’re decentralized, and you can move this problem to your users). Why? Because this is how bittorrent works, and we all know the legal dangers there. Encryption helps against eavesdropping, but not against an adversary who already has the hash and simply wants to prove you are distributing pirated material or even CSAM. You may be able to circumvent this to shift blame back on the user, in some cases. For instance, say you are re-syncing dangerous goods that you initially uploaded over Dropbox, then Dropbox can probably blame you, even though they are technically distributing. But that requires Dropbox to be reasonably confident that “you” (ie the same legal entity) had those chunks in the first place.