Docker layers are content-addressable. So the hashes are not entirely opaque. They are a direct result of what’s inside. Two images (mostly) share the same layers? No disk space wasted whatsoever.
Sure you could implement a finer-grained deduplication or transfer mechanism, but I doubt this would scale as well. Many large image layers consist of lots and lots of small files. The overhead would be tremendous.
The local storage is mostly a solved problem with hard links. Any modern file system (I.e. not NTFS) can have arbitrarily many file paths that refer to the same underlying file, with no more overhead than a normal file system.
The comment to which you were replying mentioned both the excessive local disk usage and the excessive network transfer, and so your comment appeared to apply to both portions. This is why I started my comment by explicitly restricting it to the case of local disk usage.
For hard links to work you still need to know that the brand new layer you just downloaded is same as something you already have, i.e. running a deduplication step.
How? Well, the most simple way is compute the digest of the content and look it up, oh wait :thinking:
I’m not sure what point you’re trying to make. Are you assuming that a layer would be transferred in its entirety, even in cases where the majority of the contents are already available locally? The purpose of bringing up hard links was to state that when de-duplication is done at a per-file granularity rather than a per-layer granularity, it doesn’t introduce a ru time overhead.
Sure you could implement a finer-grained deduplication or transfer mechanism, but I doubt this would scale as well. Many large image layers consist of lots and lots of small files. The overhead would be tremendous.