If I have understood this correctly - can I equate the underlying principle of this "possibility of more efficient error correction" to spotting a typo in a text? That as words grow to sentences, to paragraphs on specific topics and so on, it becomes easier to guess from context which word should have been written, when a letter is written incorrectly?
Human languages have a lot of error correction built into them. Lempel Ziv gets pretty close to perfect compression for most types of data. If you put the ASCII plaintext of almost any book into a zip file the compression ratio would be >20.
Almost all data sent electronically is first compressed until it is at entropy and then put into error correcting and line codes to handle channel imperfections.
I have no knowledge of this Lempel Ziv, but shouldn't we make this Lempel Ziv our language of choice? If all the data is so compressed then surely it means we would be able to consume it faster.
The Lempel-Ziv source encoding (LZ for short) is able to asymptotically approach maximum efficiency because it adapts to the source. It does this by keeping a dictionary of past phrases, and referring back to that dictionary as it encodes new stuff. (There are several ways to do this, all offering guarantees, so there are several flavors of LZ.)
The key is that the dictionary changes per-source.
This adaptive encoding would not really work for direct human interpretation, because you'd have to maintain that dictionary somehow in your head, separately for each source.
You'll note that our selection of acronyms and terms of art ("ROC curve", "tach", "JS") has something of this flavor, though -- and the lexicon is adaptive within each universe of discourse.
I'm not familiar with the gzip algorithm unfortunately. But imagine, if you're familiar with Vim, that each character in English represented some "idea" and combining them together combined the ideas just like in Vim. Then with 4 letters you could theoretically reach 26^4 concepts. Actually this is similar to how Chinese works.
For example the ideogram for "grass/straw/manuscript" when combined with "berry" makes "strawberry".
Why don't you familiarize yourself with it then? You're basically trying to argue a point you know nothing about, and then getting snippy that people aren't "refuting" a point that you have provided no basis for. Until humans can store large amounts of arbitrary context in their heads, and look it up very rapidly, we will not be using any kind of table based compression algorithms.
And that is exactly how English works. You take a basket, and a ball, and make basketball. Or grape, and soda, and make grape soda.
Language has evolved to be efficient though. If you try to send more information than a channel can handle (ie violate Shannon’s limit) you will not send the message across. Life is noisy. Just because language is inefficient in a quiet room doesn’t mean it isn’t useful to be able to communicate on a windy day. It’s possible that with the emergence of text based conversations and more quiet rooms that language is evolving to be less felt tolerant and more (what you would call) “efficient”.
Language isn't wasteful. All of that extra information allows for error correction. You can infer what
*r* **u *k**?
means. Longer messages can be easier to decode, for similar reasons stated above in this thread. A lot of information is contained in the language outside of the character set, which is necessary for decoding. It asks a lot of the decoder to know everything necessary, but allows for relatively robust communication.
That's a pretty atypical example, you've put the errors in exactly the right places. Imagine that string was
*r* *o* ***y
or
a** y** *k**
and you can see these are indecipherable and this is not a good example of "error correctable language".
Error correction in language typically refers to understanding through context (i.e. other sentences), which would not be lost in this sort of compression.
They aren’t indecipherable though. Context is inseparable from language. How many three letter words begin with “A”, followed by three letter words that begin with “y”, followed by four letter words that have a “k” in them? This is all well described in information theory and the redundancy of language is not invalid because I picked an obvious example.
There are many three letter words that begin with A, followed by 3 letter words that begin with Y, followed by 4 letter words with a K in them. Here are some examples.
All you skim.
And you skip.
Ate yak skat.
Add yak skin.
If you're telling me I can randomly hide 2/3rds of characters in any English sentence and you can still derive the meaning, I'm going to need a citation beyond "information theory" in general.
But this is tangential to the point. I'm saying the context is preserved with compression, hence the meaning is preserved. I'm saying we can compress the amount of letters used and still get the same meaning.
Elements of Information Theory (2e), Cover. Section 6.4. I also suggest following the authors suggested reading at the end of the section.
Also there is a difference between having information and having perfect information. English can’t guarantee perfect information of arbitrary statements with 2/3 loss.
You need fault tolerance therefore redundancy in your language. You don’t want language to be like phone numbers, where missing a single digit changes the meaning completely. Not to mention phone numbers are also hard to remember for the human brain. We are not computers.