The Shannon Limit (2010)

dang · on Oct 6, 2019

An excellent discussion from 2012: https://news.ycombinator.com/item?id=4342991

RaoulP · on Oct 6, 2019

If I have understood this correctly - can I equate the underlying principle of this "possibility of more efficient error correction" to spotting a typo in a text? That as words grow to sentences, to paragraphs on specific topics and so on, it becomes easier to guess from context which word should have been written, when a letter is written incorrectly?

willis936 · on Oct 6, 2019

Human languages have a lot of error correction built into them. Lempel Ziv gets pretty close to perfect compression for most types of data. If you put the ASCII plaintext of almost any book into a zip file the compression ratio would be >20.

Almost all data sent electronically is first compressed until it is at entropy and then put into error correcting and line codes to handle channel imperfections.

personjerry · on Oct 6, 2019

I have no knowledge of this Lempel Ziv, but shouldn't we make this Lempel Ziv our language of choice? If all the data is so compressed then surely it means we would be able to consume it faster.

mturmon · on Oct 7, 2019

The Lempel-Ziv source encoding (LZ for short) is able to asymptotically approach maximum efficiency because it adapts to the source. It does this by keeping a dictionary of past phrases, and referring back to that dictionary as it encodes new stuff. (There are several ways to do this, all offering guarantees, so there are several flavors of LZ.)

The key is that the dictionary changes per-source.

This adaptive encoding would not really work for direct human interpretation, because you'd have to maintain that dictionary somehow in your head, separately for each source.

You'll note that our selection of acronyms and terms of art ("ROC curve", "tach", "JS") has something of this flavor, though -- and the lexicon is adaptive within each universe of discourse.

antisemiotic · on Oct 8, 2019

Even more extreme version would be abbreviations like "k8s" or "a11y". T4h s7s I w2h p4e w1o a3e t3e w3d j2t f2k o1f a1d d1e.

mturmon · on Oct 8, 2019

Perfect examples! Like i18n and all the rest. Y2r sentence illustrates the ... limitations of that method.

rrss · on Oct 6, 2019

Lempel-Ziv isn't a language, it's a class of compression algorithms. And personally I have a hard time directly consuming compressed text.

personjerry · on Oct 6, 2019

I imagine compressed text would take some time to learn, like another language. But once learned, it might be more efficient.

rrss · on Oct 6, 2019

I don't think humans are well matched to consuming compressed text.

The first 17 verses of genesis, post LZ77, from http://www.infinitepartitions.com/art001.html:

001:001 In the beginning God created<25, 5>heaven an<14, 6>earth. 0<63, 5>2 A<23, 12> was without form,<55, 5>void;<9, 5>darkness<40, 4> <0, 7>upo<132, 6>face of<11, 5>deep.<93, 9>Spirit<27, 4><158, 4>mov<156, 3><54, 4><67, 9><62, 16>w<191, 3>rs<167, 9>3<73, 5><59, 4>said, Let<38, 4>r<248, 4> light:<225, 8>re<197, 5><20, 5><63, 9>4<63, 11>w<96, 5><31, 5>,<10, 3>at <153, 3><50, 4>good<70, 6><40, 4>divid<323, 6><165, 9><52, 5> from<227, 6><269, 7><102, 9>5<102, 9>call<384, 7><52, 6>Day,<388, 9><326, 9><11, 3><41, 6><98, 9>N<183, 5><406, 10><443, 3><469, 4><57, 8>mor<15, 5>w<231, 4><308, 5>irst<80, 3>y<132, 9>6<299, 28>a<48, 4>mamen<246, 3><437, 6>midst<375, 7><134, 9><383, 6><177, 6>le<290, 5><272, 6><413, 11><264, 10><429, 15>7<129, 9>mad<166, 9><117, 6><82, 6><348, 11><76, 8>which<215, 5><600, 10>nder<62, 14><115, 16><54, 11> ab<599, 3><197, 13><54, 9><470, 6><487, 7>so<169, 9>8<432, 20><108, 10>H<827, 5><397, 25><103, 9><405, 17>seco<814, 5><406, 10>9<406, 22><199, 8><235, 10><944, 7><428, 3>ga<439, 5><540, 10>toge<18, 4><45, 3>to one pl<820, 3><422, 10><604, 5>ry l<16, 4>app<981, 3><250, 8><474, 11><258, 12>10<258, 20><67, 9>E<1046, 4>;<638, 9><145, 6><234, 4><138, 8><86, 9><952, 13><75, 8><1018, 4>eas<853, 10><894, 6><883, 14><138, 9>1<290, 23><1179, 6>b<119, 5><1173, 3><11, 3>grass,<302, 7>rb<132, 9>yield<38, 4>seed<879, 10>fru<111, 3>tree<33, 10><19, 6>af<174, 3> hi<1229, 10>kin<57, 3>whose<69, 5> is<809, 4>itself,<1260, 10><148, 5><599, 23>1<1367, 16>brou<1082, 5><189, 12><58, 4><189, 4><181, 14><136, 9><154, 9><146, 7><204, 8><198, 19><175, 13><138, 4>i<1369, 10><184, 8><78, 14><401, 39>3<1160, 42>thir<753, 13>14<1460, 33>s<1155, 8><882, 10><1159, 15><780, 7><749, 3><1150, 11><100, 3><1031, 10>n<72, 4>;<769, 12>m<95, 4><361, 3><68, 9>sign<367, 7><22, 3><293, 3>aso<16, 12><79, 3><13, 7>y<430, 3>s:<192, 8><1486, 6><85, 15><185, 31><177, 10><126, 9>giv<1541, 8><573, 38>6<1343, 15>wo<562, 3><2001, 3><122, 7>;<906, 6><2019, 5><142, 7><288, 4>rul<1277, 14>d<1650, 12>l<1646, 3><45, 20><319, 6>:<937, 4><1452, 9>st<261, 3><647, 10>l<154, 11><1498, 10>s<278, 8><264, 33><256, 11><2099, 18><264, 5>,

personjerry · on Oct 6, 2019

I'm not familiar with the gzip algorithm unfortunately. But imagine, if you're familiar with Vim, that each character in English represented some "idea" and combining them together combined the ideas just like in Vim. Then with 4 letters you could theoretically reach 26^4 concepts. Actually this is similar to how Chinese works.

For example the ideogram for "grass/straw/manuscript" when combined with "berry" makes "strawberry".

oh_sigh · on Oct 7, 2019

Why don't you familiarize yourself with it then? You're basically trying to argue a point you know nothing about, and then getting snippy that people aren't "refuting" a point that you have provided no basis for. Until humans can store large amounts of arbitrary context in their heads, and look it up very rapidly, we will not be using any kind of table based compression algorithms.

And that is exactly how English works. You take a basket, and a ball, and make basketball. Or grape, and soda, and make grape soda.

daseiner1 · on Oct 6, 2019

fortunately, communication isn't based on efficiency

personjerry · on Oct 6, 2019

That's a strawman argument. I've never claimed communication is based on efficiency.

I'm saying we could make our communication more efficient, which you have not refuted.

willis936 · on Oct 7, 2019

Language has evolved to be efficient though. If you try to send more information than a channel can handle (ie violate Shannon’s limit) you will not send the message across. Life is noisy. Just because language is inefficient in a quiet room doesn’t mean it isn’t useful to be able to communicate on a windy day. It’s possible that with the emergence of text based conversations and more quiet rooms that language is evolving to be less felt tolerant and more (what you would call) “efficient”.

willis936 · on Oct 6, 2019

Language isn't wasteful. All of that extra information allows for error correction. You can infer what

  *r* **u *k**?

means. Longer messages can be easier to decode, for similar reasons stated above in this thread. A lot of information is contained in the language outside of the character set, which is necessary for decoding. It asks a lot of the decoder to know everything necessary, but allows for relatively robust communication.

personjerry · on Oct 6, 2019

That's a pretty atypical example, you've put the errors in exactly the right places. Imagine that string was

  *r* *o* ***y

or

  a** y** *k**

and you can see these are indecipherable and this is not a good example of "error correctable language".

Error correction in language typically refers to understanding through context (i.e. other sentences), which would not be lost in this sort of compression.

willis936 · on Oct 6, 2019

They aren’t indecipherable though. Context is inseparable from language. How many three letter words begin with “A”, followed by three letter words that begin with “y”, followed by four letter words that have a “k” in them? This is all well described in information theory and the redundancy of language is not invalid because I picked an obvious example.

personjerry · on Oct 7, 2019

There are many three letter words that begin with A, followed by 3 letter words that begin with Y, followed by 4 letter words with a K in them. Here are some examples.

  All you skim.
  And you skip.
  Ate yak skat.
  Add yak skin.

If you're telling me I can randomly hide 2/3rds of characters in any English sentence and you can still derive the meaning, I'm going to need a citation beyond "information theory" in general.

But this is tangential to the point. I'm saying the context is preserved with compression, hence the meaning is preserved. I'm saying we can compress the amount of letters used and still get the same meaning.

willis936 · on Oct 7, 2019

Elements of Information Theory (2e), Cover. Section 6.4. I also suggest following the authors suggested reading at the end of the section.

Also there is a difference between having information and having perfect information. English can’t guarantee perfect information of arbitrary statements with 2/3 loss.

6502nerdface · on Oct 7, 2019

Y cn ndrstnd wrttn nglsh vn wtht n vwls!

personjerry · on Oct 7, 2019

Random errors would be unlikely to only remove vowels.

_wzsf · on Oct 7, 2019

Still counts as error!

fspeech · on Oct 6, 2019

You need fault tolerance therefore redundancy in your language. You don’t want language to be like phone numbers, where missing a single digit changes the meaning completely. Not to mention phone numbers are also hard to remember for the human brain. We are not computers.

Peter_Franusic · on Oct 6, 2019

The Shannon limit is simply a number: ln(2) or in decibels approximately -1.59 dB.

rrss · on Oct 6, 2019

That's a different, but related, Shannon limit than the channel capacity discussed in the article.