בקטגוריות: Uncategorized

13 Oct 2004

Common compression algorithms use statistical analysis of data to determine common patterns and give those patterns shorter representations in data.
And I’m thinking – if the data to be compressed is a block of text, can a dictionary not be used to compress the data? Have the compression program have an indexed dictionary, and replace the words with their dictionary index. Unrecognized words can be left as they are, or perhaps given an ad-hoc index stored inside the compressed file.

This, naturally, will only be useful for natural language texts, and needs a dictionary for each language.

Is this feasible? Is this already being done? Is this actually useful for anything?

11 תגובות על

Avatar

iod

13 בOctober, 2004 בשעה 10:16

If there are, say, thousands of words in a language, and the average length for a word is less than 4 characters, then this would be quite wasteful, nay?

Avatar

calanya

13 בOctober, 2004 בשעה 11:26

It’s not better than general purpose compression algorithms.

Huffman encoding does something similar to this, except it constructs an optimal dictionary for the specific text, and then appends that dictionary to the compressed data.
There are algorithms with similar compression capabilities which don’t construct a dictionary.

Avatar

yggdrasil

13 בOctober, 2004 בשעה 21:33

Re: It’s not better than general purpose compression algorithms.

Doesn’t Huffman encoding build an optimal dictionary on a character-by-character basis? Or even a byte-by-byte basis, in these days of Unicode and wonder?

As Dubi said, most words are 4 letters or less – but 4 letters is still 8 bytes, which can be nicely compressed to several bits using a Huffman-like optimized tree.

Avatar

calanya

14 בOctober, 2004 בשעה 05:27

You’re right, in a way.

My little knowledge of data compressions comes from a signle recitation about Huffman coding.
Why not just ask Wikipedia?

Avatar

passacaglio

14 בOctober, 2004 בשעה 09:18

why should it?

The english language has 26 character, and ten of thousands of words. The statistical variance of the words (as atoms of this compression) is too big to be represented in a compressed fashion that will produce smaller results. (Can you think of a valid passage that repeats words seriously enough?)

That “statistical variance” is called the language’s enthropy.

Avatar

yggdrasil

14 בOctober, 2004 בשעה 10:28

Re: why should it?

What I thought of is not building an optimized tree based on a word’s frequency within the body of text – but rather on having a pre-defined dictionary for a given language that is already indexed and optimized for frequency and probability of appearance in the English language as a whole. This dictionary is a part of the compressing/decompressing program, and doesn’t have to be bundled along with the archive. So even if the word “book” only appears one, for instance, it may still be representable by one byte instead of 8.

Avatar

passacaglio

14 בOctober, 2004 בשעה 13:59

Because that makes it even worse

let’s say there are 1,000,000 words in the english language. That about 20 bits for uniform dispersion, and let’s say 17 for compressed ones. because of the large amount of words, I don’t see much less of this happening.

The 26 letterage would be only 5 bits long, with enthropy probably leading to an average of 3 bits.

And so, an average 4 letter word (most common words are less, such as A, THE, ON, IN) would take an average of 17 bits in your system, and 12 in the regular system.

Now, this is a rough estimate – letter (literal) compressors adapt to the text, and a global dictionary would always take the same amount, so when I write this peom:

ZOZOBRA ZOZOBRA ZOZOBRA
ALLEVIATE MY UNGUINOUS THEANTHROPIC BODY
WHEREAS I AM A STANDRIFT PAPULIFEROUS POLYHISTOR
CAUSING MAYHAM THROUGHOUT THE LAND.

The literal compression would be good; the letters are well dispersed. but the words are terrible.

Avatar

yggdrasil

14 בOctober, 2004 בשעה 14:48

Re: Because that makes it even worse

Point taken, even though I still think your poem won’t compress well only because you misspelled “mayhem”. 🙂

Avatar

passacaglio

14 בOctober, 2004 בשעה 15:29

Re: Because that makes it even worse

Oh, I ment it was causing possible porkchops.

Avatar

assafr

21 בOctober, 2004 בשעה 15:16

You’re smarter than you think.

I believe the Lempel-Ziv algorithm works in a similar way to what you have described. Too bad – if you had thought about it 29 or so years ago, you would have been famous and possibly rich (there is a patent, but possibly only to the extension by Welch).

Most compressors use the Lempel-Ziv approach to create some kind of dictionary, then for a little extra punch, compress the keys of the dictionary using Huffman.

This is my very superficial knowledge, but if you want more information I can look it up and maybe get some more details.

– Assaf

Avatar

Anonymous

1 בNovember, 2004 בשעה 08:34

even more superficial

What you describe could be found in IBM’s DB2 compression method for relational tables.
I have no idea what exact algorithm it uses, or if it’s usable on regular text files (probably is) and frankly – I really don’t care 🙂
Anyway, it’s been there for a while.

And it works exactly as you described it.

More the pity for you for being late…

-Avshi.

טופס תגובות

פעם היה לי לייבג'ורנל. עכשיו הוא כאן, מגובה.

פוסטים