What Is File Entropy

What Is File Entropy
What Is File Entropy

Video: What Is File Entropy

Video: What Is File Entropy
Video: Fele Entropy 205 bpm 2024, April
Anonim

Any computer file is made up of bytes. A byte can take values from 0 to 255. Information entropy is a statistical parameter that shows the probability of occurrence of certain bytes in a file.

What is file entropy
What is file entropy

You can visually assess the degree of entropy using a histogram - the distribution of the probability of repeating the same bytes in a file. By the entropy of the file, we can guess what type of file is in front of us, seeing only its histogram.

For demonstration, let's take three files of different types and compare their histograms. The first one is a text file (*. TXT). Its histogram is shown in the figure:

гистограмма=
гистограмма=

The text file contains only text. Each character of the text is encoded with certain bytes in accordance with the encoding table. Although there are a large number of encoding types, it is obvious that there is a limited number of alphanumeric characters, which is usually less than 255. Therefore, only some areas are occupied on the first histogram, and some bytes are not at all.

The following file will be in PDF format:

гистограмма=
гистограмма=

This file contains all possible bytes, as PDF is encoded differently from text files. It stores a lot of service information: formatting, fonts, images, etc. But its histogram shows that some of the bytes occur with approximately equal probability, while others - much more often than others. Hence the multiple sharp bursts on the histogram, and in general it has a rather "ragged" appearance, although it occupies the entire available width.

And the last file is zipped in 7Z format:

гистограмма=
гистограмма=

This histogram has two main features: firstly, all bytes are found in the zipped file with more or less equal probability (a fairly flat top edge), and secondly, there is practically no free space above the histogram, which indicates an almost complete absence of redundancy such a file. Hence, we can conclude that the archiver's algorithm in some special way "mixes" the bytes of the file in order to achieve their maximum uniform distribution.

Thus, entropy in computer science, as in physics, is a measure of the disorder in the system, in this case, the disorder in the distribution of bytes in the file. Entropy allows you to judge the degree of compression of the file and - indirectly - about its type.

Recommended: