Skip to content Skip to sidebar Skip to footer

Compute Entropy Of Different File Extensions To Find Randomness Of Data?

I have different file types which includes JPEG and jpg ,#mp3 ,#GIF,#MP4,#FLV, M4V ,exe, zip etc. Take data in blocks , something like 4k block size, find randomness Generate ran

Solution 1:

+1 very interesting question here few untested ideas just from top of my head right now:

  1. how about using correlation coefficient

    with uniformly random data of the same size (4K) ? and or using FFT first and then correlate ...

  2. Or compute statistical properties and infer from there somehow ...

    I know this is vague description but might be worth digging into it...

  3. use comprimation

    for example compress your 4K data with Huffman coding and infere your coefficient form the ratio between uncompressed and compressed sizes and may be use logarithmic scale ...

I think the most easy to implement and plausible results will be from 3th approach as Huffman coding and entropy are closely related.

[edit1] using Shannon information entropy

Your suggestion is even better than Hufman encoding (even those 2 are closely related). Using Shannon information entropyH with binary digits as base will return the average amount of bits per word of your data (after Hufman encoding) needed to represent your data. So from there going to score <0..1> just divide by bits per word...

Here small C++/VCL example for entropy computation on BYTEs:

//$$---- Form CPP ----//---------------------------------------------------------------------------#include<vcl.h>#include<math.h>#pragma hdrstop#include"Unit1.h"//---------------------------------------------------------------------------#pragma package(smart_init)#pragma resource "*.dfm"
TForm1 *Form1;
//---------------------------------------------------------------------------char txt0[]="text Text bla bla bla ..."; 
char txt1[]="abcdefghijklmnopqrstuvwxy";
char txt2[]="AAAAAAAbbbbbccccddddeeeff";
//---------------------------------------------------------------------------doubleentropy(BYTE *dat,int siz){
    if ((siz<=0)||(dat==NULL)) return0.0;
    int i; double H=0.0,P[256],dp=1.0/siz,base=1.0/log(2.0);
    for (i=0;i<256;i++) P[i]=0.0;
    for (i=0;i<siz;i++) P[dat[i]]+=dp;
    for (i=0;i<256;i++)
        {
        if (P[i]==0.0) continue;    // skip non existing itemsif (P[i]==1.0) return0.0;  // this means only 1 item type is present in data
        H-=P[i]*log(P[i])*base;     // shanon entropy (binary bits base)
        }
    return H;
    }
//---------------------------------------------------------------------------__fastcall TForm1::TForm1(TComponent* Owner):TForm(Owner)
    {
    mm_log->Lines->Clear();
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt0,entropy(txt0,sizeof(txt0)),entropy(txt0,sizeof(txt0))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt1,entropy(txt1,sizeof(txt1)),entropy(txt1,sizeof(txt1))/8.0));
    mm_log->Lines->Add(AnsiString().sprintf("txt = \"%s\" , H = %.6lf , H/8 = %.6lf",txt2,entropy(txt2,sizeof(txt2)),entropy(txt2,sizeof(txt2))/8.0));
    }
//-------------------------------------------------------------------------

And results:

txt = "text Text bla bla bla ..." , H = 3.185667 , H/8 = 0.398208txt = "abcdefghijklmnopqrstuvwxy" , H = 4.700440 , H/8 = 0.587555txt = "AAAAAAAbbbbbccccddddeeeff" , H = 2.622901 , H/8 = 0.327863

Post a Comment for "Compute Entropy Of Different File Extensions To Find Randomness Of Data?"