View Full Version : 100 mb of wikipedia compressed to 1kb, lossless compression
99.9999% compression ratio using Thinbasic.
lossless compression.
100 MB of wikipedia compressed to 1kb.
next, to test this week: compression ratio for 1 GB
theorically, this new codec, can compress GB, TB, PB in a few KB.
:)
ErosOlmi
16-07-2019, 22:36
Are you kidding us? Please produce some proofs.
Hi Eros,
the background that made it possible is here: https://largestprimes.xyz
I am participating to the Prize for Compressing Human Knowledge with this codec.
They will test it a month, and evaluate if we broke their record.
it seems that their current compression record is near 15 mb for a 100mb wikipedia file using codecs phda9, decomp8,paq8
As I was working with very large numbers for prime, then it was easy to compress 100 million digits using thinbasic.
:)
ErosOlmi
16-07-2019, 23:37
You mean this one?
https://en.wikipedia.org/wiki/Hutter_Prize
http://prize.hutter1.net/
ErosOlmi
17-07-2019, 00:13
This is my very little contribution to the challenge.
Attached script perform the following:
download zipped file used for the challenge if not already present in current script directory
extract included file into a string buffer of 100MB
compress it into a new string
report results .... very poor compared to current challenge results
Ciao
Eros
9969
uses "ZLib"
Uses "File"
uses "console"
uses "inet"
printl "---------------------------------------------------------------"
printl "Challenge: https://en.wikipedia.org/wiki/Hutter_Prize"
printl " http://prize.hutter1.net/"
printl "---------------------------------------------------------------"
printl "download zipped file used for the challenge if not already present in current script directory"
printl "extract included file into a string buffer of 100MB"
printl "compress it into a new string"
printl "report results .... very poor compared to current challenge results"
printl "---------------------------------------------------------------"
PrintL
printl "Press any key to Start---" IN %CCOLOR_FYELLOW
WaitKey
string sUrlZipFile = "http://mattmahoney.net/dc/enwik8.zip"
string sLocalZipFileName = APP_SourcePath & "enwik8.zip"
printl "---Start downlaoding", sUrlZipFile
if FILE_Exists(sLocalZipFileName) Then
printl "---File already downloaded"
Else
printl " Dowloading ..."
INET_UrlDownload(sUrlZipFile, APP_SourcePath & "enwik8.zip")
end if
printl " Local file name", sLocalZipFileName
PrintL
string sUncompressedFileName = "enwik8"
printl "---Extracting " & sUncompressedFileName & " to string"
printL " start", Time$
string sOriginal = ZLib_ExtractToString(sLocalZipFileName, "enwik8")
printL " end", Time$
printl " Extraction done. Size of string uncompressed:", LenF(sOriginal)
printl
string sCompress
printl "---Start compressing", Time$
sCompress = StrZip$(sOriginal)
printl " End compressing", Time$
printl " Len Original string.....", lenf(sOriginal)
printl " Len compressed string...", lenf(sCompress)
PrintL
printl "Press any key to end---" IN %CCOLOR_FYELLOW
WaitKey
Hi Eros,
Yes, exactly, that is the prize.
:)
Thank you Eros for your experience & contribution.
Thinbasic is very powerful
lots of commands to learn...
:)
DirectuX
18-07-2019, 15:27
99.9999% compression ratio using Thinbasic.
lossless compression.
100 MB of wikipedia compressed to 1kb.
next, to test this week: compression ratio for 1 GB
theorically, this new codec, can compress GB, TB, PB in a few KB.
:)
Hi Alberto,
how is the 1kb (kilobits ?) compared to the Shannon entropy of the 100MB ?
Hi,
good question,
they say their data 100mb: enwik8 is fairly uniform.
their link "Information about the enwik8 data file" is:
http://mattmahoney.net/dc/textdata.html
you will find there detailed information about the data, statistics, and graphics of the distribution of the data too:
This competition ranks lossless data compression programs by the compressed size (including the size of the decompression program) of the first 10power9 bytes of the XML text dump of the English version of Wikipedia on Mar. 3, 2006
enwik8: compressed size of first 108 bytes of enwik9. This data is used for the Hutter Prize, and is also ranked here but has no effect on this ranking.
enwik9: compressed size of first 109 bytes of enwiki-20060303-pages-articles.xml
they have been benchmarking well known codecs, for years.
:D
Hi,
i wonder if you mean to the certainty of the outcomes of the compressed files generated.
then the entropy is zero.
" Entropy is zero whenone outcome is certain".
http://basicknowledge101.com/pdf/km/Entropy%20(information%20theory).pdf
2 shannons of entropy: Information entropy is the log-base-2 of the number of possible outcomes; with two coins there are four outcomes, and the entropy is two bits.
Entropy is zero whenone outcome is certain.
it is the first time I read about shannon entropy, its good to learn each day something.
thanks
:)
DirectuX
19-07-2019, 09:45
Hi,
i wonder if you mean to the certainty of the outcomes of the compressed files generated.
then the entropy is zero.
:)
Hi,
I mean this : "Shannon entropy allows to estimate the average minimum number of bits needed to encode a string of symbols based on the alphabet size and the frequency of the symbols."
see https://www.shannonentropy.netmark.pl/
You would certainly be interested to know the number for the sample you wish to compress.
Hi,
I am going one step forward that. Science evolves.
It is already being explained in the documentation and codec, that I am sending to the contest this weekend.
For me, Shanon theory is obsolete, and the proof is the codec.
In a month, ill be glad to explain it to you.
the first planes were made of wood, then with metal and then with plastic and fibers, every technological step created new possibilities that were not available in the previous version, and our frame of reference changes. You dont conceive of flying in a plane made of wood now, but that was the historical frame or reference of past people.
best
:D
DirectuX
19-07-2019, 22:47
Hi,
For me, Shanon theory is obsolete, and the proof is the codec.
In a month, ill be glad to explain it to you.
:D
Okay! see you in a month :)
DirectuX
21-07-2019, 12:53
About data compression with prime numbers :
article : https://arxiv.org/pdf/physics/0511145.pdf
google patent : https://patents.google.com/patent/US6373986
DirectuX
20-09-2019, 18:39
Hello Alberto,
hope you're fine,
how the contest turned out ?