Path: archiver1.google.com!news2.google.com!news1.google.com!newsfeed.stanford.edu!newsfeeds.belnet.be!news.belnet.be!psinet-eu-nl!psiuk-p4!uknet!psiuk-p3!uknet!psiuk-n!news.pace.co.uk!nh.pace.co.uk!kbracey.cam.pace.co.uk%kbracey
From: Kevin Bracey <kevin.bra...@pace.co.uk>
Newsgroups: comp.sys.acorn.programmer
Subject: Re: Acorn's squeeze
Date: Thu, 13 Dec 2001 14:43:17 GMT
Organization: Posted on a server owned by Pace Micro Technology plc
Lines: 74
Message-ID: <3ed643e84a.kbracey@kbracey.cam.pace.co.uk>
References: <9vaa04$2ina$1@mimas.salford.ac.uk>
NNTP-Posting-Host: kbracey.cam.pace.co.uk
X-Trace: nh.pace.co.uk 1008255472 29864 136.170.129.213 (13 Dec 2001 14:57:52 GMT)
X-Complaints-To: newsmaster@news.cam.pace.co.uk
NNTP-Posting-Date: 13 Dec 2001 14:57:52 GMT
To: p.f.john...@salford.ac.uk
In-Reply-To: <9vaa04$2ina$1@mimas.salford.ac.uk>
X-Organization: Pace Micro Technology
User-Agent: Messenger-Pro/2.50a (MsgServe/1.50) (RISC-OS/3.71) NewsHound/1.38

In message <9vaa04$2in...@mimas.salford.ac.uk> Paul F. Johnson wrote:

> Hi,
> 
> Probably just me and my usual lack of info/knowledge/anything else (!), but
> having looked through the PRMs, I can't find anything on the algorhythm
> used in Acorn's squeeze program and was wondering if it ever was made
> public.

Don't think so. Here's a description I wrote, by reverse engineering it.

Summary of the compressed data format:

typedef struct {
    word decoded_size;
    word encoded_size;
    word tables_size;
    word nshorts;
    word nlongs;
    word bytestomove;
} comp_table;

 <-   encoded_size   -> <- tables_size ->
|-----encoded data-----|---dictionary----|-tbl-|--decomp code-|
<- ->
 bytestomove

 <-                             decoded_size                              ->
|-------------------------------decoded data--------------------------------|


The image is scanned for up to 1792 common 32-bit words (short entries) and
up to 1792 common top 24 bits of words (long entries). These form the
dictionary, which is stored compressed above the compressed data.

The dictionary consists of t->nshorts + t->nlongs entries. The entries are
represented by:

       00    WW XX YY    -> previous entry + YYXXWW (long)
       00    WW XX YY ZZ -> previous entry + ZZYYXXWW (short)
       01-09             -> n entries consecutive from the last one
       0A-5B             -> previous entry + (n-&0A)
       5C-AD XX          -> previous entry + ((n-&5C)<<8) + XX
       AE-FF XX YY       -> previous entry + ((n-&AE)<<16) + YYXX

Data is compressed two words at a time, and must be decompressed from the top
down. The compressed form starts with a single byte (at the highest memory
location), split into two nibbles. The least-significant nibble represents
the lower-addressed word, the most-significant nibble the higher. Further
bytes are read from decreasing memory addresses, first for the low word, then
the higher, depending on the contents of the nibble. Words encoding forms
are:

                   0  -> 00000000
      WW XX YY ZZ  1  -> WWXXYYZZ
               ZZ 2-8 -> short[((n-2) << 8) + ZZ]
            YY ZZ 9-F -> (long[((n-9) << 8) + ZZ] << 8) + YY

For example, (45 12 34 56 78 51) represents (12345678 short[345])

The encoding scheme is well suited to 32-bit oriented data, particularly ARM
instructions. Before compression, the data would normally be padded out to an
even number of words with zeros. This extra trailing data is harmless for
AIF.

A six-word header precedes the decompression code, giving details of the data
and dictionary sizes. It also gives an indication of when to shuffle the
final compressed data out of the way when decompressing in place.

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc                     Tel: +44 (0) 1223 518566
645 Newmarket Road                            Fax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United Kingdom            WWW: http://www.pace.co.uk/