|BCOS Home » The BCOS Project » BCOS Specifications » BCOS File Format Specifications|
BCOS Compressed Native File Format Specification
This document describes the native file format used for compressed files. It is intended to provide simple but effective compression for native file formats only.
The method used to compress data involves finding strings of bytes (or "runs") that are repeated, and replacing the string of bytes with a reference to its duplicate.
Any changes made in future versions of this specification are not guaranteed to be backward or forward compatible (however changes may be backward or forward compatible if possible).
In general, this specification is not expected to change in future.
The basic structure of the file is described in Figure 3-1. File Structure.
|End of file|
|Note: Not to scale.|
This is the native file format header defined in the BCOS Native File Format Specification. To comply with this specification, the file type field must be 0xC0000000.
The extended file header follows the generic file header, and is described in Table 3-1. Extended File Header Format.
|0x00000030||8 bytes||Uncompressed file size|
|0x00000038||4 bytes||Uncompressed file checksum|
|0x0000003C||4 bytes||Uncompressed file type|
This information is used to reconstruct the first 24 bytes of the original file's native file format header during decompression (the first 24 bytes of the compressed file's data is not included in the compressed data section), allows software to determine the type of the compressed file without decompressing it first, and allows decompression software to allocate a buffer of the correct size for the decompressed data. When a file is compressed its checksum is copied "as is" into the compressed file's extended header, so that if the file's checksum was incorrect before the file was compressed it will still be incorrect after the file is decompressed. However, if a file's checksum hasn't been set (and only if the file's checksum hasn't been set) then code used to compress the file may (should) generate a correct checksum before compressing the file.
The compressed data consists of a variable number of entries, that ends when both the end of the compressed file and the end of the uncompressed file is reached. There's 2 types of entries: unmatched runs (where the data is embedded "as is" into the compressed data) and matched runs (where the data can be copied from somewhere else).
The entry for an unmatched run indicates how many bytes of data are inserted unchanged into the compressed data.
|7||Must be 0 to indicate run is unmatched|
|5 to 6||Number of extra size bytes|
|0 to 4||Size bits 0 to 4|
The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus one. For example, if the initial byte is 0x65 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x65, 0x34, 0x12" then the size of the run would be "(0x65 & 0x1F) + (0x56 << 5) + (0x34 << 13) + (0x12 << 21) + 1". These bytes would be followed by the bytes of the unmatched run itself.
The entry for a matched run indicates how many bytes of data are the same as the bytes at a specified offset.
|7||Must be 1 to indicate run is matched|
|5 to 6||Number of extra size bytes|
|4||Offset encoding (0 = literal, 1 = negative)|
|2 to 3||Number of offset bytes - 1|
|0 to 1||Size bits 0 to 1|
The size of the run is determined from the size bits in the initial byte, plus the bits from any extra size bytes, plus three. For example, if the initial byte is 0xE3 then there's three extra size bytes following the initial byte, and if the extra size bytes are "0x56, 0x34, 0x12" then the size of the matched run would be "(0xE3 & 0x03) + (0x56 << 2) + (0x34 << 10) + (0x12 << 18) + 3".
The extra size bytes (if any) or the initial byte (if there's no extra size bytes) are immediately followed by 1, 2, 3 or 4 bytes used to encode the offset of the matching run. The number of offset bytes is determined by bits 2 to 3 in the initial byte. For example, if the initial byte is 0x8C (no extra size bytes and 4 offset bytes) and the next 4 bytes are "0x78, 0x56, 0x34, 0x12" then the offset would be "0x78 + (0x56 << 8) + (0x34 << 16) + (0x12 << 24)".
If the offset is a literal offset (bit 4 in the initial byte is clear) then the offset is an offset from the beginning of the decompressed data and remains "as is". If the offset is a negative offset, then it's a displacement from the current position in the decompressed data. Negative offsets can be converted into literal offsets using "literal_offset = (offset_for_next_byte_in_decompressed_data - 1) - negative_offset". Normally compression code chooses the shortest possible encoding for the offset, however for very large files (larger than 4 GiB) this increases the range of offsets possible (for example, for a 12 GiB file the offset may refer to somewhere in the first 4 GiB of the decompressed file or in the 4 GiB before the current position in the decompressed file).