BCOS Home » The BCOS Project » BCOS Specifications » BCOS File Format Specifications |
BCOS Native File Format SpecificationVersion 1.0 |
Project Map |
To improve consistency, all native file formats used by BCOS begin with a generic file header. The purpose of this generic file header is to identify the exact format of the file, and allow file corruption to be detected by software that does not understand the file's format.
This specification describes the generic file header only and forms the basis for all other native file formats. Depending on the file type, this generic header may be extended with a file type dependant header extension (see the specification/s for each specific file type for details on any file type dependant header extensions).
The general structure of a native file (shown in Figure 1-1. General File Structure is always the same.
End of file | |||
Subfiles | |||
Metadata | |||
Main File Data | |||
Extended Header (if any) | 0x00000030 | ||
Generic Header | 0x00000000 | ||
Note: Not to scale. |
The metadata area contains data about the file itself (e.g. things like copyright owner, title, author, etc). The format used for the metadata is beyond the scope of this specification, and is defined in BCOS File Metadata Specification.
The main file may have subfiles included. These subfiles have the same format as normal files would, and are (mostly) just appended onto the end of the main file (with corresponding updates to values in the main file's header).
Note that subfiles themselves may have subfiles of their own. In theory this means it's possible for a single file to contain a tree of many files. However, excessive use of subfiles should be discouraged, and most software can simply ignore any subfiles. It is intended to only be used for special cases (e.g. embedding pictures into wordprocessor files).
Any changes made in future versions of this specification are guaranteed to be backward and forward compatible. This is necessary, because (if incompatible changes are made to the specification) there's no way to determine which version of this specification a file conforms to.
Software should assume all files that comply with this specification (including older versions of this specification) already comply with the most recent version of this specificaton.
The format of the generic header is shown in Table 3-1. Generic File Header.
Offset | Size | Description |
---|---|---|
0x00000000 | 8 bytes | Total File size (see Section 3.1. Total File Size) |
0x00000008 | 8 bytes | Compliance string (see Section 3.2. Compliance String) |
0x00000010 | 4 bytes | Checksum (see Section 3.3. Checksum) |
0x00000014 | 4 bytes | File type (see Section 3.4. File Type) |
0x00000018 | 8 bytes | Main file size (see Section 3.5. Main File Size) |
0x00000020 | 4 bytes | Metadata area size (see Section 3.6. Metadata Size) |
0x00000024 | 2 bytes | Specification version (see Section 3.7. Specification Version) |
0x00000026 | 2 bytes | Subfile count (see Section 3.8. Subfile Count) |
0x00000028 | 8 bytes | Reserved (must be zero) |
The total file size is included in the generic file header so that it's possible to easily detect if a file has become truncated or extended, and so that it's easier to work with files in RAM.
Note: As this field is a 64-bit field, no native file format will be able to support any file that is equal to or larger than 18,446,744,073,709,551,616 bytes (16 exbibytes). This is equal to the maximum file size for some modern file systems (NTFS, ZFS) and larger than the maximum volume size for other modern file systems (ext4, ReiserFS).
The compliance string must be the ASCII/UTF-8 string "BCOS_NFF", where the first byte (at offset 0x00000008) is 'B' or 0x42 and the last byte (at offset 0x0000000F) is 'F' or 0x46. There is no terminating zero. Note: There is no difference between ASCII and UTF-8 for these characters.
The checksum field may contain a 32-bit CRC calculated as per Subsection 3.3.1. CRC Calculation, or may be set to zero to indicate that no CRC is present.
The checksum is calculated using a 32-bit CRC algorithm called "CRC-32C (Castagnoli)", which is the same CRC algorithm used by Intel's CRC32 instruction (so that the CPU's instruction can be used to improve performance, if the instruction is supported). See Listing 3-1. CRC Reference Implementation for the exact algorithm (expressed in C).
1 | #include <string.h> |
2 | #include <stdio.h> |
3 | |
4 | // Input Data (for illustrative purposes) |
5 | |
6 | unsigned char string[] = {"123456789"}; |
7 | |
8 | // CRC Lookup Table |
9 | |
10 | unsigned int CRCtable[256]; |
11 | |
12 | |
13 | // Code |
14 | |
15 | static void generateCRCtable(void) { |
16 | int i; |
17 | unsigned int crc; |
18 | |
19 | for(i = 0; i < 256; i++) { |
20 | crc = 0; |
21 | if( (i & 1) != 0) crc ^= 0xF26B8303; |
22 | if( (i & 2) != 0) crc ^= 0xE13B70F7; |
23 | if( (i & 4) != 0) crc ^= 0xC79A971F; |
24 | if( (i & 8) != 0) crc ^= 0x8AD958CF; |
25 | if( (i & 16) != 0) crc ^= 0x105EC76F; |
26 | if( (i & 32) != 0) crc ^= 0x20BD8EDE; |
27 | if( (i & 64) != 0) crc ^= 0x417B1DBC; |
28 | if( (i & 128) != 0) crc ^= 0x82F63B78; |
29 | CRCtable[i] = crc; |
30 | } |
31 | } |
32 | |
33 | |
34 | static unsigned int calculateCRC(unsigned char *string, int length) { |
35 | unsigned int crc; |
36 | |
37 | crc = 0xFFFFFFFF; |
38 | while(length > 0) { |
39 | crc = (crc >> 8) ^ CRCtable[(crc & 0xFF) ^ *string]; |
40 | string = &string[1]; |
41 | length--; |
42 | } |
43 | crc ^= 0xFFFFFFFF; |
44 | return(crc); |
45 | } |
46 | |
47 | |
48 | int main(void) { |
49 | int length; |
50 | |
51 | generateCRCtable(); |
52 | |
53 | length = strlen((char *)string); |
54 | printf("In: '%s' (%d bytes)\n", string, length); |
55 | printf("Out: 0x%X\n", calculateCRC(string, length)); |
56 | |
57 | return 0; |
58 | } |
When calculating the CRC, the first 3 fields in the generic header (the file size field, the compliance string and the checksum field) are skipped. This means that the CRC is calculated starting from the header size field at offset 0x00000014 in the generic file header. Metadata and subfiles (everything up to the end of the file, as determined by the total file size field) are included in the CRC calculation.
If the calculated CRC is zero then it's substituted with the value 0xFFFFFFFF, so that compliance tests know that the CRC is present. This weakens the strength of the CRC slightly (as the value 0xFFFFFFFF in the checksum field may mean that the CRC is either 0xFFFFFFF or 0x00000000) however this is negligable, and much better than using a 31-bit CRC with a "present/not present" flag.
The file type field in the generic header is a 32-bit number that uniquely identifies the type of the file, and therefore identifies the format to be used for the rest of the file's data (including the file type dependant header extension, if any). These file types are also used by the OS (file systems) for non-native file formats, but for non-native file formats there is no header and the OS needs to use other, less reliable methods to determine the file format.
For a full description of file types and a list of currently defined file type numbers (including links to specific file format specifications for all native file formats), please refer to the BCOS File Type Specification.
This field contains the size of the main file (including the generic header, any extended header and the file's data). It does not include the main file's metadata.
Note that this field in the generic header can be used as an "offset to start of metadata" field.
This field contains the size of the metadata area. If this field is zero then there is no metadata (and no "end tag"), otherwise the metadata area starts immediately after the main file (and ends when an "end tag" is found).
Do not rely on this field to determine the offset of the metadata's "end tag". There may be padding between the metadata's "end tag" and the actual end of the metadata area (e.g. to align the start of the first sub-file).
Note that if the "subfile count" field is zero then "main file size + metadata size" will equal "total file size"; and if the "subfile count" field is not zero then "main file size + metadata size" equals the offset of the first sub-file.
The specification version field in the generic header is a 16-bit number that uniquely identifies the version of the file format specification that the file complies with. These numbers consist of 2 parts - an 8-bit major version number and a 8-bit minor version number. For example, the value 0x0102 would indicate version 1.2 of the native file format's specification.
The specification for each different native file format should contain a description of how software should handle the specification version for that native file format (similar to Chapter 2: Specification Change Policy above). For example, a native file format specification may guarantee that all changes to the specification will maintain backward compatibility, or maintain both backward and forward compatability, or not maintain any compatability at all; and should provide advice on whether software should (where possible) upgrade files to the newest version of the specification or use the oldest version of the specification that can be used for the file.
This is the total number of subfiles that follow the main file's metadata area.
All subfiles must use native file formats (otherwise there's no way to determine where one subfile ends and the next starts).
The beginning of the first subfile is found by adding the metadata area's size to the main file's size. The beginning of the second subfile is found by adding the first subfile's total size to the offset of the start of the first subfile. In this way it's relatively easy to find all subfiles.
Suggestions for detecting if a file uses a native file format and has not become corrupted include:
Of these checks, only the CRC check requires more than the header to be read.
Even though the first 3 fields are not included in the CRC calculation, there is no need to be concerned that they may have been modified. If any of these fields have been modified the file will still fail compliance testing.
Note: For compliance testing code (that includes code for calculating a file's CRC for the purpose of ensuring that a non-zero value in the checksum field matches), if a file has no CRC in its checksum field it may be convenient to set one.
The above procedure only checks the file itself is consistent. It is also possible to perform this check recursively to ensure individual subfiles are also consistent.