compact binary serialization format with built-in compression
1# Hateno Specification 2 3The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", 4"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be 5interpreted as described in [RFC 2119][rfc2119]. 6 7## 0. Table of Contents 8 9<!-- TOC start (generated with https://github.com/derlin/bitdowntoc) --> 10 11- [1. Overview](#1-overview) 12- [2. Data Representation](#2-data-representation) 13 - [2.1 Numeric Encoding](#21-numeric-encoding) 14 - [2.2 String Encoding](#22-string-encoding) 15- [3. Type System](#3-type-system) 16 - [3.1 Type Identifiers](#31-type-identifiers) 17 - [3.2 Array-Compatible Types](#32-array-compatible-types) 18 - [3.3 Map Key Compatible Types](#33-map-key-compatible-types) 19- [4. Binary Encoding](#4-binary-encoding) 20 - [4.1 Primitive Types](#41-primitive-types) 21 - [4.1.1 Integer Types](#411-integer-types) 22 - [4.1.2 Floating-Point Types](#412-floating-point-types) 23 - [4.1.3 Boolean](#413-boolean) 24 - [4.2 String Type](#42-string-type) 25 - [4.3 Option Type](#43-option-type) 26 - [4.4 List Type (Heterogeneous)](#44-list-type-heterogeneous) 27 - [4.5 Map Type](#45-map-type) 28 - [4.6 Array Type (Homogeneous)](#46-array-type-homogeneous) 29 - [4.7 Timestamp Type](#47-timestamp-type) 30 - [4.8 UUID Type](#48-uuid-type) 31- [5. File Format](#5-file-format) 32 - [5.1 File Structure](#51-file-structure) 33 - [5.2 Header Format](#52-header-format) 34 - [5.2.1 Magic Bytes](#521-magic-bytes) 35 - [5.2.1.1 File Extension and MIME type](#5211-file-extension-and-mime-type) 36 - [5.2.2 Version](#522-version) 37 - [5.2.3 Flags](#523-flags) 38 - [5.2.4 Compression Method](#524-compression-method) 39 - [5.2.5 Payload Length](#525-payload-length) 40 - [5.3 Payload](#53-payload) 41- [6. Example](#6-example) 42- [7. Conformance Requirements](#7-conformance-requirements) 43 44<!-- TOC end --> 45 46<!-- TOC --><a name="1-overview"></a> 47 48## 1. Overview 49 50Hateno is a binary serialization format designed be simple to read and write. 51 52<!-- TOC --><a name="2-data-representation"></a> 53 54## 2. Data Representation 55 56<!-- TOC --><a name="21-numeric-encoding"></a> 57 58### 2.1 Numeric Encoding 59 60- Unsigned integers: Standard binary representation 61- Signed integers: Two's complement representation 62- Floating-point: IEEE 754 standard (binary32 for f32, binary64 for f64) 63- Endianness: Determined by file header flags (see Section 5.2.3) 64 65<!-- TOC --><a name="22-string-encoding"></a> 66 67### 2.2 String Encoding 68 69All strings MUST be encoded as UTF-8. Alternate encoding schemes, such as Java's 70[Modified UTF-8][mutf8], are explicitly NOT supported. 71 72<!-- TOC --><a name="3-type-system"></a> 73 74## 3. Type System 75 76<!-- TOC --><a name="31-type-identifiers"></a> 77 78### 3.1 Type Identifiers 79 80| ID | Type | Description | Size (bytes) | 81| ---- | --------- | ----------------------------- | ------------ | 82| 0x00 | u8 | Unsigned 8-bit integer | 1 | 83| 0x01 | i8 | Signed 8-bit integer | 1 | 84| 0x02 | u16 | Unsigned 16-bit integer | 2 | 85| 0x03 | i16 | Signed 16-bit integer | 2 | 86| 0x04 | u32 | Unsigned 32-bit integer | 4 | 87| 0x05 | i32 | Signed 32-bit integer | 4 | 88| 0x06 | u64 | Unsigned 64-bit integer | 8 | 89| 0x07 | i64 | Signed 64-bit integer | 8 | 90| 0x08 | f32 | 32-bit IEEE 754 float | 4 | 91| 0x09 | f64 | 64-bit IEEE 754 float | 8 | 92| 0x0A | bool | Boolean value | 1 | 93| 0x0B | String | UTF-8 encoded string | Variable | 94| 0x0C | Option | Optional value container | Variable | 95| 0x0D | List | Heterogeneous array | Variable | 96| 0x0E | Map | Key-value dictionary | Variable | 97| 0x0F | Array | Homogeneous typed array | Variable | 98| 0x10 | Timestamp | Unix timestamp (milliseconds) | 8 | 99| 0x11 | UUID | 128-bit UUID | 16 | 100 101**Note**: Type IDs 0x12-0xFF are reserved for future expansion. 102 103<!-- TOC --><a name="32-array-compatible-types"></a> 104 105### 3.2 Array-Compatible Types 106 107The following types are valid as `Array` element types: 108 109- All integer types: `u8`, `i8`, `u16`, `i16`, `u32`, `i32`, `u64`, `i64` 110- All floating-point types: `f32`, `f64` 111- Boolean type: `bool` 112 113Complex types (`String`, `Option`, `List`, `Map`, `Array`, `Timestamp`, `UUID`) 114MUST use the heterogeneous `List` type. 115 116<!-- TOC --><a name="33-map-key-compatible-types"></a> 117 118### 3.3 Map Key Compatible Types 119 120All types are valid as `Map` key types except: 121 122- `Option` 123- `List` 124- `Map` 125- `Array` 126 127<!-- TOC --><a name="4-binary-encoding"></a> 128 129## 4. Binary Encoding 130 131<!-- TOC --><a name="41-primitive-types"></a> 132 133### 4.1 Primitive Types 134 135<!-- TOC --><a name="411-integer-types"></a> 136 137#### 4.1.1 Integer Types 138 139``` 140[type_id: u8] [value: T] 141``` 142 143Where `T` is the appropriately-sized integer in the file's endianness. 144 145<!-- TOC --><a name="412-floating-point-types"></a> 146 147#### 4.1.2 Floating-Point Types 148 149``` 150[type_id: u8] [value: T] 151``` 152 153Where `T` follows IEEE 754 encoding in the file's endianness. 154 155<!-- TOC --><a name="413-boolean"></a> 156 157#### 4.1.3 Boolean 158 159``` 160[type_id: u8] [value: u8] 161``` 162 163- `0x00`: false 164- `0x01`: true 165- All other values are invalid 166 167<!-- TOC --><a name="42-string-type"></a> 168 169### 4.2 String Type 170 171``` 172[type_id: u8] [length: u32] [utf8_data: [u8; length]] 173``` 174 175- `length`: Number of bytes in UTF-8 encoding 176- `utf8_data`: Valid UTF-8 byte sequence 177- Empty strings have length 0 and no data bytes 178 179<!-- TOC --><a name="43-option-type"></a> 180 181### 4.3 Option Type 182 183``` 184[type_id: u8] [inner_type_id: u8] [discriminant: u8] [payload?] 185``` 186 187Note that the inner type ID is preserved so that type information is 188available even in case of a `None` value. 189 190- `inner_type_id`: Type ID of the contained value 191- `discriminant`: 192 - `0x00`: None (no payload follows) 193 - `0x01`: Some (payload follows) 194- `payload`: Present only when discriminant is `0x01` 195 196**Examples:** 197 198- `Option<u32>::None`: `[0x0C] [0x04] [0x00]` 199- `Option<u32>::Some(42)`: `[0x0C] [0x04] [0x01] [42, 0, 0, 0]` (little-endian) 200 201<!-- TOC --><a name="44-list-type-heterogeneous"></a> 202 203### 4.4 List Type (Heterogeneous) 204 205``` 206[type_id: u8] [length: u32] [element_1] [element_2] ... [element_n] 207``` 208 209- `length`: Number of elements 210- Each element is a complete typed value (type_id + data) 211- Elements may have different types 212 213**Example:** List `[42u8, "hello", true]` 214 215``` 216[0x0D] // List type 217[3, 0, 0, 0] // 3 elements 218[0x00] [42] // u8: 42 219[0x0B] [5, 0, 0, 0] [h, e, l, l, o] // String: "hello" 220[0x0A] [1] // bool: true 221``` 222 223<!-- TOC --><a name="45-map-type"></a> 224 225### 4.5 Map Type 226 227``` 228[type_id: u8] [length: u32] [pair_1] [pair_2] ... [pair_n] 229``` 230 231Each key-value pair: 232 233``` 234[key] [value] 235``` 236 237- `length`: Number of key-value pairs 238- Both `key` and `value` are complete typed values 239- `key` MUST be a map key compatible type (see Section 3.3) 240- Keys SHOULD be unique (behavior for duplicate keys is undefined) 241 242**Example:** Map `{42u8: "answer", "pi": 3.14f32}` 243 244``` 245[0x0E] [2, 0, 0, 0] // Map with 2 pairs 246[0x00] [42] [0x0B] [6, 0, 0, 0] [a, n, s, w, e, r] // 42u8 -> "answer" 247[0x0B] [2, 0, 0, 0] [p, i] [0x08] [0xC3, 0xF5, 0x48, 0x40] // "pi" -> 3.14f32 248``` 249 250<!-- TOC --><a name="46-array-type-homogeneous"></a> 251 252### 4.6 Array Type (Homogeneous) 253 254``` 255[type_id: u8] [length: u32] [element_type_id: u8] [element_1] ... [element_n] 256``` 257 258- `length`: Number of elements 259- `element_type_id`: Must be an array-compatible type (see Section 3.2) 260- Elements are stored as raw values (no type_id prefix per element) 261 262**Example:** `i32` array `[1, 2, 3]` 263 264``` 265[0x0F] // Array type 266[3, 0, 0, 0] // 3 elements 267[0x05] // element type: i32 268[1, 0, 0, 0] // 1 269[2, 0, 0, 0] // 2 270[3, 0, 0, 0] // 3 271``` 272 273<!-- TOC --><a name="47-timestamp-type"></a> 274 275### 4.7 Timestamp Type 276 277``` 278[type_id: u8] [value: i64] 279``` 280 281- `value`: Milliseconds since Unix epoch 282- Negative values represent times before the epoch 283 284<!-- TOC --><a name="48-uuid-type"></a> 285 286### 4.8 UUID Type 287 288``` 289[type_id: u8] [bytes: [u8; 16]] 290``` 291 292- `bytes`: 16 bytes representing the UUID in **big-endian** byte order 293- Follows [RFC 4122][rfc4122] standard binary representation 294- Byte order is independent of file endianness flag 295 296**Example:** UUID `550e8400-e29b-41d4-a716-446655440000` 297 298``` 299[0x11] // UUID type 300[0x55, 0x0e, 0x84, 0x00, 0xe2, 0x9b, 0x41, 0xd4, // UUID bytes 301 0xa7, 0x16, 0x44, 0x66, 0x55, 0x44, 0x00, 0x00] // (big-endian) 302``` 303 304<!-- TOC --><a name="5-file-format"></a> 305 306## 5. File Format 307 308<!-- TOC --><a name="51-file-structure"></a> 309 310### 5.1 File Structure 311 312``` 313[header] [payload] 314``` 315 316<!-- TOC --><a name="52-header-format"></a> 317 318### 5.2 Header Format 319 320``` 321[magic: [u8; 4]] [version: u8] [flags: u8] [compression: u8] [payload_length: u32] 322``` 323 324<!-- TOC --><a name="521-magic-bytes"></a> 325 326#### 5.2.1 Magic Bytes 327 328Fixed 4-byte signature: `HTNO` (0x48, 0x54, 0x4e, 0x4f). 329 330<!-- TOC --><a name="5211-file-extension-and-mime-type"></a> 331 332##### 5.2.1.1 File Extension and MIME type 333 334The recommended file extension is `.ht`. The recommended MIME type is 335`application/x-hateno`. 336 337<!-- TOC --><a name="522-version"></a> 338 339#### 5.2.2 Version 340 341- `0x01`: 1.0 (current version) 342- Future versions increment this value 343 344<!-- TOC --><a name="523-flags"></a> 345 346#### 5.2.3 Flags 347 3488-bit flag field: 349 350- Bit 0: Endianness (0 = little-endian, 1 = big-endian) 351- Bits 1-7: Reserved (MUST be zero) 352 353<!-- TOC --><a name="524-compression-method"></a> 354 355#### 5.2.4 Compression Method 356 357- `0x00`: No compression 358- `0x01`: Gzip compression ([RFC 1952][rfc1952]) 359- `0x02`: Zlib compression ([RFC 1950][rfc1950]) 360- `0x03`: LZ4 compression 361- `0x04-0xFF`: Reserved for future compression methods 362 363<!-- TOC --><a name="525-payload-length"></a> 364 365#### 5.2.5 Payload Length 366 367- Length of payload in bytes (u32, in file's endianness) 368- For compressed files: length of compressed data 369- For uncompressed files: length of raw data 370 371<!-- TOC --><a name="53-payload"></a> 372 373### 5.3 Payload 374 375The payload contains a single root typed value, typically a `Map`. 376 377<!-- TOC --><a name="6-example"></a> 378 379## 6. Example 380 381Complete uncompressed little-endian file containing `{"test": 42i32}`: 382 383``` 384[0x48, 0x54, 0x4e, 0x4f] // Magic: "HTNO" 385[0x01] // Version: 1.0 386[0x00] // Flags: little-endian 387[0x00] // Compression: none 388[23, 0, 0, 0] // Payload length: 23 bytes 389 390// Payload: Map with one entry 391[0x0E] // Map type 392[1, 0, 0, 0] // 1 key-value pair 393[0x0B] [4, 0, 0, 0] [t, e, s, t] // String key: "test" 394[0x05] [42, 0, 0, 0] // i32 value: 42 395``` 396 397<!-- TOC --><a name="7-conformance-requirements"></a> 398 399## 7. Conformance Requirements 400 401- Implementations MUST reject files with invalid magic bytes 402- Implementations MUST support at least version `0x01` 403- Unknown compression methods SHOULD be rejected 404- Invalid UTF-8 in strings MUST be rejected 405- Boolean values other than 0x00/0x01 MUST be rejected 406- Reserved flag bits MUST be zero in generated files 407 408[rfc2119]: https://www.rfc-editor.org/rfc/rfc2119 409[mutf8]: https://docs.oracle.com/en/java/javase/23/docs/api/java.base/java/io/DataInput.html#modified-utf-8 410[rfc1950]: https://www.rfc-editor.org/rfc/rfc1950 411[rfc1952]: https://www.rfc-editor.org/rfc/rfc1952 412[rfc4122]: https://www.rfc-editor.org/rfc/rfc4122