1% libxenctrl (libxc) Domain Image Format 2% David Vrabel <<david.vrabel@citrix.com>> 3 Andrew Cooper <<andrew.cooper3@citrix.com>> 4 Wen Congyang <<wency@cn.fujitsu.com>> 5 Yang Hongyang <<hongyang.yang@easystack.cn>> 6% Revision 2 7 8Introduction 9============ 10 11Purpose 12------- 13 14The _domain save image_ is the context of a running domain used for 15snapshots of a domain or for transferring domains between hosts during 16migration. 17 18There are a number of problems with the format of the domain save 19image used in Xen 4.4 and earlier (the _legacy format_). 20 21* Dependant on toolstack word size. A number of fields within the 22 image are native types such as `unsigned long` which have different 23 sizes between 32-bit and 64-bit toolstacks. This prevents domains 24 from being migrated between hosts running 32-bit and 64-bit 25 toolstacks. 26 27* There is no header identifying the image. 28 29* The image has no version information. 30 31A new format that addresses the above is required. 32 33ARM does not yet have have a domain save image format specified and 34the format described in this specification should be suitable. 35 36Not Yet Included 37---------------- 38 39The following features are not yet fully specified and will be 40included in a future draft. 41 42* Page data compression. 43 44* ARM 45 46 47Overview 48======== 49 50The image format consists of two main sections: 51 52* _Headers_ 53* _Records_ 54 55Headers 56------- 57 58There are two headers: the _image header_, and the _domain header_. 59The image header describes the format of the image (version etc.). 60The _domain header_ contains general information about the domain 61(architecture, type etc.). 62 63Records 64------- 65 66The main part of the format is a sequence of different _records_. 67Each record type contains information about the domain context. At a 68minimum there is a END record marking the end of the records section. 69 70 71Fields 72------ 73 74All the fields within the headers and records have a fixed width. 75 76Fields are always aligned to their size. 77 78Padding and reserved fields are set to zero on save and must be 79ignored during restore. 80 81Integer (numeric) fields in the image header are always in big-endian 82byte order. 83 84Integer fields in the domain header and in the records are in the 85endianness described in the image header (which will typically be the 86native ordering). 87 88\clearpage 89 90Headers 91======= 92 93Image Header 94------------ 95 96The image header identifies an image as a Xen domain save image. It 97includes the version of this specification that the image complies 98with. 99 100Tools supporting version _V_ of the specification shall always save 101images using version _V_. Tools shall support restoring from version 102_V_. If the previous Xen release produced version _V_ - 1 images, 103tools shall supported restoring from these. Tools may additionally 104support restoring from earlier versions. 105 106The marker field can be used to distinguish between legacy images and 107those corresponding to this specification. Legacy images will have at 108one or more zero bits within the first 8 octets of the image. 109 110Fields within the image header are always in _big-endian_ byte order, 111regardless of the setting of the endianness bit. 112 113 0 1 2 3 4 5 6 7 octet 114 +-------------------------------------------------+ 115 | marker | 116 +-----------------------+-------------------------+ 117 | id | version | 118 +-----------+-----------+-------------------------+ 119 | options | (reserved) | 120 +-----------+-------------------------------------+ 121 122 123-------------------------------------------------------------------- 124Field Description 125----------- -------------------------------------------------------- 126marker 0xFFFFFFFFFFFFFFFF. 127 128id 0x58454E46 ("XENF" in ASCII). 129 130version 0x00000003. The version of this specification. 131 132options bit 0: Endianness. 0 = little-endian, 1 = big-endian. 133 134 bit 1-15: Reserved. 135-------------------------------------------------------------------- 136 137The endianness shall be 0 (little-endian) for images generated on an 138i386, x86_64, or arm host. 139 140\clearpage 141 142Domain Header 143------------- 144 145The domain header includes general properties of the domain. 146 147 0 1 2 3 4 5 6 7 octet 148 +-----------------------+-----------+-------------+ 149 | type | page_shift| (reserved) | 150 +-----------------------+-----------+-------------+ 151 | xen_major | xen_minor | 152 +-----------------------+-------------------------+ 153 154-------------------------------------------------------------------- 155Field Description 156----------- -------------------------------------------------------- 157type 0x0000: Reserved. 158 159 0x0001: x86 PV. 160 161 0x0002: x86 HVM. 162 163 0x0003 - 0xFFFFFFFF: Reserved. 164 165page_shift Size of a guest page as a power of two. 166 167 i.e., page size = 2 ^page_shift^. 168 169xen_major The Xen major version when this image was saved. 170 171xen_minor The Xen minor version when this image was saved. 172-------------------------------------------------------------------- 173 174The legacy stream conversion tool writes a `xen_major` version of 0, and sets 175`xen_minor` to the version of itself. 176 177\clearpage 178 179Records 180======= 181 182A record has a record header, type specific data and a trailing 183footer. If `body_length` is not a multiple of 8, the body is padded 184with zeroes to align the end of the record on an 8 octet boundary. 185 186 0 1 2 3 4 5 6 7 octet 187 +-----------------------+-------------------------+ 188 | type | body_length | 189 +-----------+-----------+-------------------------+ 190 | body... | 191 ... 192 | | padding (0 to 7 octets) | 193 +-----------+-------------------------------------+ 194 195-------------------------------------------------------------------- 196Field Description 197----------- ------------------------------------------------------- 198type 0x00000000: END 199 200 0x00000001: PAGE_DATA 201 202 0x00000002: X86_PV_INFO 203 204 0x00000003: X86_PV_P2M_FRAMES 205 206 0x00000004: X86_PV_VCPU_BASIC 207 208 0x00000005: X86_PV_VCPU_EXTENDED 209 210 0x00000006: X86_PV_VCPU_XSAVE 211 212 0x00000007: SHARED_INFO 213 214 0x00000008: X86_TSC_INFO 215 216 0x00000009: HVM_CONTEXT 217 218 0x0000000A: HVM_PARAMS 219 220 0x0000000B: TOOLSTACK (deprecated) 221 222 0x0000000C: X86_PV_VCPU_MSRS 223 224 0x0000000D: VERIFY 225 226 0x0000000E: CHECKPOINT 227 228 0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary) 229 230 0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_ 231 records. 232 233 0x80000000 - 0xFFFFFFFF: Reserved for future _optional_ 234 records. 235 236body_length Length in octets of the record body. 237 238body Content of the record. 239 240padding 0 to 7 octets of zeros to pad the whole record to a multiple 241 of 8 octets. 242-------------------------------------------------------------------- 243 244Records may be _mandatory_ or _optional_. Optional records have bit 24531 set in their type. Restoring an image that has unrecognised or 246unsupported mandatory record must fail. The contents of optional 247records may be ignored during a restore. 248 249The following sub-sections specify the record body format for each of 250the record types. 251 252\clearpage 253 254END 255---- 256 257An end record marks the end of the image, and shall be the final record 258in the stream. 259 260 0 1 2 3 4 5 6 7 octet 261 +-------------------------------------------------+ 262 263The end record contains no fields; its body_length is 0. 264 265\clearpage 266 267PAGE_DATA 268--------- 269 270The bulk of an image consists of many PAGE_DATA records containing the 271memory contents. 272 273 0 1 2 3 4 5 6 7 octet 274 +-----------------------+-------------------------+ 275 | count (C) | (reserved) | 276 +-----------------------+-------------------------+ 277 | pfn[0] | 278 +-------------------------------------------------+ 279 ... 280 +-------------------------------------------------+ 281 | pfn[C-1] | 282 +-------------------------------------------------+ 283 | page_data[0]... | 284 ... 285 +-------------------------------------------------+ 286 | page_data[N-1]... | 287 ... 288 +-------------------------------------------------+ 289 290-------------------------------------------------------------------- 291Field Description 292----------- -------------------------------------------------------- 293count Number of pages described in this record. 294 295pfn An array of count PFNs and their types. 296 297 Bit 63-60: XEN_DOMCTL_PFINFO_* type (from 298 `public/domctl.h` but shifted by 32 bits) 299 300 Bit 59-52: Reserved. 301 302 Bit 51-0: PFN. 303 304page_data page_size octets of uncompressed page contents for each 305 page set as present in the pfn array. 306-------------------------------------------------------------------- 307 308Note: Count is strictly > 0. N is strictly <= C and it is possible for there 309to be no page_data in the record if all pfns are of invalid types. 310 311-------------------------------------------------------------------- 312PFINFO type Value Description 313------------- --------- ------------------------------------------ 314NOTAB 0x0 Normal page. 315 316L1TAB 0x1 L1 page table page. 317 318L2TAB 0x2 L2 page table page. 319 320L3TAB 0x3 L3 page table page. 321 322L4TAB 0x4 L4 page table page. 323 324 0x5-0x8 Reserved. 325 326L1TAB_PIN 0x9 L1 page table page (pinned). 327 328L2TAB_PIN 0xA L2 page table page (pinned). 329 330L3TAB_PIN 0xB L3 page table page (pinned). 331 332L4TAB_PIN 0xC L4 page table page (pinned). 333 334BROKEN 0xD Broken page. 335 336XALLOC 0xE Allocate only. 337 338XTAB 0xF Invalid page. 339-------------------------------------------------------------------- 340 341Table: XEN_DOMCTL_PFINFO_* Page Types. 342 343PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any 344corresponding `page_data`. 345 346The saver uses the `XTAB` type for PFNs that become invalid in the 347guest's P2M table during a live migration[^2]. 348 349Restoring an image with unrecognised page types shall fail. 350 351[^2]: In the legacy format, this is the list of unmapped PFNs in the 352tail. 353 354\clearpage 355 356X86_PV_INFO 357----------- 358 359 0 1 2 3 4 5 6 7 octet 360 +-----+-----+-----------+-------------------------+ 361 | w | ptl | (reserved) | 362 +-----+-----+-----------+-------------------------+ 363 364-------------------------------------------------------------------- 365Field Description 366----------- --------------------------------------------------- 367guest_width (w) Guest width in octets (either 4 or 8). 368 369pt_levels (ptl) Number of page table levels (either 3 or 4). 370-------------------------------------------------------------------- 371 372\clearpage 373 374X86_PV_P2M_FRAMES 375----------------- 376 377 0 1 2 3 4 5 6 7 octet 378 +-----+-----+-----+-----+-------------------------+ 379 | p2m_start_pfn (S) | p2m_end_pfn (E) | 380 +-----+-----+-----+-----+-------------------------+ 381 | p2m_pfn[p2m frame containing pfn S] | 382 +-------------------------------------------------+ 383 ... 384 +-------------------------------------------------+ 385 | p2m_pfn[p2m frame containing pfn E] | 386 +-------------------------------------------------+ 387 388-------------------------------------------------------------------- 389Field Description 390------------- --------------------------------------------------- 391p2m_start_pfn First pfn index in the p2m_pfn array. 392 393p2m_end_pfn Last pfn index in the p2m_pfn array. 394 395p2m_pfn Array of PFNs containing the guest's P2M table, for 396 the PFN frames containing the PFN range S to E 397 (inclusive). 398 399-------------------------------------------------------------------- 400 401\clearpage 402 403X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS 404---------------------------------------- 405 406The format of these records are identical. They are all binary blobs 407of data which are accessed using specific pairs of domctl hypercalls. 408 409 0 1 2 3 4 5 6 7 octet 410 +-----------------------+-------------------------+ 411 | vcpu_id | (reserved) | 412 +-----------------------+-------------------------+ 413 | context... | 414 ... 415 +-------------------------------------------------+ 416 417--------------------------------------------------------------------- 418Field Description 419----------- ---------------------------------------------------- 420vcpu_id The VCPU ID. 421 422context Binary data for this VCPU. 423--------------------------------------------------------------------- 424 425--------------------------------------------------------------------- 426Record type Accessor hypercalls 427----------------------- ---------------------------------------- 428X86_PV_VCPU_BASIC XEN_DOMCTL_{get,set}vcpucontext 429 430X86_PV_VCPU_EXTENDED XEN_DOMCTL_{get,set}\_ext_vcpucontext 431 432X86_PV_VCPU_XSAVE XEN_DOMCTL_{get,set}vcpuextstate 433 434X86_PV_VCPU_MSRS XEN_DOMCTL_{get,set}\_vcpu_msrs 435--------------------------------------------------------------------- 436 437\clearpage 438 439SHARED_INFO 440----------- 441 442The content of the Shared Info page. 443 444 0 1 2 3 4 5 6 7 octet 445 +-------------------------------------------------+ 446 | shared_info | 447 ... 448 +-------------------------------------------------+ 449 450-------------------------------------------------------------------- 451Field Description 452----------- --------------------------------------------------- 453shared_info Contents of the shared info page. This record 454 should be exactly 1 page long. 455-------------------------------------------------------------------- 456 457\clearpage 458 459X86_TSC_INFO 460------------ 461 462Domain TSC information, as accessed by the 463XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops. 464 465 0 1 2 3 4 5 6 7 octet 466 +------------------------+------------------------+ 467 | mode | khz | 468 +------------------------+------------------------+ 469 | nsec | 470 +------------------------+------------------------+ 471 | incarnation | (reserved) | 472 +------------------------+------------------------+ 473 474-------------------------------------------------------------------- 475Field Description 476----------- --------------------------------------------------- 477mode TSC mode, TSC_MODE_* constant. 478 479khz TSC frequency, in kHz. 480 481nsec Elapsed time, in nanoseconds. 482 483incarnation Incarnation. 484-------------------------------------------------------------------- 485 486\clearpage 487 488HVM_CONTEXT 489----------- 490 491HVM Domain context, as accessed by the 492XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops. 493 494 0 1 2 3 4 5 6 7 octet 495 +-------------------------------------------------+ 496 | hvm_ctx | 497 ... 498 +-------------------------------------------------+ 499 500-------------------------------------------------------------------- 501Field Description 502----------- --------------------------------------------------- 503hvm_ctx The HVM Context blob from Xen. 504-------------------------------------------------------------------- 505 506\clearpage 507 508HVM_PARAMS 509---------- 510 511HVM Domain parameters, as accessed by the 512HVMOP_{get,set}\_param hypercall sub-ops. 513 514 0 1 2 3 4 5 6 7 octet 515 +------------------------+------------------------+ 516 | count (C) | (reserved) | 517 +------------------------+------------------------+ 518 | param[0].index | 519 +-------------------------------------------------+ 520 | param[0].value | 521 +-------------------------------------------------+ 522 ... 523 +-------------------------------------------------+ 524 | param[C-1].index | 525 +-------------------------------------------------+ 526 | param[C-1].value | 527 +-------------------------------------------------+ 528 529-------------------------------------------------------------------- 530Field Description 531----------- --------------------------------------------------- 532count The number of parameters contained in this record. 533 Each parameter in the record contains an index and 534 value. 535 536param index Parameter index. 537 538param value Parameter value. 539-------------------------------------------------------------------- 540 541\clearpage 542 543TOOLSTACK (deprecated) 544---------------------- 545 546> *This record was only present for transitionary purposes during 547> development. It is should not be used.* 548 549An opaque blob provided by and supplied to the higher layers of the 550toolstack (e.g., libxl) during save and restore. 551 552 0 1 2 3 4 5 6 7 octet 553 +------------------------+------------------------+ 554 | data | 555 ... 556 +-------------------------------------------------+ 557 558-------------------------------------------------------------------- 559Field Description 560----------- --------------------------------------------------- 561data Blob of toolstack-specific data. 562-------------------------------------------------------------------- 563 564\clearpage 565 566VERIFY 567------ 568 569A verify record indicates that, while all memory has now been sent, the sender 570shall send further memory records for debugging purposes. 571 572 0 1 2 3 4 5 6 7 octet 573 +-------------------------------------------------+ 574 575The verify record contains no fields; its body_length is 0. 576 577\clearpage 578 579CHECKPOINT 580---------- 581 582A checkpoint record indicates that all the preceding records in the stream 583represent a consistent view of VM state. 584 585 0 1 2 3 4 5 6 7 octet 586 +-------------------------------------------------+ 587 588The checkpoint record contains no fields; its body_length is 0 589 590If the stream is embedded in a higher level toolstack stream, the 591CHECKPOINT record marks the end of the libxc portion of the stream 592and the stream is handed back to the higher level for further 593processing. 594 595The higher level stream may then hand the stream back to libxc to 596process another set of records for the next consistent VM state 597snapshot. This next set of records may be terminated by another 598CHECKPOINT record or an END record. 599 600\clearpage 601 602CHECKPOINT_DIRTY_PFN_LIST 603------------------------- 604 605A checkpoint dirty pfn list record is used to convey information about 606dirty memory in the VM. It is an unordered list of PFNs. Currently only 607applicable in the backchannel of a checkpointed stream. It is only used 608by COLO, more detail please reference README.colo. 609 610 0 1 2 3 4 5 6 7 octet 611 +-------------------------------------------------+ 612 | pfn[0] | 613 +-------------------------------------------------+ 614 ... 615 +-------------------------------------------------+ 616 | pfn[C-1] | 617 +-------------------------------------------------+ 618 619The count of pfns is: record->length/sizeof(uint64_t). 620 621\clearpage 622 623STATIC_DATA_END 624--------------- 625 626A static data end record marks the end of the static state. I.e. state which 627is invariant of guest execution. 628 629 630 0 1 2 3 4 5 6 7 octet 631 +-------------------------------------------------+ 632 633The end record contains no fields; its body_length is 0. 634 635\clearpage 636 637X86_CPUID_POLICY 638---------------- 639 640CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy 641hypercall sub-ops. 642 643 0 1 2 3 4 5 6 7 octet 644 +-------------------------------------------------+ 645 | CPUID_policy | 646 ... 647 +-------------------------------------------------+ 648 649-------------------------------------------------------------------- 650Field Description 651------------ --------------------------------------------------- 652CPUID_policy Array of xen_cpuid_leaf_t[]'s 653-------------------------------------------------------------------- 654 655\clearpage 656 657X86_MSR_POLICY 658-------------- 659 660MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy 661hypercall sub-ops. 662 663 0 1 2 3 4 5 6 7 octet 664 +-------------------------------------------------+ 665 | MSR_policy | 666 ... 667 +-------------------------------------------------+ 668 669-------------------------------------------------------------------- 670Field Description 671---------- --------------------------------------------------- 672MSR_policy Array of xen_msr_entry_t[]'s 673-------------------------------------------------------------------- 674 675\clearpage 676 677 678Layout 679====== 680 681The set of valid records depends on the guest architecture and type. No 682assumptions should be made about the ordering or interleaving of 683independent records. Record dependencies are noted below. 684 685Some records are used for signalling, and explicitly have zero length. All 686other records contain data relevant to the migration. Data records with no 687content should be elided on the source side, as their presence serves no 688purpose, but results in extra work for the restore side. 689 690x86 PV Guest 691------------ 692 693A typical save record for an x86 PV guest image would look like: 694 695* Image header 696* Domain header 697* Static data records: 698 * X86_PV_INFO record 699 * X86_{CPUID,MSR}_POLICY 700 * STATIC_DATA_END 701* X86_PV_P2M_FRAMES record 702* Many PAGE_DATA records 703* X86_TSC_INFO 704* SHARED_INFO record 705* VCPU context records for each online VCPU 706 * X86_PV_VCPU_BASIC record 707 * X86_PV_VCPU_EXTENDED record 708 * X86_PV_VCPU_XSAVE record 709 * X86_PV_VCPU_MSRS record 710* END record 711 712There are some strict ordering requirements. The following records must 713be present in the following order as each of them depends on information 714present in the preceding ones. 715 716* X86_PV_INFO record 717* X86_PV_P2M_FRAMES record 718* PAGE_DATA records 719* VCPU records 720 721x86 HVM Guest 722------------- 723 724A typical save record for an x86 HVM guest image would look like: 725 726* Image header 727* Domain header 728* Static data records: 729 * X86_{CPUID,MSR}_POLICY 730 * STATIC_DATA_END 731* Many PAGE_DATA records 732* X86_TSC_INFO 733* HVM_PARAMS 734* HVM_CONTEXT 735 736HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect 737the validity of architectural state in the context. 738 739Compatibility with older versions 740================================= 741 742v3 compat with v2 743----------------- 744 745A v3 stream is compatible with a v2 stream, but mandates the presense of a 746STATIC_DATA_END record ahead of any memory/register content. This is to ease 747the introduction of new static configuration records over time. 748 749A v3-compatible reciever interpreting a v2 stream should infer the position of 750STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV 751guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END 752had been sent. 753 754Legacy Images (x86 only) 755------------------------ 756 757Restoring legacy images from older tools shall be handled by 758translating the legacy format image into this new format. 759 760It shall not be possible to save in the legacy format. 761 762There are two different legacy images depending on whether they were 763generated by a 32-bit or a 64-bit toolstack. These shall be 764distinguished by inspecting octets 4-7 in the image. If these are 765zero then it is a 64-bit image. 766 767Toolstack Field Value 768--------- ----- ----- 76964-bit Bit 31-63 of the p2m_size field 0 (since p2m_size < 2^32^) 77032-bit extended-info chunk ID (PV) 0xFFFFFFFF 77132-bit Chunk type (HVM) < 0 77232-bit Page count (HVM) > 0 773 774Table: Possible values for octet 4-7 in legacy images 775 776This assumes the presence of the extended-info chunk which was 777introduced in Xen 3.0. 778 779 780Future Extensions 781================= 782 783All changes to this specification should bump the revision number in 784the title block. 785 786All changes to the image or domain headers require the image version 787to be increased. 788 789The format may be extended by adding additional record types. 790 791Extending an existing record type must be done by adding a new record 792type. This allows old images with the old record to still be 793restored. 794 795The image header may only be extended by _appending_ additional 796fields. In particular, the `marker`, `id` and `version` fields must 797never change size or location. 798 799 800Errata 801====== 802 8031. For compatibility with older code, the receving side of a stream should 804 tolerate and ignore variable sized records with zero content. Xen releases 805 between 4.6 and 4.8 could end up generating valid HVM_PARAMS or 806 X86_PV_VCPU_{EXTENDED,XSAVE,MSRS} records with zero-length content. 807