1% libxenctrl (libxc) Domain Image Format
2% David Vrabel <<david.vrabel@citrix.com>>
3  Andrew Cooper <<andrew.cooper3@citrix.com>>
4  Wen Congyang <<wency@cn.fujitsu.com>>
5  Yang Hongyang <<hongyang.yang@easystack.cn>>
6% Revision 2
7
8Introduction
9============
10
11Purpose
12-------
13
14The _domain save image_ is the context of a running domain used for
15snapshots of a domain or for transferring domains between hosts during
16migration.
17
18There are a number of problems with the format of the domain save
19image used in Xen 4.4 and earlier (the _legacy format_).
20
21* Dependant on toolstack word size.  A number of fields within the
22  image are native types such as `unsigned long` which have different
23  sizes between 32-bit and 64-bit toolstacks.  This prevents domains
24  from being migrated between hosts running 32-bit and 64-bit
25  toolstacks.
26
27* There is no header identifying the image.
28
29* The image has no version information.
30
31A new format that addresses the above is required.
32
33ARM does not yet have have a domain save image format specified and
34the format described in this specification should be suitable.
35
36Not Yet Included
37----------------
38
39The following features are not yet fully specified and will be
40included in a future draft.
41
42* Page data compression.
43
44* ARM
45
46
47Overview
48========
49
50The image format consists of two main sections:
51
52* _Headers_
53* _Records_
54
55Headers
56-------
57
58There are two headers: the _image header_, and the _domain header_.
59The image header describes the format of the image (version etc.).
60The _domain header_ contains general information about the domain
61(architecture, type etc.).
62
63Records
64-------
65
66The main part of the format is a sequence of different _records_.
67Each record type contains information about the domain context.  At a
68minimum there is a END record marking the end of the records section.
69
70
71Fields
72------
73
74All the fields within the headers and records have a fixed width.
75
76Fields are always aligned to their size.
77
78Padding and reserved fields are set to zero on save and must be
79ignored during restore.
80
81Integer (numeric) fields in the image header are always in big-endian
82byte order.
83
84Integer fields in the domain header and in the records are in the
85endianness described in the image header (which will typically be the
86native ordering).
87
88\clearpage
89
90Headers
91=======
92
93Image Header
94------------
95
96The image header identifies an image as a Xen domain save image.  It
97includes the version of this specification that the image complies
98with.
99
100Tools supporting version _V_ of the specification shall always save
101images using version _V_.  Tools shall support restoring from version
102_V_.  If the previous Xen release produced version _V_ - 1 images,
103tools shall supported restoring from these.  Tools may additionally
104support restoring from earlier versions.
105
106The marker field can be used to distinguish between legacy images and
107those corresponding to this specification.  Legacy images will have at
108one or more zero bits within the first 8 octets of the image.
109
110Fields within the image header are always in _big-endian_ byte order,
111regardless of the setting of the endianness bit.
112
113     0     1     2     3     4     5     6     7 octet
114    +-------------------------------------------------+
115    | marker                                          |
116    +-----------------------+-------------------------+
117    | id                    | version                 |
118    +-----------+-----------+-------------------------+
119    | options   | (reserved)                          |
120    +-----------+-------------------------------------+
121
122
123--------------------------------------------------------------------
124Field       Description
125----------- --------------------------------------------------------
126marker      0xFFFFFFFFFFFFFFFF.
127
128id          0x58454E46 ("XENF" in ASCII).
129
130version     0x00000003.  The version of this specification.
131
132options     bit 0: Endianness.  0 = little-endian, 1 = big-endian.
133
134            bit 1-15: Reserved.
135--------------------------------------------------------------------
136
137The endianness shall be 0 (little-endian) for images generated on an
138i386, x86_64, or arm host.
139
140\clearpage
141
142Domain Header
143-------------
144
145The domain header includes general properties of the domain.
146
147     0      1     2     3     4     5     6     7 octet
148    +-----------------------+-----------+-------------+
149    | type                  | page_shift| (reserved)  |
150    +-----------------------+-----------+-------------+
151    | xen_major             | xen_minor               |
152    +-----------------------+-------------------------+
153
154--------------------------------------------------------------------
155Field       Description
156----------- --------------------------------------------------------
157type        0x0000: Reserved.
158
159            0x0001: x86 PV.
160
161            0x0002: x86 HVM.
162
163            0x0003 - 0xFFFFFFFF: Reserved.
164
165page_shift  Size of a guest page as a power of two.
166
167            i.e., page size = 2 ^page_shift^.
168
169xen_major   The Xen major version when this image was saved.
170
171xen_minor   The Xen minor version when this image was saved.
172--------------------------------------------------------------------
173
174The legacy stream conversion tool writes a `xen_major` version of 0, and sets
175`xen_minor` to the version of itself.
176
177\clearpage
178
179Records
180=======
181
182A record has a record header, type specific data and a trailing
183footer.  If `body_length` is not a multiple of 8, the body is padded
184with zeroes to align the end of the record on an 8 octet boundary.
185
186     0     1     2     3     4     5     6     7 octet
187    +-----------------------+-------------------------+
188    | type                  | body_length             |
189    +-----------+-----------+-------------------------+
190    | body...                                         |
191    ...
192    |           | padding (0 to 7 octets)             |
193    +-----------+-------------------------------------+
194
195--------------------------------------------------------------------
196Field        Description
197-----------  -------------------------------------------------------
198type         0x00000000: END
199
200             0x00000001: PAGE_DATA
201
202             0x00000002: X86_PV_INFO
203
204             0x00000003: X86_PV_P2M_FRAMES
205
206             0x00000004: X86_PV_VCPU_BASIC
207
208             0x00000005: X86_PV_VCPU_EXTENDED
209
210             0x00000006: X86_PV_VCPU_XSAVE
211
212             0x00000007: SHARED_INFO
213
214             0x00000008: X86_TSC_INFO
215
216             0x00000009: HVM_CONTEXT
217
218             0x0000000A: HVM_PARAMS
219
220             0x0000000B: TOOLSTACK (deprecated)
221
222             0x0000000C: X86_PV_VCPU_MSRS
223
224             0x0000000D: VERIFY
225
226             0x0000000E: CHECKPOINT
227
228             0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary)
229
230             0x00000010 - 0x7FFFFFFF: Reserved for future _mandatory_
231             records.
232
233             0x80000000 - 0xFFFFFFFF: Reserved for future _optional_
234             records.
235
236body_length  Length in octets of the record body.
237
238body         Content of the record.
239
240padding      0 to 7 octets of zeros to pad the whole record to a multiple
241             of 8 octets.
242--------------------------------------------------------------------
243
244Records may be _mandatory_ or _optional_.  Optional records have bit
24531 set in their type.  Restoring an image that has unrecognised or
246unsupported mandatory record must fail.  The contents of optional
247records may be ignored during a restore.
248
249The following sub-sections specify the record body format for each of
250the record types.
251
252\clearpage
253
254END
255----
256
257An end record marks the end of the image, and shall be the final record
258in the stream.
259
260     0     1     2     3     4     5     6     7 octet
261    +-------------------------------------------------+
262
263The end record contains no fields; its body_length is 0.
264
265\clearpage
266
267PAGE_DATA
268---------
269
270The bulk of an image consists of many PAGE_DATA records containing the
271memory contents.
272
273     0     1     2     3     4     5     6     7 octet
274    +-----------------------+-------------------------+
275    | count (C)             | (reserved)              |
276    +-----------------------+-------------------------+
277    | pfn[0]                                          |
278    +-------------------------------------------------+
279    ...
280    +-------------------------------------------------+
281    | pfn[C-1]                                        |
282    +-------------------------------------------------+
283    | page_data[0]...                                 |
284    ...
285    +-------------------------------------------------+
286    | page_data[N-1]...                               |
287    ...
288    +-------------------------------------------------+
289
290--------------------------------------------------------------------
291Field       Description
292----------- --------------------------------------------------------
293count       Number of pages described in this record.
294
295pfn         An array of count PFNs and their types.
296
297            Bit 63-60: XEN_DOMCTL_PFINFO_* type (from
298            `public/domctl.h` but shifted by 32 bits)
299
300            Bit 59-52: Reserved.
301
302            Bit 51-0: PFN.
303
304page_data   page_size octets of uncompressed page contents for each
305            page set as present in the pfn array.
306--------------------------------------------------------------------
307
308Note: Count is strictly > 0.  N is strictly <= C and it is possible for there
309to be no page_data in the record if all pfns are of invalid types.
310
311--------------------------------------------------------------------
312PFINFO type    Value      Description
313-------------  ---------  ------------------------------------------
314NOTAB          0x0        Normal page.
315
316L1TAB          0x1        L1 page table page.
317
318L2TAB          0x2        L2 page table page.
319
320L3TAB          0x3        L3 page table page.
321
322L4TAB          0x4        L4 page table page.
323
324               0x5-0x8    Reserved.
325
326L1TAB_PIN      0x9        L1 page table page (pinned).
327
328L2TAB_PIN      0xA        L2 page table page (pinned).
329
330L3TAB_PIN      0xB        L3 page table page (pinned).
331
332L4TAB_PIN      0xC        L4 page table page (pinned).
333
334BROKEN         0xD        Broken page.
335
336XALLOC         0xE        Allocate only.
337
338XTAB           0xF        Invalid page.
339--------------------------------------------------------------------
340
341Table: XEN_DOMCTL_PFINFO_* Page Types.
342
343PFNs with type `BROKEN`, `XALLOC`, or `XTAB` do not have any
344corresponding `page_data`.
345
346The saver uses the `XTAB` type for PFNs that become invalid in the
347guest's P2M table during a live migration[^2].
348
349Restoring an image with unrecognised page types shall fail.
350
351[^2]: In the legacy format, this is the list of unmapped PFNs in the
352tail.
353
354\clearpage
355
356X86_PV_INFO
357-----------
358
359     0     1     2     3     4     5     6     7 octet
360    +-----+-----+-----------+-------------------------+
361    | w   | ptl | (reserved)                          |
362    +-----+-----+-----------+-------------------------+
363
364--------------------------------------------------------------------
365Field            Description
366-----------      ---------------------------------------------------
367guest_width (w)  Guest width in octets (either 4 or 8).
368
369pt_levels (ptl)  Number of page table levels (either 3 or 4).
370--------------------------------------------------------------------
371
372\clearpage
373
374X86_PV_P2M_FRAMES
375-----------------
376
377     0     1     2     3     4     5     6     7 octet
378    +-----+-----+-----+-----+-------------------------+
379    | p2m_start_pfn (S)     | p2m_end_pfn (E)         |
380    +-----+-----+-----+-----+-------------------------+
381    | p2m_pfn[p2m frame containing pfn S]             |
382    +-------------------------------------------------+
383    ...
384    +-------------------------------------------------+
385    | p2m_pfn[p2m frame containing pfn E]             |
386    +-------------------------------------------------+
387
388--------------------------------------------------------------------
389Field            Description
390-------------    ---------------------------------------------------
391p2m_start_pfn    First pfn index in the p2m_pfn array.
392
393p2m_end_pfn      Last pfn index in the p2m_pfn array.
394
395p2m_pfn          Array of PFNs containing the guest's P2M table, for
396                 the PFN frames containing the PFN range S to E
397                 (inclusive).
398
399--------------------------------------------------------------------
400
401\clearpage
402
403X86_PV_VCPU_BASIC, EXTENDED, XSAVE, MSRS
404----------------------------------------
405
406The format of these records are identical.  They are all binary blobs
407of data which are accessed using specific pairs of domctl hypercalls.
408
409     0     1     2     3     4     5     6     7 octet
410    +-----------------------+-------------------------+
411    | vcpu_id               | (reserved)              |
412    +-----------------------+-------------------------+
413    | context...                                      |
414    ...
415    +-------------------------------------------------+
416
417---------------------------------------------------------------------
418Field            Description
419-----------      ----------------------------------------------------
420vcpu_id          The VCPU ID.
421
422context          Binary data for this VCPU.
423---------------------------------------------------------------------
424
425---------------------------------------------------------------------
426Record type                  Accessor hypercalls
427-----------------------      ----------------------------------------
428X86_PV_VCPU_BASIC            XEN_DOMCTL_{get,set}vcpucontext
429
430X86_PV_VCPU_EXTENDED         XEN_DOMCTL_{get,set}\_ext_vcpucontext
431
432X86_PV_VCPU_XSAVE            XEN_DOMCTL_{get,set}vcpuextstate
433
434X86_PV_VCPU_MSRS             XEN_DOMCTL_{get,set}\_vcpu_msrs
435---------------------------------------------------------------------
436
437\clearpage
438
439SHARED_INFO
440-----------
441
442The content of the Shared Info page.
443
444     0     1     2     3     4     5     6     7 octet
445    +-------------------------------------------------+
446    | shared_info                                     |
447    ...
448    +-------------------------------------------------+
449
450--------------------------------------------------------------------
451Field            Description
452-----------      ---------------------------------------------------
453shared_info      Contents of the shared info page.  This record
454                 should be exactly 1 page long.
455--------------------------------------------------------------------
456
457\clearpage
458
459X86_TSC_INFO
460------------
461
462Domain TSC information, as accessed by the
463XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops.
464
465     0     1     2     3     4     5     6     7 octet
466    +------------------------+------------------------+
467    | mode                   | khz                    |
468    +------------------------+------------------------+
469    | nsec                                            |
470    +------------------------+------------------------+
471    | incarnation            | (reserved)             |
472    +------------------------+------------------------+
473
474--------------------------------------------------------------------
475Field            Description
476-----------      ---------------------------------------------------
477mode             TSC mode, TSC_MODE_* constant.
478
479khz              TSC frequency, in kHz.
480
481nsec             Elapsed time, in nanoseconds.
482
483incarnation      Incarnation.
484--------------------------------------------------------------------
485
486\clearpage
487
488HVM_CONTEXT
489-----------
490
491HVM Domain context, as accessed by the
492XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops.
493
494     0     1     2     3     4     5     6     7 octet
495    +-------------------------------------------------+
496    | hvm_ctx                                         |
497    ...
498    +-------------------------------------------------+
499
500--------------------------------------------------------------------
501Field            Description
502-----------      ---------------------------------------------------
503hvm_ctx          The HVM Context blob from Xen.
504--------------------------------------------------------------------
505
506\clearpage
507
508HVM_PARAMS
509----------
510
511HVM Domain parameters, as accessed by the
512HVMOP_{get,set}\_param hypercall sub-ops.
513
514     0     1     2     3     4     5     6     7 octet
515    +------------------------+------------------------+
516    | count (C)              | (reserved)             |
517    +------------------------+------------------------+
518    | param[0].index                                  |
519    +-------------------------------------------------+
520    | param[0].value                                  |
521    +-------------------------------------------------+
522    ...
523    +-------------------------------------------------+
524    | param[C-1].index                                |
525    +-------------------------------------------------+
526    | param[C-1].value                                |
527    +-------------------------------------------------+
528
529--------------------------------------------------------------------
530Field            Description
531-----------      ---------------------------------------------------
532count            The number of parameters contained in this record.
533                 Each parameter in the record contains an index and
534                 value.
535
536param index      Parameter index.
537
538param value      Parameter value.
539--------------------------------------------------------------------
540
541\clearpage
542
543TOOLSTACK (deprecated)
544----------------------
545
546> *This record was only present for transitionary purposes during
547>  development.  It is should not be used.*
548
549An opaque blob provided by and supplied to the higher layers of the
550toolstack (e.g., libxl) during save and restore.
551
552     0     1     2     3     4     5     6     7 octet
553    +------------------------+------------------------+
554    | data                                            |
555    ...
556    +-------------------------------------------------+
557
558--------------------------------------------------------------------
559Field            Description
560-----------      ---------------------------------------------------
561data             Blob of toolstack-specific data.
562--------------------------------------------------------------------
563
564\clearpage
565
566VERIFY
567------
568
569A verify record indicates that, while all memory has now been sent, the sender
570shall send further memory records for debugging purposes.
571
572     0     1     2     3     4     5     6     7 octet
573    +-------------------------------------------------+
574
575The verify record contains no fields; its body_length is 0.
576
577\clearpage
578
579CHECKPOINT
580----------
581
582A checkpoint record indicates that all the preceding records in the stream
583represent a consistent view of VM state.
584
585     0     1     2     3     4     5     6     7 octet
586    +-------------------------------------------------+
587
588The checkpoint record contains no fields; its body_length is 0
589
590If the stream is embedded in a higher level toolstack stream, the
591CHECKPOINT record marks the end of the libxc portion of the stream
592and the stream is handed back to the higher level for further
593processing.
594
595The higher level stream may then hand the stream back to libxc to
596process another set of records for the next consistent VM state
597snapshot.  This next set of records may be terminated by another
598CHECKPOINT record or an END record.
599
600\clearpage
601
602CHECKPOINT_DIRTY_PFN_LIST
603-------------------------
604
605A checkpoint dirty pfn list record is used to convey information about
606dirty memory in the VM. It is an unordered list of PFNs. Currently only
607applicable in the backchannel of a checkpointed stream. It is only used
608by COLO, more detail please reference README.colo.
609
610     0     1     2     3     4     5     6     7 octet
611    +-------------------------------------------------+
612    | pfn[0]                                          |
613    +-------------------------------------------------+
614    ...
615    +-------------------------------------------------+
616    | pfn[C-1]                                        |
617    +-------------------------------------------------+
618
619The count of pfns is: record->length/sizeof(uint64_t).
620
621\clearpage
622
623STATIC_DATA_END
624---------------
625
626A static data end record marks the end of the static state.  I.e. state which
627is invariant of guest execution.
628
629
630     0     1     2     3     4     5     6     7 octet
631    +-------------------------------------------------+
632
633The end record contains no fields; its body_length is 0.
634
635\clearpage
636
637X86_CPUID_POLICY
638----------------
639
640CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
641hypercall sub-ops.
642
643     0     1     2     3     4     5     6     7 octet
644    +-------------------------------------------------+
645    | CPUID_policy                                    |
646    ...
647    +-------------------------------------------------+
648
649--------------------------------------------------------------------
650Field            Description
651------------     ---------------------------------------------------
652CPUID_policy     Array of xen_cpuid_leaf_t[]'s
653--------------------------------------------------------------------
654
655\clearpage
656
657X86_MSR_POLICY
658--------------
659
660MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy
661hypercall sub-ops.
662
663     0     1     2     3     4     5     6     7 octet
664    +-------------------------------------------------+
665    | MSR_policy                                      |
666    ...
667    +-------------------------------------------------+
668
669--------------------------------------------------------------------
670Field            Description
671----------       ---------------------------------------------------
672MSR_policy       Array of xen_msr_entry_t[]'s
673--------------------------------------------------------------------
674
675\clearpage
676
677
678Layout
679======
680
681The set of valid records depends on the guest architecture and type.  No
682assumptions should be made about the ordering or interleaving of
683independent records.  Record dependencies are noted below.
684
685Some records are used for signalling, and explicitly have zero length.  All
686other records contain data relevant to the migration.  Data records with no
687content should be elided on the source side, as their presence serves no
688purpose, but results in extra work for the restore side.
689
690x86 PV Guest
691------------
692
693A typical save record for an x86 PV guest image would look like:
694
695* Image header
696* Domain header
697* Static data records:
698    * X86_PV_INFO record
699    * X86_{CPUID,MSR}_POLICY
700    * STATIC_DATA_END
701* X86_PV_P2M_FRAMES record
702* Many PAGE_DATA records
703* X86_TSC_INFO
704* SHARED_INFO record
705* VCPU context records for each online VCPU
706    * X86_PV_VCPU_BASIC record
707    * X86_PV_VCPU_EXTENDED record
708    * X86_PV_VCPU_XSAVE record
709    * X86_PV_VCPU_MSRS record
710* END record
711
712There are some strict ordering requirements.  The following records must
713be present in the following order as each of them depends on information
714present in the preceding ones.
715
716* X86_PV_INFO record
717* X86_PV_P2M_FRAMES record
718* PAGE_DATA records
719* VCPU records
720
721x86 HVM Guest
722-------------
723
724A typical save record for an x86 HVM guest image would look like:
725
726* Image header
727* Domain header
728* Static data records:
729    * X86_{CPUID,MSR}_POLICY
730    * STATIC_DATA_END
731* Many PAGE_DATA records
732* X86_TSC_INFO
733* HVM_PARAMS
734* HVM_CONTEXT
735
736HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect
737the validity of architectural state in the context.
738
739Compatibility with older versions
740=================================
741
742v3 compat with v2
743-----------------
744
745A v3 stream is compatible with a v2 stream, but mandates the presense of a
746STATIC_DATA_END record ahead of any memory/register content.  This is to ease
747the introduction of new static configuration records over time.
748
749A v3-compatible reciever interpreting a v2 stream should infer the position of
750STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV
751guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END
752had been sent.
753
754Legacy Images (x86 only)
755------------------------
756
757Restoring legacy images from older tools shall be handled by
758translating the legacy format image into this new format.
759
760It shall not be possible to save in the legacy format.
761
762There are two different legacy images depending on whether they were
763generated by a 32-bit or a 64-bit toolstack. These shall be
764distinguished by inspecting octets 4-7 in the image.  If these are
765zero then it is a 64-bit image.
766
767Toolstack  Field                            Value
768---------  -----                            -----
76964-bit     Bit 31-63 of the p2m_size field  0 (since p2m_size < 2^32^)
77032-bit     extended-info chunk ID (PV)      0xFFFFFFFF
77132-bit     Chunk type (HVM)                 < 0
77232-bit     Page count (HVM)                 > 0
773
774Table: Possible values for octet 4-7 in legacy images
775
776This assumes the presence of the extended-info chunk which was
777introduced in Xen 3.0.
778
779
780Future Extensions
781=================
782
783All changes to this specification should bump the revision number in
784the title block.
785
786All changes to the image or domain headers require the image version
787to be increased.
788
789The format may be extended by adding additional record types.
790
791Extending an existing record type must be done by adding a new record
792type.  This allows old images with the old record to still be
793restored.
794
795The image header may only be extended by _appending_ additional
796fields.  In particular, the `marker`, `id` and `version` fields must
797never change size or location.
798
799
800Errata
801======
802
8031. For compatibility with older code, the receving side of a stream should
804   tolerate and ignore variable sized records with zero content.  Xen releases
805   between 4.6 and 4.8 could end up generating valid HVM_PARAMS or
806   X86_PV_VCPU_{EXTENDED,XSAVE,MSRS} records with zero-length content.
807