1# Xen transport for 9pfs version 1 2 3## Background 4 59pfs is a network filesystem protocol developed for Plan 9. 9pfs is very 6simple and describes a series of commands and responses. It is 7completely independent from the communication channels, in fact many 8clients and servers support multiple channels, usually called 9"transports". For example the Linux client supports tcp and unix 10sockets, fds, virtio and rdma. 11 12 13### 9pfs protocol 14 15This document won't cover the full 9pfs specification. Please refer to 16this [paper] and this [website] for a detailed description of it. 17However it is useful to know that each 9pfs request and response has the 18following header: 19 20 struct header { 21 uint32_t size; 22 uint8_t id; 23 uint16_t tag; 24 } __attribute__((packed)); 25 26 0 4 5 7 27 +---------+--+----+ 28 | size |id|tag | 29 +---------+--+----+ 30 31- *size* 32The size of the request or response. 33 34- *id* 35The 9pfs request or response operation. 36 37- *tag* 38Unique id that identifies a specific request/response pair. It is used 39to multiplex operations on a single channel. 40 41It is possible to have multiple requests in-flight at any given time. 42 43 44## Rationale 45 46This document describes a Xen based transport for 9pfs, in the 47traditional PV frontend and backend format. The PV frontend is used by 48the client to send commands to the server. The PV backend is used by the 499pfs server to receive commands from clients and send back responses. 50 51The transport protocol supports multiple rings up to the maximum 52supported by the backend. The size of every ring is also configurable 53and can span multiple pages, up to the maximum supported by the backend 54(although it cannot be more than 2MB). The design is to exploit 55parallelism at the vCPU level and support multiple outstanding requests 56simultaneously. 57 58This document does not cover the 9pfs client/server design or 59implementation, only the transport for it. 60 61 62## Xenstore 63 64The frontend and the backend connect via xenstore to exchange 65information. The toolstack creates front and back nodes with state 66[XenbusStateInitialising]. The protocol node name is **9pfs**. 67 68Multiple rings are supported for each frontend and backend connection. 69 70### Backend XenBus Nodes 71 72Backend specific properties, written by the backend, read by the 73frontend: 74 75 versions 76 Values: <string> 77 78 List of comma separated protocol versions supported by the backend. 79 For example "1,2,3". Currently the value is just "1", as there is 80 only one version. N.B.: this is the version of the Xen trasport 81 protocol, not the version of 9pfs supported by the server. 82 83 max-rings 84 Values: <uint32_t> 85 86 The maximum supported number of rings per frontend. 87 88 max-ring-page-order 89 Values: <uint32_t> 90 91 The maximum supported size of a memory allocation in units of 92 log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It 93 must be at least 1. 94 95Backend configuration nodes, written by the toolstack, read by the 96backend: 97 98 path 99 Values: <string> 100 101 Host filesystem path to share. 102 103 tag 104 Values: <string> 105 106 Alphanumeric tag that identifies the 9pfs share. The client needs 107 to know the tag to be able to mount it. 108 109 security-model 110 Values: "none" 111 112 *none*: files are stored using the same credentials as they are 113 created on the guest (no user ownership squash or remap) 114 Only "none" is supported in this version of the protocol. 115 116### Frontend XenBus Nodes 117 118 version 119 Values: <string> 120 121 Protocol version, chosen among the ones supported by the backend 122 (see **versions** under [Backend XenBus Nodes]). Currently the 123 value must be "1". 124 125 num-rings 126 Values: <uint32_t> 127 128 Number of rings. It needs to be lower or equal to max-rings. 129 130 event-channel-<num> (event-channel-0, event-channel-1, etc) 131 Values: <uint32_t> 132 133 The identifier of the Xen event channel used to signal activity 134 in the ring buffer. One for each ring. 135 136 ring-ref<num> (ring-ref0, ring-ref1, etc) 137 Values: <uint32_t> 138 139 The Xen grant reference granting permission for the backend to 140 map a page with information to setup a share ring. One for each 141 ring. 142 143### State Machine 144 145Initialization: 146 147 *Front* *Back* 148 XenbusStateInitialising XenbusStateInitialising 149 - Query virtual device - Query backend device 150 properties. identification data. 151 - Setup OS device instance. - Publish backend features 152 - Allocate and initialize the and transport parameters 153 request ring. | 154 - Publish transport parameters | 155 that will be in effect during V 156 this connection. XenbusStateInitWait 157 | 158 | 159 V 160 XenbusStateInitialised 161 162 - Query frontend transport parameters. 163 - Connect to the request ring and 164 event channel. 165 | 166 | 167 V 168 XenbusStateConnected 169 170 - Query backend device properties. 171 - Finalize OS virtual device 172 instance. 173 | 174 | 175 V 176 XenbusStateConnected 177 178Once frontend and backend are connected, they have a shared page per 179ring, which are used to setup the rings, and an event channel per ring, 180which are used to send notifications. 181 182Shutdown: 183 184 *Front* *Back* 185 XenbusStateConnected XenbusStateConnected 186 | 187 | 188 V 189 XenbusStateClosing 190 191 - Unmap grants 192 - Unbind evtchns 193 | 194 | 195 V 196 XenbusStateClosing 197 198 - Unbind evtchns 199 - Free rings 200 - Free data structures 201 | 202 | 203 V 204 XenbusStateClosed 205 206 - Free remaining data structures 207 | 208 | 209 V 210 XenbusStateClosed 211 212 213## Ring Setup 214 215The shared page has the following layout: 216 217 typedef uint32_t XEN_9PFS_RING_IDX; 218 219 struct xen_9pfs_intf { 220 XEN_9PFS_RING_IDX in_cons, in_prod; 221 uint8_t pad[56]; 222 XEN_9PFS_RING_IDX out_cons, out_prod; 223 uint8_t pad[56]; 224 225 uint32_t ring_order; 226 /* this is an array of (1 << ring_order) elements */ 227 grant_ref_t ref[1]; 228 }; 229 230 /* not actually C compliant (ring_order changes from ring to ring) */ 231 struct ring_data { 232 char in[((1 << ring_order) << PAGE_SHIFT) / 2]; 233 char out[((1 << ring_order) << PAGE_SHIFT) / 2]; 234 }; 235 236- **ring_order** 237 It represents the order of the data ring. The following list of grant 238 references is of `(1 << ring_order)` elements. It cannot be greater than 239 **max-ring-page-order**, as specified by the backend on XenBus. 240- **ref[]** 241 The list of grant references which will contain the actual data. They are 242 mapped contiguosly in virtual memory. The first half of the pages is the 243 **in** array, the second half is the **out** array. The array must 244 have a power of two number of elements. 245- **out** is an array used as circular buffer 246 It contains client requests. The producer is the frontend, the 247 consumer is the backend. 248- **in** is an array used as circular buffer 249 It contains server responses. The producer is the backend, the 250 consumer is the frontend. 251- **out_cons**, **out_prod** 252 Consumer and producer indices for client requests. They keep track of 253 how much data has been written by the frontend to **out** and how much 254 data has already been consumed by the backend. **out_prod** is 255 increased by the frontend, after writing data to **out**. **out_cons** 256 is increased by the backend, after reading data from **out**. 257- **in_cons** and **in_prod** 258 Consumer and producer indices for responses. They keep track of how 259 much data has already been consumed by the frontend from the **in** 260 array. **in_prod** is increased by the backend, after writing data to 261 **in**. **in_cons** is increased by the frontend, after reading data 262 from **in**. 263 264The binary layout of `struct xen_9pfs_intf` follows: 265 266 0 4 8 64 68 72 76 267 +---------+---------+-----//-----+---------+---------+---------+ 268 | in_cons | in_prod | padding |out_cons |out_prod |ring_orde| 269 +---------+---------+-----//-----+---------+---------+---------+ 270 271 76 80 84 4092 4096 272 +---------+---------+----//---+---------+ 273 | ref[0] | ref[1] | | ref[N] | 274 +---------+---------+----//---+---------+ 275 276**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N 277needs to be a power of two, actually max N is 512. As 512 == (1 << 9), 278the maximum possible max-ring-page-order value is 9. 279 280The binary layout of the ring buffers follow: 281 282 0 ((1<<ring_order)<<PAGE_SHIFT)/2 ((1<<ring_order)<<PAGE_SHIFT) 283 +------------//-------------+------------//-------------+ 284 | in | out | 285 +------------//-------------+------------//-------------+ 286 287## Why ring.h is not needed 288 289Many Xen PV protocols use the macros provided by [ring.h] to manage 290their shared ring for communication. This procotol does not, because it 291actually comes with two rings: the **in** ring and the **out** ring. 292Each of them is mono-directional, and there is no static request size: 293the producer writes opaque data to the ring. On the other end, in 294[ring.h] they are combined, and the request size is static and 295well-known. In this protocol: 296 297 in -> backend to frontend only 298 out-> frontend to backend only 299 300In the case of the **in** ring, the frontend is the consumer, and the 301backend is the producer. Everything is the same but mirrored for the 302**out** ring. 303 304The producer, the backend in this case, never reads from the **in** 305ring. In fact, the producer doesn't need any notifications unless the 306ring is full. This version of the protocol doesn't take advantage of it, 307leaving room for optimizations. 308 309On the other end, the consumer always requires notifications, unless it 310is already actively reading from the ring. The producer can figure it 311out, without any additional fields in the protocol, by comparing the 312indexes at the beginning and the end of the function. This is similar to 313what [ring.h] does. 314 315## Ring Usage 316 317The **in** and **out** arrays are used as circular buffers: 318 319 0 sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2 320 +-----------------------------------+ 321 |to consume| free |to consume | 322 +-----------------------------------+ 323 ^ ^ 324 prod cons 325 326 0 sizeof(array) 327 +-----------------------------------+ 328 | free | to consume | free | 329 +-----------------------------------+ 330 ^ ^ 331 cons prod 332 333The following functions are provided to read and write to an array: 334 335 #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1)) 336 337 static inline void xen_9pfs_read(char *buf, 338 XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, 339 uint8_t *h, size_t len) { 340 if (*masked_cons < *masked_prod) { 341 memcpy(h, buf + *masked_cons, len); 342 } else { 343 if (len > XEN_9PFS_RING_SIZE - *masked_cons) { 344 memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons); 345 memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons)); 346 } else { 347 memcpy(h, buf + *masked_cons, len); 348 } 349 } 350 *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len); 351 } 352 353 static inline void xen_9pfs_write(char *buf, 354 XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons, 355 uint8_t *opaque, size_t len) { 356 if (*masked_prod < *masked_cons) { 357 memcpy(buf + *masked_prod, opaque, len); 358 } else { 359 if (len > XEN_9PFS_RING_SIZE - *masked_prod) { 360 memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod); 361 memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 362 } else { 363 memcpy(buf + *masked_prod, opaque, len); 364 } 365 } 366 *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len); 367 } 368 369The producer (the backend for **in**, the frontend for **out**) writes to the 370array in the following way: 371 372- read *cons*, *prod* from shared memory 373- general memory barrier 374- verify *prod* against local copy (consumer shouldn't change it) 375- write to array at position *prod* up to *cons*, wrapping around the circular 376 buffer when necessary 377- write memory barrier 378- increase *prod* 379- notify the other end via event channel 380 381The consumer (the backend for **out**, the frontend for **in**) reads from the 382array in the following way: 383 384- read *prod*, *cons* from shared memory 385- read memory barrier 386- verify *cons* against local copy (producer shouldn't change it) 387- read from array at position *cons* up to *prod*, wrapping around the circular 388 buffer when necessary 389- general memory barrier 390- increase *cons* 391- notify the other end via event channel 392 393The producer takes care of writing only as many bytes as available in the buffer 394up to *cons*. The consumer takes care of reading only as many bytes as available 395in the buffer up to *prod*. 396 397 398## Request/Response Workflow 399 400The client chooses one of the available rings, then it sends a request 401to the other end on the *out* array, following the producer workflow 402described in [Ring Usage]. 403 404The server receives the notification and reads the request, following 405the consumer workflow described in [Ring Usage]. The server knows how 406much to read because it is specified in the *size* field of the 9pfs 407header. The server processes the request and sends back a response on 408the *in* array of the same ring, following the producer workflow as 409usual. Thus, every request/response pair is on one ring. 410 411The client receives a notification and reads the response from the *in* 412array. The client knows how much data to read because it is specified in 413the *size* field of the 9pfs header. 414 415 416[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf 417[website]: https://github.com/chaos/diod/blob/master/protocol.md 418[XenbusStateInitialising]: https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html 419[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD 420