1# Xen transport for 9pfs version 1
2
3## Background
4
59pfs is a network filesystem protocol developed for Plan 9. 9pfs is very
6simple and describes a series of commands and responses. It is
7completely independent from the communication channels, in fact many
8clients and servers support multiple channels, usually called
9"transports". For example the Linux client supports tcp and unix
10sockets, fds, virtio and rdma.
11
12
13### 9pfs protocol
14
15This document won't cover the full 9pfs specification. Please refer to
16this [paper] and this [website] for a detailed description of it.
17However it is useful to know that each 9pfs request and response has the
18following header:
19
20    struct header {
21    	uint32_t size;
22    	uint8_t id;
23    	uint16_t tag;
24    } __attribute__((packed));
25
26    0         4  5    7
27    +---------+--+----+
28    |  size   |id|tag |
29    +---------+--+----+
30
31- *size*
32The size of the request or response.
33
34- *id*
35The 9pfs request or response operation.
36
37- *tag*
38Unique id that identifies a specific request/response pair. It is used
39to multiplex operations on a single channel.
40
41It is possible to have multiple requests in-flight at any given time.
42
43
44## Rationale
45
46This document describes a Xen based transport for 9pfs, in the
47traditional PV frontend and backend format. The PV frontend is used by
48the client to send commands to the server. The PV backend is used by the
499pfs server to receive commands from clients and send back responses.
50
51The transport protocol supports multiple rings up to the maximum
52supported by the backend. The size of every ring is also configurable
53and can span multiple pages, up to the maximum supported by the backend
54(although it cannot be more than 2MB). The design is to exploit
55parallelism at the vCPU level and support multiple outstanding requests
56simultaneously.
57
58This document does not cover the 9pfs client/server design or
59implementation, only the transport for it.
60
61
62## Xenstore
63
64The frontend and the backend connect via xenstore to exchange
65information. The toolstack creates front and back nodes with state
66[XenbusStateInitialising]. The protocol node name is **9pfs**.
67
68Multiple rings are supported for each frontend and backend connection.
69
70### Backend XenBus Nodes
71
72Backend specific properties, written by the backend, read by the
73frontend:
74
75    versions
76         Values:         <string>
77
78         List of comma separated protocol versions supported by the backend.
79         For example "1,2,3". Currently the value is just "1", as there is
80         only one version. N.B.: this is the version of the Xen trasport
81         protocol, not the version of 9pfs supported by the server.
82
83    max-rings
84         Values:         <uint32_t>
85
86         The maximum supported number of rings per frontend.
87
88    max-ring-page-order
89         Values:         <uint32_t>
90
91         The maximum supported size of a memory allocation in units of
92         log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It
93         must be at least 1.
94
95Backend configuration nodes, written by the toolstack, read by the
96backend:
97
98    path
99         Values:         <string>
100
101         Host filesystem path to share.
102
103    tag
104         Values:         <string>
105
106         Alphanumeric tag that identifies the 9pfs share. The client needs
107         to know the tag to be able to mount it.
108
109    security-model
110         Values:         "none"
111
112         *none*: files are stored using the same credentials as they are
113                 created on the guest (no user ownership squash or remap)
114         Only "none" is supported in this version of the protocol.
115
116### Frontend XenBus Nodes
117
118    version
119         Values:         <string>
120
121         Protocol version, chosen among the ones supported by the backend
122         (see **versions** under [Backend XenBus Nodes]). Currently the
123         value must be "1".
124
125    num-rings
126         Values:         <uint32_t>
127
128         Number of rings. It needs to be lower or equal to max-rings.
129
130    event-channel-<num> (event-channel-0, event-channel-1, etc)
131         Values:         <uint32_t>
132
133         The identifier of the Xen event channel used to signal activity
134         in the ring buffer. One for each ring.
135
136    ring-ref<num> (ring-ref0, ring-ref1, etc)
137         Values:         <uint32_t>
138
139         The Xen grant reference granting permission for the backend to
140         map a page with information to setup a share ring. One for each
141         ring.
142
143### State Machine
144
145Initialization:
146
147    *Front*                               *Back*
148    XenbusStateInitialising               XenbusStateInitialising
149    - Query virtual device                - Query backend device
150      properties.                           identification data.
151    - Setup OS device instance.           - Publish backend features
152    - Allocate and initialize the           and transport parameters
153      request ring.                                      |
154    - Publish transport parameters                       |
155      that will be in effect during                      V
156      this connection.                            XenbusStateInitWait
157                 |
158                 |
159                 V
160       XenbusStateInitialised
161
162                                          - Query frontend transport parameters.
163                                          - Connect to the request ring and
164                                            event channel.
165                                                         |
166                                                         |
167                                                         V
168                                                 XenbusStateConnected
169
170     - Query backend device properties.
171     - Finalize OS virtual device
172       instance.
173                 |
174                 |
175                 V
176        XenbusStateConnected
177
178Once frontend and backend are connected, they have a shared page per
179ring, which are used to setup the rings, and an event channel per ring,
180which are used to send notifications.
181
182Shutdown:
183
184    *Front*                            *Back*
185    XenbusStateConnected               XenbusStateConnected
186                |
187                |
188                V
189       XenbusStateClosing
190
191                                       - Unmap grants
192                                       - Unbind evtchns
193                                                 |
194                                                 |
195                                                 V
196                                         XenbusStateClosing
197
198    - Unbind evtchns
199    - Free rings
200    - Free data structures
201               |
202               |
203               V
204       XenbusStateClosed
205
206                                       - Free remaining data structures
207                                                 |
208                                                 |
209                                                 V
210                                         XenbusStateClosed
211
212
213## Ring Setup
214
215The shared page has the following layout:
216
217    typedef uint32_t XEN_9PFS_RING_IDX;
218
219    struct xen_9pfs_intf {
220    	XEN_9PFS_RING_IDX in_cons, in_prod;
221    	uint8_t pad[56];
222    	XEN_9PFS_RING_IDX out_cons, out_prod;
223    	uint8_t pad[56];
224
225    	uint32_t ring_order;
226        /* this is an array of (1 << ring_order) elements */
227    	grant_ref_t ref[1];
228    };
229
230    /* not actually C compliant (ring_order changes from ring to ring) */
231    struct ring_data {
232        char in[((1 << ring_order) << PAGE_SHIFT) / 2];
233        char out[((1 << ring_order) << PAGE_SHIFT) / 2];
234    };
235
236- **ring_order**
237  It represents the order of the data ring. The following list of grant
238  references is of `(1 << ring_order)` elements. It cannot be greater than
239  **max-ring-page-order**, as specified by the backend on XenBus.
240- **ref[]**
241  The list of grant references which will contain the actual data. They are
242  mapped contiguosly in virtual memory. The first half of the pages is the
243  **in** array, the second half is the **out** array. The array must
244  have a power of two number of elements.
245- **out** is an array used as circular buffer
246  It contains client requests. The producer is the frontend, the
247  consumer is the backend.
248- **in** is an array used as circular buffer
249  It contains server responses. The producer is the backend, the
250  consumer is the frontend.
251- **out_cons**, **out_prod**
252  Consumer and producer indices for client requests. They keep track of
253  how much data has been written by the frontend to **out** and how much
254  data has already been consumed by the backend. **out_prod** is
255  increased by the frontend, after writing data to **out**. **out_cons**
256  is increased by the backend, after reading data from **out**.
257- **in_cons** and **in_prod**
258  Consumer and producer indices for responses. They keep track of how
259  much data has already been consumed by the frontend from the **in**
260  array. **in_prod** is increased by the backend, after writing data to
261  **in**.  **in_cons** is increased by the frontend, after reading data
262  from **in**.
263
264The binary layout of `struct xen_9pfs_intf` follows:
265
266    0         4         8           64        68        72        76
267    +---------+---------+-----//-----+---------+---------+---------+
268    | in_cons | in_prod |  padding   |out_cons |out_prod |ring_orde|
269    +---------+---------+-----//-----+---------+---------+---------+
270
271    76        80        84      4092      4096
272    +---------+---------+----//---+---------+
273    |  ref[0] |  ref[1] |         |  ref[N] |
274    +---------+---------+----//---+---------+
275
276**N.B** For one page, N is maximum 991 (4096-132)/4, but given that N
277needs to be a power of two, actually max N is 512. As 512 == (1 << 9),
278the maximum possible max-ring-page-order value is 9.
279
280The binary layout of the ring buffers follow:
281
282    0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
283    +------------//-------------+------------//-------------+
284    |            in             |           out             |
285    +------------//-------------+------------//-------------+
286
287## Why ring.h is not needed
288
289Many Xen PV protocols use the macros provided by [ring.h] to manage
290their shared ring for communication. This procotol does not, because it
291actually comes with two rings: the **in** ring and the **out** ring.
292Each of them is mono-directional, and there is no static request size:
293the producer writes opaque data to the ring. On the other end, in
294[ring.h] they are combined, and the request size is static and
295well-known. In this protocol:
296
297  in -> backend to frontend only
298  out-> frontend to backend only
299
300In the case of the **in** ring, the frontend is the consumer, and the
301backend is the producer. Everything is the same but mirrored for the
302**out** ring.
303
304The producer, the backend in this case, never reads from the **in**
305ring. In fact, the producer doesn't need any notifications unless the
306ring is full. This version of the protocol doesn't take advantage of it,
307leaving room for optimizations.
308
309On the other end, the consumer always requires notifications, unless it
310is already actively reading from the ring. The producer can figure it
311out, without any additional fields in the protocol, by comparing the
312indexes at the beginning and the end of the function. This is similar to
313what [ring.h] does.
314
315## Ring Usage
316
317The **in** and **out** arrays are used as circular buffers:
318
319    0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
320    +-----------------------------------+
321    |to consume|    free    |to consume |
322    +-----------------------------------+
323               ^            ^
324               prod         cons
325
326    0                               sizeof(array)
327    +-----------------------------------+
328    |  free    | to consume |   free    |
329    +-----------------------------------+
330               ^            ^
331               cons         prod
332
333The following functions are provided to read and write to an array:
334
335    #define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))
336
337    static inline void xen_9pfs_read(char *buf,
338    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
339    		uint8_t *h, size_t len) {
340    	if (*masked_cons < *masked_prod) {
341    		memcpy(h, buf + *masked_cons, len);
342    	} else {
343    		if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
344    			memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
345    			memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
346    		} else {
347    			memcpy(h, buf + *masked_cons, len);
348    		}
349    	}
350    	*masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
351    }
352
353    static inline void xen_9pfs_write(char *buf,
354    		XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
355    		uint8_t *opaque, size_t len) {
356    	if (*masked_prod < *masked_cons) {
357    		memcpy(buf + *masked_prod, opaque, len);
358    	} else {
359    		if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
360    			memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
361    			memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod));
362    		} else {
363    			memcpy(buf + *masked_prod, opaque, len);
364    		}
365    	}
366    	*masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
367    }
368
369The producer (the backend for **in**, the frontend for **out**) writes to the
370array in the following way:
371
372- read *cons*, *prod* from shared memory
373- general memory barrier
374- verify *prod* against local copy (consumer shouldn't change it)
375- write to array at position *prod* up to *cons*, wrapping around the circular
376  buffer when necessary
377- write memory barrier
378- increase *prod*
379- notify the other end via event channel
380
381The consumer (the backend for **out**, the frontend for **in**) reads from the
382array in the following way:
383
384- read *prod*, *cons* from shared memory
385- read memory barrier
386- verify *cons* against local copy (producer shouldn't change it)
387- read from array at position *cons* up to *prod*, wrapping around the circular
388  buffer when necessary
389- general memory barrier
390- increase *cons*
391- notify the other end via event channel
392
393The producer takes care of writing only as many bytes as available in the buffer
394up to *cons*. The consumer takes care of reading only as many bytes as available
395in the buffer up to *prod*.
396
397
398## Request/Response Workflow
399
400The client chooses one of the available rings, then it sends a request
401to the other end on the *out* array, following the producer workflow
402described in [Ring Usage].
403
404The server receives the notification and reads the request, following
405the consumer workflow described in [Ring Usage]. The server knows how
406much to read because it is specified in the *size* field of the 9pfs
407header. The server processes the request and sends back a response on
408the *in* array of the same ring, following the producer workflow as
409usual. Thus, every request/response pair is on one ring.
410
411The client receives a notification and reads the response from the *in*
412array. The client knows how much data to read because it is specified in
413the *size* field of the 9pfs header.
414
415
416[paper]: https://www.usenix.org/legacy/event/usenix05/tech/freenix/full_papers/hensbergen/hensbergen.pdf
417[website]: https://github.com/chaos/diod/blob/master/protocol.md
418[XenbusStateInitialising]: https://xenbits.xen.org/docs/unstable/hypercall/x86_64/include,public,io,xenbus.h.html
419[ring.h]: https://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/public/io/ring.h;hb=HEAD
420