1# Argo
2
3## Introduction
4
5Argo is an interdomain communication mechanism. It provides Xen hypervisor
6primitives to transmit data between VMs, by performing data copies into
7receive memory rings registered by domains. It does not require memory
8sharing between VMs and does not use the grant tables or Xenstore.
9
10Argo has requirements for performance isolation between domains, to prevent
11negative performance impact from malicious or disruptive activity of other
12domains, or even other VCPUs of the same domain operating other rings.
13
14## Hypervisor-Mediated data eXchange (HMX)
15
16This term references inter-VM communication protocols that have this
17key architectural point: The hypervisor is responsible for performing the
18write of data into the guest-accessible memory buffer, in the manner
19according to the agreed transfer protocol. This structure ensures that
20there is strength to the transport mechanism, because the transmitting side
21of the communication is the hypervisor, which can be trusted by the receiver,
22and the buffer is isolated from access by any other potential sources
23outside the receiver.
24
25The receiver can trust that the hypervisor will:
26
27- Provide a protocol implementation adhering to hardware synchronization
28requirements for concurrent access to system memory by communicating
29components
30- Deliver data only from an approved source, enforcing policy for Mandatory
31Access Control.
32- Indicate the correct sender of the data.
33- Transmit only the intended data, adhering to the access protocol of the data
34structure in the buffer. If the memory region is being used as a ring, then:
35    - Data writes will only occur within the ring region that is indicated as
36    available for incoming data by the ring indexes.
37    - The indicated length of data written will exactly match the length of
38    data actually written.
39    - The write for each piece of data will occur only once.
40    - Data will be written sequentially in the order that it is sent.
41- Issue notification of data delivered correctly.
42
43This structure allows for augmentation by the hypervisor to identify the
44sending entity within the source VM, and then provide the receiver with
45assured context information about the data source. This enables the receiver
46to make decisions based on fine-grained knowledge of the source of the data.
47
48This structure is also of strong interest for nested virtualization:
49transport via the hypervisor can enable construction of efficient
50communications between VMs at different levels of nesting.
51
52# Locking
53
54Since Argo operates a data path between domains, sections of this code are
55*hot* when the communication paths are in use. To encourage high performance, a
56goal is to limit mutual exclusion to only where required and enable significant
57concurrency.
58
59Avoidance of deadlock is essential and since state must frequently be updated
60that pertains to more than one domain, a locking protocol defines which locks
61are needed and the order of their acquistion.
62
63## Structure
64
65The granular locking structure of Argo enables:
66
671. Performance isolation of guests
682. Avoidance of DoS of rings by domains that are not authorized to send to them
693. Deadlock-free teardown of state across multiple domains on domain destroy
704. Performance of guests using Argo with concurrent operation of rings.
71
72Argo uses three per-domain locks to protect three separate data structures.
73Access to the ring_hash data structure is confined to domains that a
74ring-registering domain has authorized to send data via the ring.  The complete
75set of Argo locks is:
76
77* Global : `L1_global_argo_rwlock`
78* Per-domain: `rings_L2_rwlock`
79* Per-domain: `send_L2_lock`
80* Per-domain: `wildcard_L2_lock`
81* Per-ring: `L3_lock`
82
83## Protected State
84
85The data structures being protected by the locks are all per-domain. The only
86global Argo state is the `L1_global_argo_rwlock` used to coordinate access to
87data structures of other domains.
88
89### State: Rings registered and owned by a domain
90
91This includes the state to run that ring, such as memory frame numbers and
92established mappings. Per-ring state is protected by its own lock, so that
93multiple VCPUs of the same domain operating different rings do not inhibit the
94performance of each other.
95
96The per-domain ring state also includes the list of pending notifications for
97other domains that are waiting for ring space availability.
98
99### State: Partner rings for which this domain is the single allowed sender
100
101This state belonging to the permitted sender is written to when a ring is
102registered by another domain. The lock that protects this state is subject to
103locking at arbitrary frequency by those foreign domains when registering rings
104-- which do not need any permission granted by this domain in order to register
105a ring to communicate with it --  so it must not inhibit the domain's own
106ability to use its own rings, to protect them from DoS. For this reason, this
107state is protected by its own lock.
108
109### State: Pending notifications for wildcard rings registered by other domains
110
111This data structure is needed when a domain is destroyed, to cancel the
112outstanding space availability notifications about the wildcard rings of other
113domains that this domain has queried.
114
115Data is entered into this data structure by the domain that owns it, either by
116a space-inhibited sendv or a notify operation.
117
118Data is removed from this data structure in one of three cases: when space
119becomes available in the destination ring and the notification is sent, when
120the ring is torn down, or when the awaiting domain is destroyed.
121
122In the case where a notification is sent, access to the data structure is
123triggered by the ring owner domain, rather than the domain waiting for
124notification. This data structure is protected by its own lock since doing so
125entails less contention than the alternative of reusing an existing lock owned
126by the domain.
127
128## Hierarchical Locking Model and Protocol
129
130The locking discipline within the Argo code is heirarchical and utilizes
131reader/writer locks to enable increased concurrency when operations do not
132conflict. None of the Argo locks are reentrant.
133
134The hierarchy:
135
136* There is a global rwlock (`L1`) to protect access to all of the per-domain
137argo data structures.
138* There is a rwlock per-domain (`rings_L2`) to protect the hashtable of the
139per-ring data structures.
140* There is a lock per ring (`L3`) to protect the per-ring data structure,
141`struct argo_ring_info`.
142
143There are a two other per-domain L2 locks; their operation is similar and they
144are described later.
145
146The protocol to safely acquire write access to the per-ring data structure,
147`struct argo_ring_info`, is:
148
1491) Acquire a Read lock on L1.
1502) Acquire a Read lock on L2.
1513) Acquire L3.
152
153An alternative valid sequence is:
154
1551) Acquire a Read lock on L1.
1562) Acquire a Write lock on L2.
157
158This second sequence grants write access to _all_ of the `argo_ring_info`
159structs belonging to the domain, but at the expense of less concurrency: no
160other operation can access those structs while the locks are held, which will
161inhibit operations on those rings until the locks are released.
162
163Another alternative valid sequence is:
164
1651) Acquire a Write lock on L1.
166
167This grants write access to _all_ of the `argo_ring_info` structs belonging to
168_all domains_, but again at the expense of far less concurrency: no other
169operation can operate on Argo rings until the locks are released.
170
171## Lock Definitions
172
173The full set of locks that are directly operated upon by the Argo code are
174described in the following section.
175
176### The global singleton lock:
177
178* `L1_global_argo_rwlock`
179
180The rationale for having a global lock is to be able to enforce system-wide
181exclusion for a critical region and simplify the logic required to avoid
182deadlock, for teardown of state across multiple domains when a domain is
183destroyed.
184
185The majority of operations take a read-lock on this lock, allowing concurrent
186Argo operations by many domains.
187
188The pointer d->argo on every domain is protected by this lock. A set of more
189granular per-domain locks could be used to do that, but since domain start and
190stop is expected to be a far less frequent operation than the other argo
191operations, acquiring a single read lock to enable access to all the argo
192structs of all domains simplifies the protocol.
193
194Points of write-locking on this lock:
195
196* `argo_destroy`, where:
197  * All of the domain's own rings are destroyed.
198      * All of the notifications pending for other domains are cancelled.
199   * All of the unicast partner rings owned by other domains for this domain to
200send to, are destroyed.
201      * All of the notifications pending on those rings are cancelled.
202   * All of the notifications pending for this domain on wildcard rings owned
203by other domains are cancelled.
204* `argo_soft_reset`, for similar teardown operations as argo_destroy.
205* `argo_init`, where the `d->argo` pointer is first populated.
206  * Since the write lock is taken here, there is serialization all concurrent
207Argo operations around this single pointer write; this is the cost of using the
208simpler one global lock approach.
209
210Enforcing that the write_lock is acquired on `L1_global_argo_rwlock` before
211executing teardown, ensures that no teardown operations act concurrently and no
212other Argo operations happen concurrently with a teardown. The teardown logic
213is free to safely modify the Argo state across all domains without having to
214acquire per-domain locks and deadlock cannot occur.
215
216### Per-Domain: Ring hash lock
217
218`rings_L2_rwlock`
219
220Protects: the per-domain ring hash table of `argo_ring_info` structs.
221
222Holding a read lock on `rings_L2` protects the ring hash table and the elements
223in the hash table `d->argo->ring_hash`, and the `node` and `id` fields in
224struct `argo_ring_info` in the hash table.
225
226Holding a write lock on `rings_L2` protects all of the elements of all the
227struct `argo_ring_info` belonging to this domain.
228
229To take `rings_L2` you must already have `R(L1)`. `W(L1)` implies `W(rings_L2)`
230and `L3`.
231
232Prerequisites:
233
234* `R(L1_global_argo_rwlock)` must be acquired before taking either read or
235write on `rings_L2_rwlock`.
236* `W(L1_global_argo_rwlock)` implies `W(rings_L2_rwlock)`, so if
237`W(L1_global_argo_rwlock)` is held, then `rings_L2_rwlock` does not need to be
238acquired, and all the data structures that `rings_L2_rwlock` protects can be
239accessed as if `W(ring_L2_rwlock)` was held.
240
241Is accessed by the hypervisor on behalf of:
242
243* The domain that registered the ring.
244* Any domain that is allowed to send to the ring -- so that's the partner
245domain, for unicast rings, or any domain, for wildcard rings.
246
247### Send hash lock
248
249`send_L2_lock`
250
251Protects: the per-domain send hash table of `argo_send_info` structs.
252
253Is accessed by the hypervisor on behalf of:
254
255* Any domain that registers a ring that specifies the domain as the unicast
256sender.
257* The domain that has been allowed to send, as part of teardown when the domain
258is being destroyed.
259
260
261### Wildcard pending list lock
262
263`wildcard_L2_lock`
264
265Protects: the per-domain list of pending notifications to the domain from
266wildcard rings owned by other domains.
267
268Is accessed by the hypervisor on behalf of:
269
270* The domain that issued a query to another about space availability in one of
271its wildcard rings - this can be done by attempting a send operation when there
272is insufficient ring space available at the time.
273* Any domain that the domain has issued a query to about space availability in
274one of their wildcard rings.
275
276### Per-Ring locks:
277
278* `L3_lock`
279
280This lock protects the members of a `struct ring_info` which is the primary
281state for a domain's own registered ring.
282
283
284## Reasoning Model
285
286A common model for reasoning about concurrent code focusses on accesses to
287individual variables: if code touches this variable, see that it first acquires
288the corresponding lock and then drops it afterwards. A challenge with this
289model is in ensuring that the sequence of locks acquired within nested
290functions, when operating on data from multiple domains with concurrent
291operations, is safe from deadlock.
292
293An alternative method that is better suited to the Argo software is to consider
294the execution path, the full sequence of locks acquired, accesses performed,
295and locks released, from entering an operation, to the completion of the work.
296
297An example code path for an operation:
298
299`[entry] > -- [ take R(L1) ] -- [ take R(L2) ] -- loop [ take a L3 / drop L3 ]
300--  [ drop R(L2) ] -- [ drop R(L1)] -- > [exit]`
301
302If a function implements a section of the path, it is important to know not
303only what variables the function itself operates upon, but also the locking
304state that will already have been established at the point when the function is
305invoked, since this will affect what data the function can access. For this
306reason, comments in the code, or ASSERTs that explicitly check lock state,
307communicate what the locking state is expected and intended to be when that
308code is invoked. See the macros defined to support this for Argo later in this
309document.
310
311
312## Macros to Validate and Document Lock State
313
314These macros encode the logic to verify that the locking has adhered to the
315locking discipline.
316
317eg. On entry to logic that requires holding at least `R(rings_L2)`, this:
318
319`ASSERT(LOCKING_Read_rings_L2(d));`
320
321checks that the lock state is sufficient, validating that one of the following
322must be true when executed:
323
324`R(rings_L2) && R(L1)`
325or:  `W(rings_L2) && R(L1)`
326or:  `W(L1)`
327
328The macros are defined thus:
329
330```
331#define LOCKING_Write_L1 (rw_is_write_locked(&L1_global_argo_rwlock))
332/*
333 * While LOCKING_Read_L1 will return true even if the lock is write-locked,
334 * that's OK because everywhere that a Read lock is needed with these macros,
335 * holding a Write lock there instead is OK too: we're checking that _at least_
336 * the specified level of locks are held.
337 */
338#define LOCKING_Read_L1 (rw_is_locked(&L1_global_argo_rwlock))
339
340#define LOCKING_Write_rings_L2(d) \
341    ((LOCKING_Read_L1 && rw_is_write_locked(&(d)->argo->rings_L2_rwlock)) || \
342     LOCKING_Write_L1)
343/*
344 * Skip checking LOCKING_Write_rings_L2(d) within this LOCKING_Read_rings_L2
345 * definition because the first clause that is testing R(L1) && R(L2) will also
346 * return true if R(L1) && W(L2) is true, because of the way that rw_is_locked
347 * behaves. This results in a slightly shorter and faster implementation.
348 */
349#define LOCKING_Read_rings_L2(d) \
350    ((LOCKING_Read_L1 && rw_is_locked(&(d)->argo->rings_L2_rwlock)) || \
351     LOCKING_Write_L1)
352/*
353 * Skip checking LOCKING_Write_L1 within this LOCKING_L3 definition because
354 * LOCKING_Write_rings_L2(d) will return true for that condition.
355 */
356#define LOCKING_L3(d, r) \
357    ((LOCKING_Read_L1 && rw_is_locked(&(d)->argo->rings_L2_rwlock) \
358      && spin_is_locked(&(r)->L3_lock)) || LOCKING_Write_rings_L2(d))
359
360#define LOCKING_send_L2(d) \
361    ((LOCKING_Read_L1 && spin_is_locked(&(d)->argo->send_L2_lock)) || \
362     LOCKING_Write_L1)
363```
364
365Here is an example of a macro in use:
366
367```
368static void
369notify_ring(const struct domain *d, struct argo_ring_info *ring_info,
370          struct hlist_head *to_notify)
371{
372  uint32_t space;
373
374  ASSERT(LOCKING_Read_rings_L2(d));
375
376  spin_lock(&ring_info->L3_lock);
377
378  if ( ring_info->len )
379      space = ringbuf_payload_space(d, ring_info);
380  else
381      space = 0;
382
383  spin_unlock(&ring_info->L3_lock);
384
385  if ( space )
386      pending_find(d, ring_info, space, to_notify);
387}
388
389```
390
391In the above example, it can be seen that it is safe to acquire the `L3` lock
392because _at least_ `R(rings_L2)` is already held, as documented and verified by
393the macro.
394
395## FAQ / Other Considerations
396
397### Why not have a single per-domain lock?
398
399Due to performance isolation / DoS avoidance: if there is a single per-domain
400lock, acquiring this lock will stall operations on other active rings owned by
401the domain. A malicious domain can loop registering and unregistering rings,
402without any consent by the targetted domain, which would experience decreased
403throughput due to the contention on the single per-domain lock. The granular
404locking structure of Argo prevents this. It also allows concurrent operation of
405different rings by multiple VCPUs of the same domain without contention, to
406avoid negative application performance interaction.
407
408## Rationale for Using a Singleton Global Lock: L1
409
410### Teardown on domain destroy
411
412The single global lock enables exclusive access to the argo data structures
413across domains when a domain is destroyed. Every unicast ring that the dying
414domain is the authorized sender is torn down and any pending space-available
415notifications in other domain's wildcard rings are cancelled. This requires
416gaining safe access to the data structures on each of the domains involved.
417
418The 'send hashtable' data structure is needed in order to perform the teardown
419of rings when a domain is destroyed. To populate it, whenever a unicast ring is
420registered, the lock that protects that data structure must be taken
421exclusively.
422
423There are granular per-domain locks which protect the per-domain data
424structures. The global singleton L1 lock operates with-and-above the per-domain
425locks and is used to obtain exclusive access to multiple domain's argo data
426structures in the infrequent case where it is used -- for domain destroy --
427whilst otherwise allowing concurrent access, via acquiring it with 'read'
428access, for the majority of the time.
429
430To perform the required state teardown on domain destruction, which can require
431removing state from the data structures of multiple domains, a locking protocol
432to obtain mutual exclusion and safe access to the state is required, without
433deadlocking.
434
435Using the single global lock avoids the need for sequencing the acquisition of
436multiple individual per-domain locks (and lower level data structure locks) to
437prevent deadlock: taking W(L1) grants access to all and taking R(L1) ensures
438that teardown of any domain will not interfere with any Argo hypercall
439operation. It enables introducing granular locking without complex or
440error-prone lock acquisition logic.
441
442# Future Work
443
444- Performance measurement and optimization
445- Provide assurance of connection source context to destination
446- Policy controls for reducing the duration of hypervisor mappings of
447transmission rings, to improve resistance to data read attacks on
448hypervisor memory
449