1# Argo 2 3## Introduction 4 5Argo is an interdomain communication mechanism. It provides Xen hypervisor 6primitives to transmit data between VMs, by performing data copies into 7receive memory rings registered by domains. It does not require memory 8sharing between VMs and does not use the grant tables or Xenstore. 9 10Argo has requirements for performance isolation between domains, to prevent 11negative performance impact from malicious or disruptive activity of other 12domains, or even other VCPUs of the same domain operating other rings. 13 14## Hypervisor-Mediated data eXchange (HMX) 15 16This term references inter-VM communication protocols that have this 17key architectural point: The hypervisor is responsible for performing the 18write of data into the guest-accessible memory buffer, in the manner 19according to the agreed transfer protocol. This structure ensures that 20there is strength to the transport mechanism, because the transmitting side 21of the communication is the hypervisor, which can be trusted by the receiver, 22and the buffer is isolated from access by any other potential sources 23outside the receiver. 24 25The receiver can trust that the hypervisor will: 26 27- Provide a protocol implementation adhering to hardware synchronization 28requirements for concurrent access to system memory by communicating 29components 30- Deliver data only from an approved source, enforcing policy for Mandatory 31Access Control. 32- Indicate the correct sender of the data. 33- Transmit only the intended data, adhering to the access protocol of the data 34structure in the buffer. If the memory region is being used as a ring, then: 35 - Data writes will only occur within the ring region that is indicated as 36 available for incoming data by the ring indexes. 37 - The indicated length of data written will exactly match the length of 38 data actually written. 39 - The write for each piece of data will occur only once. 40 - Data will be written sequentially in the order that it is sent. 41- Issue notification of data delivered correctly. 42 43This structure allows for augmentation by the hypervisor to identify the 44sending entity within the source VM, and then provide the receiver with 45assured context information about the data source. This enables the receiver 46to make decisions based on fine-grained knowledge of the source of the data. 47 48This structure is also of strong interest for nested virtualization: 49transport via the hypervisor can enable construction of efficient 50communications between VMs at different levels of nesting. 51 52# Locking 53 54Since Argo operates a data path between domains, sections of this code are 55*hot* when the communication paths are in use. To encourage high performance, a 56goal is to limit mutual exclusion to only where required and enable significant 57concurrency. 58 59Avoidance of deadlock is essential and since state must frequently be updated 60that pertains to more than one domain, a locking protocol defines which locks 61are needed and the order of their acquistion. 62 63## Structure 64 65The granular locking structure of Argo enables: 66 671. Performance isolation of guests 682. Avoidance of DoS of rings by domains that are not authorized to send to them 693. Deadlock-free teardown of state across multiple domains on domain destroy 704. Performance of guests using Argo with concurrent operation of rings. 71 72Argo uses three per-domain locks to protect three separate data structures. 73Access to the ring_hash data structure is confined to domains that a 74ring-registering domain has authorized to send data via the ring. The complete 75set of Argo locks is: 76 77* Global : `L1_global_argo_rwlock` 78* Per-domain: `rings_L2_rwlock` 79* Per-domain: `send_L2_lock` 80* Per-domain: `wildcard_L2_lock` 81* Per-ring: `L3_lock` 82 83## Protected State 84 85The data structures being protected by the locks are all per-domain. The only 86global Argo state is the `L1_global_argo_rwlock` used to coordinate access to 87data structures of other domains. 88 89### State: Rings registered and owned by a domain 90 91This includes the state to run that ring, such as memory frame numbers and 92established mappings. Per-ring state is protected by its own lock, so that 93multiple VCPUs of the same domain operating different rings do not inhibit the 94performance of each other. 95 96The per-domain ring state also includes the list of pending notifications for 97other domains that are waiting for ring space availability. 98 99### State: Partner rings for which this domain is the single allowed sender 100 101This state belonging to the permitted sender is written to when a ring is 102registered by another domain. The lock that protects this state is subject to 103locking at arbitrary frequency by those foreign domains when registering rings 104-- which do not need any permission granted by this domain in order to register 105a ring to communicate with it -- so it must not inhibit the domain's own 106ability to use its own rings, to protect them from DoS. For this reason, this 107state is protected by its own lock. 108 109### State: Pending notifications for wildcard rings registered by other domains 110 111This data structure is needed when a domain is destroyed, to cancel the 112outstanding space availability notifications about the wildcard rings of other 113domains that this domain has queried. 114 115Data is entered into this data structure by the domain that owns it, either by 116a space-inhibited sendv or a notify operation. 117 118Data is removed from this data structure in one of three cases: when space 119becomes available in the destination ring and the notification is sent, when 120the ring is torn down, or when the awaiting domain is destroyed. 121 122In the case where a notification is sent, access to the data structure is 123triggered by the ring owner domain, rather than the domain waiting for 124notification. This data structure is protected by its own lock since doing so 125entails less contention than the alternative of reusing an existing lock owned 126by the domain. 127 128## Hierarchical Locking Model and Protocol 129 130The locking discipline within the Argo code is heirarchical and utilizes 131reader/writer locks to enable increased concurrency when operations do not 132conflict. None of the Argo locks are reentrant. 133 134The hierarchy: 135 136* There is a global rwlock (`L1`) to protect access to all of the per-domain 137argo data structures. 138* There is a rwlock per-domain (`rings_L2`) to protect the hashtable of the 139per-ring data structures. 140* There is a lock per ring (`L3`) to protect the per-ring data structure, 141`struct argo_ring_info`. 142 143There are a two other per-domain L2 locks; their operation is similar and they 144are described later. 145 146The protocol to safely acquire write access to the per-ring data structure, 147`struct argo_ring_info`, is: 148 1491) Acquire a Read lock on L1. 1502) Acquire a Read lock on L2. 1513) Acquire L3. 152 153An alternative valid sequence is: 154 1551) Acquire a Read lock on L1. 1562) Acquire a Write lock on L2. 157 158This second sequence grants write access to _all_ of the `argo_ring_info` 159structs belonging to the domain, but at the expense of less concurrency: no 160other operation can access those structs while the locks are held, which will 161inhibit operations on those rings until the locks are released. 162 163Another alternative valid sequence is: 164 1651) Acquire a Write lock on L1. 166 167This grants write access to _all_ of the `argo_ring_info` structs belonging to 168_all domains_, but again at the expense of far less concurrency: no other 169operation can operate on Argo rings until the locks are released. 170 171## Lock Definitions 172 173The full set of locks that are directly operated upon by the Argo code are 174described in the following section. 175 176### The global singleton lock: 177 178* `L1_global_argo_rwlock` 179 180The rationale for having a global lock is to be able to enforce system-wide 181exclusion for a critical region and simplify the logic required to avoid 182deadlock, for teardown of state across multiple domains when a domain is 183destroyed. 184 185The majority of operations take a read-lock on this lock, allowing concurrent 186Argo operations by many domains. 187 188The pointer d->argo on every domain is protected by this lock. A set of more 189granular per-domain locks could be used to do that, but since domain start and 190stop is expected to be a far less frequent operation than the other argo 191operations, acquiring a single read lock to enable access to all the argo 192structs of all domains simplifies the protocol. 193 194Points of write-locking on this lock: 195 196* `argo_destroy`, where: 197 * All of the domain's own rings are destroyed. 198 * All of the notifications pending for other domains are cancelled. 199 * All of the unicast partner rings owned by other domains for this domain to 200send to, are destroyed. 201 * All of the notifications pending on those rings are cancelled. 202 * All of the notifications pending for this domain on wildcard rings owned 203by other domains are cancelled. 204* `argo_soft_reset`, for similar teardown operations as argo_destroy. 205* `argo_init`, where the `d->argo` pointer is first populated. 206 * Since the write lock is taken here, there is serialization all concurrent 207Argo operations around this single pointer write; this is the cost of using the 208simpler one global lock approach. 209 210Enforcing that the write_lock is acquired on `L1_global_argo_rwlock` before 211executing teardown, ensures that no teardown operations act concurrently and no 212other Argo operations happen concurrently with a teardown. The teardown logic 213is free to safely modify the Argo state across all domains without having to 214acquire per-domain locks and deadlock cannot occur. 215 216### Per-Domain: Ring hash lock 217 218`rings_L2_rwlock` 219 220Protects: the per-domain ring hash table of `argo_ring_info` structs. 221 222Holding a read lock on `rings_L2` protects the ring hash table and the elements 223in the hash table `d->argo->ring_hash`, and the `node` and `id` fields in 224struct `argo_ring_info` in the hash table. 225 226Holding a write lock on `rings_L2` protects all of the elements of all the 227struct `argo_ring_info` belonging to this domain. 228 229To take `rings_L2` you must already have `R(L1)`. `W(L1)` implies `W(rings_L2)` 230and `L3`. 231 232Prerequisites: 233 234* `R(L1_global_argo_rwlock)` must be acquired before taking either read or 235write on `rings_L2_rwlock`. 236* `W(L1_global_argo_rwlock)` implies `W(rings_L2_rwlock)`, so if 237`W(L1_global_argo_rwlock)` is held, then `rings_L2_rwlock` does not need to be 238acquired, and all the data structures that `rings_L2_rwlock` protects can be 239accessed as if `W(ring_L2_rwlock)` was held. 240 241Is accessed by the hypervisor on behalf of: 242 243* The domain that registered the ring. 244* Any domain that is allowed to send to the ring -- so that's the partner 245domain, for unicast rings, or any domain, for wildcard rings. 246 247### Send hash lock 248 249`send_L2_lock` 250 251Protects: the per-domain send hash table of `argo_send_info` structs. 252 253Is accessed by the hypervisor on behalf of: 254 255* Any domain that registers a ring that specifies the domain as the unicast 256sender. 257* The domain that has been allowed to send, as part of teardown when the domain 258is being destroyed. 259 260 261### Wildcard pending list lock 262 263`wildcard_L2_lock` 264 265Protects: the per-domain list of pending notifications to the domain from 266wildcard rings owned by other domains. 267 268Is accessed by the hypervisor on behalf of: 269 270* The domain that issued a query to another about space availability in one of 271its wildcard rings - this can be done by attempting a send operation when there 272is insufficient ring space available at the time. 273* Any domain that the domain has issued a query to about space availability in 274one of their wildcard rings. 275 276### Per-Ring locks: 277 278* `L3_lock` 279 280This lock protects the members of a `struct ring_info` which is the primary 281state for a domain's own registered ring. 282 283 284## Reasoning Model 285 286A common model for reasoning about concurrent code focusses on accesses to 287individual variables: if code touches this variable, see that it first acquires 288the corresponding lock and then drops it afterwards. A challenge with this 289model is in ensuring that the sequence of locks acquired within nested 290functions, when operating on data from multiple domains with concurrent 291operations, is safe from deadlock. 292 293An alternative method that is better suited to the Argo software is to consider 294the execution path, the full sequence of locks acquired, accesses performed, 295and locks released, from entering an operation, to the completion of the work. 296 297An example code path for an operation: 298 299`[entry] > -- [ take R(L1) ] -- [ take R(L2) ] -- loop [ take a L3 / drop L3 ] 300-- [ drop R(L2) ] -- [ drop R(L1)] -- > [exit]` 301 302If a function implements a section of the path, it is important to know not 303only what variables the function itself operates upon, but also the locking 304state that will already have been established at the point when the function is 305invoked, since this will affect what data the function can access. For this 306reason, comments in the code, or ASSERTs that explicitly check lock state, 307communicate what the locking state is expected and intended to be when that 308code is invoked. See the macros defined to support this for Argo later in this 309document. 310 311 312## Macros to Validate and Document Lock State 313 314These macros encode the logic to verify that the locking has adhered to the 315locking discipline. 316 317eg. On entry to logic that requires holding at least `R(rings_L2)`, this: 318 319`ASSERT(LOCKING_Read_rings_L2(d));` 320 321checks that the lock state is sufficient, validating that one of the following 322must be true when executed: 323 324`R(rings_L2) && R(L1)` 325or: `W(rings_L2) && R(L1)` 326or: `W(L1)` 327 328The macros are defined thus: 329 330``` 331#define LOCKING_Write_L1 (rw_is_write_locked(&L1_global_argo_rwlock)) 332/* 333 * While LOCKING_Read_L1 will return true even if the lock is write-locked, 334 * that's OK because everywhere that a Read lock is needed with these macros, 335 * holding a Write lock there instead is OK too: we're checking that _at least_ 336 * the specified level of locks are held. 337 */ 338#define LOCKING_Read_L1 (rw_is_locked(&L1_global_argo_rwlock)) 339 340#define LOCKING_Write_rings_L2(d) \ 341 ((LOCKING_Read_L1 && rw_is_write_locked(&(d)->argo->rings_L2_rwlock)) || \ 342 LOCKING_Write_L1) 343/* 344 * Skip checking LOCKING_Write_rings_L2(d) within this LOCKING_Read_rings_L2 345 * definition because the first clause that is testing R(L1) && R(L2) will also 346 * return true if R(L1) && W(L2) is true, because of the way that rw_is_locked 347 * behaves. This results in a slightly shorter and faster implementation. 348 */ 349#define LOCKING_Read_rings_L2(d) \ 350 ((LOCKING_Read_L1 && rw_is_locked(&(d)->argo->rings_L2_rwlock)) || \ 351 LOCKING_Write_L1) 352/* 353 * Skip checking LOCKING_Write_L1 within this LOCKING_L3 definition because 354 * LOCKING_Write_rings_L2(d) will return true for that condition. 355 */ 356#define LOCKING_L3(d, r) \ 357 ((LOCKING_Read_L1 && rw_is_locked(&(d)->argo->rings_L2_rwlock) \ 358 && spin_is_locked(&(r)->L3_lock)) || LOCKING_Write_rings_L2(d)) 359 360#define LOCKING_send_L2(d) \ 361 ((LOCKING_Read_L1 && spin_is_locked(&(d)->argo->send_L2_lock)) || \ 362 LOCKING_Write_L1) 363``` 364 365Here is an example of a macro in use: 366 367``` 368static void 369notify_ring(const struct domain *d, struct argo_ring_info *ring_info, 370 struct hlist_head *to_notify) 371{ 372 uint32_t space; 373 374 ASSERT(LOCKING_Read_rings_L2(d)); 375 376 spin_lock(&ring_info->L3_lock); 377 378 if ( ring_info->len ) 379 space = ringbuf_payload_space(d, ring_info); 380 else 381 space = 0; 382 383 spin_unlock(&ring_info->L3_lock); 384 385 if ( space ) 386 pending_find(d, ring_info, space, to_notify); 387} 388 389``` 390 391In the above example, it can be seen that it is safe to acquire the `L3` lock 392because _at least_ `R(rings_L2)` is already held, as documented and verified by 393the macro. 394 395## FAQ / Other Considerations 396 397### Why not have a single per-domain lock? 398 399Due to performance isolation / DoS avoidance: if there is a single per-domain 400lock, acquiring this lock will stall operations on other active rings owned by 401the domain. A malicious domain can loop registering and unregistering rings, 402without any consent by the targetted domain, which would experience decreased 403throughput due to the contention on the single per-domain lock. The granular 404locking structure of Argo prevents this. It also allows concurrent operation of 405different rings by multiple VCPUs of the same domain without contention, to 406avoid negative application performance interaction. 407 408## Rationale for Using a Singleton Global Lock: L1 409 410### Teardown on domain destroy 411 412The single global lock enables exclusive access to the argo data structures 413across domains when a domain is destroyed. Every unicast ring that the dying 414domain is the authorized sender is torn down and any pending space-available 415notifications in other domain's wildcard rings are cancelled. This requires 416gaining safe access to the data structures on each of the domains involved. 417 418The 'send hashtable' data structure is needed in order to perform the teardown 419of rings when a domain is destroyed. To populate it, whenever a unicast ring is 420registered, the lock that protects that data structure must be taken 421exclusively. 422 423There are granular per-domain locks which protect the per-domain data 424structures. The global singleton L1 lock operates with-and-above the per-domain 425locks and is used to obtain exclusive access to multiple domain's argo data 426structures in the infrequent case where it is used -- for domain destroy -- 427whilst otherwise allowing concurrent access, via acquiring it with 'read' 428access, for the majority of the time. 429 430To perform the required state teardown on domain destruction, which can require 431removing state from the data structures of multiple domains, a locking protocol 432to obtain mutual exclusion and safe access to the state is required, without 433deadlocking. 434 435Using the single global lock avoids the need for sequencing the acquisition of 436multiple individual per-domain locks (and lower level data structure locks) to 437prevent deadlock: taking W(L1) grants access to all and taking R(L1) ensures 438that teardown of any domain will not interfere with any Argo hypercall 439operation. It enables introducing granular locking without complex or 440error-prone lock acquisition logic. 441 442# Future Work 443 444- Performance measurement and optimization 445- Provide assurance of connection source context to destination 446- Policy controls for reducing the duration of hypervisor mappings of 447transmission rings, to improve resistance to data read attacks on 448hypervisor memory 449