1Reliability, Availability, and Serviceability (RAS) Extensions 2============================================================== 3 4This document describes |TF-A| support for Arm Reliability, Availability, and 5Serviceability (RAS) extensions. RAS is a mandatory extension for Armv8.2 and 6later CPUs, and also an optional extension to the base Armv8.0 architecture. 7 8In conjunction with the |EHF|, support for RAS extension enables firmware-first 9paradigm for handling platform errors: exceptions resulting from errors are 10routed to and handled in EL3. Said errors are Synchronous External Abort (SEA), 11Asynchronous External Abort (signalled as SErrors), Fault Handling and Error 12Recovery interrupts. The |EHF| document mentions various :ref:`error handling 13use-cases <delegation-use-cases>` . 14 15For the description of Arm RAS extensions, Standard Error Records, and the 16precise definition of RAS terminology, please refer to the Arm Architecture 17Reference Manual. The rest of this document assumes familiarity with 18architecture and terminology. 19 20Overview 21-------- 22 23As mentioned above, the RAS support in |TF-A| enables routing to and handling of 24exceptions resulting from platform errors in EL3. It allows the platform to 25define an External Abort handler, and to register RAS nodes and interrupts. RAS 26framework also provides `helpers`__ for accessing Standard Error Records as 27introduced by the RAS extensions. 28 29.. __: `Standard Error Record helpers`_ 30 31The build option ``RAS_EXTENSION`` when set to ``1`` includes the RAS in run 32time firmware; ``EL3_EXCEPTION_HANDLING`` and ``HANDLE_EA_EL3_FIRST`` must also 33be set ``1``. ``RAS_TRAP_LOWER_EL_ERR_ACCESS`` controls the access to the RAS 34error record registers from lower ELs. 35 36.. _ras-figure: 37 38.. image:: ../resources/diagrams/draw.io/ras.svg 39 40See more on `Engaging the RAS framework`_. 41 42Platform APIs 43------------- 44 45The RAS framework allows the platform to define handlers for External Abort, 46Uncontainable Errors, Double Fault, and errors rising from EL3 execution. Please 47refer to :ref:`RAS Porting Guide <External Abort handling and RAS Support>`. 48 49Registering RAS error records 50----------------------------- 51 52RAS nodes are components in the system capable of signalling errors to PEs 53through one one of the notification mechanisms—SEAs, SErrors, or interrupts. RAS 54nodes contain one or more error records, which are registers through which the 55nodes advertise various properties of the signalled error. Arm recommends that 56error records are implemented in the Standard Error Record format. The RAS 57architecture allows for error records to be accessible via system or 58memory-mapped registers. 59 60The platform should enumerate the error records providing for each of them: 61 62- A handler to probe error records for errors; 63- When the probing identifies an error, a handler to handle it; 64- For memory-mapped error record, its base address and size in KB; for a system 65 register-accessed record, the start index of the record and number of 66 continuous records from that index; 67- Any node-specific auxiliary data. 68 69With this information supplied, when the run time firmware receives one of the 70notification mechanisms, the RAS framework can iterate through and probe error 71records for error, and invoke the appropriate handler to handle it. 72 73The RAS framework provides the macros to populate error record information. The 74macros are versioned, and the latest version as of this writing is 1. These 75macros create a structure of type ``struct err_record_info`` from its arguments, 76which are later passed to probe and error handlers. 77 78For memory-mapped error records: 79 80.. code:: c 81 82 ERR_RECORD_MEMMAP_V1(base_addr, size_num_k, probe, handler, aux) 83 84And, for system register ones: 85 86.. code:: c 87 88 ERR_RECORD_SYSREG_V1(idx_start, num_idx, probe, handler, aux) 89 90The probe handler must have the following prototype: 91 92.. code:: c 93 94 typedef int (*err_record_probe_t)(const struct err_record_info *info, 95 int *probe_data); 96 97The probe handler must return a non-zero value if an error was detected, or 0 98otherwise. The ``probe_data`` output parameter can be used to pass any useful 99information resulting from probe to the error handler (see `below`__). For 100example, it could return the index of the record. 101 102.. __: `Standard Error Record helpers`_ 103 104The error handler must have the following prototype: 105 106.. code:: c 107 108 typedef int (*err_record_handler_t)(const struct err_record_info *info, 109 int probe_data, const struct err_handler_data *const data); 110 111The ``data`` constant parameter describes the various properties of the error, 112including the reason for the error, exception syndrome, and also ``flags``, 113``cookie``, and ``handle`` parameters from the :ref:`top-level exception handler 114<EL3 interrupts>`. 115 116The platform is expected populate an array using the macros above, and register 117the it with the RAS framework using the macro ``REGISTER_ERR_RECORD_INFO()``, 118passing it the name of the array describing the records. Note that the macro 119must be used in the same file where the array is defined. 120 121Standard Error Record helpers 122~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 123 124The |TF-A| RAS framework provides probe handlers for Standard Error Records, for 125both memory-mapped and System Register accesses: 126 127.. code:: c 128 129 int ras_err_ser_probe_memmap(const struct err_record_info *info, 130 int *probe_data); 131 132 int ras_err_ser_probe_sysreg(const struct err_record_info *info, 133 int *probe_data); 134 135When the platform enumerates error records, for those records in the Standard 136Error Record format, these helpers maybe used instead of rolling out their own. 137Both helpers above: 138 139- Return non-zero value when an error is detected in a Standard Error Record; 140- Set ``probe_data`` to the index of the error record upon detecting an error. 141 142Registering RAS interrupts 143-------------------------- 144 145RAS nodes can signal errors to the PE by raising Fault Handling and/or Error 146Recovery interrupts. For the firmware-first handling paradigm for interrupts to 147work, the platform must setup and register with |EHF|. See `Interaction with 148Exception Handling Framework`_. 149 150For each RAS interrupt, the platform has to provide structure of type ``struct 151ras_interrupt``: 152 153- Interrupt number; 154- The associated error record information (pointer to the corresponding 155 ``struct err_record_info``); 156- Optionally, a cookie. 157 158The platform is expected to define an array of ``struct ras_interrupt``, and 159register it with the RAS framework using the macro 160``REGISTER_RAS_INTERRUPTS()``, passing it the name of the array. Note that the 161macro must be used in the same file where the array is defined. 162 163The array of ``struct ras_interrupt`` must be sorted in the increasing order of 164interrupt number. This allows for fast look of handlers in order to service RAS 165interrupts. 166 167Double-fault handling 168--------------------- 169 170A Double Fault condition arises when an error is signalled to the PE while 171handling of a previously signalled error is still underway. When a Double Fault 172condition arises, the Arm RAS extensions only require for handler to perform 173orderly shutdown of the system, as recovery may be impossible. 174 175The RAS extensions part of Armv8.4 introduced new architectural features to deal 176with Double Fault conditions, specifically, the introduction of ``NMEA`` and 177``EASE`` bits to ``SCR_EL3`` register. These were introduced to assist EL3 178software which runs part of its entry/exit routines with exceptions momentarily 179masked—meaning, in such systems, External Aborts/SErrors are not immediately 180handled when they occur, but only after the exceptions are unmasked again. 181 182|TF-A|, for legacy reasons, executes entire EL3 with all exceptions unmasked. 183This means that all exceptions routed to EL3 are handled immediately. |TF-A| 184thus is able to detect a Double Fault conditions in software, without needing 185the intended advantages of Armv8.4 Double Fault architecture extensions. 186 187Double faults are fatal, and terminate at the platform double fault handler, and 188doesn't return. 189 190Engaging the RAS framework 191-------------------------- 192 193Enabling RAS support is a platform choice constructed from three distinct, but 194related, build options: 195 196- ``RAS_EXTENSION=1`` includes the RAS framework in the run time firmware; 197 198- ``EL3_EXCEPTION_HANDLING=1`` enables handling of exceptions at EL3. See 199 `Interaction with Exception Handling Framework`_; 200 201- ``HANDLE_EA_EL3_FIRST=1`` enables routing of External Aborts and SErrors to 202 EL3. 203 204The RAS support in |TF-A| introduces a default implementation of 205``plat_ea_handler``, the External Abort handler in EL3. When ``RAS_EXTENSION`` 206is set to ``1``, it'll first call ``ras_ea_handler()`` function, which is the 207top-level RAS exception handler. ``ras_ea_handler`` is responsible for iterating 208to through platform-supplied error records, probe them, and when an error is 209identified, look up and invoke the corresponding error handler. 210 211Note that, if the platform chooses to override the ``plat_ea_handler`` function 212and intend to use the RAS framework, it must explicitly call 213``ras_ea_handler()`` from within. 214 215Similarly, for RAS interrupts, the framework defines 216``ras_interrupt_handler()``. The RAS framework arranges for it to be invoked 217when a RAS interrupt taken at EL3. The function bisects the platform-supplied 218sorted array of interrupts to look up the error record information associated 219with the interrupt number. That error handler for that record is then invoked to 220handle the error. 221 222Interaction with Exception Handling Framework 223--------------------------------------------- 224 225As mentioned in earlier sections, RAS framework interacts with the |EHF| to 226arbitrate handling of RAS exceptions with others that are routed to EL3. This 227means that the platform must partition a :ref:`priority level <Partitioning 228priority levels>` for handling RAS exceptions. The platform must then define 229the macro ``PLAT_RAS_PRI`` to the priority level used for RAS exceptions. 230Platforms would typically want to allocate the highest secure priority for 231RAS handling. 232 233Handling of both :ref:`interrupt <interrupt-flow>` and :ref:`non-interrupt 234<non-interrupt-flow>` exceptions follow the sequences outlined in the |EHF| 235documentation. I.e., for interrupts, the priority management is implicit; but 236for non-interrupt exceptions, they're explicit using :ref:`EHF APIs 237<Activating and Deactivating priorities>`. 238 239-------------- 240 241*Copyright (c) 2018-2019, Arm Limited and Contributors. All rights reserved.* 242