1 2<HTML> 3 4<HEAD> 5<TITLE>Berkeley SoftFloat Library Interface</TITLE> 6</HEAD> 7 8<BODY> 9 10<H1>Berkeley SoftFloat Release 3a: Library Interface</H1> 11 12<P> 13John R. Hauser<BR> 142015 October 23<BR> 15</P> 16 17 18<H2>Contents</H2> 19 20<BLOCKQUOTE> 21<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0> 22<COL WIDTH=25> 23<COL WIDTH=*> 24<TR><TD COLSPAN=2>1. Introduction</TD></TR> 25<TR><TD COLSPAN=2>2. Limitations</TD></TR> 26<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR> 27<TR><TD COLSPAN=2>4. Types and Functions</TD></TR> 28<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR> 29<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR> 30<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR> 31<TR> 32 <TD></TD> 33 <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD> 34</TR> 35<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR> 36<TR><TD COLSPAN=2>5. Reserved Names</TD></TR> 37<TR><TD COLSPAN=2>6. Mode Variables</TD></TR> 38<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR> 39<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR> 40<TR> 41 <TD></TD> 42 <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD> 43</TR> 44<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR> 45<TR><TD COLSPAN=2>8. Function Details</TD></TR> 46<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR> 47<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR> 48<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR> 49<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR> 50<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR> 51<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR> 52<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR> 53<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR> 54<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR> 55<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR> 56<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR> 57<TR><TD></TD><TD>9.1. Name Changes</TD></TR> 58<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR> 59<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR> 60<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR> 61<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR> 62<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR> 63<TR><TD COLSPAN=2>10. Future Directions</TD></TR> 64<TR><TD COLSPAN=2>11. Contact Information</TD></TR> 65</TABLE> 66</BLOCKQUOTE> 67 68 69<H2>1. Introduction</H2> 70 71<P> 72Berkeley SoftFloat is a software implementation of binary floating-point that 73conforms to the IEEE Standard for Floating-Point Arithmetic. 74The current release supports four binary formats: <NOBR>32-bit</NOBR> 75single-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR> 76double-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision. 77The following functions are supported for each format: 78<UL> 79<LI> 80addition, subtraction, multiplication, division, and square root; 81<LI> 82fused multiply-add as defined by the IEEE Standard, except for 83<NOBR>80-bit</NOBR> double-extended-precision; 84<LI> 85remainder as defined by the IEEE Standard; 86<LI> 87round to integral value; 88<LI> 89comparisons; 90<LI> 91conversions to/from other supported formats; and 92<LI> 93conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers, 94signed and unsigned. 95</UL> 96All operations required by the original 1985 version of the IEEE Floating-Point 97Standard are implemented, except for conversions to and from decimal. 98</P> 99 100<P> 101This document gives information about the types defined and the routines 102implemented by SoftFloat. 103It does not attempt to define or explain the IEEE Floating-Point Standard. 104Information about the standard is available elsewhere. 105</P> 106 107<P> 108The current version of SoftFloat is <NOBR>Release 3a</NOBR>. 109The only difference between this version and the previous 110<NOBR>Release 3</NOBR> is the replacement of the license text supplied by the 111University of California. 112</P> 113 114<P> 115The functional interface of SoftFloat <NOBR>Release 3</NOBR> and afterward 116differs in many details from that of earlier releases. 117For specifics of these differences, see <NOBR>section 9</NOBR> below, 118<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>. 119</P> 120 121 122<H2>2. Limitations</H2> 123 124<P> 125SoftFloat assumes the computer has an addressable byte size of 8 or 126<NOBR>16 bits</NOBR>. 127(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.) 128</P> 129 130<P> 131SoftFloat is written in C and is designed to work with other C code. 132The C compiler used must conform at a minimum to the 1989 ANSI standard for the 133C language (same as the 1990 ISO standard) and must in addition support basic 134arithmetic on <NOBR>64-bit</NOBR> integers. 135Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR> 136single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that 137did not require <NOBR>64-bit</NOBR> integers, but this option is not supported 138starting with <NOBR>Release 3</NOBR>. 139Since 1999, ISO standards for C have mandated compiler support for 140<NOBR>64-bit</NOBR> integers. 141A compiler conforming to the 1999 C Standard or later is recommended but not 142strictly required. 143</P> 144 145<P> 146Most operations not required by the original 1985 version of the IEEE 147Floating-Point Standard but added in the 2008 version are not yet supported in 148SoftFloat <NOBR>Release 3a</NOBR>. 149</P> 150 151 152<H2>3. Acknowledgments and License</H2> 153 154<P> 155The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser. 156<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation 157supplanting earlier releases. 158The project to create <NOBR>Release 3</NOBR> (and <NOBR>now 3a</NOBR>) was done 159in the employ of the University of California, Berkeley, within the Department 160of Electrical Engineering and Computer Sciences, first for the Parallel 161Computing Laboratory (Par Lab) and then for the ASPIRE Lab. 162The work was officially overseen by Prof. Krste Asanovic, with funding provided 163by these sources: 164<BLOCKQUOTE> 165<TABLE> 166<COL> 167<COL WIDTH=10> 168<COL> 169<TR> 170<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD> 171<TD></TD> 172<TD> 173Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery 174(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia, 175NVIDIA, Oracle, and Samsung. 176</TD> 177</TR> 178<TR> 179<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD> 180<TD></TD> 181<TD> 182DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from 183ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA, 184Oracle, and Samsung. 185</TD> 186</TR> 187</TABLE> 188</BLOCKQUOTE> 189</P> 190 191<P> 192The following applies to the whole of SoftFloat <NOBR>Release 3a</NOBR> as well 193as to each source file individually. 194</P> 195 196<P> 197Copyright 2011, 2012, 2013, 2014, 2015 The Regents of the University of 198California. 199All rights reserved. 200</P> 201 202<P> 203Redistribution and use in source and binary forms, with or without 204modification, are permitted provided that the following conditions are met: 205<OL> 206 207<LI> 208<P> 209Redistributions of source code must retain the above copyright notice, this 210list of conditions, and the following disclaimer. 211</P> 212 213<LI> 214<P> 215Redistributions in binary form must reproduce the above copyright notice, this 216list of conditions, and the following disclaimer in the documentation and/or 217other materials provided with the distribution. 218</P> 219 220<LI> 221<P> 222Neither the name of the University nor the names of its contributors may be 223used to endorse or promote products derived from this software without specific 224prior written permission. 225</P> 226 227</OL> 228</P> 229 230<P> 231THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS “AS IS”, 232AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 233IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE 234DISCLAIMED. 235IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, 236INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 237BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, 238DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 239LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE 240OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF 241ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. 242</P> 243 244 245<H2>4. Types and Functions</H2> 246 247<P> 248The types and functions of SoftFloat are declared in header file 249<CODE>softfloat.h</CODE>. 250</P> 251 252<H3>4.1. Boolean and Integer Types</H3> 253 254<P> 255Header file <CODE>softfloat.h</CODE> depends on standard headers 256<CODE><stdbool.h></CODE> and <CODE><stdint.h></CODE> to define type 257<CODE>bool</CODE> and several integer types. 258These standard headers have been part of the ISO C Standard Library since 1999. 259With any recent compiler, they are likely to be supported, even if the compiler 260does not claim complete conformance to the ISO C Standard. 261For older or nonstandard compilers, a port of SoftFloat may have substitutes 262for these headers. 263Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from 264<CODE><stdbool.h></CODE> and on these type names from 265<CODE><stdint.h></CODE>: 266<BLOCKQUOTE> 267<PRE> 268uint16_t 269uint32_t 270uint64_t 271int32_t 272int64_t 273uint_fast8_t 274uint_fast32_t 275uint_fast64_t 276</PRE> 277</BLOCKQUOTE> 278</P> 279 280 281<H3>4.2. Floating-Point Types</H3> 282 283<P> 284The <CODE>softfloat.h</CODE> header defines four floating-point types: 285<BLOCKQUOTE> 286<TABLE CELLSPACING=0 CELLPADDING=0> 287<TR> 288<TD><CODE>float32_t</CODE></TD> 289<TD><NOBR>32-bit</NOBR> single-precision binary format</TD> 290</TR> 291<TR> 292<TD><CODE>float64_t</CODE></TD> 293<TD><NOBR>64-bit</NOBR> double-precision binary format</TD> 294</TR> 295<TR> 296<TD><CODE>extFloat80_t </CODE></TD> 297<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or 298Motorola format)</TD> 299</TR> 300<TR> 301<TD><CODE>float128_t</CODE></TD> 302<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD> 303</TR> 304</TABLE> 305</BLOCKQUOTE> 306The non-extended types are each exactly the size specified: 307<NOBR>32 bits</NOBR> for <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for 308<CODE>float64_t</CODE>, and <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>. 309Aside from these size requirements, the definitions of all these types may 310differ for different ports of SoftFloat to specific systems. 311A given port of SoftFloat may or may not define some of the floating-point 312types as aliases for the C standard types <CODE>float</CODE>, 313<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>. 314</P> 315 316<P> 317Header file <CODE>softfloat.h</CODE> also defines a structure, 318<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of 319<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory. 320This structure is the same size as type <CODE>extFloat80_t</CODE> and contains 321at least these two fields (not necessarily in this order): 322<BLOCKQUOTE> 323<PRE> 324uint16_t signExp; 325uint64_t signif; 326</PRE> 327</BLOCKQUOTE> 328Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point 329value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the 330encoded exponent in the other <NOBR>15 bits</NOBR>. 331Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of 332the floating-point value. 333(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the 334leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored 335in the most significant bit of the significand.) 336</P> 337 338<H3>4.3. Supported Floating-Point Functions</H3> 339 340<P> 341SoftFloat implements these arithmetic operations for its floating-point types: 342<UL> 343<LI> 344conversions between any two floating-point formats; 345<LI> 346for each floating-point format, conversions to and from signed and unsigned 347<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers; 348<LI> 349for each format, the usual addition, subtraction, multiplication, division, and 350square root operations; 351<LI> 352for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add 353operation defined by the IEEE Standard; 354<LI> 355for each format, the floating-point remainder operation defined by the IEEE 356Standard; 357<LI> 358for each format, a “round to integer” operation that rounds to the 359nearest integer value in the same format; and 360<LI> 361comparisons between two values in the same floating-point format. 362</UL> 363</P> 364 365<P> 366The following operations required by the 2008 IEEE Floating-Point Standard are 367not supported in SoftFloat <NOBR>Release 3a</NOBR>: 368<UL> 369<LI> 370<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>, 371<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>; 372<LI> 373conversions between floating-point formats and decimal or hexadecimal character 374sequences; 375<LI> 376all “quiet-computation” operations (<B>copy</B>, <B>negate</B>, 377<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or 378manipulation of the floating-point sign bit); and 379<LI> 380all “non-computational” operations other than <B>isSignaling</B> 381(which is supported). 382</UL> 383</P> 384 385<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3> 386 387<P> 388Because the <NOBR>80-bit</NOBR> double-extended-precision format, 389<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many 390floating-point numbers are encodable in this type in equivalent normalized and 391denormalized forms. 392Zeros and values in the subnormal range have each only a single possible 393encoding, for which the leading significand bit must <NOBR>be 0</NOBR>. 394For other finite values (outside the subnormal range), a unique normalized 395representation, with leading significand bit set <NOBR>to 1</NOBR>, always 396exists, and is considered the <I>canonical</I> representation of the value. 397Any equivalent denormalized representations (having leading significand bit 398<NOBR>of 0</NOBR>) are <I>non-canonical</I>. 399Similarly, the leading significand bit is expected to <NOBR>be 1</NOBR> for 400infinities and NaNs as well; 401any infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again 402considered non-canonical. 403In short, for an <CODE>extFloat80_t</CODE> representation to be canonical, the 404leading significand bit must <NOBR>be 1</NOBR> unless it is required to 405<NOBR>be 0</NOBR> because the encoded value is zero or a subnormal. 406</P> 407 408<P> 409Functions are not guaranteed to operate as expected when inputs of type 410<CODE>extFloat80_t</CODE> are non-canonical. 411Assuming all of a function’s <CODE>extFloat80_t</CODE> inputs (if any) 412are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always 413be canonical. 414</P> 415 416<H3>4.5. Conventions for Passing Arguments and Results</H3> 417 418<P> 419Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the 420<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all 421cases passed as function arguments by value. 422Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it 423is always returned directly as the function result. 424Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR> 425floating-point values has this simple signature: 426<BLOCKQUOTE> 427<CODE>float64_t f64_add( float64_t, float64_t );</CODE> 428</BLOCKQUOTE> 429</P> 430 431<P> 432The story is more complex when function inputs and outputs are 433<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point. 434For these types, SoftFloat always provides a function that passes these larger 435values into or out of the function indirectly, via pointers. 436For example, for adding two <NOBR>128-bit</NOBR> floating-point values, 437SoftFloat supplies this function: 438<BLOCKQUOTE> 439<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE> 440</BLOCKQUOTE> 441The first two arguments point to the values to be added, and the last argument 442points to the location where the sum will be stored. 443The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact 444that the <NOBR>128-bit</NOBR> inputs and outputs are “in memory”, 445pointed to by pointer arguments. 446</P> 447 448<P> 449All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for 450types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>. 451At the same time, SoftFloat ports may also implement alternate versions of 452these same functions that pass <CODE>extFloat80_t</CODE> and 453<CODE>float128_t</CODE> by value, like the smaller formats. 454Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a 455SoftFloat port may also supply an equivalent function with this signature: 456<BLOCKQUOTE> 457<CODE>float128_t f128_add( float128_t, float128_t );</CODE> 458</BLOCKQUOTE> 459</P> 460 461<P> 462As a general rule, on computers where the machine word size is 463<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions 464(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE> 465and <CODE>float128_t</CODE>, because passing such large types directly can have 466significant extra cost. 467On computers where the word size is <NOBR>64 bits</NOBR> or larger, both 468function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are 469provided, because the cost of passing by value is then more reasonable. 470Applications that must be portable accross both classes of computers must use 471the pointer-based functions, as these are always implemented. 472However, if it is known that SoftFloat includes the by-value functions for all 473platforms of interest, programmers can use whichever version they prefer. 474</P> 475 476 477<H2>5. Reserved Names</H2> 478 479<P> 480In addition to the variables and functions documented here, SoftFloat defines 481some symbol names for its own private use. 482These private names always begin with the prefix 483‘<CODE>softfloat_</CODE>’. 484When a program includes header <CODE>softfloat.h</CODE> or links with the 485SoftFloat library, all names with prefix ‘<CODE>softfloat_</CODE>’ 486are reserved for possible use by SoftFloat. 487Applications that use SoftFloat should not define their own names with this 488prefix, and should reference only such names as are documented. 489</P> 490 491 492<H2>6. Mode Variables</H2> 493 494<P> 495The following variables control rounding mode, underflow detection, and the 496<NOBR>80-bit</NOBR> extended format’s rounding precision: 497<BLOCKQUOTE> 498<CODE>softfloat_roundingMode</CODE><BR> 499<CODE>softfloat_detectTininess</CODE><BR> 500<CODE>extF80_roundingPrecision</CODE> 501</BLOCKQUOTE> 502These mode variables are covered in the next several subsections. 503</P> 504 505<H3>6.1. Rounding Mode</H3> 506 507<P> 508All five rounding modes defined by the 2008 IEEE Floating-Point Standard are 509implemented for all operations that require rounding. 510The rounding mode is selected by the global variable 511<BLOCKQUOTE> 512<CODE>uint_fast8_t softfloat_roundingMode;</CODE> 513</BLOCKQUOTE> 514This variable may be set to one of the values 515<BLOCKQUOTE> 516<TABLE CELLSPACING=0 CELLPADDING=0> 517<TR> 518<TD><CODE>softfloat_round_near_even</CODE></TD> 519<TD>round to nearest, with ties to even</TD> 520</TR> 521<TR> 522<TD><CODE>softfloat_round_near_maxMag </CODE></TD> 523<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD> 524</TR> 525<TR> 526<TD><CODE>softfloat_round_minMag</CODE></TD> 527<TD>round to minimum magnitude (toward zero)</TD> 528</TR> 529<TR> 530<TD><CODE>softfloat_round_min</CODE></TD> 531<TD>round to minimum (down)</TD> 532</TR> 533<TR> 534<TD><CODE>softfloat_round_max</CODE></TD> 535<TD>round to maximum (up)</TD> 536</TR> 537</TABLE> 538</BLOCKQUOTE> 539Variable <CODE>softfloat_roundingMode</CODE> is initialized to 540<CODE>softfloat_round_near_even</CODE>. 541</P> 542 543<H3>6.2. Underflow Detection</H3> 544 545<P> 546In the terminology of the IEEE Standard, SoftFloat can detect tininess for 547underflow either before or after rounding. 548The choice is made by the global variable 549<BLOCKQUOTE> 550<CODE>uint_fast8_t softfloat_detectTininess;</CODE> 551</BLOCKQUOTE> 552which can be set to either 553<BLOCKQUOTE> 554<CODE>softfloat_tininess_beforeRounding</CODE><BR> 555<CODE>softfloat_tininess_afterRounding</CODE> 556</BLOCKQUOTE> 557Detecting tininess after rounding is better because it results in fewer 558spurious underflow signals. 559The other option is provided for compatibility with some systems. 560Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat 561always detects loss of accuracy for underflow as an inexact result. 562</P> 563 564<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3> 565 566<P> 567For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic 568arithmetic operations is controlled by the global variable 569<BLOCKQUOTE> 570<CODE>uint_fast8_t extF80_roundingPrecision;</CODE> 571</BLOCKQUOTE> 572The operations affected are: 573<BLOCKQUOTE> 574<CODE>extF80_add</CODE><BR> 575<CODE>extF80_sub</CODE><BR> 576<CODE>extF80_mul</CODE><BR> 577<CODE>extF80_div</CODE><BR> 578<CODE>extF80_sqrt</CODE> 579</BLOCKQUOTE> 580When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80, 581these operations are rounded to the full precision of the <NOBR>80-bit</NOBR> 582double-extended-precision format, like occurs for other formats. 583Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the 584operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to 585<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to 586<CODE>float64_t</CODE>), respectively. 587When rounding to reduced precision, additional bits in the result significand 588beyond the rounding point are set to zero. 589The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value 590other than 32, 64, or 80 is not specified. 591Operations other than the ones listed above are not affected by 592<CODE>extF80_roundingPrecision</CODE>. 593</P> 594 595 596<H2>7. Exceptions and Exception Flags</H2> 597 598<P> 599All five exception flags required by the IEEE Floating-Point Standard are 600implemented. 601Each flag is stored as a separate bit in the global variable 602<BLOCKQUOTE> 603<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE> 604</BLOCKQUOTE> 605The positions of the exception flag bits within this variable are determined by 606the bit masks 607<BLOCKQUOTE> 608<CODE>softfloat_flag_inexact</CODE><BR> 609<CODE>softfloat_flag_underflow</CODE><BR> 610<CODE>softfloat_flag_overflow</CODE><BR> 611<CODE>softfloat_flag_infinite</CODE><BR> 612<CODE>softfloat_flag_invalid</CODE> 613</BLOCKQUOTE> 614Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros, 615meaning no exceptions. 616</P> 617 618<P> 619An individual exception flag can be cleared with the statement 620<BLOCKQUOTE> 621<CODE>softfloat_exceptionFlags &= ~softfloat_flag_<<I>exception</I>>;</CODE> 622</BLOCKQUOTE> 623where <CODE><<I>exception</I>></CODE> is the appropriate name. 624To raise a floating-point exception, function <CODE>softfloat_raise</CODE> 625should normally be used. 626</P> 627 628<P> 629When SoftFloat detects an exception other than <I>inexact</I>, it calls 630<CODE>softfloat_raise</CODE>. 631The default version of this function simply raises the corresponding exception 632flags. 633Particular ports of SoftFloat may support alternate behavior, such as exception 634traps, by modifying the default <CODE>softfloat_raise</CODE>. 635A program may also supply its own <CODE>softfloat_raise</CODE> function to 636override the one from the SoftFloat library. 637</P> 638 639<P> 640Because inexact results occur frequently under most circumstances (and thus are 641hardly exceptional), SoftFloat does not ordinarily call 642<CODE>softfloat_raise</CODE> for <I>inexact</I> exceptions. 643It does always raise the <I>inexact</I> exception flag as required. 644</P> 645 646 647<H2>8. Function Details</H2> 648 649<P> 650In this section, <CODE><<I>float</I>></CODE> appears in function names as 651a substitute for one of these abbreviations: 652<BLOCKQUOTE> 653<TABLE CELLSPACING=0 CELLPADDING=0> 654<TR> 655<TD><CODE>f32</CODE></TD> 656<TD>indicates <CODE>float32_t</CODE>, passed by value</TD> 657</TR> 658<TR> 659<TD><CODE>f64</CODE></TD> 660<TD>indicates <CODE>float64_t</CODE>, passed by value</TD> 661</TR> 662<TR> 663<TD><CODE>extF80M </CODE></TD> 664<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD> 665</TR> 666<TR> 667<TD><CODE>extF80</CODE></TD> 668<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD> 669</TR> 670<TR> 671<TD><CODE>f128M</CODE></TD> 672<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD> 673</TR> 674<TR> 675<TD><CODE>f128</CODE></TD> 676<TD>indicates <CODE>float128_t</CODE>, passed by value</TD> 677</TR> 678</TABLE> 679</BLOCKQUOTE> 680The circumstances under which values of floating-point types 681<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by 682value or indirectly via pointers was discussed earlier in 683<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>. 684</P> 685 686<H3>8.1. Conversions from Integer to Floating-Point</H3> 687 688<P> 689All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer, 690signed or unsigned, to a floating-point format are supported. 691Functions performing these conversions have these names: 692<BLOCKQUOTE> 693<CODE>ui32_to_<<I>float</I>></CODE><BR> 694<CODE>ui64_to_<<I>float</I>></CODE><BR> 695<CODE>i32_to_<<I>float</I>></CODE><BR> 696<CODE>i64_to_<<I>float</I>></CODE> 697</BLOCKQUOTE> 698Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR> 699double-precision and larger formats are always exact, and likewise conversions 700from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR> 701double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also 702always exact. 703</P> 704 705<P> 706Each conversion function takes one input of the appropriate type and generates 707one output. 708The following illustrates the signatures of these functions in cases when the 709floating-point result is passed either by value or via pointers: 710<BLOCKQUOTE> 711<PRE> 712float64_t i32_to_f64( int32_t <I>a</I> ); 713</PRE> 714<PRE> 715void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> ); 716</PRE> 717</BLOCKQUOTE> 718</P> 719 720<H3>8.2. Conversions from Floating-Point to Integer</H3> 721 722<P> 723Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or 724<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these 725functions: 726<BLOCKQUOTE> 727<CODE><<I>float</I>>_to_ui32</CODE><BR> 728<CODE><<I>float</I>>_to_ui64</CODE><BR> 729<CODE><<I>float</I>>_to_i32</CODE><BR> 730<CODE><<I>float</I>>_to_i64</CODE> 731</BLOCKQUOTE> 732The functions have signatures as follows, depending on whether the 733floating-point input is passed by value or via pointers: 734<BLOCKQUOTE> 735<PRE> 736int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 737</PRE> 738<PRE> 739int_fast32_t 740 f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 741</PRE> 742</BLOCKQUOTE> 743The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for 744the conversion. 745The variable that usually indicates rounding mode, 746<CODE>softfloat_roundingMode</CODE>, is ignored. 747Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I> 748exception flag is raised if the conversion is not exact. 749If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may 750be raised; 751otherwise, it will not be, even if the conversion is inexact. 752</P> 753 754<P> 755Conversions from floating-point to integer raise the <I>invalid</I> exception 756if the source value cannot be rounded to a representable integer of the desired 757size (32 or 64 bits). 758In such a circumstance, if the floating-point input is a NaN or if the 759conversion is to an unsigned integer type, the largest positive integer is 760returned; 761otherwise, the largest integer with the same sign as the input is returned. 762The functions that convert to integer types never raise the <I>overflow</I> 763exception. 764</P> 765 766<P> 767Note that, when converting to an unsigned integer type, if the <I>invalid</I> 768exception is raised because the input floating-point value would round to a 769negative integer, the value returned is the <EM>maximum positive unsigned 770integer</EM>. 771Zero is not returned when the <I>invalid</I> exception is raised, even when 772zero is the closest integer to the original floating-point value. 773</P> 774 775<P> 776Because languages such <NOBR>as C</NOBR> require that conversions to integers 777be rounded toward zero, the following functions are provided for improved speed 778and convenience: 779<BLOCKQUOTE> 780<CODE><<I>float</I>>_to_ui32_r_minMag</CODE><BR> 781<CODE><<I>float</I>>_to_ui64_r_minMag</CODE><BR> 782<CODE><<I>float</I>>_to_i32_r_minMag</CODE><BR> 783<CODE><<I>float</I>>_to_i64_r_minMag</CODE> 784</BLOCKQUOTE> 785These functions round only toward zero (to minimum magnitude). 786The signatures for these functions are the same as above without the redundant 787<CODE><I>roundingMode</I></CODE> argument: 788<BLOCKQUOTE> 789<PRE> 790int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> ); 791</PRE> 792<PRE> 793int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> ); 794</PRE> 795</BLOCKQUOTE> 796</P> 797 798<H3>8.3. Conversions Among Floating-Point Types</H3> 799 800<P> 801Conversions between floating-point formats are done by functions with these 802names: 803<BLOCKQUOTE> 804<CODE><<I>float</I>>_to_<<I>float</I>></CODE> 805</BLOCKQUOTE> 806All combinations of source and result type are supported where the source and 807result are different formats. 808There are four different styles of signature for these functions, depending on 809whether the input and the output floating-point values are passed by value or 810via pointers: 811<BLOCKQUOTE> 812<PRE> 813float32_t f64_to_f32( float64_t <I>a</I> ); 814</PRE> 815<PRE> 816float32_t f128M_to_f32( const float128_t *<I>aPtr</I> ); 817</PRE> 818<PRE> 819void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> ); 820</PRE> 821<PRE> 822void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); 823</PRE> 824</BLOCKQUOTE> 825</P> 826 827<P> 828Conversions from a smaller to a larger floating-point format are always exact 829and so require no rounding. 830</P> 831 832<H3>8.4. Basic Arithmetic Functions</H3> 833 834<P> 835The following basic arithmetic functions are provided: 836<BLOCKQUOTE> 837<CODE><<I>float</I>>_add</CODE><BR> 838<CODE><<I>float</I>>_sub</CODE><BR> 839<CODE><<I>float</I>>_mul</CODE><BR> 840<CODE><<I>float</I>>_div</CODE><BR> 841<CODE><<I>float</I>>_sqrt</CODE> 842</BLOCKQUOTE> 843Each floating-point operation takes two operands, except for <CODE>sqrt</CODE> 844(square root) which takes only one. 845The operands and result are all of the same floating-point format. 846Signatures for these functions take the following forms: 847<BLOCKQUOTE> 848<PRE> 849float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> ); 850</PRE> 851<PRE> 852void 853 f128M_add( 854 const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); 855</PRE> 856<PRE> 857float64_t f64_sqrt( float64_t <I>a</I> ); 858</PRE> 859<PRE> 860void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> ); 861</PRE> 862</BLOCKQUOTE> 863When floating-point values are passed indirectly through pointers, arguments 864<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input 865operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the 866location where the result is stored. 867</P> 868 869<P> 870Rounding of the <NOBR>80-bit</NOBR> double-extended-precision 871(<CODE>extFloat80_t</CODE>) functions is affected by variable 872<CODE>extF80_roundingPrecision</CODE>, as explained earlier in 873<NOBR>section 6.3</NOBR>, 874<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>. 875</P> 876 877<H3>8.5. Fused Multiply-Add Functions</H3> 878 879<P> 880The 2008 version of the IEEE Floating-Point Standard defines a <I>fused 881multiply-add</I> operation that does a combined multiplication and addition 882with only a single rounding. 883SoftFloat implements fused multiply-add with functions 884<BLOCKQUOTE> 885<CODE><<I>float</I>>_mulAdd</CODE> 886</BLOCKQUOTE> 887Unlike other operations, fused multiple-add is supported only for the 888non-extended formats, <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and 889<CODE>float128_t</CODE>. 890No fused multiple-add function is currently provided for the 891<NOBR>80-bit</NOBR> double-extended-precision type, <CODE>extFloat80_t</CODE>. 892</P> 893 894<P> 895Depending on whether floating-point values are passed by value or via pointers, 896the fused multiply-add functions have signatures of these forms: 897<BLOCKQUOTE> 898<PRE> 899float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> ); 900</PRE> 901<PRE> 902void 903 f128M_mulAdd( 904 const float128_t *<I>aPtr</I>, 905 const float128_t *<I>bPtr</I>, 906 const float128_t *<I>cPtr</I>, 907 float128_t *<I>destPtr</I> 908 ); 909</PRE> 910</BLOCKQUOTE> 911The functions compute 912<NOBR>(<CODE><I>a</I></CODE> × <CODE><I>b</I></CODE>) 913 + <CODE><I>c</I></CODE></NOBR> 914with a single rounding. 915When floating-point values are passed indirectly through pointers, arguments 916<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and 917<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>, 918<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and 919<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 920</P> 921 922<P> 923If one of the multiplication operands <CODE><I>a</I></CODE> and 924<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise 925the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN. 926</P> 927 928<H3>8.6. Remainder Functions</H3> 929 930<P> 931For each format, SoftFloat implements the remainder operation defined by the 932IEEE Floating-Point Standard. 933The remainder functions have names 934<BLOCKQUOTE> 935<CODE><<I>float</I>>_rem</CODE> 936</BLOCKQUOTE> 937Each remainder operation takes two floating-point operands of the same format 938and returns a result in the same format. 939Depending on whether floating-point values are passed by value or via pointers, 940the remainder functions have signatures of these forms: 941<BLOCKQUOTE> 942<PRE> 943float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> ); 944</PRE> 945<PRE> 946void 947 f128M_rem( 948 const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> ); 949</PRE> 950</BLOCKQUOTE> 951When floating-point values are passed indirectly through pointers, arguments 952<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands 953<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and 954<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 955</P> 956 957<P> 958The IEEE Standard remainder operation computes the value 959<NOBR><CODE><I>a</I></CODE> 960 − <I>n</I> × <CODE><I>b</I></CODE></NOBR>, 961where <I>n</I> is the integer closest to 962<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>. 963If <NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR> is exactly 964halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to 965<NOBR><CODE><I>a</I></CODE> ÷ <CODE><I>b</I></CODE></NOBR>. 966The IEEE Standard’s remainder operation is always exact and so requires 967no rounding. 968</P> 969 970<P> 971Depending on the relative magnitudes of the operands, the remainder 972functions can take considerably longer to execute than the other SoftFloat 973functions. 974This is inherent in the remainder operation itself and is not a flaw in the 975SoftFloat implementation. 976</P> 977 978<H3>8.7. Round-to-Integer Functions</H3> 979 980<P> 981For each format, SoftFloat implements the round-to-integer operation specified 982by the IEEE Floating-Point Standard. 983These functions are named 984<BLOCKQUOTE> 985<CODE><<I>float</I>>_roundToInt</CODE> 986</BLOCKQUOTE> 987Each round-to-integer operation takes a single floating-point operand. 988This operand is rounded to an integer according to a specified rounding mode, 989and the resulting integer value is returned in the same floating-point format. 990(Note that the result is not an integer type.) 991</P> 992 993<P> 994The signatures of the round-to-integer functions are similar to those for 995conversions to an integer type: 996<BLOCKQUOTE> 997<PRE> 998float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> ); 999</PRE> 1000<PRE> 1001void 1002 f128M_roundToInt( 1003 const float128_t *<I>aPtr</I>, 1004 uint_fast8_t <I>roundingMode</I>, 1005 bool <I>exact</I>, 1006 float128_t *<I>destPtr</I> 1007 ); 1008</PRE> 1009</BLOCKQUOTE> 1010The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to 1011apply. 1012The variable that usually indicates rounding mode, 1013<CODE>softfloat_roundingMode</CODE>, is ignored. 1014Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I> 1015exception flag is raised if the conversion is not exact. 1016If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may 1017be raised; 1018otherwise, it will not be, even if the conversion is inexact. 1019When floating-point values are passed indirectly through pointers, 1020<CODE><I>aPtr</I></CODE> points to the input operand and 1021<CODE><I>destPtr</I></CODE> points to the location where the result is stored. 1022</P> 1023 1024<H3>8.8. Comparison Functions</H3> 1025 1026<P> 1027For each format, the following floating-point comparison functions are 1028provided: 1029<BLOCKQUOTE> 1030<CODE><<I>float</I>>_eq</CODE><BR> 1031<CODE><<I>float</I>>_le</CODE><BR> 1032<CODE><<I>float</I>>_lt</CODE> 1033</BLOCKQUOTE> 1034Each comparison takes two operands of the same type and returns a Boolean. 1035The abbreviation <CODE>eq</CODE> stands for “equal” (=); 1036<CODE>le</CODE> stands for “less than or equal” (≤); 1037and <CODE>lt</CODE> stands for “less than” (<). 1038Depending on whether the floating-point operands are passed by value or via 1039pointers, the comparison functions have signatures of these forms: 1040<BLOCKQUOTE> 1041<PRE> 1042bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> ); 1043</PRE> 1044<PRE> 1045bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> ); 1046</PRE> 1047</BLOCKQUOTE> 1048</P> 1049 1050<P> 1051The usual greater-than (>), greater-than-or-equal (≥), and not-equal 1052(≠) comparisons are easily obtained from the functions provided. 1053The not-equal function is just the logical complement of the equal function. 1054The greater-than-or-equal function is identical to the less-than-or-equal 1055function with the arguments in reverse order, and likewise the greater-than 1056function is identical to the less-than function with the arguments reversed. 1057</P> 1058 1059<P> 1060The IEEE Floating-Point Standard specifies that the less-than-or-equal and 1061less-than comparisons by default raise the <I>invalid</I> exception if either 1062operand is any kind of NaN. 1063Equality comparisons, on the other hand, are defined by default to raise the 1064<I>invalid</I> exception only for signaling NaNs, not quiet NaNs. 1065For completeness, SoftFloat provides these complementary functions: 1066<BLOCKQUOTE> 1067<CODE><<I>float</I>>_eq_signaling</CODE><BR> 1068<CODE><<I>float</I>>_le_quiet</CODE><BR> 1069<CODE><<I>float</I>>_lt_quiet</CODE> 1070</BLOCKQUOTE> 1071The <CODE>signaling</CODE> equality comparisons are identical to the default 1072equality comparisons except that the <I>invalid</I> exception is raised for any 1073NaN input, not just for signaling NaNs. 1074Similarly, the <CODE>quiet</CODE> comparison functions are identical to their 1075default counterparts except that the <I>invalid</I> exception is not raised for 1076quiet NaNs. 1077</P> 1078 1079<H3>8.9. Signaling NaN Test Functions</H3> 1080 1081<P> 1082Functions for testing whether a floating-point value is a signaling NaN are 1083provided with these names: 1084<BLOCKQUOTE> 1085<CODE><<I>float</I>>_isSignalingNaN</CODE> 1086</BLOCKQUOTE> 1087The functions take one floating-point operand and return a Boolean indicating 1088whether the operand is a signaling NaN. 1089Accordingly, the functions have the forms 1090<BLOCKQUOTE> 1091<PRE> 1092bool f64_isSignalingNaN( float64_t <I>a</I> ); 1093</PRE> 1094<PRE> 1095bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> ); 1096</PRE> 1097</BLOCKQUOTE> 1098</P> 1099 1100<H3>8.10. Raise-Exception Function</H3> 1101 1102<P> 1103SoftFloat provides a single function for raising floating-point exceptions: 1104<BLOCKQUOTE> 1105<PRE> 1106void softfloat_raise( uint_fast8_t <I>exceptions</I> ); 1107</PRE> 1108</BLOCKQUOTE> 1109The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of 1110exceptions to raise. 1111(See earlier section 7, <I>Exceptions and Exception Flags</I>.) 1112In addition to setting the specified exception flags in variable 1113<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raise</CODE> 1114function may cause a trap or abort appropriate for the current system. 1115</P> 1116 1117 1118<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2> 1119 1120<P> 1121Apart from a change in the legal use license, <NOBR>Release 3</NOBR> of 1122SoftFloat introduced numerous technical differences compared to earlier 1123releases. 1124</P> 1125 1126<H3>9.1. Name Changes</H3> 1127 1128<P> 1129The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR> 1130is that the names of most functions and variables have changed, even when the 1131behavior has not. 1132First, the floating-point types, the mode variables, the exception flags 1133variable, the function to raise exceptions, and various associated constants 1134have been renamed as follows: 1135<BLOCKQUOTE> 1136<TABLE> 1137<TR> 1138<TD>old name, Release 2:</TD> 1139<TD>new name, Release 3:</TD> 1140</TR> 1141<TR> 1142<TD><CODE>float32</CODE></TD> 1143<TD><CODE>float32_t</CODE></TD> 1144</TR> 1145<TR> 1146<TD><CODE>float64</CODE></TD> 1147<TD><CODE>float64_t</CODE></TD> 1148</TR> 1149<TR> 1150<TD><CODE>floatx80</CODE></TD> 1151<TD><CODE>extFloat80_t</CODE></TD> 1152</TR> 1153<TR> 1154<TD><CODE>float128</CODE></TD> 1155<TD><CODE>float128_t</CODE></TD> 1156</TR> 1157<TR> 1158<TD><CODE>float_rounding_mode</CODE></TD> 1159<TD><CODE>softfloat_roundingMode</CODE></TD> 1160</TR> 1161<TR> 1162<TD><CODE>float_round_nearest_even</CODE></TD> 1163<TD><CODE>softfloat_round_near_even</CODE></TD> 1164</TR> 1165<TR> 1166<TD><CODE>float_round_to_zero</CODE></TD> 1167<TD><CODE>softfloat_round_minMag</CODE></TD> 1168</TR> 1169<TR> 1170<TD><CODE>float_round_down</CODE></TD> 1171<TD><CODE>softfloat_round_min</CODE></TD> 1172</TR> 1173<TR> 1174<TD><CODE>float_round_up</CODE></TD> 1175<TD><CODE>softfloat_round_max</CODE></TD> 1176</TR> 1177<TR> 1178<TD><CODE>float_detect_tininess</CODE></TD> 1179<TD><CODE>softfloat_detectTininess</CODE></TD> 1180</TR> 1181<TR> 1182<TD><CODE>float_tininess_before_rounding </CODE></TD> 1183<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD> 1184</TR> 1185<TR> 1186<TD><CODE>float_tininess_after_rounding</CODE></TD> 1187<TD><CODE>softfloat_tininess_afterRounding</CODE></TD> 1188</TR> 1189<TR> 1190<TD><CODE>floatx80_rounding_precision</CODE></TD> 1191<TD><CODE>extF80_roundingPrecision</CODE></TD> 1192</TR> 1193<TR> 1194<TD><CODE>float_exception_flags</CODE></TD> 1195<TD><CODE>softfloat_exceptionFlags</CODE></TD> 1196</TR> 1197<TR> 1198<TD><CODE>float_flag_inexact</CODE></TD> 1199<TD><CODE>softfloat_flag_inexact</CODE></TD> 1200</TR> 1201<TR> 1202<TD><CODE>float_flag_underflow</CODE></TD> 1203<TD><CODE>softfloat_flag_underflow</CODE></TD> 1204</TR> 1205<TR> 1206<TD><CODE>float_flag_overflow</CODE></TD> 1207<TD><CODE>softfloat_flag_overflow</CODE></TD> 1208</TR> 1209<TR> 1210<TD><CODE>float_flag_divbyzero</CODE></TD> 1211<TD><CODE>softfloat_flag_infinite</CODE></TD> 1212</TR> 1213<TR> 1214<TD><CODE>float_flag_invalid</CODE></TD> 1215<TD><CODE>softfloat_flag_invalid</CODE></TD> 1216</TR> 1217<TR> 1218<TD><CODE>float_raise</CODE></TD> 1219<TD><CODE>softfloat_raise</CODE></TD> 1220</TR> 1221</TABLE> 1222</BLOCKQUOTE> 1223</P> 1224 1225<P> 1226Furthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for 1227function names: 1228<BLOCKQUOTE> 1229<TABLE> 1230<TR> 1231<TD>used in names in Release 2:<CODE> </CODE></TD> 1232<TD>used in names in Release 3:</TD> 1233</TR> 1234<TR> <TD><CODE>int32</CODE></TD> <TD><CODE>i32</CODE></TD> </TR> 1235<TR> <TD><CODE>int64</CODE></TD> <TD><CODE>i64</CODE></TD> </TR> 1236<TR> <TD><CODE>float32</CODE></TD> <TD><CODE>f32</CODE></TD> </TR> 1237<TR> <TD><CODE>float64</CODE></TD> <TD><CODE>f64</CODE></TD> </TR> 1238<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR> 1239<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD> </TR> 1240</TABLE> 1241</BLOCKQUOTE> 1242Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point 1243numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>, 1244is now <CODE>f32_add</CODE>. 1245Lastly, there have been a few other changes to function names: 1246<BLOCKQUOTE> 1247<TABLE> 1248<TR> 1249<TD>used in names in Release 2:<CODE> </CODE></TD> 1250<TD>used in names in Release 3:<CODE> </CODE></TD> 1251<TD>relevant functions:</TD> 1252</TR> 1253<TR> 1254<TD><CODE>_round_to_zero</CODE></TD> 1255<TD><CODE>_r_minMag</CODE></TD> 1256<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD> 1257</TR> 1258<TR> 1259<TD><CODE>round_to_int</CODE></TD> 1260<TD><CODE>roundToInt</CODE></TD> 1261<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD> 1262</TR> 1263<TR> 1264<TD><CODE>is_signaling_nan </CODE></TD> 1265<TD><CODE>isSignalingNaN</CODE></TD> 1266<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD> 1267</TR> 1268</TABLE> 1269</BLOCKQUOTE> 1270</P> 1271 1272<H3>9.2. Changes to Function Arguments</H3> 1273 1274<P> 1275Besides simple name changes, some operations were given a different interface 1276in <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>: 1277<UL> 1278 1279<LI> 1280<P> 1281Since <NOBR>Release 3</NOBR>, integer arguments and results of functions have 1282standard types from header <CODE><stdint.h></CODE>, such as 1283<CODE>uint32_t</CODE>, whereas previously their types could be defined 1284differently for each port of SoftFloat, usually using traditional C types such 1285as <CODE>unsigned</CODE> <CODE>int</CODE>. 1286Likewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as 1287standard type <CODE>bool</CODE> from <CODE><stdbool.h></CODE>, whereas 1288previously these were again passed as a port-specific type (usually 1289<CODE>int</CODE>). 1290</P> 1291 1292<LI> 1293<P> 1294As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing 1295Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and 1296later may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point 1297values through pointers, meaning that functions take pointer arguments and then 1298read or write floating-point values at the locations indicated by the pointers. 1299In <NOBR>Release 2</NOBR>, floating-point arguments and results were always 1300passed by value, regardless of their size. 1301</P> 1302 1303<LI> 1304<P> 1305Functions that round to an integer have additional 1306<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that 1307they did not have in <NOBR>Release 2</NOBR>. 1308Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions 1309since <NOBR>Release 3</NOBR>. 1310For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the 1311same global variable that affects the basic arithmetic operations (now called 1312<CODE>softfloat_roundingMode</CODE> but previously known as 1313<CODE>float_rounding_mode</CODE>). 1314Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not 1315an exact integer value, and if the <I>invalid</I> exception was not raised by 1316the function, the <I>inexact</I> exception was always raised. 1317<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this 1318case. 1319Applications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same 1320effect as <NOBR>Release 2</NOBR> by passing variable 1321<CODE>softfloat_roundingMode</CODE> for argument 1322<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument 1323<CODE><I>exact</I></CODE>. 1324</P> 1325 1326</UL> 1327</P> 1328 1329<H3>9.3. Added Capabilities</H3> 1330 1331<P> 1332With <NOBR>Release 3</NOBR>, some new features have been added that were not 1333present in <NOBR>Release 2</NOBR>: 1334<UL> 1335 1336<LI> 1337<P> 1338A port of SoftFloat can now define any of the floating-point types 1339<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and 1340<CODE>float128_t</CODE> as aliases for C’s standard floating-point types 1341<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE> 1342<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>. 1343This potential convenience was not supported under <NOBR>Release 2</NOBR>. 1344</P> 1345 1346<P> 1347(Note, however, that there may be a performance cost to defining 1348SoftFloat’s floating-point types this way, depending on the platform and 1349the applications using SoftFloat. 1350Ports of SoftFloat may choose to forgo the convenience in favor of better 1351speed.) 1352</P> 1353 1354<P> 1355<LI> 1356Functions have been added for converting between the floating-point types and 1357unsigned integers. 1358<NOBR>Release 2</NOBR> supported only signed integers, not unsigned. 1359</P> 1360 1361<P> 1362<LI> 1363A new, fifth rounding mode, <CODE>softfloat_round_near_maxMag</CODE> (round to 1364nearest, with ties to maximum magnitude, away from zero) is now supported for 1365all cases involving rounding. 1366</P> 1367 1368<P> 1369<LI> 1370Fused multiply-add functions have been added for the non-extended formats, 1371<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and <CODE>float128_t</CODE>. 1372</P> 1373 1374</UL> 1375</P> 1376 1377<H3>9.4. Better Compatibility with the C Language</H3> 1378 1379<P> 1380<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C 1381Standard’s rules for portability. 1382For example, older releases of SoftFloat employed type conversions in ways 1383that, while commonly practiced, are not fully defined by the C Standard. 1384Such problematic type conversions have generally been replaced by the use of 1385unions, the behavior around which is more strictly regulated these days. 1386</P> 1387 1388<H3>9.5. New Organization as a Library</H3> 1389 1390<P> 1391Starting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library. 1392Previously, SoftFloat compiled into a single, monolithic object file containing 1393all the SoftFloat functions, with the consequence that a program linking with 1394SoftFloat would get every SoftFloat function in its binary file even if only a 1395few functions were actually used. 1396With SoftFloat in the form of a library, a program that is linked by a standard 1397linker will include only those functions of SoftFloat that it needs and no 1398others. 1399</P> 1400 1401<H3>9.6. Optimization Gains (and Losses)</H3> 1402 1403<P> 1404Individual SoftFloat functions have been variously improved in 1405<NOBR>Release 3</NOBR> compared to earlier releases. 1406In particular, better, faster algorithms have been deployed for the operations 1407of division, square root, and remainder. 1408For functions operating on the larger <NOBR>80-bit</NOBR> and 1409<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and 1410<CODE>float128_t</CODE>, code size has also generally been reduced. 1411</P> 1412 1413<P> 1414However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a 1415single object file, compilers could make optimizations across function calls 1416when one SoftFloat function calls another. 1417Now that the functions of SoftFloat are compiled separately and only afterward 1418linked together into a program, there is not usually the same opportunity to 1419optimize across function calls. 1420Some loss of speed has been observed due to this change. 1421</P> 1422 1423 1424<H2>10. Future Directions</H2> 1425 1426<P> 1427The following improvements are anticipated for future releases of SoftFloat: 1428<UL> 1429<LI> 1430support for the common <NOBR>16-bit</NOBR> “half-precision” 1431floating-point format; 1432<LI> 1433more functions from the 2008 version of the IEEE Floating-Point Standard; 1434<LI> 1435consistent, defined behavior for non-canonical representations of extended 1436format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>, 1437<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>). 1438 1439</UL> 1440</P> 1441 1442 1443<H2>11. Contact Information</H2> 1444 1445<P> 1446At the time of this writing, the most up-to-date information about SoftFloat 1447and the latest release can be found at the Web page 1448<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>. 1449</P> 1450 1451 1452</BODY> 1453 1454