1
2<HTML>
3
4<HEAD>
5<TITLE>Berkeley SoftFloat Library Interface</TITLE>
6</HEAD>
7
8<BODY>
9
10<H1>Berkeley SoftFloat Release 3a: Library Interface</H1>
11
12<P>
13John R. Hauser<BR>
142015 October 23<BR>
15</P>
16
17
18<H2>Contents</H2>
19
20<BLOCKQUOTE>
21<TABLE BORDER=0 CELLSPACING=0 CELLPADDING=0>
22<COL WIDTH=25>
23<COL WIDTH=*>
24<TR><TD COLSPAN=2>1. Introduction</TD></TR>
25<TR><TD COLSPAN=2>2. Limitations</TD></TR>
26<TR><TD COLSPAN=2>3. Acknowledgments and License</TD></TR>
27<TR><TD COLSPAN=2>4. Types and Functions</TD></TR>
28<TR><TD></TD><TD>4.1. Boolean and Integer Types</TD></TR>
29<TR><TD></TD><TD>4.2. Floating-Point Types</TD></TR>
30<TR><TD></TD><TD>4.3. Supported Floating-Point Functions</TD></TR>
31<TR>
32  <TD></TD>
33  <TD>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></TD>
34</TR>
35<TR><TD></TD><TD>4.5. Conventions for Passing Arguments and Results</TD></TR>
36<TR><TD COLSPAN=2>5. Reserved Names</TD></TR>
37<TR><TD COLSPAN=2>6. Mode Variables</TD></TR>
38<TR><TD></TD><TD>6.1. Rounding Mode</TD></TR>
39<TR><TD></TD><TD>6.2. Underflow Detection</TD></TR>
40<TR>
41  <TD></TD>
42  <TD>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</TD>
43</TR>
44<TR><TD COLSPAN=2>7. Exceptions and Exception Flags</TD></TR>
45<TR><TD COLSPAN=2>8. Function Details</TD></TR>
46<TR><TD></TD><TD>8.1. Conversions from Integer to Floating-Point</TD></TR>
47<TR><TD></TD><TD>8.2. Conversions from Floating-Point to Integer</TD></TR>
48<TR><TD></TD><TD>8.3. Conversions Among Floating-Point Types</TD></TR>
49<TR><TD></TD><TD>8.4. Basic Arithmetic Functions</TD></TR>
50<TR><TD></TD><TD>8.5. Fused Multiply-Add Functions</TD></TR>
51<TR><TD></TD><TD>8.6. Remainder Functions</TD></TR>
52<TR><TD></TD><TD>8.7. Round-to-Integer Functions</TD></TR>
53<TR><TD></TD><TD>8.8. Comparison Functions</TD></TR>
54<TR><TD></TD><TD>8.9. Signaling NaN Test Functions</TD></TR>
55<TR><TD></TD><TD>8.10. Raise-Exception Function</TD></TR>
56<TR><TD COLSPAN=2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></TD></TR>
57<TR><TD></TD><TD>9.1. Name Changes</TD></TR>
58<TR><TD></TD><TD>9.2. Changes to Function Arguments</TD></TR>
59<TR><TD></TD><TD>9.3. Added Capabilities</TD></TR>
60<TR><TD></TD><TD>9.4. Better Compatibility with the C Language</TD></TR>
61<TR><TD></TD><TD>9.5. New Organization as a Library</TD></TR>
62<TR><TD></TD><TD>9.6. Optimization Gains (and Losses)</TD></TR>
63<TR><TD COLSPAN=2>10. Future Directions</TD></TR>
64<TR><TD COLSPAN=2>11. Contact Information</TD></TR>
65</TABLE>
66</BLOCKQUOTE>
67
68
69<H2>1. Introduction</H2>
70
71<P>
72Berkeley SoftFloat is a software implementation of binary floating-point that
73conforms to the IEEE Standard for Floating-Point Arithmetic.
74The current release supports four binary formats:  <NOBR>32-bit</NOBR>
75single-precision, <NOBR>64-bit</NOBR> double-precision, <NOBR>80-bit</NOBR>
76double-extended-precision, and <NOBR>128-bit</NOBR> quadruple-precision.
77The following functions are supported for each format:
78<UL>
79<LI>
80addition, subtraction, multiplication, division, and square root;
81<LI>
82fused multiply-add as defined by the IEEE Standard, except for
83<NOBR>80-bit</NOBR> double-extended-precision;
84<LI>
85remainder as defined by the IEEE Standard;
86<LI>
87round to integral value;
88<LI>
89comparisons;
90<LI>
91conversions to/from other supported formats; and
92<LI>
93conversions to/from <NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers,
94signed and unsigned.
95</UL>
96All operations required by the original 1985 version of the IEEE Floating-Point
97Standard are implemented, except for conversions to and from decimal.
98</P>
99
100<P>
101This document gives information about the types defined and the routines
102implemented by SoftFloat.
103It does not attempt to define or explain the IEEE Floating-Point Standard.
104Information about the standard is available elsewhere.
105</P>
106
107<P>
108The current version of SoftFloat is <NOBR>Release 3a</NOBR>.
109The only difference between this version and the previous
110<NOBR>Release 3</NOBR> is the replacement of the license text supplied by the
111University of California.
112</P>
113
114<P>
115The functional interface of SoftFloat <NOBR>Release 3</NOBR> and afterward
116differs in many details from that of earlier releases.
117For specifics of these differences, see <NOBR>section 9</NOBR> below,
118<I>Changes from SoftFloat <NOBR>Release 2</NOBR></I>.
119</P>
120
121
122<H2>2. Limitations</H2>
123
124<P>
125SoftFloat assumes the computer has an addressable byte size of 8 or
126<NOBR>16 bits</NOBR>.
127(Nearly all computers in use today have <NOBR>8-bit</NOBR> bytes.)
128</P>
129
130<P>
131SoftFloat is written in C and is designed to work with other C code.
132The C compiler used must conform at a minimum to the 1989 ANSI standard for the
133C language (same as the 1990 ISO standard) and must in addition support basic
134arithmetic on <NOBR>64-bit</NOBR> integers.
135Earlier releases of SoftFloat included implementations of <NOBR>32-bit</NOBR>
136single-precision and <NOBR>64-bit</NOBR> double-precision floating-point that
137did not require <NOBR>64-bit</NOBR> integers, but this option is not supported
138starting with <NOBR>Release 3</NOBR>.
139Since 1999, ISO standards for C have mandated compiler support for
140<NOBR>64-bit</NOBR> integers.
141A compiler conforming to the 1999 C Standard or later is recommended but not
142strictly required.
143</P>
144
145<P>
146Most operations not required by the original 1985 version of the IEEE
147Floating-Point Standard but added in the 2008 version are not yet supported in
148SoftFloat <NOBR>Release 3a</NOBR>.
149</P>
150
151
152<H2>3. Acknowledgments and License</H2>
153
154<P>
155The SoftFloat package was written by me, <NOBR>John R.</NOBR> Hauser.
156<NOBR>Release 3</NOBR> of SoftFloat was a completely new implementation
157supplanting earlier releases.
158The project to create <NOBR>Release 3</NOBR> (and <NOBR>now 3a</NOBR>) was done
159in the employ of the University of California, Berkeley, within the Department
160of Electrical Engineering and Computer Sciences, first for the Parallel
161Computing Laboratory (Par Lab) and then for the ASPIRE Lab.
162The work was officially overseen by Prof. Krste Asanovic, with funding provided
163by these sources:
164<BLOCKQUOTE>
165<TABLE>
166<COL>
167<COL WIDTH=10>
168<COL>
169<TR>
170<TD VALIGN=TOP><NOBR>Par Lab:</NOBR></TD>
171<TD></TD>
172<TD>
173Microsoft (Award #024263), Intel (Award #024894), and U.C. Discovery
174(Award #DIG07-10227), with additional support from Par Lab affiliates Nokia,
175NVIDIA, Oracle, and Samsung.
176</TD>
177</TR>
178<TR>
179<TD VALIGN=TOP><NOBR>ASPIRE Lab:</NOBR></TD>
180<TD></TD>
181<TD>
182DARPA PERFECT program (Award #HR0011-12-2-0016), with additional support from
183ASPIRE industrial sponsor Intel and ASPIRE affiliates Google, Nokia, NVIDIA,
184Oracle, and Samsung.
185</TD>
186</TR>
187</TABLE>
188</BLOCKQUOTE>
189</P>
190
191<P>
192The following applies to the whole of SoftFloat <NOBR>Release 3a</NOBR> as well
193as to each source file individually.
194</P>
195
196<P>
197Copyright 2011, 2012, 2013, 2014, 2015 The Regents of the University of
198California.
199All rights reserved.
200</P>
201
202<P>
203Redistribution and use in source and binary forms, with or without
204modification, are permitted provided that the following conditions are met:
205<OL>
206
207<LI>
208<P>
209Redistributions of source code must retain the above copyright notice, this
210list of conditions, and the following disclaimer.
211</P>
212
213<LI>
214<P>
215Redistributions in binary form must reproduce the above copyright notice, this
216list of conditions, and the following disclaimer in the documentation and/or
217other materials provided with the distribution.
218</P>
219
220<LI>
221<P>
222Neither the name of the University nor the names of its contributors may be
223used to endorse or promote products derived from this software without specific
224prior written permission.
225</P>
226
227</OL>
228</P>
229
230<P>
231THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS &ldquo;AS IS&rdquo;,
232AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
233IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE, ARE
234DISCLAIMED.
235IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
236INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
237BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
238DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
239LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
240OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF
241ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
242</P>
243
244
245<H2>4. Types and Functions</H2>
246
247<P>
248The types and functions of SoftFloat are declared in header file
249<CODE>softfloat.h</CODE>.
250</P>
251
252<H3>4.1. Boolean and Integer Types</H3>
253
254<P>
255Header file <CODE>softfloat.h</CODE> depends on standard headers
256<CODE>&lt;stdbool.h&gt;</CODE> and <CODE>&lt;stdint.h&gt;</CODE> to define type
257<CODE>bool</CODE> and several integer types.
258These standard headers have been part of the ISO C Standard Library since 1999.
259With any recent compiler, they are likely to be supported, even if the compiler
260does not claim complete conformance to the ISO C Standard.
261For older or nonstandard compilers, a port of SoftFloat may have substitutes
262for these headers.
263Header <CODE>softfloat.h</CODE> depends only on the name <CODE>bool</CODE> from
264<CODE>&lt;stdbool.h&gt;</CODE> and on these type names from
265<CODE>&lt;stdint.h&gt;</CODE>:
266<BLOCKQUOTE>
267<PRE>
268uint16_t
269uint32_t
270uint64_t
271int32_t
272int64_t
273uint_fast8_t
274uint_fast32_t
275uint_fast64_t
276</PRE>
277</BLOCKQUOTE>
278</P>
279
280
281<H3>4.2. Floating-Point Types</H3>
282
283<P>
284The <CODE>softfloat.h</CODE> header defines four floating-point types:
285<BLOCKQUOTE>
286<TABLE CELLSPACING=0 CELLPADDING=0>
287<TR>
288<TD><CODE>float32_t</CODE></TD>
289<TD><NOBR>32-bit</NOBR> single-precision binary format</TD>
290</TR>
291<TR>
292<TD><CODE>float64_t</CODE></TD>
293<TD><NOBR>64-bit</NOBR> double-precision binary format</TD>
294</TR>
295<TR>
296<TD><CODE>extFloat80_t&nbsp;&nbsp;&nbsp;</CODE></TD>
297<TD><NOBR>80-bit</NOBR> double-extended-precision binary format (old Intel or
298Motorola format)</TD>
299</TR>
300<TR>
301<TD><CODE>float128_t</CODE></TD>
302<TD><NOBR>128-bit</NOBR> quadruple-precision binary format</TD>
303</TR>
304</TABLE>
305</BLOCKQUOTE>
306The non-extended types are each exactly the size specified:
307<NOBR>32 bits</NOBR> for <CODE>float32_t</CODE>, <NOBR>64 bits</NOBR> for
308<CODE>float64_t</CODE>, and <NOBR>128 bits</NOBR> for <CODE>float128_t</CODE>.
309Aside from these size requirements, the definitions of all these types may
310differ for different ports of SoftFloat to specific systems.
311A given port of SoftFloat may or may not define some of the floating-point
312types as aliases for the C standard types <CODE>float</CODE>,
313<CODE>double</CODE>, and <CODE>long</CODE> <CODE>double</CODE>.
314</P>
315
316<P>
317Header file <CODE>softfloat.h</CODE> also defines a structure,
318<CODE>struct</CODE> <CODE>extFloat80M</CODE>, for the representation of
319<NOBR>80-bit</NOBR> double-extended-precision floating-point values in memory.
320This structure is the same size as type <CODE>extFloat80_t</CODE> and contains
321at least these two fields (not necessarily in this order):
322<BLOCKQUOTE>
323<PRE>
324uint16_t signExp;
325uint64_t signif;
326</PRE>
327</BLOCKQUOTE>
328Field <CODE>signExp</CODE> contains the sign and exponent of the floating-point
329value, with the sign in the most significant bit (<NOBR>bit 15</NOBR>) and the
330encoded exponent in the other <NOBR>15 bits</NOBR>.
331Field <CODE>signif</CODE> is the complete <NOBR>64-bit</NOBR> significand of
332the floating-point value.
333(In the usual encoding for <NOBR>80-bit</NOBR> extended floating-point, the
334leading <NOBR>1 bit</NOBR> of normalized numbers is not implicit but is stored
335in the most significant bit of the significand.)
336</P>
337
338<H3>4.3. Supported Floating-Point Functions</H3>
339
340<P>
341SoftFloat implements these arithmetic operations for its floating-point types:
342<UL>
343<LI>
344conversions between any two floating-point formats;
345<LI>
346for each floating-point format, conversions to and from signed and unsigned
347<NOBR>32-bit</NOBR> and <NOBR>64-bit</NOBR> integers;
348<LI>
349for each format, the usual addition, subtraction, multiplication, division, and
350square root operations;
351<LI>
352for each format except <CODE>extFloat80_t</CODE>, the fused multiply-add
353operation defined by the IEEE Standard;
354<LI>
355for each format, the floating-point remainder operation defined by the IEEE
356Standard;
357<LI>
358for each format, a &ldquo;round to integer&rdquo; operation that rounds to the
359nearest integer value in the same format; and
360<LI>
361comparisons between two values in the same floating-point format.
362</UL>
363</P>
364
365<P>
366The following operations required by the 2008 IEEE Floating-Point Standard are
367not supported in SoftFloat <NOBR>Release 3a</NOBR>:
368<UL>
369<LI>
370<B>nextUp</B>, <B>nextDown</B>, <B>minNum</B>, <B>maxNum</B>, <B>minNumMag</B>,
371<B>maxNumMag</B>, <B>scaleB</B>, and <B>logB</B>;
372<LI>
373conversions between floating-point formats and decimal or hexadecimal character
374sequences;
375<LI>
376all &ldquo;quiet-computation&rdquo; operations (<B>copy</B>, <B>negate</B>,
377<B>abs</B>, and <B>copySign</B>, which all involve only simple copying and/or
378manipulation of the floating-point sign bit); and
379<LI>
380all &ldquo;non-computational&rdquo; operations other than <B>isSignaling</B>
381(which is supported).
382</UL>
383</P>
384
385<H3>4.4. Non-canonical Representations in <CODE>extFloat80_t</CODE></H3>
386
387<P>
388Because the <NOBR>80-bit</NOBR> double-extended-precision format,
389<CODE>extFloat80_t</CODE>, stores an explicit leading significand bit, many
390floating-point numbers are encodable in this type in equivalent normalized and
391denormalized forms.
392Zeros and values in the subnormal range have each only a single possible
393encoding, for which the leading significand bit must <NOBR>be 0</NOBR>.
394For other finite values (outside the subnormal range), a unique normalized
395representation, with leading significand bit set <NOBR>to 1</NOBR>, always
396exists, and is considered the <I>canonical</I> representation of the value.
397Any equivalent denormalized representations (having leading significand bit
398<NOBR>of 0</NOBR>) are <I>non-canonical</I>.
399Similarly, the leading significand bit is expected to <NOBR>be 1</NOBR> for
400infinities and NaNs as well;
401any infinity or NaN with a leading significand bit <NOBR>of 0</NOBR> is again
402considered non-canonical.
403In short, for an <CODE>extFloat80_t</CODE> representation to be canonical, the
404leading significand bit must <NOBR>be 1</NOBR> unless it is required to
405<NOBR>be 0</NOBR> because the encoded value is zero or a subnormal.
406</P>
407
408<P>
409Functions are not guaranteed to operate as expected when inputs of type
410<CODE>extFloat80_t</CODE> are non-canonical.
411Assuming all of a function&rsquo;s <CODE>extFloat80_t</CODE> inputs (if any)
412are canonical, function outputs of type <CODE>extFloat80_t</CODE> will always
413be canonical.
414</P>
415
416<H3>4.5. Conventions for Passing Arguments and Results</H3>
417
418<P>
419Values that are at most <NOBR>64 bits</NOBR> in size (i.e., not the
420<NOBR>80-bit</NOBR> or <NOBR>128-bit</NOBR> floating-point formats) are in all
421cases passed as function arguments by value.
422Likewise, when an output of a function is no more than <NOBR>64 bits</NOBR>, it
423is always returned directly as the function result.
424Thus, for example, the SoftFloat function for adding two <NOBR>64-bit</NOBR>
425floating-point values has this simple signature:
426<BLOCKQUOTE>
427<CODE>float64_t f64_add( float64_t, float64_t );</CODE>
428</BLOCKQUOTE>
429</P>
430
431<P>
432The story is more complex when function inputs and outputs are
433<NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point.
434For these types, SoftFloat always provides a function that passes these larger
435values into or out of the function indirectly, via pointers.
436For example, for adding two <NOBR>128-bit</NOBR> floating-point values,
437SoftFloat supplies this function:
438<BLOCKQUOTE>
439<CODE>void f128M_add( const float128_t *, const float128_t *, float128_t * );</CODE>
440</BLOCKQUOTE>
441The first two arguments point to the values to be added, and the last argument
442points to the location where the sum will be stored.
443The <CODE>M</CODE> in the name <CODE>f128M_add</CODE> is mnemonic for the fact
444that the <NOBR>128-bit</NOBR> inputs and outputs are &ldquo;in memory&rdquo;,
445pointed to by pointer arguments.
446</P>
447
448<P>
449All ports of SoftFloat implement these <I>pass-by-pointer</I> functions for
450types <CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE>.
451At the same time, SoftFloat ports may also implement alternate versions of
452these same functions that pass <CODE>extFloat80_t</CODE> and
453<CODE>float128_t</CODE> by value, like the smaller formats.
454Thus, besides the function with name <CODE>f128M_add</CODE> shown above, a
455SoftFloat port may also supply an equivalent function with this signature:
456<BLOCKQUOTE>
457<CODE>float128_t f128_add( float128_t, float128_t );</CODE>
458</BLOCKQUOTE>
459</P>
460
461<P>
462As a general rule, on computers where the machine word size is
463<NOBR>32 bits</NOBR> or smaller, only the pass-by-pointer versions of functions
464(e.g., <CODE>f128M_add</CODE>) are provided for types <CODE>extFloat80_t</CODE>
465and <CODE>float128_t</CODE>, because passing such large types directly can have
466significant extra cost.
467On computers where the word size is <NOBR>64 bits</NOBR> or larger, both
468function versions (<CODE>f128M_add</CODE> and <CODE>f128_add</CODE>) are
469provided, because the cost of passing by value is then more reasonable.
470Applications that must be portable accross both classes of computers must use
471the pointer-based functions, as these are always implemented.
472However, if it is known that SoftFloat includes the by-value functions for all
473platforms of interest, programmers can use whichever version they prefer.
474</P>
475
476
477<H2>5. Reserved Names</H2>
478
479<P>
480In addition to the variables and functions documented here, SoftFloat defines
481some symbol names for its own private use.
482These private names always begin with the prefix
483&lsquo;<CODE>softfloat_</CODE>&rsquo;.
484When a program includes header <CODE>softfloat.h</CODE> or links with the
485SoftFloat library, all names with prefix &lsquo;<CODE>softfloat_</CODE>&rsquo;
486are reserved for possible use by SoftFloat.
487Applications that use SoftFloat should not define their own names with this
488prefix, and should reference only such names as are documented.
489</P>
490
491
492<H2>6. Mode Variables</H2>
493
494<P>
495The following variables control rounding mode, underflow detection, and the
496<NOBR>80-bit</NOBR> extended format&rsquo;s rounding precision:
497<BLOCKQUOTE>
498<CODE>softfloat_roundingMode</CODE><BR>
499<CODE>softfloat_detectTininess</CODE><BR>
500<CODE>extF80_roundingPrecision</CODE>
501</BLOCKQUOTE>
502These mode variables are covered in the next several subsections.
503</P>
504
505<H3>6.1. Rounding Mode</H3>
506
507<P>
508All five rounding modes defined by the 2008 IEEE Floating-Point Standard are
509implemented for all operations that require rounding.
510The rounding mode is selected by the global variable
511<BLOCKQUOTE>
512<CODE>uint_fast8_t softfloat_roundingMode;</CODE>
513</BLOCKQUOTE>
514This variable may be set to one of the values
515<BLOCKQUOTE>
516<TABLE CELLSPACING=0 CELLPADDING=0>
517<TR>
518<TD><CODE>softfloat_round_near_even</CODE></TD>
519<TD>round to nearest, with ties to even</TD>
520</TR>
521<TR>
522<TD><CODE>softfloat_round_near_maxMag&nbsp;&nbsp;</CODE></TD>
523<TD>round to nearest, with ties to maximum magnitude (away from zero)</TD>
524</TR>
525<TR>
526<TD><CODE>softfloat_round_minMag</CODE></TD>
527<TD>round to minimum magnitude (toward zero)</TD>
528</TR>
529<TR>
530<TD><CODE>softfloat_round_min</CODE></TD>
531<TD>round to minimum (down)</TD>
532</TR>
533<TR>
534<TD><CODE>softfloat_round_max</CODE></TD>
535<TD>round to maximum (up)</TD>
536</TR>
537</TABLE>
538</BLOCKQUOTE>
539Variable <CODE>softfloat_roundingMode</CODE> is initialized to
540<CODE>softfloat_round_near_even</CODE>.
541</P>
542
543<H3>6.2. Underflow Detection</H3>
544
545<P>
546In the terminology of the IEEE Standard, SoftFloat can detect tininess for
547underflow either before or after rounding.
548The choice is made by the global variable
549<BLOCKQUOTE>
550<CODE>uint_fast8_t softfloat_detectTininess;</CODE>
551</BLOCKQUOTE>
552which can be set to either
553<BLOCKQUOTE>
554<CODE>softfloat_tininess_beforeRounding</CODE><BR>
555<CODE>softfloat_tininess_afterRounding</CODE>
556</BLOCKQUOTE>
557Detecting tininess after rounding is better because it results in fewer
558spurious underflow signals.
559The other option is provided for compatibility with some systems.
560Like most systems (and as required by the newer 2008 IEEE Standard), SoftFloat
561always detects loss of accuracy for underflow as an inexact result.
562</P>
563
564<H3>6.3. Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</H3>
565
566<P>
567For <CODE>extFloat80_t</CODE> only, the rounding precision of the basic
568arithmetic operations is controlled by the global variable
569<BLOCKQUOTE>
570<CODE>uint_fast8_t extF80_roundingPrecision;</CODE>
571</BLOCKQUOTE>
572The operations affected are:
573<BLOCKQUOTE>
574<CODE>extF80_add</CODE><BR>
575<CODE>extF80_sub</CODE><BR>
576<CODE>extF80_mul</CODE><BR>
577<CODE>extF80_div</CODE><BR>
578<CODE>extF80_sqrt</CODE>
579</BLOCKQUOTE>
580When <CODE>extF80_roundingPrecision</CODE> is set to its default value of 80,
581these operations are rounded to the full precision of the <NOBR>80-bit</NOBR>
582double-extended-precision format, like occurs for other formats.
583Setting <CODE>extF80_roundingPrecision</CODE> to 32 or to 64 causes the
584operations listed to be rounded to <NOBR>32-bit</NOBR> precision (equivalent to
585<CODE>float32_t</CODE>) or to <NOBR>64-bit</NOBR> precision (equivalent to
586<CODE>float64_t</CODE>), respectively.
587When rounding to reduced precision, additional bits in the result significand
588beyond the rounding point are set to zero.
589The consequences of setting <CODE>extF80_roundingPrecision</CODE> to a value
590other than 32, 64, or 80 is not specified.
591Operations other than the ones listed above are not affected by
592<CODE>extF80_roundingPrecision</CODE>.
593</P>
594
595
596<H2>7. Exceptions and Exception Flags</H2>
597
598<P>
599All five exception flags required by the IEEE Floating-Point Standard are
600implemented.
601Each flag is stored as a separate bit in the global variable
602<BLOCKQUOTE>
603<CODE>uint_fast8_t softfloat_exceptionFlags;</CODE>
604</BLOCKQUOTE>
605The positions of the exception flag bits within this variable are determined by
606the bit masks
607<BLOCKQUOTE>
608<CODE>softfloat_flag_inexact</CODE><BR>
609<CODE>softfloat_flag_underflow</CODE><BR>
610<CODE>softfloat_flag_overflow</CODE><BR>
611<CODE>softfloat_flag_infinite</CODE><BR>
612<CODE>softfloat_flag_invalid</CODE>
613</BLOCKQUOTE>
614Variable <CODE>softfloat_exceptionFlags</CODE> is initialized to all zeros,
615meaning no exceptions.
616</P>
617
618<P>
619An individual exception flag can be cleared with the statement
620<BLOCKQUOTE>
621<CODE>softfloat_exceptionFlags &= ~softfloat_flag_&lt;<I>exception</I>&gt;;</CODE>
622</BLOCKQUOTE>
623where <CODE>&lt;<I>exception</I>&gt;</CODE> is the appropriate name.
624To raise a floating-point exception, function <CODE>softfloat_raise</CODE>
625should normally be used.
626</P>
627
628<P>
629When SoftFloat detects an exception other than <I>inexact</I>, it calls
630<CODE>softfloat_raise</CODE>.
631The default version of this function simply raises the corresponding exception
632flags.
633Particular ports of SoftFloat may support alternate behavior, such as exception
634traps, by modifying the default <CODE>softfloat_raise</CODE>.
635A program may also supply its own <CODE>softfloat_raise</CODE> function to
636override the one from the SoftFloat library.
637</P>
638
639<P>
640Because inexact results occur frequently under most circumstances (and thus are
641hardly exceptional), SoftFloat does not ordinarily call
642<CODE>softfloat_raise</CODE> for <I>inexact</I> exceptions.
643It does always raise the <I>inexact</I> exception flag as required.
644</P>
645
646
647<H2>8. Function Details</H2>
648
649<P>
650In this section, <CODE>&lt;<I>float</I>&gt;</CODE> appears in function names as
651a substitute for one of these abbreviations:
652<BLOCKQUOTE>
653<TABLE CELLSPACING=0 CELLPADDING=0>
654<TR>
655<TD><CODE>f32</CODE></TD>
656<TD>indicates <CODE>float32_t</CODE>, passed by value</TD>
657</TR>
658<TR>
659<TD><CODE>f64</CODE></TD>
660<TD>indicates <CODE>float64_t</CODE>, passed by value</TD>
661</TR>
662<TR>
663<TD><CODE>extF80M&nbsp;&nbsp;&nbsp;</CODE></TD>
664<TD>indicates <CODE>extFloat80_t</CODE>, passed indirectly via pointers</TD>
665</TR>
666<TR>
667<TD><CODE>extF80</CODE></TD>
668<TD>indicates <CODE>extFloat80_t</CODE>, passed by value</TD>
669</TR>
670<TR>
671<TD><CODE>f128M</CODE></TD>
672<TD>indicates <CODE>float128_t</CODE>, passed indirectly via pointers</TD>
673</TR>
674<TR>
675<TD><CODE>f128</CODE></TD>
676<TD>indicates <CODE>float128_t</CODE>, passed by value</TD>
677</TR>
678</TABLE>
679</BLOCKQUOTE>
680The circumstances under which values of floating-point types
681<CODE>extFloat80_t</CODE> and <CODE>float128_t</CODE> may be passed either by
682value or indirectly via pointers was discussed earlier in
683<NOBR>section 4.5</NOBR>, <I>Conventions for Passing Arguments and Results</I>.
684</P>
685
686<H3>8.1. Conversions from Integer to Floating-Point</H3>
687
688<P>
689All conversions from a <NOBR>32-bit</NOBR> or <NOBR>64-bit</NOBR> integer,
690signed or unsigned, to a floating-point format are supported.
691Functions performing these conversions have these names:
692<BLOCKQUOTE>
693<CODE>ui32_to_&lt;<I>float</I>&gt;</CODE><BR>
694<CODE>ui64_to_&lt;<I>float</I>&gt;</CODE><BR>
695<CODE>i32_to_&lt;<I>float</I>&gt;</CODE><BR>
696<CODE>i64_to_&lt;<I>float</I>&gt;</CODE>
697</BLOCKQUOTE>
698Conversions from <NOBR>32-bit</NOBR> integers to <NOBR>64-bit</NOBR>
699double-precision and larger formats are always exact, and likewise conversions
700from <NOBR>64-bit</NOBR> integers to <NOBR>80-bit</NOBR>
701double-extended-precision and <NOBR>128-bit</NOBR> quadruple-precision are also
702always exact.
703</P>
704
705<P>
706Each conversion function takes one input of the appropriate type and generates
707one output.
708The following illustrates the signatures of these functions in cases when the
709floating-point result is passed either by value or via pointers:
710<BLOCKQUOTE>
711<PRE>
712float64_t i32_to_f64( int32_t <I>a</I> );
713</PRE>
714<PRE>
715void i32_to_f128M( int32_t <I>a</I>, float128_t *<I>destPtr</I> );
716</PRE>
717</BLOCKQUOTE>
718</P>
719
720<H3>8.2. Conversions from Floating-Point to Integer</H3>
721
722<P>
723Conversions from a floating-point format to a <NOBR>32-bit</NOBR> or
724<NOBR>64-bit</NOBR> integer, signed or unsigned, are supported with these
725functions:
726<BLOCKQUOTE>
727<CODE>&lt;<I>float</I>&gt;_to_ui32</CODE><BR>
728<CODE>&lt;<I>float</I>&gt;_to_ui64</CODE><BR>
729<CODE>&lt;<I>float</I>&gt;_to_i32</CODE><BR>
730<CODE>&lt;<I>float</I>&gt;_to_i64</CODE>
731</BLOCKQUOTE>
732The functions have signatures as follows, depending on whether the
733floating-point input is passed by value or via pointers:
734<BLOCKQUOTE>
735<PRE>
736int_fast32_t f64_to_i32( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
737</PRE>
738<PRE>
739int_fast32_t
740 f128M_to_i32( const float128_t *<I>aPtr</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
741</PRE>
742</BLOCKQUOTE>
743The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode for
744the conversion.
745The variable that usually indicates rounding mode,
746<CODE>softfloat_roundingMode</CODE>, is ignored.
747Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
748exception flag is raised if the conversion is not exact.
749If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
750be raised;
751otherwise, it will not be, even if the conversion is inexact.
752</P>
753
754<P>
755Conversions from floating-point to integer raise the <I>invalid</I> exception
756if the source value cannot be rounded to a representable integer of the desired
757size (32 or 64 bits).
758In such a circumstance, if the floating-point input is a NaN or if the
759conversion is to an unsigned integer type, the largest positive integer is
760returned;
761otherwise, the largest integer with the same sign as the input is returned.
762The functions that convert to integer types never raise the <I>overflow</I>
763exception.
764</P>
765
766<P>
767Note that, when converting to an unsigned integer type, if the <I>invalid</I>
768exception is raised because the input floating-point value would round to a
769negative integer, the value returned is the <EM>maximum positive unsigned
770integer</EM>.
771Zero is not returned when the <I>invalid</I> exception is raised, even when
772zero is the closest integer to the original floating-point value.
773</P>
774
775<P>
776Because languages such <NOBR>as C</NOBR> require that conversions to integers
777be rounded toward zero, the following functions are provided for improved speed
778and convenience:
779<BLOCKQUOTE>
780<CODE>&lt;<I>float</I>&gt;_to_ui32_r_minMag</CODE><BR>
781<CODE>&lt;<I>float</I>&gt;_to_ui64_r_minMag</CODE><BR>
782<CODE>&lt;<I>float</I>&gt;_to_i32_r_minMag</CODE><BR>
783<CODE>&lt;<I>float</I>&gt;_to_i64_r_minMag</CODE>
784</BLOCKQUOTE>
785These functions round only toward zero (to minimum magnitude).
786The signatures for these functions are the same as above without the redundant
787<CODE><I>roundingMode</I></CODE> argument:
788<BLOCKQUOTE>
789<PRE>
790int_fast32_t f64_to_i32_r_minMag( float64_t <I>a</I>, bool <I>exact</I> );
791</PRE>
792<PRE>
793int_fast32_t f128M_to_i32_r_minMag( const float128_t *<I>aPtr</I>, bool <I>exact</I> );
794</PRE>
795</BLOCKQUOTE>
796</P>
797
798<H3>8.3. Conversions Among Floating-Point Types</H3>
799
800<P>
801Conversions between floating-point formats are done by functions with these
802names:
803<BLOCKQUOTE>
804<CODE>&lt;<I>float</I>&gt;_to_&lt;<I>float</I>&gt;</CODE>
805</BLOCKQUOTE>
806All combinations of source and result type are supported where the source and
807result are different formats.
808There are four different styles of signature for these functions, depending on
809whether the input and the output floating-point values are passed by value or
810via pointers:
811<BLOCKQUOTE>
812<PRE>
813float32_t f64_to_f32( float64_t <I>a</I> );
814</PRE>
815<PRE>
816float32_t f128M_to_f32( const float128_t *<I>aPtr</I> );
817</PRE>
818<PRE>
819void f32_to_f128M( float32_t <I>a</I>, float128_t *<I>destPtr</I> );
820</PRE>
821<PRE>
822void extF80M_to_f128M( const extFloat80_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
823</PRE>
824</BLOCKQUOTE>
825</P>
826
827<P>
828Conversions from a smaller to a larger floating-point format are always exact
829and so require no rounding.
830</P>
831
832<H3>8.4. Basic Arithmetic Functions</H3>
833
834<P>
835The following basic arithmetic functions are provided:
836<BLOCKQUOTE>
837<CODE>&lt;<I>float</I>&gt;_add</CODE><BR>
838<CODE>&lt;<I>float</I>&gt;_sub</CODE><BR>
839<CODE>&lt;<I>float</I>&gt;_mul</CODE><BR>
840<CODE>&lt;<I>float</I>&gt;_div</CODE><BR>
841<CODE>&lt;<I>float</I>&gt;_sqrt</CODE>
842</BLOCKQUOTE>
843Each floating-point operation takes two operands, except for <CODE>sqrt</CODE>
844(square root) which takes only one.
845The operands and result are all of the same floating-point format.
846Signatures for these functions take the following forms:
847<BLOCKQUOTE>
848<PRE>
849float64_t f64_add( float64_t <I>a</I>, float64_t <I>b</I> );
850</PRE>
851<PRE>
852void
853 f128M_add(
854     const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
855</PRE>
856<PRE>
857float64_t f64_sqrt( float64_t <I>a</I> );
858</PRE>
859<PRE>
860void f128M_sqrt( const float128_t *<I>aPtr</I>, float128_t *<I>destPtr</I> );
861</PRE>
862</BLOCKQUOTE>
863When floating-point values are passed indirectly through pointers, arguments
864<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to the input
865operands, and the last argument, <CODE><I>destPtr</I></CODE>, points to the
866location where the result is stored.
867</P>
868
869<P>
870Rounding of the <NOBR>80-bit</NOBR> double-extended-precision
871(<CODE>extFloat80_t</CODE>) functions is affected by variable
872<CODE>extF80_roundingPrecision</CODE>, as explained earlier in
873<NOBR>section 6.3</NOBR>,
874<I>Rounding Precision for the <NOBR>80-Bit</NOBR> Extended Format</I>.
875</P>
876
877<H3>8.5. Fused Multiply-Add Functions</H3>
878
879<P>
880The 2008 version of the IEEE Floating-Point Standard defines a <I>fused
881multiply-add</I> operation that does a combined multiplication and addition
882with only a single rounding.
883SoftFloat implements fused multiply-add with functions
884<BLOCKQUOTE>
885<CODE>&lt;<I>float</I>&gt;_mulAdd</CODE>
886</BLOCKQUOTE>
887Unlike other operations, fused multiple-add is supported only for the
888non-extended formats, <CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and
889<CODE>float128_t</CODE>.
890No fused multiple-add function is currently provided for the
891<NOBR>80-bit</NOBR> double-extended-precision type, <CODE>extFloat80_t</CODE>.
892</P>
893
894<P>
895Depending on whether floating-point values are passed by value or via pointers,
896the fused multiply-add functions have signatures of these forms:
897<BLOCKQUOTE>
898<PRE>
899float64_t f64_mulAdd( float64_t <I>a</I>, float64_t <I>b</I>, float64_t <I>c</I> );
900</PRE>
901<PRE>
902void
903 f128M_mulAdd(
904     const float128_t *<I>aPtr</I>,
905     const float128_t *<I>bPtr</I>,
906     const float128_t *<I>cPtr</I>,
907     float128_t *<I>destPtr</I>
908 );
909</PRE>
910</BLOCKQUOTE>
911The functions compute
912<NOBR>(<CODE><I>a</I></CODE> &times; <CODE><I>b</I></CODE>)
913 + <CODE><I>c</I></CODE></NOBR>
914with a single rounding.
915When floating-point values are passed indirectly through pointers, arguments
916<CODE><I>aPtr</I></CODE>, <CODE><I>bPtr</I></CODE>, and
917<CODE><I>cPtr</I></CODE> point to operands <CODE><I>a</I></CODE>,
918<CODE><I>b</I></CODE>, and <CODE><I>c</I></CODE> respectively, and
919<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
920</P>
921
922<P>
923If one of the multiplication operands <CODE><I>a</I></CODE> and
924<CODE><I>b</I></CODE> is infinite and the other is zero, these functions raise
925the invalid exception even if operand <CODE><I>c</I></CODE> is a quiet NaN.
926</P>
927
928<H3>8.6. Remainder Functions</H3>
929
930<P>
931For each format, SoftFloat implements the remainder operation defined by the
932IEEE Floating-Point Standard.
933The remainder functions have names
934<BLOCKQUOTE>
935<CODE>&lt;<I>float</I>&gt;_rem</CODE>
936</BLOCKQUOTE>
937Each remainder operation takes two floating-point operands of the same format
938and returns a result in the same format.
939Depending on whether floating-point values are passed by value or via pointers,
940the remainder functions have signatures of these forms:
941<BLOCKQUOTE>
942<PRE>
943float64_t f64_rem( float64_t <I>a</I>, float64_t <I>b</I> );
944</PRE>
945<PRE>
946void
947 f128M_rem(
948     const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I>, float128_t *<I>destPtr</I> );
949</PRE>
950</BLOCKQUOTE>
951When floating-point values are passed indirectly through pointers, arguments
952<CODE><I>aPtr</I></CODE> and <CODE><I>bPtr</I></CODE> point to operands
953<CODE><I>a</I></CODE> and <CODE><I>b</I></CODE> respectively, and
954<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
955</P>
956
957<P>
958The IEEE Standard remainder operation computes the value
959<NOBR><CODE><I>a</I></CODE>
960 &minus; <I>n</I> &times; <CODE><I>b</I></CODE></NOBR>,
961where <I>n</I> is the integer closest to
962<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
963If <NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR> is exactly
964halfway between two integers, <I>n</I> is the <EM>even</EM> integer closest to
965<NOBR><CODE><I>a</I></CODE> &divide; <CODE><I>b</I></CODE></NOBR>.
966The IEEE Standard&rsquo;s remainder operation is always exact and so requires
967no rounding.
968</P>
969
970<P>
971Depending on the relative magnitudes of the operands, the remainder
972functions can take considerably longer to execute than the other SoftFloat
973functions.
974This is inherent in the remainder operation itself and is not a flaw in the
975SoftFloat implementation.
976</P>
977
978<H3>8.7. Round-to-Integer Functions</H3>
979
980<P>
981For each format, SoftFloat implements the round-to-integer operation specified
982by the IEEE Floating-Point Standard.
983These functions are named
984<BLOCKQUOTE>
985<CODE>&lt;<I>float</I>&gt;_roundToInt</CODE>
986</BLOCKQUOTE>
987Each round-to-integer operation takes a single floating-point operand.
988This operand is rounded to an integer according to a specified rounding mode,
989and the resulting integer value is returned in the same floating-point format.
990(Note that the result is not an integer type.)
991</P>
992
993<P>
994The signatures of the round-to-integer functions are similar to those for
995conversions to an integer type:
996<BLOCKQUOTE>
997<PRE>
998float64_t f64_roundToInt( float64_t <I>a</I>, uint_fast8_t <I>roundingMode</I>, bool <I>exact</I> );
999</PRE>
1000<PRE>
1001void
1002 f128M_roundToInt(
1003     const float128_t *<I>aPtr</I>,
1004     uint_fast8_t <I>roundingMode</I>,
1005     bool <I>exact</I>,
1006     float128_t *<I>destPtr</I>
1007 );
1008</PRE>
1009</BLOCKQUOTE>
1010The <CODE><I>roundingMode</I></CODE> argument specifies the rounding mode to
1011apply.
1012The variable that usually indicates rounding mode,
1013<CODE>softfloat_roundingMode</CODE>, is ignored.
1014Argument <CODE><I>exact</I></CODE> determines whether the <I>inexact</I>
1015exception flag is raised if the conversion is not exact.
1016If <CODE><I>exact</I></CODE> is <CODE>true</CODE>, the <I>inexact</I> flag may
1017be raised;
1018otherwise, it will not be, even if the conversion is inexact.
1019When floating-point values are passed indirectly through pointers,
1020<CODE><I>aPtr</I></CODE> points to the input operand and
1021<CODE><I>destPtr</I></CODE> points to the location where the result is stored.
1022</P>
1023
1024<H3>8.8. Comparison Functions</H3>
1025
1026<P>
1027For each format, the following floating-point comparison functions are
1028provided:
1029<BLOCKQUOTE>
1030<CODE>&lt;<I>float</I>&gt;_eq</CODE><BR>
1031<CODE>&lt;<I>float</I>&gt;_le</CODE><BR>
1032<CODE>&lt;<I>float</I>&gt;_lt</CODE>
1033</BLOCKQUOTE>
1034Each comparison takes two operands of the same type and returns a Boolean.
1035The abbreviation <CODE>eq</CODE> stands for &ldquo;equal&rdquo; (=);
1036<CODE>le</CODE> stands for &ldquo;less than or equal&rdquo; (&le;);
1037and <CODE>lt</CODE> stands for &ldquo;less than&rdquo; (&lt;).
1038Depending on whether the floating-point operands are passed by value or via
1039pointers, the comparison functions have signatures of these forms:
1040<BLOCKQUOTE>
1041<PRE>
1042bool f64_eq( float64_t <I>a</I>, float64_t <I>b</I> );
1043</PRE>
1044<PRE>
1045bool f128M_eq( const float128_t *<I>aPtr</I>, const float128_t *<I>bPtr</I> );
1046</PRE>
1047</BLOCKQUOTE>
1048</P>
1049
1050<P>
1051The usual greater-than (&gt;), greater-than-or-equal (&ge;), and not-equal
1052(&ne;) comparisons are easily obtained from the functions provided.
1053The not-equal function is just the logical complement of the equal function.
1054The greater-than-or-equal function is identical to the less-than-or-equal
1055function with the arguments in reverse order, and likewise the greater-than
1056function is identical to the less-than function with the arguments reversed.
1057</P>
1058
1059<P>
1060The IEEE Floating-Point Standard specifies that the less-than-or-equal and
1061less-than comparisons by default raise the <I>invalid</I> exception if either
1062operand is any kind of NaN.
1063Equality comparisons, on the other hand, are defined by default to raise the
1064<I>invalid</I> exception only for signaling NaNs, not quiet NaNs.
1065For completeness, SoftFloat provides these complementary functions:
1066<BLOCKQUOTE>
1067<CODE>&lt;<I>float</I>&gt;_eq_signaling</CODE><BR>
1068<CODE>&lt;<I>float</I>&gt;_le_quiet</CODE><BR>
1069<CODE>&lt;<I>float</I>&gt;_lt_quiet</CODE>
1070</BLOCKQUOTE>
1071The <CODE>signaling</CODE> equality comparisons are identical to the default
1072equality comparisons except that the <I>invalid</I> exception is raised for any
1073NaN input, not just for signaling NaNs.
1074Similarly, the <CODE>quiet</CODE> comparison functions are identical to their
1075default counterparts except that the <I>invalid</I> exception is not raised for
1076quiet NaNs.
1077</P>
1078
1079<H3>8.9. Signaling NaN Test Functions</H3>
1080
1081<P>
1082Functions for testing whether a floating-point value is a signaling NaN are
1083provided with these names:
1084<BLOCKQUOTE>
1085<CODE>&lt;<I>float</I>&gt;_isSignalingNaN</CODE>
1086</BLOCKQUOTE>
1087The functions take one floating-point operand and return a Boolean indicating
1088whether the operand is a signaling NaN.
1089Accordingly, the functions have the forms
1090<BLOCKQUOTE>
1091<PRE>
1092bool f64_isSignalingNaN( float64_t <I>a</I> );
1093</PRE>
1094<PRE>
1095bool f128M_isSignalingNaN( const float128_t *<I>aPtr</I> );
1096</PRE>
1097</BLOCKQUOTE>
1098</P>
1099
1100<H3>8.10. Raise-Exception Function</H3>
1101
1102<P>
1103SoftFloat provides a single function for raising floating-point exceptions:
1104<BLOCKQUOTE>
1105<PRE>
1106void softfloat_raise( uint_fast8_t <I>exceptions</I> );
1107</PRE>
1108</BLOCKQUOTE>
1109The <CODE><I>exceptions</I></CODE> argument is a mask indicating the set of
1110exceptions to raise.
1111(See earlier section 7, <I>Exceptions and Exception Flags</I>.)
1112In addition to setting the specified exception flags in variable
1113<CODE>softfloat_exceptionFlags</CODE>, the <CODE>softfloat_raise</CODE>
1114function may cause a trap or abort appropriate for the current system.
1115</P>
1116
1117
1118<H2>9. Changes from SoftFloat <NOBR>Release 2</NOBR></H2>
1119
1120<P>
1121Apart from a change in the legal use license, <NOBR>Release 3</NOBR> of
1122SoftFloat introduced numerous technical differences compared to earlier
1123releases.
1124</P>
1125
1126<H3>9.1. Name Changes</H3>
1127
1128<P>
1129The most obvious and pervasive difference compared to <NOBR>Release 2</NOBR>
1130is that the names of most functions and variables have changed, even when the
1131behavior has not.
1132First, the floating-point types, the mode variables, the exception flags
1133variable, the function to raise exceptions, and various associated constants
1134have been renamed as follows:
1135<BLOCKQUOTE>
1136<TABLE>
1137<TR>
1138<TD>old name, Release 2:</TD>
1139<TD>new name, Release 3:</TD>
1140</TR>
1141<TR>
1142<TD><CODE>float32</CODE></TD>
1143<TD><CODE>float32_t</CODE></TD>
1144</TR>
1145<TR>
1146<TD><CODE>float64</CODE></TD>
1147<TD><CODE>float64_t</CODE></TD>
1148</TR>
1149<TR>
1150<TD><CODE>floatx80</CODE></TD>
1151<TD><CODE>extFloat80_t</CODE></TD>
1152</TR>
1153<TR>
1154<TD><CODE>float128</CODE></TD>
1155<TD><CODE>float128_t</CODE></TD>
1156</TR>
1157<TR>
1158<TD><CODE>float_rounding_mode</CODE></TD>
1159<TD><CODE>softfloat_roundingMode</CODE></TD>
1160</TR>
1161<TR>
1162<TD><CODE>float_round_nearest_even</CODE></TD>
1163<TD><CODE>softfloat_round_near_even</CODE></TD>
1164</TR>
1165<TR>
1166<TD><CODE>float_round_to_zero</CODE></TD>
1167<TD><CODE>softfloat_round_minMag</CODE></TD>
1168</TR>
1169<TR>
1170<TD><CODE>float_round_down</CODE></TD>
1171<TD><CODE>softfloat_round_min</CODE></TD>
1172</TR>
1173<TR>
1174<TD><CODE>float_round_up</CODE></TD>
1175<TD><CODE>softfloat_round_max</CODE></TD>
1176</TR>
1177<TR>
1178<TD><CODE>float_detect_tininess</CODE></TD>
1179<TD><CODE>softfloat_detectTininess</CODE></TD>
1180</TR>
1181<TR>
1182<TD><CODE>float_tininess_before_rounding&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1183<TD><CODE>softfloat_tininess_beforeRounding</CODE></TD>
1184</TR>
1185<TR>
1186<TD><CODE>float_tininess_after_rounding</CODE></TD>
1187<TD><CODE>softfloat_tininess_afterRounding</CODE></TD>
1188</TR>
1189<TR>
1190<TD><CODE>floatx80_rounding_precision</CODE></TD>
1191<TD><CODE>extF80_roundingPrecision</CODE></TD>
1192</TR>
1193<TR>
1194<TD><CODE>float_exception_flags</CODE></TD>
1195<TD><CODE>softfloat_exceptionFlags</CODE></TD>
1196</TR>
1197<TR>
1198<TD><CODE>float_flag_inexact</CODE></TD>
1199<TD><CODE>softfloat_flag_inexact</CODE></TD>
1200</TR>
1201<TR>
1202<TD><CODE>float_flag_underflow</CODE></TD>
1203<TD><CODE>softfloat_flag_underflow</CODE></TD>
1204</TR>
1205<TR>
1206<TD><CODE>float_flag_overflow</CODE></TD>
1207<TD><CODE>softfloat_flag_overflow</CODE></TD>
1208</TR>
1209<TR>
1210<TD><CODE>float_flag_divbyzero</CODE></TD>
1211<TD><CODE>softfloat_flag_infinite</CODE></TD>
1212</TR>
1213<TR>
1214<TD><CODE>float_flag_invalid</CODE></TD>
1215<TD><CODE>softfloat_flag_invalid</CODE></TD>
1216</TR>
1217<TR>
1218<TD><CODE>float_raise</CODE></TD>
1219<TD><CODE>softfloat_raise</CODE></TD>
1220</TR>
1221</TABLE>
1222</BLOCKQUOTE>
1223</P>
1224
1225<P>
1226Furthermore, <NOBR>Release 3</NOBR> adopted the following new abbreviations for
1227function names:
1228<BLOCKQUOTE>
1229<TABLE>
1230<TR>
1231<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1232<TD>used in names in Release 3:</TD>
1233</TR>
1234<TR> <TD><CODE>int32</CODE></TD>    <TD><CODE>i32</CODE></TD>    </TR>
1235<TR> <TD><CODE>int64</CODE></TD>    <TD><CODE>i64</CODE></TD>    </TR>
1236<TR> <TD><CODE>float32</CODE></TD>  <TD><CODE>f32</CODE></TD>    </TR>
1237<TR> <TD><CODE>float64</CODE></TD>  <TD><CODE>f64</CODE></TD>    </TR>
1238<TR> <TD><CODE>floatx80</CODE></TD> <TD><CODE>extF80</CODE></TD> </TR>
1239<TR> <TD><CODE>float128</CODE></TD> <TD><CODE>f128</CODE></TD>   </TR>
1240</TABLE>
1241</BLOCKQUOTE>
1242Thus, for example, the function to add two <NOBR>32-bit</NOBR> floating-point
1243numbers, previously called <CODE>float32_add</CODE> in <NOBR>Release 2</NOBR>,
1244is now <CODE>f32_add</CODE>.
1245Lastly, there have been a few other changes to function names:
1246<BLOCKQUOTE>
1247<TABLE>
1248<TR>
1249<TD>used in names in Release 2:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1250<TD>used in names in Release 3:<CODE>&nbsp;&nbsp;&nbsp;</CODE></TD>
1251<TD>relevant functions:</TD>
1252</TR>
1253<TR>
1254<TD><CODE>_round_to_zero</CODE></TD>
1255<TD><CODE>_r_minMag</CODE></TD>
1256<TD>conversions from floating-point to integer (<NOBR>section 8.2</NOBR>)</TD>
1257</TR>
1258<TR>
1259<TD><CODE>round_to_int</CODE></TD>
1260<TD><CODE>roundToInt</CODE></TD>
1261<TD>round-to-integer functions (<NOBR>section 8.7</NOBR>)</TD>
1262</TR>
1263<TR>
1264<TD><CODE>is_signaling_nan&nbsp;&nbsp;&nbsp;&nbsp;</CODE></TD>
1265<TD><CODE>isSignalingNaN</CODE></TD>
1266<TD>signaling NaN test functions (<NOBR>section 8.9</NOBR>)</TD>
1267</TR>
1268</TABLE>
1269</BLOCKQUOTE>
1270</P>
1271
1272<H3>9.2. Changes to Function Arguments</H3>
1273
1274<P>
1275Besides simple name changes, some operations were given a different interface
1276in <NOBR>Release 3</NOBR> than they had in <NOBR>Release 2</NOBR>:
1277<UL>
1278
1279<LI>
1280<P>
1281Since <NOBR>Release 3</NOBR>, integer arguments and results of functions have
1282standard types from header <CODE>&lt;stdint.h&gt;</CODE>, such as
1283<CODE>uint32_t</CODE>, whereas previously their types could be defined
1284differently for each port of SoftFloat, usually using traditional C types such
1285as <CODE>unsigned</CODE> <CODE>int</CODE>.
1286Likewise, functions in <NOBR>Release 3</NOBR> and later pass Booleans as
1287standard type <CODE>bool</CODE> from <CODE>&lt;stdbool.h&gt;</CODE>, whereas
1288previously these were again passed as a port-specific type (usually
1289<CODE>int</CODE>).
1290</P>
1291
1292<LI>
1293<P>
1294As explained earlier in <NOBR>section 4.5</NOBR>, <I>Conventions for Passing
1295Arguments and Results</I>, SoftFloat functions in <NOBR>Release 3</NOBR> and
1296later may pass <NOBR>80-bit</NOBR> and <NOBR>128-bit</NOBR> floating-point
1297values through pointers, meaning that functions take pointer arguments and then
1298read or write floating-point values at the locations indicated by the pointers.
1299In <NOBR>Release 2</NOBR>, floating-point arguments and results were always
1300passed by value, regardless of their size.
1301</P>
1302
1303<LI>
1304<P>
1305Functions that round to an integer have additional
1306<CODE><I>roundingMode</I></CODE> and <CODE><I>exact</I></CODE> arguments that
1307they did not have in <NOBR>Release 2</NOBR>.
1308Refer to sections 8.2 <NOBR>and 8.7</NOBR> for descriptions of these functions
1309since <NOBR>Release 3</NOBR>.
1310For <NOBR>Release 2</NOBR>, the rounding mode, when needed, was taken from the
1311same global variable that affects the basic arithmetic operations (now called
1312<CODE>softfloat_roundingMode</CODE> but previously known as
1313<CODE>float_rounding_mode</CODE>).
1314Also, for <NOBR>Release 2</NOBR>, if the original floating-point input was not
1315an exact integer value, and if the <I>invalid</I> exception was not raised by
1316the function, the <I>inexact</I> exception was always raised.
1317<NOBR>Release 2</NOBR> had no option to suppress raising <I>inexact</I> in this
1318case.
1319Applications using SoftFloat <NOBR>Release 3</NOBR> or later can get the same
1320effect as <NOBR>Release 2</NOBR> by passing variable
1321<CODE>softfloat_roundingMode</CODE> for argument
1322<CODE><I>roundingMode</I></CODE> and <CODE>true</CODE> for argument
1323<CODE><I>exact</I></CODE>.
1324</P>
1325
1326</UL>
1327</P>
1328
1329<H3>9.3. Added Capabilities</H3>
1330
1331<P>
1332With <NOBR>Release 3</NOBR>, some new features have been added that were not
1333present in <NOBR>Release 2</NOBR>:
1334<UL>
1335
1336<LI>
1337<P>
1338A port of SoftFloat can now define any of the floating-point types
1339<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, <CODE>extFloat80_t</CODE>, and
1340<CODE>float128_t</CODE> as aliases for C&rsquo;s standard floating-point types
1341<CODE>float</CODE>, <CODE>double</CODE>, and <CODE>long</CODE>
1342<CODE>double</CODE>, using either <CODE>#define</CODE> or <CODE>typedef</CODE>.
1343This potential convenience was not supported under <NOBR>Release 2</NOBR>.
1344</P>
1345
1346<P>
1347(Note, however, that there may be a performance cost to defining
1348SoftFloat&rsquo;s floating-point types this way, depending on the platform and
1349the applications using SoftFloat.
1350Ports of SoftFloat may choose to forgo the convenience in favor of better
1351speed.)
1352</P>
1353
1354<P>
1355<LI>
1356Functions have been added for converting between the floating-point types and
1357unsigned integers.
1358<NOBR>Release 2</NOBR> supported only signed integers, not unsigned.
1359</P>
1360
1361<P>
1362<LI>
1363A new, fifth rounding mode, <CODE>softfloat_round_near_maxMag</CODE> (round to
1364nearest, with ties to maximum magnitude, away from zero) is now supported for
1365all cases involving rounding.
1366</P>
1367
1368<P>
1369<LI>
1370Fused multiply-add functions have been added for the non-extended formats,
1371<CODE>float32_t</CODE>, <CODE>float64_t</CODE>, and <CODE>float128_t</CODE>.
1372</P>
1373
1374</UL>
1375</P>
1376
1377<H3>9.4. Better Compatibility with the C Language</H3>
1378
1379<P>
1380<NOBR>Release 3</NOBR> of SoftFloat was written to conform better to the ISO C
1381Standard&rsquo;s rules for portability.
1382For example, older releases of SoftFloat employed type conversions in ways
1383that, while commonly practiced, are not fully defined by the C Standard.
1384Such problematic type conversions have generally been replaced by the use of
1385unions, the behavior around which is more strictly regulated these days.
1386</P>
1387
1388<H3>9.5. New Organization as a Library</H3>
1389
1390<P>
1391Starting with <NOBR>Release 3</NOBR>, SoftFloat now builds as a library.
1392Previously, SoftFloat compiled into a single, monolithic object file containing
1393all the SoftFloat functions, with the consequence that a program linking with
1394SoftFloat would get every SoftFloat function in its binary file even if only a
1395few functions were actually used.
1396With SoftFloat in the form of a library, a program that is linked by a standard
1397linker will include only those functions of SoftFloat that it needs and no
1398others.
1399</P>
1400
1401<H3>9.6. Optimization Gains (and Losses)</H3>
1402
1403<P>
1404Individual SoftFloat functions have been variously improved in
1405<NOBR>Release 3</NOBR> compared to earlier releases.
1406In particular, better, faster algorithms have been deployed for the operations
1407of division, square root, and remainder.
1408For functions operating on the larger <NOBR>80-bit</NOBR> and
1409<NOBR>128-bit</NOBR> formats, <CODE>extFloat80_t</CODE> and
1410<CODE>float128_t</CODE>, code size has also generally been reduced.
1411</P>
1412
1413<P>
1414However, because <NOBR>Release 2</NOBR> compiled all of SoftFloat together as a
1415single object file, compilers could make optimizations across function calls
1416when one SoftFloat function calls another.
1417Now that the functions of SoftFloat are compiled separately and only afterward
1418linked together into a program, there is not usually the same opportunity to
1419optimize across function calls.
1420Some loss of speed has been observed due to this change.
1421</P>
1422
1423
1424<H2>10. Future Directions</H2>
1425
1426<P>
1427The following improvements are anticipated for future releases of SoftFloat:
1428<UL>
1429<LI>
1430support for the common <NOBR>16-bit</NOBR> &ldquo;half-precision&rdquo;
1431floating-point format;
1432<LI>
1433more functions from the 2008 version of the IEEE Floating-Point Standard;
1434<LI>
1435consistent, defined behavior for non-canonical representations of extended
1436format <CODE>extFloat80_t</CODE> (discussed in <NOBR>section 4.4</NOBR>,
1437<I>Non-canonical Representations in <CODE>extFloat80_t</CODE></I>).
1438
1439</UL>
1440</P>
1441
1442
1443<H2>11. Contact Information</H2>
1444
1445<P>
1446At the time of this writing, the most up-to-date information about SoftFloat
1447and the latest release can be found at the Web page
1448<A HREF="http://www.jhauser.us/arithmetic/SoftFloat.html"><CODE>http://www.jhauser.us/arithmetic/SoftFloat.html</CODE></A>.
1449</P>
1450
1451
1452</BODY>
1453
1454