MicroZed Chronicles: Fixed and Floating Point Maths

Nov 13, 2024
5 min read

Updated: Jun 25, 2025

A short time ago I hosted a webinar which looked at how we can implement mathematics within programmable logic. In this webinar we examined how we could create mathematics applications using RTL, HLS and MATLAB Simulink.

When creating the examples for the RTL, I used the VHDL fixed package which is provided with VHDL 2008. This provides excellent features for working effectively with fixed point numbers and of course they are synthesisable. Some of the benefits of this package include

Signed and Unsigned (sfix and ufix) fixed point vectors.
Easy representation and quantisation of fixed point numbers into fixed point vectors.
Decimal point resides between vector element 0 and -1. Easing need to keep track of decimal point for alignment during operation.
Overflow, rounding and range management of operation is clearly defined.
Arithmetic and comparison operators.

As such when I need to implement algorithms I use the fixed point package. During the webinar I also talked about the synthesisable floating point package which was also provided by VHDL 2008.

Naturally one of the questions arising was, what is the difference in resources when implementing the same equation using fixed and floating point. I did not have an immediate answer but thought it would make a pretty interesting blog.

The example we are going to look at is the implementation of a polynomial approximation to covert an ADC reading into a temperature value. This is commonly done when working with platinum resistance thermometers in industrial applications.

The exact equation to be implemented is y = 2E-09x4 - 4E-07x3 + 0.011x2 + 2.403x - 251.26 which is extracted from plotting the equation. While we could implement the equation in its direct form it would be very wasteful in resources, along with adding complexity and risk to the development.

Using Fixed point number system we will need to do some quantisation to maintain the precision and accuracy.

The code and a simple simulation showing its implementation can be seen below

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use ieee.fixed_pkg.all;
entity complex_example is port(
clk    : in std_logic; 
ip     : in std_logic_vector(7 downto 0);
op     : out std_logic_vector(8 downto 0));
end complex_example;

architecture Behavioral of complex_example is
signal power_a : sfixed(8 downto -32):=(others=>'0');
signal power_b : sfixed(8 downto -32):=(others=>'0');
signal power_c : sfixed(8 downto -32):=(others=>'0');
signal calc  : sfixed(8 downto -32) :=(others=>'0');
signal store : sfixed(8 downto 0) := (others =>'0');
constant a : sfixed(8 downto -32):= to_sfixed( 2.00E-09, 8,-32 );
constant b : sfixed(8 downto -32):= to_sfixed( 4.00E-07, 8,-32 );
constant c : sfixed(8 downto -32):= to_sfixed( 0.0011, 8,-32 ); 
constant d : sfixed(8 downto -32):= to_sfixed( 2.403, 8,-32 ); 
constant e : sfixed(8 downto -32):= to_sfixed( 251.26, 8,-32 ); 
type reg_array is array (9 downto 0) of sfixed(8 downto -32);
signal pipeline_reg : reg_array;

begin
cvd : process(clk)
begin 
 if rising_edge(clk) then 
    store <= to_sfixed('0'&ip,store);
    power_a <= resize (arg => power_b * store * a,
                       size_res => power_a);
    power_b <= resize (arg => power_c * store * b,
                       size_res => power_b);    
    power_c <= resize (arg => store * store * c,
                       size_res => power_c);  
    calc <= resize (arg => power_a - power_b  + power_c + (store * d) - e,
                    size_res => calc);
    pipeline_reg <= pipeline_reg(pipeline_reg'high -1 downto 0 ) & calc;    	  
	op <= to_slv(pipeline_reg(pipeline_reg'high)(8 downto 0));
  end if;
end process;
end Behavioral;

For an input of resistance of 109 ohms the temperature should be reported as 23.7C. We can see in the fixed point simulation below the result is as expected within a acceptable accuracy.

Implementing the same functionality using the floating point package, is achieved in a similar way

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.FLOAT_pkg.ALL;  -- Use the floating-point package

entity FloatingPointPolynomial is
    Port (
        clk : in STD_LOGIC;
        x : in float32;  -- Input x as a 32-bit floating-point number
        y : out float32  -- Output y as a 32-bit floating-point number    );

end FloatingPointPolynomial;
architecture Behavioral of FloatingPointPolynomial is
    -- Define constants for the polynomial coefficients
    constant a4 : float32 := TO_float(2.00E-09);
    constant a3 : float32 := TO_float(-4.00E-07);
    constant a2 : float32 := TO_float(0.011);
    constant a1 : float32 := TO_float(2.403);
    constant a0 : float32 := TO_float(-251.26);
    signal x2, x3, x4 : float32;  -- Intermediate powers of x
    signal term4, term3, term2, term1 : float32;  -- Polynomial terms
    signal res : float32;
    type reg_array is array (9 downto 0) of float32;
    signal pipeline_reg : reg_array;    
begin
    process(clk)
    begin
        if rising_edge(clk) then
            -- Calculate powers of x
            x2 <= x * x;
            x3 <= x2 * x;
            x4 <= x3 * x;
            -- Calculate each term in the polynomial
            term4 <= a4 * x4;
            term3 <= a3 * x3;
            term2 <= a2 * x2;
            term1 <= a1 * x;
            -- Calculate final result
            res <= term4 + term3 + term2 + term1 + a0;
            pipeline_reg <= pipeline_reg(pipeline_reg'high -1 downto 0 ) & 
						 res;
            y <= (pipeline_reg(pipeline_reg'high));
        end if;
    end process;
end Behavioral;

Again the simulation shows the expected result, being a floating result we get a result which also includes the fractional element as well.

Both fixed and floating point are therefore capable of implementing the algorithm defined.

To see the resources required for the utilisation I decided to target both implementations at a K26 SoM.

Running the synthesis will identify the resources required for each implementation.

The fixed point implementation requires as expected a much smaller logic foot print than required by the floating point implementation.

Fixed Point Implementation

Floating Point Implementation

It is not just logic foot print which we need to consider, we also need to consider timing performance. With that in mind I set both designs to operate at 200 MHz and ran through the start of achieving a baseline timing closure.

To achieve timing closure was a lot more significant, as would be expected on the floating point implementation than fixed point. I had to go back through the design and implement pipelining, in several key stages, though this is not to be unexpected as my initial code was just to determine the foot print difference.

It is worth noting the DSP58 within the Versal family, support floating point implementation however, it does not directly map from float32 to the DSP. To leverage it we need to instantiate the DSP58 configured for FP32 operation or leverage the floating point IP which is provided by Vivado‘s IP Integrator. We will examine these in a future blog.

Wrapping up this blog has shown as would be expected a large difference in logic foot print when using the floating point libraries in VHDL.

I would recommend leveraging fixed point where necessary and limiting floating point to where absolutely necessary.

Workshops and Webinars

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include

Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Boards

Get an Adiuvo development board

Adiuvo Spartan 7 / RPi 2040 Embedded System Development Board
Adiuvo Spartan 7 Tile - Low Risk way to add a FPGA to your design.

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here Learn more about the board (see previous blogs on Bring up, DDR validation, USB, Sensors) and view the schematics here.

Order here