MicroZed Chronicles: Leveraging Performance.

Adam Taylor
54 minutes ago
7 min read

One of the things I have been examining lately is how I can leverage higher-performing devices, such as the Spartan UltraScale+, to implement more compact FPGA designs.

There are several approaches we can take to achieve this based on the higher performance of the logic. For example, we can run AXI buses narrower but at a higher frequency; in this case we leverage the increased performance of the fabric. This works well for interfaces around our module, although the core of the module may require different techniques.

These approaches may be necessary to maintain the throughput of a processing core or the dynamic range of a filter, for example.

One technique that can be used in processing cores and filters is RAM pumping.

If you are not familiar with RAM pumping, it involves running the BRAM at multiples of the clock frequency used to clock the rest of the processing core.

In the simplest implementation this might mean running the BRAM at double the rate; in others, we could run the clock faster, perhaps four times the processing-chain clock. The exact choice depends on the needs of the processing chain, the available clock rates, and of course the performance of the selected device.

If we are implementing BRAM pumping, we must ensure that the clocks are related to each other and are integer multiples.

The diagram below shows an example of the simplest double-pumping scheme, which performs one read and one write during a single clock cycle of the processing chain.

We can extend this functionality when using true dual-port RAMs to increase the number of BRAM accesses.

An example of this is storing coefficients within a BRAM and then reading them at a higher rate as part of a filter or processing algorithm.

BRAM in Spartan UltraScale+ devices is capable of clocking at up to 738 MHz (-2) or 516 MHz (-1). This performance means we can easily leverage BRAM to enable time-multiplexed processing, which in turn requires fewer logic resources.

Let’s wrap up with a simple FIR filter example. In this case, the design is intentionally kept simple to demonstrate the approach, using a four-tap FIR. While the tap values are defined as constants, we can use BRAM as the delay line to store the sample data.

In this example, we use an architecture that employs only two DSP elements combined with BRAM pumping to store the sample delay line. The BRAM is clocked at twice the rate of the calculation chain.

By doing this, we can retrieve two samples from the BRAM per calculation-chain clock, enabling two multiplications per cycle. This means each filter output takes several clock cycles to compute, but the overall resource usage is reduced.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity fir4_pumped_3cyc is
  generic (
    DATA_WIDTH  : integer := 16;
    COEFF_WIDTH : integer := 16;
    ACC_WIDTH   : integer := 32
  );
  port (
    clk_1x       : in  std_logic;  -- system clock
    clk_2x       : in  std_logic;  -- 2x clock for RAM
    rst          : in  std_logic;
    sample_in    : in  std_logic_vector(DATA_WIDTH-1 downto 0);   	 	  
    sample_valid : in  std_logic;
    sample_out   : out std_logic_vector(ACC_WIDTH-1 downto 0);
    out_valid    : out std_logic  );
end entity fir4_pumped_3cyc;

architecture rtl of fir4_pumped_3cyc is

------------------------------------------------------------------
  -- Circular buffer addressing
------------------------------------------------------------------  function minus_mod4(ptr : unsigned(1 downto 0); k : natural)
    return unsigned is
    variable idx : natural;
  begin
    idx := (to_integer(ptr) + 4 - (k mod 4)) mod 4;
    return to_unsigned(idx, 2);
  end function;

  constant ADDR_WIDTH_C : integer := 2;  -- 4 locations

------------------------------------------------------------------
-- Double-pumped RAM interface
------------------------------------------------------------------
  signal addr0  : std_logic_vector(ADDR_WIDTH_C-1 downto 0);
  signal addr1  : std_logic_vector(ADDR_WIDTH_C-1 downto 0);
  signal din0   : std_logic_vector(DATA_WIDTH-1 downto 0);
  signal din1   : std_logic_vector(DATA_WIDTH-1 downto 0);
  signal we0    : std_logic;
  signal we1    : std_logic;
  signal dout0  : std_logic_vector(DATA_WIDTH-1 downto 0);
  signal dout1  : std_logic_vector(DATA_WIDTH-1 downto 0);
------------------------------------------------------------------
  -- Circular buffer write pointer
------------------------------------------------------------------
 signal wr_ptr : unsigned(ADDR_WIDTH_C-1 downto 0) := (others => '0');
------------------------------------------------------------------
-- FIR state + registers
------------------------------------------------------------------
type state_t is (IDLE, STAGE1, STAGE2);
  signal state  : state_t := IDLE;
  signal x_n    : signed(DATA_WIDTH-1 downto 0) := (others => '0');
  signal acc    : signed(ACC_WIDTH-1 downto 0) := (others => '0');
------------------------------------------------------------------
-- Coefficients
------------------------------------------------------------------  constant H0 : signed(COEFF_WIDTH-1 downto 0) := to_signed( 2,COEFF_WIDTH);  constant H1 : signed(COEFF_WIDTH-1 downto 0) := to_signed( 4,COEFF_WIDTH);  constant H2 : signed(COEFF_WIDTH-1 downto 0) := to_signed( 4,COEFF_WIDTH);  constant H3 : signed(COEFF_WIDTH-1 downto 0) := to_signed( 2,COEFF_WIDTH);

begin
--------------------------------------------------------------------
-- Instantiate simple double-pumped RAM as 4-sample delay line 
--------------------------------------------------------------------  ram_i : entity work.dp_ram_simple
    generic map (
      ADDR_WIDTH => ADDR_WIDTH_C,
      DATA_WIDTH => DATA_WIDTH
    )
    port map (
      clk_2x => clk_2x,
      rst    => rst,
      addr0  => addr0,
      din0   => din0,
      we0    => we0,
      dout0  => dout0,
      addr1  => addr1,
      din1   => din1,
      we1    => we1,
      dout1  => dout1
    );

  --------------------------------------------------------------------
  -- FIR control in clk_1x domain (3-cycle pipeline)
  --------------------------------------------------------------------  process(clk_1x)
    variable base_idx          : unsigned(ADDR_WIDTH_C-1 downto 0);
    variable x_n1, x_n2, x_n3  : signed(DATA_WIDTH-1 downto 0);
    variable prod0, prod1      : signed(ACC_WIDTH-1 downto 0);
  begin
    if rising_edge(clk_1x) then
      if rst = '1' then
        state      <= IDLE;
        out_valid  <= '0';
        sample_out <= (others => '0');
        wr_ptr     <= (others => '0');
        x_n        <= (others => '0');
        acc        <= (others => '0');
        addr0      <= (others => '0');
        addr1      <= (others => '0');
        din0       <= (others => '0');
        din1       <= (others => '0');
        we0        <= '0';
        we1        <= '0';
      else
        -- defaults
        out_valid <= '0';
        we0       <= '0';
        we1       <= '0';
        din0      <= (others => '0');
        din1      <= (others => '0');
        case state is
          ------------------------------------------------------------
          -- IDLE: accept new sample, schedule write + x[n-1] read
          ------------------------------------------------------------
          when IDLE =>
            if sample_valid = '1' then
              -- latch x[n]
              x_n <= signed(sample_in);
              -- op0: write x[n] at wr_ptr
              addr0 <= std_logic_vector(wr_ptr);
              din0  <= sample_in;
              we0   <= '1';
              -- op1: read x[n-1] at wr_ptr-1
              base_idx := wr_ptr;
              addr1    <= std_logic_vector(minus_mod4(base_idx, 1));
              we1      <= '0';
              -- advance pointer for next sample
             wr_ptr   <= wr_ptr + 1;
              state    <= STAGE1;
            end if;
          ------------------------------------------------------------
          -- STAGE1:
          --   dout1 = x[n-1]
          --   MAC taps 0 & 1 (2 DSPs)
          --   schedule read of x[n-2], x[n-3]
          ------------------------------------------------------------
          when STAGE1 =>
            -- capture x[n-1]
            x_n1 := signed(dout1);
            -- taps 0 & 1
            prod0 := resize(x_n  * H0, ACC_WIDTH);
            prod1 := resize(x_n1 * H1, ACC_WIDTH);
            acc   <= prod0 + prod1;
            -- schedule op0/op1 for x[n-2], x[n-3]
            base_idx := wr_ptr - 1;  -- index of x[n]
            addr0    <= std_logic_vector(minus_mod4(base_idx, 2));
            addr1    <= std_logic_vector(minus_mod4(base_idx, 3));        
		   we0      <= '0';
            we1      <= '0';
            state    <= STAGE2;
          ------------------------------------------------------------
          -- STAGE2:
          --   dout0 = x[n-2], dout1 = x[n-3]
          --   MAC taps 2 & 3 (2 DSPs) and finish y[n]
          ------------------------------------------------------------
          when STAGE2 =>
            x_n2 := signed(dout0);
            x_n3 := signed(dout1);
            prod0 := resize(x_n2 * H2, ACC_WIDTH);
            prod1 := resize(x_n3 * H3, ACC_WIDTH);
            acc        <= acc + prod0 + prod1;
            sample_out <= std_logic_vector(acc + prod0 + prod1);
            out_valid  <= '1';
            state      <= IDLE;
        end case;
     end if;
    end if;
  end process;
end architecture rtl;

library ieee;

use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity dp_ram_simple is
  generic (
    ADDR_WIDTH : integer := 4;
    DATA_WIDTH : integer := 16
  );
  port (
    clk_2x  : in  std_logic;
    rst     : in  std_logic;
    -- op0 (used in phase 0)
    addr0   : in  std_logic_vector(ADDR_WIDTH-1 downto 0);
    din0    : in  std_logic_vector(DATA_WIDTH-1 downto 0);
    we0     : in  std_logic;  -- '1' = write, '0' = read
    dout0   : out std_logic_vector(DATA_WIDTH-1 downto 0);
    -- op1 (used in phase 1)
    addr1   : in  std_logic_vector(ADDR_WIDTH-1 downto 0);
    din1    : in  std_logic_vector(DATA_WIDTH-1 downto 0);
    we1     : in  std_logic;
    dout1   : out std_logic_vector(DATA_WIDTH-1 downto 0)
  );
end entity dp_ram_simple;

architecture rtl of dp_ram_simple is
  type ram_t is array (0 to (2**ADDR_WIDTH)-1) of
  	std_logic_vector(DATA_WIDTH-1 downto 0);
  signal ram : ram_t := (others => (others => '0'));
  attribute ram_style : string;
  attribute ram_style of ram : signal is "block";
  signal phase     : std_logic := '0';  -- toggles at clk_2x
  signal dout0_reg : std_logic_vector(DATA_WIDTH-1 downto 0) := (others => '0');
  signal dout1_reg : std_logic_vector(DATA_WIDTH-1 downto 0) := (others => '0');
begin
  dout0 <= dout0_reg;
  dout1 <= dout1_reg;
  process(clk_2x)
    variable idx : integer range 0 to (2**ADDR_WIDTH-1);
  begin
    if rising_edge(clk_2x) then
      if rst = '1' then
        phase      <= '0';
        dout0_reg  <= (others => '0');
        dout1_reg  <= (others => '0');
    else
        phase <= not phase;
        if phase = '0' then
          -- phase 0: op0
          idx       := to_integer(unsigned(addr0));
          dout0_reg <= ram(idx);
          if we0 = '1' then
            ram(idx) <= din0;
          end if;
        else
          -- phase 1: op1
          idx       := to_integer(unsigned(addr1));
          dout1_reg <= ram(idx);
          if we1 = '1' then
            ram(idx) <= din1;
          end if;
        end if;
      end if;
    end if;
  end process;
end architecture rtl;

Of course, we can run the clocks even faster, further reducing resource utilisation while still meeting system-level performance requirements.

Simulation clearly shows the double pumping of the BRAM during read accesses to the sample data, enabling two multiplications per clock cycle and resulting in a more compact implementation of the filter.

This is an area of interest for me, and I will be exploring it in more detail over several future articles and blogs.

FPGA Conference

FPGA Horizons US East - April 28th, 29th 2026 - THE FPGA Conference, find out more here.

FPGA Journal

Read about cutting edge FPGA developments, in the FPGA Horizons Journal or contribute an article.

Workshops and Webinars:

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include:

Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Boards

Get an Adiuvo development board:

Adiuvo Embedded System Development board - Embedded System Development Board
Adiuvo Embedded System Tile - Low Risk way to add a FPGA to your design.
SpaceWire CODEC - SpaceWire CODEC, digital download, AXIS Interfaces
SpaceWire RMAP Initiator - SpaceWire RMAP Initiator, digital download, AXIS & AXI4 Interfaces
SpaceWire RMAP Target - SpaceWire Target, digital download, AXI4 and AXIS Interfaces
Other Adiuvo Boards & Projects.

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here. Learn more about the board (see previous blogs on Bring up, DDR validation , USB, Sensors) and view the schematics here.

Sponsored by AMD