MicroZed Chronicles: FIR Filter and Coding for Performance

Adam Taylor
38 minutes ago
9 min read

FPGAs are great for implementing signal-processing functions such as FIR filters. The DSP elements, with their built-in multiply–accumulate capability, are ideally suited for this application. However, as with most things in FPGA design, the achievable performance depends heavily on how we architect the implementation.

At a basic level, a FIR filter consists of three main elements:

A delay line
Multipliers to apply the coefficients
An accumulator to sum the products

Exactly how we implement these elements can have a significant impact on the performance of the FIR filter.

Direct Form

In the direct-form implementation, shift registers are used to delay the input samples. At each delay stage, the sample is multiplied by a constant coefficient, and the outputs of all stages are then summed together.

Transposed Form

In the transposed form, all multipliers see the same input sample. The accumulator is implemented as a chain of adders, with registers inserted between them. This structure maps very well onto modern FPGA DSP slices, especially when we take advantage of their internal pipeline registers.

Example Design

Let’s look at the difference in performance between these two architectures when implemented on an FPGA.

For this example, we target an Artix-7 device and implement a FIR filter sampled at 200 MHz, with:

Passband: below 25 MHz
Stopband: above 30 MHz

To generate the filter coefficients, I used TFilter, a website that allows us to create filters interactively online.

With the filter designed, we can move on to creating the RTL code. While we could use the Vivado FIR Compiler IP, in this case I want to show the architectural differences clearly, so we’ll look at the hand-written RTL.

Direct Form RTL and Testbench

The RTL for the direct-form FIR filter is shown below. To test this design, I used a cocotb testbench which applies two signals: one in the passband and one in the stopband.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity fir_21tap is
  generic (
    DATA_WIDTH  : integer := 16;
    COEFF_WIDTH : integer := 16
  );
  port (
    clk      : in  std_logic;
    rst      : in  std_logic;
    data_in  : in  signed(DATA_WIDTH-1 downto 0);
    data_out : out signed(15 downto 0)   -- 16-bit output
  );
end entity fir_21tap;

architecture rtl of fir_21tap is

  constant NUM_TAPS   : integer := 21;
  constant PROD_WIDTH : integer := DATA_WIDTH + COEFF_WIDTH;-- 16+16 = 32
  constant ACC_WIDTH  : integer := PROD_WIDTH + 5;   -- headroom for sum

  subtype sample_t is signed(DATA_WIDTH-1 downto 0);
  subtype coeff_t  is signed(COEFF_WIDTH-1 downto 0);
  type sample_array_t is array (0 to NUM_TAPS-1) of sample_t;
  type coeff_array_t  is array (0 to NUM_TAPS-1) of coeff_t;

  -- 21 coefficients (h[0] is for most recent sample, h[20] for oldest)
  constant COEFFS : coeff_array_t := (
    to_signed(   937, COEFF_WIDTH),
    to_signed(  2402, COEFF_WIDTH),
    to_signed(  1479, COEFF_WIDTH),
    to_signed( -1122, COEFF_WIDTH),
    to_signed( -1138, COEFF_WIDTH),
    to_signed(  1751, COEFF_WIDTH),
    to_signed(  1079, COEFF_WIDTH),
    to_signed( -3238, COEFF_WIDTH),
    to_signed( -1119, COEFF_WIDTH),
    to_signed( 10356, COEFF_WIDTH),
    to_signed( 17504, COEFF_WIDTH),
    to_signed( 10356, COEFF_WIDTH),
    to_signed( -1119, COEFF_WIDTH),
    to_signed( -3238, COEFF_WIDTH),
    to_signed(  1079, COEFF_WIDTH),
    to_signed(  1751, COEFF_WIDTH),
    to_signed( -1138, COEFF_WIDTH),
    to_signed( -1122, COEFF_WIDTH),
    to_signed(  1479, COEFF_WIDTH),
    to_signed(  2402, COEFF_WIDTH),
    to_signed(   937, COEFF_WIDTH)
  );

  -- Shift register for samples
  signal x_reg : sample_array_t := (others => (others => '0'));

  -- Full-precision accumulator
  signal acc_reg : signed(ACC_WIDTH-1 downto 0) := (others => '0');

begin

  ------------------------------------------------------------------------
  -- 16-bit output: take the most significant 16 bits of the accumulator
  ------------------------------------------------------------------------
  data_out <= acc_reg(ACC_WIDTH-1 downto ACC_WIDTH-16);

  ------------------------------------------------------------------------
  -- Input sample shift register
  ------------------------------------------------------------------------
  shift_reg_proc : process (clk)
  begin
    if rising_edge(clk) then
      if rst = '1' then
        x_reg <= (others => (others => '0'));
      else
        x_reg(0) <= data_in;
        for i in 1 to NUM_TAPS-1 loop
          x_reg(i) <= x_reg(i-1);
        end loop;
      end if;
    end if;
  end process shift_reg_proc;

  ------------------------------------------------------------------------
  -- Multiply-accumulate
  ------------------------------------------------------------------------
  mac_proc : process (clk)
    variable sum  : signed(ACC_WIDTH-1 downto 0);
    variable prod : signed(PROD_WIDTH-1 downto 0);  -- 32 bits
  begin
    if rising_edge(clk) then
      if rst = '1' then
        acc_reg <= (others => '0');
      else
        sum := (others => '0');

        for i in 0 to NUM_TAPS-1 loop
          -- Multiply 16x16 -> 32 bits, then resize to accumulator width
          prod := x_reg(i) * COEFFS(i);               -- result is 32 bits
          sum  := sum + resize(prod, ACC_WIDTH);
        end loop;

        acc_reg <= sum;
      end if;
    end if;
  end process mac_proc;

end architecture rtl;

import math
import numpy as np

import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer


CLK_FREQ_HZ = 100e6      # 100 MHz sample clock
CLK_PERIOD_NS = 1e9 / CLK_FREQ_HZ  # 10 ns
NUM_TAPS = 21
DATA_WIDTH = 16


def gen_tone(freq_hz, fs_hz, n_samples, amplitude=0.8):
    """
    Generate a sine wave tone as 16-bit signed integers (Q1.15 style).

    freq_hz  : tone frequency
    fs_hz    : sample rate
    n_samples: number of samples
    amplitude: 0.0 .. 1.0 (scaled to full-scale 16-bit)
    """
    t = np.arange(n_samples) / fs_hz
    # Full-scale amplitude for signed 16-bit is 32767
    scale = int((2**(DATA_WIDTH - 1) - 1) * amplitude)

    samples = scale * np.sin(2.0 * math.pi * freq_hz * t)
    return np.round(samples).astype(np.int16)


async def apply_tone(dut, freq_hz, n_samples, label):
    """
    Apply a sine tone to the filter and capture the output.

    Returns: (input_samples, output_samples) as numpy arrays of int16
    """
    fs = CLK_FREQ_HZ
    in_samples = gen_tone(freq_hz, fs, n_samples, amplitude=0.8)

    out_samples = []

    dut._log.info(f"--- Applying {label} tone: {freq_hz/1e6:.2f} MHz ---")

    for i, sample in enumerate(in_samples):
        dut.data_in.value = int(sample)  # cocotb handles signed int
        await RisingEdge(dut.clk)
        # Read signed value from DUT (data_out is signed(15 downto 0))
        out_val = dut.data_out.value.signed_integer
        out_samples.append(out_val)

    # Convert to numpy int16 for convenience
    in_arr = np.array(in_samples, dtype=np.int16)
    out_arr = np.array(out_samples, dtype=np.int32)  # keep wider here

    # Ignore initial transient due to filter latency (~NUM_TAPS)
    steady_start = NUM_TAPS
    in_steady = in_arr[steady_start:]
    out_steady = out_arr[steady_start:]

    # Compute simple RMS to compare levels
    in_rms = math.sqrt(np.mean(in_steady.astype(np.float64)**2))
    out_rms = math.sqrt(np.mean(out_steady.astype(np.float64)**2))

    dut._log.info(
        f"{label} tone {freq_hz/1e6:.2f} MHz: "
        f"input RMS = {in_rms:.2f}, output RMS = {out_rms:.2f}"
    )

    return in_arr, out_arr


@cocotb.test()
async def fir_21tap_pass_stop_tones(dut):
    """
    Test the FIR with:
      - One tone in the passband (< 25 MHz, e.g. 10 MHz)
      - One tone in the stopband (> 25 MHz, e.g. 30 MHz)

    Clock is 100 MHz (10 ns period).
    """

    # Start 100 MHz clock
    cocotb.start_soon(Clock(dut.clk, CLK_PERIOD_NS, units="ns").start())

    # Reset sequence
    dut.rst.value = 1
    dut.data_in.value = 0
    await Timer(5 * CLK_PERIOD_NS, units="ns")
    for _ in range(5):
        await RisingEdge(dut.clk)
    dut.rst.value = 0
    await RisingEdge(dut.clk)

    # Number of samples per tone (you can increase for better spectral resolution)
    N_SAMPLES = 1024

    # Choose two test frequencies relative to 100 MHz sample rate
    pass_freq_hz = 10e6   # 10 MHz, inside passband (< 25 MHz)
    stop_freq_hz = 30e6   # 30 MHz, in stopband (> 25 MHz)

    # Apply passband tone and capture output
    in_pass, out_pass = await apply_tone(
        dut, pass_freq_hz, N_SAMPLES, label="PASSBAND"
    )

    # Small gap between tones (optional)
    for _ in range(10):
        dut.data_in.value = 0
        await RisingEdge(dut.clk)

    # Apply stopband tone and capture output
    in_stop, out_stop = await apply_tone(
        dut, stop_freq_hz, N_SAMPLES, label="STOPBAND"
    )

    # Optional: compute RMS ratio between passband and stopband outputs
    # (ignoring transient at start)
    steady_start = NUM_TAPS
    out_pass_steady = out_pass[steady_start:]
    out_stop_steady = out_stop[steady_start:]

    pass_rms = math.sqrt(
        np.mean(out_pass_steady.astype(np.float64) ** 2)
    )
    stop_rms = math.sqrt(
        np.mean(out_stop_steady.astype(np.float64) ** 2)
    )

    dut._log.info(
        f"Final comparison: passband RMS = {pass_rms:.2f}, "
        f"stopband RMS = {stop_rms:.2f}, "
        f"ratio (stop/pass) = {stop_rms/pass_rms:.3f}"
    )

    # Example loose check: ensure stopband is attenuated by some factor.
    # You can tighten this once you know the expected filter response.
    assert stop_rms < pass_rms, (
        "Expected stopband tone to be attenuated relative to passband tone"
    )

The simulation results are shown below. In the time domain, we can clearly see the output signal passed in the passband and attenuated in the stopband.

However, when we implement this design in the FPGA, we quickly run into timing issues at the required clock rate.

Timing and FMAX

We can estimate the potential maximum operating frequency of the design using

FMAX (MHz) = 1000 / (T − WNS)

Where:

T is the clock period in ns
WNS is the worst negative slack in ns

In this design, the long combinatorial accumulation path limits us to around 31 MHz. The issue is not the filter itself, but the way the RTL is architected: the accumulation is not optimised for the FPGA’s DSP and routing resources.

Re-Architecting for the FPGA

To optimise the filter for FPGA implementation, we need to:

Use the transposed form of the FIR
Correctly pipeline the DSP elements

That means enabling the internal pipeline registers so that the design maps to:

The A and B input registers of the DSP48
The P output register

The revised RTL, in transposed form, is shown below. We can reuse the same cocotb testbench to verify that functional behaviour remains correct.

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity fir_21tap_2reg is
  generic (
    DATA_WIDTH  : integer := 16;
    COEFF_WIDTH : integer := 16
  );
  port (
    clk      : in  std_logic;
    rst      : in  std_logic;
    data_in  : in  signed(DATA_WIDTH-1 downto 0);
    data_out : out signed(15 downto 0)   -- 16-bit registered output
  );
end entity fir_21tap_2reg;

architecture rtl of fir_21tap_2reg is

  constant NUM_TAPS   : integer := 21;
  constant PROD_WIDTH : integer := DATA_WIDTH + COEFF_WIDTH; -- 16+16 = 32
  constant ACC_WIDTH  : integer := PROD_WIDTH + 5;           -- headroom
  subtype coeff_t is signed(COEFF_WIDTH-1 downto 0);
  subtype acc_t   is signed(ACC_WIDTH-1 downto 0);
  type coeff_array_t is array (0 to NUM_TAPS-1) of coeff_t;
  type state_array_t is array (0 to NUM_TAPS-2) of acc_t;   -- N-1 states

  ------------------------------------------------------------------------
  -- 21 FIR coefficients h[0]..h[20]
  ------------------------------------------------------------------------
  constant COEFFS : coeff_array_t := (
    to_signed(   937, COEFF_WIDTH),  -- h0
    to_signed(  2402, COEFF_WIDTH),
    to_signed(  1479, COEFF_WIDTH),
    to_signed( -1122, COEFF_WIDTH),
    to_signed( -1138, COEFF_WIDTH),
    to_signed(  1751, COEFF_WIDTH),
    to_signed(  1079, COEFF_WIDTH),
    to_signed( -3238, COEFF_WIDTH),
    to_signed( -1119, COEFF_WIDTH),
    to_signed( 10356, COEFF_WIDTH),
    to_signed( 17504, COEFF_WIDTH),
    to_signed( 10356, COEFF_WIDTH),
    to_signed( -1119, COEFF_WIDTH),
    to_signed( -3238, COEFF_WIDTH),
    to_signed(  1079, COEFF_WIDTH),
    to_signed(  1751, COEFF_WIDTH),
    to_signed( -1138, COEFF_WIDTH),
    to_signed( -1122, COEFF_WIDTH),
    to_signed(  1479, COEFF_WIDTH),
    to_signed(  2402, COEFF_WIDTH),
    to_signed(   937, COEFF_WIDTH)   -- h20
  );

  ------------------------------------------------------------------------
  -- State registers for transposed structure: s(0)..s(19)
  ------------------------------------------------------------------------
  signal state       : state_array_t := (others => (others => '0'));
  signal data_out_reg: signed(15 downto 0) := (others => '0');

begin

  data_out <= data_out_reg;

  ------------------------------------------------------------------------
  -- Transposed-form FIR
  --
  -- Equations (for N=21, indices 0..20):
  --   z_20[n]          = h20 * x[n]
  --   z_k[n]           = h(k+1) * x[n] + z_{k+1}[n-1],  k = 0..19
  --   y[n]             = h0 * x[n] + z_0[n-1]
  --
  -- Here:
  --   state(k) holds z_k[n]      (k = 0..19)
  --   data_out_reg holds y[n]
  ------------------------------------------------------------------------
  fir_proc : process (clk)
    variable x_ext  : signed(DATA_WIDTH-1 downto 0);
    variable prod   : signed(PROD_WIDTH-1 downto 0);
    variable acc    : acc_t;
  begin
    if rising_edge(clk) then
      if rst = '1' then
        state        <= (others => (others => '0'));
        data_out_reg <= (others => '0');

      else
        -- Extend input to full precision once
        x_ext := data_in;

        ------------------------------------------------------------------
        -- Update last state: z_20[n] = h20 * x[n]
        ------------------------------------------------------------------
        prod := x_ext * COEFFS(NUM_TAPS-1);  -- h20 * x[n]
        acc  := resize(prod, ACC_WIDTH);
        state(NUM_TAPS-2) <= acc;           -- state(19)

        ------------------------------------------------------------------
        -- Update remaining states backward:
        --   state(k) <= h(k+1)*x[n] + state(k+1)(old)
        -- Note: state(k+1) on RHS is value from previous cycle (n-1),
        --       because signal reads see "old" value in this clock.
        ------------------------------------------------------------------
        for k in NUM_TAPS-3 downto 0 loop   -- k = 18 .. 0
          prod := x_ext * COEFFS(k+1);      -- h(k+1) * x[n]
          acc  := resize(prod, ACC_WIDTH) + state(k+1);
          state(k) <= acc;
        end loop;

        ------------------------------------------------------------------
        -- Output:
        --   y[n] = h0 * x[n] + z_0[n-1] = h0 * x[n] + state(0)(old)
        ------------------------------------------------------------------
        prod := x_ext * COEFFS(0);   -- h0 * x[n]
        acc  := resize(prod, ACC_WIDTH) + state(0);
        data_out_reg <= acc(ACC_WIDTH-1 downto ACC_WIDTH-16);  -- 16-bit

      end if;
    end if;
  end process;

end architecture rtl;

Once implemented, the updated design now meets timing on the target device at the required clock rate.

Conclusion

This example shows how writing RTL that is architected for the FPGA fabric is critical to achieving high performance. Two functionally identical FIR filters can behave very differently in implementation, depending on how well they map onto the device’s DSP and routing resources.

If you want to learn more about writing code for performance in Vivado you might want to take a look at my webinar here on AMD Vivado™ Design Suite Essentials: Key Techniques for Superior RTL Development

FPGA Conference

FPGA Horizons US East - April 28th, 29th 2026 - THE FPGA Conference, find out more here.

FPGA Journal

Read about cutting edge FPGA developments, in the FPGA Horizons Journal or contribute an article.

Workshops and Webinars:

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include:

Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Boards

Get an Adiuvo development board:

Adiuvo Embedded System Development board - Embedded System Development Board
Adiuvo Embedded System Tile - Low Risk way to add a FPGA to your design.
SpaceWire CODEC - SpaceWire CODEC, digital download, AXIS Interfaces
SpaceWire RMAP Initiator - SpaceWire RMAP Initiator, digital download, AXIS & AXI4 Interfaces
SpaceWire RMAP Target - SpaceWire Target, digital download, AXI4 and AXIS Interfaces
Other Adiuvo Boards & Projects.

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here. Learn more about the board (see previous blogs on Bring up, DDR validation , USB, Sensors) and view the schematics here.

Sponsored by AMD