MicroZed Chronicles: FIR Filter and Coding for Performance
- Adam Taylor
- 38 minutes ago
- 9 min read
FPGAs are great for implementing signal-processing functions such as FIR filters. The DSP elements, with their built-in multiply–accumulate capability, are ideally suited for this application. However, as with most things in FPGA design, the achievable performance depends heavily on how we architect the implementation.
At a basic level, a FIR filter consists of three main elements:
A delay line
Multipliers to apply the coefficients
An accumulator to sum the products
Exactly how we implement these elements can have a significant impact on the performance of the FIR filter.
Direct Form
In the direct-form implementation, shift registers are used to delay the input samples. At each delay stage, the sample is multiplied by a constant coefficient, and the outputs of all stages are then summed together.
Transposed Form
In the transposed form, all multipliers see the same input sample. The accumulator is implemented as a chain of adders, with registers inserted between them. This structure maps very well onto modern FPGA DSP slices, especially when we take advantage of their internal pipeline registers.
Example Design
Let’s look at the difference in performance between these two architectures when implemented on an FPGA.
For this example, we target an Artix-7 device and implement a FIR filter sampled at 200 MHz, with:
Passband: below 25 MHz
Stopband: above 30 MHz
To generate the filter coefficients, I used TFilter, a website that allows us to create filters interactively online.

With the filter designed, we can move on to creating the RTL code. While we could use the Vivado FIR Compiler IP, in this case I want to show the architectural differences clearly, so we’ll look at the hand-written RTL.
Direct Form RTL and Testbench
The RTL for the direct-form FIR filter is shown below. To test this design, I used a cocotb testbench which applies two signals: one in the passband and one in the stopband.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity fir_21tap is
generic (
DATA_WIDTH : integer := 16;
COEFF_WIDTH : integer := 16
);
port (
clk : in std_logic;
rst : in std_logic;
data_in : in signed(DATA_WIDTH-1 downto 0);
data_out : out signed(15 downto 0) -- 16-bit output
);
end entity fir_21tap;
architecture rtl of fir_21tap is
constant NUM_TAPS : integer := 21;
constant PROD_WIDTH : integer := DATA_WIDTH + COEFF_WIDTH;-- 16+16 = 32
constant ACC_WIDTH : integer := PROD_WIDTH + 5; -- headroom for sum
subtype sample_t is signed(DATA_WIDTH-1 downto 0);
subtype coeff_t is signed(COEFF_WIDTH-1 downto 0);
type sample_array_t is array (0 to NUM_TAPS-1) of sample_t;
type coeff_array_t is array (0 to NUM_TAPS-1) of coeff_t;
-- 21 coefficients (h[0] is for most recent sample, h[20] for oldest)
constant COEFFS : coeff_array_t := (
to_signed( 937, COEFF_WIDTH),
to_signed( 2402, COEFF_WIDTH),
to_signed( 1479, COEFF_WIDTH),
to_signed( -1122, COEFF_WIDTH),
to_signed( -1138, COEFF_WIDTH),
to_signed( 1751, COEFF_WIDTH),
to_signed( 1079, COEFF_WIDTH),
to_signed( -3238, COEFF_WIDTH),
to_signed( -1119, COEFF_WIDTH),
to_signed( 10356, COEFF_WIDTH),
to_signed( 17504, COEFF_WIDTH),
to_signed( 10356, COEFF_WIDTH),
to_signed( -1119, COEFF_WIDTH),
to_signed( -3238, COEFF_WIDTH),
to_signed( 1079, COEFF_WIDTH),
to_signed( 1751, COEFF_WIDTH),
to_signed( -1138, COEFF_WIDTH),
to_signed( -1122, COEFF_WIDTH),
to_signed( 1479, COEFF_WIDTH),
to_signed( 2402, COEFF_WIDTH),
to_signed( 937, COEFF_WIDTH)
);
-- Shift register for samples
signal x_reg : sample_array_t := (others => (others => '0'));
-- Full-precision accumulator
signal acc_reg : signed(ACC_WIDTH-1 downto 0) := (others => '0');
begin
------------------------------------------------------------------------
-- 16-bit output: take the most significant 16 bits of the accumulator
------------------------------------------------------------------------
data_out <= acc_reg(ACC_WIDTH-1 downto ACC_WIDTH-16);
------------------------------------------------------------------------
-- Input sample shift register
------------------------------------------------------------------------
shift_reg_proc : process (clk)
begin
if rising_edge(clk) then
if rst = '1' then
x_reg <= (others => (others => '0'));
else
x_reg(0) <= data_in;
for i in 1 to NUM_TAPS-1 loop
x_reg(i) <= x_reg(i-1);
end loop;
end if;
end if;
end process shift_reg_proc;
------------------------------------------------------------------------
-- Multiply-accumulate
------------------------------------------------------------------------
mac_proc : process (clk)
variable sum : signed(ACC_WIDTH-1 downto 0);
variable prod : signed(PROD_WIDTH-1 downto 0); -- 32 bits
begin
if rising_edge(clk) then
if rst = '1' then
acc_reg <= (others => '0');
else
sum := (others => '0');
for i in 0 to NUM_TAPS-1 loop
-- Multiply 16x16 -> 32 bits, then resize to accumulator width
prod := x_reg(i) * COEFFS(i); -- result is 32 bits
sum := sum + resize(prod, ACC_WIDTH);
end loop;
acc_reg <= sum;
end if;
end if;
end process mac_proc;
end architecture rtl;
import math
import numpy as np
import cocotb
from cocotb.clock import Clock
from cocotb.triggers import RisingEdge, Timer
CLK_FREQ_HZ = 100e6 # 100 MHz sample clock
CLK_PERIOD_NS = 1e9 / CLK_FREQ_HZ # 10 ns
NUM_TAPS = 21
DATA_WIDTH = 16
def gen_tone(freq_hz, fs_hz, n_samples, amplitude=0.8):
"""
Generate a sine wave tone as 16-bit signed integers (Q1.15 style).
freq_hz : tone frequency
fs_hz : sample rate
n_samples: number of samples
amplitude: 0.0 .. 1.0 (scaled to full-scale 16-bit)
"""
t = np.arange(n_samples) / fs_hz
# Full-scale amplitude for signed 16-bit is 32767
scale = int((2**(DATA_WIDTH - 1) - 1) * amplitude)
samples = scale * np.sin(2.0 * math.pi * freq_hz * t)
return np.round(samples).astype(np.int16)
async def apply_tone(dut, freq_hz, n_samples, label):
"""
Apply a sine tone to the filter and capture the output.
Returns: (input_samples, output_samples) as numpy arrays of int16
"""
fs = CLK_FREQ_HZ
in_samples = gen_tone(freq_hz, fs, n_samples, amplitude=0.8)
out_samples = []
dut._log.info(f"--- Applying {label} tone: {freq_hz/1e6:.2f} MHz ---")
for i, sample in enumerate(in_samples):
dut.data_in.value = int(sample) # cocotb handles signed int
await RisingEdge(dut.clk)
# Read signed value from DUT (data_out is signed(15 downto 0))
out_val = dut.data_out.value.signed_integer
out_samples.append(out_val)
# Convert to numpy int16 for convenience
in_arr = np.array(in_samples, dtype=np.int16)
out_arr = np.array(out_samples, dtype=np.int32) # keep wider here
# Ignore initial transient due to filter latency (~NUM_TAPS)
steady_start = NUM_TAPS
in_steady = in_arr[steady_start:]
out_steady = out_arr[steady_start:]
# Compute simple RMS to compare levels
in_rms = math.sqrt(np.mean(in_steady.astype(np.float64)**2))
out_rms = math.sqrt(np.mean(out_steady.astype(np.float64)**2))
dut._log.info(
f"{label} tone {freq_hz/1e6:.2f} MHz: "
f"input RMS = {in_rms:.2f}, output RMS = {out_rms:.2f}"
)
return in_arr, out_arr
@cocotb.test()
async def fir_21tap_pass_stop_tones(dut):
"""
Test the FIR with:
- One tone in the passband (< 25 MHz, e.g. 10 MHz)
- One tone in the stopband (> 25 MHz, e.g. 30 MHz)
Clock is 100 MHz (10 ns period).
"""
# Start 100 MHz clock
cocotb.start_soon(Clock(dut.clk, CLK_PERIOD_NS, units="ns").start())
# Reset sequence
dut.rst.value = 1
dut.data_in.value = 0
await Timer(5 * CLK_PERIOD_NS, units="ns")
for _ in range(5):
await RisingEdge(dut.clk)
dut.rst.value = 0
await RisingEdge(dut.clk)
# Number of samples per tone (you can increase for better spectral resolution)
N_SAMPLES = 1024
# Choose two test frequencies relative to 100 MHz sample rate
pass_freq_hz = 10e6 # 10 MHz, inside passband (< 25 MHz)
stop_freq_hz = 30e6 # 30 MHz, in stopband (> 25 MHz)
# Apply passband tone and capture output
in_pass, out_pass = await apply_tone(
dut, pass_freq_hz, N_SAMPLES, label="PASSBAND"
)
# Small gap between tones (optional)
for _ in range(10):
dut.data_in.value = 0
await RisingEdge(dut.clk)
# Apply stopband tone and capture output
in_stop, out_stop = await apply_tone(
dut, stop_freq_hz, N_SAMPLES, label="STOPBAND"
)
# Optional: compute RMS ratio between passband and stopband outputs
# (ignoring transient at start)
steady_start = NUM_TAPS
out_pass_steady = out_pass[steady_start:]
out_stop_steady = out_stop[steady_start:]
pass_rms = math.sqrt(
np.mean(out_pass_steady.astype(np.float64) ** 2)
)
stop_rms = math.sqrt(
np.mean(out_stop_steady.astype(np.float64) ** 2)
)
dut._log.info(
f"Final comparison: passband RMS = {pass_rms:.2f}, "
f"stopband RMS = {stop_rms:.2f}, "
f"ratio (stop/pass) = {stop_rms/pass_rms:.3f}"
)
# Example loose check: ensure stopband is attenuated by some factor.
# You can tighten this once you know the expected filter response.
assert stop_rms < pass_rms, (
"Expected stopband tone to be attenuated relative to passband tone"
)
The simulation results are shown below. In the time domain, we can clearly see the output signal passed in the passband and attenuated in the stopband.

However, when we implement this design in the FPGA, we quickly run into timing issues at the required clock rate.
Timing and FMAX
We can estimate the potential maximum operating frequency of the design using
FMAX (MHz) = 1000 / (T − WNS)
Where:
T is the clock period in ns
WNS is the worst negative slack in ns
In this design, the long combinatorial accumulation path limits us to around 31 MHz. The issue is not the filter itself, but the way the RTL is architected: the accumulation is not optimised for the FPGA’s DSP and routing resources.

Re-Architecting for the FPGA
To optimise the filter for FPGA implementation, we need to:
Use the transposed form of the FIR
Correctly pipeline the DSP elements
That means enabling the internal pipeline registers so that the design maps to:
The A and B input registers of the DSP48
The P output register
The revised RTL, in transposed form, is shown below. We can reuse the same cocotb testbench to verify that functional behaviour remains correct.
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity fir_21tap_2reg is
generic (
DATA_WIDTH : integer := 16;
COEFF_WIDTH : integer := 16
);
port (
clk : in std_logic;
rst : in std_logic;
data_in : in signed(DATA_WIDTH-1 downto 0);
data_out : out signed(15 downto 0) -- 16-bit registered output
);
end entity fir_21tap_2reg;
architecture rtl of fir_21tap_2reg is
constant NUM_TAPS : integer := 21;
constant PROD_WIDTH : integer := DATA_WIDTH + COEFF_WIDTH; -- 16+16 = 32
constant ACC_WIDTH : integer := PROD_WIDTH + 5; -- headroom
subtype coeff_t is signed(COEFF_WIDTH-1 downto 0);
subtype acc_t is signed(ACC_WIDTH-1 downto 0);
type coeff_array_t is array (0 to NUM_TAPS-1) of coeff_t;
type state_array_t is array (0 to NUM_TAPS-2) of acc_t; -- N-1 states
------------------------------------------------------------------------
-- 21 FIR coefficients h[0]..h[20]
------------------------------------------------------------------------
constant COEFFS : coeff_array_t := (
to_signed( 937, COEFF_WIDTH), -- h0
to_signed( 2402, COEFF_WIDTH),
to_signed( 1479, COEFF_WIDTH),
to_signed( -1122, COEFF_WIDTH),
to_signed( -1138, COEFF_WIDTH),
to_signed( 1751, COEFF_WIDTH),
to_signed( 1079, COEFF_WIDTH),
to_signed( -3238, COEFF_WIDTH),
to_signed( -1119, COEFF_WIDTH),
to_signed( 10356, COEFF_WIDTH),
to_signed( 17504, COEFF_WIDTH),
to_signed( 10356, COEFF_WIDTH),
to_signed( -1119, COEFF_WIDTH),
to_signed( -3238, COEFF_WIDTH),
to_signed( 1079, COEFF_WIDTH),
to_signed( 1751, COEFF_WIDTH),
to_signed( -1138, COEFF_WIDTH),
to_signed( -1122, COEFF_WIDTH),
to_signed( 1479, COEFF_WIDTH),
to_signed( 2402, COEFF_WIDTH),
to_signed( 937, COEFF_WIDTH) -- h20
);
------------------------------------------------------------------------
-- State registers for transposed structure: s(0)..s(19)
------------------------------------------------------------------------
signal state : state_array_t := (others => (others => '0'));
signal data_out_reg: signed(15 downto 0) := (others => '0');
begin
data_out <= data_out_reg;
------------------------------------------------------------------------
-- Transposed-form FIR
--
-- Equations (for N=21, indices 0..20):
-- z_20[n] = h20 * x[n]
-- z_k[n] = h(k+1) * x[n] + z_{k+1}[n-1], k = 0..19
-- y[n] = h0 * x[n] + z_0[n-1]
--
-- Here:
-- state(k) holds z_k[n] (k = 0..19)
-- data_out_reg holds y[n]
------------------------------------------------------------------------
fir_proc : process (clk)
variable x_ext : signed(DATA_WIDTH-1 downto 0);
variable prod : signed(PROD_WIDTH-1 downto 0);
variable acc : acc_t;
begin
if rising_edge(clk) then
if rst = '1' then
state <= (others => (others => '0'));
data_out_reg <= (others => '0');
else
-- Extend input to full precision once
x_ext := data_in;
------------------------------------------------------------------
-- Update last state: z_20[n] = h20 * x[n]
------------------------------------------------------------------
prod := x_ext * COEFFS(NUM_TAPS-1); -- h20 * x[n]
acc := resize(prod, ACC_WIDTH);
state(NUM_TAPS-2) <= acc; -- state(19)
------------------------------------------------------------------
-- Update remaining states backward:
-- state(k) <= h(k+1)*x[n] + state(k+1)(old)
-- Note: state(k+1) on RHS is value from previous cycle (n-1),
-- because signal reads see "old" value in this clock.
------------------------------------------------------------------
for k in NUM_TAPS-3 downto 0 loop -- k = 18 .. 0
prod := x_ext * COEFFS(k+1); -- h(k+1) * x[n]
acc := resize(prod, ACC_WIDTH) + state(k+1);
state(k) <= acc;
end loop;
------------------------------------------------------------------
-- Output:
-- y[n] = h0 * x[n] + z_0[n-1] = h0 * x[n] + state(0)(old)
------------------------------------------------------------------
prod := x_ext * COEFFS(0); -- h0 * x[n]
acc := resize(prod, ACC_WIDTH) + state(0);
data_out_reg <= acc(ACC_WIDTH-1 downto ACC_WIDTH-16); -- 16-bit
end if;
end if;
end process;
end architecture rtl;
Once implemented, the updated design now meets timing on the target device at the required clock rate.

Conclusion
This example shows how writing RTL that is architected for the FPGA fabric is critical to achieving high performance. Two functionally identical FIR filters can behave very differently in implementation, depending on how well they map onto the device’s DSP and routing resources.
If you want to learn more about writing code for performance in Vivado you might want to take a look at my webinar here on AMD Vivado™ Design Suite Essentials: Key Techniques for Superior RTL Development
FPGA Conference
FPGA Horizons US East - April 28th, 29th 2026 - THE FPGA Conference, find out more here.
FPGA Journal
Read about cutting edge FPGA developments, in the FPGA Horizons Journal or contribute an article.
Workshops and Webinars:
If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include:
Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS
Boards
Get an Adiuvo development board:
Adiuvo Embedded System Development board - Embedded System Development Board
Adiuvo Embedded System Tile - Low Risk way to add a FPGA to your design.
SpaceWire CODEC - SpaceWire CODEC, digital download, AXIS Interfaces
SpaceWire RMAP Initiator - SpaceWire RMAP Initiator, digital download, AXIS & AXI4 Interfaces
SpaceWire RMAP Target - SpaceWire Target, digital download, AXI4 and AXIS Interfaces
Embedded System Book
Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here. Learn more about the board (see previous blogs on Bring up, DDR validation, USB, Sensors) and view the schematics here.
Sponsored by AMD

