MicroZed Chronicles: Versal AI Engines

Adam Taylor
Sep 10
5 min read

Over the last few months I have been doing a lot of work with Versal devices and the AI Engines provided on devices like the XCV2302 and the XCV1902. The applications we have been developing using these AI Engines have mostly been DSP related and not AI related.

So I thought it would be a good idea to have a look at the AI Engines provided by both the core and edge devices, examining what the AI Engines actually are, how we can use them for our applications, and of course how we develop solutions for them.

To get started, the AI Engines are very long instruction word (VLIW) Single Instruction Multiple Data (SIMD) processors. These processors are designed to enable instruction- and data-level parallelism to achieve high throughput.

In our Versal devices these AI Engine processors are called AIE tiles and are arranged in an array. Each of these tiles, along with the processor, contains tightly coupled memories and on-tile interconnects for streaming, configuration, and memory-mapped access. AIE tiles are able to link north, south, west, and east through the streaming interface. To enable data access with the wider Versal system (e.g., NoC and PL) there are dedicated interface tiles.

Typically, applications therefore connect multiple tiles together to implement the required functionality.

What are the different types of AI Engine?

Depending on the range and generation of Versal device there are several different AI Engines provided.

• Versal Core & Premium Gen 1 – AI Engine.

• Versal Edge Gen 1 – AI Engine for ML.

• Versal Edge Gen 2 – AI Engine for ML, Version 2.

AIE provides wide SIMD on classic fixed-point and full single-precision floating-point. Vector lanes natively cover 8/16/32-bit integers (including complex forms) and real/complex FP32; as such, AIE excels at numerically sensitive DSP (FIR, FFT, beamforming) as well as fixed-point ML primitives.

AIE-ML (first-generation ML engines) shifts the centre of gravity toward inference. It adds ML-oriented data types, most notably bfloat16 (BF16), and low-precision integer vectors down to INT4 to leverage more efficient quantized CNN/transformer blocks while still composing DSP-style graphs. Critically, AIE-ML removes the native FP32 vector pipeline present in AIE; in its place, float support is provided via BF16-based emulation/accumulation in the API.

AIE-ML v2 (second-generation ML engines) builds on that with higher per-tile compute and better performance per watt for inference. It expands native types beyond AIE-ML: FP16 and FP8 are supported in the vector unit, along with microscaling (MX) formats MX9, MX6, and MX4, where lanes share an exponent for block-floating efficiency.

Why can we use AIE for DSP applications?

I mentioned that we had been using AI Engines for DSP applications. This is because the underlying mathematics of many AI and DSP elements are based on operations over large arrays.

These operations include dot products, finite impulse response (FIR), fast Fourier transforms (FFT), and matrix multiplication. At the heart of these operations is the multiply–accumulate (MAC). To get the best performance from the MAC, it is often best to have the data in tightly coupled memory.

The simplest element of a MAC is the dot product: a MAC loop over two vectors of equal length. The MAC is critical for DSP and AI, as for neural networks many operations compute a weighted sum plus a bias, e.g., dot(a, b) + bias.

For one-dimensional convolution, which is represented by the equation y[n] = Σₖ x[n−k]·h[k] the output is a dot product between a sliding window of the input x and the vector h, which is a predefined constant. Typical use cases of one-dimensional convolution are FIR filtering.

How do we program an AIE?

To program an AIE we need to be able to define not only the sequence that is executed by a particular AIE tile (the kernel) but also how data flows between the AIE tiles. Connecting AIE tile kernels together is called creating a graph and is based on a distributed model of computing proposed in 1974 by Gilles Kahn, called a Kahn Process Network (KPN). In a Kahn Process Network, tasks are executed in parallel whenever possible.

In Vitis, the AI Engine C++ ADF graph serves as the description of the pipeline. Kernels are written in standard C/C++, and the graph defines how samples move either as framed windows for deterministic staging or as continuous streams, while making interfaces explicit. When you build the graph, the AI Engine compiler elaborates the graph, selects and connects the kernels, configures tile DMAs, and emits the graph container, which is a libadf.a with metadata.

The Vitis system linker (v++ --link) then packages that container, any PL kernels, and the platform into a single device image (PDI).Verification is done in two passes: software emulation for very fast functional checks using file I/O, followed by aiesim for tile-accurate timing and back-pressure behaviour.

Vitis Analyzer closes the loop by exposing stalls and utilisation, so buffer sizes, window lengths, and NoC choices can be optimised for the application.

At run time, the processing system controls the graph through XRT/ADF, including start/stop, run-time parameter updates, and event profiling, which enables us to fine-tune behaviour without a rebuild.

This has given us a little insight into the AI Engine—the generations, capabilities, and how we create a solution. We will look at a hands-on approach to doing this in another article soon.

UK FPGA Conference

FPGA Horizons - October 7th 2025 - THE FPGA Conference, find out more here.

Workshops and Webinars:

If you enjoyed the blog why not take a look at the free webinars, workshops and training courses we have created over the years. Highlights include:

Upcoming Webinars Timing, RTL Creation, FPGA Math and Mixed Signal
Professional PYNQ Learn how to use PYNQ in your developments
Introduction to Vivado learn how to use AMD Vivado
Ultra96, MiniZed & ZU1 three day course looking at HW, SW and PetaLinux
Arty Z7-20 Class looking at HW, SW and PetaLinux
Mastering MicroBlaze learn how to create MicroBlaze solutions
HLS Hero Workshop learn how to create High Level Synthesis based solutions
Perfecting Petalinux learn how to create and work with PetaLinux OS

Boards

Get an Adiuvo development board:

Adiuvo Embedded System Development board - Embedded System Development Board
Adiuvo Embedded System Tile - Low Risk way to add a FPGA to your design.
SpaceWire CODEC - SpaceWire CODEC, digital download, AXIS Interfaces
SpaceWire RMAP Initiator - SpaceWire RMAP Initiator, digital download, AXIS & AXI4 Interfaces
SpaceWire RMAP Target - SpaceWire Target, digital download, AXI4 and AXIS Interfaces
Other Adiuvo Boards & Projects.

Embedded System Book

Do you want to know more about designing embedded systems from scratch? Check out our book on creating embedded systems. This book will walk you through all the stages of requirements, architecture, component selection, schematics, layout, and FPGA / software design. We designed and manufactured the board at the heart of the book! The schematics and layout are available in Altium here. Learn more about the board (see previous blogs on Bring up, DDR validation , USB, Sensors) and view the schematics here.

Sponsored by AMD