$~$

The StaccatoLab research program

$~$

Author: Kees van Berkel,

• part-time full professor of SAN (System Architecture and Networking),
• within the department Computer Science and Mathematics of the Technical University Eindhoven,
• email: c.h.v.berkel@tue.nl

Date: 2017 November 19

# StaccatoLab¶

StaccatoLab is a small-scale research program:

• aiming at large-scale (exascale) parallel computing;
• using radio astronomy algorithms as inspiration;
• assuming dataflow as a programming model;
• exploring innovative dataflow execution models;
• developing innovative dataflow programming tools;
• and targeting multi-FPGAs as accelerator technology.

If you are interested in contributing to StaccatoLab by means of an MSc project, first read the background information below and then consult the MSc assignments contributing to the StaccatoLab program.

Below an image of Hercules-A, a supergiant elliptical galaxy, located 2.1 billion lightyears from earth. It is a Hubble image superposed with a radio image taken by the VLA radio telescope. The optically invisible jets, one-and-a-half million light-years wide, dwarf the visible galaxy from which they emerge.

Next-generation radio telescopes push the envelope of High-Performance Computing. The workload for the SKA (the Square Kilometer Array, a radio telescope to become operational during the next decade) has been estimated to be in the exascale range (MPSoC'15) The SKA power consumption budget for compute is about 20MW, corresponding to about 30 M€/year.

# Exascale computing¶

Exascale $\approx$ 1 exaflop = $10^{18}$ Flops/sec = $10^{18}$ floating-point operations per second.

The graph below shows that the 20 fastest and the 20 greenest super computers from the top 500 supercomputers (November 2017) fall short of the projected SKA workload and power targets by an order of magnitude, even when assuming 100% resource utilization.

# FPGAs as accelerators¶

A typical supercomputer these days comprises many thousands of compute nodes, where each node contains a multi-core CPU. A node of a green super computer (above) also contains a programmable GPU (Graphics Processing Unit) as accelerator.

Our aim is to increase the throughput by 5x and reduce the power consumption (energy bill) by 10x
by using an Field Programmable Gate Array (FPGA) as accelerator.

The graph below shows rooflines for equivalent GPU and FPGA compute nodes. (Equivalent here means: same raw compute power, same memory bandwidth.) The markers show achievable performance for 2D FFT applied to large images. FPGAs intrinsically outperform GPUs by 5x in throughput, apple-to-apple (MPSoC'16).

# Dataflow as programming model¶

Exascale ($10^{18}$ Flop/s) is equivalent to one billion scalar units operating in parallel, at a clock frequency of one Ghz, with full utilization. How to design and debug such giga-scale parallel programs?

Our premise: we need a programming model that supports quantitative reasoning about parallelism. That is, quantitative reasoning about schedules, throughputs, resource utilization, scaling, parameterization, and program transformations.

Dataflow programming is the programming model of choice. Our focus is on Static DataFlow SDF, including multi-rate and cyclo-static dataflow. Where needed, judiciously and carefully chosen forms of dynamic dataflow and nondeterminism can be introduced (MPSoC'17).

Below a sketch of a hierarchical dataflow graph for radio astronomy imaging ("Cotton & Schwab"), with a workload in the petascale-exascale range, depending on the choices for sub-algorithms and on the exact SKA dimensioning. Each node denotes a subgraph.

# StaccatoLab execution model and tooling¶

StaccatoLab is also the name our dataflow execution model. It specifies how a dataflow program (graph) is executed, cycle by cycle. The StaccatoLab execution is: Self-Timed, Clocked, Throttled, One-token-output, and Look-Ahead-Back-pressured.

StaccatoLab is also the name of a set of tools to support programming, debugging, analysis, and optimization of dataflow programs. An alpha release of the tools have been used for the 2017 course VLSI programming (2IMN35). A Verilog backend (interfacing FPGA tool flows) is work in progress.

The graph below shows a still from an interactive StaccatoLab simulation of a Sobel filter, a well-known image processing function, followed by an example of an input image and a corresponding output image.

# PYNQ as experimentation and demo platform¶

The PYNQ-Z1 board is the hardware platform for the PYNQ open-source framework, and can be programmed in Jupyter Notebook using Python. The board features the ZYNQ XC7Z020 FPGA. For the 2018 edition of the course VLSI Programming we are considering the PYNQ-Z1 FPGA board.

# Python and Jupyter¶

The dataflow programming language used by StaccatoLab is embedded in Python. The StaccatoLab tools are programmed in Python as well. Python is a widely used high-level programming language for general-purpose programming. Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard library.

A dataflow program is developed and documented as a Jupyter Notebook. So are the StaccatoLab tutorial, the StaccatoLab regression tests, the VLSI programming course notes, and the student assignments. StaccatoLab tools are invoked from the Jupyter notebook containing the dataflow program (graph). The Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

# Scaling up?¶

PYNQ-Z1 contains a modest FPGA, capable about 100G operations/sec. Scaling up to exascale is, well, ... far from obvious. Dataflow as programming model offers an attractive perspective on scaling up, because it supports:

• hierarchy: a dataflow graph node can represent subgraph of nodes (recursively);
• repetition: a (sub)graph may contain regular, repetitive (2D) pipeline subgraphs;
• graph transformation: e.g. unfolding, may be used to increase parallelism;
• abstraction: a single dataflow token may denote e.g. a complete 16k$\times$16k image; (edges carrying such compound tokens need to be mapped onto DRAMs);
• fault tolerance (?)

Scaling-up StaccatoLab is a multi-year research program.

In [ ]: