$~$

MSc assignments contributing to

the StaccatoLab program

$~$

Author: Kees van Berkel,

  • part-time full professor of SAN (System Architecture and Networking),
  • within the department Computer Science and Mathematics of the Technical University Eindhoven,
  • email: c.h.v.berkel@tue.nl

Date: 2017 November 19

Introduction

  • As background and motivation for the assingments below: see the StaccatoLab research program.
  • All assignments include a pre-study phase (5 ects), assuming the Embedded Systems program.
  • Successful completion of the VLSI Programming course (2IMN35) is a considerable advantage, but not required.
  • Experience with Python programming is a considerable advantage, but not required.
  • The list of MSc assignments below may not be up to date. Feel free to contact me on the latest status.

Extend PYNQ with a dataflow interface (PYNQ = Python on Zynq)

Background

  • The PYNQ-Z1 board is the hardware platform for the PYNQ open-source framework, and can be programmed in Jupyter Notebook using Python. The board features the ZYNQ XC7Z020 FPGA.
  • Given: a converter from StaccatoLab dataflow graphs to an equivalent Verilog description. (Work in progress.)

Goal

  • a PYNQ extension that allows the user to logic-synthesize, load, run, and debug StaccatoLab dataflow graphs,
  • ... such that all user interaction is in dataflow terms (dataflow graph, nodes, edges, tokens, ..., rather than Verilog entities, FPGA resources, ... );
  • a demonstration of this extension on a video (image stream) dataflow program on a PYNQ-Z1 board.

Approach

  • pre-study: develop a video algorithm as dataflow graph (e.g. a video scaler or a Sobel filter);
  • pre-study: develop a Python script to create an FPGA image from a Verilog program;
  • explore how logic-synthesis results can be reinterpreted in dataflow terms;
  • explore how to interact with FPGA hardware in dataflow terms: run graph, set break-point, inspect node-fsm state, inspect token, show firing rates, ..;
  • design Python scripts as PYNQ extensions (invoked from Jupyter notebook) that offer such a dataflow interface;
  • .. same, for the video algorithm;
  • develop Jupyter-based instrumentation: FPGA break-point setting, FPGA state inspection, ...

Results

  • Python scripts for mapping, running, and instrumentation of dataflow graphs on a PYNQ-Z1;
  • a dataflow-on-PYNQ demonstrator, which can be used as basis for new assignments for the VLSI Programming course.

Mapping dataflow graphs on multiple FPGAs

Background

  • Large dataflow graphs may take advantage of multiple FPGAs, ultimately up to thousands of FPGAs.
  • Inter-FPGA communication of dataflow tokens is much more involved than intra-FPGA communication. There are multiple options for inter-FPGA communication, involving different resources, with different constraints, and with different performance impacts.

Goal

  • dataflow models of inter-FPGA traffic (latency, slack, throughput), for different options: FPGA-FPGA, FPGA-DRAM-FPGA, and FPGA-DRAM-network-DRAM-FPGA;
  • tool support for dataflow graph partitioning, user-directed and/or automated.

Approach

  • pre-study: design example dataflow graph(s);
  • pre-study: develop inter-FPGA traffic models, based both on theory and on experiments;
  • pre-study: define the dataflow graph partitioning problem (nb. graph cycles!);
  • analyze the performance impacts of a given dataflow-graph partitioning;
  • develop tool support for user-directed graph partitioning;
  • evaluate existing graph partitioning algorithms;
  • select/modify a graph partitioning algorithm and implement it in Python;
  • evaluate the partitioning algorithm on selected dataflow graphs.

Results

  • dataflow models for inter-FPGA traffic;
  • a study of relevant dataflow graph partitioning algorithms;
  • a tool for user-directed graph partitioning;
  • a tool that analyzes the performance impact of dataflow graph partitioning;
  • a tool that proposes a dataflow partitioning with minimal impact on the overall graph throughput;

StaccatoLab simulator acceleration

Background

  • The current StaccatoLab dataflow graph simulator is programmed in Python, but offers a throughput of only about 100k events (token productions) per cpu second (on a Mac notebook), for audio and image processing applications.
  • Running a dataflow graph on a large FPGA should easily reach 100M - 100G events/sec, but offers limited support for debugging.

Goal

  • support the development and debugging of large dataflow graphs,
  • .. by means of a simulator with a throughput of 1-10M events/sec on a laptop,
  • .. while retaining (most of) the current (Jupyter-based) debug support.

Approach

  • pre-study: analyze throughput limitations of the existing simulator;
  • pre-study: explore the potential of compiled-C code alternatives;
  • develop a compiled-C simulator for single-rate dataflow-graphs;
  • interface this simulator to current Python-based StaccatoLab graphs and the Jupyter interfaces;
  • evaluate throughput gains;
  • optimize and extend this simulator to cyclo-static dataflow graphs.

Results

  • a simulator extension offering a speed-up by one or two orders of magnitude.

Abstraction of dataflow sub-graphs as dataflow nodes

Background

  • Can the behavior of a dataflow subgraph be modeled by an "equivalent" single dataflow node? What kind of dataflow subgraphs can be modeled by an "equivalent" cyclo-static dataflow node?
  • A practical value of such abstraction could be in replacing a subgraph by an equivalent node and thereby speeding up dataflow simulation.
  • From an application viewpoint it may be sufficient to require "data equivalence": the node produces the same output data stream(s), given the same input streams.
  • When the subgraph operates in a more dynamic environment, "flow equivalence" may (or may not) be important.

Goal

  • a description of dataflow sub-graph categories ("patterns") that can be transformed into a CSDF node;
  • a tool that supports such abstraction transformation;
  • a demonstration on e.g. a deeply-pipelined 1D FFT as part of a 2D FFT.

Approach

  • pre-study: design a collection of a single-rate dataflow subgraphs with a data-equivalent CSDF node, and analyze their flow behaviors;
  • pre-study: study relevant literature on forms of program equivalence;
  • define flow equivalence, assuming the StaccatoLab execution model;
  • specify several classes of dataflow sub-graphs for which flow-equivalent nodes can be generated;
  • develop a tool that generates a StaccatoLab node description (finite state machine + output functions) for such a class of dataflow subgraphs.

Results

  • some theory + a practical tool + examples.

A dataflow graph for pulsar search

Background

  • A pulsar is a highly magnetized, rotating neutron star or white dwarf. It radiates a beam of electro-magnetic emission, much like a light house. Rotation frequencies ranging from milliseconds to seconds.
  • Searching pulsars is based on so-called coherent de-dispersion, and is extremely computationally intensive. See e.g. link.

Goal

  • a scalable dataflow graph for pulsar search;
  • .. optimized for implementation on (multiple) FPGA(s);
  • an "apple-to-apple" benchmark with existing GPU-based implementations. (Goal: 10x lower power consumption).

Approach

  • pre-study: create a reference solution for pulsar-search algorithm, hopefully with help from colleagues involved in link;
  • pre-study: design a dataflow graph for one (or more) kernels of the pulsar-search algorithm;
  • explore the limitations of GPU implementations and the potential of FPGA implementations (and vv);
  • design a scalable dataflow graph for pulsar search;
  • demonstrate this graph on an FPGA, using real radio-astronomy data;
  • perform benchmarking.

Results

  • a scalable dataflow graph for pulsar search;
  • benchmarking FPGA versus GPU for pulsar search.

A dataflow accelerator for molecular dynamics

Background

  • Molecular dynamics is a computer simulation method for studying the physical movements of atoms and molecules, and is thus a type of N-body simulation.
  • The Lennard-Jones potential is a mathematically simple model that approximates the interaction between a pair of neutral atoms or molecules.

Goal

  • acceleration of molecular dynamics (assuming the Lennard-Jones potential)
  • .. by means of an FPGA co-processor programmed by a dataflow-graph;
  • improved understanding the use of "dataflow-graph as accelerator" software architecture.

Approach

  • pre-study: create a reference solution for a molecular-dynamics simulator (preferably based on existing work);
  • pre-study: design a dataflow graph for one (or more) molecular-dynamics kernels;
  • analyze available (forms of) parallelism, both qualitatively and quantitatively;
  • define a suitable accelerator and specify the host-accelerator interface;
  • program this accelerator as a dataflow graph;
  • implement and demonstrate this accelerator on an FPGA board;
  • perform benchmarking.

Results

  • a simple dataflow-based molecular-dynamics simulator that can be scaled up to multiple (large) FPGAs;
  • a demonstrator on a medium-scaled FPGA;
  • a performance model (simulation throughput versus available resources);
  • a template for using a dataflow-graph as an accelerator.