 | |
Instructor: Herman Haverkort (office: HG 7.35)
Course dates: February to June 2012
(Registration for this course is now closed. Students who did not register are welcome to our similar course next year.) |
Update 15 June: added the slides of the last three lectures.
|
|
Update 17 February: included most of the lecture schedule, most of the literature that needs to be studied, the answers to the prerequisites test
(use username "io" and password "io" to access the file), and some "small print" at the bottom (policies regarding attendance, grading, deadlines etc.).
|
Traditionally, the running time of an algorithm is analysed in terms of the number
of computational steps it performs (comparisons, additions, etc.). However, when working
with huge amounts of data that do not fit in main memory, the number of disk accesses
is much more important for the running time than the number of computational steps.
Even when the data fit in main memory at once, a significant part of the running time
of an algorithm that works on large amounts of data is determined by the way in which
the algorithm exploits the cache: data that is cached can be accessed much faster
than data that resides in main memory. Therefore it is important to design algorithms
for large data sets in such a way, that the data needed next is always likely to be
data recently brought into cache.In this course we study so-called I/O-efficient
algorithms: algorithms and data structures that are designed in order to keep the number
of disk accesses or cache misses as small as possible. We discuss models for the design
and analysis of such algorithms, basic algorithms, data structures and paradigms, and
applications where I/O-efficiency makes the difference between algorithms that work and
algorithms that do not work in practice. Furthermore, we will see I/O-efficiency at work
in some simple experiments with huge data sets.
The course will make the students familiar with the basic concepts and techniques
for the design of I/O-efficient algorithms and data structures. The students will
learn to recognize I/O-bottlenecks in algorithms, learn basic techniques to alleviate
such bottlenecks, and will get an idea of the orders of magnitude of the parameters
involved. The students will practice finding, reading, interpreting and explaining
advanced literature on the topic. Furthermore the course offers opportunities
for students to practice their presentation skills.
The course is taught in a mixed form. Some lectures are given by the instructor;
most lectures are given by the students, who will also run a number of experiments.The students collaborate in triples to prepare their lectures. Each triple gets assigned a particular
topic and has to give a lecture of 2 x 45 minutes. Links to basic literature for the
lectures are provided by the instructor. The students are expected to look up additional
literature when necessary. The instructor will advise the students on the preparation of their talks. The students
will meet with the instructor at least two weeks before the lecture for advice on its
contents, and at least one week before the lecture to discuss (the draft version of) the
presentation. Each student is expected to spend about 40 hours on preparing the lecture.
For some general advice on how to study the material and on how to prepare your talk,
see How to give a seminar talk. The students will also collaborate in (and also across) triples to study possible solutions
for problems where I/O-efficiency is crucial. The problems studied are abstracted versions
of fundamental problems occurring in real-life applications. The students will do experiments
with different solutions and write short papers about them. Near the end of the course, the
students review each other's papers and present their results to each other.
There is a homework assignment at the end of course.
Registration for this course is now closed.
All meetings for this course---except one on 5 April---are on Fridays, from 15:45 to 17:30 in the lecture hall of Gemini-Zuid.
| date | time | room | topic | literature | lecturer(s) | PC time |
| Fri | 10 | Feb | 15:45 | Gem-Z | Introduction | Meyer: sections 1 (except 1.5), 2 (intro), 2.1.1, 3.1-3.2; Demaine: section 2; slides | Herman Haverkort | N/A |
| Sun | 12 | Feb | 23:59 | deadline for prerequisites test |
| Fri | 17 | Feb | 15:45 | Gem-Z | Basics: stacks, queues, sorting, caching policies | see 1 Feb; in addition Duartes; Sleator, section 4; slides/notes | Herman Haverkort | N/A |
| Fri | 24 | Feb | no meeting (carnival vacation) | N/A |
| Fri | 2 | Mar | 15:45 | Gem-Z | Challenges: graph traversals, permutations, list ranking | Meyer: sections 1.5, 4.1, 4.2; Zeh: sections 2, 3; slides/notes | Herman Haverkort | 27/2-4/3 Team ABM |
| Fri | 9 | Mar | 15:45 | Gem-Z | Breadth-first search | Meyer: sections 4.2, 4.3; and/or Zeh: section 6; slides directed BFS, slides undirected BFS | Team PQS | 5/3-11/3 Team ABM |
| Fri | 16 | Mar | 15:45 | Gem-Z | Ordered lists and search trees; report by team ABM; | Meyer: sections 2.1.2, 2.3 (intro), 2.3.1-2.3.4; Demaine: section 4.1 slides/notes | Herman Haverkort | 12/3-18/3 Team ABM |
| Fri | 23 | Mar | 15:45 | Gem-Z | Persistent B-trees and interval trees | Arge (2): sections 4, 6; slides persistent B-trees; slides internal interval trees; slides external interval trees | Team BNW | 19/3-25/3 Team MNW |
| Fri | 30 | Mar | no meeting (open door day) | 26/3-1/4 Team LMP |
| Thu | 5 | Apr | 15:45 | Aud 1 | Priority queues, shortest paths and DFS | Meyer: sections 4.4-7; and/or Zeh: sections 6.1, 7; slides DFS; slides SSSP | Team AT | 2/4-8/4 Team BNW |
| Fri | 13 | Apr | no meeting (mid-semester break) | 9/4-15/4 Team LMP |
| Fri | 20 | Apr | no meeting (mid-semester break) | 16/4-22/4 Team AT |
| Fri | 27 | Apr | 15:45 | Gem-Z | Priority queues, time-forward processing and plane sweeping | Meyer: sections 3.4, 3.5 or Zeh: section 4; Arge (1): sections 1-4; De Berg: section 2.1; slides | Team LMP | 23/4-29/4 Team AT |
| Fri | 4 | May | 15:45 | Gem-Z | Shortest paths on grids and planar graphs | Zeh: section 8; slides | Team HLS | 30/4-6/5 Team GGT |
| Fri | 11 | May | no meeting (lecturers ill) | 7/5-13/5 Team GGT |
| Fri | 18 | May | no meeting (university closed on the day after Ancension Day) | 14/5-20/5 Team PQS |
| Fri | 25 | May | 15:45 | Gem-Z | Operations on sparse matrices | Bender: sections 1-4, 7; slides | Team GGT | 21/5-27/5 Team HLS |
| Fri | 1 | Jun | 15:45 | Gem-Z | Data structures for points in the plane | Arge (2): section 8.2.1; Beckmann; Kamel; Van den Bercken; slides | Team MNW | 28/5-3/6 Team MNW |
| Fri | 8 | Jun | 15:45 | deadline almost-final (reviewable) reports on experiments with sorting & permutation, BFS, DFS, and time-forward processing | 4/6-10/6 Team BNW |
| 15:45 | Gem-Z | Topological sorting | Arge (3): sections 1-2; Ajwani; slides | Team ABM |
| Fri | 15 | Jun | 15:45 | deadline almost-final (reviewable) reports on experiments with connected components, matrix operations, search trees, and text search structures | 11/6-17/6 Team PQS |
| 15:45 | deadline reviews on experiments with sorting & permutation, BFS, DFS, and time-forward processing |
| 15:45 | Gem-Z | Discussing experiments with sorting & permutation, BFS, DFS, and time-forward processing | | Teams ABM, LMP, AT, GGT |
| Fri | 22 | Jun | 15:45 | deadline reviews on experiments with connected components, matrix operations, search trees, and text search structures | |
| 15:45 | Gem-Z | Discussing experiments with connected components, matrix operations, search trees, and text search structures | | Teams HLS, BNW, MNW, PQS |
| Sun | 8 | Jul | 23:59 | deadline homework assignment | |
At the end of each meeting, some time may be reserved to discuss the progress or results of early experimentation projects. Note that attending all meetings is mandatory. In case this is a problem, contact Herman Haverkort in advance. The teams and their assignments are as follows:
| team | members | lecture | experimentation project | computer time |
| ABM | Alex, Bauke, Martin | Topological sorting (8/6) | Sorting & Permutation | until 18/3 |
| AT | Ade, Teng | Depth-first search etc. (5/4) | Depth-first search | 16/4-29/4 |
| BNW | Bram, Nicolas, Wilco | Persistent B-trees etc. (23/3) | Matrix operations | 2/4-8/4; 4/6-10/6 |
| GGT | Gijs, Gijs, Twan | Sparse matrices (25/5) | Time-forward processing | 30/4-13/5 |
| HLS | Hoisun, Luc, Stef | Planar graphs (4/5) | Connected components | 21/5-27/5 |
| LMP | Lourens, Marlous, Pieter | Time-forward processing etc. (27/4) | Breadth-first search | 26/3-1/4; 9/4-15/4 |
| MNW | Marcel, Niels, Wouter | Data structures for points (11/5) | Search trees | 19/3-25/3; 28/5-3/6 |
| PQS | Paul, Quirijn, Sander | Breadth-first search (9/3) | Text search structures | 14/5-20/5; 11/6-17/6 |
[Ajwani]
D. Ajwani, A. Cosgaya-Lozana, and N. Zeh,
Engineering a topological sorting algorithm for massive graphs,
In International Workshop on Algorithm Engineering and Experiments (ALENEX) 2011, p139-150
Available from publisher;
local copy.[Arge (1)]
L. Arge,
The Buffer Tree: a technique for designing batched external data structures,
Algorithmica (2003) 37:1-24
Available from publisher;
local copy. [Arge (2)]
L. Arge,
External memory geometric data structures,
Lecture notes, 2005
Available from author's website;
local copy. [Arge (3)]
L. Arge and L. Toma,
Simplified external memory algorithms for planar DAGs.
In Proc. 9th Scandinavian Workshop on Algorithm Theory (SWAT) 2004, LNCS 3111:493-503
Available from publisher;
local copy. [Beckmann]
N. Beckmann and B. Seeger,
A revised R*-tree in comparison with related index structures,
In Proc. 35th ACM SIGMOD Int. Conf. on Management of Data (SIGMOD) 2009, p799-812
Available from publisher;
local copy. [Bender]
M. A. Bender, G. Stølting Brodal, R. Fagerberg, R. Jacob, and E. Vicari,
Optimal sparse matrix dense vector multiplication in the I/O-Model,
Theory of Computing Systems (2010) 47(4):934-962
Available from publisher;
local copy of authors' manuscript. [Van den Bercken]
J. van den Bercken and B. Seeger,
An Evaluation of Generic Bulk Loading Techniques,
In Proc. 27th Int. Conf. on Very Large Data Bases (VLDB 2001), p461-470
Available from publisher;
local copy. [De Berg]
M. de Berg, M. van Kreveld, M. Overmars, and O. Cheong (a.k.a. Schwarzkopf),
Computational geometry: algorithms and applications,
2nd edition, Springer, 2000.
[Demaine]
E. Demaine,
Cache-Oblivious Algorithms and Data Structures,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from the author's website;
local copy. [Duartes]
G. Duartes,
What your computer does while you wait,
blog post, 30 Nov 2008; [Kamel]
I. Kamel and C. Faloutsos,
Hilbert R-tree: An improved R-tree using fractals,
In Proc. of 20th Int. Conf. on Very Large Data Bases (VLDB) 1994, p500–509
Available from publisher;
local copy. [Meyer]
U. Meyer, P. Sanders, and J. Sibeyn (eds.),
Algorithms for Memory Hierarchies,
Lecture Notes in Computer Science 2625, Springer, 2003.
The whole book can be downloaded from the publisher's website.
For now, the following parts are most relevant:
- Chapter 1: models and lower bounds (includes the permutation lower bound)
- Chapter 2: basic external memory data structures (includes stacks, queues, linked lists, B-trees)
- Chapter 3: a survey of techniques for designing I/O-efficient algorithms (includes scanning, sorting, time-forward processing, deterministic computation of maximal independent sets, list ranking, and Euler tours)
- Chapter 4: Elementary graph algorithms in external memory (includes BFS, DFS, and SSSP in directed and undirected graphs)
- Bibliography and index
[Sleator]
D. D. Sleator and R. E. Tarjan,
Amortized efficiency of list update and paging rules,
Communications of the ACM (1985) 28(2):202-208
Available from publisher;
local copy. [Zeh]
N. Zeh,
I/O-Efficient Graph Algorithms,
Lecture notes of EEF Summer School on Massive Data Sets, Aarhus, 2002
Available from author's website;
local copy.
Students coming from the TU/e computer science bachelor programme must have
completed Algorithms (2IL15) successfully.If you got your prior education elsewhere, we will need to check if you have the
proper prior knowledge to be able to complete the I/O-efficient algorithms course
successfully. We will expect that you have experience with the following topics:
incremental algorithms, divide and conquer, greedy algorithms, proofs with induction and
invariants, O-notation and recurrences, sorting, search trees, depth-first and
breadth-first search, minimum spanning trees, and shortest path algorithms.
Experience with randomized algorithms is recommended. To be able to assess whether there are serious deficiencies in prior knowledge,
students are asked to make a prerequisites test.
Here are the right answers to the test,
including recommendations for further reading in case you got them wrong.
Use username "io" and password "io" to access the file.
If you have insufficient prior experience with algorithms, you will first need
to take Algorithms (2IL15), Advanced Algorithms (2IL45), or Geometric Algorithms (2IL55);
to be able to take these courses you might need Data Structures (2IL05) first.
(All of these courses are taught in English.)
Your grade will essentially be determined by adding up grades for:
- your presentation
- your experiments (and your report(s) about them)
- your review of another group's report
- your participation in class (which basically boils down to: the number of questions you asked in class)
- your homework assignment
All of the above assignments (presentation, experiments, review, your presence in class, homework)
are mandatory. If you skip an assignment, your final grade will be a five or less. Here is the small print for this course:
- Attending all meetings is mandatory.
Absence without approval after prior notice: 1 point off your grade each time.
- The student is responsible for handing in assignments in time; the student is
responsible for taking proper precaution against unexpected circumstances such
as illness, broken laptops, network failures, other deadlines etc., by not doing things in the
very last moment. That means that I will not accept your late assignment if you
did not contact me at least three days before the deadline about the delay
and with a reasonable suggestion for a solution. Depending on the consequences
for the course as intended, handing in late may reduce your grade substantially.
- I will report any suspicions of fraud to the exam committee. If I can prove it
(for example: if I find a fragment of more than two sentences of your work in an
older source on the web while you did not indicate that it was a quote), your
final grade will be a 1, independent of any additional measures taken by the department.
|  |