Data Sets

In every data set, about half of the traces should be classified positive.

aXfYnZ (a12, a22, a32, a42)

Taken from Maruster, L., Weijters, A.J.M.M., van der Aalst, W.M.P. and van den Bosch, A. (2006) A rule-based approach for process discovery: dealing with noise and imbalance in process logs. Data Min. Knowl. Discov., 13, 67–87.

All MXML logs have been converted into XES logs, where the first classifier (the legacy MXML classifier) was removed. As a result, the “concept:name: classifier is now the first classifier in every log.

If aXfYnZ.xes is the log to discover, then aXfYn50,xes is the log to classify, as this log is known to contain about 50% positive traces.

PDC 2016, 2017, 2019

Taken from the respective PDC sites (2016, 2017, 2019). The training logs are the logs to discover, the final test log are the logs to classify.

PDC 2020

A first attempt for a data set for the PDC 2020. Contains 192 discover event logs (named pdc_2020_ABCDEFG.xes) with different characteristics, which all use the same base model:

A: Dependent tasks

Long-term dependencies.

0=No, 1=Yes.

If Yes then all transitions that bypass the dependent tasks are disabled.

B: Loops

0=No, 1=Simple, 2=Complex.

If No, then all transitions that start a loop are disabled. If Simple, then all transitions that are a shortcut between the loop and the main flow are disabled.

C: OR constructs

0=No, 1=Yes.

If No, then all transitions that only take some inputs for an OR-join and all transitions that generate only some outputs for an OR-split are disabled.

D: Routing constructs

Invisible tasks.


If Yes, then some transitions are made invisible.

E: Optional tasks

0=No, 1=Yes.

If Yes, then some invisible transitions are added to allow skipping of some (visible) transitions.

F: Duplicate tasks

0=No, 1=Yes.

If Yes, then some transitions are relabeled to existing labels.

G: Noise

0=No, 1=Yes.

If Yes, then noise is introduced in approx. 1 out of 5 traces. Noise is introduced by deleting one event (40%), moving one event in the trace (20%), or copying one event in the trace (40%).

Every discover event log is generated randomly from the (updated) base model, and contains 1000 traces.

Next to these 192 discover event logs, 192 matching classify event logs are generated, and classified using the correct models into 192 score logs. These event logs also contain 1000 traces, of which approx. 1 out of 2 contains noise, and where additional noise may be introduced to check whether the discovered model correctly has discovered dependent tasks.

New data set?

A new data set contains three types of logs: discover logs, classify logs, and score logs, such that for every discover log there are corresponding classify and score logs.

The model will be discovered from the discover log. The discovered model will then be used to classify the traces in the classify log as either positive or negative. For this, a boolean “pdc:isPos” attribute must be added to every trace, where TRUE corresponds to positive.

The classified log is compared against the score log, which should contain the ground-truth classification. The traces in the score log should appear in the same order as the traces in the classify log.