Hello experts on inductive mining,
here's a conundrum about fitness preservation in the Inductive Miner framework based on directly-follows graphs (IM-D), as presented by Sander Leemans in his Ph. D. thesis.
In section 6.1.1 of his thesis, dealing with the basic log-based IM, Leemans presents tHe following log L69:
<b, c, d, e>
<d, b, e>
<e, b>
<c, b>
<b, d, e, c>
To L69 he applies a parallel cut, and splits the log so that the non-atomic part of the cut is represented by log L71:
<c, d, e>
<d, e>
<e>
<c>
<d, e, c>
That log has a directly-follows graph (DFG71) in which activity d, although it does not occur at the end of a trace, is classified as an end node after the split, on account of having an outgoing connection to activity b in the parent graph.
Leemans states that no cut can be found for DFG71, and using IM he applies a fall-through, namely activityConcurrent. That fall-through takes out d and places it concurrently to the other activities. Moreover, when applying a parallel cut to the remaining part of the log containing c and e, empty traces are introduced into the log (to fill up traces to equal length, I guess).
With IM-D, which does not use the log at all, but only the DFG, we can do neither of these things.
- the information for the introduction of empty traces is just not present in the DFG. So even if we could adapt the activityConcurrent fall-through, we'd finally end up with a simple parallelism over c, d, e. That would not preserve fitness.
- as Leemans discusses later, the activityConcurrent fall-through is too expensive to adapt to IM-D. The next more specific fall-through that is part of IM-D is strictTauLoop. This fall-through will remove every edge from an end activity to a start activity. In our case, as every activity is both an end and a start activity, it will remove ALL edges between c, d and e, leaving us with three isolated activities, which in turn will be translated to a process tree with an XOR over these activities. This is not fitness preserving either.
I must be getting something wrong, because I trust Leemans' claim that fall-throughs preserve fitness. His thesis has been reviewed many times. But still I do not see how to apply the rules of the IM-D framework in a fitness preserving way to this example. As fitness preservation is said to be one of the main features of inductive mining, I feel bad about this. Can someone please enlighten me?
Sebastian
Comments
Are there perhaps alternative specifications of IM-D that differ from the thesis?
This explains why there is a sequence from d to e.
So my only remaining question would be: what explains the optionality of d in that sequence?
As I understand the thesis, an empty trace (epsilon) is only added to a sequence cluster if there is an arc skipping that cluster. That would seem to require three clusters (see the example in Figure 6.22). But here we have a sequence consisting of only two clusters.
I guess that the special cases for start and end nodes in the function SequenceDfgSplit (page 262) somehow deal with this, in a way that still eludes me.
I am sorry that knowledge comes so slowly and piecemeal, but Sander's thesis is not bedside reading ;-)
If a later cluster of a sequence contains a start node, then an epsilon (empty trace) is added to all previous clusters.
If a previous cluster of a sequence contains an exit node, then an epsilon is added to all later clusters.
Can someone confirm or correct my reading?
Assistant Processor (Lecturer) at Queensland University of Technology
Author of the visual Miner and Inductive Miner
That is correct: add the empty trace if there is an edge from somewhere before (including start) the cluster to somewhere after (including end) the cluster (last line before return in sequenceDfgSplit).
Assistant Processor (Lecturer) at Queensland University of Technology
Author of the visual Miner and Inductive Miner
thank you for so patiently and thoroughly answering my question. This has been most helpful!
Best Regards,
Sebastian
Assistant Processor (Lecturer) at Queensland University of Technology
Author of the visual Miner and Inductive Miner
L81 = {(a, b, c, d), (d, a, b), (a, d, c), (b, c, d)} and claim that it has no cut. But d is both a start and an end node, a is a start node and c is an end node, and d is bidirectionally connected to a and c. Further b (also a start and end node) is connected to a and c. So there is a parallel cut partitioning the vertices of the DFG into <{d},{a,b,c}>, isn't there? Or am I missing something?
I am trying to adapt L81 so that really no cut applies.
it probably was just a typo. For example, when one eliminates the edge from a to d, there really is no cut. One simply has to delete the a from the third trace.
Cheers,
Sebastian
Update: I did get something wrong. In a parallel cut, all parts should be fully interconnected, i. e. any two nodes a from set_i and b from set_j with i \= j are connected by edges (a,b) and (b, a). This is indeed not the case in the example.
But isn't that overly restrictive? It certainly looks reasonable to discover a parallel cut here, instead of relying on fall-through activityConcurrent, which would be expensive o implement in IM-D.
My idea is the following: We require the DFG to be connected. Then we only look at the edge set of the DFG. For any edge from a to b, if there also is an edge from b to a (the activities are connected in both directions) then the activities go into different components, otherwise their components are merged.
If any component remains that has both a start and an end activity, and none of those are connected bidirectionally to some node of another component that has both a start and end activity, then we have not found a cut. [This clause will distinguish between b being in the same component as a and c in our example, and b being in a component by itself, disconnected from d.]
Then we merge the components that don't contain a start and end event with an arbitrary component that does.
If there is only one component we also have not found a cut. Otherwise we have.
Did you ever think about something like that? Do you think the idea holds water?
Sebastian
Assistant Processor (Lecturer) at Queensland University of Technology
Author of the visual Miner and Inductive Miner
thanks, I see. Indeed my method would only find {a} {b} {c} {d} in your example. I guess there's a trade-off here, but I do understand now why the more restrictive approach would be considered better.
Thanks again for the private tutorial :-)
Sebastian
Assistant Processor (Lecturer) at Queensland University of Technology
Author of the visual Miner and Inductive Miner