To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

Clustering traces based on edit distance

nirmalpatelnirmalpatel Posts: 3
edited April 2017 in - Usage
Hi everyone,

Is there a ProM plugin that can cluster traces based on edit distance (also called Levenshtein distance)? I see that there are some papers [1] mentioning the approach, but I'm unable to find a plugin for the same.

Please tell me if you know any other plugin for clustering that is uses similar approach, that will also work.

I'm working with data that come from place where there is no underlying normative process model. Basically, its kids using educational software, and so things differ from teacher to teacher, and also from student to student so event log has quite a lot of variety. And so process models I get are always complicated. I'm hoping that clustering traces beforehand might help.

Thanks,
Nirmal

[1] Context Aware Trace Clustering: Towards Improving Process Mining Results

Post edited by nirmalpatel on

Comments

  • Hi Folks,

    Right now I've done a little arrangement to do clustering outside ProM. I'm trying out hierarchical clustering to cluster traces. I use R, and R's hclust() function that does hierarchical clustering requires user to provide a distance matrix that contains pairwise distances between observations. So I just plug in edit distance matrix of all traces and I am getting very interesting results. I haven't optimized code, so it takes a bit of time, but results are looking good to me.

    One problem with using centroid based algorithm (like K-Means) for trace clustering is deciding up on centroid, which is a bit difficult. Earlier I tried to make centroid as trace of length that occurs most in the cluster and events in trace were most common events in each position. But I think hierarchical clustering seems more sound to use, without any kind of tricks involved.

    - Nirmal
  • CongLiuCongLiu Posts: 84
    Hi Nirmal,

    Sorry, I do not know the answer. I am contacting one of my colleague to help you. Please be patient.

    Cong
  • bhompesbhompes Posts: 10
    Dear Nirmal,

    If I understand correctly, your question is twofold: you want to provide a certain similarity/distance measure yourself (e.g. Levenshtein distance), and would like to obtain the number of clusters automatically (i.e. not having to select the k for k-Means).

    At the moment, the easiest thing to do is to run the plug-in called "Cluster traces using Markov clustering". The default implementation uses the cosine similarity between profile vectors created from the traces using a selected set of 'perspectives'. The perspectives are the properties used to cluster upon, such as attribute values or control-flow constructs. Then, the Markov cluster algorithm (MCL) is used to cluster the traces. MCL is a clustering algorithm that does not take the number of clusters as input.

    The MCL clustering technique uses two parameters, expansion and inflation, which both affect clustering granularity. When a low level (fine-grained) clustering is of interest, expansion can be decreased and inflation increased, and vice-versa for when only high-level clustering is required.

    The technique used in the plug-in is described in our paper titled "Discovering Deviating Cases and Process Variants Using Trace Clustering".

    There's also some other topics on this forum talking about this technique here, here, and here.

    Would you prefer to use another similarity measure, you'd need to run the plug-in (and ProM) from code and implement the similarity measure. The code is prepared for this (interfaced).

    Let me know whether this plug-in suits your needs and if you need more help with the custom implementation.

    For the record, the TraceClustering package also contains the code for the plug-in called "Cluster traces over time using Markov clustering", which takes into account the changes over time in the similarity matrix, in order to let the user detect concept drift (see our paper titled "Detecting Changes in Process Behavior Using Comparative Case Clustering").

    Kind regards,

    Bart
    Bart Hompes - Eindhoven University of Technology
  • Thanks a lot Bart! I read first 4 sections of your paper, and changed parameters of clustering algorithm (I decreased expansion to 1 for start) and I'm seeing all the different variants now!

    I'll try and make more use of the MCL plugin, and see how results turn out. Hopefully, I'll be able to post some updates here later.

    - Nirmal


Sign In or Register to comment.