To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

Data pre-processing for the process mining

Hi all,

I am doing my Thesis now in the area of process mining and would be appropriated for ideas, how can I modify my event-log for the software (plan to use Disko, ProM and one of commercial product also). 

My issue is that my data set has several activities (Like Activity A, Activity B, C etc. and every activity could have several meanings, like Activity A could have statuses 'Active', 'Closed', 'Lost' etc.). In addition, I have the Case ID and TimeStamp.   Usually a software proposes to map data like unique Case ID, Activity Type and TimeStamp. However, I have in my data next situation for the unique CaseID at the first TimeStamp Activity A changes to status "Active",   in the second TimeStamp Activity A changes to status "Active" and Activity B changes to status "20%", at the third TimeStamp Activity B changes status to "40%" and Activity C to status "Lost". Activities could be changed in parallel.

I was thinking to make an Activity Table and a Case table. Where Activity Table could store the CaseID, ActivityType (looking like 'Activity A changed' or 'Activity A and B changed') and a TimeStamp.  Case table could contain the extention for every CaseID in a certain TimeStamp. But in this case I got the message about the duplicates. 

I would be happy to receive advises how do I need to modify the data set for the process analysis. 

Best,
Olga

Comments

  • hverbeekhverbeek Posts: 406
    edited August 26
    Hi Olga,

    Can you provide a hint to which questions you want to have answered for this data set? A lot depends on this.

    You also mention an activity "Activity A and B changed". Do you consider this to be a single event (for both A and B), or rather two events (one for A, one for B)?

    Kind regards,
    Eric.



    Post edited by hverbeek on
  • opakhomoopakhomo Posts: 8
    Hi Eric,

    It should be an analysis of CRM logs and I would like to find the patterns, like, for example, an average time for every CaseID from opening till closing (this is defined by Activity A) or the average number of stages for every CaseID, or average time between stages. For example, it is not good if big deals are created and closed in two weeks.

    You also mention an activity "Activity A and B changed". Do you consider this to be a single event (for both A and B), or rather two events (one for A, one for B)? 

    It is one event in an activity table and several events in a case table. 
  • hverbeekhverbeek Posts: 406
    Hi Olga,

    I think I would go for the "naive" approach, and to use names like "Activity A Active" and "Activity B 20%" for events. I would also consider "Activity A and B changed" to be two separate events that happen to occur on the same time.

    I'm a bit in doubt over the "20%", as this could introduce many labels. If need be, replace these percentages with "changed" or something like it.

    Kind regards,
    Eric.
  • opakhomoopakhomo Posts: 8
    Hi Eric,

    So you suggest to use only one table with CaseID, ActivityType and TimeStamp?

    Best,
    Olga



  • hverbeekhverbeek Posts: 406
    Hi Olga,

    Ideally it should be an event log (in XES format). But if it has to be tables, then it has to be a single table containing at least the three columns you mention. You need to have this table in a CSV format to be able to import it in ProM.

    Kind regards,
    Eric.
  • opakhomoopakhomo Posts: 8
    Hi Eric,

    Thank you. Just one more question: is it possible to build a process with several parallel activities in ProM? Because if I modify data set like this, I will have several activities for one TimeStamp and CaseID. 

    Just would like to clarify (sorry, if it was unclear from the begging): I do not have a normal event log with a timestamp for every event, I did this log by myself. I used daily snapshots of CRM system with info about all Cases and used as a TimeStamp the actual data of every daily snapshot. That means if several actives have happen in one day, they all with have the same time.   

    Best,
    Olga 
  • hverbeekhverbeek Posts: 406
    Hi Olga,

    Yes, ProM is able to discover parallel activities. How this is done exactly, depends per discovery algorithm.

    As long as every row in the table has a timestamp, you should be fine. Problems may arise when some rows do not have a timestamp, because then the ordering of the events related to such a row is problematic: Where to place it? Rows with an identical timestamp will be treated in the order they appear in the table, so having rows with identical timestamps is not a problem.

    Kind regards,
    Eric.
  • opakhomoopakhomo Posts: 8
    H Eric,


    Thanks for the answer. No, I think that I have TimeStamps for all the rows.
    I will try to build my process this week.

    The last question: could I also use a case table with data for every CaseID in Prom? 
    I would like to keep in this case table the information about the  names of customers, countries, regions and etc.

    Best,
    Olga
Sign In or Register to comment.