To prevent spam users, you can only post on this forum after registration, which is by invitation. If you want to post on the forum, please send me a mail (h DOT m DOT w DOT verbeek AT tue DOT nl) and I'll send you an invitation in return for an account.

DateTime format in XES and MXML event logs

JBuijsJBuijs Posts: 912
edited December 2010 in - XESame
Marco Montali e-mailed me today with the following question, which, I agree, might be of interest for the XES community.

The problem is twofold, and I think that both issues are of interest for the XES community.

The first issue is related to parsing logs containing timestamps in a format different than something like 2005-10-24T11:57:31.000+01:00.
In particular, both the XesXMLParser and the XMXMLParser are bound to this format, and I didn't found an easy way to set up a different format.
Looking into the APIs of OpenXES, I have found that both parsers rely on XsDateTimeConversion to parse timestamps. However, the XsDateTimeConversion object used for parsing is a protected member of the parsers, and therefore cannot be customized from the external world.
Indeed, XsDateTimeConversion understands only the format I have shown before (you can find this information at http://code.deckfour.org/xes/doc/org/deckfour/xes/util/XsDateTimeConversion.html).

In order to make it possible to customize the dateformat, I have therefore implemented an ugly patch, subclassing the parser and assigning inside the constructor a subclass of the XsDateTimeConversion, which now takes a dateformat as a parameter. You can find the implementation below.

However, there is still another issue related to Localization: by default, the SimpleDateFormat class is instantiated with the Locale of the running application. However, in the general case the Locale associated to the user who generated the log could be different than the Locale associated to the user who is importing and analyzing that log. If these two users are associated to different Locales, then setting up the right dateformat is not sufficient when timestamps contain not only numbers, but also textual information.

For example, Fabrizio was trying to parse a log where timestamps' months are represented with three letters ("Sep" for september). The parser didn't work, because the format was not recognized. Hence I implemented the patch described above, making it possible to pass to the parser a customized format (EEE MMM d HH:mm:ss z yyyy). But still the parser didn't work, because Fabrizio's Locale is "IT", and therefore the SimpleDateFormat didn't recognize "Sep" as a valid month (it expected "Set", because the italian version of "september" is "settembre").

When Fabrizio recognized this problem, it was easy to fix it (it sufficies to add the right Locale - Locale.ENGLISH) as a second parameter when constructing the SimpleDateFormat.
However, this means that the user who is importing the log MUST KNOW the Locale of the user who has generated it. It would be definitely better to automatically extract this information from the log.

Thanking you in advance for the attention,

Cheers

Marco
Joos Buijs

Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
Previously Assistant Professor in Process Mining at Eindhoven University of Technology
Tagged:

Comments

  • JBuijsJBuijs Posts: 912
    To which I replied:

    Dear Marco,

    I think that there is a single answer to your issues (if I understood them correctly).
    In XES (or MXML) the time format should always be recorded/stored as something like , 2005-10-24T11:57:31.000+01:00 as you mentioned.
    It is the responsibility of the event log creator to do so.
    See the XES standard proposal, Section 1.2.2, page 4.
    By sticking to this standard for recording dates and time you don’t encounter any of the problems you mentioned, since in every event log the format is the same.

    I hope this answers your questions.

    Kind regards,

    Joos
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
  • JBuijsJBuijs Posts: 912
    To which Christian Gunther replied:

    Dear all,

    The short answer is: Joos is absolutely right :-)

    Both MXML and XES specify the timestamp (MXML), or respectively the date attribute type (XES), to be given as an xs:dateTime XML data type. This means, there is no choice how to represent a timestamp in your logs in MXML or XES, you need to (re-)format the value in that standardized format. You can find the official definition of the xs:dateTime format here:

    http://www.w3.org/TR/xmlschema-2/#dateTime

    There are good reasons to limit the XES standard to one representation for each attribute type, and ease of parser implementation is just one, if not the most important one. But in my opinion, most importantly with the timestamp type, it is a necessity.

    In Nitro (http://fluxicon.com/nitro), we support a number of typical and not-so-typical timestamp formats and do our best to automatically detect them, and I can assure you that you don't want to go there. The number of ways people have come up with to represent a date and time in a string are legion, and oftentimes you cannot easily discern from the serialization itself what format has been used. So yes, automatically identifying a timestamp pattern is a great idea, but it just does not work reliably (at least without user interaction, and that is essential for a standard and library IMHO). (Not that I want to discourage you, though: If you find a method to reliably and non-interactively parse a timestamp in any given format, for any possible localization, I would be very interested to learn about how you did it).

    Best,
    Christian
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
  • JBuijsJBuijs Posts: 912
    Marco then closed the conversation with this message:

    Dear all,
    thank you very much for the prompt answers...
    when I encountered this problem, I forgot that standards have two possibilities: either capture all the possible cases, or fix one particular choice (or a restricted set of choices).

    The second approach, which is the one you have followed for timestamps, is definitely more feasible and maintainable, also because (putting it in logical terms) dealing with people requires to follow an open-world assumption, and therefore the possible cases usually tend to infinity: "the number of ways people have come up with to represent a date and time in a string are legion", as Chrisian said :-)!

    Cheers

    Marco
    Joos Buijs

    Senior Data Scientist and process mining expert at APG (Dutch pension fund executor).
    Previously Assistant Professor in Process Mining at Eindhoven University of Technology
Sign In or Register to comment.