[MERGED] Log Operator - CPU Time, CPU Execution Time, Time, Execution Time

bugrasitemkarbugrasitemkar Member Posts: 4 Contributor I
edited November 2018 in Help
I am new to Rapidminer and already read posts about measuring modeling time. Only information I would be able to find is
cpu execution time and cpu time are measured in nanoseconds (this post is the one with the most information http://rapid-i.com/rapidforum/index.php/topic,5688.0.html)

I would like to measure modeling time for a decision tree classifier  Id3,

Set up log operator to log Id3 operator, and choose fields as cpu execution time, cpu time, execution time, time and apply count.

Posting an example result of log operator for id3 training with grid parameter optimization (that is where 2200 applications of models came from)

cpu execution time  |  cpu time    |    execution time  |  time  |  apply count
15625000                |  15625000 |    18                    |  19    |  2200

and results overview time is 1:09


My questions are:

+  Does cpu-execution time and cpu time measures only a single application of model (presumably last application) ? Because I get nearly same results for cpu execution time and cpu time when I run same model only for once.

+ cpu execution time and cpu time parts are sometimes equal and sometimes not, what may cause this change ? What is the difference between cpu execution time and cpu time ?

+ what is execution time and time stands for ? what is their units ?

Finally, is there any suggestion to measure modeling time (without I/o if possible) ? What is the correct practice of this task ?

Thanks for any help you will be able to provide.


Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I will try to answer your questions.

    1) Yes, only the last iteration is taken there. You would need to add a log operator inside the Parameter Optimization to collect times for all applications.
    2) The difference I can see from looking at the code is that cpu-execution-time is the time from when the operator execution started to when it finished. This can include times when the cpu was not actually used for this operator because the OS scheduler took it away or it was waiting for IO. cpu-time on the other hand looks like a botched implementation to me - it basically logs the time from when the operator in question started to the time the "Log" operator queried. Very weird and I don't know if that is ever going to be useful. Both are presented in nanoseconds. Note that during my testing neither were reliable - both make use of a Java API that seems to behave inconsistent. I got 0 for both start and endtime multiple times as well as identical start and endtimes which is very weird considering the Java API claims this is nanosecond precise. Use with care!
    3) execution-time is basically your system clock when the operator finished - your system clock when the operator started. So the time elapsed in milliseconds according to your system clock. time is very similiar, the only difference I can see is that it depends on the time of the call because it uses the time of the call - the starting time of the operator. So potentially this is a bit more than execution-time. Both are in milliseconds. They seem to be more reliable than the previously mentioned onces, the Java API they use seems to work rather well.
    4) I guess actual cpu time used is best measured by execution-time. The other functions are too unreliable or depend on the time of the Log operator asking for the values. This is from a developer's point of view, so a consultant's point of view might differ there ;)


    Edit: modified with new findings

    Regards,
    Marco
  • Flos_Flos_ Member Posts: 7 Contributor II
    Hey everyone,

    I have a question concerning the Log Operator and especially the values CPU Time, CPU Execution Time, Time and Execution Time.
    There are already a few posts on this topic. Those with most information are probably

    [1] http://rapid-i.com/rapidforum/index.php/topic,8081.0.html

    and the one linked in it. What I can (probably) say for sure are the time units used:

    1) CPU Time (ns)
    2) CPU Execution Time (ns)
    3) Time (ms)
    4) Execution Time (ms)

    What I am missing is a exact definition of what exactly the values mentioned above express. I would be very pleased if someone could give these definitions or point me to some documentation where they are explained (as I could not find some). It would also be very helpful to know the name of the class implementing this functionality.



    Why am I actually asking this given there is already some explanation in the link above?

    It does not really meet my observations. I am "measuring" the execution time of a "Weight by Chi Squared Statistic" Operator on a machine with 8 virtual cores (4 Cores with HT) on some data set. Given the explanation from the linked topic "CPU Execution Time" should be longer than "CPU Time" which is not the case in the results you can see below. Second the "Time" and "Execution Time" should only differ a little according to the explanation in [1] which is also not the case.

    WeightByChiSquareCpuTime   8.2609375E10
    WeightByChiSquareCpuExecTime   3.734375E9
    WeightByChiSquareTime  85158.0
    WeightByChiSquareExecTime  4238.0

    I am also measuring the different time values for a "Optimize Selection (Weight Guided)" Operator.

    OptimizeSelectionCpuTime   7.8875E10
    OptimizeSelectionCpuExecTime   7.8859375E10
    OptimizeSelectionTime   80919.0
    OptimizeSelectionExecTime   80902.0

    These results better fit the explanation in [1], but still OptimizeSelectionCpuTime   >   OptimizeSelectionCpuExecTime.

    There is maybe one other observation that might be important for the explanation. The "Weight by Chi Squared Statistic" Operator seems to have a higher CPU load (more than one core is used) then the  "Optimize Selection (Weight Guided)" Operator (probably one one core used). I guess the big differences in the time values for the "Weight by Chi Squared Statistic" are somehow related to the fact the CPU Time used for a process executed on x cores without interruption for y time units should also be x times higher the CPU time executed on one core without interruption for y time units. Unfortunately the factor between WeightByChiSquareCpuTime and WeightByChiSquareCpuExecTime is approximately 22.

    In case it is important, I am using the version 5.3.015.

    I hope that someone can help me with my problem!

    Kind regards
  • bugrasitemkarbugrasitemkar Member Posts: 4 Contributor I
    Dear Marco,

    Could you be wrong about cpu-execution-time and cpu-time? I see like it should be other way around.

    Because in my experiments cpu-execution-time is always smaller than cpu-time indicating cpu-execution time tries to measure actual cpu time (isolated from other operations)

    And another question I want to ask is cpu-execution time is often 0.0 I know the log operator is not nanoseconds precise but I dont know how to interpret 0.0 values.

    Thanks for all help you will be able to provide

  • bugrasitemkarbugrasitemkar Member Posts: 4 Contributor I
    Dear Flos!,

    I also observed  as you said, cpu-execution-time  <= cpu-time indicating cpu-execution-time tries to measure isolated cpu time.
    I have posted same question under my post ( the link you provide above), hope someone helps  (Marco maybe :)).

    I have another question for you, do you encounter 0.0 values for cpu-execution-time?
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I had a second look at the code and I'm a bit confused now myself ::)
    I think I have to retract my previous statement though, cpu-execution-time seems like the way to go. The other one actually seems to make no sense at all - the endtime is gathered once the log operator asks for it. Weird.

    Regarding your second question: The code makes use of Javas "getThreadCpuTime()" function. I just played around a bit and it seems to sometimes simply return 0 for whatever unknown reason. More often than not, start and endtime (which are queried once the Operator starts and once it finishes) were identical, despite the JavaDoc of said function to claim measurement was taken in nanoseconds. I'm not familiar with this particular piece of the Java API, but it looks a bit buggy to me. Because of that, it might even make sense to use "execution time" which is only measued in milliseconds but makes use of a seemingly more reliable Java API.

    Regards,
    Marco
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    as I have justed posted in the linked thread, I had a second look and I'm left a bit confused because some of the functions seem weird. Anyway, this is what I found:

    CPU Time:
    measued via the Java method getThreadCpuTime() which does not work on older machines. Claims to be nanosecond precise but during my testing it often returned 0 for both start and end or identical values which is weird for a nanosecond precise method. Furthermore endtime is the time when the Log operator asks. Opinion: Unreliable -> avoid.

    CPU Execution Time:
    measued via the Java method getThreadCpuTime() which does not work on older machines. Claims to be nanosecond precise but during my testing it also often returned 0 for both start and end or identical values which is weird for a nanosecond precise method. Endtime is actual timestamp when the operator is finished so this is more useful then CPU Time. Opinion: If it works very useful, however it is unreliable -> use with care.

    Execution Time:
    measued via Java method currentTimeMillis() which is very reliable but only millisecond precise. Stores start and endtime directly when the operator finishes. Opinion: Most reliable -> the way to go when measuring operator execution time.

    Time:
    measued via Java method currentTimeMillis() which is very reliable but only millisecond precise.  Endtime is taken when the Log operator asks for the value. Seems less useful than Execution Time. Opinion: Reliable but takes endtime only when the Log operator asks, not when the operator in question actually finished -> might come in handy in special circumstances.


    Regards,
    Marco
  • Flos_Flos_ Member Posts: 7 Contributor II
    Hi Marco, thanks for your reply and sorry I actually opened a new topic, but this one looked terminated to me.

    I was trying to find the class(es) implementing the Log Operator but I didn't. Can you maybe tell me the name?
  • Flos_Flos_ Member Posts: 7 Contributor II
    Thank you for the explanations!

    This explains why I am observing larger time values to the "CPU Time" and "Time" value. I am not querying just behind the operators of which I am measuring the execution time.

    But I do not really understand why "CPU Time" and "Time" should use the query time of the Log Operator as end time. This seems to be weird and a little unintuitive as you configure the operator of which you want to log values of, so you do not expect values to be taken from another "scope".

    It would be good if you add a description to the operator documentation :-).
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    whenever you want to find a class implementing an Operator, have a look at the OperatorsCore.xml. This file contains a mapping between the operator key (used as the name in RapidMiner Studio where underscores are simply replaced with whitespaces) and the implementation class.

    com.rapidminer.operator.visualization.ProcessLogOperator - The log operator
    com.rapidminer.operator.Operator - The parent class for all operators. See line 300++ for the log values.

    Regards,
    Marco
  • Flos_Flos_ Member Posts: 7 Contributor II
    Thanks!
  • bugrasitemkarbugrasitemkar Member Posts: 4 Contributor I
    Dear Marco,

    Thank you for you quick reply, you are a lifesaver and indeed a hero member :)

    I'll go on experimenting with execution time and will posts how it went.

    Regards
  • Flos_Flos_ Member Posts: 7 Contributor II
    Thank you Marco for telling me to look at the Operator Class. I was also studying the javadoc concerning currentTimeMillis() and getThreadCpuTime().

    As I already said in my previous post some documentation should be added to the Operator Help in the GUI. There are at least short sentences describing the different values in the source.

    After looking at the Operator Class source I come to the same conclusion as Marco concerning the meaning of the values. But I do not think that the implementation of getThreadCpuTime() is buggy but has (from a nanosecond precision point of view) an unexpected low resolution. And this resolution is probably dependent on the JVM and system in use. For me when testing with a simple Java Program the lowest value I was observing for the Cpu Time was always several tens of milliseconds or even more than 100 ms when directly measuring after the entrance of the Main Function. After performing some simple operations in a loop I still got the same value (so 0 ns difference which is as Marco said unlikely). Only after increasing the loop count I got higher values for the second measurement. Still it was only possible to observe some particular values and nothing in between two neighboring values.
    And considering the java doc for System.currentTimeMillis we might have a similar problem: "the granularity of the value depends on the underlying operating system and may be larger. For example, many operating systems measure time in units of tens of milliseconds."  (http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()).

    I cannot tell which one is better for measuring short time periods, but in my opinion if your system supports getThreadCpuTime() and the execution time is not too short the CPU Execution Time should be used as it express how long the queried operator was using the CPU, which should exclude those times the Process / Thread got dispatched. This is definitely not the case for the "Execution Time" this will also take into acount the time another process was using the CPU. When you are comparing the execution time of two or more operators this might be misleading.

    When using CPU Time and/or CPU Execution Time one should be very careful which value is actually returned. When the system does not support to get the thread cpu time the "Execution Time" is returned instead of the "CPU Execution" time and "Time" is returned instead of "CPU Time". Rapidminer does not tell you that it was using a fallback but is doing this silently. So be careful.

    @Admins: I think it would be good if you could merge this topic and the topic by bugrasitemkar. So if anyone else is googling for it she or he will not have to reads through 2 topics.

    Kind regards
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    I have merged both topis. The documentation could indeed be improved here. It's on our internal issue list :)

    Regards,
    Marco
Sign In or Register to comment.