-clogit- running *grossly* slower under v.13 MP2 vs. v. 12 IC

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#1

-clogit- running *grossly* slower under v.13 MP2 vs. v. 12 IC

11 Jun 2014, 18:23

Greetings,

Per the subject line, I have just found -clogit- runs about 30-40 times slower under Stata 13 MP2 than on another (somewhat inferior) machine running Stata 12 IC. Some example code is below, which took 10 sec under Stata 13 MP2, and 0.25 sec on the machine running Stata 12 IC. (Another example I compared had a 229 sec vs. 7 sec differential.) Both of these Stata copies are running on Windows 7, with updated 64 bit versions of Stata.

I understand that Stata MP has overhead costs vs. a single processor version, but this difference seems clearly excessive. I tried -set processor 1- and it made no difference. So, I'm presuming the difference is v.12 vs. v13, or else something very strange happened in the parallelization of the -clogit- code.

Can any of you out there verify an observation like this? And, can anyone explain it, presuming it's correct?

Regards, Mike

Code:

// Test code to run -clogit- for timing clear // Create some data appropriate to a clogit set obs 200 set seed 8575 gen int id = _n expand 2 sort id gen byte tx = mod(_n, 2) ==0 gen fixed = 2 * runiform() if !tx replace fixed = fixed[_n -1] if tx gen x1 = runiform() gen x2 = runiform() gen ystar = fixed + 3 * x1 + 4 * x2 + rnormal(0,1) quiet summ ystar, meanonly gen byte y = (ystar > r(mean)) // // Run some estimations. 5 reps was enough for me. local reps = 5 timer clear 1 timer on 1 forval i = 1/5 { quiet clogit y x1 x2, group(id) } timer off 1 timer list 1
Tags: None

Richard Williams

Join Date: Apr 2014
Posts: 4983

16 Jun 2014, 15:50

I just have plain old Stata/SE (Win 7 64 bit). But if it is helpful, here are the results I get with Stata 13.1:

Code:

. // Test code to run -clogit- for timing
. clear

. // Create some data appropriate to a clogit
. set obs 200
obs was 0, now 200

. set seed 8575

. gen int id = _n

. expand 2
(200 observations created)

. sort id

. gen byte tx = mod(_n, 2) ==0

. gen fixed = 2 * runiform() if !tx
(200 missing values generated)

. replace fixed = fixed[_n -1] if tx
(200 real changes made)

. gen x1 = runiform()

. gen x2 = runiform()

. gen ystar = fixed + 3 * x1 + 4 * x2 + rnormal(0,1)

. quiet summ ystar, meanonly

. gen byte y = (ystar > r(mean))

. //
. // Run some estimations. 5 reps was enough for me.
. local reps = 5

. timer clear 1

. timer on 1

. forval i = 1/5 {
  2.     quiet clogit y x1 x2, group(id)
  3. }

. timer off 1

. timer list 1
   1:      0.85 /        1 =       0.8520

Here is the output from 12.1:

Code:

. // Test code to run -clogit- for timing
. clear

. // Create some data appropriate to a clogit
. set obs 200
obs was 0, now 200

. set seed 8575

. gen int id = _n

. expand 2
(200 observations created)

. sort id

. gen byte tx = mod(_n, 2) ==0

. gen fixed = 2 * runiform() if !tx
(200 missing values generated)

. replace fixed = fixed[_n -1] if tx
(200 real changes made)

. gen x1 = runiform()

. gen x2 = runiform()

. gen ystar = fixed + 3 * x1 + 4 * x2 + rnormal(0,1)

. quiet summ ystar, meanonly

. gen byte y = (ystar > r(mean))

. //
. // Run some estimations. 5 reps was enough for me.
. local reps = 5

. timer clear 1

. timer on 1

. forval i = 1/5 {
  2.     quiet clogit y x1 x2, group(id)
  3. }

. timer off 1

. timer list 1
   1:      2.20 /        1 =       2.1980

Looks to me like Stata 13.1 did better.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2413
#3

17 Jun 2014, 08:56

Thanks for checking this.

Aha, it does seem like something is funny as regards the implementation of the code for the 2-processor version. Your results show that for 64 bit versions of SE, v. 13 is somwhat faster that v. 12. I'm finding that for 64 bit versions, v. 12 IC is *much* faster than v. 13 MP2. I'm relatively ignorant about the time costs of parallel code, but a factor of 30 as I found seems odd. So, this seems to rule in favor of the problem being with MP2, and it appears bad enough to me to suggest a problem, not just an unavoidable overhead cost of parallel processing.

Regards, Mike
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#4

18 Jun 2014, 11:02

I ran Mike's code on a 64-bit Windows 7 machine. The results are:

Stata/MP2 (64-bit) 13.1: 0.53 seconds
Stata/MP2 (64-bit) 13.1 with -set processors 1-: 0.33 seconds
Stata/IC (64-bit) 13.1: 0.55 seconds

Stata/MP2 (64-bit) 12.1: 0.45 seconds
Stata/MP2 (64-bit) 12.1 with -set processors 1-: 0.25 seconds
Stata/IC (64-bit) 12.1: 0.31 seconds

I ran both Stata 13.1 and Stata 12.1, and I ran the MP2 and IC flavors of each.

My Stata/IC 12.1's timing is close to the 0.25 seconds Mike got on his "somewhat inferior machine", but my Stata/MP2 13.1's timing is far better than the 10 seconds Mike got on his "better" machine.

A machine with better hardware can often perform worse than an inferior machine. For example, the better machine may be a server with many users running concurrent jobs, the inferior machine on the other hand only has one user running one program. Hence it is important to run the programs on the same machine with a similar workload when benchmarking program performance.

It would be interesting if Mike could run Stata/IC 12.1 and Stata/IC 13.1 on his better machine and see what those times look like. And, it would also be interesting if Mike could run Stata/MP2 13.1 on his "inferior" machine and obtain timings.

--Hua
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#5

18 Jun 2014, 13:00

I do not see the difference in performance of 1CPU or 2CPU in Stata 13.0MP in 64-bit on an AMD CPU. I have modified the code to click the timer inside the loop, which shouldn't matter, but just is somewhat more of what it was designed for.

Here is the timing for 5 repetitions as suggested.

1: 0.25 / 5 = 0.0502
2: 0.25 / 5 = 0.0496

Here is the timing for 500 repetitions:

1: 22.05 / 500 = 0.0441
2: 22.54 / 500 = 0.0451

Here is the code:

When running original Mike's code I do not see any notable difference between set processors 1 and set processors 2. So I was surprised to see the differences that Hua has posted above.
And I was also surprised not to see the 1.9 increase in clogit speed as per the StataMP report. I attribute it to the small sample size in the example, which was discussed already about a year ago.

By increasing the data size to 20,000 from original 200 I get much more reasonable output (for 50 repetitions):

1: 54.78 / 50 = 1.0957
2: 41.21 / 50 = 0.8242

I've noticed that the first time Stata is started there was a much different [larger[ timing (think accessing the ado files for the first time...). After that it was all identical and also small for all further sessions.

Hope this helps, Sergiy
Attached Files

benchmark.do (771 Bytes, 1 view)
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#6

20 Jun 2014, 13:58

Sergiy, thanks for your efforts and observations here.

After communicating with Stata Tech Support (and learning there that as an MP licensee, I can also install SE and IC), I installed both of those (v 13.1) on my machine, so that I now can provide timings on the same machine, all with Stata 13, with each of MP2, SE, and IC. I am still getting *much* better results with IC than either MP2 or SE, even at large sample sizes. I am finding that on my machine (Windows 7, Intel Xeon X3430 processor, 8 G memory), On smaller problems, Stata IC is about 30 times faster than either MP2 or SE, which perform comparably. On a larger problem, IC is only about 3X faster. Under any Stata version, run time scales linearly with number of repeitions (So much for hoping that multiple processor might help with a simulation.)

Like Sergiy, I found that changing to -set processor- made no difference.
I would conclude here that, at least for this procedure, only one processor is being used on my machine, perhaps due to some adverse interaction among the underlying code, the processor, the cache etc--- perhaps the second processor is never being used? However, this would not account for why SE performs much worse here than IC on my machine. And, why would the SE/IC differential decrease for larger problems, since they both are using a single processor.

Anyway, it's looking to me like there is something funny in the relations between my machine and the code in MP2. This is something I'd like to fix. Any diagnostic ideas out there?

Regards, Mike

Code:

Timings in sec., clogit problem, all on Stata 13 N = 200 N = 20,000 5 reps MP2 10.1 13.7 SE 10.2 15.4 IC 0.3 5.5 50 reps MP2 102 137 SE 102 153 IC 2.6 55
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4983
#7

20 Jun 2014, 14:19

You might try a different sort of program. Maybe there is something wildly quirky about clogit.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#8

20 Jun 2014, 15:28

Mike,

Would you please type -creturn list- in your Stata/MP 13 or Stata/SE 13 command windows, and post the results? In particular, what are the outputs if you type -di c(matsize)- in Stata/MP 13 and Stata/SE 13?

-Hua
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#9

21 Jun 2014, 08:35

Hua---Aha! Your thoughts hit the spot here, thanks.

c(matsize) = 11000 in my MP installation and 800 under IC. Experimenting with this showed something interesting and disconcerting to me:

When I -set matsize = 800- in MP, the runtime in MP came down to as good as IC was, i.e. .under a second rather than 10 sec.! And, increasing the -matsize- in MP appeared to have a monotonically and roughly proportional increasing effect on run time. The larger the -matsize-, the worse the performance for -clogit-. Note, by the way, that I have 8G of memory and was using a small data set, so one would not think the procedure was starved for memory.

So, Hua, is this a general lesson for us, i.e., that we should leave -matsize- at a small value and only use large values for it when an error occurs? If
Or, is this an odd algorithmic or coding issue that only strikes -clogit-? And, can you explain the nature of the issue here, so that we might anticipate other commands and situations for which a large -matsize- is a performance killer? All I can think is that something here is weirdly kicking Stata into using virtual memory when it does not need to.

And Rich--this goes to your point: I have casually compared a few other commands and not seen the same remarkably bad performance from MP, although when I used -xtlogit- to run the same conditional logit code under MP, I had the identical performance problems.

Regards, Mike
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#10

21 Jun 2014, 08:43

Interestingly, several months back I had run some analysis that required me to set a huge matrix size (in MP2), and I forgot to re-set it back down to default. Subsequently, I tried to run an -sem- that kept failing because it was "out of memory." When I conferred with tech support, they identified the large matrix size as the issue. But I recall that around that time, I was also experiencing significant slowing of -xtlogit- (a command I use frequently in my work) that simply mystified me. And I also recall that -xtlogit- resumed its usual good performance after the problem with -sem- pushed me to reset my matrix size back to factory default setting.
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#11

21 Jun 2014, 09:41

Mike,

I suspected so when I saw your timings on the same machine. I will have a detailed explanation for you and other interested parties Monday.

Have a nice weekend.

-Hua
Comment
Hua Peng (StataCorp)

StataCorp Employee

Join Date: Jun 2014

Posts: 346
#12

23 Jun 2014, 15:37

-matsize- in Stata's internal code is usually used for the size of temporary matrices created during computations.

If a user has -set matsize- to a large number, it can affect performance for two reasons.

First, it may waste memory if the user has -set matsize- much larger than what is needed. A single 11,000 by 11,000 matrix occupies almost 1 GB of memory! If matrices of such dimension are not needed for the model being estimated, it is better to -set matsize- to a lower number.

Second, a large matsize can affect performance due to how those temporary matrices are stored. For example, a 2 by 2 temporary matrix is stored as a contiguous block in the computer's main memory as:

a11, a12, ...matsize-2 number of unused entries...; a21, a22, ...

Remember that the temporary matrices are allocated with dimensions matsize x matsize, so when matsize is much larger than the needed size of a temporary matrix, there is a lot of wasted memory in between successive rows of the matrix.

The CPU does not access main memory directly. It transfers the data from main memory to the CPU's cache in blocks. Think about cache as a small block of high speed memory. When matsize is small, a11, a12, a21, and a22 can be transferred into the CPU's cache in a single trip to the main memory; afterward, the computation is done by accessing the cache only. On the other hand, if matsize is too big, it will require two trips to the main memory to transfer a11, a12, a21, and a22 to the cache; the first trip transfers a11 and a12, and the second trip transfers a21 and a22. Performance suffers because main memory access is typically much slower than cache access.

In the future, we want to do away with -set matsize- completely and allocate the temporary matrices on-the-fly as computations need, much as we eliminated -set memory-. That is a long term project, however.

For now, users should set an adequte matsize and not set a huge matsize which is much larger than an estimation problem requires.
1 like
Comment

Announcement

-clogit- running grossly slower under v.13 MP2 vs. v. 12 IC

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment