License core limitations and the future of Stata

Michael Droste

Join Date: Sep 2017

Posts: 24
#1

License core limitations and the future of Stata

01 Sep 2022, 18:05

This is not quite a 'feature request' so I figured I'd post it here.

Stata-MP comes in a wide variety of license restrictions. The number of processing cores that an instance of Stata will use is determined by the flavor of Stata-MP license you have, and the cost of a license grows quite a bit with the number of cores.. A single user business license costs about $1,000/year for a Stata-MP 4-core license, and $1,755 for a Stata-MP 12-core license. None of the universities that i've worked at or attended have offered licenses with more than 4 cores to students or staff.

Over time, this kind of hardware limitation has become more constraining for ordinary users. 13th gen Intel CPUs are launching next month: the i9 variant will have 24 cores; the i7 variant commonly found in workstations will have 16 cores; even the cheapest i5 variant will have 10. I understand why this sort of price discrimination came into play when Stata-MP was developed and these licenses could distinguish between enterprise (for lack of a better word) users and everyone else - but these days, I would wager most Stata-MP users have licenses that impose a binding constraint on the amount of computing resources they can use on their own computer.

I would wager that a very large chunk of the Stata community uses institutional / organization licenses, self included, and these users have little control over what their institution chooses for the license. There's little that I can do as a student to lobby Harvard to get me access to an 8-core license. On the other hand, I don't believe any other general-purpose programming language or statistical software constrains the use of one's own computer resources in this way. Julia and R can use as many or few cores or threads as I wish. As processor manufacturers continue to stack more cores on a chip and more people get used to setting up remote virtual machines for computationally intensive tasks, Stata effectively falls farther behind competing statistical software and general-purpose programming languages that do not have this constraint. Over time, users will migrate to those alternatives, and demand for Stata licenses will go down.
Tags: None

1 like
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

01 Sep 2022, 22:29

I think your economic analysis is correct, Michael, but under the assumption that you really cannot use the additional cores you have on your machine, if you do not have a multi core Stata flavour.

This assumption is incorrect. If you are a (moderately) advanced Stata user, you can use the additional cores. There is a user contributed package -parallel-, and I have written a note on how one can fire up multiple instances of Stata to employ the full resources of the machine he/she uses. This is what I do for simulations and bootstrap, I just fire up multiple instances of Stata and I join the results after that.
1 like
Comment
Michael Droste

Join Date: Sep 2017

Posts: 24
#3

02 Sep 2022, 00:40

Thanks for your reply, Joro!

I have used -parallel- since at least 2017 and I'm a big fan of that package. I would like to push back a bit against the claim that -parallel- does what I'm asking; I don't think it's a particularly close substitute.

-parallel- works well for certain tasks that operate on partitions of a dataset or on loops, as you mentioned. -parallel- will not speed up my logistic regression, my linear regression with a bajillion fixed effects, or (your favorite matrix operation here). -parallel- is also somewhat I/O-intensive by necessity; this is quite costly in some server environments.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

02 Sep 2022, 04:12

Now you touch on another topic, that some tasks are paralleliseable, and some are not. In the course of writing my note I read pretty carefully the authors of -parallel-'s paper, and they come up there with some very appropriate terminology: They call some tasks "embarrassingly paralleliseable." Bootstrap and simulations would be examples of such embarrassingly paralleliseable tasks.

Not that I am disagreed with anything you are saying, but you might have too high hopes regarding the gains of parallelisation, at least in the examples you are giving. I think you can actually test this, I think that you can go to lower number of cores if you have a license for more cores Stata, and you can time how long it takes to fit a logistic regression with many fixed effects.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3846
#5

02 Sep 2022, 04:46

Originally posted by Joro Kolev View Post

I think you can actually test this, I think that you can go to lower number of cores if you have a license for more cores Stata, and you can time how long it takes to [...]

Such tests have been carried out and documented by StataCorp.
Comment
JanDitzen

Join Date: Jan 2015

Posts: 350
#6

02 Sep 2022, 06:48

Michael Droste, thanks for brining this topic up! This is in my mind for a while and I think this is actually a really big issue, a huge disadvantage Stata has and something which I believe threatens Stata's future. I think Stata MP is way too expensive. You have also a point about how/who buys Stata licenses. It would be hard to for me to demand a MP license for myself as well. Plus, I am unsure if the benefits outweigh the price.

Out of desperation I coded multishell and later psimulate2 which are made to parallelise Monte Carlo simulations. I still use psimulate2, but in general, I try to avoid Stata to run Monte Carlo simulations. Both programs work, but they are difficult to set up. Combined with the general slow speed (yes, mata is a bit better) Stata looses in comparison to Matlab or R. In addition, both programs offer an easy solution to parallelise loops and an easy support for servers and/or clusters.

The same applies to working with large datasets. At the moment I am looking into parallelising two of my commands. Both programs exists in other languages and both times Stata is slower by a magnitude (it might be my poor programming skills, but running identical loops is already slow - my experience is that the bottleneck is often subscripting and matrix inversion of large (sparse) matrices [hello sparse matrices in Python, Matlab, R]). My solutions to the problem are 1) parallelise it in a similar fashion as done for psimulate2 (despite some up front costs) and 2) outsource the computational intensive tasks to Python/Julia/R. My problem with the 2nd approach, since all three programs are free and faster, why code the program in the first place in Stata?

I think Stata has huge advantages in terms of user friendliness, documentation, rigour and the quality of The Stata Journal is extremely good. However the speed and multiprocessor capabilities make it either too expensive (not too mention the new subscripting scheme) or too slow to be useful for the next generation of researchers or students working on "big data". Plus, there is not much demand for students with Stata skills in the industry. Those points bring me back to the question at the end of the last paragraph.....

So yes, I agree with Michael, Stata, please merge the MP version into a general version, independent of the number of cores and at a reasonable price.

ps: sorry if my post went a bit off topic.

Last edited by JanDitzen; 02 Sep 2022, 06:49. Reason: added ps
2 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

02 Sep 2022, 07:46

I tried R some time circa year 2006, neither is R fast, nor is R good for large databases; and on top it is a weird language that never managed to grow on me. The only thing thanks to which R still exists, in my view, is that it is free. Before Stata I used Eviews (never was too good at it), TSP (got at some point pretty good at it), and tried RATs for fun. The point is that Stata is just like any other regression based language, and switching between those is pretty easy. R is a completely different thing, and the things which are different about R from regression software just make me wonder "Who on Earth thought that this might be a good idea?"

Matlab beats Stata in speed by unimpressive margin in my experience. In Kolev and Karapandza (2017) which is a very computationally intensive paper, I programmed everything in Stata, and then my coauthor reprogrammed everything in Matlab. There was much hype about the "immense power of Matlab": some sparse matrices, some parrallelisations... And at the end if my Stata code was completing within 3 weeks, the Matlab code of my coauthor was completing in 1 week. Which is an advantage, but do you really want to revert to a matrix language for such a gain?

My point is that in my limited experience, and very unfortunately for all of us, Stata has grown to be the de facto standard in econometrics and statistics. Which gives them a lot of market power, as every basic course on economics would teach you. I agree with everything that OP says, I just think that the alternatives are not that great, and we might have to leant to live with Stata Corp behaving as a monopolist.

Kolev, Gueorgui I., and Rasa Karapandza. "Out-of-sample equity premium predictability and sample split–invariant inference." Journal of Banking & Finance 84 (2017): 188-201.

Last edited by Joro Kolev; 02 Sep 2022, 07:51.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

02 Sep 2022, 12:16

I completely agree Joro Kolev. So this is my opinion as someone who's currently learning R.

I do agree that Stata's MP flavors can be pretty expensive, and I also agree that Python and R (generally) can do a lot more (right now) than Stata can, particularly regarding machine learning, Bayesian stuff, and web scraping. But for better or worse, most researchers don't play those games. Your median researcher (most likely) simply needs a statistics and regression software to use. And on that, keep it real, Stata delivers in a much more straightforward way from beginning to end. I should also note that any difference that DOES exist isn't a curse of nature. Python and R have Google's causal impact package to do Bayesian Structural Time Series or Robust Synthetic Control analysis. Stata doesn't. Why not? Not nature! Someone just hasn't gotten around to programming principal component regression or universal singular value thresholding into Mata yet. But it's not like a property of physics that Stata can't do these, it just hasn't been done yet.

R and Python both demand you load separate libraries to do basic things like using ggplot (the graphing system, yes really), mutating (I think renaming or transforming) variables, and so on. You even need to load libraries to display your results cleanly (whereas Stata has collect). At least in Stata all this comes in one package (well, by default). R also (as far as I know) doesn't have the equivalent of macros we enjoy in Stata, where anything in R's eyes can be an object which is SUPER confusing. It is true that R can go wild with new estimators and can do pretty cool stuff with APIs (that Stata could honestly learn from), but in my experience so far, that's where the benefits end.

People talk a lot about how Python's and R's graphical systems are so much better than Stata's, but this is just a lie. Putting aside maps (which I do think R and Python MAY have an advantage, although I'm sure contributors like Asjad Naqvi would disagree), there's essentially no graphic that you can make in R that Stata can't do (barring weird exceptions).

Now I'm still learning R and Python, and both have their uses, but nothing I've seen from R (for the tasks that most researchers will need) has made me go "WOOOOWWWWW, Stata can't do ANYTHING like this". Even dataframes! For the longest, R and Python had dataframes and Stata didn't, a fact which data scientists thundered from the heavens. Well now, wish granted! Stata has them.

I say this to say given the choice between Stata 18 (paying an upgrade price of a few 100 bucks for a personal license) and R, I'm still going with Stata, unless I learn enough R and I see that there's a night and day difference, which there simply isn't aside from cost and a few other features (shiny apps, etc).
Comment
Michael Droste

Join Date: Sep 2017

Posts: 24
#9

02 Sep 2022, 13:21

Thanks for this great discussion guys and for sharing your experiences. I'm not so sure that this is the right place to discuss the relative merits of Stata, R, Matlab, Julia, etc in any holistic sense. A lot of this is quite subjective, for starters, and I'm not surprised that users of the online community 'Statalist' are going to like Stata more than alternatives. Instead, I hope that we can focus this discussion on the license model for Stata-MP as a product. I'm sure folks at StataCorp have talked about this a lot amongst themselves, but I don't know how much user feedback they get. I'm sure it's very lucrative to have many tiers of Stata-MP. I still feel strongly this system is going to induce marginal folks to switch to other languages. I'm sure StataCorp spends a decent amount of time thinking about pricing and forecasting sales.

I'm unsure whether to open the can of worms that is benchmarking runtimes on routine tasks across languages/software. But a few people have mentioned it. I have also spent quite a bit of time on monte carlo sims and bootstraps in the past year. In my experience, very straightforward Julia implementations of a nonparametric bootstrap or monte carlo simulation are about two orders of magnitude faster than the exact same script in Stata. Matthieu Gomez has a repository on GitHub benchmarking R and Stata for some common data processing tasks, using the fastest available method in each language, and most (not all) tasks like reshapes, collapses, egens, etc. are 2-10x faster in R. This has changed a lot in the past 10 years, and it's probably due to the much larger user community of R developing really efficient packages (like data tables and the tidyverse, say what you will about the syntax). But much of this discussion would also seem to be outside the scope of this thread: Stata will never compete on speed against Julia. But a big chunk of the runtime difference on many tasks is simply due to the fact that Stata won't use your whole computer.

My original claim was that for tasks that can be parallelized efficiently (and I cited a few examples earlier), Stata-MP's licensing restrictions effectively make a lot of Stata tasks a few times slower than they would otherwise be. -parallel- is a cool program and useful for some tasks, but isn't a particularly good substitute for getting rid of this constraint: -parallel- has an enormous memory footprint for tasks like monte carlo simulations, and for other tasks (like matrix operations, logistic regression, etc - as well as pretty much anything that isn't either a loop with independent iterations or a task that can be easily performed by partitioning the data in a mapreduce type way) it's often not feasible.

My suggestion is that this kind of constraint on processing power has become more likely to bind for individual users over time as home processors have increased the number of cores on them. A decade ago, Intel's third-gen Core desktops came in 2 and 4-core varieties. At that point, I could imagine selling Stata-MP licenses that allowed for more than 4 cores was a useful way for StataCorp to distinguish between me dinking around on my laptop with sysuse auto vs. a client with a Stata application server with 64 cores and a boatload of RAM. In 2012, nobody would have complained that Stata-MP licenses for 4+ cores were prohibitively expensive, because almost no users could leverage that. But it's 2022, and the same Intel Core line for consumers now feature 12-24 cores depending on the model (I'm going to ignore the heterogeneity between 'P' and 'E' cores), and most consumers now have many more cores available on their PC than their system license allows for. And this is all without mentioning cloud computing environments like AWS EC2, which have been an absolute game-changer for those who are doing computationally intensive work with pretty much any other tool (R, Julia, Matlab, Python).

One medium-term fix would be to increase 'base' Stata-MP to something like 8 cores and maintain the price discrimination based on cores above that. I'm totally aware that may not be a popular proposal for the good folks at StataCorp who would realize that this mechanically reduces their revenue in the short-run.

Last edited by Michael Droste; 02 Sep 2022, 13:31.
2 likes
Comment
Michael Droste

Join Date: Sep 2017

Posts: 24
#10

02 Sep 2022, 13:27

Originally posted by daniel klein View Post

Such tests have been carried out and documented by StataCorp.

Indeed - my use of logistic regression was not an accident.
1 like
Comment
Christopher Bratt

Join Date: May 2019

Posts: 144
#11

02 Sep 2022, 13:47

«The only thing thanks to which R still exists, in my view, is that it is free.»

Nobody who knows R would say that. It’s quite arrogant, and surprisingly uninformed. I don’t think one should turn an interesting discussion about cores into a religious war about software.

If Stata was generally more easy to use than R (it’s not), if Stata was faster than R (it’s certainly not in SEM, where Stata is extremely slow), or if Stata gave what R gives, then I would not be using R.

But all this is off topic.
3 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#12

02 Sep 2022, 23:56

Where theologists and gender studies experts see religious wars, I just see corporations maximising profits and charging us whatever they can for the product they sell. Intel does that, Microsoft does that, why should anybody expect Stata Corp not to do that? It is a corporation, the name suggests, it maximises profits.

OP is right to point out that with the advancements in hardware, the core restrictions on Stata become more and more binding.
But first, has it come to anybody that they might have thought this through, and they might have concluded that the market can bear the pricing they impose? Does not mean that we cannot complain, who knows, maybe they hear us... But also they can hear us, and say "Yes, we hear you, but we thought this though and we thought that you can take it."
Second, I take an issue with the "powerful alternatives," in particular the bizarre R built on and around some obsolete language, S created in 1976...

The interesting discussion to me at least is What can we squeeze out from the hardware we can afford, using different types of software, different software priced in a different way?

Christopher might be surprised to learn that I very much like R fans. R fans like Christopher (or fans of alternative software such as Julia, and Matlab, and Ox etc.), who are ready to migrate from Stata to R or never start using Stata at all, are what stops Stata Corp from charging us whatever they want. And it might be off topic, but I very much wish that TSP did not die out like it did. Competition is the key, competition is what brings profit maximising corporations to charge "reasonable" prices; not the whining of devoted users.
2 likes
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 689
#13

04 Sep 2022, 03:47

My take on this is that Stata simply follows the law of the market. As long as enough people or institutions are willing to pay for each core, well, it works. Capitalism, yay. I think that's OK. The only problem is: many public institutions have to work with lower funding and try to cut costs. And since there are free solutions like Python or R, why Stata? Maybe the usability is lower, whatever, it's free. I see this in my university as well. Stata must be careful due to path dependency. If too many institutions drop Stata, many students (read: future researchers, PhD students and professors) will never learn and love Stata. That's bad. This can kill a company in the long run. I hope they know how to be good dealers: get them started for low and after they are hooked... I have tried R, Python, SPSS and I am just hooked on Stata.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
2 likes
Comment

Announcement

License core limitations and the future of Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment