Stata is #1 - in the Harvard Dataverse Repository

Richard Williams

Join Date: Apr 2014

Posts: 4983
#1

Stata is #1 - in the Harvard Dataverse Repository

02 Dec 2022, 08:49

https://www.nature.com/articles/s41597-022-01143-6

The most popular programming languages among the Harvard Dataverse repository users are Stata and R, as shown from the frequency of deposited code files in Fig. 1.

Also,

We find that 74% of R files failed to complete without error in the initial execution, while 56% failed when code cleaning was applied

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None

4 likes
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

02 Dec 2022, 08:55

You'd think with all the talk about open source software, e.g. R and Python, that these would run the best, most of the time, and that the dated, proprietary software would exhibit the most problems..... but nope!!!!! Stata outpaces those time and again, and surprise, surprise. For the money, and for the capabilities, it's the best stats software around, in my opinion, by far.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#3

02 Dec 2022, 10:12

I think the main claim here about open source software is not that it always works if poorly used, but that all debugging is a democratic possibility as absolutely all code is downloadable for free and perfectly accessible.
2 likes
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4983
#4

02 Dec 2022, 10:54

I'm surprised at the high error rate (which, in fairness, is only computed for R, not Stata). I'm fairly compulsive about results being replicable if I post the code. If Stata code produces errors I bet the most likely cause would be a failure to install routines from SSC or elsewhere, and/or a failure to use version control. I wonder too if some code tells you what needs to be modified (e.g. file locations) but the tests aren't done with those required changes.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2400
#5

02 Dec 2022, 11:01

Originally posted by Richard Williams View Post

I'm surprised at the high error rate (which, in fairness, is only computed for R, not Stata). I'm fairly compulsive about results being replicable if I post the code. If Stata code produces errors I bet the most likely cause would be a failure to install routines from SSC or elsewhere, and/or a failure to use version control. I wonder too if some code tells you what needs to be modified (e.g. file locations) but the tests aren't done with those required changes.

My guess is that the large majority of these errors are down to precisely which version of R and the specific packages are being used. This is not so trivial in an environment such as R or Python since there is no special emphasis on backwards compatibility as there is with Stata.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#6

02 Dec 2022, 11:07

Another possible cause of errors is the posted data set being different from one used in the original analysis. For example, many of the data sets I work with contain "protected health information," that I cannot legally put into a public repository. So I have to "anonymize" the data in some way. I might do things like replace the actual patient identifier with consecutive numbers starting at 1, for example using -egen, group()-. But that might result in replacing a string variable with a numeric one. While I would try to make corresponding changes in the repository version of the code, I might miss some place, or introduce some new error in the process. Similarly disguising dates by adding, say, jitter, might break an -assert-ion somewhere. It's actually a lot of work to carefully go through all the code to make it compatible with a de-identified dataset. It's tedious and inherently error-prone, not particularly fun to do, and there is no external reward for doing it either.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#7

02 Dec 2022, 11:37

Taking Clyde's point in #6 in a different direction, we looked (if I recall correctly) at adding some options to add noise or obfuscation to dataex, but backed off for similar reasons: it is all too easy to mess up something that is important. So, we put responsibility back on the user to make up a dataset that is realistic where that matters and fake where that also matters.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#8

02 Dec 2022, 13:12

Richard Williams thank you very much for posting this - I have just read the article and found it pretty depression (e.g., even journals that claim that they "verify" the code before publication only had a success rate of about 60% - so, I think that something weird is going on (the actual data sets allegedly used were included in the repository)

I also found it interesting that of the 2,091 replication packages from the R world that were included, 620 included Stata code (e.g., do files) so that much of the work is clearly being done in more than one language (the reason was not clear as that is not what the repository does but obvious possibilities include (1) different members of the team using different software or (2) different parts of the project done in different software

finally, I note that while I no longer do expert witness type work, I always had to be prepared to turn over my data and my do files (and my log files) - on the occasions when I actually did turn this material over, no questions ever arose about failure to work
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#9

02 Dec 2022, 19:42

I started computing by myself in 1973 with Fortran -- then usually called FORTRAN -- as just about everyone in my sub-sub-field was writing programs in that language. There were statistical packages but none available locally was really programmable. Sharing code was a matter of finding a published listing of code in a journal or getting an author to send you a listing. Then you needed to get code put on a deck of punched cards -- other places used paper tape. Despite Fortran being a language that was well supported (e.g. by agreed public standards, well-written compilers, and excellent textbooks) an early and painful discovery was that input of data and output of results were often highly idiosyncratic and what worked at your University would need to be modified to work elsewhere, and vice versa.

Back to the present: I have much sympathy with project authors. There is a widely perceived responsibility to publish replicable code, but the presumption that that research should remain replicable into the indefinite future is more than I think most authors admit or aspire to.

I don't want this thread to be cited as snark about R by Stata people. Let's not forget that there is a long tail of code in any language that is likely to be poorly documented and difficult to reproduce by the highest standards.
3 likes
Comment
Julian Reif

Join Date: Dec 2018

Posts: 47
#10

03 Dec 2022, 06:23

Chang and Li reviewed the reproducibility of macroeconomics papers a few years ago. The results were not encouraging, though the situation seems to be improving.

Associate Professor of Finance and Economics
University of Illinois
www.julianreif.com
1 like
Comment

Announcement

Stata is #1 - in the Harvard Dataverse Repository

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment