Wishlist for Stata 15

Nick Cox

Join Date: Mar 2014

Posts: 35804
#31

17 Dec 2015, 10:28

#1 User-defined functions

I would interpret that differently, e.g. that I could define say a generalised floor function as floor(x) with one argument and y * floor(x/y) with two arguments and then use that anywhere where I can refer to a function at present.

But that's another wish that in no sense detracts from your suggestion.

#3 Better documentation for graphics

As with all these ideas StataCorp decides, but immediately after version 8, there was an idea that the internals of graphics should get another manual. My back-of-the-envelope calculations are that such a task would lock up very senior developers for 2 person-years and be of interest or benefit to about 10 users. I would be one of those 10, but now make your guesses on what will happen. Worse, whatever is documented would have to be maintained. At present, StataCorp can change the underlying code and not worry about documenting what isn't documented. So, documenting that code might suddenly expose a need to make the syntax much more user-friendly.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30187
#32

17 Dec 2015, 12:55

This is a minor one, but please make "replace all in selection" the default behavior for the Edit-Replace function in the do-editor when it is invoked with selected text. Not only is that the default behavior in every other program I work with that has a text replace function, but it's a nuisance having to go and check that box each time. Were it not for checking that box, replace operations in a selection could be accomplished easily without ever taking one's fingers off the keyboard or tabbing through the other check-box options, a modest gain in efficiency. (As a second-best, provide some single keystroke way of checking that box without having to use the mouse. As a third-best, move that one to the top of the list of check-boxes in the Replace dialog box so that it can be activated just be hitting Tab and = after entering the Replace what text.)
2 likes
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#33

17 Dec 2015, 16:10

Originally posted by Nick Cox View Post

#1 User-defined functions

I would interpret that differently, e.g. that I could define say a generalised floor function as floor(x) with one argument and y * floor(x/y) with two arguments and then use that anywhere where I can refer to a function at present.

But that's another wish that in no sense detracts from your suggestion.

#3 Better documentation for graphics

As with all these ideas StataCorp decides, but immediately after version 8, there was an idea that the internals of graphics should get another manual. My back-of-the-envelope calculations are that such a task would lock up very senior developers for 2 person-years and be of interest or benefit to about 10 users. I would be one of those 10, but now make your guesses on what will happen. Worse, whatever is documented would have to be maintained. At present, StataCorp can change the underlying code and not worry about documenting what isn't documented. So, documenting that code might suddenly expose a need to make the syntax much more user-friendly.

Nick Cox definitely a good point. I guess more of it is just wishing that there could be some more modernized graphics capabilities that would make it easier for Stata to compete with other platforms that are either more specialized for data visualization (e.g., Tableau, D3.js, polestar.js, Trifacta, vega.js, voyager.js, etc...) or other statistical platforms that currently have more robust integration with these technologies (e.g., R, python, etc...). I just started experimenting with JavaFX in the hopes that I could potentially put together something with that, but so far have had little luck. Once I wrap up one or two smaller things with brewscheme I'll be spending a bit more time with a Mata wrapper around the D3.js library that I started building since it will be easy enough to take existing D3.js based code and slide it over.
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#34

21 Dec 2015, 08:15

I'll add to my previous post about increased efficiency in Stata; these benchmarks (https://github.com/matthieugomez/benchmark-stata-r) demonstrate some of the issues I and others have raised. Consider the results below, which I pulled from the attached link and which refer to a dataset with approximately 10 million observations:

Stata's performance is especially egregious in opening a CSV, reshaping, and merging, among others. Improvements in the performance of these routines would benefit the entire Stata community and many current users. These are far and above the most important items (for me) on the wishlist. Maybe it would be possible to implement -reshape-, -merge-, etc. in C/C++, as part of the Stata core binary, instead of the slower ado language?

To give another example, in my experience, it's sometimes faster to export a dataset, open it in R, reshape the data, export it from R to a Stata-compatible format, then reload it into Stata, instead of reshaping the data in Stata. That's a damning indictment of Stata's performance, to say the least.
Attached Files

Last edited by Michael Anbar; 21 Dec 2015, 08:33.
2 likes
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#35

21 Dec 2015, 13:54

Michael Anbar while the attempts at benchmarking are great, it's probably better to compare core functionality across the platforms instead of comparing user written programs to native functions across the platforms.

"R is faster than ten times faster than Stata to read .csv (using the data.table command fread vs the Stata commandsinsheet. However, when reading or saving data in proprietary format (.dta for Stata and .rds for R), Stata is more than ten times faster."

data.table is not a part of R base and many of the comparisons made by Matthieu Gomez use user-written commands:

# To run the script, download the relevant packages: # install.packages("data.table") # install.packages("tidyr") # install.packages("statar") # install.packages("biglm")
If R were truly that much faster, reading their binary data file (.Rds) would be much faster. Although R provides a lot of nice functionality, the truth of the matter is that it is terribly inefficient with memory management and relies heavily on users to develop core functionality that has even moderate performance (it'd be better to compare read.csv() to import delimited). I suspect that there were also other applications bogging down the users system. I regularly have had to work with files that are 1GB or larger and have never waited 3-4 minutes for Stata to load a csv, excel, or Stata data file.
1 like
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#36

21 Dec 2015, 14:54

Originally posted by wbuchanan View Post

Michael Anbar while the attempts at benchmarking are great, it's probably better to compare core functionality across the platforms instead of comparing user written programs to native functions across the platforms.

data.table is not a part of R base and many of the comparisons made by Matthieu Gomez use user-written commands:

I guess I don't understand your point here. Whether or not a command is part of base R or base Stata shouldn't make a difference. If a user-written command in R can significantly outperform the native, compiled functionality of Stata (e.g. in the case of -reshape-, although I'm not sure the R equivalent is user-written), or vice versa, that speaks volumes about the inefficiency of one or both programs.

In general, my point is simple, in that (as others have mentioned as well; see my previous links) some of the core functionality in Stata, e.g. -reshape-, -merge-, etc. is painfully slow. This has been a complaint for a while now.

Once I've loaded a dataset in memory, running -reshape- on the data can take significant amounts of time. In some cases, upwards of an hour (even on small datasets of around 1 GB). This is on my office workstation, which is a new, modern machine with minimal other processes running concurrently. I regularly work with datasets that are 20 GB+ in size, at which point such processing becomes nearly impossible in Stata, even though the dataset fits completely in memory (I'm working with 64 GB of RAM on my office machine).

Maybe packaging a version of -reshape- that's compiled vs. interpreted in the ado language or can take better advantage of multi-threaded machines would improve this, but I don't have the experience with C to speak to how easy that would be to implement.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#37

22 Dec 2015, 03:32

Michael Anbar the point is that the bench mark isn't providing a reasonable comparison. If we were to compare the performance of joins (merge) and unions (appends) in SQL, you'd find that all non-sql solutions for those types of tasks typically perform orders of magnitude slower than their equivalents in statistical software packages. Similarly, if you were to compare performance with any statistical computations SQL platforms can be quite a bit slower. As you may be aware of, R uses a very different model for providing core functionality. So, there is going to be an inherent difference in performance when both data objects are stored in RAM compared to a method that requires an I/O operation to perform the join. Regardless of whether or not this happens in low-level compiled code, or higher level abstracted scripting languages is irrelevant here when the comparison isn't equivalent. As you can see from the benchmark code:

Code:

DT <- readRDS(rdsfile) DT_merge <- readRDS("merge.rds") f <- function(){ setkey(DT, id1, id3) setkey(DT_merge, id1, id3) merge(DT, DT_merge, all.x = TRUE, all.y = FALSE) } out[length(out)+1] <- time(f())

The user already loaded the data into memory before timing the join which is far from comparable to reading a file from disk in order to perform the join. The data.table class is also not equivalent to a dataset in Stata but is more analogous to a SQL table with a primary key constraint defined; so again the performance estimates generated by the "benchmark" are being heavily skewed by the user in a way that is fundamental non-comparable. Would you ever want to compare joins/unions of a well indexed paralleled and partitioned table in an Oracle system with the performance of the same data in Stata/R? Of course not, because the comparison is so riddled with confounding that any results are essentially useless. If performance for data munging tasks is as big an issue as it is for most people (which it very legitimately can be), then the issue is either making multiple datasets accessible in memory simultaneously or using better lighter-weight technologies (e.g., Spark, Shark, Drill, etc...) that are designed for processing massive amounts of streaming data efficiently (then the file I/O issues can be mitigated a bit by not needing to reading/representing the entire file into memory vs processing a bit stream of the data). My issue isn't so much with your request, which I see as completely valid and reasonable, but with using a bench mark that doesn't make reasonable comparisons.

To give you an idea of just how awful the performance of R is, I've been waiting a bit more than 20 minutes for

Code:

setwd("~/Desktop") K <- 100 set.seed(7779311) for (file in c("2e6", "1e7", "1e8")){ N <- as.integer(file) DT <- as.data.frame(cbind( id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char) id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char) id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char) id4 = sample(K, N, TRUE), # large groups (int) id5 = sample(K, N, TRUE), # large groups (int) id6 = sample(N/K, N, TRUE), # small groups (int) v1 = sample(5, N, TRUE), # int in range [1,5] v2 = sample(1e6, N, TRUE), # int in range [1,1e6] v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749 ), stringsAsFactors = FALSE) write.table(DT, paste0(file, ".csv"), row.names = TRUE, sep = "\t") if (file == "2e6"){ write.table(DT, "merge.csv", row.names = TRUE, sep = "\t") } }

to finish. I'm using Revolution R Open on a MacBook Pro with an i7 and 16GB ram. It's just now wrapping up, but I'll try to put together some roughly comparable bench marks (as best as possible at least) and will post the results and code if you're interested.
1 like
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#38

22 Dec 2015, 07:24

wbuchanan Thanks for clarifying that and adding more relevant benchmarks. I'm not sure issues of disk IO vs. data in memory apply to the benchmark of -reshape-, though, correct? In both R and Stata, the data would already be loaded into memory. I definitely agree with the need for multiple datasets in memory, or further (any?) integration with distributed computing, or even something similar to Python's -memmap- (although I realize that once we start reading files from the disk in pieces, we enter the purview of SAS and run into issues of disk IO again).
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1439
#39

22 Dec 2015, 08:17

Michael: Andrew Maurer gave a great talk on ""Big Data in Stata" at the UK User Group meeting, September 2015 (abstract at http://www.stata.com/meeting/uk15/abstracts/) and he has a suite of fast* commands (ssc desc f). Perhaps some elements of this would be useful for you?
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#40

22 Dec 2015, 08:35

Originally posted by Stephen Jenkins View Post

Michael: Andrew Maurer gave a great talk on ""Big Data in Stata" at the UK User Group meeting, September 2015 (abstract at http://www.stata.com/meeting/uk15/abstracts/) and he has a suite of fast* commands (ssc desc f). Perhaps some elements of this would be useful for you?

Stephen Jenkins I've looked at a few of those commands, but I hadn't heard of -fastcollapse-. I can't get it to install from the SSC right now (-ssc install fastcollapse- says it wasn't found), but I can at least get the code from here (https://ideas.repec.org/c/boc/bocode/s457939.html ) and explore it from there. Unfortunately, many of the commands I use quite frequently for processing large datasets, e.g. -reshape- and -merge-, either aren't multithreaded or are limited by Stata's lack of multi-dataset support (and thus limited by disk IO).
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#41

22 Dec 2015, 14:07

Michael Anbar it can also be a bit painful with regards to performance, but I've found SQL more and more useful for cases where the amount of data to manage is fairly large. I've been trying to put together a .dta file reader in Java off and on for a bit, in which case it might be possible to push some of this work off to the JVM using H2/HSQL or something similar. Unfortunately, there are memory constraints which could also present issues. It won't solve everything since the Java API isn't thread safe, but it could potentially help with passing a call to do all of the more laborious munging work before returning the results in serial.
Comment

Richard Williams

Join Date: Apr 2014
Posts: 5025

#42

02 Feb 2016, 08:14

The svy: prefix supports some non-estimation commands -- e.g. svy: tabulate twoway -- but I wish it supported more. In particular I wish it worked with the summarize command. I use summarize all the time and it is sort of a pain to use things like svy:mean instead, since I have to use multiple commands to get the same information as i get with summarize. I wouldn't think it would be that hard to do. Here is the way I am improvising in the meantime:

Code:

webuse nhanes2f, clear
quietly svy: mean weight height age
estat sd
sum weight height age [aw = finalwgt]

. estat sd

-------------------------------------
             |       Mean   Std. Dev.
-------------+-----------------------
      weight |   71.90869    15.43333
      height |   168.4625    9.702933
         age |   42.23732    15.50095
-------------------------------------

. sum weight height age [aw = finalwgt]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      weight |  10,337   117023659    71.90869   15.43333      30.84     175.88
      height |  10,337   117023659    168.4625   9.702933      135.5        200
         age |  10,337   117023659    42.23732   15.50095         20         74

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35804
#43

02 Feb 2016, 08:20

Is there theory to support svy flavours of everything that summarize does? Quantiles? Kurtosis?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5025
#44

02 Feb 2016, 08:33

Originally posted by Nick Cox View Post

Is there theory to support svy flavours of everything that summarize does? Quantiles? Kurtosis?

The svy: tabulate commands don't do things exactly the same way that tabulate does. I think svy: sum could only support those things that are legitimate. I assume the basic example I gave above is fine because I get the same results with summarize using aweights as I do with other commands using the svy: prefix. I am usually quite happy just getting means, standard deviations, and the min and max values.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5025
#45

03 Feb 2016, 22:18

This thread had a nice discussion of survey data combined with multiple imputation:

http://www.statalist.org/forums/foru...th-survey-data

As it notes, you can do something like mi estimate:svy:command... But you can't do something like

svy: mi impute...

In other words, you can't easily incorporate the survey characteristics into the imputation of missing values.

I would like to see such a command like the above come into existence. Barring that, I would at least like to see some good examples in the manual, or maybe a FAQ, on how to handle mi and svy together.

If I follow the thread, one approach is to do something like

mi impute logit var1 var2 var3 i.clustervar i.stratvar [pw = weightingvar]...

I wouldn't know that if I hadn't read the thread, and I still don't know if that is really best. Anything Stata can do to make it easier or to at least understand what the options are would be good.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment