Wish list for Stata 14

Nicholas Winter

Join Date: Mar 2014

Posts: 122
#46

07 Aug 2014, 07:54

Originally posted by Richard Williams View Post

Better support for margins with multiple outcome commands. After commands like ologit and mlogit, you have to run a separate margins command for each outcome of the dependent variable. I'd like to have margins do it with one command.

I'm coming late to the new forum, but: just a quick note that my combomarginsplot does provide some functionality to combine the multiple calls to marginsplot.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#47

07 Aug 2014, 08:39

What I would personally would love to have is a replace option for generate. It's a small thing in the grand scheme of things, but the lack of one keeps annoying me - especially when doing ad-hoc trials with .do-files and the like. Also, looping could be simpler in some cases.

I agree completely with this one. And even more so for -egen-.

2. Somewhat similaris the possibility to save empty data sets - yes, I know there's an addon to do it and also easy ways to work around, but it would make some things much more elegant to be able to do so from scratch.

Not sure what you intend here that -save, emptyok- doesn't address.

Last edited by Jeph Herrin; 07 Aug 2014, 08:42.
Comment
Jeph Herrin

Join Date: Apr 2014

Posts: 335
#48

07 Aug 2014, 08:47

The single most important "feature" Stata 14 could add is better support for so called Big Data^(c). This means 2 things in particular:

1. More and more I find myself working with very large databases - where analytic files are on the order of 100gb, say - and in this environment Stata does not shine. In the long history of Little Data, the memory model employed by Stata had a clear and evident advantage. But no more - I find myself not only using SAS for data management, but (reluctantly) suggesting it to others who are working in the same space.

2. Large data is typically maintained in relational databases, accessed via some flavor of SQL. Stata has some support for SQL, in that one can submit an SQL statement and retrieve the output, but this is not helpful if one wants to, eg, use an index file to match IDs against the SQL database. Instead (and this often leads to problems of type 1 above) one must retrieve every record and do the matching locally in Stata, which means working with even larger datasets than one would need to otherwise.

J

Last edited by Jeph Herrin; 07 Aug 2014, 08:52.
2 likes
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#49

07 Aug 2014, 13:50

Originally posted by Nick Cox View Post

You are presumably referring to ==, >, <, >=, <=, !=.

The problem you identify isn't (to me) at all clear....

I guess he is referring to precision control in the lines of http://docwiki.embarcadero.com/RADSt...s_%28Delphi%29 , meaning in short that the system should know what it is comparing (types of LHS and RHS) and know it's own limitations (what it can and cannot compare).

Best, Sergiy Radyakin
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5011
#50

07 Aug 2014, 21:06

Originally posted by Nicholas Winter View Post

I'm coming late to the new forum, but: just a quick note that my combomarginsplot does provide some functionality to combine the multiple calls to marginsplot.

Combomarginsplot is a great command. I do hope that Stata corps doesn't think that just because somebody has written a command that it doesn't have to worry about the issue anymore. For a user it would be easier to add an option to margins than to learn a whole new command.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://academicweb.nd.edu/~rwilliam/
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#51

11 Aug 2014, 09:33

Originally posted by Jeph Herrin View Post

Large data is typically maintained in relational databases, accessed via some flavor of SQL. Stata has some support for SQL, in that one can submit an SQL statement and retrieve the output, but this is not helpful if one wants to, eg, use an index file to match IDs against the SQL database. Instead (and this often leads to problems of type 1 above) one must retrieve every record and do the matching locally in Stata, which means working with even larger datasets than one would need to otherwise.

I would even add that my impression is that other tools like monetdb (esp. linked to R) has shocking performance advantages for data retrieval, selection, manipulation and aggregation, let alone "MapReduce" operations on many cores (btw, even Hadoop will be 10 years old when Stata 14 comes out). I understand that Stata is fantastic for -regress- and -ivregress- and -teffects- and clustered standard errors and what not, but it is all the sadder if we waste time with -by- looping over the data all the time, spend time with disk I/O for preserving, rarely transparent to the user, doing needless dances for merges, waiting to have the data resorted again and again because there is no better index for the data etc.

I am obviously no database scientist, but I do see other tools (not only SQL, but also pandas in Python, e.g.) doing "merges, egens, collapses, ifs, bys" much, much more efficiently. I really wonder why StataCorp does not invest in this.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#52

12 Aug 2014, 04:33

Native JDBC support analogous to ODBC. Although the Java API and/or C plugin capabilities for end user development are great, some end users (myself in particular) have little to no experience in these languages. Depending on the data system, JDBC can also be a bit better performing than ODBC. Support for multiple datasets residing in memory simultaneously. With sufficient ODBC/JDBC support, this could potentially be a matter of interfacing with whichever flavor SQL end users prefer (maybe have a set sql [mysql, sqlite, postgre, oracle, sqlserver] [, permanently]) to make integration of Stata within the "big data" space a bit easier on the development team at Stata as well as retaining flexibility. Using graph names as a method for creating multilayered graphs. For example, we can already use : tw scatter x y || scatter abc z But having a way to use something like: tw scatter x y, name(gr1) tw scatter abc z, name(gr2) tw scatter varx vary, name(gr3) // use gr1 as base, layer gr3 on top of that, and gr2 on top of all of them gr layer gr1 gr3 gr2 It doesn't seem as though alpha transparency is something that will be coming up in the immediate future, but making it easier for the end users to layer things without having to rewrite code could be a good stop gap that would also make it easier for people to create complex graphs more easily. Better documentation of the graphics engine to make it easier for the user/programmer community to add to the graphics capabilities. Better documentation/net courses for programming mata. Latent class/transition analysis (categorical latent variables more generally) modeling. Native support for 3PL IRT (allowing a single fixed and individually constrained/estimated c parameters in the model).
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#53

12 Aug 2014, 15:42

And one simpler thing, much less frustrating for StataCorp than my previous suggestions: a new case(,,…) function, or anything to replace nested cond(,,) functions. And this should replace the everday use of repeated -replace if- and other uses (or the recode command!), which are painfully slow, esp. on bigger data.
Comment
Adrian Sayers

Join Date: Apr 2014

Posts: 67
#54

13 Aug 2014, 09:01

I would really like a few simple things.
1. A vertical select in the do file editor
2. A Do file editor that is divorced from the main package, so when Stata takes a nose dive, it doesnt take your code with it.

I basically want crimson editor to work with Stata
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#55

13 Aug 2014, 09:45

+1 for Laszlo's suggestion of a case() function!
1 like
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#56

13 Aug 2014, 16:56

One other thing: Could -margins- and (some) other factor-variable magic work with predefined interaction terms? Stata is lovely that it takes care of interactions for us (though numerical derivates are painfully slow sometimes), but repeated use of interaction terms in various models needlessly repeats the generation of those temporary variables, unless I am missing something. Why can't I save/define the interaction term, use it repeatedly, but still not fool -margins- thinking it's a separate variable?
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#57

14 Aug 2014, 16:02

And one more: if StataCorp does not feel like their comparative advantage is in reinventing the wheel and writing up regex in C etc., they could invest in fast, easy, efficient back-and-forth with other environments. Python comes to mind, as it is free, and much of it a complement to Stata, not a substitute (unlike with R or Matlab or SAS etc.). It would be great to manipulate variables with essentially Python, esp. strings (and now with strL…).
Comment
Michael Anbar

Join Date: Aug 2014

Posts: 116
#58

15 Aug 2014, 12:15

I would like to see consistent application of date literals. For example, to import data between certain dates from a Haver database (an external data source), the "fin" option requires date literals in the desired frequency format, like this:

Code:

import haver gdph@usna, fin(2000q4, 2010q4) tvar(date)

I can't pass in date literals using "tq", like this:

Code:

import haver gdph@usna, fin(tq(2000q4), tq(2010q4)) tvar(date)

which would greatly simplify do files that use local macros with date literals as the parameters.

To work with a subset of the data in an "if" expression, I can use either of these expressions:

Code:

list if tin(2002q4, 2005q4) list if date >= tq(2002q4) & date <= tq(2005q4)

but not this:

Code:

local start_date = yq(2002, 4) local end_date = yq(2005, 4) list if tin(`start_date', `end_date')

or this:

Code:

list if date >= 2002q4 & date <= 2005q4

I don't understand the discrepancy here, especially in the limitations on "tin". tin already reads from tsset to find the name of the time variable, so ideally, it would also read the frequency and interpret integers that it's passed in the context of that frequency. For the Haver command, a separate option could be added that takes a date frequency, e.g. as in the case of tsset, and numerics that are passed in could be interpreted in the context of that frequency (%tq, %tm, etc.)

Thank you,
Michael Anbar

Last edited by Michael Anbar; 15 Aug 2014, 12:20.
Comment
Alan Acock

Join Date: Apr 2014

Posts: 2
#59

17 Aug 2014, 11:40

Ricahard Williams wrote "* Also, I understand that some other programs (e.g. MPLUS) let you specify auxiliary variables that help improve the handling of missing data."

To add auxiliary variables all you need to do is get them in the covariance matrix you are analyzing. One way to do this is to add a line to the sem command as shown below
sem (bmi <- age numberchildren incomeln educ quickfood) ///
(<- gender minority alienation), ///
method(mlmv) standardized
estat eqgof

the (<- gender minority alienation) line adds three auxiliary variables.
Comment
László Sándor

Join Date: Apr 2014

Posts: 120
#60

17 Aug 2014, 17:46

Also, here is a thought, something different: As I know many people struggling with Stata on big data, and memory is *still* not-unlimited, I could imagine an option for -do-, or at least a tool in the do-file editor, for the cases when the code ends with -keep- before a save (or a new -keep- option for save). For these cases, Stata would offer to quickly analyze which variables would be loaded and used (-describe using- is very fast, one of the very few things Stata's limited indexing allows to do without holding the entire data in memory), then generated and used, and finally kept. Even for varlists this is doable quickly. Everything that is not kept in the end, could be offered to be dropped after their last use, or if never used, then not loaded (merged etc.). This might be a bit lazy for garbage collection, but would really save tremendous resources for teams working on big datasets who just cannot (do not) keep track of specifying variables for each use and merge or meticulously -drop- all the time (saving memory for the ensuing operations). Esp. editing these -drop-s all the time once edits the code…

Oh, while I'm at it, it is crazy that -save- cannot just save a subset of observations even in version 14, 30 years after the first release. -preserve-keep-save-restore- is needlessly costly with big data. Preparing clean extracts can be a nightmare.

Oh, and please allow merging-joining on variables having different names in the two datasets. It is a bit dangerous, but caveat emptor. Again, it is messy (equally dangerous?) to rename variables temporarily either in master or in using that one can forgot to undo.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment