Wish list for Stata 14

László Sándor replied

08 Sep 2014, 10:08
Originally posted by David Muhlestein View Post

Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)

Note that this is possible with -use varlist using filename-. One of the first and biggest lessons of http://www.nber.org/stata/efficient/. What would be new if we could rename variables on the fly (more important for -merge-), which does not require the values to be sitting in memory, or any automatic selection of variables ever used and kept (not-dropped) later. I had a related suggestion on that.
1 like
Leave a comment:
David Muhlestein replied

08 Sep 2014, 07:51
I appreciate all the comments that have already been made. Some of these suggestions are duplicates, in which case consider them seconded, and others haven't been mentioned, yet. They are not listed in any order:

Interactive output (ex: hover over truncated variable name and see full variable name and see the label)

Longer variable names and macro names - it may be cumbersome for some output, but that's the risk of those who define them

Unicode support

Parallel processing for large datasets (ex: George Vega Yon's parallel command is a good starting point, but getting this officially supported would be great)

Update to the FAQ that provides comparisons of related commands and which ones are faster

Increased tab size (no more "too many values" errors)

Preserve labels and notes following collapse

Transparent option with graph output

Quietly graph and export images as pngs (see http://www.stata.com/statalist/archive/2012-04/msg00709.html)

Estimated time to completion of commands - have a minimum time that it doesn't show up (such as anything less than an estimated 2 minutes won't provide an estimate), but if I begin calculating something large I'd like to know if it's estimated to take 12 minutes or 12 hours; I know there are lots of challenges with this, but even a warning that suggests a command may take a long time is helpful

Built-in support for json data

Formal support for spmap or, preferably, improved mapping support that will allow for map layer creation

Support for larger data sets (everything Jeph Herrin and László said)

More intuitive way of bringing in relational databases - maybe storing multiple relational databases in memory

Save a subset of observations - ex: save xxx if x>2

Save a random sample of observations - ex: save xxx, sample(.1)

Use data but only import a subset of variables - ex: use xxx.dta, keep(x y z)

Currency or financial formatting - think of having dollar values in the y axis or being able to output to an excel file already formatted as currency

Better integration with excel - it would be great to be able to assign formatting to cells and worksheets and be able to insert excel charts from Stata (such as inserting the data on worksheet 1 and then referencing that data and creating a bar chart on worksheet 2); the majority of the business world functions with MS Office and automating the output would save a lot of manual cleaning up of generated spreadsheets
Leave a comment:
László Sándor replied

04 Sep 2014, 15:31
Originally posted by László View Post

Other little things:

Multiple variables to absorb with -areg-.
Multiple variables to cluster by/on. (Which can be very slow without a neat C implementation.)
Detrending in -xtreg- or -areg-, i.e. actually allowing group-level trends/coefficients without blowing up -regress- with i.group##c.time. (There is a reason why -xtreg- and -areg- are orders of magnitude faster.)

Note that -reghdfe- on SSC seems to go a long way on the first and the last points. If it's still not as fast as (reasonably) possible, StataCorp should take this on and build the improved version in. If Sergio (Correia) came close to the efficiency frontier, all the better reason to incorporate this into version 14. Way too many processor cycles and PhD days are wasted on waiting for these models to be estimated. (Or they are just never attempted unless a referee is adamant on another robustness check.)

By the way, I am not sure I see the reason why -xtreg, fe- should be three times slower than -areg-, and even -areg- only half as fast as -_robust, absorb()-. Surely some flexibility is built into the more generic commands, but I don't think the extra parsing and eclass posting caused these speed differences (on 64 cores, so the more complex commands are not better parallelized). As panel methods are a major selling point of Stata, maybe -xtreg- and -areg- could be faster still, and offer multiple fixed effects. (And also multiple variables to cluster on.)

Code:

clear all set obs 100000000 mata: idx = st_addvar("double",("x1","x2","x3","x4","x5","x6","x7","x8","x9","x10","x11","x12","x13","x14","x15","x16","x17","x18","x19"),1) V = J(0,0,.) st_view(V,.,idx) V[.,.] = runiform(100000000,19) end g long id = floor(_n/10) g byte time = mod(_n,10) timer on 1 _regress x1 x2-x19, absorb(id) timer off 1 timer on 2 areg x1 x2-x19, absorb(id) timer off 2 xtset id time timer on 3 xtreg x1 x2-x19, fe timer off 3 timer list exit

Code:

. timer list 1: 103.89 / 1 = 103.8850 2: 316.13 / 1 = 316.1350 3: 1116.49 / 1 = 1116.4890
1 like
Leave a comment:
Michael Anbar replied

04 Sep 2014, 13:54
I would also love to see Stata have a built-in driver for SQLite (http://en.wikipedia.org/wiki/Sqlite). This would be useful for institutions and organizations (like mine) that use SQLite for data storage and processing, but would like to interface into Stata directory. Whether or not this driver was implemented through the ODBC probably wouldn't make much difference to the end user. I started thinking about this because SAS has the PROC SQL procedure that allows you to use SQL syntax with a dataset, which would be a great asset for Stata to have. PROC SQL is a slightly different issue than SQLite, but it would be convenient, at least for many of the people I work with, to have a quick interface in Stata to be able to read from SQLite databases. SQLite is a small C library, and many languages, e.g. Python, actually have the library built in, so I doubt it would add much in terms of disk space.
1 like
Leave a comment:
Michael Anbar replied

02 Sep 2014, 10:23
Originally posted by László View Post

Note that F(-1) or L(-1) does not work in expressions but works as varnames. You can do -regress y F(-1).y-. But -g t = F(-1).y- indeed gives you an "unknown function ()" error. Strange, unfortunate and inconsistent to my eye.

Yes, this is quite inconsistent, and I think this is something that Statacorp could work on to take market share away from RATS, R, etc. and other programs that are consistent in how they approach time series. See my previous post on p4 about how certain Stata commands require date literals like 2000q4, while others require integers of the form tq(2000q4). This makes for needlessly verbose syntax.
1 like
Leave a comment:
Nick Cox replied

02 Sep 2014, 10:12
Stas: findname (SJ) has functionality for finding variables depending on the names and/or the contents of variable and value labels.
1 like
Leave a comment:
skolenik replied

02 Sep 2014, 09:47
lookfor is begging for option valuelabels that would search the text of the value labels, along with the variable labels that it does search now. More advanced search capabilities such as regular expressions would also be highly appreciated. In not-so-well documented data sets with hundreds of variables, I *know* the variable gender *must* be there, but in the latest incarnation of it that I faced, it was QB10 that had variable label that contained the (truncated) question text "Because it is sometimes difficult to determine over the phone, I am asked" and the rest was truncated (and it went along as "to verify if you are...") with category labels "Male" and "Female". There is no way on Earth I could have found that in the data set by itself with the existing lookfor capabilities, although lookfor male, valuelabels would have found it.
1 like
Leave a comment:
Erika Kociolek replied

29 Aug 2014, 07:57
I'd like to see a variable format for percents similar to what is available for commas. I would also like to see more straightforward way to add information to graphs that isn't necessarily what is being shown in the graph itself (see the link attached to this post). Having graphic schemes that are cleaner and a bit more modern-looking would be helpful. I agree with the many comments about generating outputs that can be easily dropped into Word or other programs without too much formatting.

Link to question about adding other data to graphs

http://www.statalist.org
1 like
Leave a comment:
Attaullah Shah replied

28 Aug 2014, 05:02
Thanks Maarten Buis, yes i do use compress more often. Can please elaborate what you specifically mean by "gain memory". Are you talking about increasing the RAM?
Leave a comment:
Maarten Buis replied

28 Aug 2014, 02:05
Attaullah Shah: if you work with such large datasets, using compress on all variables would be a good idea anyhow. That command is explicitly designed such that you don't loose any information, but if possible gain memory. In large datasets that gain can be substantial at no cost other than the time it takes to run compress.
Leave a comment:
Attaullah Shah replied

27 Aug 2014, 22:05
Rich Goldstein and Sarah Edgington , Thanks for your responses. I am using stata 13.1 SE, still I have to delete observations or compress variables otherwise stata will warn that the current memory is not enough. I have 4GB RAM.
Regards
Attaullah Shah
Leave a comment:
Jeff Wooldridge replied

27 Aug 2014, 13:20
My wish list for Stata 14 is pretty modest. I'd like all panel data commands to support cluster-robust variance matrix estimators. Currently, xthtaylor and xtivreg do not allow this. Thus, when one computes standard errors to compare them with, say, output from xtreg, the standard errors are not comparable. Sure, one can bootstrap to obtain cluster-robust inference, but there's no reason one should have to. The analytical formulas are simple. Someone at Stata could add these features in less than a day.

Oh, and while the user-written command xtivreg2 allows clustering, it does not allow a random effects option. I was pleased that xtmixed now allows a cluster option, which makes it somewhat puzzling that some more basic commands, such as xtivreg, do not.
2 likes
Leave a comment:
Sarah Edgington replied

27 Aug 2014, 13:03
What version of Stata are you currently using? In recent versions there is no reason to use the set memory command because memory management is handled automatically.

Memory for current versions of Stata is limited by what the operating system will provide. As for the maxvar and matsize considerations, I'm skeptical of any data management task that really involves more than 32,000 separate variables. If I actually had data with that many variables, I'd almost certainly want to work with it in smaller subsets to increase efficiency regardless of Stata's theoretical limits.
Leave a comment:
Attaullah Shah replied

27 Aug 2014, 12:52
Rich Goldstein, what I meant by 500 mb limit was the Stata commands of set memory 50000, set maxvar 32000 or set matsize 11000
Leave a comment:
Rich Goldstein replied

27 Aug 2014, 12:11
Attaullah, I'm not sure where you got the 500 mb limit, but that is not Stata's limit; e.g., I have 32 GB of RAM and have often loaded files >25GB
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: