Wish list for Stata 14

László Sándor replied

25 Sep 2014, 18:42
Small thing, but maybe easy to add then: Why doesn't -collapse- accept stubs for newvarnames? Basically now you can choose to use varlists, but then each variable only once (though you still might generate means for some vars, sums for others, of course), and none renamed during the collapse, or generate the aggregates but specify them one by one just to give them a name.
1 like
Leave a comment:
Michael Anbar replied

23 Sep 2014, 17:33
Originally posted by László Sándor View Post

This might be related or unrelated, but there seem to be more and more features of Stata (factor variables, large-N small-T panels etc.) which would benefit greatly from sparse matrices in Mata. One wonders how hard it is to add.

Separate memory spaces came up before, but note that the huge costs of sorting and preserving-restoring in data with many covariates (esp. if unused in a line) or irrelevant observations, comes from the fact that the rest of the big data is also moved in memory needlessly.

I'll second this. The lack of sparse matrices is one of the major impediments (but far from the only one) to my colleagues and I using Mata for a wider array of tasks. Mata's requirement that all matrices be full matrices, regardless of how sparse they actually are, imposes needlessly high memory requirements on storing certain matrices and imposes upper bounds on the size of matrices that (for certain problems) are far below those of competitors that do support sparse matrices, e.g. MATLAB, Python, etc.
Leave a comment:
ben earnhart replied

22 Sep 2014, 15:52
I wish there were an option for -egen rowmean()- to only compute a value if there are a certain number of non-missing values. For example, when computing a scale with 15 items, I'd only like to compute it for those with ten or more valid values. Not that big of a deal to use -egen row(no)miss()- and -replace- afterwards to clean things up, but it's a common enough situation I'd like to do it with a single command instead of three.
Leave a comment:
László Sándor replied

22 Sep 2014, 11:35
As Stata is sorting so much in the background, it'd be great to have a more flexible -sort-. Basically, if my command needs sorted data but was called on only in a subsample, I would want to have an option not to sort the data that will never be used. If the sortedby local needs to carry around a second term now to remember that the data is only sorted on the given variables only in a subsample defined by another variable, so be it. As -sort- already allows [in], maybe the system local is already robust to this…

Or you would achieve this now in two steps?:

Code:

sort `touse' count if `touse' sort `touse' sortvar in `=_N-`r(N)'+1'/`=_N'

It it still clumsy to use the end of my data now, not the beginning… So maybe it is not that costly to generate another tempvar

Code:

gen byte `invtouse' = 1-`touse'

By the way, if [in] is much, much faster than [if], why do most of our commands carry around "if `touse'" instead of quickly (?) sorting on `touse' and keep track the start and the end of the data to use with in?

Sorting on a binary variable should be much faster than O(_N log _N), of course, as the order is calculable (i.e. count if `touse' produces a sufficient statistic, r(N)) and can then be imposed. Why can't we specify for -sort- whether the sortvar is binary (or categorical) and not continuous?
Last edited by László Sándor; 22 Sep 2014, 12:04.
Leave a comment:
László Sándor replied

19 Sep 2014, 11:18
I am also annoyed by Stata raising error (and thus crashing the job) on some time-series operators being killed when xt data is not sorted on panelid time any more. See the example below with or without commenting out (the second) -xtset-.

I am not sure I see why the current behavior is preferred to resorting and proceeding. If that is too costly sometimes to be a default (i.e. users might want to be informed about their jobs being inefficient and slow), at least I would welcome a switch in -set- to turn such behavior on.

I see no apparent logic in which commands override previously xtset data, and I am a reasonably savvy user. I think it is bad if Stata complains about the sort order when the data should still be xtset.

Code:

clear all cd ~/Downloads set obs 1000 mata: // Store data directly with st_store() st_store(.,st_addvar("double",("x1","x2"),1),runiform(st_nobs(),2)) end g long id = floor(_n/10) g byte time = mod(_n,10) xtset id time g l1x1 = l.x1 sort x2 xtset g l1x1_2 = l.x1 exit
Last edited by László Sándor; 19 Sep 2014, 12:02. Reason: giving an example
Leave a comment:
Rasool Bux replied

19 Sep 2014, 00:54
I wish in stata 14 the relational data files can be opened with use command or can read variables from different files based on the keyvariable.
Leave a comment:
László Sándor replied

15 Sep 2014, 15:57
I know not all topics qualify as "wishes for Stata 14," even if all quirks and questions could potentially be relevant for an improvement, "fix" etc., but I still mention this here: I have a hard time understanding why Stata uses so much memory for -use- or -merge- operations where you specify to use only a small subset of observations or variables. Using two variables "in 1/1000" for my data of 150 million observations with total size of 40 GBs (with all the other variables) still takes minutes and many-many GBs of RAM (temporarily) to load. Even without any indexing and database computer science wizardry (raised under this topic before), I consider this very poor form. Isn't this "fixable in Stata 14" even within Stata's current memory model and file format?
Leave a comment:
Michael Anbar replied

12 Sep 2014, 14:56
I'll add another feature that would be very useful, although I've seen related requests mentioned before: an interactive debugger for Mata. I'm not just talking about -set trace-, -pause-, etc; I'm referring to actual debuggers that modern programming environments use, e.g. MATLAB, Rstudio, various Python IDE's, etc. The ability to set breakpoints in code and step through it line by line is something SORELY missing from Stata and Mata, but especially Mata.

This is another feature that could significantly improve Stata's market share; I often find myself needing to write functionality in Stata that uses matrix operations, but the lack of real debugging tools means I almost have to use MATLAB, Python, etc. for tasks like this (for people who have used modern programming environments, they realize how primitive Stata's programming environment really is), at which point I usually end up doing all of my analysis in those languages instead of Stata.
Leave a comment:
Clyde Schechter replied

11 Sep 2014, 12:15
Thanks, Philip. This looks great. Can't wait to try it!
Leave a comment:
Philip Jones replied

11 Sep 2014, 11:39
Clyde's suggestion of a wrapper program was a good one.

I have done so and the code is below. Just save as -csti.ado- in your personal folder and it should work.

I re-arranged the default entry order for -csi- because for me it makes more sense to enter the numbers as:
EVENTS_group1 TOTAL_group1 EVENTS_group2 TOTAL_group2
rather than the default.

For example, if 23/130 patients died in the Intervention group while 13/127 patients died in the control group, you would type:

Code:

csti 23 130 13 127

All options for csi will be passed along and will work. All of the -csi- "r" results are returned.

There is no error checking.

Comments and improvements most welcome!

If people find it useful, I can make a short help file and upload to SSC. Just let me know.

I hope this is helpful for someone.

Phil

Code:

*! version 1.0.7 12sep2014 \ Philip M Jones, [email protected] /* csit.ado: Wrapper program for csi to use total number of patients. */ /* Example: for 23 events in 130 patients in one group, 13 events in 127 patients in another group */ /* "csti 23 130 13 127" */ /* all options for -csi- will continue to work as they are passed along */ capture program drop csti program define csti version 13 syntax anything [, *] tokenize `anything' local n1 `1' local N1 `2' local n2 `3' local N2 `4' local N_1 = `N1' - `n1' local N_2 = `N2' - `n2' csi `n1' `n2' `N_1' `N_2', `options' end
Leave a comment:
Phil Clayton replied

10 Sep 2014, 19:42
I agree, this would be useful.

Along with my previously documented list (http://www.stata.com/statalist/archi.../msg00083.html) I would like to suggest the removal of a feature: the ability to merge m:m. This is a very confusing and, as far as I can tell, totally useless feature. I am quite confident that users have gotten incorrect results without realising it by using this "feature". Users should be pointed to joinby instead. The m:m "feature" could be retained under version control for diehards and people trying to understand why they can't replicate earlier erroneous (!) analyses.
Leave a comment:
Clyde Schechter replied

10 Sep 2014, 09:03
Re: Philip Jones' wish: +1. This problem can be solved by writing a wrapper program for csi. But it seems to be something that comes up often enough to be annoying, but not enough to movitate me to actually write that wrapper.
Leave a comment:
Andrew Lover replied

10 Sep 2014, 06:13
Philip, that's a fantastic suggestion! I always end up either using -expand- or adding frequency weights; it's admittedly a minor issue, but annoying nonetheless.
Leave a comment:

Philip Jones replied

10 Sep 2014, 06:06

My "wish" is in the context of medicine (although this "wish" would apply to any field), for binary outcomes such as mortality. In medical journals, this type of data is typically displayed in a Table as the number of events over the number of patients in the group, i.e. "n/N" (example: 23/130 patients died in the Intervention group while 13/130 patients died in the control group). I frequently need to verify P values, risk ratios, CIs, etc. For this, I frequently use the -csi- command.

Right now, to use Stata's contingency table commands (such as -csi-), I must calculate the difference in events between the denominator and the numerator. While this is not particularly difficult, it is prone to simple math errors, and requires more effort than I would like, especially for many outcomes.

I "wish" I could enter -csi-like commands using events and total numbers, rather than events and non-events.

Example (using the above numbers) of how I enter the command currently:

Code:

csi 23 13 107 117

                 |   Exposed   Unexposed  |      Total
-----------------+------------------------+------------
           Cases |        23          13  |         36
        Noncases |       107         117  |        224
-----------------+------------------------+------------
           Total |       130         130  |        260
                 |                        |
            Risk |  .1769231          .1  |   .1384615
                 |                        |
                 |      Point estimate    |    [95% Conf. Interval]
                 |------------------------+------------------------
 Risk difference |         .0769231       |   -.0065187    .1603649 
      Risk ratio |         1.769231       |    .9374361    3.339083 
 Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
 Attr. frac. pop |         .2777778       |
                 +-------------------------------------------------
                               chi2(1) =     3.22  Pr>chi2 = 0.0726

Example of how I would like to enter the data: (I am inventing the option 'total' here, but it could be called anything):

Code:

csi 23 13 130 130, total

                 |   Exposed   Unexposed  |      Total
-----------------+------------------------+------------
           Cases |        23          13  |         36
        Noncases |       107         117  |        224
-----------------+------------------------+------------
           Total |       130         130  |        260
                 |                        |
            Risk |  .1769231          .1  |   .1384615
                 |                        |
                 |      Point estimate    |    [95% Conf. Interval]
                 |------------------------+------------------------
 Risk difference |         .0769231       |   -.0065187    .1603649 
      Risk ratio |         1.769231       |    .9374361    3.339083 
 Attr. frac. ex. |         .4347826       |   -.0667393    .7005166 
 Attr. frac. pop |         .2777778       |
                 +-------------------------------------------------
                               chi2(1) =     3.22  Pr>chi2 = 0.0726

Anybody else wish this? Or, am I missing something and can Stata already do this?

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: