Wish list for Stata 14

Nick Cox replied

07 Oct 2014, 04:37
Easier debugging is a common request. In abstraction, I imagine we all agree.

But I disagree that existing error messages are "too general to be useful". That in turn is too exaggerated to be helpful. I benefit from error messages all the time.

What is trickier here is to move towards the program being smart enough to tell you what you should have written. That is a very difficult, ultimately impossible, goal.

Longer error messages would not necessarily be more helpful. If they appeared by default they would more often be irritating than helpful. Did you know that you can click on an error code to see a longer message?

Knowing where an error occurred is indeed a key part of debugging. Did you know about set trace? A common complaint is that that produces far too much output. The common request that error messages be of the form

error on line 19 of program foo
called at line 42 of program bar
....
called at line 666 of program myprog

is, I understand, on StataCorp's long-term to do list in some form or another. I understand it's trickier than one imagines for reasons that depend on Stata's internals.

P.S. Please see FAQ Advice Section 18.
Leave a comment:
Imed Limam replied

07 Oct 2014, 03:13
Contributing to this very interesting debate, I hope that the error messages become more explicit and helpful so that debugging become easier. r(????) messages are too general to be useful, and do not point where the error took place. Work on this part of STATA would be very effort and time saving.
1 like
Leave a comment:
Richard Williams replied

06 Oct 2014, 20:45
Originally posted by Alan Neustadtl View Post

How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan

I suppose that would be mildly advantageous if your response variable is coded 1,2 rather than 0, 1. It would be a disaster if your response variable was coded 0, zillions of values besides zero, because each of those non-zero values would get treated as a unique value rather than as 1. So I am inclined to think it wouldn't be a good idea, but I suppose it wouldn't matter that much.
Leave a comment:
Clyde Schechter replied

06 Oct 2014, 20:44
I understand Joseph Coveney's response: if you are writing your own estimation procedure, you can have factor variables on the left side if you want to.

But I don't get Alan's original question and his example. Just what would -logit i.sex i.chd c.income- mean? Logistic regression implies that the dependent variable is not only categorical, but specifically a dichotomy. And if you wrote -regress i.something i.predictor c.other_predictor-, what would you want regress to do? It seems to me that all of the built-in estimation commands uniquely determine whether their dependent variables are categorical or not. Perhaps the exception is Poisson which will accept (and use as continuous) a continuous outcome variable even though it is nominally (no pun intended) a procedure for estimating count variables.
Leave a comment:
Joseph Coveney replied

06 Oct 2014, 20:33
Originally posted by Alan Neustadtl View Post

How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan

Stata actually doesn't prohibit using factor variables on the left-hand side. It's your option, as the author of an estimation command, to have your command forbid factor variables there. It's done with the _fv_check_depvar command.
Leave a comment:
Alan Neustadtl replied

06 Oct 2014, 19:14
How about factor variables for the left hand side of models. Something like:

Code:

logistic i.sex i.chd c.income

Best,
Alan
Leave a comment:
László Sándor replied

03 Oct 2014, 19:46
MathWorks is now marketing MapReduce on the desktop and MapReduce on Hadoop for Matlab. If only something like it would be easy to do for StataCorp too. http://www.mathworks.com/discovery/m...ce-hadoop.html

Last edited by László Sándor; 03 Oct 2014, 20:23.
Leave a comment:
Phil Bromiley replied

03 Oct 2014, 13:00
I would like a Stata output procedure that, like outreg2, wrote to Word files. It would also be nice if we had an easy way to export formatted correlation matrices to Word.
1 like
Leave a comment:
László Sándor replied

03 Oct 2014, 08:40
Originally posted by Sergiy Radyakin View Post

Tabulation on half data takes half as much time as on full data (provided some uniformity assumption about how unique values are distributed across the data). So two processors should produce the list of unique values twice as fast (considering merging the two lists negligible compared to the task of looking through the data).

Another suggestion for version 14, then (I *do* try to keep these posts on-topic), is to increase transparency on MP support. As someone who cajoled his institution into buying two 64-core licenses, I am embarrassed by how hit-and-miss the MP benefits are (or on the hardware side: requesting 8 8-corse chips on a compute node, and the corresponding memory). -tabulate- is indeed much faster than many of its alternatives, but I am still dismayed if it's not parallelized. Yet it is much, much faster than collapsing the relevant data (-preserve- and -restore- is costly in many systems and larger data) or trying -egen-. I was lazy with -by: egen, mean()- last night and wasted nine hours without -egen- completing. I have no relevant comparison, but -tabulate- can only be faster, though parsing the resulting matrices are a bit clumsy.

I have no relevant comparison of -levelsof- to -tabulate-.

If -tabulate- is this much faster, I would like to see -tabulate, summarize()- also produce a matrix for later use. This goes back to an earlier wish on building in a fast version of -binscatter- (and its -fastxtile-).
Leave a comment:
Nick Cox replied

02 Oct 2014, 12:47
I have a personal perspective on levelsof, as its original author (under different names, but that's not material). Naturally, as it is now an official command, its programming and documentation are totally the responsibility of StataCorp. Equally naturally, my personal reasons for writing it in the first place need not be identical to, or even relevant to, anyone's reasons for using it now.

But in essence I see two main uses for levelsof, at least as intended.

The first was as a display command to show which distinct values are present, and to show really concisely, more concisely even than tabulate, which values are, and sometimes which are not, .present in the data.

The second was to provide output of a returned value or equivalently a local macro including a list of those distinct levels for use in looping, especially when there is some irregularity to the distinct levels.

With any variable with a large number of distinct levels, the benefits in using levelsof for either purpose are likely to be much diminished, to the point that it may be wondered why people are using levelsof at all. Sometimes a display of e.g. hundreds of distinct values can be useful, but not often. If the aim is to loop over the distinct values, there are likely to be better ways to do it, most notably statsby or using egen, group() to construct a looping variable.
1 like
Leave a comment:
Sergiy Radyakin replied

02 Oct 2014, 12:03
Originally posted by daniel klein View Post

Although I am not working with huge datasets and hence have never experienced trouble with levelsof, it is obvious that the command could be much faster if it was re-implemented in Mata. I have my own version of levelsof that requires Stata 9.2 (as does the original levelsof). In a fake dataset using 1,000 observations filled with random numbers (runifom()), it is almost 20 times faster. Of course, to put this in perspective, the absolute time for both is (far) below one second on my machine.

I do not fully get what László wants do do with fvrevar.

Best
Daniel

Daniel, your example is one which is not really levelsof's playground. While technically it works with anything, the optimization will kick-in for categorical variables with relatively small number of categories (relative to the size of the dataset). I have just spent 5 minutes on a totally non-optimized mata version of levelsof and then half an hour on testing various cases. The only case when it can beat original levelsof is your example of all different values. Of course it could be because of my inefficient approach though (see attached log).

This seems to come from the fact that levelsof sorts the whole dataset, while fast implementation would sort only unique values. (If you don't want to read the code, just notice the sortpreserve marker in the declaration). This is quite obvious, and I don't think the command that useful and basic was overlooked by developers. In fact this mode is activated only if a much faster mode of getting levels with a fast/builtin command tabulate fails. It fails if there are too many levels (more than matsize). If your matsize is default (I guess 400, may depend on Stata flavor), then creating a dataset of 1000 observations of all different random numbers, you are forcing levelsof into a very special case, where it has to do job twice (first try the fast method, establish it doesn't work, second fallback to alternative).

Another source of perceived slowness of levelsof is that it serializes the levels into a string (which is totally not required in most of my tasks). String operations are slow. Moreover getting each level later would also be slow (using word i or foreach). It is not clear how your procedure reports the results (string, matrix?).

Directly using

Code:

quietly tabulate x, matrow(X)

should be a good alternative in many cases, when I know the variable is numeric and has few codes.

I am pretty sure that if you can beat tabulate in the above task, StataCorp would be interested to know about your approach. I definitely am and would like to see the (at least compiled version of your routine).

@Laszlo

I also find -levelsof- woefully slow.

Compared to what?

What I find strange however, is that tabulate is not parallelized. I don't see a reason why.

Code:

. quietly tab occup in 1/22460000 r; t=1.69 13:38:30 . quietly tab occup in 1/22460000 r; t=1.71 13:38:39 . quietly tab occup r; t=3.40 13:38:45 . quietly tab occup r; t=3.33 13:38:50

Tabulation on half data takes half as much time as on full data (provided some uniformity assumption about how unique values are distributed across the data). So two processors should produce the list of unique values twice as fast (considering merging the two lists negligible compared to the task of looking through the data).

It should be possible to write a plugin for that, but Stata plugins can't address Stata dataset in multiple threads (at least they can't write, not so clear about reading). So this would not be without a few hurdles.

Finally it is sometimes possible to exploit some a priori knowledge about the data to determine the levels (for example if you know the codes are 1,2,3 and first three observations are 1,2,3, you don't need to look through millions of observations to follow).

Best regards, Sergiy Radyakin.
Attached Files

flevelsof.txt (743 Bytes, 1 view)
1 like
Leave a comment:
daniel klein replied

02 Oct 2014, 09:00
As interesting as these statistical issues might be, would it not be better to start a new thread and focus on the topic here?

Perhaps there is no need to go as far as Sergiy suggested, but lets do all of us a favor and keep things where we and other find them later.

Best
Daniel

Finally, on this thread, may I humbly suggest splitting suggestions from wishes? Some suggestions are actually resolved quickly by other users pointing to already available functionality, but such suggestions really clutter this thread. I tend to think of a wish in this context as something that is not doable by the user in principle, but something that should be relatively easy to do for developers having access to internals. For example, if the list of the variables can be exposed so that the user can pick variable names from it, why not expose the list of globals? The rest, (I wish Stata (program) did my job, or I wish Stata (program) was smart enough to understand what I want from it) I put into the group dreams, which is not something that is worth discussing. But I think what could help is some weighting of features (easily done here in the forum with opinion polls), such as "what do you prefer 3D charts, or Mata debugger?". Both are useful, and maybe even equivalent in man-hours. But the market imho will strongly signal the former, since the latter is interesting only to a few developers. With some other features it is less clear.

Best regards, Sergiy Radyakin
Leave a comment:
László Sándor replied

02 Oct 2014, 08:27
Originally posted by Richard Williams View Post

I used and taught the missing data dummy approach for years. But then Allison showed that it was usually (but not always) worse than doing nothing at all, i.e. using listwise: http://www.amazon.com/Missing-Quanti.../dp/0761916725

There was a discussion of this some years ago:

http://www.stata.com/statalist/archi.../msg00024.html

http://www.stata.com/statalist/archi.../msg00030.html

These are great points, thanks. Let me note that I could use this when the data is inherently missing, not just unobserved. So it could be faster in Stata. (I am even OK with Stata raising a warning about this, though it rarely does about other dangerous practices.)

I also wonder how this relates to event studies, where maybe the data is not inherently missing. I mean, if you observe only 2000-2007 for some treatment, then an analysis of outcomes on leads and lags of treatment would necessarily constrain yourself to a few years in the middle where all leads and lags exist. You are saying that this approach cannot be sensibly extended to more years, when 2006 could still be used even if I cannot control for, say, two leads any longer? Then I can only use lags but no leads as controls for any of the other years either? Prudent, though a bit dispiriting. Always better than an invalid analysis though. Thanks.
Leave a comment:
Andrew Lover replied

02 Oct 2014, 01:46
Hi Alex,

Do you know about -winbugsfromstata- (SSC)? It's maybe pretty dated, but there's a bit on the web, along with a Stata J article (Volume 6 Number 4: pp. 530-549).

re: LASSO, while not comprehensive, check out -lars- (SSC) for least angle methods.
Leave a comment:
Richard Williams replied

01 Oct 2014, 19:29
I used and taught the missing data dummy approach for years. But then Allison showed that it was usually (but not always) worse than doing nothing at all, i.e. using listwise: http://www.amazon.com/Missing-Quanti.../dp/0761916725

There was a discussion of this some years ago:

http://www.stata.com/statalist/archi.../msg00024.html

http://www.stata.com/statalist/archi.../msg00030.html
1 like
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: