Wish list for Stata 14

Brian Quistorff replied

19 Nov 2014, 16:18
My wish list includes mainly items for reproducing research:
You should be able to download (with -net-/-ssc-) old versions of modules easily. This would probably require the package format to include versions explicitly (e.g. in the pkg file not just as *! comments in the ados) and repositories (SSC) to save old versions.

Primary output files (dtas and gphs) should be able to be reproduced byte-for-byte. Primarily this requires being able to zero-out the timestamps and zero-out any junk padding.

Make PDF and PNG exporting of figures available on console Unix.

Shell commands should work in Windows batch-mode.

Built-in commands should almost always return data in s()/r(), but some return just to the screen.

Also, allow the Windows do-editor to automatically word wrap (this is a main reason why people I know use other editors).
Leave a comment:
Matthew White replied

17 Nov 2014, 12:07
Another Java wish: it would be great to be able to query whether Stata has or could load a specified class.

I have the following scenario in mind. Let's say I write a Java helper class named Helper and add it to SSC. I then write a new class named MyClass that uses Helper, adding that to SSC. The MyClass package comes with the Stata wrapper program -myclass- so that the user doesn't have to interact with Java in any way.

I don't want to distribute Helper with MyClass, because I don't want to have to update the MyClass package every time I update Helper. Yet if the user downloads MyClass without Helper and tries to run -myclass-, a ClassNotFoundException will be thrown and the user will see a Java stack trace. An error is inevitable here, but I'd like to issue a error message more informative to a non-Java user, something like "SSC package helper required; to install, type ssc install helper". If MyClass could query from Stata whether Helper is loaded, I could do that. I could have MyClass display an error, then return to Stata with a nonzero return code before the exception is thrown.
Leave a comment:
Nick Cox replied

12 Nov 2014, 13:04
Matthieu:

Anyone wishing to follow your benchmark results would need to install user-written commands you use other than your own, which you name.

I spotted distinct and reg2hdfe; there may be others.

(Please register using your full name, Matthieu Gomez.)
Leave a comment:
Matthieu Gomez replied

12 Nov 2014, 12:31
I have never participated in the Statalist but I registered after reading this thread. Here is my wishlist by order of importance:

- A faster "sort" and "by: egen sum". These functions could be made 10x faster as shown by the performance of the R packages data.table or dplyr, or the Python library Panda. I have attached some benchmarks. Since sort and by: sum are used across a wide range of commands, this would make Stata instantly better with large datasets.
- areg with arbitrary number of fixed effect, multiple clusters, and iv. Basically what already exists in R with the package lfe (http://cran.r-project.org/web/packages/lfe/index.html)
- A faster csv reader. Again, Python and R readers (panda ps.read_csv and data.table fread) are 10x faster than Stata import delimited.
- New functions gzipuse and gzipsave that would read and write gzip files using a named pipe (for reference: http://www.nber.org/stata/efficient/pipes.html, http://www.stata.com/statalist/archi.../msg00867.html, http://hsphsun3.harvard.edu/cgi-bin/...ticle-183.html)
- an option to preserve / restore in RAM
- The command append should have an option to coerce variables with conflicting types. In particular, numeric in master vs string in using should coerce everything to strings.
- suppress merge m:m or at least improve the documentation. The documentation in help merge just says "Many-to-many merge", which is meaningless, if not wrong. The documentation should explain what m:m does (at least as clearly as in the former documention of merge) and redirects users to joinby. It would also be nice if a new version of joinby could have the same syntax than merge (including default options).
- A lighter set trace on. On error, this mode would print the last command, the stack of function calls leading to it, and all stored macro and scalar.
- no restriction in the length of variable names.

Benchmarks · Stata to R

http://www.princeton.edu

Last edited by Matthieu Gomez; 12 Nov 2014, 12:50.
1 like
Leave a comment:
László Sándor replied

06 Nov 2014, 08:45
I wonder if anyone would revisit -areg-. Though I would not be surprised if it had generated long discussions before, I am just not aware of them.

Basically, I am not sure what to think about the behavior that -areg- is able (even defaults) to predict out of sample — but not the absorbed fixed effects. I understand why the latter is necessary if the absorbed variable is itself missing, but then even the other fitted values are hard to make sense of, unless one has some strong priors that the fixed effects have mean zero when the absorbvar is missing.

In other cases, when I fit the model on a subsample, but all variables are observed out of sample, even the absorbed one, I would love to be able to predict the values incl. "d", the fixed effects.

If its algorithmically impractical to change the behavior about "predict, d" (i.e. the speed of -areg- comes from transforming only e(sample)), then I would revisit what "predict, xb" defaults to after areg, without a warning.

Though I understand, part of my concern is what "predict, xb" is useful for at all, without also running "predict, d".

By the way, if StataCorp adopts the code from -reghdfe-, its similar behavior might also be revisited.
Leave a comment:
Matthew White replied

31 Oct 2014, 16:38
I've found very useful Stata 13's introduction of Java plug-ins. Here are some notes from my experience...

From [P] java (my emphasis):

When a programmer is developing and testing a Java program, it is important to understand when
the JRE is loaded and its effect.

The JRE loads the first time that it is needed. That can happen if internal Stata functionality requires
Java or if Java is needed for some user-written command. Java’s classpath is set when the JRE is
loaded, and it cannot be modified afterward (that is, modifying the ado-path after the JRE has loaded
will not change the classpath). For the end user who is consuming a completed Java plugin, the
process of how Java plugins are loaded is not important because it happens transparently. However,
for the programmer who is modifying and testing code, it is very important to understand the process
.
Assume you are implementing a Java method named mymethod(). You have compiled it, placed
the class or JAR file in the correct location, and call it for the first time using javacall. Perhaps it
executes correctly, but you want to make some modification. You edit the source code, compile it,
and copy it to the correct location. If you are using the same Stata session, your changes will not be
reflected when you call it again. To reload a Java plugin, Stata must be restarted.

When writing anything but the simplest Java classes, I find myself restarting Stata frequently, which is cumbersome and slows down development. Part of the reason for this is my profile.do, which takes several seconds to complete. This wait time is normally acceptable, but is less convenient when I'm restarting Stata relatively often. Even beyond my profile.do, reopening my do-file editor windows and resetting the working directory is a hassle.

With this in mind, it would be great to have a Stata command to reload the JRE.

Part of what my profile.do does is set my ado-path, which contains about 75 directories, as my ado-files are scattered across project directories and Git repositories. It also sets my PERSONAL system directory outside the default C:\ado\personal: I like keeping PERSONAL on Dropbox in order to facilitate ado-file consistency across machines. Yet all my calls to the adopath and sysdir commands seem to be processed after the JRE is loaded fairly early on, so javacall looks only in C:\ado for Java files. Again, a command to reload the JRE with the current ado-path would address this.

Less importantly, it would be nice to be able to get/set string scalars. Especially when working with difficult strings, locals sometimes aren't the right option, so I find myself returning values from Java to Stata locals, then using Mata to copy the locals to string scalars.

It'd also be nice to get/set stored results.

All in all, I'm very glad that Stata has this Java integration. A few changes would help the experience flow better.
Leave a comment:
László Sándor replied

15 Oct 2014, 13:34
Originally posted by László Sándor View Post

MathWorks is now marketing MapReduce on the desktop and MapReduce on Hadoop for Matlab. If only something like it would be easy to do for StataCorp too. http://www.mathworks.com/discovery/m...ce-hadoop.html

And by the way, Hadoop and Spark for Python are here too.
http://continuum.io/anaconda-cluster
If only Stata code (and licenses!) could work similarly on clusters or rented hardware like Amazon Web Services.
Leave a comment:
Nick Cox replied

08 Oct 2014, 03:40
I guess this hinges on the distinction between what a person filling in [filling out] a form sees and what the researcher uses. Given

Are you

0. a new learner?
1. an experienced learner?

there is just too much scope for somebody not familiar with Boolean logic to feel offended, confused or puzzled. Naturally, don't do that then! is one answer, that is, don't show numeric codes to someone taking a survey.
Leave a comment:
Maarten Buis replied

08 Oct 2014, 01:58
Originally posted by Steve Samuels View Post

I believe that the proper numbering for binary responses, especially "Yes/No" in a questionnaire should be 1 "Yes" 2 "No".

I don't understand that statement. I always read 0 as "false" and 1 as "true". So the first thing I do when I open a new dataset is change variables like sex (1 "male" 2 "female") into a more sensible (to me) variable female (0 "false, so male", 1 "true so female"). The 0=false and 1=true convention derives from Boolean logic. Where does the 1=yes, 2=no convention come from?
Leave a comment:
Joseph Coveney replied

07 Oct 2014, 22:43
Originally posted by Steve Samuels View Post

For a binary outcome Y with probability P, a 1-2 coding would destroy the simple theoretical relation E(Y) = P

I'm reluctant to continue the thread drift, but with Stata's factor variables you can have your 1-2 coding cake and eat it too:

Code:

clear * set more off set seed `=date("2014-10-08", "YMD")' quietly set obs 200 generate byte response = floor((2 - 1 + 1) * runiform() + 1) generate double predictor = runiform() gsem (ib2.response <- c.predictor), logit nolog // <- Here recode response (2 = 0) logit response c.predictor, nolog // <- Confirmed here

Also, it's not uncommon in clinical trial data-collection ("case report") forms for patient eligibility to have responses to items in the top-half of the page coded 1 = Yes and 2 = No for inclusion criteria, and 2 = No [that is, left or first] and 1 = Yes for exclusion criteria in the bottom half of the page.

And, for some reason, I've always believed that the 1-2 coding (instead of 0-1 coding) harks back to the days when SAS was the exclusive software package used for clinical trial data. (Namely, because of PROC LOGISTIC's default coding for the response variable.)
Leave a comment:
Steve Samuels replied

07 Oct 2014, 19:49
I believe that the proper numbering for binary responses, especially "Yes/No" in a questionnaire should be 1 "Yes" 2 "No". This is natural phrasing in ordinary language, whereas 0 "No" 1 "Yes" is not. And, in writing questionnaires, the more natural and unsurprising the phrasing, the better. That said, I've never been tempted to use the 1-2 values in an analysis, on either side of a regression equation. For a binary outcome Y with probability \(P\) , a 1-2 coding would destroy the simple theoretical relation

\[
E(Y) = P
\]

Last edited by Steve Samuels; 07 Oct 2014, 20:02.
Leave a comment:
Joseph Coveney replied

07 Oct 2014, 19:18
Originally posted by Alan Neustadtl View Post

In many of the datasets I use the dichotomous measures are coded 1 and 2. In these situations Stata ends with an error.

Well, you could always use gsem in such cases.

Code:

sysuse auto recode foreign (0=1) (1=2) gsem (i.foreign <- c.displacement), logit nolog
Leave a comment:
Alan Neustadtl replied

07 Oct 2014, 15:54
Originally posted by Clyde Schechter View Post

But I don't get Alan's original question and his example. Just what would -logit i.sex i.chd c.income- mean? Logistic regression implies that the dependent variable is not only categorical, but specifically a dichotomy. And if you wrote -regress i.something i.predictor c.other_predictor-, what would you want regress to do? It seems to me that all of the built-in estimation commands uniquely determine whether their dependent variables are categorical or not. Perhaps the exception is Poisson which will accept (and use as continuous) a continuous outcome variable even though it is nominally (no pun intended) a procedure for estimating count variables.

While there may be no general prohibition in estimating models, Stata written estimation routines like -logist- do not allow factor variables on the LHS. So, it strikes me as useful if the factor notation we have become used to on the right hand side be allowed on the left hand side and generate an error if the DV is not dichotomous.

In many of the datasets I use the dichotomous measures are coded 1 and 2. In these situations Stata ends with an error. Consider the example below using data from the General Social Survey where the variable sex is coded with male=1 and female=2:

Code:

. logistic i.sex c.educ depvar may not be a factor variable r(198);

Best,
Alan
Leave a comment:
Richard Williams replied

07 Oct 2014, 07:30
Part of the problem is that the real error often isn't what Stata thinks is the error. For example, if you run this as a do file,

Code:

sysuse auto, clear reg price i.foreign/// weight

Stata complains

Code:

. reg price i.foreign/// / invalid name r(198);

I've seen people spend hours trying to figure out what Stata is whining about, only to finally realize that they need a space after foreign.

Occasionally it is possible to suggest a clearer error message to Stata Corp, and it will do so when asked.

Having said all that, I agree that it would be wonderful if there was something better than -set trace on-, which can often overwhelm you with its output.
Leave a comment:
Imed Limam replied

07 Oct 2014, 07:09
I agree but there has to be a better middle. Being too concise by default is not helpful either. Thank you for pointing FAQ S. 18 to me. Regards.
Leave a comment:

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: