New features and improvements in version 1.3b include:

- Support for (some) complex variance estimators including Stata’s survey estimator (sample points, strata, survey weights etc.)
- Improvements to the numerical approximation. Thanks to the intervention of an anonymous reviewer, survebias is roughly seven times faster now.
- For simple random samples a new analytical method that is even faster
- Convenience options for naming variables created by
**survebiasseries** - Bug fixes and minor improvements

]]>

Me and my colleagues are running several regression models on patient data in order to estimate health outcomes and resource consumption. Regression commands include (but not necessarily limited to): regress, logit, nbreg and glm, family(gamma). Each model has up to 20 covariates, and the patients are sometimes divided into named subgroups. Regressions are either done for all patients or for each subgroup separately. Now to the problem: I want to be able to export the output from the regressions to some standard format (preferably .csv or similar) which contains all relevant data in order to predict the outcomes. Relevant data includes: Regression coefficients, regression model (actually the formula for prediction) and subgroup for which this regression is done over. Also of interest is the number of data points in each regression and the mean of the dependent variable in each regression, and maybe the data type of each covariate and dependent variable, but these are not as neccessary. Basically, the export file should contain all data to assemble a prediction model in e.g. excel without any extra data added.

My idea has so far been to use eststo to store the result from each regression, estadd to somehow manually add the metadata needed to named e() macros or scalars and estout to actually write the export file, all from SSC. Maybe I will write wrapper functions for these ones with predefined options for the commands.

Many thanks for suggestions on how to do this!]]>

I am currently working on a count model (negative binomial) and trying to choose the best model based on AIC or BIC model fit tests.

The issue that I am concerned is that I read some articles saying that these tests should not be used when using clustered / weighted data which are common for survey data. Because the way that I constructed my data is somewhat different from this statement, I am looking for any help to see whether these tests (AIC, BIC) are appropriate or not.

The model consists of two types of datasets. The dependent variable was obtained from the surveillance database, so neither weight nor sampling is matter. However, the independent variables were obtained from Demography and Health Survey (DHS) data where sample weights are required to use. The goal of this analysis is to find out statistically significant independent variables to explain variance of the dependent variable (as usual).

Prior to running a regression, the dataset for the independent variables was prepared by collapsing (by region) with the "sample weights" provided from DHS datasets. Thus, I do not have to use the "[iweight=weight]" option when running the regression (because the final dataset for independent variables was already weighted when collapsing, and no weight was required for the dependent variable). The regression and test outputs for one of the models are shown as below.

I was wondering if it would be okay to use AIC or BIC tests for model comparison in this context.

Thank you.

Jungseok Lee

. xi: glm inc1000 i.q3RF1 i.age_grp*inc_type, fam(nb)

i.q3RF1 _Iq3RF1_1-3 (naturally coded; _Iq3RF1_1 omitted)

i.age_grp _Iage_grp_1-5 (naturally coded; _Iage_grp_5 omitted)

i.age_~p*inc~pe _IageXinc_t_# (coded as above)

note: _IageXinc_t_1 omitted because of collinearity

Iteration 0: log likelihood = -228.99003

Iteration 1: log likelihood = -225.47151

Iteration 2: log likelihood = -225.43393

Iteration 3: log likelihood = -225.43391

Generalized linear models No. of obs = 84

Optimization : ML Residual df = 73

Scale parameter = 1

Deviance = 80.20500562 (1/df) Deviance = 1.098699

Pearson = 71.00797014 (1/df) Pearson = .9727119

Variance function: V(u) = u+(1)u^2 [Neg. Binomial]

Link function : g(u) = ln(u) [Log]

AIC = 5.629379

Log likelihood = -225.4339137 BIC = -243.2446

------------------------------------------------------------------------------

| OIM

inc1000 | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

_Iq3RF1_2 | -.5920396 .3099836 -1.91 0.056 -1.199596 .0155171

_Iq3RF1_3 | .3788479 .3493202 1.08 0.278 -.305807 1.063503

_Iage_grp_1 | .9523037 .4453947 2.14 0.033 .079346 1.825261

_Iage_grp_2 | -4.379099 1.337118 -3.28 0.001 -6.999803 -1.758395

_Iage_grp_3 | -1.639636 .6600511 -2.48 0.013 -2.933312 -.3459597

_Iage_grp_4 | -3.685952 1.512576 -2.44 0.015 -6.650546 -.721358

inc_type | -3.278245 .6011499 -5.45 0.000 -4.456478 -2.100013

_IageXinc_~1 | (omitted)

_IageXinc_~2 | 5.701549 1.390779 4.10 0.000 2.975672 8.427426

_IageXinc_~3 | 2.357005 .7785509 3.03 0.002 .831073 3.882936

_IageXinc_~4 | 3.255374 1.596704 2.04 0.041 .1258912 6.384857

_cons | 4.277992 .5850281 7.31 0.000 3.131358 5.424626

------------------------------------------------------------------------------

. estat ic

-----------------------------------------------------------------------------

Model | Obs ll(null) ll(model) df AIC BIC

-------------+---------------------------------------------------------------

. | 84 . -225.4339 11 472.8678 499.6068

-----------------------------------------------------------------------------

Note: N=Obs used in calculating BIC; see [R] BIC note

]]>

]]>

I am trying to import NAMCS OPD files 2003-2009 using the supplied .do files and I get an error telling me the data file is a Stata 13 file but then I tried using the NAMCS supplied do files on Stata 13 and it tells me the data files are not Stata format.

Can anyone assist?

Thanks.

ET]]>

I am trying to estimate parameters in a CES function for a set of countries. I recognize that nonlinear estimation is tricky due to convergence issues, but the problem I've ran into is that the parameter values are outside the bounds that make sense for the model -- negative share parameters, for example. This indicates to me that the optimization routine didn't do a great job.

One helpful trick I've included is normalizing all the time series by the geometric mean within the corresponding country. That way, all variables are essentially unitless. Leon Ledesma and coauthors suggest this is the proper approach to estimation in a recent AER paper.

Does anyone know how to add constraints for parameters or change to an improved optimization routine? I've copied my main command below. The first just ensures that I don't feed in missing values.

gen nomiss=1 if cgdpo!=. & ck!=. & emp!=. & eprod!=. & varpi!=. & theta_e!=.

tabmiss cgdpo ck emp eprod if nomiss==1

nl (cgdpo=ybar*(varpibar*((exp({zk=0}*year)*ck^{theta =.33}*emp^(1-{theta=.33}))/(exp({zk=0}*tbar)*kbar^{theta=.33}*emp^(1-{theta=.33})))^{nu=-2}+theta_ebar*((exp({ze=0}*year)*eprod)/(exp({ze=0}*tbar)*ebar))^{nu=-2})^(1/{nu=-2})) if nomiss==1 & country=="United States"

]]>

This may be less a Stata question and more a methods question, but I'm hoping someone can help. I have a sample of individuals over time and am regressing starting salary on a particular organizational feature (and a host of other variables) for individuals who change jobs. Some individuals do not vary on this feature (i.e., it is either always absent or always present within these individuals). I can't find guidance on whether I should drop or retain these individuals. They do seem to be influencing the results in a non-trivial way.

Do I need to make a qualitative judgement here, excluding those with no variation if I conclude that they are in some way different from those with variation?

Interestingly, if I retain them, Hausman's tests suggests re, but if I drop them, it suggests fe.]]>

This is the basic command I am running:

:

steppedwedge, binomial detectabledifference incomplete(1) p1(0.8) m(5) rho(0.05) alpha(0.05)

I know the program is reading the design matrices correctly because the total number of observations recorded in the output is 240 for the 12 cluster design (12 clusters * 5 people per cluster m * 4 observations per person = 240) and 360 for the 18 cluster design.

The 18 cluster design should be able to detect a smaller difference (or have higher power if solving for power instead) compared to the 12 cluster design all else equal, but the results are exactly the same: 0.3882.

I've tried closing and reopening Stata and changing the order of runs just to make sure there is not a memory problem, but I get the same results. I also tried running a simpler example from the help file and then doubling the number of clusters in the example; it worked.

I think there is something about my design matrices that is tripping up the calculation. Any thoughts?

I'm using the user-written psmatch2 command and when I run the code to implement the match a new set of variables are created, which includes _support and _weight. I have some questions about these new variables.

1. What is Stata doing when it identifies observations that are on support but their weights are missing?

2. How is _weight being calculated?

3. What does it mean to be on support?

]]>

Statalist itself started some time in August 1994, although the original postings are pretty hard to trace.

Happy birthday to us and happy birthday to you!

I, or rather the forum software, count 33,111 posts, getting close to 100 per day. (Apply your own discount for unintentional reposts, but that's still a lot of interest in Stata.)

The really interesting measure would be the fraction of threads regarded as resolved! ]]>

********** attempted loop ****************

. local var1 "sodas cakes cereal chips chocolate cookies juice sweetbread"

. local var2 "bottledwater chicken driedbeans icecream milk rice whitebread"

. local food "`var1' `var2'"

.

. *** label vars for display

.

. foreach v of local food {

2. local lbl proper("`v'")

3. disp `lbl'

4. label var r`v' `lbl'

5. }

Sodas

invalid syntax

r(198);

end of do-file

r(198);

******* now by hand for first element ********

. d rsodas

storage display value

variable name type format label variable label

-----------------------------------------------------------------------------------------------------------

rsodas float %9.0g

. local v sodas

. local lbl proper("`v'")

. disp "r`v'"

rsodas

. disp `lbl'

Sodas

. label var r`v' `lbl'

invalid syntax

r(198);

]]>

I have written commands and comments in a do-file that are long. And, fortunately, when I print out the do-file the lines are automatically broken into multiple lines (without using /// half way through a long line). However, when I log the commands and the results (using SMCL format) the same does not happen and many lines are out of border.

Is there any way to address this (aside from using ///)?

Thanks,

Navid

]]>

Sorry for the unclear title but I don't really know how to say it. I have two sets of data surveyed in different years, say 2000 and 2010. I want to test whether some preferences and attitudes change with age. For (a simple) example, average kids like hot chocolate, but they tend to prefer cappuccino when they get older. My strategy is that I will match observations in the two data sets and see how those preferences and attitudes changed. The concept is like observations aged 15 in 2000 are those aged 25 in 2010. If average people aged 15 in 2000 liked hot chocolate, and people aged 25 in 2010 liked cappuccino, I can conclude that drink preferences change with age. Anyway, my problem is about coding and organizing the data. How can I do this with Stata?

Thank you.]]>

years sex

1. 20 M

2. 29 F

3. 21 M

4. 25 F

5. 24 M

If I want to replace observation number 5 for the years variable to 23 if observacion number 4 for years is below 26. How do I put this in a do file?

PS: The example is just to put something it does not have to make any sense.

Thank you very much.

Felipe]]>