What I'd like to do is use the first dataset to impute the state for each respondent in the second dataset. My idea is to run a multilevel multinomial logit model using the first dataset, with state as the DV, region as the second level, and race, sex, age, and education as the IVs. I could then use this model to come up with probabilities for each respondent in the second dataset being in each state, and assign the respondent using those probabilities.

However, there are two snags with this plan. First, I'm not sure what the syntax for a multilevel multinomial logit model is, since there isn't a convenient ME function. And second, I don't know how to convert the predicted state probabilities for each respondent in the second dataset into state assignments corresponding to those probabilities. E.g., if there's a 1% chance a respondent is in AK, 3% in AL, 6% in AR, and so on, how can I get Stata to assign him based on those probabilities? Thanks!]]>

Thanks in advance.]]>

Apologies if this has been discussed before, but I couldn't find anything on it.

I'm using Stata 14.1, and having an issue with the command 'mean'. Specifically, the 95% confidence intervals it generates don't appear to be 95%. As an example, when I run mean on a variable in my dataset, it generates the following output:

. mean day

Mean estimation Number of obs = 212

--------------------------------------------------------------

| Mean Std. Err. [95% Conf. Interval]

-------------+------------------------------------------------

day | 1659.731 326.08 1016.939 2302.523

--------------------------------------------------------------

However, when I manually calculate the 95% confidence interval based on the formula:

x̅ ± z(0.05/2) x S.E.

I get a confidence interval of 1020.626 - 2298.836. Similar, but not exactly the same.

Working backwards from the CI generated by stata, it looks like it's using a z that is smaller than 0.05 - and the exact value varies depending on the calculation/variable used.

Not a major issue, I would just like to understand why it is doing it (assuming its deliberate).]]>

The dates contained in one string variable have inconsistent formats, exemplified as follows

date |

1961-1965 |

del 5-Ene-1930 - 8-Dec-1950 |

5-Ene-1931 |

I proceeded by eliminating "del" in excel (I would appreciate if someone provided ideas on how to do this directly in Stata) and then by using

Code:

split date, p(" - ")

date | date1 | date2 |

1961-1965 | 1961 | 1965 |

del 5-Ene-1930 - 8-Dec-1950 | 5-Ene-1930 | 8-Dec-1950 |

5-Ene-1931 | 5-Ene-1931 |

I proceeded by

Code:

split date1, p("-")

I then

Code:

egen new_date = concat (date1)

new_date |

1961 |

5Ene1930 |

5Ene1931 |

I would appreciate suggestions in this regard.]]>

So consider that we want to use the FGLS estimator to model heteroskedasticity. As Cameron and Trivedi (2010) show this can be done with weighted least squares using

I was wondering... what happens if we want to use weights in our estimation? For example, we may be using survey data that provides weights and we want to use them in our estimation. If we do all the intermediate estimations using the weights, do we need to also include the survey weights in addition of weighting to do FGLS? If so, how? I was kind of thinking that since the intermediate steps have been using the survey weights, the prediction of the variance is affected by the survey weights, and thus we may not need accommodate for the survey weights in the final estimation.

Help and thoughts would be appreciated, thanks!

Cameron, A. Colin and Pravin K. Trivedi. 2010.

I am trying to reorganize an administrative dataset into a panel form.

Basically, I need to transform a dataset structured like this

Country | Citizenship | 1999 | 2000 | 2001 | 2002 |

XXXX | A | 15 | 16 | 17 | 18 |

XXXX | B | 16 | 17 | 18 | 19 |

XXXX | C | 17 | 18 | 19 | 20 |

YYY | A | 20 | 19 | 18 | 17 |

YYY | B | 19 | 18 | 16 | 15 |

YYY | C | 18 | 17 | 16 | 15 |

into this

Country | Year | A | B | C |

XXXX | 1999 | 15 | 16 | 17 |

XXXX | 2000 | 16 | 17 | 18 |

XXXX | 2001 | 17 | 18 | 19 |

XXXX | 2002 | 18 | 19 | 20 |

YYY | 1999 | 20 | 19 | 18 |

YYY | 2000 | 19 | 18 | 17 |

YYY | 2001 | 18 | 16 | 16 |

YYY | 2002 | 17 | 15 | 15 |

I also tried to use the reshape command, but it is not working at all (for sure I used it incorrectly).

Any advices?

Thank you in advance.

V.]]>

I have imputed a data set consisting of continuous and binary variables and I am creating a conditional logistic regression model with independent variables associated with the recurrence of TB infection (recurrence being my dependent variable). I believe that there are some variables that are highly correlated e.g. the interruption of drug treatment and reaction to medication. When I search online for methods to detect collinearity and multiple collinearity papers suggest to use methods such as VIF, the condition index and / or using the unexpected direction of associations between the outcome and explanatory variables is an important sign of collinearity and multicollinearity (http://www.nature.com/bdj/journal/v199/n7/full/4812743a.html). Using the last recommendation I believe I have detected collinearity but I cannot use VIF / the condition index with multiple imputed data. I was wondering if there is a better approach to assess my conditional logistic regression model for the presence of collinear and multiple collinear variables when working with multiply imputed data?

Many thanks for your help

]]>

Now i have to test with an OLS regression to check if the amount of sales of quintile 1 is significantly different then quintile 5.

I know this can be done with a t test, but how do you set this up with an OLS?

]]>

I want to count how many time a dummy variable equals one if an identifier has a specific value.

In the end I want to compute percentage share when the dummy equals if an identifier has a specific value.

I tried that code:

Code:

gen freq = _N if id==1 egen dummy = count(mpg) if mpg==1 & id==1 egen cdummy = sum(dummy) & id==1 gen share = dummy/freq if id==1

But it would be even better if I can compute the percentage share for different values of id so that I don't have to repeat the command for each identifier.

Any suggestions?

Many thanks in advance.

Bene

]]>

I've been creating new variables in my dataset of panel data to show total days of follow-up by month of follow-up. I used the following code to do this successfully:

by month: egen followup= total(days)

I also needed to know the number of facilities monitored, and added this variable using the nvals() function:

by month: egen facilities=nvals(facility)

My question arose when I was trying to add further variables displaying only information regarding intervention and controls, but to all observations in the month (rather than interventions showing missing in the control column and vice versa). I initially tried the following code, which yielded missing values for controls:

by month: egen followup_int= total(days) if interv_site==1

by month: egen facilities_int=nvals(facility) if interv_site==1

I discovered a useful workaround for the total() function in Stata's FAQ pages:

by month: egen followup_int= total(days*(interv_site==1))

I'm wondering (a) if there's another way to tell Stata to apply the total (found using the if statement) to all observations (by month) without sort of tricking it as above, and (b) if there is a way to make Stata similarly fill all cells in the new column with the number of intervention facilities monitored by month.

I'm sure there are multiple ways of working around this, but thought I would ask the experts in case there's a simple command or option I'm not aware of. Thank you!

Julia ]]>

However, this caused some problems. The way I am considering now is to keep the years which feature the smallest change. For example, let's say we only have one 1950 value at .4. We have two values for 1951, .6 and .7. Since .6 is a smaller change from 1950, we keep the .6 observation and drop the .7 one. This seems to get more complicated when the year you are examining has alternate observations for the prior year. That is to say, what if 1950 also has three different estimates? Maybe there is some algorithm to run. Does anyone know of any techniques for this? Thanks.

Here is some example data

1950 | 0.4438 |

1951 | 0.2746 |

1952 | 0.215 |

1953 | 0.9189 |

1954 | 0.7192 |

1955 | 0.7332 |

1956 | 0.6545 |

1957 | 0.2492 |

1957 | 0.3382 |

1958 | 0.6456 |

1958 | 0.1853 |

1950 | 0.4664 |

1951 | 0.3202 |

1952 | 0.2473 |

1953 | 0.9355 |

1954 | 0.4428 |

1955 | 0.0049 |

1956 | 0.9164 |

Plotting the values, the difference looks minimal. The final payoff is always at least below if not equal to the initial payoff, and there is only the slightest daylight in between the two:

Array

But the test came back highly significant:

signrank initial = final

Wilcoxon signed-rank test

sign | obs sum ranks expected

-------------+---------------------------------

positive | 0 0 59.5

negative | 14 119 59.5

zero | 1 1 1

-------------+---------------------------------

all | 15 120 120

unadjusted variance 310.00

adjustment for ties 0.00

adjustment for zeros -0.25

----------

adjusted variance 309.75

Ho: e = e1

z = -3.381

Prob > |z| = 0.0007

Is the significance of the test driven by the fact that one curve is almost always above the other?

]]>

I have 7 continuous variables of scores for class subjects:

1. score_math

2. score_science

3. score_english

4. score_history

5. score_spanish

6. score_reading

7. score_writing

And I want to create a new variable (student_segments) that will return discrete values 1 through 7 depending on which of the above 7 variables returned the max score (i.e., if their score for math is their highest of the 7 scores it would return a value of 1 for the variable customer_segment... and 2 for science, 3 for english and so on)

Any advice on the best way to do this is very much appreciated!]]>

Data: Patient-level health data; patient characteristics and responses to a quality of life questionnaire

I want to produce multiple bar charts displaying the categorical data distribution (proportion of patients in each category) for each item on my questionnaire, and show this separately for patients in 4 different settings. The aim is to provide a visual summary of patient responses to the questions, comparing differences between settings of care - I want to show a lot of information on one page as a visual summary NB: this is not for an academic paper - im reporting to health care teams on the data they have collected.

I'm attaching the graph I have produced using catplot (SSC), code:

This is how I want each bar graph to look - but I want multiple items/questions included, by setting (4 bar charts per question) - all within the same graph. Is this possible?

Im also attaching a crude mock up of the graph I ideally want - a cut and paste of the graphs I have produced with catplot. Array

One final point - there is an example of a 'matrix of bar graphs' here: http://blog.stata.com/tag/sem/ does anyone know how this was produced?

Thank you

]]>