Reshape Error

Claudia Orlas

Join Date: Jan 2020

Posts: 13
#1

Reshape Error

10 Jan 2020, 19:30

Hello everyone,

I am struggling with this problem, I am running an analysis in Stata, with longitudinal database with two time points of observations, which means that currently I have two observations for each individual. Now the data base is in long format and I need to go from long to wide, by -reshape command.

My command is:

reshape wide ptsd, i (record_id) j (cohort)

-ptsd: is my outcome variable
-record_id: is my unique identifier
-cohort: identifies whether the data is from 6 months==1 or 12 months==2 of follow up.

After running I got the r (9) message, what that means is "There are variables other than a, b, record_id, cohort in your data. They must be constant within record_id because that is the only way they can fit into wide data without loss of information"

The variable or variables listed above are not constant within record_id. Perhaps the values are in error. Type reshape error for a list fo the problem observations.

Either that, or the values vary because they should vary, in which case you must either add the variables to the list of xij variables, or drop them.

Comments and thoughts:
1. I have a long list of variables (more than 100) that theoretically should be included on the code, would be able Stata to handle it?
2. What can I do at this point?
3. I wont be able to show you guys the output because of the rules at my center regarding protected data of patients. But even I can show you a part of the output, even when is a screenshot (sorry, I know is not the standard here, but it was the best I could do).

Please, I would appreciate any help, I need to fix it out, and moving forward with this analysis as soon as I can.

Cheers,

C
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

10 Jan 2020, 19:50

In order to reshape this into wide layout you wlll need to add all of the variables that Stata is complaining about not being constant within record_id to the variable list where you currently have only ptsd listed. If there is a very large number of such variables, there are a few alternatives to consider:

1. Can you capture all of those variables with a wildcard? For example, maybe these variables are all contiguous within the data set and the whole list can be boiled down to flag___1-pcwclose or a few expressions like that.

2. Or maybe the list of variables that you don't have to do this for is short enough to list. For example, if the problem affects every variable in the data set except for variables called foobar, hello, and goodbye, then you could do this:
[code]
ds record_id cohort foobar hello goodbye, not
reshape wide `r(varlist)', i(record_id) j(cohort)
[/vofr]

3. Maybe you don't need all of these variables in your wide data set. Maybe you only need ptsd and status. Then
[code]
keep record_id cohort ptsd status
reshape wide ptsd status, i(record_id j(cohort)
[/code
will do the trick.

4. And I would guess that this is the most likely: you just shouldn't reshape wide in the first place. Why are you doing it? There are only a few things in Stata that work better (or even at all) in wide layout. So if you have a long layout data set you are usually best off leaving it that way; going wide usually just makes your life difficult or impossible. Can you identify specific things you need to do with your data that require a wide layout? If not, just forget about it. If so, my best guess is that you don't need all of the variables to do that, so that solution 3 could simplify the problem.

Finally, assuming that you really do need to -reshape wide- and assuming that you really do need all your variables, there is the possibility that Stata will not be able to handle it, depending on the size of your data set as a whole, your available memory, and your flavor of Stata. If Stata fails to do the -reshape- because it runs out of memory while trying, then you will need to break your data set up into subsets of the variables, and then separately -reshape wide- each one (using code like 3.), and then put them all back together with a series of -merge 1:1 record_id- commands.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#3

11 Jan 2020, 02:30

Clyde Schechter covers all the big points. (A glance at a QWERTY keyboard explains "vofr" as a typo for "code".)

Let's just add that comparing cohorts should be easy in long layout (you say "format", which makes sense, but I push a tip from Clyde a while back that "layout" is a good term, less overloaded with meanings, given file format, display format, and what not).

If you go

Code:

xtset record_id cohort

then you can use time series operators and subscripts alike. For example, change from cohort 1 to cohort 2 could be

Code:

gen change = D.foo

or

Code:

bysort cohort (record_id) : gen change = foo[2] - foo[1]

Easy comparison of cohort values is the most obvious reason for wanting wide, and you can do it in long.
1 like
Comment
Claudia Orlas

Join Date: Jan 2020

Posts: 13
#4

14 Jan 2020, 15:29

Hello there!

Thank you very much in advance! After thinking about my best option, finally, we have decided NOT to switch to wide form and work with the database in long.
Moving forward declaring the data as time series (xtset) I got a new problem, which is:

-I have run this code:
xtset record_id cohort

-Stata now has produced this:
string variables not allowed in varlist;
record_id is a string variable
r (109)

Would you help me to address now this problem?

Thank you very much,

C
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#5

14 Jan 2020, 15:49

So you need to create a new variable that parallels record_id but is numeric.

Code:

egen numeric_record_id = group(record_id), label xtset numeric_record_id cohort
1 like
Comment
Claudia Orlas

Join Date: Jan 2020

Posts: 13
#6

15 Jan 2020, 09:17

Good morning.

Still working with the coding. Initially, I had not thought in working with time-series analysis and I am not an expert on that. Already I defined my time variable as a numeric variable (thanks Clyde Schechter ) but now I need to evaluate the change/time of my outcomes such as ptsd (0= No, 1=Si), pain (0=No, 1=Si), no return to work (0=no, 1=Si), physical functioning (SF-12 score, this is a continuous outcome), mental functioning (SF-12 score, this is a continuous outcome). I have two time points defined by cohort (1=6 months, 2=12 months), and other co-variables like age, sex, socioeconomic status, health insurance. So starting by estimating differences by categories by each outcome through all the co-variables, for example:

-the difference of ptsd among age>50 y/o and age<50 y/o at 6 and 12 months (categorical outcome)
-the difference of ptsd among male vs female at 6 and 12 months (categorical outcome)
-the difference of SF-12 for physical functioning, male vs female at 6 and 12 months

Nick Cox how cal I use these variables within my new time operators to estimate differences on those outcomes among the two periods of time?

Thank you very much!

C
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36058
#7

15 Jan 2020, 10:13

#3 still summarizes my suggestion. I don't see anything in #6 that changes the picture.
Comment
Claudia Orlas

Join Date: Jan 2020

Posts: 13
#8

16 Jan 2020, 11:46

Thank you very much for all your answers! The codes have been working well on my analysis. Everything is so much easier on a long format and declaring the data as panel data.
There are some details that I did not mention before like I have a good proportion of missing data in both cohort 6 and 12 months, for my outcomes. Finally, I used Multiple Imputation including my outcome and even though I know there are some concerts about it, this approach is still possible.

Now when I am at the final of this analysis, in order to run my models for each outcome, on my imputed data set, which is now declared as panel data (xtset), as I mentioned above I have outcomes like ptsd, pain, SF-12 (physical functioning), so for my models, I want the change between 6 and 12 months in any of those outcomes, I mean not the outcome perse, more like the change on that outcome between my two periods of time (6 and 12 months), as the outcome variable. So I assume when I run my model after declaring "xtset" my model has into consideration that there are two observations per individual in two periods. I don't want to miss that detail on my analysis. So, for example, for ptsd, my code is this one:

mi estimate: xlogit ptsd age sex i.raceg i.edlev iss i.injcause i.comorb icu vent i.dischdispo

And this code works, no problem. My concern is about, Do I am missing any within the code? Do I have to include the variable "cohort" on the model, or doing any interaction? Since my outcome should be the change in ptsd between cohort==1 and cohort==2

Thank you once again for your help,

Claudia
Comment
Claudia Orlas

Join Date: Jan 2020

Posts: 13
#9

16 Jan 2020, 11:56

Sorry, I missed I am working with random effects. So my model for ptsd is:

mi estimate: xlogit ptsd age sex i.raceg i.edlev iss i.injcause i.comorb icu vent i.dischdispo, re nolog
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#10

16 Jan 2020, 12:41

And this code works, no problem. My concern is about, Do I am missing any within the code? Do I have to include the variable "cohort" on the model, or doing any interaction? Since my outcome should be the change in ptsd between cohort==1 and cohort==2

Yes, if you are trying to estimate change in PTSD between times 1 and 2, then you need to include the time variable (i.cohort) in the model. As for whether you also need any interactions, that depends. If you believe that the change from time 1 to time 2 itself depends on some of the other variables, then you need the interactions between cohort and those other variables. If, on the other hand, your hypothesis is that the distribution of changes in PTSD does not depend on the other variables, then no interactions are needed.
1 like
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment