Regressions with 'long'/panel data: misleading test statistics?

Zach Goldberg

Join Date: Jul 2017

Posts: 184
#1

Regressions with 'long'/panel data: misleading test statistics?

09 Jun 2021, 19:18

Greetings,

I'm running Stata 15.1 on a Mac OS and am currently working with Pew panel data. I believe my question is very basic. I'd like to measure the relationship between a continuous independent variable and an ordinal dependent variable (note: there are other variables whose relationships I'm interested in, but I will use the current case as an example). One (the x or independent variable) was measured in the April 2020 wave of the survey, and the other (the dependent variable) was measured in the October 2020 wave. Because my dataset also consists of variables measured in other waves, I opted to reshape the data to 'wide' format. However, I noticed that model test statistics are larger in regressions of data in 'long' than 'wide' format:

Long Format

Code:

. ologit AF_GOOD4 mhindex_meanZ, or Iteration 0: log likelihood = -27851.383 Iteration 1: log likelihood = -27270.49 Iteration 2: log likelihood = -27268.927 Iteration 3: log likelihood = -27268.927 Ordered logistic regression Number of obs = 23,538 LR chi2(1) = 1164.91 Prob > chi2 = 0.0000 Log likelihood = -27268.927 Pseudo R2 = 0.0209 AF_GOOD4 Odds Ratio Std. Err. z P>z [95% Conf. Interval] mhindex_meanZ 1.547024 .019962 33.82 0.000 1.50839 1.586648 /cut1 -.0222594 .0132795 -.0482869 .003768 /cut2 .9936055 .014874 .9644529 1.022758 /cut3 2.788631 .0274565 2.734817 2.842444

Wide Format

Code:

. ologit AF_GOOD4 mhindex6466_meanZ, or Iteration 0: log likelihood = -9283.7942 Iteration 1: log likelihood = -9090.1634 Iteration 2: log likelihood = -9089.6425 Iteration 3: log likelihood = -9089.6424 Ordered logistic regression Number of obs = 7,846 LR chi2(1) = 388.30 Prob > chi2 = 0.0000 Log likelihood = -9089.6424 Pseudo R2 = 0.0209 AF_GOOD4 Odds Ratio Std. Err. z P>z [95% Conf. Interval] mhindex6466_meanZ 1.547043 .0345765 19.52 0.000 1.480737 1.616317 /cut1 -.0222594 .0230008 -.0673403 .0228214 /cut2 .9936055 .0257626 .9431117 1.044099 /cut3 2.788631 .047556 2.695422 2.881839

Of course, this is not surprising given that 'wide' format includes multiple measurements (at different waves) of the same variable from each respondent. But my question is whether the inflated test statistics can be trusted. As more control variables are added, it's possible that variables that remain significant in 'long' format are no longer significant in 'wide' format. I'm thus not sure how to approach this issue. Am I better off sticking to wide format? Is there a way to obtain 'adjusted' test statistics in long format? Or perhaps I'm perceiving a problem that really isn't a problem (?).

Any input you can provide will be much appreciated. Thank you!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

09 Jun 2021, 19:53

In = your wide layout you have 7,846 observations, while in your long layout, you have 23,538 observations, iprecisely three times as many.

I have a feeling that however you structured your long layout, you ended up with copies of AF_GOOD4 and mhindex_meanZ in more observations than they should have been in. Perhaps your original ologit command should be something similar to

Code:

ologit AF_GOOD4 mhindex_meanZ if year==2020, or

but without a better idea of your data, its difficult to say.
Comment

Zach Goldberg

Join Date: Jul 2017
Posts: 184

09 Jun 2021, 20:52

William,

Thanks for the reply.

Some potentially relevant information I neglected to include: each variable title had a suffix or stub indicating the survey wave (e.g. 64, 66, 76) in which it was measured. The items constituting the index (mhindex_meanZ) that I'm using as my IV were measured in March (wave 64) and again in April (wave 66). Given the short intervening period, and the fact that not all panelists participated in both waves, I opted to take the average of the March and April measurements (i.e. I created an average index that includes panelists that either provided data in both or only one of the waves). I'm wondering whether this is the issue. The dependent variable was measured only in October (wave 76).

Here is sample data in wide form:

Code:

* Example generated    by -dataex-. To    install: ssc install dataex
clear
input double caseid    float(mhindex64    mhindex66) double AF_GOOD476
100260 1.25    1 1
100637  2.5    2 1
101472 1.75 1.75 2
101493    2  1.5 1
103094    3 3.25 .
103538  1.5  1.5 .
103611    2    2 .
104210  1.5 1.75 4
104368  1.5  1.5 3
104491 2.25    2 1
104689    2  2.5 1
104727 1.75    2 1
104937    1    1 .
106590  2.5 2.25 1
106960    . 1.75 .
107329 2.25 2.25 2
108035    1    1 1
108348    .    3 1
108435 1.75    2 .
109143    1  1.5 2
110550 1.25    1 1
111665 3.25    3 2
112238    2 2.25 2
112490 3.25 3.25 3
112613    1 1.25 1
112984 2.75    3 2
113248    .    1 3
113412 2.25 2.25 1
114058 3.25  3.5 2
114671    1 2.25 1
115295 1.25    1 1
115546 1.75 1.25 2
115706    .    . .
115807    1    1 .
116151    1    1 1
116264  1.5 2.75 1
116832 2.75    . .
116998    2  2.5 3
118110    1    1 1
118414  1.5    1 1
118847  1.5    2 .
118888  1.5  1.5 2
119121  2.5 3.25 2
119392 1.25    1 .
119548    3 1.75 1
120343 1.75    1 1
120873    1    1 3
121158    2 2.75 1
121582    2    2 1
122503    2  2.5 2
124561    3 1.75 2
125280 3.75 3.75 .
126131  1.5 1.25 3
126211    2  2.5 1
126570 1.25    1 1
127160    1    1 1
127250    .    . .
127284 2.25    2 .
127498 3.25 1.75 1
128285 2.25  1.5 3
128558    1    1 1
129622    .    1 .
131786    1 2.75 1
132246  1.5  1.5 1
132264  1.5  1.5 1
132478 2.75 2.25 3
132973  1.5  1.5 3
133435 1.25    . .
133550 2.75    2 2
133700 1.75  2.5 2
134129 2.25  1.5 .
135293 2.25    2 .
135751 1.25 1.25 2
135822    1    1 1
136046  2.5 2.75 2
136999 1.25    1 1
137139    1    1 2
138105    2 2.75 2
139905  2.5  1.5 2
140204    3 1.25 .
141319 1.25 1.25 4
141471    1    1 .
142398    2 1.25 .
143122  2.5 2.25 .
143915  1.5 1.75 .
144036    1    1 1
144120    .    . .
144429 1.25    1 1
145887    .    1 .
146434 2.75 2.75 1
147121  3.5    . .
147316    2  2.5 .
149866 2.25  2.5 .
150084 3.25 1.75 4
150280  1.5  1.5 .
152324 1.75  1.5 .
152957    . 2.25 .
153863    .  1.5 3
154750    1  1.5 1
155164    .    . .
end

If the above has too many missings to work with, here is also sample data consisting of panelists with complete responses (i.e. they provided measures in March AND April):

Code:

* Example generated    by -dataex-. To    install: ssc install dataex
clear
input double caseid    float(mhindex64    mhindex66) double AF_GOOD476
100260 1.25    1 1
100637  2.5    2 1
101472 1.75 1.75 2
101493    2  1.5 1
104210  1.5 1.75 4
104368  1.5  1.5 3
104491 2.25    2 1
104689    2  2.5 1
104727 1.75    2 1
106590  2.5 2.25 1
107329 2.25 2.25 2
108035    1    1 1
109143    1  1.5 2
110550 1.25    1 1
111665 3.25    3 2
112238    2 2.25 2
112490 3.25 3.25 3
112613    1 1.25 1
112984 2.75    3 2
113412 2.25 2.25 1
114058 3.25  3.5 2
114671    1 2.25 1
115295 1.25    1 1
115546 1.75 1.25 2
116151    1    1 1
116264  1.5 2.75 1
116998    2  2.5 3
118110    1    1 1
118414  1.5    1 1
118888  1.5  1.5 2
119121  2.5 3.25 2
119548    3 1.75 1
120343 1.75    1 1
120873    1    1 3
121158    2 2.75 1
121582    2    2 1
122503    2  2.5 2
124561    3 1.75 2
126131  1.5 1.25 3
126211    2  2.5 1
126570 1.25    1 1
127160    1    1 1
127498 3.25 1.75 1
128285 2.25  1.5 3
128558    1    1 1
131786    1 2.75 1
132246  1.5  1.5 1
132264  1.5  1.5 1
132478 2.75 2.25 3
132973  1.5  1.5 3
133550 2.75    2 2
133700 1.75  2.5 2
135751 1.25 1.25 2
135822    1    1 1
136046  2.5 2.75 2
136999 1.25    1 1
137139    1    1 2
138105    2 2.75 2
139905  2.5  1.5 2
141319 1.25 1.25 4
144036    1    1 1
144429 1.25    1 1
146434 2.75 2.75 1
150084 3.25 1.75 4
154750    1  1.5 1
155464 1.75    1 1
156615 1.75 1.75 3
157570  2.5 1.75 1
157730  1.5 1.25 1
159997 1.25 1.75 1
162341 2.25    1 3
162517 1.25    1 3
164283 2.25 2.25 3
164425    2 2.25 1
165452  1.5 1.25 1
166301 1.75 2.25 4
168336 2.25 2.75 2
169249  3.5    2 1
169864  1.5  1.5 1
170940 1.25 1.25 1
171264    3  2.5 2
174032    4  2.5 1
175363 1.75 1.75 1
176214 2.25    2 2
176308    2    2 3
176940 1.25 1.25 1
177572 1.75 1.75 2
178470 2.75  2.5 1
179691 1.75  2.5 1
179979    3    3 1
183145 1.25 1.75 1
185394 1.75 2.25 4
187312 3.25 2.75 1
188245    1  1.5 1
189589  2.5    2 3
190780  1.5 1.25 3
190961 3.25  3.5 1
191063 1.75  1.5 1
191336 3.25 2.75 3
192294 2.25    2 2
end

Note: caseid= panel ID (so you can try reshaping the data yourself).

My 'reshape' syntax was as follows:

Code:

reshape long mhindex, i(caseid) j(wave)

Note: I did not reshape AF_GOOD4 due to the fact that it was only measured once.

Thanks again for your help!

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

10 Jun 2021, 09:47

If we run the code you provided on the first 10 observations of your example data

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double caseid float(mhindex64 mhindex66) double AF_GOOD476
100260 1.25    1 1
100637  2.5    2 1
101472 1.75 1.75 2
101493    2  1.5 1
103094    3 3.25 .
103538  1.5  1.5 .
103611    2    2 .
104210  1.5 1.75 4
104368  1.5  1.5 3
104491 2.25    2 1
end
reshape long mhindex, i(caseid) j(wave)

we produce the 20 observations of reshaped data

Code:

. list, abbreviate(12) sepby(caseid)

     +--------------------------------------+
     | caseid   wave   mhindex   AF_GOOD476 |
     |--------------------------------------|
  1. | 100260     64      1.25            1 |
  2. | 100260     66         1            1 |
     |--------------------------------------|
  3. | 100637     64       2.5            1 |
  4. | 100637     66         2            1 |
     |--------------------------------------|
  5. | 101472     64      1.75            2 |
  6. | 101472     66      1.75            2 |
     |--------------------------------------|
  7. | 101493     64         2            1 |
  8. | 101493     66       1.5            1 |
     |--------------------------------------|
  9. | 103094     64         3            . |
 10. | 103094     66      3.25            . |
     |--------------------------------------|
 11. | 103538     64       1.5            . |
 12. | 103538     66       1.5            . |
     |--------------------------------------|
 13. | 103611     64         2            . |
 14. | 103611     66         2            . |
     |--------------------------------------|
 15. | 104210     64       1.5            4 |
 16. | 104210     66      1.75            4 |
     |--------------------------------------|
 17. | 104368     64       1.5            3 |
 18. | 104368     66       1.5            3 |
     |--------------------------------------|
 19. | 104491     64      2.25            1 |
 20. | 104491     66         2            1 |
     +--------------------------------------+

But you tell us your independent variable is the average of the observations of mhindex in waves 64 and 66, but you don't tell us how you create that variable — mhindex_meanZ in the wide dataset and mhindex6466_meanZ in the long dataset.

I believe you made the following mistake.

Code:

generate mhindex6466 = (mhindex64+mhindex66)/2
reshape long mhindex, i(caseid) j(wave)
rename mhindex mhindex_meanZ6466
ologit AF_GOOD4  mhindex6466_meanZ, or

But here is the data the ologit command is run on.

Code:

. list, abbreviate(18) sepby(caseid)

     +------------------------------------------------+
     | caseid   wave   AF_GOOD476   mhindex6466_meanZ |
     |------------------------------------------------|
  1. | 100260     64            1                1.25 |
  2. | 100260     66            1                   1 |
  3. | 100260   6466            1               1.125 |
     |------------------------------------------------|
  4. | 100637     64            1                 2.5 |
  5. | 100637     66            1                   2 |
  6. | 100637   6466            1                2.25 |
     |------------------------------------------------|
  7. | 101472     64            2                1.75 |
  8. | 101472     66            2                1.75 |
  9. | 101472   6466            2                1.75 |
     |------------------------------------------------|
 10. | 101493     64            1                   2 |
 11. | 101493     66            1                 1.5 |
 12. | 101493   6466            1                1.75 |
     |------------------------------------------------|
 13. | 103094     64            .                   3 |
 14. | 103094     66            .                3.25 |
 15. | 103094   6466            .               3.125 |
     |------------------------------------------------|
 16. | 103538     64            .                 1.5 |
 17. | 103538     66            .                 1.5 |
 18. | 103538   6466            .                 1.5 |
     |------------------------------------------------|
 19. | 103611     64            .                   2 |
 20. | 103611     66            .                   2 |
 21. | 103611   6466            .                   2 |
     |------------------------------------------------|
 22. | 104210     64            4                 1.5 |
 23. | 104210     66            4                1.75 |
 24. | 104210   6466            4               1.625 |
     |------------------------------------------------|
 25. | 104368     64            3                 1.5 |
 26. | 104368     66            3                 1.5 |
 27. | 104368   6466            3                 1.5 |
     |------------------------------------------------|
 28. | 104491     64            1                2.25 |
 29. | 104491     66            1                   2 |
 30. | 104491   6466            1               2.125 |
     +------------------------------------------------+

Do you see - you have three times as many observations in the long dataset as you had in the wide dataset, exactly the problem was with your results in post #1 that I pointed out in post #2.

You should have run the command

Code:

ologit AF_GOOD4  mhindex6466_meanZ if wave==6466, or

to limit your ologit to just those observations having the average value for mhindex (from wave "6466")
as I suggested in post #2.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

10 Jun 2021, 13:00

On further reflection, more changes to your code would be better.

Code:

generate mhmeanindex66 = (mhindex64+mhindex66)/2
reshape long mhindex mhmeanindex, i(caseid) j(wave)
list, abbreviate(18) sepby(caseid)
ologit AF_GOOD4 mhmeanindex, or

Now this is the data the ologit command will be run on.

Code:

. list, abbreviate(18) sepby(caseid)

     +----------------------------------------------------+
     | caseid   wave   mhindex   AF_GOOD476   mhmeanindex |
     |----------------------------------------------------|
  1. | 100260     64      1.25            1             . |
  2. | 100260     66         1            1         1.125 |
     |----------------------------------------------------|
  3. | 100637     64       2.5            1             . |
  4. | 100637     66         2            1          2.25 |
     |----------------------------------------------------|
  5. | 101472     64      1.75            2             . |
  6. | 101472     66      1.75            2          1.75 |
     |----------------------------------------------------|
  7. | 101493     64         2            1             . |
  8. | 101493     66       1.5            1          1.75 |
     |----------------------------------------------------|
  9. | 103094     64         3            .             . |
 10. | 103094     66      3.25            .         3.125 |
     |----------------------------------------------------|
 11. | 103538     64       1.5            .             . |
 12. | 103538     66       1.5            .           1.5 |
     |----------------------------------------------------|
 13. | 103611     64         2            .             . |
 14. | 103611     66         2            .             2 |
     |----------------------------------------------------|
 15. | 104210     64       1.5            4             . |
 16. | 104210     66      1.75            4         1.625 |
     |----------------------------------------------------|
 17. | 104368     64       1.5            3             . |
 18. | 104368     66       1.5            3           1.5 |
     |----------------------------------------------------|
 19. | 104491     64      2.25            1             . |
 20. | 104491     66         2            1         2.125 |
     +----------------------------------------------------+

As you can see, at most one observation for each value of caseid, rather than three.

Announcement