Problems with reshape

Adam Mitchell

Join Date: Oct 2021
Posts: 56

Problems with reshape

16 Apr 2024, 07:36

I have a dataset containing around 3000 individuals. Among these they have had samples taken at multiple times (1-9) but not everyone at everytime. My data is currently set up where each individual has multiple lines of data corresponding to these different sample times. See below.

The analysis I would want to do would require 1 row per person. I have tried to run a code to reshape and had 3 errors.

Code:

reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, i(LopNr) j(sampling_period)
variable sampling_period is string; specify option string
   Data are already wide. 

reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, i(LopNr) j(sampling_period)
variable sampling_period contains missing values

Can anyone advise how to handle this. Many thanks

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(LopNr OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa)
1237697   3.551   2.839   1.672   .575
1237697       .       .       .      .
1237697   2.577   1.547    .359   .099
1237697       .       .       .      .
1237697    4.07   4.036   3.403  2.039
1237697   1.195    .434    .054   -.05
1237697   3.787   3.859   3.164  2.008
1237697   3.289   3.169   2.529  1.349
1237697 2.53685 2.08185 1.39985 .67875
 307990   2.929   2.679   1.869   .887
 307990       .       .       .      .
 307990   3.388   2.172    .852   .234
 307990       .       .       .      .
 307990       .       .       .      .
 307990       .       .       .      .
 307990    1.25    .648    .201   .058
 307990   3.819   2.751   1.483   .496
 307990    .058    .022    .009   .003
1727694    1.46    .464    .152   .034
1727694   1.159    .438    .115   .024
1727694       .       .       .      .
1727694   1.773    .624    .163   .032
 178743   4.869   4.442   3.616  1.991
 178743       .       .       .      .
 178743   2.909   2.039   1.109   .389
 178743   3.007   2.894   2.007   .994
 178743   3.829   3.534   2.223   .874
 178743       .       .       .      .
 178743   3.027   1.807    .775   .194
 178743       .       .       .      .
 178743   3.294   2.382   1.164   .349
 727774    .574    .137    .042   .011
 727774       .       .       .      .
 727774    .297    .024   -.068  -.085
 727774       .       .       .      .
1473626    .338    .103    .036   .007
1473626   2.105   1.145    .307   .024
1473626   2.639   1.949   1.119   .339
1473626       .       .       .      .
1801642   3.774   3.618   3.263  1.855
1801642   3.682   2.983   2.039   .923
1801642    3.18    3.19     2.3   1.14
1801642       .       .       .      .
1801642  3.6643  3.3288  3.0208 1.9142
1801642   2.347    .824     .21   .039
1801642       .       .       .      .
1801642    3.91   3.135   1.808   .592
1801642   3.342     3.1   2.234   .881
2012813       .       .       .      .
2012813    3.55    3.48    2.51   1.11
2012813       .       .       .      .
1541383  3.7046  3.4827  3.5773 2.8569
1541383   3.642   3.412   3.322  2.482
1541383       .       .       .      .
1541383       .       .       .      .
1541383    3.63   3.416   3.111  2.104
1541383  2.8932  2.8812  2.3228 1.3494
1541383       .       .       .      .
1541383   4.102   4.119   4.085  3.525
1541383   2.682   2.794   2.691  2.015
 338556    3.19    3.02    2.53   1.26
 338556   1.317    .509    .096   .023
 338556  3.6907  3.8486  3.0278 1.7716
 338556   3.518   2.841   1.785   .685
1227967       .       .       .      .
1227967  2.0863  1.8294  1.3108   .658
1227967   2.647   1.697    .551   .136
1227967       .       .       .      .
1227967   2.792   2.858   2.336  1.269
1227967  3.7295  3.5983  3.2165 1.8078
1227967   3.135   3.005   2.085  1.055
1227967   4.098   3.791   2.823  1.322
1227967   3.258   2.958   2.258  1.263
 740119     3.1    2.76    1.83   .694
 740119   3.765   3.817   3.614  2.874
 740119  3.6311   3.551  2.9041 1.7918
 740119   3.965   3.617   2.576  1.151
 740119   2.642   2.817   2.489  1.521
 740119    .551    .173    .039   .008
 740119   3.508   2.676   1.605   .659
1631975     4.2    4.34   4.282  3.412
1631975       .       .       .      .
1631975   2.759    2.79   2.428  1.352
1631975       .       .       .      .
1631975  2.0784  1.9907  1.5099   .802
1631975   3.603   3.298   2.748  1.407
1631975       .       .       .      .
1631975    3.17    3.15    2.19    .88
1631975  3.5436  3.3568  3.0562 1.9475
1589223    3.83   4.123   3.686  2.391
1589223  3.7966  3.4984    2.66 1.3485
1589223   3.225   2.915   1.485    .49
1589223       .       .       .      .
1589223   3.477   2.611   1.342   .452
 471713  3.4473  3.4653  3.2446  2.697
 471713       .       .       .      .
 471713    1.22    .383    .093   .028
 471713    3.28    3.23    2.26    .91
1813011    1.46    .427    .106   .029
1813011    3.15    3.16    2.03   .784
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

16 Apr 2024, 09:46

Thank you for intending to show example data. But your example data doesn't contain the sampling period variable, which is, according to the error messages, part of the problem. So we can't really troubleshoot without that.

At a guess, the observations that have missing values for all four values of the OD_1_*_Elisa variables also have missing values for the sampling period variable. It is not legal to have missing values in the variable specified in the -j()- variable. The solution would be to drop those observations from the data set before trying to -reshape-.

More problematic would be if there are other observations that have missing values for the sampling period but do contain OD_1_*_Elisa variable values. In that case, dropping those observations will cause data loss. I would think, however, based on your data layout, that having a missing value for the sampling period variable when there is other data represents a data error in the first place, and the real solution would be to identify the correct value for the sampling period and put it in there.
1 like
Comment
Scott Merryman

Join Date: Mar 2014

Posts: 895
#3

16 Apr 2024, 09:47

You did not include the sampling period variable, but this works:

Code:

bys LopNr: gen sampleperiod = _n reshape wide OD*, i(Lop) j(sampleperiod)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

16 Apr 2024, 11:07

Re #3. Yes, that would work. But because the error message O.P. received says that sampleperiod is a string variable, that is not what his sampleperiod variable looks like. It is a string, and it has some missing values. And if the string consists of, for example, string representations of dates, things could be more complicated than just replacing it by a sequential integer. Also, given the nature of the data, it is hard to imagine that missing values for the sampleperiod variable don't represent data errors. Ultimately that may be traceable to errors in data management leading up to this data set. I think O.P. needs to look into this more deeply before adopting the solution in #3.
1 like
Comment

Adam Mitchell

Join Date: Oct 2021
Posts: 56

17 Apr 2024, 07:11

Thank you for the advice. I now understand from colleagues that we do have an error in the data, the missing sample periods will be corrected shortly so that we have complete information on that variable for j. However I still wanted to see if for the time being I could work with this data. I attempted the code suggested by Scott and it was successful in terms of allowing me to run a reshape command however I ended up with 669 columns per variable. I only have 9 sample periods. I then thought about your suggestion Clyde and decided to drop the missing sample periods but encountered a new error below which I am not sure how to handle.

Code:

reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, /// 
i(LopNr) j(sampling_period) string

string j variable sampling_period not allowed to contain spaces;
    encode string variable sampling_period into a numeric variable and
    use the new numeric variable as the j variable

encode sampling_period, gen (samplingperiod)
drop if samplingperiod== 0
drop if samplingperiod== 1
drop if samplingperiod== .

reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, /// 
i(LopNr) j(samplingperiod)
(j = 2 3 4 5 6 7 8 9 10)

values of variable samplingperiod not unique within LopNr
    Your data are currently long. You are performing a reshape wide. You
    specified i(LopNr) and j(samplingperiod). There are observations
    within i(LopNr) with the same value of j(samplingperiod). In the long
    data, variables i() and j() together must uniquely identify the
    observations.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#6

17 Apr 2024, 10:32

values of variable samplingperiod not unique within LopNr

So, this is another problem with your sampling period variable. In addition to having forbidden missing values, it turns out that the same LopNr can have more than one observation with the same value for sampling period. This is not permissible with -reshape-, but more important, it, too, indicates that there is something seriously wrong with the data.

In the situations where the same LopNr has multiple observations with the same value for sampling period, there are two generic possibilities:
These observations are complete duplicates; that is, they have the same values for all the OD_1_*_Elisa variables (and any other variables in your data set that you haven't shown.)

These observations contain contradictory values on some of the other variables.

In case 1, you can resolve the problem with -duplicates drop-. No information lost. BUT, you usually do not have situations like this when a data set has been constructed correctly. It is likely that the existence of these duplicate observations arose due to an error in the data management that produced the data set. It is a general rule of programming that where one error is found, others may lurk. So you should review carefully the data management that created this data set and look for where these duplicates crept in. When you find that, look for other data management errors in the same program. Clean up all your code, and re-generate the data set so that it doesn't contain the duplicates in the first place and fixes any other errors you uncovered.

In case 2, you have a more serious problem because the data are now seen to be self-contradictory. Apart from just changing the code that created this data set so as to not create duplicate observations, you have to figure out which of the duplicate observations, if any, contains the correct value for the variables that disagree, and assure that only observations with the correct values are included in the regenerated data set.

In short, at this point you are aware that your data set is incorrect. You cannot make "progress" with this data set. You need to fix it first.
Comment

Announcement

Problems with reshape

Comment

Comment

Comment

Comment

Comment