Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with reshape

    I have a dataset containing around 3000 individuals. Among these they have had samples taken at multiple times (1-9) but not everyone at everytime. My data is currently set up where each individual has multiple lines of data corresponding to these different sample times. See below.

    The analysis I would want to do would require 1 row per person. I have tried to run a code to reshape and had 3 errors.

    Code:
    reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, i(LopNr) j(sampling_period)
    variable sampling_period is string; specify option string
       Data are already wide. 
    
    reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, i(LopNr) j(sampling_period)
    variable sampling_period contains missing values
    Can anyone advise how to handle this. Many thanks

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double(LopNr OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa)
    1237697   3.551   2.839   1.672   .575
    1237697       .       .       .      .
    1237697   2.577   1.547    .359   .099
    1237697       .       .       .      .
    1237697    4.07   4.036   3.403  2.039
    1237697   1.195    .434    .054   -.05
    1237697   3.787   3.859   3.164  2.008
    1237697   3.289   3.169   2.529  1.349
    1237697 2.53685 2.08185 1.39985 .67875
     307990   2.929   2.679   1.869   .887
     307990       .       .       .      .
     307990   3.388   2.172    .852   .234
     307990       .       .       .      .
     307990       .       .       .      .
     307990       .       .       .      .
     307990    1.25    .648    .201   .058
     307990   3.819   2.751   1.483   .496
     307990    .058    .022    .009   .003
    1727694    1.46    .464    .152   .034
    1727694   1.159    .438    .115   .024
    1727694       .       .       .      .
    1727694   1.773    .624    .163   .032
     178743   4.869   4.442   3.616  1.991
     178743       .       .       .      .
     178743   2.909   2.039   1.109   .389
     178743   3.007   2.894   2.007   .994
     178743   3.829   3.534   2.223   .874
     178743       .       .       .      .
     178743   3.027   1.807    .775   .194
     178743       .       .       .      .
     178743   3.294   2.382   1.164   .349
     727774    .574    .137    .042   .011
     727774       .       .       .      .
     727774    .297    .024   -.068  -.085
     727774       .       .       .      .
    1473626    .338    .103    .036   .007
    1473626   2.105   1.145    .307   .024
    1473626   2.639   1.949   1.119   .339
    1473626       .       .       .      .
    1801642   3.774   3.618   3.263  1.855
    1801642   3.682   2.983   2.039   .923
    1801642    3.18    3.19     2.3   1.14
    1801642       .       .       .      .
    1801642  3.6643  3.3288  3.0208 1.9142
    1801642   2.347    .824     .21   .039
    1801642       .       .       .      .
    1801642    3.91   3.135   1.808   .592
    1801642   3.342     3.1   2.234   .881
    2012813       .       .       .      .
    2012813    3.55    3.48    2.51   1.11
    2012813       .       .       .      .
    1541383  3.7046  3.4827  3.5773 2.8569
    1541383   3.642   3.412   3.322  2.482
    1541383       .       .       .      .
    1541383       .       .       .      .
    1541383    3.63   3.416   3.111  2.104
    1541383  2.8932  2.8812  2.3228 1.3494
    1541383       .       .       .      .
    1541383   4.102   4.119   4.085  3.525
    1541383   2.682   2.794   2.691  2.015
     338556    3.19    3.02    2.53   1.26
     338556   1.317    .509    .096   .023
     338556  3.6907  3.8486  3.0278 1.7716
     338556   3.518   2.841   1.785   .685
    1227967       .       .       .      .
    1227967  2.0863  1.8294  1.3108   .658
    1227967   2.647   1.697    .551   .136
    1227967       .       .       .      .
    1227967   2.792   2.858   2.336  1.269
    1227967  3.7295  3.5983  3.2165 1.8078
    1227967   3.135   3.005   2.085  1.055
    1227967   4.098   3.791   2.823  1.322
    1227967   3.258   2.958   2.258  1.263
     740119     3.1    2.76    1.83   .694
     740119   3.765   3.817   3.614  2.874
     740119  3.6311   3.551  2.9041 1.7918
     740119   3.965   3.617   2.576  1.151
     740119   2.642   2.817   2.489  1.521
     740119    .551    .173    .039   .008
     740119   3.508   2.676   1.605   .659
    1631975     4.2    4.34   4.282  3.412
    1631975       .       .       .      .
    1631975   2.759    2.79   2.428  1.352
    1631975       .       .       .      .
    1631975  2.0784  1.9907  1.5099   .802
    1631975   3.603   3.298   2.748  1.407
    1631975       .       .       .      .
    1631975    3.17    3.15    2.19    .88
    1631975  3.5436  3.3568  3.0562 1.9475
    1589223    3.83   4.123   3.686  2.391
    1589223  3.7966  3.4984    2.66 1.3485
    1589223   3.225   2.915   1.485    .49
    1589223       .       .       .      .
    1589223   3.477   2.611   1.342   .452
     471713  3.4473  3.4653  3.2446  2.697
     471713       .       .       .      .
     471713    1.22    .383    .093   .028
     471713    3.28    3.23    2.26    .91
    1813011    1.46    .427    .106   .029
    1813011    3.15    3.16    2.03   .784
    end

  • #2
    Thank you for intending to show example data. But your example data doesn't contain the sampling period variable, which is, according to the error messages, part of the problem. So we can't really troubleshoot without that.

    At a guess, the observations that have missing values for all four values of the OD_1_*_Elisa variables also have missing values for the sampling period variable. It is not legal to have missing values in the variable specified in the -j()- variable. The solution would be to drop those observations from the data set before trying to -reshape-.

    More problematic would be if there are other observations that have missing values for the sampling period but do contain OD_1_*_Elisa variable values. In that case, dropping those observations will cause data loss. I would think, however, based on your data layout, that having a missing value for the sampling period variable when there is other data represents a data error in the first place, and the real solution would be to identify the correct value for the sampling period and put it in there.

    Comment


    • #3
      You did not include the sampling period variable, but this works:

      Code:
      bys LopNr: gen sampleperiod = _n
      reshape wide OD*, i(Lop) j(sampleperiod)

      Comment


      • #4
        Re #3. Yes, that would work. But because the error message O.P. received says that sampleperiod is a string variable, that is not what his sampleperiod variable looks like. It is a string, and it has some missing values. And if the string consists of, for example, string representations of dates, things could be more complicated than just replacing it by a sequential integer. Also, given the nature of the data, it is hard to imagine that missing values for the sampleperiod variable don't represent data errors. Ultimately that may be traceable to errors in data management leading up to this data set. I think O.P. needs to look into this more deeply before adopting the solution in #3.

        Comment


        • #5
          Thank you for the advice. I now understand from colleagues that we do have an error in the data, the missing sample periods will be corrected shortly so that we have complete information on that variable for j. However I still wanted to see if for the time being I could work with this data. I attempted the code suggested by Scott and it was successful in terms of allowing me to run a reshape command however I ended up with 669 columns per variable. I only have 9 sample periods. I then thought about your suggestion Clyde and decided to drop the missing sample periods but encountered a new error below which I am not sure how to handle.

          Code:
          reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, /// 
          i(LopNr) j(sampling_period) string
          
          string j variable sampling_period not allowed to contain spaces;
              encode string variable sampling_period into a numeric variable and
              use the new numeric variable as the j variable
          
          encode sampling_period, gen (samplingperiod)
          drop if samplingperiod== 0
          drop if samplingperiod== 1
          drop if samplingperiod== .
          
          reshape wide OD_1_50_Elisa OD_1_250_Elisa OD_1_1250_Elisa OD_1_6250_Elisa, /// 
          i(LopNr) j(samplingperiod)
          (j = 2 3 4 5 6 7 8 9 10)
          
          values of variable samplingperiod not unique within LopNr
              Your data are currently long. You are performing a reshape wide. You
              specified i(LopNr) and j(samplingperiod). There are observations
              within i(LopNr) with the same value of j(samplingperiod). In the long
              data, variables i() and j() together must uniquely identify the
              observations.

          Comment


          • #6
            values of variable samplingperiod not unique within LopNr
            So, this is another problem with your sampling period variable. In addition to having forbidden missing values, it turns out that the same LopNr can have more than one observation with the same value for sampling period. This is not permissible with -reshape-, but more important, it, too, indicates that there is something seriously wrong with the data.

            In the situations where the same LopNr has multiple observations with the same value for sampling period, there are two generic possibilities:
            1. These observations are complete duplicates; that is, they have the same values for all the OD_1_*_Elisa variables (and any other variables in your data set that you haven't shown.)
            2. These observations contain contradictory values on some of the other variables.
            In case 1, you can resolve the problem with -duplicates drop-. No information lost. BUT, you usually do not have situations like this when a data set has been constructed correctly. It is likely that the existence of these duplicate observations arose due to an error in the data management that produced the data set. It is a general rule of programming that where one error is found, others may lurk. So you should review carefully the data management that created this data set and look for where these duplicates crept in. When you find that, look for other data management errors in the same program. Clean up all your code, and re-generate the data set so that it doesn't contain the duplicates in the first place and fixes any other errors you uncovered.

            In case 2, you have a more serious problem because the data are now seen to be self-contradictory. Apart from just changing the code that created this data set so as to not create duplicate observations, you have to figure out which of the duplicate observations, if any, contains the correct value for the variables that disagree, and assure that only observations with the correct values are included in the regenerated data set.

            In short, at this point you are aware that your data set is incorrect. You cannot make "progress" with this data set. You need to fix it first.

            Comment

            Working...
            X