Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling-in variable's missing values with the average of few variable's observations

    HI

    I have the dataset counting 161 observations. The dataset looks as following:

    ID_2 NAME_2 innovation
    31462 Region 1 .3636364
    31463 Region 2 .25
    31464 Region 3 .4
    31465 Region 4 .1428571
    31466 Region 5 NA
    31467 Region 6 0
    31468 Region 7 0
    31469 Region 8 .3333333
    31470 Region 9 .4
    31471 Region 10 1
    31472 Region 11 NA
    .....

    I would like to replace missing values of varible innovation (NA) with the average value of few observations of variable innovation. Hypothetically, Region 5 has missing value of innovaiton. I would like to replace this missing value with the average innovation of Region 3, 7 and 10 (which are neigbouring regions of region 5).

    How can I do that?

    Thank you!
    Last edited by Mina Wu; 28 Jul 2015, 05:10.

  • #2
    Given your circumstances, such that there is no way to tell from the ID variables or the region variables which cases are neighbors, probably the best you can do is to:
    Code:
    replace innovation=(.4+0+1)/3
    which will be somewhat tedious,but with only 161 regions, not too bad.

    Comment


    • #3
      If you had a variable in your dataset to identify the neighbours of a given region, you could automate this. For example:

      Code:
      clear
      input ID_2 str10 NAME_2 str3 innovation
      31462 "Region 1" .3636364
      31463 "Region 2" .25
      31464 "Region 3" .4
      31465 "Region 4" .1428571
      31466 "Region 5" NA
      31467 "Region 6" 0
      31468 "Region 7" 0
      31469 "Region 8" .3333333
      31470 "Region 9" .4
      31471 "Region 10" 1
      31472 "Region 11" NA
      end
      
      * Get the region numbers
      destring NAME_2, ignore("Region ") gen(region)
      * Destring innovation
      destring innovation, ignore("NA") replace
      * Generate a neighbour variable
      gen neighbour5=inlist(region,3,7,10)
      * Hypothetically, generate some neighbours for region 11 as well
      gen neighbour11=inlist(region,2,4,6)
      
      * Get the regions
      levelsof region, loc(regions)
      * Loop over the regions
      foreach r in `regions' {
          * Check if there are missing values of innovation for given region
          capture assert innovation!=. if region==`r'
          * Replace missings with neighbour average
          if _rc {
              sum innovation if neighbour`r', meanonly
              replace innovation=r(mean) if mi(innovation)
          }
      }
      Jorge Eduardo Pérez Pérez
      www.jorgeperezperez.com

      Comment


      • #4
        ben earnhart and Jorge Eduardo Perez Perez thank you for your responses!

        Jorge Eduardo Perez Perez , I tried applying your suggestions. My innovation variable was destringed already. I create neigbours with no difficulty. Here are the commands for generating all the neigbours.

        gen neighbour4=inlist(region_sp,9,5,2)
        gen neighbour10=inlist(region_sp,8,6,7,110,104)
        gen neighbour11=inlist(region_sp,7,5,2,0,106)
        gen neighbour40=inlist(region_sp,25,26,43,42,44,24)
        gen neighbour41=inlist(region_sp,45,98,94,96,43)
        gen neighbour44=inlist(region_sp,24,25,40,42,43)
        gen neighbour55=inlist(region_sp,51,57,135,136)
        gen neighbour62=inlist(region_sp,136,60,63,19)
        gen neighbour74=inlist(region_sp,77,76,78,73,69,75)
        gen neighbour91=inlist(region_sp,95,97,94)
        gen neighbour109=inlist(region_sp,146,107,116,115,117)
        gen neighbour126=inlist(region_sp,130,129,128,131,125)


        I get the regions, as suggested. However, the loop does not work-I still have missing values(12 missing observations).. I do not get any reported error in Stata 13, when I run the loop- just this..

        foreach r in `regions' {
        2. capture assert innovation!=. if region==`r'
        3. if _rc {
        4. sum innovation if neighbour`r', meanonly
        5. replace innovation=r(mean) if mi( innovation)
        6. }
        7. }

        Do you know if there is any solution?

        Thank you so much!

        Comment


        • #5
          In the loop, "region" should be "region_sp"

          Code:
          levelsof region_sp, loc(regions)
          * Loop over the regions
          foreach r in `regions' {
              * Check if there are missing values of innovation for given region
              capture assert innovation!=. if region_sp==`r'
              * Replace missings with neighbour average
              if _rc {
                  sum innovation if neighbour`r', meanonly
                  replace innovation=r(mean) if mi(innovation)
              }
          }
          Let me know if you have any further problems after this.
          Jorge Eduardo Pérez Pérez
          www.jorgeperezperez.com

          Comment


          • #6
            Jorge Eduardo Perez Perez thanks again! it still does not work..

            Here are the details of what I do:

            *** neighbours
            gen neighbour4=inlist(region_sp,9,5,2)
            gen neighbour10=inlist(region_sp,8,6,7,110,104)
            gen neighbour11=inlist(region_sp,7,5,2,0,106)
            gen neighbour40=inlist(region_sp,25,26,43,42,44,24)
            gen neighbour41=inlist(region_sp,45,98,94,96,43)
            gen neighbour44=inlist(region_sp,24,25,40,42,43)
            gen neighbour55=inlist(region_sp,51,57,135,136)
            gen neighbour62=inlist(region_sp,136,60,63,19)
            gen neighbour74=inlist(region_sp,77,76,78,73,69,75)
            gen neighbour91=inlist(region_sp,95,97,94)
            gen neighbour109=inlist(region_sp,146,107,116,115,117)
            gen neighbour126=inlist(region_sp,130,129,128,131,125)
            ** get the regions
            levelsof region_sp, loc(regions)

            0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
            > 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 5
            > 9 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
            > 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
            > 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
            > 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 15
            > 2 153 154 155 156 157 158 159 160
            region_sp takes values 0-160..

            *loop
            foreach r in `regions' {
            * Check if there are missing values of innovation for given region
            capture assert innovation!=. if region_sp==`r'
            * Replace missings with neighbour average
            if _rc {
            sum innovation if neighbour`r', meanonly
            replace innovation=r(mean) if mi(innovation)
            }
            }
            Stata 13 reports the loop, but nothing happens to the missing values of variable innovation?

            Thank you again very much!

            Comment


            • #7
              Are you running the code by pieces? If so, you need to run the levelsof command and the loop together, otherwise the local `regions' will not be recognized by foreach and nothing will happen.

              Jorge Eduardo Pérez Pérez
              www.jorgeperezperez.com

              Comment


              • #8
                Jorge Eduardo Perez Perez it worked now! This was more than helpful!! Many thanks!!

                Comment


                • #9
                  Hi Jorge Eduardo Perez Perez and other who could maybe help.

                  I had a freedom to ask a related question. I have a panel dataset. Below are three variables firm Id (1-3), year (2007-2012), and the variable X.
                  id year X
                  1 2007 0
                  1 2008 0
                  1 2009 0
                  1 2010 0
                  1 2011 0
                  1 2012 30
                  1 2013 30
                  1 2014 30
                  2 2007 0
                  2 2008 0
                  2 2009 50
                  2 2010 0
                  2 2011 50
                  2 2012 50
                  2 2013 50
                  2 2014 50
                  3 2007 66.66
                  3 2008 66.66
                  3 2009 66.66
                  3 2010 0
                  3 2011 0
                  3 2012 0
                  3 2013 0
                  3 2014 0
                  I would like to "correct" for the value of X in the year 2010 in the following manner: if the value of x in 2010 is zero, but the value of x in 2011 and 2009 are positive values-then i would like to replace that value zero in 2010 with the average of the value in 2009 and 2011. that would give me value 50 for firm 2. HOWEVER i do not want to make any changes to those zeros of variable X in 2010, if the value in 2007, 2008 and 2009 are zeros (firm 1) and if observations in 2011, 2012, 2013 and 2014 are 0 too(firm 3).

                  Similarly, i have firms with zero value of x for two/three consecutive years, but certain positive value of x after and prior these zeros, expressed in decimals (firm 4, table below). Is it possible to make a loop function/algorithm which would input the values prior and after the zero values into the cell with zero values, conditional that those value are the same and expressed in decimal. That means that I would NOT like to make any changes to the consecutive zero values if the years prior and after missing are the same, but expressed as round number (firm 5).
                  id year X
                  4 2007 25.31
                  4 2008 25.31
                  4 2009 25.31
                  4 2010 25.31
                  4 2011 0
                  4 2012 0
                  4 2013 25.31
                  4 2014 25.31
                  5 2007 25.31
                  5 2008 10
                  5 2009 10
                  5 2010 0
                  5 2011 0
                  5 2012 10
                  5 2013 10
                  5 2014 10
                  To sum it all up, the first loop/algorithm should refer to the year 2010. but the second, should refer to any two/three consecutive years having zeros, while prior/later values of, EXPRESSED IN DECIMALS are identical. Thank you so much in advance!

                  Comment

                  Working...
                  X