Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with missing values and the -egen var1=mean(var2), by(var3)- command

    Hello together,

    I use Stata 17 and have data that looks like this:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int year byte month float growlag6_1
    1982  1           .
    1982  2           .
    1982  3           .
    1982  4           .
    1982  5           .
    1982  6           .
    1982  7           .
    1982  8           .
    1982  9           .
    1982 10           .
    1982 11           .
    1982 12           .
    1983  1   -.8965766
    1983  2   -.8536214
    1983  3  -1.1194961
    1983  4   -.7394078
    1983  5           .
    1983  6           .
    1983  7           .
    1983  8           .
    1983  9           .
    1983 10           .
    1983 11           .
    1983 12   1.0059717
    1984  1     .461578
    1984  2 -.064775355
    1984  3   -.4874977
    1984  4   -.6169059
    1984  5           .
    1984  6           .
    1984  7           .
    1984  8           .
    1984  9           .
    1984 10           .
    1984 11           .
    1984 12  .021989904
    1985  1   .05275526
    1985  2    .3495794
    1985  3    .0751656
    1985  4   -.3113486
    1985  5           .
    1985  6           .
    1985  7           .
    1985  8           .
    1985  9           .
    1985 10           .
    1985 11           .
    1985 12    .3685137
    1986  1   .13611127
    1986  2  .021089263
    1986  3  -.09093564
    1986  4  -.10236227
    1986  5           .
    1986  6           .
    1986  7           .
    1986  8           .
    1986  9           .
    1986 10           .
    1986 11           .
    1986 12   -1.581145
    1987  1  -1.7825167
    1987  2  -1.7077404
    1987  3  -1.1667379
    1987  4  -1.1837677
    1987  5           .
    1987  6           .
    1987  7           .
    1987  8           .
    1987  9           .
    1987 10           .
    1987 11           .
    1987 12    .9539546
    1988  1    .3803727
    1988  2    .4067596
    1988  3  -.04841128
    1988  4     -.72443
    1988  5           .
    1988  6           .
    1988  7           .
    1988  8           .
    1988  9           .
    1988 10           .
    1988 11           .
    1988 12      .51243
    1989  1  -.03726343
    1989  2  -.32369575
    1989  3   .13528158
    1989  4    .1882625
    1989  5           .
    1989  6           .
    1989  7           .
    1989  8           .
    1989  9           .
    1989 10           .
    1989 11           .
    1989 12    .6406555
    1990  1    .5858638
    1990  2   1.0911701
    1990  3    1.127725
    1990  4   1.0045696
    end
    I want to create a new variable that takes the mean of growlag6_1 (December-April) per year and contains this mean value in its value slots for all 12 months of the given year. I tried to solve this by using
    Code:
     egen meanlag6=mean(growlag6_1), by(year)
    However, it seems like this command cannot handle the missing values because the resulting mean is incorrect. Do you have an idea how to solve this? Thank you!

    Best,
    Peter

  • #2
    This seems to be the correct approach to achive what you want.

    Can you please explain why you think that "this command cannot handle the missing values ", and why do you think that "the resulting mean is incorrect"?

    In other words how is the result from the command you show different from what you are expecting?

    Comment


    • #3
      Hi Joro,

      I have made a mistake in thinking. My problem here is not the command, but rather that I need the range (December(old year)-April(new year), which means that I cannot sort by year as this is a range over two years. This resulted in the wrong mean.

      However, I cannot figure out how to create a variable that can identify this range instead of year. Do you maybe have and idea? Many thanks and sorry for the initial confusion.

      Lastly, an example for more clarity: If we have in the dataset values for December 1984 and January, February, March, and April 1985, then I want to create a variable that takes the mean of these 5 values and puts them in its 12 slots of 1985.

      Comment


      • #4
        Sorry for posting twice. Apparently one cannot delete a post here.
        Last edited by Joro Kolev; 22 Jul 2022, 06:56. Reason: Bad internet.

        Comment


        • #5
          your question is not clear to me, but I think that the user-written -rangestat- command will do what you want; use -search- to find and download and then look at the help file

          Comment


          • #6
            This should be doing it, if I did not mess up something:

            Code:
            . sort year month
            
            . gen t = _n
            
            . gen mymean = .
            (100 missing values generated)
            
            . qui forvalues i = 1/`=_N' {
            . summ growlag6_1 if t>`i'-5 & t<=`i', meanonly
            . replace mymean = r(mean) in `i'
            . }
            
            . egen myaprilmean = mean(cond(month==4,mymean,.)), by(year)
            (12 missing values generated)

            Comment


            • #7
              The main issue seems to be the use of non-calendar years.

              Code:
              gen ncy = cond(month == 12, year + 1, year)
              will give a framework of years starting in December.

              Comment


              • #8
                I am not a big expert on -rangestat- but I have noticed that my students use it a lot. So here is a solution based on -rangestat-:

                Code:
                . rangestat (mean) growlag6_1, interval(t -4 0)
                
                . egen myaprilmean2 = mean(cond(month==4, growlag6_1_mean ,.)), by(year)
                (12 missing values generated)
                
                . compare myaprilmean myaprilmean2
                
                                                        ---------- Difference ----------
                                            Count       Minimum      Average     Maximum
                ------------------------------------------------------------------------
                myapril~n=myapril~2            88
                                       ----------
                Jointly defined                88             0            0           0
                Jointly missing                12
                                       ----------
                Total                         100

                Comment


                • #9
                  rangestat from SSC could be used but isn't really needed here.
                  Last edited by Nick Cox; 22 Jul 2022, 07:40.

                  Comment


                  • #10
                    Dear all,

                    Thank you very much for your answers. As Nick said, the main issue was the use of non-calendar years. Using
                    Code:
                     gen ncy = cond(month == 12, year + 1, year)
                    solved my issue as I am now able
                    to identify the correct range by using
                    Code:
                     egen mean_growlag6= mean(growlag6_1), by(ncy)
                    The -rangestat- command is also very useful to know. Thanks for this!

                    Comment


                    • #11
                      In #3 OP gave the following example as to what he wants to have calculated: " If we have in the dataset values for December 1984 and January, February, March, and April 1985, then I want to create a variable that takes the mean of these 5 values and puts them in its 12 slots of 1985."

                      The code OP shows in #10 does not accomplish the task he posed in #3. What the code in #10 accomplishes is to calculate the mean from December last year to November this year. And then to attach these averages to December last year to November this year.

                      The solution with sending months to another year to the original posed task in #3 would be something like:

                      Code:
                      . gen ncy = cond(month == 12, year + 1, year) if month<5 | month==12
                      (56 missing values generated)
                      
                      . egen mytempaprilmean3 = mean( growlag6_1), by(ncy)
                      (60 missing values generated)
                      
                      . egen myaprilmean3 = mean(cond(month<5, mytempaprilmean3, .)), by(year)
                      (12 missing values generated)
                      
                      . compare myaprilmean myaprilmean3
                      
                                                              ---------- Difference ----------
                                                  Count       Minimum      Average     Maximum
                      ------------------------------------------------------------------------
                      myapril~n=myapril~3            88
                                             ----------
                      Jointly defined                88             0            0           0
                      Jointly missing                12
                                             ----------
                      Total                         100
                      And then we can dispute which is the easiest. To me the solution with the loop in #6 was the fastest to write, then the solution with the -rangestat- (because I had to slow down to read the help file), and then the longest it took me to write the solution with the sending months to other years, because I got it wrong a couple of times, before I could get it right.

                      Comment

                      Working...
                      X