Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to tell Stata not to ignore missing values in bysort: egen mean command

    Dear Stata-Listers,

    I am posting a question for the first time, so I hope I'm doing it correctly.

    I have a panel data set and I want Stata to calculate the mean of a variable within a certain group. The group is identified by "country" and "period". "period" is a three-year period, i.e. there are always three year-observations in one "country period" group. I use the following code:

    Code:
    bysort country period: egen mean_ideology = mean(ideology)
    This works perfectly. It is only that sometimes "ideology" is missing for one or two years and with the code above Stata ignores these, i.e. mean_ideology is then calculated on the basis of the one or two non-missing observations. However, I want "mean_ideology" to be missing as soon as "ideology" is missing in one of the three years.

    What happens is:
    year country period ideology mean_ideology
    2007 Belgium 1 3 3
    2008 Belgium 1 3 3
    2009 Belgium 1 3 3
    2010 Belgium 2 2 2
    2011 Belgium 2 1 2
    2012 Belgium 2 3 2
    2013 Belgium 3 3 3
    2014 Belgium 3 . 3
    2015 Belgium 3 . 3
    What I want is:
    year country period ideology mean_ideology
    2007 Belgium 1 3 3
    2008 Belgium 1 3 3
    2009 Belgium 1 3 3
    2010 Belgium 2 2 2
    2011 Belgium 2 1 2
    2012 Belgium 2 3 2
    2013 Belgium 3 3 .
    2014 Belgium 3 . .
    2015 Belgium 3 . .
    I thought it might make sense to first replace "ideology" with "." if there is one other missing value for ideology within a "country period" group and then calculate the mean. Based on other Stata list questions and responses I have played around with:

    Code:
    bysort country period: replace ideology = . if
    ...?

    but I have been unsuccessful.

    I'd be very grateful for some help.
    Thank you in advance!
    Rike
    Last edited by Rike Schmid; 31 Aug 2020, 04:49.

  • #2
    Try
    Code:
    bysort country (period): replace ideology = ideology[_N]  if  missing(ideology[_N] )

    Comment


    • #3
      Another way would be:

      Code:
      egen nomiss = count(ideology), by(country period)
      
      egen mean_ideology = mean(ideology) if nomiss==3, by(country period)

      Comment


      • #4

        For the example data you could -- and for your desired goal perhaps even should -- avoid egen altogether


        Code:
        bysort country : gen wanted = (ideology[1] + ideology[2] + ideology[3]) / 3

        Comment


        • #5
          Thank you both so much for your help. Joro's first code worked though without the brackets around "period". I will try your way as well, Nick. Is there any particular reason to avoid egen in this case? I've found it very helpful so far.

          Comment


          • #6
            I had a mistake in my code. The correct version should be either
            Code:
            bysort country period (ideology): replace ideology = ideology[_N] if missing(ideology[_N] )
            or equivalently

            Code:
            sort country period ideology
            by country period: replace ideology = ideology[_N] if missing(ideology[_N] )
            And the mistake I made illustrates why Nick is not wrong to propose that the simplest possible solution be used.

            You have a really simple situation, what Nick suggests does the trick.

            The more general and complicated solutions you attempt, the higher are the chances that you commit an error, like me.

            The code in #3 is still correct.



            Originally posted by Rike Schmid View Post
            Thank you both so much for your help. Joro's first code worked though without the brackets around "period". I will try your way as well, Nick. Is there any particular reason to avoid egen in this case? I've found it very helpful so far.
            Last edited by Joro Kolev; 31 Aug 2020, 08:12.

            Comment


            • #7
              A reason to avoid egen is your starting point here: it doesn’t do what you want. But you could separately use it calculate means, count non-missing values and revise accordingly.

              Comment


              • #8
                Nick's code should also probably read
                Code:
                bysort country period: gen wanted = (ideology[1] + ideology[2] + ideology[3]) / 3
                he probably was mislead by the wrong thing I initially wrote.

                Comment


                • #9
                  1. No-manual would be more generally applicable.
                  Code:
                  bys country period (ideology): gen NoMiss = ideology[_N] !=.
                  by country period: egen Wanted = mean(ideology) if NoMiss
                  2. With all the caution of avoiding using egen for this case (as noted by Nick), I still believe in below `unrecommended' code:
                  Code:
                  bys country period (ideology): egen Wanted2 = mean(cond(ideology[_N] !=. , ideology, .))
                  3. Note that below code does not work.
                  Code:
                  bys country period (ideology): egen NotWanted = mean(ideology) if ideology[_N] !=.

                  Comment


                  • #10
                    #8 and #9 catch my intentions correctly and carefully!

                    Comment


                    • #11
                      And why the code in 3. is not working, is one of the mysteries of Stata.

                      In this thread here, where similarly weird things with -egen- were happening, Nick cited some source that we should not be using -egen- with explicit subscripting.
                      https://www.statalist.org/forums/for...ing-a-by-group



                      Originally posted by Romalpa Akzo View Post
                      1. No-manual would be more generally applicable.
                      Code:
                      bys country period (ideology): gen NoMiss = ideology[_N] !=.
                      by country period: egen Wanted = mean(ideology) if NoMiss
                      2. With all the caution of avoiding using egen for this case (as noted by Nick), I still believe in below `unrecommended' code:
                      Code:
                      bys country period (ideology): egen Wanted2 = mean(cond(ideology[_N] !=. , ideology, .))
                      3. Note that below code does not work.
                      Code:
                      bys country period (ideology): egen NotWanted = mean(ideology) if ideology[_N] !=.

                      Comment


                      • #12
                        Romalpa Akzo , I went to sleep yesterday thinking why your code in 3. is being rejected by Stata and what can be done about this, because 3. seems fairly general and fairly elegant way to achieve the task. I was also thinking of something I learnt recently here
                        https://www.statalist.org/forums/for...iour-explained
                        basically that Stata executes operations diligently left to right. So the two issues combined into this modification of the code in 3. which happens to be accepted by Stata:

                        Code:
                        . bys country period (ideology): egen oneline = mean(ideology[_N]*ideology/ideology[_N])
                        (3 missing values generated)
                        
                        . bys country period (ideology): egen NotWanted = mean(ideology) if ideology[_N] !=.
                        (9 missing values generated)
                        
                        . list, sep(0)
                        
                             +--------------------------------------------------------------------+
                             | year   country   period   ideology   mean_i~y   oneline   NotWan~d |
                             |--------------------------------------------------------------------|
                          1. | 2007   Belgium        1          3          3         3          . |
                          2. | 2009   Belgium        1          3          3         3          . |
                          3. | 2008   Belgium        1          3          3         3          . |
                          4. | 2011   Belgium        2          1          2         2          . |
                          5. | 2010   Belgium        2          2          2         2          . |
                          6. | 2012   Belgium        2          3          2         2          . |
                          7. | 2013   Belgium        3          3          3         .          . |
                          8. | 2014   Belgium        3          .          3         .          . |
                          9. | 2015   Belgium        3          .          3         .          . |
                             +--------------------------------------------------------------------+

                        Comment


                        • #13
                          Works for me. If it's not working, then your version of egen is looking at the last value of ideology in the dataset, not in each group;

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input int year str7 country byte(period ideology)
                          2007 "Belgium" 1 3
                          2008 "Belgium" 1 3
                          2009 "Belgium" 1 3
                          2010 "Belgium" 2 2
                          2011 "Belgium" 2 1
                          2012 "Belgium" 2 3
                          2013 "Belgium" 3 3
                          2014 "Belgium" 3 .
                          2015 "Belgium" 3 .
                          end
                          
                          . bys country period (ideology): egen NotWanted = mean(ideology) if ideology[_N] !=.
                          (3 missing values generated)
                          
                          
                          . list , sepby(period)
                          
                               +-----------------------------------------------+
                               | year   country   period   ideology   NotWan~d |
                               |-----------------------------------------------|
                            1. | 2009   Belgium        1          3          3 |
                            2. | 2007   Belgium        1          3          3 |
                            3. | 2008   Belgium        1          3          3 |
                               |-----------------------------------------------|
                            4. | 2011   Belgium        2          1          2 |
                            5. | 2010   Belgium        2          2          2 |
                            6. | 2012   Belgium        2          3          2 |
                               |-----------------------------------------------|
                            7. | 2013   Belgium        3          3          . |
                            8. | 2014   Belgium        3          .          . |
                            9. | 2015   Belgium        3          .          . |
                               +-----------------------------------------------+
                          
                          . which egen
                          C:\Program Files (x86)\Stata\ado\base\e\egen.ado
                          *! version 3.4.2  13apr2020
                          
                          .

                          Comment


                          • #14
                            Nick, what you are proposing is a good potential explanation (that egen looks at the last observation in the dataset and not in the group), but I still cannot see how this can explain that when I stick the object into the mean function it starts to work for me.

                            This works for me:
                            Code:
                            bys country period (ideology): egen oneline = mean(ideology[_N]*ideology/ideology[_N])
                            This does not work for me (generates all missings)
                            Code:
                            bys country period (ideology): egen NotWanted = mean(ideology) if ideology[_N] !=.
                            As far as I see both expressions are logically equivalent,
                            the first expression mechanically inserts the missing last observation in the mean function to annihilate the rest,
                            the second expression does the same "from outside" the mean, through the -if-.

                            Code:
                            . which egen
                            C:\Program Files (x86)\Stata15\ado\base\e\egen.ado
                            *! version 3.4.1  05jun2013
                            
                            . version
                            version 15.1

                            Comment


                            • #15
                              Joro: I've shown in #13 that the code there works as intended in an up-to-date Stata

                              egen has been revised within Stata 16, so I can only conclude that the bug you've experienced in your out-of-date version has now been removed.

                              That your other code works is good fortune, as the help for
                              egen does advise against building subscripts into what you feed to egen.

                              See
                              https://www.statalist.org/forums/help#version for the request to state your version of Stata if out-of-date. Your guess, and mine too, was presumably that version should not really matter for this question, but it turns out that it does.

                              In practice, it seems that an if qualifier is peeled off on parsing and then passed to the code for mean().
                              Last edited by Nick Cox; 01 Sep 2020, 03:46.

                              Comment

                              Working...
                              X