Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping observations if frequency of value of variable lower than x in panel data

    Dear Statalist user,

    I have a problem regarding my panel data that should be fairly easy to solve but I was unable to accomplish it. I found similar threats but was unable to find a fitting solution.
    (e.g. https://www.stata.com/statalist/arch.../msg00159.html)

    My data is structured as follows:
    personID_year_country_ income
    1_______1____1_______1
    2_______1____2_______2
    3_______1____3_______3
    1_______2____1_______4
    2_______2____2_______5
    3_______2____3_______6

    I want to analyze various variables (e.g. income) as grouped by other variables (e.g.countries), e.g. calculating the mean of income per country. In order to improve the analysis I want to focus on countries with more than 30 observations (personID) per year. My attempt was to drop countries if they had a lower frequency of 30 per year. Is there a way to drop values from a variable if their frequency is less than, as in this case, 30? Or maybe dropping is not the right approach and there is another much better solution to my problem? I guess the if condition could work as well but I would still need to be able to create a variable tagging countries with more/less than 30 observations per year.

    The codes I tried were accepted by Stata but didn`t provide the results I wished for. (There were still less than thirty observations in some values of the country variable in several years) As I thought they would work, I don`t know what I actually did, which is not really helpful so if someone might explain to me what`s wrong with my attempts I would be really happy as well:

    First try:
    bysort country (year): egen countrytotal= total(country)
    drop if countrytotal<30
    *Some observations were deleted but apparently not enough?

    Second try:
    foreach num of numlist 1/3 {
    egen countrytag`num'= tag(country) if year==`zahl'
    egen countrycount`num'= _N if countrytag`zahl'==1
    drop if countrycount`zahl'<50
    }
    * No observations at all were deleted

    As I read in the other thread, the assert command is a way for checking if a command worked as wished for. I used it, but maybe that one was wrong as well? Stata reported there was a number of contradictions in my observations.

    bysort country (year): gen countryfreq= _N
    assert countryfreq >30



    As you can see I`m rather confused after a some hours of coding without accomplishing what I aimed for. I hope I was clear enough to express what I`m aiming at? If not please let me know and I´ll try to clarify my goals. Any help would be highly appreciated, thank you all in advance and have a nice day!

  • #2
    The following untested code may point the way.
    Code:
    // how many observations in each year for each country?
    bysort country year: egen country_year_freq = _N
    // what is the least number of obsercations in a year for each country?
    bysort country: egen country_year_min = min(country_year_freq)
    drop if country_year_min<=30

    Comment


    • #3
      Let's look at your code.

      Code:
      bysort country (year): egen countrytotal= total(country)
      drop if countrytotal<30
      But country is a numeric variable with values like 1, 2, 3. So if country 3 occurs 10 times you get 30 for the total. Not what you want.


      Code:
      foreach num of numlist 1/3 {
          egen countrytag`num'= tag(country) if year==`zahl'
          egen countrycount`num'= _N if countrytag`zahl'==1
          drop if countrycount`zahl'<50
      }
      What is `zahl` here? You don't define it, so that is hard to discuss.

      Code:
       egen countrycount`num'= _N if countrytag`zahl'==1
      That's illegal, so rather than nothing happening, the code would have failed.

      Why post code you didn't run? That's a puzzle. Don't paraphrase, Show us code that you used.

      Backing up:

      I want to focus on countries with more than 30 observations (personID) per year.
      I think that means more than 30 observations in every year of the study. So code could be something like.

      Code:
      fillin country year 
      bysort country year : egen freq = total(!_fillin) 
      by country: egen minfreq = min(freq)
      drop if minfreq <= 30 
      drop if _fillin 
      drop _fillin
      Why the fillin? Suppose that country 42 is represented by 30 observations in 1984 and none in any other year. You don't want that passing the test, I imagine. So, you ensure that all country x year combinations are checked and take account if the answer is zero.

      Comment


      • #4
        Dear Nick and William,

        thank you very much for your fast replies! Excuse me for the posting a false code, I`m German speaking so I have to translate parts of my codes which I forgot above in the second part. The second code I used was (zahl means number in German):

        foreach num of numlist 1/3 {
        egen countrytag`num'= tag(country) if year==`num'
        egen countrycount`num'= _N if countrytag`num'==1
        drop if countrycount`num'<30
        }
        * No observations at all were deleted

        About Williams answer:

        // how many observations in each year for each country? bysort country year: egen country_year_freq = _N // what is the least number of obsercations in a year for each country? bysort country: egen country_year_min = min(country_year_freq) drop if country_year_min<=30
        //I transformed a part of the code to: gen country_year_freq= _N

        //Turned out to delete too many observations: Countries with more than 30 observations per year were deleted as well
        bysort country year: gen country_year_freq = _N
        bysort country: egen country_year_min = min(country_year_freq)
        drop if country_year_min<=30

        //Turned out to delete too little observations (difference: Note the parentheses around year): there were still Countries left with less than 30 observations if I checked them per year (e.g. bysort year: tab country if year==2015)
        bysort country (year): gen country_year_freq = _N
        bysort country: egen country_year_min = min(country_year_freq)
        drop if country_year_min<=30

        About Nick`s answer:
        fillin country year bysort country year : egen freq = total(!_fillin) by country: egen minfreq = min(freq) drop if minfreq <= 30 drop if _fillin drop _fillin
        Worked fine but dropped almost all observations as many countries do not include more than 30 observations in each year of the study. Most countries only include more than 30 observations in some of the years of the study.
        Unfortunately, I don`t want the countries to be checked whether they include more than 30 observations in each year of the study. The country variable changed a lot during the time the panel is running as did the number of observations for each country/year. That`s why, for now, I only want to check if the countries hold more than 30 observations in the years they were observed in the panel and to be dropped for all the years they were not including more than 30 observations, but otherwise keep them in my data in the years they included more than 30 observations. Maybe better said: I want to check whether the countries observed in year 2015 included more than 30 observations, the countries observed in 2014 included more than 30 observations, independently of whether it was the same countries that included more than 30 observations or not. So e.g. if country 42 represented more than 30 observations in 1984 but less in 1985 it should be kept in the data in 1984 but be kicked out in 1985.
        (Some countries were merged or split up during the years of the panel study, but for my descriptive analysis I still want to include them if they hold more than 30 observations to not loose too many observations, even if they were not observed during the whole time span of the panel study)

        I thought maybe the fillin command was unnecessary in this case, but then I end up with Williams code, right, which seemed to accomplish the same as the fillin approach. (Dropping all countries which were including less than 30 observations each and every year; instead of only dropping the countries for the year(s) the number of observations was below 30 but keeping them in the panel data for the years they included more than thirty observations.)

        I would highly appreciate any further suggestions. Thanks a lot for taking care and have a nice day!
        Last edited by Benedikt Walker; 02 Mar 2018, 02:57.

        Comment


        • #5
          Question solved, what I was looking for was this:

          bysort country year: gen country_year_freq = _N
          bysort country year: egen country_year_min = min(country_year_freq)
          drop if country_year_min<=30

          Thanks a lot for your help Nick and William, I really appreciate the community here!

          Comment


          • #6
            My apologies for misunderstanding what you wrote in post #1.

            My attempt was to drop countries if they had a lower frequency of 30 per year.
            I took that to mean you wanted to drop entire countries if in any year they had less than 30 observations. What you wanted was to drop any country/year combination if it had less than 30 observations. It wasn't until post #4 that you clarified your intention.

            Unfortunately, I don`t want the countries to be checked whether they include more than 30 observations in each year of the study.
            With that said, if you run the first two lines of your code from post #5, you will see that the second line is unnecessary - country_year_freq will be the same for every observation with the same country and year, so taking the minimum of those values adds nothing.

            And indeed, with this new understanding of your objective, all three lines could be reduced to
            Code:
            bysort country year: drop if _N<30
            because for every observation in a given country and year, Stata's automatic variable _N will be the number of observations in that country and year.
            Last edited by William Lisowski; 02 Mar 2018, 06:03.

            Comment

            Working...
            X