Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Balance an unbalanced dataset

    Hi all,

    I have a strongly unbalanced dataset of countries observed by year.
    I would like to balance it by retaining as many observations as possible. For instance, say the maximum time span is 46 periods. Still, only 10 out of 100 countries have 46 time periods, whereas if I only drop 2 time periods (going from 46 to 44) I can reach 20 more countries ending up with 30 countries observed in 44 time periods. Then, I would prefer the second choice. Say the minimum acceptable time periods are 28. Is there a way to make the process automatic ideally with a "persist" and "restore" dropping all observations having less than k time periods and counting the number of countries with k time periods then repeating the process... Of course, the time periods should be the same (so for instance if I have 2 countries with 3 time periods, if country A has 1980-1981-1982 and country B has 2020-2021-2022, then this is not a matching of time periods. Instead, time periods should ideally coincide).

    This is a snapshot of my code:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str4 person_ctrycode int earliest_publn_year
    "AD" 1983
    "AD" 1990
    "AD" 1998
    "AD" 2005
    "AD" 2006
    "AD" 2009
    "AD" 2013
    "AD" 2014
    "AD" 2015
    "AD" 2017
    "AD" 2018
    "AD" 2019
    "AD" 2020
    "AD" 2021
    "AE" 1978
    "AE" 1984
    "AE" 1989
    "AE" 1990
    "AE" 1991
    "AE" 1992
    "AE" 1993
    "AE" 1994
    "AE" 1996
    "AE" 1997
    "AE" 1998
    "AE" 2001
    "AE" 2003
    "AE" 2004
    "AE" 2005
    "AE" 2006
    "AE" 2007
    "AE" 2008
    "AE" 2009
    "AE" 2010
    "AE" 2011
    "AE" 2012
    "AE" 2013
    "AE" 2014
    "AE" 2015
    "AE" 2016
    "AE" 2017
    "AE" 2018
    "AE" 2019
    "AE" 2020
    "AE" 2021
    "AF" 2015
    "AF" 2018
    "AF" 2020
    "AG" 2010
    "AG" 2013
    "AI" 1982
    "AI" 1984
    "AI" 1986
    "AI" 1988
    "AI" 1990
    "AI" 1995
    "AI" 1996
    "AI" 2002
    "AI" 2008
    "AI" 2009
    "AI" 2010
    "AI" 2019
    "AI" 2020
    "AL" 1988
    "AL" 1993
    "AL" 2003
    "AL" 2009
    "AL" 2010
    "AL" 2011
    "AL" 2015
    "AL" 2017
    "AL" 2018
    "AM" 1995
    "AM" 2000
    "AM" 2004
    "AM" 2005
    "AM" 2006
    "AM" 2007
    "AM" 2008
    "AM" 2010
    "AM" 2011
    "AM" 2012
    "AM" 2013
    "AM" 2014
    "AM" 2015
    "AM" 2016
    "AM" 2017
    "AM" 2018
    "AM" 2019
    "AM" 2020
    "AN" 1979
    "AN" 1980
    "AN" 1981
    "AN" 1982
    "AN" 1983
    "AN" 1984
    "AN" 1985
    "AN" 1986
    "AN" 1987
    "AN" 1988
    end
    however, I think that state provides a default unbalanced dataset which might be more useful for an MWE.

    Thank you

  • #2
    Here is some code which may start you in a useful direction.
    Code:
    * create a list of spells - sequential years when a country has data
    sort person_ctrycode earliest_publn_year
    generate spell = sum( person_ctrycode!=person_ctrycode[_n-1] ///
                          | earliest_publn_year[_n] != earliest_publn_year[_n-1]+1 ///
                        )
    collapse (first) person_ctrycode ///
             (min) min_epy=earliest_publn_year (max) ///
             max_epy=earliest_publn_year ///
             , by(spell)
    
    * display list of spells         
    list, noobs sepby(person_ctrycode)
    
    * example of using spells information 
    forvalues fy = 1990/1995 {
        forvalues ty = 2000(-1)`fy' {
            quietly levelsof person_ctrycode if min_epy <= `fy' & max_epy>=`ty', local(pclist) clean
            local N : word count `pclist' 
            if `N'>0  display %9.0f `r(N)' " `fy' to `ty' `pclist' "
        }
    }
    Code:
    . * display list of spells         
    . list, noobs sepby(person_ctrycode)
    
      +--------------------------------------+
      | spell   person~e   min_epy   max_epy |
      |--------------------------------------|
      |     1         AD      1983      1983 |
      |     2         AD      1990      1990 |
      |     3         AD      1998      1998 |
      |     4         AD      2005      2006 |
      |     5         AD      2009      2009 |
      |     6         AD      2013      2015 |
      |     7         AD      2017      2021 |
      |--------------------------------------|
      |     8         AE      1978      1978 |
      |     9         AE      1984      1984 |
      |    10         AE      1989      1994 |
      |    11         AE      1996      1998 |
      |    12         AE      2001      2001 |
      |    13         AE      2003      2021 |
      |--------------------------------------|
      |    14         AF      2015      2015 |
      |    15         AF      2018      2018 |
      |    16         AF      2020      2020 |
      |--------------------------------------|
      |    17         AG      2010      2010 |
      |    18         AG      2013      2013 |
      |--------------------------------------|
      |    19         AI      1982      1982 |
      |    20         AI      1984      1984 |
      |    21         AI      1986      1986 |
      |    22         AI      1988      1988 |
      |    23         AI      1990      1990 |
      |    24         AI      1995      1996 |
      |    25         AI      2002      2002 |
      |    26         AI      2008      2010 |
      |    27         AI      2019      2020 |
      |--------------------------------------|
      |    28         AL      1988      1988 |
      |    29         AL      1993      1993 |
      |    30         AL      2003      2003 |
      |    31         AL      2009      2011 |
      |    32         AL      2015      2015 |
      |    33         AL      2017      2018 |
      |--------------------------------------|
      |    34         AM      1995      1995 |
      |    35         AM      2000      2000 |
      |    36         AM      2004      2008 |
      |    37         AM      2010      2020 |
      |--------------------------------------|
      |    38         AN      1979      1988 |
      +--------------------------------------+
    Code:
            1 1990 to 1994 AE 
            1 1990 to 1993 AE 
            1 1990 to 1992 AE 
            1 1990 to 1991 AE 
            3 1990 to 1990 AD AE AI 
            1 1991 to 1994 AE 
            1 1991 to 1993 AE 
            1 1991 to 1992 AE 
            1 1991 to 1991 AE 
            1 1992 to 1994 AE 
            1 1992 to 1993 AE 
            1 1992 to 1992 AE 
            1 1993 to 1994 AE 
            2 1993 to 1993 AE AL 
            1 1994 to 1994 AE 
            1 1995 to 1996 AI 
            2 1995 to 1995 AI AM

    Comment


    • #3
      William Lisowski Thank you

      Comment

      Working...
      X