Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Copying Variable Names That Are Missing within a Range

    Hi, all.

    I have dataset where country names (variable: Country) are missing rather randomly. I have to fill in the names for years 2000 until 2016, because I use these country names to then later recode to a numeric value. Here is an example of how the data looks like:
    Code:
    input float(ccode2 year) str57 Country
    630 2000 ""           
    630 2001 ""           
    630 2002 ""           
    630 2003 ""           
    630 2004 ""           
    630 2005 ""           
    630 2006 ""           
    630 2007 ""           
    630 2008 ""           
    630 2009 ""           
    630 2010 ""           
    630 2011 ""           
    630 2012 ""           
    630 2013 ""           
    630 2014 ""           
    630 2015 ""           
    630 2016 "Iran"       
    840 2000 "Philippines"
    840 2001 "Philippines"
    840 2002 "Philippines"
    840 2003 "Philippines"
    840 2004 "Philippines"
    840 2005 "Philippines"
    840 2006 "Philippines"
    840 2007 "Philippines"
    840 2008 "Philippines"
    840 2009 "Philippines"
    840 2010 "Philippines"
    840 2011 "Philippines"
    840 2012 "Philippines"
    840 2013 "Philippines"
    840 2014 "Philippines"
    840 2015 "Philippines"
    840 2016 "Philippines"
    750 2000 "India"      
    750 2001 "India"      
    750 2002 "India"      
    750 2003 "India"      
    750 2004 ""           
    750 2005 ""           
    750 2006 ""           
    750 2007 ""           
    750 2008 ""           
    750 2009 ""           
    750 2010 ""           
    750 2011 ""           
    750 2012 ""           
    750 2013 ""           
    750 2014 "India"      
    750 2015 "India"      
    750 2016 "India"      
    775 2000 "Myanmar"    
    775 2001 "Myanmar"    
    775 2002 "Myanmar"    
    775 2003 "Myanmar"    
    775 2004 "Myanmar"    
    775 2005 "Myanmar"    
    775 2006 "Myanmar"    
    775 2007 "Myanmar"    
    775 2008 "Myanmar"    
    775 2009 "Myanmar"    
    775 2010 "Myanmar"    
    775 2011 "Myanmar"    
    775 2012 ""           
    775 2013 "Myanmar"    
    775 2014 ""           
    775 2015 ""           
    775 2016 ""
    And so on. I went to the FAQ (https://www.stata.com/support/faqs/d...issing-values/) but my question is how to I copy a string name for years that are missing between 2000 and 2016? I'm not sure if gsort will work here based on what I'm reading.

    I played around with some code, such as
    Code:
    replace Country1 = Country1[_n+1] if inrange(year,2000,2016)
    and so one, but no avail. (I made another variable, Country1, because Country is a string). (As a FYI, I used
    Code:
    encode Country, generate(Country1)
    and it turned it into long). Thank you in advance for any help.

  • #2
    Well, how do you know which country's name to fill in in any observation? You don't say anything about that. I'm going to guess that the variable ccode2 determines this, that all observations with the same ccode2 are the same country. If that's right:

    Code:
    by ccode2 (Country), sort: assert Country == Country[_N] | missing(Country)
    by ccode2 (Country): replace Country = Country[_N]
    If there is some ccode2 where all of the values of Country are missing this will not help: we can't create the information out of thin air. If Country is not consistently identical within ccode2 (except where missing), the code will stop with an error message after the first command.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Well, how do you know which country's name to fill in in any observation? You don't say anything about that. I'm going to guess that the variable ccode2 determines this, that all observations with the same ccode2 are the same country. If that's right:

      Code:
      by ccode2 (Country), sort: assert Country == Country[_N] | missing(Country)
      by ccode2 (Country): replace Country = Country[_N]
      If there is some ccode2 where all of the values of Country are missing this will not help: we can't create the information out of thin air. If Country is not consistently identical within ccode2 (except where missing), the code will stop with an error message after the first command.
      Hello Clyde,

      Thank you for your response. My apologies if I was not clear. I actually have to recode ccode2 later, because I found some errors within it. (Part of the reason why I found errors, is because of the blanks in the variable names for Country). What I'm trying to do is to fill in the rest of the blanks with the Country names. So, for example-starting from the top- Iran should be filled in for each year between 2000-2016. (Philippines, okay). But India is missing for 2004-2013. I have to fill it in. (Then, later, I have recode it to ccode2). As a FYI, there are duplicates (meaning, further down there may be other observations for India 2000-2016), but that's a separate issue I have to work with later.

      Comment


      • #4
        OK, so it seems that you are relying on the sort order of the data to determine which country an observation actually belongs to when Country is missing. That is, you assume that each Country occupies a consecutive block of 17 observations (corresponding to years 2000-2016) and you want to fill in Country with whatever name appears in that block.

        That's a fairly brittle approach, as, unless this is really rather virgin data, there are many commands to which they have been subjected that might have scrambled the original sort order. Nevertheless, assuming this is the case:

        Code:
        gen long obs_no = _n
        gen block = ceil(_n/17)
        by block (obs_no), sort: assert year == 1999 + _n
        by block (Country), sort: assert Country == Country[_N] | missing(Country)
        by block (Country): replace Country = Country[_N]
        sort obs_no

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          OK, so it seems that you are relying on the sort order of the data to determine which country an observation actually belongs to when Country is missing. That is, you assume that each Country occupies a consecutive block of 17 observations (corresponding to years 2000-2016) and you want to fill in Country with whatever name appears in that block.

          That's a fairly brittle approach, as, unless this is really rather virgin data, there are many commands to which they have been subjected that might have scrambled the original sort order. Nevertheless, assuming this is the case:

          Code:
          gen long obs_no = _n
          gen block = ceil(_n/17)
          by block (obs_no), sort: assert year == 1999 + _n
          by block (Country), sort: assert Country == Country[_N] | missing(Country)
          by block (Country): replace Country = Country[_N]
          sort obs_no
          Thank you, Clyde. Indeed. It is a brittle approach. At the same token, I never came across the ceil command before, so I learned something new.

          Comment

          Working...
          X