Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Add missing values

    Hello,

    I have the following dataset:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int year long gid float nactors2 double nactorsNEW int country_code float(longitude latitude)
    1999 136452 1 1 566 5.75 4.75
    2002 136452 . 1   .    .    .
    2003 136452 . 1   .    .    .
    2004 136452 1 1 566 5.75 4.75
    2005 136452 2 2 566 5.75 4.75
    2006 136452 2 2 566 5.75 4.75
    2007 136452 . 2   .    .    .
    2008 136452 2 2 566 5.75 4.75
    2009 136452 1 1 566 5.75 4.75
    2010 136452 . 1   .    .    .
    2011 136452 . 1   .    .    .
    2012 136452 1 1 566 5.75 4.75
    2013 136452 2 2 566 5.75 4.75
    2014 136452 1 1 566 5.75 4.75
    2015 136452 1 1 566 5.75 4.75
    2016 136452 1 1 566 5.75 4.75
    2017 136452 2 2 566 5.75 4.75
    2018 136452 . 2   .    .    .
    2019 136452 2 2 566 5.75 4.75
    2020 136452 2 2 566 5.75 4.75
    1997 136453 1 1 566 6.25 4.75
    1998 136453 2 2 566 6.25 4.75
    1999 136453 3 3 566 6.25 4.75
    2002 136453 1 1 566 6.25 4.75
    2003 136453 2 2 566 6.25 4.75
    2004 136453 1 1 566 6.25 4.75
    2005 136453 1 1 566 6.25 4.75
    2006 136453 2 2 566 6.25 4.75
    2007 136453 1 1 566 6.25 4.75
    2008 136453 4 4 566 6.25 4.75
    2009 136453 4 4 566 6.25 4.75
    2010 136453 2 2 566 6.25 4.75
    2011 136453 1 1 566 6.25 4.75
    2012 136453 1 1 566 6.25 4.75
    2013 136453 2 2 566 6.25 4.75
    2014 136453 2 2 566 6.25 4.75
    2015 136453 3 3 566 6.25 4.75
    2016 136453 4 4 566 6.25 4.75
    2017 136453 3 3 566 6.25 4.75
    2018 136453 1 1 566 6.25 4.75
    2019 136453 3 3 566 6.25 4.75
    2020 136453 8 8 566 6.25 4.75
    end
    My data changes by gid and year. However, there are some variables that are fixed by gid, that is, it does not change by year. In this variables, (country_code), longitude and latitude, there are some missing values as a consequence of a merge that I did. I always have a cell which contains the value of that missing.

    I would like to fill that missing values with the value of his corresponding gid (the value is always fixed). For example, in the second observation that I showed, there are missing values for the observation in which year is 2002 and gid 136452. I would like to complete the missing values of country_code, latitude and longitude with the ones corresponding to the gid 136452 for any other observation. I also have missing values for variable nactors2 but I do not care at all about it.

    If my question is not clear, please let me know.

    Diego.

  • #2
    Diego:
    a step-by-step approach (that can be made faster via -foreach-) is:
    Code:
    . bysort gid: replace country_code=country_code[1]
    
    . bysort gid: replace longitude=longitude[1]
    
    . bysort gid: replace latitude=latitude[1]
    Via -foreach-
    Code:
     foreach var of varlist country_code-latitude {
      bysort gid: replace `var'=`var'[1] if `var'==.
      }
    Last edited by Carlo Lazzaro; 12 Oct 2021, 02:33.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you Carlo Lazzaro for your answer.

      I find a problem in the solution you suggested to me. It's true that it works for the sample I attached in my first comment. However, when I do with my entire sample, I find a problem in those observation which does not have the data for the first unit of observation. I put a observation in which it does not work. Specifically when gid is equal to 136452.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int year long gid float nactors2 byte _fillin double nactorsNEW2 int country_code float(longitude latitude)
      1997 136452 . 1 1   .    .    .
      1998 136452 . 1 1   .    .    .
      1999 136452 1 0 1 566 5.75 4.75
      2002 136452 . 1 1   .    .    .
      2003 136452 . 1 1   .    .    .
      2004 136452 1 0 1 566 5.75 4.75
      2005 136452 2 0 2 566 5.75 4.75
      2006 136452 2 0 2 566 5.75 4.75
      2007 136452 . 1 2   .    .    .
      2008 136452 2 0 2 566 5.75 4.75
      2009 136452 1 0 1 566 5.75 4.75
      2010 136452 . 1 1   .    .    .
      2011 136452 . 1 1   .    .    .
      2012 136452 1 0 1 566 5.75 4.75
      2013 136452 2 0 2 566 5.75 4.75
      2014 136452 1 0 1 566 5.75 4.75
      2015 136452 1 0 1 566 5.75 4.75
      2016 136452 1 0 1 566 5.75 4.75
      2017 136452 2 0 2 566 5.75 4.75
      2018 136452 . 1 2   .    .    .
      2019 136452 2 0 2 566 5.75 4.75
      2020 136452 2 0 2 566 5.75 4.75
      1997 136453 1 0 1 566 6.25 4.75
      1998 136453 2 0 2 566 6.25 4.75
      1999 136453 3 0 3 566 6.25 4.75
      2002 136453 1 0 1 566 6.25 4.75
      2003 136453 2 0 2 566 6.25 4.75
      2004 136453 1 0 1 566 6.25 4.75
      2005 136453 1 0 1 566 6.25 4.75
      2006 136453 2 0 2 566 6.25 4.75
      2007 136453 1 0 1 566 6.25 4.75
      2008 136453 4 0 4 566 6.25 4.75
      2009 136453 4 0 4 566 6.25 4.75
      2010 136453 2 0 2 566 6.25 4.75
      2011 136453 1 0 1 566 6.25 4.75
      2012 136453 1 0 1 566 6.25 4.75
      2013 136453 2 0 2 566 6.25 4.75
      2014 136453 2 0 2 566 6.25 4.75
      2015 136453 3 0 3 566 6.25 4.75
      2016 136453 4 0 4 566 6.25 4.75
      2017 136453 3 0 3 566 6.25 4.75
      2018 136453 1 0 1 566 6.25 4.75
      2019 136453 3 0 3 566 6.25 4.75
      2020 136453 8 0 8 566 6.25 4.75
      1997 140057 . 1 1   .    .    .
      1998 140057 1 0 1 566 8.25 7.25
      1999 140057 . 1 1   .    .    .
      2002 140057 . 1 1   .    .    .
      2003 140057 . 1 1   .    .    .
      2004 140057 . 1 1   .    .    .
      2005 140057 . 1 1   .    .    .
      2006 140057 . 1 1   .    .    .
      2007 140057 . 1 1   .    .    .
      2008 140057 1 0 1 566 8.25 7.25
      2009 140057 . 1 1   .    .    .
      2010 140057 . 1 1   .    .    .
      2011 140057 1 0 1 566 8.25 7.25
      2012 140057 1 0 1 566 8.25 7.25
      2013 140057 3 0 3 566 8.25 7.25
      2014 140057 1 0 1 566 8.25 7.25
      2015 140057 1 0 1 566 8.25 7.25
      2016 140057 . 1 1   .    .    .
      2017 140057 1 0 1 566 8.25 7.25
      2018 140057 1 0 1 566 8.25 7.25
      2019 140057 2 0 2 566 8.25 7.25
      2020 140057 2 0 2 566 8.25 7.25
      end

      Diego.

      Comment


      • #4
        I would add the following to Carlo's helpful code:

        Code:
        foreach var of varlist country_code-latitude {
          bysort gid (`var'): replace `var'=`var'[1] if `var'==.
        }
        This assumes that the variables are time invariant.

        Comment


        • #5
          Now, It works Andrew Musau !! Thank you. I do not understand very well why but it works. Thank you!

          Comment


          • #6
            For numeric variables, non-missing values are sorted before missing values. So in the code, we define groups by "gid" and within these groups, we ask Stata to sort by the variable that we want to replace missing values by enclosing it within parentheses. Therefore, the first observation within a group will be a non-missing value, provided that there is at least one non-missing value within that group. We do not care which observation is sorted first as the variable is time-invariant (i.e., any will do).

            Comment


            • #7
              I see it. Thank you for the explanation, Andrew Musau . It helps me a lot!.

              Comment


              • #8
                Dear Andrew Musau and Carlo Lazzaro . Sorry to disturb you but I am trying to use the code you provided to do exactly the same as I asked in this post: filling the missing values of cells with the ones it already exists (variables x_GID, y_GID ccode_GID and country_GID does not change by gid)

                I am using the following code (the same than Andrew provided):

                Code:
                 foreach var of varlist x_GID-ccode_GID {  
                bysort gid (`var'): replace `var'=`var'[1] if `var'==.
                }
                However, it works to fill columns x_GID and y_GID, but it does not works for columns country_GID and ccode_GID. I am not sure why because it has at least one observation no missing. It gives me the following error: "type mismatch". Can you help me?

                I attach an example of the dataset:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long gid int year float(x_GID y_GID) str34 country_GID int ccode_GID float wanted
                62356 1997 37.75 -46.75 "South Africa" 216 0
                62356 1998     .      . ""               . 0
                62356 1999     .      . ""               . 0
                62356 2000     .      . ""               . 0
                62356 2001     .      . ""               . 0
                62356 2002     .      . ""               . 0
                62356 2003     .      . ""               . 0
                62356 2004     .      . ""               . 0
                62356 2005     .      . ""               . 0
                62356 2006     .      . ""               . 0
                62356 2007     .      . ""               . 0
                62356 2008     .      . ""               . 0
                62356 2009     .      . ""               . 0
                62356 2010     .      . ""               . 0
                62356 2011     .      . ""               . 0
                62356 2012     .      . ""               . 0
                62356 2013     .      . ""               . 0
                62356 2014     .      . ""               . 0
                62356 2015     .      . ""               . 0
                62356 2016     .      . ""               . 0
                62356 2017     .      . ""               . 0
                62357 1997 38.25 -46.75 "South Africa" 216 0
                62357 1998     .      . ""               . 0
                62357 1999     .      . ""               . 0
                62357 2000     .      . ""               . 0
                62357 2001     .      . ""               . 0
                62357 2002     .      . ""               . 0
                62357 2003     .      . ""               . 0
                62357 2004     .      . ""               . 0
                62357 2005     .      . ""               . 0
                62357 2006     .      . ""               . 0
                62357 2007     .      . ""               . 0
                62357 2008     .      . ""               . 0
                62357 2009     .      . ""               . 0
                62357 2010     .      . ""               . 0
                62357 2011     .      . ""               . 0
                62357 2012     .      . ""               . 0
                62357 2013     .      . ""               . 0
                62357 2014     .      . ""               . 0
                62357 2015     .      . ""               . 0
                62357 2016     .      . ""               . 0
                62357 2017     .      . ""               . 0
                80317 1997     .      . ""               . 0
                80317 1998     .      . ""               . 0
                80317 1999     .      . ""               . 0
                80317 2000     .      . ""               . 0
                80317 2001     .      . ""               . 0
                80317 2002     .      . ""               . 0
                80317 2003     .      . ""               . 0
                80317 2004     .      . ""               . 0
                80317 2005     .      . ""               . 0
                80317 2006     .      . ""               . 0
                80317 2007     .      . ""               . 0
                80317 2008     .      . ""               . 0
                80317 2009     .      . ""               . 0
                80317 2010     .      . ""               . 0
                80317 2011     .      . ""               . 0
                80317 2012 18.25 -34.25 "South Africa" 216 1
                80317 2013 18.25 -34.25 "South Africa" 216 1
                80317 2014 18.25 -34.25 "South Africa" 216 1
                80317 2015 18.25 -34.25 "South Africa" 216 1
                80317 2016 18.25 -34.25 "South Africa" 216 1
                80317 2017 18.25 -34.25 "South Africa" 216 1
                80318 1997 18.75 -34.25 "South Africa" 216 1
                80318 1998 18.75 -34.25 "South Africa" 216 1
                80318 1999     .      . ""               . 0
                80318 2000     .      . ""               . 0
                80318 2001     .      . ""               . 0
                80318 2002 18.75 -34.25 "South Africa" 216 1
                80318 2003 18.75 -34.25 "South Africa" 216 1
                80318 2004 18.75 -34.25 "South Africa" 216 1
                80318 2005     .      . ""               . 0
                80318 2006     .      . ""               . 0
                80318 2007 18.75 -34.25 "South Africa" 216 1
                80318 2008     .      . ""               . 0
                80318 2009 18.75 -34.25 "South Africa" 216 1
                80318 2010     .      . ""               . 0
                80318 2011     .      . ""               . 0
                80318 2012 18.75 -34.25 "South Africa" 216 3
                80318 2013 18.75 -34.25 "South Africa" 216 3
                80318 2014 18.75 -34.25 "South Africa" 216 1
                80318 2015 18.75 -34.25 "South Africa" 216 3
                80318 2016     .      . ""               . 0
                80318 2017 18.75 -34.25 "South Africa" 216 1
                80319 1997     .      . ""               . 0
                80319 1998     .      . ""               . 0
                80319 1999     .      . ""               . 0
                80319 2000     .      . ""               . 0
                80319 2001     .      . ""               . 0
                80319 2002     .      . ""               . 0
                80319 2003     .      . ""               . 0
                80319 2004     .      . ""               . 0
                80319 2005     .      . ""               . 0
                80319 2006     .      . ""               . 0
                80319 2007     .      . ""               . 0
                80319 2008     .      . ""               . 0
                80319 2009     .      . ""               . 0
                80319 2010     .      . ""               . 0
                80319 2011     .      . ""               . 0
                80319 2012 19.25 -34.25 "South Africa" 216 1
                end
                Thanks,

                Diego.

                Comment


                • #9
                  Sorting order is usually reversed for strings, such that nonmissing values are sorted last. So try replacing `var'[1] with `var'[_N] in the case of strings. For the type mismatch error, use

                  Code:
                  if missing(`var')
                  as the -missing()- function is not sensitive to variable type.
                  Last edited by Andrew Musau; 02 Nov 2021, 05:11.

                  Comment


                  • #10
                    Thank you Andrew Musau . If I am not wrong, what are you suggesting to me is that I should do two loops, one with no string variables and another one with string variables, right?

                    Best,

                    Diego.

                    Comment


                    • #11
                      Not necessary.

                      Code:
                      foreach var of varlist x_GID-ccode_GID {  
                          bysort gid (`var'): replace `var'=`var'[1] if missing(`var') & !missing(`var'[1])
                          bysort gid (`var'): replace `var'=`var'[_N] if missing(`var') & !missing(`var'[_N])
                      }

                      Comment


                      • #12
                        Thank you Andrew Musau . It works well!

                        Comment

                        Working...
                        X