Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Same if clause but different number of observations affected

    I am trying to fill in some missing data with a previous year's data for that country and flag the imputation method. These two statements (I think) have the same if clauses but operate on a different number of observations. Can someone explain and help me get this aligned?

    Thanks!


    // Use last year's rcvd
    sort country year
    by country: replace rcvd_imputation = "Lagged received" if rcvd_tot==. & rcvd_tot[_n-1]!=. & year !=latest_year & do_not_fill!=1
    sort country year
    by country: replace rcvd_tot = rcvd_tot[_n-1] if rcvd_tot==. & rcvd_tot[_n-1]!=. & year !=latest_year & do_not_fill!=1
    Attached Files

  • #2
    The -if- qualifiers do appear to be the same in both commands. BUT, if country and year do not uniquely identify observations in your data set, then when you -sort- the second time, you may be changing the order of the data, so that different numbers of observations satisfy the rcvd_tot==. & rcvd_tot[_n-1]!=. condition.

    So my first question is whether country and year identify unique observations. Running -isid country year- will tell you that. (You will get no output from this command if country and year identify unique observations, but an error message if they don't.)

    If country and year do identify unique observations, then it is not apparent to me what the cause of your problem is. I would suggest, in this case, that you post back with example data that reproduces your problem. Use the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    If country and year do not identify unique observations, then I see some different situations. One is that there is something wrong with your data. After all, to even speak of "lagged received" is not meaningful if there can be multiple observations of the same country in the preceding year--which one is the lagged value? So if your data are wrong, the solution is to review the data management that created the data set and fix it.

    If your data correctly contains multiple observations for (some) country and year then the situation splits into two cases. Case 1 is that you actually have data at a more frequent interval than the year, perhaps the quarter or month. In that case you should sort the data on country and the more refined quarterly or monthly (or whatever it is) date variable. That will make the sort order determinate and will resolve the problem. Case 2 is that you have multiple observations, without any more fine-grained time variable, but your data starts out sorted in such a way that the appropriate lagged observation immediately precedes each observation--you have arranged for this in some way that we need not discuss here. In that case, add the -stable- option to your -sort- commands (and use -by-, not -bysort-), and this order will be preserved, and your problem should, again, be resolved.

    If none of this resolves your problem, do post back with example data that reproduces the problem, using the -dataex- command, and further troubleshooting will be possible.

    Comment


    • #3
      Thanks for the thorough reply Clyde. isid confirmed that country and year uniquely identify observations.

      Below is a peice of the data using dataex and the variables in question. I should also mention latest year is defined in the do file as : scalar latest_year=2023.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str56 country str3 iso3 int year long rcvd_tot double do_not_fill
      "Afghanistan"         "AFG" 2023        . 0
      "Afghanistan"         "AFG" 2022  9741400 0
      "Afghanistan"         "AFG" 2021  8623222 0
      "Afghanistan"         "AFG" 2020 11732326 0
      "Afghanistan"         "AFG" 2019 14984900 0
      "Afghanistan"         "AFG" 2018 12400144 0
      "Afghanistan"         "AFG" 2017 12071477 0
      "Afghanistan"         "AFG" 2016 12189415 0
      "Afghanistan"         "AFG" 2015 10065058 0
      "Afghanistan"         "AFG" 2014  5480435 0
      "Afghanistan"         "AFG" 2013  8025242 0
      "Albania"             "ALB" 2023        . 0
      "Albania"             "ALB" 2022        . 0
      "Albania"             "ALB" 2021        . 0
      "Albania"             "ALB" 2020        . 0
      "Albania"             "ALB" 2019   803879 0
      "Albania"             "ALB" 2018        . 0
      "Albania"             "ALB" 2017        . 0
      "Albania"             "ALB" 2016        . 0
      "Albania"             "ALB" 2015        . 0
      "Albania"             "ALB" 2014        . 0
      "Albania"             "ALB" 2013        . 0
      "Algeria"             "DZA" 2023        . 0
      "Algeria"             "DZA" 2022        . 0
      "Algeria"             "DZA" 2021        . 0
      "Algeria"             "DZA" 2020   795000 0
      "Algeria"             "DZA" 2019  1715500 0
      "Algeria"             "DZA" 2018   391000 0
      "Algeria"             "DZA" 2017   810820 0
      "Algeria"             "DZA" 2016  1125000 0
      "Algeria"             "DZA" 2015        . 0
      "Algeria"             "DZA" 2014   440000 0
      "Algeria"             "DZA" 2013        . 0
      "American Samoa"      "ASM" 2023        . 0
      "American Samoa"      "ASM" 2022        . 0
      "American Samoa"      "ASM" 2021        . 0
      "American Samoa"      "ASM" 2020        . 0
      "American Samoa"      "ASM" 2019        . 0
      "American Samoa"      "ASM" 2018        . 0
      "American Samoa"      "ASM" 2017        . 0
      "American Samoa"      "ASM" 2016        . 0
      "American Samoa"      "ASM" 2015        . 0
      "American Samoa"      "ASM" 2014        . 0
      "American Samoa"      "ASM" 2013        . 0
      "Andorra"             "AND" 2023        . 0
      "Andorra"             "AND" 2022        . 0
      "Andorra"             "AND" 2021        . 0
      "Andorra"             "AND" 2020        . 0
      "Andorra"             "AND" 2019        . 0
      "Andorra"             "AND" 2018        . 0
      "Andorra"             "AND" 2017        . 0
      "Andorra"             "AND" 2016        . 0
      "Andorra"             "AND" 2015        . 0
      "Andorra"             "AND" 2014        . 0
      "Andorra"             "AND" 2013        . 0
      "Angola"              "AGO" 2023        . 0
      "Angola"              "AGO" 2022 10814702 0
      "Angola"              "AGO" 2021  7638394 0
      "Angola"              "AGO" 2020  7425228 0
      "Angola"              "AGO" 2019  5392720 0
      "Angola"              "AGO" 2018 38261289 0
      "Angola"              "AGO" 2017 29268526 0
      "Angola"              "AGO" 2016  9422370 0
      "Angola"              "AGO" 2015 14162000 0
      "Angola"              "AGO" 2014        . 0
      "Angola"              "AGO" 2013        . 0
      "Anguilla"            "AIA" 2023        . 0
      "Anguilla"            "AIA" 2022        . 0
      "Anguilla"            "AIA" 2021        . 0
      "Anguilla"            "AIA" 2020        . 0
      "Anguilla"            "AIA" 2019        . 0
      "Anguilla"            "AIA" 2018        . 0
      "Anguilla"            "AIA" 2017        . 0
      "Anguilla"            "AIA" 2016        . 0
      "Anguilla"            "AIA" 2015        . 0
      "Anguilla"            "AIA" 2014        . 0
      "Anguilla"            "AIA" 2013        . 0
      "Antigua and Barbuda" "ATG" 2023        . 0
      "Antigua and Barbuda" "ATG" 2022        . 0
      "Antigua and Barbuda" "ATG" 2021        . 0
      "Antigua and Barbuda" "ATG" 2020        . 0
      "Antigua and Barbuda" "ATG" 2019        . 0
      "Antigua and Barbuda" "ATG" 2018        . 0
      "Antigua and Barbuda" "ATG" 2017        . 0
      "Antigua and Barbuda" "ATG" 2016        . 0
      "Antigua and Barbuda" "ATG" 2015        . 0
      "Antigua and Barbuda" "ATG" 2014        . 0
      "Antigua and Barbuda" "ATG" 2013        . 0
      "Argentina"           "ARG" 2023        . 0
      "Argentina"           "ARG" 2022  1280770 0
      "Argentina"           "ARG" 2021  2003716 0
      "Argentina"           "ARG" 2020  1735158 0
      "Argentina"           "ARG" 2019  2532872 0
      "Argentina"           "ARG" 2018  1789153 0
      "Argentina"           "ARG" 2017  3276179 0
      "Argentina"           "ARG" 2016  3143397 0
      "Argentina"           "ARG" 2015  1940621 0
      "Argentina"           "ARG" 2014  1513702 0
      "Argentina"           "ARG" 2013  1586978 0
      "Armenia"             "ARM" 2023        . 0
      end

      P.S. I realize the code I pasted above in the picture does not exactly match what I typed out. I tried changing bysort to having the sort line first, followed by a by statement. This did not seem to change my problem with different number of observations being chnaged.

      Comment


      • #4
        These two statements (I think) have the same if clauses but operate on a different number of observations. Can someone explain and help me get this aligned?



        // Use last year's rcvd
        sort country year
        by country: replace rcvd_imputation = "Lagged received" if rcvd_tot==. & rcvd_tot[_n-1]!=. & year !=latest_year & do_not_fill!=1
        sort country year
        by country: replace rcvd_tot = rcvd_tot[_n-1] if rcvd_tot==. & rcvd_tot[_n-1]!=. & year !=latest_year & do_not_fill!=1
        The first condition is static whereas the second condition is dynamic. In the latter, the same variable that is being replaced also defines conditions for replacement. Each time a replacement occurs, Stata reevaluates the specified conditions. What was not true, e.g., "!missing(var[_n-1])" (consider 3rd observation below)


        Code:
        Nonmissing
        Missing
        Missing - FALSE
        now becomes true after replacement

        Code:
        Nonmissing
        Nonmissing
        Missing - TRUE

        So that explains the different number of observations replaced.

        Comment

        Working...
        X