Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • dropping missing obs at state-year intersection based on a percentage of missing threshold

    hello all! Stata S/E 16.1 user.

    I have been trying to find out how to do what I am trying to do for awhile now. I have figured out how to identify my issue/patterns of missing and I am looking for how to address efficiently in order to replicate different scenarios for sensitivity analysis. I want to eliminate state-years with missing observations for a particular variable above (and keep if below) a certain threshold. All of my variables are categorical. I have read the FAQ's and I hope I am in line with its requirements.

    This is how I identified my pattern:

    foreach state in 1 2 5 6 8 11 12 15 16 17 19 20 22 24 26 28 29 30 32 33 34 35 38 39 45 46 47 48 49 56 {
    tab ADMYR wt1 if STFIPS==`state', nofreq row missing
    }

    Below is a good example of one state that I would like to address "wholesale" across all states in my analysis. The missing-ness across states is "random" in that different states have different patterns of relatively hi/low missing-ness in different years (or no high missing-ness at all). I have already dropped states with "too high" missing to be included in the analysis at this point (eg 90% or above).

    I would like to be able to systematically drop (or keep) states-years at different thresholds of missing-ness. In this one state example, 2011 and 2012 would be dropped and all other states would retain a different set of years (or all years in many cases).

    My initial approach to this runs into a wall as I cannot create a variable for the percentage of missing based on the "divide something by missing or divide missing by something" yields missing. I do realize I may be going about all of this in a very clunky manner. I am trying to get better and faster, and I hope that the forthcoming lessons from you all will help to that end.

    I do appreciate your time in advance, and any and all help.

    ONE STATE EXAMPLE:
    Year of ZT1
    admission 0 1 2 3 4 . Total
    2009 17.66 34.81 20.60 15.72 10.38 0.83 100.00
    2010 28.01 34.38 16.55 12.77 7.75 0.54 100.00
    2011 17.58 20.24 6.89 5.03 3.13 47.13 100.00
    2012 20.88 12.37 6.36 6.18 4.86 49.35 100.00
    2013 44.93 22.07 12.81 10.10 9.83 0.26 100.00
    2014 64.03 16.42 8.07 5.32 6.04 0.12 100.00
    2015 65.46 15.34 6.82 5.33 7.01 0.04 100.00
    2016 63.85 18.02 6.09 5.44 6.51 0.08 100.00
    2017 65.66 14.50 6.47 6.28 6.94 0.14 100.00
    2018 64.18 17.31 6.15 6.42 5.73 0.21 100.00
    2019 69.85 17.24 5.50 4.55 2.66 0.20 100.00
    2020 64.86 19.03 6.17 5.63 4.21 0.09 100.00
    Total 48.54 20.31 9.18 7.49 6.33 8.14 100.00
    On another note - I have read into the implications and ways to address dropping entire state-years from a data set in the context of D-in-D analysis, and I may need some additional advice there. I do see there is quite a bit out there (and on here) on this topic. If someone has any specific considerations for these implications (dropping entire state-years from a data set in the context of D-in-D analysis) based on the original question and if there is enough context, I would appreciate that. I realize that may be a separate topic, and apologies if I am reaching with this "another note".

    Thanks again for tips, considerations and code.

    Cam Bigler

  • #2
    I don't really follow what you want to do. But it seems it has something to do with the proportion of missing observations of variable called wt1 among observations sharing the same values of STFIPS and ADMYR. To get that proportion you can do this:
    Code:
    by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1))
    What you want to do with that result is unclear to me. But perhaps this will be a sufficient nudge in the right direction that you can finish it from there. If not, when posting back, in addition to explaining more clearly, please be sure to include example data from your Stata data set, and use the -dataex- command to do it. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.


    Comment


    • #3
      thanks much for the above, Clyde. I will get back with a better breakdown, question, goal, and data example. I will also try out the command to see if it will get me where I need to go. I appreciate the lesson on how to post for more efficacy - I am learning Stata and how to post here at the same time. I appreciate the patience.
      Cam

      Comment


      • #4
        Does this help with the data example - below? it is only reporting one category of ZT1 and there are six (inclusive of missing). ZT 1 is how long a person must wait to become accepted into a program.

        I want to remove all state-year level observations that have a missing percentage of >=.25 observations for the wt1 variable. I want to remove the entire year for states that have missing observation percentages of >=.25 for that particular state in that particular year (that combo varies by state and across years inconsistently). And I want to do that for all the states in my data set at the same time in a way that allows me to investigate sensitivity to the chosen threshold (right now >=.25). So I can easily change the threshold and run my DnD model to investigate the treatment effect's sensitivity to the threshold (and run Table1's to compare the descriptives/demographics)

        I ran this code:
        by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) and it is difficult to interpret when I tab it.

        prop_missin |
        g | Freq. Percent Cum.
        ------------+-----------------------------------
        .0003908 | 5,118 0.05 0.05
        .0003986 | 5,018 0.05 0.11
        .0004034 | 2,479 0.03 0.13
        .0004301 | 4,650 0.05 0.18
        .0004819 | 20,753 0.22 0.39
        .0005573 | 12,560 0.13 0.53
        .0006345 | 1,576 0.02 0.54
        .0007117 | 16,862 0.18 0.72
        .0007692 | 2,600 0.03 0.74
        .0007714 | 6,482 0.07 0.81
        .0007734 | 1,293 0.01 0.82
        .0007806 | 2,562 0.03 0.85
        .0007967 | 5,021 0.05 0.90
        .0008535 | 3,515 0.04 0.94
        .0008586 | 34,941 0.36 1.30
        .000877 | 34,207 0.36 1.66
        .000882 | 13,605 0.14 1.80
        .0008956 | 22,331 0.23 2.03
        .0008958 | 24,559 0.26 2.29
        .0009119 | 19,739 0.21 2.49
        .0009665 | 142,784 1.48 3.97
        .0009861 | 17,239 0.18 4.15



        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int ADMYR byte STFIPS float ZT1
        2009  6 0
        2009  6 0
        2009  6 0
        2009  8 1
        2009 19 1
        2009 19 0
        2009 31 0
        2010  6 0
        2010  6 1
        2010  8 0
        2010 34 1
        2011  6 0
        2011 24 1
        2011 26 3
        2012 24 0
        2012 25 1
        2012 26 0
        2012 39 1
        2012 39 3
        2013 31 .
        2014  4 4
        2014  6 2
        2014 25 .
        2014 29 1
        2014 48 0
        2014 48 0
        2015  6 0
        2015 29 0
        2015 34 0
        2016 34 0
        2016 34 1
        2016 46 2
        2017  6 0
        2017  8 0
        2017 25 1
        2017 39 2
        2018  4 0
        2018  4 .
        2018  6 0
        2018  6 0
        2018 12 .
        2018 24 0
        2018 39 0
        2019  4 .
        2019  6 1
        2019 26 0
        2019 26 0
        2019 34 2
        2019 42 2
        2020 45 0
        end
        label values ADMYR ADMYR
        label def ADMYR 2009 "2009", modify
        label def ADMYR 2010 "2010", modify
        label def ADMYR 2011 "2011", modify
        label def ADMYR 2012 "2012", modify
        label def ADMYR 2013 "2013", modify
        label def ADMYR 2014 "2014", modify
        label def ADMYR 2015 "2015", modify
        label def ADMYR 2016 "2016", modify
        label def ADMYR 2017 "2017", modify
        label def ADMYR 2018 "2018", modify
        label def ADMYR 2019 "2019", modify
        label def ADMYR 2020 "2020", modify
        label values STFIPS STFIPS
        label def STFIPS 4 "Arizona", modify
        label def STFIPS 6 "California", modify
        label def STFIPS 8 "Colorado", modify
        label def STFIPS 12 "Florida", modify
        label def STFIPS 19 "Iowa", modify
        label def STFIPS 24 "Maryland", modify
        label def STFIPS 25 "Massachusetts", modify
        label def STFIPS 26 "Michigan", modify
        label def STFIPS 29 "Missouri", modify
        label def STFIPS 31 "Nebraska", modify
        label def STFIPS 34 "New Jersey", modify
        label def STFIPS 39 "Ohio", modify
        label def STFIPS 42 "Pennsylvania", modify
        label def STFIPS 45 "South Carolina", modify
        label def STFIPS 46 "South Dakota", modify
        label def STFIPS 48 "Texas", modify

        additional varibales included for example purposes:

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int ADMYR byte(STFIPS EDUC GENDER AGE) float ZT1
        2009  6 3 1  8 0
        2009  6 2 2  3 0
        2009  6 4 1  3 0
        2009  8 5 1  7 1
        2009 19 4 1  4 1
        2009 19 2 1  4 0
        2009 31 3 1  9 0
        2010  6 4 1  9 0
        2010  6 1 1  6 1
        2010  8 2 2  3 0
        2010 34 3 2 10 1
        2011  6 4 1 11 0
        2011 24 3 1  9 1
        2011 26 4 2  6 3
        2012 24 2 1  9 0
        2012 25 5 2  5 1
        2012 26 2 1 11 0
        2012 39 3 1  8 1
        2012 39 3 1  5 3
        2013 31 3 1  8 .
        2014  4 5 1  6 4
        2014  6 5 2  6 2
        2014 25 4 1  7 .
        2014 29 3 2  6 1
        2014 48 3 1  9 0
        2014 48 4 1  8 0
        2015  6 2 2  6 0
        2015 29 2 1 11 0
        2015 34 2 2  4 0
        2016 34 1 2  9 0
        2016 34 3 1  8 1
        2016 46 3 2  8 2
        2017  6 4 2  9 0
        2017  8 3 2  6 0
        2017 25 1 2  8 1
        2017 39 3 1  5 2
        2018  4 3 2  5 0
        2018  4 . 2  6 .
        2018  6 2 2 10 0
        2018  6 3 1  6 0
        2018 12 3 2  4 .
        2018 24 2 2  9 0
        2018 39 3 1 11 0
        2019  4 . 2  7 .
        2019  6 3 2  7 1
        2019 26 4 1  7 0
        2019 26 3 1 10 0
        2019 34 2 1  5 2
        2019 42 4 2  9 2
        2020 45 4 1 11 0
        end
        label values ADMYR ADMYR
        label def ADMYR 2009 "2009", modify
        label def ADMYR 2010 "2010", modify
        label def ADMYR 2011 "2011", modify
        label def ADMYR 2012 "2012", modify
        label def ADMYR 2013 "2013", modify
        label def ADMYR 2014 "2014", modify
        label def ADMYR 2015 "2015", modify
        label def ADMYR 2016 "2016", modify
        label def ADMYR 2017 "2017", modify
        label def ADMYR 2018 "2018", modify
        label def ADMYR 2019 "2019", modify
        label def ADMYR 2020 "2020", modify
        label values STFIPS STFIPS
        label def STFIPS 4 "Arizona", modify
        label def STFIPS 6 "California", modify
        label def STFIPS 8 "Colorado", modify
        label def STFIPS 12 "Florida", modify
        label def STFIPS 19 "Iowa", modify
        label def STFIPS 24 "Maryland", modify
        label def STFIPS 25 "Massachusetts", modify
        label def STFIPS 26 "Michigan", modify
        label def STFIPS 29 "Missouri", modify
        label def STFIPS 31 "Nebraska", modify
        label def STFIPS 34 "New Jersey", modify
        label def STFIPS 39 "Ohio", modify
        label def STFIPS 42 "Pennsylvania", modify
        label def STFIPS 45 "South Carolina", modify
        label def STFIPS 46 "South Dakota", modify
        label def STFIPS 48 "Texas", modify
        label values EDUC EDUC
        label def EDUC 1 "Less than one school grade, no schooling, nursery school, or kindergarten to Grade 8", modify
        label def EDUC 2 "Grades 9 to 11", modify
        label def EDUC 3 "Grade 12 (or GED)", modify
        label def EDUC 4 "1-3 years of college, university, or vocational school", modify
        label def EDUC 5 "4 years of college, university, BA/BS, some postgraduate study, or more", modify
        label values GENDER GENDER
        label def GENDER 1 "Male", modify
        label def GENDER 2 "Female", modify
        label values AGE AGE
        label def AGE 3 "18-20 years", modify
        label def AGE 4 "21-24 years", modify
        label def AGE 5 "25-29 years", modify
        label def AGE 6 "30-34 years", modify
        label def AGE 7 "35-39 years", modify
        label def AGE 8 "40-44 years", modify
        label def AGE 9 "45-49 years", modify
        label def AGE 10 "50-54 years", modify
        label def AGE 11 "55-64 years", modify
        Last edited by WILLIAM BIGLER; 30 Mar 2023, 11:27. Reason: added additional variables to the data ex example

        Comment


        • #5
          I'm sorry, but I don't understand this explanation any more than I did the first. Perhaps somebody else following along grasps what you are trying to do and can help.

          I ran this code:
          by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) and it is difficult to interpret when I tab it.

          prop_missin |
          g | Freq. Percent Cum.
          ------------+-----------------------------------
          .0003908 | 5,118 0.05 0.05
          .0003986 | 5,018 0.05 0.11
          .0004034 | 2,479 0.03 0.13
          .0004301 | 4,650 0.05 0.18
          .0004819 | 20,753 0.22 0.39
          .0005573 | 12,560 0.13 0.53
          .0006345 | 1,576 0.02 0.54
          .0007117 | 16,862 0.18 0.72
          .0007692 | 2,600 0.03 0.74
          .0007714 | 6,482 0.07 0.81
          .0007734 | 1,293 0.01 0.82
          .0007806 | 2,562 0.03 0.85
          .0007967 | 5,021 0.05 0.90
          .0008535 | 3,515 0.04 0.94
          .0008586 | 34,941 0.36 1.30
          .000877 | 34,207 0.36 1.66
          .000882 | 13,605 0.14 1.80
          .0008956 | 22,331 0.23 2.03
          .0008958 | 24,559 0.26 2.29
          .0009119 | 19,739 0.21 2.49
          .0009665 | 142,784 1.48 3.97
          .0009861 | 17,239 0.18 4.15
          What this tells me is that there aren't any state-year's that have anything even remotely close to 0.25% missing values for variable wt1. The largest percentage of missing values for any state-year is less than 0.0001%. Indeed, it looks to me as if the amount of missing data on this variable is so small that it is probably not worth even thinking about, let alone doing elaborate sensitivity analyses about. But, as I say, I don't really get what you're trying to say here, so...

          Comment


          • #6
            I think I have done a bad job of communicating.

            I want to eliminate state-year pairs that have missing observations for ZT1 at 25% or greater (I was saying .25) for ZT1.

            I want to systematically eliminate the state-year pairs missing ZT1 at 25% or greater.

            Wyoming shows 2011-2012 with greater than 25% missing ZT1
            West Virginia shows 2009-2015 with greater than 25% missing ZT1

            I would like to be able to set a threshold and remove at different thresholds for ZT1 missing-ness at state-year level to test sensitivity and to compare descriptive stats, IV distributions and demographics.

            something like: if missing observations percent of ZT1 for STFIPS X in ADMYR Y >25%, drop STFIPS X in ADMYR Y

            So for Wyoming, 2011 and 2012 would be dropped and for West Virginia 2009-2015 would be dropped and so on for whatever the specific ADMYR-STFIPS and percent missing observations for ZT1 happen to be across all other STFIPS-ADMYR's.

            STATE 1 EXAMPLE (STFIPS ==WYOMING):
            Year of ZT1
            admission 0 1 2 3 4 . Total
            2009 17.66 34.81 20.60 15.72 10.38 0.83 100.00
            2010 28.01 34.38 16.55 12.77 7.75 0.54 100.00
            2011 17.58 20.24 6.89 5.03 3.13 47.13 100.00
            2012 20.88 12.37 6.36 6.18 4.86 49.35 100.00
            2013 44.93 22.07 12.81 10.10 9.83 0.26 100.00
            2014 64.03 16.42 8.07 5.32 6.04 0.12 100.00
            2015 65.46 15.34 6.82 5.33 7.01 0.04 100.00
            2016 63.85 18.02 6.09 5.44 6.51 0.08 100.00
            2017 65.66 14.50 6.47 6.28 6.94 0.14 100.00
            2018 64.18 17.31 6.15 6.42 5.73 0.21 100.00
            2019 69.85 17.24 5.50 4.55 2.66 0.20 100.00
            2020 64.86 19.03 6.17 5.63 4.21 0.09 100.00
            Total 48.54 20.31 9.18 7.49 6.33 8.14 100.00
            SECOND STATE EXAMPLE (STFIPS==WEST VIRGINIA)
            Year of ZT1
            admission 0 1 2 3 4 . Total
            2009. 2009 0.79 0.07 0.00 0.00 0.00 99.14 100.00
            2010. 2010 3.33 0.00 0.00 0.00 0.00 96.67 100.00
            2011. 2011 4.82 0.10 0.00 0.00 0.00 95.08 100.00
            2012. 2012 14.25 0.96 0.36 0.54 0.00 83.89 100.00
            2013. 2013 11.89 1.84 0.50 0.33 1.47 83.97 100.00
            2014. 2014 6.58 2.40 0.25 0.38 0.57 89.81 100.00
            2015. 2015 12.17 1.87 0.09 0.09 0.15 85.63 100.00
            2016. 2016 95.53 1.47 1.23 0.84 0.56 0.37 100.00
            2017. 2017 93.30 2.20 1.58 1.30 1.01 0.62 100.00
            2018. 2018 84.09 6.58 3.15 2.70 2.04 1.43 100.00
            2019. 2019 80.59 7.14 5.09 2.94 1.80 2.45 100.00
            2020. 2020 84.02 8.10 3.02 2.21 1.55 1.10 100.00
            Total 38.45 1.92 0.93 0.70 0.58 57.42 100.00

            The code by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) shows that there are no values greater than 25% in the dataex example of 50 rows only (I think). The data set has much mroe data and the missing percentages get pretty high as shown below:

            .2017285 | 9,835 0.08 77.07
            .2242924 | 103,053 0.84 77.91
            .238669 | 62,748 0.51 78.43
            .239028 | 9,547 0.08 78.51
            .2433147 | 6,432 0.05 78.56
            .2455545 | 9,729 0.08 78.64
            .2597129 | 88,902 0.73 79.36
            .2723651 | 81,578 0.67 80.03
            .2724451 | 77,979 0.64 80.67
            .2730405 | 80,109 0.65 81.32
            .273496 | 9,342 0.08 81.40
            .2760826 | 5,473 0.04 81.44
            .2760923 | 12,130 0.10 81.54
            .2762986 | 5,371 0.04 81.58
            .2819034 | 76,622 0.63 82.21
            .2857357 | 13,334 0.11 82.32
            .2868063 | 78,492 0.64 82.96
            .2899939 | 11,483 0.09 83.05
            .3044097 | 75,901 0.62 83.67
            .3144768 | 10,465 0.09 83.76
            .316069 | 82,849 0.68 84.44
            .3286325 | 40,206 0.33 84.76
            .3381175 | 13,238 0.11 84.87
            .3398874 | 83,816 0.68 85.56
            .3441848 | 7,833 0.06 85.62
            .3725114 | 15,621 0.13 85.75
            .3878949 | 57,959 0.47 86.22
            .3915403 | 80,334 0.66 86.88
            .3997055 | 14,263 0.12 86.99
            .4016083 | 46,757 0.38 87.38
            .4434658 | 11,126 0.09 87.47
            .4488189 | 33,782 0.28 87.74
            .4535597 | 39,933 0.33 88.07
            .4556516 | 11,917 0.10 88.17
            .4713427 | 4,990 0.04 88.21
            .4846365 | 79,474 0.65 88.86
            .4934896 | 4,608 0.04 88.89
            .497146 | 79,188 0.65 89.54
            .5016962 | 78,705 0.64 90.18
            .5019432 | 13,123 0.11 90.29
            .508274 | 9,971 0.08 90.37
            .5181758 | 1,513 0.01 90.39
            .5933018 | 14,780 0.12 90.51
            .6041512 | 13,490 0.11 90.62
            .6205587 | 84,013 0.69 91.30
            .6246219 | 83,308 0.68 91.98
            .6312428 | 75,006 0.61 92.60
            .6368122 | 73,981 0.60 93.20
            .6389152 | 9,513 0.08 93.28
            .6913791 | 14,082 0.12 93.39
            .7054709 | 10,437 0.09 93.48
            .7379734 | 1,954 0.02 93.49
            .7467188 | 11,505 0.09 93.59
            .7569011 | 19,852 0.16 93.75
            .7604423 | 19,895 0.16 93.91
            .8131564 | 8,665 0.07 93.98
            .8343295 | 143,707 1.17 95.16
            .8389221 | 1,670 0.01 95.17
            .8396819 | 2,389 0.02 95.19
            .8562691 | 3,270 0.03 95.22
            .8981153 | 3,661 0.03 95.25
            .9033204 | 7,499 0.06 95.31
            .9211974 | 7,817 0.06 95.37
            .9507979 | 3,008 0.02 95.40
            .9667388 | 3,698 0.03 95.43
            .9693791 | 8,295 0.07 95.49
            .9809983 | 9,736 0.08 95.57
            .9913983 | 8,603 0.07 95.64
            .9970852 | 48,030 0.39 96.04
            .9975887 | 12,856 0.10 96.14
            .9984533 | 57,540 0.47 96.61
            .9989596 | 58,630 0.48 97.09
            1 | 356,350 2.91 100.00
            ------------+-----------------------------------
            Total | 12,243,932 100.00

            .


            Comment


            • #7
              I am trying to avoid having to manually type (below) for each state (Wyoming and WV as examples) for multiple reasons, namely because I cannot easily switch my percent missing ZT1 observations for all ADMYR-STFIPS intersections and reevaluate everything I want to reevaluate and compare:

              drop if ADMYR ==2011 | ADMYR ==2012 & STFIPS ==56
              drop if ADMYR ==2009-2015 (I know this does not work) & STFIPS ==54

              Comment


              • #8
                OK, much clearer, thanks.
                Code:
                local threshold 0.25
                by STFIPS ADMYR, sort: egen prop_missing_zt1 = mean(missing(ZT1))
                drop if prop_missing_zt1 >= `threshold'
                will do it. If you want to change the threshold from 25% to some other value just change the number at the end of the -local threshold- command accordingly.

                Comment


                • #9
                  HA! Cool! Thanks much! It worked and I think this puts me much farther down the road - I have another analysis this directly ties to as well. This is a useful code for me.
                  Appreciate you sticking with me, Clyde!

                  One other question - this drops these states completely out of the analysis across all observations, correct? These dropped state-year pairs are "gone period"? I hope so.

                  I believe I answered my own question...
                  Last edited by WILLIAM BIGLER; 31 Mar 2023, 11:33. Reason: I believe I answered my own question...

                  Comment


                  • #10
                    Yes, those state-years are completely gone from the data at that point. You will never see them again unless you read them in from a previous version of the data set.

                    Comment


                    • #11
                      Great! this was all very helpful form an analysis advice standpoint and how to better communicate here. Cheers
                      Cam

                      Comment

                      Working...
                      X