dropping missing obs at state-year intersection based on a percentage of missing threshold

WILLIAM BIGLER

Join Date: Mar 2020
Posts: 13

dropping missing obs at state-year intersection based on a percentage of missing threshold

30 Mar 2023, 08:05

hello all! Stata S/E 16.1 user.

I have been trying to find out how to do what I am trying to do for awhile now. I have figured out how to identify my issue/patterns of missing and I am looking for how to address efficiently in order to replicate different scenarios for sensitivity analysis. I want to eliminate state-years with missing observations for a particular variable above (and keep if below) a certain threshold. All of my variables are categorical. I have read the FAQ's and I hope I am in line with its requirements.

This is how I identified my pattern:

foreach state in 1 2 5 6 8 11 12 15 16 17 19 20 22 24 26 28 29 30 32 33 34 35 38 39 45 46 47 48 49 56 {
tab ADMYR wt1 if STFIPS==`state', nofreq row missing
}

Below is a good example of one state that I would like to address "wholesale" across all states in my analysis. The missing-ness across states is "random" in that different states have different patterns of relatively hi/low missing-ness in different years (or no high missing-ness at all). I have already dropped states with "too high" missing to be included in the analysis at this point (eg 90% or above).

I would like to be able to systematically drop (or keep) states-years at different thresholds of missing-ness. In this one state example, 2011 and 2012 would be dropped and all other states would retain a different set of years (or all years in many cases).

My initial approach to this runs into a wall as I cannot create a variable for the percentage of missing based on the "divide something by missing or divide missing by something" yields missing. I do realize I may be going about all of this in a very clunky manner. I am trying to get better and faster, and I hope that the forthcoming lessons from you all will help to that end.

I do appreciate your time in advance, and any and all help.

ONE STATE EXAMPLE:

Year of		ZT1
admission	0	1	2	3	4	.	Total

2009	17.66	34.81	20.60	15.72	10.38	0.83	100.00
2010	28.01	34.38	16.55	12.77	7.75	0.54	100.00
2011	17.58	20.24	6.89	5.03	3.13	47.13	100.00
2012	20.88	12.37	6.36	6.18	4.86	49.35	100.00
2013	44.93	22.07	12.81	10.10	9.83	0.26	100.00
2014	64.03	16.42	8.07	5.32	6.04	0.12	100.00
2015	65.46	15.34	6.82	5.33	7.01	0.04	100.00
2016	63.85	18.02	6.09	5.44	6.51	0.08	100.00
2017	65.66	14.50	6.47	6.28	6.94	0.14	100.00
2018	64.18	17.31	6.15	6.42	5.73	0.21	100.00
2019	69.85	17.24	5.50	4.55	2.66	0.20	100.00
2020	64.86	19.03	6.17	5.63	4.21	0.09	100.00

Total	48.54	20.31	9.18	7.49	6.33	8.14	100.00

On another note - I have read into the implications and ways to address dropping entire state-years from a data set in the context of D-in-D analysis, and I may need some additional advice there. I do see there is quite a bit out there (and on here) on this topic. If someone has any specific considerations for these implications (dropping entire state-years from a data set in the context of D-in-D analysis) based on the original question and if there is enough context, I would appreciate that. I realize that may be a separate topic, and apologies if I am reaching with this "another note".

Thanks again for tips, considerations and code.

Cam Bigler

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

30 Mar 2023, 10:06

I don't really follow what you want to do. But it seems it has something to do with the proportion of missing observations of variable called wt1 among observations sharing the same values of STFIPS and ADMYR. To get that proportion you can do this:

Code:

by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1))

What you want to do with that result is unclear to me. But perhaps this will be a sufficient nudge in the right direction that you can finish it from there. If not, when posting back, in addition to explaining more clearly, please be sure to include example data from your Stata data set, and use the -dataex- command to do it. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
WILLIAM BIGLER

Join Date: Mar 2020

Posts: 13
#3

30 Mar 2023, 10:34

thanks much for the above, Clyde. I will get back with a better breakdown, question, goal, and data example. I will also try out the command to see if it will get me where I need to go. I appreciate the lesson on how to post for more efficacy - I am learning Stata and how to post here at the same time. I appreciate the patience.
Cam
Comment

WILLIAM BIGLER

Join Date: Mar 2020
Posts: 13

30 Mar 2023, 11:07

Does this help with the data example - below? it is only reporting one category of ZT1 and there are six (inclusive of missing). ZT 1 is how long a person must wait to become accepted into a program.

I want to remove all state-year level observations that have a missing percentage of >=.25 observations for the wt1 variable. I want to remove the entire year for states that have missing observation percentages of >=.25 for that particular state in that particular year (that combo varies by state and across years inconsistently). And I want to do that for all the states in my data set at the same time in a way that allows me to investigate sensitivity to the chosen threshold (right now >=.25). So I can easily change the threshold and run my DnD model to investigate the treatment effect's sensitivity to the threshold (and run Table1's to compare the descriptives/demographics)

I ran this code:
by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) and it is difficult to interpret when I tab it.

prop_missin |
g | Freq. Percent Cum.
------------+-----------------------------------
.0003908 | 5,118 0.05 0.05
.0003986 | 5,018 0.05 0.11
.0004034 | 2,479 0.03 0.13
.0004301 | 4,650 0.05 0.18
.0004819 | 20,753 0.22 0.39
.0005573 | 12,560 0.13 0.53
.0006345 | 1,576 0.02 0.54
.0007117 | 16,862 0.18 0.72
.0007692 | 2,600 0.03 0.74
.0007714 | 6,482 0.07 0.81
.0007734 | 1,293 0.01 0.82
.0007806 | 2,562 0.03 0.85
.0007967 | 5,021 0.05 0.90
.0008535 | 3,515 0.04 0.94
.0008586 | 34,941 0.36 1.30
.000877 | 34,207 0.36 1.66
.000882 | 13,605 0.14 1.80
.0008956 | 22,331 0.23 2.03
.0008958 | 24,559 0.26 2.29
.0009119 | 19,739 0.21 2.49
.0009665 | 142,784 1.48 3.97
.0009861 | 17,239 0.18 4.15

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int ADMYR byte STFIPS float ZT1
2009  6 0
2009  6 0
2009  6 0
2009  8 1
2009 19 1
2009 19 0
2009 31 0
2010  6 0
2010  6 1
2010  8 0
2010 34 1
2011  6 0
2011 24 1
2011 26 3
2012 24 0
2012 25 1
2012 26 0
2012 39 1
2012 39 3
2013 31 .
2014  4 4
2014  6 2
2014 25 .
2014 29 1
2014 48 0
2014 48 0
2015  6 0
2015 29 0
2015 34 0
2016 34 0
2016 34 1
2016 46 2
2017  6 0
2017  8 0
2017 25 1
2017 39 2
2018  4 0
2018  4 .
2018  6 0
2018  6 0
2018 12 .
2018 24 0
2018 39 0
2019  4 .
2019  6 1
2019 26 0
2019 26 0
2019 34 2
2019 42 2
2020 45 0
end
label values ADMYR ADMYR
label def ADMYR 2009 "2009", modify
label def ADMYR 2010 "2010", modify
label def ADMYR 2011 "2011", modify
label def ADMYR 2012 "2012", modify
label def ADMYR 2013 "2013", modify
label def ADMYR 2014 "2014", modify
label def ADMYR 2015 "2015", modify
label def ADMYR 2016 "2016", modify
label def ADMYR 2017 "2017", modify
label def ADMYR 2018 "2018", modify
label def ADMYR 2019 "2019", modify
label def ADMYR 2020 "2020", modify
label values STFIPS STFIPS
label def STFIPS 4 "Arizona", modify
label def STFIPS 6 "California", modify
label def STFIPS 8 "Colorado", modify
label def STFIPS 12 "Florida", modify
label def STFIPS 19 "Iowa", modify
label def STFIPS 24 "Maryland", modify
label def STFIPS 25 "Massachusetts", modify
label def STFIPS 26 "Michigan", modify
label def STFIPS 29 "Missouri", modify
label def STFIPS 31 "Nebraska", modify
label def STFIPS 34 "New Jersey", modify
label def STFIPS 39 "Ohio", modify
label def STFIPS 42 "Pennsylvania", modify
label def STFIPS 45 "South Carolina", modify
label def STFIPS 46 "South Dakota", modify
label def STFIPS 48 "Texas", modify

additional varibales included for example purposes:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int ADMYR byte(STFIPS EDUC GENDER AGE) float ZT1
2009  6 3 1  8 0
2009  6 2 2  3 0
2009  6 4 1  3 0
2009  8 5 1  7 1
2009 19 4 1  4 1
2009 19 2 1  4 0
2009 31 3 1  9 0
2010  6 4 1  9 0
2010  6 1 1  6 1
2010  8 2 2  3 0
2010 34 3 2 10 1
2011  6 4 1 11 0
2011 24 3 1  9 1
2011 26 4 2  6 3
2012 24 2 1  9 0
2012 25 5 2  5 1
2012 26 2 1 11 0
2012 39 3 1  8 1
2012 39 3 1  5 3
2013 31 3 1  8 .
2014  4 5 1  6 4
2014  6 5 2  6 2
2014 25 4 1  7 .
2014 29 3 2  6 1
2014 48 3 1  9 0
2014 48 4 1  8 0
2015  6 2 2  6 0
2015 29 2 1 11 0
2015 34 2 2  4 0
2016 34 1 2  9 0
2016 34 3 1  8 1
2016 46 3 2  8 2
2017  6 4 2  9 0
2017  8 3 2  6 0
2017 25 1 2  8 1
2017 39 3 1  5 2
2018  4 3 2  5 0
2018  4 . 2  6 .
2018  6 2 2 10 0
2018  6 3 1  6 0
2018 12 3 2  4 .
2018 24 2 2  9 0
2018 39 3 1 11 0
2019  4 . 2  7 .
2019  6 3 2  7 1
2019 26 4 1  7 0
2019 26 3 1 10 0
2019 34 2 1  5 2
2019 42 4 2  9 2
2020 45 4 1 11 0
end
label values ADMYR ADMYR
label def ADMYR 2009 "2009", modify
label def ADMYR 2010 "2010", modify
label def ADMYR 2011 "2011", modify
label def ADMYR 2012 "2012", modify
label def ADMYR 2013 "2013", modify
label def ADMYR 2014 "2014", modify
label def ADMYR 2015 "2015", modify
label def ADMYR 2016 "2016", modify
label def ADMYR 2017 "2017", modify
label def ADMYR 2018 "2018", modify
label def ADMYR 2019 "2019", modify
label def ADMYR 2020 "2020", modify
label values STFIPS STFIPS
label def STFIPS 4 "Arizona", modify
label def STFIPS 6 "California", modify
label def STFIPS 8 "Colorado", modify
label def STFIPS 12 "Florida", modify
label def STFIPS 19 "Iowa", modify
label def STFIPS 24 "Maryland", modify
label def STFIPS 25 "Massachusetts", modify
label def STFIPS 26 "Michigan", modify
label def STFIPS 29 "Missouri", modify
label def STFIPS 31 "Nebraska", modify
label def STFIPS 34 "New Jersey", modify
label def STFIPS 39 "Ohio", modify
label def STFIPS 42 "Pennsylvania", modify
label def STFIPS 45 "South Carolina", modify
label def STFIPS 46 "South Dakota", modify
label def STFIPS 48 "Texas", modify
label values EDUC EDUC
label def EDUC 1 "Less than one school grade, no schooling, nursery school, or kindergarten to Grade 8", modify
label def EDUC 2 "Grades 9 to 11", modify
label def EDUC 3 "Grade 12 (or GED)", modify
label def EDUC 4 "1-3 years of college, university, or vocational school", modify
label def EDUC 5 "4 years of college, university, BA/BS, some postgraduate study, or more", modify
label values GENDER GENDER
label def GENDER 1 "Male", modify
label def GENDER 2 "Female", modify
label values AGE AGE
label def AGE 3 "18-20 years", modify
label def AGE 4 "21-24 years", modify
label def AGE 5 "25-29 years", modify
label def AGE 6 "30-34 years", modify
label def AGE 7 "35-39 years", modify
label def AGE 8 "40-44 years", modify
label def AGE 9 "45-49 years", modify
label def AGE 10 "50-54 years", modify
label def AGE 11 "55-64 years", modify

Last edited by WILLIAM BIGLER; 30 Mar 2023, 11:27. Reason: added additional variables to the data ex example

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#5

30 Mar 2023, 13:36

I'm sorry, but I don't understand this explanation any more than I did the first. Perhaps somebody else following along grasps what you are trying to do and can help.

I ran this code:
by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) and it is difficult to interpret when I tab it.

prop_missin |
g | Freq. Percent Cum.
------------+-----------------------------------
.0003908 | 5,118 0.05 0.05
.0003986 | 5,018 0.05 0.11
.0004034 | 2,479 0.03 0.13
.0004301 | 4,650 0.05 0.18
.0004819 | 20,753 0.22 0.39
.0005573 | 12,560 0.13 0.53
.0006345 | 1,576 0.02 0.54
.0007117 | 16,862 0.18 0.72
.0007692 | 2,600 0.03 0.74
.0007714 | 6,482 0.07 0.81
.0007734 | 1,293 0.01 0.82
.0007806 | 2,562 0.03 0.85
.0007967 | 5,021 0.05 0.90
.0008535 | 3,515 0.04 0.94
.0008586 | 34,941 0.36 1.30
.000877 | 34,207 0.36 1.66
.000882 | 13,605 0.14 1.80
.0008956 | 22,331 0.23 2.03
.0008958 | 24,559 0.26 2.29
.0009119 | 19,739 0.21 2.49
.0009665 | 142,784 1.48 3.97
.0009861 | 17,239 0.18 4.15

What this tells me is that there aren't any state-year's that have anything even remotely close to 0.25% missing values for variable wt1. The largest percentage of missing values for any state-year is less than 0.0001%. Indeed, it looks to me as if the amount of missing data on this variable is so small that it is probably not worth even thinking about, let alone doing elaborate sensitivity analyses about. But, as I say, I don't really get what you're trying to say here, so...
Comment

WILLIAM BIGLER

Join Date: Mar 2020
Posts: 13

30 Mar 2023, 19:01

I think I have done a bad job of communicating.

I want to eliminate state-year pairs that have missing observations for ZT1 at 25% or greater (I was saying .25) for ZT1.

I want to systematically eliminate the state-year pairs missing ZT1 at 25% or greater.

Wyoming shows 2011-2012 with greater than 25% missing ZT1
West Virginia shows 2009-2015 with greater than 25% missing ZT1

I would like to be able to set a threshold and remove at different thresholds for ZT1 missing-ness at state-year level to test sensitivity and to compare descriptive stats, IV distributions and demographics.

something like: if missing observations percent of ZT1 for STFIPS X in ADMYR Y >25%, drop STFIPS X in ADMYR Y

So for Wyoming, 2011 and 2012 would be dropped and for West Virginia 2009-2015 would be dropped and so on for whatever the specific ADMYR-STFIPS and percent missing observations for ZT1 happen to be across all other STFIPS-ADMYR's.

STATE 1 EXAMPLE (STFIPS ==WYOMING):

Year of		ZT1
admission	0	1	2	3	4	.	Total

2009	17.66	34.81	20.60	15.72	10.38	0.83	100.00
2010	28.01	34.38	16.55	12.77	7.75	0.54	100.00
2011	17.58	20.24	6.89	5.03	3.13	47.13	100.00
2012	20.88	12.37	6.36	6.18	4.86	49.35	100.00
2013	44.93	22.07	12.81	10.10	9.83	0.26	100.00
2014	64.03	16.42	8.07	5.32	6.04	0.12	100.00
2015	65.46	15.34	6.82	5.33	7.01	0.04	100.00
2016	63.85	18.02	6.09	5.44	6.51	0.08	100.00
2017	65.66	14.50	6.47	6.28	6.94	0.14	100.00
2018	64.18	17.31	6.15	6.42	5.73	0.21	100.00
2019	69.85	17.24	5.50	4.55	2.66	0.20	100.00
2020	64.86	19.03	6.17	5.63	4.21	0.09	100.00

Total	48.54	20.31	9.18	7.49	6.33	8.14	100.00

SECOND STATE EXAMPLE (STFIPS==WEST VIRGINIA)

Year of		ZT1
admission	0	1	2	3	4	.	Total

2009. 2009	0.79	0.07	0.00	0.00	0.00	99.14	100.00
2010. 2010	3.33	0.00	0.00	0.00	0.00	96.67	100.00
2011. 2011	4.82	0.10	0.00	0.00	0.00	95.08	100.00
2012. 2012	14.25	0.96	0.36	0.54	0.00	83.89	100.00
2013. 2013	11.89	1.84	0.50	0.33	1.47	83.97	100.00
2014. 2014	6.58	2.40	0.25	0.38	0.57	89.81	100.00
2015. 2015	12.17	1.87	0.09	0.09	0.15	85.63	100.00
2016. 2016	95.53	1.47	1.23	0.84	0.56	0.37	100.00
2017. 2017	93.30	2.20	1.58	1.30	1.01	0.62	100.00
2018. 2018	84.09	6.58	3.15	2.70	2.04	1.43	100.00
2019. 2019	80.59	7.14	5.09	2.94	1.80	2.45	100.00
2020. 2020	84.02	8.10	3.02	2.21	1.55	1.10	100.00

Total	38.45	1.92	0.93	0.70	0.58	57.42	100.00

The code by STFIPS ADMYR, sort: egen prop_missing = mean(missing(wt1)) shows that there are no values greater than 25% in the dataex example of 50 rows only (I think). The data set has much mroe data and the missing percentages get pretty high as shown below:

.2017285 | 9,835 0.08 77.07
.2242924 | 103,053 0.84 77.91
.238669 | 62,748 0.51 78.43
.239028 | 9,547 0.08 78.51
.2433147 | 6,432 0.05 78.56
.2455545 | 9,729 0.08 78.64
.2597129 | 88,902 0.73 79.36
.2723651 | 81,578 0.67 80.03
.2724451 | 77,979 0.64 80.67
.2730405 | 80,109 0.65 81.32
.273496 | 9,342 0.08 81.40
.2760826 | 5,473 0.04 81.44
.2760923 | 12,130 0.10 81.54
.2762986 | 5,371 0.04 81.58
.2819034 | 76,622 0.63 82.21
.2857357 | 13,334 0.11 82.32
.2868063 | 78,492 0.64 82.96
.2899939 | 11,483 0.09 83.05
.3044097 | 75,901 0.62 83.67
.3144768 | 10,465 0.09 83.76
.316069 | 82,849 0.68 84.44
.3286325 | 40,206 0.33 84.76
.3381175 | 13,238 0.11 84.87
.3398874 | 83,816 0.68 85.56
.3441848 | 7,833 0.06 85.62
.3725114 | 15,621 0.13 85.75
.3878949 | 57,959 0.47 86.22
.3915403 | 80,334 0.66 86.88
.3997055 | 14,263 0.12 86.99
.4016083 | 46,757 0.38 87.38
.4434658 | 11,126 0.09 87.47
.4488189 | 33,782 0.28 87.74
.4535597 | 39,933 0.33 88.07
.4556516 | 11,917 0.10 88.17
.4713427 | 4,990 0.04 88.21
.4846365 | 79,474 0.65 88.86
.4934896 | 4,608 0.04 88.89
.497146 | 79,188 0.65 89.54
.5016962 | 78,705 0.64 90.18
.5019432 | 13,123 0.11 90.29
.508274 | 9,971 0.08 90.37
.5181758 | 1,513 0.01 90.39
.5933018 | 14,780 0.12 90.51
.6041512 | 13,490 0.11 90.62
.6205587 | 84,013 0.69 91.30
.6246219 | 83,308 0.68 91.98
.6312428 | 75,006 0.61 92.60
.6368122 | 73,981 0.60 93.20
.6389152 | 9,513 0.08 93.28
.6913791 | 14,082 0.12 93.39
.7054709 | 10,437 0.09 93.48
.7379734 | 1,954 0.02 93.49
.7467188 | 11,505 0.09 93.59
.7569011 | 19,852 0.16 93.75
.7604423 | 19,895 0.16 93.91
.8131564 | 8,665 0.07 93.98
.8343295 | 143,707 1.17 95.16
.8389221 | 1,670 0.01 95.17
.8396819 | 2,389 0.02 95.19
.8562691 | 3,270 0.03 95.22
.8981153 | 3,661 0.03 95.25
.9033204 | 7,499 0.06 95.31
.9211974 | 7,817 0.06 95.37
.9507979 | 3,008 0.02 95.40
.9667388 | 3,698 0.03 95.43
.9693791 | 8,295 0.07 95.49
.9809983 | 9,736 0.08 95.57
.9913983 | 8,603 0.07 95.64
.9970852 | 48,030 0.39 96.04
.9975887 | 12,856 0.10 96.14
.9984533 | 57,540 0.47 96.61
.9989596 | 58,630 0.48 97.09
1 | 356,350 2.91 100.00
------------+-----------------------------------
Total | 12,243,932 100.00

.

Comment

WILLIAM BIGLER

Join Date: Mar 2020

Posts: 13
#7

30 Mar 2023, 19:13

I am trying to avoid having to manually type (below) for each state (Wyoming and WV as examples) for multiple reasons, namely because I cannot easily switch my percent missing ZT1 observations for all ADMYR-STFIPS intersections and reevaluate everything I want to reevaluate and compare:

drop if ADMYR ==2011 | ADMYR ==2012 & STFIPS ==56
drop if ADMYR ==2009-2015 (I know this does not work) & STFIPS ==54
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#8

31 Mar 2023, 09:46

OK, much clearer, thanks.

Code:

local threshold 0.25 by STFIPS ADMYR, sort: egen prop_missing_zt1 = mean(missing(ZT1)) drop if prop_missing_zt1 >= `threshold'

will do it. If you want to change the threshold from 25% to some other value just change the number at the end of the -local threshold- command accordingly.
Comment
WILLIAM BIGLER

Join Date: Mar 2020

Posts: 13
#9

31 Mar 2023, 11:10

HA! Cool! Thanks much! It worked and I think this puts me much farther down the road - I have another analysis this directly ties to as well. This is a useful code for me.
Appreciate you sticking with me, Clyde!

One other question - this drops these states completely out of the analysis across all observations, correct? These dropped state-year pairs are "gone period"? I hope so.

I believe I answered my own question...

Last edited by WILLIAM BIGLER; 31 Mar 2023, 11:33. Reason: I believe I answered my own question...
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#10

31 Mar 2023, 11:48

Yes, those state-years are completely gone from the data at that point. You will never see them again unless you read them in from a previous version of the data set.
Comment
WILLIAM BIGLER

Join Date: Mar 2020

Posts: 13
#11

31 Mar 2023, 12:18

Great! this was all very helpful form an analysis advice standpoint and how to better communicate here. Cheers
Cam
Comment

Announcement

dropping missing obs at state-year intersection based on a percentage of missing threshold

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment