Reassigning records with missing values to existing categories considering the existing distribution within the population

Raffaele Palladino

Join Date: Apr 2014

Posts: 23
#1

Reassigning records with missing values to existing categories considering the existing distribution within the population

31 Mar 2020, 10:24

Dear Statalist,

I am struggling to reassign records with a missing value on the categorical variable that assigns each record to a province. I would like to semi-randomly reassign those with a missing value to existing provinces based on the existing frequency within each province. Below I created an example where I had 20 records already assigned to each province, I am also reporting overall frequency for each province. I would like to reassign semi-randomly the three missing values considering the existing frequency for each province. I am using Stata 14 MP.

Thanks

clear
input float(province sex freq)
1 1 .15
1 1 .15
1 2 .15
2 1 .1
2 2 .1
3 2 .1
3 2 .1
4 2 .25
4 2 .25
4 1 .25
4 1 .25
4 1 .25
5 2 .1
5 1 .1
6 2 .1
6 1 .1
7 1 .05
8 1 .05
9 2 .1
9 2 .1
. 2 .
. 1 .
. 1 .
end
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

31 Mar 2020, 14:27

Your request doesn't quite make sense to me: You say there are "missing value[s] on the categorical variable that assigns each record to a province," which sounds like you want to replace missing values of the province variable, but then you go on to talk about assigning randomly according to the distribution "within each province." Also, I'm not certain what you mean by "semi" randomly. And, I suspect people here who are more expert than me in modern methods of imputation would not bless your idea from handling missing data. That being said, here's a demonstration of a technique you might find of use, namely how to replace missing values with ones sample with replacement and with equal probability from the observed distribution of nonmissing values.

Code:

sort province // missing values go to the end count if !missing(province) // r(N) is last nonmissing gen int randspot = ceil(runiform() * r(N)) replace province = province[randspot] if missing(province)
Comment
Raffaele Palladino

Join Date: Apr 2014

Posts: 23
#3

01 Apr 2020, 01:50

Thanks Mike. I did not consider MI as an option in this case considering the proportion of missing (around 0.5% of the sample) which was due to miscoding.
Comment

Announcement

Reassigning records with missing values to existing categories considering the existing distribution within the population

Comment

Comment