Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reassigning records with missing values to existing categories considering the existing distribution within the population

    Dear Statalist,

    I am struggling to reassign records with a missing value on the categorical variable that assigns each record to a province. I would like to semi-randomly reassign those with a missing value to existing provinces based on the existing frequency within each province. Below I created an example where I had 20 records already assigned to each province, I am also reporting overall frequency for each province. I would like to reassign semi-randomly the three missing values considering the existing frequency for each province. I am using Stata 14 MP.

    Thanks



    clear
    input float(province sex freq)
    1 1 .15
    1 1 .15
    1 2 .15
    2 1 .1
    2 2 .1
    3 2 .1
    3 2 .1
    4 2 .25
    4 2 .25
    4 1 .25
    4 1 .25
    4 1 .25
    5 2 .1
    5 1 .1
    6 2 .1
    6 1 .1
    7 1 .05
    8 1 .05
    9 2 .1
    9 2 .1
    . 2 .
    . 1 .
    . 1 .
    end

  • #2
    Your request doesn't quite make sense to me: You say there are "missing value[s] on the categorical variable that assigns each record to a province," which sounds like you want to replace missing values of the province variable, but then you go on to talk about assigning randomly according to the distribution "within each province." Also, I'm not certain what you mean by "semi" randomly. And, I suspect people here who are more expert than me in modern methods of imputation would not bless your idea from handling missing data. That being said, here's a demonstration of a technique you might find of use, namely how to replace missing values with ones sample with replacement and with equal probability from the observed distribution of nonmissing values.

    Code:
    sort province // missing values go to the end
    count if !missing(province) // r(N) is last nonmissing
    gen int randspot = ceil(runiform() * r(N))  
    replace province = province[randspot] if missing(province)

    Comment


    • #3
      Thanks Mike. I did not consider MI as an option in this case considering the proportion of missing (around 0.5% of the sample) which was due to miscoding.

      Comment

      Working...
      X