Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing missing values with median (PANEL DATA)

    Hi, Statalist.

    I'm working on panel dataset where I want to replace the missing values in my columns with the median.
    The missing values should be replaced with the median on panel level .
    Any suggestions?


  • #2
    Hi Ole,

    Without being specific in your data I can only provide a general suggestion: (1) create the median for your variable (column) and then (2) replace missing values with this median.

    You say you want to do this on panel data, so presumably you want blank values over time to be replaced by the individual's median. Let's say that your time is year and individual is id, I would imagine something like the following should do the trick:

    Code:
    bysort id (year): egen var_median = pctile(var, p(50))
    bysort id (year): replace var = var_median if var==.
    Best,
    Rhys

    Comment


    • #3
      Many thanks, Williams. I do have about 150 variables where there is missing values. Is there a code to replace the missing values instead of generating a new variable for each of those?

      Comment


      • #4
        Ole:
        Rhys's helpful reply shows you how to do the trick.
        From a different perspective, replacing missing value with any location (or whatever) measure of the observed data is, in general, a very bad idea, as explained in https://www.lshtm.ac.uk/research/cen...s/missing-data as well as in many references that you can find in the -mi. section of Stata .pdf manual.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Originally posted by ole karlsen View Post
          Many thanks, Williams. I do have about 150 variables where there is missing values. Is there a code to replace the missing values instead of generating a new variable for each of those?
          Hi Ole,

          Notwithstanding Carlo's important point about replacing missing values with location measures, if you still wish to proceed then I recommend you use some kind of loop to obtain this, check out how to use foreach var of varlist (https://www.stata.com/manuals/pforeach.pdf), you can then always generate a "temp" var which you delete once the loop has replaced the median values for that variable.

          Best,
          Rhys

          Comment


          • #6
            Thanks, Carlo. I`m trying to predict firm growth and I'm going to use automated machine learning to predict this. The tool i'm going to use is TPOT (Tree-Based Pipeline Optimization Tool). TPOT will impute the missing values by implacing the median on column level. I think its a better idea to replace on panel level.

            Comment


            • #7
              Ole:
              I'm not familiar with automated machine learning, hence I cannot comment on this approach.
              My previous point was methodological: how can a researcher safely impute missing data without performing a previous diagnosis on their missingness mechanism(s) (missing completely at random; missing at random; missing non at random)?
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment


              • #8

                This code will work.

                foreach v of var "variables" {
                bysort id (year): egen work = pctile(`v'), p(50)
                replace `v' = work if missing(`v')
                drop work
                }

                Comment

                Working...
                X