Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Mata to fill in missing entries

    Let's say I have a dataset containing a thousand variables that have only one nonmissing entry per id. Up till now, I have used the following Stata code to fill in the missing entries, which are set to the variable's unique observation (by id).

    Code:
    forvalues i = 1/1000 {
        sort id var_`i'
        by id (var_`i'): gen newvar_`i' = var_`i'[_N]
    }
    Unfortunately, this procedure is quite slow, especially when (very) large datasets are employed. I was wondering whether it would make sense to rewrite this snippet in Mata; would there be a significant gain in speed?

  • #2
    I'm a bit confused by what your code does, you create a new variable equal to the largest value of a particular variable? Also, which part of the loop takes the longest? The sort or the generate command? You can time this using either the standard timer on/off functionality, or through -timeit- from ssc (e.g. timeit 1: sort id var_`i').

    Comment


    • #3
      Welcome to Statalist, Sebastian.

      Like Jesse, I don't understand how your original example works. Have you neglected to tell us that your variables are all strings? Because for numeric variables, missing values sort high, and the only nonmissing value will be var[1]. But for strings, null strings sort low and the only nonmissing value will indeed be in var[_N].

      Please take the tie to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. It would be particularly helpful to post a small hand-made example, perhaps with just a few variables and observations, showing the data before the process and how you expect it to look after the process. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

      Added in edit: it is not clear to me that this problem at all needs Mata. You might consider posting an expanded explanation of your problem, following the guidance in the FAQ, to the General Forum where it will get more attention.

      Comment


      • #4
        The typical "trick" in Stata to be able to target a specific observation using explicit subscripting is to use an observation index variable (whose value is the same as _n) and use summarize. Something like:
        Code:
        sysuse auto, clear
        gen long obs = _n
        sum obs if make == "Ford Fiesta", meanonly
        dis price[`r(min)']
        Of course this only works if what you are looking for occurs only once in the data. Anyway, this technique won't work here because the observation must be located within id groups. You can adapt the technique to work by groups, and the fastest solution will be to use a running sum to carry forward the observation index where the condition is true to the last observation within the id group.

        Code:
        * fake data, 3 id, 10 variables, obs between 500 and 1000
        clear
        set seed 13243
        set obs 3
        gen id = _n
        gen N = runiformint(500, 1000)
        expand N
        
        forvalues i=1/10 {
            bysort id: gen pick1 = runiformint(1, N)
            qui by id: gen v`i' = runiform() if pick1[1] == _n
            drop pick1
        }
        
        bysort id: gen long obs = _n
        by id: sum
        
        
        forvalues i=1/10 {
            by id: gen pick = sum(obs / !mi(v`i'))
            by id: gen vfill`i' = v`i'[pick[_N]]
            drop pick
        }

        Comment


        • #5
          Thank you for your answers. I forgot to mention that before I enter the loop I replace all missing values by zeros (these thousand variables are positive by construction). But as you suggest, this would be an equivalent but more efficient way:

          Code:
          forvalues i = 1/1000 {
              sort id var_`i'
              by id (var_`i'): gen newvar_`i' = var_`i'[1]
          }
          Thanks for your elaborate example, Robert. I will try to implement this later today and will post some benchmarking results.

          Comment


          • #6
            A quick follow-up: In my application (and using Stata 13 MP4) Robert's solution is about 5 times faster than what I proposed. Many thanks again!

            Comment

            Working...
            X