Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fill up values by first non-missing in group

    I would like to fill up values for a variable, say number, with the first (and only) non-missing number in the same group (captured by the group identifier id) such that

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id double number
    1 23
    1  .
    1  .
    2 12
    2  .
    3  5
    3  .
    3  .
    end
    will be completed to

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id double number
    1 23
    1 23
    1 23
    2 12
    2 12
    3  5
    3  5
    3  5
    end
    I do realize that this question has been asked and answered before. I have found some useful initial help here and here. I ended up with three different versions of code:

    Code:
    bysort id : replace number = number[_n-1] if missing(number) & _n > 1    
    bysort id : replace number = number[1]
    bysort id (number): replace number=number[_N]
    However, neither of these versions would produce the result that I am looking for (see table above). For some ids (say id=1 and id=3) it worked, for others it did not (I got only missing number values for id=2 despite the fact that the number was nonmissing for id=2 in the original data).

    Do you have some other suggestions I could try? What could be possible potential sources of error? What additional information could I look for / provide to make the error hunting feasible?

    Many thanks,
    Milan

  • #2
    You cite the FAQ http://www.stata.com/support/faqs/da...issing-values/ but what you say makes limited sense otherwise.

    Let's peel off first

    Code:
     bysort id (number): replace number=number[_N]
    That would sort the highest value observed in each block (which could be missing) to the end of each block and use that to overwrite all values in the same block, regardless of whether values were missing. That's not a good idea at all. It's not a solution for your stated problem.

    Code:
     bysort id: replace number=number[1]    
     bysort id : replace number = number[_n-1] if missing(number) & _n > 1
    are both solutions for your stated problem.

    You seem to be implying that they don't work for your example, but that's not so.

    Code:
    . clear
    
    . input byte id double number
    
               id      number
      1. 1 23
      2. 1  .
      3. 1  .
      4. 2 12
      5. 2  .
      6. 3  5
      7. 3  .
      8. 3  .
      9. end
    
    .
    . bysort id : replace number = number[_n-1] if missing(number) & _n > 1  
    (5 real changes made)
    
    . l, sepby(id)
    
         +-------------+
         | id   number |
         |-------------|
      1. |  1       23 |
      2. |  1       23 |
      3. |  1       23 |
         |-------------|
      4. |  2       12 |
      5. |  2       12 |
         |-------------|
      6. |  3        5 |
      7. |  3        5 |
      8. |  3        5 |
         +-------------+
    I can't see any reason to doubt a standard method for what it fairly claims to do, copy downwards. Naturally, that method won't copy upwards as well, but that's documented. See also mipolate (SSC).

    Comment


    • #3
      Sorry, I should have clarified:

      The above code is only pseudo-code that I wrote to generalise the problem. Should I upload some actual observations? I have tried the first version because it was recommended in the other linked thread. As to the other two versions, for my data they do not work properly.

      Thank you for mentioning mipolate (SSC). However, it would not work in my case because my id is a string variable.

      Copy downwards (with carryforward (SSC)) would not work because sometimes I have all values missing and


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id double number
      1 23
      1  .
      1  .
      2  .
      2  .
      3  5
      3  .
      3  .
      end
      would become

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id double number
      1 23
      1 23
      1 23
      2 23
      2 23
      3  5
      3  5
      3  5
      end
      (Probably should have included this case in the pseudo code.)

      Thank you for the help and sorry for all possible confusion.
      Last edited by Milan Quentel; 09 Jan 2017, 07:20.

      Comment


      • #4
        So, you're now claiming that the problem is not the examples you showed --- but the examples you didn't show us. Do show us at least one you don't understand.

        Also, it seems that your identifier is string. But that's irrelevant to the techniques mentioned here or used by mipolate (SSC). Applying within a framework by stringid: (or whatever) will work. Did you try it?

        I am not responsible for carryforward (SSC), but you would certainly get nonsense results if you ignored the identifiers. That makes no sense. Also, no program whatsoever can validly interpolate from nothing else whatsoever. Your example for id = 2 shows no information.

        Incidentally the code you attribute to http://www.stata.com/statalist/archi.../msg01239.html is there used for a completely different purpose.
        Last edited by Nick Cox; 09 Jan 2017, 07:45.

        Comment


        • #5
          So here is an example from my data where
          Code:
          bysort company : replace turnover = turnover[_n-1] if missing(turnover) & _n > 1
          does not work. Note that company is the id (in string format) and turnover is the number to be replicated.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str144 company double turnover str10 DMC_id
          "2016 GCG LIMITED" 1968878.5682201385 "P113682000"
          "2016 GCG LIMITED"                  . "P113682000"
          "2016 GCG LIMITED"                  . "P012939285"
          "2016 GCG LIMITED"                  . "P012939285"
          "2016 GCG LIMITED"                  . "P097765326"
          "2016 GCG LIMITED"                  . "P355973038"
          "2016 GCG LIMITED"                  . "P048664390"
          end
          is completed as

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str144 company double turnover str10 DMC_id
          "2016 GCG LIMITED"                  . "P113682000"
          "2016 GCG LIMITED"                  . "P012939285"
          "2016 GCG LIMITED"                  . "P097765326"
          "2016 GCG LIMITED" 1968878.5682201385 "P113682000"
          "2016 GCG LIMITED" 1968878.5682201385 "P048664390"
          "2016 GCG LIMITED" 1968878.5682201385 "P012939285"
          "2016 GCG LIMITED" 1968878.5682201385 "P355973038"
          end
          Have you / has anyone seen something like that before? What could be potential reasons?


          Thank you also, Nick Cox, for your suggestion to use mipolate together with by. I tried
          Code:
          bysort company : mipolate company turnover, gen(ly1)
          which fails because company is a string variable and
          Code:
          bysort company : mipolate turnover, gen(ly1)
          which fails because I need to specify at least two variables.

          I also tried encoding my string id variable and then using mipolate but that, too, produced rather weird results.

          Could you please explain how mipolate could be used together with the by option?

          Thank you for all your help.

          Comment


          • #6
            Your example works for me:

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str144 company double turnover str10 DMC_id
            "2016 GCG LIMITED" 1968878.5682201385 "P113682000"
            "2016 GCG LIMITED"                  . "P113682000"
            "2016 GCG LIMITED"                  . "P012939285"
            "2016 GCG LIMITED"                  . "P012939285"
            "2016 GCG LIMITED"                  . "P097765326"
            "2016 GCG LIMITED"                  . "P355973038"
            "2016 GCG LIMITED"                  . "P048664390"
            end
            
            bysort company : replace turnover = turnover[_n-1] if missing(turnover) 
            
            list 
            
                 +-------------------------------------------+
                 |          company    turnover       DMC_id |
                 |-------------------------------------------|
              1. | 2016 GCG LIMITED   1968878.6   P113682000 |
              2. | 2016 GCG LIMITED   1968878.6   P113682000 |
              3. | 2016 GCG LIMITED   1968878.6   P012939285 |
              4. | 2016 GCG LIMITED   1968878.6   P012939285 |
              5. | 2016 GCG LIMITED   1968878.6   P097765326 |
                 |-------------------------------------------|
              6. | 2016 GCG LIMITED   1968878.6   P355973038 |
              7. | 2016 GCG LIMITED   1968878.6   P048664390 |
                 +-------------------------------------------+
            Your attempts at using mipolate both fail because you don't meet its requirements. The variable named after mipolate is that you are interpolating and that must be numeric and can't be the identifier you use to define groups. You need a sequence variable even if it is artificial.

            This would work, for example,

            Code:
             
             * Example generated by -dataex-. To install: ssc install dataex clear input str144 company double turnover str10 DMC_id "2016 GCG LIMITED" 1968878.5682201385 "P113682000" "2016 GCG LIMITED"                  . "P113682000" "2016 GCG LIMITED"                  . "P012939285" "2016 GCG LIMITED"                  . "P012939285" "2016 GCG LIMITED"                  . "P097765326" "2016 GCG LIMITED"                  . "P355973038" "2016 GCG LIMITED"                  . "P048664390" end bysort company : gen seq = _n 
            
            by company: mipolate turnover seq, gen(turnover2) forward 
            
            list 
            
                 +-------------------------------------------------------------+
                 |          company    turnover       DMC_id   seq   turnover2 |
                 |-------------------------------------------------------------|
              1. | 2016 GCG LIMITED   1968878.6   P113682000     1   1968878.6 |
              2. | 2016 GCG LIMITED           .   P113682000     2   1968878.6 |
              3. | 2016 GCG LIMITED           .   P012939285     3   1968878.6 |
              4. | 2016 GCG LIMITED           .   P012939285     4   1968878.6 |
              5. | 2016 GCG LIMITED           .   P097765326     5   1968878.6 |
                 |-------------------------------------------------------------|
              6. | 2016 GCG LIMITED           .   P355973038     6   1968878.6 |
              7. | 2016 GCG LIMITED           .   P048664390     7   1968878.6 |
                 +-------------------------------------------------------------+




            Comment


            • #7
              I have no idea why the example works if taken on its own but doesn't work within my larger dataset.

              The idea with the sequence variable, bysort and mipolate, however, works perfectly fine. Thank you very much!

              Comment


              • #8
                Test for equality of strings that you think should be equal. Small differences of spelling or punctuation or hidden characters are easily fatal.

                Comment


                • #9
                  Thank you for the advice. Fortunately, I can guarantee that the identifying string is equal because of the way it was constructed.

                  One more question: What happens if the variable to be replicated (in my example turnover) is a string variable? Is there any way to deal with such a situation within the mipolate command?

                  Comment


                  • #10
                    There is a related solution, already documented.

                    http://www.statalist.org/forums/foru...-in-panel-data

                    See http://www.statalist.org/forums/foru...-interpolation for stripolate

                    Comment


                    • #11
                      Thank you, very much. Stripolate works perfectly.

                      Comment


                      • #12
                        Hello,

                        I am working with an unbalanced panel dataset and I used the -fillin- command to create a balanced version of it:

                        fillin timevar id

                        Now I need to replace the missing values (the only ones are those of _fillin==1), but after trying the following, I get the "different non-missing values of variable x" error message (r498):

                        local varlist Exporter Importer Comm distcap contig colony Exp_gdp Exp_pop Imp_gdp Imp_pop ldlock
                        foreach var in `varlist' {
                        mipolate `var' id, gen(n`var') groupwise
                        }


                        Each id value is unique for each Exporter-Importer-Comm possible combination. The remaining variables are dyadic, so they have unique values for each Exporter-Importer possibility, regardless of Commodity or timevar.
                        I haven“t used -mipolate- command before, so any comments would be really helpful.

                        Thanks!

                        Comment


                        • #13
                          See the help for mipolate (SSC):


                          groupwise specifies that non-missing values be copied to
                          missing values if, and only if, just one distinct
                          non-missing value occurs in each group. Thus a group of
                          values ., 42, ., . qualifies as 42 is not missing and is
                          the only non-missing value in the group. Hence the missing
                          values in the group will be replaced with 42 in the new
                          variable. By the same rules 42, ., 42, . qualifies but 42,
                          ., 43, . does not. Normally, but not necessarily, this
                          option is used in conjunction with by:, which is how groups
                          are specified; otherwise the (single) group is the entire
                          set of observations being used.
                          So, your data don't satisfy the criterion for this kind of interpolation.

                          What's most obvious is that you missed out a by() option so you are telling mipolate that there is just one distinct non-missing value for each variable in the entire dataset. That will typically be quite wrong.

                          I imagine that you need

                          Code:
                          by(Exporter Importer Comm)


                          Comment


                          • #14
                            Thank you so much!

                            Comment


                            • #15
                              The code that finally worked for my dataset is:
                              foreach var in `varlist' {
                              bysort id: mipolate `var' id, gen(n`var') nearest
                              replace `var'=n`var' if `var'==.
                              drop n`var'
                              }
                              Thank you again

                              Comment

                              Working...
                              X