Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Filling missing rows by groups for string and numeric

    I have a data file where there are several rows per person (ID) and both string and numeric variables with only one value per person. I would like to fill the missing rows by person ID.
    There are several similar topics on this forum which address this situation however, as far as I am aware, none address the case when there is no entry for one ID.

    A simplified version of the data are as follows:

    Code:
    clear
    input ID str8 varstr varnum
    1    "A"     .
    1   "" 4.2
    1   "" .
    1   "" .
    1   "" .
    2   "" .
    2   "G" .
    2   "" .
    2   "" .
    2   "" .
    3   "" 7.6
    3   "" .
    3   "" .
    3   "" .
    3   "" .
    end
    
    list, sepby(ID)
    Code:
         +----------------------+
         | ID   varstr   varnum |
         |----------------------|
      1. |  1        A        . |
      2. |  1               4.2 |
      3. |  1                 . |
      4. |  1                 . |
      5. |  1                 . |
         |----------------------|
      6. |  2                 . |
      7. |  2        G        . |
      8. |  2                 . |
      9. |  2                 . |
     10. |  2                 . |
         |----------------------|
     11. |  3               7.6 |
     12. |  3                 . |
     13. |  3                 . |
     14. |  3                 . |
     15. |  3                 . |
         +----------------------+
    As you can see, the single value for each ID is not always in the first row, ID 3 has no varstr value and ID 2 has no varnum value. When filling, these should be kept as missing.

    I have tried
    Code:
    bysort ID (varstr) : replace varstr= varstr[_N]
    however this only works when all IDs have exactly one non-missing value.

    Advice much appreciated.

  • #2
    What rule do you want to follow if there are two or more non-missing values?

    Comment


    • #3
      There is always a maximum of one non-missing value per ID (this has been checked).

      Comment


      • #4
        I am not clear what is puzzling you then. This works for your example to the extent that seems possible.

        Code:
        clear
        input ID str8 varstr varnum
        1    "A"     .
        1   "" 4.2
        1   "" .
        1   "" .
        1   "" .
        2   "" .
        2   "G" .
        2   "" .
        2   "" .
        2   "" .
        3   "" 7.6
        3   "" .
        3   "" .
        3   "" .
        3   "" .
        end
        bysort ID (varnum) : replace varnum = varnum[1] if missing(varnum) 
        bysort ID (varstr) : replace varstr = varstr[_N] if missing(varstr) 
        list, sepby(ID) 
        
             +----------------------+
             | ID   varstr   varnum |
             |----------------------|
          1. |  1        A      4.2 |
          2. |  1        A      4.2 |
          3. |  1        A      4.2 |
          4. |  1        A      4.2 |
          5. |  1        A      4.2 |
             |----------------------|
          6. |  2        G        . |
          7. |  2        G        . |
          8. |  2        G        . |
          9. |  2        G        . |
         10. |  2        G        . |
             |----------------------|
         11. |  3               7.6 |
         12. |  3               7.6 |
         13. |  3               7.6 |
         14. |  3               7.6 |
         15. |  3               7.6 |
             +----------------------+

        Comment


        • #5
          Yes you're right - it does work. Thank you!
          I'm not sure what I was doing before, my apologies.

          Comment


          • #6
            The number of distinct values in each group is for any variable is

            Code:
            egen tag = tag(whatevervar id)
            egen nvalues = total(tag), by(id)

            Comment

            Working...
            X