Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using the mode, and if no mode then most recent value in panel dataset

    Hi everyone,

    I've been looking through forums but I can't seem to find an answer to my question.

    I have a panel dataset that often has three waves (6, 7, 8). We need to collapse this to a single line per person, so we are needing to economise. Where people have been to multiple schools we are taking the mode
    Code:
    bysort id (wave): egen mode_school = mode(school_id)
    However, there are some people that have been to three different schools, or two different schools with one line missing. In this instance, we would like to take the most recent wave. This must be somewhat straight forward but I cannot think of how to do this without replacing the existing data. We also need all three waves per person to be filled with the most recent school not only one line which is the solution I seem to have via
    Code:
    bysort id (wave): replace mode_school = school_id if mode_school ==. & wave==8
    Could someone help me with this please? I've given an example of what I have (mode_school) and what I need (wanted_column)

    ID WAVE SCHOOL_ID MODE_SCHOOL WANTED_COLUMN
    1 6 12345 12345 12345
    1 7 12345 12345 12345
    1 8 12345 12345 12345
    2 6 . . 45678
    2 7 34567 . 45678
    2 8 45678 . 45678
    3 6 . . .
    3 7 . . .
    3 8 . . .
    4 6 12345 . 89078
    4 7 54321 . 89078
    4 8 89078 . 89078
    5 6 45678 45678 45678
    5 8 45678 45678 45678


  • #2
    I don't think I would use mode() at all, despite its zeroth author

    net describe egenodd, from(http://www.stata.com/users/njc)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input byte(id wave) long(school_id mode_school wanted_column)
    1 6 12345 12345 12345
    1 7 12345 12345 12345
    1 8 12345 12345 12345
    2 6     .     . 45678
    2 7 34567     . 45678
    2 8 45678     . 45678
    3 6     .     .     .
    3 7     .     .     .
    3 8     .     .     .
    4 6 12345     . 89078
    4 7 54321     . 89078
    4 8 89078     . 89078
    5 6 45678 45678 45678
    5 8 45678 45678 45678
    end
    
    * first guess: use school_id if consistent 
    bysort id (school_id) : gen wanted = school_id[1] if school_id[1] == school_id[_N]
    
    * second guess: use last known school_id 
    gen last = school_id
    bysort id (wave) : replace last = last[_n-1] if missing(last)
    bysort id (wave) : replace wanted = last[_N] if missing(wanted)
    list, sepby(id)
    
         +-------------------------------------------------------------+
         | id   wave   school~d   mode_s~l   wanted~n   wanted    last |
         |-------------------------------------------------------------|
      1. |  1      6      12345      12345      12345    12345   12345 |
      2. |  1      7      12345      12345      12345    12345   12345 |
      3. |  1      8      12345      12345      12345    12345   12345 |
         |-------------------------------------------------------------|
      4. |  2      6          .          .      45678    45678       . |
      5. |  2      7      34567          .      45678    45678   34567 |
      6. |  2      8      45678          .      45678    45678   45678 |
         |-------------------------------------------------------------|
      7. |  3      6          .          .          .        .       . |
      8. |  3      7          .          .          .        .       . |
      9. |  3      8          .          .          .        .       . |
         |-------------------------------------------------------------|
     10. |  4      6      12345          .      89078    89078   12345 |
     11. |  4      7      54321          .      89078    89078   54321 |
     12. |  4      8      89078          .      89078    89078   89078 |
         |-------------------------------------------------------------|
     13. |  5      6      45678      45678      45678    45678   45678 |
     14. |  5      8      45678      45678      45678    45678   45678 |
         +-------------------------------------------------------------+

    Comment


    • #3
      I really appreciate the swift response, thank you. I will give this a go but looks exactly what we need in the project - for multiple variables! The egen mode is excellent, it just is tricky when there is no clear min/max!

      Comment


      • #4
        Code:
        bysort id (wave): replace mode_school = school_id[_N] if missing(mode_school)
        Added: Crossed with #2 and #3.

        Comment


        • #5
          Hi both,

          Thanks for the swift response. I'm afraid neither solutions work.
          • With Nick's suggested code, I have tripled checked my code back and I have school ID's appearing which do not match any of the schools presented - i'm not sure where the schools are being pulled from but the wanted and last do not match at all.
          • Equally, Clyde's solution does not work when there are two different schools in wave 6&7, but one missing in wave 8, but it is OK if one is missing in wave 6 and there are two different schools in wave 7&8 (it takes wave 8). It does also work when all three schools are different.
          • I think the problem lies where wave 8 is missing, even if that is the most recent.
          Could you suggest anything else?

          Thanks in advance,

          Emily

          Comment


          • #6
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input byte(id wave) long(school_id mode_school wanted_column)
            1 6 12345 12345 12345
            1 7 12345 12345 12345
            1 8 12345 12345 12345
            2 6     .     . 45678
            2 7 34567     . 45678
            2 8 45678     . 45678
            3 6     .     .     .
            3 7     .     .     .
            3 8     .     .     .
            4 6 12345     . 89078
            4 7 54321     . 89078
            4 8 89078     . 89078
            5 6 45678 45678 45678
            5 8 45678 45678 45678
            6 6 12345     . 56789
            6 7 56789     . 56789
            6 8     .     . 56789
            end
            
            by id (wave), sort: gen last_school = school_id if _n == 1
            by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], ///
                school_id) if _n > 1
            by id (wave): replace mode_school = last_school[_N] if missing(mode_school)
            The originally posted example data didn't contain any instances where there were two different schools in waves 6 and 7 but missing in wave 8, so my testing did not catch the failure of my code in that use case. The above should do the trick. Note that I have extended the example with a new id 6 that illustrates the problematic use case.

            Comment


            • #7
              I have school ID's appearing which do not match any of the schools presented
              I don't think that should happen with my code, but to convince me otherwise, please show a data example and code for which this is true.

              last and wanted in my code are all calculated from values of school_id for the same id.

              Comment


              • #8
                Hi both,

                I have tried and it looks like the code is incorrect for Clyde's. I'm afraid I can't use the real version as this is on a secure data haven and I'm trying to communicate the general problem but I wonder if my comments below will elucidate more why this is not working? Thanks again in advance for your help.


                I should note:
                The version I put above was a simplified version, but I'm wondering if the problem is caused by the data type.
                • The ID is actually 8, sometimes 9 digits and stored as
                  Code:
                  long %10.0g
                • Wave is stored as a float
                  Code:
                  %9.0g
                • The school_id is 11 digits. It is a concat of two long variables, which turned into a string (to be unique), and I destringed this back into a
                  Code:
                  double %20.0f
                  . See below:
                Code:
                gen school_id = trim(strofreal(school, "%10.0f")) + trim(strofreal(la, "%10.0f"))
                
                replace school_id = "." if school_id == ".."
                destring school_id, gen(school_id1)
                drop school_id
                rename school_id1 school_id
                format school_pe %20.0f

                So it does look more like this (with errors):
                ID Wave School LA School_ID mode(egen) last_school (Clyde) Wanted(Nick) Last(Nick)
                1 6 11223344 123 11223344123 11223344123 11223344100 11223344100 11223344100
                1 7 11223344 123 11223344123 11223344123 11223344100 11223344100 11223344100
                1 8 11223344 123 11223344123 11223344123 11223344100 11223344100 11223344100
                2 6 44556677 123 44556677123 44556677123 44556677102 . 44556677102
                2 7 . . . 44556677123 44556677102 . .
                2 8 . . . 44556677123 44556677102 .
                3 6 12345678 100 12345678100 34582913300 3456212984 . 3456291284
                3 7 12345678 200 12345678200 34582913300 3456292984 . 3456292984
                3 8 12345678 300 12345678300 34582913300 3458291330 . 3458291330

                At the moment when I use both codes they seem wrong. Even when all schools are the same, the last 3 digits change. When they are completely different schools I'm getting completely random numbers. Generally speaking I can't see a pattern, but where schools are the same and there is a mode the first 8 digits are the same and the last 3 change. I wonder if this has anything to do with me concat'ing the variables. Sorry it is so difficult to comprehend - i've tried to include some basic examples so you can get the jist of what's going on here.

                Kind regards,

                Comment


                • #9
                  This apparently random problem in the low order digits is a precision issue. While you say your school id's are 8 or 9 digits, in the examples you show are actually 11 digits long. That is way too big to store in a long, and doing so rounds the low order digits to whatever is the nearest number that fits in that space in binary. You need to change all variables that hold, or potentially hold, these school id's to doubles. Then everything should work properly for both Nick's solution and mine. For example, my solution would now look like:

                  Code:
                  by id (wave), sort: gen double last_school = school_id if _n == 1
                  by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], ///
                      school_id) if _n > 1
                  by id (wave): replace mode_school = last_school[_N] if missing(mode_school)
                  (And the code that calculates mode_school, which preceded my solution, also should be -egen doublemode_school = mode(school_id)-.

                  I understand the problem that you cannot share the real data. For most problems on Statalist, the real data are not needed: what counts is the structure of the data. That structure includes data storage types. In the example you showed in #1, you oversimplified the problem by showing 5-digit school ID numbers. Admittedly this is a subtle point, and it would not necessarily be obvious at the time you are posing the question that the difference between 5 digits and 11 digits would be relevant. So it is understandable that we all went down this path. But, as a general rule, when showing made-up data, it is important to make it "look like" the real thing as best you can.

                  Comment


                  • #10
                    That is a precision problem. You have much longer identifiers than previously implied, which is fine, except that my code and Clyde's need to be amended to produce new variables using double as a storage type. Note that the problem is the same in both cases, that you are getting numbers that are close to, but typically not equal to, your identifiers. To hold very large integers exactly the default of float is not capacious enough,

                    Comment


                    • #11
                      This apparently random problem in the low order digits is a precision issue. While you say your school id's are 8 or 9 digits, in the examples you show are actually 11 digits long. That is way too big to store in a long, and doing so rounds the low order digits to whatever is the nearest number that fits in that space in binary. You need to change all variables that hold, or potentially hold, these school id's to doubles. Then everything should work properly for both Nick's solution and mine. For example, my solution would now look like:

                      Code:
                      by id (wave), sort: gen double last_school = school_id if _n == 1
                      by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], ///
                          school_id) if _n > 1
                      by id (wave): replace mode_school = last_school[_N] if missing(mode_school)
                      (And the code that calculates mode_school, which preceded my solution, also should be -egen doublemode_school = mode(school_id)-.

                      I understand the problem that you cannot share the real data. For most problems on Statalist, the real data are not needed: what counts is the structure of the data. That structure includes data storage types. In the example you showed in #1, you oversimplified the problem by showing 5-digit school ID numbers. Admittedly this is a subtle point, and it would not necessarily be obvious at the time you are posing the question that the difference between 5 digits and 11 digits would be relevant. So it is understandable that we all went down this path. But, as a general rule, when showing made-up data, it is important to make it "look like" the real thing as best you can.

                      Comment


                      • #12
                        Dear Nick, Clyde,

                        Yes my apologies. When I saw the last three digits were the issue I knew I should have posted the full length problem - noted for future here.

                        I've implemented Clyde's code and it works perfectly. I imagine Nick's would too with some amendments. Very helpful as I need to apply this code to many variables.

                        Very much appreciate both of your time.

                        Emily

                        Comment

                        Working...
                        X