Using the mode, and if no mode then most recent value in panel dataset

Emily Lowthian

Join Date: Aug 2023
Posts: 8

Using the mode, and if no mode then most recent value in panel dataset

08 Aug 2023, 12:23

Hi everyone,

I've been looking through forums but I can't seem to find an answer to my question.

I have a panel dataset that often has three waves (6, 7, 8). We need to collapse this to a single line per person, so we are needing to economise. Where people have been to multiple schools we are taking the mode

Code:

bysort id (wave): egen mode_school = mode(school_id)

However, there are some people that have been to three different schools, or two different schools with one line missing. In this instance, we would like to take the most recent wave. This must be somewhat straight forward but I cannot think of how to do this without replacing the existing data. We also need all three waves per person to be filled with the most recent school not only one line which is the solution I seem to have via

Code:

bysort id (wave): replace mode_school = school_id if mode_school ==. & wave==8

Could someone help me with this please? I've given an example of what I have (mode_school) and what I need (wanted_column)

ID	WAVE	SCHOOL_ID	MODE_SCHOOL	WANTED_COLUMN
1	6	12345	12345	12345
1	7	12345	12345	12345
1	8	12345	12345	12345
2	6	.	.	45678
2	7	34567	.	45678
2	8	45678	.	45678
3	6	.	.	.
3	7	.	.	.
3	8	.	.	.
4	6	12345	.	89078
4	7	54321	.	89078
4	8	89078	.	89078
5	6	45678	45678	45678
5	8	45678	45678	45678

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35777

08 Aug 2023, 12:52

I don't think I would use mode() at all, despite its zeroth author

net describe egenodd, from(http://www.stata.com/users/njc)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id wave) long(school_id mode_school wanted_column)
1 6 12345 12345 12345
1 7 12345 12345 12345
1 8 12345 12345 12345
2 6     .     . 45678
2 7 34567     . 45678
2 8 45678     . 45678
3 6     .     .     .
3 7     .     .     .
3 8     .     .     .
4 6 12345     . 89078
4 7 54321     . 89078
4 8 89078     . 89078
5 6 45678 45678 45678
5 8 45678 45678 45678
end

* first guess: use school_id if consistent 
bysort id (school_id) : gen wanted = school_id[1] if school_id[1] == school_id[_N]

* second guess: use last known school_id 
gen last = school_id
bysort id (wave) : replace last = last[_n-1] if missing(last)
bysort id (wave) : replace wanted = last[_N] if missing(wanted)
list, sepby(id)

     +-------------------------------------------------------------+
     | id   wave   school~d   mode_s~l   wanted~n   wanted    last |
     |-------------------------------------------------------------|
  1. |  1      6      12345      12345      12345    12345   12345 |
  2. |  1      7      12345      12345      12345    12345   12345 |
  3. |  1      8      12345      12345      12345    12345   12345 |
     |-------------------------------------------------------------|
  4. |  2      6          .          .      45678    45678       . |
  5. |  2      7      34567          .      45678    45678   34567 |
  6. |  2      8      45678          .      45678    45678   45678 |
     |-------------------------------------------------------------|
  7. |  3      6          .          .          .        .       . |
  8. |  3      7          .          .          .        .       . |
  9. |  3      8          .          .          .        .       . |
     |-------------------------------------------------------------|
 10. |  4      6      12345          .      89078    89078   12345 |
 11. |  4      7      54321          .      89078    89078   54321 |
 12. |  4      8      89078          .      89078    89078   89078 |
     |-------------------------------------------------------------|
 13. |  5      6      45678      45678      45678    45678   45678 |
 14. |  5      8      45678      45678      45678    45678   45678 |
     +-------------------------------------------------------------+

Comment

Emily Lowthian

Join Date: Aug 2023

Posts: 8
#3

08 Aug 2023, 12:57

I really appreciate the swift response, thank you. I will give this a go but looks exactly what we need in the project - for multiple variables! The egen mode is excellent, it just is tricky when there is no clear min/max!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

08 Aug 2023, 12:57

Code:

bysort id (wave): replace mode_school = school_id[_N] if missing(mode_school)

Added: Crossed with #2 and #3.
Comment
Emily Lowthian

Join Date: Aug 2023

Posts: 8
#5

08 Aug 2023, 13:43

Hi both,

Thanks for the swift response. I'm afraid neither solutions work.
With Nick's suggested code, I have tripled checked my code back and I have school ID's appearing which do not match any of the schools presented - i'm not sure where the schools are being pulled from but the wanted and last do not match at all.

Equally, Clyde's solution does not work when there are two different schools in wave 6&7, but one missing in wave 8, but it is OK if one is missing in wave 6 and there are two different schools in wave 7&8 (it takes wave 8). It does also work when all three schools are different.

I think the problem lies where wave 8 is missing, even if that is the most recent.

Could you suggest anything else?

Thanks in advance,

Emily
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30164

08 Aug 2023, 13:56

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(id wave) long(school_id mode_school wanted_column)
1 6 12345 12345 12345
1 7 12345 12345 12345
1 8 12345 12345 12345
2 6     .     . 45678
2 7 34567     . 45678
2 8 45678     . 45678
3 6     .     .     .
3 7     .     .     .
3 8     .     .     .
4 6 12345     . 89078
4 7 54321     . 89078
4 8 89078     . 89078
5 6 45678 45678 45678
5 8 45678 45678 45678
6 6 12345     . 56789
6 7 56789     . 56789
6 8     .     . 56789
end

by id (wave), sort: gen last_school = school_id if _n == 1
by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], ///
    school_id) if _n > 1
by id (wave): replace mode_school = last_school[_N] if missing(mode_school)

The originally posted example data didn't contain any instances where there were two different schools in waves 6 and 7 but missing in wave 8, so my testing did not catch the failure of my code in that use case. The above should do the trick. Note that I have extended the example with a new id 6 that illustrates the problematic use case.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35777
#7

08 Aug 2023, 17:26

I have school ID's appearing which do not match any of the schools presented

I don't think that should happen with my code, but to convince me otherwise, please show a data example and code for which this is true.

last and wanted in my code are all calculated from values of school_id for the same id.
Comment

Emily Lowthian

Join Date: Aug 2023
Posts: 8

09 Aug 2023, 11:46

Hi both,

I have tried and it looks like the code is incorrect for Clyde's. I'm afraid I can't use the real version as this is on a secure data haven and I'm trying to communicate the general problem but I wonder if my comments below will elucidate more why this is not working? Thanks again in advance for your help.

I should note:
The version I put above was a simplified version, but I'm wondering if the problem is caused by the data type.

The ID is actually 8, sometimes 9 digits and stored as
Code:
```
long %10.0g
```
Wave is stored as a float
Code:
```
%9.0g
```
The school_id is 11 digits. It is a concat of two long variables, which turned into a string (to be unique), and I destringed this back into a
Code:
```
double %20.0f
```
. See below:

Code:

gen school_id = trim(strofreal(school, "%10.0f")) + trim(strofreal(la, "%10.0f"))

replace school_id = "." if school_id == ".."
destring school_id, gen(school_id1)
drop school_id
rename school_id1 school_id
format school_pe %20.0f

So it does look more like this (with errors):

ID	Wave	School	LA	School_ID	mode(egen)	last_school (Clyde)	Wanted(Nick)	Last(Nick)
1	6	11223344	123	11223344123	11223344123	11223344100	11223344100	11223344100
1	7	11223344	123	11223344123	11223344123	11223344100	11223344100	11223344100
1	8	11223344	123	11223344123	11223344123	11223344100	11223344100	11223344100

2	6	44556677	123	44556677123	44556677123	44556677102	.	44556677102
2	7	.	.	.	44556677123	44556677102	.	.
2	8	.	.	.	44556677123	44556677102	.

3	6	12345678	100	12345678100	34582913300	3456212984	.	3456291284
3	7	12345678	200	12345678200	34582913300	3456292984	.	3456292984
3	8	12345678	300	12345678300	34582913300	3458291330	.	3458291330

At the moment when I use both codes they seem wrong. Even when all schools are the same, the last 3 digits change. When they are completely different schools I'm getting completely random numbers. Generally speaking I can't see a pattern, but where schools are the same and there is a mode the first 8 digits are the same and the last 3 change. I wonder if this has anything to do with me concat'ing the variables. Sorry it is so difficult to comprehend - i've tried to include some basic examples so you can get the jist of what's going on here.

Kind regards,

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#9

09 Aug 2023, 12:06

This apparently random problem in the low order digits is a precision issue. While you say your school id's are 8 or 9 digits, in the examples you show are actually 11 digits long. That is way too big to store in a long, and doing so rounds the low order digits to whatever is the nearest number that fits in that space in binary. You need to change all variables that hold, or potentially hold, these school id's to doubles. Then everything should work properly for both Nick's solution and mine. For example, my solution would now look like:

Code:

by id (wave), sort: gen double last_school = school_id if _n == 1 by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], /// school_id) if _n > 1 by id (wave): replace mode_school = last_school[_N] if missing(mode_school)

(And the code that calculates mode_school, which preceded my solution, also should be -egen doublemode_school = mode(school_id)-.

I understand the problem that you cannot share the real data. For most problems on Statalist, the real data are not needed: what counts is the structure of the data. That structure includes data storage types. In the example you showed in #1, you oversimplified the problem by showing 5-digit school ID numbers. Admittedly this is a subtle point, and it would not necessarily be obvious at the time you are posing the question that the difference between 5 digits and 11 digits would be relevant. So it is understandable that we all went down this path. But, as a general rule, when showing made-up data, it is important to make it "look like" the real thing as best you can.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35777
#10

09 Aug 2023, 12:17

That is a precision problem. You have much longer identifiers than previously implied, which is fine, except that my code and Clyde's need to be amended to produce new variables using double as a storage type. Note that the problem is the same in both cases, that you are getting numbers that are close to, but typically not equal to, your identifiers. To hold very large integers exactly the default of float is not capacious enough,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#11

09 Aug 2023, 12:32

This apparently random problem in the low order digits is a precision issue. While you say your school id's are 8 or 9 digits, in the examples you show are actually 11 digits long. That is way too big to store in a long, and doing so rounds the low order digits to whatever is the nearest number that fits in that space in binary. You need to change all variables that hold, or potentially hold, these school id's to doubles. Then everything should work properly for both Nick's solution and mine. For example, my solution would now look like:

Code:

by id (wave), sort: gen double last_school = school_id if _n == 1 by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], /// school_id) if _n > 1 by id (wave): replace mode_school = last_school[_N] if missing(mode_school)

(And the code that calculates mode_school, which preceded my solution, also should be -egen doublemode_school = mode(school_id)-.

I understand the problem that you cannot share the real data. For most problems on Statalist, the real data are not needed: what counts is the structure of the data. That structure includes data storage types. In the example you showed in #1, you oversimplified the problem by showing 5-digit school ID numbers. Admittedly this is a subtle point, and it would not necessarily be obvious at the time you are posing the question that the difference between 5 digits and 11 digits would be relevant. So it is understandable that we all went down this path. But, as a general rule, when showing made-up data, it is important to make it "look like" the real thing as best you can.
Comment
Emily Lowthian

Join Date: Aug 2023

Posts: 8
#12

14 Aug 2023, 02:58

Dear Nick, Clyde,

Yes my apologies. When I saw the last three digits were the issue I knew I should have posted the full length problem - noted for future here.

I've implemented Clyde's code and it works perfectly. I imagine Nick's would too with some amendments. Very helpful as I need to apply this code to many variables.

Very much appreciate both of your time.

Emily
Comment

Announcement

Using the mode, and if no mode then most recent value in panel dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment