Using the mode, and if no mode then most recent value in panel dataset - Continuation of Emily Lowthian's Thread

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#1

Using the mode, and if no mode then most recent value in panel dataset - Continuation of Emily Lowthian's Thread

09 Aug 2023, 13:07

Something has gone awry with the Forum software in the above captioned thread. It is apparently not possible to make additional posts to it. So I am resuming her thread here, with a response to her #8 at https://www.statalist.org/forums/for...-panel-dataset.

Emily has in fact identified the problem herself. It is a precision problem. The school ID variables she shows in her new (made-up) data example are 11 digits long. That is way to long to fit into either a float or a long. These variables need to be doubles. The "random" changes in the low order digits she is getting arise from the fact that the code that Nick and I have offered create intermediate school id containing variables that are created as floats (by default since no type was specified). Since these numbers are too large to fit into a float, they are instead created as the nearest rounded off binary value that will fit. The solution is to make sure that all of the variables are created as doubles (which can handle up to 16 decimal digits.)

Code:

egen double mode_school = mode(school_id) by id (wave), sort: gen double last_school = school_id if _n == 1 by id (wave): replace last_school = cond(missing(school_id), last_school[_n-1], /// school_id) if _n > 1 by id (wave): replace mode_school = last_school[_N] if missing(mode_school)

will resolve this problem with my code. Similar changes to Nick's code will resolve the problem with his.

I understand the problem with showing example data from confidential data sources. In general, the exact values of the variables are not needed for most of the problems encountered on Statalist. But the structure of the data is critically important. And data storage types are part of that. When this thread began, O.P. exhibited a data tableau showing 5 digit school IDs. Admittedly, when thinking about the particular question posed, it would not immediately jump to my mind that the difference between 5 digit and 11 digit IDs would be salient, let alone crucial. But it turned out to be so. So the general moral of the story is: if you have to make up data when posting an example, do your best to make the made-up data "look like" the real data.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35776
#2

09 Aug 2023, 14:35

The thread referred to can be seen in MS Edge at least but not clearly in Chrome or Firefox. I guess Emily Lowthian posted some material using HTML but not quite to the forum software's appreciation. If you do that you need to do it within designated tags. If not, then who knows.

For completeness, this was my reply, which echoes Clyde Schechter in #1.

That is a precision problem. You have much longer identifiers than previously implied, which is fine, except that my code and Clyde's need to be amended to produce new variables using double as a storage type. Note that the problem is the same in both cases, that you are getting numbers that are close to, but typically not equal to, your identifiers. To hold very large integers exactly the default of float is not capacious enough.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35776
#3

10 Aug 2023, 10:37

Thanks to sladmin the glitch in the thread referred to has been fixed if anyone wants to have a look.
Comment
Emily Lowthian

Join Date: Aug 2023

Posts: 8
#4

12 Sep 2023, 07:23

Thanks both, apologies
Comment

Announcement

Using the mode, and if no mode then most recent value in panel dataset - Continuation of Emily Lowthian's Thread

Comment

Comment

Comment