Sorting/Re-Order Observations Independently of Other Variables?

David Kapaon

Join Date: Nov 2019

Posts: 30
#1

Sorting/Re-Order Observations Independently of Other Variables?

17 Jun 2025, 16:58

Hi,
I have a longitudinal dataset that contains three waves of data from a cohort of older adults. Data was collected every 3 years. One of the questions in the study asks the adults to list the ages of each of their children at each wave. I've reshaped the data long to show a sample of the children within each adult/parent (ID). However, I also included the wide version too as an example.

Theoretically, the order of the children are supposed to be constant across the waves - (i.e. "Please list the age of child #1" should force the parent to list the same child's age in each wave. Same with child #2, and #3, etc.). Unfortunately, I have some suspicions that the roster of children may have gotten mixed up across the waves in certain scenarios since some ages do not increase by ~3 years over time (or alternatively just bad data too...). For example, some children gain 15 years between two waves, and others decrease in age.

(I have a feeling this almost certainly won't work...) BUT I've been asked (just to check) to sort the ages of the children from lowest to highest within each wave just to see if the ages increase by ~3 years each wave. Essentially, if by ignoring the "number" of the child, do the ages make more sense across the waves than how they currently are now - Is there a way to sort observations independently of each other though?
Best,
David

Wide Data:

Code:

clear input str70 ID byte wave_1_child_age1 int wave_2_child_age1 byte wave_3_child_age1 "CBBGF" 15 19 24 "DFRHQ" 16 9 21 "FSGLP" 19 15 26 "LVEKH" 22 26 28 "EDBXY" 24 28 31 "HMRXD" 30 24 36 end

Long Data:

Code:

clear input str70 ID byte(Child wave_1_child_age) int wave_2_child_age byte wave_3_child_age "CBBGF" 1 15 19 24 "DFRHQ" 1 16 9 21 "DFRHQ" 2 13 5 15 "DFRHQ" 3 11 . 10 "DFRHQ" 4 13 . 9 "DFRHQ" 5 4 . . "EDBXY" 1 24 28 31 "EDBXY" 2 20 25 29 "EDBXY" 3 15 23 26 "EDBXY" 4 11 14 18 "EDBXY" 5 8 12 15 "EDBXY" 6 6 10 13 "EDBXY" 7 4 11 11 "FSGLP" 1 19 15 26 "FSGLP" 2 18 15 25 "FSGLP" 3 14 3 21 "FSGLP" 4 13 . 18 "FSGLP" 5 11 . 16 "FSGLP" 6 7 . . "FSGLP" 7 2 . . "FSGLP" 8 13 . . "FSGLP" 9 11 . . "HMRXD" 1 30 24 36 "HMRXD" 2 27 21 32 "HMRXD" 3 25 19 28 "LVEKH" 1 22 26 28 "LVEKH" 2 14 19 21 "LVEKH" 3 7 12 14 end

Picture example in case too:
Tags: None
David Kapaon

Join Date: Nov 2019

Posts: 30
#2

17 Jun 2025, 17:08

Oops sorry - just clarifying the picture! columns are children's ages in a given wave
Sorry the confusion!
David
Attached Files
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30169

17 Jun 2025, 17:24

I think what you want to do is this:

Code:

//    VERIFY NO DUPLICATES ON COMBOINATION ONF ID AND Child
isid ID Child

//    GO LONG ON WAVE
reshape long wave_@_child_age, i(ID Child) j(wave)
rename wave__child_age child_age

//    SORT THE CHILD'S DATA IN AGE ORDER AND CALCULATE INTER-WAVE DIFFERENCES
by ID Child (child_age), sort: gen delta = child_age - child_age[_n-1]

tab delta

And, ideally, delta should always be 3, or at least close to that. In your example data, the results are nothing at all like that:

Code:

      delta |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |          1        2.38        2.38
          1 |          1        2.38        4.76
          2 |          4        9.52       14.29
          3 |          7       16.67       30.95
          4 |          9       21.43       52.38
          5 |          8       19.05       71.43
          6 |          4        9.52       80.95
          7 |          5       11.90       92.86
          8 |          2        4.76       97.62
         11 |          1        2.38      100.00
------------+-----------------------------------
      Total |         42      100.00

As you can see, the commonest values are three through five, which suggests that there is some tendency towards child ages growing by more or less three per wave. But it is only loosely so, and only about half of the time. And there are many instances of inter-wave age changes that are very far off the mark. Of course, you need to see how this plays out in your full data set.

Last edited by Clyde Schechter; 17 Jun 2025, 17:27.

Comment

Nils Enevoldsen

Join Date: Oct 2014

Posts: 296
#4

23 Jun 2025, 13:12

This is an interesting problem! Unless I'm mistaken, Clyde has illustrated the problem you described, rather than proposed the solution you requested. I have thought for a while about a solution, but the longer I think about it the more subtle it gets. The presence of missing data is what makes this problem hard/fun.

As a high-level algorithm, it seems like what you want to do it define some sort of [within-household, between-wave] goodness-of-fit metric, and then reassign child IDs to optimize that metric for each household. Crudely, this metric is something along the lines of "average age progression for each child is close to 3 years". But missing data makes this hard. For example, at an extreme you can cheat this metric entirely by assuming that there is zero overlap between identified children in each wave. ("In wave 1, they forgot about all their children except Alice and Bob. In wave 2, they forgot about all their children except Charlie and Dana. In wave 3, they forgot about all their children except Edward and Francis. And hence there are zero incongruities with reported ages!")

To illustrate some difficult cases, say you have a parent who reports ages 10/20/30 in wave 1, 23/33/43 in wave 2, and 16/26/36 in wave 3. Which of these scenarios is more likely?

Code:

Scenario A child: 1 2 3 wave 1 10 20 30 wave 2 43 23 33 // Data error for child 1 wave 3 16 26 36 Scenario B child: 1 2 3 4 wave 1 10 20 30 . wave 2 . 23 33 43 // Child 1 missing; child 4 present wave 3 16 26 36 . Scenario C child: 1 2 3 wave 1 10 20 30 wave 2 23 33 43 // Not plausible, but a naive algorithm might make this assignment wave 3 16 26 36

On the other hand, if the age gaps are narrower, like 1/2/3, 5/6/7, and 7/8/9, there are even more plausible possibilities:

Code:

Scenario A child: 1 2 3 wave 1 1 2 3 wave 2 7 5 6 // Data error for child 1 wave 3 7 8 9 Scenario B child: 1 2 3 4 wave 1 1 2 3 . wave 2 . 5 6 7 // Child 1 missing; child 4 present wave 3 7 8 9 . Scenario C child: 1 2 3 wave 1 1 2 3 wave 2 5 6 7 // Plausible (e.g. survey done late in wave 2) wave 3 7 8 9

I may be overthinking this and letting the perfect be the enemy of the good, but whatever solution is proposed needs to address in some kind of basic way the frequent missing data that your example data exhibits. I'm sorry I don't have a more complete answer!
Comment

Announcement