Problems merging cross-sectional data with longitudinal data

Janey Sewell

Join Date: Nov 2017

Posts: 12
#1

Problems merging cross-sectional data with longitudinal data

27 Nov 2017, 07:27

Hi,
I am a relatively inexperienced stata user and am hoping someone here might be able to help!

I am trying to merge dates from a cross-sectional dataset (this has around 1200 unique observations, each identified by a unique studyID) with a longitudinal dataset in which I have repeat observations (each observation is identified by studyID but in many cases studyID is used multiple times - every time more longitudinal data is collected studyID is entered with each new bit of longitudinal data being added as a row).

Both datasets are in long format and all participants in the longitudinal dataset are found in the cross-sectional dataset, but not all of those that are in the cross-sectional dataset are in the longitudinal dataset.

In the cross sectional dataset I have kept only the studyID and the 'dateattended' variable (which contains the date i need in DMY format) and i have called this 'studyIDanddateattended' (original i know!) and I have then sorted both datasets by studyID.

With the longitudinal dataset open as master, I have used the code:

merge m:1 studyID using studyIDanddateattended

However although this seems to merge on studyID, the new dateattended variable in the longitudinal dataset is not the same value as in the cross-sectional dataset, except for the first few observations.

I've tried lots of different ways of merging but can not seem to make it match up - if anyone has any advice i'd be very grateful.

Thanks so much, janey
Tags: None
Jannic Cutura

Join Date: Jul 2025

Posts: 125
#2

27 Nov 2017, 08:22

Hey janey,
welcome to Statalist.

Please provide an example using dataex (see help datasex) so we can see whats going wrong.

From what you described I understand you have a crosssectional dataset (where an observation (a row) is identified by studyID) and a stacked panel dataset (where an observation is identified with studyID and dateattended). Assuming your working directory is set to the folder where both your datasets are located (type pwd to check and use cd "path" to change the working directory to the correct one) you can use:

Code:

merge n:1 studyID using "name_of_stacked_panel_dataset"

now I am not sure I understand whats going wrong. If the date is a strange format ,simply reformat it:

Code:

format datevariable %td

to reformat the variablöe "datevariable" into human-readible format assuming it is stored in days (and an integer, not a string).
Comment
Janey Sewell

Join Date: Nov 2017

Posts: 12
#3

27 Nov 2017, 09:05

Thanks for your quick response...

The cross sectional dataset contains studyID (unique to each row) and dateattended
And the stacked panel dataset (i think this refers to my longitudinal dataset) contains studyID (each observation identified with studyID so it can appear on multiple rows) as well as lots of other variables

The date appears in the right format but just not the correct date corresponding to the studyID. I write the full location of the dataset each time to make sure that it doesn't use the wrong dataset

Using dataex, in the cross sectional data, studyID and dateattended appear like this:

[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input long studyID float dateattended
1 20172
2 20172
3 20172
4 20172
5 20172
6 20172
7 20172
8 20171
9 20172
10 20179

However only studyID numbers 5, 7 and 8 are found in the longitudinal dataset (the others did not complete longitudinal data)

With the longitudinal(stacked panel) dataset open i write:

merge m:1 studyID using studyIDanddateattended

and i get the following:

----------------------- copy starting from the next line -----------------------
[CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input long studyID float dateattended
1 20172
1 20172
1 20172
1 20172
1 20172
1 20172
1 20172
2 20172
2 20172
2 20172
3 20172
3 20172
3 20172
3 20172
3 20172
3 20172
3 20172
4 20172
4 20172
4 20172
5 20172
5 20172
5 20172
6 20172
6 20172
6 20172
6 20172
6 20172
6 20172
6 20172
7 20172
7 20172
7 20172
7 20172
7 20172
7 20172
8 20171
8 20171
8 20171
9 20172
9 20172
9 20172
9 20172
9 20172
9 20172
10 20179
10 20179
10 20179

In the above example ID 1 has been given ID 1's value from the cross-sectional data, however ID 1 in my longitudinal dataset actually corresponds to ID5 from the cross-sectional dataset, ID 2 in the longitudinal dataset corresponds to ID 7 in the cross-sectional data, and ID 3 in the longitudinal dataset should have ID8's value from the cross sectional data.

From doing the above i realise that the studyID is therefore not being read properly - I encoded studyID in both datasets but should I do something different for them to be able to read the same?

Hope this makes sense!
Thanks,

janey
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

27 Nov 2017, 09:53

In the above example ID 1 has been given ID 1's value from the cross-sectional data, however ID 1 in my longitudinal dataset actually corresponds to ID5 from the cross-sectional dataset, ID 2 in the longitudinal dataset corresponds to ID 7 in the cross-sectional data, and ID 3 in the longitudinal dataset should have ID8's value from the cross sectional data.

Well, you can't possibly expect the -merge- command to "know" that, can you?

It simply isn't possible to correctly -merge- two data sets correctly when the merge key variable doesn't correspond to the same entities in both data sets. The particular correspondence you mention here, 1 with 5, 2 with 7 and 3 with 8 does not suggest any simple systematic rule to convert from the ID system in one file to that in the other. So you are going to have to create a new data set that crosswalks the ID's from one file to those in the other. Then you will need to merge each of your original files to the crosswalk in order to finally get a sensible merge result.
Comment
Janey Sewell

Join Date: Nov 2017

Posts: 12
#5

27 Nov 2017, 10:15

Thanks, but I thought that if the studyID was unique in the first cross-sectional dataset and was used again in the second longitudinal dataset then it could be matched (i.e the studyID for (in the above example) ID 1 is AC001, ID 2 is AC002, ID 3 is AC003... ID 5 is AC005 and appears in both datasets but after the merge it has been given the value of AC001) . I am not sure why in dataex the AC001 has not been shown.

That is why I asked if I there was something I needed to do so that the studyID's could be identified correctly in each for the merge to be successful.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

27 Nov 2017, 10:37

This is pretty obscure. But a clue is that your -dataex- posts are, on their face, incomplete. You do not show all the way down to the end of the -dataex- output. And based on what you're saying in #5, I think the key information is probably in what you did not show.

My hypothesis is that the variable ID is a value-labeled variable. That is, had we seen the full -dataex- output we would have seen some -label define- commands there and those commands would have attached the value label "AC001" to the number 1, "AC002" to the number 2, etc.

If I'm on the right track here, then you have, for some reason, got two data sets with a variable ID, but the value labels are different in the two data sets. This would have arisen, for example, if you used the -encode- command to create the ID variable in each data set separately. When doing -encode- in the smaller data set, Stata would have no way to know about other values occurring in the larger data set, and -encode- always just creates consecutive integer values for the data it is presently working with. That is why it is dangerous to use -encode- in two different data sets separately unless you are sure that they have exactly the same values in them.

If I'm still on the right track, what you need to do is -decode- the ID variable to re-create the original string version of the ID variable. Then you can merge the two data sets on the string version of the ID variable and they will match up correctly. And consider dropping the version of the ID variable you have currently (after you -decode- it) so that you won't get into any more trouble with it.

There is the question of why you even want to have a numeric version of the ID variable anyway. For nearly all purposes, a string version works just as well and avoids issues of precision that sometimes arise. Perhaps the issue for you is that you need to use this ID variable as a panel identifier, and -xtset- requires a numeric panel variable. In that case, the encoded version you used would be suitable for that purpose, but it definitely cannot be used across different data sets.
Comment
Janey Sewell

Join Date: Nov 2017

Posts: 12
#7

27 Nov 2017, 10:47

You're definitely on the right track - thanks Clyde. Yes - i didn't show all the way down the dataex post as they were so long (and i didn't know how relevant!) but you're right there were label define comments at the bottom - sorry!
And yes, you're right, i did use encode separately in each dataset.

I'll give your suggestion a go now - if i understand correctly, decode the ID variable in each dataset and then merge on the ID variable.
Comment
Janey Sewell

Join Date: Nov 2017

Posts: 12
#8

27 Nov 2017, 10:52

Hurray - thanks very much for your help. So appreciated.
Comment

Announcement

Problems merging cross-sectional data with longitudinal data

Comment

Comment

Comment

Comment

Comment

Comment

Comment