Making a new variable or mini dataset that consist of information about the missings

Anna weel

Join Date: May 2023

Posts: 21
#1

Making a new variable or mini dataset that consist of information about the missings

15 Jun 2023, 12:00

Hi there!

I'm currently working with two datasets. They are a baseline and follow-up in the same population. But the second data collection has lots of missings(67), which is 15% of the population to analyse. Every individual has his own number that is the same in both datasets. I either want to make a new variable or maybe create a new mini dataset that only include those missings, so I can find if they have something in common so I can explain the possible selection bias this creates. Does anyone has an idea how to tackle this?
Thank you in advance,
Anna
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#2

15 Jun 2023, 12:11

Your description leaves nearly everything to the imagination. Here's what I imagine. I imagine you have two separate data sets, one for baseline and one for follow-up. They are keyed by a common identifier variable. I also assume that the identifier variable uniquely identifies observations in both data sets. The follow-up data set contains no observations for 67 members of the population (which is different from there being an observation containing the identifier variable but missing values for all other variables.). To find those:

Code:

use baseline_data_set, clear merge 1:1 identifier_variable using follow_up_data_set, keep(master) nogenerate keep identifier_variable save new_data_set, replace

The new data set created by this data set will contain the identifiers and all baseline data for those participants who do not appear in the follow-up data set.

Is that what you wanted? If not, please post back and show example data from both the baseline and follow-up data sets, and then show what the result you want to get would look like.

When showing example data, please be sure to use the -dataex- command, as this is the most effective and helpful way to show example data. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Anna weel

Join Date: May 2023

Posts: 21
#3

16 Jun 2023, 04:09

Hi ,
Thank you so much for your help. The previous code did not generate what I want so here is my data with only including a few variables.
The first one is the baseline that does not have any missings. The second dataset has 67 missings, # 100101 was not there for the second round of data collection.
What I would like is to create a new dataset consisting out of those 67 missing persons so I can try to find out what they had in common. How can I do this?
Dataset 1
* Example generated by -dataex-. For more info, type help dataex
clear
input double pid byte b02 double b05 float(c016_cat c016tot)
1001001 0 14 0 4
1001002 0 10 0 6
1001003 0 14 0 9
1001004 0 14 0 7
1001005 0 11 1 24
end
label values b02 sex
label def sex 0 "Male", modify
label values c016_cat eat26lab
label def eat26lab 0 "no eating disorder", modify
label def eat26lab 1 "have eating disorder", modify
[/CODE]

Dataset 2

Code:

* Example generated by -dataex-. For more info, type help dataex clear input double(pid b02 b05) float(c016_cat c016tot) 1001001 . . 0 0 1001002 0 12 0 3 1001003 0 15 0 8 1001004 0 16 0 12 1001005 0 12 0 7 end label values b02 sex label def sex 0 "Male", modify label values c016_cat c016lab label def c016lab 0 "no eating disorder", modify
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#4

16 Jun 2023, 08:34

I'm sorry, but I still don't understand what you want. Your example data has exactly all the same pid's in it--nothing is missing, so you have given me nothing to work with here.

The following code, a slight markup of what I provided in #2, creates a data set containing all and only the 67 people missing from dataset2 and all of their baseline information.

Code:

use baseline_data_set, clear merge 1:1 pid using follow_up_data_set, keep(master) nogenerate

If that is not what you want, explain, or better still, show by example, how that is different from "a new dataset consisting out of those 67 missing persons so I can try to find out what they had in common?"
Comment
Anna weel

Join Date: May 2023

Posts: 21
#5

16 Jun 2023, 10:27

Hi there,

I want to create a new dataset that includes only those participants from the first dataset whose IDs are present in the second dataset but do not have any associated data. In other words, I want to identify the 67 participants whose IDs exist in both datasets but are missing data in the second dataset. If there is a code to do the next for steps this would be amazing.
Identify the participants with missing data in the second dataset: In the second dataset, locate the 67 participant IDs that exist but have no associated data.

Merge the two datasets: Use the "pid" variable to merge the first and second datasets. This process combines the two datasets based on matching participant IDs. As a result, you'll have a new dataset that includes all participants from the first dataset but only those who have IDs in the second dataset.

Filter out participants with missing data: In the merged dataset, filter out the participants whose associated data is missing (the 67 participants identified in step 1).

Create the new dataset: Once you've filtered out the participants with missing data, you'll have a new dataset that contains only the participants you're interested in—the ones who exist in the first dataset but have no associated data in the second dataset.

Thank you so much for your time and effort.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30168
#6

16 Jun 2023, 10:41

OK, this is clearer. I note that in your example data, there are no instances of a pid with no associated data. At most you have ID 1001001 who is missing data on b02 and b05, but that person still has associated data on c016_cat and c016tot. Taking you literally, then the results for your example data would be an empty data set, and that is what the following code produces:

Code:

use `dataset2', clear ds pid, not local non_id_vars `r(varlist)' egen int nmcount = rownonmiss(`non_id_vars') keep if nmcount == 0 // THESE HAVE NO ASSOCIATED DATA keep pid tempfile of_interest save `of_interest' use `dataset1', clear merge 1:1 pid using `of_interest', keep(match) nogenerate

However, in #3, you mention pid 100101 (which I'm interpreting as a typo for 1001001) as an instance of what you are interested. So I think what you really mean is not pids with no associated data, but pids with incomplete associated data. For that, the code is somewhat different:

Code:

use `dataset2', clear ds pid, not local non_id_vars `r(varlist)' egen int mcount = rowmiss(`non_id_vars') keep if mcount > 0 // THESE HAVE SOME MISSING ASSOCIATED DATA rename (`non_id_vars') =_2 tempfile of_interest save `of_interest' use `dataset1', clear merge 1:1 pid using `of_interest', keep(match) nogenerate
Comment
Anna weel

Join Date: May 2023

Posts: 21
#7

20 Jun 2023, 02:39

Thank you so much.
Comment

Announcement

Making a new variable or mini dataset that consist of information about the missings

Comment

Comment

Comment

Comment

Comment

Comment