Find the number of unique values of one variable for values in another variable

FitzGerald Blindman

Join Date: Sep 2023

Posts: 36
#1

Find the number of unique values of one variable for values in another variable

29 Aug 2024, 08:16

Hi folks, hope you are all doing well.

I have one variable with many values, each of them can appear in many number of observation (some will appear in just one, some will appear more), and I have another variable which also have many values. Most of the values in the first variable will be connected to only one value in the second variable, but some of them will be connect to more than one value. I want to mark the observations with values in the first variable that get more than one value in the second variable.

I thought to use the command egen (bysort var1: egen var3 = functin(var2)) to create new variable for this, but I don't find the right function. Can you help me with this please? of curse not just restricted to help based on the egen command.

Thank you!

Fitz
Tags: None
FitzGerald Blindman

Join Date: Sep 2023

Posts: 36
#2

29 Aug 2024, 08:23

Originally posted by FitzGerald Blindman View Post

Hi folks, hope you are all doing well.

I have one variable with many values, each of them can appear in many number of observation (some will appear in just one, some will appear more), and I have another variable which also have many values. Most of the values in the first variable will be connected to only one value in the second variable, but some of them will be connect to more than one value. I want to mark the observations with values in the first variable that get more than one value in the second variable.

I thought to use the command egen (bysort var1: egen var3 = functin(var2)) to create new variable for this, but I don't find the right function. Can you help me with this please? of curse not just restricted to help based on the egen command.

Thank you!

Fitz

Maybe I should be more clear:

My data is about families, the first variable is ID's of people (id_parent) and it's values appear as many observations as the number of children they have, and the second one is ID of families (id_family) - two observation will have the same value in id_family if they have children together. So most of the people have only one value in the id_family variable, because most of the people have children from only one partner. But if a person have children with more than one partner than his ID number in the variable id_parent will be connected to more than one value in the variable id_family.

I want to deal with the people that have more than one id_family so I'll have them only in their first marriage.

Thanks again folks.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1129
#3

29 Aug 2024, 08:37

Hello FitzGerald Blindman. You'll make it much easier for members to help you if you use -dataex- to provide a small dataset showing what your data file looks like. See item 12.2 in the FAQ for more info.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 18.5 (Windows)
2 likes
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2411

29 Aug 2024, 10:01

Per Bruce's suggestion, understanding your data and goal would be much easier with an example. I nevertheless decided to try something based on a data set I simulated to fit something like what I think (?) you describe in #1. There might well be an easy "first principles" solution, but the community-contributed command -distinct- occurred to me as a ready-made tool.

Code:

clear
set seed 6136
// A data set of 10 persons who are parents of 1, ..., 5 children
set obs 10
gen int id_parent = _n^2 // a nonconsecutive id
gen int nchild = runiformint(1,5)
expand nchild
// Each child observation might be linked to anyone of 10
// families, i.e., 10 family partners.
gen int id_family = runiformint(1, 10)
// Done simulating data.
//
// id_parent will not be something simple in the original data. Making a temporary
// consecutive id from it makes it easy to loop over a potentially large number of
// id_parent values.  Relying on -levelsof id_parent- might exceed Stata's limits.
egen tempid = group(id_parent)
summ tempid , meanonly
local nperson = r(max)
//
// We need the -distinct- command
capture net install  dm0042_2, from("http://www.stata-journal.com/software/sj15-3")
//
gen int nfam = .
label var nfam "Number of distinct family partners for this person"
di "nperson = `nperson'"
forval i = 1/`nperson' {
   distinct id_family if tempid == `i'
   replace nfam = r(ndistinct) if tempid == `i'
}
// Check out what was done.
sort id_parent id_family
browse id_parent id_family nfam

Announcement

Find the number of unique values of one variable for values in another variable

Comment

Comment

Comment