Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Find the number of unique values of one variable for values in another variable

    Hi folks, hope you are all doing well.

    I have one variable with many values, each of them can appear in many number of observation (some will appear in just one, some will appear more), and I have another variable which also have many values. Most of the values in the first variable will be connected to only one value in the second variable, but some of them will be connect to more than one value. I want to mark the observations with values in the first variable that get more than one value in the second variable.

    I thought to use the command egen (bysort var1: egen var3 = functin(var2)) to create new variable for this, but I don't find the right function. Can you help me with this please? of curse not just restricted to help based on the egen command.

    Thank you!

    Fitz

  • #2
    Originally posted by FitzGerald Blindman View Post
    Hi folks, hope you are all doing well.

    I have one variable with many values, each of them can appear in many number of observation (some will appear in just one, some will appear more), and I have another variable which also have many values. Most of the values in the first variable will be connected to only one value in the second variable, but some of them will be connect to more than one value. I want to mark the observations with values in the first variable that get more than one value in the second variable.

    I thought to use the command egen (bysort var1: egen var3 = functin(var2)) to create new variable for this, but I don't find the right function. Can you help me with this please? of curse not just restricted to help based on the egen command.

    Thank you!

    Fitz
    Maybe I should be more clear:

    My data is about families, the first variable is ID's of people (id_parent) and it's values appear as many observations as the number of children they have, and the second one is ID of families (id_family) - two observation will have the same value in id_family if they have children together. So most of the people have only one value in the id_family variable, because most of the people have children from only one partner. But if a person have children with more than one partner than his ID number in the variable id_parent will be connected to more than one value in the variable id_family.

    I want to deal with the people that have more than one id_family so I'll have them only in their first marriage.

    Thanks again folks.

    Comment


    • #3
      Hello FitzGerald Blindman. You'll make it much easier for members to help you if you use -dataex- to provide a small dataset showing what your data file looks like. See item 12.2 in the FAQ for more info.

      Cheers,
      Bruce
      --
      Bruce Weaver
      Email: [email protected]
      Version: Stata/MP 18.5 (Windows)

      Comment


      • #4
        Per Bruce's suggestion, understanding your data and goal would be much easier with an example. I nevertheless decided to try something based on a data set I simulated to fit something like what I think (?) you describe in #1. There might well be an easy "first principles" solution, but the community-contributed command -distinct- occurred to me as a ready-made tool.
        Code:
        clear
        set seed 6136
        // A data set of 10 persons who are parents of 1, ..., 5 children
        set obs 10
        gen int id_parent = _n^2 // a nonconsecutive id
        gen int nchild = runiformint(1,5)
        expand nchild
        // Each child observation might be linked to anyone of 10
        // families, i.e., 10 family partners.
        gen int id_family = runiformint(1, 10)
        // Done simulating data.
        //
        // id_parent will not be something simple in the original data. Making a temporary
        // consecutive id from it makes it easy to loop over a potentially large number of
        // id_parent values.  Relying on -levelsof id_parent- might exceed Stata's limits.
        egen tempid = group(id_parent)
        summ tempid , meanonly
        local nperson = r(max)
        //
        // We need the -distinct- command
        capture net install  dm0042_2, from("http://www.stata-journal.com/software/sj15-3")
        //
        gen int nfam = .
        label var nfam "Number of distinct family partners for this person"
        di "nperson = `nperson'"
        forval i = 1/`nperson' {
           distinct id_family if tempid == `i'
           replace nfam = r(ndistinct) if tempid == `i'
        }
        // Check out what was done.
        sort id_parent id_family
        browse id_parent id_family nfam

        Comment

        Working...
        X