Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I know which variables uniquely identify each observation in the dataset from a total of 134 variables in my dataset?

    I have a dataset with 134 variables , 4,640 observations. The dataset is identified by more than one variable. How do I know which variables identify each observation in the dataset?
    My main issue is there is no information about the structure of the dataset. I only know that several variables identify the observations in the dataset.
    My plan is that if I know the combinations of these variables that identify the dataset, I reduce the identifying variables by collapsing to two important ones. Then, I will "xtset" the dataset using the two main variables and do some panel regression analysis.
    So far, to find these variables, I run a command, for example, "duplicates report x1 x2 x3" where x1 x2 or x3 , etc.. are the variables that I am guessing to identify the dataset until I find no duplicate in the output of the command.
    I also used "isid x1 x2 x3" to check whether any selected combination of variables identifies the dataset or not.
    The issue with this trial and error is, I have several variables (134 variables in total). Therefore, it is very cumbersome, and I cannot easily find the right combination of variables that identify the observations in the dataset.
    Please let me know if you have an easy way of solving this problem.
    Thank you very much, and looking forward for your suggestions.
    Last edited by tig som; 06 Sep 2022, 13:45.

  • #2
    Are there variable names and variable labels if you run describe? That could be a good start.

    duplicate reports may not be right thing to do. Same person appearing in both waves should have the same household ID and personal ID, but adding income (unbeknownst to us that it's income) to it will make them look like unique. So, making sure that they are unique probably will not help.

    If the data set is truly void of any information on their variables and structure, then it'd be better to contact the data source for details, or look for another data set. Don't make any assumptions, because if they are wrong, the results will do more harm than good.

    Comment


    • #3
      This answers the question in the title of the thread rather than the more specific question in the text of #1. If I get ideas on that, I will post further.


      An identifier variable has as many distinct values as the number of observations. distinct from the Stata Journal can find such variables.

      With the auto data, make is known to be a identifier. We construct another and also find with distinct that price has completely distinct values, or in other words each value is unique, occurring once only.



      Code:
      . sysuse auto, clear
      (1978 automobile data)
      
      . gen id = _n
      
      . distinct, min(74)
      
      -------------------------------------
                    |     total   distinct
      --------------+----------------------
               make |        74         74
              price |        74         74
                 id |        74         74
      -------------------------------------
      . It would be useful in such cases if distinct returned the variable list shown, which I will think about as an addition.

      Naturally isid has the role of identifying identifiers. Here is one way it could be used.


      Code:
      foreach v of varlist * {
           capture isid `v'
           if _rc == 0 local ids `ids' `v'
      }
      
      . di "`ids'"
      make price id
      Last edited by Nick Cox; 06 Sep 2022, 15:13.

      Comment


      • #4
        Is there a reason why isid does not give a generic success message, when the used variable(s) uniquely identify observations?
        It's only in the PDF manual where we learn isid only reports error messages:
        isid reports no error, so the two variables company and year uniquely identify the observations.
        .

        Perhaps an extra sentence in the basic help file would improve user-friendliness. At least I remember thinking the command simply didn't register, the very first time I happened to run isid on a unique email identifier.
        Last edited by Michiel De Vydt; 09 Sep 2022, 09:45.

        Comment


        • #5

          Thank you all. Thank you, Nick Cox. The command you gave me is perfect. Very helpful. However, in my case, a group of variables together identify the dataset. It is not just one variable that identifies the dataset. When I run the command you gave me above in a cross-sectional dataset, it automatically displays the variable that identifies the dataset.
          However, when I run the command in a panel dataset, it does not display the two variables that together identify the dataset.

          In my current dataset case, more than one variable identifies the dataset, and I am trying to find out those variables that together identify the dataset.
          Do you have some tips for that? Thank you very much!

          Comment


          • #6
            Originally posted by Michiel De Vydt View Post
            Is there a reason why isid does not give a generic success message, when the used variable(s) uniquely identify observations?
            It's only in the PDF manual where we learn isid only reports error messages: .

            Perhaps an extra sentence in the basic help file would improve user-friendliness. At least I remember thinking the command simply didn't register, the very first time I happened to run isid on a unique email identifier.
            Maybe it's just a matter of what one is accustomed to. I am used to the idea that assertions in Stata succeed quietly, so I was not surprised with this behaviour of isid. Programmatically this is a non-issue because one almost always captures errors in assertions, otherwise the code moves on, as you would want it to.

            Comment


            • #7
              Originally posted by tig som View Post
              Thank you all. Thank you, Nick Cox. The command you gave me is perfect. Very helpful. However, in my case, a group of variables together identify the dataset. It is not just one variable that identifies the dataset. When I run the command you gave me above in a cross-sectional dataset, it automatically displays the variable that identifies the dataset.
              However, when I run the command in a panel dataset, it does not display the two variables that together identify the dataset.

              In my current dataset case, more than one variable identifies the dataset, and I am trying to find out those variables that together identify the dataset.
              Do you have some tips for that? Thank you very much!
              The distinct command shown in #3 allows you to check if two (or more) variables jointly uniquely identify the data (use the option joint). So the solution might be to run nested loops going over all possible pairs of variables, and see which pairs do the job.

              Comment


              • #8
                Thank you Hemanshu Kumar. I run distinct, joint, and from the result, I see that the entire observations are jointly distinct. Thus, the dataset is identified by a group of variables. However, how do I know those variables that together identify the dataset?

                I found that the “foreach” command shown in #3, is very helpful in displaying which variables identify the dataset if the dataset can be identified by a single variable. But in my dataset, more than one variable together identifies the dataset. I am trying to find out these variables that together identify the dataset.
                I agree with your recommendation to run nested loops going over all possible pairs of variables and see which pairs do the job. Do you have any tips on how to run a nested loop?
                Thank you again.

                Comment


                • #9
                  Sorry, but if you are planning on running (panel) analyses, there must be something you know about the variables.* For example, you probably wouldn't put identifiers as predictor (right-hand-side, independent, ...) variables into a regression model, would you?

                  Otherwise, there are about 9,000 possible pairs of two variables or 2-tuples; you can brute-force your way through those. But then, there are about 400,000 possible 3-tuples, about 13,000,000 4-tuples, and so on. Nested loops are not feasible here.


                  * Edit: More broadly speaking, there must be something you know about the data. Where is it from? How was it created?
                  Last edited by daniel klein; 27 Sep 2022, 06:18.

                  Comment


                  • #10
                    Try this (I use isid instead of distinct, which does the job and has the advantage of being an inbuilt Stata command):

                    Code:
                    sysuse auto, clear
                    
                    unab vars: *
                    local numvars: list sizeof vars
                    
                    noi dis "The following pairs of variables jointly identify the data:"
                    
                    forval i = 1/`=`numvars'-1' {
                        local v1: word `i' of `vars'
                        forval j = `=`i'+1'/`numvars' {
                            local v2: word `j' of `vars'
                            capture isid `v1' `v2'
                            if _rc == 0 noi dis "`v1' and `v2'"
                        }
                    }
                    which produces:
                    Code:
                    The following pairs of variables jointly identify the data:
                    make and price
                    make and mpg
                    make and headroom
                    make and trunk
                    make and weight
                    make and length
                    make and turn
                    make and displacement
                    make and gear_ratio
                    make and foreign
                    price and mpg
                    price and headroom
                    price and trunk
                    price and weight
                    price and length
                    price and turn
                    price and displacement
                    price and gear_ratio
                    price and foreign
                    mpg and weight
                    Edit: #9 is right. Surely you know something about the structure of the dataset which helps you limit the number of variables you are cycling over? If so, you should replace the unab command in the code above with something like
                    Code:
                    unab vars: <put subset of vars here>
                    Last edited by Hemanshu Kumar; 27 Sep 2022, 06:36.

                    Comment


                    • #11
                      Thank you very much again, Hemanshu Kumar!!
                      This command is very informative. I still have a question.
                      The command works perfectly when only two variables identify the dataset.
                      However, in my dataset, more than two variables identify the dataset. To be precise, I don’t know the exact number of variables that together identify the dataset.

                      I tried to extend your command to check if three variables identify the dataset by adding "forvalue K …"as presented in the command below.


                      unab vars: *
                      local numvars: list sizeof vars

                      noi dis "The following pairs of variables jointly identify the data:"

                      forval i = 1/`=`numvars'-1' {
                      local v1: word `i' of `vars'
                      forval j = `=`i'+1'/`numvars' {
                      local v2: word `j' of `vars'
                      forval k = `=`j'+1'/`numvars' {
                      local v3: word `k' of `vars'

                      capture isid `v1' `v2' `v3'
                      if _rc == 0 noi dis "`v1' and `v2' and `v3'"
                      }
                      }
                      }


                      And still, three variables together do not identify the dataset. It seems more than three variables together identify the dataset.
                      As I said, the issue is I do not know the exact number of variables that identify the dataset. Adding v3, v4, v5,.. and running each command until I reach the exact number of variables that identify the dataset is a long process. Is there a shorter way for this, please?
                      Thank you again, and looking forward to your suggestion.

                      Comment


                      • #12
                        One last try from my side: You are approaching this the wrong way. Where does the data come from? Who created the data? What is the information in that data? How do you even know that there is exactly one combination of variables that identifies the observations?

                        Comment


                        • #13
                          I agree with #12; I am becoming more and more uncomfortable with this brute-force approach. There should be some information about the dataset that can help you significantly narrow down the choices, if not eliminate them altogether....

                          Comment


                          • #14
                            The question of why isid says nothing when it does is a good one. I explain Stata’s behaviour in such cases as following a principle that no news is good news.

                            Comment


                            • #15
                              The previous posts so far are absolutely correct. The best way to do this is to learn more about the dataset, either by reaching out to the creator of the data, finding the documentation, or by making inferences about the data by examining its structure. For instance, is there a column formatted like a date or a year? Do you know how many waves of data you should have, and if so, is there exactly one variable with that many levels?

                              It's an interesting theoretical problem though, isn't it? One could simplify the brute force approach in #10 by eliminating any variables containing non-integer numbers from the search, since these are almost certainly not non-integer unique identifiers. Same with negative values.

                              Comment

                              Working...
                              X