How do I know which variables uniquely identify each observation in the dataset from a total of 134 variables in my dataset?

tig som

Join Date: Sep 2022

Posts: 58
#1

How do I know which variables uniquely identify each observation in the dataset from a total of 134 variables in my dataset?

06 Sep 2022, 13:39

I have a dataset with 134 variables , 4,640 observations. The dataset is identified by more than one variable. How do I know which variables identify each observation in the dataset?
My main issue is there is no information about the structure of the dataset. I only know that several variables identify the observations in the dataset.
My plan is that if I know the combinations of these variables that identify the dataset, I reduce the identifying variables by collapsing to two important ones. Then, I will "xtset" the dataset using the two main variables and do some panel regression analysis.
So far, to find these variables, I run a command, for example, "duplicates report x1 x2 x3" where x1 x2 or x3 , etc.. are the variables that I am guessing to identify the dataset until I find no duplicate in the output of the command.
I also used "isid x1 x2 x3" to check whether any selected combination of variables identifies the dataset or not.
The issue with this trial and error is, I have several variables (134 variables in total). Therefore, it is very cumbersome, and I cannot easily find the right combination of variables that identify the observations in the dataset.
Please let me know if you have an easy way of solving this problem.
Thank you very much, and looking forward for your suggestions.

Last edited by tig som; 06 Sep 2022, 13:45.
Tags: None
Ken Chui

Join Date: Aug 2014

Posts: 1058
#2

06 Sep 2022, 14:39

Are there variable names and variable labels if you run describe? That could be a good start.

duplicate reports may not be right thing to do. Same person appearing in both waves should have the same household ID and personal ID, but adding income (unbeknownst to us that it's income) to it will make them look like unique. So, making sure that they are unique probably will not help.

If the data set is truly void of any information on their variables and structure, then it'd be better to contact the data source for details, or look for another data set. Don't make any assumptions, because if they are wrong, the results will do more harm than good.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35755
#3

06 Sep 2022, 15:07

This answers the question in the title of the thread rather than the more specific question in the text of #1. If I get ideas on that, I will post further.

An identifier variable has as many distinct values as the number of observations. distinct from the Stata Journal can find such variables.

With the auto data, make is known to be a identifier. We construct another and also find with distinct that price has completely distinct values, or in other words each value is unique, occurring once only.

Code:

. sysuse auto, clear (1978 automobile data) . gen id = _n . distinct, min(74) ------------------------------------- | total distinct --------------+---------------------- make | 74 74 price | 74 74 id | 74 74 -------------------------------------

. It would be useful in such cases if distinct returned the variable list shown, which I will think about as an addition.

Naturally isid has the role of identifying identifiers. Here is one way it could be used.

Code:

foreach v of varlist * { capture isid `v' if _rc == 0 local ids `ids' `v' } . di "`ids'" make price id

Last edited by Nick Cox; 06 Sep 2022, 15:13.
Comment
Michiel De Vydt

Join Date: Mar 2020

Posts: 10
#4

09 Sep 2022, 09:32

Is there a reason why isid does not give a generic success message, when the used variable(s) uniquely identify observations?
It's only in the PDF manual where we learn isid only reports error messages:

isid reports no error, so the two variables company and year uniquely identify the observations.

.

Perhaps an extra sentence in the basic help file would improve user-friendliness. At least I remember thinking the command simply didn't register, the very first time I happened to run isid on a unique email identifier.

Last edited by Michiel De Vydt; 09 Sep 2022, 09:45.
Comment
tig som

Join Date: Sep 2022

Posts: 58
#5

27 Sep 2022, 04:11

Thank you all. Thank you, Nick Cox. The command you gave me is perfect. Very helpful. However, in my case, a group of variables together identify the dataset. It is not just one variable that identifies the dataset. When I run the command you gave me above in a cross-sectional dataset, it automatically displays the variable that identifies the dataset.
However, when I run the command in a panel dataset, it does not display the two variables that together identify the dataset.

In my current dataset case, more than one variable identifies the dataset, and I am trying to find out those variables that together identify the dataset.
Do you have some tips for that? Thank you very much!
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1448
#6

27 Sep 2022, 04:31

Originally posted by Michiel De Vydt View Post

Is there a reason why isid does not give a generic success message, when the used variable(s) uniquely identify observations?
It's only in the PDF manual where we learn isid only reports error messages: .

Perhaps an extra sentence in the basic help file would improve user-friendliness. At least I remember thinking the command simply didn't register, the very first time I happened to run isid on a unique email identifier.

Maybe it's just a matter of what one is accustomed to. I am used to the idea that assertions in Stata succeed quietly, so I was not surprised with this behaviour of isid. Programmatically this is a non-issue because one almost always captures errors in assertions, otherwise the code moves on, as you would want it to.
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1448
#7

27 Sep 2022, 04:47

Originally posted by tig som View Post

Thank you all. Thank you, Nick Cox. The command you gave me is perfect. Very helpful. However, in my case, a group of variables together identify the dataset. It is not just one variable that identifies the dataset. When I run the command you gave me above in a cross-sectional dataset, it automatically displays the variable that identifies the dataset.
However, when I run the command in a panel dataset, it does not display the two variables that together identify the dataset.

In my current dataset case, more than one variable identifies the dataset, and I am trying to find out those variables that together identify the dataset.
Do you have some tips for that? Thank you very much!

The distinct command shown in #3 allows you to check if two (or more) variables jointly uniquely identify the data (use the option joint). So the solution might be to run nested loops going over all possible pairs of variables, and see which pairs do the job.
Comment
tig som

Join Date: Sep 2022

Posts: 58
#8

27 Sep 2022, 05:39

Thank you Hemanshu Kumar. I run distinct, joint, and from the result, I see that the entire observations are jointly distinct. Thus, the dataset is identified by a group of variables. However, how do I know those variables that together identify the dataset?

I found that the “foreach” command shown in #3, is very helpful in displaying which variables identify the dataset if the dataset can be identified by a single variable. But in my dataset, more than one variable together identifies the dataset. I am trying to find out these variables that together identify the dataset.
I agree with your recommendation to run nested loops going over all possible pairs of variables and see which pairs do the job. Do you have any tips on how to run a nested loop?
Thank you again.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3872
#9

27 Sep 2022, 06:08

Sorry, but if you are planning on running (panel) analyses, there must be something you know about the variables.* For example, you probably wouldn't put identifiers as predictor (right-hand-side, independent, ...) variables into a regression model, would you?

Otherwise, there are about 9,000 possible pairs of two variables or 2-tuples; you can brute-force your way through those. But then, there are about 400,000 possible 3-tuples, about 13,000,000 4-tuples, and so on. Nested loops are not feasible here.

* Edit: More broadly speaking, there must be something you know about the data. Where is it from? How was it created?

Last edited by daniel klein; 27 Sep 2022, 06:18.
2 likes
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1448

#10

27 Sep 2022, 06:18

Try this (I use isid instead of distinct, which does the job and has the advantage of being an inbuilt Stata command):

Code:

sysuse auto, clear

unab vars: *
local numvars: list sizeof vars

noi dis "The following pairs of variables jointly identify the data:"

forval i = 1/`=`numvars'-1' {
    local v1: word `i' of `vars'
    forval j = `=`i'+1'/`numvars' {
        local v2: word `j' of `vars'
        capture isid `v1' `v2'
        if _rc == 0 noi dis "`v1' and `v2'"
    }
}

which produces:

Code:

The following pairs of variables jointly identify the data:
make and price
make and mpg
make and headroom
make and trunk
make and weight
make and length
make and turn
make and displacement
make and gear_ratio
make and foreign
price and mpg
price and headroom
price and trunk
price and weight
price and length
price and turn
price and displacement
price and gear_ratio
price and foreign
mpg and weight

Edit: #9 is right. Surely you know something about the structure of the dataset which helps you limit the number of variables you are cycling over? If so, you should replace the unab command in the code above with something like

Code:

unab vars: <put subset of vars here>

Last edited by Hemanshu Kumar; 27 Sep 2022, 06:36.

Comment

tig som

Join Date: Sep 2022

Posts: 58
#11

27 Sep 2022, 08:07

Thank you very much again, Hemanshu Kumar!!
This command is very informative. I still have a question.
The command works perfectly when only two variables identify the dataset.
However, in my dataset, more than two variables identify the dataset. To be precise, I don’t know the exact number of variables that together identify the dataset.

I tried to extend your command to check if three variables identify the dataset by adding "forvalue K …"as presented in the command below.

unab vars: *
local numvars: list sizeof vars

noi dis "The following pairs of variables jointly identify the data:"

forval i = 1/`=`numvars'-1' {
local v1: word `i' of `vars'
forval j = `=`i'+1'/`numvars' {
local v2: word `j' of `vars'
forval k = `=`j'+1'/`numvars' {
local v3: word `k' of `vars'

capture isid `v1' `v2' `v3'
if _rc == 0 noi dis "`v1' and `v2' and `v3'"
}
}
}

And still, three variables together do not identify the dataset. It seems more than three variables together identify the dataset.
As I said, the issue is I do not know the exact number of variables that identify the dataset. Adding v3, v4, v5,.. and running each command until I reach the exact number of variables that identify the dataset is a long process. Is there a shorter way for this, please?
Thank you again, and looking forward to your suggestion.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3872
#12

27 Sep 2022, 08:10

One last try from my side: You are approaching this the wrong way. Where does the data come from? Who created the data? What is the information in that data? How do you even know that there is exactly one combination of variables that identifies the observations?
1 like
Comment
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1448
#13

27 Sep 2022, 08:31

I agree with #12; I am becoming more and more uncomfortable with this brute-force approach. There should be some information about the dataset that can help you significantly narrow down the choices, if not eliminate them altogether....
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35755
#14

27 Sep 2022, 08:42

The question of why isid says nothing when it does is a good one. I explain Stata’s behaviour in such cases as following a principle that no news is good news.
1 like
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 818
#15

27 Sep 2022, 09:48

The previous posts so far are absolutely correct. The best way to do this is to learn more about the dataset, either by reaching out to the creator of the data, finding the documentation, or by making inferences about the data by examining its structure. For instance, is there a column formatted like a date or a year? Do you know how many waves of data you should have, and if so, is there exactly one variable with that many levels?

It's an interesting theoretical problem though, isn't it? One could simplify the brute force approach in #10 by eliminating any variables containing non-integer numbers from the search, since these are almost certainly not non-integer unique identifiers. Same with negative values.
Comment

Announcement

How do I know which variables uniquely identify each observation in the dataset from a total of 134 variables in my dataset?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment