I am interested in understanding the "correspondance" between two variables in a dataset that I have to clean. My two variables have no missing values but are not unique identifiers.
For example (see the MWE below) if I could show that one value of var_a is always associated with the same value of var_b, then the two variables are redundant and I could simply drop one of them. I don't expect this to be the case for all but for a lorge number of pairs.
On the contrary I might have surjection in one or the other directions (or even some messy other situations). I have tried:
Which gives me the result I want, assuming that all pairwise combination are either bijective or surjective (so that the last observation 7-e never ocurs)
1. Is there a simpler method to achieve the result?
2. How could I report more complicated cases. For example including the last pair (7-e) makes that the mapping from the domain (6-7) to (f-e) is not an onto mapping. That's typically the type of cases that I would like to report in my data if they exist (what my code does not do so far)
Thank you for your help
For example (see the MWE below) if I could show that one value of var_a is always associated with the same value of var_b, then the two variables are redundant and I could simply drop one of them. I don't expect this to be the case for all but for a lorge number of pairs.
On the contrary I might have surjection in one or the other directions (or even some messy other situations). I have tried:
Code:
clear all input var_a str16 var_b 1 a //var_a and var_b are bijective 1 a 2 b 3 c 1 a 2 b 3 c 4 d //var_a is surjective onto var_b 5 d 5 d 6 e //var_b is surjective onto var_a 6 f 6 f // 7 e end bysort var_a: gen count_a = _N bysort var_b: gen count_b = _N gen cat = "" replace cat = "bij" if count_a == count_b replace cat = "a_onto_b" if count_a < count_b replace cat = "b_onto_a" if count_a > count_b
1. Is there a simpler method to achieve the result?
2. How could I report more complicated cases. For example including the last pair (7-e) makes that the mapping from the domain (6-7) to (f-e) is not an onto mapping. That's typically the type of cases that I would like to report in my data if they exist (what my code does not do so far)
Thank you for your help

Comment