Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    This problem reminds me of the way an attacker might de-anonymize data (or at least, find the original data) from summary statistics described in this YouTube video which was made in partnership with the US census. In a given data set, there might be different sets of categorical variables that uniquely identify the data. Can we recover a single, most plausible set of variables that uniquely identify the data without having some other meaning?

    Comment


    • #17
      Originally posted by daniel klein View Post
      One last try from my side: You are approaching this the wrong way. Where does the data come from? Who created the data? What is the information in that data? How do you even know that there is exactly one combination of variables that identifies the observations?
      Specifically, the OP could browse the data in Excel or in the Stata data browser. Surely there has to be something that looks like an ID? A long string of numbers that don't look like anything else (e.g. they're not likely values for income, or GDP, or whatever) could be an ID. A long string of alphanumeric characters is probably an ID (but you'll need to generate a numeric ID to use xtset). Are there no clues in the variable names - e.g. type id into the variable search bar? Surely not every variable is x1, x2, x3... right?

      If it is, you need to have words with the entity that originated the data - if its a statistical agency there are probably input statements that would have renamed and labeled variables, although some entities will only give you SAS input statements (many US government entities do this :-/)
      Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

      When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

      Comment


      • #18
        Drawing on the code provided in #10, I extend the brute force search solution to the power set of all of the variables. This will test every single combination of variables to see if they uniquely identify the data.

        This is for demonstration purposes only, and I absolutely agree with the consensus in this thread: you should use other means to clearly understand your data. Additionally, notice that the num_iterations is equal to (2^N) - 1, where N is the word count. This means that the complexity of this algorithm grows exponentially with respect to the number of variables you would like to test. In other words, this is a non polynomial (NP) algorithm, and is almost as bad as it can get in terms of processing time. Depending on the number of variables you have in your dataset, this could mean the algorithm will take a very long time to halt. Use at your own risk.

        Code:
        unab vars: *
        //local vars = "make price mpg rep78"
        
        display "the following set of variables jointly identify the data:"
        
        scalar num_iterations = 2^wordcount("`vars'") - 1
        local num_bits = wordcount("`vars'")
        forv itter = 1/`=num_iterations'{
            quietly inbase 2 `itter'
            local var_subset = ""
            local binary_inclusion = string(real(r(base)), "%0`num_bits'.0f")
            forv bit_index = 1/`num_bits'{
                local bit = substr("`binary_inclusion'", `bit_index', 1)
                if `bit' {
                    local var = word("`vars'", `bit_index')
                    local var_subset = "`var_subset' `var'"
                }
            }
            capture isid `var_subset'
            if _rc == 0 display "`var_subset'"
        }
        Thanks Hemanshu Kumar for providing the code that this monstrosity is based on.

        Edit: if anyone is curious to know how this works, I basically just count up in binary, and look at the bits. If the bit corresponding to a variable is 1, the variable is included in the subset, otherwise it is not.
        Last edited by Daniel Schaefer; 27 Sep 2022, 16:45.

        Comment


        • #19
          (post deleted due to error in code)
          Last edited by Hemanshu Kumar; 27 Sep 2022, 19:32.

          Comment


          • #20
            The code in #18 seems highly similar to the algorithm that Nick Cox implemented in tuples (SSC) almost 20 years ago. We have speeded things up since then, and everyone interested can download the command.

            I would like to add that whether one should use another approach is not a question; it is merely impossible to brute-force our way through all possible combinations of 134 variables. One thing we have seen during the COVID-19 pandemic is that most people are having a really hard time understanding what exponential growth really means. To put things into perspective, let us assume that Stata processes one iteration or possible combination in 1/100,000 seconds. Then, finding all 392,084 3-tuples would take less than 4 seconds. Finding all 12,840,741 4-tuples would take a bit longer than 2 minutes and finding all 5-tuples would still not even take an hour. That would be annoying but it is surely feasible. Moving on. Trying to find all 6-tuples would almost take a day and finding all 7-tuples would keep us busy for two weeks. Finally, finding all 9-tuples will take ten years and we will all be dead before finding all 10-tuples; there is a good chance of the sun exploding before we are able to get all the 20-tuples.

            Oh, in case it is not clear: the timings above add up. Finding all 3- and 4-tuples will take 4 seconds + about 2 minutes. Our tuples command has a min() and max() option. If we were confident that the combination we are looking for includes at least 2 and at most 4 variables, then there is a chance that a brute-force approach is feasible.
            Last edited by daniel klein; 28 Sep 2022, 03:14. Reason: Mixed up the number of 3-tuples and 4-tuples; you get those as comb(134, 3) and comb(134, 4)

            Comment


            • #21
              Here is some code that heavily uses the -tuples- command referenced in #20 (I saw Daniel's post just as I was about to post my solution!)

              The program search_id does the work.

              It starts with the smallest sized tuples, and proceeds to larger ones. It also allows you to specify the largest size of tuple you want to consider. It also allows you to specify that the search should stop at the first success.

              Beyond that, it improves upon the earlier method by excluding combinations that are supersets of identifying combinations (since e.g. if mpg and make jointly identify the data, then any larger combination involving both of those variables also will, and so there is no point looking through those).


              All this could potentially offer significant speed gains over the method in #18.

              Code:
              capture program drop search_id
              program define search_id, rclass
                  
                  syntax [varlist] , [MAXsize(numlist int >0 min=1 max=1)] [First]
                  * -max- specifies the largest size of variable combinations to look for
                  * -first- specifies that the search should stop at the first combination found
              
                  if "`varlist'" == "" local varlist *
              
                  unab vars: `varlist'
                  local numvars: list sizeof vars
              
                  if "`maxsize'" != "" {
                      if `maxsize' > `numvars' {
                          dis as error "The maximum size of the combination cannot exceed the number of variables specified."
                          exit
                      }
                  }
                  
                  local keepvars = ""
                  foreach var of local vars {
                      capture confirm string var `var'
                      if _rc == 0 local keepvars `keepvars' `var'
                          else {
                              capture assert int(`var') == `var'
                              if _rc == 0 local keepvars `keepvars' `var'
                                  else noi dis "Excluding variable `var' since it is neither string nor integer."
                          }
                  }
              
                  if "`maxsize'" == "" local maxlen: list sizeof keepvars
                      else local maxlen = min(`maxsize',`:list sizeof keepvars')
                      
                  local cond = ""
                  local all_combos = ""
                  
                  forval size = 1/`maxlen' {
                      tuples `keepvars', asis min(`size') max(`size') cond(`cond')
                      
                      local combos_of_size_`size' = ""
                      local found 0
                      forval t = 1/`ntuples' {
                          capture isid `tuple`t''
                          if _rc == 0 {
                              noi dis "`tuple`t''"
                              if "`first'" != "" {
                                  return local id_combos "`tuple`t''"
                                  exit
                              }
                              local combos_of_size_`size' `"`combos_of_size_`size'' "`tuple`t''" "'
                              local found 1
                          }        
                      }
                      
                      if `found' {
                          local num_combos: list sizeof id_combos
                          local add_to_cond = ""
                          foreach combo of local combos_of_size_`size' {
                              local elnum = 0
                              foreach el of local combo {
                                  local ++elnum
                                  local pos: list posof "`el'" in keepvars
                                  if `elnum' == 1 local add_bit `pos'
                                      else local add_bit `add_bit'&`pos'
                              }
                              local add_to_cond `add_to_cond' !(`add_bit')
                          }
                          local cond `cond' `add_to_cond'
                          local all_combos `"`all_combos' `combos_of_size_`size''"'
                      }
                  }
                  
                  return local id_combos `"`all_combos'"'
              end
              
              sysuse auto, clear
              search_id
              
              local combos = r(id_combos)
              if `"`combos'"' != "" {
                  noi dis "The combinations that identify the dataset are:"
                  foreach c of local combos {
                      noi dis `"`c'"'
                  }
              }
                  else noi dis "No combinations were found that identify the dataset."
              Last edited by Hemanshu Kumar; 28 Sep 2022, 04:00.

              Comment


              • #22
                Acknowledging the fact that no code improvement will ever get us even close to the size of the underlying problem of the thread, I have two minor suggestions concerning the code in #21:

                1. Clear the local macros tuple_# before every call to tuples. There is not much to gain here but having a couple of hundred thousand local macros in memory will slow down tuples.

                2. Remove the overhead from isid. Replace isid with the two lines

                Code:
                sort `tuple_`t''
                capture by `tuple_`t' : assert _N == 1 , fast


                Edit

                Deleted a third point claiming that conditionals() would slow tuples down. While conditionals() can slow tuples down (especially with algorithms other than the default) it usually does not.
                Last edited by daniel klein; 28 Sep 2022, 05:39.

                Comment


                • #23
                  daniel klein Thanks for the suggestions! And yes, I did not go down the road of updating keepvars for the reason you mention. I also had alternative, much more efficient, code that I had momentarily posted in #19, which I took down precisely because that too suffered from the same issue.

                  Here is an improved version of the code, based in part on the suggestions in #22.

                  Code:
                  capture program drop search_id
                  program define search_id, rclass
                      
                      syntax [varlist] , [MAXsize(numlist int >0 min=1 max=1)] [First]
                      * -max- specifies the largest size of variable combinations to look for
                      * -first- specifies that the search should stop at the first combination found
                  
                      if "`varlist'" == "" local varlist *
                  
                      unab vars: `varlist'
                      local numvars: list sizeof vars
                  
                      if "`maxsize'" != "" {
                          if `maxsize' > `numvars' {
                              dis as error "The maximum size of the combination cannot exceed the number of variables specified."
                              exit
                          }
                      }
                      
                      local keepvars = ""
                      foreach var of local vars {
                          capture confirm string var `var'
                          if _rc == 0 local keepvars `keepvars' `var'
                              else {
                                  capture assert int(`var') == `var'
                                  if _rc == 0 local keepvars `keepvars' `var'
                                      else dis "Excluding variable `var' since it is neither string nor integer."
                              }
                      }
                  
                      if "`maxsize'" == "" local maxlen: list sizeof keepvars
                          else local maxlen = min(`maxsize',`:list sizeof keepvars')
                          
                      local cond = ""
                      local all_combos = ""
                      local combo_count 0
                      
                      forval size = 1/`maxlen' {
                          macro drop _tuple*
                          tuples `keepvars', asis min(`size') max(`size') cond(`cond')
                          
                          local combos_of_size_`size' = ""
                          local found 0
                          
                          forval t = 1/`ntuples' {
                              sort `tuple`t''
                              capture by `tuple`t'' : assert _N == 1 , fast
                              if _rc == 0 {
                                  dis "Found identifying combo: `tuple`t''"
                                  local ++combo_count
                                  if "`first'" != "" {
                                      return local id_combos "`tuple`t''"
                                      exit
                                  }
                                  local combos_of_size_`size' `"`combos_of_size_`size'' "`tuple`t''" "'
                                  local found 1
                              }        
                          }
                          
                          if `found' {
                              local num_combos: list sizeof id_combos
                              local add_to_cond = ""
                              foreach combo of local combos_of_size_`size' {
                                  local elnum = 0
                                  foreach el of local combo {
                                      local ++elnum
                                      local pos: list posof "`el'" in keepvars
                                      if `elnum' == 1 local add_bit `pos'
                                          else local add_bit `add_bit'&`pos'
                                  }
                                  local add_to_cond `add_to_cond' !(`add_bit')
                              }
                              local cond `cond' `add_to_cond'
                              local all_combos `" `all_combos' `combos_of_size_`size'' "'
                          }
                      }
                      
                      if `combo_count' == 0 dis "No identifying combinations found in the specified variable list."
                      local all_combos: list clean all_combos    
                      return local id_combos `all_combos'
                      return scalar count = `combo_count'
                  end
                  One can then try:
                  Code:
                  . sysuse auto, clear
                  (1978 automobile data)
                  
                  . search_id
                  Excluding variable headroom since it is neither string nor integer.
                  Excluding variable gear_ratio since it is neither string nor integer.
                  Found identifying combo: price
                  Found identifying combo: make
                  Found identifying combo: mpg weight
                  
                  . search_id, max(1)
                  Excluding variable headroom since it is neither string nor integer.
                  Excluding variable gear_ratio since it is neither string nor integer.
                  Found identifying combo: price
                  Found identifying combo: make
                  
                  . sysuse nlsw88, clear
                  (NLSW, 1988 extract)
                  
                  . search_id, first
                  Excluding variable wage since it is neither string nor integer.
                  Excluding variable ttl_exp since it is neither string nor integer.
                  Excluding variable tenure since it is neither string nor integer.
                  Found identifying combo: idcode
                  
                  . search_id age race married
                  No identifying combinations found in the specified variable list.
                  Last edited by Hemanshu Kumar; 28 Sep 2022, 05:35.

                  Comment


                  • #24
                    daniel klein
                    3. Using the conditional() option is a good idea, conceptually. However, the option actually slows tuples down; the number of items (variables) in the list stays constant. An alternative approach might be to dynamically update keepvars, excluding variables that (together) identify the observations. However, that approach would not produce all possible combinations of variables that identify the observations and results would depend on the order in which variables are entered.
                    I realised that the conditional option does not give us speed gains on tuples, but at least thanks to that we are able to narrow down the number of tuples being tested as identifiers. Hopefully, for most cases this still results in speed gains relative to the brute force method in #18.

                    Comment


                    • #25
                      Thank you very much all. I do agree much information should have been provided in the data description. We will seek for detail information when we do further analysis using this data.
                      Thank you very much for all the guides on how to do the brutal force search using STATA codes. I very much appreciate your tips and am very grateful.

                      Just a quick question, what if non-string and non-integer variables are part of the identifier variables? For example, date variable with double storage type and %tc format.
                      I tried to replace


                      else dis "Excluding variable `var' since it is neither string nor integer."

                      By the following

                      else {
                      capture assert double(`var') == `var'
                      if _rc == 0 local keepvars `keepvars' `var'

                      However, this solution couldn’t work.

                      I hesitate to change the double variables to either integer or string because they are several in number, and I also want to avoid missing value issues during conversion. Especially one of the double variables is a variable that records time based on the day, month, year, hours, minutes, and second together. Splitting will give me more variables in the dataset, but I want to avoid adding more variables to the dataset.
                      In short, my question is, is there any way of using the brute force approach without eliminating any variables containing non-integer numbers from the search(particularly double variables)?
                      Sorry if my question is inappropriate. Thank you for the help so far, and I looking forward to your suggestion.
                      Last edited by tig som; 28 Sep 2022, 09:46.

                      Comment


                      • #26
                        tig som the check just makes sure that the variable is either string or an integer -- as the latter term is more commonly understood, i.e. it involves no decimal portion. So e.g. daily dates (stored in Stata's numerical format) should be fine.

                        Comment


                        • #27
                          Unless the OP needs to match the data against another dataset, why not create an ID variale:

                          Code:
                          gen id=_n
                          Searching for a unique combination of existing variables is likely to produce a random collection of variables that identify a record by sheer happenstance (such as income and age), rather than finding (for example) a meaningful "household_id", "person_in_household" combination.

                          Comment


                          • #28
                            The problem is that OP wants to preform a panel analysis, meaning that the data is hierarchically structured. So OP needs to identify which variable or set of variables identify the waves of the analysis so that OP can use xtset. Unfortunately, many pairs of variables jointly identify the data, and it is unclear how to identify the wave variable programatically.

                            This is why I suggested in #15 that if OP knows the number of waves, OP could look for a variable with exactly that many levels. OP should also eliminate any pair of variables that jointly identify the data, but which contain real or negative numbers, as these almost certainly are not unique identifiers for waves or individuals. I may have buried the lede somewhat on that one.

                            ------

                            Edit: I guess I've been assuming the data is in long format, but what if it is in wide? In that case, OP should look for repeated patterns in variables names...

                            Edit2: Looks like Hemanshu's code in #23 handles some or all of the possible eliminations anyway.
                            Last edited by Daniel Schaefer; 30 Sep 2022, 09:12.

                            Comment


                            • #29
                              Thank you for all the explanations!!

                              Comment

                              Working...
                              X