Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a way to know if a ID string variable is the same between two different datasets?

    Hi everyone,

    I have a quick question. I'd like to know if the IDs of two datasets are different in a more efficient way than mine (possibly my way is wrong...):

    I have two datasets with the same identifier variable name: "id".

    So here are my steps:

    (1) I open the first dataset and type in the command set -isid id-

    (2) I don't get any error messages, so everything's fine: the id variable uniquely identifies each observation.

    (3) I open dataset 2. I carry out exactly the same procedure as in (2). No error message either.

    (4) To see if the two IDs are identical, I run the following command. Let's imagine that my master data is "dataset_1" and my using dataset is "dataset_2".

    Code:
    merge 1:1 id using dataset_2
    I get this for example:
    Code:
    . merge 1:1 id using dataset_2
    
        Result                      Number of obs
        -----------------------------------------
        Not matched                     5,808,563
            from master                 2,579,289  (_merge==1)
            from using                  3,229,274  (_merge==2)
    
        Matched                                 0  (_merge==3)
        -----------------------------------------
    
    .
    Can I deduce from this that the two datasets have different IDs? Is there a better way of observing this than this?
    I say this because I need to compare a file with other files. The other files are monthly and run from January 2021 to July 2023... So the procedure would be very energy and time-consuming

    Thanks a lot.
    Best,
    Michael
    Last edited by Michael Duarte Goncalves; 27 Oct 2023, 09:24.

  • #2
    I believe you can skip steps 1, 2, and 3 because merge 1:1 should raise an error if the id variables for either dataset don't uniquely identify the data. You can use a for loop to iteratively attempt to merge the data, use -capture- to suppress the error, then get the return value from -capture- to detect the error anyway and log it to the console before moving on to the next dataset. You should also be able to programmatically detect situations where none of the id's match and log that to the console as well. It might still take a while to run, but at least you won't have to do it manually.

    Do you need help with the syntax for that?

    Edit: if you do need syntax help, could you please let us know if you want to test whether it is possible to merge a set of files with a set of files, or do you have a single file and want to see if a set of other files have id's in common with that one file?
    Last edited by Daniel Schaefer; 27 Oct 2023, 10:31.

    Comment


    • #3
      Hi Daniel Schaefer.

      Thank you very much for your explanations given in #2.

      To answer your question, I need some help with the syntax please.
      I'm leaning towards the second option, i.e.: I have a single file, which I'd like to compare with a set of other files and see if they have id's in common with that one file, please.
      Many thanks in advance!
      All the best,

      Michael

      Comment


      • #4
        Hi again,

        I'm returning to you because I have a new question about post #3.

        Would it also be possible to see if ID variables are duplicated among files, and delete them if necessary, in addition to #3?

        I really appreciate any help you can provide.
        Thank you in advance,

        Michael

        Comment


        • #5
          Hi Michael,

          The issue in #4 is that it isn't clear why the id values are duplicated. If the entire row is duplicated in the dataset, it might be appropriate to keep one and not the rest, but then I wonder why completely redundant data was included in the data file in the first place. It is more likely that the id appears more than once because the subject was measured more than once. But that raises the question: Which of the measurements do you want to keep? If you want to keep them all you need a one-to-many or many-to-many merge.

          Regarding syntax: I simulate your problem by creating 5 datasets.

          Code:
          clear
          set obs 1000
          set seed 57492
          
          gen id = _n
          gen x1 = runiform()
          gen x2 = runiform()
          gen x3 = runiform()
          save "main_file.dta", replace
          
          drop x1 x2 x3
          gen x4 = runiform()
          gen x5 = runiform()
          gen x6 = runiform()
          save "comparison_file_1.dta", replace
          
          drop x4 x5 x6
          gen x7 = runiform()
          gen x8 = runiform()
          gen x9 = runiform()
          save "comparison_file_2.dta", replace
          
          drop x7 x8 x9
          replace id = id + _N
          gen x10 = runiform()
          gen x11 = runiform()
          gen x12 = runiform()
          save "comparison_file_3.dta", replace
          
          drop id x10 x11 x12
          set obs 1000
          egen id = seq(), from(1) to(10)
          gen x13 = runiform()
          gen x14 = runiform()
          gen x15 = runiform()
          save "comparison_file_4.dta", replace
          
          drop id x13 x14 x15
          set obs 1000
          gen id = _n
          gen x16 = runiform()
          gen x17 = runiform()
          gen x18 = runiform()
          save "comparison_file_5.dta", replace
          The "main" file is our starting comparison file. Comparison files 1, 2, and 5 are valid files which can be merged with a 1:1 merge command. Comparison file 3 can be merged, but has no ids in common with the main file. Comparison file 4 has the first 10 ids in common with the main file, but those ids are repeated so that they don't uniquely identify the data.

          Here is how I merge them together. I assume you don't want to simply check to see if the files can be merged, you actually want to end up with a single merged dataset. Note that I don't try to delete any duplicate observations in comparison file 4, I just accept that the 1:1 merge raises and error and move on. If you need something else, please respond to my comments at the top of this post first.

          Code:
          use "main_file.dta", clear
          local files = "comparison_file_1.dta comparison_file_2.dta comparison_file_3.dta comparison_file_4.dta comparison_file_5.dta"
          foreach file of local files{
              display _newline
              display as text "[MERGE] [`file'] Attempting merge into master."
              capture noisily merge 1:1 id using "`file'"
              if _rc {
                  display as text "[WARNING] [`file'] Failed to merge! See merge error."
                  continue
              }
              quietly count if _merge == 3
              if !(r(N) > 0){
                  display as text "[WARNING] [`file'] no id values in common with main file."
              }
              drop if _merge == 2 // drop if not matched from using.
              drop _merge
          }
          ds
          notice I drop if _merge == 2 just to clean up the dataset after each merge. You may want different behavior, but this seems like a useful sep to me. The output looks like this:

          Code:
          [MERGE] [comparison_file_1.dta] Attempting merge into master.
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                             0
              Matched                             1,000  (_merge==3)
              -----------------------------------------
          (0 observations deleted)
          
          
          [MERGE] [comparison_file_2.dta] Attempting merge into master.
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                             0
              Matched                             1,000  (_merge==3)
              -----------------------------------------
          (0 observations deleted)
          
          
          [MERGE] [comparison_file_3.dta] Attempting merge into master.
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                         2,000
                  from master                     1,000  (_merge==1)
                  from using                      1,000  (_merge==2)
          
              Matched                                 0  (_merge==3)
              -----------------------------------------
          [WARNING] [comparison_file_3.dta] no id values in common with main file.
          (1,000 observations deleted)
          
          
          [MERGE] [comparison_file_4.dta] Attempting merge into master.
          variable id does not uniquely identify observations in the using data
          [WARNING] [comparison_file_4.dta] Failed to merge! See merge error.
          
          
          [MERGE] [comparison_file_5.dta] Attempting merge into master.
          
              Result                      Number of obs
              -----------------------------------------
              Not matched                             0
              Matched                             1,000  (_merge==3)
              -----------------------------------------
          (0 observations deleted)
          Warnings are printed to the console when either the file has no ids in common (comparison file 3) or when the ids don't uniquely identify the data (comparison file 4). Finally, I print out the variables included in the final dataset with the -ds- command.

          Code:
          id   x1   x2   x3   x4   x5   x6   x7   x8   x9   x10  x11  x12  x16  x17  x18
          As you can see, we are missing variables x13, x14, and x15 from comparison file 4. That's because it was never successfully merged. x10-x12 from comparison file 3 do appear (the file was technically merged) but should be all missing. The perfectionist in me wants to figure out how to detect this case and drop the related variables, but I need to run to another meeting in a few minutes, so my advice is just to check the summary statistics and clean up afterward.

          Suppose you really just want to check to see if or how many of the ids match without actually merging all of the data. Then you can just reload the original file at the beginning of each iteration of the loop.

          Hopefully, that's enough to get you started.

          Comment


          • #6
            Hi Daniel Schaefer,

            #5: Incredible post with the resolution of all my problems! Thank you so much for this fantastic work.

            I'm going to try it with my several datafiles and see how it goes.
            Thanks again.

            All the best,

            Michael

            Comment


            • #7
              Hi again,

              I tried your code Daniel Schaefer and it works like a charm! Many thanks.

              Coming back to #2, would it be possible to have a simple example please concerning option 1:

              "[...] to test whether it is possible to merge a set of files with a set of files [...]".
              If this takes up too much of your time, I completely understand.
              Thank you all the same!

              Michael

              Comment


              • #8
                In general, you can compare each item in one set to every item in a second set like this:

                Code:
                local set1 = "item1 item2 item3 item4 item5"
                local set2 = "item6 item7 item8 item9 item10"
                foreach s1_item in `set1'{
                    foreach s2_item in `set2'{
                        display "(`s1_item', `s2_item')"
                    }
                }
                In general, you can compare every item in a set to every other item in that same set (except itself and assuming order does not matter) like this:

                Code:
                local set = "item1 item2 item3 item4 item5 item6 item7 item8 item9 item10"
                forv word1 = 1/`=wordcount("`set'")'{
                    forv word2 = `word1'/`=wordcount("`set'")'{
                        if `word1' == `word2'{
                            continue
                        }
                        display "(" word("`set'", `word1') ", " word("`set'", `word2') ")"
                    }
                }

                Comment


                • #9
                  Hi Daniel,

                  Beautiful. Thank you so much.

                  Have a great day.
                  Michael

                  Comment

                  Working...
                  X