Hi all,
I am in the process of trying to merge two large administrative datasets from two different jail facilities using matchit. The goal of this project is to identify individuals who have been booked into both jails during the period 2016-2022. There is no linking variable available for these two datasets, so I have to use a combination of inmate name, date of birth, sex, and race to match them. My problem is similar to that posed by Anne-Claire Jo in this thread (https://www.statalist.org/forums/for...and-using-data); however, I cannot remove duplicate observations in many circumstances.
Each dataset has its own unique inmate identifier (jacketid and jailid, respectively). In one file, the names are uniform within jailid. In the other, names and dates of birth may vary within jacketid (as a function of false information given upon arrest/booking). Once a fingerprint or retinal scan flags a "bad" name or date of birth as belonging to an existing jacketid, the whole record is subsumed under the existing id with a new observation including the "bad" information. According to the data managers, each combination of information must be treated as valid within a jacketid, and therefore I cannot get rid of duplicate person observations unless all have identical inmate information.
I have created matching variables that include the combined name, dob, sex, and race values at different levels of matching criteria (loose versus strict). I have also created within-jacket info ids to capture how many unique inmate identifiers exist within jacketids. Within the ~77,000 unique jacketids, there are as many as 13 (loose) to 20 (strict) unique info variables. Luckily, this is an issue in less than 3% and 10% of my cases, respectively.
I believe the best way to proceed is to remove duplicates within jacket-info ids (see nobs_part and nobs_full below) and then try to use matchit on the remaining, unique info variables. Due to the size of the dataset, I'm guessing I will need to break them down by alphabet.
This leads me to my problem:
The matchit command hangs up on me every time I try to run it using two datasets. It gets through the indexing percentage count and then stalls out before even giving a percentage completed. In the past, I have been able to combine my two "matching" datasets and then run matchit on the single dataset using the program provided by Andrew (#4) in https://www.statalist.org/forums/for...and-one-column , and results have generated almost immediately. Unfortunately, I don't think that program will work here, because it requires me to make predetermined matches to run.
Is there a way to use a combined dataset (see my code below) and then run matchit, or will this problem require me to use two datasets? I'm not sure if @Julio Raffo or others might have some suggestions? (btw, is that the right way to tag someone?)
Thanks so much for your help!
~ Miranda
I am in the process of trying to merge two large administrative datasets from two different jail facilities using matchit. The goal of this project is to identify individuals who have been booked into both jails during the period 2016-2022. There is no linking variable available for these two datasets, so I have to use a combination of inmate name, date of birth, sex, and race to match them. My problem is similar to that posed by Anne-Claire Jo in this thread (https://www.statalist.org/forums/for...and-using-data); however, I cannot remove duplicate observations in many circumstances.
Each dataset has its own unique inmate identifier (jacketid and jailid, respectively). In one file, the names are uniform within jailid. In the other, names and dates of birth may vary within jacketid (as a function of false information given upon arrest/booking). Once a fingerprint or retinal scan flags a "bad" name or date of birth as belonging to an existing jacketid, the whole record is subsumed under the existing id with a new observation including the "bad" information. According to the data managers, each combination of information must be treated as valid within a jacketid, and therefore I cannot get rid of duplicate person observations unless all have identical inmate information.
I have created matching variables that include the combined name, dob, sex, and race values at different levels of matching criteria (loose versus strict). I have also created within-jacket info ids to capture how many unique inmate identifiers exist within jacketids. Within the ~77,000 unique jacketids, there are as many as 13 (loose) to 20 (strict) unique info variables. Luckily, this is an issue in less than 3% and 10% of my cases, respectively.
I believe the best way to proceed is to remove duplicates within jacket-info ids (see nobs_part and nobs_full below) and then try to use matchit on the remaining, unique info variables. Due to the size of the dataset, I'm guessing I will need to break them down by alphabet.
This leads me to my problem:
The matchit command hangs up on me every time I try to run it using two datasets. It gets through the indexing percentage count and then stalls out before even giving a percentage completed. In the past, I have been able to combine my two "matching" datasets and then run matchit on the single dataset using the program provided by Andrew (#4) in https://www.statalist.org/forums/for...and-one-column , and results have generated almost immediately. Unfortunately, I don't think that program will work here, because it requires me to make predetermined matches to run.
Is there a way to use a combined dataset (see my code below) and then run matchit, or will this problem require me to use two datasets? I'm not sure if @Julio Raffo or others might have some suggestions? (btw, is that the right way to tag someone?)
Thanks so much for your help!
~ Miranda
Code:
* Example generated by -dataex-. For more info, type help dataex clear input byte dataset long(jacketid jailid) float(pobs nobs_part nobs_full) int(bookdate reldate) str20 name int dob str1(sex race) str4 fname str7 mname str5 lname str2 sname str18 fullname_s str23 partinfo str32 fullinfo 1 123456 . 1 1 1 20457 20486 "SHMO, JOE R, JR" 6447 "M" "W" "JOE" "R" "SHMO" "JR" "JOE R SHMO JR" "J SHMO JR;26aug1977;M;W" "JOE R SHMO JR;26aug1977;M;W" 1 123456 . 2 1 1 20530 20570 "DUDE, BIG LIAR, JR" 6447 "M" "W" "BIG" "LIAR" "DUDE" "JR" "BIG LIAR DUDE JR" "B DUDE JR;26aug1977;M;W" "BIG LIAR DUDE JR;26aug1977;M;W" 1 123456 . 3 2 1 20651 20677 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 1 123456 . 4 1 1 20902 20904 "SHMO, JOE RICKIE" 6447 "M" "W" "JOE" "RICKIE" "SHMO" "" "JOE RICKIE SHMO" "J SHMO;26aug1977;M;W" "JOE RICKIE SHMO;26aug1977;M;W" 1 123456 . 5 1 1 20951 20977 "SHMO, JOE MICKIE, JR" 5373 "M" "W" "JOE" "MICKIE" "SHMO" "JR" "JOE MICKIE SHMO JR" "J SHMO JR;17sep1974;M;W" "JOE MICKIE SHMO JR;17sep1974;M;W" 1 234567 . 1 2 2 20479 20486 "DOE, JANE MARIE" 10952 "F" "A" "JANE" "MARIE" "DOE" "" "JANE MARIE DOE" "J DOE;26dec1989;F;A" "JANE MARIE DOE;26dec1989;F;A" 1 234567 . 2 1 1 20501 20545 "DOE, JANE MARIE" 10952 "F" "A" "JANE" "MARIE" "DOE" "" "JANE MARIE DOE" "J DOE;26dec1989;F;A" "JANE MARIE DOE;26dec1989;F;A" 1 234567 . 3 3 1 20626 20627 "DOE, JANE MARY" 10952 "F" "A" "JANE" "MARY" "DOE" "" "JANE MARY DOE" "J DOE;26dec1989;F;A" "JANE MARY DOE;26dec1989;F;A" 1 234567 . 4 1 1 20630 20644 "DO, JUNE MAY" 10587 "F" "A" "JUNE" "MAY" "DO" "" "JUNE MAY DO" "J DOE;26dec1988;F;A" "JUNE MAY DO;26dec1988;F;A" 1 234567 . 5 1 1 20646 20682 "NAME, NOT MY REAL" 10683 "F" "A" "NOT" "MY REAL" "NAME" "" "NOT MY REAL NAME" "N NAME;01apr1989;F;A" "NOT MY REAL NAME;01apr1989;F;A" 1 345678 . 1 3 1 22066 22068 "HARRY, TOM DICK" 4109 "M" "W" "TOM" "DICK" "HARRY" "" "TOM DICK HARRY" "T HARRY;02apr1971;M;W" "TOM DICK HARRY;02apr1971;M;W" 1 345678 . 2 2 1 22638 22638 "HARRY, TOM D" 4109 "M" "W" "TOM" "D" "HARRY" "" "TOM D HARRY" "T HARRY;02apr1971;M;W" "TOM D HARRY;02apr1971;M;W" 1 345678 . 3 1 1 22685 22686 "HARRY, T" 4109 "M" "W" "T" "" "HARRY" "" "T HARRY" "T HARRY;02apr1971;M;W" "T HARRY;02apr1971;M;W" 1 456789 . 1 1 1 22490 22495 "SMITH, JANE" 8528 "F" "B" "JANE" "" "SMITH" "" "JANE SMITH" "J SMITH;08may1983;F;B" "JANE SMITH;08may1983;F;B" 1 567890 . 1 1 1 22596 22598 "BUCK, GUY, SR" -251 "M" "B" "GUY" "" "BUCK" "SR" "GUY BUCK SR" "G BUCK SR;25apr1959;M;B" "GUY BUCK SR;25apr1959;M;B" 2 . 420864 1 1 1 22066 22066 "HARRY, T D" 4109 "M" "W" "T" "D" "HARRY" "" "T D HARRY" "T HARRY;02apr1971;M;W" "T D HARRY;02apr1971;M;W" 2 . 654321 1 3 3 20564 20588 "DOE, JANE" 10952 "F" "A" "JANE" "" "DOE" "" "JANE DOE" "J DOE;26dec1989;F;A" "JANE DOE;26dec1989;F;A" 2 . 654321 2 1 2 20612 20619 "DOE, JANE" 10952 "F" "A" "JANE" "" "DOE" "" "JANE DOE" "J DOE;26dec1989;F;A" "JANE DOE;26dec1989;F;A" 2 . 654321 3 2 1 20682 20689 "DOE, JANE" 10952 "F" "A" "JANE" "" "DOE" "" "JANE DOE" "J DOE;26dec1989;F;A" "JANE DOE;26dec1989;F;A" 2 . 654321 4 1 1 20696 20703 "DOE, JANE" 10587 "F" "A" "JANE" "" "DOE" "" "JANE DOE" "J DOE;26dec1988;F;A" "JANE DOE;26dec1988;F;A" 2 . 753175 1 6 6 20723 20749 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 753175 2 5 3 20911 20926 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 753175 3 3 2 20931 20933 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 753175 4 4 1 20944 20946 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 753175 5 1 5 21000 21053 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 753175 6 2 4 21189 21191 "SHMO, JOE R" 6447 "M" "W" "JOE" "R" "SHMO" "" "JOE R SHMO" "J SHMO;26aug1977;M;W" "JOE R SHMO;26aug1977;M;W" 2 . 864202 1 1 1 22713 22713 "SMITH, J D" 8528 "F" "B" "J" "D" "SMITH" "" "J D SMITH" "J SMITH;08may1983;F;B" "J D SMITH;08may1983;F;B" 2 . 987654 1 1 1 22150 22157 "DUDE, SOME" 5111 "M" "W" "SOME" "" "DUDE" "" "SOME DUDE" "S DUDE;29dec1973;M;W" "SOME DUDE;29dec1973;M;W" end format %td bookdate format %td reldate format %td dob label var dataset "Dataset" label var jacketid "Jail 1 Inmate ID" label var jailid "Jale 2 Inmate ID" label var pobs "Within-Inmate Observation" label var nobs_part "Within-Name Observation (Loose)" label var nobs_full "Within-Name Observation (Strict)" label var bookdate "Date Booked into Jail" label var reldate "Date Released from Jail" label var name "Original Name Format" label var dob "Inmate Date of Birth" label var sex "Inmate Sex" label var race "Inmate Race" label var fname "First Name" label var mname "Middle Name" label var lname "Last Name" label var sname "Name Suffix" label var fullname_s "First + Middle + Last + Suffix" label var partinfo "Loose Matching Var" label var fullinfo "Strict Matching Var"
