I have a giant data set (~80K rows by ~32000 columns) of data on arrest histories of individuals. What I need to do is determine whether an individual gets rearrested after they're released from the their first incarceration. To do this, I need to determine the individual's first arrest that led to an incarceration. Ideally, what would happen is that I would look at a date variable and look at a corresponding variable that has a "verdict code" in it. If the verdict code indicated that the person was guilty, then I would know the person would be incarcerated and could look for the first rearrest after the end of the incarceration.
However, the structure of the data is a bit irregular and makes this task difficult in Stata. I'm an R user by default, but this has to be done in Stata. The main problem is that it's hard to match a column with a date to the column with the corresponding verdict code ("GY" for "guilty").
If this were R, I would do something like this:
where arrestDates is a vector created by unlist-ing all the columns containing arrestDates. I can't figure out an analogous way to do this in Stata
Below is a small snapshot of the dataset in Stata showing the first five arrest data columns and the first five verdict code columns. For example, the fourth row shows an instance of being arrested on three different charges on 04apr2000. One of the corresponding verdict codes to that date is "GY", so I'd consider 04apr2000 to indicate an incarceration. However, because of multiple arrests and multiple charges per arrest, the arrest and verdict code data is spread out over hundreds of columns. What I need is a way to match, say arrestDate3 to verdictCode3 in a way similar to the R code above.
However, the structure of the data is a bit irregular and makes this task difficult in Stata. I'm an R user by default, but this has to be done in Stata. The main problem is that it's hard to match a column with a date to the column with the corresponding verdict code ("GY" for "guilty").
If this were R, I would do something like this:
Code:
firstIncarcerationArrestDate <- min(arrestDates[verdictCodes == "GY"], na.rm = T)
Below is a small snapshot of the dataset in Stata showing the first five arrest data columns and the first five verdict code columns. For example, the fourth row shows an instance of being arrested on three different charges on 04apr2000. One of the corresponding verdict codes to that date is "GY", so I'd consider 04apr2000 to indicate an incarceration. However, because of multiple arrests and multiple charges per arrest, the arrest and verdict code data is spread out over hundreds of columns. What I need is a way to match, say arrestDate3 to verdictCode3 in a way similar to the R code above.
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(unifcrimhist_arrestdat_1 unifcrimhist_arrestdat_2 unifcrimhist_arrestdat_3 unifcrimhist_arrestdat_4) str8(unifcrimhist_verdictcd_1 unifcrimhist_verdictcd_2 unifcrimhist_verdictcd_3 unifcrimhist_verdictcd_4) 18234 . . 20020 "TM" "" "" "NO" 15557 20382 14656 20267 "TM" "" "43" "" 17470 . 17470 19107 "TM" "" "GY" "GC" 14704 . 14880 14704 "TM" "" "NO" "GY" 17139 . 17139 15884 "TM" "" "SI" "DM" 16692 15932 . . "TM" "32" "" "" 17704 17704 . . "TM" "GC" "" "" 17000 14988 . . "TM" "GY" "" "" 15456 15456 15456 15456 "TM" "GY" "GY" "SI" 16390 18349 . 18349 "TM" "NO" "" "SV" 18005 16547 18005 16111 "TM" "NO" "GY" "GY" 17059 17059 16921 17059 "TM" "NO" "GY" "GY" 14913 17347 17123 15022 "TM" "NO" "NO" "GY" 17932 17159 . 17932 "TM" "NP" "" "NO" 17601 18068 17959 17959 "TM" "SI" "NO" "NO" 16516 16635 17847 17847 "TM" "TM" "GY" "GY" 14986 . . . "TR" "" "" "" 16373 16636 16636 . "TR" "GY" "GY" "" end format %td unifcrimhist_arrestdat_1 format %td unifcrimhist_arrestdat_2 format %td unifcrimhist_arrestdat_3 format %td unifcrimhist_arrestdat_4
Comment