Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching values in corresponding columns

    I have a giant data set (~80K rows by ~32000 columns) of data on arrest histories of individuals. What I need to do is determine whether an individual gets rearrested after they're released from the their first incarceration. To do this, I need to determine the individual's first arrest that led to an incarceration. Ideally, what would happen is that I would look at a date variable and look at a corresponding variable that has a "verdict code" in it. If the verdict code indicated that the person was guilty, then I would know the person would be incarcerated and could look for the first rearrest after the end of the incarceration.

    However, the structure of the data is a bit irregular and makes this task difficult in Stata. I'm an R user by default, but this has to be done in Stata. The main problem is that it's hard to match a column with a date to the column with the corresponding verdict code ("GY" for "guilty").

    If this were R, I would do something like this:
    Code:
     
     firstIncarcerationArrestDate <- min(arrestDates[verdictCodes == "GY"], na.rm = T)
    where arrestDates is a vector created by unlist-ing all the columns containing arrestDates. I can't figure out an analogous way to do this in Stata

    Below is a small snapshot of the dataset in Stata showing the first five arrest data columns and the first five verdict code columns. For example, the fourth row shows an instance of being arrested on three different charges on 04apr2000. One of the corresponding verdict codes to that date is "GY", so I'd consider 04apr2000 to indicate an incarceration. However, because of multiple arrests and multiple charges per arrest, the arrest and verdict code data is spread out over hundreds of columns. What I need is a way to match, say arrestDate3 to verdictCode3 in a way similar to the R code above.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(unifcrimhist_arrestdat_1 unifcrimhist_arrestdat_2 unifcrimhist_arrestdat_3 unifcrimhist_arrestdat_4) str8(unifcrimhist_verdictcd_1 unifcrimhist_verdictcd_2 unifcrimhist_verdictcd_3 unifcrimhist_verdictcd_4)
    18234     .     . 20020 "TM" ""   ""   "NO"
    15557 20382 14656 20267 "TM" ""   "43" ""  
    17470     . 17470 19107 "TM" ""   "GY" "GC"
    14704     . 14880 14704 "TM" ""   "NO" "GY"
    17139     . 17139 15884 "TM" ""   "SI" "DM"
    16692 15932     .     . "TM" "32" ""   ""  
    17704 17704     .     . "TM" "GC" ""   ""  
    17000 14988     .     . "TM" "GY" ""   ""  
    15456 15456 15456 15456 "TM" "GY" "GY" "SI"
    16390 18349     . 18349 "TM" "NO" ""   "SV"
    18005 16547 18005 16111 "TM" "NO" "GY" "GY"
    17059 17059 16921 17059 "TM" "NO" "GY" "GY"
    14913 17347 17123 15022 "TM" "NO" "NO" "GY"
    17932 17159     . 17932 "TM" "NP" ""   "NO"
    17601 18068 17959 17959 "TM" "SI" "NO" "NO"
    16516 16635 17847 17847 "TM" "TM" "GY" "GY"
    14986     .     .     . "TR" ""   ""   ""  
    16373 16636 16636     . "TR" "GY" "GY" ""  
    end
    format %td unifcrimhist_arrestdat_1
    format %td unifcrimhist_arrestdat_2
    format %td unifcrimhist_arrestdat_3
    format %td unifcrimhist_arrestdat_4

  • #2
    Cross-posted at http://stackoverflow.com/questions/3...lumns-in-stata

    In the FAQ Advice you were asked to read before posting we explain policy:

    http://www.statalist.org/forums/help#crossposting

    8. May I cross-post to other forums?

    People posting on Statalist may also post the same question on other listservers or in web forums. There is absolutely no rule against doing that.

    But if you do post elsewhere, we ask that you provide cross-references in URL form to searchable archives. That way, people interested in your question can quickly check what has been said elsewhere and avoid posting similar comments. Being open about cross-posting saves everyone time.

    If your question was answered well elsewhere, please post a cross-reference to that answer on Statalist.

    Comment


    • #3
      My apologies; I was explicitly asked to post here in addition.

      Comment


      • #4
        That was good advice, regardless of who said it and where, but doesn't affect the point.

        More interestingly and more importantly for you, the advice already given twice on SO is to reshape, and you there say that it's not possible. I can't see anywhere a reason for that.

        This is an excellent small example of why cross-posting should be explicit, to avoid repetition of advice.
        Last edited by Nick Cox; 22 Sep 2016, 15:50.

        Comment


        • #5
          I can't see why -reshape- wouldn't be possible. Given the size of your data set, it will be slow, and it might push up against the limits of your computer's memory. If the latter is the problem, then break up the data set into smaller pieces (groups of observations, say 10K "rows" in each) and do -reshape- in those separately, and then -append- the results.

          That said, it boggles my mind that there are 32000 "columns" (in Stata we call these variables; "rows" are called observations). Given that each arrest leads to a pair of variables, that comes out to data for up to 16,000 arrests per person. If a person were arrested daily, that could go on every day for almost 44 years! How is that even remotely possible?

          Comment

          Working...
          X