Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting numbers from a string variable to put into a new variable

    Hello,

    A few of my colleagues reviewed police investigative case files and measured the victims' demographic characteristics for each case using a single string variable, victim_demo. They were instructed to identify the victim's race (B, W, O, or U), gender (M, F, O, or U), and DOB using commas to separate each attribute. For example, a white, male victim with a DOB of 1/1/1980 should have had a value of "W,M,1/1/1980". My hope was to split the variable based on the comma to create the 3 demographic variables. As you can see below, they did not follow those instructions. My solution (I'm open to better ones) is to 1) create a new variable with just the numbers from the victim_demo variable (e.g., row 1 below would have a value of "08071996"), 2) separate the DOB values from the age in years values into distinct variables, 3) convert the DOB values to age in years values, and then 4) combine the age variables to have a single measure of age in years. I think I can figure out the converting to age in years part, but I can't figure out how to pull any/all numbers from the victim_demo into a new variable, let's say victimage_temp. I found this website - https://stats.oarc.ucla.edu/stata/fa...r-expressions/ - and I think I need to use regex but I'm not familiar with it. If someone could help me out with the code I'd really appreciate it. If you also have the time to show an efficient way of turning that new variable into a single measure of age in years I would also really appreciate it but I can probably figure it out if not. Thank you for any help!


    Code:
    input str30 victim_demo
    "B/M/08/07/1996"
    "B/M unknown ages"
    "B/M/06/27/2000"
    "B/M Teen"
    "B/M 4-27-92 27y/o"
    "B/M  16 y/o. no further"
    "W/M  DOB is not listed"
    "B/M"
    "B/M  12-24-1995"
    "B/M  09-27-83  36y/o"
    "W, F, 27 years"
    end

  • #2
    Code:
    clear
    input str30 victim_demo
    "B/M/08/07/1996"
    "B/M unknown ages"
    "B/M/06/27/2000"
    "B/M Teen"
    "B/M 4-27-92 27y/o"
    "B/M  16 y/o. no further"
    "W/M  DOB is not listed"
    "B/M"
    "B/M  12-24-1995"
    "B/M  09-27-83  36y/o"
    "W, F, 27 years"
    end
    
    gen race=  substr(ustrregexra(victim_demo, "[^B|^W|^O|^U|^M|^F]", ""), 1, 1)
    gen gender=  substr(ustrregexra(victim_demo, "[^B|^W|^O|^U|^M|^F]", ""), 2, 1)
    gen age= real(ustrregexra(victim_demo,".*[\s](\d+)[\s+ y].*", "$1"))
    gen day= real(ustrregexra( victim_demo , ".*[\/|\s+]([0-9]{1,2})[\/|-]([0-9]{1,2})[\/|-]([0-9]{2,4}).*", "$2"))
    gen month= real(ustrregexra( victim_demo , ".*[\/|\s+]([0-9]{1,2})[\/|-]([0-9]{1,2})[\/|-]([0-9]{2,4}).*", "$1"))
    gen year= real(ustrregexra( victim_demo , ".*[\/|\s+]([0-9]{1,2})[\/|-]([0-9]{1,2})[\/|-]([0-9]{2,4}).*", "$3"))
    replace year= 1900+ year if inrange(year, 23, 99)
    gen dob= mdy(month, day, year)
    format dob %td
    gen age_today=  age(dob, td(`c(current_date)'))
    Res.:

    Code:
    . l, sep(0)
    
         +-------------------------------------------------------------------------------------------+
         |             victim_demo   race   gender   age   day   month   year         dob   age_to~y |
         |-------------------------------------------------------------------------------------------|
      1. |          B/M/08/07/1996      B        M     .     7       8   1996   07aug1996         26 |
      2. |        B/M unknown ages      B        M     .     .       .      .           .          . |
      3. |          B/M/06/27/2000      B        M     .    27       6   2000   27jun2000         22 |
      4. |                B/M Teen      B        M     .     .       .      .           .          . |
      5. |       B/M 4-27-92 27y/o      B        M    27    27       4   1992   27apr1992         30 |
      6. | B/M  16 y/o. no further      B        M    16     .       .      .           .          . |
      7. |  W/M  DOB is not listed      W        M     .     .       .      .           .          . |
      8. |                     B/M      B        M     .     .       .      .           .          . |
      9. |         B/M  12-24-1995      B        M     .    24      12   1995   24dec1995         27 |
     10. |    B/M  09-27-83  36y/o      B        M    36    27       9   1983   27sep1983         39 |
     11. |          W, F, 27 years      W        F    27     .       .      .           .          . |
         +-------------------------------------------------------------------------------------------+

    Comment


    • #3
      I don't understand it, but it worked! Thank you very much, Andrew Musau

      Comment

      Working...
      X