Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Destringing messy strings representing time of day

    I'm working with some messy survey data where respondents were able to enter in times with open-ended text. We want to convert these to numeric values as best we can, but as you can see in my data example, this is fairly difficult, especially since we have a lot of variables (10+) and observations (over 10k) in the actual dataset.

    My current approach involves two steps: (1) a loop for times when AM/PM is written in, and (2) a loop when only a number is written without am/pm. (One note is that we're aiming to convert the PM times to 24-hour times in order for ease of calculation later on.)

    A possible approach (which would involve many more lines of code) would be to try to capture all possible combinations of semi-colons, AM and PM written uppercase, lowercase, with and without periods, and so on, for each of the main time blocks (full, quarter, and half hours). So that'd look something like:

    Code:
    foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart {
    foreach i of numlist 1/11 {
    replace use_`var' = `i' if `var' == "`i':00 AM" & change == 1
    replace use_`var' = `i' if `var' == "`i':00 am" & change == 1
    replace use_`var' = `i' if `var' == "`i':00 A.M." & change == 1
    replace use_`var' = `i' if `var' == "`i':00 a.m." & change == 1
    replace use_`var' = `i' if `var' == "`i':00AM" & change == 1
    replace use_`var' = `i' if `var' == "`i':00am" & change == 1
    }
    }
    ...and so on.

    Instead, I've tried to use strpos() to improve efficiency and reduce the amount of code needed, but I'm running into an issue where the highest number in the loop will override the actual time value if it exists in the string in any position.

    You can see this in the first observation with 'f1shouldstart', where the code reads in the "5" in "450am" and converts it to PM time as per the very last loop. With what I want, that 450am value actually wouldn't get converted at all yet--I'm just trying to capture the most commonly entered times (which are the full hour, quarter past, quarter til, and half).

    If anyone has any fixes for this, or any completely alternate ways of dealing with the task, I'd be very appreciative.


    Code:
    Code:
    *Create vars to use 
    foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart {
        gen use_`var' = .
    }
    
    
    // Step 1: Replace vars when am/pm is written
    foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart {
    foreach i of numlist 1/11 {
            
        ******* AM *******
        
        * full hour
        replace use_`var' = `i' if strpos(`var', "`i'") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i' if strpos(`var', "`i':00") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i' if strpos(`var', "`i'") & strpos(`var', "`A'") & change == 1
        replace use_`var' = `i' if strpos(`var', "`i':00") & strpos(`var', "`A'") & change == 1
    
    
        *quarter past 
        replace use_`var' = `i'.25 if strpos(`var', "`i':15") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.25 if strpos(`var', "`i'15") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.25 if strpos(`var', "`i':15") & strpos(`var', "`A'") & change == 1
        replace use_`var' = `i'.25 if strpos(`var', "`i'15") & strpos(`var', "`A'") & change == 1
    
        *half hour 
        replace use_`var' = `i'.5 if strpos(`var', "`i':30") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.5 if strpos(`var', "`i'30") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.5 if strpos(`var', "`i':30") & strpos(`var', "`A'") & change == 1
        replace use_`var' = `i'.5 if strpos(`var', "`i'30") & strpos(`var', "`A'") & change == 1
    
        *quarter til 
        replace use_`var' = `i'.75 if strpos(`var', "`i':45") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.75 if strpos(`var', "`i'45") & strpos(`var', "`a'") & change == 1
        replace use_`var' = `i'.75 if strpos(`var', "`i':45") & strpos(`var', "`A'") & change == 1
        replace use_`var' = `i'.75 if strpos(`var', "`i'45") & strpos(`var', "`A'") & change == 1
    
        
        ******* PM *******
        
        * full hour
        replace use_`var' = `i'+12 if strpos(`var', "`i'") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12 if strpos(`var', "`i':00") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12 if strpos(`var', "`i'") & strpos(`var', "`P'") & change == 1
        replace use_`var' = `i'+12 if strpos(`var', "`i':00") & strpos(`var', "`P'") & change == 1
        
        *quarter past 
        replace use_`var' = `i'+12.25 if strpos(`var', "`i':15") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.25 if strpos(`var', "`i'15") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.25 if strpos(`var', "`i':15") & strpos(`var', "`P'") & change == 1
        replace use_`var' = `i'+12.25 if strpos(`var', "`i'15") & strpos(`var', "`P'") & change == 1
    
        *half hour 
        replace use_`var' = `i'+12.5 if strpos(`var', "`i':30") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.5 if strpos(`var', "`i'30") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.5 if strpos(`var', "`i':30") & strpos(`var', "`P'") & change == 1
        replace use_`var' = `i'+12.5 if strpos(`var', "`i'30") & strpos(`var', "`P'") & change == 1
    
        *quarter til 
        replace use_`var' = `i'+12.75 if strpos(`var', "`i':45") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.75 if strpos(`var', "`i'45") & strpos(`var', "`p'") & change == 1
        replace use_`var' = `i'+12.75 if strpos(`var', "`i':45") & strpos(`var', "`P'") & change == 1
        replace use_`var' = `i'+12.75 if strpos(`var', "`i'45") & strpos(`var', "`P'") & change == 1    
    }
    }
    
    
    // Step 2: Replace vars when no am/pm specified
    * AM start (5-11)
    foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart {
    foreach i of numlist 5/11 {
        * full hour
        replace use_`var' = `i' if `var' == "`i'" & change == 1
        replace use_`var' = `i' if `var' == "`i':00" & change == 1
        replace use_`var' = `i' if `var' == "`i'00" & change == 1
    
        * quarter past 
        replace use_`var' = `i'.25 if `var' == "`i':15" & change == 1
        replace use_`var' = `i'.25 if `var' == "`i'15" & change == 1
    
        * half hour 
        replace use_`var' = `i'.5 if `var' == "`i':30" & change == 1
        replace use_`var' = `i'.5 if `var' == "`i'30" & change == 1
        
        * quarter til 
        replace use_`var' = `i'.75 if `var' == "`i':45" & change == 1
        replace use_`var' = `i'.75 if `var' == "`i'45" & change == 1
    }
    }
    
    
    * PM start (1-4)
    foreach var of varlist f1actualstart f1shouldstart f2actualstart f2shouldstart {
    foreach i of numlist 1/4 {
        * full hour
        replace use_`var' = `i'+12 if `var' == "`i'" & change == 1
        replace use_`var' = `i'+12 if `var' == "`i':00" & change == 1
        replace use_`var' = `i'+12 if `var' == "`i'00" & change == 1
    
        * quarter past 
        replace use_`var' = `i'+12.25 if `var' == "`i':15" & change == 1
        replace use_`var' = `i'+12.25 if `var' == "`i'15" & change == 1
    
        * half hour 
        replace use_`var' = `i'+12.5 if `var' == "`i':30" & change == 1
        replace use_`var' = `i'+12.5 if `var' == "`i'30" & change == 1
        
        * quarter til 
        replace use_`var' = `i'+12.75 if `var' == "`i':45" & change == 1
        replace use_`var' = `i'+12.75 if `var' == "`i'45" & change == 1
    }
    }



    Sample data:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 f1actualstart str9 f1shouldstart str6 f2actualstart str9 f2shouldstart byte change
    "7"      "450am"     "7"      "730  am"     1
    "4 P.M." "330pm"     "4 P.M." "630AM"     1
    "3am"    "4 AM"      "5am"    "545am"     1
    "4:00pm" "2:45pm"    "4:00pm" "2:45pm"    1
    "5:15am" "12am"      "5:15"   "1a"        1
    "7:30pm" "12:15pm"   "4:30pm" "12:15pm"   1
    "9pm"    "415"       "1"      "545"       1
    "1PM"    "7:30p "    "2PM"    "7:30p "    1
    "2a.m."  "7 a.m."    "4p"     "5P.M"      1
    "230pm"  "5:15 p.m." "230pm"  "5:15 p.m." 1
    "115"    "1245 PM"   "1230"   "700pm"     1
    " 4:45"  "4:15pm"    "530p"   "545p"      1
    "12"     "1230pm"    " 3:15"  " 3:27"     1
    end


  • #2
    I can't guarantee that this will cover everything in your entire data set, since a large corpus of free-form text can include irregularities that have not yet been contemplated! But this code covers everything in your example data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6 f1actualstart str9 f1shouldstart str6 f2actualstart str9 f2shouldstart byte change
    "7"      "450am"     "7"      "730  am"     1
    "4 P.M." "330pm"     "4 P.M." "630AM"     1
    "3am"    "4 AM"      "5am"    "545am"     1
    "4:00pm" "2:45pm"    "4:00pm" "2:45pm"    1
    "5:15am" "12am"      "5:15"   "1a"        1
    "7:30pm" "12:15pm"   "4:30pm" "12:15pm"   1
    "9pm"    "415"       "1"      "545"       1
    "1PM"    "7:30p "    "2PM"    "7:30p "    1
    "2a.m."  "7 a.m."    "4p"     "5P.M"      1
    "230pm"  "5:15 p.m." "230pm"  "5:15 p.m." 1
    "115"    "1245 PM"   "1230"   "700pm"     1
    " 4:45"  "4:15pm"    "530p"   "545p"      1
    "12"     "1230pm"    " 3:15"  " 3:27"     1
    end
    
    gen `c(obs_t)' obs_no = _n
    reshape long @start, i(obs_no) j(which) string
    
    gen edit = itrim(trim(ustrregexrf(start, "(\d{1,2})(\d\d)", "$1:$2")))
    replace edit = subinstr(edit, " ", "", .)
    replace edit = subinstr(edit, ".", "", .)
    replace edit = ustrregexrf(edit, "([APap])$", "$1m")
    replace edit = ustrregexrf(edit, "^(\d{1,2})$", "$1:00")
    replace edit = ustrregexrf(edit, "^(\d{1,2})([APap])", "$1:00$2")
    gen double time = clock(edit, "hm")
    format time %tcHH:MM_AM
    
    drop edit
    reshape wide @start @time, i(obs_no) j(which) string
    By the way, I suspect that somebody who really knows regular expressions well could reduce this to maybe 2 or 3 lines of code.

    Added: Note, that this code makes a blatant effort to read the mind of the responder in ambiguous cases. So, for example, if the original response is 700, this code translates that into 7:00AM. You may find it necessary to modify the code in situations like this if you can, from other information in your data set, infer when a numeric time between 00:00 and 12:59 with no AM/PM designation is AM or PM. If there is no such information, then you really have no choice but to guess AM or guess PM.
    Last edited by Clyde Schechter; 26 Mar 2024, 14:32.

    Comment


    • #3
      Thanks, Clyde. I definitely don't know regular expressions too well, so this is very helpful.

      And yes, my current rule (which I only hint at in the notes of my code in Step 2) is to treat ambiguous cases as AM if they are between 4:00 and 11:59, and PM if between 12:00 and 3:59. Which part of your code would need to be amended based on that cutoff (or if a different cutoff is decided on in the future)?

      An additional question: at the end, how can I create a new numeric variable based on the formatted 'time' variable? So 14:30PM would become 14.5; 6:15AM would become 6.25, and so on. My ultimate goal is to be able to add/subtract these different variables from each other.
      Last edited by Garrett Todd; 26 Mar 2024, 14:49.

      Comment


      • #4
        Code:
        gen `c(obs_t)' obs_no = _n
        reshape long @start, i(obs_no) j(which) string
        
        gen edit = itrim(trim(ustrregexrf(start, "(\d{1,2})(\d\d)", "$1:$2")))
        replace edit = subinstr(edit, " ", "", .)
        replace edit = subinstr(edit, ".", "", .)
        replace edit = ustrregexrf(edit, "([APap])$", "$1m")
        replace edit = ustrregexrf(edit, "^(\d{1,2})$", "$1:00")
        replace edit = ustrregexrf(edit, "^(\d{1,2})([APap])", "$1:00$2")
        gen double time = clock(edit, "hm")
        format time %tcHH:MM_AM
        
        replace time = time + msofhours(12) if inrange(time, tc(00:00), tc(03:59)) ///
            & !strpos(lower(start), "a")
            
        gen fractional_time = clockdiff_frac(tc(00:00), time, "h")
        
        drop edit
        reshape wide @start @time @fractional_time, i(obs_no) j(which) string
        ]

        Comment


        • #5
          Clyde Schechter. I have a follow-up question: if one wanted to ignore all the times where there was no version of AM/PM entered, how could one add code at the end that would replace those cases to ==.?
          Last edited by Dakota McAvoy; 27 Mar 2024, 13:36.

          Comment


          • #6
            Code:
            replace time = . if !strpos(lower(start), "a") & !strpos(lower(start), "p")

            Comment

            Working...
            X