Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Cleaning - Possibly automate the task

    I wish to retain information on a person's grades which is observed in the variable below. The text that i would like to retain in a clean format is as follows:
    1. Grade: A .....
    2. Single Letters that are followed by a + or - need to be applied a prefix of "Grade: "
    3. Letter preceded by 1 or 2 digits such as 14 As or 3 Bs also need to be applied a prefix of "Grade: "
    What I want to discard:
    1. Observation having 1, 2 or 3 digits only
    2. Sentences
    I hope this message is drafted clearly. If not then feel free to ask for more clarification



    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str31 grade_1
    "14 As, 7 Bs"                    
    "c"                              
    "I continue  study in BA  part 2"
    "B"                              
    "Grade:D"                        
    "200BSc"                        
    "Grade:B"                        
    "888"                            
    "A"                              
    "67"                            
    "As"                            
    "B Grade"                        
    "3bs"                            
    "B"                              
    "B+"                            
    "A"                              
    "B"                              
    "B"                              
    "Grade B"                        
    "5As,3Bs"                        
    "73"                            
    "2nd"                            
    "B"                              
    "B"                              
    "Grade:A"                        
    "Grade B+"                      
    "b"                              
    "D"                              
    "As"                            
    "Pass"                          
    "B"                              
    "5As"                            
    "B"                              
    "Master continue"                
    "B"                              
    "Grade: B"                      
    "D"                              
    "B+"                            
    "b"                              
    "A+"                            
    "888"                            
    "5As"                            
    "5As"                            
    "As"                            
    "A"                              
    "b"                              
    "B"                              
    "I am studying"                  
    "2.67"                          
    "497"                            
    "B Grade"                        
    "697"                            
    "Grade A"                        
    "B"                              
    "All B's"                        
    "distinction"                    
    "B Grade"                        
    "A"                              
    "B"                              
    "B Grades"                      
    "Grade C"                        
    "A"                              
    "c"                              
    "B"                              
    "3Bs"                            
    "5 As"                          
    "b"                              
    "C+"                            
    "A"                              
    "B"                              
    "D"                              
    "B"                              
    "A"                              
    "1100/627, Grade: C"            
    "B"                              
    "none"                          
    "A"                              
    "B"                              
    "B"                              
    "B"                              
    "B"                              
    "A"                              
    "C"                              
    "B"                              
    "b"                              
    "Grade B"                        
    "C"                              
    "B"                              
    "C"                              
    "4"                              
    "b"                              
    end
    Last edited by Fahad Mirza; 15 Dec 2020, 08:10.

  • #2
    William Lisowski I have created a new thread for my problem. Please feel free to let me know if you think the description requires further clarity.

    Comment


    • #3
      In record 1 there is two grades separated by comma. Assuming this structure in example code below. Also trying to keep the regex simple, thus using several simpler regex:
      Code:
      split grade_1, parse(",")
      
      keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")
      
      foreach v of varlist grade_1? {
      
          generate is_`v' = ustrregexs(1) if ustrregexm(`v', "^([A-Fa-f][+]{0,1})$")
          replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "\d{0,1}\s*([A-Fa-f])s")
          replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "\s([A-Fa-f])[']s")
          replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "([A-Fa-f][+]{0,1})[\s:]+Grade$")
          replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "Grade[:\s]+([A-Fa-f][+]{0,1})$")
      }
      
      tab grade_1 if mi(is_grade_11) & mi(is_grade_12)
      Code:
      . tab grade_1 if mi(is_grade_11) & mi(is_grade_12)
      
                              grade_1 |      Freq.     Percent        Cum.
      --------------------------------+-----------------------------------
                               200BSc |          1       16.67       16.67
                                  2nd |          1       16.67       33.33
                        I am studying |          1       16.67       50.00
      I continue  study in BA  part 2 |          1       16.67       66.67
                          distinction |          1       16.67       83.33
                                 none |          1       16.67      100.00
      --------------------------------+-----------------------------------
                                Total |          6      100.00
      https://regex101.com/ can be used to explain the regular expressions.
      Last edited by Bjarte Aagnes; 15 Dec 2020, 08:59.

      Comment


      • #4
        Bjarte Aagnes thanks so much for this however, i think we may have crossed wires, so i am trying to retain the opposite of what you have tabulated. So in particular just need to keep information on Grades per observation and not their marks and enumerator collected sentences.

        Comment


        • #5
          The results are in the new variables:
          Code:
          -> tabulation of is_grade_11  
          
          is_grade_11 |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    A |         20       26.32       26.32
                   A+ |          1        1.32       27.63
                    B |         31       40.79       68.42
                   B+ |          3        3.95       72.37
                    C |          4        5.26       77.63
                   C+ |          1        1.32       78.95
                    D |          4        5.26       84.21
                    a |          2        2.63       86.84
                    b |          7        9.21       96.05
                    c |          2        2.63       98.68
                    e |          1        1.32      100.00
          ------------+-----------------------------------
                Total |         76      100.00
          
          -> tabulation of is_grade_12  
          
          is_grade_12 |      Freq.     Percent        Cum.
          ------------+-----------------------------------
                    B |          2       66.67       66.67
                    C |          1       33.33      100.00
          ------------+-----------------------------------
                Total |          3      100.00

          Comment


          • #6
            Not closed, thus adding code which may be closer to the wanted end result
            Code:
            split grade_1, parse(",")
            
            keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")
            
            local rx1 ^([A-Fa-f][+]{0,1})$
            local rx2 (\d{0,1}\s*[A-Fa-f])s
            local rx3 \s([A-Fa-f])[']s
            local rx4 ([A-Fa-f][+]{0,1})[\s:]+Grade$
            local rx5 Grade[:\s]+([A-Fa-f][+]{0,1})$
            
            local rx61 (\d{0,1}) // capture group 1
            local rx62 ([A-Fa-f]) // capture group 2
            local rx6 `rx61'`rx62's
            
            foreach v of varlist grade_1? {
             
                generate is_`v' = ""
             
                forvalues i = 1/5 {
             
                    replace is_`v' = ustrregexs(1) if ustrregexm(`v', "`rx`i''")
                }
             
                replace  is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")
            }
            
            generate grades = ///
                trim(itrim(upper("Grades: " + is_grade_11 + " " + is_grade_12))) ///
                  if !mi(is_grade_11 + is_grade_12)
            Code:
            . tab grades , mi
            
                     grades |      Freq.     Percent        Cum.
            ----------------+-----------------------------------
                            |          6        7.23        7.23
                GRADES: 3 B |          2        2.41        9.64
                GRADES: 5 A |          3        3.61       13.25
            GRADES: 5 A 3 B |          1        1.20       14.46
                  GRADES: A |         17       20.48       34.94
                GRADES: A B |          1        1.20       36.14
                 GRADES: A+ |          1        1.20       37.35
                  GRADES: B |         36       43.37       80.72
                 GRADES: B+ |          3        3.61       84.34
                  GRADES: C |          7        8.43       92.77
                 GRADES: C+ |          1        1.20       93.98
                  GRADES: D |          4        4.82       98.80
                  GRADES: E |          1        1.20      100.00
            ----------------+-----------------------------------
                      Total |         83      100.00
            
            . tab grade_1 if mi(grades)
            
                                    grade_1 |      Freq.     Percent        Cum.
            --------------------------------+-----------------------------------
                                     200BSc |          1       16.67       16.67
                                        2nd |          1       16.67       33.33
                              I am studying |          1       16.67       50.00
            I continue  study in BA  part 2 |          1       16.67       66.67
                                distinction |          1       16.67       83.33
                                       none |          1       16.67      100.00
            --------------------------------+-----------------------------------
                                      Total |          6      100.00

            Comment


            • #7
              I am so sorry i just got back online to view this and was unable to reply to your previous message however, this last code is what i needed and i am thankful to you for taking time out to go through this. Really appreciate it!. As a request, i know this may sound a bit amateur but I am not really familiar with the use of ustr or str in general. Will it be alright for you to guide me through how the code is working? if that is alright with you. Thank you.

              Originally posted by Bjarte Aagnes View Post
              Not closed, thus adding code which may be closer to the wanted end result
              Code:
              split grade_1, parse(",")
              
              keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")
              
              local rx1 ^([A-Fa-f][+]{0,1})$
              local rx2 (\d{0,1}\s*[A-Fa-f])s
              local rx3 \s([A-Fa-f])[']s
              local rx4 ([A-Fa-f][+]{0,1})[\s:]+Grade$
              local rx5 Grade[:\s]+([A-Fa-f][+]{0,1})$
              
              local rx61 (\d{0,1}) // capture group 1
              local rx62 ([A-Fa-f]) // capture group 2
              local rx6 `rx61'`rx62's
              
              foreach v of varlist grade_1? {
              
              generate is_`v' = ""
              
              forvalues i = 1/5 {
              
              replace is_`v' = ustrregexs(1) if ustrregexm(`v', "`rx`i''")
              }
              
              replace is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")
              }
              
              generate grades = ///
              trim(itrim(upper("Grades: " + is_grade_11 + " " + is_grade_12))) ///
               if !mi(is_grade_11 + is_grade_12)
              Code:
              . tab grades , mi
              
              grades | Freq. Percent Cum.
              ----------------+-----------------------------------
              | 6 7.23 7.23
              GRADES: 3 B | 2 2.41 9.64
              GRADES: 5 A | 3 3.61 13.25
              GRADES: 5 A 3 B | 1 1.20 14.46
              GRADES: A | 17 20.48 34.94
              GRADES: A B | 1 1.20 36.14
              GRADES: A+ | 1 1.20 37.35
              GRADES: B | 36 43.37 80.72
              GRADES: B+ | 3 3.61 84.34
              GRADES: C | 7 8.43 92.77
              GRADES: C+ | 1 1.20 93.98
              GRADES: D | 4 4.82 98.80
              GRADES: E | 1 1.20 100.00
              ----------------+-----------------------------------
              Total | 83 100.00
              
              . tab grade_1 if mi(grades)
              
              grade_1 | Freq. Percent Cum.
              --------------------------------+-----------------------------------
              200BSc | 1 16.67 16.67
              2nd | 1 16.67 33.33
              I am studying | 1 16.67 50.00
              I continue study in BA part 2 | 1 16.67 66.67
              distinction | 1 16.67 83.33
              none | 1 16.67 100.00
              --------------------------------+-----------------------------------
              Total | 6 100.00

              Comment


              • #8
                Most string functions are easy to understand from the -help string functions-, or -help itrim()-, -help upper()- etc. But, regular expressions is a formal language and need more explanation than given by -help ustrregexm()-, and more explanation than I will give here, but I can try to explain the most complicated expression:
                Code:
                replace is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")
                The ustrregexm(S,RE) return 1 if regular expression match the string S.
                The local rx6 containt the regex (\d{0,1})([A-Fa-f])s and the local v contain the variable name of the string variable (grade_11, grade_12) to be searched for match. Thus, for `v' = grade_11 we have
                Code:
                replace is_grade_11 = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(grade_11, "(\d{0,1})([A-Fa-f])s")
                The ustrregexs(n) will return the sub-expression n from the previous ustrregexm() match. In the regex applied there are two sub-expression, also called capture groups, defined by "()" around the sub-expression.

                The "+" in the expression is the string concatenator operator.

                Below is a example. To use # comments within the regex a trick must be used to force the Stata parser to interpret embedded newlines as newlines.Ignore this technicalities and focus on the regex with comments
                Code:
                /* TO BE RUN FROM HERE TO END */
                
                clear
                set obs 1
                gen grade = "5As"
                
                #delim;  /* Technical: newlines `nl' must be inserted to allow # comments */
                
                local nl = cond( c(os)=="Windows", char(13) + char(10) , char(10) );
                
                scalar regexpattern =
                
                "(?x)    # SET flag UREGEX_COMMENTS: Allow white space and # comments   `nl'"
                
                + /* REG EXP PATTERN TO MATCH between double quotes - ignore `nl' */
                
                "
                (        #  START Capturing Group (sub-expression 1)                    `nl'  
                \d{0,1}  #1   match digit {zero or one time}                            `nl'
                )        #  END Capturing Group (sub-expression 1)                      `nl'
                (        #  START Capturing Group (sub-expression 2)                    `nl'  
                [A-Fa-f] #2   'word class/set' match a single chars A to F a to f       `nl'
                )        #  END Capturing Group (sub-expression 2)                      `nl'
                s        #0 match character "s"                                    
                "
                ;
                #delim cr
                
                gen new =   ustrregexs(1)   /// string that matched sub-expression 1: \d{0,1}     
                            +               /// concatenation                  
                            " "             /// a space character      
                            +               /// concatenation      
                            ustrregexs(2)   /// string that matched sub-expression 2: [A-Fa-f]      
                                            ///
                            if  ustrregexm(grade, regexpattern) // see explanation above
                                
                display grade
                display regexpattern
                display ustrregexs(1)
                display ustrregexs(2)
                display ustrregexs(0)
                
                /* END */
                The ICU regular expressions are documented at https://unicode-org.github.io/icu/us...gs/regexp.html
                The regular expressions can be explained using https://regex101.com/
                A easy introduction https://www.regular-expressions.info/book10mins.html
                Last edited by Bjarte Aagnes; 21 Dec 2020, 11:59.

                Comment


                • #9
                  Thank you for these useful details and for taking time out to explain the code, it was indeed useful.

                  Comment

                  Working...
                  X