String Cleaning - Possibly automate the task

Fahad Mirza

Join Date: Sep 2018
Posts: 263

String Cleaning - Possibly automate the task

15 Dec 2020, 08:06

I wish to retain information on a person's grades which is observed in the variable below. The text that i would like to retain in a clean format is as follows:

Grade: A .....
Single Letters that are followed by a + or - need to be applied a prefix of "Grade: "
Letter preceded by 1 or 2 digits such as 14 As or 3 Bs also need to be applied a prefix of "Grade: "

What I want to discard:

Observation having 1, 2 or 3 digits only
Sentences

I hope this message is drafted clearly. If not then feel free to ask for more clarification

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str31 grade_1
"14 As, 7 Bs"                    
"c"                              
"I continue  study in BA  part 2"
"B"                              
"Grade:D"                        
"200BSc"                        
"Grade:B"                        
"888"                            
"A"                              
"67"                            
"As"                            
"B Grade"                        
"3bs"                            
"B"                              
"B+"                            
"A"                              
"B"                              
"B"                              
"Grade B"                        
"5As,3Bs"                        
"73"                            
"2nd"                            
"B"                              
"B"                              
"Grade:A"                        
"Grade B+"                      
"b"                              
"D"                              
"As"                            
"Pass"                          
"B"                              
"5As"                            
"B"                              
"Master continue"                
"B"                              
"Grade: B"                      
"D"                              
"B+"                            
"b"                              
"A+"                            
"888"                            
"5As"                            
"5As"                            
"As"                            
"A"                              
"b"                              
"B"                              
"I am studying"                  
"2.67"                          
"497"                            
"B Grade"                        
"697"                            
"Grade A"                        
"B"                              
"All B's"                        
"distinction"                    
"B Grade"                        
"A"                              
"B"                              
"B Grades"                      
"Grade C"                        
"A"                              
"c"                              
"B"                              
"3Bs"                            
"5 As"                          
"b"                              
"C+"                            
"A"                              
"B"                              
"D"                              
"B"                              
"A"                              
"1100/627, Grade: C"            
"B"                              
"none"                          
"A"                              
"B"                              
"B"                              
"B"                              
"B"                              
"A"                              
"C"                              
"B"                              
"b"                              
"Grade B"                        
"C"                              
"B"                              
"C"                              
"4"                              
"b"                              
end

Last edited by Fahad Mirza; 15 Dec 2020, 08:10.

Tags: cleaning, grades, numbers, sentences, string

Fahad Mirza

Join Date: Sep 2018

Posts: 263
#2

15 Dec 2020, 08:07

William Lisowski I have created a new thread for my problem. Please feel free to let me know if you think the description requires further clarity.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

15 Dec 2020, 08:53

In record 1 there is two grades separated by comma. Assuming this structure in example code below. Also trying to keep the regex simple, thus using several simpler regex:

Code:

split grade_1, parse(",")

keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")

foreach v of varlist grade_1? {

    generate is_`v' = ustrregexs(1) if ustrregexm(`v', "^([A-Fa-f][+]{0,1})$")
    replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "\d{0,1}\s*([A-Fa-f])s")
    replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "\s([A-Fa-f])[']s")
    replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "([A-Fa-f][+]{0,1})[\s:]+Grade$")
    replace is_`v'  = ustrregexs(1) if ustrregexm(`v', "Grade[:\s]+([A-Fa-f][+]{0,1})$")
}

tab grade_1 if mi(is_grade_11) & mi(is_grade_12)

Code:

. tab grade_1 if mi(is_grade_11) & mi(is_grade_12)

                        grade_1 |      Freq.     Percent        Cum.
--------------------------------+-----------------------------------
                         200BSc |          1       16.67       16.67
                            2nd |          1       16.67       33.33
                  I am studying |          1       16.67       50.00
I continue  study in BA  part 2 |          1       16.67       66.67
                    distinction |          1       16.67       83.33
                           none |          1       16.67      100.00
--------------------------------+-----------------------------------
                          Total |          6      100.00

https://regex101.com/ can be used to explain the regular expressions.

Last edited by Bjarte Aagnes; 15 Dec 2020, 08:59.

Comment

Fahad Mirza

Join Date: Sep 2018

Posts: 263
#4

15 Dec 2020, 09:04

Bjarte Aagnes thanks so much for this however, i think we may have crossed wires, so i am trying to retain the opposite of what you have tabulated. So in particular just need to keep information on Grades per observation and not their marks and enumerator collected sentences.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

15 Dec 2020, 09:14

The results are in the new variables:

Code:

-> tabulation of is_grade_11  

is_grade_11 |      Freq.     Percent        Cum.
------------+-----------------------------------
          A |         20       26.32       26.32
         A+ |          1        1.32       27.63
          B |         31       40.79       68.42
         B+ |          3        3.95       72.37
          C |          4        5.26       77.63
         C+ |          1        1.32       78.95
          D |          4        5.26       84.21
          a |          2        2.63       86.84
          b |          7        9.21       96.05
          c |          2        2.63       98.68
          e |          1        1.32      100.00
------------+-----------------------------------
      Total |         76      100.00

-> tabulation of is_grade_12  

is_grade_12 |      Freq.     Percent        Cum.
------------+-----------------------------------
          B |          2       66.67       66.67
          C |          1       33.33      100.00
------------+-----------------------------------
      Total |          3      100.00

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

20 Dec 2020, 09:51

Not closed, thus adding code which may be closer to the wanted end result

Code:

split grade_1, parse(",")

keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")

local rx1 ^([A-Fa-f][+]{0,1})$
local rx2 (\d{0,1}\s*[A-Fa-f])s
local rx3 \s([A-Fa-f])[']s
local rx4 ([A-Fa-f][+]{0,1})[\s:]+Grade$
local rx5 Grade[:\s]+([A-Fa-f][+]{0,1})$

local rx61 (\d{0,1}) // capture group 1
local rx62 ([A-Fa-f]) // capture group 2
local rx6 `rx61'`rx62's

foreach v of varlist grade_1? {
 
    generate is_`v' = ""
 
    forvalues i = 1/5 {
 
        replace is_`v' = ustrregexs(1) if ustrregexm(`v', "`rx`i''")
    }
 
    replace  is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")
}

generate grades = ///
    trim(itrim(upper("Grades: " + is_grade_11 + " " + is_grade_12))) ///
      if !mi(is_grade_11 + is_grade_12)

Code:

. tab grades , mi

         grades |      Freq.     Percent        Cum.
----------------+-----------------------------------
                |          6        7.23        7.23
    GRADES: 3 B |          2        2.41        9.64
    GRADES: 5 A |          3        3.61       13.25
GRADES: 5 A 3 B |          1        1.20       14.46
      GRADES: A |         17       20.48       34.94
    GRADES: A B |          1        1.20       36.14
     GRADES: A+ |          1        1.20       37.35
      GRADES: B |         36       43.37       80.72
     GRADES: B+ |          3        3.61       84.34
      GRADES: C |          7        8.43       92.77
     GRADES: C+ |          1        1.20       93.98
      GRADES: D |          4        4.82       98.80
      GRADES: E |          1        1.20      100.00
----------------+-----------------------------------
          Total |         83      100.00

. tab grade_1 if mi(grades)

                        grade_1 |      Freq.     Percent        Cum.
--------------------------------+-----------------------------------
                         200BSc |          1       16.67       16.67
                            2nd |          1       16.67       33.33
                  I am studying |          1       16.67       50.00
I continue  study in BA  part 2 |          1       16.67       66.67
                    distinction |          1       16.67       83.33
                           none |          1       16.67      100.00
--------------------------------+-----------------------------------
                          Total |          6      100.00

Comment

Fahad Mirza

Join Date: Sep 2018
Posts: 263

21 Dec 2020, 07:12

I am so sorry i just got back online to view this and was unable to reply to your previous message however, this last code is what i needed and i am thankful to you for taking time out to go through this. Really appreciate it!. As a request, i know this may sound a bit amateur but I am not really familiar with the use of ustr or str in general. Will it be alright for you to guide me through how the code is working? if that is alright with you. Thank you.

Originally posted by Bjarte Aagnes View Post

Not closed, thus adding code which may be closer to the wanted end result

Code:

split grade_1, parse(",")

keep if ustrregexm(grade_11, "[A-Fa-f]") | ustrregexm(grade_12, "[A-Fa-f]")

local rx1 ^([A-Fa-f][+]{0,1})$
local rx2 (\d{0,1}\s*[A-Fa-f])s
local rx3 \s([A-Fa-f])[']s
local rx4 ([A-Fa-f][+]{0,1})[\s:]+Grade$
local rx5 Grade[:\s]+([A-Fa-f][+]{0,1})$

local rx61 (\d{0,1}) // capture group 1
local rx62 ([A-Fa-f]) // capture group 2
local rx6 `rx61'`rx62's

foreach v of varlist grade_1? {

generate is_`v' = ""

forvalues i = 1/5 {

replace is_`v' = ustrregexs(1) if ustrregexm(`v', "`rx`i''")
}

replace is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")
}

generate grades = ///
trim(itrim(upper("Grades: " + is_grade_11 + " " + is_grade_12))) ///
 if !mi(is_grade_11 + is_grade_12)

Code:

. tab grades , mi

grades | Freq. Percent Cum.
----------------+-----------------------------------
| 6 7.23 7.23
GRADES: 3 B | 2 2.41 9.64
GRADES: 5 A | 3 3.61 13.25
GRADES: 5 A 3 B | 1 1.20 14.46
GRADES: A | 17 20.48 34.94
GRADES: A B | 1 1.20 36.14
GRADES: A+ | 1 1.20 37.35
GRADES: B | 36 43.37 80.72
GRADES: B+ | 3 3.61 84.34
GRADES: C | 7 8.43 92.77
GRADES: C+ | 1 1.20 93.98
GRADES: D | 4 4.82 98.80
GRADES: E | 1 1.20 100.00
----------------+-----------------------------------
Total | 83 100.00

. tab grade_1 if mi(grades)

grade_1 | Freq. Percent Cum.
--------------------------------+-----------------------------------
200BSc | 1 16.67 16.67
2nd | 1 16.67 33.33
I am studying | 1 16.67 50.00
I continue study in BA part 2 | 1 16.67 66.67
distinction | 1 16.67 83.33
none | 1 16.67 100.00
--------------------------------+-----------------------------------
Total | 6 100.00

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 786

21 Dec 2020, 11:51

Most string functions are easy to understand from the -help string functions-, or -help itrim()-, -help upper()- etc. But, regular expressions is a formal language and need more explanation than given by -help ustrregexm()-, and more explanation than I will give here, but I can try to explain the most complicated expression:

Code:

replace is_`v' = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(`v', "`rx6'")

The ustrregexm(S,RE) return 1 if regular expression match the string S.
The local rx6 containt the regex (\d{0,1})([A-Fa-f])s and the local v contain the variable name of the string variable (grade_11, grade_12) to be searched for match. Thus, for `v' = grade_11 we have

Code:

replace is_grade_11 = ustrregexs(1) + " " + ustrregexs(2) if ustrregexm(grade_11, "(\d{0,1})([A-Fa-f])s")

The ustrregexs(n) will return the sub-expression n from the previous ustrregexm() match. In the regex applied there are two sub-expression, also called capture groups, defined by "()" around the sub-expression.

The "+" in the expression is the string concatenator operator.

Below is a example. To use # comments within the regex a trick must be used to force the Stata parser to interpret embedded newlines as newlines.Ignore this technicalities and focus on the regex with comments

Code:

/* TO BE RUN FROM HERE TO END */

clear
set obs 1
gen grade = "5As"

#delim;  /* Technical: newlines `nl' must be inserted to allow # comments */

local nl = cond( c(os)=="Windows", char(13) + char(10) , char(10) );

scalar regexpattern =

"(?x)    # SET flag UREGEX_COMMENTS: Allow white space and # comments   `nl'"

+ /* REG EXP PATTERN TO MATCH between double quotes - ignore `nl' */

"
(        #  START Capturing Group (sub-expression 1)                    `nl'  
\d{0,1}  #1   match digit {zero or one time}                            `nl'
)        #  END Capturing Group (sub-expression 1)                      `nl'
(        #  START Capturing Group (sub-expression 2)                    `nl'  
[A-Fa-f] #2   'word class/set' match a single chars A to F a to f       `nl'
)        #  END Capturing Group (sub-expression 2)                      `nl'
s        #0 match character "s"                                    
"
;
#delim cr

gen new =   ustrregexs(1)   /// string that matched sub-expression 1: \d{0,1}     
            +               /// concatenation                  
            " "             /// a space character      
            +               /// concatenation      
            ustrregexs(2)   /// string that matched sub-expression 2: [A-Fa-f]      
                            ///
            if  ustrregexm(grade, regexpattern) // see explanation above
                
display grade
display regexpattern
display ustrregexs(1)
display ustrregexs(2)
display ustrregexs(0)

/* END */

The ICU regular expressions are documented at https://unicode-org.github.io/icu/us...gs/regexp.html
The regular expressions can be explained using https://regex101.com/
A easy introduction https://www.regular-expressions.info/book10mins.html

Last edited by Bjarte Aagnes; 21 Dec 2020, 11:59.

Comment

Fahad Mirza

Join Date: Sep 2018

Posts: 263
#9

21 Dec 2020, 12:36

Thank you for these useful details and for taking time out to explain the code, it was indeed useful.
Comment

Announcement

String Cleaning - Possibly automate the task

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment