Extracting the first three consonants with a few exceptions

KJ Lee

Join Date: Feb 2016
Posts: 15

Extracting the first three consonants with a few exceptions

27 Feb 2016, 14:52

Hello Statalisters,

I need to make a new string variable, which contains abbreviations of another string variable. The ultimate goal is to shorten words' lengths in every observation. Question: How to make this new variable with the following abbreviation rules?

1. Exclude propositions of time/place (From, to, on, at, in, of, …)
2. Per word extract three letters

First letter:

If a word starts with a vowel, use this vowel

If a word starts with a consonant, use this consonant

Second and third letters: consonants only

3. No consecutively repeated consonants
Example:
Allowance--> alw, instead of all
Possible --> psb, instead of pss

4. Word with 4 letters containing less than 3 consonants:
Example:
Paid --> pd
Loan --> ln
However, Cash (since it has sufficient 3 consonants) become: csh

Further examples:
Novemberium--> nvm
Aquarium --> aqr
Aquaregia --> aqr

Note on aquarium & aquaregia both result in aqr. This inability to differentiate further is fine for two reasons:
1. The chance that this happen in my data seems small and cleaning it manually when it happens may be easier than writing a script to differentiate or flag aquarium’s aqr vs. aquaregia’s aqr.
2. It may be good because it will be less sensitive to typos (aquarioum, aquarrim will all give me aqr)
Please correct me if this is an unnecessary tradeoff.

Info about the “raw” string variable; here, an observation may contain between 1-10 words.

Further illustration below:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str104 raw str38 abbreviated
"Additional Paid In Capital"                                                                               "adt pd cpt"                            
"Allowance for Derivative Assets /"                                                                        "alw drv ast"                          
"Allowance for Securities Held /"                                                                          "alw scr hld"                          
"Allowance for Uncollectible Accounts"                                                                     "alw unc acn"                          
"Allowence Syariah"                                                                                        "alw syr"                              
"ASSETS"                                                                                                   "ast"                                  
"Cash"                                                                                                     "csh"                                  
"Comparative Period of Difference in Restructuring Value of Transactions of Entities Under Common Control" "cmp prd dff rst vl trn ent udr cmn cnt"
"Defferd income /"                                                                                         "dfr inc"                              
"Mudharabah Saving"                                                                                        "mdh svn"                              
"Murabahah Recieveable"                                                                                    "mrb rcv"                              
"Musyarakah Financial"                                                                                     "msy fnc"                              
"Musyarakah financing"                                                                                     "msy fnc"                              
"Musyarakah Laoan"                                                                                         "msy ln"                               
"Musyarakah Loan"                                                                                          "msy ln"                                
"Net of Allowance for Possible Losses of Consumer Financing Receivables"                                   "nt alw psb cns fnc rcv"               
"Other Comprehensive Incomes"                                                                              "oth cmp inc"                          
"Other Equity"                                                                                             "oth eqt"                              
"Placement at Bank India"                                                                                  "plc bnk ind"                          
"Placement at Bank of Egypt"                                                                               "plc bnk egy"                           
"Prepaid Expenses"                                                                                         "prp exp"                              
"Retained Earnings"                                                                                        "rtn ern"                              
"Securities Issued"                                                                                        "scr isd"                              
"Syariah Financing Facility"                                                                               "syr fnc fcl"                          
"TOTAL LIABILITIES AND EQUITY"                                                                             "tl lbl eqt"                           
end

Just in case relevant info: I’m using Stata MP 14, Macbook Air OSX El Capitan.

I welcome anything you may have in mind.

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30083

27 Feb 2016, 16:11

So the following code will produce the results you indicate with two exceptions:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str104 raw str38 abbreviated
"Additional Paid In Capital"                                                                               "adt pd cpt"                            
"Allowance for Derivative Assets /"                                                                        "alw drv ast"                          
"Allowance for Securities Held /"                                                                          "alw scr hld"                          
"Allowance for Uncollectible Accounts"                                                                     "alw unc acn"                          
"Allowence Syariah"                                                                                        "alw syr"                              
"ASSETS"                                                                                                   "ast"                                  
"Cash"                                                                                                     "csh"                                  
"Comparative Period of Difference in Restructuring Value of Transactions of Entities Under Common Control" "cmp prd dff rst vl trn ent udr cmn cnt"
"Defferd income /"                                                                                         "dfr inc"                              
"Mudharabah Saving"                                                                                        "mdh svn"                              
"Murabahah Recieveable"                                                                                    "mrb rcv"                              
"Musyarakah Financial"                                                                                     "msy fnc"                              
"Musyarakah financing"                                                                                     "msy fnc"                              
"Musyarakah Laoan"                                                                                         "msy ln"                               
"Musyarakah Loan"                                                                                          "msy ln"                                
"Net of Allowance for Possible Losses of Consumer Financing Receivables"                                   "nt alw psb cns fnc rcv"               
"Other Comprehensive Incomes"                                                                              "oth cmp inc"                          
"Other Equity"                                                                                             "oth eqt"                              
"Placement at Bank India"                                                                                  "plc bnk ind"                          
"Placement at Bank of Egypt"                                                                               "plc bnk egy"                           
"Prepaid Expenses"                                                                                         "prp exp"                              
"Retained Earnings"                                                                                        "rtn ern"                              
"Securities Issued"                                                                                        "scr isd"                              
"Syariah Financing Facility"                                                                               "syr fnc fcl"                          
"TOTAL LIABILITIES AND EQUITY"                                                                             "tl lbl eqt"                           
end

//    BREAK UP INTO SEPARATE WORDS
split raw, gen(word)

//    CHARACTERS NOT WANTED (EXCEPT INITIALLY)
local vowels a e i o u
local alphabet `c(alpha)'
local consonants: list alphabet - vowels
local punctuation / . , 
local exclude `vowels' `punctuation'

foreach v of varlist word* {
    replace `v' = trim(itrim(lower(`v'))) // SET TO LOWER CASE; TRIM
    replace `v' = "" if inlist(`v', "to", "on", "at", "in", "of", "for", "and") // SMALL WORDS
    gen initial_`v' = substr(`v', 1, 1) // SEPARATE FIRST CHARACTER
    replace initial_`v' = "" if initial_`v' == "/"
    replace `v' = substr(`v', 2, .) // FROM REST OF WORD
    foreach e of local exclude { // REMOVE VOWELS & PUNCTUATION
        replace `v' = subinstr(`v', "`e'", "", .)
    }
    replace `v' = initial_`v' + `v' // BRING BACK INITIALS
    drop initial_`v'

    // REPLACE GEMINATE CONSONANTS BY SINGLETONS
    foreach c of local consonants {
        replace `v' = subinstr(`v', "`c'`c'", "`c'", .)
    }
    replace `v' = substr(`v', 1, 3) // KEEP FIRST THREE REMAINING
}
egen abbr = concat(word*), punct(" ")
replace abbr = trim(itrim(abbr))

drop word*

list abbreviated abbr raw if abbreviated != abbr, clean

The two exceptions are in observations 8 and 16. Where you abbreviated Difference as dff, I get dfr, and where you omit any abbreviation for Losses, I have lss. In both of these cases it appears to me that you are not following your own rules, whereas I am.

Notes:
1. This code does not deal with the possibility of embedded digits or punctuation characters other than / . or ,. It is not hard to expand -local punctuation...- to include these if need be. However, it is quite tricky if we encounter any ", ', or ` characters. (In fact, I have not yet been able to get the handling of these correct, and this code explicitly does not deal with them. I'm just hoping you don't have any.)

2. I have identified 7 short words that you want to skip over in this. You can see them in the -replace `v' = "" if inlist(...- command. When dealing with strings, -inlist()- will take a maximum of 10 arguments, so you only have room for two more. If there are more such words than that, then you will need to disjoin additional -inlist()- expressions to cover them all.

3. It looks like you have either invented, or somewhere found, an interesting way of abbreviating words in ways that retain the "important" letters. I'm guessing you want to do this to try to match up various mistypings and misspellings of the words in "raw." If this is not your own algorithm, I would be interested in seeing the reference: this problem does come up with some frequency and it would be nice to use it myself and share it with others. If it is your own invention, I would appreciate some follow-up on how helpful it turns out to be--and if it works well, I encourage you to publish it.

4. In connection with the previous note, if this approach ultimately proves unsatisfactory, you might try using the -soundex()- function to abbreviate each word instead of the rules you have come upon. Soundex was originally invented by the US Census to identify alternate spellings of the same surnames--it produces a code with 1 letter and 3 digits that encode the "important" letters in a name. Alternate spellings of the same name typically yield the same soundex code, and vice versa. Soundex was, as mentioned, optimized for use with surnames, but it is pretty serviceable for English vocabulary overall. (Though it fails miserably with words that are intentionally bizarre, such as generic names of pharmaceuticals and brand names of many commercial products.)

5. I half expect that when Robert Picard sees this post he will come up with a one-line solution using regular expressions.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

27 Feb 2016, 17:03

A tall order Clyde! But I think the combination of dropping common words, vowels, double consonants, etc is a bit much. My solution is similar to yours:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str104 raw str38 abbreviated
"Additional Paid In Capital"                                                                               "adt pd cpt"                            
"Allowance for Derivative Assets /"                                                                        "alw drv ast"                          
"Allowance for Securities Held /"                                                                          "alw scr hld"                          
"Allowance for Uncollectible Accounts"                                                                     "alw unc acn"                          
"Allowence Syariah"                                                                                        "alw syr"                              
"ASSETS"                                                                                                   "ast"                                  
"Cash"                                                                                                     "csh"                                  
"Comparative Period of Difference in Restructuring Value of Transactions of Entities Under Common Control" "cmp prd dff rst vl trn ent udr cmn cnt"
"Defferd income /"                                                                                         "dfr inc"                              
"Mudharabah Saving"                                                                                        "mdh svn"                              
"Murabahah Recieveable"                                                                                    "mrb rcv"                              
"Musyarakah Financial"                                                                                     "msy fnc"                              
"Musyarakah financing"                                                                                     "msy fnc"                              
"Musyarakah Laoan"                                                                                         "msy ln"                               
"Musyarakah Loan"                                                                                          "msy ln"                                
"Net of Allowance for Possible Losses of Consumer Financing Receivables"                                   "nt alw psb cns fnc rcv"               
"Other Comprehensive Incomes"                                                                              "oth cmp inc"                          
"Other Equity"                                                                                             "oth eqt"                              
"Placement at Bank India"                                                                                  "plc bnk ind"                          
"Placement at Bank of Egypt"                                                                               "plc bnk egy"                           
"Prepaid Expenses"                                                                                         "prp exp"                              
"Retained Earnings"                                                                                        "rtn ern"                              
"Securities Issued"                                                                                        "scr isd"                              
"Syariah Financing Facility"                                                                               "syr fnc fcl"                          
"TOTAL LIABILITIES AND EQUITY"                                                                             "tl lbl eqt"                           
end

* remove specific small words
gen work = " " + lower(raw) + " "
foreach s in from to on at in of for and / {
    replace work = subinstr(work," `s' "," ",.)
}

* remove vowells
replace work = lower(ustrregexra(strproper(work),"[aeiou]", ""))

* remove double consonnants
foreach c in `c(alpha)' {
    replace work = subinstr(work,"`c'`c'","`c'", .)
    replace work = subinstr(work,"`c'`c'","`c'", .) 
}

* reduce to first 3 letters
gen nwords = wordcount(work)
sum nwords, meanonly

gen clean = ""
local n = r(N)
forvalues i = 1/`n' {
    replace clean = clean + " " + substr(word(work,`i'),1,3)
}
replace clean = trim(clean)

* to install, type in Stata's command window: ssc install leftalign
leftalign
gen check = clean != abbreviated
list raw abbreviated clean check, noobs compress string(30)

And I get the same problematic 2 cases

Code:

+----------------------------------------------------------------------------------------------------------------+
  | raw                                abbreviated                        clean                              check |
  |----------------------------------------------------------------------------------------------------------------|
  | Additional Paid In Capital         adt pd cpt                         adt pd cpt                             0 |
  | Allowance for Derivative Asset..   alw drv ast                        alw drv ast                            0 |
  | Allowance for Securities Held /    alw scr hld                        alw scr hld                            0 |
  | Allowance for Uncollectible Ac..   alw unc acn                        alw unc acn                            0 |
  | Allowence Syariah                  alw syr                            alw syr                                0 |
  |----------------------------------------------------------------------------------------------------------------|
  | ASSETS                             ast                                ast                                    0 |
  | Cash                               csh                                csh                                    0 |
  | Comparative Period of Differen..   cmp prd dff rst vl trn ent udr..   cmp prd dfr rst vl trn ent und..       1 |
  | Defferd income /                   dfr inc                            dfr inc                                0 |
  | Mudharabah Saving                  mdh svn                            mdh svn                                0 |
  |----------------------------------------------------------------------------------------------------------------|
  | Murabahah Recieveable              mrb rcv                            mrb rcv                                0 |
  | Musyarakah Financial               msy fnc                            msy fnc                                0 |
  | Musyarakah financing               msy fnc                            msy fnc                                0 |
  | Musyarakah Laoan                   msy ln                             msy ln                                 0 |
  | Musyarakah Loan                    msy ln                             msy ln                                 0 |
  |----------------------------------------------------------------------------------------------------------------|
  | Net of Allowance for Possible ..   nt alw psb cns fnc rcv             nt alw psb ls cns fnc rcv              1 |
  | Other Comprehensive Incomes        oth cmp inc                        oth cmp inc                            0 |
  | Other Equity                       oth eqt                            oth eqt                                0 |
  | Placement at Bank India            plc bnk ind                        plc bnk ind                            0 |
  | Placement at Bank of Egypt         plc bnk egy                        plc bnk egy                            0 |
  |----------------------------------------------------------------------------------------------------------------|
  | Prepaid Expenses                   prp exp                            prp exp                                0 |
  | Retained Earnings                  rtn ern                            rtn ern                                0 |
  | Securities Issued                  scr isd                            scr isd                                0 |
  | Syariah Financing Facility         syr fnc fcl                        syr fnc fcl                            0 |
  | TOTAL LIABILITIES AND EQUITY       tl lbl eqt                         tl lbl eqt                             0 |
  +----------------------------------------------------------------------------------------------------------------

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

28 Feb 2016, 13:28

There are a few different issues here that span different domains. The removal of prepositions is fairly common in the natural language processing world, as a subset of "stop" words. The business logic for defining the abbreviations could take a while to implement if the strings are longer in length or if there is a large number of observations. There could be more practical/efficient solutions for the end goal that the OP has in mind, so maybe clarifying that would help others to think through other potentially viable solutions/alternatives that accomplish the same or a closely enough related end goal to also be useful.
Comment
KJ Lee

Join Date: Feb 2016

Posts: 15
#5

29 Feb 2016, 15:42

Mr. Schechter, Mr. Picard, and Mr. Buchanan: Thank you very much for your comments and suggestions. Since I'm a neophyte, it will take me few days to digest all of the information you shared, but I try them both and will update here. Again, thank you very much for your considerations. I really appreciated it.
Comment

Announcement

Extracting the first three consonants with a few exceptions

Comment

Comment

Comment

Comment