Hello Statalisters,
I need to make a new string variable, which contains abbreviations of another string variable. The ultimate goal is to shorten words' lengths in every observation. Question: How to make this new variable with the following abbreviation rules?
1. Exclude propositions of time/place (From, to, on, at, in, of, …)
2. Per word extract three letters
First letter:
If a word starts with a vowel, use this vowel
If a word starts with a consonant, use this consonant
Second and third letters: consonants only
3. No consecutively repeated consonants
Example:
Allowance--> alw, instead of all
Possible --> psb, instead of pss
4. Word with 4 letters containing less than 3 consonants:
Example:
Paid --> pd
Loan --> ln
However, Cash (since it has sufficient 3 consonants) become: csh
Further examples:
Novemberium--> nvm
Aquarium --> aqr
Aquaregia --> aqr
Note on aquarium & aquaregia both result in aqr. This inability to differentiate further is fine for two reasons:
1. The chance that this happen in my data seems small and cleaning it manually when it happens may be easier than writing a script to differentiate or flag aquarium’s aqr vs. aquaregia’s aqr.
2. It may be good because it will be less sensitive to typos (aquarioum, aquarrim will all give me aqr)
Please correct me if this is an unnecessary tradeoff.
Info about the “raw” string variable; here, an observation may contain between 1-10 words.
Further illustration below:
Just in case relevant info: I’m using Stata MP 14, Macbook Air OSX El Capitan.
I welcome anything you may have in mind.
I need to make a new string variable, which contains abbreviations of another string variable. The ultimate goal is to shorten words' lengths in every observation. Question: How to make this new variable with the following abbreviation rules?
1. Exclude propositions of time/place (From, to, on, at, in, of, …)
2. Per word extract three letters
First letter:
If a word starts with a vowel, use this vowel
If a word starts with a consonant, use this consonant
Second and third letters: consonants only
3. No consecutively repeated consonants
Example:
Allowance--> alw, instead of all
Possible --> psb, instead of pss
4. Word with 4 letters containing less than 3 consonants:
Example:
Paid --> pd
Loan --> ln
However, Cash (since it has sufficient 3 consonants) become: csh
Further examples:
Novemberium--> nvm
Aquarium --> aqr
Aquaregia --> aqr
Note on aquarium & aquaregia both result in aqr. This inability to differentiate further is fine for two reasons:
1. The chance that this happen in my data seems small and cleaning it manually when it happens may be easier than writing a script to differentiate or flag aquarium’s aqr vs. aquaregia’s aqr.
2. It may be good because it will be less sensitive to typos (aquarioum, aquarrim will all give me aqr)
Please correct me if this is an unnecessary tradeoff.
Info about the “raw” string variable; here, an observation may contain between 1-10 words.
Further illustration below:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str104 raw str38 abbreviated "Additional Paid In Capital" "adt pd cpt" "Allowance for Derivative Assets /" "alw drv ast" "Allowance for Securities Held /" "alw scr hld" "Allowance for Uncollectible Accounts" "alw unc acn" "Allowence Syariah" "alw syr" "ASSETS" "ast" "Cash" "csh" "Comparative Period of Difference in Restructuring Value of Transactions of Entities Under Common Control" "cmp prd dff rst vl trn ent udr cmn cnt" "Defferd income /" "dfr inc" "Mudharabah Saving" "mdh svn" "Murabahah Recieveable" "mrb rcv" "Musyarakah Financial" "msy fnc" "Musyarakah financing" "msy fnc" "Musyarakah Laoan" "msy ln" "Musyarakah Loan" "msy ln" "Net of Allowance for Possible Losses of Consumer Financing Receivables" "nt alw psb cns fnc rcv" "Other Comprehensive Incomes" "oth cmp inc" "Other Equity" "oth eqt" "Placement at Bank India" "plc bnk ind" "Placement at Bank of Egypt" "plc bnk egy" "Prepaid Expenses" "prp exp" "Retained Earnings" "rtn ern" "Securities Issued" "scr isd" "Syariah Financing Facility" "syr fnc fcl" "TOTAL LIABILITIES AND EQUITY" "tl lbl eqt" end
Just in case relevant info: I’m using Stata MP 14, Macbook Air OSX El Capitan.
I welcome anything you may have in mind.
Comment