Re-elaborating strrings

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#1

Re-elaborating strrings

25 Oct 2022, 01:54

Hi all,

I have a string variable that is made like this:

Code:

['Y10S 482/902' 'Y10S 482/901']

I am replacing the string using

Code:

replace tech_class = ustrregexra(tech_class, "[\[\]']", "")

and then counting its words using

Code:

wordcount(tech_class)

The problem is that when I adopt wordcount, the words counted in the instance above are 4: Y10S,482/902,Y10S,482/901 whereas they should be 2: Y10S 482/902 and Y10S 482/901. Is there a way to either transform the sting like this:

Code:

['Y10S_482/902' 'Y10S_482/901']

or to tell stata to consider Y10S 482/902 and Y10S 482/901 as two separate words (and not 4 words)?

In general, there should be a space whenever from numbers we move to words. So for instance, in this case:

Code:

['Y02B 30/13' 'Y02B 30/12' 'Y02E 60/14' 'Y02B 10/70']

the outcome should be something like:

Code:

['Y02B _30/13' 'Y02B _30/12' 'Y02E _60/14' 'Y02B _10/70']

Thank you

Last edited by Federico Nutarelli; 25 Oct 2022, 01:57.
Tags: data, panel, string, strings, Suggestion
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

25 Oct 2022, 08:29

The output of help wordcount() is very clear about what constitutes a "word" to Stata.

A word is a set of characters that starts and terminates with spaces, starts with the beginning of the string, or terminates with the end of the string.

So if you want to treat these paired tokens as a word, you're going to have to connect the two tokesn with (one or more) non-space characters that aren't otherwise used in your data, so that you can remove them for presentation purposes.

[Let me note that one could probably use a unicode nonprinting character but I don't want to go down that path: invisible characters are difficult to work with.]

Perhaps the following will help you along.

Code:

. input str60 tech_class tech_class 1. "['Y10S 482/902' 'Y10S 482/901']" 2. "['Y02B 30/13' 'Y02B 30/12' 'Y02E 60/14' 'Y02B 10/70']" 3. end . . replace tech_class = ustrregexra(tech_class,"([A-Z]) +([0-9])","$1_$2") (2 real changes made) . replace tech_class = ustrregexra(tech_class, "[\[\]']", "") (2 real changes made) . generate wc = wordcount(tech_class) . list +--------------------------------------------------+ | tech_class wc | |--------------------------------------------------| 1. | Y10S_482/902 Y10S_482/901 2 | 2. | Y02B_30/13 Y02B_30/12 Y02E_60/14 Y02B_10/70 4 | +--------------------------------------------------+ .

Note that in the second observation we collapse sequences of two spaces between tokens to a single connecting underscore.
1 like
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#3

26 Oct 2022, 07:09

Thank you. I was actually reading a post with a similar issue and a similar solution but was confused on how to use the regex in my case. Is there a stata manual on regex in general?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

26 Oct 2022, 07:44

The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.
1 like
Comment

Announcement

Re-elaborating strrings

Comment

Comment

Comment