Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Re-elaborating strrings

    Hi all,

    I have a string variable that is made like this:

    Code:
    ['Y10S 482/902' 'Y10S 482/901']
    I am replacing the string using

    Code:
    replace tech_class = ustrregexra(tech_class, "[\[\]']", "")
    and then counting its words using

    Code:
    wordcount(tech_class)
    The problem is that when I adopt wordcount, the words counted in the instance above are 4: Y10S,482/902,Y10S,482/901 whereas they should be 2: Y10S 482/902 and Y10S 482/901. Is there a way to either transform the sting like this:

    Code:
    ['Y10S_482/902' 'Y10S_482/901']
    or to tell stata to consider Y10S 482/902 and Y10S 482/901 as two separate words (and not 4 words)?

    In general, there should be a space whenever from numbers we move to words. So for instance, in this case:
    Code:
    ['Y02B  30/13' 'Y02B  30/12' 'Y02E  60/14' 'Y02B  10/70']
    the outcome should be something like:
    Code:
    ['Y02B _30/13' 'Y02B _30/12' 'Y02E _60/14' 'Y02B _10/70']
    Thank you
    Last edited by Federico Nutarelli; 25 Oct 2022, 01:57.

  • #2
    The output of help wordcount() is very clear about what constitutes a "word" to Stata.
    A word is a set of characters that starts and terminates with spaces, starts with the beginning of the string, or terminates with the end of the string.
    So if you want to treat these paired tokens as a word, you're going to have to connect the two tokesn with (one or more) non-space characters that aren't otherwise used in your data, so that you can remove them for presentation purposes.

    [Let me note that one could probably use a unicode nonprinting character but I don't want to go down that path: invisible characters are difficult to work with.]

    Perhaps the following will help you along.
    Code:
    . input str60 tech_class
    
                                                           tech_class
      1. "['Y10S 482/902' 'Y10S 482/901']"
      2. "['Y02B  30/13' 'Y02B  30/12' 'Y02E  60/14' 'Y02B  10/70']"
      3. end
    
    . 
    . replace tech_class = ustrregexra(tech_class,"([A-Z]) +([0-9])","$1_$2")
    (2 real changes made)
    
    . replace tech_class = ustrregexra(tech_class, "[\[\]']", "")
    (2 real changes made)
    
    . generate wc = wordcount(tech_class)
    
    . list
    
         +--------------------------------------------------+
         |                                  tech_class   wc |
         |--------------------------------------------------|
      1. |                   Y10S_482/902 Y10S_482/901    2 |
      2. | Y02B_30/13 Y02B_30/12 Y02E_60/14 Y02B_10/70    4 |
         +--------------------------------------------------+
    
    .
    Note that in the second observation we collapse sequences of two spaces between tokens to a single connecting underscore.

    Comment


    • #3
      Thank you. I was actually reading a post with a similar issue and a similar solution but was confused on how to use the regex in my case. Is there a stata manual on regex in general?

      Comment


      • #4

        The Unicode regular expression functions introduced in Stata 14 have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

        A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

        Comment

        Working...
        X