Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching superscript in string variable

    The attached example dataset contains 14 rows of a single string variable called Year. In most rows, Year just contains the 4-digit year (e.g., "1992") but in a few rows there is a superscript (e.g., "19921").

    I would like to flag the rows with a superscript. Here is how I try to do that:
    Code:
    use example, clear
    gen superscript = (substr(Year,-1,1) == "¹")
    list
    But this doesn't work. In every row the value of superscript is 0.
    What's a better way to do this?

    Thanks!
    Paul
    Attached Files

  • #2
    Here's another symptom of the same problem. To avoid attaching a dataset, I first tried to create an example with code.
    Code:
    input str5 Year
    1992
    1994¹
    end
    list
    But the listed data looked weird in my output window. Instead of displaying the superscript 1, Stata displayed a "?" in a box, which I suspect means that it didn't recognize or couldn't display the superscript, even though I see superscripts in the example data that I attached to my previous post.

    If I understood how Stata represents superscripts in string variables, that would probably resolve both problems.

    Thanks again!

    Comment


    • #3
      Regex might resolve it by searching out all non-numbers (is how Stata sees those superscripts) I think you need a newer version of Stata for ustrregexra functions (perhaps Stata 16+) to be available.

      Code:
      replace Year = ustrregexra(Year, "\D+", "")
      or
      Code:
      replace  Year = substr(Year, 1, 4)

      Comment


      • #4
        Thanks! Your code replaces the Year variable. How would I generate a superscript variable that indicates whether a superscript is present in Year?

        Comment


        • #5
          Code:
          gen has_superscript_one = usubstr(Year,-1,1) == "¹"
          Code:
          . di ustrlen("¹")
          1
          
          . di strlen("¹")
          2
          or more general
          Code:
          gen has_superscript = ustrregexm(Year,"^\d{4}\p{No}+$")
          Last edited by Bjarte Aagnes; 09 Nov 2023, 13:12.

          Comment


          • #6
            Originally posted by paulvonhippel View Post
            Thanks! Your code replaces the Year variable. How would I generate a superscript variable that indicates whether a superscript is present in Year?
            Either of these will do it. If the position and numbers of strange characters remain the same.

            Code:
            gen supyes = strlen(Year)!=4
            
            gen supyes = !missing(substr(Year, 5, .))

            Comment


            • #7
              My collaborator Yujin Kwon reports that this works too:
              Code:
              gen superscript = strpos(Year,"¹")>0
              I like her solution because it looks explicitly for the superscript rather than counting on the string length, which might vary for other reasons....

              Comment


              • #8
                Originally posted by paulvonhippel View Post
                My collaborator Yujin Kwon reports that this works too:
                Code:
                gen superscript = strpos(Year,"¹")>0
                I like her solution because it looks explicitly for the superscript rather than counting on the string length, which might vary for other reasons....
                ..presuming the superscript text is always 1, it will work. The choice of solution depends on the data at hand.

                Comment

                Working...
                X