Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep only the last 9 characters

    I have searched and found things close, but not quite what I need. Perhaps that's because I need it spelled out for me. I have the following variable, but I want to leave only the last 9 digits of each observation within that variable. How do I get rid of the "TX-extra numbers-"? I am assuming somewhere is substr, but I cannot get it right. Thank you!

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str19 stateschoolidpublicschool201819
    "TX-108913-108913118"
    "TX-108902-108902006"
    "TX-043914-043914005"
    "†"                
    "TX-246909-246909014"
    end

  • #2
    Code:
    g only9 = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)

    Comment


    • #3
      Thank you George! I just tried

      gen TXID = substr(stateschoolidpublicschool201819,-9,9) and it worked. But now you have me questioning my -9 instead of -8. I believe it worked though at first glance.

      Comment


      • #4
        with -9, I think you get the dash since you get 10 figures.

        Comment


        • #5
          The syntax used in post #3 is different than that used in post #2.
          Code:
          gen TXID = substr(stateschoolidpublicschool201819,-9,9)
          starts with the 9th character from the last, and takes a total of 9 characters.
          Code:
          g only9 = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)
          computes the length of the string and subtracts 8, giving the position of the 9th character from the last, and takes a total of 9 characters.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str19 stateschoolidpublicschool201819
          "TX-108913-108913118"
          "TX-108902-108902006"
          "TX-043914-043914005"
          "†"                
          "TX-246909-246909014"
          end
          gen TXID1a = substr(stateschoolidpublicschool201819,-9,9)
          gen TXID2a = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)
          gen TXID1b = substr(stateschoolidpublicschool201819,-9,.)
          gen TXID2b = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,.)
          list, clean
          Code:
          . list, clean
          
                 stateschooli~201819      TXID1a      TXID2a      TXID1b      TXID2b  
            1.   TX-108913-108913118   108913118   108913118   108913118   108913118  
            2.   TX-108902-108902006   108902006   108902006   108902006   108902006  
            3.   TX-043914-043914005   043914005   043914005   043914005   043914005  
            4.                     †                                                  
            5.   TX-246909-246909014   246909014   246909014   246909014   246909014

          Comment


          • #6
            Thank you both so much. I learned a lot. I've been using Stata since 2007, but never regularly (hopefully until now, this girl needs a job) so I feel like I am a forever novice. Again, thank you so much for your time.

            Comment


            • #7
              Stephanie Leigh the usefulness of substr() will depend on how messy your actual data is. If you want to make sure you capture the last 9 digits and ignore situations where the string has other endings, you might prefer to use regular expressions:

              Code:
              clear
              input str19 stateschoolidpublicschool201819
              "TX-108913-108913118"
              "TX-108902-108902006"
              "TX-043914-043914005"
              "†"              
              "TX-246909-246909014"
              "TX-123456-BLAH"
              end
              
              gen wanted1 = substr(stateschoolidpublicschool201819,-9,.)
              gen wanted2 = ustrregexs(0) if ustrregexm(stateschoolidpublicschool201819,"[0-9]{9}$")
              which produces:

              Code:
              . li, noobs sep(0)
              
                +---------------------------------------------+
                | stateschooli~201819     wanted1     wanted2 |
                |---------------------------------------------|
                | TX-108913-108913118   108913118   108913118 |
                | TX-108902-108902006   108902006   108902006 |
                | TX-043914-043914005   043914005   043914005 |
                |                   †                         |
                | TX-246909-246909014   246909014   246909014 |
                |      TX-123456-BLAH   3456-BLAH             |
                +---------------------------------------------+

              Comment


              • #8
                Ok, I got what I needed (the last 9 from the id) However, I needed them in numeric format, so after creating TXID_n, I did TXID_n == real(TXID) and some of the real values have the last digit changed?? I have never run into this using real before. Please forgive me, I attempted dataex and kept getting error messages to try and show what I wanted to show.


                +---------------------------------------------------+
                | TXID TXID_n |
                |---------------------------------------------------|
                3527. | 077902041 77902040 |
                +---------------------------------------------------+

                Comment


                • #9
                  Your problem is one of precision. By default TXID_n is created as a float variable. A float variable can hold with full accuracy at most a 7-digit integer. You need to use a long variable (or a double) to store a 9-digit integer with full precision.
                  Code:
                  generate long TXID_n == real(TXID)
                  Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.
                  byte - 7 bits -127 100
                  int - 15 bits -32,767 32,740
                  long - 31 bits -2,147,483,647 2,147,483,620
                  float - 24 bits -16,777,216 16,777,216
                  double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

                  Comment


                  • #10
                    This is an issue with precision. With 9 digits, you need at least a long variable. Just do
                    Code:
                    gen long TXID_n = real(TXID)
                    See also

                    Code:
                    help data_types

                    Comment


                    • #11
                      Code:
                      gen only9 = substr(stateschoolidpublicschool201819,-9,.)
                      destring only9 , force replace
                      format %20.0f only9

                      Comment

                      Working...
                      X