How to keep only the last 9 characters

Stephanie Leigh

Join Date: Jan 2021

Posts: 24
#1

How to keep only the last 9 characters

03 Oct 2022, 07:25

I have searched and found things close, but not quite what I need. Perhaps that's because I need it spelled out for me. I have the following variable, but I want to leave only the last 9 digits of each observation within that variable. How do I get rid of the "TX-extra numbers-"? I am assuming somewhere is substr, but I cannot get it right. Thank you!

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str19 stateschoolidpublicschool201819 "TX-108913-108913118" "TX-108902-108902006" "TX-043914-043914005" "†" "TX-246909-246909014" end
Tags: None

George Ford

Join Date: Aug 2014
Posts: 3148

03 Oct 2022, 07:34

Code:

g only9 = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)

Comment

Stephanie Leigh

Join Date: Jan 2021

Posts: 24
#3

03 Oct 2022, 07:48

Thank you George! I just tried

gen TXID = substr(stateschoolidpublicschool201819,-9,9) and it worked. But now you have me questioning my -9 instead of -8. I believe it worked though at first glance.
1 like
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#4

03 Oct 2022, 08:13

with -9, I think you get the dash since you get 10 figures.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Oct 2022, 08:29

The syntax used in post #3 is different than that used in post #2.

Code:

gen TXID = substr(stateschoolidpublicschool201819,-9,9)

starts with the 9th character from the last, and takes a total of 9 characters.

Code:

g only9 = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)

computes the length of the string and subtracts 8, giving the position of the 9th character from the last, and takes a total of 9 characters.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str19 stateschoolidpublicschool201819
"TX-108913-108913118"
"TX-108902-108902006"
"TX-043914-043914005"
"†"                
"TX-246909-246909014"
end
gen TXID1a = substr(stateschoolidpublicschool201819,-9,9)
gen TXID2a = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,9)
gen TXID1b = substr(stateschoolidpublicschool201819,-9,.)
gen TXID2b = substr(stateschoolidpublicschool201819,strlen(stateschoolidpublicschool201819)-8,.)
list, clean

Code:

. list, clean

       stateschooli~201819      TXID1a      TXID2a      TXID1b      TXID2b  
  1.   TX-108913-108913118   108913118   108913118   108913118   108913118  
  2.   TX-108902-108902006   108902006   108902006   108902006   108902006  
  3.   TX-043914-043914005   043914005   043914005   043914005   043914005  
  4.                     †                                                  
  5.   TX-246909-246909014   246909014   246909014   246909014   246909014

Comment

Stephanie Leigh

Join Date: Jan 2021

Posts: 24
#6

03 Oct 2022, 08:34

Thank you both so much. I learned a lot. I've been using Stata since 2007, but never regularly (hopefully until now, this girl needs a job) so I feel like I am a forever novice. Again, thank you so much for your time.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1396

03 Oct 2022, 08:36

Stephanie Leigh the usefulness of substr() will depend on how messy your actual data is. If you want to make sure you capture the last 9 digits and ignore situations where the string has other endings, you might prefer to use regular expressions:

Code:

clear
input str19 stateschoolidpublicschool201819
"TX-108913-108913118"
"TX-108902-108902006"
"TX-043914-043914005"
"†"              
"TX-246909-246909014"
"TX-123456-BLAH"
end

gen wanted1 = substr(stateschoolidpublicschool201819,-9,.)
gen wanted2 = ustrregexs(0) if ustrregexm(stateschoolidpublicschool201819,"[0-9]{9}$")

which produces:

Code:

. li, noobs sep(0)

  +---------------------------------------------+
  | stateschooli~201819     wanted1     wanted2 |
  |---------------------------------------------|
  | TX-108913-108913118   108913118   108913118 |
  | TX-108902-108902006   108902006   108902006 |
  | TX-043914-043914005   043914005   043914005 |
  |                   †                         |
  | TX-246909-246909014   246909014   246909014 |
  |      TX-123456-BLAH   3456-BLAH             |
  +---------------------------------------------+

Comment

Stephanie Leigh

Join Date: Jan 2021

Posts: 24
#8

03 Oct 2022, 09:57

Ok, I got what I needed (the last 9 from the id) However, I needed them in numeric format, so after creating TXID_n, I did TXID_n == real(TXID) and some of the real values have the last digit changed?? I have never run into this using real before. Please forgive me, I attempted dataex and kept getting error messages to try and show what I wanted to show.

+---------------------------------------------------+
| TXID TXID_n |
|---------------------------------------------------|
3527. | 077902041 77902040 |
+---------------------------------------------------+
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

03 Oct 2022, 10:06

Your problem is one of precision. By default TXID_n is created as a float variable. A float variable can hold with full accuracy at most a 7-digit integer. You need to use a long variable (or a double) to store a 9-digit integer with full precision.

Code:

generate long TXID_n == real(TXID)

Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.

byte - 7 bits	-127	100
int - 15 bits	-32,767	32,740
long - 31 bits	-2,147,483,647	2,147,483,620
float - 24 bits	-16,777,216	16,777,216
double - 53 bits	-9,007,199,254,740,992	9,007,199,254,740,992

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1396
#10

03 Oct 2022, 10:07

This is an issue with precision. With 9 digits, you need at least a long variable. Just do

Code:

gen long TXID_n = real(TXID)

See also

Code:

help data_types
Comment

George Ford

Join Date: Aug 2014
Posts: 3148

#11

03 Oct 2022, 10:10

Code:

gen only9 = substr(stateschoolidpublicschool201819,-9,.)
destring only9 , force replace
format %20.0f only9

Announcement