Copying one variable into default frame

Girish Venkataraman

Join Date: Dec 2021
Posts: 281

Copying one variable into default frame

28 Aug 2023, 19:41

Hello all,

I need to grab one column from an imported text file into my default frame which has the same number of observations in sequence as the column I want to import. I moved the txt into a new frame (phantm below), cleaned up the one variable called 'phantmx' and need to copy paste it into the default frame which contains all other variables. frame copy and frame append don't seem to be the right options. Would appreciate input on how to import besides brute copy paste. Perhaps I don't need frames at all? I have this code nested in the middle of cleaning my default frame (dataex-ed at the bottom here).

My code so far:

Code:

frame create phantm
frame phantm: import delimited "tp53long_phantm.txt", clear 
frame change phantm
gen phantmx = substr(phantm,1,6)  // misses 1.3 in obs 8 got to try ustrregexs perhaps
destring phantmx, replace force

frame phantm below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int obs str16 phantm double phantmx
 1 " 0.543 ± 0.504"  .543
 1 ""                    .
 1 ""                    .
 2 " 0.636 ± 0.304"  .636
 2 ""                    .
 2 ""                    .
 3 " 0.889 ± 0.107"  .889
 3 ""                    .
 3 ""                    .
 4 " 1.025 ± 0.388" 1.025
 4 ""                    .
 4 ""                    .
 5 ""                    .
 5 " 1.38 ± 0.206"   1.38
 5 ""                    .
 6 " 1.267 ± 0.109" 1.267
 6 ""                    .
 6 ""                    .
 7 " 1.301 ± 0.168" 1.301
 7 ""                    .
 7 ""                    .
 8 " 1.3 ± 0.156"       .
 8 " 0.611 ± 0.349"  .611
 8 ""                    .
 9 " 1.644 ± 0.296" 1.644
 9 ""                    .
 9 ""                    .
10 " 0.84 ± 0.444"    .84
10 ""                    .
10 ""                    .
11 " 0.698 ± 1.021"  .698
11 ""                    .
11 ""                    .
12 " 1.235 ± 0.062" 1.235
12 ""                    .
12 ""                    .
13 " 1.102 ± 0.259" 1.102
13 ""                    .
13 ""                    .
14 ""                    .
14 " 1.52 ± 0.195"   1.52
14 ""                    .
15 ""                    .
15 " 1.102 ± 0.259" 1.102
15 ""                    .
16 " 0.952 ± 0.301"  .952
16 ""                    .
16 ""                    .
17 ""                    .
17 ""                    .
17 ""                    .
18 " 1.221 ± 0.111" 1.221
18 " 0.512 ± 0.034"  .512
18 ""                    .
19 " 1.241 ± 0.221" 1.241
19 ""                    .
19 ""                    .
20 " 0.15 ± 0.013"    .15
20 ""                    .
20 ""                    .
21 " 1.38 ± 0.206"   1.38
21 ""                    .
21 ""                    .
22 " 1.025 ± 0.388" 1.025
22 ""                    .
22 ""                    .
23 " 1.486 ± 0.433" 1.486
23 ""                    .
23 ""                    .
24 " 0.512 ± 0.166"  .512
24 ""                    .
24 ""                    .
25 " 1.05 ± 0.357"   1.05
25 ""                    .
25 ""                    .
26 " 1.217 ± 0.194" 1.217
26 ""                    .
26 ""                    .
27 ""                    .
27 ""                    .
27 ""                    .
28 " 0.877 ± 0.358"  .877
28 ""                    .
28 ""                    .
29 " 1.138 ± 0.301" 1.138
29 ""                    .
29 ""                    .
30 " 1.095 ± 0.264" 1.095
30 ""                    .
30 ""                    .
31 " 0.729 ± 0.439"  .729
31 ""                    .
31 ""                    .
32 " 1.351 ± 0.118" 1.351
32 " 1.328 ± 0.081" 1.328
32 ""                    .
33 " 1.106 ± 0.24"  1.106
33 ""                    .
33 ""                    .
34 " 0.358 ± 0.287"  .358
end

My default frame:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float obs str43 mut
 1 "TP53 c.842A>T, p.D281V"               
 1 ""                                     
 1 ""                                     
 2 "TP53 c.586C>T, p.R196*"               
 2 ""                                     
 2 ""                                     
 3 "TP53 c.376T>C, p.Y126H"               
 3 ""                                     
 3 ""                                     
 4 "TP53 c.524G>A, p.R175H"               
 4 ""                                     
 4 ""                                     
 5 "TP53 c.569del, p.P190Lfs*57"          
 5 "TP53 c.814G>A, p.V272M"               
 5 ""                                     
 6 "TP53 c.401T>C, p.F134S"               
 6 ""                                     
 6 ""                                     
 7 "TP53 c.658T>A, p.Y220N"               
 7 ""                                     
 7 ""                                     
 8 "TP53 c.392A>T, p.N131I"               
 8 "TP53 c.637C>T, p.R213*"               
 8 ""                                     
 9 "TP53 c.832C>G, p.P278A"               
 9 ""                                     
 9 ""                                     
10 "TP53 c.527G>T, p.C176F"               
10 "TP53 Loss - Equivocal"                
10 ""                                     
11 "TP53 c.1024C>T, p.R342*"              
11 ""                                     
11 ""                                     
12 "TP53 c.772G>A, p.E258K"               
12 "TP53 c.331_332dup, p.G112Wfs*12"      
12 ""                                     
13 "TP53 c.659A>G, p.Y220C"               
13 ""                                     
13 ""                                     
14 "TP53 c.375+1G>C, p.?"                 
14 "TP53 c.713G>A, p.C238Y"               
14 ""                                     
15 "TP53 c.636_639delinsCGG, p.R213Gfs*34"
15 "TP53 c.659A>G, p.Y220C"               
15 ""                                     
16 "TP53 c.742C>T, p.R248W"               
16 "TP53 Loss Equivocal"                  
16 ""                                     
17 "TP53 c.713_717delinsTGT, p.C238Lfs*2" 
17 ""                                     
17 ""                                     
18 "TP53 c.818G>A, p.R273H"               
18 "TP53 c.358A>G, p.K120E"               
18 ""                                     
19 "TP53 c.337T>G, p.F113V"               
19 "TP53 c.636del, p.R213Dfs*34"          
19 ""                                     
20 "TP53 c.493C>T, p.Q165*"               
20 ""                                     
20 ""                                     
21 "TP53 c.814G>A, p.V272M"               
21 ""                                     
21 ""                                     
22 "TP53 c.524G>A, p.R175H"               
22 ""                                     
22 ""                                     
23 "TP53 c.537T>G, p.H179Q"               
23 ""                                     
23 ""                                     
24 "TP53 c.830G>T, p.C277F"               
24 ""                                     
24 ""                                     
25 "TP53 c.730G>A, p.G244S"               
25 ""                                     
25 ""                                     
26 "TP53 c.584T>C, p.I195T"               
26 ""                                     
26 ""                                     
27 "TP53 c.993+1G>A, p.?"                 
27 "TP53 c.429del, p.Q144Sfs*26"          
27 ""                                     
28 "TP53 c.839G>A, p.R280K"               
28 ""                                     
28 ""                                     
29 "TP53 c.700T>A, p.Y234N"               
29 ""                                     
29 ""                                     
30 "TP53 c.715A>G, p.N239D"               
30 ""                                     
30 ""                                     
31 "TP53 c.730G>T, p.G244C"               
31 ""                                     
31 ""                                     
32 "TP53 c.503A>G, p.H168R"               
32 "TP53 c.643A>G, p.S215G"               
32 ""                                     
33 "TP53 c.797G>A, p.G266E"               
33 ""                                     
33 ""                                     
34 "TP53 c.725G>T, p.C242F"               
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

28 Aug 2023, 20:36

Code:

frame phantm: gen `c(obs_t)' obs_no = _n gen `c(obs_t)' obs_no = _n frlink 1:1 obs_no, frame(phantm) frget phantmx, from(phantm)

The key here is to create an obs_no variable in both frames to serve as the link between them.

That said, you have to be very confident of the data management that created these data sets to be sure that the observations match up in the order given. Usually there is some other way to create other variables that are natural to the data to do this. For example it looks like in both sets things run in groups of 3 (except for the last observation) and these groups are identified by obs and mut, respectively. If the sequencing of the observations within the groups of three is based on, say, chronological order, or something like that, it would be better to include that chronological 1/2/3 variable in both data sets and then link the frames with obs matching mut and the two frames' chronological variables matching.

Finally, I noticed you use -destring, force- to create phantmx. The use of -force- options is dangerous and in most circumstances should be avoided. In the example data, there is one observation (the 22nd) where -substr(phantm, 1, 6)- does not turn out to be a number. In that case, phantm is 1.3 ± 0.156, and the first six characters are " 1.3 ±", and that ± is messing things up. By using -force-, the value of phantmx you calculate is missing value; but surely the correct answer is 1.3, right? Perhaps there are others like this in your full data set. Here's a better way to calculate phantmx that does not leave you with incomplete, incorrect values:

Code:

split phantm, gen(part) parse("±") destring rename part1 phantmx
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#3

28 Aug 2023, 20:54

Thanks much, Clyde Schechter. I did have a seq variable for running observation numbers for both data frames. I had always wondered in some of the Statalist posts what the `c(obs_t)' did really. Is

Code:

gen obs_no = _n

vs.

Code:

gen `c(obs_t)' obs_no = _n

not the same?

I will read up on that -frlink- and -frget-. I kept looking up help for frame copy and frame put.

And I was wondering how you got the +- with the minus below the + in the parse field. Was it perhaps a copy/paste or an underline below +?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

29 Aug 2023, 08:53

About `c(obs_t)':

-gen obs_no = _n-, because it does not specify a data storage type, will default to float. If the number of observations in the data set is sufficiently small (less than about 10,000,000) this will work just fine. But if the number is larger than that, a float is not big enough to hold all the digits of a sequential ID number and you end up with some different observations having the same value of obs_no, which, evidently, defeats the purpose.

Now, you can do better by specifying -gen long obs_no = _n- or -gen double obs_no = _n-, as a long can handle 9 digits, and a double 16. But if your data set is not that large, this is wasteful of memory. The system variable c(obs_t) contains a storage type which is large enough to create a correct obs_no variable for the data currently in memory, without wasting memory. By using -gen `c(obs_t)' obs_no = _n-, you will neither waste memory, nor end up with an incorrect obs_no variable no matter the size of your data set. And, you don't even have to know how big the data set is (will be) at the time you write the code, because Stata evaluates `c(obs_t)' at the time you actually use it.

Last edited by Clyde Schechter; 29 Aug 2023, 08:59.
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#5

29 Aug 2023, 13:34

That made it pretty clear, Clyde Schechter . Will make it a point to use it going forward.
Comment

Announcement

Copying one variable into default frame

Comment

Comment

Comment

Comment