Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Copying one variable into default frame

    Hello all,

    I need to grab one column from an imported text file into my default frame which has the same number of observations in sequence as the column I want to import. I moved the txt into a new frame (phantm below), cleaned up the one variable called 'phantmx' and need to copy paste it into the default frame which contains all other variables. frame copy and frame append don't seem to be the right options. Would appreciate input on how to import besides brute copy paste. Perhaps I don't need frames at all? I have this code nested in the middle of cleaning my default frame (dataex-ed at the bottom here).


    My code so far:

    Code:
    frame create phantm
    frame phantm: import delimited "tp53long_phantm.txt", clear 
    frame change phantm
    gen phantmx = substr(phantm,1,6)  // misses 1.3 in obs 8 got to try ustrregexs perhaps
    destring phantmx, replace force

    frame phantm below:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int obs str16 phantm double phantmx
     1 " 0.543 ± 0.504"  .543
     1 ""                    .
     1 ""                    .
     2 " 0.636 ± 0.304"  .636
     2 ""                    .
     2 ""                    .
     3 " 0.889 ± 0.107"  .889
     3 ""                    .
     3 ""                    .
     4 " 1.025 ± 0.388" 1.025
     4 ""                    .
     4 ""                    .
     5 ""                    .
     5 " 1.38 ± 0.206"   1.38
     5 ""                    .
     6 " 1.267 ± 0.109" 1.267
     6 ""                    .
     6 ""                    .
     7 " 1.301 ± 0.168" 1.301
     7 ""                    .
     7 ""                    .
     8 " 1.3 ± 0.156"       .
     8 " 0.611 ± 0.349"  .611
     8 ""                    .
     9 " 1.644 ± 0.296" 1.644
     9 ""                    .
     9 ""                    .
    10 " 0.84 ± 0.444"    .84
    10 ""                    .
    10 ""                    .
    11 " 0.698 ± 1.021"  .698
    11 ""                    .
    11 ""                    .
    12 " 1.235 ± 0.062" 1.235
    12 ""                    .
    12 ""                    .
    13 " 1.102 ± 0.259" 1.102
    13 ""                    .
    13 ""                    .
    14 ""                    .
    14 " 1.52 ± 0.195"   1.52
    14 ""                    .
    15 ""                    .
    15 " 1.102 ± 0.259" 1.102
    15 ""                    .
    16 " 0.952 ± 0.301"  .952
    16 ""                    .
    16 ""                    .
    17 ""                    .
    17 ""                    .
    17 ""                    .
    18 " 1.221 ± 0.111" 1.221
    18 " 0.512 ± 0.034"  .512
    18 ""                    .
    19 " 1.241 ± 0.221" 1.241
    19 ""                    .
    19 ""                    .
    20 " 0.15 ± 0.013"    .15
    20 ""                    .
    20 ""                    .
    21 " 1.38 ± 0.206"   1.38
    21 ""                    .
    21 ""                    .
    22 " 1.025 ± 0.388" 1.025
    22 ""                    .
    22 ""                    .
    23 " 1.486 ± 0.433" 1.486
    23 ""                    .
    23 ""                    .
    24 " 0.512 ± 0.166"  .512
    24 ""                    .
    24 ""                    .
    25 " 1.05 ± 0.357"   1.05
    25 ""                    .
    25 ""                    .
    26 " 1.217 ± 0.194" 1.217
    26 ""                    .
    26 ""                    .
    27 ""                    .
    27 ""                    .
    27 ""                    .
    28 " 0.877 ± 0.358"  .877
    28 ""                    .
    28 ""                    .
    29 " 1.138 ± 0.301" 1.138
    29 ""                    .
    29 ""                    .
    30 " 1.095 ± 0.264" 1.095
    30 ""                    .
    30 ""                    .
    31 " 0.729 ± 0.439"  .729
    31 ""                    .
    31 ""                    .
    32 " 1.351 ± 0.118" 1.351
    32 " 1.328 ± 0.081" 1.328
    32 ""                    .
    33 " 1.106 ± 0.24"  1.106
    33 ""                    .
    33 ""                    .
    34 " 0.358 ± 0.287"  .358
    end
    My default frame:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float obs str43 mut
     1 "TP53 c.842A>T, p.D281V"               
     1 ""                                     
     1 ""                                     
     2 "TP53 c.586C>T, p.R196*"               
     2 ""                                     
     2 ""                                     
     3 "TP53 c.376T>C, p.Y126H"               
     3 ""                                     
     3 ""                                     
     4 "TP53 c.524G>A, p.R175H"               
     4 ""                                     
     4 ""                                     
     5 "TP53 c.569del, p.P190Lfs*57"          
     5 "TP53 c.814G>A, p.V272M"               
     5 ""                                     
     6 "TP53 c.401T>C, p.F134S"               
     6 ""                                     
     6 ""                                     
     7 "TP53 c.658T>A, p.Y220N"               
     7 ""                                     
     7 ""                                     
     8 "TP53 c.392A>T, p.N131I"               
     8 "TP53 c.637C>T, p.R213*"               
     8 ""                                     
     9 "TP53 c.832C>G, p.P278A"               
     9 ""                                     
     9 ""                                     
    10 "TP53 c.527G>T, p.C176F"               
    10 "TP53 Loss - Equivocal"                
    10 ""                                     
    11 "TP53 c.1024C>T, p.R342*"              
    11 ""                                     
    11 ""                                     
    12 "TP53 c.772G>A, p.E258K"               
    12 "TP53 c.331_332dup, p.G112Wfs*12"      
    12 ""                                     
    13 "TP53 c.659A>G, p.Y220C"               
    13 ""                                     
    13 ""                                     
    14 "TP53 c.375+1G>C, p.?"                 
    14 "TP53 c.713G>A, p.C238Y"               
    14 ""                                     
    15 "TP53 c.636_639delinsCGG, p.R213Gfs*34"
    15 "TP53 c.659A>G, p.Y220C"               
    15 ""                                     
    16 "TP53 c.742C>T, p.R248W"               
    16 "TP53 Loss Equivocal"                  
    16 ""                                     
    17 "TP53 c.713_717delinsTGT, p.C238Lfs*2" 
    17 ""                                     
    17 ""                                     
    18 "TP53 c.818G>A, p.R273H"               
    18 "TP53 c.358A>G, p.K120E"               
    18 ""                                     
    19 "TP53 c.337T>G, p.F113V"               
    19 "TP53 c.636del, p.R213Dfs*34"          
    19 ""                                     
    20 "TP53 c.493C>T, p.Q165*"               
    20 ""                                     
    20 ""                                     
    21 "TP53 c.814G>A, p.V272M"               
    21 ""                                     
    21 ""                                     
    22 "TP53 c.524G>A, p.R175H"               
    22 ""                                     
    22 ""                                     
    23 "TP53 c.537T>G, p.H179Q"               
    23 ""                                     
    23 ""                                     
    24 "TP53 c.830G>T, p.C277F"               
    24 ""                                     
    24 ""                                     
    25 "TP53 c.730G>A, p.G244S"               
    25 ""                                     
    25 ""                                     
    26 "TP53 c.584T>C, p.I195T"               
    26 ""                                     
    26 ""                                     
    27 "TP53 c.993+1G>A, p.?"                 
    27 "TP53 c.429del, p.Q144Sfs*26"          
    27 ""                                     
    28 "TP53 c.839G>A, p.R280K"               
    28 ""                                     
    28 ""                                     
    29 "TP53 c.700T>A, p.Y234N"               
    29 ""                                     
    29 ""                                     
    30 "TP53 c.715A>G, p.N239D"               
    30 ""                                     
    30 ""                                     
    31 "TP53 c.730G>T, p.G244C"               
    31 ""                                     
    31 ""                                     
    32 "TP53 c.503A>G, p.H168R"               
    32 "TP53 c.643A>G, p.S215G"               
    32 ""                                     
    33 "TP53 c.797G>A, p.G266E"               
    33 ""                                     
    33 ""                                     
    34 "TP53 c.725G>T, p.C242F"               
    end

  • #2
    Code:
    frame phantm: gen `c(obs_t)' obs_no = _n
    gen `c(obs_t)' obs_no = _n
    frlink 1:1 obs_no, frame(phantm)
    frget phantmx, from(phantm)
    The key here is to create an obs_no variable in both frames to serve as the link between them.

    That said, you have to be very confident of the data management that created these data sets to be sure that the observations match up in the order given. Usually there is some other way to create other variables that are natural to the data to do this. For example it looks like in both sets things run in groups of 3 (except for the last observation) and these groups are identified by obs and mut, respectively. If the sequencing of the observations within the groups of three is based on, say, chronological order, or something like that, it would be better to include that chronological 1/2/3 variable in both data sets and then link the frames with obs matching mut and the two frames' chronological variables matching.

    Finally, I noticed you use -destring, force- to create phantmx. The use of -force- options is dangerous and in most circumstances should be avoided. In the example data, there is one observation (the 22nd) where -substr(phantm, 1, 6)- does not turn out to be a number. In that case, phantm is 1.3 ± 0.156, and the first six characters are " 1.3 ±", and that ± is messing things up. By using -force-, the value of phantmx you calculate is missing value; but surely the correct answer is 1.3, right? Perhaps there are others like this in your full data set. Here's a better way to calculate phantmx that does not leave you with incomplete, incorrect values:
    Code:
    split phantm, gen(part) parse("±") destring
    rename part1 phantmx

    Comment


    • #3
      Thanks much, Clyde Schechter. I did have a seq variable for running observation numbers for both data frames. I had always wondered in some of the Statalist posts what the `c(obs_t)' did really. Is

      Code:
      gen obs_no = _n
      vs.
      Code:
      gen `c(obs_t)' obs_no = _n
      not the same?

      I will read up on that -frlink- and -frget-. I kept looking up help for frame copy and frame put.

      And I was wondering how you got the +- with the minus below the + in the parse field. Was it perhaps a copy/paste or an underline below +?

      Comment


      • #4
        About `c(obs_t)':

        -gen obs_no = _n-, because it does not specify a data storage type, will default to float. If the number of observations in the data set is sufficiently small (less than about 10,000,000) this will work just fine. But if the number is larger than that, a float is not big enough to hold all the digits of a sequential ID number and you end up with some different observations having the same value of obs_no, which, evidently, defeats the purpose.

        Now, you can do better by specifying -gen long obs_no = _n- or -gen double obs_no = _n-, as a long can handle 9 digits, and a double 16. But if your data set is not that large, this is wasteful of memory. The system variable c(obs_t) contains a storage type which is large enough to create a correct obs_no variable for the data currently in memory, without wasting memory. By using -gen `c(obs_t)' obs_no = _n-, you will neither waste memory, nor end up with an incorrect obs_no variable no matter the size of your data set. And, you don't even have to know how big the data set is (will be) at the time you write the code, because Stata evaluates `c(obs_t)' at the time you actually use it.
        Last edited by Clyde Schechter; 29 Aug 2023, 08:59.

        Comment


        • #5
          That made it pretty clear, Clyde Schechter . Will make it a point to use it going forward.

          Comment

          Working...
          X