Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata remove all observations when "if condition" applies to data with many decimals (such as 1.00E-30)

    Dear All,

    In the field of genetics, the number of decimals after decimal point is meaningful as it provides strong evidence against the null hypothesis. Because of this, I used, for example, 1.00E-30 in excel to reflect 1.00 x 10^-30 for the p-value of the association between each SNP (genetic variant) and a phenotype (outcome), and then imported into Stata program. I pasted an example of my data as follows:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ID str16 SNP long P_value
     51 "rs1004058"        708
     52 "rs11704820"        59
     53 "rs131798"         727
     54 "rs131805"         523
     55 "rs140188"         717
     56 "rs140522"         114
     57 "rs146532334"      247
     58 "rs165722"         487
     59 "rs2904552"          7
     60 "rs3892097"        628
     61 "rs4680"           559
     62 "rs4822523"          7
     63 "rs5751777"          7
     64 "rs5751909"         14
     65 "rs60422751"       419
     66 "rs6151429"          7
     67 "rs61748567"       560
     68 "rs7290732"        297
     69 "rs73886794"       156
     70 "rs739296"         752
     71 "rs75303441"       249
     72 "rs8139070"        699
     73 "rs138730015"      712
     74 "rs139401390"      659
     75 "rs17830558"       163
     76 "rs9272729"        555
     77 "rs4664308 "       710
     78 "rs4927186"        304
     79 "rs12647735"       104
     80 "rs6997279"         35
     81 "chr19:48475266:I" 593
     82 "chr6:32525987:I"  513
     83 "rs73206603"       177
     84 "chr14:95189723:D" 385
     85 "rs117897666"      675
     86 "rs76262407"       246
     87 "rs10404821"       203
     88 "rs12472051"       385
     89 "rs73017308"       585
     90 "rs36025606"       368
     91 "rs11961816"       162
     92 "rs4977388"        272
     93 "rs141052170"      292
     94 "rs1143914"        317
     95 "rs7222331"         15
     96 "rs74796791"       499
     97 "rs158342"         635
     98 "rs81277664"       272
     99 "rs9942471"        269
    100 "rs768920"         685
    end
    label values P_value P_value
    label def P_value 7 "1.00E-20", modify
    label def P_value 14 "1.10E-03", modify
    label def P_value 15 "1.10E-07", modify
    label def P_value 35 "1.20E-07", modify
    label def P_value 59 "1.30E-12", modify
    label def P_value 104 "1.60E-04", modify
    label def P_value 114 "1.60E-32", modify
    label def P_value 156 "1.80E-95", modify
    label def P_value 162 "1.90E-06", modify
    label def P_value 163 "1.90E-08", modify
    label def P_value 177 "2.00E-04", modify
    label def P_value 203 "2.10E-06", modify
    label def P_value 246 "2.40E-06", modify
    label def P_value 247 "2.40E-12", modify
    label def P_value 249 "2.40E-52", modify
    label def P_value 269 "2.56E-11", modify
    label def P_value 272 "2.60E-06", modify
    label def P_value 292 "2.70E-06", modify
    label def P_value 297 "2.70E-91", modify
    label def P_value 304 "2.80E-05", modify
    label def P_value 317 "2.90E-06", modify
    label def P_value 368 "3.50E-06", modify
    label def P_value 385 "3.70E-06", modify
    label def P_value 419 "4.00E-30", modify
    label def P_value 487 "4.86E-02", modify
    label def P_value 499 "5.00E-06", modify
    label def P_value 513 "5.20E-06", modify
    label def P_value 523 "5.40E-128", modify
    label def P_value 555 "5.90E-27", modify
    label def P_value 559 "6.00E-03", modify
    label def P_value 560 "6.00E-04", modify
    label def P_value 585 "6.60E-06", modify
    label def P_value 593 "6.70E-06", modify
    label def P_value 628 "7.30E-01", modify
    label def P_value 635 "7.40E-06", modify
    label def P_value 659 "7.88E-09", modify
    label def P_value 675 "8.10E-03", modify
    label def P_value 685 "8.14E-06", modify
    label def P_value 699 "8.46E-08", modify
    label def P_value 708 "8.60E-17", modify
    label def P_value 710 "8.60E-29", modify
    label def P_value 712 "8.68E-07", modify
    label def P_value 717 "8.80E-13", modify
    label def P_value 727 "9.00E-19", modify
    label def P_value 752 "9.70E-12", modify
    I tried to remove data with P_value >= 1.03E-05 (or 0.0000103). However, when I used the following codes, all observations were removed instead of only those observations with P-value >=1.03E-05.

    Code:
    drop if P_value >= 1.03E-05
    I searched all resources and none landed me a solution. I appreciate if you could help to suggest any solution to address this.

    Thank you and look forward to receiving your help.

  • #2
    You should look into the difference between values and value labels. What you see is a value label, but the expression

    drop if P_value >= 1.03E-05
    will use the actual values. See

    Code:
    help label
    One way:

    Code:
    decode P_value, g(pvallab)
    destring pvallab, replace
    drop if pvallab >= 1.03E-05

    Comment


    • #3
      Thanks Andrew so much. It works now. Just a further question is why my data was already in actual values with, for example, 1.00E-20 (see my screenshot:
      Click image for larger version

Name:	Screenshot_Data_before_using dataex.png
Views:	1
Size:	42.1 KB
ID:	1725848
      ), when using dataex, it automatically added labels such as a number as 7?

      Comment


      • #4
        Just a further question is why my data was already in actual values with, for example, 1.00E-20 (see my screenshot:
        Your data wasn't in actual values. It just looks that way if you don't know what to look for. The whole purpose of having value labels is so that with discrete variables like, say, sex coded 1 = Male and 2 = Female, you can see, with your eyes, "Male" and "Female" instead of seeing 1 and 2 and having to remember or figure out which is which. But you can tell even from the screenshot that P_value is not in real values because it is showing up in blue. When Stata shows you actual numbers, they show up in black in the data browser. When you see blue, you are looking at value labels. -dataex- did not create those value labels: -dataex- just shows you what is already in your data set.

        My best guess as to how this all went wrong is that this data was imported into Stata from some other data source and for whatever reason, P_value was imported as a string variable instead of a number. (It is pretty common to see in spreadsheets missing values represented as "NULL." The presence of the non-numeric characters N, U, and L in the variable would cause Stata to import that variable as a string, since NULL is not a number.) Then somebody who doesn't understand the difference between -encode- and -destring- made the mistake of -encode-ing the variable. That converted it from a string variable that looks, to human eyes, like numeric values, into a variable whose values are consecutive integers starting from 1, with value labels attached. Because the values are themselves actually representations of numbers, this tricks the unwary into thinking they are working with the real numeric values. This is just a misapplication of the -encode- command in a context where it should not be applied. There are, of course, other ways this situation could have arisen, but this is the most common.

        Comment


        • #5
          Thanks Clyde Schechter for your useful explanations. They really make sense, and now I'm able to understand them.

          Comment

          Working...
          X