Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Marking Incorrect gene name by looking up column of reference gene names

    Hello all:

    I have a list of 50,000 true gene names and my dataset with 150 odd entries with some gene names misspelt in long form in the test variable. How do I tag rows corresponding to the misspelt gene names by looking up the associatedgenename variable containing the 50,000 gene names?
    I played around with levelsof but the code halts at the first row or gives something faulty. In the test below, only "AiBG" should have been tagged as 1 for being incorrect (A1BG is the correct form)

    levelsof test, local(testl)
    levelsof associatedgenename, local(gold)
    foreach v of local testl{
    gen incorrect = inlist(associatedgenename, `testl')
    recode incorrect (1=0) (0=1)
    }



    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str22 associatedgenename str4 test
    "5S_rRNA"     "7SK" 
    "5_8S_rRNA"   "AiBG"
    "7SK"         "A1CF"
    "A1BG"        "A2M" 
    "A1BG-AS1"    "7SK" 
    "A1CF"        "A1BG"
    "A2M"         ""    
    "A2M-AS1"     ""    
    "A2ML1"       ""    
    "A2ML1-AS1"   ""    
    "A2ML1-AS2"   ""    
    "A2MP1"       ""    
    "A3GALT2"     ""    
    "A4GALT"      ""    
    "A4GNT"       ""    
    "AA06"        ""    
    "AAAS"        ""    
    "AACS"        ""    
    "AACSP1"      ""    
    "AADAC"       ""    
    "AADACL2"     ""    
    "AADACL2-AS1" ""    
    "AADACL3"     ""    
    "AADACL4"     ""    
    "AADACP1"     ""    
    "AADAT"       ""    
    "AAED1"       ""    
    "AAGAB"       ""    
    "AAK1"        ""    
    end

  • #2
    You're probably looking for something along these lines.

    .ÿ
    .ÿversionÿ17.0

    .ÿ
    .ÿclearÿ*

    .ÿ
    .ÿquietlyÿinputÿstr22ÿassociatedgenenameÿstr4ÿtest

    .ÿ
    .ÿ*
    .ÿ*ÿBeginÿhere
    .ÿ*
    .ÿframeÿputÿassociatedgenename,ÿinto(TrueGenes)

    .ÿ
    .ÿquietlyÿkeepÿifÿ!missing(test)

    .ÿ
    .ÿfrlinkÿm:1ÿtest,ÿframe(TrueGenesÿassociatedgenename)
    ÿÿ(1ÿobservationÿinÿframeÿdefaultÿunmatched)

    .ÿ
    .ÿlistÿtestÿTrueGenesÿifÿmissing(TrueGenes),ÿnoobsÿabbreviate(20)

    ÿÿ+------------------+
    ÿÿ|ÿtestÿÿÿTrueGenesÿ|
    ÿÿ|------------------|
    ÿÿ|ÿAiBGÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
    ÿÿ+------------------+

    .ÿ
    .ÿexit

    endÿofÿdo-file


    .

    Comment


    • #3
      Or alternatively:

      Code:
      . levelsof associatedgenename, local(gold)
      `"5S_rRNA"' `"5_8S_rRNA"' `"7SK"' `"A1BG"' `"A1BG-AS1"' `"A1CF"' `"A2M"' `"A2M-AS1"' `"A2ML1"' `"A
      > 2ML1-AS1"' `"A2ML1-AS2"' `"A2MP1"' `"A3GALT2"' `"A4GALT"' `"A4GNT"' `"AA06"' `"AAAS"' `"AACS"' `
      > "AACSP1"' `"AADAC"' `"AADACL2"' `"AADACL2-AS1"' `"AADACL3"' `"AADACL4"' `"AADACP1"' `"AADAT"' `"
      > AAED1"' `"AAGAB"' `"AAK1"'
      
      . qui foreach l of local gold {
      . replace correct = 1 if test == "`l'"
      . }
      
      . gen incorrect = !correct if !missing(correct)
      (23 missing values generated)
      
      . list
      
           +-----------------------------------------+
           | associate~e   test   correct   incorr~t |
           |-----------------------------------------|
        1. |     5S_rRNA    7SK         1          0 |
        2. |   5_8S_rRNA   AiBG         0          1 |
        3. |         7SK   A1CF         1          0 |
        4. |        A1BG    A2M         1          0 |
        5. |    A1BG-AS1    7SK         1          0 |
           |-----------------------------------------|
        6. |        A1CF   A1BG         1          0 |
        7. |         A2M                .          . |
        8. |     A2M-AS1                .          . |
        9. |       A2ML1                .          . |
       10. |   A2ML1-AS1                .          . |
           |-----------------------------------------|
       11. |   A2ML1-AS2                .          . |
       12. |       A2MP1                .          . |
       13. |     A3GALT2                .          . |
       14. |      A4GALT                .          . |
       15. |       A4GNT                .          . |
           |-----------------------------------------|
       16. |        AA06                .          . |
       17. |        AAAS                .          . |
       18. |        AACS                .          . |
       19. |      AACSP1                .          . |
       20. |       AADAC                .          . |
           |-----------------------------------------|
       21. |     AADACL2                .          . |
       22. | AADACL2-AS1                .          . |
       23. |     AADACL3                .          . |
       24. |     AADACL4                .          . |
       25. |     AADACP1                .          . |
           |-----------------------------------------|
       26. |       AADAT                .          . |
       27. |       AAED1                .          . |
       28. |       AAGAB                .          . |
       29. |        AAK1                .          . |
           +-----------------------------------------+
      
      .

      Comment


      • #4
        You can also try the user written -inlist2-. It seems to succeed on the sample data, I am curious whether it will do the job on the full dataset.

        Code:
        . levelsof associatedgenename, local(gold) sep(,) clean
        5S_rRNA,5_8S_rRNA,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,A3GALT2,A4GAL
        > T,A4GNT,AA06,AAAS,AACS,AACSP1,AADAC,AADACL2,AADACL2-AS1,AADACL3,AADACL4,AADACP1,AADAT,AAED1,AAGA
        > B,AAK1
        
        . inlist2 test, values("`gold'") name(correct)
        
        . gen incorrect = !correct if !missing(correct)
        (23 missing values generated)
        
        . list
        
             +-----------------------------------------+
             | associate~e   test   correct   incorr~t |
             |-----------------------------------------|
          1. |     5S_rRNA    7SK         1          0 |
          2. |   5_8S_rRNA   AiBG         0          1 |
          3. |         7SK   A1CF         1          0 |
          4. |        A1BG    A2M         1          0 |
          5. |    A1BG-AS1    7SK         1          0 |
             |-----------------------------------------|
          6. |        A1CF   A1BG         1          0 |
          7. |         A2M                .          . |
          8. |     A2M-AS1                .          . |
          9. |       A2ML1                .          . |
         10. |   A2ML1-AS1                .          . |
             |-----------------------------------------|
         11. |   A2ML1-AS2                .          . |
         12. |       A2MP1                .          . |
         13. |     A3GALT2                .          . |
         14. |      A4GALT                .          . |
         15. |       A4GNT                .          . |
             |-----------------------------------------|
         16. |        AA06                .          . |
         17. |        AAAS                .          . |
         18. |        AACS                .          . |
         19. |      AACSP1                .          . |
         20. |       AADAC                .          . |
             |-----------------------------------------|
         21. |     AADACL2                .          . |
         22. | AADACL2-AS1                .          . |
         23. |     AADACL3                .          . |
         24. |     AADACL4                .          . |
         25. |     AADACP1                .          . |
             |-----------------------------------------|
         26. |       AADAT                .          . |
         27. |       AAED1                .          . |
         28. |       AAGAB                .          . |
         29. |        AAK1                .          . |
             +-----------------------------------------+
        
        .

        Comment


        • #5
          Originally posted by Joro Kolev View Post
          You can also try the user written -inlist2-. It seems to succeed on the sample data, I am curious whether it will do the job on the full dataset.

          Code:
          . levelsof associatedgenename, local(gold) sep(,) clean
          5S_rRNA,5_8S_rRNA,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,A3GALT2,A4GAL
          > T,A4GNT,AA06,AAAS,AACS,AACSP1,AADAC,AADACL2,AADACL2-AS1,AADACL3,AADACL4,AADACP1,AADAT,AAED1,AAGA
          > B,AAK1
          
          . inlist2 test, values("`gold'") name(correct)
          
          . gen incorrect = !correct if !missing(correct)
          (23 missing values generated)
          
          . list
          
          +-----------------------------------------+
          | associate~e test correct incorr~t |
          |-----------------------------------------|
          1. | 5S_rRNA 7SK 1 0 |
          2. | 5_8S_rRNA AiBG 0 1 |
          3. | 7SK A1CF 1 0 |
          4. | A1BG A2M 1 0 |
          5. | A1BG-AS1 7SK 1 0 |
          |-----------------------------------------|
          6. | A1CF A1BG 1 0 |
          7. | A2M . . |
          8. | A2M-AS1 . . |
          9. | A2ML1 . . |
          10. | A2ML1-AS1 . . |
          |-----------------------------------------|
          11. | A2ML1-AS2 . . |
          12. | A2MP1 . . |
          13. | A3GALT2 . . |
          14. | A4GALT . . |
          15. | A4GNT . . |
          |-----------------------------------------|
          16. | AA06 . . |
          17. | AAAS . . |
          18. | AACS . . |
          19. | AACSP1 . . |
          20. | AADAC . . |
          |-----------------------------------------|
          21. | AADACL2 . . |
          22. | AADACL2-AS1 . . |
          23. | AADACL3 . . |
          24. | AADACL4 . . |
          25. | AADACP1 . . |
          |-----------------------------------------|
          26. | AADAT . . |
          27. | AAED1 . . |
          28. | AAGAB . . |
          29. | AAK1 . . |
          +-----------------------------------------+
          
          .
          Thanks much for both versions. They both work on the small dataset of True genes. But the levelof command throws an error since there are too many levels. See below for the error. I am using Stata SE. I am not quite sure what this means for the capabilities of my version.

          inlist2 test, values("`gold'") name(correct)
          macro substitution results in line that is too long
          The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters,
          which is calculated on the basis of set maxvar.

          You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.

          The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit
          increase in set maxvar increases the length maximums by 129. The maximum value of set maxvar is 32,767. Thus, the maximum line
          length may be set up to 4,227,159 characters if you set maxvar to its largest value.
          r(920);

          Comment


          • #6
            Originally posted by Joseph Coveney View Post
            You're probably looking for something along these lines.

            .ÿ
            .ÿversionÿ17.0

            .ÿ
            .ÿclearÿ*

            .ÿ
            .ÿquietlyÿinputÿstr22ÿassociatedgenenameÿstr4ÿtest

            .ÿ
            .ÿ*
            .ÿ*ÿBeginÿhere
            .ÿ*
            .ÿframeÿputÿassociatedgenename,ÿinto(TrueGenes)

            .ÿ
            .ÿquietlyÿkeepÿifÿ!missing(test)

            .ÿ
            .ÿfrlinkÿm:1ÿtest,ÿframe(TrueGenesÿassociatedgenename)
            ÿÿ(1ÿobservationÿinÿframeÿdefaultÿunmatched)

            .ÿ
            .ÿlistÿtestÿTrueGenesÿifÿmissing(TrueGenes),ÿnoobsÿabbreviate(20)

            ÿÿ+------------------+
            ÿÿ|ÿtestÿÿÿTrueGenesÿ|
            ÿÿ|------------------|
            ÿÿ|ÿAiBGÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
            ÿÿ+------------------+

            .ÿ
            .ÿexit

            endÿofÿdo-file


            .
            That totally worked. Thanks!!
            I had never used -frames- much but this will inspire me to learn -frames- now.

            Comment


            • #7
              The key to Joseph's solution is the merging. He does it using frames, but frames are not needed, the merging can be also done with two files. Here:

              Code:
              . save genes
              file genes.dta saved
              
              . drop test
              
              . save associatedgenename
              file associatedgenename.dta saved
              
              . use genes
              
              . drop associatedgenename
              
              . ren test associatedgenename
              
              . merge m:1 associatedgenename using "C:\StataWorkingDir\associatedgenename.dta"
              (variable associatedgenename was str4, now str22 to accommodate using data's values)
              
                  Result                      Number of obs
                  -----------------------------------------
                  Not matched                            49
                      from master                        24  (_merge==1)
                      from using                         25  (_merge==2)
              
                  Matched                                 5  (_merge==3)
                  -----------------------------------------
              
              . list if _merge==1 & !missing( associatedgenename )
              
                   +----------------------------+
                   | associ~e            _merge |
                   |----------------------------|
               29. |     AiBG   Master only (1) |
                   +----------------------------+
              
              .

              Comment


              • #8
                Similarly the second of the solutions I proposed (the one not using -inlist2-) does not depend on creating a macro and -levelsof-. I just used -levelsof- following the lead of OP; as OP was working on the -levelsof- solution I presumed that his Stata can handle a macro with 50k levels.

                Here is the second solution that I proposed but without putting the levels in a macro:

                Code:
                . gen correct = 0 if !missing(test)
                (23 missing values generated)
                
                . count if !missing( associatedgenename )
                  29
                
                . qui forvalues i=1/`r(N)' {
                . replace correct = 1 if test == associatedgenename[`i'] & !missing(test)
                . }
                
                . gen incorrect = !correct if !missing(correct)
                (23 missing values generated)
                
                . list
                
                     +-----------------------------------------+
                     | associate~e   test   correct   incorr~t |
                     |-----------------------------------------|
                  1. |     5S_rRNA    7SK         1          0 |
                  2. |   5_8S_rRNA   AiBG         0          1 |
                  3. |         7SK   A1CF         1          0 |
                  4. |        A1BG    A2M         1          0 |
                  5. |    A1BG-AS1    7SK         1          0 |
                     |-----------------------------------------|
                  6. |        A1CF   A1BG         1          0 |
                  7. |         A2M                .          . |
                  8. |     A2M-AS1                .          . |
                  9. |       A2ML1                .          . |
                 10. |   A2ML1-AS1                .          . |
                     |-----------------------------------------|
                 11. |   A2ML1-AS2                .          . |
                 12. |       A2MP1                .          . |
                 13. |     A3GALT2                .          . |
                 14. |      A4GALT                .          . |
                 15. |       A4GNT                .          . |
                     |-----------------------------------------|
                 16. |        AA06                .          . |
                 17. |        AAAS                .          . |
                 18. |        AACS                .          . |
                 19. |      AACSP1                .          . |
                 20. |       AADAC                .          . |
                     |-----------------------------------------|
                 21. |     AADACL2                .          . |
                 22. | AADACL2-AS1                .          . |
                 23. |     AADACL3                .          . |
                 24. |     AADACL4                .          . |
                 25. |     AADACP1                .          . |
                     |-----------------------------------------|
                 26. |       AADAT                .          . |
                 27. |       AAED1                .          . |
                 28. |       AAGAB                .          . |
                 29. |        AAK1                .          . |
                     +-----------------------------------------+
                
                .

                Comment


                • #9
                  Originally posted by Joro Kolev View Post
                  The key to Joseph's solution is the merging. He does it using frames, but frames are not needed, the merging can be also done with two files. Here:
                  When I first drafted my reply, I used -merge- with a -tempfile- prepared from the dataset in between -preserve- and -restore-.

                  Before I posted it, I thought to myself, "But -merge- and jockeying a separate file are not needed; the join can be done more simply using frames."

                  Comment


                  • #10
                    Learnt quite a bit from this thread of replies Joro Kolev and Joseph Coveney. Thanks both. I probably will end up using a -preserve-/-restore- based solution since the gene name checking is mainly an assertion of correct spellings in the existing data against the separate gold standard reference data file before I export the file for further visualization.

                    Comment

                    Working...
                    X