Marking Incorrect gene name by looking up column of reference gene names

Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#1

Marking Incorrect gene name by looking up column of reference gene names

30 Jan 2023, 20:09

Hello all:

I have a list of 50,000 true gene names and my dataset with 150 odd entries with some gene names misspelt in long form in the test variable. How do I tag rows corresponding to the misspelt gene names by looking up the associatedgenename variable containing the 50,000 gene names?
I played around with levelsof but the code halts at the first row or gives something faulty. In the test below, only "AiBG" should have been tagged as 1 for being incorrect (A1BG is the correct form)

levelsof test, local(testl)
levelsof associatedgenename, local(gold)
foreach v of local testl{
gen incorrect = inlist(associatedgenename, `testl')
recode incorrect (1=0) (0=1)
}

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str22 associatedgenename str4 test "5S_rRNA" "7SK" "5_8S_rRNA" "AiBG" "7SK" "A1CF" "A1BG" "A2M" "A1BG-AS1" "7SK" "A1CF" "A1BG" "A2M" "" "A2M-AS1" "" "A2ML1" "" "A2ML1-AS1" "" "A2ML1-AS2" "" "A2MP1" "" "A3GALT2" "" "A4GALT" "" "A4GNT" "" "AA06" "" "AAAS" "" "AACS" "" "AACSP1" "" "AADAC" "" "AADACL2" "" "AADACL2-AS1" "" "AADACL3" "" "AADACL4" "" "AADACP1" "" "AADAT" "" "AAED1" "" "AAGAB" "" "AAK1" "" end
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4399
#2

30 Jan 2023, 22:05

You're probably looking for something along these lines.

.ÿ
.ÿversionÿ17.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿquietlyÿinputÿstr22ÿassociatedgenenameÿstr4ÿtest

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿframeÿputÿassociatedgenename,ÿinto(TrueGenes)

.ÿ
.ÿquietlyÿkeepÿifÿ!missing(test)

.ÿ
.ÿfrlinkÿm:1ÿtest,ÿframe(TrueGenesÿassociatedgenename)
ÿÿ(1ÿobservationÿinÿframeÿdefaultÿunmatched)

.ÿ
.ÿlistÿtestÿTrueGenesÿifÿmissing(TrueGenes),ÿnoobsÿabbreviate(20)

ÿÿ+------------------+
ÿÿ|ÿtestÿÿÿTrueGenesÿ|
ÿÿ|------------------|
ÿÿ|ÿAiBGÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
ÿÿ+------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

30 Jan 2023, 23:13

Or alternatively:

Code:

. levelsof associatedgenename, local(gold)
`"5S_rRNA"' `"5_8S_rRNA"' `"7SK"' `"A1BG"' `"A1BG-AS1"' `"A1CF"' `"A2M"' `"A2M-AS1"' `"A2ML1"' `"A
> 2ML1-AS1"' `"A2ML1-AS2"' `"A2MP1"' `"A3GALT2"' `"A4GALT"' `"A4GNT"' `"AA06"' `"AAAS"' `"AACS"' `
> "AACSP1"' `"AADAC"' `"AADACL2"' `"AADACL2-AS1"' `"AADACL3"' `"AADACL4"' `"AADACP1"' `"AADAT"' `"
> AAED1"' `"AAGAB"' `"AAK1"'

. qui foreach l of local gold {
. replace correct = 1 if test == "`l'"
. }

. gen incorrect = !correct if !missing(correct)
(23 missing values generated)

. list

     +-----------------------------------------+
     | associate~e   test   correct   incorr~t |
     |-----------------------------------------|
  1. |     5S_rRNA    7SK         1          0 |
  2. |   5_8S_rRNA   AiBG         0          1 |
  3. |         7SK   A1CF         1          0 |
  4. |        A1BG    A2M         1          0 |
  5. |    A1BG-AS1    7SK         1          0 |
     |-----------------------------------------|
  6. |        A1CF   A1BG         1          0 |
  7. |         A2M                .          . |
  8. |     A2M-AS1                .          . |
  9. |       A2ML1                .          . |
 10. |   A2ML1-AS1                .          . |
     |-----------------------------------------|
 11. |   A2ML1-AS2                .          . |
 12. |       A2MP1                .          . |
 13. |     A3GALT2                .          . |
 14. |      A4GALT                .          . |
 15. |       A4GNT                .          . |
     |-----------------------------------------|
 16. |        AA06                .          . |
 17. |        AAAS                .          . |
 18. |        AACS                .          . |
 19. |      AACSP1                .          . |
 20. |       AADAC                .          . |
     |-----------------------------------------|
 21. |     AADACL2                .          . |
 22. | AADACL2-AS1                .          . |
 23. |     AADACL3                .          . |
 24. |     AADACL4                .          . |
 25. |     AADACP1                .          . |
     |-----------------------------------------|
 26. |       AADAT                .          . |
 27. |       AAED1                .          . |
 28. |       AAGAB                .          . |
 29. |        AAK1                .          . |
     +-----------------------------------------+

.

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

30 Jan 2023, 23:45

You can also try the user written -inlist2-. It seems to succeed on the sample data, I am curious whether it will do the job on the full dataset.

Code:

. levelsof associatedgenename, local(gold) sep(,) clean
5S_rRNA,5_8S_rRNA,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,A3GALT2,A4GAL
> T,A4GNT,AA06,AAAS,AACS,AACSP1,AADAC,AADACL2,AADACL2-AS1,AADACL3,AADACL4,AADACP1,AADAT,AAED1,AAGA
> B,AAK1

. inlist2 test, values("`gold'") name(correct)

. gen incorrect = !correct if !missing(correct)
(23 missing values generated)

. list

     +-----------------------------------------+
     | associate~e   test   correct   incorr~t |
     |-----------------------------------------|
  1. |     5S_rRNA    7SK         1          0 |
  2. |   5_8S_rRNA   AiBG         0          1 |
  3. |         7SK   A1CF         1          0 |
  4. |        A1BG    A2M         1          0 |
  5. |    A1BG-AS1    7SK         1          0 |
     |-----------------------------------------|
  6. |        A1CF   A1BG         1          0 |
  7. |         A2M                .          . |
  8. |     A2M-AS1                .          . |
  9. |       A2ML1                .          . |
 10. |   A2ML1-AS1                .          . |
     |-----------------------------------------|
 11. |   A2ML1-AS2                .          . |
 12. |       A2MP1                .          . |
 13. |     A3GALT2                .          . |
 14. |      A4GALT                .          . |
 15. |       A4GNT                .          . |
     |-----------------------------------------|
 16. |        AA06                .          . |
 17. |        AAAS                .          . |
 18. |        AACS                .          . |
 19. |      AACSP1                .          . |
 20. |       AADAC                .          . |
     |-----------------------------------------|
 21. |     AADACL2                .          . |
 22. | AADACL2-AS1                .          . |
 23. |     AADACL3                .          . |
 24. |     AADACL4                .          . |
 25. |     AADACP1                .          . |
     |-----------------------------------------|
 26. |       AADAT                .          . |
 27. |       AAED1                .          . |
 28. |       AAGAB                .          . |
 29. |        AAK1                .          . |
     +-----------------------------------------+

.

Comment

Girish Venkataraman

Join Date: Dec 2021
Posts: 281

31 Jan 2023, 05:25

Originally posted by Joro Kolev View Post

You can also try the user written -inlist2-. It seems to succeed on the sample data, I am curious whether it will do the job on the full dataset.

Code:

. levelsof associatedgenename, local(gold) sep(,) clean
5S_rRNA,5_8S_rRNA,7SK,A1BG,A1BG-AS1,A1CF,A2M,A2M-AS1,A2ML1,A2ML1-AS1,A2ML1-AS2,A2MP1,A3GALT2,A4GAL
> T,A4GNT,AA06,AAAS,AACS,AACSP1,AADAC,AADACL2,AADACL2-AS1,AADACL3,AADACL4,AADACP1,AADAT,AAED1,AAGA
> B,AAK1

. inlist2 test, values("`gold'") name(correct)

. gen incorrect = !correct if !missing(correct)
(23 missing values generated)

. list

+-----------------------------------------+
| associate~e test correct incorr~t |
|-----------------------------------------|
1. | 5S_rRNA 7SK 1 0 |
2. | 5_8S_rRNA AiBG 0 1 |
3. | 7SK A1CF 1 0 |
4. | A1BG A2M 1 0 |
5. | A1BG-AS1 7SK 1 0 |
|-----------------------------------------|
6. | A1CF A1BG 1 0 |
7. | A2M . . |
8. | A2M-AS1 . . |
9. | A2ML1 . . |
10. | A2ML1-AS1 . . |
|-----------------------------------------|
11. | A2ML1-AS2 . . |
12. | A2MP1 . . |
13. | A3GALT2 . . |
14. | A4GALT . . |
15. | A4GNT . . |
|-----------------------------------------|
16. | AA06 . . |
17. | AAAS . . |
18. | AACS . . |
19. | AACSP1 . . |
20. | AADAC . . |
|-----------------------------------------|
21. | AADACL2 . . |
22. | AADACL2-AS1 . . |
23. | AADACL3 . . |
24. | AADACL4 . . |
25. | AADACP1 . . |
|-----------------------------------------|
26. | AADAT . . |
27. | AAED1 . . |
28. | AAGAB . . |
29. | AAK1 . . |
+-----------------------------------------+

.

Thanks much for both versions. They both work on the small dataset of True genes. But the levelof command throws an error since there are too many levels. See below for the error. I am using Stata SE. I am not quite sure what this means for the capabilities of my version.

inlist2 test, values("`gold'") name(correct)
macro substitution results in line that is too long
The line resulting from substituting macros would be longer than allowed. The maximum allowed length is 645,216 characters,
which is calculated on the basis of set maxvar.

You can change that in Stata/SE and Stata/MP. What follows is relevant only if you are using Stata/SE or Stata/MP.

The maximum line length is defined as 16 more than the maximum macro length, which is currently 645,200 characters. Each unit
increase in set maxvar increases the length maximums by 129. The maximum value of set maxvar is 32,767. Thus, the maximum line
length may be set up to 4,227,159 characters if you set maxvar to its largest value.
r(920);

Comment

Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#6

31 Jan 2023, 05:27

Originally posted by Joseph Coveney View Post

You're probably looking for something along these lines.

.ÿ
.ÿversionÿ17.0

.ÿ
.ÿclearÿ*

.ÿ
.ÿquietlyÿinputÿstr22ÿassociatedgenenameÿstr4ÿtest

.ÿ
.ÿ*
.ÿ*ÿBeginÿhere
.ÿ*
.ÿframeÿputÿassociatedgenename,ÿinto(TrueGenes)

.ÿ
.ÿquietlyÿkeepÿifÿ!missing(test)

.ÿ
.ÿfrlinkÿm:1ÿtest,ÿframe(TrueGenesÿassociatedgenename)
ÿÿ(1ÿobservationÿinÿframeÿdefaultÿunmatched)

.ÿ
.ÿlistÿtestÿTrueGenesÿifÿmissing(TrueGenes),ÿnoobsÿabbreviate(20)

ÿÿ+------------------+
ÿÿ|ÿtestÿÿÿTrueGenesÿ|
ÿÿ|------------------|
ÿÿ|ÿAiBGÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
ÿÿ+------------------+

.ÿ
.ÿexit

endÿofÿdo-file

.

That totally worked. Thanks!!
I had never used -frames- much but this will inspire me to learn -frames- now.
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

31 Jan 2023, 20:48

The key to Joseph's solution is the merging. He does it using frames, but frames are not needed, the merging can be also done with two files. Here:

Code:

. save genes
file genes.dta saved

. drop test

. save associatedgenename
file associatedgenename.dta saved

. use genes

. drop associatedgenename

. ren test associatedgenename

. merge m:1 associatedgenename using "C:\StataWorkingDir\associatedgenename.dta"
(variable associatedgenename was str4, now str22 to accommodate using data's values)

    Result                      Number of obs
    -----------------------------------------
    Not matched                            49
        from master                        24  (_merge==1)
        from using                         25  (_merge==2)

    Matched                                 5  (_merge==3)
    -----------------------------------------

. list if _merge==1 & !missing( associatedgenename )

     +----------------------------+
     | associ~e            _merge |
     |----------------------------|
 29. |     AiBG   Master only (1) |
     +----------------------------+

.

Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

31 Jan 2023, 21:16

Similarly the second of the solutions I proposed (the one not using -inlist2-) does not depend on creating a macro and -levelsof-. I just used -levelsof- following the lead of OP; as OP was working on the -levelsof- solution I presumed that his Stata can handle a macro with 50k levels.

Here is the second solution that I proposed but without putting the levels in a macro:

Code:

. gen correct = 0 if !missing(test)
(23 missing values generated)

. count if !missing( associatedgenename )
  29

. qui forvalues i=1/`r(N)' {
. replace correct = 1 if test == associatedgenename[`i'] & !missing(test)
. }

. gen incorrect = !correct if !missing(correct)
(23 missing values generated)

. list

     +-----------------------------------------+
     | associate~e   test   correct   incorr~t |
     |-----------------------------------------|
  1. |     5S_rRNA    7SK         1          0 |
  2. |   5_8S_rRNA   AiBG         0          1 |
  3. |         7SK   A1CF         1          0 |
  4. |        A1BG    A2M         1          0 |
  5. |    A1BG-AS1    7SK         1          0 |
     |-----------------------------------------|
  6. |        A1CF   A1BG         1          0 |
  7. |         A2M                .          . |
  8. |     A2M-AS1                .          . |
  9. |       A2ML1                .          . |
 10. |   A2ML1-AS1                .          . |
     |-----------------------------------------|
 11. |   A2ML1-AS2                .          . |
 12. |       A2MP1                .          . |
 13. |     A3GALT2                .          . |
 14. |      A4GALT                .          . |
 15. |       A4GNT                .          . |
     |-----------------------------------------|
 16. |        AA06                .          . |
 17. |        AAAS                .          . |
 18. |        AACS                .          . |
 19. |      AACSP1                .          . |
 20. |       AADAC                .          . |
     |-----------------------------------------|
 21. |     AADACL2                .          . |
 22. | AADACL2-AS1                .          . |
 23. |     AADACL3                .          . |
 24. |     AADACL4                .          . |
 25. |     AADACP1                .          . |
     |-----------------------------------------|
 26. |       AADAT                .          . |
 27. |       AAED1                .          . |
 28. |       AAGAB                .          . |
 29. |        AAK1                .          . |
     +-----------------------------------------+

.

Comment

Joseph Coveney

Join Date: Apr 2014

Posts: 4399
#9

31 Jan 2023, 21:42

Originally posted by Joro Kolev View Post

The key to Joseph's solution is the merging. He does it using frames, but frames are not needed, the merging can be also done with two files. Here:

When I first drafted my reply, I used -merge- with a -tempfile- prepared from the dataset in between -preserve- and -restore-.

Before I posted it, I thought to myself, "But -merge- and jockeying a separate file are not needed; the join can be done more simply using frames."
2 likes
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#10

01 Feb 2023, 07:08

Learnt quite a bit from this thread of replies Joro Kolev and Joseph Coveney. Thanks both. I probably will end up using a -preserve-/-restore- based solution since the gene name checking is mainly an assertion of correct spellings in the existing data against the separate gold standard reference data file before I export the file for further visualization.
Comment

Announcement