Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Recoding SNP genotype data based on minor allele frequency

    Does anybody know of a way to recode SNP genotype data based on the minor allele frequency of that SNP? For example, if we have a column containing A/A, A/G, G/G genotypes and G is the minor allele, then we recode (or generate a new variable) such that A/A=0, A/G=1 and G/G=2. It's straightforward to just generate the a variable based on the levels of genotype, but what I can't tell is whether you can do this based on the frequency of the genotypes. In my understanding the convention is to have the homozygote of the common allele =0 and the homozygote of the rare allele =2 (and heterozygotes =1). So iterating over a bunch of SNPs based of different allele frequencies would be very convenient.

  • #2
    This might run slowly on huge genetic datasets, but maybe it will give you an idea. If you need to find out which of A G C T are in the string, check out -regexm-.

    Code:
    * setting up a dataset:
    clear
    set obs 100
    gen g="A/A" if _n<=60
    replace g="A/G" if _n>60 & _n<=90
    replace g="G/G" if _n>90
    tab g
    
    * counting starts here
    count if g=="A/A"
    local aa=r(N)
    count if g=="G/G"
    local gg=r(N)
    
    gen x=0
    if `aa' > `gg' {
        replace x=2 if g=="A/A"
        replace x=1 if g=="A/G" // need "G/A" line?
        }
    
    else if `aa' < `gg' {
        replace x=2 if g=="G/G"
        replace x=1 if g=="A/G" // need "G/A" line?
        }
    
    * are they ever equal?    If so, add code
    
    tab g x

    Comment


    • #3
      Laura's example made it really clear what was needed. Here's another way to do it:

      Code:
      * setting up a dataset:
      clear
      set obs 100
      gen g="A/A" if _n<=60
      replace g="A/G" if _n>60 & _n<=90
      replace g="G/G" if _n>90
      tab g
      
      * counting starts here
      bysort g : gen count = _N 
      bysort count g : gen x = _n == 1
      replace x = sum(x) - 1 
      tab g x

      Comment


      • #4
        much tidier, but it fails in this instance:

        Code:
        * setting up a dataset:
        clear
        set obs 100
        gen g="A/A" if _n<=10
        replace g="A/G" if _n>10 & _n<=60
        replace g="G/G" if _n>60
        tab g
        
        * counting starts here
        bysort g : gen count = _N
        bysort count g : gen x = _n == 1
        replace x = sum(x) - 1
        tab g x
        This results in A/G as 2 and G/G as 1.

        Comment


        • #5
          Reading this more carefully, I see that my code is missing the genetics here, as the heterozygote should always be in the middle. I stopped studying biology at age 15 in the last millennium, but let's try again.

          Code:
          * setting up a dataset:
          clear
          set obs 100
          gen g="A/A" if _n<=10
          replace g="A/G" if _n>10 & _n<=60
          replace g="G/G" if _n>60
          tab g
          
          * counting starts here
          gen x = 1 if g == "A/G"
          bysort g : gen count = cond(x == 1, -1, _N)
          sort count g
          replace x = cond(g == g[_N], 2, 0) if missing(x)
          tab g x
          The main trick is to sort the frequency of the more common homozygote to the end. Assigning -1 to the frequency of the heterozygote ensures that it cannot possibly be sorted to the end.

          Comment


          • #6
            nice!

            Comment

            Working...
            X