Recoding SNP genotype data based on minor allele frequency

Samir Kelada

Join Date: Jun 2014

Posts: 11
#1

Recoding SNP genotype data based on minor allele frequency

10 Jun 2015, 09:39

Does anybody know of a way to recode SNP genotype data based on the minor allele frequency of that SNP? For example, if we have a column containing A/A, A/G, G/G genotypes and G is the minor allele, then we recode (or generate a new variable) such that A/A=0, A/G=1 and G/G=2. It's straightforward to just generate the a variable based on the levels of genotype, but what I can't tell is whether you can do this based on the frequency of the genotypes. In my understanding the convention is to have the homozygote of the common allele =0 and the homozygote of the rare allele =2 (and heterozygotes =1). So iterating over a bunch of SNPs based of different allele frequencies would be very convenient.
Tags: None

Laura Gibbons

Join Date: Apr 2014
Posts: 25

11 Jun 2015, 14:00

This might run slowly on huge genetic datasets, but maybe it will give you an idea. If you need to find out which of A G C T are in the string, check out -regexm-.

Code:

* setting up a dataset:
clear
set obs 100
gen g="A/A" if _n<=60
replace g="A/G" if _n>60 & _n<=90
replace g="G/G" if _n>90
tab g

* counting starts here
count if g=="A/A"
local aa=r(N)
count if g=="G/G"
local gg=r(N)

gen x=0
if `aa' > `gg' {
    replace x=2 if g=="A/A"
    replace x=1 if g=="A/G" // need "G/A" line?
    }

else if `aa' < `gg' {
    replace x=2 if g=="G/G"
    replace x=1 if g=="A/G" // need "G/A" line?
    }

* are they ever equal?    If so, add code

tab g x

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

11 Jun 2015, 17:15

Laura's example made it really clear what was needed. Here's another way to do it:

Code:

* setting up a dataset:
clear
set obs 100
gen g="A/A" if _n<=60
replace g="A/G" if _n>60 & _n<=90
replace g="G/G" if _n>90
tab g

* counting starts here
bysort g : gen count = _N 
bysort count g : gen x = _n == 1
replace x = sum(x) - 1 
tab g x

Comment

Laura Gibbons

Join Date: Apr 2014
Posts: 25

12 Jun 2015, 11:00

much tidier, but it fails in this instance:

Code:

* setting up a dataset:
clear
set obs 100
gen g="A/A" if _n<=10
replace g="A/G" if _n>10 & _n<=60
replace g="G/G" if _n>60
tab g

* counting starts here
bysort g : gen count = _N
bysort count g : gen x = _n == 1
replace x = sum(x) - 1
tab g x

This results in A/G as 2 and G/G as 1.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

12 Jun 2015, 11:27

Reading this more carefully, I see that my code is missing the genetics here, as the heterozygote should always be in the middle. I stopped studying biology at age 15 in the last millennium, but let's try again.

Code:

* setting up a dataset: clear set obs 100 gen g="A/A" if _n<=10 replace g="A/G" if _n>10 & _n<=60 replace g="G/G" if _n>60 tab g * counting starts here gen x = 1 if g == "A/G" bysort g : gen count = cond(x == 1, -1, _N) sort count g replace x = cond(g == g[_N], 2, 0) if missing(x) tab g x

The main trick is to sort the frequency of the more common homozygote to the end. Assigning -1 to the frequency of the heterozygote ensures that it cannot possibly be sorted to the end.
Comment
Laura Gibbons

Join Date: Apr 2014

Posts: 25
#6

12 Jun 2015, 11:51

nice!
Comment

Announcement

Recoding SNP genotype data based on minor allele frequency

Comment

Comment

Comment

Comment

Comment