Automating recode of categories with small numbers

Sonia Lewis

Join Date: Jul 2018

Posts: 32
#1

Automating recode of categories with small numbers

08 Nov 2018, 04:11

Hi all,

I have 55 datasets with the same set of 60 variables. Most of them are categorical. Some datasets have small numbers in some of the categories for some of the variables, and I'm trying to find a way to recode based on the frequency in each category (e.g. if less than 10). So with the example below to select categories 1 and 2, and recode rep78 missing for those values. The problem is that it's not always the same categories in each dataset that have small numbers. Any ideas on how to do this?

Thanks,

Sonia

Code:

sysuse auto, clear tab rep78 Repair | Record 1978 | Freq. Percent Cum. ------------+----------------------------------- 1 | 2 2.90 2.90 2 | 8 11.59 14.49 3 | 30 43.48 57.97 4 | 18 26.09 84.06 5 | 11 15.94 100.00 ------------+----------------------------------- Total | 69 100.00
Tags: None
daniel klein

Join Date: Mar 2014

Posts: 3847
#2

08 Nov 2018, 04:40

Instead of jumping into a solution, which should not be too hard, technically, let me ask why you want to do this? Why would you want (i) to recode small numbers to missing, and (ii) why would you want different categories missing in different datasets? This does not sound like a particularly good plan to me.

Best
Daniel

Last edited by daniel klein; 08 Nov 2018, 05:00.
2 likes
Comment
Sonia Lewis

Join Date: Jul 2018

Posts: 32
#3

08 Nov 2018, 07:34

Hi Daniel,

Those are good questions.

(i) As an example, one of the variables with some categories having small numbers is religion. Some datasets have 10,000 Muslims and <10 Christians, whilst others have 10,000 Christians <10 Buddhists. The values used for different religious categories are also different between datasets, so Christian may be coded as 1 in one dataset, and 4 in another one. If there were a way to recode all categories with small numbers as Other, that would be ok too, but there may only be one category with <10, so it would be difficult to choose which other category to combine it with. I thought of recoding categories with <25 as missing as they represent less than 1% of the total data, so it shouldn't introduce bias. Otherwise, having categories with small numbers will cause problems of data sparsity in analyses.

(ii) I don't necessarily want different categories missing in different datasets, but it won't cause problems with the planned analyses.

Thanks,

Sonia
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35681
#4

08 Nov 2018, 09:11

I am absolutely with daniel klein on this. I see no virtue in recoding different datasets differently just because the frequencies may end up very different.
#3 doesn't shift that view.

Your biggest problem is to reverse the different coding in different datasets to get string values, and then when have appended them, work out new labels. Sounds like a lot of work, but I guess you know that.
Comment
Sonia Lewis

Join Date: Jul 2018

Posts: 32
#5

11 Nov 2018, 00:44

Hi Nick and Daniel,

I have done as you suggested, and recoded based on value labels, by using decode to convert to a string variable, and then used code like this to create new categories:

Code:

clear input str12 religion "Christian" "Catholic" "other christian" "Presbytarian" "Seventh day" "Muslim" "Islam" "Buddhist" "budhist" "Buddhism" "Hindu" "Hinduism" "No religion" "other" end generate rel = 3 recode rel (3=1) if strmatch(religion, "*hr?st*") | strmatch(religion, "*atholic*") recode rel (3=2) if strmatch(religion, "*uslim*") | strmatch(religion, "*slam*") label define rel 1 "Christian" 2 "Muslim" 3 "Other" label values rel rel list, clean

This works fine, however, the problem still remains that when I try to loop a command over all the datasets using survey commands such as:

Code:

svy: proportion vacc, over(rel)

It doesn't work when there are small numbers in some cells, especially if I use a sub-population within the svy command.

Is there any way to restrict the command to run for categories with large enough numbers, rather than recoding them as missing? Or to recode them as "Other" if small numbers in that category?

Thanks,

Sonia
Comment

Announcement

Automating recode of categories with small numbers

Comment

Comment

Comment

Comment