Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Automating recode of categories with small numbers

    Hi all,

    I have 55 datasets with the same set of 60 variables. Most of them are categorical. Some datasets have small numbers in some of the categories for some of the variables, and I'm trying to find a way to recode based on the frequency in each category (e.g. if less than 10). So with the example below to select categories 1 and 2, and recode rep78 missing for those values. The problem is that it's not always the same categories in each dataset that have small numbers. Any ideas on how to do this?

    Thanks,

    Sonia

    Code:
    sysuse auto, clear
    
    tab rep78
    
         Repair |
    Record 1978 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              1 |          2        2.90        2.90
              2 |          8       11.59       14.49
              3 |         30       43.48       57.97
              4 |         18       26.09       84.06
              5 |         11       15.94      100.00
    ------------+-----------------------------------
          Total |         69      100.00

  • #2
    Instead of jumping into a solution, which should not be too hard, technically, let me ask why you want to do this? Why would you want (i) to recode small numbers to missing, and (ii) why would you want different categories missing in different datasets? This does not sound like a particularly good plan to me.

    Best
    Daniel
    Last edited by daniel klein; 08 Nov 2018, 05:00.

    Comment


    • #3
      Hi Daniel,

      Those are good questions.

      (i) As an example, one of the variables with some categories having small numbers is religion. Some datasets have 10,000 Muslims and <10 Christians, whilst others have 10,000 Christians <10 Buddhists. The values used for different religious categories are also different between datasets, so Christian may be coded as 1 in one dataset, and 4 in another one. If there were a way to recode all categories with small numbers as Other, that would be ok too, but there may only be one category with <10, so it would be difficult to choose which other category to combine it with. I thought of recoding categories with <25 as missing as they represent less than 1% of the total data, so it shouldn't introduce bias. Otherwise, having categories with small numbers will cause problems of data sparsity in analyses.

      (ii) I don't necessarily want different categories missing in different datasets, but it won't cause problems with the planned analyses.

      Thanks,

      Sonia

      Comment


      • #4
        I am absolutely with daniel klein on this. I see no virtue in recoding different datasets differently just because the frequencies may end up very different.
        #3 doesn't shift that view.

        Your biggest problem is to reverse the different coding in different datasets to get string values, and then when have appended them, work out new labels. Sounds like a lot of work, but I guess you know that.

        Comment


        • #5
          Hi Nick and Daniel,

          I have done as you suggested, and recoded based on value labels, by using decode to convert to a string variable, and then used code like this to create new categories:

          Code:
          clear
          input str12 religion
          "Christian"
          "Catholic"
          "other christian"
          "Presbytarian"
          "Seventh day"
          "Muslim"
          "Islam"
          "Buddhist"
          "budhist"
          "Buddhism"
          "Hindu"
          "Hinduism"
          "No religion"
          "other"
          end
          generate rel = 3
          recode rel (3=1) if strmatch(religion, "*hr?st*") | strmatch(religion, "*atholic*")
          recode rel (3=2) if strmatch(religion, "*uslim*") | strmatch(religion, "*slam*")
          label define rel 1 "Christian" 2 "Muslim" 3 "Other"
          label values rel rel
          list, clean
          This works fine, however, the problem still remains that when I try to loop a command over all the datasets using survey commands such as:

          Code:
          svy: proportion vacc, over(rel)
          It doesn't work when there are small numbers in some cells, especially if I use a sub-population within the svy command.

          Is there any way to restrict the command to run for categories with large enough numbers, rather than recoding them as missing? Or to recode them as "Other" if small numbers in that category?

          Thanks,

          Sonia

          Comment

          Working...
          X