Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing categories in newly created categorical variable

    Hi Statalist.

    I have found that after creating a new categorical variable that not all categories appear (see below), requiring me to rerun code for the missing category. The code creates different religious pairings within couples (in which the respondent (religb) is coded differently to their partner (p_religb). The code separates three 'levels': 2 same relig, 2 different, 1 relig/1 not).

    To reduce the long lists of value labels in the code (only present in the first set of code - hereafter replaced by Prot1, Prot2, Other religion.
    Code:
    gen faith = 1 if religb == p_religb & religb == 2070 // same Catholic    
    replace faith = 2 if religb == p_religb & inlist(religb, 2010, 2030, 2050, 2170, 2250, 2270, 2330, 2800)  // same Prot1 (hereafter listed as prot1)
    replace faith = 3 if religb == p_religb & inlist(religb, 2110, 2130, 2150, 2310, 2400)  // same Prot2 (hereafter listed as prot2)
    replace faith = 4 if religb == p_religb & inlist(religb, 1000, 3000, 4000, 5000, 6000 // same 'Other' relig (hereafter listed as relOther)
    
    replace faith = 5 if religb != p_religb & prot1 & prot1_p // diff Prot1
    replace faith = 6 if religb != p_religb & inrange(religb, 2010, 2800) & inrange(p_religb, 2010, 2800) // diff Prot = (prot1+prot2)
    replace faith = 7 if religb != p_religb & (religb == 2070 | p_religb == 2070) & (prot1 | prot1_p) // Cath + Prot1
    replace faith = 8 if religb != p_religb & (religb == 2070 | p_religb == 2070) & inrange(religb, 2010, 2800) & inrange(p_religb, 2010, 2800) // Cath + Prot
    
    replace faith = 9 if religb != p_religb & (religb == 2070 | p_religb == 2070) & (religb == 7000 | p_religb == 7000) // Cath + No relig
    replace faith = 10 if religb != p_religb & (prot1 | prot1_p) & (religb == 7000 | p_religb == 7000) // Prot1 + No relig
    replace faith = 11 if religb != p_religb & (relOther | relOther_p) & (religb == 7000 | p_religb == 7000) // Other relig + No relig
    replace faith = 12 if religb == p_religb & religb == 7000 & import == 1 & attend == 1 // both 'No relig'
    Help identifying the problem appreciated.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(religb p_religb) float faith
    2070 2070  1
    2330 2010  6
    2310 2310 3
    7000 7000 12
    2330 2330  2
    2010 2010  2
    7000 2330 10
    7000 7000 12
    2070 7000  9
    2070 2110  8
    2070 2330  7
    2070 2070  1
    2330 2010  6
    2400 2400 3
    1000 7000 11
    2070 2010  7
    1000 1000  4
    3000 3000  4
    2330 2010  5
    2050 2330  5
    2010 2400  6
    end
    Last edited by Chris Boulis; 19 Aug 2020, 00:40.

  • #2
    (deleted, accidental duplication.)
    Last edited by Mike Lacy; 19 Aug 2020, 10:03.

    Comment


    • #3
      I'm not sure I understand what you mean by "not all categories appear." I'm thinking you mean "I think there are observations within my data set such that, when I apply the series of replace commands, each of the possible values (1/12) should occur one or more times in the variable 'faith,' but I don't get all of those." If that's correct, I would say that what we need to help you would be an example of input data that should result in (say) faith == X but doesn't. So, if I understand you correctly, you'll want to supply some such data examples, including all the relevant input variables, which appear to be religb, p_religb, prot1, prot2, ... etc.

      If my understanding is not correct, you'd presumably want to describe your problem again.

      Comment


      • #4
        Hi Mike Lacy Thank you for your reply. Yes you are correct in that I've written code to split my dataset into 12 categories but i find that one or two of these are missing when I tabulate the variable, though if I rerun the code for the 'missing' category then tabulate, it is no longer missing. I therefore feel there is an issue in the way I've written my code even though I believe it's consistent with prior advice by Nick Cox e.g. https://www.statalist.org/forums/for...-cond-function.

        In this sample, I have ensured there are observations for each of the 12 categories. Note, while I've shortened the code from lines 5-13 for ease of readability for those on the forum, I run the full code on my pc.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int(religb p_religb) byte(prot1_p prot2_p relOther) float faith
        2070 2070 0 0 0  
        2070 2070 0 0 0  
        2070 2070 0 0 0  
        2070 2070 0 0 0  
        2030 2030 1 0 0  
        2010 2010 1 0 0  
        2010 2010 1 0 0  
        2130 2130 0 1 0  
        2130 2130 0 1 0  
        2400 2400 0 1 0  
        2400 2400 0 1 0  
        1000 1000 0 0 1  
        1000 1000 0 0 1  
        1000 1000 0 0 1  
        1000 1000 0 0 1  
        2000 2010 1 0 0  
        2000 2330 1 0 0  
        2000 2010 1 0 0  
        2000 2330 1 0 0  
        2330 2010 1 0 0  
        2010 2170 1 0 0  
        2010 2400 0 1 0  
        2250 2170 1 0 0  
        2250 2110 0 1 0  
        2330 2010 1 0 0  
        2010 2400 0 1 0  
        2250 2110 0 1 0  
        2330 2170 1 0 0  
        2070 2330 1 0 0  
        2070 2330 1 0 0  
        2070 2250 1 0 0  
        2070 2800 1 0 0  
        2070 2030 1 0 0  
        2010 2070 0 0 0  
        2010 2070 0 0 0  
        2010 2070 0 0 0  
        2070 2233 0 0 0  
        2070 2233 0 0 0  
        2070 7000 0 0 0  
        2070 7000 0 0 0  
        2070 7000 0 0 0  
        7000 2010 1 0 0 
        7000 2010 1 0 0 
        7000 2330 1 0 0 
        7000 2250 1 0 0 
        7000 2170 1 0 0 
        7000 2170 1 0 0 
        7000 2330 1 0 0 
        7000 6000 0 0 0 
        6000 7000 0 0 1 
        6000 7000 0 0 1 
        1000 7000 0 0 1 
        1000 7000 0 0 1 
        7000 7000 0 0 0 
        7000 7000 0 0 0 
        7000 7000 0 0 0 
        end
        N.B. Stata v.15.1


        Comment


        • #5
          What I'd find useful in order to help you would be an example that shows the *desired* value of faith for each combination of input values, and the value that your code is actually producing. In the example you've given with -datex-, all the faith values are or or 1, and I find it hard to connect that with your point about some of the 1/12 values being absent. So, instead of your 0/1 faith variable above, can you fill our your example instead with "faith_wanted" and a "faith_mycode" variable? While I understand that *you* know what you want to get, we don't. Even one example like "I expected this observation to yield faith == 4, but it didn't" would help."

          Regarding your note about : "if I rerun the code for the 'missing' category then tabulate,... " We don't have a way to know about your 'missing' category(ies), so it's hard to respond.

          Comment


          • #6
            Thank you Mike Lacy for your responses above. I want to apologise for not responding, I must have figured out a solution, but even still I should have advised as such.

            I have experienced a similar issue - that is, when I tabulate a newly created variable (with 17 categories) some categories are missing (see below). Then, if I re-run the code for one of the missing categories, say category 12 and re-tabulate, I now see category 12 (and some others), but other categories are missing (see attached). I have included my code below and some sample data.

            Guidance on why all categories are not coming through is kindly appreciated.

            Code:
            gen byte rel4 = 1 if religb1 == religb2 & religb1 == 7000 & relimp1 == 0 & relimp2 == 0 & relat1 == 1 & relat2 == 1  
            replace rel4 = 2 if religb1 == religb2 & religb1 == 2070  
            replace rel4 = 3 if religb1 == religb2 & inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800)
            replace rel4 = 4 if religb1 == religb2 & inlist(religb1, 2050, 2110, 2130, 2150, 2310, 2400)  
            replace rel4 = 5 if religb1 == religb2 & religb1 == 1000
            replace rel4 = 6 if religb1 == religb2 & religb1 == 3000
            replace rel4 = 7 if religb1 == religb2 & religb1 == 4000
            
            replace rel4 = 8 if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
            inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800))
            replace rel4 = 9 if religb1 != religb2 & inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) & ///
            inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)  
            replace rel4 = 10 if religb1 != religb2 & religb1 == 2070 & inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)
            replace rel4 = 11 if religb1 != religb2 & religb2 == 2070 & inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800)
            
            replace rel4 = 12 if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (religb1 == 7000 | religb2 == 7000)  
            replace rel4 = 13 if religb1 != religb2 & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
            inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) & (religb1 == 7000 | religb2 == 7000)
            replace rel4 = 14 if religb1 != religb2 & (religb1 == 2070 & religb2 == 7000)
            replace rel4 = 15 if religb1 != religb2 & (religb2 == 2070 & religb1 == 7000)
            replace rel4 = 16 if religb1 != religb2 & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) & religb2 == 7000)
            replace rel4 = 17 if religb1 != religb2 & (inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800) & religb1 == 7000)





            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long(id p_id) int(religb1 religb2)
            10  11  2070 2070
            12  13  2070 7000
            14  15  2030 2010
            16  17  2070 2070
            181  191  7000 7000
            182  192  7000 7000
            183  193  2030 7000
            184  194  2130 2130
            185  195  1000 1000
            186  196  1000 1000
            187  197  2330 2330
            188  198  3000 3000
            189  199  4000 4000
            111  121  2170 2250
            112  122  7000 2010
            113  123  2010 7000
            114  124  7000 2070
            115  125  2070 7000
            116  126  2010 2070
            117  127  2070 2010
            118  128  2010 2070
            119  129  2010 2250
            101  151  2070 2070
            102  153  2070 7000
            103  155  2030 2010
            104  157  7000 2070
            105  151  7000 2010
            106  152  2010 7000
            107  153  2070 7000
            108  154  2130 2130
            109  155  1000 1000
            286  156  1000 1000
            287  157  2330 2330
            288  158  3000 3000
            289  159  4000 4000
            end
            Note, due to privacy I have amended some of the information from the original data.
            Stata v.15.1.
            Last edited by Chris Boulis; 18 Dec 2020, 05:41.

            Comment


            • #7
              I'd say this is too complicated for me to feel like I could get somewhere by eyeballing what you have presented. My suggestion for diagnosis, which you may well already have tried, would be to insert -tabulate- commands, perhaps with -if- qualifiers, very liberally throughout your code. This would enable tracking your changes, and just as importantly, might stimulate you to see some previously unnoticed logic error.

              Comment


              • #8
                Thank you Mike Lacy. The irony is that this variable takes categories already coded in three other variables (all slightly different) - all of which work fine. I went through a process of checking my code against each of them before posting. I believe I am following the approach I learnt from Nick Cox (in #20 here) . All categories appear in -tabulate- when only running the first 10 lines. Adding lines 11 to 14 resulted in category 8 bring dropped from the output of -tabulate-. After adding lines 15 to 16, category 12 also dropped. Finally, after adding line 17, in addition to categories 8 & 12, category 13 dropped from the -tabulate- output.

                I did notice one potential issue, but it depends on whether the value in "(# real changes made)" that appears after each line of code is run, e.g.
                Code:
                (8,941 real changes made)
                should be the same as that displayed in the "frequency" column of the -tabulate- output? e.g.
                Code:
                rel4 |      Freq.
                  13 |      5,458
                If so, then I identified two inconsistencies:
                • Category 8: (8,327 real changes made) Frequency: 4,238 (as noted above), was dropped from the output of -tabulate- after code from lines 11+ were added.
                • Category 13: (8,941 real changes made) Frequency: 5,458 (displayed above) changed to this new frequency value in line 16. After adding the last of the code (line 17), category 13 was dropped from the output of -tabulate-.
                I hope this helps. I appreciate any help/guidance. Kind regards, Chris
                Last edited by Chris Boulis; 18 Dec 2020, 18:58.

                Comment


                • #9
                  Update: I tried running the last six lines of code separately, finding only the first three lines ran and showed in -tabulate- without issue, adding lines 4-6 results in a category dropping from -tabulate- as noted in #8.

                  I also checked each line of code using -count- with -if- statements and obtained the correct frequencies for each category, for example:
                  Code:
                  . count if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
                  > inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) // cat8
                    8,327
                  
                  . count if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (religb1 == 7000 | religb2 == 7000)  // cat12
                    6,638
                  
                  . count if religb1 != religb2 & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
                  > inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) & (religb1 == 7000 | religb2 == 7000)  // cat13
                    8,941
                  
                  . count if religb1 == 2070 & religb2 == 7000  // cat14
                    2,494
                  
                  . count if religb2 == 2070 & religb1 == 7000  // cat15
                    4,144
                  Output from -tabulate- after running all 17 lines of code:

                  Click image for larger version

Name:	tab_rel4_frequencies.png
Views:	1
Size:	4.5 KB
ID:	1586701


                  Help/guidance how to identify the issue is kindly appreciated.

                  Comment

                  Working...
                  X