Missing categories in newly created categorical variable

Chris Boulis

Join Date: Feb 2019
Posts: 368

Missing categories in newly created categorical variable

19 Aug 2020, 00:33

Hi Statalist.

I have found that after creating a new categorical variable that not all categories appear (see below), requiring me to rerun code for the missing category. The code creates different religious pairings within couples (in which the respondent (religb) is coded differently to their partner (p_religb). The code separates three 'levels': 2 same relig, 2 different, 1 relig/1 not).

To reduce the long lists of value labels in the code (only present in the first set of code - hereafter replaced by Prot1, Prot2, Other religion.

Code:

gen faith = 1 if religb == p_religb & religb == 2070 // same Catholic    
replace faith = 2 if religb == p_religb & inlist(religb, 2010, 2030, 2050, 2170, 2250, 2270, 2330, 2800)  // same Prot1 (hereafter listed as prot1)
replace faith = 3 if religb == p_religb & inlist(religb, 2110, 2130, 2150, 2310, 2400)  // same Prot2 (hereafter listed as prot2)
replace faith = 4 if religb == p_religb & inlist(religb, 1000, 3000, 4000, 5000, 6000 // same 'Other' relig (hereafter listed as relOther)

replace faith = 5 if religb != p_religb & prot1 & prot1_p // diff Prot1
replace faith = 6 if religb != p_religb & inrange(religb, 2010, 2800) & inrange(p_religb, 2010, 2800) // diff Prot = (prot1+prot2)
replace faith = 7 if religb != p_religb & (religb == 2070 | p_religb == 2070) & (prot1 | prot1_p) // Cath + Prot1
replace faith = 8 if religb != p_religb & (religb == 2070 | p_religb == 2070) & inrange(religb, 2010, 2800) & inrange(p_religb, 2010, 2800) // Cath + Prot

replace faith = 9 if religb != p_religb & (religb == 2070 | p_religb == 2070) & (religb == 7000 | p_religb == 7000) // Cath + No relig
replace faith = 10 if religb != p_religb & (prot1 | prot1_p) & (religb == 7000 | p_religb == 7000) // Prot1 + No relig
replace faith = 11 if religb != p_religb & (relOther | relOther_p) & (religb == 7000 | p_religb == 7000) // Other relig + No relig
replace faith = 12 if religb == p_religb & religb == 7000 & import == 1 & attend == 1 // both 'No relig'

Help identifying the problem appreciated.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(religb p_religb) float faith
2070 2070  1
2330 2010  6
2310 2310 3
7000 7000 12
2330 2330  2
2010 2010  2
7000 2330 10
7000 7000 12
2070 7000  9
2070 2110  8
2070 2330  7
2070 2070  1
2330 2010  6
2400 2400 3
1000 7000 11
2070 2010  7
1000 1000  4
3000 3000  4
2330 2010  5
2050 2330  5
2010 2400  6
end

Last edited by Chris Boulis; 19 Aug 2020, 00:40.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2426
#2

19 Aug 2020, 10:00

(deleted, accidental duplication.)

Last edited by Mike Lacy; 19 Aug 2020, 10:03.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2426
#3

19 Aug 2020, 10:01

I'm not sure I understand what you mean by "not all categories appear." I'm thinking you mean "I think there are observations within my data set such that, when I apply the series of replace commands, each of the possible values (1/12) should occur one or more times in the variable 'faith,' but I don't get all of those." If that's correct, I would say that what we need to help you would be an example of input data that should result in (say) faith == X but doesn't. So, if I understand you correctly, you'll want to supply some such data examples, including all the relevant input variables, which appear to be religb, p_religb, prot1, prot2, ... etc.

If my understanding is not correct, you'd presumably want to describe your problem again.
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

19 Aug 2020, 22:41

Hi Mike Lacy Thank you for your reply. Yes you are correct in that I've written code to split my dataset into 12 categories but i find that one or two of these are missing when I tabulate the variable, though if I rerun the code for the 'missing' category then tabulate, it is no longer missing. I therefore feel there is an issue in the way I've written my code even though I believe it's consistent with prior advice by Nick Cox e.g. https://www.statalist.org/forums/for...-cond-function.

In this sample, I have ensured there are observations for each of the 12 categories. Note, while I've shortened the code from lines 5-13 for ease of readability for those on the forum, I run the full code on my pc.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(religb p_religb) byte(prot1_p prot2_p relOther) float faith
2070 2070 0 0 0  
2070 2070 0 0 0  
2070 2070 0 0 0  
2070 2070 0 0 0  
2030 2030 1 0 0  
2010 2010 1 0 0  
2010 2010 1 0 0  
2130 2130 0 1 0  
2130 2130 0 1 0  
2400 2400 0 1 0  
2400 2400 0 1 0  
1000 1000 0 0 1  
1000 1000 0 0 1  
1000 1000 0 0 1  
1000 1000 0 0 1  
2000 2010 1 0 0  
2000 2330 1 0 0  
2000 2010 1 0 0  
2000 2330 1 0 0  
2330 2010 1 0 0  
2010 2170 1 0 0  
2010 2400 0 1 0  
2250 2170 1 0 0  
2250 2110 0 1 0  
2330 2010 1 0 0  
2010 2400 0 1 0  
2250 2110 0 1 0  
2330 2170 1 0 0  
2070 2330 1 0 0  
2070 2330 1 0 0  
2070 2250 1 0 0  
2070 2800 1 0 0  
2070 2030 1 0 0  
2010 2070 0 0 0  
2010 2070 0 0 0  
2010 2070 0 0 0  
2070 2233 0 0 0  
2070 2233 0 0 0  
2070 7000 0 0 0  
2070 7000 0 0 0  
2070 7000 0 0 0  
7000 2010 1 0 0 
7000 2010 1 0 0 
7000 2330 1 0 0 
7000 2250 1 0 0 
7000 2170 1 0 0 
7000 2170 1 0 0 
7000 2330 1 0 0 
7000 6000 0 0 0 
6000 7000 0 0 1 
6000 7000 0 0 1 
1000 7000 0 0 1 
1000 7000 0 0 1 
7000 7000 0 0 0 
7000 7000 0 0 0 
7000 7000 0 0 0 
end

N.B. Stata v.15.1

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2426
#5

20 Aug 2020, 09:20

What I'd find useful in order to help you would be an example that shows the *desired* value of faith for each combination of input values, and the value that your code is actually producing. In the example you've given with -datex-, all the faith values are or or 1, and I find it hard to connect that with your point about some of the 1/12 values being absent. So, instead of your 0/1 faith variable above, can you fill our your example instead with "faith_wanted" and a "faith_mycode" variable? While I understand that *you* know what you want to get, we don't. Even one example like "I expected this observation to yield faith == 4, but it didn't" would help."

Regarding your note about : "if I rerun the code for the 'missing' category then tabulate,... " We don't have a way to know about your 'missing' category(ies), so it's hard to respond.
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

18 Dec 2020, 05:36

Thank you Mike Lacy for your responses above. I want to apologise for not responding, I must have figured out a solution, but even still I should have advised as such.

I have experienced a similar issue - that is, when I tabulate a newly created variable (with 17 categories) some categories are missing (see below). Then, if I re-run the code for one of the missing categories, say category 12 and re-tabulate, I now see category 12 (and some others), but other categories are missing (see attached). I have included my code below and some sample data.

Guidance on why all categories are not coming through is kindly appreciated.

Code:

gen byte rel4 = 1 if religb1 == religb2 &amp; religb1 == 7000 &amp; relimp1 == 0 &amp; relimp2 == 0 &amp; relat1 == 1 &amp; relat2 == 1  
replace rel4 = 2 if religb1 == religb2 &amp; religb1 == 2070  
replace rel4 = 3 if religb1 == religb2 &amp; inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800)
replace rel4 = 4 if religb1 == religb2 &amp; inlist(religb1, 2050, 2110, 2130, 2150, 2310, 2400)  
replace rel4 = 5 if religb1 == religb2 &amp; religb1 == 1000
replace rel4 = 6 if religb1 == religb2 &amp; religb1 == 3000
replace rel4 = 7 if religb1 == religb2 &amp; religb1 == 4000

replace rel4 = 8 if religb1 != religb2 &amp; (religb1 == 2070 | religb2 == 2070) &amp; (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800))
replace rel4 = 9 if religb1 != religb2 &amp; inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) &amp; ///
inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)  
replace rel4 = 10 if religb1 != religb2 &amp; religb1 == 2070 &amp; inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)
replace rel4 = 11 if religb1 != religb2 &amp; religb2 == 2070 &amp; inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800)

replace rel4 = 12 if religb1 != religb2 &amp; (religb1 == 2070 | religb2 == 2070) &amp; (religb1 == 7000 | religb2 == 7000)  
replace rel4 = 13 if religb1 != religb2 &amp; (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) &amp; (religb1 == 7000 | religb2 == 7000)
replace rel4 = 14 if religb1 != religb2 &amp; (religb1 == 2070 &amp; religb2 == 7000)
replace rel4 = 15 if religb1 != religb2 &amp; (religb2 == 2070 &amp; religb1 == 7000)
replace rel4 = 16 if religb1 != religb2 &amp; (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) &amp; religb2 == 7000)
replace rel4 = 17 if religb1 != religb2 &amp; (inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800) &amp; religb1 == 7000)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id p_id) int(religb1 religb2)
10  11  2070 2070
12  13  2070 7000
14  15  2030 2010
16  17  2070 2070
181  191  7000 7000
182  192  7000 7000
183  193  2030 7000
184  194  2130 2130
185  195  1000 1000
186  196  1000 1000
187  197  2330 2330
188  198  3000 3000
189  199  4000 4000
111  121  2170 2250
112  122  7000 2010
113  123  2010 7000
114  124  7000 2070
115  125  2070 7000
116  126  2010 2070
117  127  2070 2010
118  128  2010 2070
119  129  2010 2250
101  151  2070 2070
102  153  2070 7000
103  155  2030 2010
104  157  7000 2070
105  151  7000 2010
106  152  2010 7000
107  153  2070 7000
108  154  2130 2130
109  155  1000 1000
286  156  1000 1000
287  157  2330 2330
288  158  3000 3000
289  159  4000 4000
end

Note, due to privacy I have amended some of the information from the original data.
Stata v.15.1.

Last edited by Chris Boulis; 18 Dec 2020, 05:41.

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2426
#7

18 Dec 2020, 08:48

I'd say this is too complicated for me to feel like I could get somewhere by eyeballing what you have presented. My suggestion for diagnosis, which you may well already have tried, would be to insert -tabulate- commands, perhaps with -if- qualifiers, very liberally throughout your code. This would enable tracking your changes, and just as importantly, might stimulate you to see some previously unnoticed logic error.
2 likes
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

18 Dec 2020, 18:55

Thank you Mike Lacy. The irony is that this variable takes categories already coded in three other variables (all slightly different) - all of which work fine. I went through a process of checking my code against each of them before posting. I believe I am following the approach I learnt from Nick Cox (in #20 here) . All categories appear in -tabulate- when only running the first 10 lines. Adding lines 11 to 14 resulted in category 8 bring dropped from the output of -tabulate-. After adding lines 15 to 16, category 12 also dropped. Finally, after adding line 17, in addition to categories 8 & 12, category 13 dropped from the -tabulate- output.

I did notice one potential issue, but it depends on whether the value in "(# real changes made)" that appears after each line of code is run, e.g.

Code:

(8,941 real changes made)

should be the same as that displayed in the "frequency" column of the -tabulate- output? e.g.

Code:

rel4 | Freq. 13 | 5,458

If so, then I identified two inconsistencies:
Category 8: (8,327 real changes made) Frequency: 4,238 (as noted above), was dropped from the output of -tabulate- after code from lines 11+ were added.

Category 13: (8,941 real changes made) Frequency: 5,458 (displayed above) changed to this new frequency value in line 16. After adding the last of the code (line 17), category 13 was dropped from the output of -tabulate-.

I hope this helps. I appreciate any help/guidance. Kind regards, Chris

Last edited by Chris Boulis; 18 Dec 2020, 18:58.
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

19 Dec 2020, 19:01

Update: I tried running the last six lines of code separately, finding only the first three lines ran and showed in -tabulate- without issue, adding lines 4-6 results in a category dropping from -tabulate- as noted in #8.

I also checked each line of code using -count- with -if- statements and obtained the correct frequencies for each category, for example:

Code:

. count if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
> inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) // cat8
  8,327

. count if religb1 != religb2 & (religb1 == 2070 | religb2 == 2070) & (religb1 == 7000 | religb2 == 7000)  // cat12
  6,638

. count if religb1 != religb2 & (inlist(religb1, 2010, 2030, 2170, 2250, 2270, 2330, 2800) | ///
> inlist(religb2, 2010, 2030, 2170, 2250, 2270, 2330, 2800)) & (religb1 == 7000 | religb2 == 7000)  // cat13
  8,941

. count if religb1 == 2070 & religb2 == 7000  // cat14
  2,494

. count if religb2 == 2070 & religb1 == 7000  // cat15
  4,144

Output from -tabulate- after running all 17 lines of code:

Click image for larger version

Name: tab_rel4_frequencies.png
Views: 1
Size: 4.5 KB
ID: 1586701

Help/guidance how to identify the issue is kindly appreciated.

Announcement

Missing categories in newly created categorical variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment