How to find mean of different categories in different variables?

Bhaskar Gupta

Join Date: Aug 2016

Posts: 5
#1

How to find mean of different categories in different variables?

10 Oct 2016, 04:59

Hi all,
My name is Bhaskar Gupta. I am not able to find out the means of different categories in different variables.
The variables and categories are as follows:

Variable 1:
Category 0: 108
Category1: 144
Category2: 56
Category3: 83
Category4: 9
Total: 400

Variable 2:
Category 0: 111
Category 1: 109
Category 2: 60
Category 3: 98
Category 4: 22
Total: 400

Variable 3:
Category 0: 109
Category 1: 166
Category 2: 78
Category 3: 31
Category 4: 16
Total: 400

I want to find out the means of different categories in different variables. For example:

For category 0, it should be equal to: 108+111+109/400 = 328/400= 0.82

Same for all the other categories.

I am not able to make the command for this.

I have tried a lot by myself, but everytime I tried I got 400/400=1

Kindly help me.

I will be very thankful.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35792

10 Oct 2016, 05:48

It is difficult to comment on your code, as you show none.

It is difficult to know exactly what is your data. Assuming that your "variables" are indeed Stata variables, I am led to guess at this structure. Knowing that your totals are all 400, then you won't need to calculate those.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte category int(variable1 variable2 variable3)
0 108 111 109
1 144 109 166
2  56  60  78
3  83  98  31
4   9  22  16
end

egen rowtotal = rowtotal(var*)

gen rowpr = rowtotal / 400

list

     +---------------------------------------------------------------+
     | category   variab~1   variab~2   variab~3   rowtotal    rowpr |
     |---------------------------------------------------------------|
  1. |        0        108        111        109        328      .82 |
  2. |        1        144        109        166        419   1.0475 |
  3. |        2         56         60         78        194     .485 |
  4. |        3         83         98         31        212      .53 |
  5. |        4          9         22         16         47    .1175 |
     +---------------------------------------------------------------+

Note that it is not Stataish to hold totals within the total set as extra observations. That way just lies confusion, as you would have to keep excluding them from other calculations.

We should not have to guess. There is guidance in the FAQ Advice on being explicit about your data and your code. Please read http://www.statalist.org/forums/help (especially #12) as requested.

If your data structure is different and you are still puzzled, please explain what it is.

Last edited by Nick Cox; 10 Oct 2016, 06:32.

Comment

Chris Larkin

Join Date: Apr 2016
Posts: 296

10 Oct 2016, 05:51

Hi Bhaskar,

I think you want to use egen's rowmean function for this problem. See below for a solution.

Code:

clear
set obs 5
local cmd generate
forvalues i = 1/5{
    `cmd' cats = "cat`=`i'-1'" in `i'
    if `i' < 4 gen var`i' = .
    local cmd replace
}

replace var1 = 108 in 1
replace var2 = 111 in 1
replace var3 = 109 in 1
replace var1 = 144 in 2
replace var2 = 109 in 2
replace var3 = 166 in 2
replace var1 = 56 in 3
replace var2 = 60 in 3
replace var3 = 78 in 3
replace var1 = 83 in 4
replace var2 = 98 in 4
replace var3 = 31 in 4
replace var1 = 9 in 5
replace var2 = 22 in 5
replace var3 = 15 in 5

egen rowtot = rowtotal(var1-var3)
egen rowmean = rowmean(var1-var3) // this is the function that produces your mean by categories

Do please use dataex (ssc install dataex) though when sharing data on Statalist. I've manually recreated your dataset, but you should not expect this sort of thing to happen and you can easily share your data for questions on Statalist by using dataex.

Comment

Bhaskar Gupta

Join Date: Aug 2016

Posts: 5
#4

11 Oct 2016, 13:41

Thanks Nick for your reply.
Comment
Bhaskar Gupta

Join Date: Aug 2016

Posts: 5
#5

11 Oct 2016, 15:26

Hi Nick and Chris,
Thanks for your replies. I am a Dental Public Health specialist from India. I am new to STATA. I have tried your commands in my data set, but was able to get answers.

This is my data which I got after using dataex command.

clear
input byte age str6 sex byte(cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647) float(cpirowtotal cpimean)
4 "male" 1 1 1 1 1 1 5 .0125
45 "female" 3 1 3 4 1 4 13 .0325
20 "female" 3 1 4 3 2 1 10 .025
24 "female" 4 1 3 1 1 3 10 .025
18 "female" 1 1 3 1 2 2 7 .0175
13 "female" 1 1 1 1 1 1 5 .0125
13 "female" 2 1 1 1 1 1 6 .015
33 "female" 3 2 2 1 1 1 8 .02
50 "female" 2 3 3 1 2 2 10 .025
45 "female" 0 2 2 2 2 2 8 .02
6 "female" 0 0 0 0 0 0 0 0
7 "male" 0 0 0 0 0 0 0 0
55 "female" 1 0 0 1 0 1 3 .0075
15 "female" 1 1 1 1 1 1 5 .0125
5 "female" 3 1 1 1 1 1 7 .0175
45 "female" 3 1 3 1 0 1 6 .015
55 "female" 4 3 2 1 0 1 9 .0225
40 "female" 1 3 4 1 1 3 9 .0225
20 "female" 0 1 1 1 1 1 4 .01
5 "male" 0 0 0 0 0 0 0 0
5 "female" 3 0 0 0 0 0 3 .0075
40 "male" 0 1 1 1 1 3 6 .015
60 "male" 1 0 0 0 0 0 1 .0025
12 "male" 0 1 1 1 1 1 4 .01
11 "female" 0 0 0 0 0 0 0 0
45 "male" 2 0 0 0 0 0 2 .005
18 "male" 0 2 2 2 2 2 8 .02
11 "male" 1 0 0 0 0 0 1 .0025
18 "female" 1 1 1 1 1 1 5 .0125
12 "male" 0 1 1 1 1 1 4 .01
5 "female" 1 0 0 0 0 0 1 .0025
32 "male" 1 1 1 1 1 1 5 .0125
10 "male" 1 1 1 1 1 1 5 .0125
5 "male" 0 1 1 1 1 1 4 .01
4 "male" 0 0 0 0 0 0 0 0
6 "female" 0 0 0 0 0 0 0 0
5 "female" 0 0 0 0 0 0 0 0
6 "female" 0 0 0 0 0 0 0 0
7 "male" 0 0 0 0 0 0 0 0
3 "male" 0 0 0 0 0 0 0 0
6 "female" 1 0 0 0 0 0 1 .0025
20 "female" 0 1 1 1 1 1 4 .01
7 "female" 0 0 0 0 0 0 0 0
6 "female" 1 0 0 0 0 0 1 .0025
40 "male" 1 1 2 2 2 2 8 .02
35 "female" 2 1 1 1 1 1 6 .015
60 "male" 2 1 1 1 2 2 8 .02
50 "female" 2 1 3 1 0 3 7 .0175
20 "female" 3 1 2 1 1 1 7 .0175
40 "female" 1 1 3 4 1 3 10 .025
12 "female" 1 1 1 1 1 1 5 .0125
14 "male" 0 1 2 2 1 1 5 .0125
8 "male" 1 0 0 0 0 0 1 .0025
11 "male" 1 1 1 1 1 1 5 .0125
9 "male" 0 1 1 1 1 1 4 .01
6 "female" 0 0 0 0 0 0 0 0
22 "male" 0 0 0 0 0 0 0 0
6 "male" 0 0 0 0 0 0 0 0
5 "female" 1 0 0 0 0 0 1 .0025
24 "female" 1 1 1 1 1 1 5 .0125
15 "female" 3 1 1 1 1 1 7 .0175
46 "female" 0 1 3 4 1 3 9 .0225
5 "female" 1 0 0 0 0 0 1 .0025
12 "female" 1 1 1 1 1 1 5 .0125
20 "female" 3 1 1 1 1 1 7 .0175
35 "female" 1 2 3 2 1 2 8 .02
45 "female" 0 1 2 4 1 2 8 .02
5 "male" 0 0 0 0 0 0 0 0
6 "female" 0 0 0 0 0 0 0 0
57 "female" 2 0 0 0 0 0 2 .005
68 "male" 2 2 3 2 1 2 9 .0225
13 "female" 2 2 2 2 2 2 10 .025
25 "female" 1 1 2 3 1 3 9 .0225
12 "female" 3 1 1 1 1 1 7 .0175
35 "female" 2 1 2 4 1 2 10 .025
12 "female" 2 2 2 2 2 2 10 .025
42 "female" 0 2 2 2 2 2 8 .02
38 "female" 0 0 0 0 0 0 0 0
5 "female" 3 0 0 0 0 0 3 .0075
35 "male" 0 1 2 2 1 2 6 .015
3 "female" 0 0 0 0 0 0 0 0
22 "female" 0 0 0 0 0 0 0 0
25 "male" 3 0 0 0 0 0 3 .0075
26 "male" 2 1 3 1 1 1 6 .015
24 "female" 0 1 3 2 1 2 6 .015
6 "male" 0 0 0 0 0 0 0 0
4 "male" 0 0 0 0 0 0 0 0
8 "male" 2 0 0 0 0 0 2 .005
16 "male" 0 2 2 2 2 2 8 .02
6 "male" 3 0 0 0 0 0 3 .0075
18 "male" 0 1 3 1 3 2 7 .0175
24 "male" 0 0 0 0 0 0 0 0
20 "male" 0 0 0 0 0 0 0 0
36 "male" 1 0 0 0 0 0 1 .0025
45 "female" 3 1 3 4 2 2 12 .03
55 "male" 0 2 3 1 0 1 4 .01
65 "male" 1 0 0 0 0 0 1 .0025
28 "female" 3 1 2 3 1 1 9 .0225
28 "female" 2 1 3 2 1 2 8 .02
20 "male" 1 1 2 3 1 1 7 .0175
end
label values cpi1617 cpi1
label def cpi1 0 "Healthy", modify
label def cpi1 1 "Bleeding", modify
label def cpi1 2 "Calculus", modify
label def cpi1 3 "Pocket 4-5mm", modify
label def cpi1 4 "Pocket >=6mm", modify
label values cpi11 cpi2
label def cpi2 0 "Healthy", modify
label def cpi2 1 "Bleeding", modify
label def cpi2 2 "Calculus", modify
label def cpi2 3 "Pocket 4-5mm", modify
label values cpi2627 cpi3
label def cpi3 0 "Healthy", modify
label def cpi3 1 "Bleeding", modify
label def cpi3 2 "Calculus", modify
label def cpi3 3 "Pocket 4-5mm", modify
label def cpi3 4 "Pocket >=6mm", modify
label values cpi3637 cpi4
label def cpi4 0 "Healthy", modify
label def cpi4 1 "Bleeding", modify
label def cpi4 2 "Calculus", modify
label def cpi4 3 "Pocket 4-5mm", modify
label def cpi4 4 "Pocket >=6mm", modify
label values cpi31 cpi5
label def cpi5 0 "Healthy", modify
label def cpi5 1 "Bleeding", modify
label def cpi5 2 "Calculus", modify
label def cpi5 3 "Pocket 4-5mm", modify
label values cpi4647 cpi6
label def cpi6 0 "Healthy", modify
label def cpi6 1 "Bleeding", modify
label def cpi6 2 "Calculus", modify
label def cpi6 3 "Pocket 4-5mm", modify
label def cpi6 4 "Pocket >=6mm", modify
[/CODE]

First of all, my problem is that I have to make 7 categories of different age groups.
The age groups are as follows:

Category 1: 2-5
Category 2: 6-12
Category 3: 13-15
Category 4: 16-34
Category 5: 35-44
Category 6: 45-64
Category 7: 65-84

I have used the following commands for this:

gen agecat=.
replace agecat=1 if ageinnum>=2 & ageinnum<=5
replace agecat=2 if ageinnum>=6 & ageinnum<=12
replace agecat=3 if ageinnum>=13 & ageinnum<=15
replace agecat=4 if ageinnum>=16 & ageinnum<=34
replace agecat=5 if ageinnum>=35 & ageinnum<=44
replace agecat=6 if ageinnum>=45 & ageinnum<=64
replace agecat=7 if ageinnum>=65 & ageinnum<=84

But total shown by STATA is 392, however the total sample is 400.

My second problem is regarding CPI (Community Periodontal Index) scores on different teeth, numbered as: (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647)
The different categories on for each tooth number (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647) is as follows:
0- healthy
1- bleeding
2- calculus
3- Pocket 4-5mm
4- Pocket >=6 mm

First of all, I have to separately find the mean of different categories in all the tooth numbers.
For example, adding 0 for all the teeth numbers separately, then taking their mean.
I have used the following command, but didn't get the answer.

egen cpirowtotal= rowtotal (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647)
gen cpirowmean= cpirowtotal /400

Secondly, I have generate 4 new variables for 4 categories (0,1,2,3,4), taking all the teeth numbers all together.

I know regarding the complexity of my problem.

I hope that you understand my problem.

Thanks.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35792
#6

11 Oct 2016, 15:39

Thanks for using dataex (SSC).

Your example data include a variable age; your code for age categories in contrast uses a variable ageinnum and would not work with your example data; otherwise it looks good to me.

We can't check on 400 versus 392 as the sample data are just 100 observations.

I can't see that your variables coded 0 to 4 are variables for which a mean makes sense. Is it even an ordered scale with mutually exclusive categories? I infer that there is a Community Periodontal Index; in turn I am not any kind expert on dental health.

Others here may be able to help more.
Comment
Bhaskar Gupta

Join Date: Aug 2016

Posts: 5
#7

12 Oct 2016, 00:34

Thanks for your reply Nick.
In my data, age was a string variable. I changed it to a numeral variable as ageinnum, by using command: encode age, gen (ageinnum).

After that, ageinnum is divided into 7 age groups, but in that total comes as 392 instead of 400.

The different age groups are: 2-5, 6-12, 13-15, 16-34, 35-44, 45-64 and 65-84.

Secondly, for Community Petiodontal Index scores, codes: 0,1,2,3,4 are mutually exclusive categories. We have found out these CPI codes on particular teeth (cpi1617, cpi11, cpi2627, cpi3637, cpi31, cpi4647).

Now I have to find out the total mean of for the different cpi codes.

For example: code 0 in cpi1617 is for 107 subjects, code 0 in cpi11 is for 113 subjects, code 0 in cpi2627 is for 110 subjects,code 0 in cpi3637 is for 108 subjects, code 0 in cpi31 is for 122 subjects, code 0 in cpi4647 is for 111 subjects.
So total mean will be: 671/400=1.68.

The same has to be done for other cpi codes also.

Last edited by Bhaskar Gupta; 12 Oct 2016, 00:44.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35792

12 Oct 2016, 02:02

Not so. In your reply age was a byte variable. That's explicit in #5. That's a small confusion, but if you can be confused or inconsistent on small matters, the larger ones won't be easier.

Similarly you should phrase your question in terms of the example data. Manifestly we do not have your full data and cannot check calculations otherwise.

With your sample data I get (once a particular package is installed, first command below)

Code:

* do this once only 
ssc install tab_chi 

. tabm  cpi1617  cpi11 cpi2627 cpi3637 cpi31 cpi4647

           |                         values
  variable |   Healthy   Bleeding   Calculus  Pocket 4-  Pocket >= |     Total
-----------+-------------------------------------------------------+----------
   cpi1617 |        41         28         14         15          2 |       100 
     cpi11 |        41         46         10          3          0 |       100 
   cpi2627 |        41         24         17         16          2 |       100 
   cpi3637 |        40         37         13          4          6 |       100 
     cpi31 |        45         42         12          1          0 |       100 
   cpi4647 |        40         33         19          7          1 |       100 
-----------+-------------------------------------------------------+----------
     Total |       248        210         85         46         11 |       600

Perhaps the column totals are what you are after.

Comment

Bhaskar Gupta

Join Date: Aug 2016

Posts: 5
#9

21 Oct 2016, 11:07

Thanks Nick for your reply. Yes, it is true that I want column total to calculate the total number of patients, suffering from bleeding, calculus, Pocket (4 to 5 mm) and Pocket (6 mm or more).

Thanks for telling me about this command. I have no idea of this command before this.

I have some more queries regarding the same variables [Healthy Bleeding Calculus Pocket (4 to 5 mm) and Pocket (6mm or more)] find above in my previous post.

I want to inform you that here 600 are total different CPI values on different teeth, and not the total number of patients. The total number of patients are 400 in my data, and I am interested in total number of patients, and not in total different CPI values on different teeth.

The queries are as follows:

1. How can I generate a new variable only for Bleeding, which shows only total number of patients suffering from Bleeding gums, that is 210. The same I have to do for Calculus, Pocket (4 to 5

mm) and Pocket (6mm or more) separately. So, the total number of new variables will be 4 [Bleeding= 210; Calculus=85; Pocket (4 to 5 mm)= 46; and Pocket (6mm or more)= 11].

2. After that I have to generate a new variable which show the sum of bleeding, calculus, Pocket (4 to 5 mm), and Pocket (6mm or more) patients and that will be 210+85+46+11= 352.

3. After that, I have to generate a new variable for bleeding which shows patients suffering from bleeding gums (210), divided by 400, as 400 is my total sample size of patients. The answer

will be 0.525. The same I have to do for calculus. For example: new variable for calculus, 85/400= 0.213, and so on. Again, the new variables will be 4 here (210/400; 85/400; 46/400; 11/400)

4. Lastly, I have to generate two new variables separately only for bleeding gums (210), divided by 177, as 177 are the male patients (210/177= 1.19), and same for female patients (210/223=

0.94), as 223 are the female patients.

I think I have to generate a lot new variables and this work is slightly complex.

Kindly help me.

Last edited by Bhaskar Gupta; 21 Oct 2016, 11:10.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35792
#10

21 Oct 2016, 11:21

In general, it's best to ask one question at a time. If you want coding advice on an entire small project, you need to hire a programmer.

I think you need to step back from a lengthy series of small questions and learn about some more general features.

As emphasised in #6, we have no way to verifying your calculations because we do not have all the data.

First, if is your friend and inlist() and inrange() are your friends. Suppose I want to know how many people have condition 42 in any of a set of variables. Something like

Code:

count if inlist(42, a, b, c, d, e)

is short for

Code:

count if a == 42 | b == 42 | c == 42 | d == 42 | e == 42

Code:

help inlist()

gives the definition.

Second, you're asking about creating lots of new variables, but I can't see from a quick read that you need to do that.
Comment

Announcement