Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find mean of different categories in different variables?

    Hi all,
    My name is Bhaskar Gupta. I am not able to find out the means of different categories in different variables.
    The variables and categories are as follows:

    Variable 1:
    Category 0: 108
    Category1: 144
    Category2: 56
    Category3: 83
    Category4: 9
    Total: 400

    Variable 2:
    Category 0: 111
    Category 1: 109
    Category 2: 60
    Category 3: 98
    Category 4: 22
    Total: 400

    Variable 3:
    Category 0: 109
    Category 1: 166
    Category 2: 78
    Category 3: 31
    Category 4: 16
    Total: 400

    I want to find out the means of different categories in different variables. For example:

    For category 0, it should be equal to: 108+111+109/400 = 328/400= 0.82

    Same for all the other categories.

    I am not able to make the command for this.

    I have tried a lot by myself, but everytime I tried I got 400/400=1

    Kindly help me.

    I will be very thankful.



  • #2
    It is difficult to comment on your code, as you show none.

    It is difficult to know exactly what is your data. Assuming that your "variables" are indeed Stata variables, I am led to guess at this structure. Knowing that your totals are all 400, then you won't need to calculate those.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte category int(variable1 variable2 variable3)
    0 108 111 109
    1 144 109 166
    2  56  60  78
    3  83  98  31
    4   9  22  16
    end
    
    egen rowtotal = rowtotal(var*)
    
    gen rowpr = rowtotal / 400
    
    list
    
         +---------------------------------------------------------------+
         | category   variab~1   variab~2   variab~3   rowtotal    rowpr |
         |---------------------------------------------------------------|
      1. |        0        108        111        109        328      .82 |
      2. |        1        144        109        166        419   1.0475 |
      3. |        2         56         60         78        194     .485 |
      4. |        3         83         98         31        212      .53 |
      5. |        4          9         22         16         47    .1175 |
         +---------------------------------------------------------------+
    Note that it is not Stataish to hold totals within the total set as extra observations. That way just lies confusion, as you would have to keep excluding them from other calculations.

    We should not have to guess. There is guidance in the FAQ Advice on being explicit about your data and your code. Please read http://www.statalist.org/forums/help (especially #12) as requested.

    If your data structure is different and you are still puzzled, please explain what it is.
    Last edited by Nick Cox; 10 Oct 2016, 06:32.

    Comment


    • #3
      Hi Bhaskar,

      I think you want to use egen's rowmean function for this problem. See below for a solution.

      Code:
      clear
      set obs 5
      local cmd generate
      forvalues i = 1/5{
          `cmd' cats = "cat`=`i'-1'" in `i'
          if `i' < 4 gen var`i' = .
          local cmd replace
      }
      
      replace var1 = 108 in 1
      replace var2 = 111 in 1
      replace var3 = 109 in 1
      replace var1 = 144 in 2
      replace var2 = 109 in 2
      replace var3 = 166 in 2
      replace var1 = 56 in 3
      replace var2 = 60 in 3
      replace var3 = 78 in 3
      replace var1 = 83 in 4
      replace var2 = 98 in 4
      replace var3 = 31 in 4
      replace var1 = 9 in 5
      replace var2 = 22 in 5
      replace var3 = 15 in 5
      
      egen rowtot = rowtotal(var1-var3)
      egen rowmean = rowmean(var1-var3) // this is the function that produces your mean by categories
      Do please use dataex (ssc install dataex) though when sharing data on Statalist. I've manually recreated your dataset, but you should not expect this sort of thing to happen and you can easily share your data for questions on Statalist by using dataex.

      Comment


      • #4
        Thanks Nick for your reply.

        Comment


        • #5
          Hi Nick and Chris,
          Thanks for your replies. I am a Dental Public Health specialist from India. I am new to STATA. I have tried your commands in my data set, but was able to get answers.

          This is my data which I got after using dataex command.

          clear
          input byte age str6 sex byte(cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647) float(cpirowtotal cpimean)
          4 "male" 1 1 1 1 1 1 5 .0125
          45 "female" 3 1 3 4 1 4 13 .0325
          20 "female" 3 1 4 3 2 1 10 .025
          24 "female" 4 1 3 1 1 3 10 .025
          18 "female" 1 1 3 1 2 2 7 .0175
          13 "female" 1 1 1 1 1 1 5 .0125
          13 "female" 2 1 1 1 1 1 6 .015
          33 "female" 3 2 2 1 1 1 8 .02
          50 "female" 2 3 3 1 2 2 10 .025
          45 "female" 0 2 2 2 2 2 8 .02
          6 "female" 0 0 0 0 0 0 0 0
          7 "male" 0 0 0 0 0 0 0 0
          55 "female" 1 0 0 1 0 1 3 .0075
          15 "female" 1 1 1 1 1 1 5 .0125
          5 "female" 3 1 1 1 1 1 7 .0175
          45 "female" 3 1 3 1 0 1 6 .015
          55 "female" 4 3 2 1 0 1 9 .0225
          40 "female" 1 3 4 1 1 3 9 .0225
          20 "female" 0 1 1 1 1 1 4 .01
          5 "male" 0 0 0 0 0 0 0 0
          5 "female" 3 0 0 0 0 0 3 .0075
          40 "male" 0 1 1 1 1 3 6 .015
          60 "male" 1 0 0 0 0 0 1 .0025
          12 "male" 0 1 1 1 1 1 4 .01
          11 "female" 0 0 0 0 0 0 0 0
          45 "male" 2 0 0 0 0 0 2 .005
          18 "male" 0 2 2 2 2 2 8 .02
          11 "male" 1 0 0 0 0 0 1 .0025
          18 "female" 1 1 1 1 1 1 5 .0125
          12 "male" 0 1 1 1 1 1 4 .01
          5 "female" 1 0 0 0 0 0 1 .0025
          32 "male" 1 1 1 1 1 1 5 .0125
          10 "male" 1 1 1 1 1 1 5 .0125
          5 "male" 0 1 1 1 1 1 4 .01
          4 "male" 0 0 0 0 0 0 0 0
          6 "female" 0 0 0 0 0 0 0 0
          5 "female" 0 0 0 0 0 0 0 0
          6 "female" 0 0 0 0 0 0 0 0
          7 "male" 0 0 0 0 0 0 0 0
          3 "male" 0 0 0 0 0 0 0 0
          6 "female" 1 0 0 0 0 0 1 .0025
          20 "female" 0 1 1 1 1 1 4 .01
          7 "female" 0 0 0 0 0 0 0 0
          6 "female" 1 0 0 0 0 0 1 .0025
          40 "male" 1 1 2 2 2 2 8 .02
          35 "female" 2 1 1 1 1 1 6 .015
          60 "male" 2 1 1 1 2 2 8 .02
          50 "female" 2 1 3 1 0 3 7 .0175
          20 "female" 3 1 2 1 1 1 7 .0175
          40 "female" 1 1 3 4 1 3 10 .025
          12 "female" 1 1 1 1 1 1 5 .0125
          14 "male" 0 1 2 2 1 1 5 .0125
          8 "male" 1 0 0 0 0 0 1 .0025
          11 "male" 1 1 1 1 1 1 5 .0125
          9 "male" 0 1 1 1 1 1 4 .01
          6 "female" 0 0 0 0 0 0 0 0
          22 "male" 0 0 0 0 0 0 0 0
          6 "male" 0 0 0 0 0 0 0 0
          5 "female" 1 0 0 0 0 0 1 .0025
          24 "female" 1 1 1 1 1 1 5 .0125
          15 "female" 3 1 1 1 1 1 7 .0175
          46 "female" 0 1 3 4 1 3 9 .0225
          5 "female" 1 0 0 0 0 0 1 .0025
          12 "female" 1 1 1 1 1 1 5 .0125
          20 "female" 3 1 1 1 1 1 7 .0175
          35 "female" 1 2 3 2 1 2 8 .02
          45 "female" 0 1 2 4 1 2 8 .02
          5 "male" 0 0 0 0 0 0 0 0
          6 "female" 0 0 0 0 0 0 0 0
          57 "female" 2 0 0 0 0 0 2 .005
          68 "male" 2 2 3 2 1 2 9 .0225
          13 "female" 2 2 2 2 2 2 10 .025
          25 "female" 1 1 2 3 1 3 9 .0225
          12 "female" 3 1 1 1 1 1 7 .0175
          35 "female" 2 1 2 4 1 2 10 .025
          12 "female" 2 2 2 2 2 2 10 .025
          42 "female" 0 2 2 2 2 2 8 .02
          38 "female" 0 0 0 0 0 0 0 0
          5 "female" 3 0 0 0 0 0 3 .0075
          35 "male" 0 1 2 2 1 2 6 .015
          3 "female" 0 0 0 0 0 0 0 0
          22 "female" 0 0 0 0 0 0 0 0
          25 "male" 3 0 0 0 0 0 3 .0075
          26 "male" 2 1 3 1 1 1 6 .015
          24 "female" 0 1 3 2 1 2 6 .015
          6 "male" 0 0 0 0 0 0 0 0
          4 "male" 0 0 0 0 0 0 0 0
          8 "male" 2 0 0 0 0 0 2 .005
          16 "male" 0 2 2 2 2 2 8 .02
          6 "male" 3 0 0 0 0 0 3 .0075
          18 "male" 0 1 3 1 3 2 7 .0175
          24 "male" 0 0 0 0 0 0 0 0
          20 "male" 0 0 0 0 0 0 0 0
          36 "male" 1 0 0 0 0 0 1 .0025
          45 "female" 3 1 3 4 2 2 12 .03
          55 "male" 0 2 3 1 0 1 4 .01
          65 "male" 1 0 0 0 0 0 1 .0025
          28 "female" 3 1 2 3 1 1 9 .0225
          28 "female" 2 1 3 2 1 2 8 .02
          20 "male" 1 1 2 3 1 1 7 .0175
          end
          label values cpi1617 cpi1
          label def cpi1 0 "Healthy", modify
          label def cpi1 1 "Bleeding", modify
          label def cpi1 2 "Calculus", modify
          label def cpi1 3 "Pocket 4-5mm", modify
          label def cpi1 4 "Pocket >=6mm", modify
          label values cpi11 cpi2
          label def cpi2 0 "Healthy", modify
          label def cpi2 1 "Bleeding", modify
          label def cpi2 2 "Calculus", modify
          label def cpi2 3 "Pocket 4-5mm", modify
          label values cpi2627 cpi3
          label def cpi3 0 "Healthy", modify
          label def cpi3 1 "Bleeding", modify
          label def cpi3 2 "Calculus", modify
          label def cpi3 3 "Pocket 4-5mm", modify
          label def cpi3 4 "Pocket >=6mm", modify
          label values cpi3637 cpi4
          label def cpi4 0 "Healthy", modify
          label def cpi4 1 "Bleeding", modify
          label def cpi4 2 "Calculus", modify
          label def cpi4 3 "Pocket 4-5mm", modify
          label def cpi4 4 "Pocket >=6mm", modify
          label values cpi31 cpi5
          label def cpi5 0 "Healthy", modify
          label def cpi5 1 "Bleeding", modify
          label def cpi5 2 "Calculus", modify
          label def cpi5 3 "Pocket 4-5mm", modify
          label values cpi4647 cpi6
          label def cpi6 0 "Healthy", modify
          label def cpi6 1 "Bleeding", modify
          label def cpi6 2 "Calculus", modify
          label def cpi6 3 "Pocket 4-5mm", modify
          label def cpi6 4 "Pocket >=6mm", modify
          [/CODE]

          First of all, my problem is that I have to make 7 categories of different age groups.
          The age groups are as follows:

          Category 1: 2-5
          Category 2: 6-12
          Category 3: 13-15
          Category 4: 16-34
          Category 5: 35-44
          Category 6: 45-64
          Category 7: 65-84

          I have used the following commands for this:

          gen agecat=.
          replace agecat=1 if ageinnum>=2 & ageinnum<=5
          replace agecat=2 if ageinnum>=6 & ageinnum<=12
          replace agecat=3 if ageinnum>=13 & ageinnum<=15
          replace agecat=4 if ageinnum>=16 & ageinnum<=34
          replace agecat=5 if ageinnum>=35 & ageinnum<=44
          replace agecat=6 if ageinnum>=45 & ageinnum<=64
          replace agecat=7 if ageinnum>=65 & ageinnum<=84

          But total shown by STATA is 392, however the total sample is 400.

          My second problem is regarding CPI (Community Periodontal Index) scores on different teeth, numbered as: (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647)
          The different categories on for each tooth number (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647) is as follows:
          0- healthy
          1- bleeding
          2- calculus
          3- Pocket 4-5mm
          4- Pocket >=6 mm

          First of all, I have to separately find the mean of different categories in all the tooth numbers.
          For example, adding 0 for all the teeth numbers separately, then taking their mean.
          I have used the following command, but didn't get the answer.

          egen cpirowtotal= rowtotal (cpi1617 cpi11 cpi2627 cpi3637 cpi31 cpi4647)
          gen cpirowmean= cpirowtotal /400

          Secondly, I have generate 4 new variables for 4 categories (0,1,2,3,4), taking all the teeth numbers all together.

          I know regarding the complexity of my problem.

          I hope that you understand my problem.

          Thanks.

          Comment


          • #6
            Thanks for using dataex (SSC).

            Your example data include a variable age; your code for age categories in contrast uses a variable ageinnum and would not work with your example data; otherwise it looks good to me.

            We can't check on 400 versus 392 as the sample data are just 100 observations.

            I can't see that your variables coded 0 to 4 are variables for which a mean makes sense. Is it even an ordered scale with mutually exclusive categories? I infer that there is a Community Periodontal Index; in turn I am not any kind expert on dental health.

            Others here may be able to help more.

            Comment


            • #7
              Thanks for your reply Nick.
              In my data, age was a string variable. I changed it to a numeral variable as ageinnum, by using command: encode age, gen (ageinnum).

              After that, ageinnum is divided into 7 age groups, but in that total comes as 392 instead of 400.

              The different age groups are: 2-5, 6-12, 13-15, 16-34, 35-44, 45-64 and 65-84.

              Secondly, for Community Petiodontal Index scores, codes: 0,1,2,3,4 are mutually exclusive categories. We have found out these CPI codes on particular teeth (cpi1617, cpi11, cpi2627, cpi3637, cpi31, cpi4647).

              Now I have to find out the total mean of for the different cpi codes.

              For example: code 0 in cpi1617 is for 107 subjects, code 0 in cpi11 is for 113 subjects, code 0 in cpi2627 is for 110 subjects,code 0 in cpi3637 is for 108 subjects, code 0 in cpi31 is for 122 subjects, code 0 in cpi4647 is for 111 subjects.
              So total mean will be: 671/400=1.68.

              The same has to be done for other cpi codes also.
              Last edited by Bhaskar Gupta; 12 Oct 2016, 00:44.

              Comment


              • #8
                Not so. In your reply age was a byte variable. That's explicit in #5. That's a small confusion, but if you can be confused or inconsistent on small matters, the larger ones won't be easier.

                Similarly you should phrase your question in terms of the example data. Manifestly we do not have your full data and cannot check calculations otherwise.

                With your sample data I get (once a particular package is installed, first command below)

                Code:
                * do this once only 
                ssc install tab_chi 
                
                . tabm  cpi1617  cpi11 cpi2627 cpi3637 cpi31 cpi4647
                
                           |                         values
                  variable |   Healthy   Bleeding   Calculus  Pocket 4-  Pocket >= |     Total
                -----------+-------------------------------------------------------+----------
                   cpi1617 |        41         28         14         15          2 |       100 
                     cpi11 |        41         46         10          3          0 |       100 
                   cpi2627 |        41         24         17         16          2 |       100 
                   cpi3637 |        40         37         13          4          6 |       100 
                     cpi31 |        45         42         12          1          0 |       100 
                   cpi4647 |        40         33         19          7          1 |       100 
                -----------+-------------------------------------------------------+----------
                     Total |       248        210         85         46         11 |       600
                Perhaps the column totals are what you are after.

                Comment


                • #9

                  Thanks Nick for your reply. Yes, it is true that I want column total to calculate the total number of patients, suffering from bleeding, calculus, Pocket (4 to 5 mm) and Pocket (6 mm or more).

                  Thanks for telling me about this command. I have no idea of this command before this.



                  I have some more queries regarding the same variables [Healthy Bleeding Calculus Pocket (4 to 5 mm) and Pocket (6mm or more)] find above in my previous post.

                  I want to inform you that here 600 are total different CPI values on different teeth, and not the total number of patients. The total number of patients are 400 in my data, and I am interested in total number of patients, and not in total different CPI values on different teeth.




                  The queries are as follows:

                  1. How can I generate a new variable only for Bleeding, which shows only total number of patients suffering from Bleeding gums, that is 210. The same I have to do for Calculus, Pocket (4 to 5

                  mm) and Pocket (6mm or more) separately. So, the total number of new variables will be 4 [Bleeding= 210; Calculus=85; Pocket (4 to 5 mm)= 46; and Pocket (6mm or more)= 11].



                  2. After that I have to generate a new variable which show the sum of bleeding, calculus, Pocket (4 to 5 mm), and Pocket (6mm or more) patients and that will be 210+85+46+11= 352.



                  3. After that, I have to generate a new variable for bleeding which shows patients suffering from bleeding gums (210), divided by 400, as 400 is my total sample size of patients. The answer

                  will be 0.525. The same I have to do for calculus. For example: new variable for calculus, 85/400= 0.213, and so on. Again, the new variables will be 4 here (210/400; 85/400; 46/400; 11/400)


                  4. Lastly, I have to generate two new variables separately only for bleeding gums (210), divided by 177, as 177 are the male patients (210/177= 1.19), and same for female patients (210/223=

                  0.94), as 223 are the female patients.


                  I think I have to generate a lot new variables and this work is slightly complex.

                  Kindly help me.








                  Last edited by Bhaskar Gupta; 21 Oct 2016, 11:10.

                  Comment


                  • #10
                    In general, it's best to ask one question at a time. If you want coding advice on an entire small project, you need to hire a programmer.

                    I think you need to step back from a lengthy series of small questions and learn about some more general features.

                    As emphasised in #6, we have no way to verifying your calculations because we do not have all the data.

                    First, if is your friend and inlist() and inrange() are your friends. Suppose I want to know how many people have condition 42 in any of a set of variables. Something like

                    Code:
                    count if inlist(42, a, b, c, d, e)
                    is short for

                    Code:
                    count if a == 42 | b == 42 | c == 42 | d == 42 | e == 42
                    Code:
                    help inlist()
                    gives the definition.

                    Second, you're asking about creating lots of new variables, but I can't see from a quick read that you need to do that.

                    Comment

                    Working...
                    X