Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Computing average age

    Hello everybody, I have a problem that is apparently simple but to me it seems complicated.
    I have demographic data about the age of people living in various cities from 1992 to 2019, and I want to compute the average age for each city in each year. The problem is that I have only variables in a range of ages, namely the number of inhabitants that have an age between 0-4, 10-14, 15-19 and so forth, like that

    :
    Click image for larger version

Name:	esempio dataset.png
Views:	1
Size:	40.6 KB
ID:	1603991





    I would like to compute the average age by municipality (Agliè, Airasca etc.) every year. Is there a solution? I was thinking that I could take the median age for each range (for example, in the range between 0 and 4 I could pretend that all people that are between 0 and 4 years old are actually 2 years old), but still I don't know then how to compute the average age (I would have simply the frequencies of the ages). Thank you in advance.
    Last edited by Riccardo Calabretta; 16 Apr 2021, 09:33.

  • #2
    In principle -egen, mean()- does what you want, but you need to show a sample of your data using -dataex- so that we can see how your age variable is encoded.

    Comment


    • #3
      I don't know exactly if this can be helpful, but here it is:

      Click image for larger version

Name:	subsample.png
Views:	1
Size:	17.7 KB
ID:	1604005

      Comment


      • #4
        That looks like an image to me, but it is helpful enough to suggest code. Your variables presumably run with suffixes 0_4 to 95_99 and then 100_ov. I will assume that each age group has medians 2 to 97 and then 102.

        So we need to look at each variable name -- ignore p_jan_ -- and then read the first age in each interval as implied by what precedes the next underscore.

        But, but, but: I can't copy your data to play and so none of this is tested.


        Code:
        gen num = 0
        gen den = 0
        
        foreach v of var p_jan* {
             * ignore p_jan_  so first 6 characters of the variable name ignored
             local first = substr("`v'", 7, .)
             local first = substr("`first'", 1, strpos("`first'", "_") - 1)
             local median = `first' + 2
        
             replace num = num + `median' * `v'
             replace den = den + `v'
        }
        
        gen mean = num / den
        Last edited by Nick Cox; 16 Apr 2021, 11:50.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          That looks like an image to me, but it is helpful enough to suggest code. Your variables presumably run with suffixes 0_4 to 95_99 and then 100_ov. I will assume that each age group has medians 2 to 97 and then 102.

          So we need to look at each variable name -- ignore p_jan_ -- and then read the first age in each interval as implied by what precedes the next underscore.

          But, but, but: I can't copy your data to play and so none of this is tested.


          Code:
          gen num = 0
          gen den = 0
          
          foreach v of var p_jan* {
          * ignore p_jan_ so first 6 characters of the variable name ignored
          local first = substr("`v'", 7, .)
          local first = substr("`first'", 1, strpos("`first'", "_") - 1)
          local median = `first' + 2
          
          replace num = num + `median' * `v'
          replace den = den + `v'
          }
          
          gen mean = num / den
          Thank you for your help, but I don't understand, what do v, 7, ., 1 and first stand for in your code. I would like simply to assign to the variable "p_jan_0_4" the numeric value 2, to "p_jan_5_9" the value 7 and so forth, and in the end computing the average age by municipality having the frequencies of each median age 2,7,12,17, 21 etc.

          Comment


          • #6
            That is precisely what my code is intended to do. Did you try it?

            The statement puzzling you is commented

            Code:
             
             * ignore p_jan_ so first 6 characters of the variable name ignored local first = substr("`v'", 7, .)
            Say the first time around the loop the code is looking at the variable p_jan_0_4. (That doesn't have to be true. It could be looking at any other the variables that start p_jan_.)

            The code uses the function substr() to ignore the first 6 characters of the variable name which are p_jan_ by selecting character 7 onwards. See the help for substr() to learn more. about this function.

            Then the code is looking at 0_4

            We need to peel off the number before the _ which is what

            Code:
             
             local first = substr("`first'", 1, strpos("`first'", "_") - 1)
            is intended to do. The code looks for the position of the underscore and then subtracts 1. We can't just look for the first character because sometimes we need the first character as with 0 5, sometimes we need the first two characters as with 10 (5) 95 and sometimes we need the first three characters 100. Once we have found the lower limit of each bin we add 2 and then use that to build up the numerator and denominator of a weighted mean.
            I can't copy your data to play and so none of this is tested.

            Comment


            • #7
              The fact is that I'm not sure if I figured out in practice.

              For example for the variable p_jan_0_4 I should write

              Code:
               
               local first = substr(p_jan_0_4, 7, .) local first = substr(first, 1, strpos(first, _) - 1) local median = first + 2  replace num = num + `median' * `v' replace den = den + `v'
              I'm sure this is wrong but I don't know in practice how to do it. Can you provide me a concrete example for the variable p_jan_0_4? I didn't understand what to plug into the brackets

              Comment


              • #8
                Unfortunately your rewriting is not only not needed: it breaks my code.

                You need to work with the variable name and not its values, so that mistake broke the code.

                Also, the code is general and should work with all the relevant variables. It does not need writing out again as a series of particular cases.

                As said, this code should work so far as I can see, so what you need to do is try it with your data. It should not need rewriting,

                Also, it is the other way round. It's you who has in your dataset a concrete example of p_jan_0_4 and showed it in #3. You got as far as using dataex (thanks) but what you showed was an image.

                https://www.statalist.org/forums/help#stata explains. But in a nutshell once you have run dataex you need to copy and paste the results from Stata into this forum software. Do not copy the results to an image and insert the image. Experienced Stata users who answer questions do not want to copy numbers from an image and type them in one by one.

                This isn't an easy problem for Stata beginners, but you should (please) tend to trust my code more than your guesses at what it should be.
                Last edited by Nick Cox; 17 Apr 2021, 05:01.

                Comment


                • #9
                  Sorry here is my copypaste of dataex

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str20 id_code float year double(p_jan_0_4 p_jan_5_9 p_jan_10_14 p_jan_15_19)
                  "001001" 1991   .   .   .   .
                  "001001" 1992  85  99 111 141
                  "001001" 1993  99  93  97 123
                  "001001" 1994 100  89 102 109
                  "001001" 1995  99  98 100 107
                  "001001" 1996 108 100  98 116
                  "001001" 1997 106  99 107 112
                  "001001" 1998  99 102 100 101
                  "001001" 1999  91 108 101  99
                  "001001" 2000  87 114 103  94
                  "001001" 2001  86 109 104  92
                  "001001" 2002  93 104  96 103
                  "001001" 2003  99 100 100  98
                  "001001" 2004 108  99 105 105
                  "001001" 2005 109  98 117 115
                  "001001" 2006 104  97 115 116
                  "001001" 2007 102  97 107 103
                  "001001" 2008  94  95 104 112
                  "001001" 2009 100 100  97 106
                  "001001" 2010  97 101  88 105
                  "001001" 2011  99 100  93 110
                  "001001" 2012 106  98  92 109
                  "001001" 2013 113 105 104 107
                  "001001" 2014 122 106 110 104
                  "001001" 2015 114 113 109  99
                  "001001" 2016 111 114 106  93
                  "001001" 2017 116 119 102 100
                  "001001" 2018 108 121 108 108
                  "001001" 2019 103 121 110 107
                  "001002" 1991   .   .   .   .
                  "001002" 1992 173 230 254 269
                  "001002" 1993 182 199 238 271
                  "001002" 1994 166 181 240 251
                  "001002" 1995 155 181 235 246
                  "001002" 1996 173 166 221 248
                  "001002" 1997 170 180 217 253
                  "001002" 1998 180 176 205 248
                  "001002" 1999 174 172 199 252
                  "001002" 2000 176 154 203 248
                  "001002" 2001 186 178 190 237
                  "001002" 2002 194 163 190 223
                  "001002" 2003 202 174 174 215
                  "001002" 2004 192 174 163 204
                  "001002" 2005 209 182 159 201
                  "001002" 2006 197 184 168 190
                  "001002" 2007 199 195 175 190
                  "001002" 2008 202 199 173 182
                  "001002" 2009 213 199 170 167
                  "001002" 2010 212 216 168 155
                  "001002" 2011 212 207 172 161
                  "001002" 2012 220 217 191 169
                  "001002" 2013 206 221 203 165
                  "001002" 2014 184 222 207 169
                  "001002" 2015 180 222 221 170
                  "001002" 2016 166 213 224 177
                  "001002" 2017 150 214 210 188
                  "001002" 2018 155 183 205 190
                  "001002" 2019 148 166 209 195
                  "001003" 1991   .   .   .   .
                  "001003" 1992  22  16  21  22
                  "001003" 1993  21  13  18  29
                  "001003" 1994  19  18  16  26
                  "001003" 1995  22  20  19  25
                  "001003" 1996  22  15  19  19
                  "001003" 1997  21  22  15  19
                  "001003" 1998  24  22  10  19
                  "001003" 1999  22  24  14  19
                  "001003" 2000  19  25  21  19
                  "001003" 2001  23  25  20  18
                  "001003" 2002  24  23  24  16
                  "001003" 2003  29  22  22  18
                  "001003" 2004  30  17  24  21
                  "001003" 2005  25  20  24  22
                  "001003" 2006  23  26  23  19
                  "001003" 2007  21  26  22  24
                  "001003" 2008  19  27  21  22
                  "001003" 2009  18  27  17  23
                  "001003" 2010  17  21  19  22
                  "001003" 2011  14  21  25  22
                  "001003" 2012  14  20  23  22
                  "001003" 2013  15  18  25  21
                  "001003" 2014  17  18  28  17
                  "001003" 2015  19  20  24  19
                  "001003" 2016  18  17  22  26
                  "001003" 2017  18  16  20  25
                  "001003" 2018  16  15  18  27
                  "001003" 2019  13  18  18  28
                  "001004" 1991   .   .   .   .
                  "001004" 1992  65  67  72  89
                  "001004" 1993  66  70  68  81
                  "001004" 1994  61  67  67  75
                  "001004" 1995  58  69  55  81
                  "001004" 1996  61  63  63  76
                  "001004" 1997  62  61  67  72
                  "001004" 1998  68  62  69  70
                  "001004" 1999  64  66  73  72
                  "001004" 2000  65  60  84  60
                  "001004" 2001  68  64  72  73
                  "001004" 2002  63  67  72  75
                  "001004" 2003  60  71  64  82
                  end
                  I hope this will help more.

                  Comment


                  • #10
                    Thanks. That's it. So we introduce my code to your data. Here is everything at once.


                    Code:
                    clear
                    input str20 id_code float year double(p_jan_0_4 p_jan_5_9 p_jan_10_14 p_jan_15_19)
                    "001001" 1991   .   .   .   .
                    "001001" 1992  85  99 111 141
                    "001001" 1993  99  93  97 123
                    "001001" 1994 100  89 102 109
                    "001001" 1995  99  98 100 107
                    "001001" 1996 108 100  98 116
                    "001001" 1997 106  99 107 112
                    "001001" 1998  99 102 100 101
                    "001001" 1999  91 108 101  99
                    "001001" 2000  87 114 103  94
                    "001001" 2001  86 109 104  92
                    "001001" 2002  93 104  96 103
                    "001001" 2003  99 100 100  98
                    "001001" 2004 108  99 105 105
                    "001001" 2005 109  98 117 115
                    "001001" 2006 104  97 115 116
                    "001001" 2007 102  97 107 103
                    "001001" 2008  94  95 104 112
                    "001001" 2009 100 100  97 106
                    "001001" 2010  97 101  88 105
                    "001001" 2011  99 100  93 110
                    "001001" 2012 106  98  92 109
                    "001001" 2013 113 105 104 107
                    "001001" 2014 122 106 110 104
                    "001001" 2015 114 113 109  99
                    "001001" 2016 111 114 106  93
                    "001001" 2017 116 119 102 100
                    "001001" 2018 108 121 108 108
                    "001001" 2019 103 121 110 107
                    "001002" 1991   .   .   .   .
                    "001002" 1992 173 230 254 269
                    "001002" 1993 182 199 238 271
                    "001002" 1994 166 181 240 251
                    "001002" 1995 155 181 235 246
                    "001002" 1996 173 166 221 248
                    "001002" 1997 170 180 217 253
                    "001002" 1998 180 176 205 248
                    "001002" 1999 174 172 199 252
                    "001002" 2000 176 154 203 248
                    "001002" 2001 186 178 190 237
                    "001002" 2002 194 163 190 223
                    "001002" 2003 202 174 174 215
                    "001002" 2004 192 174 163 204
                    "001002" 2005 209 182 159 201
                    "001002" 2006 197 184 168 190
                    "001002" 2007 199 195 175 190
                    "001002" 2008 202 199 173 182
                    "001002" 2009 213 199 170 167
                    "001002" 2010 212 216 168 155
                    "001002" 2011 212 207 172 161
                    "001002" 2012 220 217 191 169
                    "001002" 2013 206 221 203 165
                    "001002" 2014 184 222 207 169
                    "001002" 2015 180 222 221 170
                    "001002" 2016 166 213 224 177
                    "001002" 2017 150 214 210 188
                    "001002" 2018 155 183 205 190
                    "001002" 2019 148 166 209 195
                    "001003" 1991   .   .   .   .
                    "001003" 1992  22  16  21  22
                    "001003" 1993  21  13  18  29
                    "001003" 1994  19  18  16  26
                    "001003" 1995  22  20  19  25
                    "001003" 1996  22  15  19  19
                    "001003" 1997  21  22  15  19
                    "001003" 1998  24  22  10  19
                    "001003" 1999  22  24  14  19
                    "001003" 2000  19  25  21  19
                    "001003" 2001  23  25  20  18
                    "001003" 2002  24  23  24  16
                    "001003" 2003  29  22  22  18
                    "001003" 2004  30  17  24  21
                    "001003" 2005  25  20  24  22
                    "001003" 2006  23  26  23  19
                    "001003" 2007  21  26  22  24
                    "001003" 2008  19  27  21  22
                    "001003" 2009  18  27  17  23
                    "001003" 2010  17  21  19  22
                    "001003" 2011  14  21  25  22
                    "001003" 2012  14  20  23  22
                    "001003" 2013  15  18  25  21
                    "001003" 2014  17  18  28  17
                    "001003" 2015  19  20  24  19
                    "001003" 2016  18  17  22  26
                    "001003" 2017  18  16  20  25
                    "001003" 2018  16  15  18  27
                    "001003" 2019  13  18  18  28
                    "001004" 1991   .   .   .   .
                    "001004" 1992  65  67  72  89
                    "001004" 1993  66  70  68  81
                    "001004" 1994  61  67  67  75
                    "001004" 1995  58  69  55  81
                    "001004" 1996  61  63  63  76
                    "001004" 1997  62  61  67  72
                    "001004" 1998  68  62  69  70
                    "001004" 1999  64  66  73  72
                    "001004" 2000  65  60  84  60
                    "001004" 2001  68  64  72  73
                    "001004" 2002  63  67  72  75
                    "001004" 2003  60  71  64  82
                    end
                    
                    gen num = 0
                    gen den = 0
                    
                    foreach v of var p_jan* {
                    * ignore p_jan_ so first 6 characters of the variable name ignored
                    local first = substr("`v'", 7, .)
                    local first = substr("`first'", 1, strpos("`first'", "_") - 1)
                    local median = `first' + 2
                    
                    replace num = num + `median' * `v'
                    replace den = den + `v'
                    }
                    
                    gen mean = num / den
                    
                    . su mean
                    
                        Variable |        Obs        Mean    Std. Dev.       Min        Max
                    -------------+---------------------------------------------------------
                            mean |         96    9.684446     .489373   8.593407   10.96104
                    The 4 missing values just match the missing values in the data example.. The mean ages here range from over 8 to under 11, which makes sense to me because in your example we are just looking at ages from 0 to 19. For your full dataset we expect, I don't know, perhaps nearer 32 or so for recent years and less earlier.
                    Last edited by Nick Cox; 17 Apr 2021, 05:31.

                    Comment


                    • #11
                      Originally posted by Nick Cox View Post
                      Thanks. That's it. So we introduce my code to your data. Here is everything at once.


                      Code:
                      clear
                      input str20 id_code float year double(p_jan_0_4 p_jan_5_9 p_jan_10_14 p_jan_15_19)
                      "001001" 1991 . . . .
                      "001001" 1992 85 99 111 141
                      "001001" 1993 99 93 97 123
                      "001001" 1994 100 89 102 109
                      "001001" 1995 99 98 100 107
                      "001001" 1996 108 100 98 116
                      "001001" 1997 106 99 107 112
                      "001001" 1998 99 102 100 101
                      "001001" 1999 91 108 101 99
                      "001001" 2000 87 114 103 94
                      "001001" 2001 86 109 104 92
                      "001001" 2002 93 104 96 103
                      "001001" 2003 99 100 100 98
                      "001001" 2004 108 99 105 105
                      "001001" 2005 109 98 117 115
                      "001001" 2006 104 97 115 116
                      "001001" 2007 102 97 107 103
                      "001001" 2008 94 95 104 112
                      "001001" 2009 100 100 97 106
                      "001001" 2010 97 101 88 105
                      "001001" 2011 99 100 93 110
                      "001001" 2012 106 98 92 109
                      "001001" 2013 113 105 104 107
                      "001001" 2014 122 106 110 104
                      "001001" 2015 114 113 109 99
                      "001001" 2016 111 114 106 93
                      "001001" 2017 116 119 102 100
                      "001001" 2018 108 121 108 108
                      "001001" 2019 103 121 110 107
                      "001002" 1991 . . . .
                      "001002" 1992 173 230 254 269
                      "001002" 1993 182 199 238 271
                      "001002" 1994 166 181 240 251
                      "001002" 1995 155 181 235 246
                      "001002" 1996 173 166 221 248
                      "001002" 1997 170 180 217 253
                      "001002" 1998 180 176 205 248
                      "001002" 1999 174 172 199 252
                      "001002" 2000 176 154 203 248
                      "001002" 2001 186 178 190 237
                      "001002" 2002 194 163 190 223
                      "001002" 2003 202 174 174 215
                      "001002" 2004 192 174 163 204
                      "001002" 2005 209 182 159 201
                      "001002" 2006 197 184 168 190
                      "001002" 2007 199 195 175 190
                      "001002" 2008 202 199 173 182
                      "001002" 2009 213 199 170 167
                      "001002" 2010 212 216 168 155
                      "001002" 2011 212 207 172 161
                      "001002" 2012 220 217 191 169
                      "001002" 2013 206 221 203 165
                      "001002" 2014 184 222 207 169
                      "001002" 2015 180 222 221 170
                      "001002" 2016 166 213 224 177
                      "001002" 2017 150 214 210 188
                      "001002" 2018 155 183 205 190
                      "001002" 2019 148 166 209 195
                      "001003" 1991 . . . .
                      "001003" 1992 22 16 21 22
                      "001003" 1993 21 13 18 29
                      "001003" 1994 19 18 16 26
                      "001003" 1995 22 20 19 25
                      "001003" 1996 22 15 19 19
                      "001003" 1997 21 22 15 19
                      "001003" 1998 24 22 10 19
                      "001003" 1999 22 24 14 19
                      "001003" 2000 19 25 21 19
                      "001003" 2001 23 25 20 18
                      "001003" 2002 24 23 24 16
                      "001003" 2003 29 22 22 18
                      "001003" 2004 30 17 24 21
                      "001003" 2005 25 20 24 22
                      "001003" 2006 23 26 23 19
                      "001003" 2007 21 26 22 24
                      "001003" 2008 19 27 21 22
                      "001003" 2009 18 27 17 23
                      "001003" 2010 17 21 19 22
                      "001003" 2011 14 21 25 22
                      "001003" 2012 14 20 23 22
                      "001003" 2013 15 18 25 21
                      "001003" 2014 17 18 28 17
                      "001003" 2015 19 20 24 19
                      "001003" 2016 18 17 22 26
                      "001003" 2017 18 16 20 25
                      "001003" 2018 16 15 18 27
                      "001003" 2019 13 18 18 28
                      "001004" 1991 . . . .
                      "001004" 1992 65 67 72 89
                      "001004" 1993 66 70 68 81
                      "001004" 1994 61 67 67 75
                      "001004" 1995 58 69 55 81
                      "001004" 1996 61 63 63 76
                      "001004" 1997 62 61 67 72
                      "001004" 1998 68 62 69 70
                      "001004" 1999 64 66 73 72
                      "001004" 2000 65 60 84 60
                      "001004" 2001 68 64 72 73
                      "001004" 2002 63 67 72 75
                      "001004" 2003 60 71 64 82
                      end
                      
                      gen num = 0
                      gen den = 0
                      
                      foreach v of var p_jan* {
                      * ignore p_jan_ so first 6 characters of the variable name ignored
                      local first = substr("`v'", 7, .)
                      local first = substr("`first'", 1, strpos("`first'", "_") - 1)
                      local median = `first' + 2
                      
                      replace num = num + `median' * `v'
                      replace den = den + `v'
                      }
                      
                      gen mean = num / den
                      
                      . su mean
                      
                      Variable | Obs Mean Std. Dev. Min Max
                      -------------+---------------------------------------------------------
                      mean | 96 9.684446 .489373 8.593407 10.96104
                      The 4 missing values just match the missing values in the data example.. The mean ages here range from over 8 to under 11, which makes sense to me because in your example we are just looking at ages from 0 to 19. For your full dataset we expect, I don't know, perhaps nearer 32 or so.
                      I still haven't understood completely your coding (I'm just a Stata beginner), but it worked! Thank you very much! And next time I'll pay attention to post my dataex properly.

                      Comment


                      • #12
                        It is a weighted mean you want, where the numerator is a sum of terms like


                        median age in interval * number of people

                        and the denominator is a sum of terms like

                        number of people.

                        So you initialise both numerator and denominator to zero and add a term each time. Once out of the loop, it is just division.


                        The tricky bit is that you have to get Stata to work with the name of each variable, which is an exercise in string manipulation. I used substr() and strpos()

                        Some reading that might help:

                        https://www.stata-journal.com/articl...article=dm0058 on functions

                        https://journals.sagepub.com/doi/pdf...36867X20976340 on loops (including local macros)


                        Somehow I thought this was Brazilian data, but I now guess it is Italian data, not that the solution is different. But I now guess your mean ages could be higher than i said. I Googled and it seems that numerator and denominator are the same words in Italian as in English. Probably from Latin either way.

                        Comment


                        • #13
                          Originally posted by Nick Cox View Post
                          It is a weighted mean you want, where the numerator is a sum of terms like


                          median age in interval * number of people

                          and the denominator is a sum of terms like

                          number of people.

                          So you initialise both numerator and denominator to zero and add a term each time. Once out of the loop, it is just division.


                          The tricky bit is that you have to get Stata to work with the name of each variable, which is an exercise in string manipulation. I used substr() and strpos()

                          Some reading that might help:

                          https://www.stata-journal.com/articl...article=dm0058 on functions

                          https://journals.sagepub.com/doi/pdf...36867X20976340 on loops (including local macros)


                          Somehow I thought this was Brazilian data, but I now guess it is Italian data, not that the solution is different. But I now guess your mean ages could be higher than i said. I Googled and it seems that numerator and denominator are the same words in Italian as in English. Probably from Latin either way.
                          Yes exactly, the etymology of numerator and denominator, as many other words in english, is latin. And yes of course the mean ages are higher, I just gave you a subsample because pasting all the sample was too long. Thank you again for your help anyway!

                          Comment

                          Working...
                          X