Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate the sum over years

    Hello!

    I want to generate a new variable which shows the sum of months (1-12) of unemployment from the time period 2001 to 2015 for each observation/person.
    The variable should show the total duration of months a person was unemployed between 2001 and 2015, for example from 2001 to 2015 person xy was unemployed for 37 months in total.

    Here an example of my data:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int syear long pid float arbeitslos_vorjahrmonate
    2001  201  .
    2002  201  .
    2003  201  .
    2004  201  .
    2005  201  .
    2006  201  .
    2007  201  .
    2008  201  .
    2009  201  .
    2001  203  1
    2002  203  .
    2003  203  .
    2004  203  3
    2005  203  9
    2006  203  .
    2007  203  .
    2008  203  .
    2009  203  .
    2001  601  .
    2002  601  .
    2003  601  .
    2004  601  .
    2001  602  .
    2002  602 10
    2003  602 12
    2004  602  .
    2005  602  .
    2006  602  .
    2007  602  .
    2008  602  .
    2009  602  .
    2010  602  .
    2011  602  .
    2012  602  .
    2013  602  .
    2001  603  .
    2002  603  .
    2003  603  .
    2004  603  .
    2005  603  .
    2006  603  .
    2007  603  .
    2008  603  .
    2009  603  .
    2010  603  .
    2011  603  .
    2012  603  .
    2013  603  .
    2001  604  .
    2002  604  .
    2003  604  .
    2004  604  .
    2005  604  .
    2006  604  .
    2007  604  .
    2008  604  .
    2009  604  .
    2010  604  .
    2011  604  .
    2001  605  .
    2002  605  .
    2003  605  .
    2004  605  .
    2005  605  .
    2006  605  .
    2007  605  .
    2008  605  .
    2009  605  .
    2010  605  .
    2011  605  .
    2012  605  .
    2013  605  .
    2001  901  .
    2002  901  .
    2003  901  .
    2004  901  .
    2005  901  .
    2006  901  .
    2007  901  .
    2008  901  .
    2009  901  .
    2010  901  .
    2011  901  .
    2012  901  .
    2013  901  .
    2014  901  .
    2015  901  .
    2001 1202  .
    2002 1202  .
    2003 1202  .
    2004 1202  .
    2005 1202  .
    2006 1202  .
    2007 1202  .
    2001 1501  .
    2002 1501  .
    2003 1501  .
    2004 1501  .
    2005 1501  .
    2006 1501  .
    end
    pid: personal identification number
    syear: survey year, restricted to 2001 - 2015
    arbeitslos_vorjahrmonate: shows the months a person was unemployed in the previous year,
    e.g. pid 203 was unemployed in 2001 one month, in 2004 three months and in 2005 nine months.
    So the new variable should show that pid 203 was in total 13 months unemployed from 2001-2015.
    I tried with this command:

    Code:
    bysort pid (syear): gen months=arbeitslos_vorjahrmonate if  syear >2000  & syear <2016
    bysort pid: egen months1=total(months),m
    The new variable months1 should show this result only once (not for every survey year) so that I can sum the results.
    However, when I'm browsing the results, the variable shows e.g. for pid 203 the 13 months in 2001, 2002, 2003, 2004... to 2015. So in every year, there are the 13 months which is wrong because I want this result only once and then sum the result of all persons.

    Does anyone have a clue about how to handle this?

    Thank you very much.

    Regards
    Tina

  • #2
    Hi Tina,

    What you've done is fine, but if as you say, you want to end up with a single statistic for the total number of months unemployed across all people then you either want to change your data or just produce a statistic.

    If you want to change your data (with no further information, this doesn't seem like the right approach - for what it's worth), you could do something like this

    Code:
    collapse (sum) arbeitslos_vorjahrmonate, by(pid) //and then take the sum of arbeitslos_vorjahrmonate
    If you just want the statistic without actually adding it to your dataset, then something like this will work (although is very inelegant)

    Code:
    tempvar sum
    gen `sum' = sum(arbeitslos_vorjahrmonate)
    li `sum' if _n==_N
    Also, just to make your data make more sense, i'd probably turn all the missings to zeros if that's what they represent (again, you're best placed to know whether that's true)


    Comment


    • #3
      Thank you very much for your quick answer.
      I still have problems to get the right results.

      The problem is that I have to control for other variables too, so if I'm using the code:
      Code:
      collapse (sum) arbeitslos_vorjahrmonate, by (pid)
      I cannot restrict the observations anymore.

      Also with your second code
      Code:
      tempvar sum
      gen sum = sum(arbeitslos_vorjahrmonate)
      li sum if _n==_N
      I cannot get the right results. Because if I:
      Code:
      browse syear pid arbeitslos_vorjahrmonate sum if syear >2000 & syear <2016
      I get a strange sum:
      Click image for larger version

Name:	Unbenannt.PNG
Views:	1
Size:	18.1 KB
ID:	1409358


      Do you maybe know how to solve this problem? Thank you!

      Comment


      • #4
        Tina:
        does the following code do the trick?
        Code:
        bysort pid: g sum_un=sum(arbeitslos_vorjahrmonate)
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Hi Carlo,

          thanks for your comment but it doesn't really work.
          Click image for larger version

Name:	Unbenannt1.PNG
Views:	1
Size:	35.1 KB
ID:	1409379


          It continously sums the months as you can see in the new column "sum_un".

          Kind regards,
          Tina

          Comment


          • #6
            It is not clear what it is that you seek. Please tell us what value you would like to have for sum_un in each of the sum_un cells highlighted in blue in your picture above.

            Comment


            • #7
              for example pid 6002 was unemployed in 2001 zero months, in 2002 10 months, in 2003 12 months, in 2004 12 months, in 2005 5 months, in 2008 2 months, in 2010 1 month, in 2011 11 months, in 2012 12 months, in 2013 12 months. the sum of these months would be 77 months.
              so pid 6002 was 77 months unemployed in total. I need only this overall sum of 77 months.
              then I have to look at all the other "pid" to get the sum of all "pid" so that I can see the average months of unemployment in total from the years 2001-2015 (observation period) for every "pid".

              Comment


              • #8
                Code:
                bys pid: egen wanted = total( arbeitslos_vorjahrmonate ) if inrange(syear,2001, 2015)
                bys pid: replace wanted= 0 if _n!=_N & inrange(syear,2001, 2015)

                Comment


                • #9
                  Hi Andrew,

                  Unfortunately, I get more zeros than the results.

                  Click image for larger version

Name:	Unbenannt.PNG
Views:	1
Size:	21.2 KB
ID:	1409492


                  Kind regards,
                  Tina

                  Comment


                  • #10
                    Hey everyone!
                    I tried again with several commands and now I get similar results to the ones I should get. But I won't browse anymore because then it makes no sense.
                    So overall I will just take these results and I hope that this will be fine.

                    Thanks for your help!

                    Regards,
                    Tina

                    Comment


                    • #11
                      I read your brief in #7 as follows: You want a total for the variable arbeitslos_vorjahrmonate for each pid during the years 2001-2015, and this total should appear only once for each pid. The easiest way to do this was illustrated by Chris in #2

                      Code:
                      preserve
                      keep if inrange(syear,2001, 2015)
                      collapse (sum) arbeitslos_vorjahrmonate, by(pid)
                      l
                      restore
                      However, you seem to want this variable within the dataset,You are correct that my code in #8 will run into problems if the last observed year for a given pid is outside the time range 2001-2015, so I would modify it as follows

                      Code:
                      bys pid (syear): egen total = total( arbeitslos_vorjahrmonate ) if inrange(syear,2001, 2015)
                      bys pid: egen wanted= max(total)
                      drop total
                      bys pid (syear): replace wanted= 0 if _n!=_N
                      You should get the same numbers as with collapse

                      Code:
                      bys pid (syear): gen lastob=_n==_N
                      list pid wanted if lastob

                      Comment

                      Working...
                      X