Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 150x150 crosstab in stata, showing timeseries movement between categories

    hello,

    I'm a bit over my head here and I hope you can help me, or at least point me in the right direction.

    I got a massive dataset (5.8 mio observations per year, over 14 years), which deals with individuals' occupation over time. I need to sum up the changes in occupation in this timeperiod, so I can see from which occupation people go from and to, after periods of unemployment, like this:

    http://i.stack.imgur.com/ujr8R.png

    however, there are 150 categories, which means that I can't get stata to show the whole crosstable without linebreaks. I need to output the crosstab in .csv or some other format for further manipulation in GNU/R afterwards. So the first question is this:

    1) how to extract the crosstab of this size to an csv/xls-file?

    A solution would be to construct data in such a way that I can import it to SPSS, which is perfectly able to output such a huge crosstab to an excel-sheet.

    Now, my data is structured like this, with (.) being missings

    http://i.stack.imgur.com/Bh3To.png

    As you can see, this is a simple dataframe in a long format. I need to construct the data in such a way so I can create a crosstab that shows the movement over years. E.g. one person might be occupied with "A" in 1996, then with "B" in 1997, then again in "A" in 1998. This would mean that for these two years, he would be be counted twice, as he would go from A to B and then from B to A. So the idea is that it just counts the number of shifts between categories, no matter the year.

    I have created an example dataset:

    https://www.dropbox.com/s/ihxo2temqp...hange.csv?dl=0

    Code:
    import delimited "path\to\testforstackexhange.csv", delimiter(";") varnames(1) clear
    I hope my question is asked in a clear and precise way. I do not take lightly on the fact that you spend your time on this, so if I can improve my question in any way, please say so and I will do my utmost to refine my question so as not to waste your time. Thank you in advance.

    I also asked the question on stackoverflow:

    http://stackoverflow.com/questions/2...een-categories
    Last edited by EmilBeBri; 02 Mar 2015, 07:38.

  • #2
    Hello Emil,

    maybe it helps to construct a variable that shows the previous occupation_status - then you could export both variables to spss
    If I did not misunderstand you, the following syntax would solve your problem:

    by id: gen occ_prev = occupation_code[_n-1]

    greetings, Klaudia

    Comment


    • #3
      Hello Klaudia,
      than you for your reply! This works perfectly.

      It also opened up the wonderfull world of using "spells" or runs to determine different states. this stata journal article is quickly proving to be an invaluable ressource, since I also need to record shifts in states. stata-journal.com/sjpdf.html?articlenum=dm0029

      Do you think I should pose a separate question about how to make a 150x150 table with 300 mio observations in stata? Or would that be considered bad form?

      Comment


      • #4
        Hello Emil,

        If I understand your original question correctly, it was, that you wanted the crosstabs-content to be exported to excel, right?
        If this is so, you don't need to produce a crosstab, because you can produce a data set which you can export to excel.
        You can try the following code with test data that has only some categories so that you can display the crosstab in the output window, just to control if the syntax works ok.
        The other day I forgot to say, that the data must be sorted by the "by"-specification, which can be obtained by using "bysort" instead of "by"

        The solution is to collapse the data (aggregate it, for spss--users), and then to "turn it around" (reshape it), such that the cells of the data matrix correspond exactly to the cells of the crosstab.
        There are no row or column totals, but if you need, you can generate them after the reshape-command.

        Code:
        bysort id: gen occ_prev = occupation_code[_n-1]
        gen h1 = occ_prev
        gen h2 = occupation_code
        tab2 occ_prev occupation_code  // this is to control if the data set shows the same
        
        collapse (count) occ_prev occupation_code, by(h2 h1)
        keep if h1 != .
        drop occ_prev
        reshape wide occupation_code, i(h1) j(h2)
        I'm not sure if this solution is the most straightforward one, but it seems to work.

        Greetings, Klaudia

        Comment


        • #5
          Just a little appendum:
          Code:
          bysort id year:....
          - I did not look at your data example.

          Comment


          • #6
            Given your interest in spells, note also tsspell (SSC).

            Comment


            • #7
              Klaudia Erhardt :Seems like a good solution, thank you! I'm not quite there yet with the data so I can test your solution on the actual data, but it seems like an elegant solution, I will report back when I'm able to use it.

              I am aware of tsspel Nick, it is invaluable in the work I'm doing right now. Maybe you could help me with this predicament I'm in here, before I can utilize Klaudias solution - if you have the time of course!


              the thing is, the table has to consist of the changes in occupation only after periods of unemployment. Since I have one variable with with employment codes and I have a binary variable used for the spells ("unempl" where "employed"=0/"not employed"=1), like this:

              Code:
              clear
              input ///
              id year occ_code unempl
              1 1999 4 0
              1 2000 4 0
              1 2001 . 1
              1 2002 . 1
              1 2003 . 1
              1 2004 5 0
              1 2005 5 0
              1 2006 5 0
              1 2007 . 1
              2 1999 . 1
              2 2000 . 1
              2 2001 . 1
              2 2002 2 0
              2 2003 2 0
              2 2004 . 1
              2 2005 2 0
              2 2006 2 0
              2 2007 . 1
              3 1999 1 0
              3 2000 1 0
              3 2001 . 1
              3 2002 1 0
              3 2003 . 1
              3 2004 2 0
              3 2005 2 0
              3 2006 3 0
              3 2007 . 1
              4 1999 1 0
              4 2000 2 0
              4 2001 3 0
              4 2002 4 0
              4 2003 . 1
              4 2004 4 0
              4 2005 3 0
              4 2006 . 1
              4 2007 3 0
              end
              With Klaudias first response, I understood the idea of using bysort and [_n] to lag the employment code one year behind:

              Code:
              sort id year
              tsset id year
              by id: gen occ_prev = occ_code[_n-1]
              order id year occ_prev occ_code unempl
              Then, with the paneldata (which consists of everyone who ever had a job in Denmark since 1980), I could utilize Klaudias collapse solution. However, before this step, I need some way to count changes in employment *but only after* a state of unemployment. and this is where the spells come in, and tsspell is good starting point.

              so for instance, the panel/person with id = 1, should only be counted once in the final crosstable, because he only shifts from unempl = 0 to unempl = 1 and then back to unemployed = 0 one time during the period 1999-2007. Only the last job he had before he was unemployed and the first job after the spell of unemployment should be part of the final crosstable. with tsspel, I'm getting a little closer to this, but it is still causing me some headaches how I should handle this.

              Code:
              tsspell unempl
              I think the solution is to create some helper-variables based on conditions that decides wether an employmentcode is right after of right before an spell of unemployment, so I can delete the observations that doesn't come right before or right after a year (or several years) of a spell with unempl==1, and then use Klaudias reply to output it. The idea is to feed this table to a package in GNU/R that is based on the method of social network analysis, and treat the table as an adjacency matrix, in the terminology of SNA. in this way, I'll make it into a graph that shows social mobility in occupation after periods of unemployment.

              This is causing me some headaches, unexperienced as I am in the art of paneldata, and any help would be appreciated.

              Comment


              • #8
                Hi Emil,

                again, no guarantee that this is the most straightforward solution.
                I've generated an auxilliary variable "temp" that counts the no of unemployment spells and used this for the specification of how many records to go backward in the [_n-x] parts
                The rest is practically the same, except for the recode-command at the end.

                Me doing all your work you've got to pay me a beer if you'll ever come to Berlin

                greetings, Klaudia

                Code:
                sort id year
                gen temp=0
                replace temp = temp[_n-1]+1 if unempl==1
                
                by id: gen occ_prev = occ_code[_n-temp[_n-1]-1] if occ_code[_n-1]==. & occ_code != .   // this is the trick!
                drop temp
                gen h1 = occ_prev
                gen h2 = occ_code
                tab2 occ_prev occ_code  // this is to control if the data set shows the same
                
                collapse (count) occ_prev occ_code, by(h2 h1)
                keep if h1 != .
                drop occ_prev
                reshape wide occ_code, i(h1) j(h2)
                recode _all (. = 0)

                Comment


                • #9
                  Here's another way of doing this

                  Code:
                  clear
                  input id year occ_code unempl
                  1 1999 4 0
                  1 2000 4 0
                  1 2001 . 1
                  1 2002 . 1
                  1 2003 . 1
                  1 2004 5 0
                  1 2005 5 0
                  1 2006 5 0
                  1 2007 . 1
                  2 1999 . 1
                  2 2000 . 1
                  2 2001 . 1
                  2 2002 2 0
                  2 2003 2 0
                  2 2004 . 1
                  2 2005 2 0
                  2 2006 2 0
                  2 2007 . 1
                  3 1999 1 0
                  3 2000 1 0
                  3 2001 . 1
                  3 2002 1 0
                  3 2003 . 1
                  3 2004 2 0
                  3 2005 2 0
                  3 2006 3 0
                  3 2007 . 1
                  4 1999 1 0
                  4 2000 2 0
                  4 2001 3 0
                  4 2002 4 0
                  4 2003 . 1
                  4 2004 4 0
                  4 2005 3 0
                  4 2006 . 1
                  4 2007 3 0
                  5 1999 1 0
                  5 2000 2 0
                  5 2001 3 0
                  5 2002 4 0
                  5 2003 . 1
                  5 2004 4 0
                  5 2005 3 0
                  5 2006 . 1
                  5 2007 3 0
                  6 2005 15 0
                  6 2006 . 1
                  6 2007 16 0
                  end
                  
                  * carry over the last employment code
                  isid id year, sort
                  by id: replace occ_code = occ_code[_n-1] if mi(occ_code)
                  
                  * reduce to cases where unemployed goes to employed
                  by id: keep if unempl == 1 & unempl[_n+1] == 0 | unempl == 0  & unempl[_n-1] == 1
                  
                  * identify each transition
                  by id: gen transition = sum(unempl)
                  
                  * reshape to wide
                  bysort id transition (year): gen new_occ = occ_code[2]
                  by id transition: keep if _n == 1
                  list id year occ_code new_occ, sepby(id) noobs
                  
                  * Collapse by occ_code new_occ and put the count in a variable called occ_
                  collapse (count) occ_=year, by(occ_code new_occ)
                  
                  * Create a cross-tab
                  list, sepby(occ_code) noobs
                  reshape wide occ_, i(occ_code) j(new_occ)
                  mvencode _all, mv(0)
                  list

                  Comment


                  • #10
                    Hello Robert,

                    interesting solution. What puzzles me: using Emils Data from #7 and your syntax, the resulting table is:

                    occ_code occ_1 occ_2 occ_3 occ_4 occ_5
                    1 1 1 0 0 0
                    2 0 1 0 0 0
                    3 0 0 1 0 0
                    4 0 0 0 1 1
                    0 0 1 0 0 0

                    whereas using my syntax, the resulting table is:

                    h1 occ_code1 occ_code2 occ_code3 occ_code4 occ_code5
                    1 1 1 0 0 0
                    2 0 1 0 0 0
                    3 0 0 1 0 0
                    4 0 0 0 1 1

                    Your resulting data shows a transition from occ_code 0 to occ_2. But there is no occ_code 0.
                    What is the reason for it?

                    Also, for a principle, I always try to find solutions without touching original variables by replace commands or such.

                    Could you please drop a short word on the condition
                    if mi(occ_code)
                    What does it say? I never had to do with multiple imputation.

                    Thank you and greetings,
                    Klaudia

                    Comment


                    • #11
                      The occ_code at zero is due to id 2 that starts with an unemployment spell and therefore no prior occupation code. Since all missing values after the final reshape are bulk-replaced with zero, the missing occ_code got changed to zero as well. Presumably these should be dropped but that's really within the province of the one doing the research. This of course assumes that there are no occ_code at zero in the first place.

                      Since you are destroying the data anyway, there's no real harm in carrying over the value in place. You can certainly clone the original variable and do it on the clone if you want to check every step.

                      See help missing() for a description of the function. It's a safer way to identify missing values.

                      Comment


                      • #12
                        Yes, it is a question of personal taste if you use clones or not.
                        Sorry to have asked a stupid question - I typed help mi and got a description relating to multiple imputation (I don't like abbreviated stata commands either )

                        Comment


                        • #13
                          Thank you so much, both of you, it works, and although there are some things that needs to be smoothed out in the real-world data I use, the basic structure you provided has helped me immensely.


                          I'm writing my thesis and the complexity of the datamanipulation needed was really causing me distress, so thanks again. The internet is a wonderfull place full of generous people indeed. I will definitely buy you a beer in Berlin Klaudia! And you too Robert, if I ever get to Washington.

                          The logic of Klaudias solution appealed more to me, for some reason. Byt it is also very useful to se different solutions to the same problem, although I wil only utilize the one. But damn, I had to strain my head in order to understand the idea of using hardbrackets within hardbrackets!

                          Your solution Klaudia can be improved in one way though:in the first step you provided:

                          Code:
                          gen temp=0
                          replace temp = temp[_n-1]+1 if unempl==1
                          however, if the the last observation in a panel has unempl==1 and the first observation in the following panel also has unempl==1, this piece code would "keep counting" because it couldn't tell the difference . This can be seen in the example data when going from ID==1 to ID==2. Instead, I used Nick Cox's tsspell to generate the spell sequence like this:

                          Code:
                          tsspell unempl if unempl==1
                          rename _seq temp
                          drop _end _spell
                          and then proceeded as Klaudia suggested.

                          (I have copy-pasted your suggestion on stackoverflow Klaudia, so that others may use it there as well. Of course with credit and a link to this page. If you would like to write the answer there yourself with your own profile if you have one, so I can give the "best answer" flag there, if you are into that sort of thing, let me know)
                          Last edited by EmilBeBri; 12 Mar 2015, 02:50.

                          Comment


                          • #14
                            Hi Emil,

                            you are completely right about the bug in my syntax. (I just put it there to see if you only copy and paste or if you think it through )
                            It did not matter in the end because occ_prev was generated "by id:". But it is bad programming style anyway.

                            Instead of using tsspell, the bug could also be fixed by by-ing the generation of temp:
                            Code:
                            sort id year
                            by id: gen temp=0
                            by id: replace temp = temp[_n-1]+1 if unempl==1
                            This way the first observations of a case will have missing values when they are unemployment spells (see first observations of ID 2 in your data). It would be ok for your needs, but if you want to do it really sophisticatedly, you would type instead:

                            Code:
                            sort id year
                            by id: gen temp=cond(_n==1 & unempl==1, 1, 0)     /* temp is generated to be 1 if it is the first observation of a case and this is an unemployment spell. Otherwise temp is 0 */
                            by id: replace temp = temp[_n-1]+1 if unempl==1 & _n > 1   /* almost the same as above, but the first observation (already having value 1) stays untouched  */
                            If you don't know the "cond"-function it may cause another bit of strain for your head, but it is worth while, I find cond() very useful.

                            I feel pleased that you find my solution worth to be communicated, and appreciate if you credit me, and would like you to update the solution with what I wrote here. I am not posting anywhere else but stata list (still new), because I am a bit tight of time.

                            Wish you good success for your thesis,
                            greetings,
                            Klaudia

                            Comment


                            • #15
                              now this was nonsense -
                              Code:
                              by id: gen temp =0
                              !!! (I was a bit confused for having to do many things at the same time, sorry).
                              But the rest is ok. The first solution of #14 reads:

                              Code:
                              sort id year
                              gen temp=0
                              by id: replace temp = temp[_n-1]+1 if unempl==1
                              greetings
                              Klaudia

                              Comment

                              Working...
                              X