Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating duration variables for Cox Model application

    Hello dear Stata users, i'm writting my master thesis and using Stata for econometric analysis, but the problem is that i have zero experience with the software.

    My data looks like the one shown below and my problem is that i don't know how to create a variable that shows the company's duration on a given status, in this case, Exporting Status = 1. In concret, i need a variable that tells me the consecutive number of years that a company was able to export (regardless of the year it started) until there is a failure, which means, Exporting Status = 0.

    Can anyone help me? Thank you for your attention. Cheers
    Company's ID Year Exporting Status Duration
    PT500000026 2008 1
    PT500000026 2009 1
    PT500000026 2010 1
    PT500000026 2011 1
    PT500000026 2012 1
    PT500000026 2013 0
    PT500000026 2014 0
    PT500000832 2008 0
    PT500000832 2009 0
    PT500000832 2010 1
    PT500000832 2011 0
    PT500000832 2012 0
    PT500000832 2013 1
    PT500000832 2014 1
    PT500001499 2008 1
    PT500001499 2009 0
    PT500001499 2010 0
    PT500001499 2011 1
    PT500001499 2012 1
    PT500001499 2013 1
    PT500001499 2014 1

  • #2
    There are some aspects of your question that you have left unspecified. It would have been helpful had you shown what you expect the duration variable to show in your example. To clarify, for PT5000000026, do you want to start with Duration = 0 in 2008, or with Duration = 1? And in 2013 and beyond, do you want Duration to continue to show 5 (or 4 if you started with 0), or do you want it to have missing values for years when it no longer exports.

    For PT5000000832, is duration to be 0 or missing in 2008 and 2009? In 2010 it presumably starts and there is 1 year. But then exporting stops in 2011 and 2012, resuming in 2013. When we get to 2013, is the 2010 year to be included in that value of duration (total duration of exports over the span of the data) or do we restart counting from 1 (or 0) again (duration of the most recent continuous spell of exporting)? Between 2010 and 2013, is duration supposed to be missing, or zero, or does the 2010 result carry forward through those years?

    Any of these can be readily coded in Stata, but without knowing what you need, I don't think it makes sense to show code at this point.

    Also, when posting example data, please use the -dataex- command. (-ssc install dataex-, read -help dataex- for simple instructions on its use.) Using -dataex- helps those who want to help you by making it possible for them to directly and quickly create an exact replica of your data example in Stata. When data are shown in a table such as yours in #1, it sometimes takes longer to transfer the data to Stata than to find a solution to the problem posed.

    Comment


    • #3
      Thank you for your answer. Your questions raised some doubts of my own, so i had to discuss them with my professor, but i'm ready now to answer them.

      First of all, i want the duration variable to show the total duration of exports over the span of the data. So for PT5000000026, the duration should be 5, for PT5000000832 should be 3 and for PT500001499 should be 5. I´ve abandoned the idea of the number of consecutive years exporting, so now i want the total number of years exporting between 2008 and 2014. I think that this means that the duration variable should show the same value all the years for the same Company ID. (For PT5000000026 it would show 5 all the years and not the cumulative duration)

      This raised me another doubt concerning the display/shape of the data. For the especific case of Cox regression with time-varying covariates implementation, what should be the data shape? Wide or long? I guess my data is in the long shape, but the majority of Cox regression examples i saw i think the data was in the wide shape. Maybe i have to reshape the data for this specific survival analysis, but i'm not sure.

      Thank you so much for your time!

      Comment


      • #4
        So your duration variable is actually the sum of the 0/1 values of exporting status.

        Code:
        by company_id, sort: egen duration = total(exporting_status)
        Do read the help file for -egen-: it is full of useful commands for calculating summary statistics for groups of observations within data.

        If you are ultimately going to do a survival analysis with time varying covariates, then you will need your data to be in long layout. Be sure to read the entire manual section on the -stset- command. Getting the -stset- right is the key to survival analysis in Stata. Once you have that, the rest is quite easy.

        Comment


        • #5
          Thank you so much for your help!

          Comment


          • #6
            Hello again!

            I'm trying to -stset- my data but it seams that there's some kind of error as the stata says to me "multiple records at same instant - PROBABLE ERROR". This brought me other doubts about the duration and failure/died variables.

            Looking at the example table below, for a sucessful -stset- should the duration varible be like "total duration" or like "cumulative duration"? And what about the failure/died variable? Should it be like "died in that year" or like "died in the period"?
            id year exporting status total duration cumulative duration died in that year died in the period
            PT500000026 2008 1 5 1 0 1
            PT500000026 2009 1 5 2 0 1
            PT500000026 2010 1 5 3 0 1
            PT500000026 2011 1 5 4 0 1
            PT500000026 2012 1 5 5 0 1
            PT500000026 2013 0 5 5 1 1
            PT500000026 2014 0 5 5 1 1
            PT500000832 2008 0 3 0 or missing? 1 1
            PT500000832 2009 0 3 0 or missing? 1 1
            PT500000832 2010 1 3 1 0 1
            PT500000832 2011 0 3 1 1 1
            PT500000832 2012 0 3 1 1 1
            PT500000832 2013 1 3 2 0 1
            PT500000832 2014 1 3 3 0 1
            PT500001499 2008 1 5 1 0 1
            PT500001499 2009 0 5 1 1 1
            PT500001499 2010 0 5 1 1 1
            PT500001499 2011 1 5 2 0 1
            PT500001499 2012 1 5 3 0 1
            PT500001499 2013 1 5 4 0 1
            PT500001499 2014 1 5 5 0 1
            It's complicated because my data seam like some kind of special survival data. I say this because for many companies there's the case that they die (exporting_status=0), but later they "resuscitate" by resuming the exporting activities. How can i fit this behaviour in survival analysis?

            Another doubt is if i have to declare right away on the -stset- that i have time varying covarites and so indicate the multiple-record ID variable.

            Thank you for your time! Best regards.

            Comment


            • #7
              Hi Andre,

              Apart from Clyde's insightful remarks, I wish to remind that your model deals with discrete time periods.

              I kindly suggest you take a look on the previous discussions on this theme.

              Best,

              Marcos

              Best regards,

              Marcos

              Comment


              • #8
                The correct way of setting up this data for a survival analysis depends on what kind of analysis you plan to do. If you do not have any time-varying predictors to model (other than time itself), then you should reduce your data to one record per id containing the "total duration". You should -stset- the data with your "total duration" as the analysis time, and create a failure variable to indicate whether that total duration represents the end of the run of export status = 1, or if it is censored (i.e. it's the last observation you have on that id and export status still == 1, so we don't know when that run will come to an end.)

                If, however, you will be using time-varying predictors in your analysis, then you have to keep multiple observations per person. Have a look at the -snapspan- command for getting the variables right for that.

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  The correct way of setting up this data for a survival analysis depends on what kind of analysis you plan to do. If you do not have any time-varying predictors to model (other than time itself), then you should reduce your data to one record per id containing the "total duration". You should -stset- the data with your "total duration" as the analysis time, and create a failure variable to indicate whether that total duration represents the end of the run of export status = 1, or if it is censored (i.e. it's the last observation you have on that id and export status still == 1, so we don't know when that run will come to an end.)

                  If, however, you will be using time-varying predictors in your analysis, then you have to keep multiple observations per person. Have a look at the -snapspan- command for getting the variables right for that.
                  Hello Mr. Schechter, sorry for bothering you again, but i'm really concerned about this project as i can´t move one with it.

                  Yes, i will be using time-varying predictors. My ultimate goal is to know the influence that predictors like firms' "dimension", "age/experience", "productivity" or "capital intensity" have in export survival time.

                  So, i've been looking into the -snpaspan- command to get my data right for the -stset- and i'm having a problem. I used "id" as my subject variable, "year" as my time variable (is this correct or should i use "cumulative duration" instead?) and only "died in that year" as event variables (is this correct or should i declare other event variable(s)??). The problem is that when i browsed the data after converting it into span data, i realized that all the data "jumps" one year. The data for 2008 wents missing and "jumps" into 2009, the 2009 data "jumps" to 2010 and so on. The worst problem is that the data for the last year, 2014, "jumps" to nowhere and disappears... Clearly i'm not converting the data in the right way. How can i fix this?

                  Again, sorry for bothering you and thank you for all the help.

                  Comment


                  • #10
                    Originally posted by Marcos Almeida View Post
                    Hi Andre,

                    Apart from Clyde's insightful remarks, I wish to remind that your model deals with discrete time periods.

                    I kindly suggest you take a look on the previous discussions on this theme.

                    Best,

                    Marcos
                    Hello Mr. Almeida and thank you for your comment.

                    I am aware of that fact as i know that dealing with time in years, will turn time periods into discrete. But your comment made me read some discussions on that theme, especially regarding the Cox regression. Did i misunderstood what i read or Cox model suits best continous time periods? Does it means that i will not be able to use that regression in my data?

                    Obrigado, cumprimentos de Portugal.

                    Comment


                    • #11
                      So, i've been looking into the -snpaspan- command to get my data right for the -stset- and i'm having a problem. I used "id" as my subject variable, "year" as my time variable (is this correct or should i use "cumulative duration" instead?) and only "died in that year" as event variables (is this correct or should i declare other event variable(s)??). The problem is that when i browsed the data after converting it into span data, i realized that all the data "jumps" one year. The data for 2008 wents missing and "jumps" into 2009, the 2009 data "jumps" to 2010 and so on. The worst problem is that the data for the last year, 2014, "jumps" to nowhere and disappears... Clearly i'm not converting the data in the right way. How can i fix this?
                      Describing in words what you did and what you got is rarely adequate. Please post the exact code you ran and the exact output you got from Stata. Use code delimiters to assure maximum readability.

                      Comment


                      • #12
                        Hello Andre,

                        Yes, you got it right.

                        Please take a look at this thread:


                        http://www.statalist.org/forums/foru...time-span-data
                        Best regards,

                        Marcos

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          Describing in words what you did and what you got is rarely adequate. Please post the exact code you ran and the exact output you got from Stata. Use code delimiters to assure maximum readability.
                          This is how my original data looks like:
                          Click image for larger version

Name:	original_data.jpg
Views:	1
Size:	310.6 KB
ID:	1358315





                          Then i runed the code - snapspan id year died, replace - in order to convert the data into span and the output data looks like this:



                          Click image for larger version

Name:	output_data.jpg
Views:	1
Size:	296.9 KB
ID:	1358316


                          As you can see all the data "jumped" one year and the data for 2014 disappeared. How can i solve it? Should i create a fictitious year 2007 with missing values for all variables before converting the data?

                          Comment


                          • #14
                            There is nothing to solve. This is how it is supposed to look.

                            What I would do differently, however, is use duration instead of year in the -snapspan- command, and also specify the -gen(time0)- option*. After that, run -stset- with duration as your time variable, and then run your analysis.

                            *The time0 variable isn't strictly necessary for analysis, but having it in the data makes it easier to see what's going on.
                            Last edited by Clyde Schechter; 28 Sep 2016, 11:20.

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              There is nothing to solve. This is how it is supposed to look.

                              What I would do differently, however, is use duration instead of year in the -snapspan- command, and also specify the -gen(time0)- option*. After that, run -stset- with duration as your time variable, and then run your analysis.

                              *The time0 variable isn't strictly necessary for analysis, but having it in the data makes it easier to see what's going on.
                              I did exactly what you recommended for the -snapspan- command but i get an error.

                              Stata doesn't allow the conversion and says:

                              "14405 subjects have 70473 duplicate duration values
                              it is unclear which record to use at the specified time
                              perhaps
                              1. id is wrong and the records are not really for
                              the same subject, or
                              2. duration is wrong and one record occurs after the other"

                              I assume that the problem is in the duration variable. Is it because it has repeated values for some subjects/companies? For example, looking at the original data picture posted above, id==1 has three values "5" for duration. Or another example, id==3 has three values "1" for duration. A contrary example, id==2 has different values for duration throughout the period 2008-2014.

                              If this is the problem, how can i solve it? I think this is the last issue to solve and than hopefully my data is set for the analysis.
                              Last edited by Andre Ferreira; 29 Sep 2016, 09:07.

                              Comment

                              Working...
                              X