Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with duration / survival analysis data: transformation to time-span data?

    Hello everyone,

    I am conducting a survival analysis on an unemployment micro data. I receive a monthly output from a local unemployment database. So far I have only data from two months, november and december 2015, but I will receive more as the time goes. My task is to conduct duration analysis of the individuals' unemployment.

    My question is how should I correctly and most efficiently tranform this kind of data into a duration data where I can use stset.

    The data is as follows [picture 1]:

    time: yearmonth when data received
    id: (date of birth)
    id2: combined id with BEGIN_REG EDUC FEMALE (in case people are born on the same day)
    FEMALE: gender
    age
    BEGIN_REG: start of registration at the unemployment office
    END_REG: end of registration (ie. leaving unemployment)
    evid_length: length of evidence at unempl. office, ie. how long is the person unemployed (in days)
    LAST_EMPL: place of last employment
    REASON: for leaving unemployment
    EDUC: education
    censor: censored data (if END_REG = .; then the person has not yet left unemployment).
    ... other indicators are not related to my question

    My data looks the way that each month I have the same person occurring in the database again and again, while the evid_length keeps increasing (as the length of his unemployment is increasing).

    My initial thought was to use snapspan:

    snapspan id2 time END_REG evid_length censor, gen(time0) replace

    But then I get the following error:

    14 subjects have 28 duplicate time values
    it is unclear which record to use at the specified time
    perhaps
    1. id2 is wrong and the records are not really for
    the same subject, or
    2. time is wrong and one record occurs after the other
    r(459);

    And I am not sure how to proceed. I tried various variations on the snapspan command, but I think this one is the most correct one. For the timevar in snapsan I thought of putting there the evid_length, but I think that's incorrect and it should be in the varlist - as that's the variable that occurs at the time and always changes.

    My other thought was to eliminate always the ealier observations and keep only always the latest one:

    foreach id2 == `n'{
    drop if id2 == `n'& time==201512* evid_length > time==201511*evid_length
    }


    or

    drop if time==201511*id2==time==201512*id2 & time==201512* evid_length > time==201511*evid_length

    But I am not sure if this would be the correct procedure for time analysis and those attempts were unsuccessful anyways. The loop didn't work and the drop command dropped only few observations.


    This is my first time working with this type of data. I have read up on it, including the stata manuals, etc. I will be happy for any kind of help and/or advice!


  • #2
    Hello Tomas,

    Welcome to the Stata Forum!

    It seems you need to do at least 2 tasks before the survival analysis:

    Generate and format the time variable. Please type:

    Code:
    , help datetime
    Perform the - stset - and for this you'll need to include the time variable (you created and formatted) as the time-to-event variable, and select the "failure" variable (I didn't find any in the snapshop from your data set.. I kindly ask you to use the CODE delimiters and present the information as suggested in the FAQ #12). Please type:

    Code:
    . help stset

    Hopefully that helps.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      Tomas: you appear to have monthly survival time data, but I presume that individuals can move into and out of unemployment on any given day. Put differently, it seems to me that you have grouped (interval-censored) duration data, not continuous duration data on unemployment spells. If this is the case, I think you should be looking to use discrete time survival analysis methods (methods for interval-censored/grouped data). All Stata's st commands are designed for use with continuous time data; ditto snapspan. See the "Survival Analysis Using Stata" webpages for some freely downloadable materials, including discussion of the distinction and examples:http://www.iser.essex.ac.uk/survival-analysis

      Comment


      • #4
        Thanks Marcos and Stephen for addressing this problem.

        I am getting the same error message for a dataset, in which the time variable 'days' is the number of days counted from a random date before the data collection date. The individual observations represent admission events to hospitals. Each patient has a unique identifier 'visit' and each admission has a unique identifier 'key'. Some patients will have multiple admissions; implying for a single value of 'visit' there may be more than one value of 'key' depending on the number of admissions this patient had. Moreover, the date from which 'days' is counted is randomly selected for each patient with the unique 'visit'.

        My assumption is that this is snapshot dataset as there is a single time variable 'days' that indicates the n-th day when admission took place. Var 'key' changes with each new admission. So to convert this data from snapshot to timespan type, I applied the command line:

        snapspan visit days key

        This results in an error message:

        1291 subjects have 2617 duplicate days values
        it is unclear which record to use at the specified time
        perhaps
        1. visit is wrong and the records are not really for
        the same subject, or
        2. days is wrong and one record occurs after the other



        My thought is that, subjects (patients) that have more than one admission (that is carrying more than one 'key' for each episode of admission) should have more than one entry for variable 'days' counted from a single randomly assigned date before the study period, representing the time of occurrence for each admission (corresponding to multiple entries for variable 'key' for a single patient) This is how the dataset is structured and should not generate an error message.

        Anyway, tried the steps to drop "duplicate" values as mentioned in this Stata FAQ,
        http://www.stata.com/support/faqs/da...d-time-values/

        and got these:


        duplicates report visit days

        Duplicates in terms of visit days

        --------------------------------------
        copies | observations surplus
        ----------+---------------------------
        1 | 913733 0
        2 | 2614 1307
        3 | 3 2
        --------------------------------------


        bysort visit days: assert _N == 1
        1308 contradictions in 915041 by-groups

        assertion is false


        duplicates tag visit days, gen(isdup)

        Duplicates in terms of visit days

        . edit if isdup

        . drop isdup


        However, even after dropping the duplicates (that perhaps is not the right thing to do),
        still getting the same error message, on trying snapspan again

        snapspan link days key

        1291 subjects have 2617 duplicate days values
        it is unclear which record to use at the specified time
        perhaps
        1. visit is wrong and the records are not really for
        the same subject, or
        2. days is wrong and one record occurs after the other




        Please advise on how to convert this dataset from snapshot to timespan.
        The subsequent task is to use stset, to perform time to event analyses.

        Thank you very much for your help :-)
        Last edited by Parijat Joy; 13 Nov 2016, 18:23.

        Comment

        Working...
        X