Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Set up data for Cox survival analysis

    Dear Clyde Schechter Nick Cox

    I am trying to set up correctly my dataset to perform a survival analysis of bank-borrower relationship. Therefore, my failure event is getting a loan after the treatment starts (2020Q3). My dataset is at the loan level, but I want to perform this analysis at both bank-borrower and bank-borrower-quarter level. This is my code:
    Code:
    bys uniqueborrowerid bankid (numeric_quarter): gen failure = (cond(nr_loans,numeric_quarter>=242,.))
    tempvar minq maxq
    bys uniqueborrowerid bankid: egen `minq' = min(numeric_quarter) // when the relationship started
    bys uniqueborrowerid bankid: gen `maxq' = numeric_quarter if failure==1 // identifies when failure happens
    
    gen time = `maxq' - `minq' 
    egen relationid = group(uniqueborrowerID bankid)
    
    collapse (max) failure treat time (mean) nloans_bfaceli, by(relationid numeric_quarter)
    
    stset time, id(relationid) failure(failure==1)
    stcox i.treat, vce(cluster relationid)
    stcurve, survival at1(treat=0) at2(treat=1) title("Cox survival of bank-borrower relationship by treatment") xtitle("Analysis time (in quarters)") legend(pos(8) col(2) ring(0)) xlabel(#8)
    However, when running cox regression and plot the results for both groups (T and C), both lines converges to zero, which is clearly an error since relationships don't end, but my dataset does at 2023Q4. Then, my questions are:

    1. How can I deal with this problem (right censorship)? I've read that I can replace last quarter with 0 or missing to fix this, but I don't understand why.
    2. When generating 'time' variable, it seems that it is non missing only for those with failure == 1. Is ok to replace those missings with 0?


  • #2
    Your description of the data set is incomplete, so what I am about to say is based on some guesses about what the variables you have are and what they mean. As such, I may be off base, so take this for what it's worth.

    It appears you have, for each bank and borrower, a series of quarterly observations, the first of which constitutes the start of the banking relationship. It isn't at all clear what the rest of them represent.

    As for your observation that the time variable is non-missing only if failure == 1, that is because you defined it that way:
    Code:
    bys uniqueborrowerid bankid: gen `maxq' = numeric_quarter if failure==1 // identifies when failure happens
    gen time = `maxq' - `minq'
    The first line creates a value of `maxq' only if failure == 1. If failure != 1, `maxq' is missing and therefore the second command necessarily has time missing. Note that when you then -stset time-, all those observations with failure != 1, because they have a missing value for time, are excluded from the subsequent survival analysis calculations. (And no, replacing those missing values with 0 will not help you: observations with time <= 0 are also excluded.) This, in turn, explains why all your survival curves converge to zero: the only included observations all have failure == 1, which means that every borrower that is included meets your definition, as specified in the -failure()- option of -stset-, of having gotten a loan. So, of course the survival function, which is defined as the estimated proportion who have not yet experienced the events, ultimately drops to zero, because borrowers who never got a loan have been excluded from the analysis.

    What I can't do with the information you have provided is tell you how to fix this. Your definition of the failure variable strikes me as strange, and I wonder whether it means what you think it means. And you don't explain what your variable nr_loans means: I can think of at least two reasonable possibilities with different implications for how to use it in this problem. It would also be helpful to have example data to work with. Use the -dataex- command to show that in a response post if you want me to propose a fix for your problem.

    Comment


    • #3
      Thanks for your response Clyde!

      Here you go a piece of my data:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input strL uniqueborrowerid str11 bankid float(numeric_quarter nr_loans nloans_bfaceli treat)
      "       13EE1313JJDZEKDJ13EK"        "ZKJDJ"    250 1  0 0
      "       13EE1313JJDZEKDJ13EK"        "ZKJDJ"    253 1  0 0
      "       26KJJJJDJ26GJ13EZ13E"        "ZKJDJ"    253 1  0 0
      "       66JL13JJDKEEZJ13"            "ZKJLJ"    253 1  0 1
      "       66Z26KJJ13E13IEZEDE"         "ZKJDJ"    253 1  0 0
      "       66Z26KJJDIDJXTJDJ"           "ZKJDJ"    253 1  0 0
      "       E66K13JJ13EKGG13DT"          "ZKJDJ"    253 1  0 0
      "       Z66ZJJJDD26D26T13EJ"         "ZKJDJ"    253 1  0 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  228 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  228 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  228 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  231 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  232 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  233 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  233 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  234 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  234 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  234 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  234 1 12 0
      "131313131313131326D13IJ13E"   "ZKJ13EE"  235 1 12 0
      "13131326JJ13E13EK136613E13"   "ZKJJE"    252 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  249 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  249 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  249 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  249 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  250 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  251 1  0 1
      "13131326JJD13JEZDI"           "ZKJ13EJ"  252 1  0 1
      "13131326JJD13JKJ13EF"         "ZKJDE"    238 1  1 0
      "13131326JJD13JZJDE"           "ZKJ13EJ"  228 1  2 0
      "13131326JJD13JZJDE"           "ZKJ13EJ"  239 1  2 0
      "13131326JJDZ13JZJDE"          "ZKJ13EJ"  239 1  1 0
      "13131326JJJDL13E13E13EJ"      "ZKJ13EJ"  231 1  1 0
      "13131326JJJDL13E13E13EZ13E"   "ZKJ13EJ"  231 1  2 0
      "13131326JJJDL13E13E13EZ13E"   "ZKJ13EJ"  231 1  2 0
      "13131326JJJDL13EEJE"          "ZKJ13EJ"  250 1  0 1
      "13131366JJDGE66XDL"           "ZKJ13EJ"  247 1  0 0
      "13131366JJDGEK26D13E66"       "ZKJJE"    251 1  0 1
      "13131366JJDGEKZ13E13EL"       "ZKJ13EJ"  251 1  0 1
      "13131366JJDGEKZDG"            "ZKJJE"    255 1  0 1
      "13131366JJDGEZ2613E26"        "ZKJJZ"    238 1  1 0
      "13131366JJJDJ13EZJ13EF"       "ZKJD13"   248 1  0 0
      "13131366JJJDJT2613E26"        "ZKJ13EJ"  249 1  0 1
      "13131366JJJDJT2613E26"        "ZKJ13EJ"  251 1  0 1
      "13131366JJJDJT2613E26"        "ZKJ13EJ"  253 1  0 1
      "13131366JJJDJT26D13EDJ"       "ZKJDJ"    252 1  0 0
      "13131366JJJDJTG13EJ"          "ZKJ13EJ"  251 1  0 1
      "131313E13JJ13E2626DDXD13EZ"   "ZKJDJ"    250 1  0 0
      "131313E13JJ13E2626DJZJJ"      "ZKJJE"    253 1  0 1
      "131313E13JJ13E26GX13JDJ"      "ZKJJK"    252 1  0 0
      "131313E13JJDEG13B13E13E"      "ZKJJE"    237 1  1 0
      "131313E13JJDEG13EDJ13E13E13E" "ZKJLJ"    249 1  0 1
      "131313E13JJDEG13TJ13EZ"       "ZKJJE"    234 1  1 0
      "131313E13JJDEG2613EJE"        "ZKJD13"   243 1  0 0
      "131313E13JJDEG2613EJE"        "ZKJD13"   251 1  0 0
      "131313E13JJDEG2613EJZ"        "ZKJ13E26" 250 1  0 1
      "131313E13JJDEG2613EJZ"        "ZKJ13E26" 253 1  0 1
      "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    232 1  1 0
      "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    248 1  1 1
      "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    255 1  1 1
      "131313E13JJDEG26KJJ"          "ZKJJE"    230 1  1 0
      "131313E13JJDEGDJJDE"          "ZKJJE"    240 1  1 0
      "131313E13JJDEGDJJDE"          "ZKJJE"    247 1  1 1
      "131313E13JJDEGDJKD26D"        "ZKJ13E26" 243 1  0 0
      "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  229 1  5 0
      "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  233 1  5 0
      "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  235 1  5 0
      "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  238 1  5 0
      "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  240 1  5 0
      "131313E13JJDEGF26DD13"        "ZKJJE"    234 1  1 0
      "131313E13JJDEGG66J13EZ"       "ZKJJL"    246 1  0 0
      "131313E13JJDEGG66J13EZ"       "ZKJJL"    250 1  0 0
      "131313E13JJDEZDJ13E13ED"      "ZKJJE"    230 1  1 0
      "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  234 1  3 0
      "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  239 1  3 0
      "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  239 1  3 0
      end
      format %tq numeric_quarter
      nr_loans would be always 1 since it is an indicator if a loan exists (this dataset is at the loan level) and it is a starting point to collapse at any other level, while nloans_bfaceli is the number of loans taken by the same borrower prior 2020Q3 (when the first bank joined the program). I understand what is doing the code line you pointed out. My question is more about what the "correct" way to create the time variable is and how to set up this survival analysis. The failure variable I am trying to capture is the next loan they get from the bank after the first bank joined (i.e. if this borrower has taken any loan after 2020Q3). Therefore, the failure would be =1 for those people that took loans after the mentioned quarter, and =0 otherwise (am I defining this well?). How can I fix the set up and improve the survival curves?

      Thanks in advance!

      Comment


      • #4
        If I am understanding your response in #3 correctly, this data set simply does not contain the information needed for your problem. The event you want to track is issuance of a new loan in 2020Q3 or later. But there is no variable or combination of variables that enables you to identify the observations where that occurs. The variable nr_loans is useless altogether: it is actually a constant in this data set, so it is uninformative about any event. The variable nloans_bfaceli tells you how many loans the borrower had before 2020Q3, but that also gives no information about new loans issued on or after that date. And the time variable you need is the quarter in which the first post-2023Q3 loan was issued, so you can't calculate that until you have a variable that tells you in each observation whether such a loan was originated.

        What you need is some variable that is 1 in each observation where a new loan was originated by the observation's bank to its borrower in its quarter, and 0 otherwise. You don't have that, and you can't create it from the information available in the variables you show. Perhaps you have other data that contains the needed information and you can bring that information into play. Once you had that, you could proceed. In pseudocode it would look something like this:
        Code:
        //  MARK OBSERVATIONS HAVING A TARGETED LOAN ISSUED
        gen failure = (numeric_quarter >= 242) ///
            & (0/1 variable indicating whether a new loan originated for this bank and borrower this quarter)
        
        //  CALCULATE TIME VARIABLES
        by uniqueborrowerid bankid (numeric_quarter), sort: egen event_time = min(cond(failure, numeric_quarter, .)) // FAILED
        by uniqueborrowerid bankid (numeric_quarter): replace event_time = numeric_quarter[_N] ///
            if missing(event_time) // CENSORED AT END OF DATA IF NO LOAN ISSUED IN OR AFTER 2020Q3
        by uniqueborrowerid bankid (numeric_quarter): gen origin = max(numeric_quarter[1], 242) // LATER OF 2020Q3 & FIRST OBS
        
        //   COLLAPSE THE DATA SET
        collapse (first) time first_time (max) failure, by(uniqueborrowerid bankid)
        
        //   PREPARE FOR SURVIVAL ANALYSIS
        stset event_time, origin(time origin) failure(failure)
        Replace the italicized portion by the actual variable name or expression that evaluates to 1 when a new post 2020Q3 loan is issued and 0 otherwise.


        I also note that your example data set has a large number of duplicated observations. This may mean the data set is incorrect since there is only rarely any reason to have exact duplicate observations in a data set. (An innocent explanation is that your actual data set has additional variables not shown in #3 and those variables distinguish the duplicate observations. If this is the case, you need eliminate those variables and drop the resulting duplicate observations before you proceed with this.) This kind of incorrect data set typically results from errors in the data management that created it. While it is easy enough to "fix" the data set by dropping the duplicate observations, it's important to remember that where one error has been made in the creation of a data set, there is a high probability that others have as well, so the data set may have additional problems that have not (yet) surfaced. The better approach is to review all of the data management used to pull this data set together, starting from its original raw data sources and track down how the surplus duplicate observations crept into it, and then fix that problem, as well as any others you stumble upon in that process.

        Comment


        • #5
          I don't know why I was pinged here. Despite the wonderful name, I don't use Cox models and my only contribution to any such discussion is to request capitalisation of Cox if it is omitted. Ditto on Box-Cox transformation.

          Some years ago I asked the main man Sir David Cox [who was responsible for all this] a question using email. After he replied fully and courteously, as was characteristic, I slipped in a supplementary question about whether we were possibly related, which seemed unlikely given what I knew of my family tree. His family went back a long way in the Birmingham area and mine in a different part of Britain so we ended concluding that any relation was at best very distant. It's a common family name, any way.

          Comment


          • #6
            You are right, the data example is not the full set of variables I have, however those "duplicates" are not a problem. These loans are given in a daily frequency, therefore the numeric_quarter is just representing a way to aggregate them in a broader classification (quarters). I am not totally sure if I am understanding your questions, but flagging the loans after 2020Q3 might be able to do just using the numeric_quarter, no? Since all of the rows are loans, then those loans granted after that quarter are what I am defining as failure (i.e =1).
            In addition, I am not totally sure why you are generating an origin variable and why you are not declaring an ID variable when setting up the analysis?

            I am attaching a different piece of my dataset including the variable that indicates the exact day when they got the loan. Additional variables are the quarter when the bank joined the program (missing for those which never joined), and start_date that is the date when the loan was granted.
            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input strL uniqueborrowerid str11 bankid float(numeric_quarter numeric_quarter_accreditation) int start_date float treat
            "       13EE1313JJDZEKDJ13EK"        "ZKJDJ"    250   . 22879 0
            "       13EE1313JJDZEKDJ13EK"        "ZKJDJ"    253   . 23125 0
            "       26KJJJJDJ26GJ13EZ13E"        "ZKJDJ"    253   . 23114 0
            "       66JL13JJDKEEZJ13"            "ZKJLJ"    253 249 23142 1
            "       66Z26KJJ13E13IEZEDE"         "ZKJDJ"    253   . 23111 0
            "       66Z26KJJDIDJXTJDJ"           "ZKJDJ"    253   . 23111 0
            "       E66K13JJ13EKGG13DT"          "ZKJDJ"    253   . 23106 0
            "       Z66ZJJJDD26D26T13EJ"         "ZKJDJ"    253   . 23152 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  228   . 20878 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  228   . 20859 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  228   . 20864 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  231   . 21146 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  232   . 21243 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  233   . 21288 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  233   . 21340 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  234   . 21445 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  234   . 21427 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  234   . 21455 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  234   . 21448 0
            "131313131313131326D13IJ13E"   "ZKJ13EE"  235   . 21545 0
            "13131326JJ13E13EK136613E13"   "ZKJJE"    252 244 23044 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  249 249 22804 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  249 249 22824 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  249 249 22818 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  249 249 22811 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22838 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22865 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22908 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22879 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22915 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22901 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22872 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22851 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22894 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22858 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22887 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22844 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  250 249 22831 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22939 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22953 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 23003 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22974 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22988 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22981 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22995 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22960 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 23009 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22923 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22930 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22946 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  251 249 22967 1
            "13131326JJD13JEZDI"           "ZKJ13EJ"  252 249 23068 1
            "13131326JJD13JKJ13EF"         "ZKJDE"    238   . 21739 0
            "13131326JJD13JZJDE"           "ZKJ13EJ"  228 249 20825 0
            "13131326JJD13JZJDE"           "ZKJ13EJ"  239 249 21903 0
            "13131326JJDZ13JZJDE"          "ZKJ13EJ"  239 249 21903 0
            "13131326JJJDL13E13E13EJ"      "ZKJ13EJ"  231 249 21173 0
            "13131326JJJDL13E13E13EZ13E"   "ZKJ13EJ"  231 249 21173 0
            "13131326JJJDL13E13E13EZ13E"   "ZKJ13EJ"  231 249 21173 0
            "13131326JJJDL13EEJE"          "ZKJ13EJ"  250 249 22856 1
            "13131366JJDGE66XDL"           "ZKJ13EJ"  247 249 22634 0
            "13131366JJDGEK26D13E66"       "ZKJJE"    251 244 22960 1
            "13131366JJDGEKZ13E13EL"       "ZKJ13EJ"  251 249 22931 1
            "13131366JJDGEKZDG"            "ZKJJE"    255 244 23288 1
            "13131366JJDGEZ2613E26"        "ZKJJZ"    238 242 21787 0
            "13131366JJJDJ13EZJ13EF"       "ZKJD13"   248   . 22665 0
            "13131366JJJDJT2613E26"        "ZKJ13EJ"  249 249 22789 1
            "13131366JJJDJT2613E26"        "ZKJ13EJ"  251 249 22949 1
            "13131366JJJDJT2613E26"        "ZKJ13EJ"  253 249 23191 1
            "13131366JJJDJT26D13EDJ"       "ZKJDJ"    252   . 23055 0
            "13131366JJJDJTG13EJ"          "ZKJ13EJ"  251 249 22986 1
            "131313E13JJ13E2626DDXD13EZ"   "ZKJDJ"    250   . 22827 0
            "131313E13JJ13E2626DJZJJ"      "ZKJJE"    253 244 23140 1
            "131313E13JJ13E26GX13JDJ"      "ZKJJK"    252   . 23070 0
            "131313E13JJDEG13B13E13E"      "ZKJJE"    237 244 21728 0
            "131313E13JJDEG13EDJ13E13E13E" "ZKJLJ"    249 249 22762 1
            "131313E13JJDEG13TJ13EZ"       "ZKJJE"    234 244 21440 0
            "131313E13JJDEG2613EJE"        "ZKJD13"   243   . 22249 0
            "131313E13JJDEG2613EJE"        "ZKJD13"   251   . 22949 0
            "131313E13JJDEG2613EJZ"        "ZKJ13E26" 250 250 22909 1
            "131313E13JJDEG2613EJZ"        "ZKJ13E26" 253 250 23183 1
            "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    232 244 21236 0
            "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    248 244 22697 1
            "131313E13JJDEG26D13E13E13EK"  "ZKJJE"    255 244 23359 1
            "131313E13JJDEG26KJJ"          "ZKJJE"    230 244 21018 0
            "131313E13JJDEGDJJDE"          "ZKJJE"    240 244 22004 0
            "131313E13JJDEGDJJDE"          "ZKJJE"    247 244 22614 1
            "131313E13JJDEGDJKD26D"        "ZKJ13E26" 243 250 22249 0
            "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  229 249 20992 0
            "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  233 249 21301 0
            "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  235 249 21458 0
            "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  238 249 21782 0
            "131313E13JJDEGDJKD26D"        "ZKJ13EJ"  240 249 21974 0
            "131313E13JJDEGF26DD13"        "ZKJJE"    234 244 21425 0
            "131313E13JJDEGG66J13EZ"       "ZKJJL"    246   . 22516 0
            "131313E13JJDEGG66J13EZ"       "ZKJJL"    250   . 22881 0
            "131313E13JJDEZDJ13E13ED"      "ZKJJE"    230 244 21050 0
            "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  234 249 21392 0
            "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  239 249 21899 0
            "131313E13JJJJ26EZDDD"         "ZKJ13EJ"  239 249 21899 0
            end
            format %tq numeric_quarter
            format %tq numeric_quarter_accreditation
            format %td start_date

            Comment


            • #7
              First the simple question. I'm not using -stset-'s -id()- option because after -collapse-, there is only one observation per bank-borrower pair.

              Now, flagging the loans that occur during or after 2020q3 will identify the targeted loans that count as failure events. But that is not completely sufficient for a correct survival analysis. Suppose there is some bank-borrower pair where the earliest loan in the data set is in 2020q4. There is no information about this pair in or before 2020q3. Then there are two possibilities, and the difference between them is crucial for survival analysis.

              One possibility is that the borrower was a client of the bank during or before 2020q3, in which case the non-issuance of a loan in 2020q3 is an event that is relevant to the survival analysis. It's a failure that could have happened but didn't. The fact that there is no record for that quarter is a problem: this is why survival analysis should not be done from a data set that includes only records of failures. The non-failures are relevant and need to be in the data set.

              The other possibility is that the borrower was not a client of the bank at that time, so there was no possibility of a loan being issued then in any case. In other words, there was no possibility of a failure occuring in this record: it is neither a failure or a non-failure; it is not in universe for the analysis. This time the absence of a record for 2020q3 in the data set is not a problem (although coding the analysis would be simpler if it were there.)

              Because your data set was constructed to include records only for loans, it does not distinguish these two possibilities. Consequently it is not possible to do a correct survival analysis.

              As for the use of the -origin()- option in my proposed code, it is an attempt to deal with the issue just set out in the previous four paragraphs. In survival analysis, a key element is time at risk for the failure event. If you are not a client of the bank, you are not at risk for loan issuance. If it is before 2020q3, nobody is at risk for the failure event. If you are a client and it is 2020q3 or later, but the time is still before the end of the data collection, you are at risk for the event until it actually happens. If the time is later than the end of data collection, then you are no longer at risk for having the event recorded, although you are at risk for having the event itself (which would be a censored observation). Survival analysis is based on observing the failures that occur during time at risk and the failures that could have occurred (because the time was in an at-risk period) but did not. The -stset- command allows you to define the at-risk period by using the -origin()- option to designate its start, and the time variable to mark its end.

              Comment

              Working...
              X