Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Combining multiple observation, each with data that I want to keep

    Hello -

    I have a dataset with records of clients that have performed a specific action. This action can be performed multiple times and each time it is performed, this appears as a new record. I have a unique clientid variable and there is a variable, sequence, that tells me the position/order of that specific record (among existing records for that clientid).

    I have cleaned up my dataset to create new variables for date1, date2, date3, and date4. The first record (i.e., the record with sequence =1) will only have a date1; the record with sequence = 2 will only have a date 2, and so on. I dropped all records with a sequence > 4.

    Now, I want to combine observations (using clientid) so that my dataset becomes a dataset of clients and not records. For each existing variable (var1, var2, var3, var4..... var18), I would like to keep data from the most recent observation. The most recent observation would have the most recent createdate. (The last 2 sentences are nice to have, but not absolutely necessary. I would like one record that has all the variables in one row.... most importantly date1, date2, date3, date4).

    I am fairly new to stata and I know this involves some sort of bysort, egen code, _N.

    Please help

    ----

    A table for your understanding
    clientid var1 sequence date1 (mdy) date2 date3 date4 createdate
    111111 aaa 1 1/1/2021 . . . 1/4/2021
    222222 aaa 1 1/2/2021 . . . 1/4/2021
    222222 aaa 2 . 2/2/2021 . . 2/4/2021
    333333 bbb 1 1/3/2021 . . . 1/4/2021
    333333 bbb 2 . 2/3/2021 . . 2/4/2021
    333333 ccc 3 . . 3/3/2021 . 3/4/2021
    444444 ddd 1 1/4/2021 . . . 1/4/2021
    444444 ddd 2 . 2/4/2021 . . 2/4/2021
    444444 eee 3 . . 3/4/2021 . 3/4/2021
    444444 fff 4 . . . 4/4/2021 4/4/2021
    OUTPUT
    clientid var1 sequence date1 (mdy) date2 date3 date4 createdate
    111111 aaa 1 1/1/2021 . . . 1/4/2021
    222222 aaa 2 1/2/2021 2/2/2021 . . 2/4/2021
    333333 ccc 3 1/3/2021 2/3/2021 3/3/2021 . 3/4/2021
    444444 fff 4 1/4/2021 2/4/2021 3/4/2012 4/4/2021 4/4/2021
    Thanks for your help!
    Bri

  • #2
    Code:
    // VERIFY THAT DATA ARE AS ADVERTISED
    by clientid (var1), sort: assert var1[1] == var1[_N]
    
    // COLLAPSE THE DATA SET
    collapse  (lastnm) var1 date* (max) sequence createdate, by(clientid)
    Notes:

    1. For this code to work properly, createdate must be a proper Stata internal format numeric date variable. If it is not, you must first convert it to one.
    2. The first command verifies that the variable var1 takes on the same value in all observations for a given clientid--a necessary condition for handling it in -collapse-. The other variables are not subject to that condition because you explain in your question how you want to select which value to use.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you so much! I was (and still am) unclear about the first part of the code, but I was able to run part 2 without it.
      It ended up looking like this:

      Code:
      collapse  (lastnm) var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12 var13 var14 var15 (max) sequence createdate, by(clientid)
      I also did not seem to need the
      Code:
      var1 date*
      portion of the code. I am thinking this is related to part 1?

      Thanks again,
      Bri

      Comment


      • #4
        I also did not seem to need the Code:
        var1 date*

        portion of the code
        With the code you show, I cannot see how you could have ended up with anything like what you said you wanted. Your result will not contain any of the date variables. And it will contain a bunch of other variables, var2 through var15 that you did not even mention in #1.

        As for the first line of code I had suggested, in the data tableau shown in #1, which I now gather is only distantly related to your actual data, there was a variable called var1 which was supposed to also appear in the results you wanted. For all of the other variables you had clearly stated that their values would vary from one observation of the same clientid to the next but you also made it clear that in all cases you wanted the most recent non-missing value. But for var1 you said nothing about which value to select. I observed that, at least in what you showed, the value of var1 was always the same in all observations of any clientid. So I wrote the -collapse- command to assume that was true throughout your data set. But not knowing if that really is true, I wrote the first line of code to check that assumption. If the assumption were false, Stata would halt with an error message, rather than blundering on to execute a -collapse- command that might produce incorrect results.

        Comment

        Working...
        X