Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate a variable that combines string-observations based on two conditions (date constraint by group)

    Hi everyone,

    I struggle with a list of persons that can be uniquely identified by the variable "name". The persons are observed at different points in time ("date_str" / "date"). I want to generate a new variable that includes all values of the variable "classes" for a given person within the past 5 years. Exemplary data: see below.

    Code:
    clear
    input str25 name str15 date_str str15 classes
    "Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C"
    "Lastname 1, First name 1" "July 15, 2004" "B65D G01N"
    "Lastname 1, First name 1" "May 3, 2006" "C12Q"
    "Lastname 1, First name 1" "July 8, 2009" "C08K"
    "Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R"
    "Lastname 2, First name 2" "May 20, 2003" "F22B"
    "Lastname 2, First name 2" "April 2, 2007" "G01N"
    end
    gen date = date(date_str, "MDY")
    order name date_str date classes

    For instance, in line 4 the new variable would have the value "C08K C12Q B65D G01N". In more general terms: I am trying to combine string-observations based on two conditions: (1) same "name" and (2) "date" has to be within the past 5 years of the focal date.

    This might be related to a previous post: www.statalist.org/forums/forum/general-stata-discussion/general/1295115-how-to-summarize-multiple-observations-per-id. However, I struggle with adapting this: It is not enough to look at the previous line. The evaluation of dates has to consider the whole group defined by "name". [Repeating classes - e.g., "F22B H04Q F22B" - are not an issue: I can discard them afterwards.]

    I'm thankful for any help or suggestions!
    Patrick

  • #2
    Well posed problem!

    Consider this:

    Code:
     
    clear
    input str25 name str15 date_str str15 classes
    "Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C"
    "Lastname 1, First name 1" "July 15, 2004" "B65D G01N"
    "Lastname 1, First name 1" "May 3, 2006" "C12Q"
    "Lastname 1, First name 1" "July 8, 2009" "C08K"
    "Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R"
    "Lastname 2, First name 2" "May 20, 2003" "F22B"
    "Lastname 2, First name 2" "April 2, 2007" "G01N"
    end
    gen date = date(date_str, "MDY")
    order name date_str date classes
    
    gen wc = wordcount(classes) 
    expand wc 
    sort name date 
    bysort name date : gen count = _n 
    gen class = word(classes, count) 
    by name (date) : drop if (date[_N] - date) > (5 * 365.25) 
    bysort name class : drop if _n > 1 
    by name : gen wanted = class[1] if _n == 1 
    by name : replace wanted = wanted[_n-1] + " " + class if _n > 1 
    by name : keep if _n == _N 
    list 
    
    
    
         +-------------------------------------------------------------------+
      1. |                     name |      date_str |  date |   classes | wc |
         | Lastname 1, First name 1 | July 15, 2004 | 16267 | B65D G01N |  2 |
         |-------------------------------------------------------------------|
         |      count      |      class      |                   wanted      |
         |          2      |       G01N      |      B65D C08K C12Q G01N      |
         +-------------------------------------------------------------------+
    
         +-------------------------------------------------------------------+
      2. |                     name |      date_str |  date |   classes | wc |
         | Lastname 2, First name 2 | April 2, 2007 | 17258 |      G01N |  1 |
         |-------------------------------------------------------------------|
         |      count      |      class      |                   wanted      |
         |          1      |       G01N      |                F22B G01N      |
         +-------------------------------------------------------------------+

    Comment


    • #3
      Thank you a lot for the quick reply: That's already a big step forward!

      There's two things that separates this solution from "making my day":
      1. I do not want to change the original data structure: I want the new variable ("wanted") to have a value for each observation, i.e. for all 7 lines of the example (the original value of "classes" plus any value from other observations/lines that satisfies the two conditions "same name" and "within past 5 years".
      2. The result for "Lastname 1, Firstname 1" is not correct: The classes "C12Q" and "C08K" are from 2006 and 2009 (the ones after the focal date July 15, 2004). "Lastname 2, Firstname 2" is correct (wanted=="F22B" from 2003 and "G01N" from 2007, not "F16J B06B H04R" from 1999).

      Comment


      • #4
        1. Just merge back with the original.

        2. Sorry, but I have no idea of what the focal date is. I can't see an explanation either in #1 or in #3.

        I guessed at the last date and within 5 years previous to that date.

        How is the programmer supposed to know that July 15, 2004 is focal for the first person???

        Comment


        • #5
          Ah, OK, (2.) was a misunderstanding. I was confused by the dates in your solution (they show not the "last date" at which you estimated, but "July 15, 2004" for person 1). So, the solution is actually exactly what I want!

          With regards to (1.):
          I am still not sure how I can achieve a solution for every date - not only for the last date per person. I should have made that more clear in my description. The variable should have values in every line, for instance for the second-last date of person 1 ("focal date: May 3, 2006"), the solution would list the classes of years 2006, 2004 and 2003: "C12Q B65D G01N F22B H04Q F04C".

          Thank you a lot for your repeated replies.

          Comment


          • #6
            This is not efficient, so it might or might not work for you depending on the dataset size. It loops through all observations. I added example observations.

            Code:
            clear
            set more off
            
            *----- example data -----
            
            input str25 name str15 date_str str15 classes
            "Lastname 1, First name 1" "June 16, 2003" "F22B H04Q F04C"
            "Lastname 1, First name 1" "July 15, 2004" "B65D G01N"
            "Lastname 1, First name 1" "May 3, 2006" "C12Q"
            "Lastname 1, First name 1" "July 8, 2009" "C08K"
            "Lastname 1, First name 1" "Sep 13, 2018" "X56P"
            "Lastname 1, First name 1" "Sep 19, 2019" "X39P"
            "Lastname 2, First name 2" "April 5, 1999" "F16J B06B H04R"
            "Lastname 2, First name 2" "May 20, 2003" "F22B"
            "Lastname 2, First name 2" "April 2, 2007" "G01N"
            end
            gen date = date(date_str, "MDY")
            format %td date
            drop date_str
            order name date classes
            
            list
            
            *----- what you want -----
            
            gen near = .
            gen wanted = ""
            
            // check identifier and sort
            isid name date, sort
            quietly forvalues i = 1/`=_N' {
            
                // variable to hold concatenations
                gen allclass = classes
                
                // observations within the 5-year range
                replace near = inrange(date[`i'] - date, 1, `=5*365.25') ///
                    & name[`i'] == name
                
                // concatenate classes
                replace allclass = ///
                    trim(classes + " " + allclass[_n-1]) if near[_n-1]
                
                // desired variable
                replace wanted = allclass in `i'
                
                // drop for next loop
                drop allclass
            
            }
            Last edited by Roberto Ferrer; 24 Jul 2015, 07:52.
            You should:

            1. Read the FAQ carefully.

            2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

            3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

            4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

            Comment


            • #7
              Thank you both a lot. Solution works like a charm!

              Comment


              • #8
                Here's another approach derived from the one I suggested recently here

                Code:
                clear
                input str24 name int date str14 classes
                "Lastname 1, First name 1" 15872 "F22B H04Q F04C" 
                "Lastname 1, First name 1" 16267 "B65D G01N" 
                "Lastname 1, First name 1" 16924 "C12Q" 
                "Lastname 1, First name 1" 18086 "C08K" 
                "Lastname 2, First name 2" 14339 "F16J B06B H04R" 
                "Lastname 2, First name 2" 15845 "F22B" 
                "Lastname 2, First name 2" 17258 "G01N" 
                end
                format %td date
                
                sort name date classes
                gen s = classes
                local more 1
                local i 0
                local window = 365 * 5
                while `more' {
                    local `++i'
                    by name: replace s = s + " " + classes[_n-`i'] ///
                        if (date - date[_n-`i']) < `window'
                    qui count if (date - date[_n-`i']) < `window'
                    local more = r(N)
                }

                Comment


                • #9
                  Thank you! The solution also works.
                  I'm going through both codes right now to fully understand them

                  Comment


                  • #10
                    I worked too fast to adapt my example and introduced an error in the condition that ends the loop. Here's a revised version that I think takes care of the problem.

                    Code:
                    clear
                    input str24 name int date str14 classes
                    "Lastname 1, First name 1" 15872 "F22B H04Q F04C" 
                    "Lastname 1, First name 1" 16267 "B65D G01N" 
                    "Lastname 1, First name 1" 16924 "C12Q" 
                    "Lastname 1, First name 1" 18086 "C08K" 
                    "Lastname 2, First name 2" 14339 "F16J B06B H04R" 
                    "Lastname 2, First name 2" 15845 "F22B" 
                    "Lastname 2, First name 2" 17258 "G01N" 
                    end
                    format %td date
                    
                    sort name date classes
                    gen s = classes
                    local more 1
                    local i 0
                    local window = 365 * 5
                    while `more' {
                        local `++i'
                        by name: gen doit = (date - date[_n-`i']) < `window'
                        by name: replace s = s + " " + classes[_n-`i'] if doit
                        count if doit
                        local more = r(N)
                        drop doit
                    }

                    Comment

                    Working...
                    X