Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Analyzing Linked Columns

    Hello all,

    I've come across a data analysis issue and don't know how best to address the following problem.

    For simplicity, I have the following dataset:
    id event_date date1 value1 date2 value2 date3 value3 event_value
    1 1/1/2020 1/2/2020 1000 3/2/2020 2000 5/2/2020 3000
    2 2/1/2020 2/2/2019 1000 2/2/2020 2000 2/2/2021 3000
    3 3/1/2020 3/2/2018 1000 3/2/2019 2000 3/2/2020 3000

    I would like to identify the values for dates that are within 30 days of event_date. For example, in the table above, I would like code to generate an event_value column that would equal 1000 for id=1, 2000 for id=2, and 3000 for id=3.

    I know I could generate a counter column that would represent which date value fits within a 30-day window (i.e.
    Code:
    replace counter=1 if date1 <event_date+ 30 & date1 >event_date
    .

    However, in my real dataset, I have 50 date columns, and would prefer to automate a solution rather than run 50 replace commands. Secondly, I don't know an automated way to take take the results of the counter column and have it pull the specific value corresponding to the date (i.e., pull value2 if date2 is in desired range).

    Any thoughts?

    Thank you.

  • #2
    What do you want to do if there is more than one date that falls within 30 days of the event date? What if there is none?

    Also, when you way "within 30 days of the event date" does that mean:
    1. Equal to the event date or later up to 29 days.
    2. Equal to the date of the day after the event date or later up to 30 days.
    3. 1, but also including dates between the event date and the 29th day before (inclusive).
    4. 2, but also including dates between the day before the event date and earlier up to 29 days before.
    5. Between 14 days before and 15 days after the event date.
    6. Between 15 days before and 14 days after the event date.
    7. Something else?--Specify.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      What do you want to do if there is more than one date that falls within 30 days of the event date? What if there is none?

      Also, when you way "within 30 days of the event date" does that mean:
      1. Equal to the event date or later up to 29 days.
      2. Equal to the date of the day after the event date or later up to 30 days.
      3. 1, but also including dates between the event date and the 29th day before (inclusive).
      4. 2, but also including dates between the day before the event date and earlier up to 29 days before.
      5. Between 14 days before and 15 days after the event date.
      6. Between 15 days before and 14 days after the event date.
      7. Something else?--Specify.

      It would be within 30 days after the date, starting from the next day. So, using id 1 from above (start date 1/1/2020), this would include dates from 1/2-31. For any sequences where multiple dates fall within 30 days, it would default to the date closest to the event_date.

      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input byte id str9(event_date date1) int value1 str9 date2 int value2 str9 date3 int value3
        1 "1/1/2020 " "1/2/2020 " 1000 "3/2/2020 " 2000 "5/2/2020 " 3000
        2 "2/1/2020 " "2/2/2019 " 1000 "2/2/2020 " 2000 "2/2/2021 " 3000
        3 "3/1/2020 " "3/2/2018 " 1000 "3/2/2019 " 2000 "3/2/2020 " 3000
        end
        
        foreach v of varlist *date* {
            gen _`v' = daily(`v', "MDY"), after(`v')
            assert missing(_`v') == missing(`v')
            format _`v' %td
            drop `v'
            rename _`v' `v'
        }
        
        reshape long date value, i(id)
        by id (_j), sort: assert _j == _n
        by id (_j), sort: egen index = min(cond(inrange(date-event_date, 1, 30), _j, .))
        by id (_j): gen event_value = value[index]
        Note: This code assumes, and verifies, that the date* and value* series of variables are numbered consecutively starting from 1 until however many there are. It will give break at the -assert- command if this is not true, and produce no results.

        The code also assumes that all of the date variables, including event_date are string variables. Because you gave a tabular display of the data, it is impossible to tell what those variables really are. So the first part of the code is transforming the date variables to Stata internal format numeric daily date variables. This is necessary for almost any kind of calculation with the dates. If in your actual data set these variables are already Stata internal format numeric daily date variables, then that part of the code (everything preceding the -reshape- command, can be omitted.

        In the future, to eliminate guesswork, which can waste everybody's time, when showing data examples, please use the -dataex- command to do so, as I have done here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        Comment


        • #5
          This is fantastic work, thank you Clyde.

          Comment

          Working...
          X