Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge: Two time-series datasets with different unit of analysis (days vs time periods)

    Hello forum,

    I have two time-series data of students that I need to merge. One contains the number of students’ misbehavior per day (BehaviorData-daily) and the other contains the time period and location where students were housed (HousedData). The problem with the HousedData is that only has time periods, not daily data (see below). I need to assign the misbehavior count to the corresponding date period. So if in my example below if the student misbehaved on April 9, it should be next to the April 7 to Apr 10.

    1131600265 4-Apr-16 6-Apr-16
    1131600265 7-Apr-16 10-Apr-16
    1131600265 11-Apr-16 13-Apr-16

    I guess there are two options here. One would be to generate daily observation of my HousedData and then merge. Another option would be to work first using the behavior data, and assign the time range (ex April 7 to Apr 10) to the specific date of the misbehavior. How can I do this using my strategies? Or perhaps there are another more efficient ways to do this? I would appreciate any help.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long ID int(startdt enddt) str2 room byte Ineed
             .     .     . ""   .
    1131600265 20548 20550 "a1" 6
    1131600265 20551 20554 "a1" 2
    1131600265 20555 20557 "a2" .
    1131600265 20558 20561 "a2" .
    1131600265 20562 20564 "a2" .
    1131600473 20583 20585 "a2" .
    1131600473 20586 20589 "a1" .
    1131600473 20590 20592 "a1" 1
    1131600473 20593 20596 "a1" 2
    1131600473 20597 20599 "a2" .
    1131600265 20548 20550 "a1" .
    1131600265 20551 20554 "a2" .
    1131600265 20555 20557 "a3" .
    1131600265 20558 20561 "a4" .
    1131600265 20562 20564 "a5" 1
    end
    format %tddd-Mon-YY startdt
    format %tddd-Mon-YY enddt

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long ID int behavior byte count
             .     . .
    1131600265 20548 3
    1131600265 20549 1
    1131600265 20550 2
    1131600265 20551 1
    1131600265 20552 1
    1131600473 20590 1
    1131600473 20593 1
    1131600473 20594 1
    1131600265 20563 1
    end
    format %tdnn/dd/CCYY behavior








  • #2
    Hello,

    If the post is confusing, please let me know. I often face this type of situation where I have to merge 2 panel data-sets with different time factors such as one contain days and the other discrete dates (two time factor per week, for example). I usually try to convert the datsets to a unique time factors for example convert the time factor to weeks for both datasets and then merge them. But in this specific case, one dataset have time range - 2 time factors per week (Monday and Thursday) . I would appreciate any ideas.

    Thank you in advance!
    Marvin

    Comment


    • #3
      my guess is that "rangejoin" from SSC will do what you want

      Comment


      • #4
        I agree that rangejoin could be preferable to the code posted below, which demonstrates the more general built-in joinby command. My code also demonstrates that the sample data has some problems that need to be overcome to produce the desired results, including the fact that one of the IDs appears twice in both datasets.
        Code:
        . // clean up the sample data
        . 
        . use `housing', clear
        
        . sort ID startdt enddt
        
        . list, sepby(ID)
        
             +---------------------------------------------------+
             |         ID     startdt       enddt   room   Ineed |
             |---------------------------------------------------|
          1. | 1131600265    4-Apr-16    6-Apr-16     a1       . |
          2. | 1131600265    4-Apr-16    6-Apr-16     a1       6 |
          3. | 1131600265    7-Apr-16   10-Apr-16     a2       . |
          4. | 1131600265    7-Apr-16   10-Apr-16     a1       2 |
          5. | 1131600265   11-Apr-16   13-Apr-16     a3       . |
          6. | 1131600265   11-Apr-16   13-Apr-16     a2       . |
          7. | 1131600265   14-Apr-16   17-Apr-16     a4       . |
          8. | 1131600265   14-Apr-16   17-Apr-16     a2       . |
          9. | 1131600265   18-Apr-16   20-Apr-16     a5       1 |
         10. | 1131600265   18-Apr-16   20-Apr-16     a2       . |
             |---------------------------------------------------|
         11. | 1131600473    9-May-16   11-May-16     a2       . |
         12. | 1131600473   12-May-16   15-May-16     a1       . |
         13. | 1131600473   16-May-16   18-May-16     a1       1 |
         14. | 1131600473   19-May-16   22-May-16     a1       2 |
         15. | 1131600473   23-May-16   25-May-16     a2       . |
             |---------------------------------------------------|
         16. |          .           .           .              . |
             +---------------------------------------------------+
        
        . drop if missing(ID)
        (1 observation deleted)
        
        . duplicates drop ID startdt enddt, force
        
        Duplicates in terms of ID startdt enddt
        
        (5 observations deleted)
        
        . drop Ineed
        
        . list, sepby(ID)
        
             +-------------------------------------------+
             |         ID     startdt       enddt   room |
             |-------------------------------------------|
          1. | 1131600265    4-Apr-16    6-Apr-16     a1 |
          2. | 1131600265    7-Apr-16   10-Apr-16     a2 |
          3. | 1131600265   11-Apr-16   13-Apr-16     a3 |
          4. | 1131600265   14-Apr-16   17-Apr-16     a4 |
          5. | 1131600265   18-Apr-16   20-Apr-16     a5 |
             |-------------------------------------------|
          6. | 1131600473    9-May-16   11-May-16     a2 |
          7. | 1131600473   12-May-16   15-May-16     a1 |
          8. | 1131600473   16-May-16   18-May-16     a1 |
          9. | 1131600473   19-May-16   22-May-16     a1 |
         10. | 1131600473   23-May-16   25-May-16     a2 |
             +-------------------------------------------+
        
        . save `housing', replace
        file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_34904.000001 saved
        
        . use `behave', clear
        
        . sort ID behavior
        
        . list, sepby(ID) 
        
             +--------------------------------+
             |         ID    behavior   count |
             |--------------------------------|
          1. | 1131600265    4/4/2016       3 |
          2. | 1131600265    4/5/2016       1 |
          3. | 1131600265    4/6/2016       2 |
          4. | 1131600265    4/7/2016       1 |
          5. | 1131600265    4/8/2016       1 |
          6. | 1131600265   4/19/2016       1 |
             |--------------------------------|
          7. | 1131600473   5/16/2016       1 |
          8. | 1131600473   5/19/2016       1 |
          9. | 1131600473   5/20/2016       1 |
             |--------------------------------|
         10. |          .           .       . |
             +--------------------------------+
        
        . drop if missing(ID)
        (1 observation deleted)
        
        . save `behave', replace
        file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_34904.000002 saved
        
        . 
        . // real work begins here
        . 
        . use `housing', clear
        
        . joinby ID using `behave', unmatched(master)
        
        . replace count = 0  if !inrange(behavior,startdt,enddt)
        (36 real changes made)
        
        . // list, clean nolabel
        . bysort ID startdt enddt: generate Ihave = sum(count)
        
        . // list, clean nolabel
        . bysort ID startdt enddt: keep if _n==_N
        (35 observations deleted)
        
        . drop _merge behavior count
        
        . list, clean
        
                       ID     startdt       enddt   room   Ihave  
          1.   1131600265    4-Apr-16    6-Apr-16     a1       6  
          2.   1131600265    7-Apr-16   10-Apr-16     a2       2  
          3.   1131600265   11-Apr-16   13-Apr-16     a3       0  
          4.   1131600265   14-Apr-16   17-Apr-16     a4       0  
          5.   1131600265   18-Apr-16   20-Apr-16     a5       1  
          6.   1131600473    9-May-16   11-May-16     a2       0  
          7.   1131600473   12-May-16   15-May-16     a1       0  
          8.   1131600473   16-May-16   18-May-16     a1       1  
          9.   1131600473   19-May-16   22-May-16     a1       2  
         10.   1131600473   23-May-16   25-May-16     a2       0  
        
        .

        Comment


        • #5
          Thank you so much to both of you. I will try this. I wasn't aware of the rangejoin and joinby commands, although I assumed there must be something out-there. You codes seems to work- but I will also explore the rangejoin command.

          Just so you know, this was only a sample of my data. In reality I do this to compare pre and post rate of misbehavior (the rooms determine the pre or post eras). So the sample shown was only the pre era. What I did was that I generated start and end date for pre and post eras . So each student had a unique start and end date for the pre and post period. Then I merge that with the "behavior" datset and just count the number of misbehavior based on the start and end dates. But these commands are better to do this. I will explore them.

          Thank you!

          Comment


          • #6
            William Lisowski I re do what you did using my real data and although it seems to work, it a little cumbersome since the date range are multiply by the number of events. Then you have to do more calculation to assign the event to the right date range and finally delete duplicates. If there are multiple time series per panel and many events, this will be a lot.

            Rich Goldstein I downloaded the ado command and I think it kind of does the same as the joinby command. If possible can you show me how you would do it.

            I thought there was an easier way to merge panel datasets that have different time units. I thought there was a command that merge the event to the corresponding time range directly. Perhpas the rangejoin does this and just don't know how to do it. Any ideas?

            Any help will be appreciate it!

            Comment


            • #7
              There's really not much to explain with respect to how to use rangejoin (from SSC) in this case. It boils down to a single line. I've removed the date formatting to make it easier to visualize what's happening by comparing to the input data.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long ID int behavior byte count
                       .     . .
              1131600265 20548 3
              1131600265 20549 1
              1131600265 20550 2
              1131600265 20551 1
              1131600265 20552 1
              1131600473 20590 1
              1131600473 20593 1
              1131600473 20594 1
              1131600265 20563 1
              end
              save "behavior_statalist.dta", replace
              
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long ID int(startdt enddt) str2 room byte Ineed
                       .     .     . ""   .
              1131600265 20548 20550 "a1" 6
              1131600265 20551 20554 "a1" 2
              1131600265 20555 20557 "a2" .
              1131600265 20558 20561 "a2" .
              1131600265 20562 20564 "a2" .
              1131600473 20583 20585 "a2" .
              1131600473 20586 20589 "a1" .
              1131600473 20590 20592 "a1" 1
              1131600473 20593 20596 "a1" 2
              1131600473 20597 20599 "a2" .
              1131600265 20548 20550 "a1" .
              1131600265 20551 20554 "a2" .
              1131600265 20555 20557 "a3" .
              1131600265 20558 20561 "a4" .
              1131600265 20562 20564 "a5" 1
              end
              
              rangejoin behavior startdt enddt using "behavior_statalist.dta", by(ID)
              
              list, sepby(ID)
              and the results
              Code:
                   +----------------------------------------------------------------+
                   |         ID   startdt   enddt   room   Ineed   behavior   count |
                   |----------------------------------------------------------------|
                1. |          .         .       .              .          .       . |
                   |----------------------------------------------------------------|
                2. | 1131600265     20548   20550     a1       6      20548       3 |
                3. | 1131600265     20548   20550     a1       6      20549       1 |
                4. | 1131600265     20548   20550     a1       6      20550       2 |
                5. | 1131600265     20551   20554     a1       2      20551       1 |
                6. | 1131600265     20551   20554     a1       2      20552       1 |
                7. | 1131600265     20555   20557     a2       .          .       . |
                8. | 1131600265     20558   20561     a2       .          .       . |
                9. | 1131600265     20562   20564     a2       .      20563       1 |
                   |----------------------------------------------------------------|
               10. | 1131600473     20583   20585     a2       .          .       . |
               11. | 1131600473     20586   20589     a1       .          .       . |
               12. | 1131600473     20590   20592     a1       1      20590       1 |
               13. | 1131600473     20593   20596     a1       2      20593       1 |
               14. | 1131600473     20593   20596     a1       2      20594       1 |
               15. | 1131600473     20597   20599     a2       .          .       . |
                   |----------------------------------------------------------------|
               16. | 1131600265     20548   20550     a1       .      20548       3 |
               17. | 1131600265     20548   20550     a1       .      20549       1 |
               18. | 1131600265     20548   20550     a1       .      20550       2 |
               19. | 1131600265     20551   20554     a2       .      20551       1 |
               20. | 1131600265     20551   20554     a2       .      20552       1 |
               21. | 1131600265     20555   20557     a3       .          .       . |
               22. | 1131600265     20558   20561     a4       .          .       . |
               23. | 1131600265     20562   20564     a5       1      20563       1 |
                   +----------------------------------------------------------------+

              Comment


              • #8
                Thanks Richard- I got it!

                Comment

                Working...
                X