Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Clean dummy variables according to month and day

    I have a household indicator #hhd#, month and day indicators #month# #day# and three dummy variables #mark_1# #mark_2# #mark_3#

    a household starts with all dummy variables at 0 and each dummy can be activated one at a time. Once a dummy changes from 0 to 1, it stays 1 throughout ever subsequent day for that household. #mark_1# must be activated before #mark_2#, #mark_2# must be activated before #mark_3#.

    The problem I am having is that no more than one dummy ought to be activated in any given month. So for example both #mark_1# and #mark_2# cannot both be 0 on January 1st and both be 1 on January 31st however they can both be 1 on February 1st.

    At the moment, I have some households where more than one dummy variable is being turned from 0 to 1 in the same month.

    For example

    Code:
    ssc install datex
    input long hhd float month float day double mark_1 double mark_2 mark_3
    1 1 1 0 0 0
    1 1 2 0 0 0
    1 1 3 0 0 0
    1 1 4 0 0 0
    1 1 5 0 0 0
    1 1 6 0 0 0
    1 1 7 0 0 0
    1 1 8 0 0 0
    1 1 9 0 0 0
    1 1 10 1 0 0
    1 1 11 1 0 0
    1 1 12 1 0 0
    1 1 13 1 0 0
    1 1 14 1 0 0
    1 1 15 1 0 0
    1 1 16 1 0 0
    1 1 17 1 0 0
    1 1 18 1 0 0
    1 1 19 1 0 0
    1 1 20 1 1 0
    1 1 21 1 1 0
    1 1 22 1 1 0
    1 1 23 1 1 0
    1 1 24 1 1 0
    1 1 25 1 1 0
    1 1 26 1 1 0
    1 1 27 1 1 0
    1 1 28 1 1 0
    1 1 29 1 1 0
    1 1 30 1 1 0
    1 1 31 1 1 0
    1 2 1 1 1 0
    1 2 2 1 1 0
    1 2 3 1 1 0
    1 2 4 1 1 0
    1 2 5 1 1 0
    1 2 6 1 1 0
    1 2 7 1 1 0
    1 2 8 1 1 0
    1 2 9 1 1 0
    1 2 10 1 1 0
    1 2 11 1 1 0
    1 2 12 1 1 0
    1 2 13 1 1 0
    1 2 14 1 1 0
    1 2 15 1 1 0
    1 2 16 1 1 0
    1 2 17 1 1 1
    1 2 18 1 1 1
    1 2 19 1 1 1
    1 2 20 1 1 1
    1 2 21 1 1 1
    1 2 22 1 1 1
    1 2 23 1 1 1
    1 2 24 1 1 1
    1 2 25 1 1 1
    1 2 26 1 1 1
    1 2 27 1 1 1
    1 2 28 1 1 1
    1 3 1 1 1 1
    1 3 2 1 1 1
    1 3 3 1 1 1
    1 3 4 1 1 1
    1 3 5 1 1 1
    1 3 6 1 1 1
    1 3 7 1 1 1
    1 3 8 1 1 1
    1 3 9 1 1 1
    1 3 10 1 1 1  
    1 3 11 1 1 1
    1 3 12 1 1 1
    1 3 13 1 1 1
    1 3 14 1 1 1
    1 3 15 1 1 1
    1 3 16 1 1 1
    1 3 17 1 1 1
    1 3 18 1 1 1
    1 3 19 1 1 1
    1 3 20 1 1 1
    1 3 21 1 1 1  
    1 3 22 1 1 1
    1 3 23 1 1 1
    1 3 24 1 1 1
    1 3 25 1 1 1
    1 3 26 1 1 1
    1 3 27 1 1 1
    1 3 28 1 1 1
    1 3 29 1 1 1
    1 3 30 1 1 1
    1 3 31 1 1 1
    end
    The desired outcome would be something such as this

    Code:
    input long hhd float month float day double mark_1 double mark_2 mark_3
    1 1 1 0 0 0
    1 1 2 0 0 0
    1 1 3 0 0 0
    1 1 4 0 0 0
    1 1 5 0 0 0
    1 1 6 0 0 0
    1 1 7 0 0 0
    1 1 8 0 0 0
    1 1 9 0 0 0
    1 1 10 1 0 0
    1 1 11 1 0 0
    1 1 12 1 0 0
    1 1 13 1 0 0
    1 1 14 1 0 0
    1 1 15 1 0 0
    1 1 16 1 0 0
    1 1 17 1 0 0
    1 1 18 1 0 0
    1 1 19 1 0 0
    1 1 20 1 0 0
    1 1 21 1 0 0
    1 1 22 1 0 0
    1 1 23 1 0 0
    1 1 24 1 0 0
    1 1 25 1 0 0
    1 1 26 1 0 0
    1 1 27 1 0 0
    1 1 28 1 0 0
    1 1 29 1 0 0
    1 1 30 1 0 0
    1 1 31 1 0 0
    1 2 1 1 0 0
    1 2 2 1 0 0
    1 2 3 1 0 0
    1 2 4 1 0 0
    1 2 5 1 0 0
    1 2 6 1 0 0
    1 2 7 1 0 0
    1 2 8 1 0 0
    1 2 9 1 0 0
    1 2 10 1 0 0
    1 2 11 1 0 0
    1 2 12 1 0 0
    1 2 13 1 0 0
    1 2 14 1 0 0
    1 2 15 1 0 0
    1 2 16 1 0 0
    1 2 17 1 1 0
    1 2 18 1 1 0
    1 2 19 1 1 0
    1 2 20 1 1 0
    1 2 21 1 1 0
    1 2 22 1 1 0
    1 2 23 1 1 0
    1 2 24 1 1 0
    1 2 25 1 1 0
    1 2 26 1 1 0
    1 2 27 1 1 0
    1 2 28 1 1 0
    1 3 1 1 1 0
    1 3 2 1 1 0
    1 3 3 1 1 0
    1 3 4 1 1 0
    1 3 5 1 1 0
    1 3 6 1 1 0
    1 3 7 1 1 0
    1 3 8 1 1 0
    1 3 9 1 1 0
    1 3 10 1 1 0  
    1 3 11 1 1 0
    1 3 12 1 1 0
    1 3 13 1 1 0
    1 3 14 1 1 0
    1 3 15 1 1 0
    1 3 16 1 1 0
    1 3 17 1 1 0
    1 3 18 1 1 0
    1 3 19 1 1 0
    1 3 20 1 1 0
    1 3 21 1 1 0  
    1 3 22 1 1 0
    1 3 23 1 1 0
    1 3 24 1 1 0
    1 3 25 1 1 0
    1 3 26 1 1 0
    1 3 27 1 1 0
    1 3 28 1 1 0
    1 3 29 1 1 0
    1 3 30 1 1 0
    1 3 31 1 1 0
    end
    I have approximately 200K households so looping over households is not ideal but I am wondering if there is a way that I can resolve this issue. If some sort of loop is possible, then it would be okay.

    Thank you

  • #2
    So, by default, we are all using version 16.0 now. The following code uses frames, which did not exist in older versions of Stata. If you are using an older version, please post back saying so and I'll try to translate this into code that will run on earlier versions.

    Code:
    clear*
    input long hhd float month float day double mark_1 double mark_2 mark_3
    1 1 1 0 0 0
    1 1 2 0 0 0
    1 1 3 0 0 0
    1 1 4 0 0 0
    1 1 5 0 0 0
    1 1 6 0 0 0
    1 1 7 0 0 0
    1 1 8 0 0 0
    1 1 9 0 0 0
    1 1 10 1 0 0
    1 1 11 1 0 0
    1 1 12 1 0 0
    1 1 13 1 0 0
    1 1 14 1 0 0
    1 1 15 1 0 0
    1 1 16 1 0 0
    1 1 17 1 0 0
    1 1 18 1 0 0
    1 1 19 1 0 0
    1 1 20 1 1 0
    1 1 21 1 1 0
    1 1 22 1 1 0
    1 1 23 1 1 0
    1 1 24 1 1 0
    1 1 25 1 1 0
    1 1 26 1 1 0
    1 1 27 1 1 0
    1 1 28 1 1 0
    1 1 29 1 1 0
    1 1 30 1 1 0
    1 1 31 1 1 0
    1 2 1 1 1 0
    1 2 2 1 1 0
    1 2 3 1 1 0
    1 2 4 1 1 0
    1 2 5 1 1 0
    1 2 6 1 1 0
    1 2 7 1 1 0
    1 2 8 1 1 0
    1 2 9 1 1 0
    1 2 10 1 1 0
    1 2 11 1 1 0
    1 2 12 1 1 0
    1 2 13 1 1 0
    1 2 14 1 1 0
    1 2 15 1 1 0
    1 2 16 1 1 0
    1 2 17 1 1 1
    1 2 18 1 1 1
    1 2 19 1 1 1
    1 2 20 1 1 1
    1 2 21 1 1 1
    1 2 22 1 1 1
    1 2 23 1 1 1
    1 2 24 1 1 1
    1 2 25 1 1 1
    1 2 26 1 1 1
    1 2 27 1 1 1
    1 2 28 1 1 1
    1 3 1 1 1 1
    1 3 2 1 1 1
    1 3 3 1 1 1
    1 3 4 1 1 1
    1 3 5 1 1 1
    1 3 6 1 1 1
    1 3 7 1 1 1
    1 3 8 1 1 1
    1 3 9 1 1 1
    1 3 10 1 1 1  
    1 3 11 1 1 1
    1 3 12 1 1 1
    1 3 13 1 1 1
    1 3 14 1 1 1
    1 3 15 1 1 1
    1 3 16 1 1 1
    1 3 17 1 1 1
    1 3 18 1 1 1
    1 3 19 1 1 1
    1 3 20 1 1 1
    1 3 21 1 1 1  
    1 3 22 1 1 1
    1 3 23 1 1 1
    1 3 24 1 1 1
    1 3 25 1 1 1
    1 3 26 1 1 1
    1 3 27 1 1 1
    1 3 28 1 1 1
    1 3 29 1 1 1
    1 3 30 1 1 1
    1 3 31 1 1 1
    end
    
    //  CREATE STATA INTERNAL FORMAT DAILY AND MONTHLY VARIABLES
    //  AS NO YEAR SPECIFIED, ASSUME 2019
    gen ddate = mdy(month, day, 2019)
    assert missing(ddate) == missing(month, day)
    format ddate %tdMon_dd
    
    gen mdate = mofd(ddate)
    format mdate %tmMon
    
    drop month day
    
    //  GO LONG
    reshape long mark_, i(hhd ddate) j(order)
    
    //  GET START DATE FOR EACH MARK
    by hhd order (ddate), sort: egen start = min(cond(mark_, ddate, .))
    format start %tdMon_dd
    
    //  SPLIT THE DATA INTO TWO PARTS
    //  THE DATE STRUCTURE
    frame put hhd ddate, into(date_structure)
    frame date_structure: by hhd ddate, sort: keep if _n == 1
    
    //   AND THE ACTIVATION DATES FOR THE MARKS
    keep hhd mdate start
    keep if mofd(start) == mdate
    
    
    //  RETAIN ONLY THE FIRST START DATE IN ANY MONTH
    by hhd mdate (start), sort: keep if _n == 1
    drop mdate
    by hhd start, sort: keep if _n == 1
    
    //  SET AN ORDER COUNTER AND GO WIDE
    by hhd (start), sort: gen int order = _n
    summ order, meanonly
    local n_orders = r(max)
    reshape wide start, i(hhd) j(order)
    
    //  LINK THE TWO PARTS OF THE DATA BACK TOGETHER
    frame change date_structure
    frlink m:1 hhd, frame(default)
    frget start*, from(default)
    
    //  NOW RE-CREATE THE MARKS
    forvalues i = 1/`n_orders' {
        gen mark_`i' = ddate >= start`i'
    }
    
    drop start* default
    frame drop default
    Given the large size of your data set I have tried to code things in ways that run fast.

    This code produces (almost) the results you desire on your example data. The discrepancy is that because in your example data the results never have mark_3 = 1, there is no mark_3 variable in the results. But if some hhd in your real data set does actually have mark_3 = 1 at some point, then you will get exactly the results you show as desired.

    Comment


    • #3
      Thank you so much for your example! I am using Stata 15.0 and I'm sorry I should have stated as much in my initial post. In due time, I'm sure that I'll find the frame syntax helpful as well but if you have any suggestions for earlier versions I would be very appreciative.

      Comment


      • #4
        This should run under version 15 and give you the same results:

        Code:
        clear*
        input long hhd float month float day double mark_1 double mark_2 mark_3
        1 1 1 0 0 0
        1 1 2 0 0 0
        1 1 3 0 0 0
        1 1 4 0 0 0
        1 1 5 0 0 0
        1 1 6 0 0 0
        1 1 7 0 0 0
        1 1 8 0 0 0
        1 1 9 0 0 0
        1 1 10 1 0 0
        1 1 11 1 0 0
        1 1 12 1 0 0
        1 1 13 1 0 0
        1 1 14 1 0 0
        1 1 15 1 0 0
        1 1 16 1 0 0
        1 1 17 1 0 0
        1 1 18 1 0 0
        1 1 19 1 0 0
        1 1 20 1 1 0
        1 1 21 1 1 0
        1 1 22 1 1 0
        1 1 23 1 1 0
        1 1 24 1 1 0
        1 1 25 1 1 0
        1 1 26 1 1 0
        1 1 27 1 1 0
        1 1 28 1 1 0
        1 1 29 1 1 0
        1 1 30 1 1 0
        1 1 31 1 1 0
        1 2 1 1 1 0
        1 2 2 1 1 0
        1 2 3 1 1 0
        1 2 4 1 1 0
        1 2 5 1 1 0
        1 2 6 1 1 0
        1 2 7 1 1 0
        1 2 8 1 1 0
        1 2 9 1 1 0
        1 2 10 1 1 0
        1 2 11 1 1 0
        1 2 12 1 1 0
        1 2 13 1 1 0
        1 2 14 1 1 0
        1 2 15 1 1 0
        1 2 16 1 1 0
        1 2 17 1 1 1
        1 2 18 1 1 1
        1 2 19 1 1 1
        1 2 20 1 1 1
        1 2 21 1 1 1
        1 2 22 1 1 1
        1 2 23 1 1 1
        1 2 24 1 1 1
        1 2 25 1 1 1
        1 2 26 1 1 1
        1 2 27 1 1 1
        1 2 28 1 1 1
        1 3 1 1 1 1
        1 3 2 1 1 1
        1 3 3 1 1 1
        1 3 4 1 1 1
        1 3 5 1 1 1
        1 3 6 1 1 1
        1 3 7 1 1 1
        1 3 8 1 1 1
        1 3 9 1 1 1
        1 3 10 1 1 1  
        1 3 11 1 1 1
        1 3 12 1 1 1
        1 3 13 1 1 1
        1 3 14 1 1 1
        1 3 15 1 1 1
        1 3 16 1 1 1
        1 3 17 1 1 1
        1 3 18 1 1 1
        1 3 19 1 1 1
        1 3 20 1 1 1
        1 3 21 1 1 1  
        1 3 22 1 1 1
        1 3 23 1 1 1
        1 3 24 1 1 1
        1 3 25 1 1 1
        1 3 26 1 1 1
        1 3 27 1 1 1
        1 3 28 1 1 1
        1 3 29 1 1 1
        1 3 30 1 1 1
        1 3 31 1 1 1
        end
        
        //  CREATE STATA INTERNAL FORMAT DAILY AND MONTHLY VARIABLES
        //  AS NO YEAR SPECIFIED, ASSUME 2019
        gen ddate = mdy(month, day, 2019)
        assert missing(ddate) == missing(month, day)
        format ddate %tdMon_dd
        
        gen mdate = mofd(ddate)
        format mdate %tmMon
        
        drop month day
        
        //  GO LONG
        reshape long mark_, i(hhd ddate) j(order)
        
        //  GET START DATE FOR EACH MARK
        by hhd order (ddate), sort: egen start = min(cond(mark_, ddate, .))
        format start %tdMon_dd
        
        //  SPLIT THE DATA INTO TWO PARTS
        //  THE DATE STRUCTURE
        preserve
        keep hhd ddate
        by hhd ddate, sort: keep if _n == 1
        tempfile date_structure
        save `date_structure'
        
        //   AND THE ACTIVATION DATES FOR THE MARKS
        restore
        keep hhd mdate start
        keep if mofd(start) == mdate
        
        
        //  RETAIN ONLY THE FIRST START DATE IN ANY MONTH
        by hhd mdate (start), sort: keep if _n == 1
        drop mdate
        by hhd start, sort: keep if _n == 1
        
        //  SET AN ORDER COUNTER AND GO WIDE
        by hhd (start), sort: gen int order = _n
        summ order, meanonly
        local n_orders = r(max)
        reshape wide start, i(hhd) j(order)
        
        //  LINK THE TWO PARTS OF THE DATA BACK TOGETHER
        merge 1:m hhd using `date_structure', assert(match) nogenerate
        
        //  NOW RE-CREATE THE MARKS
        forvalues i = 1/`n_orders' {
            gen mark_`i' = ddate >= start`i'
        }
        
        drop start*

        Comment

        Working...
        X