Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating new variable with varied proportions of current variables

    Hi All,

    In my data the three variables of w_id, w_date and value are from one dataset and I want to use them to create new_value for the other dataset which includes n_id and n_date. The problem is that the weeks in these two datasets don’t match. So, I want to create a new value variable using proportions of weeks in the first dataset (w_id, w_date, and value).

    For example first two weeks in w_id and w_date are:
    From 18993 to 19000 with value of 1.23 and 19000-19007 with value of 1.46
    The first week in n_id and n_date is from 18996 to 19003 which means it has 4 days from the first week and 3 days from the second week in w_date data. So, I want to create new_value=(4/7)*1.23 +(3/7)*1.46. But these proportions are not always 4/7 and 3/7 and change throughout the data.


    Here is a sample of my data:
    Code:
    clear
    input int w_id int w_date int n_id int n_date float value
    1  18993  1  18996  1.23
    2  19000  2  19003  1.46
    3  19007  3  19010  1.69
    4  19014  4  19017  1.92
    5  19021  5  19024  2.15
    6  19028  6  19031  2.38
    7  19035  7  19038  2.61
    8  19042  8  19045  2.84
    9  19049  9  19052  3.07
    10  19056  10  19059  3.8
    11  19063  11  19066  3.03
    12  19070  12  19073  3.26
    13  19077  13  19080  3.49
    14  19084  14  19087  3.72
    15  19091  15  19094  3.95
    16  19098  16  19101  4.18
    17  19105  17  19108  4.41
    18  19112  18  19115  4.64
    19  19119  19  19122  4.87
    20  19126  20  19129  5.1
    21  19133  21  19136  4.83
    22  19140  22  19143  5.06
    23  19147  23  19150  5.29
    24  19154  24  19157  5.52
    25  19161  25  19164  5.75
    26  19168  26  19171  5.98
    27  19175  27  19185  6.21
    28  19182  28  19192  6.44
    29  19189  29  19199  6.67
    30  19196  30  19206  6.9
    31  19203  31  19213  7.13
    32  19210  32  19220  7.36
    33  19217  33  19227  7.59
    34  19224  34  19234  7.82
    35  19231  35  19241  8.05
    36  19238  36  19248  8.28
    37  19245  37  19255  8.51
    38  19252  38  19262  8.74
    39  19259  39  19269  8.97
    40  19266  40  19276  9.2
    41  19273  41  19283  9.43
    42  19280  42  19290  9.66
    43  19287  43  19297  9.89
    44  19294  44  19304  10.12
    45  19301  45  19311  10.35
    46  19308  46  19325  10.58
    end
    order value, before (n_id)

    I really appreciate your help.

  • #2
    Congratulations on providing an easy to use data example!

    I'm a bit confused by what you are trying to do. In your example, there are two instances where n_date skips a week. Which means that your observations out of phase in the sense that w_date[_n+1] < n_date. Do the w_id and n_id have to match when generating new_value or should the values from the closest week be used. The following illustrates the issue

    Code:
    clear
    input int w_id int w_date int n_id int n_date float value
    1  18993  1  18996  1.23
    2  19000  2  19003  1.46
    3  19007  3  19010  1.69
    4  19014  4  19017  1.92
    5  19021  5  19024  2.15
    6  19028  6  19031  2.38
    7  19035  7  19038  2.61
    8  19042  8  19045  2.84
    9  19049  9  19052  3.07
    10  19056  10  19059  3.8
    11  19063  11  19066  3.03
    12  19070  12  19073  3.26
    13  19077  13  19080  3.49
    14  19084  14  19087  3.72
    15  19091  15  19094  3.95
    16  19098  16  19101  4.18
    17  19105  17  19108  4.41
    18  19112  18  19115  4.64
    19  19119  19  19122  4.87
    20  19126  20  19129  5.1
    21  19133  21  19136  4.83
    22  19140  22  19143  5.06
    23  19147  23  19150  5.29
    24  19154  24  19157  5.52
    25  19161  25  19164  5.75
    26  19168  26  19171  5.98
    27  19175  27  19185  6.21
    28  19182  28  19192  6.44
    29  19189  29  19199  6.67
    30  19196  30  19206  6.9
    31  19203  31  19213  7.13
    32  19210  32  19220  7.36
    33  19217  33  19227  7.59
    34  19224  34  19234  7.82
    35  19231  35  19241  8.05
    36  19238  36  19248  8.28
    37  19245  37  19255  8.51
    38  19252  38  19262  8.74
    39  19259  39  19269  8.97
    40  19266  40  19276  9.2
    41  19273  41  19283  9.43
    42  19280  42  19290  9.66
    43  19287  43  19297  9.89
    44  19294  44  19304  10.12
    45  19301  45  19311  10.35
    46  19308  46  19325  10.58
    end
    format %td *_date
    order value, before (n_id)
    
    
    gen w1 = w_date[_n-1] - w_date
    gen w2 = n_date[_n-1] - n_date
    
    gen dbefore = n_date - w_date
    gen dafter = w_date - n_date[_n-1]
    gen range = dafter + dbefore

    Comment


    • #3
      Thanks Robert for your kind response.
      I try to explain with an example. For example in the case of n_date=27june2012 to 11july2012 I need to see the days in this range belong to which weeks in w_dates. In this case I need to use three weeks of the w_date data.

      24june to 1 jully 4 days with value =5.98 =>(4/7)*5.98
      1 jully to 8 july 7 days with value =6.21 =>(7/7)*6.21
      8 july to 15 july 3 days with value=6.44 => (3/4)* 6.44

      I hope it makes more sense. Thanks for your help.

      Comment


      • #4
        The following solution ignores the identifiers and calculates a new value based on daily value around n_date. To get daily values, I expand the original weekly data and replicate the value until a new week occurs. I then merge daily value with n_date and calculate a new value based on a +/- 3 day window around n_date.

        Code:
        clear
        input int w_id int w_date int n_id int n_date float value
        1  18993  1  18996  1.23
        2  19000  2  19003  1.46
        3  19007  3  19010  1.69
        4  19014  4  19017  1.92
        5  19021  5  19024  2.15
        6  19028  6  19031  2.38
        7  19035  7  19038  2.61
        8  19042  8  19045  2.84
        9  19049  9  19052  3.07
        10  19056  10  19059  3.8
        11  19063  11  19066  3.03
        12  19070  12  19073  3.26
        13  19077  13  19080  3.49
        14  19084  14  19087  3.72
        15  19091  15  19094  3.95
        16  19098  16  19101  4.18
        17  19105  17  19108  4.41
        18  19112  18  19115  4.64
        19  19119  19  19122  4.87
        20  19126  20  19129  5.1
        21  19133  21  19136  4.83
        22  19140  22  19143  5.06
        23  19147  23  19150  5.29
        24  19154  24  19157  5.52
        25  19161  25  19164  5.75
        26  19168  26  19171  5.98
        27  19175  27  19185  6.21
        28  19182  28  19192  6.44
        29  19189  29  19199  6.67
        30  19196  30  19206  6.9
        31  19203  31  19213  7.13
        32  19210  32  19220  7.36
        33  19217  33  19227  7.59
        34  19224  34  19234  7.82
        35  19231  35  19241  8.05
        36  19238  36  19248  8.28
        37  19245  37  19255  8.51
        38  19252  38  19262  8.74
        39  19259  39  19269  8.97
        40  19266  40  19276  9.2
        41  19273  41  19283  9.43
        42  19280  42  19290  9.66
        43  19287  43  19297  9.89
        44  19294  44  19304  10.12
        45  19301  45  19311  10.35
        46  19308  46  19325  10.58
        end
        format %td *_date
        order value, before (n_id)
        tempfile main
        save "`main'"
        
        list, clean
        
        * verify assumptions about the data
        isid w_date
        isid n_date
        
        * special case to reach the last n_date
        local lastdate = n_date[_N]
        
        * expand to daily data for these variables
        keep w_id w_date value
        gen days = w_date[_n+1] - w_date
        replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
        expand days
        bysort w_date : gen n_date = w_date + _n - 1
        format %td *_date
        
        * merge back using n_date
        merge m:1 n_date using "`main'", assert(master match) nogen
        
        * calculate new value based on value on the days around n_date
        * -tsegen- is from SSC.
        tsset n_date
        tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)
        
        * note that the same can be done this way except that there are
        * no forward values for the last observation
        gen new_values = (L1.value + L2.value + L3.value + value + ///
                        F1.value + F2.value + F3.value) / 7 if !mi(n_id)
                        
        * reduce to the original data
        keep if !mi(n_id)

        Comment


        • #5
          Hi Robert,

          Thanks so much. It's way above my head. I ran it line by line to understand what happens. One thing I changed is that instead of
          Code:
          tempfile main
          save "`main'"
          I used:
          Code:
          save main, replace
          Honestly,I've never worked with global macros so I don't know if the changes hurt the code.

          Then I used this part:
          Code:
          list, clean
          
          * verify assumptions about the data
          isid w_date
          isid n_date
          
          * special case to reach the last n_date
          local lastdate = n_date[_N]
          
          * expand to daily data for these variables
          keep w_id w_date value
          gen days = w_date[_n+1] - w_date
          replace days = `lastdate' - w_date + 1 in l
          expand days
          bysort w_date : gen n_date = w_date + _n - 1
          format %td *_date
          
          * merge back using n_date
          merge m:1 n_date using main, assert(master match) nogen
          And then this one:

          Code:
          tsset n_date
          tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)
          but I received the following errors:

          tsset n_date
          repeated time values in sample
          r(451);


          tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)
          unrecognized command: tsegen
          r(199);


          I don't know why it happens. Is it because I saved main in my local folder?
          Thanks so much for your help.

          Comment


          • #6
            No problem, I can break it down a bit. There's no harm in switching to

            Code:
            save main, replace
            Are you sure that the following part does not result in an error? If there are multiple observations per w_date or n_date, that would not work with this approach.

            Code:
            * verify assumptions about the data
            isid w_date
            isid n_date
            If not, then the rest should work. The first step is to convert w_id w_date value into daily observations. The only difficulty is that you need to extend these daily observations so that there's a value on the last n_date. So

            Code:
            * special case to reach the last n_date
            local lastdate = n_date[_N]
            does just that. I places the last value of n_date into the local macro lastdate. Note that having only eyes on the example you posted, I assume that the last n_date is later than the last w_date. If that's not the case, then the code would have to be adjusted.

            Then the following code reduces the data to just the w_id w_date value variables and creates daily observations. Rather than trying to explain each part, just put a list statement afterwards and see what is created. Essentially, the value on a given w_date is carried over each new day until a new value is observed.

            Code:
            * expand to daily data for these variables
            keep w_id w_date value
            gen days = w_date[_n+1] - w_date
            replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
            expand days
            bysort w_date : gen n_date = w_date + _n - 1
            format %td *_date
            list, sepby(w_id) noobs
            At this point, you should have one observation per day. You can check that there are no duplicates using

            Code:
            isid n_date
            This daily data can now be merged back with the original data. Because the daily data covers all days, you are sure to match all the original values in n_date. You can update the merge to a merge 1:1 since you have unique days in memory and you know that there are no duplicate n_date values in the original data.

            Code:
            * merge back using n_date
            merge 1:1 n_date using "`main'", assert(master match) nogen
            In order to use tsegen, you need to install it from SSC. You only need to do this once; to do so, type in the command window

            Code:
            ssc install tsegen
            It's hard to explain how you got multiple values of n_date; based on the data example, you should not have any. That's what the following message is indicating:

            Code:
            tsset n_date
            repeated time values in sample
            r(451);
            I suggest that you run this updated example. Note that since I used a local macro, it's best to run the whole thing from a do-file. Just save everything in a file called "test.do" in the current directory and run it using

            Code:
            do test.do
            The updated code follows

            Code:
            clear
            input int w_id int w_date int n_id int n_date float value
            1  18993  1  18996  1.23
            2  19000  2  19003  1.46
            3  19007  3  19010  1.69
            4  19014  4  19017  1.92
            5  19021  5  19024  2.15
            6  19028  6  19031  2.38
            7  19035  7  19038  2.61
            8  19042  8  19045  2.84
            9  19049  9  19052  3.07
            10  19056  10  19059  3.8
            11  19063  11  19066  3.03
            12  19070  12  19073  3.26
            13  19077  13  19080  3.49
            14  19084  14  19087  3.72
            15  19091  15  19094  3.95
            16  19098  16  19101  4.18
            17  19105  17  19108  4.41
            18  19112  18  19115  4.64
            19  19119  19  19122  4.87
            20  19126  20  19129  5.1
            21  19133  21  19136  4.83
            22  19140  22  19143  5.06
            23  19147  23  19150  5.29
            24  19154  24  19157  5.52
            25  19161  25  19164  5.75
            26  19168  26  19171  5.98
            27  19175  27  19185  6.21
            28  19182  28  19192  6.44
            29  19189  29  19199  6.67
            30  19196  30  19206  6.9
            31  19203  31  19213  7.13
            32  19210  32  19220  7.36
            33  19217  33  19227  7.59
            34  19224  34  19234  7.82
            35  19231  35  19241  8.05
            36  19238  36  19248  8.28
            37  19245  37  19255  8.51
            38  19252  38  19262  8.74
            39  19259  39  19269  8.97
            40  19266  40  19276  9.2
            41  19273  41  19283  9.43
            42  19280  42  19290  9.66
            43  19287  43  19297  9.89
            44  19294  44  19304  10.12
            45  19301  45  19311  10.35
            46  19308  46  19325  10.58
            end
            format %td *_date
            order value, before (n_id)
            save main, replace
            list, clean
            
            * verify assumptions about the data
            isid w_date
            isid n_date
            
            * special case to reach the last n_date
            local lastdate = n_date[_N]
            
            * expand to daily data for these variables
            keep w_id w_date value
            gen days = w_date[_n+1] - w_date
            replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
            expand days
            bysort w_date : gen n_date = w_date + _n - 1
            format %td *_date
            
            list, sepby(w_id) noobs
            
            * merge back using n_date
            merge 1:1 n_date using "main", assert(master match)
            tab _merge
            
            * calculate new value based on value on the days around n_date
            * -tsegen- is from SSC.
            tsset n_date
            tsegen new_values2 = rowmean(L(0/3).value F(0/3).value) if !mi(n_id)
            
            * note that the same can be done this way except that there are
            * no forward values for the last observation
            gen new_values = (L1.value + L2.value + L3.value + value + ///
                            F1.value + F2.value + F3.value) / 7 if !mi(n_id)
                            
            * reduce to the original data
            keep if !mi(n_id)

            Comment


            • #7
              Hi Robert,

              Thanks for the explanation. It's more clear now. The problem was the installation. Now, I don't receive any errors but the new_values are identical to the old values. For example value and new_value in the first row are both 1.23. I mean proportions did not work on the values. Did you have the correct values? I am copy pasting exactly what you have here. So, our data are the same.

              Do you know where I am making mistake?
              Thanks a lot.

              Comment


              • #8
                My bad, I did not look at the new values and didn't realize that I was not calculating the new values as you wanted. The setup can remain the same, I think all you need is to average daily values over periods defined from the start of n_date until the next n_date. Something like

                Code:
                clear
                input int w_id int w_date int n_id int n_date float value
                1  18993  1  18996  1.23
                2  19000  2  19003  1.46
                3  19007  3  19010  1.69
                4  19014  4  19017  1.92
                5  19021  5  19024  2.15
                6  19028  6  19031  2.38
                7  19035  7  19038  2.61
                8  19042  8  19045  2.84
                9  19049  9  19052  3.07
                10  19056  10  19059  3.8
                11  19063  11  19066  3.03
                12  19070  12  19073  3.26
                13  19077  13  19080  3.49
                14  19084  14  19087  3.72
                15  19091  15  19094  3.95
                16  19098  16  19101  4.18
                17  19105  17  19108  4.41
                18  19112  18  19115  4.64
                19  19119  19  19122  4.87
                20  19126  20  19129  5.1
                21  19133  21  19136  4.83
                22  19140  22  19143  5.06
                23  19147  23  19150  5.29
                24  19154  24  19157  5.52
                25  19161  25  19164  5.75
                26  19168  26  19171  5.98
                27  19175  27  19185  6.21
                28  19182  28  19192  6.44
                29  19189  29  19199  6.67
                30  19196  30  19206  6.9
                31  19203  31  19213  7.13
                32  19210  32  19220  7.36
                33  19217  33  19227  7.59
                34  19224  34  19234  7.82
                35  19231  35  19241  8.05
                36  19238  36  19248  8.28
                37  19245  37  19255  8.51
                38  19252  38  19262  8.74
                39  19259  39  19269  8.97
                40  19266  40  19276  9.2
                41  19273  41  19283  9.43
                42  19280  42  19290  9.66
                43  19287  43  19297  9.89
                44  19294  44  19304  10.12
                45  19301  45  19311  10.35
                46  19308  46  19325  10.58
                end
                format %td *_date
                order value, before (n_id)
                save main, replace
                list, clean
                
                * verify assumptions about the data
                isid w_date
                isid n_date
                
                * special case to reach the last n_date
                local lastdate = n_date[_N]
                
                * expand to daily data for these variables
                keep w_id w_date value
                gen days = w_date[_n+1] - w_date
                replace days = `lastdate' - w_date + 1 in l // to reach the last n_date
                expand days
                bysort w_date : gen n_date = w_date + _n - 1
                format %td *_date
                
                list, sepby(w_id) noobs
                
                * merge back using n_date
                merge 1:1 n_date using "main", assert(master match) nogen
                
                * calculate new value based on value on the days around n_date
                gen n_group = sum(!mi(n_id))
                bysort n_group (n_date): egen new_values = mean(value)
                
                * show a few examples
                list if inlist(n_group,1,2,26), sepby(n_group) noobs
                
                * reduce to original observations
                keep if !mi(n_id)

                Comment


                • #9
                  No worries at all. It helped me even more because I spent some time to think about it more and came up with a solution, very basic one of course. I learned so much with your help. I really appreciate it. Have a wonderful day

                  Comment

                  Working...
                  X