Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weighting observations before collapsing the data

    Hi,

    I have data on a group of applicants (applicationid) for a job, their previous jobs(onet), how long they spent in each of the previous jobs (length), and time between the last day of that job and the application date (elapsed=applicationdate- ndateto). I also got the O*NET characteristics of the previous jobs (char1, char2, and char3 are examples of those characteristics). I want to collapse data so that I have one row per person and have some weighted average of each O*NET characteristic for each person. In other words, I don't want to give equal weights to the previous jobs. I want to give a weight that is a function of time the applicant spent in the previous job (length) and how long has passed since they've been in the previous job (elapsed) in a way that sum of these weights for each person would be 1.

    In this case, the jobs the applicants had a long time ago should get lower weights and the jobs in which the applicants had longer tenure should get higher weights. The sum of weights for previous jobs for each person should be 1.

    I'm not sure what the best approach is. I really appreciate your help.


    Code:
    input long applicationid float(applicationdate ndatefrom ndateto length elapsed) str10 onet float(char1 char2 char3)
    1 17265 14762 14823   61 2442 "39-1021.00" 4.05 3.94 3.89
    1 17265 15127 15188   61 2077 "39-1021.00" 4.05 3.94 3.89
    1 17265 15492 15553   61 1712 "39-1021.00" 4.05 3.94 3.89
    1 17265 15584 17136 1552  129 "15-1151.00" 3.79 3.34 4.27
    1 17265 15675 17136 1461  129 "33-9092.00" 3.33 2.61 3.18
    1 17265 16223 16284   61  981 "11-9151.00" 4.46 3.85  4.1
    1 17265 16588 16649   61  616 "11-9032.00" 4.15 4.01 4.46
    1 17265 16953 17014   61  251 "11-9032.00" 4.15 4.01 4.46
    1 17265 17014 17106   92  159 "25-2031.00" 3.97 4.07 4.25
    1 17265 17106 17167   61   98 "25-2031.00" 3.97 4.07 4.25
    1 17265 17226 17318   92    0 "25-2022.00" 4.34 4.37 4.47
    2 19395 16315 17410 1095 1985 "25-9041.00" 4.04 3.92 3.62
    2 19395 16680 16953  273 2442 "21-1011.00"  4.1  3.6 4.09
    2 19395 17379 18049  670 1346 "21-1012.00" 4.19 3.96 3.91
    2 19395 18171 18718  547  677 "41-2031.00" 3.95 3.69  3.6
    2 19395 18871 19510  639    0 "25-2021.00" 4.12 4.15 3.78
    3 17225 12631 13666 1035 3559 "25-9021.00" 4.35 4.06 4.15
    3 17225 13727 14031  304 3194 "25-2032.00" 4.31 4.29 4.08
    3 17225 14092 14762  670 2463 "25-2031.00" 3.97 4.07 4.25
    3 17225 14823 15492  669 1733 "11-9032.00" 4.15 4.01 4.46
    3 17225 15553 15857  304 1368 "27-3031.00" 3.88 3.79 3.92
    3 17225 16010 16649  639  576 "35-1012.00" 3.58 3.13  3.2
    3 17225 16680 16953  273  272 "25-9041.00" 4.04 3.92 3.62
    end

  • #2
    I want to give a weight that is a function of time the applicant spent in the previous job (length) and how long has passed since they've been in the previous job (elapsed) in a way that sum of these weights for each person would be 1.
    1. If the only thing you are going to use these weights for is calculating a weighted average, it doesn't matter what the sum of the weights is. So don't worry about that part.

    2. The rest is too non-specific. What function of the time spent in the previous job and how long has passed since they've been in the previous job? There are infinitely many possibilities.

    Comment


    • #3
      Hi Clyde,

      Thanks for the quick response. I'm collapsing these characteristics to use them along other variables to predict performance. (This dataset is a small part of the applicants data, I also have the performance of those who got hired out of these applicants).

      As for the function, I'm not sure what I should use. I couldn't find something similar in the literature. So, if the sum doesn't need to be 1, can I use a simple weight like (length/elapsed)?

      Comment


      • #4
        As for the function, I'm not sure what I should use. I couldn't find something similar in the literature. So, if the sum doesn't need to be 1, can I use a simple weight like (length/elapsed)?
        Well, you can use pretty much anything you like. The issue is whether the weighting function serves your purposes and properly captures the way in which duration of employment and elapsed time since then influence your outcome. length/elapsed is one possibility and might be a reasonable starting point. I can imagine many others, though. I would be a bit wary of having elapsed time in the denominator of the weight. If the applicant is currently employed at his/her previous job, then the elapsed time is zero and the weight will be infinite--that seems wrong. It might be better to multiply the weight by some function that is monotonically decreasing in length and doesn't blow up at zero. For example, length*exp(-elapsed) would probably be better behaved when elapsed is at or near zero. On the other hand, it might decrease too rapidly. So some scaling on elapsed might be required. Or maybe some entirely different functional form would better represent how length and elapsed influence the outcome your are studying.

        If I were in this position myself, I think I would start with some structured graphical exploration of the data to get a sense of it. But I can't think of anything more specific to suggest.

        Comment


        • #5
          Thanks so much, Clyde!
          Your advice is super helpful as always. I will do some graphical exploration to see what works the best. Have a wonderful day.

          Comment


          • #6
            Hi again,

            In searching for an appropriate weight I came up with this idea: In order to incorporate both length and elapsed time, how about defining a decay function of elapsed time, so that the x axes is from 0-10 (0 for the ones who are still in the job and 10 is the max distance from last day of previous job and the job they are applying for, elapsed=10). Then I take the definite integral on the range [elapse, elapse+length]. And the result will be the weight for that job incorporating both how long ago the person was in that job and how long the person had that job.

            The y axes ranges from 0 to 1. And the function would be exp(-r*elapsed). So, I will have one free parameter to estimate which is the rate of decay (r).

            I was wondering whether there is a way for me to estimate the rate of decay using nonlinear least square?

            I really appreciate your help. I made length and elapsed time in year to make it more clear in the following data:


            Code:
            clear
            input long applicationid float(applicationdate ndatefrom ndateto length elapsed) str10 onet float(char1 char2 char3)
            1 17265 14762 14823  .1671233  6.690411 "39-1021.00" 4.05 3.94 3.89
            1 17265 15127 15188  .1671233  5.690411 "39-1021.00" 4.05 3.94 3.89
            1 17265 15492 15553  .1671233  4.690411 "39-1021.00" 4.05 3.94 3.89
            1 17265 15584 17136 4.2520547  .3534247 "15-1151.00" 3.79 3.34 4.27
            1 17265 15675 17136   4.00274  .3534247 "33-9092.00" 3.33 2.61 3.18
            1 17265 16223 16284  .1671233  2.687671 "11-9151.00" 4.46 3.85  4.1
            1 17265 16588 16649  .1671233 1.6876712 "11-9032.00" 4.15 4.01 4.46
            1 17265 16953 17014  .1671233  .6876712 "11-9032.00" 4.15 4.01 4.46
            1 17265 17014 17106 .25205478  .4356164 "25-2031.00" 3.97 4.07 4.25
            1 17265 17106 17167  .1671233 .26849315 "25-2031.00" 3.97 4.07 4.25
            1 17265 17226 17318 .25205478         0 "25-2022.00" 4.34 4.37 4.47
            2 19395 16315 17410         3  5.438356 "25-9041.00" 4.04 3.92 3.62
            2 19395 16680 16953  .7479452  6.690411 "21-1011.00"  4.1  3.6 4.09
            2 19395 17379 18049 1.8356165  3.687671 "21-1012.00" 4.19 3.96 3.91
            2 19395 18171 18718   1.49863 1.8547945 "41-2031.00" 3.95 3.69  3.6
            2 19395 18871 19510  1.750685         0 "25-2021.00" 4.12 4.15 3.78
            3 17225 12631 13666  2.835616  9.750685 "25-9021.00" 4.35 4.06 4.15
            3 17225 13727 14031  .8328767  8.750685 "25-2032.00" 4.31 4.29 4.08
            3 17225 14092 14762 1.8356165  6.747945 "25-2031.00" 3.97 4.07 4.25
            3 17225 14823 15492 1.8328767 4.7479453 "11-9032.00" 4.15 4.01 4.46
            3 17225 15553 15857  .8328767  3.747945 "27-3031.00" 3.88 3.79 3.92
            3 17225 16010 16649  1.750685 1.5780822 "35-1012.00" 3.58 3.13  3.2
            3 17225 16680 16953  .7479452  .7452055 "25-9041.00" 4.04 3.92 3.62
            end

            Comment


            • #7
              I don't understand what you mean when you say you would like to estimate the rate of decay parameter, r.

              Let's suppose for a moment that you actually had the "right" value of r and applied it. Now you use that and your exponential function to calculate a weight for each observation, and then you use those weights to calculate the weighted averages of the O*NET (whatever that is!) characteristics of the person. It's not stated in the thread so far how you will then use those weighted averages.

              If those weighted averages are to be used as predictors of some outcome variable for which you have data, then, in principal, it might be possible to find a value of r that maximizes the fit between predicted and observed outcomes by some criterion (whether least squares, or ml, or something else.) It would be complicated, but conceivable. But you haven't told us about any such outcomes, and all we know about this is that you want to compute some weighted averages, and that a property of the weights is that they should be decreasing in elapsed time and increasing in tenure. Based only on that information, I see no way to estimate a value of r.

              So you need to explain where this all leads.

              Comment


              • #8
                Hi Clyde,

                Yes, you're right and what you described is exactly what I want to do! I want to find the weight, multiply the previous job characteristics by the weight, then collapse them so I have one row for each applicant. Then use the weighted averages of job characteristics to predict job performance (included performance in the following data).

                I really appreciate your help with it. Thanks

                ----------------------- copy starting from the next line -----------------------
                Code:
                clear
                input long applicationid float(applicationdate ndatefrom ndateto length elapsed) str10 onet float(char1 char2 char3 performance)
                1 17265 14762 14823  .1671233  6.690411 "39-1021.00" 4.05 3.94 3.89 1
                1 17265 15127 15188  .1671233  5.690411 "39-1021.00" 4.05 3.94 3.89 2
                1 17265 15492 15553  .1671233  4.690411 "39-1021.00" 4.05 3.94 3.89 3
                1 17265 15584 17136 4.2520547  .3534247 "15-1151.00" 3.79 3.34 4.27 4
                1 17265 15675 17136   4.00274  .3534247 "33-9092.00" 3.33 2.61 3.18 2
                1 17265 16223 16284  .1671233  2.687671 "11-9151.00" 4.46 3.85  4.1 1
                1 17265 16588 16649  .1671233 1.6876712 "11-9032.00" 4.15 4.01 4.46 1
                1 17265 16953 17014  .1671233  .6876712 "11-9032.00" 4.15 4.01 4.46 1
                1 17265 17014 17106 .25205478  .4356164 "25-2031.00" 3.97 4.07 4.25 2
                1 17265 17106 17167  .1671233 .26849315 "25-2031.00" 3.97 4.07 4.25 5
                1 17265 17226 17318 .25205478         0 "25-2022.00" 4.34 4.37 4.47 4
                2 19395 16315 17410         3  5.438356 "25-9041.00" 4.04 3.92 3.62 5
                2 19395 16680 16953  .7479452  6.690411 "21-1011.00"  4.1  3.6 4.09 5
                2 19395 17379 18049 1.8356165  3.687671 "21-1012.00" 4.19 3.96 3.91 5
                2 19395 18171 18718   1.49863 1.8547945 "41-2031.00" 3.95 3.69  3.6 2
                2 19395 18871 19510  1.750685         0 "25-2021.00" 4.12 4.15 3.78 3
                3 17225 12631 13666  2.835616  9.750685 "25-9021.00" 4.35 4.06 4.15 1
                3 17225 13727 14031  .8328767  8.750685 "25-2032.00" 4.31 4.29 4.08 4
                3 17225 14092 14762 1.8356165  6.747945 "25-2031.00" 3.97 4.07 4.25 1
                3 17225 14823 15492 1.8328767 4.7479453 "11-9032.00" 4.15 4.01 4.46 2
                3 17225 15553 15857  .8328767  3.747945 "27-3031.00" 3.88 3.79 3.92 3
                3 17225 16010 16649  1.750685 1.5780822 "35-1012.00" 3.58 3.13  3.2 4
                3 17225 16680 16953  .7479452  .7452055 "25-9041.00" 4.04 3.92 3.62 5
                end

                Comment


                • #9
                  There's something I don't understand here. You wanted to take a weighted average over all observations for a given applicationid, and you want to identify a set of weights (along the lines discussed earlier) such that the weighted average optimally predicts performance. But in this data, performance varies from observation to observation for the same applicationid. So I don't understand the logic here. It seems you are calculating one (weighted) average assessment of the individual to be applied at all times, but using it to predict something that varies over time. It doesn't make sense to me.

                  Fair warning: even assuming you come up with a response that clarifies this, now that I see the full extent of the problem, I probably will not be able to provide a complete solution. I think in the end this will require writing a program that includes using the -maximize- command. I have only done that on a small number of occasions, and not at all in the past decade. I'm not familiar with the modern syntax of the -maximize- command, and I have long forgotten all the steps involved in its use. So I don't think I will attempt to go down that road. What I probably could do is give you some code that would evaluate some measure of fit between observed performance (once you clarify my question in the first paragraph) as a function of the parameter r at selected values of r, and then one might graph those data and visually identify an approximate best value of r. That might or might not be good enough for your purposes: given the general unconstrainedness of your problem, I suspect it would be. But that's up to you. Of course, it is also possible that somebody else on the Forum who knows maximization well and uses it regularly might step in and finish the job.

                  Comment


                  • #10
                    So sorry! the performance part was a dumb mistake! I just made up some numbers and didn't pay attention that they should be the same for each applicant. I'm attaching the correct data. I really appreciate it if you share the code you mentioned with me. Thanks again.

                    Code:
                    clear
                    input long applicationid float(applicationdate ndatefrom ndateto length elapsed) str10 onet float(char1 char2 char3 performance perf)
                    1 17265 14762 14823  .1671233  6.690411 "39-1021.00" 4.05 3.94 3.89 1 1
                    1 17265 15127 15188  .1671233  5.690411 "39-1021.00" 4.05 3.94 3.89 2 1
                    1 17265 15492 15553  .1671233  4.690411 "39-1021.00" 4.05 3.94 3.89 3 1
                    1 17265 15584 17136 4.2520547  .3534247 "15-1151.00" 3.79 3.34 4.27 4 1
                    1 17265 15675 17136   4.00274  .3534247 "33-9092.00" 3.33 2.61 3.18 2 1
                    1 17265 16223 16284  .1671233  2.687671 "11-9151.00" 4.46 3.85  4.1 1 1
                    1 17265 16588 16649  .1671233 1.6876712 "11-9032.00" 4.15 4.01 4.46 1 1
                    1 17265 16953 17014  .1671233  .6876712 "11-9032.00" 4.15 4.01 4.46 1 1
                    1 17265 17014 17106 .25205478  .4356164 "25-2031.00" 3.97 4.07 4.25 2 1
                    1 17265 17106 17167  .1671233 .26849315 "25-2031.00" 3.97 4.07 4.25 5 1
                    1 17265 17226 17318 .25205478         0 "25-2022.00" 4.34 4.37 4.47 4 1
                    2 19395 16315 17410         3  5.438356 "25-9041.00" 4.04 3.92 3.62 5 2
                    2 19395 16680 16953  .7479452  6.690411 "21-1011.00"  4.1  3.6 4.09 5 2
                    2 19395 17379 18049 1.8356165  3.687671 "21-1012.00" 4.19 3.96 3.91 5 2
                    2 19395 18171 18718   1.49863 1.8547945 "41-2031.00" 3.95 3.69  3.6 2 2
                    2 19395 18871 19510  1.750685         0 "25-2021.00" 4.12 4.15 3.78 3 2
                    3 17225 12631 13666  2.835616  9.750685 "25-9021.00" 4.35 4.06 4.15 1 3
                    3 17225 13727 14031  .8328767  8.750685 "25-2032.00" 4.31 4.29 4.08 4 3
                    3 17225 14092 14762 1.8356165  6.747945 "25-2031.00" 3.97 4.07 4.25 1 3
                    3 17225 14823 15492 1.8328767 4.7479453 "11-9032.00" 4.15 4.01 4.46 2 3
                    3 17225 15553 15857  .8328767  3.747945 "27-3031.00" 3.88 3.79 3.92 3 3
                    3 17225 16010 16649  1.750685 1.5780822 "35-1012.00" 3.58 3.13  3.2 4 3
                    3 17225 16680 16953  .7479452  .7452055 "25-9041.00" 4.04 3.92 3.62 5 3
                    end

                    Comment


                    • #11
                      So it would look something like this:

                      Code:
                      set more off
                      clear*
                      
                      capture program drop r_assess, rclass
                      program define r_assess, rclass sortpreserve
                          args r
                          assert `r' > 0
                          tempvar wt wtsum
                          // CALCULATE WEIGHTS
                          gen `wt' = (exp(-`r'*elapsed) + exp(-`r'*(elapsed+length)))/`r'
                          //    GET TOTAL WEIGHT FOR EACH APPLICANT
                          by applicationid, sort: egen `wtsum' = total(`wt')
                          //    CALCULATE WEIGHTED MEAN FOR EACH CHARACTERISTIC
                          foreach v of varlist /*char1*/ char2 /*char3*/ {
                              tempvar `v'
                              by applicationid, sort: egen ``v'' = total(`wt'*`v')  // WEIGHTED SUM
                              replace ``v'' = `v'/`wtsum' // WEIGHTED MEAN
                          }
                          regress performance /*`char1'*/ `char2' /*`char3'*/ if flag
                          return scalar rsq = e(r2)
                          exit
                      end
                      
                          
                      
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input long applicationid float(applicationdate ndatefrom ndateto length elapsed) str10 onet float(char1 char2 char3 performance)
                      1 17265 14762 14823  .1671233  6.690411 "39-1021.00" 4.05 3.94 3.89 1
                      1 17265 15127 15188  .1671233  5.690411 "39-1021.00" 4.05 3.94 3.89 1
                      1 17265 15492 15553  .1671233  4.690411 "39-1021.00" 4.05 3.94 3.89 1
                      1 17265 15584 17136 4.2520547  .3534247 "15-1151.00" 3.79 3.34 4.27 1
                      1 17265 15675 17136   4.00274  .3534247 "33-9092.00" 3.33 2.61 3.18 1
                      1 17265 16223 16284  .1671233  2.687671 "11-9151.00" 4.46 3.85  4.1 1
                      1 17265 16588 16649  .1671233 1.6876712 "11-9032.00" 4.15 4.01 4.46 1
                      1 17265 16953 17014  .1671233  .6876712 "11-9032.00" 4.15 4.01 4.46 1
                      1 17265 17014 17106 .25205478  .4356164 "25-2031.00" 3.97 4.07 4.25 1
                      1 17265 17106 17167  .1671233 .26849315 "25-2031.00" 3.97 4.07 4.25 1
                      1 17265 17226 17318 .25205478         0 "25-2022.00" 4.34 4.37 4.47 1
                      2 19395 16315 17410         3  5.438356 "25-9041.00" 4.04 3.92 3.62 2
                      2 19395 16680 16953  .7479452  6.690411 "21-1011.00"  4.1  3.6 4.09 2
                      2 19395 17379 18049 1.8356165  3.687671 "21-1012.00" 4.19 3.96 3.91 2
                      2 19395 18171 18718   1.49863 1.8547945 "41-2031.00" 3.95 3.69  3.6 2
                      2 19395 18871 19510  1.750685         0 "25-2021.00" 4.12 4.15 3.78 2
                      3 17225 12631 13666  2.835616  9.750685 "25-9021.00" 4.35 4.06 4.15 3
                      3 17225 13727 14031  .8328767  8.750685 "25-2032.00" 4.31 4.29 4.08 3
                      3 17225 14092 14762 1.8356165  6.747945 "25-2031.00" 3.97 4.07 4.25 3
                      3 17225 14823 15492 1.8328767 4.7479453 "11-9032.00" 4.15 4.01 4.46 3
                      3 17225 15553 15857  .8328767  3.747945 "27-3031.00" 3.88 3.79 3.92 3
                      3 17225 16010 16649  1.750685 1.5780822 "35-1012.00" 3.58 3.13  3.2 3
                      3 17225 16680 16953  .7479452  .7452055 "25-9041.00" 4.04 3.92 3.62 3
                      end
                      
                      //    MARK A SINGLE RECORD FOR EACH APPLICATION ID
                      egen flag = tag(applicationid)
                      
                      //    VERIFY PERFORMANCE CONSISTENT WITHIN APPLICATION ID
                      by applicationid (performance), sort: assert performance[1] == performance[_N]
                      
                      //    SET UP A POSTFILE TO CAPTURE THE RESULTS
                      tempfile holding
                      capture postutil clear
                      postfile handle float r float rsq using `holding'
                      
                      //    CALCULATE R^2 AS A FUNCTION OF r OVER A RANGE OF r
                      //    SAVING THE RESULTS IN THE POSTFILE
                      foreach r of numlist 0.125 (0.125) 10 {
                          r_assess `r'
                          return list
                          post handle (`r') (r(rsq))
                      }
                      
                      postclose handle
                      
                      //    READ IN THE POSTFILE, LIST, AND GRAPH
                      use `holding', clear
                      list, noobs clean
                      graph twoway line rsq r, sort
                      Notes:

                      1, The program -r_assess- tags a single argument, the test value of parameter r, calculates the weighted averages of char1 char2 and char3 for each applictionid using r in the formula you proposed, and then regresses the performance meassure on those weighted averages.

                      2. Well, sort of. In your example data there are only three applicationids, so it isn't possible to do a regression on 3 predictors. In the code above, I have "commented out" char1 and char3, so you can see how the program works but it just regresses on char2., In your real data, you presumably have many more applicationids than predictors and you don't need to do any of that /* */ stuff.

                      3. I modified your data input and replaced the variable performance with the new variable perf. It was just confusing me to have two versions of the variable with closely related names.

                      4. If you run this, the graph suggests that R2 peaks with r about equal to 1.5. You could localize r more precisely by doing a more fine-grained run over a narrow range of r, For example, -foreach r of numlist 1(0.03125)2- (and, when I tried it it suggests a peak near r = 1.3). Now, this is just by way of illustration. With your actual variables and data, the results will probably be different.

                      Added: 5. Note that I do not -collapse- the data here, because I am doing this repeatedly for different values of r. Each collapse would have to be preceded by a -preserve- and followed by a -restore-. While the time required wouldn't be noticeable in this small example data set, in a data set of appreciable size it would lead to massively degraded performance as Stata spends time thrashing the hard drive. Instead, I identify a single observation for each applicationid with an indicator variable, flag, and use only that observation in the regressions. The effect is the same, but the performance should be (much) better.
                      Last edited by Clyde Schechter; 12 Jan 2017, 17:53.

                      Comment


                      • #12
                        Thanks very much! I follow the logic, and it makes perfect sense! But I don't know why when I run the code I receive the following error. Even when I copy pasted your whole code and data!
                        Code:
                        / is not a valid command name
                        Do you know why it happens?

                        Thanks

                        Comment


                        • #13
                          I don't know what to say about that. I just copied the code from #11 back into my Stata do-editor and it ran without any errors and produced the same results I got before.

                          It's also strange to see that message because a command name would be something Stata looks for at the beginning of a line, but there is no line in the code that begins with / (except the comment lines which begin with //--which is OK). Did you perhaps make a mousing error when copying one of the lines starting with // and only get one of the / characters into your copy?

                          Which line in the code produces that error message? You might just try hand-retyping that line in your dofile (being very careful to copy it exactly as it appears in #11--watch all the quotes, commas, etc.) and see if that solves it. I have sometimes found that when I copy code from certain web sites into Stata's do-editor I get weird error messages like that (and it has happened to me copying code from this forum at times)--I think it is because some web sites embed non-printing control characters into the page, and those characters are recognized by Stata as a problem, but don't display in the error message.

                          Another possible fix is to first paste the code into a simple text editor (not a word processor, something like Windows notepad) and then copy it from there into your Stata do-editor. That sometimes "launders" out the non-printing characters that cause problems.

                          Finally, if all else fails, my do-file is attached here. (Attachments are discouraged on this Forum, for good reasons, but I think this is one circumstance where it is warranted.)

                          weighted_analysis.do

                          Comment


                          • #14
                            Clyde, I can't thank you enough. I executed your do file and it worked perfectly. I truly appreciate that you took so much time to help me. It's so kind of you. Have a great night

                            Comment

                            Working...
                            X