Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lookup - Similar coding (help)

    Hello everyone

    I'm not so sure that I'll describe the issue perfectly, but here it comes


    I'm having some troubles while trying to code something wich i think is similar to the "lookup" function in excel.


    The data panel below is quiet similar to the one i'm working on:
    cod_univ year aprov_4ef reg_4ef UF p rel_p goal_aprov
    1 2014 73.2 278 11 0 .25
    2 2014 75.6 129 11 .02 .288912
    3 2014 77.1 520 11 .039 .3239062
    4 2014 77.4 411 11 .059 .3587353
    5 2014 77.7 257 11 .078 .3899808
    6 2014 80.7 405 11 .098 .4210008
    7 2014 81 320 11 .118 .4501736
    8 2014 81.5 282 11 .137 .4762423
    9 2014 82.6 704 11 .157 .5020201
    10 2014 83 321 11 .176 .5249937
    11 2014 83.3 170 11 .196 .5476503
    12 2014 83.9 800 11 .216 .5688125
    13 2014 84 763 11 .235 .5875981
    14 2014 85.1 407 11 .275 .6232266
    15 2014 85.1 1525 11 .255 .6060533
    16 2014 85.4 199 11 .294 .6384193
    17 2014 85.8 1983 11 .314 .6532997

    'goal_aprov' is where i'm trying to use some egen argument in order to return the value I want. I need it to return me in 'rel_p' row, the nearst value of 'aprov_4ef' wich matches with the variable 'p'.

    This is what the first row should look like after the code i'm looking for:
    cod_univ year aprov_4ef reg_4ef UF p rel_p goal_aprov
    1 2014 73.2 278 11 0 .25 85.1

    The reason it happend is because 'rel_p' is pretty near to 'p' in the 15th row, so it returned the aprov_4ef value of the same row.


    What should I do to complete 'goal_aprov'?

    I don't know if it helps, but this is the equation of rel_p:

    rel_p = (3/2)*p^3 - (11/4)*p^2 + 2p + (1/4)



  • #2
    I think the following will do what you want:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p)
     1 2014 73.2  278 11    0      .25
     2 2014 75.6  129 11  .02  .288912
     3 2014 77.1  520 11 .039 .3239062
     4 2014 77.4  411 11 .059 .3587353
     5 2014 77.7  257 11 .078 .3899808
     6 2014 80.7  405 11 .098 .4210008
     7 2014   81  320 11 .118 .4501736
     8 2014 81.5  282 11 .137 .4762423
     9 2014 82.6  704 11 .157 .5020201
    10 2014   83  321 11 .176 .5249937
    11 2014 83.3  170 11 .196 .5476503
    12 2014 83.9  800 11 .216 .5688125
    13 2014   84  763 11 .235 .5875981
    14 2014 85.1  407 11 .275 .6232266
    15 2014 85.1 1525 11 .255 .6060533
    16 2014 85.4  199 11 .294 .6384193
    17 2014 85.8 1983 11 .314 .6532997
    end
    
    isid cod_univ
    preserve
    
    //    CREATE ANOTHER FILE WITH JUST P AND APPROV_4EF
    keep cod_univ p aprov_4ef
    duplicates drop
    rename p p_match
    rename aprov_4ef goal_aprov
    rename cod_univ cod_univ_match
    tempfile matches
    save `matches'
    
    restore
    
    //    PAIR EACH OBSERVATION OF ORIGINAL DATA WITH
    //    ALL OBSERVATIONS OF THE MATCHES FILE
    cross using `matches'
    
    //    FOR EACH COD_UNIV, SELECT THE MATCH THAT HAS
    //    THE VALUE OF P_MATCH NEAREST TO REL_P.  IF THERE ARE
    //    TIES, BREAK THEM ARBITRARILY AND IRREPRODUCIBLY
    gen delta = abs(rel_p-p_match)
    by cod_univ (delta), sort: keep if _n == 1
    Notes: Your problem is partly ill-posed. You do not say what to do if there are multiple observations whose value of p are equally close to an observation of rel_p. The code above breaks those ties arbitrarily and irreproducibly--that is, if you run the code repeatedly, it will break the ties differently. It would be better to decide upon a rule for breaking ties and then implement that in the code. Bear in mind that -cross- pairs every observation in the data in memory with every observation in the -using- data set. If your original data set is very large, the result may be too large to fit in memory (and it will also take Stata a long time to create it). If you face that issue, post back for a possible workaround.

    The code assumes the cod_univ uniquely identifies observations in the data. If that is not the case in your real data, create a unique identifier by running -gen long obs_num = _n-, and replace cod_univ by obs_num throughout in the code.

    In the future, please use the -dataex- command to post example data, as I have done in this response. HTML tables such as the one you posted are often difficult to import to Stata, and even when importable, often leave key information about the data such as storage types, labels, etc., unspecified. Run -ssc install dataex- to get the -dataex- command, and read -help dataex- to see the simple instructions for using it. Using -dataex- enables those who want to help you to instantly create a completely faithful replica of your Stata example with a simple copy/paste operation.

    Comment


    • #3
      If I understand this correctly, a lookup strategy is almost doomed here. Most of the rel_p values have no near values of p to use as analog[ues] (or even to do something systematic like interpolate in some way.

      Your data example by the way does work as something that can be copied but using dataex (SSC) as we request would have been better.

      I did this.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p) byte goal_aprov
       1 2014 73.2  278 11    0      .25 .
       2 2014 75.6  129 11  .02  .288912 .
       3 2014 77.1  520 11 .039 .3239062 .
       4 2014 77.4  411 11 .059 .3587353 .
       5 2014 77.7  257 11 .078 .3899808 .
       6 2014 80.7  405 11 .098 .4210008 .
       7 2014   81  320 11 .118 .4501736 .
       8 2014 81.5  282 11 .137 .4762423 .
       9 2014 82.6  704 11 .157 .5020201 .
      10 2014   83  321 11 .176 .5249937 .
      11 2014 83.3  170 11 .196 .5476503 .
      12 2014 83.9  800 11 .216 .5688125 .
      13 2014   84  763 11 .235 .5875981 .
      14 2014 85.1  407 11 .275 .6232266 .
      15 2014 85.1 1525 11 .255 .6060533 .
      16 2014 85.4  199 11 .294 .6384193 .
      17 2014 85.8 1983 11 .314 .6532997 .
      end
      
      . stack cod_univ year aprov_4ef  p  cod_univ year goal_aprov rel_p , into(cod_univ year y x) clear
      
      . sort x
      
      . list
      
           +--------------------------------------------+
           | _stack   cod_univ   year      y          x |
           |--------------------------------------------|
        1. |      1          1   2014   73.2          0 |
        2. |      1          2   2014   75.6        .02 |
        3. |      1          3   2014   77.1       .039 |
        4. |      1          4   2014   77.4       .059 |
        5. |      1          5   2014   77.7       .078 |
           |--------------------------------------------|
        6. |      1          6   2014   80.7       .098 |
        7. |      1          7   2014     81       .118 |
        8. |      1          8   2014   81.5       .137 |
        9. |      1          9   2014   82.6       .157 |
       10. |      1         10   2014     83       .176 |
           |--------------------------------------------|
       11. |      1         11   2014   83.3       .196 |
       12. |      1         12   2014   83.9       .216 |
       13. |      1         13   2014     84       .235 |
       14. |      2          1   2014      .        .25 |
       15. |      1         15   2014   85.1       .255 |
           |--------------------------------------------|
       16. |      1         14   2014   85.1       .275 |
       17. |      2          2   2014      .    .288912 |
       18. |      1         16   2014   85.4       .294 |
       19. |      1         17   2014   85.8       .314 |
       20. |      2          3   2014      .   .3239062 |
           |--------------------------------------------|
       21. |      2          4   2014      .   .3587353 |
       22. |      2          5   2014      .   .3899808 |
       23. |      2          6   2014      .   .4210008 |
       24. |      2          7   2014      .   .4501736 |
       25. |      2          8   2014      .   .4762423 |
           |--------------------------------------------|
       26. |      2          9   2014      .   .5020201 |
       27. |      2         10   2014      .   .5249937 |
       28. |      2         11   2014      .   .5476503 |
       29. |      2         12   2014      .   .5688125 |
       30. |      2         13   2014      .   .5875981 |
           |--------------------------------------------|
       31. |      2         15   2014      .   .6060533 |
       32. |      2         14   2014      .   .6232266 |
       33. |      2         16   2014      .   .6384193 |
       34. |      2         17   2014      .   .6532997 |
           +--------------------------------------------+
      .

      So, the example you choose is plausible as .25 is near to .255 but most of the unknowns would require some other method. Perhaps your real data fill in the gaps. If so, then I'd recommend interpolation followed by merging back with the original data.

      Comment


      • #4
        I do not use Excel and do not know what the "lookup" function does exactly. With that caveat, here's an approach that uses rangerun (from SSC). I assumed that you want to do this within groups of observations from the same year. With rangerun, you loop over each observation in the data. Each time, the data in memory is replaced with the observations that are in range, that is within the specified interval bounds. When you specify a prefix, scalars are created with the value of each variable for the current observation. Then the specified program (doit in this example) is run.

        The program first creates a variable with the absolute difference between the value of ret_p for the current observation with the value of p for all observations in memory. Then it sorts observations in decreasing order of diff breaking ties using the highest value of aprov_4ef. With rangerun, the results for the current observation come from the last observation in memory when the user-defined program terminates. So it's sufficient to create goal_aprov as a copy of aprov_4ef.

        Code:
        clear all
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p)
         1 2014 73.2  278 11    0      .25
         2 2014 75.6  129 11  .02  .288912
         3 2014 77.1  520 11 .039 .3239062
         4 2014 77.4  411 11 .059 .3587353
         5 2014 77.7  257 11 .078 .3899808
         6 2014 80.7  405 11 .098 .4210008
         7 2014   81  320 11 .118 .4501736
         8 2014 81.5  282 11 .137 .4762423
         9 2014 82.6  704 11 .157 .5020201
        10 2014   83  321 11 .176 .5249937
        11 2014 83.3  170 11 .196 .5476503
        12 2014 83.9  800 11 .216 .5688125
        13 2014   84  763 11 .235 .5875981
        14 2014 85.1  407 11 .275 .6232266
        15 2014 85.1 1525 11 .255 .6060533
        16 2014 85.4  199 11 .294 .6384193
        17 2014 85.8 1983 11 .314 .6532997
        end
        
        program doit
            gen diff = abs(s_rel_p - p)
            gsort -diff aprov_4ef
            gen goal_aprov = aprov_4ef 
        end
        
        rangerun doit, interval(year 0 0) use(aprov_4ef p rel_p) sprefix(s_)
        list

        Comment


        • #5
          Just one comment on Robert Picard's solution. The use of -interval(year 0 0)- will limit the potential matches to observations having the same year. That is not a stated constraint in the original problem, and may or may not be wanted. Moreover, it is hard to infer whether such a constraint is wanted from the example given, because the example data has year == 2014 throughout.

          Comment

          Working...
          X