Lookup - Similar coding (help)

Daniel Galveas

Join Date: Jul 2017
Posts: 1

Lookup - Similar coding (help)

18 Jul 2017, 09:09

Hello everyone

I'm not so sure that I'll describe the issue perfectly, but here it comes

I'm having some troubles while trying to code something wich i think is similar to the "lookup" function in excel.

The data panel below is quiet similar to the one i'm working on:

cod_univ	year	aprov_4ef	reg_4ef	UF	p	rel_p	goal_aprov
1	2014	73.2	278	11	0	.25
2	2014	75.6	129	11	.02	.288912
3	2014	77.1	520	11	.039	.3239062
4	2014	77.4	411	11	.059	.3587353
5	2014	77.7	257	11	.078	.3899808
6	2014	80.7	405	11	.098	.4210008
7	2014	81	320	11	.118	.4501736
8	2014	81.5	282	11	.137	.4762423
9	2014	82.6	704	11	.157	.5020201
10	2014	83	321	11	.176	.5249937
11	2014	83.3	170	11	.196	.5476503
12	2014	83.9	800	11	.216	.5688125
13	2014	84	763	11	.235	.5875981
14	2014	85.1	407	11	.275	.6232266
15	2014	85.1	1525	11	.255	.6060533
16	2014	85.4	199	11	.294	.6384193
17	2014	85.8	1983	11	.314	.6532997

'goal_aprov' is where i'm trying to use some egen argument in order to return the value I want. I need it to return me in 'rel_p' row, the nearst value of 'aprov_4ef' wich matches with the variable 'p'.

This is what the first row should look like after the code i'm looking for:

cod_univ	year	aprov_4ef	reg_4ef	UF	p	rel_p	goal_aprov
1	2014	73.2	278	11	0	.25	85.1

The reason it happend is because 'rel_p' is pretty near to 'p' in the 15th row, so it returned the aprov_4ef value of the same row.

What should I do to complete 'goal_aprov'?

I don't know if it helps, but this is the equation of rel_p:

rel_p = (3/2)*p^3 - (11/4)*p^2 + 2p + (1/4)

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

18 Jul 2017, 09:33

I think the following will do what you want:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p) 1 2014 73.2 278 11 0 .25 2 2014 75.6 129 11 .02 .288912 3 2014 77.1 520 11 .039 .3239062 4 2014 77.4 411 11 .059 .3587353 5 2014 77.7 257 11 .078 .3899808 6 2014 80.7 405 11 .098 .4210008 7 2014 81 320 11 .118 .4501736 8 2014 81.5 282 11 .137 .4762423 9 2014 82.6 704 11 .157 .5020201 10 2014 83 321 11 .176 .5249937 11 2014 83.3 170 11 .196 .5476503 12 2014 83.9 800 11 .216 .5688125 13 2014 84 763 11 .235 .5875981 14 2014 85.1 407 11 .275 .6232266 15 2014 85.1 1525 11 .255 .6060533 16 2014 85.4 199 11 .294 .6384193 17 2014 85.8 1983 11 .314 .6532997 end isid cod_univ preserve // CREATE ANOTHER FILE WITH JUST P AND APPROV_4EF keep cod_univ p aprov_4ef duplicates drop rename p p_match rename aprov_4ef goal_aprov rename cod_univ cod_univ_match tempfile matches save `matches' restore // PAIR EACH OBSERVATION OF ORIGINAL DATA WITH // ALL OBSERVATIONS OF THE MATCHES FILE cross using `matches' // FOR EACH COD_UNIV, SELECT THE MATCH THAT HAS // THE VALUE OF P_MATCH NEAREST TO REL_P. IF THERE ARE // TIES, BREAK THEM ARBITRARILY AND IRREPRODUCIBLY gen delta = abs(rel_p-p_match) by cod_univ (delta), sort: keep if _n == 1

Notes: Your problem is partly ill-posed. You do not say what to do if there are multiple observations whose value of p are equally close to an observation of rel_p. The code above breaks those ties arbitrarily and irreproducibly--that is, if you run the code repeatedly, it will break the ties differently. It would be better to decide upon a rule for breaking ties and then implement that in the code. Bear in mind that -cross- pairs every observation in the data in memory with every observation in the -using- data set. If your original data set is very large, the result may be too large to fit in memory (and it will also take Stata a long time to create it). If you face that issue, post back for a possible workaround.

The code assumes the cod_univ uniquely identifies observations in the data. If that is not the case in your real data, create a unique identifier by running -gen long obs_num = _n-, and replace cod_univ by obs_num throughout in the code.

In the future, please use the -dataex- command to post example data, as I have done in this response. HTML tables such as the one you posted are often difficult to import to Stata, and even when importable, often leave key information about the data such as storage types, labels, etc., unspecified. Run -ssc install dataex- to get the -dataex- command, and read -help dataex- to see the simple instructions for using it. Using -dataex- enables those who want to help you to instantly create a completely faithful replica of your Stata example with a simple copy/paste operation.
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35720

18 Jul 2017, 09:56

If I understand this correctly, a lookup strategy is almost doomed here. Most of the rel_p values have no near values of p to use as analog[ues] (or even to do something systematic like interpolate in some way.

Your data example by the way does work as something that can be copied but using dataex (SSC) as we request would have been better.

I did this.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p) byte goal_aprov
 1 2014 73.2  278 11    0      .25 .
 2 2014 75.6  129 11  .02  .288912 .
 3 2014 77.1  520 11 .039 .3239062 .
 4 2014 77.4  411 11 .059 .3587353 .
 5 2014 77.7  257 11 .078 .3899808 .
 6 2014 80.7  405 11 .098 .4210008 .
 7 2014   81  320 11 .118 .4501736 .
 8 2014 81.5  282 11 .137 .4762423 .
 9 2014 82.6  704 11 .157 .5020201 .
10 2014   83  321 11 .176 .5249937 .
11 2014 83.3  170 11 .196 .5476503 .
12 2014 83.9  800 11 .216 .5688125 .
13 2014   84  763 11 .235 .5875981 .
14 2014 85.1  407 11 .275 .6232266 .
15 2014 85.1 1525 11 .255 .6060533 .
16 2014 85.4  199 11 .294 .6384193 .
17 2014 85.8 1983 11 .314 .6532997 .
end

. stack cod_univ year aprov_4ef  p  cod_univ year goal_aprov rel_p , into(cod_univ year y x) clear

. sort x

. list

     +--------------------------------------------+
     | _stack   cod_univ   year      y          x |
     |--------------------------------------------|
  1. |      1          1   2014   73.2          0 |
  2. |      1          2   2014   75.6        .02 |
  3. |      1          3   2014   77.1       .039 |
  4. |      1          4   2014   77.4       .059 |
  5. |      1          5   2014   77.7       .078 |
     |--------------------------------------------|
  6. |      1          6   2014   80.7       .098 |
  7. |      1          7   2014     81       .118 |
  8. |      1          8   2014   81.5       .137 |
  9. |      1          9   2014   82.6       .157 |
 10. |      1         10   2014     83       .176 |
     |--------------------------------------------|
 11. |      1         11   2014   83.3       .196 |
 12. |      1         12   2014   83.9       .216 |
 13. |      1         13   2014     84       .235 |
 14. |      2          1   2014      .        .25 |
 15. |      1         15   2014   85.1       .255 |
     |--------------------------------------------|
 16. |      1         14   2014   85.1       .275 |
 17. |      2          2   2014      .    .288912 |
 18. |      1         16   2014   85.4       .294 |
 19. |      1         17   2014   85.8       .314 |
 20. |      2          3   2014      .   .3239062 |
     |--------------------------------------------|
 21. |      2          4   2014      .   .3587353 |
 22. |      2          5   2014      .   .3899808 |
 23. |      2          6   2014      .   .4210008 |
 24. |      2          7   2014      .   .4501736 |
 25. |      2          8   2014      .   .4762423 |
     |--------------------------------------------|
 26. |      2          9   2014      .   .5020201 |
 27. |      2         10   2014      .   .5249937 |
 28. |      2         11   2014      .   .5476503 |
 29. |      2         12   2014      .   .5688125 |
 30. |      2         13   2014      .   .5875981 |
     |--------------------------------------------|
 31. |      2         15   2014      .   .6060533 |
 32. |      2         14   2014      .   .6232266 |
 33. |      2         16   2014      .   .6384193 |
 34. |      2         17   2014      .   .6532997 |
     +--------------------------------------------+

.

So, the example you choose is plausible as .25 is near to .255 but most of the unknowns would require some other method. Perhaps your real data fill in the gaps. If so, then I'd recommend interpolation followed by merging back with the original data.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#4

18 Jul 2017, 10:16

I do not use Excel and do not know what the "lookup" function does exactly. With that caveat, here's an approach that uses rangerun (from SSC). I assumed that you want to do this within groups of observations from the same year. With rangerun, you loop over each observation in the data. Each time, the data in memory is replaced with the observations that are in range, that is within the specified interval bounds. When you specify a prefix, scalars are created with the value of each variable for the current observation. Then the specified program (doit in this example) is run.

The program first creates a variable with the absolute difference between the value of ret_p for the current observation with the value of p for all observations in memory. Then it sorts observations in decreasing order of diff breaking ties using the highest value of aprov_4ef. With rangerun, the results for the current observation come from the last observation in memory when the user-defined program terminates. So it's sufficient to create goal_aprov as a copy of aprov_4ef.

Code:

clear all * Example generated by -dataex-. To install: ssc install dataex clear input byte cod_univ int year float aprov_4ef int reg_4ef byte uf float(p rel_p) 1 2014 73.2 278 11 0 .25 2 2014 75.6 129 11 .02 .288912 3 2014 77.1 520 11 .039 .3239062 4 2014 77.4 411 11 .059 .3587353 5 2014 77.7 257 11 .078 .3899808 6 2014 80.7 405 11 .098 .4210008 7 2014 81 320 11 .118 .4501736 8 2014 81.5 282 11 .137 .4762423 9 2014 82.6 704 11 .157 .5020201 10 2014 83 321 11 .176 .5249937 11 2014 83.3 170 11 .196 .5476503 12 2014 83.9 800 11 .216 .5688125 13 2014 84 763 11 .235 .5875981 14 2014 85.1 407 11 .275 .6232266 15 2014 85.1 1525 11 .255 .6060533 16 2014 85.4 199 11 .294 .6384193 17 2014 85.8 1983 11 .314 .6532997 end program doit gen diff = abs(s_rel_p - p) gsort -diff aprov_4ef gen goal_aprov = aprov_4ef end rangerun doit, interval(year 0 0) use(aprov_4ef p rel_p) sprefix(s_) list
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

18 Jul 2017, 10:26

Just one comment on Robert Picard's solution. The use of -interval(year 0 0)- will limit the potential matches to observations having the same year. That is not a stated constraint in the original problem, and may or may not be wanted. Moreover, it is hard to infer whether such a constraint is wanted from the example given, because the example data has year == 2014 throughout.
Comment

Announcement

Lookup - Similar coding (help)

Comment

Comment

Comment

Comment