Last Occurrence Of...

Hans Schubert

Join Date: Jul 2021

Posts: 109
#1

Last Occurrence Of...

13 Jan 2023, 12:01

Good evening

I could do with some help here please.

Looking for some code for the following:

When dslrintervalcode28 is 2......
Look up to the most recent occurence of 1, (data is sorted by ddate) and put the value that is in the adjacent cell to the right (dslrintervalnmfp) in the wanted column

Reverse process needed when dslrintervalcode28 is 1....look back up to the most recent occurrence of 2 and repeat.

Thank you
Hans

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted) 21479 2 .1474 . 21675 2 .1182 . 21712 2 .1083 . 21740 1 .0504 . 21762 1 .073 . 21778 1 .0924 . 21789 1 .0913 . 21805 1 .0868 . 21821 1 .1059 .1083 21854 2 .1 . 21872 1 .1118 . 22070 2 .0674 . 22109 2 .0275 . 22149 2 .0025 .1118 22158 1 .0907 . 22200 2 .0109 . end format %td ddate

------------------ copy up to and including the previous line ------------------
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

13 Jan 2023, 12:36

The following code does what you say you want, but it does not agree with what you show in the wanted variable in your example:

Code:

drop wanted capture program drop one_ddate program define one_ddate keep if dslrintervalcode28 != dslrintervalcode28[_N] | _n == _N sort ddate gen wanted = dslrintervalnmfp[_N-1] in L keep in L exit end isid ddate, sort rangerun one_ddate, interval(ddate . 0)

Note: the code assumes, and verifies, that there is at most one observation in the data for any given date. The problem, as posed, would not make sense if that is untrue.
-rangerun- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

I do not understand how you arrived at the variable wanted in your example. Look at observation 4, dated 10jul2019. It has dslrintervalcode28 == 1. The immediately preceding observation has dslrintervalcode28 == 2, and so it is automatically the last such observation. It's corresponding value of dslrintervalnmfp is 0.1083. So my code gives that value as the value of wanted. But you show missing value for this observation. In fact, wherever you show a non-missing value for wanted, I show the same value you do. But you have almost all values of wanted as missing, yet I find values for them. So it seems there is some additional condition you wish to impose on the calculation restricting it to certain observations, but I cannot find it stated in your post, nor do I perceive what the rule might be.
Comment
Hans Schubert

Join Date: Jul 2021

Posts: 109
#3

13 Jan 2023, 15:18

Thanks, Clyde. That works fine. I only put a couple of examples in the initial wanted column on the first post (out of laziness!). All cells are required to be filled in that column. Thanks again.
Comment

Hans Schubert

Join Date: Jul 2021
Posts: 109

13 Jan 2023, 17:07

Another question if you dont mind...

As stated in #2 "the code assumes, and verifies, that there is at most one observation in the data for any given date."
Is it possible to recode this to take into account a group variable (animal) in the attached dataex, where there is more than one observation in the data for any given date?

Thanks

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str3 animal float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted)
"cat" 21479 2 .1474     .
"cat" 21675 2 .1182     .
"cat" 21712 2 .1083     .
"cat" 21740 1 .0504 .1083
"cat" 21762 1  .073 .1083
"cat" 21778 1 .0924 .1083
"cat" 21789 1 .0913 .1083
"cat" 21805 1 .0868 .1083
"cat" 21821 1 .1059 .1083
"cat" 21854 2    .1 .1059
"cat" 21872 1 .1118    .1
"cat" 22070 2 .0674 .1118
"cat" 22109 2 .0275 .1118
"cat" 22149 2 .0025 .1118
"cat" 22158 1 .0907 .0025
"cat" 22200 2 .0109 .0907
"dog" 21479 2 .1474     .
"dog" 21675 2 .1182     .
"dog" 21712 2 .1083     .
"dog" 21740 1 .0504 .1083
"dog" 21762 1  .073 .1083
"dog" 21778 1 .0924 .1083
"dog" 21789 1 .0913 .1083
"dog" 21805 1 .0868 .1083
"dog" 21821 1 .1059 .1083
"dog" 21854 2    .1 .1059
"dog" 21872 1 .1118    .1
"dog" 22070 2 .0674 .1118
"dog" 22109 2 .0275 .1118
"dog" 22149 2 .0025 .1118
"dog" 22158 1 .0907 .0025
"dog" 22200 2 .0109 .0907
end
format %td ddate

------------------ copy up to and including the previous line ------------------

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#5

13 Jan 2023, 17:14

No change is needed to program one_ddate. All you need to change are the last two lines of the code shown in #2:

Code:

isid animal ddate, sort rangerun one_ddate, interval(ddate . 0) by(animal)
1 like
Comment
Hans Schubert

Join Date: Jul 2021

Posts: 109
#6

13 Jan 2023, 17:25

Yep I tried that before I posted in #4. Realize the information is coming at you in snatches here....the actual dataset I'm working on has over 1 million rows, and 5 mins in on a high-spec laptop, I still have the spinning circle. Perhaps I'm asking Stata for a little too much here! Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#7

13 Jan 2023, 17:31

For an operation like this on a data set with a million observations, 5 minutes is nothing at all. Get yourself a novel to read or something. Your computer is going to spin this for a long time. Several hours I would imagine, possibly a day or longer.

I've rechecked the code, and I don't see anything that can be done to materially speed it up: the computation itself is inherently lengthy. And what you may not realize is that the computation time for a data set with N observations grows in proportion to N². (That actually applies separately to the observations for each animal, and then you add up those squares, not the square of the total number of observations for all animals.)
Comment
Hans Schubert

Join Date: Jul 2021

Posts: 109
#8

13 Jan 2023, 17:43

I need to run this every day, so unfortunately not feasible. There isn't much I haven't thrown at Stata in terms of big data at this stage, and so far it's handled it all very impressively! Will have to raise a white flag on this particular operation it seems. Thanks again for your wisdom, much appreciated.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35668

14 Jan 2023, 03:54

I can't follow all the details here, but this is (the start of) another approach. I include Clyde Schechter 's code to see that it produces the same results.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted)
21479 2 .1474     .
21675 2 .1182     .
21712 2 .1083     .
21740 1 .0504     .
21762 1  .073     .
21778 1 .0924     .
21789 1 .0913     .
21805 1 .0868     .
21821 1 .1059 .1083
21854 2    .1     .
21872 1 .1118     .
22070 2 .0674     .
22109 2 .0275     .
22149 2 .0025 .1118
22158 1 .0907     .
22200 2 .0109     .
end
format %td ddate

drop wanted

capture program drop one_ddate
program define one_ddate
    keep if dslrintervalcode28 != dslrintervalcode28[_N] | _n == _N
    sort ddate
    gen wanted = dslrintervalnmfp[_N-1] in L
    keep in L
    exit
end


isid ddate, sort
rangerun one_ddate, interval(ddate . 0)

gen WANTED = dslrintervalnmfp[_n-1] if dslrintervalcode28 != dslrintervalcode28[_n-1]  
replace WANTED = WANTED[_n-1] if missing(WANTED)

list 


     +---------------------------------------------------+
     |     ddate   dslri~28   dslrin~p   wanted   WANTED |
     |---------------------------------------------------|
  1. | 22oct2018          2      .1474        .        . |
  2. | 06may2019          2      .1182        .        . |
  3. | 12jun2019          2      .1083        .        . |
  4. | 10jul2019          1      .0504    .1083    .1083 |
  5. | 01aug2019          1       .073    .1083    .1083 |
     |---------------------------------------------------|
  6. | 17aug2019          1      .0924    .1083    .1083 |
  7. | 28aug2019          1      .0913    .1083    .1083 |
  8. | 13sep2019          1      .0868    .1083    .1083 |
  9. | 29sep2019          1      .1059    .1083    .1083 |
 10. | 01nov2019          2         .1    .1059    .1059 |
     |---------------------------------------------------|
 11. | 19nov2019          1      .1118       .1       .1 |
 12. | 04jun2020          2      .0674    .1118    .1118 |
 13. | 13jul2020          2      .0275    .1118    .1118 |
 14. | 22aug2020          2      .0025    .1118    .1118 |
 15. | 31aug2020          1      .0907    .0025    .0025 |
     |---------------------------------------------------|
 16. | 12oct2020          2      .0109    .0907    .0907 |
     +---------------------------------------------------+

In principle, this extends easily to separate calculations per animal.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#10

14 Jan 2023, 11:27

Wow! Nick's code is much better than mine. It will be orders of magnitude faster.
Comment
Hans Schubert

Join Date: Jul 2021

Posts: 109
#11

14 Jan 2023, 15:52

Brilliant, Nick. Thank you very much.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35668
#12

15 Jan 2023, 02:35

See also https://www.stata-journal.com/articl...article=dm0059 if interested in this thread.
1 like
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#13

16 Jan 2023, 04:22

Just write down Nick's solution per animal.

Code:

bys animal (ddate): gen W = dslrintervalnmfp[_n-1]
by animal: replace W = W[_n-1] if dslrintervalcode28 == dslrintervalcode28[_n-1]

Announcement