Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Last Occurrence Of...

    Good evening

    I could do with some help here please.

    Looking for some code for the following:

    When dslrintervalcode28 is 2......
    Look up to the most recent occurence of 1, (data is sorted by ddate) and put the value that is in the adjacent cell to the right (dslrintervalnmfp) in the wanted column

    Reverse process needed when dslrintervalcode28 is 1....look back up to the most recent occurrence of 2 and repeat.

    Thank you
    Hans

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted)
    21479 2 .1474     .
    21675 2 .1182     .
    21712 2 .1083     .
    21740 1 .0504     .
    21762 1  .073     .
    21778 1 .0924     .
    21789 1 .0913     .
    21805 1 .0868     .
    21821 1 .1059 .1083
    21854 2    .1     .
    21872 1 .1118     .
    22070 2 .0674     .
    22109 2 .0275     .
    22149 2 .0025 .1118
    22158 1 .0907     .
    22200 2 .0109     .
    end
    format %td ddate
    ------------------ copy up to and including the previous line ------------------

  • #2
    The following code does what you say you want, but it does not agree with what you show in the wanted variable in your example:

    Code:
    drop wanted
    
    capture program drop one_ddate
    program define one_ddate
        keep if dslrintervalcode28 != dslrintervalcode28[_N] | _n == _N
        sort ddate
        gen wanted = dslrintervalnmfp[_N-1] in L
        keep in L
        exit
    end
    
    
    isid ddate, sort
    rangerun one_ddate, interval(ddate . 0)
    Note: the code assumes, and verifies, that there is at most one observation in the data for any given date. The problem, as posed, would not make sense if that is untrue.
    -rangerun- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

    I do not understand how you arrived at the variable wanted in your example. Look at observation 4, dated 10jul2019. It has dslrintervalcode28 == 1. The immediately preceding observation has dslrintervalcode28 == 2, and so it is automatically the last such observation. It's corresponding value of dslrintervalnmfp is 0.1083. So my code gives that value as the value of wanted. But you show missing value for this observation. In fact, wherever you show a non-missing value for wanted, I show the same value you do. But you have almost all values of wanted as missing, yet I find values for them. So it seems there is some additional condition you wish to impose on the calculation restricting it to certain observations, but I cannot find it stated in your post, nor do I perceive what the rule might be.

    Comment


    • #3
      Thanks, Clyde. That works fine. I only put a couple of examples in the initial wanted column on the first post (out of laziness!). All cells are required to be filled in that column. Thanks again.

      Comment


      • #4
        Another question if you dont mind...

        As stated in #2 "the code assumes, and verifies, that there is at most one observation in the data for any given date."
        Is it possible to recode this to take into account a group variable (animal) in the attached dataex, where there is more than one observation in the data for any given date?

        Thanks

        ----------------------- copy starting from the next line -----------------------
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str3 animal float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted)
        "cat" 21479 2 .1474     .
        "cat" 21675 2 .1182     .
        "cat" 21712 2 .1083     .
        "cat" 21740 1 .0504 .1083
        "cat" 21762 1  .073 .1083
        "cat" 21778 1 .0924 .1083
        "cat" 21789 1 .0913 .1083
        "cat" 21805 1 .0868 .1083
        "cat" 21821 1 .1059 .1083
        "cat" 21854 2    .1 .1059
        "cat" 21872 1 .1118    .1
        "cat" 22070 2 .0674 .1118
        "cat" 22109 2 .0275 .1118
        "cat" 22149 2 .0025 .1118
        "cat" 22158 1 .0907 .0025
        "cat" 22200 2 .0109 .0907
        "dog" 21479 2 .1474     .
        "dog" 21675 2 .1182     .
        "dog" 21712 2 .1083     .
        "dog" 21740 1 .0504 .1083
        "dog" 21762 1  .073 .1083
        "dog" 21778 1 .0924 .1083
        "dog" 21789 1 .0913 .1083
        "dog" 21805 1 .0868 .1083
        "dog" 21821 1 .1059 .1083
        "dog" 21854 2    .1 .1059
        "dog" 21872 1 .1118    .1
        "dog" 22070 2 .0674 .1118
        "dog" 22109 2 .0275 .1118
        "dog" 22149 2 .0025 .1118
        "dog" 22158 1 .0907 .0025
        "dog" 22200 2 .0109 .0907
        end
        format %td ddate
        ------------------ copy up to and including the previous line ------------------

        Comment


        • #5
          No change is needed to program one_ddate. All you need to change are the last two lines of the code shown in #2:
          Code:
          isid animal ddate, sort
          rangerun one_ddate, interval(ddate . 0) by(animal)

          Comment


          • #6
            Yep I tried that before I posted in #4. Realize the information is coming at you in snatches here....the actual dataset I'm working on has over 1 million rows, and 5 mins in on a high-spec laptop, I still have the spinning circle. Perhaps I'm asking Stata for a little too much here! Thanks.

            Comment


            • #7
              For an operation like this on a data set with a million observations, 5 minutes is nothing at all. Get yourself a novel to read or something. Your computer is going to spin this for a long time. Several hours I would imagine, possibly a day or longer.

              I've rechecked the code, and I don't see anything that can be done to materially speed it up: the computation itself is inherently lengthy. And what you may not realize is that the computation time for a data set with N observations grows in proportion to N2. (That actually applies separately to the observations for each animal, and then you add up those squares, not the square of the total number of observations for all animals.)

              Comment


              • #8
                I need to run this every day, so unfortunately not feasible. There isn't much I haven't thrown at Stata in terms of big data at this stage, and so far it's handled it all very impressively! Will have to raise a white flag on this particular operation it seems. Thanks again for your wisdom, much appreciated.

                Comment


                • #9
                  I can't follow all the details here, but this is (the start of) another approach. I include Clyde Schechter 's code to see that it produces the same results.


                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input float ddate byte dslrintervalcode28 float(dslrintervalnmfp wanted)
                  21479 2 .1474     .
                  21675 2 .1182     .
                  21712 2 .1083     .
                  21740 1 .0504     .
                  21762 1  .073     .
                  21778 1 .0924     .
                  21789 1 .0913     .
                  21805 1 .0868     .
                  21821 1 .1059 .1083
                  21854 2    .1     .
                  21872 1 .1118     .
                  22070 2 .0674     .
                  22109 2 .0275     .
                  22149 2 .0025 .1118
                  22158 1 .0907     .
                  22200 2 .0109     .
                  end
                  format %td ddate
                  
                  drop wanted
                  
                  capture program drop one_ddate
                  program define one_ddate
                      keep if dslrintervalcode28 != dslrintervalcode28[_N] | _n == _N
                      sort ddate
                      gen wanted = dslrintervalnmfp[_N-1] in L
                      keep in L
                      exit
                  end
                  
                  
                  isid ddate, sort
                  rangerun one_ddate, interval(ddate . 0)
                  
                  gen WANTED = dslrintervalnmfp[_n-1] if dslrintervalcode28 != dslrintervalcode28[_n-1]  
                  replace WANTED = WANTED[_n-1] if missing(WANTED)
                  
                  list 
                  
                  
                       +---------------------------------------------------+
                       |     ddate   dslri~28   dslrin~p   wanted   WANTED |
                       |---------------------------------------------------|
                    1. | 22oct2018          2      .1474        .        . |
                    2. | 06may2019          2      .1182        .        . |
                    3. | 12jun2019          2      .1083        .        . |
                    4. | 10jul2019          1      .0504    .1083    .1083 |
                    5. | 01aug2019          1       .073    .1083    .1083 |
                       |---------------------------------------------------|
                    6. | 17aug2019          1      .0924    .1083    .1083 |
                    7. | 28aug2019          1      .0913    .1083    .1083 |
                    8. | 13sep2019          1      .0868    .1083    .1083 |
                    9. | 29sep2019          1      .1059    .1083    .1083 |
                   10. | 01nov2019          2         .1    .1059    .1059 |
                       |---------------------------------------------------|
                   11. | 19nov2019          1      .1118       .1       .1 |
                   12. | 04jun2020          2      .0674    .1118    .1118 |
                   13. | 13jul2020          2      .0275    .1118    .1118 |
                   14. | 22aug2020          2      .0025    .1118    .1118 |
                   15. | 31aug2020          1      .0907    .0025    .0025 |
                       |---------------------------------------------------|
                   16. | 12oct2020          2      .0109    .0907    .0907 |
                       +---------------------------------------------------+
                  In principle, this extends easily to separate calculations per animal.

                  Comment


                  • #10
                    Wow! Nick's code is much better than mine. It will be orders of magnitude faster.

                    Comment


                    • #11
                      Brilliant, Nick. Thank you very much.

                      Comment


                      • #12
                        See also https://www.stata-journal.com/articl...article=dm0059 if interested in this thread.

                        Comment


                        • #13
                          Just write down Nick's solution per animal.
                          Code:
                          bys animal (ddate): gen W = dslrintervalnmfp[_n-1]
                          by animal: replace W = W[_n-1] if dslrintervalcode28 == dslrintervalcode28[_n-1]

                          Comment

                          Working...
                          X