Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • twoway rarea: invert color when lines cross

    I am plotting actual and predicted rates of an event and want to emphasize times when actual exceeded predicted and vice versa. I've attached data and the code to create two graphs that aren't exactly what I want.

    In the first version, there is no change in shading color.
    In the second version, the shading color changes, but the shaded region doesn't exactly match the shape of the area between the lines.

    I'm probably not the first to want to do this--has anyone developed a method to make it work?



    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float dt double(actual predicted)
    21915  .3469639850779763 .28161612542707887
    21922 .33249606101521084  .2823371799191069
    21929 .31045343211196497 .28269158152852114
    21936  .2859349142080523  .2890576291187378
    21943  .2689252093502894  .3050486616517073
    21950  .2594151983932902  .3231018050827308
    21957 .23873048698868282 .33936473298506314
    21964 .21008050332694667 .33601787184519843
    21971 .20156126392584633  .3111580381318348
    21978 .22863544297280752 .28395028536352795
    21985  .2660472141141334 .26082748481494766
    21992 .29655574434967535 .25953957681502066
    21999 .31047023493071646  .2769737081770453
    22006 .30256011879759714 .29519669922777303
    22013 .31316367479107704  .3105154315846261
    22020 .33034560075873415  .3158314224579599
    22027  .3423353364338891 .31421121409660363
    22034  .3255514845029689 .31123034641138675
    22041 .33064063113656567 .30797980257571406
    22048  .3440255688837501  .3050228966236359
    22055  .3880894232632907 .30628983645919167
    22062 .45888253150535363   .313261414441176
    22069 .45787862165242454  .3214246624527872
    22076  .4258455243912937 .32874556554942996
    22083 .37023067043414676  .3313199710417261
    22090  .3380617836034539 .33069423801440934
    end
    format %td dt
    
    
    /* V1: no change in shading */
    twoway (rarea actual predicted dt, fcolor(gs12) lcolor(none)) ///
    (line actual dt, lpattern(dash) lwidth(thick)) ///
    (line predicted dt, lpattern(solid) lwidth(thick))
    
    
    /* V2: change in shading, but with gaps */
    gen pos=actual>predicted
    sort dt
    gen change=pos!=pos[_n-1]
    gen area=sum(change)
    
    
    twoway (rarea actual predicted dt if area==1, fcolor(gs12) ) ///
    (rarea actual predicted dt if area==2, fcolor(gs1)  ) ///
    (rarea actual predicted dt if area==3, fcolor(gs12)  ) ///
    (line actual dt, lpattern(dash) lwidth(thick)) ///
    (line predicted dt, lpattern(solid) lwidth(thick)), legend(off)

  • #2
    Thanks for the reproducible example. This calls for some cumulative thinking!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float dt double(actual predicted)
    21915  .3469639850779763 .28161612542707887
    21922 .33249606101521084  .2823371799191069
    21929 .31045343211196497 .28269158152852114
    21936  .2859349142080523  .2890576291187378
    21943  .2689252093502894  .3050486616517073
    21950  .2594151983932902  .3231018050827308
    21957 .23873048698868282 .33936473298506314
    21964 .21008050332694667 .33601787184519843
    21971 .20156126392584633  .3111580381318348
    21978 .22863544297280752 .28395028536352795
    21985  .2660472141141334 .26082748481494766
    21992 .29655574434967535 .25953957681502066
    21999 .31047023493071646  .2769737081770453
    22006 .30256011879759714 .29519669922777303
    22013 .31316367479107704  .3105154315846261
    22020 .33034560075873415  .3158314224579599
    22027  .3423353364338891 .31421121409660363
    22034  .3255514845029689 .31123034641138675
    22041 .33064063113656567 .30797980257571406
    22048  .3440255688837501  .3050228966236359
    22055  .3880894232632907 .30628983645919167
    22062 .45888253150535363   .313261414441176
    22069 .45787862165242454  .3214246624527872
    22076  .4258455243912937 .32874556554942996
    22083 .37023067043414676  .3313199710417261
    22090  .3380617836034539 .33069423801440934
    end
    format %td dt
    
    gen pos=actual>predicted
    sort dt
    gen change=pos!=pos[_n-1]
    gen area=sum(change)
    
    
    foreach num of numlist 1 3{
       gen area`num'2= inlist(area, `num', 2)
    }
    
    twoway (rarea actual predicted dt if area12, fcolor(gs12) ) ///
    (rarea actual predicted dt if area32, fcolor(gs1)  ) ///
    (rarea actual predicted dt if area==3, fcolor(gs12)  ) ///
    (line actual dt, lpattern(dash) lwidth(thick) scheme(s1color)) ///
    (line predicted dt, lpattern(solid) lwidth(thick)), legend(off)
    Res.:
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	45.9 KB
ID:	1571215

    Comment


    • #3
      Perfect! Thank you so much!

      Comment


      • #4
        Oddly, when the data change, this no longer works.


        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float dt double(actual predicted)
        21915   16.4374811007146 12.669995973242584
        21922 15.209215698156115 12.134958526934238
        21929 13.907829213824419 11.867893963453048
        21936   12.7005776427328  12.08547538629546
        21943 11.986846488872429  12.86748605031838
        21950 11.543464504572759 13.615336632867946
        21957 10.512740909196596 14.174203873434175
        21964  9.130806115842926 13.867261978811243
        21971   8.56977152699833 12.615959092485628
        21978  9.082512436140698 10.987438672710763
        21985  9.617240002668176  9.081461726364969
        21992  9.661117109195153  7.752512700865874
        21999  9.007911734175586  7.185171998332542
        22006  8.018395288172515  7.056341029123566
        22013  8.025481818493901  7.245594767048623
        22020  8.631218717704177  7.501548637539655
        22027   9.23797516646064    7.7366144535245
        22034  9.122894381636353  8.025253079442706
        22041  9.764694790762958  8.351740051090315
        22048 10.726059461218954  8.694168262015884
        22055 12.679512397406402  9.175223242547663
        22062   15.4328984301478  9.768697144361049
        22069 15.709410788804417 10.309160736336278
        22076 14.886763968403914 10.802491616185991
        22083  13.25036014179231 11.153512104321273
        22090 12.562518899104907 11.496164616807022
        end
        format %td dt
        gen pos=actual>predicted
        sort dt
        gen change=pos!=pos[_n-1]
        gen area=sum(change)
        
        
        foreach num of numlist 1 3{
           gen area`num'2= inlist(area, `num', 2)
        }
        
        twoway (rarea actual predicted dt if area12, fcolor(gs12) ) ///
        (rarea actual predicted dt if area32, fcolor(gs1)  ) ///
        (rarea actual predicted dt if area==3, fcolor(gs12)  ) ///
        (line actual dt, lpattern(dash) lwidth(thick) scheme(s1color)) ///
        (line predicted dt, lpattern(solid) lwidth(thick)), legend(off)
        Click image for larger version

Name:	Graph.png
Views:	1
Size:	197.4 KB
ID:	1571386

        Comment


        • #5
          The issue is missing dates. I will illustrate using areas 1 & 2 below.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float dt double(actual predicted) float(pos change area)
          21915   16.4374811007146 12.669995973242584 1 1 1
          21922 15.209215698156115 12.134958526934238 1 0 1
          21929 13.907829213824419 11.867893963453048 1 0 1
          21936   12.7005776427328  12.08547538629546 1 0 1
          21943 11.986846488872429  12.86748605031838 0 1 2
          21950 11.543464504572759 13.615336632867946 0 0 2
          21957 10.512740909196596 14.174203873434175 0 0 2
          21964  9.130806115842926 13.867261978811243 0 0 2
          21971   8.56977152699833 12.615959092485628 0 0 2
          21978  9.082512436140698 10.987438672710763 0 0 2
          21985  9.617240002668176  9.081461726364969 1 1 3
          21992  9.661117109195153  7.752512700865874 1 0 3
          21999  9.007911734175586  7.185171998332542 1 0 3
          22006  8.018395288172515  7.056341029123566 1 0 3
          22013  8.025481818493901  7.245594767048623 1 0 3
          22020  8.631218717704177  7.501548637539655 1 0 3
          22027   9.23797516646064    7.7366144535245 1 0 3
          22034  9.122894381636353  8.025253079442706 1 0 3
          22041  9.764694790762958  8.351740051090315 1 0 3
          22048 10.726059461218954  8.694168262015884 1 0 3
          22055 12.679512397406402  9.175223242547663 1 0 3
          22062   15.4328984301478  9.768697144361049 1 0 3
          22069 15.709410788804417 10.309160736336278 1 0 3
          22076 14.886763968403914 10.802491616185991 1 0 3
          22083  13.25036014179231 11.153512104321273 1 0 3
          22090 12.562518899104907 11.496164616807022 1 0 3
          end
          format %td dt
          
          twoway (rarea actual predicted dt if inlist(area, 1, 2), fcolor(gs12) ) ///
          (rarea actual predicted dt if inlist(area, 1), fcolor(gs1) ///
          xline(`=date("22jan2020","DMY")') xline(`=date("29jan2020","DMY")') ///
          xlab(21936 "end 1" 21943 "start 2" 21970 ) scheme(s1color))

          Below, I highlight the last observation date in area 1 and the first in area 2.

          Click image for larger version

Name:	Graph.png
Views:	1
Size:	35.9 KB
ID:	1571395



          So what Stata does is to create a mid-point and join the areas. This, you can do.


          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float dt double(actual predicted) float(pos change area)
          21915   16.4374811007146 12.669995973242584 1 1 1
          21922 15.209215698156115 12.134958526934238 1 0 1
          21929 13.907829213824419 11.867893963453048 1 0 1
          21936   12.7005776427328  12.08547538629546 1 0 1
          21943 11.986846488872429  12.86748605031838 0 1 2
          21950 11.543464504572759 13.615336632867946 0 0 2
          21957 10.512740909196596 14.174203873434175 0 0 2
          21964  9.130806115842926 13.867261978811243 0 0 2
          21971   8.56977152699833 12.615959092485628 0 0 2
          21978  9.082512436140698 10.987438672710763 0 0 2
          21985  9.617240002668176  9.081461726364969 1 1 3
          21992  9.661117109195153  7.752512700865874 1 0 3
          21999  9.007911734175586  7.185171998332542 1 0 3
          22006  8.018395288172515  7.056341029123566 1 0 3
          22013  8.025481818493901  7.245594767048623 1 0 3
          22020  8.631218717704177  7.501548637539655 1 0 3
          22027   9.23797516646064    7.7366144535245 1 0 3
          22034  9.122894381636353  8.025253079442706 1 0 3
          22041  9.764694790762958  8.351740051090315 1 0 3
          22048 10.726059461218954  8.694168262015884 1 0 3
          22055 12.679512397406402  9.175223242547663 1 0 3
          22062   15.4328984301478  9.768697144361049 1 0 3
          22069 15.709410788804417 10.309160736336278 1 0 3
          22076 14.886763968403914 10.802491616185991 1 0 3
          22083  13.25036014179231 11.153512104321273 1 0 3
          22090 12.562518899104907 11.496164616807022 1 0 3
          end
          format %td dt
          
          
          sort dt
          gen extra=2 if area!=area[_n+1]
          expand extra, g(new)
          sort dt extra new
          foreach var in dt actual predicted{
              replace `var'= (`var'+`var'[_n+1])/2 if new
          }
          replace new=2 if new
          expand new
          bys dt area new: replace area= area+1 if _n==_N & new
          replace area=area-1 if _n==_N    
           
          twoway (rarea actual predicted dt if inlist(area, 1, 2), fcolor(gs12) ) ///
          (rarea actual predicted dt if inlist(area, 1), fcolor(gs1) ///
          xline(`=date("22jan2020","DMY")') xline(`=date("29jan2020","DMY")') ///
          xlab(21936 "end 1" 21943 "start 2" 21970 ) scheme(s1color))
          Click image for larger version

Name:	Graph2.png
Views:	1
Size:	37.4 KB
ID:	1571396

          Comment


          • #6
            A graphical maxim is

            If a difference is of interest and importance, calculate it explicitly and show it directly.

            Nothing original or novel there. People have been looking at plotting (and mapping) residuals since the 19th century at least. Focusing on residuals as ways to improve explanations features strongly in John Herschel's and John Stuart Mill's books on scientific and logical method. (Herschel's name should be celebrated in every introductory account of statistics as plausibly inventing scatter plots; the name scatter plot is, however, Karl Pearson's, rather oddly given the latter's general tendency to reach for a Greek dictionary when inventing new terms.)

            That leads to suggestions like these:


            Code:
            gen difference = actual - predicted 
            
            * ssc install mycolours 
            mycolours 
            
            line actual predicted dt, fysize(60) name(G1, replace) ytitle(actual and predicted) legend(ring(0) pos(11) col(1)) xsc(off)
            
            line difference dt, color(black) fysize(40) name(G2, replace) 
            
            graph combine G1 G2, col(1) xcommon name(G3, replace)
            
            
            line actual predicted difference dt, lc("`ora'" "`blu'" "`bla'") name(G4, replace) yli(0, lw(thin)) legend(ring(0) pos(11) col(1))
            The two panel solution is quite popular in some quarters. It's fiddly to get it right for any particular case, let alone in general. For the data in #1 the last solution seems to work as well as the others. If an outcome and the difference were quite different in absolute value that would no longer be the case.

            Here is the last graph. There remains a great deal of scope to improve it.

            Click image for larger version

Name:	act_pre_diff.png
Views:	1
Size:	30.5 KB
ID:	1571444

            Comment


            • #7
              Andrew Musau: thank you for looking at this again. Unfortunately, I think it is not as simple as picking the midpoint, which you can see if you apply your new code to the old data. The new datapoint has to be at the point where the lines cross, which may not be the midpoint.

              Click image for larger version

Name:	Graph2.png
Views:	1
Size:	178.1 KB
ID:	1571521




              Nick Cox: Thank you! I appreciate both the theory and the solutions. The two-panel solution might work best for the rest of our data (6 sites, all with different baselines and residual patterns). It would also allow us to relabel the x-axis as "excess (reduced) visits" or something else that draws attention to the interpretation of the residual.

              Comment


              • #8
                Andrew Musau: thank you for looking at this again. Unfortunately, I think it is not as simple as picking the midpoint, which you can see if you apply your new code to the old data. The new datapoint has to be at the point where the lines cross, which may not be the midpoint
                You are right, the example in #5 was a special case. It should be possible to create observations between areas and interpolate, keeping the observation closest to the intersection point. But Nick's suggestion is more intuitive, so I will not bother programming this.

                Comment


                • #9
                  #8 should look something like this, but the code has not been thoroughly tested. First install mipolate from SSC by Nick Cox. Data in #1

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float dt double(actual predicted)
                  21915  .3469639850779763 .28161612542707887
                  21922 .33249606101521084  .2823371799191069
                  21929 .31045343211196497 .28269158152852114
                  21936  .2859349142080523  .2890576291187378
                  21943  .2689252093502894  .3050486616517073
                  21950  .2594151983932902  .3231018050827308
                  21957 .23873048698868282 .33936473298506314
                  21964 .21008050332694667 .33601787184519843
                  21971 .20156126392584633  .3111580381318348
                  21978 .22863544297280752 .28395028536352795
                  21985  .2660472141141334 .26082748481494766
                  21992 .29655574434967535 .25953957681502066
                  21999 .31047023493071646  .2769737081770453
                  22006 .30256011879759714 .29519669922777303
                  22013 .31316367479107704  .3105154315846261
                  22020 .33034560075873415  .3158314224579599
                  22027  .3423353364338891 .31421121409660363
                  22034  .3255514845029689 .31123034641138675
                  22041 .33064063113656567 .30797980257571406
                  22048  .3440255688837501  .3050228966236359
                  22055  .3880894232632907 .30628983645919167
                  22062 .45888253150535363   .313261414441176
                  22069 .45787862165242454  .3214246624527872
                  22076  .4258455243912937 .32874556554942996
                  22083 .37023067043414676  .3313199710417261
                  22090  .3380617836034539 .33069423801440934
                  end
                  format %td dt
                  
                  gen pos=actual>predicted
                  sort dt
                  gen change=pos!=pos[_n-1]
                  gen area=sum(change)
                  
                  gen points=100 if area!=area[_n+1]
                  expand points, g(new)
                  foreach var in actual predicted{
                      replace `var'= . if new
                  }
                  sort dt new
                  gen time=_n
                  foreach var in actual predicted{
                      mipolate `var' time, generate(m`var')
                  }
                  
                  foreach var in actual predicted{
                      replace `var' = m`var' if new
                  }
                  bys area new (dt): gen position=_n if new  
                  gen diff= abs(actual)- abs(predicted) if new
                  replace diff=abs(diff)
                  bys area new (diff): drop if _n>1 & new
                  sort time
                  replace dt=dt[_n-1] + (((dt[_n+1]- dt[_n-1])/100)*position) if new
                  replace dt= int(dt) if new
                  replace new=2 if new
                  expand new
                  bys dt area new: replace area= area+1 if _n==_N & new
                  replace area=area-1 if _n==_N
                   
                  twoway (rarea actual predicted dt if inlist(area,1,2), fcolor(gs12) ) ///
                  (rarea actual predicted dt if inlist(area,2,3), fcolor(gs1)  ) ///
                  (rarea actual predicted dt if area==3, fcolor(gs12)  ) ///
                  (line actual dt, lpattern(dash) lwidth(thick) scheme(s1color)) ///
                  (line predicted dt, lpattern(solid) lwidth(thick)), legend(off)
                  Click image for larger version

Name:	Graph.png
Views:	1
Size:	28.4 KB
ID:	1571526



                  Data in #4

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float dt double(actual predicted)
                  21915   16.4374811007146 12.669995973242584
                  21922 15.209215698156115 12.134958526934238
                  21929 13.907829213824419 11.867893963453048
                  21936   12.7005776427328  12.08547538629546
                  21943 11.986846488872429  12.86748605031838
                  21950 11.543464504572759 13.615336632867946
                  21957 10.512740909196596 14.174203873434175
                  21964  9.130806115842926 13.867261978811243
                  21971   8.56977152699833 12.615959092485628
                  21978  9.082512436140698 10.987438672710763
                  21985  9.617240002668176  9.081461726364969
                  21992  9.661117109195153  7.752512700865874
                  21999  9.007911734175586  7.185171998332542
                  22006  8.018395288172515  7.056341029123566
                  22013  8.025481818493901  7.245594767048623
                  22020  8.631218717704177  7.501548637539655
                  22027   9.23797516646064    7.7366144535245
                  22034  9.122894381636353  8.025253079442706
                  22041  9.764694790762958  8.351740051090315
                  22048 10.726059461218954  8.694168262015884
                  22055 12.679512397406402  9.175223242547663
                  22062   15.4328984301478  9.768697144361049
                  22069 15.709410788804417 10.309160736336278
                  22076 14.886763968403914 10.802491616185991
                  22083  13.25036014179231 11.153512104321273
                  22090 12.562518899104907 11.496164616807022
                  end
                  format %td dt
                  
                  gen pos=actual>predicted
                  sort dt
                  gen change=pos!=pos[_n-1]
                  gen area=sum(change)
                  
                  gen points=100 if area!=area[_n+1]
                  expand points, g(new)
                  foreach var in actual predicted{
                      replace `var'= . if new
                  }
                  sort dt new
                  gen time=_n
                  foreach var in actual predicted{
                      mipolate `var' time, generate(m`var')
                  }
                  
                  foreach var in actual predicted{
                      replace `var' = m`var' if new
                  }
                  bys area new (dt): gen position=_n if new  
                  gen diff= abs(actual)- abs(predicted) if new
                  replace diff=abs(diff)
                  bys area new (diff): drop if _n>1 & new
                  sort time
                  replace dt=dt[_n-1] + (((dt[_n+1]- dt[_n-1])/100)*position) if new
                  replace dt= int(dt) if new
                  replace new=2 if new
                  expand new
                  bys dt area new: replace area= area+1 if _n==_N & new
                  replace area=area-1 if _n==_N
                   
                  twoway (rarea actual predicted dt if inlist(area,1,2), fcolor(gs12) ) ///
                  (rarea actual predicted dt if inlist(area,2,3), fcolor(gs1)  ) ///
                  (rarea actual predicted dt if area==3, fcolor(gs12)  ) ///
                  (line actual dt, lpattern(dash) lwidth(thick) scheme(s1color)) ///
                  (line predicted dt, lpattern(solid) lwidth(thick)), legend(off)
                  Click image for larger version

Name:	Graph2.png
Views:	1
Size:	29.6 KB
ID:	1571527

                  Comment


                  • #10
                    Thanks for the mention of mipolate (SSC) but what it is used for here is just the default case of linear interpolation, for which the official command ipolate is fine. Indeed, the code for linear interpolation in mipolate is cloned from the official code.

                    As I understand it, the issue is getting cross-over points where predicted = observed. For that you could

                    calculate diff = observed - predicted
                    expand 2 if the sign of diff differs between data points
                    replace diff with 0 at the new data points
                    replace time with missing at the new data points
                    interpolate predicting time from diff at the new data points

                    It's not intuitive, perhaps, that the essence of the problem is interpolating time, but the point is that you know what the difference is at cross-over points, precisely zero, so the unknown is precisely when that occurs,
                    Last edited by Nick Cox; 05 Sep 2020, 02:04.

                    Comment


                    • #11
                      Just to close this off, in case someone else wants to do this, there is one more step, and one caution.
                      Caution: When interpolating, you have to limit to using the data points just before and after each crossover point, otherwise all crossover points get the same value for time
                      The additional step is that you also have to use your interpolated crossover time to interpolate actual and predicted at each crossover point.

                      Thanks again to Nick and Andrew for the help!

                      Comment

                      Working...
                      X