Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • distplot and right-censoring

    The -distplot- command written by Nick Cox has become my default for picturing cumulative distributions.
    Code:
    search distplot
    But does anyone know if it's possible with -distplot- to censor the CDF? An example would be where the data have a long right tail so that most of the graph is depicting the CDF near 1. In this case a clearer picture might emerge if the graph could be censored at some not-so-large value of the data, much like many familiar pictures of Kaplan-Meier curves in clinical studies involving, say, administrative censoring.

    The desired picture could be obtained by using the -cumul- command, e.g.
    Code:
    sysuse auto
    
    sort price
    cumul price, g(cprice)
    
    tw scatter cprice price if price<10000, s(i) c(l) ylab(0(.2)1)
    But using -distplot- would be preferable.

    Note that using "if price<10000" with -distplot- does not provide the desired result.

    It may be that this is not currently a feature of -distplot- but if anyone knows differently I'd be grateful to know. Thanks in advance.

  • #2
    Correct. distplot does not support thiis at present.

    Comment


    • #3
      I did this and it worked, at least in the visual presentation. Or I suppose you could preserve, drop>10000, and then do it.
      Code:
      distplot price
      distplot price if price<10000

      Comment


      • #4
        I made a minor edit to distplot to add a censor option, which should work fine at least for simple situations like the above. Use with caution.

        Code:
        *! 2.2.3 HK 15 Sep 2025 adds the censor option
        *! 2.2.2 NJC 15 Sep 2017                    
        * 2.2.1 NJC 29 Mar 2010                    
        * 2.2.0 NJC 18 Dec 2009                    
        * 2.1.0 NJC 17 Aug 2005                    
        * 2.0.2 NJC 12 Aug 2004                    
        * 2.0.1 NJC 29 Sep 2003 tscale -> trscale
        * 2.0.0 NJC 7 May 2003       
        * 1.6.1 NJC 7 Apr 2003       
        * 1.6.0 NJC 5 Dec 2002       
        * 1.5.0 NJC 24 Mar 1999        [STB-51: gr41]
        * 1.4.0 NJC 17 Mar 1998
        * 1.0.0 NJC 5 Mar 1998
        program distplot, sort 
            version 8.0
        
            * allobs undocumented 
            syntax varlist(numeric) [if] [in] [fweight aweight/] ///
            [ , BY(str asis) OVER(varname) FREQuency MIDpoint MISSing ///
            TRSCale(str) REVerse Reverse2(str) MSymbol(str)     ///
            ADDPLOT(str asis) PLOT(str asis) ///
            CENSOR(str asis) ALLOBS * ]
            
            tokenize `varlist'
            local nvars : word count `varlist'
        
            // error checks 
            if "`allobs'" != "" marksample touse, novarlist 
            else marksample touse  
            
            if "`missing'" == "" { 
                if "`over'" != "" markout `touse' `over', strok 
            }
            else if "`over'" == "" di as txt "missing only applies with over()" 
            
            if "`exp'" == "" local exp "1" 
            else {
                capture assert `exp' >= 0 if `touse'
                if _rc {
                    di as err "weight assumes negative values"
                    exit 402
                    }    
            }
        
            if "`over'" != "" {
                if `nvars' > 1 {
                    di as err "too many variables specified"
                    exit 103
                    }
            }
        
        
            if `"`by'"' != "" { 
                gettoken by opts : by, parse(",") 
                local byby by(`by' `opts') 
                local bylabel : variable label `by' 
            }    
            
            qui count if `touse'
            if r(N) == 0 {
                di as err "no observations satisfy conditions"
                exit 2000
            }
        
            if "`trscale'" != "" { 
                if !index("`trscale'","@") { 
                    di as err "trscale() does not contain @" 
                    exit 198 
                }
            } 
        
            if "`frequency'" != "" & "`midpoint'" != "" { 
                di as err "frequency and midpoint may not be combined"
                exit 198 
            }
        
            if "`reverse2'" != "" { 
                if "`reverse2'" != "ge" {
                    di as err "reverse() option invalid" 
                    exit 198 
                }     
                if "`midpoint'" != "" { 
                    di as txt "(reverse midpoint options assumed)" 
                    local reverse "reverse" 
                }
                else local ge "ge" 
            } 
            
            // we're in business 
            
            quietly {
                preserve
                keep if `touse'  
                tempvar wt id col P midP 
                gen double `wt' = `exp'
        
                if `nvars' > 1 { 
                    gen long `id' = _n 
                    tempname data datalbl 
                    local i = 1 
                    foreach v of var `varlist' {
                        local lbl`i' : variable label `v' 
                        if `"`lbl`i''"' == "" local lbl`i' "`v'" 
                        rename `v' `data'`i++'
                        local xlbl "`xlbl' `v'"
                    } 
                        reshape long `data', i(`id') j(`col') 
                        forval i = 1 / `nvars' {
                        label def `datalbl' `i' `"`lbl`i''"', modify
                        }
                    label val `col' `datalbl' 
                        local varlist "`data'"
                    label var `varlist' "`xlbl'"
                    if "`by'" != "" label var `by' `"`bylabel'"' 
                    local over "`col'"
                }
        
                sort `touse' `by' `over' `varlist'
                local over1 "by `touse' `by' `over' :"
                local over2 "by `touse' `by' `over' `varlist' :"
                
                if "`midpoint'" == "" { 
                    `over1' gen double `P' = sum(`wt') 
                    if "`frequency'" == "" `over1' replace `P' = `P' / `P'[_N]
                    if "`reverse'" != ""   `over1' replace `P' = `P'[_N] - `P'
                } 
                else {
                    `over2' gen double `P' = sum(`wt') 
                    `over2' replace `P' = `P'[_N] 
                    `over2' keep if _n == _N 
                    `over1' gen double `midP' = 0.5 * `P' + sum(`P'[_n-1]) 
                    `over1' replace `P' = sum(`P') 
                    `over1' replace `P' = `midP' / `P'[_N] 
                    if "`reverse'" != "" replace `P' = 1 - `P' 
                 } 
                
                if "`ge'" != "" replace `varlist' = -`varlist' 
                local ylbl = cond("`xlbl'" != "", "value", "`varlist'") 
            
                if "`midpoint'" != ""  {
                    local ytitle "Cumulative probability midpoint"     
                }
                else if "`frequency'" == "" {
                    local ytitle "Cumulative probability"
                }
                else local ytitle "Cumulative frequency" 
                
                if "`over'" != "" {
                    tempvar group
                    bysort `over' (`P') : gen byte `group' = _n == 1 
                    replace `group' = sum(`group')
                    local overlab : value label `over'
                    local j = 1 
                    forval i = 1 / `= `group'[_N]' {
                        tempvar P_i`i'
                         gen double `P_i`i'' = `P' if `group' == `i'
                                 local overval = `over'[`j']
                            if "`overlab'" != "" { 
                            local overval : label `overlab' `overval' 
                        }
                            label var `P_i`i'' `"`overval'"'
                        local Plist "`Plist' `P_i`i''"
                        count if `group' == `i'
                        local j = `j' + r(N)
                    }
                }
                else { 
                    local sort "sort" 
                    _crcslbl `P' `varlist' 
                    local Plist "`P'"
                }    
        
                if "`trscale'" != "" { 
                    foreach v of var `Plist' { 
                        local newv : subinstr local trscale "@" "`v'", all 
                        replace `v' = `newv' 
                    }
                } 
            }     
        
            if "`msymbol'" == "" local msymbol "oh dh th sh smplus x O D T S + X"
            if `"`censor'"' == `""' local ifcond
                else local ifcond if `censor'
                
            twoway line `Plist' `varlist' `ifcond', ytitle("`ytitle'") /// 
            `sort' ms(`msymbol') `byby' `options' ///
            || `plot' /// 
            || `addplot' /// 
            // blank 
        end
        With this, you would just do
        Code:
        distplot price, censor(price<10000) ylab(0(.2)1)

        Comment


        • #5
          Thanks very much Hemanshu and George.

          Comment


          • #6
            I'm not sure if #3 solves the problem correctly. The issue is that in #3, the distribution is computed only on the subset of observations which are uncensored, whereas what you want to do is compute the distribution on the full set, and then draw the graph only for the uncensored observations -- which is what #1 (with cumul and tw scatter) and #4 (with the modified distplot) do.

            Comment

            Working...
            X