Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Programming & If Conditions

    Hello everybody,

    I am wondering what I have to change to be able to use if conditions for the following "programmed command":

    program iqr_ethn, rclass
    version 16
    syntax varlist(numeric) [if] [=exp]


    * Count occurrences for each unique value
    egen frequencies = count(`varlist') if `varlist'<., by(`varlist')

    * Compute Q1, Q3, IQR, and the threshold
    qui sum frequencies if `varlist' <. , detail
    local Q1 = r(p25)
    local Q3 = r(p75)
    local IQR = `Q3' - `Q1'
    local threshold = `Q3' + 3*`IQR'


    * Assign 1 to outliers (i.e. the ingroup) and 0 otherwise
    replace ingroup = 0 if ingroup==. & `varlist'<.
    replace ingroup = 1 if ingroup<. & frequencies > `threshold' & `varlist'<.

    if `threshold' == `Q3' {
    replace ingroup = 1 if ingroup == 0 & frequencies >= `Q3' & `varlist'<.

    }
    tab frequencies
    sum frequencies, detail
    drop frequencies
    tab `varlist' ingroup if `varlist'<.
    end

    The Variable on which I want to apply the command is the following:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double ETHNIC
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    80
    end
    label values ETHNIC ETHNIC
    label def ETHNIC 80 "South+Latin America", modify
    Many thanks in advance for helping me!!!
    Last edited by Clara Eul; 20 Aug 2023, 07:09.

  • #2
    I am not especially clear that you need a program to do what I think you want to do, but here are some general and specific comments and some code that may help.

    Code:
    capture program drop iqr_ethn
    
    program iqr_ethn, sortpreserve
    version 16
    syntax varname(numeric) [if] [in], ingroup(str)
    
    marksample touse
    tempvar freq
    
    
    * Count occurrences for each distinct value
    bysort `touse' `varlist': gen long `freq' = _N if `touse'
    label var `freq' "frequencies"
    
    * Compute Q1, Q3, IQR, and the threshold
    qui sum `freq' if `touse' , detail
    
    * threshold = r(p75) + 3 * (r(p75) - r(p25)) 
    * assign 1 to outliers (i.e. the ingroup) and 0 otherwise
    gen `ingroup' = `touse' & (`freq' > r(p75) + 3 * (r(p75) - r(p25)))
    
    tab `freq' if `touse'
    
    sum `freq' if `touse', detail
    
    tab `varlist' `ingroup' if `touse'
    
    end
    G1. The need to ignore missing values and also to take account of any if condition specified is so common that marksample has been provided as a command dedicated to doing both. (Support for in comes free too.)

    G2. The method of taking returned results, putting them into local macros, and then taking them out again immediately is very common in user code, but usually needless indirection (and may result in loss of precision). Often you can and should just use the returned results directly.

    G3. In a program use of egen is often inefficient: the equivalent code is often shorter in terms of what Stata has to do and faster too.

    S1. You declare your program rclass but return nothing. That's not a problem but otherwise has no point.

    S2. If you summarize all the data specified, then each frequency occurs that many times, so your quartiles are calculated across observations, not across the distinct (not "unique", please (*)) values of your variable. Be sure that is what you want.

    S3. Your threshold is upper quartile PLUS 3 IQR. The threshold will be equal to Q3 if (and only if) the IQR is identically 0. If that is true, the recipe has already calculated the threshold correctly.

    S4. The program is awkward in wiring in the variable name ingroup AND assuming that the variable exists before you run the program. (Otherwise the first replace statement would fail.) It's better practice to let the user specify the name of the variable and let the program produce it. If you want to combine the result of this calculation with some previous results, well, that's not clear but there are plenty of ways to do that.

    S5. Similarly it's better not to wire in in the variable name frequencies.

    S6. Perhaps you intend to support weights too but I didn't do anything about that code.

    The program above does run on your data example (so it's legal) but the results don't seem interesting.

    (*) On why distinct is a better word please see Section 2 of https://journals.sagepub.com/doi/pdf...867X0800800408

    Comment

    Working...
    X