Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating variable and replacing the outlier as missing based on grouping and condition

    Dear All,
    I am working with firm level data. I want to generate a variable lsales = ln(sales) and want to replace the values in lsales as missing if the values in the variable lsales are higher than mean+3*standard deviation. While doing so, I want the summary stat of lsales by grouping state and sector and if the observation of lsales is more than 1000 in the individual group (state-sector) then find the outlier and replace it as missing. If the number of observation is less than 1000 in any of the group (state-sector) then I want to group lsales by the state only and find out the summary stat and replace the outlier with missing values. Please suggest to me how should I code this, as I am unable to visualise the loop for the condition.

    Thanks
    Prakash

  • #2
    Here is a possible approach. Basically, you can use a series of -bysort: egen- commands to create the count, mean, and SD for state-level and state-sector-level. Then, according to the condition of the group N > 1000, replace the extreme values.

    Code:
    sysuse nlsw88.dta, clear
    
    * Focus on the few we'll be working with:
    keep wage race married
    
    * Generage the log value:
    gen lwage = log(wage)
    gen lwage2 = lwage // for data check
    
    * Assuming race is state, married is sector. Compute n, mean, and sd:
    bysort race married: egen stateSectorN = count(lwage)
    bysort race married: egen stateSectorMean = mean(lwage)
    bysort race married: egen stateSectorSd = sd(lwage)
    
    bysort race: egen stateMean = mean(lwage)
    bysort race: egen stateSd = sd(lwage)
    
    * No need for these follow two, just for data check:
    gen stateSectorLimit = stateSectorMean + 3*stateSectorSd
    gen stateLimit = stateMean + 3*stateSd
    
    * Replace extreme values
    replace lwage = . if stateSectorN >= 1000 & lwage > (stateSectorMean + 3*stateSectorSd)
    replace lwage = . if stateSectorN  < 1000 & lwage > (stateMean + 3*stateSd)
    
    * Check data:
    list lwage lwage2 stateSectorN stateLimit stateSectorLimit if lwage == ., sepby(stateSectorN)
    Version: Stata 16
    Last edited by Ken Chui; 28 Apr 2021, 06:50.

    Comment


    • #3
      Thanks Chui, Now it looks very stupid, I was probably looking it from the point that I need to create a loop. Thanks a lot.

      Comment

      Working...
      X