Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapsing the mean of a dummy variable vs. collapsing the sum and defining the percentage by hand

    Hi everyone,

    I'm working with a dataset that includes labor market statistics for Latin America and the Caribbean (LAC). I'm analyzing labor informality using a variable called informal_ss, which is equal to 1 if the employed person does not contribute to social security, and 0 if they do contribute. This variable is defined only for employed individuals aged 18–65.

    I collapsed the data in two ways:
    1. collapse (sum) of informal_ss and ocupado
    2. collapse (mean) of informal_ss (renamed as m_informal_ss)
    Then I calculated an informal rate manually (by_hand) as the ratio between the sum of informal_ss and the sum of ocupado. I expected this manual calculation to match the mean, i.e., by_hand ≈ m_informal_ss.

    However, I observe small differences between the two rates in several countries (e.g., a difference of over 2 percentage points for Mexico).

    My question is:
    Why do I get differences between the rate calculated via collapse (mean) and the manual division after collapse (sum)?

    Exact code:


    *Informal: social security definition: including self-employed as a share of employment
    gen informal_ss=1 if djubila==0
    replace informal_ss=0 if djubila==1


    expand 2, gen(dup)
    tab dup

    replace pais = "LAC" if dup == 1

    *ocupado informal_ss informalssxsmall informal_ssxself_employed informalssxdependentssmall

    collapse (sum) ocupado informal_ss (mean) m_ocupado=ocupado m_informal_ss=informal_ss [fw=round(pondera)] if ocupado==1 & edad>=18 & edad<66, by(pais)

    gen by_hand=informal_ss/ocupado


    foreach x in m_ocupado m_informal_ss by_hand {
    replace `x'=`x'*100
    }


    gen diff=(by_hand)-m_informal_ss

    Results:

    pais ocupado informal_ss m_ocupado m_informal_ss by_hand diff
    ARG 12511837 6526443 100 52.35327 52.16215 -.1911201
    BOL 4985197 3816835 100 78.24393 76.56338 -1.68055
    BRA 95186045 3.31e+07 100 34.82551 34.82551 0
    CHL 8763029 2401302 100 27.71941 27.40265 -.3167648
    COL 20614075 1.15e+07 100 56.49807 56.00306 -.4950142
    CRI 2039818 497447 100 24.393 24.38683 -.0061722
    DOM 17512152 9867675 100 57.15142 56.34758 -.8038368
    ECU 7287688 4797207 100 65.82619 65.82619 0
    GTM 5519407 4434942 100 80.38461 80.35178 -.0328293
    HND 3401972 2744132 100 80.66298 80.66298 0
    LAC 259991486 1.36e+08 100 52.66125 52.13443 -.5268211
    MEX 57109772 3.81e+07 100 69.11126 66.79506 -2.3162
    PAN 1610167 854046 100 53.04083 53.04083 0
    PER 15942907 1.22e+07 100 76.46815 76.46815 0
    PRY 3280010 2466798 100 75.30729 75.20702 -.1002655
    SLV 2656377 1772289 100 66.71828 66.71828 0
    URY 1571033 334665 100 21.30223 21.30223 0

  • #2
    Check for missing values in one variable but not the other. Also watch out that you may need long or double results from the collapse

    Comment

    Working...
    X