Equalizing frequencies of all independent variables with dependent variable (including lower frequencies)

Sonwabile Mbuma

Join Date: Jun 2021
Posts: 24

Equalizing frequencies of all independent variables with dependent variable (including lower frequencies)

03 Jul 2025, 14:23

Hi, may you please help. I used the code to make indepedent variables frequency equal to the dependent variable frequency. Such that, the sample is consistent/the same.

qui regress TB_HIV_Knowledge
gen byte keep=e(sample)
keep if keep

I need further assistance. Is there something more I can do besides dropping lower frequencies (I need all the below listed variables)? The code was able to equalize Age, sex, residence, and Province to 39165 (frequency of TBHIV_Knowledge).
However, variables that had lower frequencies were changed to values lower than 39165.

Different Sample/frequencies
Marital Status = 39157
Employment = 39116
Highest Education = 28119
Gross Income = 19569

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float TB_HIV_Knowledge long Class_Income int Age_Recode byte(Sex Residence_Type Region Marital_Status Employment)
1 .  9 1 2 1 1 4
1 .  7 2 2 1 2 4
1 .  2 1 2 1 2 4
3 .  1 1 2 1 2 1
1 1  1 2 2 1 2 3
1 .  4 2 2 1 1 3
1 . 10 2 2 1 1 1
1 . 10 1 2 1 1 1
1 1 10 1 2 1 2 4
2 .  4 2 2 1 2 4
2 1  4 1 2 1 2 4
2 .  3 2 2 1 2 1
1 1  7 2 2 1 2 4
1 .  3 1 2 1 2 1
1 .  1 2 2 1 2 3
1 .  6 2 2 1 2 4
1 1  5 1 2 1 2 4
2 1  2 2 2 1 1 4
2 .  8 1 2 1 1 4
2 . 10 2 2 1 1 1
1 1  2 1 2 1 2 4
1 1  5 1 2 1 1 4
1 1  4 2 2 1 1 4
1 1  1 1 2 1 2 4
1 .  1 2 2 1 2 3
1 1  7 1 2 1 2 2
1 .  4 1 2 2 3 1
1 .  4 2 2 1 2 1
1 .  4 1 2 2 3 1
1 1  6 1 2 1 2 4
1 1  5 2 2 1 2 4
1 .  2 2 2 1 2 4
3 1  9 1 2 1 1 4
3 1 10 2 2 1 1 1
2 .  1 2 2 1 2 3
1 .  8 1 2 1 1 1
1 2 10 1 2 1 1 4
1 .  5 2 2 2 2 1
1 1  4 1 1 1 2 4
1 1  2 1 1 1 2 4
1 1  2 1 1 1 2 4
2 1  2 1 1 1 2 4
2 1  2 1 1 1 2 4
1 .  3 1 1 1 2 1
1 1  9 2 1 1 2 1
3 1  3 2 1 1 2 4
1 1  9 1 1 1 1 1
1 .  9 2 1 1 1 1
1 1  5 2 1 1 1 4
1 2  5 1 1 1 1 4
1 2  2 1 1 1 2 4
1 2  1 1 1 1 2 4
1 1  9 1 1 1 1 5
1 1  8 2 1 1 1 4
1 2  2 1 1 1 2 4
1 2  6 1 1 1 2 5
1 .  8 2 1 1 1 1
1 2  7 1 1 1 1 4
1 2  1 2 1 1 2 1
1 1  1 2 1 1 2 3
1 2  8 2 1 1 1 4
1 2  5 1 1 1 1 4
1 1  8 2 1 1 2 1
1 .  2 2 1 1 2 1
1 1  2 2 1 1 2 1
1 1  9 2 1 1 2 4
1 .  9 1 1 1 1 4
1 1  1 2 1 1 2 1
1 1  7 2 1 1 3 4
1 .  1 1 1 1 2 1
1 2  5 2 1 1 2 4
2 .  4 1 1 1 2 1
1 .  3 2 1 1 2 1
1 .  5 1 1 1 2 1
1 1  4 2 1 1 2 4
1 2 10 2 2 1 3 1
3 1  4 1 2 1 1 4
1 1  4 1 2 1 2 4
1 2  4 1 2 1 1 4
1 2  4 2 2 1 1 4
2 1  1 1 2 1 2 4
1 2  5 1 2 1 2 4
1 2  5 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  5 1 2 1 2 4
1 1  4 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  2 1 2 1 2 4
1 1  2 1 2 1 2 4
1 1  5 1 2 1 2 4
1 1  4 1 2 1 2 4
1 1  4 1 2 1 2 4
1 1  4 1 2 1 2 4
1 1  4 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  3 1 2 1 2 4
1 1  2 1 2 1 2 4
end
label values TB_HIV_Knowledge TB_HIV_Knowledge
label def TB_HIV_Knowledge 1 "True", modify
label def TB_HIV_Knowledge 2 "False", modify
label def TB_HIV_Knowledge 3 "Do Not Know", modify
label values Class_Income Class_Income
label def Class_Income 1 "Poor", modify
label def Class_Income 2 "Working Class", modify
label values Age_Recode Age_Recode
label def Age_Recode 1 "15-19", modify
label def Age_Recode 2 "20-24", modify
label def Age_Recode 3 "25-29", modify
label def Age_Recode 4 "30-34", modify
label def Age_Recode 5 "35-39", modify
label def Age_Recode 6 "40-44", modify
label def Age_Recode 7 "45-49", modify
label def Age_Recode 8 "50-54", modify
label def Age_Recode 9 "55-59", modify
label def Age_Recode 10 "60+", modify
label values Sex sex_q
label def sex_q 1 "Male", modify
label def sex_q 2 "Female", modify
label values Residence_Type Residence_Type
label def Residence_Type 1 "Urban", modify
label def Residence_Type 2 "Rural", modify
label values Region province
label def province 1 "Western Cape", modify
label def province 2 "Eastern Cape", modify
label values Marital_Status Marital_Status
label def Marital_Status 1 "Married", modify
label def Marital_Status 2 "Never Married", modify
label def Marital_Status 3 "No longer Married", modify
label values Employment q1_7
label def q1_7 1 "Unemployed", modify
label def q1_7 2 "Sick/disabled and unable to work", modify
label def q1_7 3 "Student/pupil/learner", modify
label def q1_7 4 "Employed / Self Employed", modify
label def q1_7 5 "Other", modify

Tags: None

Andrew Musau

Join Date: Oct 2014

Posts: 10195
#2

03 Jul 2025, 17:22

You should read about listwise deletion to understand how observation counts are determined in the presence of multiple variables: https://en.wikipedia.org/wiki/Listwise_deletion. If a variable initially has many missing values, leading to fewer observations than some target number of observations, there is no way to increase its observation count to meet that target if the software uses listwise deletion.

Last edited by Andrew Musau; 03 Jul 2025, 17:26.
1 like
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1400

05 Jul 2025, 15:16

Could I ask why you are trying to do this?

The reason I ask is because if it is only to regress (or run another estimation command), you don't need to drop observations with missing values -- Stata will do this for you automatically.

You can also use the Stata function missing() to help identify observations that have no missing values on a set of variables.

Consider this:

Code:

. sysuse nlsw88, clear
(NLSW, 1988 extract)

. misstable patterns tenure grade industry, freq

   Missing-value patterns
     (1 means complete)

              |   Pattern
    Frequency |  1  2  3
  ------------+-------------
        2,215 |  1  1  1
              |
           15 |  1  1  0
           14 |  1  0  1
            2 |  0  1  1
  ------------+-------------
        2,246 |

  Variables are  (1) grade  (2) industry  (3) tenure

. regress tenure grade i.industry

      Source |       SS           df       MS      Number of obs   =     2,215
-------------+----------------------------------   F(12, 2202)     =     15.61
       Model |  5265.47118        12  438.789265   Prob > F        =    0.0000
    Residual |  61902.5935     2,202  28.1119862   R-squared       =    0.0784
-------------+----------------------------------   Adj R-squared   =    0.0734
       Total |  67168.0647     2,214  30.3378793   Root MSE        =    5.3021

<omitted table for brevity>

. gen byte in_sample = e(sample)

. tab in_sample

  in_sample |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         31        1.38        1.38
          1 |      2,215       98.62      100.00
------------+-----------------------------------
      Total |      2,246      100.00

. gen byte non_missing = !missing(tenure, grade, industry)

. tab non_missing

non_missing |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         31        1.38        1.38
          1 |      2,215       98.62      100.00
------------+-----------------------------------
      Total |      2,246      100.00

. assert in_sample == non_missing

Last edited by Hemanshu Kumar; 05 Jul 2025, 15:19.

Comment

Sonwabile Mbuma

Join Date: Jun 2021

Posts: 24
#4

Yesterday, 07:41

Thank you very much, this is very helpful.
Comment

Announcement

Equalizing frequencies of all independent variables with dependent variable (including lower frequencies)

Comment

Comment

Comment