Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Equalizing frequencies of all independent variables with dependent variable (including lower frequencies)

    Hi, may you please help. I used the code to make indepedent variables frequency equal to the dependent variable frequency. Such that, the sample is consistent/the same.


    qui regress TB_HIV_Knowledge
    gen byte keep=e(sample)
    keep if keep


    I need further assistance. Is there something more I can do besides dropping lower frequencies (I need all the below listed variables)? The code was able to equalize Age, sex, residence, and Province to 39165 (frequency of TBHIV_Knowledge).
    However, variables that had lower frequencies were changed to values lower than 39165.

    Different Sample/frequencies
    Marital Status = 39157
    Employment = 39116
    Highest Education = 28119
    Gross Income = 19569

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float TB_HIV_Knowledge long Class_Income int Age_Recode byte(Sex Residence_Type Region Marital_Status Employment)
    1 .  9 1 2 1 1 4
    1 .  7 2 2 1 2 4
    1 .  2 1 2 1 2 4
    3 .  1 1 2 1 2 1
    1 1  1 2 2 1 2 3
    1 .  4 2 2 1 1 3
    1 . 10 2 2 1 1 1
    1 . 10 1 2 1 1 1
    1 1 10 1 2 1 2 4
    2 .  4 2 2 1 2 4
    2 1  4 1 2 1 2 4
    2 .  3 2 2 1 2 1
    1 1  7 2 2 1 2 4
    1 .  3 1 2 1 2 1
    1 .  1 2 2 1 2 3
    1 .  6 2 2 1 2 4
    1 1  5 1 2 1 2 4
    2 1  2 2 2 1 1 4
    2 .  8 1 2 1 1 4
    2 . 10 2 2 1 1 1
    1 1  2 1 2 1 2 4
    1 1  5 1 2 1 1 4
    1 1  4 2 2 1 1 4
    1 1  1 1 2 1 2 4
    1 .  1 2 2 1 2 3
    1 1  7 1 2 1 2 2
    1 .  4 1 2 2 3 1
    1 .  4 2 2 1 2 1
    1 .  4 1 2 2 3 1
    1 1  6 1 2 1 2 4
    1 1  5 2 2 1 2 4
    1 .  2 2 2 1 2 4
    3 1  9 1 2 1 1 4
    3 1 10 2 2 1 1 1
    2 .  1 2 2 1 2 3
    1 .  8 1 2 1 1 1
    1 2 10 1 2 1 1 4
    1 .  5 2 2 2 2 1
    1 1  4 1 1 1 2 4
    1 1  2 1 1 1 2 4
    1 1  2 1 1 1 2 4
    2 1  2 1 1 1 2 4
    2 1  2 1 1 1 2 4
    1 .  3 1 1 1 2 1
    1 1  9 2 1 1 2 1
    3 1  3 2 1 1 2 4
    1 1  9 1 1 1 1 1
    1 .  9 2 1 1 1 1
    1 1  5 2 1 1 1 4
    1 2  5 1 1 1 1 4
    1 2  2 1 1 1 2 4
    1 2  1 1 1 1 2 4
    1 1  9 1 1 1 1 5
    1 1  8 2 1 1 1 4
    1 2  2 1 1 1 2 4
    1 2  6 1 1 1 2 5
    1 .  8 2 1 1 1 1
    1 2  7 1 1 1 1 4
    1 2  1 2 1 1 2 1
    1 1  1 2 1 1 2 3
    1 2  8 2 1 1 1 4
    1 2  5 1 1 1 1 4
    1 1  8 2 1 1 2 1
    1 .  2 2 1 1 2 1
    1 1  2 2 1 1 2 1
    1 1  9 2 1 1 2 4
    1 .  9 1 1 1 1 4
    1 1  1 2 1 1 2 1
    1 1  7 2 1 1 3 4
    1 .  1 1 1 1 2 1
    1 2  5 2 1 1 2 4
    2 .  4 1 1 1 2 1
    1 .  3 2 1 1 2 1
    1 .  5 1 1 1 2 1
    1 1  4 2 1 1 2 4
    1 2 10 2 2 1 3 1
    3 1  4 1 2 1 1 4
    1 1  4 1 2 1 2 4
    1 2  4 1 2 1 1 4
    1 2  4 2 2 1 1 4
    2 1  1 1 2 1 2 4
    1 2  5 1 2 1 2 4
    1 2  5 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  5 1 2 1 2 4
    1 1  4 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  2 1 2 1 2 4
    1 1  2 1 2 1 2 4
    1 1  5 1 2 1 2 4
    1 1  4 1 2 1 2 4
    1 1  4 1 2 1 2 4
    1 1  4 1 2 1 2 4
    1 1  4 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  3 1 2 1 2 4
    1 1  2 1 2 1 2 4
    end
    label values TB_HIV_Knowledge TB_HIV_Knowledge
    label def TB_HIV_Knowledge 1 "True", modify
    label def TB_HIV_Knowledge 2 "False", modify
    label def TB_HIV_Knowledge 3 "Do Not Know", modify
    label values Class_Income Class_Income
    label def Class_Income 1 "Poor", modify
    label def Class_Income 2 "Working Class", modify
    label values Age_Recode Age_Recode
    label def Age_Recode 1 "15-19", modify
    label def Age_Recode 2 "20-24", modify
    label def Age_Recode 3 "25-29", modify
    label def Age_Recode 4 "30-34", modify
    label def Age_Recode 5 "35-39", modify
    label def Age_Recode 6 "40-44", modify
    label def Age_Recode 7 "45-49", modify
    label def Age_Recode 8 "50-54", modify
    label def Age_Recode 9 "55-59", modify
    label def Age_Recode 10 "60+", modify
    label values Sex sex_q
    label def sex_q 1 "Male", modify
    label def sex_q 2 "Female", modify
    label values Residence_Type Residence_Type
    label def Residence_Type 1 "Urban", modify
    label def Residence_Type 2 "Rural", modify
    label values Region province
    label def province 1 "Western Cape", modify
    label def province 2 "Eastern Cape", modify
    label values Marital_Status Marital_Status
    label def Marital_Status 1 "Married", modify
    label def Marital_Status 2 "Never Married", modify
    label def Marital_Status 3 "No longer Married", modify
    label values Employment q1_7
    label def q1_7 1 "Unemployed", modify
    label def q1_7 2 "Sick/disabled and unable to work", modify
    label def q1_7 3 "Student/pupil/learner", modify
    label def q1_7 4 "Employed / Self Employed", modify
    label def q1_7 5 "Other", modify

  • #2
    You should read about listwise deletion to understand how observation counts are determined in the presence of multiple variables: https://en.wikipedia.org/wiki/Listwise_deletion. If a variable initially has many missing values, leading to fewer observations than some target number of observations, there is no way to increase its observation count to meet that target if the software uses listwise deletion.
    Last edited by Andrew Musau; 03 Jul 2025, 17:26.

    Comment


    • #3
      Could I ask why you are trying to do this?

      The reason I ask is because if it is only to regress (or run another estimation command), you don't need to drop observations with missing values -- Stata will do this for you automatically.

      You can also use the Stata function missing() to help identify observations that have no missing values on a set of variables.

      Consider this:

      Code:
      . sysuse nlsw88, clear
      (NLSW, 1988 extract)
      
      . misstable patterns tenure grade industry, freq
      
         Missing-value patterns
           (1 means complete)
      
                    |   Pattern
          Frequency |  1  2  3
        ------------+-------------
              2,215 |  1  1  1
                    |
                 15 |  1  1  0
                 14 |  1  0  1
                  2 |  0  1  1
        ------------+-------------
              2,246 |
      
        Variables are  (1) grade  (2) industry  (3) tenure
      
      . regress tenure grade i.industry
      
            Source |       SS           df       MS      Number of obs   =     2,215
      -------------+----------------------------------   F(12, 2202)     =     15.61
             Model |  5265.47118        12  438.789265   Prob > F        =    0.0000
          Residual |  61902.5935     2,202  28.1119862   R-squared       =    0.0784
      -------------+----------------------------------   Adj R-squared   =    0.0734
             Total |  67168.0647     2,214  30.3378793   Root MSE        =    5.3021
      
      <omitted table for brevity>
      
      . gen byte in_sample = e(sample)
      
      . tab in_sample
      
        in_sample |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |         31        1.38        1.38
                1 |      2,215       98.62      100.00
      ------------+-----------------------------------
            Total |      2,246      100.00
      
      . gen byte non_missing = !missing(tenure, grade, industry)
      
      . tab non_missing
      
      non_missing |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |         31        1.38        1.38
                1 |      2,215       98.62      100.00
      ------------+-----------------------------------
            Total |      2,246      100.00
      
      . assert in_sample == non_missing
      Last edited by Hemanshu Kumar; 05 Jul 2025, 15:19.

      Comment


      • #4
        Thank you very much, this is very helpful.

        Comment

        Working...
        X