Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting mean from multiple variables with minimum number of those variables have value

    Dear Stata Masters,

    I have the following dataset

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str3 regnum int date str1 regimen byte(qolno f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22 f23 f24 f25 f26)
    "1"   21641 "1" 1 4 2 5 5 3 3 3 2 2 3 5 2 3 4 4 5 4 4 2 4 4 4 5 4 4 3
    "1"   21768 "1" 2 4 2 4 4 3 3 3 3 3 3 5 3 3 2 3 4 4 4 3 4 4 3 4 4 4 4
    "1"   21859 "1" 3 4 2 4 4 3 3 3 2 3 3 3 1 2 2 3 3 3 3 2 4 4 2 3 3 3 2
    "1"   21917 "1" 4 4 3 3 3 3 3 3 3 4 4 3 4 4 3 4 4 4 3 3 4 4 3 3 4 4 3
    "1"   22014 "1" 5 3 3 4 5 4 3 3 3 3 3 5 3 2 4 1 1 4 3 4 4 4 1 3 1 1 2
    "1"   22154 "1" 9 4 4 5 5 4 4 4 4 4 4 4 3 5 4 4 4 3 4 5 4 5 3 4 4 4 4
    "1"   22299 "1" 9 4 4 4 5 3 4 3 3 4 3 4 3 3 3 4 3 4 4 4 4 4 3 3 3 4 4
    "10"  21655 "2" 1 4 1 5 1 5 5 4 3 1 5 3 2 5 5 3 3 5 3 4 5 3 3 5 3 4 2
    "10"  21755 "2" 2 5 3 3 5 5 5 4 5 4 5 2 1 5 5 2 3 4 2 5 2 3 5 5 1 3 5
    "10"  21858 "2" 3 2 4 3 4 5 5 5 5 5 5 1 1 5 5 3 1 4 5 4 4 5 5 5 5 5 3
    "10"  21928 "2" 4 4 4 3 3 3 5 3 2 5 4 1 1 5 3 4 1 2 3 3 5 5 5 5 3 5 3
    "10"  22159 "2" 9 5 4 5 5 4 4 5 3 5 4 4 1 5 5 3 1 4 3 3 1 3 5 5 4 4 4
    "100" 21794 "1" 1 3 3 2 5 1 3 4 4 1 4 4 1 2 2 5 3 1 1 3 3 1 1 1 1 4 3
    "100" 21938 "1" 3 3 3 3 2 2 4 3 4 2 4 4 1 2 1 4 4 2 1 3 4 1 1 2 2 4 3
    "101" 21796 "1" 1 5 3 5 1 5 5 5 5 3 5 5 2 4 1 5 1 5 5 5 5 1 5 3 4 5 4
    "101" 21885 "1" 2 4 4 4 3 4 5 4 3 2 5 5 3 4 5 3 3 5 5 5 5 1 5 4 4 3 4
    "101" 22041 "1" 3 5 5 3 4 4 5 4 5 4 5 4 4 5 4 3 3 4 5 5 5 1 5 4 4 4 4
    "101" 22076 "1" 4 3 4 5 5 5 5 5 1 1 4 5 5 5 2 1 4 4 2 5 5 5 5 5 1 1 5
    "102" 21796 "1" 1 3 3 3 1 4 4 5 5 4 5 5 1 3 3 1 4 4 3 3 5 1 3 5 5 4 5
    "102" 21900 "1" 2 4 4 5 3 4 4 5 5 4 4 4 1 1 1 1 4 4 3 4 1 1 4 5 5 1 4
    "102" 22027 "1" 3 3 3 2 3 2 3 5 3 3 3 4 1 3 1 2 4 2 1 4 4 3 4 2 5 3 4
    "102" 22069 "1" 4 4 4 3 5 3 4 5 3 5 3 3 3 3 3 3 4 2 2 2 4 3 4 4 5 4 4
    "102" 22284 "1" 9 2 4 4 4 3 2 3 4 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 4
    "103" 21802 "1" 1 4 4 5 4 4 5 5 5 5 3 3 3 4 3 5 5 4 4 4 5 5 5 5 5 5 4
    "103" 21931 "1" 2 4 4 3 4 4 4 3 5 4 4 4 3 5 5 5 3 3 3 4 5 4 3 5 5 4 4
    "103" 22000 "1" 3 4 4 5 4 4 4 4 5 4 4 3 3 5 3 5 5 4 4 4 5 5 1 5 4 3 3
    "103" 22088 "1" 4 5 5 5 5 5 5 5 5 5 4 4 3 4 3 5 5 5 5 5 5 5 5 4 5 5 5
    "104" 21803 "1" 1 1 1 1 3 4 5 1 5 4 1 1 3 1 1 1 3 1 1 3 4 1 1 5 4 1 3
    "104" 21936 "1" 2 3 1 4 4 3 5 3 5 5 4 4 4 3 1 3 2 5 1 3 5 1 4 4 4 1 5
    "104" 22097 "1" 4 2 4 3 5 2 3 3 3 3 4 2 3 3 3 2 4 3 3 3 4 3 3 4 4 3 3
    "104" 22295 "1" 9 3 5 5 5 4 5 5 5 5 5 5 5 5 5 5 5 2 2 5 5 5 5 5 5 5 5
    "106" 21804 "1" 1 3 1 4 1 5 3 1 2 4 3 5 1 1 1 1 1 5 1 3 5 1 1 3 5 1 3
    "106" 21928 "1" 2 3 3 4 4 2 2 2 2 4 2 1 1 2 2 3 2 3 3 3 3 3 1 3 4 2 2
    "106" 22022 "1" 3 2 2 2 1 3 4 5 2 3 1 4 1 2 1 1 1 1 1 4 1 4 2 1 3 1 2
    "106" 22081 "1" 4 4 4 4 4 3 4 5 3 3 4 5 3 3 3 3 3 5 5 4 3 5 3 2 3 2 3
    end
    format %td date
    1 row = 1 survey with 26 questions. One regnum can have multiple survey.

    24 questions are categorized into 4 domains:
    Domain 1: f3 + f4 + f10 + f15 + f16 + f17 + f18
    Domain 2: f5 + f6 + f7 + f11 + f19 + f26
    Domain 3: f20 + f21 + f22
    Domain 4: f8 + f9 + f12 + f13 + f14 + f23 + f24 + f25

    I need to calculate the average of each domain with a condition that only 1 question could be empty. If more than 1 question is empty, then the domain must not be calculated.
    Then, one survey needs to be excluded if >=20% of questions are empty (>=21 questions answered)

    Does anyone could help me with these two questions?

    Thank you in advance.

    Vini


  • #2
    Because your example data does not contain missing values for any of the variables, I cannot claim to have thoroughly tested this code. But I believe it does what you ask.

    Code:
    local domain1 f3 f4 f10 f15 f16 f17 f18
    local domain2 f5 f6 f7 f11 f19 f26
    local domain3 f20 f21 f22
    local domain4 f8 f9 f12 f13 f14 f23 f24 f25
    
    capture program drop score_one_domain
    program define score_one_domain
        syntax varlist, gen(name)
        tempvar mcount
        egen `mcount' = rowmiss(`varlist')
        egen `gen' = rowtotal(`varlist') if `mcount' <= 1
        exit
    end
    
    egen nmcount = rownonmiss(f1-f26)
    drop if nmcount < 21
    
    forvalues i = 1/4 {
        score_one_domain `domain`i'', gen(domain_score`i')
    }
    Added: That said, it is odd to define the score as the sum of the items when you are allowing the number of items to vary (because 1 item might be missing). It is more usual, and more sensible, to use the mean of the items as the score in that circumstance--effectively imputing the mean of the other items for the missing response. If this is what you want (and the title of the post suggests that perhaps it is, notwithstanding what is written in the text), then replace -rowtotal- by -rowmean-.
    Last edited by Clyde Schechter; 08 Sep 2022, 09:49.

    Comment


    • #3
      Dear Clyde Schechter , thank you for this. It works! I just need to learn each of the step another time.
      In terms of the calculation, yes, I was thinking to add another step to calculate the mean by dividing the total of each domain with total questions with value in that domain. Glad to know that it can be done in one go using rowmean!

      Thanks again for this.

      Comment

      Working...
      X