Command to classify variable based on mean

Said Mohamed

Join Date: Aug 2022

Posts: 11
#1

Command to classify variable based on mean

11 Aug 2025, 04:08

Hello,

I have a dataset with several Yes/No variables representing whether participants know specific breast cancer risk factors: Family History, Contraceptive use, No Breast Feeding ,Early Menarche, Late Menopause, High Fat Diet and many more though I cant post all of them. Each correct response is coded as “Yes” and each incorrect as “No.” I want to:
Recode “Yes” as 1 and “No” as 0.

Create a cumulative awareness score for each participant.

Calculate the mean score across all participants.

Categorize participants as “Aware” if their score is at or above the mean, and “Unaware” if their score is below the mean.

Which Stata command can I use to do this? Thank you in advance for your help!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35782
#2

11 Aug 2025, 04:39

You should give a data example, even if fake and based on a few named variables and observations. See FAQ Advice #12.

On your #1

Code:

label define indicator 0 No 1 Yes encode have, gen(want) label(indicator)

gives the flavour of a conversion. It's likely that you don't need to repeat the encode for each variable, but could employ a loop over variables.

I can't get past your #2. Cumulative with respect to what and why? Across variables? Over time, given some other handle in your data?

#3 and #4 seem arbitrary.
Comment

Maarten Buis

Join Date: Mar 2014
Posts: 3467

11 Aug 2025, 04:54

Code:

// create an example dataset (with missing values (NA))
clear all
set seed 12345
set obs 100

gen fam_hist          = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))       
gen contr             = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))
gen no_breast_feeding = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))
gen early_menarche    = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))
gen late_menopause    = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))
gen high_fat          = cond(runiform()<.5,"yes",cond(runiform()<0.9, "no", "NA"))

label var fam_hist          "Family History"
label var contr             "Contraceptive use"
label var no_breast_feeding "No Breast Feeding"
label var early_menarche    "Early Menarche"
label var late_menopause    "Late Menopause"
label var high_fat          "High Fat Diet"

list in 1/10

// turn the string variable into indicator (dummy) variables
local aware_vars fam_hist contr no_breast_feeding  ///
                 early_menarche late_menopause high_fat          
                 
foreach var of local aware_vars {
    gen byte num`var':yesno_lb = 1  if `var' == "yes"
    replace  num`var'          = 0  if `var' == "no"
    replace  num`var'          = .a if `var' == "NA"
    label var num`var' `"`: variable label `var''"'
}                
label define yesno_lb 0 "no" 1 "yes" .a "not available"

// awareness score
egen aware = rowmean(num*)

// compute the mean
sum aware

// dummify that variable (probably a bad idea)
gen byte daware:yesno_lb =aware > r(mean) if !missing(aware)

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------

Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 932

11 Aug 2025, 04:56

You should post your data example to get a good answer. Suppose your variables are store as string and need to be encoded, i.e. change string variable to numeric variable.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str3(item1 item2 item3 item4 item5 item6)
"Yes" "No"  "Yes" "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "Yes" "No"  "No"  "Yes" "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "Yes" "No"  "No"  "Yes" "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "Yes" "No"  "No"  "No"  "No" 
"Yes" "Yes" "No"  "No"  "No"  "Yes"
"Yes" "Yes" "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "Yes" "No" 
"No"  "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "Yes" "No"  "Yes" "Yes" "Yes"
"Yes" "Yes" "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "Yes" "No"  "Yes" "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "Yes" "Yes" "No"  "No"  "No" 
"No"  "Yes" "No"  "No"  "No"  "Yes"
"Yes" "Yes" "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "No"  "No"  "No"  "No"  "Yes"
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "Yes" "No"  "Yes" "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "Yes" "Yes" "Yes" "Yes" "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "No"  "No"  "No"  "No"  "No" 
"Yes" "Yes" "No"  "Yes" "No"  "No" 
"Yes" "Yes" "No"  "No"  "No"  "No" 
end

label variable item1 "Family History" 
label variable item2 "Contraceptive use" 
label variable item3 "No Breast Feeding" 
label variable item4 "Early Menarche" 
label variable item5 "Late Menopause" 
label variable item6 "High Fat Diet"

Code:

encode item1, gen(eitem1) label(yesno)
encode item2, gen(eitem2) label(yesno)
encode item3, gen(eitem3) label(yesno)
encode item4, gen(eitem4) label(yesno)
encode item5, gen(eitem5) label(yesno)
encode item6, gen(eitem6) label(yesno)
egen score=anycount(eitem1-eitem6), values(1)
summarize score, meanonly
display "mean score across all participants = " r(mean)
gen awareness=cond(score>=1.56,1,0)
label define awareness 1 Aware 0 Unaware
label values awareness awareness

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30169
#5

11 Aug 2025, 09:03

Another way to get 0/1 variables out of all your No/Yes string variables is with daniel klein 's -encoder- package, available from SSC. That eliminates the need for looping or tediously writing lines of repetitious code. For example, if your data looks like what Chen Samulsion suggests, it becomes a one-liner:

Code:

encoderall item*, setzero
Comment
Said Mohamed

Join Date: Aug 2022

Posts: 11
#6

11 Aug 2025, 09:24

Thank you. Here I am attaching the dataset, Please kindly take a look. I want to Categorize participants as “Aware” if their score is at or above the mean, and “Unaware” if their score is below the mean.
Attached Files

Untitled spreadsheet (2).xlsx (23.7 KB, 1 view)
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1479

11 Aug 2025, 10:01

Consider this:

Code:

import excel using "Untitled spreadsheet (2).xlsx", clear cellrange(B1) firstrow
label define indicator 0 No 1 Yes

ds , has(type string)
foreach var in `r(varlist)' {
    encode `var', gen(n_`var') label(indicator)
}

egen byte score = rowtotal(n_*)
sum score, meanonly

gen byte aware = (score >= `r(mean)')
label var aware "Has awareness score of at least `r(mean)'"
label define awareness 0 "Unaware" 1 "Aware"
label values aware awareness

which will give you

Code:

. tab aware

        Has |
  awareness |
score of at |
      least |
13.59895833 |
     333333 |      Freq.     Percent        Cum.
------------+-----------------------------------
    Unaware |        162       42.19       42.19
      Aware |        222       57.81      100.00
------------+-----------------------------------
      Total |        384      100.00

Last edited by Hemanshu Kumar; 11 Aug 2025, 10:07.

Comment

Dirk Enzmann

Join Date: Apr 2014

Posts: 579
#8

11 Aug 2025, 18:32

Said Mohamed In #6 you are responding to the request in #2 and #4 to give a data example by attaching an Excel file. This is not what you should do.

Already one year ago you have been asked to read the Stata Forum FAQ before posting: "Please, read the FAQ of the Stata Forum thoroughly". Why don't you follow this excellent advice? If you would do you would come across #12 in the FAQ and would not have attached an Excel file -- why you should not do this and how to show us your data differently is explained there.

If your data are in Excel format only and you have problems to import them to Stata (to use dataex subsequently): Show us the Stata commands you are using to import the Excel data. When showing us the Stata commands, please enclose them in code delimiters (which is also explained in the FAQ, #12).
3 likes
Comment
Said Mohamed

Join Date: Aug 2022

Posts: 11
#9

12 Aug 2025, 05:05

Thank you for the feedback, and I sincerely apologize for not following the forum guidelines more carefully. If I run into any issues, I’ll share the exact commands I'm using, properly formatted as explained in the FAQ. Thank you again for your patience and guidance.
Comment

Announcement