Drop if identical answers for one observation

Dan Rebenich

Join Date: Nov 2023

Posts: 11
#1

Drop if identical answers for one observation

28 Nov 2023, 09:50

Hello everyone,

I am currently working on replicating the Schwartz' model of values. Right from the start it is advised to "exclude[] persons with more than 5 missing responses and those who gave the same answer to more than 16 value items" (Bilsky et al. 2011: 762). While I succesfully managed to meet the first requirement and create a variable that counts the number of missings in varlist for each observation (egen X = rowmiss(varlist)), I cannot figure out how to measure whether 17 identical answers to the 21 value items have been given by one person. All items are numerical.
Is there another egen function I am missing? Do you have a workaround?

Thank you so much in advance!
Dan
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35757

28 Nov 2023, 10:35

If any one has given 17 identical numerical answers to 21 questions, that answer will be the median.

That checks out for the answer being the lowest value and for the highest and so for any intermediate value.

So, count how many values are equal to the median. I used rowsort (Stata Journal) here but only to get a neat sandbox to play in. In this reduced example, choose your own threshold.

Code:

clear

set obs 10 

forval j = 1/10 { 
  gen y`j' = cond(`j' < _n, 7, runiformint(1, 10))
} 

rowsort y*, gen(Y1-Y10) 

* you start here 
egen Ymedian = rowmedian(Y*) 

gen eqmedian = 0 

forval j = 1/10 { 
   replace eqmedian = eqmedian + (Y`j' == Ymedian)
}

list Y* *median 
    +---------------------------------------------------------------------------------+
     | Y1   Y2   Y3   Y4   Y5   Y6   Y7   Y8   Y9   Y10   Ymedian   Ymedian   eqmedian |
     |---------------------------------------------------------------------------------|
  1. |  1    2    3    3    4    4    5    5    6     7         4         4          2 |
  2. |  2    3    4    4    4    6    6    6    6     7         5         5          0 |
  3. |  1    6    6    6    7    7    7    8    9    10         7         7          3 |
  4. |  1    2    3    4    5    5    6    7    7     7         5         5          2 |
  5. |  1    5    6    7    7    7    7    9    9    10         7         7          4 |
     |---------------------------------------------------------------------------------|
  6. |  1    2    6    6    7    7    7    7    7     7         7         7          6 |
  7. |  4    5    7    7    7    7    7    7    7    10         7         7          7 |
  8. |  1    7    7    7    7    7    7    7    7    10         7         7          8 |
  9. |  5    6    7    7    7    7    7    7    7     7         7         7          8 |
 10. |  4    7    7    7    7    7    7    7    7     7         7         7          9 |
     +---------------------------------------------------------------------------------+

.

Comment

Dan Rebenich

Join Date: Nov 2023

Posts: 11
#3

28 Nov 2023, 15:29

Thank you for your answer. However, I ran into two problems:

Code:

// Variables in question: i_crtiv, i_hlpplp ... 21 Items egen i_median = rowmedian(i_*) gen i_eqmedian = 0 foreach variable of varlist i_* { replace i_eqmedian = i_eqmedian + (`var' == i_median) }

First, rowmedian produces decimals (e.g. 3.5, 4.5 etc.). How can I prevent this?

Second, my i_eqmedian ends up 1 to high (e.g. 16 identical values, i_eqmedian == 17). Have I implemented your suggestion wrong? I tried adding a "-1" into the expression, breaking everything for a reason unknown to me.

Thank you in advance!
Dan

Last edited by Dan Rebenich; 28 Nov 2023, 15:31.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35757
#4

28 Nov 2023, 15:37

If the median of 21 values is 3.5 or 4.5 that must have been one of the original values, but I don’t think that invalidates the method. The question still is whether someone gave that answer. Note that if you have any missing values you need extra rules.

A real or realistic data example might help here.
1 like
Comment
Dan Rebenich

Join Date: Nov 2023

Posts: 11
#5

28 Nov 2023, 16:04

Thank you, I figured it out. Had to name i_median and i_eqmedian differently in order not to mess up the "i_*" varlist. Well...
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35757
#6

29 Nov 2023, 07:48

I can't resist pointing out that this is easier in long layout. I dub Schechter's Law the generalization that in Stata it is usually easier to work in long layout than in wide layout. Here Clyde Schechter gets the credit not for discovering this (it is ancient Stata folklore) but for being its most energetic and articulate exponent. But the word layout here I do owe to Clyde as an alternative to overloaded terms like format and structure.

Most of this code is just a way to get a sandbox.

Code:

clear set obs 100 set seed 2803 egen id = seq(), block(10) gen y = cond(runiform() > id/9, 7, runiformint(1, 10)) bysort id y : gen freq = _N tabstat freq, s(max) by(id) bysort id (freq) : drop if freq[_N] > 7

The essentials are these:

0. Different answers for the same person are in different observations.

1. We get the frequencies of each answer for each person.

Code:

bysort id y : gen freq = _N

2. We drop according to some threshold e.g. 8, 9 or 10 identical answers out of 10 are not acceptable.

Code:

bysort id (freq) : drop if freq[_N] > 7
1 like
Comment

Announcement

Drop if identical answers for one observation

Comment

Comment

Comment

Comment

Comment