Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Drop if identical answers for one observation

    Hello everyone,

    I am currently working on replicating the Schwartz' model of values. Right from the start it is advised to "exclude[] persons with more than 5 missing responses and those who gave the same answer to more than 16 value items" (Bilsky et al. 2011: 762). While I succesfully managed to meet the first requirement and create a variable that counts the number of missings in varlist for each observation (egen X = rowmiss(varlist)), I cannot figure out how to measure whether 17 identical answers to the 21 value items have been given by one person. All items are numerical.
    Is there another egen function I am missing? Do you have a workaround?

    Thank you so much in advance!
    Dan

  • #2
    If any one has given 17 identical numerical answers to 21 questions, that answer will be the median.

    That checks out for the answer being the lowest value and for the highest and so for any intermediate value.

    So, count how many values are equal to the median. I used rowsort (Stata Journal) here but only to get a neat sandbox to play in. In this reduced example, choose your own threshold.

    Code:
    clear
    
    set obs 10 
    
    forval j = 1/10 { 
      gen y`j' = cond(`j' < _n, 7, runiformint(1, 10))
    } 
    
    rowsort y*, gen(Y1-Y10) 
    
    * you start here 
    egen Ymedian = rowmedian(Y*) 
    
    gen eqmedian = 0 
    
    forval j = 1/10 { 
       replace eqmedian = eqmedian + (Y`j' == Ymedian)
    }
    
    list Y* *median 
        +---------------------------------------------------------------------------------+
         | Y1   Y2   Y3   Y4   Y5   Y6   Y7   Y8   Y9   Y10   Ymedian   Ymedian   eqmedian |
         |---------------------------------------------------------------------------------|
      1. |  1    2    3    3    4    4    5    5    6     7         4         4          2 |
      2. |  2    3    4    4    4    6    6    6    6     7         5         5          0 |
      3. |  1    6    6    6    7    7    7    8    9    10         7         7          3 |
      4. |  1    2    3    4    5    5    6    7    7     7         5         5          2 |
      5. |  1    5    6    7    7    7    7    9    9    10         7         7          4 |
         |---------------------------------------------------------------------------------|
      6. |  1    2    6    6    7    7    7    7    7     7         7         7          6 |
      7. |  4    5    7    7    7    7    7    7    7    10         7         7          7 |
      8. |  1    7    7    7    7    7    7    7    7    10         7         7          8 |
      9. |  5    6    7    7    7    7    7    7    7     7         7         7          8 |
     10. |  4    7    7    7    7    7    7    7    7     7         7         7          9 |
         +---------------------------------------------------------------------------------+
    
    .

    Comment


    • #3
      Thank you for your answer. However, I ran into two problems:

      Code:
      // Variables in question: i_crtiv, i_hlpplp ... 21 Items
      
      egen i_median = rowmedian(i_*)
      gen i_eqmedian = 0
      foreach variable of varlist i_* { replace i_eqmedian = i_eqmedian + (`var' == i_median) }
      First, rowmedian produces decimals (e.g. 3.5, 4.5 etc.). How can I prevent this?

      Second, my i_eqmedian ends up 1 to high (e.g. 16 identical values, i_eqmedian == 17). Have I implemented your suggestion wrong? I tried adding a "-1" into the expression, breaking everything for a reason unknown to me.

      Thank you in advance!
      Dan
      Last edited by Dan Rebenich; 28 Nov 2023, 15:31.

      Comment


      • #4
        If the median of 21 values is 3.5 or 4.5 that must have been one of the original values, but I don’t think that invalidates the method. The question still is whether someone gave that answer. Note that if you have any missing values you need extra rules.

        A real or realistic data example might help here.

        Comment


        • #5
          Thank you, I figured it out. Had to name i_median and i_eqmedian differently in order not to mess up the "i_*" varlist. Well...

          Comment


          • #6
            I can't resist pointing out that this is easier in long layout. I dub Schechter's Law the generalization that in Stata it is usually easier to work in long layout than in wide layout. Here Clyde Schechter gets the credit not for discovering this (it is ancient Stata folklore) but for being its most energetic and articulate exponent. But the word layout here I do owe to Clyde as an alternative to overloaded terms like format and structure.

            Most of this code is just a way to get a sandbox.

            Code:
            clear
            
            set obs 100 
            
            set seed 2803 
            
            egen id = seq(), block(10) 
            
            gen y = cond(runiform() > id/9, 7, runiformint(1, 10))
             
            bysort id y : gen freq = _N 
            
            tabstat freq, s(max) by(id)
            
            bysort id (freq) : drop if freq[_N] > 7
            The essentials are these:

            0. Different answers for the same person are in different observations.

            1. We get the frequencies of each answer for each person.

            Code:
            bysort id y : gen freq = _N

            2. We drop according to some threshold e.g. 8, 9 or 10 identical answers out of 10 are not acceptable.

            Code:
            bysort id (freq) : drop if freq[_N] > 7

            Comment

            Working...
            X