Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different kinds of missings

    Dear stata-community,

    I collected data on about 2000 participants via an online survey. I am now in the process of cleaning that data (with about 100 variables) and have come across an issue in coding missings. There are two possible scenarios of missings in my data set.

    1. someone skipped a question and thus did not answer that single question (see example ID 1)
    2. someone answered the first few questions and then stopped, thus discontinued the survey (see example ID 2)
    Both missings are coded as "." right now.
    A single observation could include both kinds of missings in different variables. (see example ID 3, here Var 2 was not answered and then the survey was discontinued after Var 3)

    Example
    ID Var 1 Var 2 Var 3 Var 4 Var 5
    1 Yes . No No N0
    2 Yes . . . .
    3 Yes . No . .

    I would now like to replace "." with "-1" for scenario 1 and with "-2" for scenario 2. Does anyone have an idea of how I could do that using a loop or a different form of automization?
    I would like to avoid having to go through all observations manually.

    Any help is greatly appreciated.
    Thanks in advance.
    Maike

  • #2
    • Reshaping to long, and then replace the missings, and then back to wide may work.
    • Instead of calling them -1 and -2, utilize custom missing values (.a to .z), there are 26 of them which allow users to specify the type of missing. Keeping them missing will be easier for tabulation and analysis as they will not sneak in as real values. I am using .a for one-instance missing and .b for consecutive missing.
    • Notice that I made a lot of assumptions about your data (variable names, format, etc.) so the solution may not 100% work. If you need more fitting help, consult section 12 in the FAQ (http://www.statalist.org/forums/help) on how to use dataex to show data sample.

    Fake data:
    Code:
    clear
    input id v1 v2 v3 v4 v5
    1 1 . 0 0 0
    2 1 . . . .
    3 1 . 0 . .
    end
    Reshape the long, and then use two replace commands. The first one recodes all the consecutive missings in 2nd position and on. The second one takes care of the first in a consecutive missing that has not been recoded. Then, turn back to wide:
    Code:
    reshape long v, i(id) j(qnum)
    
    replace v = .a if missing(v)
    bysort id (qnum): replace v = .b if v == .a & (v[_n - 1] == .a)
    bysort id (qnum): replace v = .b if v == .a & (v[_n + 1] == .b)
    
    reshape wide
    Results:
    Code:
         +-----------------------------+
         | id   v1   v2   v3   v4   v5 |
         |-----------------------------|
      1. |  1    1   .a    0    0    0 |
      2. |  2    1   .b   .b   .b   .b |
      3. |  3    1   .a    0   .b   .b |
         +-----------------------------+

    Comment


    • #3
      Hi Maike, im curious to know your reason for differentiation between a skipped question vs a hard stop in each survey. Im wondering how that differentiation would be of value considering data is missing either way. Unless you're analyzing some metric of survey engagement/participation?

      Comment


      • #4
        Thank you so much Ken. That was very helpful!

        Thank you for your comment Saskhi. You definitely have a valid point. The reason I thought I should distinguish between those two is that it could make a difference if someone skips a single question (perhaps because a person thought the question was too personal) or if someone is just tired of answering the questions generally and wants to get to the end fast. But I guess a flag variable for unfinished surveys could also singal the same. I will have to think about it again.

        Comment

        Working...
        X