Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data Comparison

    Hi, I'm new to STATA and would appreciate some help with comparing data values.

    I have successfully managed to import data from 'redcap' into STATA and can see how to edit and browse.
    The imported data is from questionnaires, (mostly nominal) with >400 'variables'
    to check the validity of the entry, I have repeated a select few questionnaire entries.
    I now wish to compare the 'duplicates' to the 'originals'.

    I am unable to see how the CF function works and am struggling to see an alternative option

    Any help is appreciated thank you
    Izzi


  • #2
    The cf command is for comparing datasets rather than two or more variables within a dataset.

    If you have two variables that should be identical, with the same values in each observation, then

    Code:
    assert a == b
    is a start, and no news is good news.

    If the assertion is declared false, then you need to look at the differences, for example


    Code:
    list a b if a != b 

    Comment


    • #3
      I want to compare all the variables for two records,
      I have >400 variables,
      is a there quicker way to do this without having to compare each variable individually ?
      Thank you

      Comment


      • #4
        Sorry, but I don't understand what you mean by comparing all the variables. Also, what is a "record"? In some jargon, it is another name for observation, case or row in the dataset, but I think you mean something else again.

        Comment


        • #5
          Could you post an example of the data you have using -dataex-? If you are concerned about privacy, you can edit those identifiers -- we don't ask for real data, just realistic to replicate your situation.

          Nick has given you good advice to directly compare variables. Without knowing more about your data structure, it's difficult to imagine whether something more efficient could be used.

          Code:
          The imported data is from questionnaires, (mostly nominal) with >400 'variables'
          to check the validity of the entry, I have repeated a select few questionnaire entries.
          I now wish to compare the 'duplicates' to the 'originals'.
          ...
          I want to compare all the variables for two records,
          I have >400 variables
          If you only had to compare a few variables, that should not take long to replicate Nick's code appropriately.

          I wonder if your data are laid out such that survey questions are observations (Stata terminology for rows) and survey respondents are the variables (Stata terminology for columns). (I originally thought you have the transposed orientation.) If this is the case, the following might give you a place to start. It creates toy data with 3 respondents (p1 to p3) and 3 questions. It will compare specifically one variable's first and third observations, and only list those out of they don't match. Lastly, it loops over all variables (here using wildcard notation, exploiting the fact that the respondents' variables all start with a common prefix).

          Code:
          clear
          input byte(question p1 p2 p3)
          1 1 1 4
          2 2 3 1
          3 1 3 4
          end
          
          foreach v of varlist p* {
            list question `v' if `v'[1]!=`v'[3] & inlist(_n, 1, 3)
          }
          Result

          Code:
               +---------------+
               | question   p2 |
               |---------------|
            1. |        1    1 |
            3. |        3    3 |
               +---------------+
          Finally, when asking for help with code, it saves everyone time and confusion if you help us to help you. Please provide a data example, and show us any relevant code you have tried (exactly) and the output that Stata gave you back. You are asked to do this in the FAQ, and it's for a good reason.

          edit: crossed with #4.

          Edit to add:

          I would also say, if my guess above is correct, it's certainly not the most efficient data layout with which to work. And, with specific reporting as above, the output may also be quite lengthy and difficult to read.
          Last edited by Leonardo Guizzetti; 17 Feb 2022, 06:30.

          Comment


          • #6
            I think I see what this is about, very belatedly. You did mention duplicates but the reference to cf which is quite different left me puzzled.

            Each observation should be duplicated, meaning identical observations should each occur twice.

            So

            Code:
            help duplicates 

            Comment

            Working...
            X