Data Comparison

Izzi Smith

Join Date: Feb 2022

Posts: 2
#1

Data Comparison

17 Feb 2022, 05:30

Hi, I'm new to STATA and would appreciate some help with comparing data values.

I have successfully managed to import data from 'redcap' into STATA and can see how to edit and browse.
The imported data is from questionnaires, (mostly nominal) with >400 'variables'
to check the validity of the entry, I have repeated a select few questionnaire entries.
I now wish to compare the 'duplicates' to the 'originals'.

I am unable to see how the CF function works and am struggling to see an alternative option

Any help is appreciated thank you
Izzi
Tags: data, data comparison, export, redcap
Nick Cox

Join Date: Mar 2014

Posts: 35734
#2

17 Feb 2022, 05:46

The cf command is for comparing datasets rather than two or more variables within a dataset.

If you have two variables that should be identical, with the same values in each observation, then

Code:

assert a == b

is a start, and no news is good news.

If the assertion is declared false, then you need to look at the differences, for example

Code:

list a b if a != b
Comment
Izzi Smith

Join Date: Feb 2022

Posts: 2
#3

17 Feb 2022, 05:53

I want to compare all the variables for two records,
I have >400 variables,
is a there quicker way to do this without having to compare each variable individually ?
Thank you
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#4

17 Feb 2022, 06:22

Sorry, but I don't understand what you mean by comparing all the variables. Also, what is a "record"? In some jargon, it is another name for observation, case or row in the dataset, but I think you mean something else again.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#5

17 Feb 2022, 06:24

Could you post an example of the data you have using -dataex-? If you are concerned about privacy, you can edit those identifiers -- we don't ask for real data, just realistic to replicate your situation.

Nick has given you good advice to directly compare variables. Without knowing more about your data structure, it's difficult to imagine whether something more efficient could be used.

Code:

The imported data is from questionnaires, (mostly nominal) with >400 'variables' to check the validity of the entry, I have repeated a select few questionnaire entries. I now wish to compare the 'duplicates' to the 'originals'. ... I want to compare all the variables for two records, I have >400 variables

If you only had to compare a few variables, that should not take long to replicate Nick's code appropriately.

I wonder if your data are laid out such that survey questions are observations (Stata terminology for rows) and survey respondents are the variables (Stata terminology for columns). (I originally thought you have the transposed orientation.) If this is the case, the following might give you a place to start. It creates toy data with 3 respondents (p1 to p3) and 3 questions. It will compare specifically one variable's first and third observations, and only list those out of they don't match. Lastly, it loops over all variables (here using wildcard notation, exploiting the fact that the respondents' variables all start with a common prefix).

Code:

clear input byte(question p1 p2 p3) 1 1 1 4 2 2 3 1 3 1 3 4 end foreach v of varlist p* { list question `v' if `v'[1]!=`v'[3] & inlist(_n, 1, 3) }

Result

Code:

+---------------+ | question p2 | |---------------| 1. | 1 1 | 3. | 3 3 | +---------------+

Finally, when asking for help with code, it saves everyone time and confusion if you help us to help you. Please provide a data example, and show us any relevant code you have tried (exactly) and the output that Stata gave you back. You are asked to do this in the FAQ, and it's for a good reason.

edit: crossed with #4.

Edit to add:

I would also say, if my guess above is correct, it's certainly not the most efficient data layout with which to work. And, with specific reporting as above, the output may also be quite lengthy and difficult to read.

Last edited by Leonardo Guizzetti; 17 Feb 2022, 06:30.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35734
#6

17 Feb 2022, 13:02

I think I see what this is about, very belatedly. You did mention duplicates but the reference to cf which is quite different left me puzzled.

Each observation should be duplicated, meaning identical observations should each occur twice.

So

Code:

help duplicates
Comment

Announcement

Comment

Comment

Comment

Comment

Comment