Creating string variables

christiana

Join Date: Jun 2014

Posts: 46
#1

Creating string variables

18 Jul 2014, 05:04

Hi,
I have data on the employment level by industry and I want to enter them into Stata via the data editor. I am not sure as to whether I should choose to enter the industry classifications (NACE) with their names as a string variable and then encode this if I need to when actually using the dataset or whether I should number the NACE categories and enter the variable as a float with 1- n categories and just create and attach value labels to these (meaningless) categories.

Lastly, can string variables be used in any analysis in this form or they always have to be encoded beforehand.

Many thanks in advance
Tags: categorical, label, string
Nick Cox

Join Date: Mar 2014

Posts: 35725
#2

18 Jul 2014, 05:26

Numeric variables with value labels are almost always easier to enter and to check than string variables. In the latter case, small spelling or punctuation errors can lead to all sorts of small and large problems. They can also appear in model commands as predictors (e.g. as factor variables). Some kinds of graphs won't accept string variables as defining an axis, but that is likely to bite much less.

There is plenty of guidance on this in every introduction to Stata I know, including [U].
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#3

18 Jul 2014, 08:57

It's not often that I disagree with Nick Cox, but

Numeric variables with value labels are almost always easier to enter and to check than string variables. In the latter case, small spelling or punctuation errors can lead to all sorts of small and large problems.

That's true, but there is also the problem of accurately, consistently translating the string value to the corresponding numeric value. Human data-enterers are prone to mistakes doing this, and the process of consulting a string-to-numeric crosswalk table also slows down the data entry process considerably. I have generally worked by the principle that data entry should be straight transcription: any transformations that need to be made should be made by a computer, not a human.

So my approach would be to enter the data in the exact form that you have them. Then carefully clean the classifications and names to eliminate the typos and other errors that occur. Then use -encode- to transform those to numeric variables with value labels. (I certainly agree with Nick that numeric variables with value labels are far more useful for analysis than string variables.)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35725
#4

18 Jul 2014, 09:05

We can converge easily. My experience is that some users have more problems typing strings in correctly and consistently than one could possibly guess, and that's time-consuming too.

But Clyde is totally correct that typing in the wrong number but one that has a value label defined is an insidious error, and I've seen that often too.

I guess the overarching tacit advice should be spelled out too: Any data you care about (which should mean any data) should be checked at data entry stage, ideally by methods that maximise the independence of the check, e.g. a completely separate entry, if possible by a different person. Also, print out the data and compare with the original source.

Last edited by Nick Cox; 18 Jul 2014, 09:17.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

18 Jul 2014, 09:28

I guess the overarching tacit advice should be spelled out too: Any data you care about (which should mean any data) should be checked at data entry stage, ideally by methods that maximise the independence of the check, e.g. a completely separate entry, if possible by a different person. Also, print out the data and compare with the original source.

Couldn't agree more! Convergence achieved.
Comment
christiana

Join Date: Jun 2014

Posts: 46
#6

18 Jul 2014, 09:49

many thanks to both for your valuable advice
Comment

Announcement

Creating string variables

Comment

Comment

Comment

Comment

Comment