Creating variables from text

John Gustavsson

Join Date: Sep 2018

Posts: 14
#1

Creating variables from text

25 Dec 2019, 20:42

Hi,

I got some survey data. The raw data is in text, like for example there was a question about education where the responses are "undergraduate degree", "postgraduate degree" etc. Now, I'd like to turn this into binary variables (one "undergraduate" variable, one "postgraduate" variable etc). I know that this can be done manually though it's quite time-consuming, but I was wondering if this process could be automated somehow using STATA.

I was thinking of generating a variable and then have the variable take the value 1 when the response was in a certain way (so the value "undergraduate" takes the value 1 when the response is "Undergraduate degree". However, when I tried this I got a "type mismatch" error.

Is there any way to get around this or should I just start translating the data into 1's and 0's manually?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

25 Dec 2019, 21:13

So, first of all, there is probably no need for you to do this. If you create a numeric variable that is coded, say 1 for undergraduate degree, 2 for postgraduate degree, etc. that will serve most purposes in Stata. If you need 0/1 indicator ("dummy") variables to use in regressions, you do not need to create them because Stata can to that for you "on the fly" using factor variable notation (-help fvvarlist-). So the part you really need is to create a single numeric variable. That you can accomplish with the -encode- command. At it's simplest:

Code:

encode education, gen(education_numeric)

Now, there are some additional potential complications here. Text variables like this often contain inconsistent capitalization, (i.e. one observation says Undergraduate degree, and another says undergraduate degree, and yet another says UNDERGRADUATE DEGREE, etc.) And sometimes when the values have more than one word, you get stray extra spaces, or left or right padding with spaces on an inconsistent basis. So you should first -tab- your variable to make sure that all of the values that mean the same thing are actually spelled out the same way, and make corrections if they are not. The -upper()-, -lower()-, -proper()-, -trim()- and -itrim()- functions may prove useful here.

Then, it may make sense to have the categories in some specific order. In this case, for example, you might want 1 to represent "no college" and 2 to be "some college but no degree" and 3 "undergraduate degree" and 4 "some graduate school" and 5 "graduate degree" or something like that. -encode-, of course, knows nothing about education levels, and it just, by default, makes the numeric order correspond to alphabetic order in the text variable. You can over-ride that default with your preferred orer by using -encode-'s -label()- option.

In the event that you really do need those separate 0/1 variables for each level for some purpose where factor-variable notation is not supported, you can create those easily with -tab-'s -gen()- option.

More generally, it sounds like there are a number of really fundamental commands here that you aren't yet acquainted with. Take some time out to read the Getting Started [GS] and User's Guide [U] volumes of the PDF documentation that comes with your Stata installation. They will give you a tour of Stata including the "bread and butter" commands that get used all the time in basic data management and statistical analysis. The time you invest doing that will be amply repaid.

-help label-
-help encode-
-help fvvarlist-
-help tab-
-help trim()-
-help itrim()-
-help upper()-
-help lower()-
-help proper()-

As you did not provide example data, I cannot give you advice more directly applicable to your data. If you need more specific guidance, post back using the -dataex- command to show example data. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
2 likes
Comment

Announcement

Creating variables from text

Comment