Separating phrases in every single cell under a column and assigning each phrase a unique dummy variable.

anisha arya

Join Date: Jul 2024

Posts: 28
#1

Separating phrases in every single cell under a column and assigning each phrase a unique dummy variable.

30 Jul 2024, 15:03

Hi, I have a column for options chosen by survey participants where they could choose several options (out of 8 standard options) at once. For example, cell for participant 1 contains " blue raspberry, mango peach", participant 2's cell is "banana blast, red raspberry, mango peach", participant 3's cell is "None of the above" and so on.
So how do I assign unique numbers = 1, 2, 3, 4, 5, 6, 7, 8 to all eight options and then tally them for every (participant's) cell. I hope that makes sense, I am new to stata. Please let me know, thanks in advance!
Tags: dummy variables, Generate, replace, string variables

Andrew Musau

Join Date: Oct 2014
Posts: 10180

30 Jul 2024, 15:22

Your title and description differ. In your title, you refer to indicators or dummies, whereas in your description, you mention assigning numeric values to each option. The former may make more sense. Here is a way to create indicators.

Code:

clear
input float(participant_id) strL(responses)
1 "blue raspberry, mango peach"
2 "banana blast, red raspberry, mango peach"
3 "None of the above"
end

foreach fruit in raspberry mango peach banana{
    gen `fruit'= ustrregexm(lower(response), "\b`fruit'\b")
}

Res.:

Code:

. l

     +-----------------------------------------------------------------------------------------+
     | partic~d                                  responses   raspbe~y   mango   peach   banana |
     |-----------------------------------------------------------------------------------------|
  1. |        1                blue raspberry, mango peach          1       1       1        0 |
  2. |        2   banana blast, red raspberry, mango peach          1       1       1        1 |
  3. |        3                          None of the above          0       0       0        0 |
     +-----------------------------------------------------------------------------------------+

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#3

30 Jul 2024, 15:32

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float id str40 responses 1 "blue raspberry, mango peach" 2 "banana blast, red raspberry, mango peach" 3 "None of the above" end split responses, gen(_response) parse(", ") reshape long _response, i(id) j(_j) encode _response, gen(response) label(response) drop _response levelsof response, local(responses) foreach r of local responses { gen byte chose_`r' = (response == `r') label var chose_`r' "Selected `:label (response) `r''" by id (chose_`r'), sort: replace chose_`r' = chose_`r'[_N] } by id, sort: keep if _n == 1

In the future, when asking for help with code, always show example data. To come up with a solution, I created a demonstration data set based on what I imagine your data set looks like. It is consistent with your description, but other possibilities exist. And if I have imagined incorrectly, the code shown may not work, and both of us will have wasted our time. To avoid guesswork, show example data. And, as I have done above, use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Added: Crossed with #2. The solution there requires listing, and therefore, advance knowledge of, all of the response possibilities. I suppose if this is survey data and the item in question has a fixed response set containing only 8 choices, then this is a reasonable assumption. The code shown here will work with arbitrarily many responses and does not require knowing what they are ahead of time.

Last edited by Clyde Schechter; 30 Jul 2024, 15:36.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10180

30 Jul 2024, 15:36

Assuming that you are dealing with phrases, here is a modification to #2:

Code:

clear
input float(participant_id) strL(responses)
1 "blue raspberry, mango peach"
2 "banana blast, red raspberry, mango peach"
3 "None of the above"
end

replace responses= strtrim(stritrim(responses))
local keywords `" "blue raspberry" "red raspberry" "banana blast" "mango peach" "'
foreach k of local keywords{
    gen `=strtoname("`k'")' = ustrregexm(" " + lower(responses) + " ", "\b`k'\b")
}

Res.:

Code:

. l

     +-------------------------------------------------------------------------------------------------+
     | partic~d                                  responses   blue_r~y   red_ra~y   banana~t   mango_~h |
     |-------------------------------------------------------------------------------------------------|
  1. |        1                blue raspberry, mango peach          1          0          0          1 |
  2. |        2   banana blast, red raspberry, mango peach          0          1          1          1 |
  3. |        3                          None of the above          0          0          0          0 |
     +-------------------------------------------------------------------------------------------------+

Splitting the data, reshaping long and then manipulating the data is an alternative option. See

Code:

help split
help reshape

Note: Crossed with #3 that illustrates my 3rd suggestion.

Comment

anisha arya

Join Date: Jul 2024

Posts: 28
#5

30 Jul 2024, 16:28

Thank you so much both of you!

#3 worked perfectly, although I have a query about the resulting data. The column _j - what exactly does it signify? (sorry I am extremely new to this trying to help my project director with this!)

Thanks,
Anisha
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#6

30 Jul 2024, 18:30

The first thing the code does is break up the variable responses into separate variables for each separate response (separated by commas in the original responses variable). So you end up with a bunch of variables called _response1 _response2 _response3. (If, unlike the example data somebody choose more than 3 responses then there would be more than 3 such variables. There would be as many as the largest number of responses anybody chose.)

The -reshape- command then reorganizes the data so that there is a single _response variable, and what were previously single observations are now 3 separate observations. Simply put, the _response's get reorganized vertically instead of horizontally. -reshape-, while doing this, preserves the information about which _response# variable each observation came from in the new variable _j. That variable _j is never actually used later in this code. It even loses its meaning later in the code because the values of chose_# simply reflect whether the given string was chosen at all and does not care which _response# variable it actually came from, and in every case all but one of those 3 rows per observation gets deleted, restoring the original one-observation per id organization.

So that's what _j is about. I really should have -drop-ped it soon after it was created by -reshape- because what remains of it at the end of the code is no longer meaningful, so it becomes a source of potential confusion. It won't confuse Stata, but a person looking at the results would rightly wonder what on earth that _j is, and it is pretty hard to figure that out without "running the code in your head" to see where it came from and what became of it. So you asked a very good question and highlighted a blemish in the code. I just forgot to eliminate it; I suggest you add -drop _j- to the end of the code.
Comment

Announcement