split a string variable

Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#1

split a string variable

18 Nov 2018, 15:34

Hello, I´ve got a multiple response variable from a survey:
1 "Miradas lascivas (degeneradas)"
2 "Silbidos y otros sonidos (besos, jadeos, bocinazos)"
3 "Acoso verbal (aluciones al cuerpo y de tipo sexual)"
4 "Arcamiento intimidante (tocar cintura, hablar al oido,etc)"
5 "Agarrones (de senos, vulva, trasero, pene, besos a la fuerza)"
6 "Sentimiento de presion"
7 "Persecución (a pie o en medio de transporte)"
8 "Exhibicionismo"
9 "Violación"
10 "Nunca he sido acosada/o"
11 "Otro"
This variable has a multiple response, so there are some respondent who can choose:
1, 2 and 3; or just 1; or 1, 2, 5, 6, 7; or 1, 11 and so on.
My data is in excel and what I want is to split every response in order to create a frequency chart to find how many respondent answer 1; how many 2; and so on
I hope yo can guide me with this
Kind regards
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30124
#2

18 Nov 2018, 16:01

Your data is in Excel. You will, at some point, need to bring it into Stata, and there are various ways you might do that. I can think of at least 6 different ways your Stata data might be, all consistent with your explanation. Each would require a different solution. So first, get your data imported to Stata. Then show an example, using the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#3

18 Nov 2018, 16:26

Thank you very much for your answer Clyde.
I already bring my data into stata. In my case my variable name is: Eformas_acoso and, for example, ths first observation is reading as follows (it is a string variable)
Miradas lascivas (degeneradas), Silvidos y otros sonidos (besos, jadeos, bocinazos), Acoso suave ("halagos"), Acoso agresivo (alusiones al cuerpo y acto sexual), Acercamiento intimidante (tocar cintura, hablar al oido,etc), "Agarrones" (de senos, vulva, trasero, pene, besos a la fuerza), "Sentimientos de presion"( Presión de genitales sobre tu cuerpo), Persecución (a pie o en medio de transporte)
So I need to spleat each observation in order to get one variable for each posible response to find the frequency of each response.
Regards
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#4

18 Nov 2018, 16:39

Also, using datex I´ve got this
input str406 Eformas_acoso
data width (410 chars) exceeds max linesize. Try specifying fewer variables
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30124
#5

18 Nov 2018, 17:41

OK, I can work with the description in this case. The major obstacle is splitting up this variable into the individual responses. As the responses are separated by commas (,), this would be straightforward, except that the responses also contain internal commas. So first we have to remove the internal commas--which is possible because we know the words that precede them. After that, it's a matter of -reshape long- to get all the responses into a single "vertical" variable and tabulate the response frequencies.

Code:

local precomma besos jadeos cintura senos vulva trasero pene foreach p of local precomma { replace Eformas_acoso = subinstr(Eformas_Acoso, "`p'," "`p'", .) } split Eformas_acoso, gen(resp) parse(",") gen long obs_no = _n reshape long resp, i(obs_no) j(_j) tab resp

Note: This code substantially changes the original data in ways that may prove cumbersome for other things you need to do. So you might want to precede this with -preserve- and then -restore- at the end.

Code is not tested, so there may be typos.
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#6

19 Nov 2018, 06:38

Thank you very much Clyde, it worked perfectly.
However, my tabulation gives me the following:

PHP Code:

tab resp resp | Freq. Percent Cum. ----------------------------------------+----------------------------------- "Agarrones" (de senos vulva trasero .. | 289 10.89 10.89 "Punteos"( Presión de genitales sobr.. | 225 8.48 19.37 Acercamiento intimidante (tocar cint.. | 314 11.83 31.20 Acoso agresivo (alusiones al cuerpo .. | 7 0.26 31.46 Acoso suave ("halagos") | 8 0.30 31.76 Acoso verbal ( aluciones al cuerpo y.. | 368 13.87 45.63 Exhibicionismo o masturbación | 156 5.88 51.51 Persecución (a pie o en medio de tra.. | 276 10.40 61.91 Silvidos y otros sonidos (besos jade.. | 420 15.83 77.73 Violación | 26 0.98 78.71 otro | 27 1.02 79.73 "Agarrones" (de senos vulva trasero p.. | 10 0.38 80.11 "Punteos"( Presión de genitales sobre.. | 1 0.04 80.14 Acercamiento intimidante (tocar cintu.. | 7 0.26 80.41 Acoso verbal ( aluciones al cuerpo y .. | 11 0.41 80.82 Exhibicionismo o masturbación | 4 0.15 80.97 Miradas lascivas (degeneradas) | 432 16.28 97.25 Persecución (a pie o en medio de tran.. | 10 0.38 97.63 Silvidos y otros sonidos (besos jadeo.. | 46 1.73 99.36 Violación | 2 0.08 99.43 nunca he sido acosada/o | 14 0.53 99.96 otro | 1 0.04 100.00 ----------------------------------------+----------------------------------- Total | 2,654 100.00 . end of do-file

In summary, it is like answers have been break it into two parts. As you can see, after "otro" some of them are repeting.
I can not see what is happening in this case, I really appreciate any comments
Regards

Last edited by Nicolas Rodriguez; 19 Nov 2018, 06:41.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30124
#7

19 Nov 2018, 10:05

Well, it is hard to tell from the -tab- output because it does not show the full strings. But what I think is happening is that in some instances, there are different versions of the response that look the same to our eyes but are in fact different character strings. Looking carefully at the output, I notice that the results shown near the top of the table all begin with a blank space, whereas those at the bottom do not. So I think my code failed to consider that after splitting on the commas, responses that were listed first would not have a blank, but those that were listed later in the list would have a blank following the comma.

The way to fix this, if I have the right diagnosis, is, between the -reshape- and -tab commands insert this command:

Code:

replace resp = trim(itrim(resp))
Comment
Nicolas Rodriguez

Join Date: Jul 2016

Posts: 63
#8

19 Nov 2018, 14:10

Thank you ver much Clyde. It perfectly worked.
Kind Regards
1 like
Comment

Announcement

split a string variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment