Creating dummy variables from several strings, variable length

Joern Grahl

Join Date: Sep 2017

Posts: 3
#1

Creating dummy variables from several strings, variable length

29 Sep 2017, 03:53

Dear Statalist,

I am a long time reader and I would like to thank all of you for your support and for helping the community!
I could solve many problems just from reading the archives and the FAQ.
But this time, I can not find help in the archives and FAQ and I do not know how to proceed.

Question:

I have a dataset where every observation is a list of strings. This list has a variable length (some observations have 2 strings, some have 10).
Here is a simple example with three observations. The variable names are v1, v2, v3, v4.

1: "A", "B", na, na
2: "B", na,na, na
3: "C, "D", "A", "B"

I would like to convert this dataset into a different format. The variable names should be "A", "B", "C", and "D" (or whatever other strings occur in the data). Each observation should then be a set of indicator variables (0,1), indicating whether "A", or "B", or "C" or "D" occurs or not.

The example dataset would look like this:
Variable names are "A","B","C","D" (in this order)

1: 1, 1, 0, 0
2: 0, 1, 0, 0
3: 1, 1, 1, 1

I already checked tabulate and encode, but both seem not to work for my case.

Thank you very much.

Kind regards

Joern
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

29 Sep 2017, 06:32

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float id str8(v1 v2 v3 v4) 1 "A" "B" "" "" 2 "B" "" "" "" 3 "C" "D" "" "A" end reshape long v, i(id) j(num) list if id==1, clean drop if v=="" drop num generate val_ = 1 fillin id v replace val_ = 0 if _fillin drop _fillin reshape wide val_, i(id) j(v) string rename (val_*) (*) list, clean

Code:

. list, clean id A B C D 1. 1 1 1 0 0 2. 2 0 1 0 0 3. 3 1 0 1 1

Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
Comment
Joern Grahl

Join Date: Sep 2017

Posts: 3
#3

29 Sep 2017, 12:34

Dear William,

thank you for your working example! It is extremely helpful.

Thank you also for pointing me towards the Statalist FAQ and showing me what I could do better. I will consider this points next time. My apologies, and thank you for answering anyway!
Your idea has worked: I could transfer my data into the format that I need!

Kind regards

Jörn
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

29 Sep 2017, 13:04

Thank you Jörn, I'm glad my code worked for you. I was rushed when I wrote the answer and neglected to thank you for actually providing a clear enough explanation for me to be able to figure out what your data was like and what you wanted the results to look like. Since you've been reading Statalist for a while you know that doesn't always happen. My grumbling about the FAQ was really just because it took me a few tries to create the sample data, and this was before my first coffee of the day, when everything makes me grumble.

And because I was hurrying, I didn't include any explanation of technique. Yours was an interesting problem - which is why I wanted to tackle it before someone else beat me to it - with a solution that demonstrates a number of techniques, including the fillin command, a helpful technique in turning "not quite rectangular" data into rectangular data, as well as the always popular reshape command and the rename group command. If any of those are new to you, and their use is not clear from the help files for them, please reply with any further questions. It's worth having each of them in your Stata "tool kit".
1 like
Comment
Joern Grahl

Join Date: Sep 2017

Posts: 3
#5

30 Sep 2017, 13:11

Hello William, thank you for the follow-up! It is interesting to see (in fact I am quite happy because otherwise I should really have found the solution myself) that STATA has no built-in functionality for this problem. I did try with fillin, with tab, generate(...) but it did not work. Also, it still feels more natural to me to do such things outside of STATA, with a "normal" programming language like Python or Java. But once you _can_ do it with STATA, the entire workflow improves quite a bit (no more switching tools, less imports, less exports). Thanks again and I hope you enjoyed your coffee :-) Jörn
Comment

Announcement

Creating dummy variables from several strings, variable length

Comment

Comment

Comment

Comment