Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating dummy variables from several strings, variable length

    Dear Statalist,

    I am a long time reader and I would like to thank all of you for your support and for helping the community!
    I could solve many problems just from reading the archives and the FAQ.
    But this time, I can not find help in the archives and FAQ and I do not know how to proceed.

    Question:

    I have a dataset where every observation is a list of strings. This list has a variable length (some observations have 2 strings, some have 10).
    Here is a simple example with three observations. The variable names are v1, v2, v3, v4.

    1: "A", "B", na, na
    2: "B", na,na, na
    3: "C, "D", "A", "B"


    I would like to convert this dataset into a different format. The variable names should be "A", "B", "C", and "D" (or whatever other strings occur in the data). Each observation should then be a set of indicator variables (0,1), indicating whether "A", or "B", or "C" or "D" occurs or not.

    The example dataset would look like this:

    Variable names are "A","B","C","D" (in this order)

    1: 1, 1, 0, 0
    2: 0, 1, 0, 0
    3: 1, 1, 1, 1


    I already checked tabulate and encode, but both seem not to work for my case.

    Thank you very much.

    Kind regards

    Joern

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float id str8(v1 v2 v3 v4)
    1 "A"  "B" ""  "" 
    2 "B"  ""  ""  "" 
    3 "C" "D" "" "A"
    end
    reshape long v, i(id) j(num)
    list if id==1, clean
    drop if v==""
    drop num
    generate val_ = 1
    fillin id v
    replace val_ = 0 if _fillin
    drop _fillin
    reshape wide val_, i(id) j(v) string
    rename (val_*) (*)
    list, clean
    Code:
    . list, clean
    
           id   A   B   C   D  
      1.    1   1   1   0   0  
      2.    2   0   1   0   0  
      3.    3   1   0   1   1
    Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. In particular, please read FAQ #12 and use dataex and CODE delimiters when posting to Statalist.

    The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

    Comment


    • #3
      Dear William,

      thank you for your working example! It is extremely helpful.

      Thank you also for pointing me towards the Statalist FAQ and showing me what I could do better. I will consider this points next time. My apologies, and thank you for answering anyway!
      Your idea has worked: I could transfer my data into the format that I need!

      Kind regards

      Jörn

      Comment


      • #4
        Thank you Jörn, I'm glad my code worked for you. I was rushed when I wrote the answer and neglected to thank you for actually providing a clear enough explanation for me to be able to figure out what your data was like and what you wanted the results to look like. Since you've been reading Statalist for a while you know that doesn't always happen. My grumbling about the FAQ was really just because it took me a few tries to create the sample data, and this was before my first coffee of the day, when everything makes me grumble.

        And because I was hurrying, I didn't include any explanation of technique. Yours was an interesting problem - which is why I wanted to tackle it before someone else beat me to it - with a solution that demonstrates a number of techniques, including the fillin command, a helpful technique in turning "not quite rectangular" data into rectangular data, as well as the always popular reshape command and the rename group command. If any of those are new to you, and their use is not clear from the help files for them, please reply with any further questions. It's worth having each of them in your Stata "tool kit".

        Comment


        • #5
          Hello William, thank you for the follow-up! It is interesting to see (in fact I am quite happy because otherwise I should really have found the solution myself) that STATA has no built-in functionality for this problem. I did try with fillin, with tab, generate(...) but it did not work. Also, it still feels more natural to me to do such things outside of STATA, with a "normal" programming language like Python or Java. But once you _can_ do it with STATA, the entire workflow improves quite a bit (no more switching tools, less imports, less exports). Thanks again and I hope you enjoyed your coffee :-) Jörn

          Comment

          Working...
          X