Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem about reshaping from long to wide

    Hi all,

    I have a dataset like this (see below), and what I want is a wide form (see the second table).
    ID staff1 staff2 staff3 item
    4 Julie Brian Ben a
    4 Julie Brian Ben b
    6 Ben Molly Drew f
    6 Ben Molly Drew f
    6 Ben Molly Drew g
    9 Julie Ben Brian h
    9 Julie Ben Brian a
    9 Molly Julie Brian e
    11 Tiger Ben Ben j
    ...... ...... ...... ...... ......
    ID staff1 staff2 staff3 item1 amount1 item2 amount2 item3 amount3 ......
    4 J B B a 1 b 1 . . ......
    6 B M D f 2 g 1 . . ......
    9 J B B h 1 a 1 e 1 ......
    11 T B B j 1 . . . . ......
    First, I want to check for each ID, if the combination of "staff" is identical (example in Bold is not expected). Since I have approximately 200,000,000 observations (I may split the dataset into small pieces if breaking the limits of reshape), I can't check one by one which is very awkward. If so(hopefully), how can I get a result like the second table?

    Also, I would like to add a column (the number of items) for each item in the wide form.

    I have no idea how to check the combination of staff" and add the column. The reshape code I can think of is,
    bys id: gen n_id=_n
    reshape wide item, i(id) j( n_id)
    Thanks very much!


  • #2
    Do you get the right result when using the code you provided? You might need to add "amount" after item. Try it on a small dataset first (i.e. keep only the first 5000 observations or so and save this as data_sample5k).

    The reshape will take ages by the way. I think if you google around a bit you will find some userwritten commands that are a lot faster for relatively straightforward reshapes like these. I think Sergio Correia's ftools has a reshape command, but I'm not sure.

    Comment


    • #3
      Thanks Jesse.

      Yes, I'm trying on a small sample. My code has nothing to do with "staff" variables and add-on columns so it can't produce the second table that I want.

      I've searched a lot before I posted the question but still found it confusing. Ftools might be useful when I do the reshape with the whole dataset. Thanks.

      Comment


      • #4
        It's far from clear to me that you will be better off with a wide dataset any way. However, note that

        1. dataex (SSC) is the method of choice to give us data examples (FAQ Section 12)

        2. contract can help.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte id str5(staff1 staff2 staff3) str1 item
         4 "Julie" "Brian" "Ben"   "a"
         4 "Julie" "Brian" "Ben"   "b"
         6 "Ben"   "Molly" "Drew"  "f"
         6 "Ben"   "Molly" "Drew"  "f"
         6 "Ben"   "Molly" "Drew"  "g"
         9 "Julie" "Ben"   "Brian" "h"
         9 "Julie" "Ben"   "Brian" "a"
         9 "Molly" "Julie" "Brian" "e"
        11 "Tiger" "Ben"   "Ben"   "j"
        end
        
        . contract *
        
        . l
        
             +----------------------------------------------+
             | id   staff1   staff2   staff3   item   _freq |
             |----------------------------------------------|
          1. |  4    Julie    Brian      Ben      a       1 |
          2. |  4    Julie    Brian      Ben      b       1 |
          3. |  6      Ben    Molly     Drew      f       2 |
          4. |  6      Ben    Molly     Drew      g       1 |
          5. |  9    Julie      Ben    Brian      a       1 |
             |----------------------------------------------|
          6. |  9    Julie      Ben    Brian      h       1 |
          7. |  9    Molly    Julie    Brian      e       1 |
          8. | 11    Tiger      Ben      Ben      j       1 |
             +----------------------------------------------+

        Comment


        • #5
          Thanks Nick.

          I was hoping to extract this part of data out of my dataset in order to make it smaller. Actually I have about 100 variables and many of them are like "year", "week" "type" (see below), etc. They are identical for each value of "ID". If I could take the part in my original post out of my dataset, then I could use "duplicates" to drop repeating information. In this way, the dataset could be split into two parts linked by "ID".

          Sorry that I didn't make it clear.

          Code:
          . input byte id str5(staff1 staff2 staff3) str1 item int year int period int week int day int location str4 type
           
          1. 4 "Julie" "Brian" "Ben" "a" 2014 12 4 7 287 "F"
          2. 4 "Julie" "Brian" "Ben" "b" 2014 12 4 7 287 "F" 
          3. 6 "Ben"   "Molly" "Drew"  "f" 2015 2 4 7 126 "F"
          4. 6 "Ben" "Molly" "Drew" "f" 2015 2 4 7 126 "F"
          5. 9 "Julie" "Ben" "Brian" "h" 2015 6 3 3 126 "L"
          6. 9 "Julie" "Ben" "Brian" "a" 2015 6 3 3 126 "L"
          7. 9 "Julie" "Ben" "Brian" "e" 2015 6 3 3 126 "L"
          8. 9 "Molly" "Julie" "Brian" "e" 2015 6 3 3 126 "L"
          9. 11 "Tiger" "Ben"   "Ben"   "j" 2014 3 4 6 287 "F"
          10. end
           
          . contract *
           
          . l
           
               +---------------------------------------------------------------------------------------------+
               | id   staff1   staff2   staff3   item   year   period   week   day   location   type   _freq |
               |---------------------------------------------------------------------------------------------|
            1. |  4    Julie    Brian      Ben      a   2014       12      4     7        287      F       1 |
            2. |  4    Julie    Brian      Ben      b   2014       12      4     7        287      F       1 |
            3. |  6     Ben    Molly     Drew      f   2015        2      4     7        126      F       2 |
            4. |  9    Julie      Ben    Brian      a   2015        6      3     3        126      L       1 |
            5. |  9    Julie      Ben    Brian      e   2015        6      3     3        126      L       1 |
               |---------------------------------------------------------------------------------------------|
            6. |  9    Julie      Ben    Brian      h   2015        6      3     3        126      L       1 |
            7. |  9    Molly    Julie    Brian      e   2015        6      3     3        126      L       1 |
            8. | 11    Tiger      Ben      Ben      j   2014        3      4     6        287      F       1 |
               +---------------------------------------------------------------------------------------------+

          Comment


          • #6
            Trading observations for variables won't necessarily make the dataset much smaller -- or (even more important) easier to handle. Do you foresee calculations that you know will be easier? Do you have a strategy for analyses in wide layout?

            Comment


            • #7
              Currently I'm still investigating the data, yet it takes too much time for the computer to respond to each step and countless observations look very overwhelming. I don't have much experience so I thought it might because of the huge size of the dataset. But I do think it will easier to investigate if being split into two pieces.

              Comment


              • #8
                If presented with a problem like yours, with 200,000,000 observations, I would first extract a sensible random sample of, say, 1% of the IDs (or perhaps just 0.1%) to produce a smaller dataset, on which I would develop my understanding of the data and fully develop the analyses on this sample. Then, and only then, would I go about scaling the actual analysis up to the full dataset. You'll save a lot of time if you don't drag 200,000,000 observations down the paths that eventually prove to be unproductive.

                As Nick hints, you are probably making an error in attempting to reshape your data in the manner you describe. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data.

                Comment


                • #9
                  Thanks William.
                  That helps a lot. I would reconsider my investigation on the data.
                  I've already done an investigation of a sample, and now need to go through the whole dataset. I had no idea of the difference between using a long layout and a wide layout, but now I would think about it.

                  Comment

                  Working...
                  X