Problem about reshaping from long to wide

Kate Jia

Join Date: Sep 2017
Posts: 6

Problem about reshaping from long to wide

22 Sep 2017, 01:24

Hi all,

I have a dataset like this (see below), and what I want is a wide form (see the second table).

ID	staff1	staff2	staff3	item
4	Julie	Brian	Ben	a
4	Julie	Brian	Ben	b
6	Ben	Molly	Drew	f
6	Ben	Molly	Drew	f
6	Ben	Molly	Drew	g
9	Julie	Ben	Brian	h
9	Julie	Ben	Brian	a
9	Molly	Julie	Brian	e
11	Tiger	Ben	Ben	j
......	......	......	......	......

ID	staff1	staff2	staff3	item1	amount1	item2	amount2	item3	amount3	......
4	J	B	B	a	1	b	1	.	.	......
6	B	M	D	f	2	g	1	.	.	......
9	J	B	B	h	1	a	1	e	1	......
11	T	B	B	j	1	.	.	.	.	......

First, I want to check for each ID, if the combination of "staff" is identical (example in Bold is not expected). Since I have approximately 200,000,000 observations (I may split the dataset into small pieces if breaking the limits of reshape), I can't check one by one which is very awkward. If so(hopefully), how can I get a result like the second table?

Also, I would like to add a column (the number of items) for each item in the wide form.

I have no idea how to check the combination of staff" and add the column. The reshape code I can think of is,

bys id: gen n_id=_n
reshape wide item, i(id) j( n_id)

Thanks very much!

Tags: None

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

22 Sep 2017, 02:57

Do you get the right result when using the code you provided? You might need to add "amount" after item. Try it on a small dataset first (i.e. keep only the first 5000 observations or so and save this as data_sample5k).

The reshape will take ages by the way. I think if you google around a bit you will find some userwritten commands that are a lot faster for relatively straightforward reshapes like these. I think Sergio Correia's ftools has a reshape command, but I'm not sure.
Comment
Kate Jia

Join Date: Sep 2017

Posts: 6
#3

22 Sep 2017, 03:29

Thanks Jesse.

Yes, I'm trying on a small sample. My code has nothing to do with "staff" variables and add-on columns so it can't produce the second table that I want.

I've searched a lot before I posted the question but still found it confusing. Ftools might be useful when I do the reshape with the whole dataset. Thanks.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

22 Sep 2017, 03:39

It's far from clear to me that you will be better off with a wide dataset any way. However, note that

1. dataex (SSC) is the method of choice to give us data examples (FAQ Section 12)

2. contract can help.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str5(staff1 staff2 staff3) str1 item
 4 "Julie" "Brian" "Ben"   "a"
 4 "Julie" "Brian" "Ben"   "b"
 6 "Ben"   "Molly" "Drew"  "f"
 6 "Ben"   "Molly" "Drew"  "f"
 6 "Ben"   "Molly" "Drew"  "g"
 9 "Julie" "Ben"   "Brian" "h"
 9 "Julie" "Ben"   "Brian" "a"
 9 "Molly" "Julie" "Brian" "e"
11 "Tiger" "Ben"   "Ben"   "j"
end

. contract *

. l

     +----------------------------------------------+
     | id   staff1   staff2   staff3   item   _freq |
     |----------------------------------------------|
  1. |  4    Julie    Brian      Ben      a       1 |
  2. |  4    Julie    Brian      Ben      b       1 |
  3. |  6      Ben    Molly     Drew      f       2 |
  4. |  6      Ben    Molly     Drew      g       1 |
  5. |  9    Julie      Ben    Brian      a       1 |
     |----------------------------------------------|
  6. |  9    Julie      Ben    Brian      h       1 |
  7. |  9    Molly    Julie    Brian      e       1 |
  8. | 11    Tiger      Ben      Ben      j       1 |
     +----------------------------------------------+

Comment

Kate Jia

Join Date: Sep 2017
Posts: 6

22 Sep 2017, 04:13

Thanks Nick.

I was hoping to extract this part of data out of my dataset in order to make it smaller. Actually I have about 100 variables and many of them are like "year", "week" "type" (see below), etc. They are identical for each value of "ID". If I could take the part in my original post out of my dataset, then I could use "duplicates" to drop repeating information. In this way, the dataset could be split into two parts linked by "ID".

Sorry that I didn't make it clear.

Code:

. input byte id str5(staff1 staff2 staff3) str1 item int year int period int week int day int location str4 type
 
1. 4 "Julie" "Brian" "Ben" "a" 2014 12 4 7 287 "F"
2. 4 "Julie" "Brian" "Ben" "b" 2014 12 4 7 287 "F" 
3. 6 "Ben"   "Molly" "Drew"  "f" 2015 2 4 7 126 "F"
4. 6 "Ben" "Molly" "Drew" "f" 2015 2 4 7 126 "F"
5. 9 "Julie" "Ben" "Brian" "h" 2015 6 3 3 126 "L"
6. 9 "Julie" "Ben" "Brian" "a" 2015 6 3 3 126 "L"
7. 9 "Julie" "Ben" "Brian" "e" 2015 6 3 3 126 "L"
8. 9 "Molly" "Julie" "Brian" "e" 2015 6 3 3 126 "L"
9. 11 "Tiger" "Ben"   "Ben"   "j" 2014 3 4 6 287 "F"
10. end
 
. contract *
 
. l
 
     +---------------------------------------------------------------------------------------------+
     | id   staff1   staff2   staff3   item   year   period   week   day   location   type   _freq |
     |---------------------------------------------------------------------------------------------|
  1. |  4    Julie    Brian      Ben      a   2014       12      4     7        287      F       1 |
  2. |  4    Julie    Brian      Ben      b   2014       12      4     7        287      F       1 |
  3. |  6     Ben    Molly     Drew      f   2015        2      4     7        126      F       2 |
  4. |  9    Julie      Ben    Brian      a   2015        6      3     3        126      L       1 |
  5. |  9    Julie      Ben    Brian      e   2015        6      3     3        126      L       1 |
     |---------------------------------------------------------------------------------------------|
  6. |  9    Julie      Ben    Brian      h   2015        6      3     3        126      L       1 |
  7. |  9    Molly    Julie    Brian      e   2015        6      3     3        126      L       1 |
  8. | 11    Tiger      Ben      Ben      j   2014        3      4     6        287      F       1 |
     +---------------------------------------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#6

22 Sep 2017, 04:26

Trading observations for variables won't necessarily make the dataset much smaller -- or (even more important) easier to handle. Do you foresee calculations that you know will be easier? Do you have a strategy for analyses in wide layout?
Comment
Kate Jia

Join Date: Sep 2017

Posts: 6
#7

22 Sep 2017, 05:16

Currently I'm still investigating the data, yet it takes too much time for the computer to respond to each step and countless observations look very overwhelming. I don't have much experience so I thought it might because of the huge size of the dataset. But I do think it will easier to investigate if being split into two pieces.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

22 Sep 2017, 05:55

If presented with a problem like yours, with 200,000,000 observations, I would first extract a sensible random sample of, say, 1% of the IDs (or perhaps just 0.1%) to produce a smaller dataset, on which I would develop my understanding of the data and fully develop the analyses on this sample. Then, and only then, would I go about scaling the actual analysis up to the full dataset. You'll save a lot of time if you don't drag 200,000,000 observations down the paths that eventually prove to be unproductive.

As Nick hints, you are probably making an error in attempting to reshape your data in the manner you describe. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data.
Comment
Kate Jia

Join Date: Sep 2017

Posts: 6
#9

22 Sep 2017, 06:45

Thanks William.
That helps a lot. I would reconsider my investigation on the data.
I've already done an investigation of a sample, and now need to go through the whole dataset. I had no idea of the difference between using a long layout and a wide layout, but now I would think about it.
Comment

Announcement

Problem about reshaping from long to wide

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment