reshape data?

River Huang

Join Date: Mar 2016
Posts: 1908

14 Feb 2019, 00:50

Dear All, I find this question here (https://bbs.pinggu.org/thread-6924503-1-1.html). The data is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503 P2504 P2505"      
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2504,P2515,P2599"      
end

and wish to reshape as

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id year P2503 P2504 P2505 P2515 P2599)
4 2010 1 . . 1 1
4 2011 1 1 1 . .
5 2008 1 1 1 . 1
5 2009 . 1 . 1 1
end

Any suggestions? It changes P* to be variable names, and for each id year, gives the value 1 if P* are included in TypeID.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Tags: None

Jorrit Gosens

Join Date: Jan 2015
Posts: 1019

14 Feb 2019, 01:02

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503 P2504 P2505"      
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2504,P2515,P2599"      
end

local vars P2503 P2504 P2505 P2515 P2599

foreach v of local vars{
gen `v'=.
replace `v'=1 if strpos(TypeID, "`v'")
}

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

14 Feb 2019, 01:15

The solution of Jorrit Gosens depends on writing down the variables in advance. That works for a toy example, but is less practical for a bigger one.

Code:

clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503 P2504 P2505"      
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2504,P2515,P2599"      
end

replace TypeID = subinstr(TypeID, ",", " ", .) 
gen N = wordcount(TypeID) 
expand N 
bysort id year : gen which = word(TypeID, _n) 

gen z = 1 
reshape wide z , i(id year) j(which) string 
drop N TypeID 
rename (z*) (*) 
list 

     +---------------------------------------------------+
     | id   year   P2503   P2504   P2505   P2515   P2599 |
     |---------------------------------------------------|
  1. |  4   2010       1       .       .       1       1 |
  2. |  4   2011       1       1       1       .       . |
  3. |  5   2008       1       1       1       .       1 |
  4. |  5   2009       .       1       .       1       1 |
     +---------------------------------------------------+

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#4

14 Feb 2019, 01:56

Dear @Jorrit Gosens and @Nick Cox, Thank you both for the helpful suggestions.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

River Huang

Join Date: Mar 2016
Posts: 1908

14 Feb 2019, 02:22

Dear Nick, Can you have a look at this example?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503,P2515,P2599"      
4 2012 "P2503,P2515,P2599"      
4 2013 "P2503,P2515,P2599"      
4 2008 "P2503,P2599"            
4 2009 "P2503,P2599"            
4 2010 "P2503,P2599"            
5 2007 "P2503,P2504,P2505,P2599"
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2503,P2504,P2505,P2599"
end

Something wrong?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

14 Feb 2019, 02:33

You have duplicated observations for id 4 year 2010. You tell us what would be right. Here I remove duplicates and take mentions to be correct otherwise.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503,P2515,P2599"      
4 2012 "P2503,P2515,P2599"      
4 2013 "P2503,P2515,P2599"      
4 2008 "P2503,P2599"            
4 2009 "P2503,P2599"            
4 2010 "P2503,P2599"            
5 2007 "P2503,P2504,P2505,P2599"
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2503,P2504,P2505,P2599"
end

replace TypeID = subinstr(TypeID, ",", " ", .) 
bysort id year : replace TypeID = TypeID[_n-1] + " " + TypeID if _n > 1 
bysort id year : replace TypeID = TypeID[_N] if _N > 1 
bysort id year : keep if _n == _N 
gen N = wordcount(TypeID) 
expand N 
bysort id year : gen which = word(TypeID, _n) 
drop N TypeID 
duplicates drop id year which , force 
gen z = 1 
reshape wide z , i(id year) j(which) string 
rename (z*) (*) 
list , sepby(id) 


     +---------------------------------------------------+
     | id   year   P2503   P2504   P2505   P2515   P2599 |
     |---------------------------------------------------|
  1. |  4   2008       1       .       .       .       1 |
  2. |  4   2009       1       .       .       .       1 |
  3. |  4   2010       1       .       .       1       1 |
  4. |  4   2011       1       .       .       1       1 |
  5. |  4   2012       1       .       .       1       1 |
  6. |  4   2013       1       .       .       1       1 |
     |---------------------------------------------------|
  7. |  5   2007       1       1       1       .       1 |
  8. |  5   2008       1       1       1       .       1 |
  9. |  5   2009       1       1       1       .       1 |
     +---------------------------------------------------+

.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#7

14 Feb 2019, 02:57

Dear Nick, Thanks for the reply. I found the problem is due to duplicates as well. I am asking the one who raised this question to see what happened?

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Feb 2019, 07:22

The following approach postpones the decision on what to do about duplicates. It also replaces missing values in the results with zeroes, which may be preferable depending on what is to be done with the results.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long id int year str41 TypeID
4 2010 "P2503,P2515,P2599"      
4 2011 "P2503,P2515,P2599"      
4 2012 "P2503,P2515,P2599"      
4 2013 "P2503,P2515,P2599"      
4 2008 "P2503,P2599"            
4 2009 "P2503,P2599"            
4 2010 "P2503,P2599"            
5 2007 "P2503,P2504,P2505,P2599"
5 2008 "P2503,P2504,P2505,P2599"
5 2009 "P2503,P2504,P2505,P2599"
end

split TypeID, parse(,)
drop TypeID
gen seq = _n
reshape long TypeID, i(seq) j(num)
drop if TypeID==""
list if seq==1, clean noobs

drop num
generate one = 1
reshape wide one, i(seq id year) j(TypeID) string
rename (one*) (*)
list if seq==1, clean noobs

collapse (sum) P*, by(id year)
list, sepby(id) noobs

Code:

. list, sepby(id) noobs

  +---------------------------------------------------+
  | id   year   P2503   P2504   P2505   P2515   P2599 |
  |---------------------------------------------------|
  |  4   2008       1       0       0       0       1 |
  |  4   2009       1       0       0       0       1 |
  |  4   2010       2       0       0       1       2 |
  |  4   2011       1       0       0       1       1 |
  |  4   2012       1       0       0       1       1 |
  |  4   2013       1       0       0       1       1 |
  |---------------------------------------------------|
  |  5   2007       1       1       1       0       1 |
  |  5   2008       1       1       1       0       1 |
  |  5   2009       1       1       1       0       1 |
  +---------------------------------------------------+

Last edited by William Lisowski; 14 Feb 2019, 07:25.

Comment

River Huang

Join Date: Mar 2016

Posts: 1908
#9

14 Feb 2019, 16:32

Dear William, Thank you for this additional suggestion.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment

Announcement

reshape data?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment