Dealing with string categorical column, reshaping long dataset to wide

Vincent KH Lee

Join Date: Feb 2022

Posts: 26
#1

Dealing with string categorical column, reshaping long dataset to wide

17 Apr 2023, 01:32

Hi,

I am trying to convert some columns of a long dataset into wide format.

the data looks somewhat like below:

id date sale_amt discount_yn
1 01jan2013 50 y
1 02jan2013 60 n
2 01jan2013 30 y
2 02jan2013 20 n
3 01jan2013 40 y
3 02jan2013 15 n

I want to convert the discount_yn column into wide format, so that when I collapse the discount_yn column, I can retrieve the sum number of discounts each id received per month.

Ideally, I want the data to look somewhat like this:

id date sale_amt discount_y discount_n
1 01jan2013 50 1 0
1 02jan2013 60 0 1
2 01jan2013 30 1 0
2 02jan2013 20 0 1
3 01jan2013 40 1 0
3 02jan2013 15 0 1

Then,

collapse (sum) discount_y, by(id)

I thought the solution is to use 'reshape' command,

so I tried

reshape wide discount_yn, i(id) j(date)

but the result created columns discount_01jan2013 discount_02jan2013............. and so on.

As far as I can tell, the reshape command won't automatically separate the column discount_yn to discount_y and discount_n by itself, I think I am missing something here.

Could anyone provide some insights to help me get through this issue?

Thanks in advance.
Tags: None
Vincent KH Lee

Join Date: Feb 2022

Posts: 26
#2

17 Apr 2023, 02:24

I did manage to get the results with a slightly unsophisticated way.

Code:

gen discount_no = 1 if discount_yn == "Y" replace discount_no = 0 if discount_no == .

It did the job I was looking for, but I am not sure if this is the only and the most efficient way.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#3

17 Apr 2023, 11:55

This is definitely not a job for -reshape-. The question in my mind, though, is why you want to create separate discount_y and discount_n variables in the first place. The two variables will have exactly the same information, just coded differently. I can't think of anything you would need the second variable for. Converting from y/n to 1/0 coding is useful, yes. But that's easy enough. You've already discovered one way to do it. There are others, such as:

Code:

// METHOD 1 gen discount = (discount_yn == "y") // METHOD 2 label define yesno 0 "n" 1 "y" decode discount_yn, gen(discount) label(yesno)

Once you have the 0/1 variable discount, which is identical to your proposed discount_y, anything that you would have done with your discount_n can then be done with !discount or 1-discount.

Finally, for the purpose of just counting up the number of discounts each id received each month, you don't even need to do the numerical conversion. From your original data it would go as follows:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id float date byte sale_amt str1 discount_yn 1 19359 50 "y" 1 19360 60 "n" 2 19359 30 "y" 2 19360 20 "n" 3 19359 40 "y" 3 19360 15 "n" end format %td date gen mdate = mofd(date) format mdate %tm by id mdate (date), sort: egen num_discounts = total(discount_yn == "y")

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Vincent KH Lee

Join Date: Feb 2022

Posts: 26
#4

18 Apr 2023, 03:01

Thank you so much Clyde,

the final suggestion with the egen worked out perfectly.

I should really take a closer look at the egen command as it seems more powerful and useful than I thought, and will definitely try -dataex- in the future to replicate the dataset.
Comment

Announcement

Dealing with string categorical column, reshaping long dataset to wide

Comment

Comment

Comment