Problem with reshape

Stefano Grillini

Join Date: Jun 2015

Posts: 85
#1

Problem with reshape

27 Jul 2017, 17:18

Hi all,
I'm having a problem with reshape that I wish I don't have to deal with manually. I have a dataset of daily (non consecutive) observations for many (I can't currently say the exact number) companies. For each of them I also have their daily stock repurchase quantity and the number of shares. When I reshape long to wide,

Code:

reshape wide amount s_o px_last, i(date) j(id) string

I get the error message r(9), which says that there are
observations within i(date) with the same value of j(id). The reason is pretty simple. On one specific date, one company may repurchase only once, while others more than once in a single day.
Below is an example of my dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str11 id int date double(amount s_o px_last) "secBP/" 16714 7938000 20965.041 634.5 "secBT/A" 16714 1000000 8450.818 223 "secDGE" 16714 1400000 2938.265 841 "secDXNS" 16714 750000.062 1858.33 106.436 "secHAS" 16714 2000000 1573.364 119.25 "secHNS" 16714 325000.031 728.769 585 "secIMB" 16714 300000.031 715.236 1412.33 "secIMI" 16714 80000.008 347.035 489.429 "secJDW" 16714 1000000 174.481 290 "secNXT" 16714 50000.004 254.797 1360 "secNXT" 16714 191033.016 254.797 1360 "secNXT" 16714 50000.004 254.797 1360 "secPMP" 16714 100000.008 9.901 188.5 "secTNI" 16714 50000.004 289.311 588.5 "secULVR" 16714 423410.031 2885.778 1316.667 "secVOD" 16714 17500000 62976 214.651 "secWMH" 16714 750000.062 387.766 381.485 "secABP" 16715 50000.004 305.977 523 "secAIE" 16715 37702.004 348.541 67.219 "secBP/" 16715 6808000 20965.041 620.5 "secBT/A" 16715 1000000 8450.818 219.5 end format %tdDD/NN/CCYY date

As you can notice, the id "secNXT" has three repurchase transactions on the same day. This is not a problem, since I'm working with daily observations, so that:
1) I can sum up all the transactions (column "amount") in each day;
2) I need only the last observation of each day with multiple transactions for the column "s_o"
3) I need only one obs for the column "px_last", which can be the last, as well as another (as well as the mean).

Apologise if date is in integers.

Any suggestion on how to avoid doing this manually?

Thank You

SG
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30154
#2

27 Jul 2017, 20:50

You should never even contemplate doing anything like this manually. Manual revisions to the data leave no audit trail. Nobody will ever be able to know what was done and whether it was correct. You should never accept from others data manipulations that wee done manually (unless the data set is so small and the manipulations so trivial that you can verify them at sight.)

On the positive note, there is always a way to do it properly in code.

So the first step is to reduce to one observation per security per date following steps 1) through 3) in your post. That's what the -collapse- command does.

Then we can do the -reshape-, except there is a problem. Some of the values of id include characters that are not legal in variable names. In particular, you cannot have a / character in a name. So first we have to convert all values of id into something suitable for inclusion in a variable name.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str11 id int date double(amount s_o px_last) "secBP/" 16714 7938000 20965.041 634.5 "secBT/A" 16714 1000000 8450.818 223 "secDGE" 16714 1400000 2938.265 841 "secDXNS" 16714 750000.062 1858.33 106.436 "secHAS" 16714 2000000 1573.364 119.25 "secHNS" 16714 325000.031 728.769 585 "secIMB" 16714 300000.031 715.236 1412.33 "secIMI" 16714 80000.008 347.035 489.429 "secJDW" 16714 1000000 174.481 290 "secNXT" 16714 50000.004 254.797 1360 "secNXT" 16714 191033.016 254.797 1360 "secNXT" 16714 50000.004 254.797 1360 "secPMP" 16714 100000.008 9.901 188.5 "secTNI" 16714 50000.004 289.311 588.5 "secULVR" 16714 423410.031 2885.778 1316.667 "secVOD" 16714 17500000 62976 214.651 "secWMH" 16714 750000.062 387.766 381.485 "secABP" 16715 50000.004 305.977 523 "secAIE" 16715 37702.004 348.541 67.219 "secBP/" 16715 6808000 20965.041 620.5 "secBT/A" 16715 1000000 8450.818 219.5 end format %tdDD/NN/CCYY date // APPLY RULES 1) THROUGH 3) collapse (sum) amount (last) s_o px_last, by(date id) // CREATE A NEW ID VARIABLE THAT CAN BE PART OF A VARIABLE NAME clonevar new_id = id replace new_id = strtoname(id) // VERIFY THAT THE NEW NAMES CORRESPOND 1-1 WITH OLD NAMES by new_id (id), sort: assert id[1] == id[_N] assert length(new_id) <= 32 drop id reshape wide amount s_o px_last, i(date) j(new_id) string

Notes:

1. The conversion of original id's to new ones that can be part of variable names is a bit tricky. The strtoname() function replaces all inadmissible characters with underscore (_) characters. But it won't help if some id is long enough that when postposed after amount results in something too long. Also, if there were two securities, say secA/ and secA&, strtoname() will reduce both of them to secA_, so you would be confusing two different securities. That is why the code above contains commands to check that these misfortunes have not actually occurred. If you find that either of the -assert- commands breaks the program, then you will need to use some -replace- statements to change some of the values of id to something else that won't cause problems.

2. Having shown you how to do this, I will also caution you that this is probably not a wise thing to do. I don't know where you're taking this data next. If you just plan to export it to a spreadsheet for viewing, or plan to make some parallel graphs for different securities, then this is suitable. But if you are going to do more actual analysis, you will probably regret having done the -reshape-. Nearly all Stata analysis commands work better (some only work at all) with data in long layout. I fear your next post on Statalist will be a desire to use these results for some analysis, and the first step in the solution will be to -reshape- back to long. So think carefully about what you plan to do with this data, because there are very few things that are facilitated by going wide.

Aside: don't apologize for the dates being integers. The reason -dataex- posts them that way is because that is how Stata represents those dates internally. -dataex- output is not intended for human readability: it is for importation to Stata via copy/paste to the do-editor. Since you are, I imagine, surprised to learn this fact about dates, I strongly urge you to read the datetimes section of the PDF manuals that come with your Stata installation. In your line of work, you will undoubtedly have need to do calculations and transformations with dates, and you must understand how Stata approaches these things. The time you invest will be well repaid. No, you will not remember everything from the reading: even the most experienced among us need to go to the help files or the manual to refresh ourselves on the details. But you will not be able to do effective panel data work with Stata if you do not master the fundamentals of how Stata handles dates and times.
Comment
Stefano Grillini

Join Date: Jun 2015

Posts: 85
#3

30 Jul 2017, 11:32

Thank You very much for your post Clyde.
First of all thank you for the hint on using collapse and the "strong" recommendation of doing nothing manually. Secondly, thanks for telling me the function strtoname() that I wasn't aware of.
It is necessary to collapse the data to obtain daily measures. On the other hand, I really appreciate your comment on the risk of reshape the data. I'm trying to figure out the best way to approach this issue, since, as you precisely forecasted, further analysis is needed. In fact, this dataset will be merged with another with other daily data regarding the same new_id variables.

I will also strongly consider your recommendation to read more carefully the manual.

Thanks again

SG
Comment

Announcement

Problem with reshape

Comment

Comment