Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with reshape

    Hi all,
    I'm having a problem with reshape that I wish I don't have to deal with manually. I have a dataset of daily (non consecutive) observations for many (I can't currently say the exact number) companies. For each of them I also have their daily stock repurchase quantity and the number of shares. When I reshape long to wide,
    Code:
    reshape wide amount s_o px_last, i(date) j(id) string
    I get the error message r(9), which says that there are
    observations within i(date) with the same value of j(id). The reason is pretty simple. On one specific date, one company may repurchase only once, while others more than once in a single day.
    Below is an example of my dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 id int date double(amount s_o px_last)
    "secBP/"  16714    7938000 20965.041    634.5
    "secBT/A" 16714    1000000  8450.818      223
    "secDGE"  16714    1400000  2938.265      841
    "secDXNS" 16714 750000.062   1858.33  106.436
    "secHAS"  16714    2000000  1573.364   119.25
    "secHNS"  16714 325000.031   728.769      585
    "secIMB"  16714 300000.031   715.236  1412.33
    "secIMI"  16714  80000.008   347.035  489.429
    "secJDW"  16714    1000000   174.481      290
    "secNXT"  16714  50000.004   254.797     1360
    "secNXT"  16714 191033.016   254.797     1360
    "secNXT"  16714  50000.004   254.797     1360
    "secPMP"  16714 100000.008     9.901    188.5
    "secTNI"  16714  50000.004   289.311    588.5
    "secULVR" 16714 423410.031  2885.778 1316.667
    "secVOD"  16714   17500000     62976  214.651
    "secWMH"  16714 750000.062   387.766  381.485
    "secABP"  16715  50000.004   305.977      523
    "secAIE"  16715  37702.004   348.541   67.219
    "secBP/"  16715    6808000 20965.041    620.5
    "secBT/A" 16715    1000000  8450.818    219.5
    end
    format %tdDD/NN/CCYY date
    As you can notice, the id "secNXT" has three repurchase transactions on the same day. This is not a problem, since I'm working with daily observations, so that:
    1) I can sum up all the transactions (column "amount") in each day;
    2) I need only the last observation of each day with multiple transactions for the column "s_o"
    3) I need only one obs for the column "px_last", which can be the last, as well as another (as well as the mean).

    Apologise if date is in integers.

    Any suggestion on how to avoid doing this manually?

    Thank You

    SG


  • #2
    You should never even contemplate doing anything like this manually. Manual revisions to the data leave no audit trail. Nobody will ever be able to know what was done and whether it was correct. You should never accept from others data manipulations that wee done manually (unless the data set is so small and the manipulations so trivial that you can verify them at sight.)

    On the positive note, there is always a way to do it properly in code.

    So the first step is to reduce to one observation per security per date following steps 1) through 3) in your post. That's what the -collapse- command does.

    Then we can do the -reshape-, except there is a problem. Some of the values of id include characters that are not legal in variable names. In particular, you cannot have a / character in a name. So first we have to convert all values of id into something suitable for inclusion in a variable name.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 id int date double(amount s_o px_last)
    "secBP/"  16714    7938000 20965.041    634.5
    "secBT/A" 16714    1000000  8450.818      223
    "secDGE"  16714    1400000  2938.265      841
    "secDXNS" 16714 750000.062   1858.33  106.436
    "secHAS"  16714    2000000  1573.364   119.25
    "secHNS"  16714 325000.031   728.769      585
    "secIMB"  16714 300000.031   715.236  1412.33
    "secIMI"  16714  80000.008   347.035  489.429
    "secJDW"  16714    1000000   174.481      290
    "secNXT"  16714  50000.004   254.797     1360
    "secNXT"  16714 191033.016   254.797     1360
    "secNXT"  16714  50000.004   254.797     1360
    "secPMP"  16714 100000.008     9.901    188.5
    "secTNI"  16714  50000.004   289.311    588.5
    "secULVR" 16714 423410.031  2885.778 1316.667
    "secVOD"  16714   17500000     62976  214.651
    "secWMH"  16714 750000.062   387.766  381.485
    "secABP"  16715  50000.004   305.977      523
    "secAIE"  16715  37702.004   348.541   67.219
    "secBP/"  16715    6808000 20965.041    620.5
    "secBT/A" 16715    1000000  8450.818    219.5
    end
    format %tdDD/NN/CCYY date
    
    // APPLY RULES 1) THROUGH 3)
    collapse (sum) amount (last) s_o px_last, by(date id)
    
    // CREATE A NEW ID VARIABLE THAT CAN BE PART OF A VARIABLE NAME
    clonevar new_id = id
    replace new_id = strtoname(id)
    // VERIFY THAT THE NEW NAMES CORRESPOND 1-1 WITH OLD NAMES
    by new_id (id), sort: assert id[1] == id[_N]
    assert length(new_id) <= 32
    drop id
    
    reshape wide amount s_o px_last, i(date) j(new_id) string
    Notes:

    1. The conversion of original id's to new ones that can be part of variable names is a bit tricky. The strtoname() function replaces all inadmissible characters with underscore (_) characters. But it won't help if some id is long enough that when postposed after amount results in something too long. Also, if there were two securities, say secA/ and secA&, strtoname() will reduce both of them to secA_, so you would be confusing two different securities. That is why the code above contains commands to check that these misfortunes have not actually occurred. If you find that either of the -assert- commands breaks the program, then you will need to use some -replace- statements to change some of the values of id to something else that won't cause problems.

    2. Having shown you how to do this, I will also caution you that this is probably not a wise thing to do. I don't know where you're taking this data next. If you just plan to export it to a spreadsheet for viewing, or plan to make some parallel graphs for different securities, then this is suitable. But if you are going to do more actual analysis, you will probably regret having done the -reshape-. Nearly all Stata analysis commands work better (some only work at all) with data in long layout. I fear your next post on Statalist will be a desire to use these results for some analysis, and the first step in the solution will be to -reshape- back to long. So think carefully about what you plan to do with this data, because there are very few things that are facilitated by going wide.

    Aside: don't apologize for the dates being integers. The reason -dataex- posts them that way is because that is how Stata represents those dates internally. -dataex- output is not intended for human readability: it is for importation to Stata via copy/paste to the do-editor. Since you are, I imagine, surprised to learn this fact about dates, I strongly urge you to read the datetimes section of the PDF manuals that come with your Stata installation. In your line of work, you will undoubtedly have need to do calculations and transformations with dates, and you must understand how Stata approaches these things. The time you invest will be well repaid. No, you will not remember everything from the reading: even the most experienced among us need to go to the help files or the manual to refresh ourselves on the details. But you will not be able to do effective panel data work with Stata if you do not master the fundamentals of how Stata handles dates and times.

    Comment


    • #3
      Thank You very much for your post Clyde.
      First of all thank you for the hint on using collapse and the "strong" recommendation of doing nothing manually. Secondly, thanks for telling me the function strtoname() that I wasn't aware of.
      It is necessary to collapse the data to obtain daily measures. On the other hand, I really appreciate your comment on the risk of reshape the data. I'm trying to figure out the best way to approach this issue, since, as you precisely forecasted, further analysis is needed. In fact, this dataset will be merged with another with other daily data regarding the same new_id variables.

      I will also strongly consider your recommendation to read more carefully the manual.

      Thanks again

      SG

      Comment

      Working...
      X