Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Screening/Cleaning data

    hi guys,

    I am working with the following data:
    . dataex

    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str6 Code_Lefted float date str48 Name double prices float return
    "130298" 13485 "ALBEMARLE"    9.25            .
    "130298" 13487 "ALBEMARLE"   9.375   .013513514
    "130298" 13489 "ALBEMARLE"   9.125   -.02666667
    "130298" 13493 "ALBEMARLE"     9.5    .04109589
    "130298" 13494 "ALBEMARLE"  9.5625   .006578947
    "130298" 13495 "ALBEMARLE"  9.4375  -.013071896
    "130298" 13501 "ALBEMARLE"  8.3125    -.1192053
    "130298" 13502 "ALBEMARLE"   9.125    .09774436
    "130298" 13503 "ALBEMARLE"  9.0625  -.006849315
    "130298" 13506 "ALBEMARLE"  8.9375  -.013793103
    "130298" 13508 "ALBEMARLE"  8.6875   -.02797203
    "130298" 13510 "ALBEMARLE"   8.875   .021582734
    "130298" 13514 "ALBEMARLE"  9.0625    .02112676
    "130298" 13516 "ALBEMARLE"  9.5625    .05517241
    "130298" 13520 "ALBEMARLE"  9.5625            0
    "130298" 13523 "ALBEMARLE"  9.4375  -.013071896
    "130298" 13524 "ALBEMARLE"  9.5625   .013245033
    "130298" 13527 "ALBEMARLE"     9.5  -.006535948
    "130298" 13531 "ALBEMARLE"   9.375  -.013157895
    "130298" 13538 "ALBEMARLE"  9.5625          .02
    "130298" 13541 "ALBEMARLE"  9.3125   -.02614379
    "130298" 13578 "ALBEMARLE"  9.8125    .05369128
    "130298" 13583 "ALBEMARLE"    9.75  -.006369427
    "130298" 13584 "ALBEMARLE"   9.875   .012820513
    "130298" 13586 "ALBEMARLE"      10   .012658228
    "130298" 13592 "ALBEMARLE"  9.4375      -.05625
    "130298" 13593 "ALBEMARLE"    9.25   -.01986755
    "130298" 13597 "ALBEMARLE"   9.375   .013513514
    "130298" 13604 "ALBEMARLE"  8.9375   -.04666667
    "130298" 13606 "ALBEMARLE"  8.9375            0
    "130298" 13608 "ALBEMARLE"       9   .006993007
    "130298" 13611 "ALBEMARLE"   9.125    .01388889
    "130298" 13613 "ALBEMARLE"    9.25    .01369863
    "130298" 13614 "ALBEMARLE"    9.25            0
    "130298" 13615 "ALBEMARLE"    9.25            0
    "130298" 13618 "ALBEMARLE"     9.5    .02702703
    "130298" 13620 "ALBEMARLE"   9.375  -.013157895
    "130298" 13621 "ALBEMARLE"       9         -.04
    "130298" 13622 "ALBEMARLE"       9            0
    "130298" 13625 "ALBEMARLE"    8.75   -.02777778
    "130298" 13628 "ALBEMARLE"       9    .02857143
    "130298" 13634 "ALBEMARLE"  8.8125  -.020833334
    "130298" 13635 "ALBEMARLE"   8.875   .007092198
    "130298" 13636 "ALBEMARLE"  9.3125    .04929578
    "130298" 13639 "ALBEMARLE"  9.3125            0
    "130298" 13640 "ALBEMARLE"    9.25   -.00671141
    "130298" 13643 "ALBEMARLE"    9.25            0
    "130298" 13646 "ALBEMARLE"  9.3125   .006756757
    "130298" 13647 "ALBEMARLE"  9.3125            0
    "130298" 13648 "ALBEMARLE"    9.25   -.00671141
    "130298" 13653 "ALBEMARLE"  9.1875  -.006756757
    "130298" 13660 "ALBEMARLE"    9.25   .006802721
    "130298" 13662 "ALBEMARLE"    9.25            0
    "130298" 13664 "ALBEMARLE"   9.375   .013513514
    "130298" 13669 "ALBEMARLE"   9.125   -.02666667
    "130298" 13671 "ALBEMARLE"  9.3125   .020547945
    "130298" 13674 "ALBEMARLE"  9.3125            0
    "130298" 13675 "ALBEMARLE"  9.3125            0
    "130298" 13676 "ALBEMARLE"  9.4375    .01342282
    "130298" 13677 "ALBEMARLE"  9.8125     .0397351
    "130298" 13678 "ALBEMARLE"  9.9375   .012738854
    "130298" 13682 "ALBEMARLE" 10.1875    .02515723
    "130298" 13683 "ALBEMARLE"  10.125  -.006134969
    "130298" 13684 "ALBEMARLE"   9.875   -.02469136
    "130298" 13688 "ALBEMARLE" 10.0625    .01898734
    "130298" 13690 "ALBEMARLE"   9.625   -.04347826
    "130298" 13691 "ALBEMARLE"      10    .03896104
    "130298" 13695 "ALBEMARLE" 10.5312       .05312
    "130298" 13696 "ALBEMARLE"      11    .04451535
    "130298" 13697 "ALBEMARLE"  11.375    .03409091
    "130298" 13698 "ALBEMARLE" 11.4375   .005494506
    "130298" 13704 "ALBEMARLE" 10.9375   -.04371585
    "130298" 13705 "ALBEMARLE" 11.0625    .01142857
    "130298" 13706 "ALBEMARLE" 11.0625            0
    "130298" 13711 "ALBEMARLE" 11.4687   .036718644
    "130298" 13717 "ALBEMARLE" 11.1562  -.027248075
    "130298" 13718 "ALBEMARLE" 11.3437   .016806798
    "130298" 13720 "ALBEMARLE"  11.125  -.019279424
    "130298" 13723 "ALBEMARLE" 11.2187   .008422472
    "130298" 13724 "ALBEMARLE"    11.5   .025074206
    "130298" 13726 "ALBEMARLE"      12    .04347826
    "130298" 13727 "ALBEMARLE" 12.0312        .0026
    "130298" 13730 "ALBEMARLE"  12.125   .007796396
    "130298" 13731 "ALBEMARLE" 11.9062   -.01804536
    "130298" 13732 "ALBEMARLE" 11.9687   .005249366
    "130298" 13734 "ALBEMARLE"  11.875  -.007828753
    "130298" 13737 "ALBEMARLE" 11.9062   .002627368
    "130298" 13738 "ALBEMARLE" 11.7812  -.010498731
    "130298" 13739 "ALBEMARLE" 11.9062   .010610125
    "130298" 13741 "ALBEMARLE" 11.7187  -.015748098
    "130298" 13746 "ALBEMARLE" 11.6875 -.0026624114
    "130298" 13751 "ALBEMARLE" 11.5937  -.008025669
    "130298" 13752 "ALBEMARLE" 11.6875   .008090601
    "130298" 13753 "ALBEMARLE"   11.75   .005347594
    "130298" 13755 "ALBEMARLE" 12.2812    .04520851
    "130298" 13759 "ALBEMARLE" 12.3125   .002548611
    "130298" 13760 "ALBEMARLE"      13    .05583756
    "130298" 13761 "ALBEMARLE" 12.7187   -.02163846
    "130298" 13765 "ALBEMARLE" 12.9062   .014742073
    "130298" 13769 "ALBEMARLE"    13.5     .0460089
    end
    format %tdDDNNCCYY date
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 2370016 observations
    Use the count() option to list more

    - it is not super evident from this data example. However I calculate daily returns by
    Code:
    gen return = (price-price[_n-1]) / price[_n-1]
    . The change in price.

    - this is data from datastream. this source " pads" the variable price, after a company (Name) is dead. thus the last valid price is repeated till the end of your chosen time period.

    - therefore I want do to the following: Delete all zero values for return from the end my sample untill the first non-zero return.

    - From reading up on some different statalist posts I thought I would try the following:
    Code:
    gen last=.
    Code:
     bys Name (date): replace last=cond(return>0 & return<. & last[_n-1]==.,_n,last[_n-1])
    Code:
    drop if last==.
    --> this drops 19500 observations. whilst when I go through it manually, for some companies( Name) it seems not be applied optimally. The zero`s in between valid datapoint are kept however that is pleasing to see.

    Where am I making a error in my thought process?
    Last edited by Wessel de Kroo; 23 Apr 2018, 09:45.

  • #2
    This is not correct. I have way more records/observations that need to be dropped if I truly want to : delete all zero values from the end of the sample until the first non-zero return.

    Comment


    • #3
      This is easier to do if you sort by reverse order of date:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str6 Code_Lefted float date str48 Name double prices float return
      "130298" 13485 "ALBEMARLE"    9.25            .
      "130298" 13487 "ALBEMARLE"   9.375   .013513514
      "130298" 13489 "ALBEMARLE"   9.125   -.02666667
      "130298" 13493 "ALBEMARLE"     9.5    .04109589
      "130298" 13494 "ALBEMARLE"  9.5625   .006578947
      "130298" 13495 "ALBEMARLE"  9.4375  -.013071896
      "130298" 13501 "ALBEMARLE"  8.3125    -.1192053
      "130298" 13502 "ALBEMARLE"   9.125    .09774436
      "130298" 13503 "ALBEMARLE"  9.0625  -.006849315
      "130298" 13506 "ALBEMARLE"  8.9375  -.013793103
      "130298" 13508 "ALBEMARLE"  8.6875   -.02797203
      "130298" 13510 "ALBEMARLE"   8.875   .021582734
      "130298" 13514 "ALBEMARLE"  9.0625    .02112676
      "130298" 13516 "ALBEMARLE"  9.5625    .05517241
      "130298" 13520 "ALBEMARLE"  9.5625            0
      "130298" 13521 "ALBEMARLE"  9.5625            0
      "130298" 13522 "ALBEMARLE"  9.5625            0
      "130298" 13523 "ALBEMARLE"  9.5625            0
      "130298" 13524 "ALBEMARLE"  9.5625            0
      end
      format date %tdDay_Mon_DD_CCYY
      
      isid Name date
      
      gsort Name -date
      by Name: gen tokeep = sum(return != 0)
      sort Name date
      replace tokeep = tokeep > 0

      Comment


      • #4
        hey,

        - reformatting and reversing the order of date. is a nice way.

        Replacing " first" into " last" in my code gave me similar results. I applied your code and it works better. see post #1 . does apply the data screen I wanted to apply in full.

        - one more thing that I thought was worth mentioning:

        Code:
        isid Name date
        gave me " variables Name date do not uniquely identify the observations". This implies I think that in my dataset some duplicates are present because it should be able to identify Name + date unique.

        So I am checking for this. thank you.

        - also:

        right now this identifies to what this " screen" applies to. by
        Code:
        drop if tokeep==0
        you do actually delete them from your data if wanted.

        Comment


        • #5
          yes after removing duplicates isid Name date works. nice for pointing that out Robert

          Comment


          • #6
            for whom it concerns:

            dropping duplicates, what I applied:

            sort Name date
            duplicates report Name date
            quietly by Name date: gen dup = cond(_N==1,0,_n)
            drop if dup>1

            Comment


            • #7
              If your data is not uniquely identified using
              Code:
              isid Name date
              then you have duplicates within Name date by-groups. The
              Code:
              gsort Name -date
              cannot possibly know in which order to put multiple returns on the same date so the order will be random within Name date by-groups. Perhaps you have another identifier that you can use instead of Name. What about this Code_Lefted variable? It's up to you to figure out what's going on here but until then, the code could generate different results at every run.

              There's a keep command you know so I was thinking more like
              Code:
              keep if tokeep

              Comment


              • #8
                That is true.

                Luckily for me my data is uniquely identified if I apply Name and Date. ( daily record per company name,thus for date).

                right yes the Code_Lefted here is actually a similar variable as to Name.

                As for the code I am running now which is in this thread in earlier posts, the same results are obtained.

                Thank you for the clarification on this topic

                Comment

                Working...
                X