Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting string variables based on variables that are string and values

    Please help me out. I would like to eventually create a new variable which is a proportion, called Raw∆k;t = Number of institutions buying k;t / (Number of institutions buying k;t + Number of institutions selling k;t ) where k is a ticker that is bought or sold during quarter t.
    The variables ticker, k and qdate, t are string variables. I have an ID variable that identifies each institution, order and q_status which are string variables. The observation in the variable q_status has either + or - or 0, as a position at the end of the quarter, meaning the status increased, decreased or remained the same. The number of institutions is essentially, the number of +, for each stock k during the quarter t. My problem is to identify a code that counts the number of +, i.e the q_status variable and then also counts the number of -, for each ticker during the quarter and place that value in a column within stata.

    What i tried to do is run the line of code below which counts the +, by qdate (which is the quarter but does not count for each stock k). My challenge is to count based on stock k and qdate t and place the result in a column. I trust that if i count the + and then - , i should be able to generate a code to add them and obtain the ratio. The intenton is to generate the ratio.

    Code:
    by qdate:count if q_status == "+"
    my sample data is as below

    Code:
    * Example generated by -dataex-. To install :ssc install dataex
    * dataex investor_id ticker order qdate q_status
    clear
    input int investor_id str3 ticker str1 order str6 qdate str1 q_status
      . ""    ""  ""       "" 
      . ""    ""  ""       "" 
     40 "FBK" "S" "2007q1" "-"
    108 "FBK" "B" "2007q1" "-"
     95 "FBK" "B" "2007q1" "+"
     72 "FBK" "B" "2007q1" "+"
    120 "IL " "B" "2007q1" "+"
     44 "IL " "B" "2007q1" "+"
     45 "IL " "S" "2007q1" "-"
     72 "IL " "S" "2007q1" "-"
    108 "IL " "B" "2007q1" "-"
     14 "IL " "B" "2007q1" "+"
    118 "NC"  "S" "2007q1" "-"
     14 "NC"  "B" "2007q1" "+"
    108 "NC"  "S" "2007q1" "-"
     41 "NMG" "B" "2007q1" "+"
     40 "NMG" "S" "2007q1" "-"
    108 "NMG" "B" "2007q1" "-"
     14 "NMG" "B" "2007q1" "+"
     72 "NMG" "B" "2007q1" "0"
     87 "NMG" "S" "2007q1" "-"
    end


  • #2
    Code:
    drop if missing(ticker, qdate, q_status)
    isid ticker qdate investor
    by ticker (qdate), sort: egen n_buying = total(q_status == "+")
    by ticker (qdate): egen n_selling = total(q_status == "-")
    gen ratio = n_buying/(n_buying + n_selling)
    As an aside, there is very little you can do with a date stored as a string variable. This particular task can be accomplished, but you won't get very far working with the data set this way. So I strongly urge you to convert that to a Stata internal format numeric date variable:
    Code:
    gen _qdate = quarterly(qdate, "YQ"), after(qdate)
    assert missing(_qdate) == missing(qdate)
    drop qdate
    rename _qdate qdate
    format qdate %tq

    Comment


    • #3
      Many thanks Prof Schechter. I note however that when i apply the 2 lines of the code
      Code:
       
       by ticker (qdate), sort: egen n_buying = total(q_status == "+")  by ticker (qdate): egen n_selling = total(q_status == "-")
      the output is a higher number of n-buying 206, and n-selling ie 197 than the actual number of 122 institutions that i have in the actual sample. At some point the n_buying for qdate quarter 1 to quarter 4 remains constant, like 157 while n_selling is 153. I will try to figure out why, possibly there is cumulation and will revert.
      Many thanks

      Comment


      • #4
        Duncan Mphande try this code, which might solve for situations where there are multiple observations for the same institution for a ticker in a particular quarter:

        Code:
        egen tag = tag(ticker qdate investor_id q_status)
        
        bys ticker (qdate): egen n_buying = total((q_status == "+")*tag)
        bys ticker (qdate): egen n_selling = total((q_status == "-")*tag)
        gen ratio = n_buying/(n_buying + n_selling)
        drop tag

        Comment


        • #5
          In the code offered in #2, the first command checks for the possibility of multiple observations for the same institution for a ticker in a particular quarter. So unless O.P. chose to ignore the error and just blunded on past it, that shouldn't be the problem he is encountering.

          If that is what happened, while the code in #4 will, in a sense, solve the problem, it risks sweeping a bigger problem under the rug. For the setting and data description given in #1, the data clearly should satisfy the condition that any investor appears only once in the data for any combination of sticker and quarter. If that is not the case, then the data are problematic. In particular, the same investor might be listed as both net buyer and net seller of the same ticker at the same time. And even if that never happens, the question remains why these extra observations are there. They represent an error in data management up to that point. It should be fixed, not swept under the rug in some rush to get incorrect results. That is why the -assert- command was placed at the head of the code: to verify that the data are suitable for the proposed analysis.

          If O.P. did ignore an error message that the -assert- command gave him, and blunder past it, I would simply remind him never to do that and urge him to thank Stata for warning him about his incorrect data, and go about repairing it. Otherwise, I would ask him to post an example of the data that is giving him troubling results so that troubleshooting can be done.

          Comment


          • #6
            Ah, I now realise I didn't carefully read the original post. Yes, that description implies that there should not be more than one observation for an institution-ticker-quarter, since "[t]he observation in the variable q_status has either + or - or 0, as a position at the end of the quarter".

            In that case, I completely agree with Clyde in #5, you should figure out why this does not happen (if it does not), rather than using my code in #4.

            Comment


            • #7
              Clyde Schechter Prof, in your #5 note, it is true that any investor ought to appear only once for any ticker and quarter combination. I looked into the data and could not find any situation where is listed both as a net buyer or net seller. What i find, is that if an investor trades in more than one ticker, which is not unusual, even during the same quarter, there will be more observations than number of investors. That is the likely explanation. That said, will have another go on data management. Many thanks.

              Comment


              • #8
                Prof Clyde Schechter in your #2 note, you advised that i needed to change the format of the variable qdate (which i had said was string variable),and when i followed your well guided line of code
                Code:
                gen _qdate = quarterly(qdate, "YQ"), after(qdate)
                My result was an error (type mismatch-attempting to combine a string and numeric subexpression in a logically impossible way). I have tried to get round this but have been challenged. what i realized is that the variable (qdate) is actually a float and not a string. My apologies for this error, and i guess i got the help based on the (incorrect) information i had submitted.
                Please help me correct this by converting the float variable to stata numeric date variable.

                Will greatly appreciate

                Comment


                • #9
                  Well, I got the idea that qdate is a string because in the data example you gave, that's what it was. So if you now tell me that it is a float, I can only surmise that your data example did not come from the actual data set you are working with. With qdate being a float, there are two possibilities, and without example data from the real data set, I cannot tell which applies.

                  1. qdate is already a proper Stata internal format quarterly date variable. The best way to tell if that is true is to run -format qdate %tq- and then -browse qdate-. If what you see in the browser looks like 2007q3 and the like, then you do in fact have it right and nothing further need be done.

                  2. If the results of 1. do not look right (in the sense that the years are way off from what they should be), then qdate is some other numeric representation that likely will get you into trouble if you try to do date calculations or chronological order with it as is. The steps needed to fix that will depend on the data themselves, so I would urge you to post a new -dataex- example from the real data set. You needn't show anything other than qdate itself: -dataex qdate- for this purpose.

                  Comment


                  • #10
                    Prof Clyde Schechter many thanks. In #9, i can confirm that i ran the data and the variable qdate was formatted to %tq (at least in my do file). I am not sure how when i attempted to post it using -dataex the format changed, but as it is now it is a proper Stata internal format. Thanks. I will proceed with my analysis and if am unable to make progress i will revert on the forum. I have found it so helpful. Many thanks .

                    Comment

                    Working...
                    X