standardization of a variable in a group

Duncan Mphande

Join Date: Jul 2022
Posts: 22

standardization of a variable in a group

13 Dec 2022, 11:43

Please help me standardize a variable ( i want a value MINUS mean /SD) a variable named rawbuy which i generate based on grouping data into quarters.

Here is my code that i used to generate qdate

Code:

gen date3 =date(date,"DMY")
format %td date3
format %tddd-Mon-YY date3
gen qdate= qofd(date3)
format qdate %tq

and here is the code that i use to generate the other variables such as n_buying, n_selling Qb and Qs and finally rawbuy. For e.g. n_buying, is count of +, for variable ticker during quarter, variable qdate

Code:

egen id3= group(ticker qdate)
sort id3
by id3: egen n_buying = total(q_status == "+")
by id3: egen n_selling = total(q_status == "-")

i then aggregate for each quarter, the total for all n_buying, for different tickers. same for n_seling. I then compute a ratio [ n_buying/ (n_buying + n_selling)], whihc i call rawbuy. I would like to standardize this rawbuy so that i can directly compare coefficients in my investigation

Code:

sort id3
by id3 : egen Qb = total(n_buying)
by id3 : egen Qs = total(n_selling)
by id3 : gen rawbuy = Qb/(Qb +Qs)

my sample data looks like this

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int date str3 ticker str6 qdate str1 q_status int id3 byte(n_buying n_selling Qb Qs) double rawbuy
17261 "fgm" "2007q2" "+"  27 4 1 20  5       .8
17261 "fgm" "2007q2" "+"  27 4 1 20  5       .8
17261 "lvi" "2007q2" "+"  84 2 0  4  0        1
17266 "nih" "2007q2" "+" 342 2 0  4  0        1
17283 "nmp" "2007q2" "-" 186 0 1  0  1        0
17289 "nih" "2007q2" "+" 342 2 0  4  0        1
17290 "fgm" "2007q2" "-"  27 4 1 20  5       .8
17304 "okm" "2007q2" "-" 393 0 2  0  4        0
17309 "pla" "2007q2" "+" 439 2 1  6  3 .6666667
17329 "okm" "2007q2" "-" 393 0 2  0  4        0
17332 "fgm" "2007q2" "+"  27 4 1 20  5       .8
17342 "lvi" "2007q2" "+"  84 2 0  4  0        1
17342 "nsk" "2007q2" "-" 239 0 1  0  1        0
17344 "nid" "2007q2" "+" 293 1 0  1  0        1
17352 "nsk" "2007q3" "-" 240 1 3  4 12      .25
17358 "nsk" "2007q3" "-" 240 1 3  4 12      .25
17371 "san" "2007q3" "+" 568 1 1  2  2       .5
17378 "nih" "2007q3" "+" 343 3 2 15 10       .6
17381 "fgm" "2007q3" "-"  28 1 5  6 30 .1666667
17381 "okm" "2007q3" "+" 394 2 0  4  0        1
17381 "okm" "2007q3" "+" 394 2 0  4  0        1
17388 "nih" "2007q3" "+" 343 3 2 15 10       .6
17392 "fgm" "2007q3" "+"  28 1 5  6 30 .1666667
17392 "fgm" "2007q3" "-"  28 1 5  6 30 .1666667
17398 "lvi" "2007q3" "-"  85 0 1  0  1        0
17398 "pla" "2007q3" "-" 440 1 2  3  6 .3333333
17406 "nsk" "2007q3" "-" 240 1 3  4 12      .25
17406 "nih" "2007q3" "+" 343 3 2 15 10       .6
17422 "fgm" "2007q3" "-"  28 1 5  6 30 .1666667
17436 "fgm" "2007q3" "-"  28 1 5  6 30 .1666667
end
format %tddd-Mon-YY date

Tags: data, regression, syntax

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

13 Dec 2022, 12:03

Code:

by id3, sort: egen rawbuy_standardized = std(rawbuy)

will give you a new variable that is rawbuy standardized to the mean and sd of all observations for the given id3.

Do bear in mind that the notion that you can validly compare effects by looking at standardized coefficients is mostly illusory. And what usually results from using standardized variables is obfuscation rather than clarification. I think that will be the case here. A variable defined the way rawbuy is will be quite intuitive and understandable to any numerate audience: it is the proportion of volume that is buy rather than sell. Standardizing it makes it completely uninterpretable, since only a person who knows the mean and standard deviation of this value for every id3 really can understand this new variable.

Added:

I do not understand what you have done with the variables n_buying, n_selling, Qb, and Qs. First of all, the values you show do not seem to correspond with the results one would get by applying the code you show. For example, looking at id3 27, there are precisely 3 observations with q_status == "+", yet your data shows n_buying == 4. Something is amiss here. Similarly Qb should be according to your code, the total of the values of n_buying for that id3: there are four observations and each has n_buying == 4 (by your data example, not by your code), and yet you show Qb = 20, not 16.

Moreover, since n_buying and n_selling are created so as to be constant within id3, the calculations of Qb and Qs are tantamount to just multiplying n_buying and n_selling by the number of observations for id3. But then when you calculate buy_ratio, since both Qb and Qs are the same multiple of n_buying and n_selling, you would get the exact same result by just calculating buy_ratio = n_buying/(n_buying + n_selling). So I'm wondering what the point of Qb and Qs are--maybe they are used later in a way where their common factor does not cancel out?

Last edited by Clyde Schechter; 13 Dec 2022, 12:21.
2 likes
Comment
Duncan Mphande

Join Date: Jul 2022

Posts: 22
#3

14 Dec 2022, 09:43

Many thanks Clyde Schechter for the code.

The first overall comment is insightful for me. It got me thinking whether what I am doing is exactly what the model I am using actually did. I have had to look at the data again and again following execution of each code. That the raw buy is for numerate audience is likely but that standardization will make it completely uninterpretable has got me thinking.

For your comments in #2; the observation in the first paragraph ( under Added) is correct. this is because i extracted the next 30 observations from my data set, since the first 30 observations were all showing transactions done in 1 quarter. Given that i was emphasising that aggregation or grouping is on quarterly basis, i wanted the sample data to show at least 2 different quarters.

The comment on the last paragraph regarding the buy ratio being exactly generated by either ratio = n_buying/(n_buying + n_selling) OR Qb = Qb/(Qb +Qs) has jolted me into looking at whether my counts have been done correctly. They seem not (since the are based both on id3), but my intention by generating Qb and Qs is to aggregate all counts of either n_buying or n_selling during the quarter, while by generating n_buying and n_selling is aimed at showing individual count for each investor in particular ticker during the quarter. Perhaps, for the aggregation of the quarter, i need a different id. I will look into it by checking individual counts first and then the quarter aggregates.

I sincerely thank you for your detailed insights once again.
Comment

Announcement

standardization of a variable in a group

Comment

Comment