The replace function: using several conditions, where one of the conditions start once another condition is met.

RJ Bremer

Join Date: May 2015

Posts: 7
#1

The replace function: using several conditions, where one of the conditions start once another condition is met.

23 May 2015, 12:23

Hi all,

Currently, I'm working on my thesis and I want to run some regressions on 24 Global Datastream Price and Total Return Indices.
Normally you would say that a return is invalid if the value is exactly equal to zero (i.e. say the market is closed on 25th and 26th of December then the index would not change).

But because some constituents are traded in a different currency, and therefore changes in currency rates will have an effect on the value of the index even if that market is closed on that day.
To mitigate this problem I downloaded information on the turnover of the indices - if a financial market was closed, the turnover would be missing.

Once again, normally you would say that the following function would do the trick:

Code:

(1): replace RI_RT =. if PI_RT ==0 | TURNOVERBYVOLUME ==.

But obviously I would not post this message if this would work.

The problem is that information on the turnover is available since 1986, while the indexation starts in 1973 (for most countries).
Hence, using the formula above would delete all information up to the first valid result of the turnover variable.

The data is sorted on date, so I tried to come up with a formula which would use

Code:

(2): replace RI_RT =. if PI_RT ==0

if there is no available information on turnover and the whole formula (1) once there is data available on the turnover.

I started out with generating a new variable "seq_num", which basically shows the row number and next I replace this value with "." if there was turnover data missing:

Code:

generate seq_num = _n replace seq_num =. if missing(TURNOVERBYVOLUME)

I would say that you have to build an if function that states that once the first valid observation is "found", (1) will be used and (2) otherwise.
Because I have to do this for 24 countries, I would prefer to find this transition automatically.

I just started using Stata two days ago, so I don't have any clue yet how to build something like that.
But I really hope that maybe one of you guys could help me.

With kind regards,

RJ Bremer
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

23 May 2015, 12:31

It doesn't seem like it should be very complicated, but I cannot figure out from your explanation how you would know, from the data available, which observations are eligible for your criterion (1) and which aren't. I am left with the impression it has something to do with dates, but you don't describe any date variables in your data, and your description of the chronology of events is inexact. Maybe if you show us a sample of the relevant data and explain the situation a little more clearly someone can help you. I'm willing to bet the solution is just a 1 or 2 -liner.
Comment
Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#3

23 May 2015, 13:15

I believe the problem is that you rely on -TURNOVERBYVOLUME- being equal to missing (.) to identify closed financial markets, but it's also missing (.) in years for which that variable is not available (1973-1985). You can identify the years for which that variable is not available (1973-1985). Say

Code:

replace TURNOVERBYVOLUME = .a if year <=1985

(I would prefer the opposite: system missing (.) for missing years and extended missing (.a) for closed markets, but it's your call.)

Then

Code:

replace RI_RT =. if PI_RT ==0 | TURNOVERBYVOLUME ==.

should work. But you could also do something more direct like

Code:

replace RI_RT =. if PI_RT ==0 | (TURNOVERBYVOLUME ==. & year >= 1986)

Or just generate an indicator variable (0s and 1s) based on -TURNOVERBYVOLUME- and set missing years as system missing. This variable can be 0 if market closed, 1 if market open, and . if there is no information. Then work with that.

See -help missing-.

Last edited by Roberto Ferrer; 23 May 2015, 13:20.

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
RJ Bremer

Join Date: May 2015

Posts: 7
#4

23 May 2015, 13:15

Dear Clyde, thank you for your reply. I see that I forgot some elements in the story indeed.

First a more detailed description of the variables in the dataset:

Code:

date: date with formattype 23/05/2015, and only weekdays. PI: value price index - an index consisting approximately of all stocks of a country RI: value total price index, which basically is a price index but where capital gains (dividends) are reinvested. TURNOVERBYVOLUME: turnover by volume PI_RT: simple return PI RI_RT: simple return RI

The problem:

I want to use the RI in my regression, but because of the reinvestments, returns are (almost) never equal to zero.
Hence, I use PI and TURNOVERBYVOLUME to find out if RT should be zero (because the market was closed).

To give you an example of which observations are eligible for criterion 1:

Date PI TURNOVERBYVOLUME RI PI_RT RI_RT

8/14/1995 765.81 . 2005.83 . .

8/15/1995 765.80 . 2006.09 -.0000131 .

Both days the market was closed (i.e. there was no turnover), but because of some valuta changes the PI changed from 765.81 tot 765.80.
Hence, in this case criterion 2 won't give us the correct value.

On the other hand criterion 1 gives the correct value of RI_RT, namely: "."

BUT turnover was not available before a certain date (this date is different for each country).

So to give you an example of which observations are eligible for criterion 2:

Date PI TURNOVERBYVOLUME RI PI_RT RI_RT

12/31/1984 240.35 . 449.48 . .

1/1/1995 240.64 . 450.10 .0012066 .

The market is only closed on the first day of the year. Hence, the value of RI_RT should be given.
But using criterion 1 we won't find a value; turnover is missing. And this holds for every value up to the first available observation of turnover.

The formula should use criterion 2 up to that first available observation. After that point criterion 1 should be used.

I hope that everything is clear now.

Last edited by RJ Bremer; 23 May 2015, 13:19.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

23 May 2015, 14:48

OK. Now I think I understand it. One stipulation: I take "date: date with formattype 23/05/2015" to mean that date is a numeric variable with a %tdMDY format. If it is just a string variable that looks like that, you need to first convert it to numeric with the date() function. You state that the date at which turnoverbyvolume first takes a non-missing value varies by country, but you do not mention a country variable. I won't even assume you have one explicitly in your data set. I am guessing, however, that what you have here is not a single time series, but is panel data, and that there is a panel identifier which, for lack of any explicit information about it, I will call firm_id. You have all of the variables you mentioned in addition to these.

Code:

by firm_id date, sort: gen tbv_count = sum(!missing(TURNOVERBYVOLUME)) replace RI_RT =. if PI_RT ==0 | (TURNOVERBYVOLUME ==. & tbv_count > 0) replace RI_RT =. if PI_RT ==0 & tbv_count == 0

Notes:
1. The first command generates a running count of non-missing values of TURNOVERBYVOLUME, so you get tbv_count = 0 during the initial period of non-availability, and > 0 thereafter.
2. The second command applies rule 1 if applicable.
3. The third command applies rule 2 if applicable.
4. If this really is not panel data, just a single time series, then omit firm_id from the by-prefix of the first command.

The second and third lines could be combined into a single command by conjoining the -if- qualifiers, but the resulting code would be less transparent, and I doubt the time savings in execution would be noticeable.
Comment
RJ Bremer

Join Date: May 2015

Posts: 7
#6

23 May 2015, 16:24

Thank you very much Roberto and Clyde!
I learned some valuable lessons of both of your reactions. However, in the end I think I will use the code of Clyde, since it is automated.

Clyde Schechter: You're totally right! I'm not that much of a programmer in my daily life, hence the lack in terminology, but I'm learning. I already converted it to a numeric variable using the date() function.
Furthermore, it is indeed panel data but I did not yet added the panel identifier. Since I will add this panel identifier in the near future I really want to thank you for the completeness of your answer. Also, the structure of the code is clear; I'll stick to the transparant version.

Have a nice day!

Kind regards,

RJ Bremer
Comment

Date	PI	TURNOVERBYVOLUME	RI	PI_RT	RI_RT
8/14/1995	765.81	.	2005.83	.	.
8/15/1995	765.80	.	2006.09	-.0000131	.

Date	PI	TURNOVERBYVOLUME	RI	PI_RT	RI_RT
12/31/1984	240.35	.	449.48	.	.
1/1/1995	240.64	.	450.10	.0012066	.

Announcement

The replace function: using several conditions, where one of the conditions start once another condition is met.

Comment

Comment

Comment

Comment

Comment