Counting observations in a certain time under a certain condition

Alexander Heinrich

Join Date: Mar 2024

Posts: 11
#1

Counting observations in a certain time under a certain condition

27 Mar 2024, 14:09

Dear all,
I am currently trying to build a variable which counts how many nonviolent_protest_episodes occured during the last 10 years when a certain condition is met (uturn==1). My data structure is country-year and looks the following (here an example for Bolivia):

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str32 country_name int year float(uturn nonviolent_protest_episodes) "Bolivia" 2001 0 0 "Bolivia" 2002 0 0 "Bolivia" 2003 0 1 "Bolivia" 2004 0 1 "Bolivia" 2005 0 1 "Bolivia" 2006 1 0 "Bolivia" 2007 1 0 "Bolivia" 2008 1 0 "Bolivia" 2009 1 0 "Bolivia" 2010 1 0 "Bolivia" 2011 1 0 "Bolivia" 2012 1 0 "Bolivia" 2013 1 0 "Bolivia" 2014 1 0 "Bolivia" 2015 1 0 "Bolivia" 2016 1 0 "Bolivia" 2017 1 0 "Bolivia" 2018 1 0 "Bolivia" 2019 1 2 "Bolivia" 2020 1 0 "Bolivia" 2021 1 0 "Bolivia" 2022 1 0 "Bolivia" 2023 1 0 end

I tried to code something like: egen u_turn_protest_count = total(nonviolent_protest_episodes) if uturn == 1 & inrange (year, year -9, year)
Stata tells the error code that inrange is not found which is expectable since I do not specify the years. My problem is that U-Turns happen in different times and countries and every time that happens I would like "to look 10 years back" and calculate how many nonviolent_protest_episodes occured before the U-Turn started. Unfortunately, I have not found an answer here in this forum or anywhere else in the internet. I use Stata 17.0

I would greatly appreciate if someone could help me out.

Many thanks in advance,
Alexander
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#2

27 Mar 2024, 14:15

the problem is the space between the "e" of inrange and the open parens - get rid of that and your code works; whether it gives you want you want is not entirely clear
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#3

27 Mar 2024, 14:15

The simplest solution is to first get the total of nonviolent protest episodes regardless of the value of uturn, and then replace the result with missing value when uturn is not 1.

Code:

rangestat (sum) nonviolent_protest_episodes, by(country_name) interval(year -9 0) replace nonviolent_protest_episodes_sum = . if uturn != 1

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It is available from SSC.

Added: Crossed with #2.
2 likes
Comment
Alexander Heinrich

Join Date: Mar 2024

Posts: 11
#4

27 Mar 2024, 14:44

Thank you very much Clyde Schechter. It worked perfectly and indeed very simple! But what do you mean with crossed #2? If I might ask a follow up question? My data has a separate entry for each protest. for example, there were 2 different protests in bolivia in 2019.
my dataset then lists
Country Year Protest ...
Bolivia 2019 1
Bolivia 2019 1
To get to the variable nonviolent_protest_episodes, I used the following command: egen "nonviolent_protest_episodes" = total(NONVIOL), by (country_name year) // only non-violent
NONVIOL is one of many other variables in my dataset that indicates whether the protest was non-violent or violent (dichotomous, nonviolent==1).
My problem is that when I use the code, I end up with four (instead of two) entries for Bolivia 2019 for the variable nonviolent_protest_episodes, which in the end leads to an incorrect value being calculated in the rangestat command. This is due to my datastructure and the command above which gives two entries for each of the two years.
Do I have to collapse my dataset (but then I would loose information about the specific protest, e.g. whether it was succesfull) or is there another way to handle this issue?

Many thanks in advance.

Last edited by Alexander Heinrich; 27 Mar 2024, 15:41.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

27 Mar 2024, 16:24

By crossed with #2 I mean: my post showed up as #3 in the thread. When I first began writing my post, however, only your #1 had appeared. So, when I finally posted mine and saw there was now a #2 that I had not seen, I added to my post that it had crossed with #2. Otherwise, put, #2 was written (by somebody else) while I was writing #3. The point of it is to indicate that I was not aware of #2 while writing #3.

I don't understand the rest of what you wrote in #3. In the data example shown in #1, there is only one observation for Bolivia for 2019. Now you state that there are two in your real data. Fine. If each of those observations includes 1 non-violent protest, when -egen- adds them up it will get 1 + 1 = 2, not 4. And -egen- neither adds nor removes observations from the data set: it just creates new variables in the existing observations. So there is something else going on in your data set that I am not understanding. I suggest that you post back showing all of the Bolivia observations in your -dataex-, and the relevant variables, so I can see what's going on here.
1 like
Comment

Alexander Heinrich

Join Date: Mar 2024
Posts: 11

27 Mar 2024, 16:53

Thanks for clarification. Here is the dataex with all of the Bolivia observations and the relevant variables. I excluded years before 1996 because that is not of my interest for this question.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str32 country_name int year float nonviolent_protest_episodes byte(uturn NONVIOL) double nonviolent_protest_episodes_sum .
"Bolivia" 1996 0 0 0 .
"Bolivia" 1997 0 0 0 .
"Bolivia" 1998 0 0 0 .
"Bolivia" 1999 0 0 0 .
"Bolivia" 2000 0 0 0 .
"Bolivia" 2001 0 0 0 .
"Bolivia" 2002 0 0 0 .
"Bolivia" 2003 1 0 1 .
"Bolivia" 2004 1 0 1 .
"Bolivia" 2005 1 0 1 .
"Bolivia" 2006 0 1 0 3
"Bolivia" 2007 0 1 0 3
"Bolivia" 2008 0 1 0 3
"Bolivia" 2009 0 1 0 3
"Bolivia" 2010 0 1 0 3
"Bolivia" 2011 0 1 0 3
"Bolivia" 2012 0 1 0 3
"Bolivia" 2013 0 1 0 2
"Bolivia" 2014 0 1 0 1
"Bolivia" 2015 0 1 0 0
"Bolivia" 2016 0 1 0 0
"Bolivia" 2017 0 1 0 0
"Bolivia" 2018 0 1 0 0
"Bolivia" 2019 2 1 1 4
"Bolivia" 2019 2 1 1 4
"Bolivia" 2020 0 1 . 4
"Bolivia" 2021 0 1 . 4
"Bolivia" 2022 0 1 . 4
"Bolivia" 2023 0 1 . 4
end

For the last dataex I used the collapse command to simplify the dataset that is why you could not see two observations for Bolivia for 2019. If am interpreting it correctly, shouldn't be nonviolent_protest_episodes_sum in the years 2020 to 2024 be 2 instead of 4? Also I do not understand why nonviolent_protest_episodes_sum is 4 in 2019...

Last edited by Alexander Heinrich; 27 Mar 2024, 17:13.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#7

27 Mar 2024, 17:19

I see what you mean now. It seems to me that what you have some choices. One is to -collapse- your data set to one observation per country per year. If there is important data on other variables that differs across observations with the same country and year, data that you cannot throw away, then this is not a good option.

Another possibility would be to use NONVIOL rather than nonviolent_protest_episodes in the -rangestat- command. That will give a correct total.

Code:

rangestat (sum) NONVIOL, by(country) interval(year -9 0) replace NONVIOL_sum = . if uturn != 1
1 like
Comment
Alexander Heinrich

Join Date: Mar 2024

Posts: 11
#8

27 Mar 2024, 17:21

Alright, thank you so much!
Comment
Alexander Heinrich

Join Date: Mar 2024

Posts: 11
#9

29 Mar 2024, 07:37

Clyde Schechter I'm sorry, but I have another question. I used the rangestat command to mark when the most recent protest occurred in the last 10 years when uturn == 1. The reason is that I want to calculate how many years are between the most recent NONVIOL protest and uturn. Could you tell me how to tell Stata to calculate the difference in years between each uturn and the last NONVIOL protest if it is within the ten years threshold? This is what I have already coded: rangestat (first) NONVIOL, by(country_name) interval(year -9 0)
replace NONVIOL_first = . if uturn != 1
I have already tried some commands, but I did not work.

Last edited by Alexander Heinrich; 29 Mar 2024, 08:10.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

#10

29 Mar 2024, 08:38

This is a slightly different situation from your original request, and while it can be done with -rangestat-, in this case it is actually easier to do with native Stata commands:

Code:

by country_name (year), sort: gen last_nonviol = cond(NONVIOL==1, year, 0) if _n == 1
by country_name (year): replace last_nonviol = ///
    cond((NONVIOL==1) & (year > last_nonviol[_n-1]), year, last_nonviol[_n-1]) if _n > 1
gen years_since_last_nonviol = year - last_nonviol if uturn == 1

Comment

Alexander Heinrich

Join Date: Mar 2024

Posts: 11
#11

29 Mar 2024, 08:44

This makes a lot of sense, thank you so much again!
Comment

Announcement