Identify overlapping data sets and assign cluster variable

Franz Langmann

Join Date: Mar 2016

Posts: 15
#1

Identify overlapping data sets and assign cluster variable

19 Sep 2016, 04:05

I have a data set and want to identify overlapping observations. I define overlapping observations as those looking at the same country and where the time range overlaps. I have three variables to do the clustering. A variable called 'countryid' which assigns an ID to each possible country an observation refers to (e.g., countryid = 1 for US data). The two other variables are the 'startyear' and a variable called 'endear'. Those variables define the time span a variable refers to.

Now I want to create a new variable 'overlap' which takes the same value of all observations from the same country and with overlapping time span. E.g., if observation1 is from the US for 1991-1998 and observation2 is also from US for 1996-2000, the variable 'overlap' would have the same value for both observations. In contrast, if the country for the second observation would be Italy or the time span would be from 2001-2003, the cluster variables should be different. In summary, I want to define a cluster variable for all observations from the same country with overlapping time span.

Can anyone help me to implement this in Stata?

Many thanks!
Tags: cluster, group, Grouped data
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#2

19 Sep 2016, 05:15

Post example dataset please
1 like
Comment
Franz Langmann

Join Date: Mar 2016

Posts: 15
#3

19 Sep 2016, 05:49

Sorry, but I couldn't find out how to upload the .dta file directly. Here is a link to the file:

https://cloud.web.de/ngcloud/externa...ranzlangmann89

This is a snippet of my full data set.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

20 Sep 2016, 10:25

Let's abstract from multiple countries for the sake of simplicity, because as I understand, if you solve it for one country, you repeat it for each subsequent countries.
Suppose in the above image colored bars represent different spells. You wrote: " I want to create a new variable 'overlap' which takes the same value of all observations from the same country and with overlapping time span". As you can see, while all bars overlap with something else, they don't all overlap with each other, and if you have to assign a cluster number, you will have multiple solutions (just as there are many clustering algorithms). On the picture if you are only allowed to pick one group for a country you may have 7 different solutions (#2 and #3 are essentially same). If you are allowed to pick multiple clusters, then what do you pick?

Please clarify what solution you wish to find, e.g. what is the value of the 'overlap' variable for each of the colored bars in the above picture?.

Best, Sergiy
1 like
Comment
Franz Langmann

Join Date: Mar 2016

Posts: 15
#5

20 Sep 2016, 17:11

To be honest, when I saw your plot, some issues came up I did not have in mind so far. Many thanks for this helpful illustration!

Actually, for me the overlap variable should have the same value for all bars in your picture as all short time horizons (green, blue, red, dark red) overlap with the orange bar. Just if there would be an additional bar right to the orange one (but without overlap), a different value for the overlap variable should be assigned to this observation. I hope this makes my issue more clear to you?

Last edited by Franz Langmann; 20 Sep 2016, 17:13.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

20 Sep 2016, 19:00

Franz, in other words you want to combine the spells. Nominally the command newspell by Hannes Kröger should do it. However when I run it with the following syntax:

Code:

newspell combine, begin(start) end(finish) id(id) stype(state) snumber(snum)

it replies that "option opt() required"
and if I run it with the option added

Code:

newspell combine, begin(start) end(finish) id(id) stype(state) snumber(snum) opt(1)

it replies that "option opt() is not allowed".

This is almost surely because I didn't read the help file.

Best, Sergiy Radyakin
1 like
Comment
Hannes Kröger

Join Date: Nov 2016

Posts: 1
#7

28 Nov 2016, 08:57

Dear Stata-community,

this is my first post. My name Is Hannes Kröger and as Sergiy pointed out, I have written the newspell command.
If the question is already answered or solved in a different way, this might be a irrelevant now, but I would gladly check whether newspell can do what Franz asks for and if it produces an error (and how to fix it potentially), but I do not have a sample dataset that identifies spells as required (ID, spellnr, spell-type).
If you can send me such a file, we can see if the problem can be solved.

best regards

Hannes
Comment

Announcement

Identify overlapping data sets and assign cluster variable

Comment

Comment

Comment

Comment

Comment

Comment