Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify overlapping data sets and assign cluster variable

    I have a data set and want to identify overlapping observations. I define overlapping observations as those looking at the same country and where the time range overlaps. I have three variables to do the clustering. A variable called 'countryid' which assigns an ID to each possible country an observation refers to (e.g., countryid = 1 for US data). The two other variables are the 'startyear' and a variable called 'endear'. Those variables define the time span a variable refers to.

    Now I want to create a new variable 'overlap' which takes the same value of all observations from the same country and with overlapping time span. E.g., if observation1 is from the US for 1991-1998 and observation2 is also from US for 1996-2000, the variable 'overlap' would have the same value for both observations. In contrast, if the country for the second observation would be Italy or the time span would be from 2001-2003, the cluster variables should be different. In summary, I want to define a cluster variable for all observations from the same country with overlapping time span.

    Can anyone help me to implement this in Stata?

    Many thanks!

  • #2
    Post example dataset please

    Comment


    • #3
      Sorry, but I couldn't find out how to upload the .dta file directly. Here is a link to the file:

      https://cloud.web.de/ngcloud/externa...ranzlangmann89

      This is a snippet of my full data set.

      Comment


      • #4
        Click image for larger version

Name:	overlap.png
Views:	1
Size:	19.7 KB
ID:	1357341

        Let's abstract from multiple countries for the sake of simplicity, because as I understand, if you solve it for one country, you repeat it for each subsequent countries.
        Suppose in the above image colored bars represent different spells. You wrote: " I want to create a new variable 'overlap' which takes the same value of all observations from the same country and with overlapping time span". As you can see, while all bars overlap with something else, they don't all overlap with each other, and if you have to assign a cluster number, you will have multiple solutions (just as there are many clustering algorithms). On the picture if you are only allowed to pick one group for a country you may have 7 different solutions (#2 and #3 are essentially same). If you are allowed to pick multiple clusters, then what do you pick?

        Please clarify what solution you wish to find, e.g. what is the value of the 'overlap' variable for each of the colored bars in the above picture?.

        Best, Sergiy

        Comment


        • #5
          To be honest, when I saw your plot, some issues came up I did not have in mind so far. Many thanks for this helpful illustration!

          Actually, for me the overlap variable should have the same value for all bars in your picture as all short time horizons (green, blue, red, dark red) overlap with the orange bar. Just if there would be an additional bar right to the orange one (but without overlap), a different value for the overlap variable should be assigned to this observation. I hope this makes my issue more clear to you?
          Last edited by Franz Langmann; 20 Sep 2016, 17:13.

          Comment


          • #6
            Franz, in other words you want to combine the spells. Nominally the command newspell by Hannes Kröger should do it. However when I run it with the following syntax:

            Code:
            newspell combine, begin(start) end(finish) id(id) stype(state) snumber(snum)
            it replies that "option opt() required"
            and if I run it with the option added
            Code:
            newspell combine, begin(start) end(finish) id(id) stype(state) snumber(snum) opt(1)
            it replies that "option opt() is not allowed".

            This is almost surely because I didn't read the help file.

            Best, Sergiy Radyakin

            Comment


            • #7
              Dear Stata-community,

              this is my first post. My name Is Hannes Kröger and as Sergiy pointed out, I have written the newspell command.
              If the question is already answered or solved in a different way, this might be a irrelevant now, but I would gladly check whether newspell can do what Franz asks for and if it produces an error (and how to fix it potentially), but I do not have a sample dataset that identifies spells as required (ID, spellnr, spell-type).
              If you can send me such a file, we can see if the problem can be solved.

              best regards

              Hannes

              Comment

              Working...
              X