Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Repeated time value error in Stata when declaring Panel Data

    Hi, So I am working on a class project, where my testable hypothesis is that "Players playing position on the soccer field influence his overall rating" The data set I found was in sql, so I had to convert it into excel first then Stata. As you could see that in the attached screenshot here there are variables like Player_fifa_api_id and year1, and categorical variable Attack_rate and Defense rate coded as Medium, High, Low etc. From the categorical variable I created player positions such that if a player has high attack rate and low defense, he would play at Forward position. As this data set shows same player playing matches in different years, so he had varied attributes for every match, I created means of the variables and dropped the original variable, so when declaring panel data there are no varied observation. I have the same issue here, if you could see the two highlighted rows (17 & 18), you could notice that a player with same player id played two matches in 2012 but in one of the matches he had attack rate low, and defense rate medium, so he has varied observations for the same year. This dataset has 65000 observations, so I want to know how could I fix this problem so that I have only one observation for each year, and that observation is not varied across the row. I want to declare this data set as panel such that xtset player_fifa_api year1. I have already done a lot of data manipulation on this data set, and I am not sure if I could test my hypothesis on this data set. Also If anybody could suggest good soccer data which would easily test my hypothesis (like OLS reg), that would be helpful.

    Thanks!
    Attached Files

  • #2
    First a disclaimer, I know nothing about soccer, or organized sports. So I will confine my remarks here to statistical approaches--some of which may make no sense in this particluar context.

    First, ask yourself whether you really need to reduce to one observation per player per year. Why do you need to do that? The immediate answer is because you want to -xtset player_fifa_api year1-. But why do you want to do that? Why won't -xtset player_fifa_api- suit your purposes? The only thing you gain by declaring the time variable is the ability to use time-series operators like lag and lead, and to estimate models with autoregressive structure. If you aren't going to need to do those things, then your data set can be left alone, and you just -xtset player_fifa_api- and move on.

    Assuming that you will in fact need to include a time variable in your -xtset-, so you really do need to reduce to a single observation per player per year, there are several approaches you can take.

    1. You can review all of the variables in your data set other than player_fifa_api and year1, and make some decision rules about how to resolve conflicts among them to choose a common value you will retain. The common value might be a mean, or the maximum, or the minimum, or the median, or... If you decide on such things, the use of the numerous -egen- functions, or perhaps the -collapse- command will implement these for you readily.

    2. In connection with #1, for inherently categorical variables that cannot be assigned numeric codes in such a way that things like means, medians, max, and min make any sense (i.e. purely nominal level information), you might set up a hierarchy. I have at times done this with variables like race where the same person, on different occasions, is identified as two different races, and perhaps sometimes as multiracial. In situations like that one might agree that conflict between a specific race and multiracial would be resolved in favor of multiracial, and that conflicts between two specific races might be resolved in some particular way. This can be time-consuming and inevitably involves some degree of arbitrariness. Doing this well requires some knowledge of how the data got coded that way and what kinds of things are likely to be errors, and what is likely to be "real" in the population at hand.

    3. Another approach to eliminating multiple observations is to simply pick one at random:
    Code:
    set seed 1234 // OR YOUR FAVORITE SEED NUMBER
    gen double shuffle1 = runiform()
    gen double shuffle2 = runiform()
    by player_fifa_api year1 (shuffle1 shuffle2), sort: keep if _n == 1
    drop shuffle1 shuffle2
    4. If you have information about how the data set was assembled in the first place, you might be able to rely on that. For example, if the data set combines records that originated in different sources, and if you know one source to be more reliable than the other, you might always choose to keep the observation that comes from the most reliable source.

    5. If you can refine the time on your variables, so that you know actual dates (or months, etc.) within years, then you might have reason to prefer the most recent observation, or the earliest, or something like that--again, which of these choices makes sense depends on your context.

    Added: In the future, please do not use screen shots to show data. This one was barely readable on my computer. Even when they are easily read, had I needed to try to import the data to Stata to try out some code on it, that would not have been possible. The helpful way to post example data is with the -dataex- command. Run -ssc install dataex- to install the command, and then run -help dataex- to read the simple instructions for using it. Use it whenever you post sample data so that those who want to help you can create a completely faithful replica of your example in Stata with just a simple copy/paste operation.
    Last edited by Clyde Schechter; 11 Apr 2017, 19:49.

    Comment


    • #3
      I took the 3rd approach, and indeed the code worked. Thank you so much for your help.

      Comment

      Working...
      X