Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dummy variable across multiple variables

    Good afternoon everyone,

    I have a data set that lists lineups for soccer teams (variable names Player1 - Player11). Depending on where they are entered in the lineup corresponds to which variable they correspond to for that particular game. I am trying to run a fixed effects model for players and want to create a dummy variable for each of them. Unfortunately I only know how to do this for one specific variable (each player in Player 1 gets a dummy variable) rather than across all 11 (i.e if the same player appears as Player 2 and Player 5 in different games that still corresponds to the same dummy variable). I've looked online and in the manuals and found no way to do this easily.

    Is there an easy way to code this in Stata or will I have to create some sort of loop to get around this?

    Thanks for the help, it's much appreciated.

    Guy

  • #2
    I find your description unclear, borderline confusing. I can't figure out what the data you have actually looks like, nor what the indciator ("dummy") variable you want to calculate represents or how it relates to the data you have. Maybe somebody else grasps it and can respond directly. But if you don't get a quick response from somebody else, I suggest you post a short example of your data along with a clearer explanation of how you would calculate this indicator variable by hand, and showing what its values would actually be for the example data you show.

    To assure that the data example you show can be easily and faithfully converted into a Stata data set by whoever wants to help you, you should use the -dataex- command to post it. You get the -dataex- command by running -ssc install dataex-. The simple directions for using it are in -help dataex-.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I find your description unclear, borderline confusing. I can't figure out what the data you have actually looks like, nor what the indciator ("dummy") variable you want to calculate represents or how it relates to the data you have. Maybe somebody else grasps it and can respond directly. But if you don't get a quick response from somebody else, I suggest you post a short example of your data along with a clearer explanation of how you would calculate this indicator variable by hand, and showing what its values would actually be for the example data you show.

      To assure that the data example you show can be easily and faithfully converted into a Stata data set by whoever wants to help you, you should use the -dataex- command to post it. You get the -dataex- command by running -ssc install dataex-. The simple directions for using it are in -help dataex-.
      No worries, I'll clarify some more. I don't have Stata at hand right now but I'll try to make it clear with Excel at the moment.

      Click image for larger version

Name:	StataData.png
Views:	1
Size:	114.1 KB
ID:	1362403


      Columns titled 1-11 contain the names of the starting eleven players for a soccer game. Each row contains a different game. Depending on how a game is recorded players can be in any of these 11 columns for a specific game.

      Taking David Atanga as an example, I want a dummy variable which is 0 if he is not in the game (not in columns 1-11) and 1 if he is in any of columns 1-11. I will then repeat this for all players in the dataset. David Atanga is in column 7 for that game but if he appears in 5 for another game I want this to be captured with the same dummy.

      So instead of creating a dummy over one variable I want to create them over eleven. Does this help clarify the issue?

      Comment


      • #4
        Well, I get the general idea now. Unfortunately, I can't give you a specific solution to your problem because the details depend on how this data gets imported into Stata. That's why we always want to see a data example gotten by using -dataex- with a real Stata data set.

        That said, I'm going to make assumptions about what your Stata data set will ultimately look like. I see that in addition to player names you have some numeric variables. I have no idea what they mean but will assume that they are irrelevant to the matter at hand. I will also assume that they will be imported as separate variables from the names. So, after running -import excel- and doing a little data cleaning, I assume you will have the following variables: game_number player1 number1 player2 number2 ...player11 number11. The variables player* will contain the names as string variables. The number* variables will contain those other numbers. And the game_number variable will correspond to the row in your spreadsheet.

        So there are a couple of difficulties remaining. You need an indicator variable for each player. You can't use the player's name as the variable name because embedded blanks are not permitted in variable names. Moreover, variable names are limited to 32 characters, and I can imagine that a player's name could be longer than that. While one could get around this by just creating indicators called indicator1, indicator2,... for how many different players there happen to be, you would need some crosswalk to tell you which indicator tracked which player. (OK, that could be built into the variable labels, but it's really inconvenient.) So I'm going to build the indicator names out of the player names, with the understanding that blanks will be replaced with underscores, and some truncation at the end may be necessary.

        Next, you need to create this for every single player, regardless of which positions he/she appears in. So this is just a specific reason beyond the general principle: almost everything in Stata is easier with data in long layout.

        So here we go:

        Code:
        // RESHAPE TO LONG LAYOUT
        reshape long player number, i(game_num) j(position)
        
        //    CREATE PLAYER INDICATOR VARIABLES
        levelsof player, local(players)
        foreach p of local players {
            local varname = substr(strtoname(`"`p'"'), 1, 25) // LEAVE ROOM FOR ingame_
            local varname ingame_`varname'
            by game_num, sort: egen `varname' = max(player == `"`p'"')
        }
        You will now have variables with names like ingame_David_Atanga coded 1 in every game he participated in, 0 elsewhere.

        Note: Not tested. Beware of typos, etc.

        I don't know what you will be doing next, but it is most likely that leaving the data in long layout will facilitate your next steps. In the event you really do need to go back to the wide layout, -reshape wide- will accomplish this and will bring along the new indicator variable.

        Comment


        • #5
          Thanks very much, I'll test this out later today. The numeric variables you see are just an index for specific players (David Atanga is 253699 for example).

          Comment


          • #6
            Originally posted by Clyde Schechter View Post

            I don't know what you will be doing next, but it is most likely that leaving the data in long layout will facilitate your next steps. In the event you really do need to go back to the wide layout, -reshape wide- will accomplish this and will bring along the new indicator variable.
            Thanks for your help so far, I can get the indicator variable in on the long layout. If I want to bring it back to the wide layout do I need to specify anything different from

            reshape long player number, i(game_num) j(position)

            The indicators are not constant within MatchID in the long view (since the game takes up 11 rows instead, one for each player (one 1, ten 0s on the indicator)). Is there a way to specify this under the reshape command?

            Comment


            • #7
              The indicators are not constant within MatchID in the long view (since the game takes up 11 rows instead, one for each player (one 1, ten 0s on the indicator)). Is there a way to specify this under the reshape command?
              I have non idea what you're trying to tell me here. What is MatchID? Is that what I called game_num? My code above was predicated on the notion that each game was a single observation ("row") in the original data set. Under that assumption, each of the new indicator variables is constant within all observations for a given value of game_num, and you should not include it in the varlist of your -reshape wide- command.

              Added: In fact, if all you have done is run the code from #4 and now are trying to get back to wide layout, just -reshape wide- with nothing else specified should do the trick.

              I can't reshape data that I can't see. Please use -dataex- (-ssc install dataex-) to post an example of the data you are working with if you need additional help with this.

              And, again, really think twice before -reshape wide-, because wide layout usually makes things harder for subsequent analyses.
              Last edited by Clyde Schechter; 02 Nov 2016, 10:25.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                I have non idea what you're trying to tell me here. What is MatchID? Is that what I called game_num? My code above was predicated on the notion that each game was a single observation ("row") in the original data set. Under that assumption, each of the new indicator variables is constant within all observations for a given value of game_num, and you should not include it in the varlist of your -reshape wide- command.

                Added: In fact, if all you have done is run the code from #4 and now are trying to get back to wide layout, just -reshape wide- with nothing else specified should do the trick.

                I can't reshape data that I can't see. Please use -dataex- (-ssc install dataex-) to post an example of the data you are working with if you need additional help with this.

                And, again, really think twice before -reshape wide-, because wide layout usually makes things harder for subsequent analyses.
                I'll post a very small snapshot of the data.

                Code:
                 * Example generated by -dataex-. To install: ssc install dataex clear input byte MatchID str18(Team OpTeam) long(Index1 Index2 Index3 Index4 Index5 Index6 Index7 Index8 Index9 Index10 Index11)
                1 "Grödig" "Rheindorf Altach" 228418 117698 111641 111783 216679 144540 228423 138974 238319 94836 4763
                2 "Rapid Vienne" "Ried" 43988 202439 15205 143245 80811 121873 103250 12363 91988 43108 184473
                3 "Trenkwalder Admira" "Sturm Graz" 111640 187509 239858 173746 193773 163509 223522 137035 159548 5001 184706
                4 "Red Bull Salzburg" "Mattersburg" 173632 94937 237759 17613 68816 232765 253699 217490 147210 116501 10971
                5 "Wolfsberger AC" "Austria Vienna" 117726 3611 84314 59436 41134 149245 97327 136399 59403 18423 52212 end
                The only variables we need to worry about is MatchID (which is your game_num) and the Index* variables. I tried your code with the player names on a sub-sample and it worked perfectly, even switching back to the wide layout. Is it possible to edit your code to work for the Index numbers instead of names? This would make it easier and reduce the risk on two different players with identical names (which I've realized now could be an issue)?

                I have one final issue. My full dataset contains some 64,000 games and the player count runs into the tens of thousands. This means that my Stata crashes when trying to produce so many indicator variables. Is there an easy way to say put Index numbers that appear 35 times or less into one indicator then have the rest work as normal? So that would look like;

                Indicator1 = 1 if count(Index*)<=35

                and all others produced as normal by your code. This would trim the number of indicator variables down to around 7.6k which would be more manageable on Stata.

                Thanks once again for your help, it's been extremely useful.

                Comment


                • #9
                  Switching from player to index is a pretty minor modification, accounting for Index being a numeric rather than a string variable. In a way, it actually makes it simpler.

                  I don't quite get your indicator1 concept: it seems to be a variable that smashes together the participation of all players who participated in 35 or fewer games. I'm not sure if I what I've done here is what you had in mind or not.

                  Code:
                   reshape long Index, i(MatchID) j(position)
                   
                   //    CREATE PLAYER INDICATOR VARIABLES
                  gen byte indicator1 = 0
                  levelsof Index, local(indices)
                  foreach i of local indices {
                      by MatchID, sort: egen in_game`i' = max(Index == `i')
                      summarize in_game`i', meanonly
                      if `r(sum)' <= 35 {
                          replace indicator1 = in_game_`i'  if Index == `i'
                          drop in_game`i'
                      }
                  }
                  reshape wide

                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    Switching from player to index is a pretty minor modification, accounting for Index being a numeric rather than a string variable. In a way, it actually makes it simpler.

                    I don't quite get your indicator1 concept: it seems to be a variable that smashes together the participation of all players who participated in 35 or fewer games. I'm not sure if I what I've done here is what you had in mind or not.

                    Code:
                    reshape long Index, i(MatchID) j(position)
                    
                    // CREATE PLAYER INDICATOR VARIABLES
                    gen byte indicator1 = 0
                    levelsof Index, local(indices)
                    foreach i of local indices {
                    by MatchID, sort: egen in_game`i' = max(Index == `i')
                    summarize in_game`i', meanonly
                    if `r(sum)' <= 35 {
                    replace indicator1 = in_game_`i' if Index == `i'
                    drop in_game`i'
                    }
                    }
                    reshape wide
                    Thanks very much, this does what I'm looking for. I almost had it without the indicator1 variable but this makes it work.

                    Comment


                    • #11
                      Dear Clyde Schechter
                      I have a related query.
                      I have a panel of states representing whether or not the pair of states has adopted a policy provision. The data includes multiple policy provisions prov01, prov02, prov100, prov101 etc. These are indicator (dummy) variables = 1, if the provision is adopted and 0 otherwise at time t. I want to create a dummy variable:
                      PROV = 1, if the pair has adopted at least one provisions
                      = 0, if none of the provisions is adopted.
                      I am posting a short example showing how my PROV indicator be created from multiple indicators.

                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input str2(state_i state_k) int year byte(prov01 prov02 prov03 prov100 prov101 prov103 prov11 prov110 prov111 prov112 prov13 prov130 prov131 prov132 prov14 prov15 PROV)
                      "JK" "BR" 1990 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
                      "JK" "BR" 1991 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "JK" "BR" 1992 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "JK" "LT" 1990 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1
                      "JK" "LT" 1992 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "JK" "TR" 1990 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1
                      "JK" "TR" 1991 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
                      "JK" "TR" 1992 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "ST" "AS" 1990 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "ST" "AS" 1991 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1
                      "ST" "BR" 1990 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 1 1
                      "ST" "BR" 1992 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1
                      "ST" "LT" 1990 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
                      "ST" "LT" 1991 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1
                      "ST" "LT" 1992 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
                      end
                      Thanks,
                      (Ridwan)

                      Comment


                      • #12
                        Ridwan Sheikh Your desired dummy or indicator is the maximum across the original indicators. If there is any 1, the maximum will be 1, and otherwise if all values are 0 the maximum is 0. There is more on this general idea at https://www.stata.com/support/faqs/d...ble-recording/

                        In your case the twist is that you need something like

                        Code:
                        egen wanted = rowmax(prov*)
                        There are no missing values in your data example, but rowmax() will ignore missing values to the extent possible.

                        Comment


                        • #13
                          Thanks Nick Cox !
                          Yes , the desired dummy or indicator (PROV) is the maximum across the original indicators (prov*) with 1 being maximum (if number of provisions adopted >=1)and 0 minimum(if none of the provisions is a adopted by dyadic state pair).
                          The code you suggested worked fine in this case.
                          I also need to created another variable that gives the count of 1's across (prov*). How can i proceed with that ?
                          Thanks

                          Comment


                          • #14
                            It is documented in the same place. rowtotal() is an egen function akin to rowmax().

                            Comment


                            • #15
                              Thank you very much Nick Cox

                              Comment

                              Working...
                              X