Variables with multiple values that will be picked up when used in regression?

Curan Tahim

Join Date: Jan 2018

Posts: 24
#1

Variables with multiple values that will be picked up when used in regression?

11 Jan 2018, 10:30

I'll try to explain this as best as possible.

I'm using a database of information for movies scraped from IMDb. It's currently as a csv but I'll be importing it into Stata.

Each film has a list of genres. They are listed alphabetically so I can't simply use the first genre of each list. I was wondering if there was a way to create the genre variable in Stata so that the genre variable for each film will have each of its list 'attached' to it, and then when I use a series of dummy variables in my regression, every film with that genre attached to it will be accounted for.

i.e. y = a + b1action + e

where the action dummy would pick any film with at least 'action' as one of its genres.

I hope this makes sense. Not sure it's possible but thought I'd ask!

Thanks
Tags: cross-section, variables
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#2

11 Jan 2018, 10:33

Curan:
welcome to this forum.
As a temptative answer, I would take a look at -egen-, with -group- function.

Kind regards,
Carlo
(Stata 19.0)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

11 Jan 2018, 10:51

I think Carlo and I have a different understanding of your question, because his advice isn't what comes to my mind.

But neither Carlo nor I have proved very good at telepathy so far, and both of us have poor track records when it comes to writing code for imaginary data.

If you want a helpful answer to your question, I think you need to show an example of your data after you import it to Stata. The useful way to do that is with the -dataex- command. Please read the Forum FAQ, especially #12, for instructions about using -dataex-.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17707
#4

11 Jan 2018, 10:57

Clyde is right.
As an economist, I'm fatally poor at forecasting, not to say at thought reading!

Kind regards,
Carlo
(Stata 19.0)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

11 Jan 2018, 12:49

Cross-posted at https://www.reddit.com/r/stata/comme..._that_will_be/ (no answers there at the time of writing).

We ask that you tell us about cross-posting. This is explicit in the FAQ Advice everyone is asked to read before posting.

Like others I am not clear what you want here and can only echo the advice to provide a concrete data example.
Comment
Curan Tahim

Join Date: Jan 2018

Posts: 24
#6

12 Jan 2018, 11:17

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str67 movie_name long(genre1 genre2 genre3) "Avatar" 1 1 8 "Spectre " 1 1 17 "The Dark Knight Rises " 1 18 . "John Carter " 1 1 15 "Tangled " 2 2 3 end label values genre1 genre1 label def genre1 1 "Action", modify label def genre1 2 "Adventure", modify label values genre2 genre2 label def genre2 1 "Adventure", modify label def genre2 2 "Animation", modify label def genre2 18 "Thriller", modify label values genre3 genre3 label def genre3 3 "Comedy", modify label def genre3 8 "Fantasy", modify label def genre3 15 "Sci-Fi", modify label def genre3 17 "Thriller", modify

I hope I've done this correctly and it's clear! So basically what I have here is an example of how I currently have 8 variables for genre of each movie (genre1-genre8). I am looking for a way to collapse this into one genre variables, that would contain each genre 'tag' for a particular movie.

Essentially I am trying to avoid having to add 8 sets of dummies for genre1 through genre8 with 17 dummies for each genre, especially as most have only two genre tags! If there was a way to create one genre variable, which would allow observations to be picked up if they are tagged with the genre, I believe that would be a much neater solution?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

12 Jan 2018, 11:28

So the first obstacle is that the numeric codes used in the genre variables are inconsistent with each other. In this case, best to go back to strings. Then we just concatenate the strings together:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str67 movie_name long(genre1 genre2 genre3)
"Avatar"                 1  1  8
"Spectre "               1  1 17
"The Dark Knight Rises " 1 18  .
"John Carter "           1  1 15
"Tangled "               2  2  3
end
label values genre1 genre1
label def genre1 1 "Action", modify
label def genre1 2 "Adventure", modify
label values genre2 genre2
label def genre2 1 "Adventure", modify
label def genre2 2 "Animation", modify
label def genre2 18 "Thriller", modify
label values genre3 genre3
label def genre3 3 "Comedy", modify
label def genre3 8 "Fantasy", modify
label def genre3 15 "Sci-Fi", modify
label def genre3 17 "Thriller", modify

//    SINCE GENRES HAVE INCONSISTENT LABELS, REVERT TO STRINGS
foreach v of varlist genre* {
    decode `v', gen(_`v')
    drop `v'
    rename _`v' `v'
}

//    NOW CONCATENATE THE STRINGS IN SORTED ORDER
egen all_genres = concat(genre*), punct(" ")

Now if you want to do something with all observations that include "Adventure" among their tags you can do:

Code:

whatever_command ... if strpos(all_genres, "Adventure")

A potential pitfall here is if the genres are themselves inconsistent in spelling or capitalization. That is, Stata will not recognize "adventure" as the same as "Adventure." Nor will it think "Advemture" is the same as "Adventure." Nor is "Adventure " a match for "Adventure". So you may want to carefully scrub the string versions of the genre variables to make sure they are all formatted alike. Functions that may be helpful here are -trim()-, -itrim()-, -upper()-, -lower()-, and -proper()-. Do consult the help files for those.

Comment

Curan Tahim

Join Date: Jan 2018

Posts: 24
#8

12 Jan 2018, 11:35

This is great, thank you! How will this work when I move onto econometric analysis? For example, if using i.genre in a regression, all observations will the relevant tags will be picked up, or will I need to use a different command with the concatenated variable?
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

12 Jan 2018, 11:39

Thanks for the example, which helps a lot.

You're going to get into a real mess if you try to compare different categorical variables with inconsistent value labels, which you have here, without tidying up first.

Also, combining to one genre variable sounds likely to be futile to me, although there are various ways it could be tried. In any case, you don't say how you think it could or should be done -- and most crucially how it would be used in modelling.

Here is some technique. To get your genre variables on a common basis, I convert them all to string and then encode them back on a common basis using multencode (SSC), which you must install before you can use.

This won't work cleanly if you have variations in spelling, including in leading or trailing or embedded spaces or in inconsistent use of upper and lower case.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str67 movie_name long(genre1 genre2 genre3)
"Avatar"                 1  1  8
"Spectre "               1  1 17
"The Dark Knight Rises " 1 18  .
"John Carter "           1  1 15
"Tangled "               2  2  3
end
label values genre1 genre1
label def genre1 1 "Action", modify
label def genre1 2 "Adventure", modify
label values genre2 genre2
label def genre2 1 "Adventure", modify
label def genre2 2 "Animation", modify
label def genre2 18 "Thriller", modify
label values genre3 genre3
label def genre3 3 "Comedy", modify
label def genre3 8 "Fantasy", modify
label def genre3 15 "Sci-Fi", modify
label def genre3 17 "Thriller", modify

forval j = 1/3 { 
    decode genre`j', gen(Genre`j') 
} 

multencode Genre?, generate(genre_1 genre_2 genre_3) label(Genre) 

label list Genre 
* note that there are 7 distinct labels 

forval j = 1/7 { 
     gen is`j' = inlist(`j', genre_1, genre_2, genre_3) 
} 

list movie is* 

     +------------------------------------------------------------------+
     |             movie_name   is1   is2   is3   is4   is5   is6   is7 |
     |------------------------------------------------------------------|
  1. |                 Avatar     1     1     0     0     1     0     0 |
  2. |               Spectre      1     1     0     0     0     0     1 |
  3. | The Dark Knight Rises      1     0     0     0     0     0     1 |
  4. |           John Carter      1     1     0     0     0     1     0 |
  5. |               Tangled      0     1     1     1     0     0     0 |
     +------------------------------------------------------------------+

egen is = concat(is*)

list movie is 

     +----------------------------------+
     |             movie_name        is |
     |----------------------------------|
  1. |                 Avatar   1100100 |
  2. |               Spectre    1100001 |
  3. | The Dark Knight Rises    1000001 |
  4. |           John Carter    1100010 |
  5. |               Tangled    0111000 |
     +----------------------------------+

egen allgenre = concat(genre_?)

list movie allgenre

     +-----------------------------------+
     |             movie_name   allgenre |
     |-----------------------------------|
  1. |                 Avatar        125 |
  2. |               Spectre         127 |
  3. | The Dark Knight Rises         17. |
  4. |           John Carter         126 |
  5. |               Tangled         234 |
     +-----------------------------------+

Comment

Curan Tahim

Join Date: Jan 2018

Posts: 24
#10

12 Jan 2018, 11:55

Thanks for the reply, Nick. As I mentioned previously, I was wondering how this would affect my modeling in a regression equation, i.e. would I still be looking to use i.allgenre in this case, or would I need to use a different command/form to obtain accurate coefficient estimates?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#11

12 Jan 2018, 12:06

I can't see that anyone can predict what will lead to "accurate parameter estimates" as that's entirely in a circle with what your aim in modelling is and what your picture is of any underlying process.

If this was my problem, I would use one indicator variable (you follow a multitude in using the unfortunate term "dummy") for each genre (and live with the loss of information on ordering). In terms of #7 that would mean the indicator variables is1 upwards.

Note: As often happens, Clyde and I were composing replies at the same time, but I think our advice is consistent when it overlaps.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

12 Jan 2018, 13:58

If you want to use specific genres as predictors in regression analyses, with this kind of data, I don't see any alternative to having an indicator variable for each separate genre. The code I showed in #7 would work if you wanted to count or tabulate or summarize things in a particular genre. But it won't allow you to regress against genres.

If you want to use just a couple of genres as predictors, say, for example, Adventure and Comedy, you could do this:

Code:

foreach x in Adventure Comedy { gen byte is_`x' = strpos(all_genres, "`x'") } regress some_outcome_variable i.is_Adventure i.is_Comedy perhaps_other_regressors

If you will want to use all of the genres in this way, and if the number of them is too large, for practical purposes, for you to list in a foreach statement, then you can do this:

Code:

local genres foreach v of varlist genre* { levelsof `v', local(glist) local genres: list genres | glist local genres: list uniq genres } foreach x of local genres { gen byte is_`x' = strpos(all_genres, "`x'") } regress....//etc.

This is basically the same idea, it's just that instead of manually listing out all the genres in -foreach-, we first build up a local macro that contains them by pulling them out of the genre* variables.
Comment
Curan Tahim

Join Date: Jan 2018

Posts: 24
#13

15 Jan 2018, 12:21

Thanks very much for this help, it has made it much easier to understand. Just a quick question regarding the presentation of data in graphs etc with the new is_x variables. I am unsure how to present the frequency of each genre tag in a bar graph for example, or even in a summary statistics table, due to the way these variables are now structured. Would you be able to advise on how to achieve this?

Thank you again
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

15 Jan 2018, 12:36

Well, -graph bar is_*- would give you a bar graph of all the different genres, heights proportional to the fraction of observations that mention that genre.

There are endless ways to construct tables to show data, so it is hard to give specific advice. But one thing I can say is that you should think about tables of these variables the same way you would think about a table of distinct attributes such as duration of the film, box office gross in first week, number of Oscar nominations, etc. Whatever you would do to build a table for those, you can do the same with the is_* variables. I don't think any of the built-in Stata commands do this sort of thing, but there are many user-written commands that tackle this in different ways.
Comment
Curan Tahim

Join Date: Jan 2018

Posts: 24
#15

02 Feb 2018, 08:31

I hope it is ok to revisit this thread. I am now looking to create an indicator that will pick up a subset of the genres into an is_Other variable. To do this I used:

Code:

gen byte is_Other = strpos(all_genres, "Animation" "Biography" "Crime" "Documentary" "Family" "Fantasy" "History" "Horror" "Music" "Musical" "Mystery" "SciFi" "Sport" "War" "Western")

This should create is_Other which is an indicator that excludes action, adventure, comedy, drama, romance, and thriller. However after creating this indicator, I checked its frequency using:

Code:

sum is_Other if is_Other!=0

This returned a result that only 80 variables were tagged under is_Other, which should be far far too low from the sample of 1206 observations.

Any ideas on where I'm going wrong here? Thanks
Comment

Announcement