Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting a variable by two dummy variables

    I am working on a paper about discrimination in the labour market. I have three variables (among many others): one is race (black/white), the other is callback and the third is advertisement. Race and callback are dummies, the ad variable assigns a number to each job advertisement from 1 to 1200

    the dataset is like this

    race call ad
    b 0 1
    w 0 1
    b 0 1
    w 0 1


    Which means that for ad 1, nobody got called back

    then I have

    race call ad
    w 0 216
    b 1 216
    b 1 216
    w 0 216


    which means that for ad 216, 2 African-Americans got called back and 0 Whites

    or



    race call ad
    b 0 376
    w 1 376
    b 0 376
    w 1 376


    which means that for ad 376 2 whites got called back and no African-Americans

    I need to to know the percentage of ads that favored african americans (like ad 216) or they treates the candidates equally (like ad 1) or they favored whites (like other ads where 1 white got called back and no african americans) so I need to discern the advertisement variable to know how many ads called 0 whites and 0 blacks or 1 black and 1 white or 2 whites and 0 blacks and so on.

    I am relatively new to Stata so I've tried to use the summarize command, unsuccessfully


    sum ad if race=="w" & call==0 & race=="b" & call==1 says "no observations"

    I tried to use tabulate race call adid, row and it says "too many variables specified" so I switched to bysort ad: summarize race call yet I got no results.





  • #2

    Code:
    sum ad if race=="w" & call==0 & race=="b" & call==1
    asks Stata to work on observations for which it's true that race is white and also black in the same observations, and similarly call is 0 and call is 1 in the same observations. That is like asking whether a car is both foreign and domestic. -- but it can't be both. That explains the message you report.

    tabulate allows only one or two variables. See its help.

    It seems to me that you don't want the & operator there at all, as it yields no observations.

    You can count black and white calls with

    Code:
    egen n_calls = total(call), by(ad race)
    and ensure that you only look at one observation per advertisement and race by

    Code:
    egen tag = tag(ad race) 
    after that

    Code:
    tab ad race [fw=n_calls] if tag
    would seem a step in the right direction. Presumably if you have different numbers of black and white people you're going to adjust for that. One way to that would be

    Code:
    egen mean_call = mean(call), by(ad race)
    and indeed

    Code:
    tabulate ad race, summarize(call)
    gets you there directly.






    Comment


    • #3
      Thank you very much. I've used the code you wrote but unfortunately when I reached the last step and used tabulate ad race, summarize(call)

      it said

      too many values

      Comment


      • #4
        Evidently you have too many ads to tabulate.

        Thinking about this again:

        1. How are you going to model this? I am no expert here and others do have more authority and may have better advice, but with thousands of ads, I wonder about xtlogit.

        2. For a descriptive analysis I would produce a reduced dataset.

        Putting all that together, and leaving interpretation of the model an open question (your data are just a minimal sandbox in any case), that suggests

        Code:
        clear 
        input str1 race call ad
        b 0 1
        w 0 1
        b 0 1
        w 0 1
        w 0 216
        b 1 216
        b 1 216
        w 0 216
        b 0 376
        w 1 376
        b 0 376
        w 1 376
        end 
        
        encode race, gen(nrace) 
        xtset ad 
        xtlogit call i.nrace 
        
        contract race call ad 
        reshape wide _freq, i(ad race) j(call) 
        reshape wide _freq0 _freq1, i(ad) j(race) string 
        mvencode *, mv(0) 
        
        
        gen pcall_black = _freq1b / (_freq1b + _freq0b) 
        gen pcall_white = _freq1w / (_freq1w + _freq0w)  
        
        list 
        
             +-------------------------------------------------------------------+
             |  ad   _freq0b   _freq1b   _freq0w   _freq1w   pcall_~k   pcall_~e |
             |-------------------------------------------------------------------|
          1. |   1         2         0         2         0          0          0 |
          2. | 216         0         2         2         0          1          0 |
          3. | 376         2         0         0         2          0          1 |
             +-------------------------------------------------------------------+

        Comment


        • #5
          I am actually working on a replication so I'm going to stick to the model they used, a probit regression (which I have used for other data I am replicating)

          The table I am working on is not supposed to be a regression, it's merely the percentage of callbacks that favored whites or blacks.

          You code works but unfortunately I still don't have the numbers I need and I still need to include all the 1200+ ads in the sample. I doubt the authors of the original paper went through every ad manually.

          Comment

          Working...
          X