Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Histogram including missing observations

    Hi,

    Is there any way to plot a histogram where the density is not relative to all observations without missing data, but including missing data?

    Here is the issue I am having: I want to plot the distribution of retirement age for my treated and control group. However, for a substantial portion of each group, I do not see them retiring in my data. My variable of interest is "retire7age1_year." "Treat5" is a dummy which = 1 if individuals are in the treated group, and 0 if individuals are in the control group. I am using byhist because I am reweighting the control group, and byhist allows for pweights.

    The following code compares the distribution of all observed retirements in the treated group vs. all observed retirements in the control group:

    Code:
    byhist retire7age1_year  [pw = treatage_weight], by(treat5) density discrete
    Click image for larger version

Name:	image_12032.png
Views:	2
Size:	190.6 KB
ID:	1463413

    However, I want to also count how many missing data (non-retired people) there are in each group. I've gotten around this in a very rudimentary way... I create a new variable "retire7age1_year_exit" which = 100 if the individual is missing a retirement year. Then I save the tabulate results in a matrix, save the matrix as a text file, add this text file as a new dataset and plot the tabulate results by group, relabeling 100 as "Not Retired."

    Code:
    tab retire7age1_year_exit treat5 [aw = treatage_weight], matcell(treatcontrol_tab)
    
    matrix rownames treatcontrol_tab = 49 50 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 71 100
    matrix list treatcontrol_tab
    
    matrix missingages = (0,0\0,0\0,0)
    matrix rownames missingages = 51 52 70
    
    matrix treatcontrol_tab = treatcontrol_tab\ missingages
    ssc install mat2txt
    mat2txt, matrix(treatcontrol_tab) saving(retire7age1_year_exit) replace
    
    preserve
        import delimited using "retire7age1_year_exit.txt", clear
        rename v1 retireage
        rename c1 control_freq
        rename c2 treat_freq
        drop v4
        egen control_tot = sum(control_freq)
        egen treat_tot = sum(treat_freq)
        gen treat_prop = treat_freq / treat_tot
        gen control_prop = control_freq / control_tot
    
        twoway (bar treat_prop retireage if retireage < 100, color(ebg)) ///
        (bar control_prop retireage if retireage < 100, fcolor(none)), ///
        legend(order(1 "Treat" 2 "Control")) ytitle("Percent") xtitle("Retirement Age") graphregion(color(white)) xtick(49(1)71)
        graph export retire7age1_treatcontrol_reweight_nonretirescale.png, replace
    
        sort retireage
        tostring retireage, replace
        replace retireage = "NR" if retireage == "100"
    
        graph bar treat_prop control_prop, over(retireage) asyvars ///
        legend( label(1 "Treat") label(2 "Control")) bar(1, color(ebg)) bar(2, fcolor(erose) lcolor(maroon)) graphregion(color(white)) ///
        b1title("Retirement Age") ytitle("Percent")  
        graph export retire7age1_treatcontrol_reweight_nonretire.png, replace
    restore


    This gets me what I want:
    Click image for larger version

Name:	image_12034.png
Views:	2
Size:	77.6 KB
ID:	1463414
    where NR = not retired.

    Click image for larger version

Name:	image_12033.png
Views:	3
Size:	65.2 KB
ID:	1463415

    which is a "zoomed in" version of the graph above, excluding NR individuals.

    However, it has a lot of steps and is very specific to the data/outcome variable I am using. Is there an easier way? This seems like something that a lot of people might want to do, but I can't seem to find a similar question on Statalist.
    Attached Files
    Last edited by Maggie Shi; 25 Sep 2018, 08:08.

  • #2
    Hello, everyone,
    I have a similar problem to Maggie. I want to create a histogram that takes missing values into account and also displays them as a separate value of the variable, or at least takes them into account when calculating the percentages (the scale level of the variable is nominal with four values). However, Maggie's solution doesn't seem to work for me, so I wonder if there is a simpler solution?
    I am looking forward to answers and suggestions!
    Jane

    Comment


    • #3
      Jane Doe: That's your real name. Commiseration on poor jokes ever since.

      "Jane Doe": So that's not your real name. We do ask for real names here. Please read and act on http://www.statalist.org/forums/help#realnames and
      #3 of https://www.statalist.org/forums/help#adviceextras

      I don't recall #1. Perhaps I and maybe others looked at it and decided it was long and complicated and left someone else to work at it.

      #3 "doesn't seem to work for me" is not a report that can be discussed without data, code or specific details. http://www.statalist.org/forums/help#stata gives crucial advice.

      If you want to show missings on a histogram and the density or other calculations to include them then they must be assigned a distinct non-missing value and then you can draw a histogram. Here is a simple example.

      Code:
      sysuse auto, clear
      clonevar rep78_2 = rep78
      replace rep78_2 = 7 if rep78_2 == .
      label def rep78_2  7 missing
      label val rep78_2  rep78_2
      histogram rep78_2, discrete xla(1/5 7, valuelabel)

      Comment


      • #4
        Hi Nick,
        I am sorry that the post and the name do not fit the requirements, I will act on them!
        Thank you very much for your reply and the little example.
        Again, sorry!
        Antje

        Comment

        Working...
        X