Descriptive statistics and significance of different groups

John Adler

Join Date: Apr 2017

Posts: 173
#1

Descriptive statistics and significance of different groups

24 Nov 2017, 08:33

Hello,

I have a large panel data-set across three waves which includes the health outcomes and behaviors of respondents. I would like to create a table to provide some summary statistics. Particularly, I would like to report what percentage of the sample report each specific health outcome, i.e. diabetes, heart disease, stroke, etc., Following this, I would like to repeat this for the various ethnic groups in this survey, i.e. what percentage of whites report this disease, what percentage of blacks, what percentage of Hispanics, etc., particularly, I would like to see if certain ethnic groups have a higher probability of reporting these health outcomes than others, i.e. are whites more likely to report heart disease than blacks. Especially, I would like to be able to say if these differences are statistically significant and a what level.

Any help would be greatly appreciated,

Kindest regards,

John
Tags: data, descriptive, hypothesis test, staistics, summary
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

24 Nov 2017, 12:43

Welcome to the Stata Forum/ Statalist,

Your question seems to be quite broad, and you didn’t provide data to work on.

Please read the FAQ, particularly on how important it is to share data in order to provide a clarifying query.

It is still confusing to me, for the main outcome is unclear.

That said, you may probably need to deal with - tabulate - and its options. Also, I suspect you will need to use - egen - somewhere in order to create variables for the comparisons.

I’m sorry, but I cannot go further than this, considering the lack of informative data.

Best regards,

Marcos
1 like
Comment

John Adler

Join Date: Apr 2017
Posts: 173

28 Nov 2017, 07:14

Dear Marcos,

Thank you for your reply, and yes I should have been clearer. I have spoken to the data manager here and although they will not give me permission to post the study data here, I have replicated my analysis using auto.dta in order to provide a greater level of clarity.
I have renamed variables in this dataset in order to best replicate my own dataset, although some of the naming conventions are incorrect, this is less important than the syntax, which I will be applying to my own dataset in time.

In this example, which is analogous to my own dataset, I would like to put together a table of summary statistics on the self-rated health of women.

In particular, I would like to:

1) Report the percentage of women that report poor self-rated health in my sample.

2) Report the percentage of women who do not have public insurance and report poor self-rated health.

3) To determine whether there is some relationship between self-rated health and holding public insurance for women in my sample.

Hence I call the datafile

Code:

 sysuse auto

To keep things simple, and representative of my own dataset, I recode number of repairs as a binary variable:

Code:

 
recode rep78  (0/2 = 0 "Good Health") (3/5 = 1 "Poor Health") (else=.), gen(selfratedhealth_bin) label(selfratedhealth_bin)

and create a frequency table as below, my interest here is only the women in the sample, not the men.

So I recode foreign to gender

Code:

 
recode foreign (0 = 0 "Female") ( 1 = 1 "Male") (else=.), gen(gender) label(gender)

Code:

  tab selfratedhealth_bin gender, column row

Code:

 
 
 


+-------------------+
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
+-------------------+
 
+-------------------+
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
+-------------------+
 
  RECODE of |
      rep78 |
    (Repair |   RECODE of foreign
     Record |      (Car type)
      1978) |    Female       Male |     Total
------------+----------------------+----------
Good Health |        10          0 |        10
            |    100.00       0.00 |    100.00
            |     20.83       0.00 |     14.49
------------+----------------------+----------
Poor Health |        38         21 |        59
            |     64.41      35.59 |    100.00
            |     79.17     100.00 |     85.51
------------+----------------------+----------
      Total |        48         21 |        69
            |     69.57      30.43 |    100.00
            |    100.00     100.00 |    100.00

Results suggest that in this sample 20.83% of women had good self-rated health and 79.17% of women had bad self-rated health.
I would like to know if the percentage of women reporting bad self-rated health varies by holding public insurance so I create a binary public insurance variable:

Code:

recode price  (5079/15906 =1 "Does not have Public Insurance") (3291/4934 =0 "Has Public Insurance") (else=.), gen(public_insurance) label(public_insurance)
(74 differences between price and public_insurance)

Then I create another frequency table as below:

Code:

tab selfratedhealth_bin public_insurance if gender == 0, column row

This tells me that a larger percentage of women reported bad self-rated health when considering those who did not have public insurance (81.82% > 79.17%).

Code:

 


+-------------------+
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
+-------------------+
 
  RECODE of |
      rep78 |
    (Repair |    RECODE of price
     Record |        (Price)
      1978) | Has Publi  Does not  |     Total
------------+----------------------+----------
Good Health |         6          4 |        10
            |     60.00      40.00 |    100.00
            |     23.08      18.18 |     20.83
------------+----------------------+----------
Poor Health |        20         18 |        38
            |     52.63      47.37 |    100.00
            |     76.92      81.82 |     79.17
------------+----------------------+----------
      Total |        26         22 |        48
            |     54.17      45.83 |    100.00
            |    100.00     100.00 |    100.00

Spurred by these rough results, I would like to make use of a X2(chi-square) test for relationships between variables. To determine if there is some relationship between self-rated health and public insurance for women.

Code:

tab selfratedhealth_bin public_insurance if gender == 0, column row nokey chi2 lrchi2 V exact gamma taub

Code:


  RECODE of |
      rep78 |
    (Repair |    RECODE of price
     Record |        (Price)
      1978) | Has Publi  Does not  |     Total
------------+----------------------+----------
Good Health |         6          4 |        10
            |     60.00      40.00 |    100.00
            |     23.08      18.18 |     20.83
------------+----------------------+----------
Poor Health |        20         18 |        38
            |     52.63      47.37 |    100.00
            |     76.92      81.82 |     79.17
------------+----------------------+----------
      Total |        26         22 |        48
            |     54.17      45.83 |    100.00
            |    100.00     100.00 |    100.00
 
          Pearson chi2(1) =   0.1731   Pr = 0.677
 likelihood-ratio chi2(1) =   0.1743   Pr = 0.676
               Cramér's V =   0.0601
                    gamma =   0.1489  ASE = 0.353
          Kendall's tau-b =   0.0601  ASE = 0.143
           Fisher's exact =                 0.735
   1-sided Fisher's exact =                 0.479

The null hypothesis (Ho) is that there is no relationship between self-rated health and public insurance and women. To reject this we need a Pr < 0.05 (at 95% confidence).
I am a bit fuzzy on how I should best read this, as the Pearson chi2, i.e. insignificant at 95% confidence, can I conclude that the differences I see between public insurance and self-rated health are not significant for women at the 96% level? If the Pearson chi2 was < 0.05 could I conclude that differences between women with this insurance and without this insurance are significant at the 0.95 level?

In general, is my approach to considering the percentage reporting bad self-rated health by public insurance in women correct for my first summary table (later in the paper there will be a more quantitative analysis), or am I on another planet altogether?

I thank you for all your support,

Best regards,

Jonathan

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#4

28 Nov 2017, 15:17

I gather the real data is much bigger, hence I’ll avoid remarking that Fisher’s exact test would be more appropriate for this tiny sample. Kendall’s tau b is not informative, since you just have binary variables. Being this an observational study, you will probably need to provide a ‘full’ model, adjusting for covariates, pattern of residual distribution, and eventually endogeneity.

That being said, your approach strikes me as quite reasonable, so to speak.

Best regards,

Marcos
Comment

Announcement