Seeking advice on how to analyse different households' applicance use in response to a change in weather

Ryan Long

Join Date: Feb 2019

Posts: 10
#1

Seeking advice on how to analyse different households' applicance use in response to a change in weather

13 Feb 2019, 00:19

Hi all,

To provide some context, here's a small sample of my data:
input float(haze age ac_behavior purifier_behavior) long housing float female

0 21 1 0 1 1

1 21 1 1 1 1

0 58 1 0 3 1

1 58 1 0 3 1

0 47 1 0 3 1

1 47 1 0 3 1

0 43 0 0 2 0

1 43 1 1 2 0

0 35 1 0 3 0

1 35 0 1 3 0

0 52 1 0 3 1

1 52 0 0 3 1

0 54 1 1 5 0

1 54 1 1 5 0

0 61 1 0 5 0

1 61 1 1 5 0

0 47 1 1 2 0

1 47 1 0 2 0

A brief explanation of the variables:

age is continuous
female = 1 if female, =0 if male

ac_behavior =1 if air-con is used, = 0 if unused

purifier_behavior = 1 if purifier is used, =0 if unused

housing is a categorical variable (can take on labels from 1 to 6 according to housing type). Housing type is meant to be a proxy for household income level since income data is not available.

haze = 1 if hazy, = 0 if normal. Each participant will answer the questions on behavior twice, first responding to how they will use their appliances under normal weather conditions, and second to how they will use their appliances under hazy conditions. This means that while I have 622 observations in my dataset, there are only 311 participants in the study.

So my intention is first to investigate whether households' use of appliances is related to haze. For a start, I ran a logit regression with ac_behavior as dependent variable and haze as a regressor. Here are the results. It tells me that all other factors constant, households are 1.77 times more like likely to use air-con under hazy conditions as compared to normal, and the relation is statistically significant. Is this the right line of thought?

Next, I wish to investigate whether households in different kinds of dwellings (and by proxy different income groups) respond to haze differently by comparing their appliance use with and without haze. I am not sure how I can go about doing that.

I am thinking about running another logit regression. For instance, a code like this:

Code:

logistic ac_behavior i.housing haze, r

Any advice on how to proceed would be much appreciated.

Last edited by Ryan Long; 13 Feb 2019, 00:25.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17739

13 Feb 2019, 00:45

Ryan:
I think you're on the right track, but I will perform an unique regression instead (in the following toy-example let's set aside for a while the Stata notes that appear above the outcome table):

Code:

input float(haze age ac_behavior purifier_behavior) long housing float female
0 21 1 0 1 1
1 21 1 1 1 1
0 58 1 0 3 1
1 58 1 0 3 1
0 47 1 0 3 1
1 47 1 0 3 1
0 43 0 0 2 0
1 43 1 1 2 0
0 35 1 0 3 0
1 35 0 1 3 0
0 52 1 0 3 1
1 52 0 0 3 1
0 54 1 1 5 0
1 54 1 1 5 0
0 61 1 0 5 0
1 61 1 1 5 0
0 47 1 1 2 0
1 47 1 0 2 0
end
egen id = seq(), b(2)
logistic ac_behavior i.haze i.housing c.age##c.age i.female, vce(cluster id)
note: 1.housing != 0 predicts success perfectly
      1.housing dropped and 2 obs not used

note: 5.housing != 0 predicts success perfectly
      5.housing dropped and 4 obs not used

note: 3.housing omitted because of collinearity

Logistic regression                             Number of obs     =         12
                                                Wald chi2(4)      =          .
                                                Prob > chi2       =          .
Log pseudolikelihood = -5.9349307               Pseudo R2         =     0.1205

                                     (Std. Err. adjusted for 6 clusters in id)
------------------------------------------------------------------------------
             |               Robust
 ac_behavior | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.haze |   .3605999   .7013676    -0.52   0.600     .0079696    16.31603
             |
     housing |
          1  |          1  (empty)
          2  |   .2266629   1.684043    -0.20   0.842     1.07e-07    478160.3
          3  |          1  (omitted)
          5  |          1  (empty)
             |
         age |   2.193264   7.296923     0.24   0.813     .0032297    1489.405
             |
 c.age#c.age |   .9935944   .0319737    -0.20   0.842     .9328625     1.05828
             |
    1.female |   .1377507   1.164656    -0.23   0.815     8.76e-09     2166901
       _cons |   5.04e-09   3.91e-07    -0.25   0.806     4.14e-75    6.12e+57
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

.

The -fvvarlist- notation for -age- aims at investigating possible turning points (but this does not seem to be the case with your data excerpt).
You should cluster the standard errors since the same person aswers two different questions (haze yes/no).

Kind regards,
Carlo
(Stata 19.0)

Comment

Ryan Long

Join Date: Feb 2019

Posts: 10
#3

13 Feb 2019, 01:25

Hi Carlo,

Thanks for your quick reply. I am still trying to understand what's going on in this line of code. Could you help me out?

Code:

egen id = seq(), b(2)

I'm reading the manual but to no avail. How should I code this differently when I am dealing with my actual dataset which is much larger?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17739
#4

13 Feb 2019, 01:52

Ryan;
-seq()- is an -egen- function that allows you to create integer sequences (see -help egen- and related entry in Stata .pdf manual).
As far as your excerpt is concerned, since -id- was not present in the set of variables, to create a two-observation cluster, I asked Stata to give the same -id- to each 0/1 couples (that is, to each block composed of two observations or, in Stataish, -b(2)-).
Starting from the first couple of observations onwards the integer sequence increases by 1.
If your dataset already includes an -id- you should do nothing; conversely, if you need to create an -id- and you have max two observations per person, just replicate -seq()- code reported above.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ryan Long

Join Date: Feb 2019

Posts: 10
#5

13 Feb 2019, 02:16

My data does have a responseid that is unique for every participant, so I am using that instead of generating a new id. These are my results:

Now that I have run the regression, it appears that some of the variables are not statistically significant. Does that indicate that I should respecify the regression model by removing those variables that are non-significant? And how should I handle the fact that some of the dummy variables on housing have a statistically significant coefficient while others don't?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17739
#6

13 Feb 2019, 02:35

Ryan:
- your -c.age#c.age- should be -c.age##c.age-;
an insignificant coefficient convyes the sama anount of information than a significant one: hence, do not remove but try to explain in your research report why the resulst are so;
- it may well be that you have different number of observations for each level of the categorical variables -housing- or, more substantively, that, coeteris paribus, higher income households do not pay much attention about using apparels with or without haze;
- you can also test the joint significance of -housing- via:

Code:

testparm(i.housing)

Kind regards,
Carlo
(Stata 19.0)
Comment
Ryan Long

Join Date: Feb 2019

Posts: 10
#7

13 Feb 2019, 02:48

Thanks Carlo! You have been of great help. I have rectified the mistake.
Comment
Ryan Long

Join Date: Feb 2019

Posts: 10
#8

13 Feb 2019, 10:41

Sorry to bother everyone again. I have been tinkering with my model specification. I tried including a few more binary regressors that I thought might be helpful in explaining air-con use such as purifier_behavior (whether the household will use air purifier), dineout_behavior (whether the household will dine outside), and own_car (whether household owns a car). However, this radically changes my p values for the existing housing and haze variables, making them statistically insignificant.

This is my regression result without the newly introduced regressors:

This is my result after adding in the new variables:

My question is: is either one of the model specification more preferable than the other? And is it correct to think that there might be multicollinearity at work here?
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17739
#9

13 Feb 2019, 11:01

Ryan:
- as a general comment, regression models should give a fair and true view of the data generating process. Hence, I would skim through the literature of your reserach field and see what others did in the past when presented with the same research goal;
- you can easily see whether quasi-extreme multicollinearity is at work via -estat vce, corr-. However, please note that the relevance of quasi-extreme multicollinearity as an issue per se is questioned on and off ( http://www.hup.harvard.edu/results-l...&submit=Search) this list.
As an aside, when adjusted for other predictors you might have a turning point for -age- in your second regression model.

Kind regards,
Carlo
(Stata 19.0)
Comment
Ryan Long

Join Date: Feb 2019

Posts: 10
#10

13 Feb 2019, 11:13

I see. Thanks again for your input.
Comment

Announcement