How to calculate correlation per ID

Stefan Tijink

Join Date: Nov 2022

Posts: 7
#1

How to calculate correlation per ID

02 Nov 2022, 10:30

Good day,

I am trying to calculate the correlation between temperature and national gas consumption for several weather stations. This would be an easy task if my data was structured with variables being date, national gas consumption, weather measured at station 1, weather measured at station 2, and so forth. My data set with gas consumption is simply structured as data and gas consumption as the variables, with the inputs being clear. However, my weather data is structured as follows, which continues for all the weather stations: (Note that gas consumption is not yet included here)

As a consequence, my different stations are not specified as variables, but rather as inputs of the variable "staid" or "staname".

I am quite stuck with how to merge these data sets and calculate the correlation for each weather station. Merging is difficult because in my weather data set every date occurs multiple times (for every station) rather than just once. I have looked into reshape to get the staid inputs as separate variables but I could not work this out due to a lack of Stata skills.

Is there anyone that could help me with this?

Many thanks!

Last edited by Stefan Tijink; 02 Nov 2022, 11:26.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29966
#2

02 Nov 2022, 11:59

I found #1 too confusing to provide an answer. Descriptions of data sets, no matter how elaborate, are usually not sufficient to enable others to provide specific advice. Please post back showing example data from both data sets. Accompany that by a description of what the salient variables in each data set mean. That will make it more likely that somebody can figure out what needs to be done.

And before posting back, please read the Forum FAQ for excellent advice about how to post in ways that maximize the chances that someone will be able to provide a timely and helpful response. Among the things you will learn there is that screenshots of data are not helpful. The useful way to show example data is with the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
2 likes
Comment

Stefan Tijink

Join Date: Nov 2022
Posts: 7

03 Nov 2022, 04:16

Thank you for the reply and apologies if it was unclear.

Short description and goal:
I have data on weather stations in 8 different countries, with for each weather station the temperature it has measured on a given date. In addition, I have a second dataset with monthly gas consumption for each of these eight countries. My goal is to find for each country the weather station with the highest correlation between its measured temperature and the gas consumption of the country. In order to find this, I have to calculate for each weather station the correlation with the gas consumption of the country it is located in and then select for each country the weather station with the highest correlation.

Dataset 1: Weather stations
This data set contains the variables staid (weather station id), staname (weather station name), cn (country code weather station is located in, e.g. AT in example below refers to Austria, total 8 different countries), date (date at which temperature is measured for given weather station) tg (temperature measured at given weather station at given date).

Example 1: dataex if staid == 11, count(10)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int staid str48 staname str4 cn long(date tg)
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190101   4
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190102 -19
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190103   3
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190104  30
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190105  41
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190106  23
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190107  17
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190108  25
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190109  14
11 "KREMSMUENSTER (TAWES)                       " "AT" 20190110  52
end

Example 2: dataex if staid == 12, count(10)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int staid str48 staname str4 cn long(date tg)
12 "GRAZ-UNIVERSITAET                           " "AT" 20190101 41
12 "GRAZ-UNIVERSITAET                           " "AT" 20190102 18
12 "GRAZ-UNIVERSITAET                           " "AT" 20190103 38
12 "GRAZ-UNIVERSITAET                           " "AT" 20190104 34
12 "GRAZ-UNIVERSITAET                           " "AT" 20190105 68
12 "GRAZ-UNIVERSITAET                           " "AT" 20190106 75
12 "GRAZ-UNIVERSITAET                           " "AT" 20190107 41
12 "GRAZ-UNIVERSITAET                           " "AT" 20190108 27
12 "GRAZ-UNIVERSITAET                           " "AT" 20190109 31
12 "GRAZ-UNIVERSITAET                           " "AT" 20190110 50
end

Dataset 2: Gas consumption
This data set contains data on monthly gas consumption for 8 different countries. This monthly data has been transformed to daily data, with each date taking on its monthly value. For example, if Jan 2019 has a value of 8000, all days between 1 Jan 2019 and 31 Jan 2019 have the value 8000. So in the example below, this means that Belgium had a gas consumption of 2479.15 throughout January 2019. The data set starts at January 2019 and runs until August 2022.

Example:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int Date double(Belgium Denmark Germany France Italy Netherlands Austria UK)
21550 2479.15 384 12995 6442 10598 5613.034 1310 9967
21551 2479.15 384 12995 6442 10598 5613.034 1310 9967
21552 2479.15 384 12995 6442 10598 5613.034 1310 9967
21553 2479.15 384 12995 6442 10598 5613.034 1310 9967
21554 2479.15 384 12995 6442 10598 5613.034 1310 9967
21555 2479.15 384 12995 6442 10598 5613.034 1310 9967
21556 2479.15 384 12995 6442 10598 5613.034 1310 9967
21557 2479.15 384 12995 6442 10598 5613.034 1310 9967
21558 2479.15 384 12995 6442 10598 5613.034 1310 9967
21559 2479.15 384 12995 6442 10598 5613.034 1310 9967
end
format %tdnn/dd/CCYY Date

Comment

Stefan Tijink

Join Date: Nov 2022

Posts: 7
#4

03 Nov 2022, 10:27

I have managed to make some progress.

First I amended the files such that there data variables were similar, and joined them together using

Code:

joinby using "....\Gas consumption.dta"

After which, the data has the following form:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int staid str48 staname str4 cn long(date tg) double(Belgium Denmark Germany France Italy Netherlands Austria UK) 11 "KREMSMUENSTER (TAWES) " "AT" 20190101 46 2479.15 384 12995 6442 10598 5613.034 1310 9967 11 "KREMSMUENSTER (TAWES) " "AT" 20190102 24 2479.15 384 12995 6442 10598 5613.034 1310 9967 11 "KREMSMUENSTER (TAWES) " "AT" 20190103 -13 2479.15 384 12995 6442 10598 5613.034 1310 9967

Second, I calculated the correlation using the following formula:

Code:

by staid, sort : correlate tg Austria if cn =="AT"

This formula can be repeated for each of the eight countries, after which Stata reports the correlation coefficients for each of the relevant weather stations. I have included an example picture of the output below:

Remaining problem:
The only problem remaining is selecting for each country the weather station with the highest correlation. My idea would be to save the correlation coefficient for each staid as part of a new variable "correlation", so that afterwards I can find the maximum.

So far, I have tried:

Code:

gen corr = . foreach i in `staid' { corr tg Austria if staid ==`i' & cn =="AT" replace corr = r(rho) if state ==`i' }

However, the part from foreach onwards seems to be doing nothing in Stata. Am I doing something wrong here? Or even: Is there a more simple approach for this problem?

Thank you in advance!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29966
#5

03 Nov 2022, 14:12

Well, although there is probably a way to complete your task from where you have gotten so far, I think it will be complicated and difficult. In my view, you have not made progress, you have retrogressed. You started with two data sets that had problems and incompatibilities and you have now created a new joined data set that has even bigger problems and is really quite unsuited to analysis in Stata.

Here are the problems and incompatibilities of the original data sets:
Country is a string variable in both data sets, but it is long names in one and 2-letter codes in another. This makes pairing them up very difficult. The approach you have taken so far depends on your personal knowledge of the correspondence between those, and you are basically writing that into the code. This is cumbersome, at best.

The date variable in the first data set is not a date variable at all. It is a number that, to human eyes, looks like a date. But it is not suitable for date calculations. The date in the second data set is a proper Stata date variable. But it won't match properly with the one in the first. So the date variable in the first data set needs to be transformed to a real Stata date variable.

The second data set is in wide layout, which is only occasionally useful for anything in Stata. To make it useful, and also make it suitable for combining with the first, it needs to be made long.

The code below does the necessary surgery on both data sets and combines them. The resulting data set is one that I suggest you save, as not only the correlations but anything else you want to do with this data would likely best start from there.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int staid str48 staname str4 cn long(date tg) 11 "KREMSMUENSTER (TAWES) " "AT" 20190101 4 11 "KREMSMUENSTER (TAWES) " "AT" 20190102 -19 11 "KREMSMUENSTER (TAWES) " "AT" 20190103 3 11 "KREMSMUENSTER (TAWES) " "AT" 20190104 30 11 "KREMSMUENSTER (TAWES) " "AT" 20190105 41 11 "KREMSMUENSTER (TAWES) " "AT" 20190106 23 11 "KREMSMUENSTER (TAWES) " "AT" 20190107 17 11 "KREMSMUENSTER (TAWES) " "AT" 20190108 25 11 "KREMSMUENSTER (TAWES) " "AT" 20190109 14 11 "KREMSMUENSTER (TAWES) " "AT" 20190110 52 12 "GRAZ-UNIVERSITAET " "AT" 20190101 41 12 "GRAZ-UNIVERSITAET " "AT" 20190102 18 12 "GRAZ-UNIVERSITAET " "AT" 20190103 38 12 "GRAZ-UNIVERSITAET " "AT" 20190104 34 12 "GRAZ-UNIVERSITAET " "AT" 20190105 68 12 "GRAZ-UNIVERSITAET " "AT" 20190106 75 12 "GRAZ-UNIVERSITAET " "AT" 20190107 41 12 "GRAZ-UNIVERSITAET " "AT" 20190108 27 12 "GRAZ-UNIVERSITAET " "AT" 20190109 31 12 "GRAZ-UNIVERSITAET " "AT" 20190110 50 end tempfile dataset1 save `dataset1' * Example generated by -dataex-. For more info, type help dataex clear input int Date double(Belgium Denmark Germany France Italy Netherlands Austria UK) 21550 2479.15 384 12995 6442 10598 5613.034 1310 9967 21551 2479.15 384 12995 6442 10598 5613.034 1310 9967 21552 2479.15 384 12995 6442 10598 5613.034 1310 9967 21553 2479.15 384 12995 6442 10598 5613.034 1310 9967 21554 2479.15 384 12995 6442 10598 5613.034 1310 9967 21555 2479.15 384 12995 6442 10598 5613.034 1310 9967 21556 2479.15 384 12995 6442 10598 5613.034 1310 9967 21557 2479.15 384 12995 6442 10598 5613.034 1310 9967 21558 2479.15 384 12995 6442 10598 5613.034 1310 9967 21559 2479.15 384 12995 6442 10598 5613.034 1310 9967 end format %tdnn/dd/CCYY Date tempfile dataset2 save `dataset2' // MAKE THE DATA SETS SUITABLE FOR USE AND COMBINING use `dataset1', clear tostring date, replace format(%8.0f) gen Date = daily(date, "YMD") assert missing(date) == missing(Date) drop date format Date %tdnn/dd/CCYY tempfile weather_stations save `weather_stations' use `dataset2', clear rename (Belgium-UK) gas_consumption= reshape long gas_consumption, i(Date) j(country) string kountry country, from(other) stuck rename _ISO3N_ n_country kountry n_country, from(iso3n) to(iso2c) drop country n_country rename _ISO2C_ cn tempfile gas_consumption save `gas_consumption' use `weather_stations', clear merge m:1 Date cn using `gas_consumption'

Note: -kountry- is written by Rafal Raciborski and is available from SSC.

Notice, by the way, that these data sets were combinable with -merge m:1-, not resorting to -joinby-. Consequently, this resulting data set does not contain large numbers of extraneous observations that pair observations that have different dates and countries. It only pairs gas consumption on a given date in a given country to other observations of the same country on that date.

Now, before we proceed to the correlations, I want to point out another problem in the gas consumption data. In the example you have shown, it appears that although there are daily entries, the numbers do not change from one date to the next. Consequently the correlations (at least using this example data) with tg (or anything else in the universe) are all going to be zero (or, as some programs will do it, missing value). Hopefully your real data set does not have this problem. The fact that you seem to have somehow gotten some non-zero non-missing correlations in your own work suggests that the real data set is better in this regard.

Turning now to the correlations, here's how I would approach it:

Code:

// CORRELATIONS OF GAS CONSUMPTION & OBSERVED TEMPERATURE BY STATION rangestat (corr) gas_consumption tg, by(cn staid) interval(Date . .) // FOR EACH COUNTRY PICK THE STATION WITH "HIGHEST" CORRELATION gen byte nm_corr = !missing(corr_x) by cn (nm_corr corr_x), sort: gen highest_corr_station = staid[_N] by cn (nm_corr corr_x): gen highest_correlation = corr_x[_N]

-rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It, too, is available from SSC.

I have put highest in scarequotes because you have not stated how you want to deal with the possibility that two or more stations in a country will tie for the highest correlation. The above code breaks the tie randomly, and the results will not be reproducible for these ties from one run of the code to the next. Once you decide how you would break those ties, it is usually quite simple to modify the code to implement your preferred selection.

Last edited by Clyde Schechter; 03 Nov 2022, 14:15.
Comment
Stefan Tijink

Join Date: Nov 2022

Posts: 7
#6

04 Nov 2022, 06:37

I applied your suggestions and it worked perfectly. I only had to make a small amendment as the correlations were negative, so I had to find the largest absolute value correlation. Fortunately, that was only a slight adjustment that was easily done.

Thank you so much for your assistance! I will definitely look further into the kountry and rangestat packages.
Comment

Announcement