Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to calculate correlation per ID

    Good day,

    I am trying to calculate the correlation between temperature and national gas consumption for several weather stations. This would be an easy task if my data was structured with variables being date, national gas consumption, weather measured at station 1, weather measured at station 2, and so forth. My data set with gas consumption is simply structured as data and gas consumption as the variables, with the inputs being clear. However, my weather data is structured as follows, which continues for all the weather stations: (Note that gas consumption is not yet included here)
    Click image for larger version

Name:	Knipsel.JPG
Views:	1
Size:	79.5 KB
ID:	1687752





    As a consequence, my different stations are not specified as variables, but rather as inputs of the variable "staid" or "staname".

    I am quite stuck with how to merge these data sets and calculate the correlation for each weather station. Merging is difficult because in my weather data set every date occurs multiple times (for every station) rather than just once. I have looked into reshape to get the staid inputs as separate variables but I could not work this out due to a lack of Stata skills.

    Is there anyone that could help me with this?

    Many thanks!


    Last edited by Stefan Tijink; 02 Nov 2022, 11:26.

  • #2
    I found #1 too confusing to provide an answer. Descriptions of data sets, no matter how elaborate, are usually not sufficient to enable others to provide specific advice. Please post back showing example data from both data sets. Accompany that by a description of what the salient variables in each data set mean. That will make it more likely that somebody can figure out what needs to be done.

    And before posting back, please read the Forum FAQ for excellent advice about how to post in ways that maximize the chances that someone will be able to provide a timely and helpful response. Among the things you will learn there is that screenshots of data are not helpful. The useful way to show example data is with the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you for the reply and apologies if it was unclear.


      Short description and goal:
      I have data on weather stations in 8 different countries, with for each weather station the temperature it has measured on a given date. In addition, I have a second dataset with monthly gas consumption for each of these eight countries. My goal is to find for each country the weather station with the highest correlation between its measured temperature and the gas consumption of the country. In order to find this, I have to calculate for each weather station the correlation with the gas consumption of the country it is located in and then select for each country the weather station with the highest correlation.


      Dataset 1: Weather stations
      This data set contains the variables staid (weather station id), staname (weather station name), cn (country code weather station is located in, e.g. AT in example below refers to Austria, total 8 different countries), date (date at which temperature is measured for given weather station) tg (temperature measured at given weather station at given date).


      Example 1: dataex if staid == 11, count(10)
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int staid str48 staname str4 cn long(date tg)
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190101   4
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190102 -19
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190103   3
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190104  30
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190105  41
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190106  23
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190107  17
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190108  25
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190109  14
      11 "KREMSMUENSTER (TAWES)                       " "AT" 20190110  52
      end
      Example 2: dataex if staid == 12, count(10)
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int staid str48 staname str4 cn long(date tg)
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190101 41
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190102 18
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190103 38
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190104 34
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190105 68
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190106 75
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190107 41
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190108 27
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190109 31
      12 "GRAZ-UNIVERSITAET                           " "AT" 20190110 50
      end

      Dataset 2: Gas consumption
      This data set contains data on monthly gas consumption for 8 different countries. This monthly data has been transformed to daily data, with each date taking on its monthly value. For example, if Jan 2019 has a value of 8000, all days between 1 Jan 2019 and 31 Jan 2019 have the value 8000. So in the example below, this means that Belgium had a gas consumption of 2479.15 throughout January 2019. The data set starts at January 2019 and runs until August 2022.

      Example:
      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input int Date double(Belgium Denmark Germany France Italy Netherlands Austria UK)
      21550 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21551 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21552 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21553 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21554 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21555 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21556 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21557 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21558 2479.15 384 12995 6442 10598 5613.034 1310 9967
      21559 2479.15 384 12995 6442 10598 5613.034 1310 9967
      end
      format %tdnn/dd/CCYY Date







      Comment


      • #4

        I have managed to make some progress.

        First I amended the files such that there data variables were similar, and joined them together using
        Code:
        joinby using "....\Gas consumption.dta"
        After which, the data has the following form:
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input int staid str48 staname str4 cn long(date tg) double(Belgium Denmark Germany France Italy Netherlands Austria UK)
        11 "KREMSMUENSTER (TAWES)                       " "AT" 20190101  46 2479.15 384 12995 6442 10598 5613.034 1310 9967
        11 "KREMSMUENSTER (TAWES)                       " "AT" 20190102  24 2479.15 384 12995 6442 10598 5613.034 1310 9967
        11 "KREMSMUENSTER (TAWES)                       " "AT" 20190103 -13 2479.15 384 12995 6442 10598 5613.034 1310 9967
        Second, I calculated the correlation using the following formula:
        Code:
        by staid, sort : correlate tg Austria if cn  =="AT"
        This formula can be repeated for each of the eight countries, after which Stata reports the correlation coefficients for each of the relevant weather stations. I have included an example picture of the output below:

        Click image for larger version

Name:	Knipsel.JPG
Views:	1
Size:	25.7 KB
ID:	1687910

        Remaining problem:
        The only problem remaining is selecting for each country the weather station with the highest correlation. My idea would be to save the correlation coefficient for each staid as part of a new variable "correlation", so that afterwards I can find the maximum.

        So far, I have tried:
        Code:
        gen corr = .
        foreach i in `staid' {
        corr tg Austria if staid ==`i' & cn =="AT"
        replace corr = r(rho) if state ==`i'
        }
        However, the part from foreach onwards seems to be doing nothing in Stata. Am I doing something wrong here? Or even: Is there a more simple approach for this problem?

        Thank you in advance!


        Comment


        • #5
          Well, although there is probably a way to complete your task from where you have gotten so far, I think it will be complicated and difficult. In my view, you have not made progress, you have retrogressed. You started with two data sets that had problems and incompatibilities and you have now created a new joined data set that has even bigger problems and is really quite unsuited to analysis in Stata.

          Here are the problems and incompatibilities of the original data sets:
          1. Country is a string variable in both data sets, but it is long names in one and 2-letter codes in another. This makes pairing them up very difficult. The approach you have taken so far depends on your personal knowledge of the correspondence between those, and you are basically writing that into the code. This is cumbersome, at best.
          2. The date variable in the first data set is not a date variable at all. It is a number that, to human eyes, looks like a date. But it is not suitable for date calculations. The date in the second data set is a proper Stata date variable. But it won't match properly with the one in the first. So the date variable in the first data set needs to be transformed to a real Stata date variable.
          3. The second data set is in wide layout, which is only occasionally useful for anything in Stata. To make it useful, and also make it suitable for combining with the first, it needs to be made long.
          The code below does the necessary surgery on both data sets and combines them. The resulting data set is one that I suggest you save, as not only the correlations but anything else you want to do with this data would likely best start from there.
          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input int staid str48 staname str4 cn long(date tg)
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190101   4
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190102 -19
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190103   3
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190104  30
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190105  41
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190106  23
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190107  17
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190108  25
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190109  14
          11 "KREMSMUENSTER (TAWES)                       " "AT" 20190110  52
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190101 41
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190102 18
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190103 38
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190104 34
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190105 68
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190106 75
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190107 41
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190108 27
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190109 31
          12 "GRAZ-UNIVERSITAET                           " "AT" 20190110 50
          end
          tempfile dataset1
          save `dataset1'
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input int Date double(Belgium Denmark Germany France Italy Netherlands Austria UK)
          21550 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21551 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21552 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21553 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21554 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21555 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21556 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21557 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21558 2479.15 384 12995 6442 10598 5613.034 1310 9967
          21559 2479.15 384 12995 6442 10598 5613.034 1310 9967
          end
          format %tdnn/dd/CCYY Date
          tempfile dataset2
          save `dataset2'
          
          //  MAKE THE DATA SETS SUITABLE FOR USE AND COMBINING
          use `dataset1', clear
          tostring date, replace format(%8.0f)
          gen Date = daily(date, "YMD")
          assert missing(date) == missing(Date)
          drop date
          format Date %tdnn/dd/CCYY
          tempfile weather_stations
          save `weather_stations'
          
          use `dataset2', clear
          rename (Belgium-UK) gas_consumption=
          reshape long gas_consumption, i(Date) j(country) string
          kountry country, from(other) stuck
          rename _ISO3N_ n_country
          kountry n_country, from(iso3n) to(iso2c)
          drop country n_country
          rename _ISO2C_ cn
          tempfile gas_consumption
          save `gas_consumption'
          
          use `weather_stations', clear
          merge m:1 Date cn using `gas_consumption'
          Note: -kountry- is written by Rafal Raciborski and is available from SSC.

          Notice, by the way, that these data sets were combinable with -merge m:1-, not resorting to -joinby-. Consequently, this resulting data set does not contain large numbers of extraneous observations that pair observations that have different dates and countries. It only pairs gas consumption on a given date in a given country to other observations of the same country on that date.

          Now, before we proceed to the correlations, I want to point out another problem in the gas consumption data. In the example you have shown, it appears that although there are daily entries, the numbers do not change from one date to the next. Consequently the correlations (at least using this example data) with tg (or anything else in the universe) are all going to be zero (or, as some programs will do it, missing value). Hopefully your real data set does not have this problem. The fact that you seem to have somehow gotten some non-zero non-missing correlations in your own work suggests that the real data set is better in this regard.

          Turning now to the correlations, here's how I would approach it:

          Code:
          // CORRELATIONS OF GAS CONSUMPTION & OBSERVED TEMPERATURE BY STATION
          rangestat (corr) gas_consumption tg, by(cn staid) interval(Date . .)
          
          //  FOR EACH COUNTRY PICK THE STATION WITH "HIGHEST" CORRELATION
          gen byte nm_corr = !missing(corr_x)
          by cn (nm_corr corr_x), sort: gen highest_corr_station = staid[_N]
          by cn (nm_corr corr_x): gen highest_correlation = corr_x[_N]
          -rangestat- is written by Robert Picard, Nick Cox, and Roberto Ferrer. It, too, is available from SSC.

          I have put highest in scarequotes because you have not stated how you want to deal with the possibility that two or more stations in a country will tie for the highest correlation. The above code breaks the tie randomly, and the results will not be reproducible for these ties from one run of the code to the next. Once you decide how you would break those ties, it is usually quite simple to modify the code to implement your preferred selection.
          Last edited by Clyde Schechter; 03 Nov 2022, 14:15.

          Comment


          • #6
            I applied your suggestions and it worked perfectly. I only had to make a small amendment as the correlations were negative, so I had to find the largest absolute value correlation. Fortunately, that was only a slight adjustment that was easily done.

            Thank you so much for your assistance! I will definitely look further into the kountry and rangestat packages.

            Comment

            Working...
            X