Tracking changes in population by geography over time

Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#1

Tracking changes in population by geography over time

09 Aug 2021, 09:44

Hello,

In my dataset I have a number of small geographical areas (beginning with A in the sample data below) that fit within a larger geographical areas (designated by code 16C or 01K or 01H in the sample data below). There are around 2000 of these smaller geographical areas and around 100 of these larger geographical areas that they fit in to.

The smaller geographies can merge with other smaller geographies and cease to exist , so in the below sample, area A81015 ceased to exist at the end of one year, this designated by a dummy variable, taking a value of 1 where this has occurred ( variable: mis_1805) . This geographical area had a population of 2919 people. You can find which geographical area it merged to, by checking the last column which shows you the change in populations for these small geographical areas from one year to the next. You can see that area A81042 grew by 2956 people, and importantly, it belongs so the same wider geographical area as the one that ceased to exist (16C).

I am looking for a way to automate this process , so that if, for a given larger geographical area (16C, OR 01K OR 01H ) a smaller geographical area ceases to exist (identified by a dummy variable taking the value 1) and another area grows by similar amount, then a variable is created giving both observations, the code of the geographical area that one has merged in to. So in the below, an additional column would be created with code A81042 for both observations ( A81015 & A81042 ).

There is likely to be changes in population for the smaller geographical areas each year, so the amount by which one area grows does not need to be the exactly the same as the population for the area that ceased to exist. I have chosen a threshold of 90%, so in this example 2919 (old area)/2956 (other area) = 98%

The intention is not for this to provide a definitive answer, but automate the process somewhat prior to a manual inspection of the data.

I don't know if this is possible to do within Stata - and perhaps SQL may be a better place to do this.

CODE]
* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 gp_code str5 LastCCG float(mis_1805 mis_1805n mis_1805_diff)
"A81001" "16C" . . 5
"A81002" "16C" . . -44
"A81004" "16C" . . 129
"A81005" "16C" . . -7
"A81006" "16C" . . 17
"A81007" "16C" . . 10
"A81009" "16C" . . 1
"A81011" "16C" . . -9
"A81012" "16C" . . 12
"A81013" "16C" . . 18
"A81014" "16C" . . 5
"A81015" "16C" 1 2919 .
"A81016" "16C" . . -8
"A81017" "16C" . . -100
"A81018" "16C" . . 1
"A81019" "16C" . . -39
"A81020" "16C" . . 22
"A81021" "16C" . . 33
"A81022" "16C" . . 6
"A81023" "16C" . . -10
"A81025" "16C" . . -2
"A81026" "16C" . . 400
"A81027" "16C" . . 33
"A81029" "16C" . . 113
"A81030" "16C" . . 58
"A81031" "16C" . . 9
"A81032" "16C" . . 41
"A81034" "16C" . . 26
"A81035" "16C" . . 66
"A81036" "16C" . . -16
"A81037" "16C" . . 35
"A81038" "16C" . . 53
"A81039" "16C" . . 40
"A81040" "16C" . . -3
"A81041" "16C" . . 16
"A81042" "16C" . . 2956
"A81043" "16C" . . 18
"A81044" "16C" . . -19
"A81045" "16C" . . -15
"A81046" "16C" . . -14
"A81047" "16C" 1 259 .
"A81048" "16C" . . 24
"A81049" "16C" . . 390
"A81051" "16C" . . -59
"A81052" "16C" . . 38
"A81053" "16C" . . 3
"A81054" "16C" . . 4
"A81056" "16C" . . -12
"A81057" "16C" . . 35
"A81058" "16C" . . 47
"A81060" "16C" . . -10
"A81063" "16C" . . 0
"A81064" "16C" . . 86
"A81065" "16C" . . 0
"A81066" "16C" . . 7
"A81067" "16C" . . 8
"A81070" "16C" . . -21
"A81602" "16C" . . -7
"A81608" "16C" . . 84
"A81610" "16C" . . -3
"A81611" "16C" . . 28
"A81612" "16C" . . 21
"A81618" "16C" . . -6
"A81621" "16C" . . 9
"A81622" "16C" . . 13
"A81629" "16C" . . 14
"A81630" "16C" . . 13
"A81631" "16C" . . -8
"A81632" "16C" . . -8
"A81633" "16C" . . 8
"A81634" "16C" . . -28
"A82003" "01K" . . -15
"A82004" "01H" . . 4
"A82005" "01K" . . -23
"A82006" "01H" . . 1
"A82007" "01K" . . 11
"A82008" "01K" . . -1
"A82009" "01K" . . 21
"A82010" "01K" . . 0
"A82012" "01H" . . 14
"A82013" "01H" . . 5
"A82014" "01H" . . -4
"A82015" "01H" . . 1
"A82016" "01H" . . 41
"A82018" "01H" . . 8
"A82019" "01H" . . -6
"A82020" "01H" . . 29
"A82021" "01H" . . 20
"A82022" "01H" . . 4
"A82023" "01H" . . 24
"A82024" "01H" . . -2
"A82025" "01K" . . 18
"A82026" "01K" . . 0
"A82027" "01K" . . -20
"A82028" "01H" . . 6
"A82029" "01H" . . -5
"A82030" "01K" . . 2
"A82031" "01H" . . 1
"A82032" "01H" . . 14
"A82033" "01K" . . -10
end
[/CODE]
Tags: None
Maria Boutchkova

Join Date: May 2016

Posts: 103
#2

09 Aug 2021, 10:03

Hi Daniel,
thanks for using dataex.
I am assuming that there might be mutliple observations that satisfy your condition that 2919 (old area)/2956 (other area) is between .98 and 1 and my code will record them in the same variable separated by spaces.
I'd go:

Code:

gen list_merged_codes = "" qui levelsof gp_code if mis_1805 == 1, l(codes) clean foreach c of local codes { qui levelsof mis_1805n if gp_code == "`c'", l(pop_curr) clean di "`pop_curr'" replace list_merged_codes = list_merged_codes + " " + "`c'" if inrange(`pop_curr' / mis_1805_diff,.98,1) }
Comment
Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#3

09 Aug 2021, 10:17

Thank you very much! This looks good. Could I check though, when looking at old area to other area, I need to make sure that both are in the same larger geographical area, 16C using the above example, but could take 1 of 100 or so values. Can the code be amended to take that in to account?
Comment

Maria Boutchkova

Join Date: May 2016
Posts: 103

09 Aug 2021, 10:43

Ah yes, I missed this additional condition. Here you go:

Code:

gen list_merged_codes = ""
qui levelsof gp_code if mis_1805 == 1, l(codes) clean
foreach c of local codes {
qui levelsof LastCCG if gp_code == "`c'", l(area) clean
qui levelsof mis_1805n if gp_code == "`c'", l(pop_curr) clean
replace list_merged_codes = list_merged_codes + " " + "`c'" if inrange(`pop_curr' / mis_1805_diff,.98,1) & LastCCG == "`area'"
}

Also note that this kind of loop may become slow for very large datasets. I would use mata or merging if I want things to work faster.

Comment

Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#5

10 Aug 2021, 04:18

Thank you so much for this, it works great!. I amended it slightly so it only looks at populations >499, as there is some natural volatility:

qui levelsof gp_code if mis_1805 == 1 & mis_1805n >499 , l(codes) clean.

My dataset is quite small around 7000 observations. I was wondering, 1805 in the above relates to may 2018, and the next set of variables are 1806 relating to June 2018, and so on up until 2107 - July 2021, Is there a way I can loop the code to go through all sets of variables: they all follow the same naming convention so for 1806 they are:
mis_1806 mis_1806n mis_1806_diff If not, I can just copy and paste the code and change the names of the variables.
Comment
Maria Boutchkova

Join Date: May 2016

Posts: 103
#6

10 Aug 2021, 04:29

Of course, this is the power of looping! Perhaps try to do it yourself and share it here, I will help troubleshoot. The hint is that you may want to generate separate vars for each loop over the dates.
list_merged_codes_1805, list_merged_codes_1806, etc.
Comment
Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#7

10 Aug 2021, 05:30

I'm afraid this has got the better of me ...anymore hints?
Comment

Maria Boutchkova

Join Date: May 2016
Posts: 103

10 Aug 2021, 05:50

First, we store all dates in a local. You can do this in excel and copy-paste in your do file. My geekiness compels me to use Stata.

Code:

local dates 
forv y = 18/21 {
forv m = 1/9 {
local dates `dates' `y'0`m'
}
forv m = 10/12 {
local dates `dates' `y'`m'
}
}
* to drop 1801 - 1804 and 2108 - 2112
forv i = 1801/1804 {
local dates: subinstr local dates "`i'" ""
}
forv i = 2108/2112 {
local dates: subinstr local dates "`i'" ""
}
* making sure I have saved in the local dates what I want
di "`dates'"

Next we put our original code in a loop over the dates and replace 1805 with the reference to the local macro containing each date:

Code:

foreach d of local dates {
gen list_merged_codes_`d' = ""
qui levelsof gp_code if mis_`d' == 1, l(codes) clean
foreach c of local codes {
qui levelsof LastCCG if gp_code == "`c'", l(area) clean
qui levelsof mis_`d'n if gp_code == "`c'", l(pop_curr) clean
replace list_merged_codes_`d' = list_merged_codes_`d' + " " + "`c'" if inrange(`pop_curr' / mis_`d'_diff,.98,1) & LastCCG == "`area'"
}
}

Comment

Daniel Sutcliffe

Join Date: Sep 2020

Posts: 77
#9

10 Aug 2021, 06:48

Thank you, this works perfectly!
Comment

Announcement