Cumulative curve to show percentage of searchers that can afford only so much percent of available objects

Daniel Hendriks

Join Date: May 2018

Posts: 9
#1

Cumulative curve to show percentage of searchers that can afford only so much percent of available objects

14 Nov 2018, 07:37

Hi everyone! Could someone maybe help me out?

I think I have a relatively easy question; at least I hope so! As usual I find it hard to exactly explain what I am looking for but I hope it works out.

Let's say I have values for persons with the maximum rental prices they want to pay. I also have actual prices of all rental homes. I want to calculate for how many people only the cheapest 10% of houses are obtainable, then I also want to calculate for how many people only the cheapest 20% of houses are obtainable and so forth. Preferably I want to have the value for the percentage for every percent of houses to be able to show the cleanest and smoothest line in a graph. So I don't want 50% on the x-axis to be the average price but I want it to be the median. I then want to graph this cumulative percentage line of searchers per percentage of total objects that can't afford the rest of the number of total objects with a higher price.

I hope the following example of a dataset helps:

Code:

clear input float(searcherid objectprice maxprice objectid) 1 3 10 1 2 5 12 2 3 8 23 3 4 1 21 4 5 15 18 5 6 25 17 6 7 21 8 7 8 13 30 8 9 19 21 9 10 30 19 10 11 32 23 11 . . 15 12 . . 16 13 . . 19 14 . . 30 15 . . 31 16 end

Thank you in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#2

14 Nov 2018, 12:25

Your data example confuses me. I would assume from the variable names that maxprice is the maximum price that a given searcher is willing to pay, and the objectprice would be the actual selling price of the object. So I would expect that every searcherid would have a maxprice associated and every objectid an objectprice associated with it. But what I see in the data is that every searcherid has an object price and every objectid has a maxprice. So I really don't know what to make of that.

In fact, having these data side-by-side in a single data set is really rather odd. It is, in this case, possible to work with it this way, but separate searcher and price data sets would be a more natural way to structure this data.

Anyway, I'm going to work with the data as it is and I'm going to stick with my assumptions in the code below. That is, maxprice in a given observation is the maximum price that the searcherid in that observation is willing to pay. And that objectprice is the actual price one must pay to purchase the objectid in the same observation.

Code:

clear input float(searcherid objectprice maxprice objectid) 1 3 10 1 2 5 12 2 3 8 23 3 4 1 21 4 5 15 18 5 6 25 17 6 7 21 8 7 8 13 30 8 9 19 21 9 10 30 19 10 11 32 23 11 . . 15 12 . . 16 13 . . 19 14 . . 30 15 . . 31 16 end // COUNT NUMBER OF DISTINCT OBJECTS isid objectid, sort count local n_objects `r(N)' // FOR EACH SEARCHER CALCULATE HOW MANY OBJECTS ARE AFFORDABLE rangestat (count) n_affordable = objectid, interval(objectprice . maxprice) gen pct_affordable = 100*n_affordable/`n_objects' quantile pct_affordable

To use this code you need the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, available from SSC.

-quantile- is an official Stata command that will make the graph you want. You can customize the appearance of the plot considerably by taking advantage of some of its options. -help quantile-.
Comment
Daniel Hendriks

Join Date: May 2018

Posts: 9
#3

14 Nov 2018, 14:35

Thank you so much for for the prompt response! I actually indeed have accidentally switched the naming of searcherid and objectid. Because of this fail it definitely made my question harder too understand. Now actually the result for pct_affordable is not actually what I intended. The output of following your code after adjusting for the faulty switch up is shown below.

Code:

clear input float(object objectprice maxprice searcherid) double n_affordable float pct_affordable 1 3 10 1 4 25 2 5 12 2 4 25 3 8 23 3 8 50 4 1 21 4 8 50 5 15 18 5 6 37.5 6 25 17 6 6 37.5 7 21 8 7 4 25 8 13 30 8 10 62.5 9 19 21 9 8 50 10 30 19 10 7 43.75 11 32 23 11 8 50 . . 15 12 . . . . 16 13 . . . . 19 14 . . . . 30 15 . . . . 31 16 . . end

The value of pct_affordable for the object value of 32 should actually be 100 since none of the searchers can afford this value. The value associated with object value should actually be 87,5 since 14/16 of searchers cannot afford it at this price. Objectprice 25, which is the median should have a value of 81,25 since 13/16 of searchers cannot afford this price. This would then portray a really skewed cumulative distribution. Is there a way to get the result I am currently talking about?

Thank you in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#4

14 Nov 2018, 18:51

I see, I gave you the percent of objects each purchaser can afford, but you want the percent of purchasers for whom the object is unaffordable. Sorry for the misunderstanding. So that's a minor modification of the code:

[code]
// COUNT NUMBER OF DISTINCT SEARCHERS
by searcherid, sort: assert _N == 1 if !missing(searcherid)
count if !missing(searcherid)
local n_searchers `r(N)'

// FOR EACH OBJECT CALCULATE HOW MANY SEARCHERS CANNOT AFFORD IT
rangestat (count) n_able_to_afford = searcherid, interval(maxprice objectprice .)
replace n_able_to_afford = 0 if missing(n_able_to_afford)
gen pct_unable_to_afford = 100*(`n_searchers' - n_able_to_afford)/`n_searchers'
quantile pct_unable_to_afford
[/quote]
Comment

Daniel Hendriks

Join Date: May 2018
Posts: 9

15 Nov 2018, 03:59

Okay, again thank you so much! It works perfectly. I actually come to realize I have now one other additional question. Is it also possbile to make the calculations per year or per month or per halfyear? If this was possible can I then also show the multiple curves in the plot to demonstrate the changes per time period?

The code below shows a similar way how I categorized my dates. There are 2 years, 2017 and 2018; there are 4 months per year in this example and are preceded by the year in the data leading to values such as 20173 which is month 3 in 2017. Then assume there are 2 halves of 2 months in 2017 and 2018. Given all this data I would then like to calculate the percentages from above per month, halfyear or per year. If this would be possible I could show whether and how the curve has skewed more over time.

Code:

clear
input float(object objectprice) str5(objectmonthid objecthalfid) str4 objectyearid float(maxprice searcherid) str6 searchermonthid str5 searcherhalfid str4 searcheryearid double n_affordable float pct_affordable
 1  3 "20171" "20171" "2017" 10  1 "20171" "20171" "2017"  4    25
 2  5 "20172" "20172" "2017" 12  2 "20174" "20172" "2017"  4    25
 3  8 "20171" "20171" "2017" 23  3 "20171" "20171" "2017"  8    50
 4  1 "20171" "20171" "2017" 21  4 "20172" "20171" "2017"  8    50
 5 15 "20172" "20172" "2017" 18  5 "20173" "20172" "2017"  6  37.5
 6 25 "20181" "20181" "2018" 17  6 "20181" "20181" "2018"  6  37.5
 7 21 "20181" "20181" "2018"  8  7 "20181" "20181" "2018"  4    25
 8 13 "20182" "20182" "2018" 30  8 "20183" "20182" "2018" 10  62.5
 9 19 "20181" "20181" "2018" 21  9 "20182" "20181" "2018"  8    50
10 30 "20172" "20172" "2017" 19 10 "20173" "20172" "2017"  7 43.75
11 32 "20182" "20182" "2018" 23 11 "20184" "20182" "2018"  8    50
 .  . "."     "."     "."    15 12 "20182" "20181" "2018"  .     .
 .  . "."     "."     "."    16 13 "20183" "20182" "2018"  .     .
 .  . "."     "."     "."    19 14 "20172" "20171" "2017"  .     .
 .  . "."     "."     "."    30 15 "20174" "20172" "2017"  .     .
 .  . "."     "."     "."    31 16 "20174" "20172" "2017"  .     .
end

(I hope I don't bother too much but I know I would come up with a hardcoded solution at the moment which would be extremely inefficient)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#6

15 Nov 2018, 08:36

I don't quite understand how this is supposed to go. Each searcher has only a single date associated with it, and that is also true of objects. So, do you want to count a searcher as able/unable to afford an object only if the date of the object and the searcher are equal? In your real data, do you have multiple observations per object and searcher reflecting changes in price and willingness to pay over time? If so, a new data example with that would help.

While awaiting your response, I'll give you some unsolicited advice about your date variables here. First, they are badly named, because they are not id's of anything. They are just dates. At some point this can trip you up, because on its face, there is no reason a number like 20174 couldn't be an id of something. And, in fact, it appears you have already been tripped up by your own approach, because, for example, in observation 2, you have objectmonthid = "20172" which I interpret as February 2017, but the correspond objecthalfid is 20172, whereas February is in the first half of 2017. There are actually many such discrepancies in the data. Perhaps I'm just misunderstanding the meanings of these variables, but if I have them right, then these are errors.

Second, while I think for this limited purpose you can get away with the way you have setup the months and halfyears, in general, this way of setting up dates is not very useful in Stata. You should be using real Stata internal format date variables, to assure that you will be able to get things sorted in the correct order and calculate with them. Here's how you can do this:

Code:

clear input float(object objectprice) str5(objectmonthid objecthalfid) str4 objectyearid float(maxprice searcherid) str6 searchermonthid str5 searcherhalfid str4 searcheryearid double n_affordable float pct_affordable 1 3 "20171" "20171" "2017" 10 1 "20171" "20171" "2017" 4 25 2 5 "20172" "20172" "2017" 12 2 "20174" "20172" "2017" 4 25 3 8 "20171" "20171" "2017" 23 3 "20171" "20171" "2017" 8 50 4 1 "20171" "20171" "2017" 21 4 "20172" "20171" "2017" 8 50 5 15 "20172" "20172" "2017" 18 5 "20173" "20172" "2017" 6 37.5 6 25 "20181" "20181" "2018" 17 6 "20181" "20181" "2018" 6 37.5 7 21 "20181" "20181" "2018" 8 7 "20181" "20181" "2018" 4 25 8 13 "20182" "20182" "2018" 30 8 "20183" "20182" "2018" 10 62.5 9 19 "20181" "20181" "2018" 21 9 "20182" "20181" "2018" 8 50 10 30 "20172" "20172" "2017" 19 10 "20173" "20172" "2017" 7 43.75 11 32 "20182" "20182" "2018" 23 11 "20184" "20182" "2018" 8 50 . . "." "." "." 15 12 "20182" "20181" "2018" . . . . "." "." "." 16 13 "20183" "20182" "2018" . . . . "." "." "." 19 14 "20172" "20171" "2017" . . . . "." "." "." 30 15 "20174" "20172" "2017" . . . . "." "." "." 31 16 "20174" "20172" "2017" . . end foreach v of varlist *monthid { local newname: subinstr local v "monthid" "month" gen int `newname' = monthly(substr(`v', 1, 4) + "m" + substr(`v', 5, .), "YM") format `newname' %tm local newname2: subinstr local newname "month" "half" gen int `newname2' = hofd(dofm(`newname')) format `newname2' %th local newname3: subinstr local newname "month" "year" gen int `newname3' = year(dofm(`newname')) } drop *monthid *halfid *yearid

I'll try to respond to your question if you post back with a new data example and an answer to my question.
Comment
Daniel Hendriks

Join Date: May 2018

Posts: 9
#7

15 Nov 2018, 09:59

First of all, you are absolutely correct that the date variables are wrong. I decided to use some pseudocode/pseudodates to maybe make the example more easy to understand, but of course, this would effectively be detrimental to ensuring correct sorting and calculation. In addition, I understand also that they are not ID's, they definitely in this example aren't exclusive.

With regard to your question whether I have multiple observations per object and searcher, I do not. I indeed just want to count a searcher as able/unable to afford an object only if the date of the object and the searcher are equal. I have a clear reason for this as I want to see if the new search price corresponds with the object prices within the same time period thus if the dates are equal. More specifically, I want to see if the new search prices in a month correspond with the object prices in that month or that the new search prices in a year correspond with the object prices in that year. I do not keep track of the changes in search prices. For the following example by only using months and years I would want my data to look like the following (it is still pretty much the same example but I hope the clarification of that dates should be equal and there is one observation per object and searcher makes the example okay again):

Code:

clear input float(object objectprice maxprice searcherid) int(objectmonth objectyear searchermonth searcheryear) float(pct_unaffordable_monthly pct_unaffordable_yearly) 1 9 10 1 684 2017 684 2017 0 0 4 1 21 4 684 2017 685 2017 0 0 3 13 23 3 684 2017 684 2017 50 22.22 2 5 12 2 685 2017 687 2017 0 0 5 15 18 5 686 2017 686 2017 100 22.22 10 30 19 10 687 2017 686 2017 33.33 77.77 9 19 21 9 696 2018 697 2018 100 57.14 6 25 17 6 696 2018 696 2018 100 85.71 8 13 30 8 697 2018 698 2018 50 14.28 7 21 8 7 698 2018 696 2018 50 57.14 11 32 23 11 699 2018 699 2018 100 100 . . 31 16 . . 687 2017 . . . . 30 15 . . 687 2017 . . . . 19 14 . . 685 2017 . . . . 15 12 . . 697 2018 . . . . 16 13 . . 698 2018 . . end format %tm objectmonth format %tm searchermonth

There are 3 object in 2017m1 with objectprices of 1, 9 and 13. The max prices of searchers in 2017m1 are 10 and 23. Thus pct_unnaffordable_monthly for 1 is 0, for 9 0 and for 13 is 50%. For 2017 the objectprices are 1, 5, 9, 13, 15 and 30. For 2017 the searcher prices are
10, 12, 18,19, 19, 21, 23, 30, 31. This would lead to the yearly percentages from above.

If this is a really bad design of variables and data which makes it difficult to solve I would also be okay with hearing this of course.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30118

15 Nov 2018, 10:35

So, the data organization needs to change, and the code changes with it. But the reorganization of the data is rather simple and is included in the code below.

Code:

clear
input float(object objectprice maxprice searcherid) int(objectmonth objectyear searchermonth searcheryear) float(pct_unaffordable_monthly pct_unaffordable_yearly)
 1  9 10  1 684 2017 684 2017     0     0
 4  1 21  4 684 2017 685 2017     0     0
 3 13 23  3 684 2017 684 2017    50 22.22
 2  5 12  2 685 2017 687 2017     0     0
 5 15 18  5 686 2017 686 2017   100 22.22
10 30 19 10 687 2017 686 2017 33.33 77.77
 9 19 21  9 696 2018 697 2018   100 57.14
 6 25 17  6 696 2018 696 2018   100 85.71
 8 13 30  8 697 2018 698 2018    50 14.28
 7 21  8  7 698 2018 696 2018    50 57.14
11 32 23 11 699 2018 699 2018   100   100
 .  . 31 16   .    . 687 2017     .     .
 .  . 30 15   .    . 687 2017     .     .
 .  . 19 14   .    . 685 2017     .     .
 .  . 15 12   .    . 697 2018     .     .
 .  . 16 13   .    . 698 2018     .     .
end
format %tm objectmonth
format %tm searchermonth

//    REORGANIZE DATA INTO TIME SEGMENTS
preserve
keep object*
drop if missing(objectprice)
rename objectmonth month
rename objectyear year
assert year == year(dofm(month))
tempfile objects
save `objects'
restore
keep searcher* maxprice
drop if missing(maxprice)
rename searchermonth month
rename searcheryear year
assert year == year(dofm(month))
joinby month using `objects'
sort month
isid searcherid object

//    IDENTIFY UNAFFORDABLE MATCHUPS
gen byte unaffordable = objectprice > maxprice
//    COUNT SEARCHERS WHO CAN'T AFFORD EACH OBJECT & TOTAL SEARCHERS BY MONTH
preserve
collapse (count) n_searchers = searcherid (sum) unaffordable, by(object month)
gen pct_unaffordable = 100*unaffordable/n_searchers
levelsof month, local(months)
local graphs 
foreach m of local months {
    quantile pct_unaffordable if month == `m', title(`:display %tmMon_YY month') ///
        name(m`m', replace)
    local graphs `graphs' m`m' 
}
graph combine `graphs', xcommon ycommon name(combined_monthly, replace)

//    NOW DO IT BY YEARS
restore
collapse (count) n_searchers = searcherid (sum) unaffordable, by(object year)
gen pct_unaffordable = 100*unaffordable/n_searchers
levelsof year, local(years)
local graphs 
foreach m of local years {
    quantile pct_unaffordable if year == `m', title(`m') ///
        name(y`m', replace)
    local graphs `graphs' y`m' 
}
graph combine `graphs', xcommon ycommon name(combined_yearly, replace)

Now, this creates the graphs. The data underlying the monthly graphs is lost: it gets clobbered by the yearly data. If you want to you can save both the monthly and yearly data sets in a temporary or permanently data file by adding -save- commands in the appropriate places.

As applied to the data, the monthly graphs are mostly just single points, but you will get something better in the real data. The yearly graphs have enough data points in them that you can get a feel for what they are really like even just from the example.

Announcement