Ghost observations and problem with generalized PSM!

Kanika Dua

Join Date: Feb 2023

Posts: 33
#1

Ghost observations and problem with generalized PSM!

11 Jul 2023, 16:35

Hello all,

I am working on my dissertation which aims to evaluate the impact of a scheme on sex ratio at birth in India. I am using generalized propensity score matching analysis along with dose-response. In order to conduct the matching, I used the following stata command:

Code:

gpscore sexratio1991 , t(did) gpscore(pscore) predict(hat_treat) sigma(sd) cutpoints(phase) index(p50) nq_gps(5) t_transf(ln) detail

wherein sexratio1991 is the variable on which I want to match upon and did is the treatment indicator. My unit of observation is district (referred as sdist in the dataset)

However, upon running this command I get the following error: could not calculate numerical derivatives discontinuous region with missing values encountered r(430);

I probed more into my dataset which contains 651 districts (as displayed by the count command). However, when I plot the sexratio2011 against the districts, the graph plots the two variables against more than 900 districts. This is confusing as I do not have more that 651 districts in my dataset. I have attached the graph to show what I mean. I suspect that the issue is related to this point. But, I am not sure.

My datset is as follows:
[CODE]
* Example generated by -dataex-. For more info, type help dataex
clear
input int sdist float sexratio2011
1 596.7742
2 642.8571
3 862.069
4 644.0678
5 1184.2106
6 781.25
7 414.6342
8 954.5455
9 906.9767
10 962.963
11 636.3636
12 909.0909
13 913.0435
14 1392.857
15 711.1111
16 970.5883
17 764.7059
18 977.2727
19 673.0769
20 1418.6046
21 567.56757
22 1343.75
23 1000
24 833.3333
25 1428.5714
26 1035.7142
27 777.7778
28 615.3846
29 909.0909
30 709.6774
31 1045.4546
32 675
33 607.1429
34 750
36 645.1613
37 586.2069
38 633.3333
39 600
40 833.3333
41 1588.2354
42 1333.3334
44 591.83673
45 1062.5
46 710.5263
47 1400
48 514.2857
49 685.7143
50 1419.355
51 956.5217
52 696.9697
53 583.3333
54 709.6774
55 640
56 589.7436
57 727.2727
58 1400
59 758.6207
60 606.0606
61 533.3333
62 900
63 1074.0741
64 566.6667
65 1361.111
66 1586.207
67 1071.4286
68 875
69 484.8485
70 909.0909
71 1147.0588
72 450
73 709.6774
74 916.6667
75 1343.75
76 675
77 1058.8235
78 1185.1852
79 681.8182
80 560.9756
82 852.9412
83 758.6207
84 743.5897
85 571.4286
86 621.6216
87 756.7568
88 542.8571
89 897.9592
99 936.1702
100 1086.9565
101 805.5555
102 898.3051
103 750
104 759.2593
105 918.3674
106 1203.3898
107 833.3333
108 844.4445
109 941.1765
110 647.0588
111 1000
112 814.8148
end

I would be really grateful if someone can help me with this issue.

Many thanks in advance!
Kanika

Attached Files
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10282
#2

11 Jul 2023, 19:18

count will give you the total number, not the actual values. For example, if I have 1, 2, 3 and 7, the count is four. If I were to plot these values, they will range from 1 to 7 as the axes in a line graph are continuous. So first:

Code:

summarize district

to see the actual range. More importantly, the district identifier is a categorical variable and therefore a line graph does not make sense. If you want to represent the distribution of the continuous variable, start with a histogram.

Code:

histogram sexratio2011, freq

Here, you will see how many districts are within a certain range of values. Finally, you can create an identifier with consecutive integers starting at one using the -group()- function of egen.

Code:

egen district_id= group(sdist) summarize district_id

Last edited by Andrew Musau; 11 Jul 2023, 19:29.
Comment
Kanika Dua

Join Date: Feb 2023

Posts: 33
#3

12 Jul 2023, 03:06

Dear Andrew,

Thank you very very much for this prompt response. I am grateful.

Just wanted to clarify that I have plotted the line graph, just to highlight the gap in the data. Nevertheless, your comment makes sense.

My issue isn't yet resolved, so any leads on the same would be highly highly appreciated.

Thank you,
Kanika
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10282
#4

12 Jul 2023, 05:39

I don't think that anyone can offer you better advice without seeing more. You need to show the result of

Code:

summarize sdist
Comment
Kanika Dua

Join Date: Feb 2023

Posts: 33
#5

12 Jul 2023, 06:19

Hello,

My apologies.

The result for the commands are:

. summarize sdist

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sdist | 651 385.4608 249.8039 1 931

. summarize district_id

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
district_id | 651 326 188.0718 1 651

district_id is the unique identifier of each district (named as sdist in my dataset).

Thank you,
Kanika
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10282
#6

12 Jul 2023, 06:28

Originally posted by Kanika Dua View Post

The result for the commands are:

. summarize sdist

Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
sdist | 651 385.4608 249.8039 1 931

So this is consistent with what I said in #2. You have 651 observations, but these range from 1 to 931 for "sdist". Once you create the variable district_id, it has 651 observations ranging from 1 to 651. So use this second variable in place of "sdist" if you want a variable with consecutively numbered observations. At the end, as this is a categorical variable, it won't matter what variable you use.
Comment
Kanika Dua

Join Date: Feb 2023

Posts: 33
#7

12 Jul 2023, 06:53

Thank you Andrew. I get this insight. However, my command does not include the variable sdist or district anywhere. I tried running the gpsm command after dropping sdist and just using unique district identifier; however, it is still giving me the same error code.

Reproducing the command and the error again here for reference:

Code:

gpscore sexratio2011 , t(did) gpscore(pscore) predict(hat_treat) sigma(sd) cutpoints(phase) index(p50) nq_gps(5) t_transf(ln) detail

Error message:
could not calculate numerical derivatives
discontinuous region with missing values encountered
r(430);

Can it be for reasons other than the one explored above?

Please help.

Thank you,
kanika
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10282
#8

12 Jul 2023, 07:06

OK, your issue is about the nonconvergence of your ML estimation. This has nothing to do with the values that your district identifier takes. I do not use that command and I am not familiar with how you should set up your model, but assuming you have specified it correctly, you can follow the usual diagnostics for an ML estimation with convergence problems. See, e.g., Clyde Schechter's advice in #8 of https://www.statalist.org/forums/for...nce-of-melogit.
1 like
Comment
Kanika Dua

Join Date: Feb 2023

Posts: 33
#9

13 Jul 2023, 03:29

Hello, for future reference of those who may need, the command worked by not using the log transformation. Thank you!
Comment

Announcement

Ghost observations and problem with generalized PSM!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment