Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Ghost observations and problem with generalized PSM!

    Hello all,

    I am working on my dissertation which aims to evaluate the impact of a scheme on sex ratio at birth in India. I am using generalized propensity score matching analysis along with dose-response. In order to conduct the matching, I used the following stata command:

    Code:
    gpscore sexratio1991 , t(did) gpscore(pscore) predict(hat_treat) sigma(sd) cutpoints(phase) index(p50) nq_gps(5) t_transf(ln) detail
    wherein sexratio1991 is the variable on which I want to match upon and did is the treatment indicator. My unit of observation is district (referred as sdist in the dataset)

    However, upon running this command I get the following error: could not calculate numerical derivatives discontinuous region with missing values encountered r(430);

    I probed more into my dataset which contains 651 districts (as displayed by the count command). However, when I plot the sexratio2011 against the districts, the graph plots the two variables against more than 900 districts. This is confusing as I do not have more that 651 districts in my dataset. I have attached the graph to show what I mean. I suspect that the issue is related to this point. But, I am not sure.

    My datset is as follows:
    [CODE]
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int sdist float sexratio2011
    1 596.7742
    2 642.8571
    3 862.069
    4 644.0678
    5 1184.2106
    6 781.25
    7 414.6342
    8 954.5455
    9 906.9767
    10 962.963
    11 636.3636
    12 909.0909
    13 913.0435
    14 1392.857
    15 711.1111
    16 970.5883
    17 764.7059
    18 977.2727
    19 673.0769
    20 1418.6046
    21 567.56757
    22 1343.75
    23 1000
    24 833.3333
    25 1428.5714
    26 1035.7142
    27 777.7778
    28 615.3846
    29 909.0909
    30 709.6774
    31 1045.4546
    32 675
    33 607.1429
    34 750
    36 645.1613
    37 586.2069
    38 633.3333
    39 600
    40 833.3333
    41 1588.2354
    42 1333.3334
    44 591.83673
    45 1062.5
    46 710.5263
    47 1400
    48 514.2857
    49 685.7143
    50 1419.355
    51 956.5217
    52 696.9697
    53 583.3333
    54 709.6774
    55 640
    56 589.7436
    57 727.2727
    58 1400
    59 758.6207
    60 606.0606
    61 533.3333
    62 900
    63 1074.0741
    64 566.6667
    65 1361.111
    66 1586.207
    67 1071.4286
    68 875
    69 484.8485
    70 909.0909
    71 1147.0588
    72 450
    73 709.6774
    74 916.6667
    75 1343.75
    76 675
    77 1058.8235
    78 1185.1852
    79 681.8182
    80 560.9756
    82 852.9412
    83 758.6207
    84 743.5897
    85 571.4286
    86 621.6216
    87 756.7568
    88 542.8571
    89 897.9592
    99 936.1702
    100 1086.9565
    101 805.5555
    102 898.3051
    103 750
    104 759.2593
    105 918.3674
    106 1203.3898
    107 833.3333
    108 844.4445
    109 941.1765
    110 647.0588
    111 1000
    112 814.8148
    end

    I would be really grateful if someone can help me with this issue.

    Many thanks in advance!
    Kanika


    Attached Files

  • #2
    count will give you the total number, not the actual values. For example, if I have 1, 2, 3 and 7, the count is four. If I were to plot these values, they will range from 1 to 7 as the axes in a line graph are continuous. So first:

    Code:
    summarize district
    to see the actual range. More importantly, the district identifier is a categorical variable and therefore a line graph does not make sense. If you want to represent the distribution of the continuous variable, start with a histogram.

    Code:
    histogram sexratio2011, freq
    Here, you will see how many districts are within a certain range of values. Finally, you can create an identifier with consecutive integers starting at one using the -group()- function of egen.

    Code:
    egen district_id= group(sdist)
    summarize district_id
    Last edited by Andrew Musau; 11 Jul 2023, 19:29.

    Comment


    • #3
      Dear Andrew,

      Thank you very very much for this prompt response. I am grateful.

      Just wanted to clarify that I have plotted the line graph, just to highlight the gap in the data. Nevertheless, your comment makes sense.

      My issue isn't yet resolved, so any leads on the same would be highly highly appreciated.

      Thank you,
      Kanika

      Comment


      • #4
        I don't think that anyone can offer you better advice without seeing more. You need to show the result of

        Code:
        summarize sdist

        Comment


        • #5
          Hello,

          My apologies.

          The result for the commands are:


          . summarize sdist

          Variable | Obs Mean Std. dev. Min Max
          -------------+---------------------------------------------------------
          sdist | 651 385.4608 249.8039 1 931


          . summarize district_id

          Variable | Obs Mean Std. dev. Min Max
          -------------+---------------------------------------------------------
          district_id | 651 326 188.0718 1 651


          district_id is the unique identifier of each district (named as sdist in my dataset).

          Thank you,
          Kanika

          Comment


          • #6
            Originally posted by Kanika Dua View Post
            The result for the commands are:


            . summarize sdist

            Variable | Obs Mean Std. dev. Min Max
            -------------+---------------------------------------------------------
            sdist | 651 385.4608 249.8039 1 931

            So this is consistent with what I said in #2. You have 651 observations, but these range from 1 to 931 for "sdist". Once you create the variable district_id, it has 651 observations ranging from 1 to 651. So use this second variable in place of "sdist" if you want a variable with consecutively numbered observations. At the end, as this is a categorical variable, it won't matter what variable you use.

            Comment


            • #7
              Thank you Andrew. I get this insight. However, my command does not include the variable sdist or district anywhere. I tried running the gpsm command after dropping sdist and just using unique district identifier; however, it is still giving me the same error code.

              Reproducing the command and the error again here for reference:
              Code:
              gpscore sexratio2011 , t(did) gpscore(pscore) predict(hat_treat) sigma(sd) cutpoints(phase) index(p50) nq_gps(5) t_transf(ln) detail
              Error message:
              could not calculate numerical derivatives
              discontinuous region with missing values encountered
              r(430);

              Can it be for reasons other than the one explored above?

              Please help.

              Thank you,
              kanika

              Comment


              • #8
                OK, your issue is about the nonconvergence of your ML estimation. This has nothing to do with the values that your district identifier takes. I do not use that command and I am not familiar with how you should set up your model, but assuming you have specified it correctly, you can follow the usual diagnostics for an ML estimation with convergence problems. See, e.g., Clyde Schechter's advice in #8 of https://www.statalist.org/forums/for...nce-of-melogit.

                Comment


                • #9
                  Hello, for future reference of those who may need, the command worked by not using the log transformation. Thank you!

                  Comment

                  Working...
                  X