Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tracking a unique id over years

    I am trying to calculate teacher retention from year to year. I need to be able to figure out if a unique id is present in a consecutive year and if so, count that as teacher retention but if not count that as teacher turnover. I am able to figure out how to do this with 2 years of data but I am unable to figure out the code for multiple years of data (10 years). Would love any help I can get. I have not very proficient in Stata (a beginner) so please if you can suggest actual code that would be needed I would be very thankful.

  • #2
    It is impossible to respond to this in a helpful way without example data. There are various ways your data may be organized, and the code for handling your problem would differ accordingly. Please post back with example data, and use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    There is also some lack of clarity in the question itself. If a teacher is present in the data, say, in 2010 and 2012 but not in 2011, how is that situation to be handled? What if the "gap" in a teacher's presence is several years?

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Thank you so much for your response. I am unfamiliar with dataex but I do have version 17 and it says it is already installed. I am working to figure out how to post sample data using dataex.

      For clarity: I apologize for my lack of clarity. We have defined teacher retention as a teacher being present from one year to the next. So if a teacher is in the data set in 2010 but not 2011, then the teacher was not retained. If said teacher shows back up in 2012, they would need to be present in the 2013 dataset to be considered retained for that school year. A gap would imply no retention. I need teacher retention by school year but over a 10 year span.

      Comment


      • #4
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str7 teacher_id str4 year
        ""        ""
        "100025"  ""
        "1000793" ""
        "1002613" ""
        "1002645" ""
        "1002909" ""
        "1004052" ""
        "1004594" ""
        "1005114" ""
        "1005358" ""
        "1005980" ""
        "1010484" ""
        "1018114" ""
        "1019154" ""
        "1019254" ""
        "1023597" ""
        "102561"  ""
        "1027372" ""
        "1031704" ""
        "1032235" ""
        "1032252" ""
        "1032678" ""
        "1033506" ""
        "1033565" ""
        "1033783" ""
        "1037587" ""
        "1038135" ""
        "1039397" ""
        "1039399" ""
        "1040094" ""
        "1040622" ""
        "1041433" ""
        "1042482" ""
        "1042483" ""
        "104283"  ""
        "1043032" ""
        "1044931" ""
        "104583"  ""
        "1047573" ""
        "104914"  ""
        "105638"  ""
        "105795"  ""
        "1058887" ""
        "1059703" ""
        "105973"  ""
        "1068590" ""
        "106917"  ""
        "106978"  ""
        "1072797" ""
        "1074702" ""
        "1076600" ""
        "1079395" ""
        "1080875" ""
        "108245"  ""
        "108556"  ""
        "1086"    ""
        "108880"  ""
        "1091"    ""
        "1093161" ""
        "1094796" ""
        "1095"    ""
        "1096869" ""
        "1098442" ""
        "1099863" ""
        "1100"    ""
        "110063"  ""
        "110244"  ""
        "1103763" ""
        "1104293" ""
        "110636"  ""
        "111150"  ""
        "1112414" ""
        "1112547" ""
        "1114542" ""
        "1114834" ""
        "1114922" ""
        "1115411" ""
        "1116258" ""
        "1116784" ""
        "1116820" ""
        "1117695" ""
        "1117850" ""
        "1117871" ""
        "1117936" ""
        "1118036" ""
        "1119681" ""
        "1119682" ""
        "1119683" ""
        "1120148" ""
        "1120825" ""
        "1120839" ""
        "1121260" ""
        "1122435" ""
        "1122853" ""
        "1123"    ""
        "112521"  ""
        "1126706" ""
        "1127708" ""
        "1128947" ""
        "1130"    ""
        end

        Comment


        • #5
          Well, something is wrong with your example data because the year variable is always missing.

          Nevertheless, the information about your data organization is sufficient to produce a solution with fair confidence.

          The first issue is that your year variable is a string. In order to know what the next year after a given year is, it has to be converted to numeric. That's what -destring- does.
          Code:
          destring year, replace

          Once we have that it's pretty simple.

          Code:
          by teacher_id (year), sort: gen byte retained = (year[_n+1] == year + 1)
          This will set retained = 1 in those years where the teacher is retained the following year and 0 otherwise.

          There is one limitation to this approach: every teacher will be marked as not retained in the final year for which he/she has data. So if your data set ends, say, in 2022, all teachers will be marked as not retained in 2022. If your data set does actually terminate at some fixed calendar year, this is easy to fix (demonstrated using 2022 as the final year for which there is any data):
          Code:
          by teacher_id (year), sort: gen byte retained = (year[_n+1] == year+1) if year < 2022
          This code will work like the previous one, except that it will give a missing value in 2022, which is appropriate since there is no way to know from the data what happened to the teacher in 2023.

          If, however, there is a different endpoint for each teacher, then it would be
          Code:
          by teacher_id (year), sort: gen byte retained = (year[_n+1] == year + 1 if _n < _N
          This will set retained to missing value in the final year of each teacher's data, again, appropriate as there is no way to know what happened next to that teacher.

          Comment


          • #6
            Thanks so much Clyde, You were very helpful.

            Comment


            • #7
              So I followed what Clyde said above. Everything worked except for the last year and then it seems something weird is happening with 2022 as well.

              Below is the code used to produced retention. Everything looked good except for the last couple of years. I don't know if I misunderstood your earlier suggestions perhaps

              by teacher_id (year), sort: gen byte retained = (year[_n+1] == year+1) if year < 2024

              egen tag = tag(teacher_id year)
              egen count = total(tag), by(year)
              tabdisp year, cell(count)

              rename count total

              egen count = total(retained), by(year)

              rename count totalretained
              rename total totalteachers

              destring totalretained, replace
              destring totalteachers, replace

              gen pp_retained = totalretained/totalteachers
              gen pp_retained2 = totalretained/totalteachers*100

              tabdisp year, cell(pp_retained2)



              Code:
              * Example generated by -dataex-. For more info, type help dataex
              clear
              input str7 teacher_id int year float pp_retained2 byte retained float(totalteachers totalretained)
              ""           .        . .    0    0
              "100025"  2015 81.47591 0 1992 1623
              "100025"     .        . .    0    0
              "1000793" 2022        0 0 1986    0
              "1000793"    .        . .    0    0
              "1001247" 2021 80.79332 1 1916 1548
              "1001247" 2022        0 0 1986    0
              "1002529" 2019 80.60095 1 1897 1529
              "1002529" 2020 82.08797 1 1887 1549
              "1002529" 2021 80.79332 0 1916 1548
              "1002613"    .        . .    0    0
              "1002645"    .        . .    0    0
              "1002909" 2022        0 0 1986    0
              "1002909"    .        . .    0    0
              "1004052" 2020 82.08797 1 1887 1549
              "1004052" 2021 80.79332 1 1916 1548
              "1004052" 2022        0 0 1986    0
              "1004052"    .        . .    0    0
              "1004594" 2022        0 0 1986    0
              "1004594"    .        . .    0    0
              "1004846" 2019 80.60095 1 1897 1529
              "1004846" 2020 82.08797 0 1887 1549
              "1005114" 2020 82.08797 1 1887 1549
              "1005114" 2021 80.79332 1 1916 1548
              "1005114" 2022        0 0 1986    0
              "1005114"    .        . .    0    0
              "1005358"    .        . .    0    0
              "1005980" 2020 82.08797 1 1887 1549
              "1005980" 2021 80.79332 1 1916 1548
              "1005980" 2022        0 0 1986    0
              "1005980"    .        . .    0    0
              "1009518" 2019 80.60095 1 1897 1529
              "1009518" 2020 82.08797 0 1887 1549
              "1010388" 2019 80.60095 0 1897 1529
              "1010484" 2022        0 0 1986    0
              "1010484"    .        . .    0    0
              "101086"  2015 81.47591 1 1992 1623
              "101086"  2016 80.71611 1 1955 1578
              "101086"  2017  79.9075 0 1946 1555
              "1011116" 2021 80.79332 1 1916 1548
              "1011116" 2022        0 0 1986    0
              "1012705" 2019 80.60095 0 1897 1529
              "1014851" 2020 82.08797 0 1887 1549
              "1015027" 2019 80.60095 0 1897 1529
              "1015280" 2019 80.60095 0 1897 1529
              "1018114" 2020 82.08797 1 1887 1549
              "1018114" 2021 80.79332 1 1916 1548
              "1018114" 2022        0 0 1986    0
              "1018114"    .        . .    0    0
              "1019154" 2020 82.08797 1 1887 1549
              "1019154" 2021 80.79332 1 1916 1548
              "1019154" 2022        0 0 1986    0
              "1019154"    .        . .    0    0
              "1019190" 2020 82.08797 1 1887 1549
              "1019190" 2021 80.79332 1 1916 1548
              "1019190" 2022        0 0 1986    0
              "1019254" 2020 82.08797 1 1887 1549
              "1019254" 2021 80.79332 1 1916 1548
              "1019254" 2022        0 0 1986    0
              "1019254"    .        . .    0    0
              "101976"  2015 81.47591 0 1992 1623
              "101976"  2017  79.9075 0 1946 1555
              "1019919" 2022        0 0 1986    0
              "1021750" 2020 82.08797 1 1887 1549
              "1021750" 2021 80.79332 1 1916 1548
              "1021750" 2022        0 0 1986    0
              "1023597" 2020 82.08797 1 1887 1549
              "1023597" 2021 80.79332 1 1916 1548
              "1023597" 2022        0 0 1986    0
              "1023597"    .        . .    0    0
              "102543"  2015 81.47591 0 1992 1623
              "102548"  2015 81.47591 1 1992 1623
              "102548"  2016 80.71611 1 1955 1578
              "102548"  2017  79.9075 1 1946 1555
              "102548"  2018 80.60166 1 1928 1554
              "102548"  2019 80.60095 1 1897 1529
              "102548"  2020 82.08797 1 1887 1549
              "102548"  2021 80.79332 0 1916 1548
              "102561"  2015 81.47591 1 1992 1623
              "102561"  2016 80.71611 0 1955 1578
              "102561"  2018 80.60166 0 1928 1554
              "102561"  2022        0 0 1986    0
              "102561"     .        . .    0    0
              "1027372" 2019 80.60095 1 1897 1529
              "1027372" 2020 82.08797 1 1887 1549
              "1027372" 2021 80.79332 1 1916 1548
              "1027372" 2022        0 0 1986    0
              "1027372"    .        . .    0    0
              "1027563" 2021 80.79332 0 1916 1548
              "103022"  2015 81.47591 1 1992 1623
              "103022"  2016 80.71611 0 1955 1578
              "103022"  2018 80.60166 0 1928 1554
              "1031704"    .        . .    0    0
              "1032235" 2020 82.08797 1 1887 1549
              "1032235" 2021 80.79332 1 1916 1548
              "1032235" 2022        0 0 1986    0
              "1032235"    .        . .    0    0
              "1032252"    .        . .    0    0
              "1032678" 2020 82.08797 1 1887 1549
              "1032678" 2021 80.79332 1 1916 1548
              end

              Comment


              • #8
                I was misled by the initial data example. As I noted, the year variable was always missing, and I misinterpreted that as meaning that there were sporadic observations scattered in the data where for whatever reason that data was missing, and you had just sorted the data in such a way that the observations that -dataex- put out were those.

                Looking at the example shown in #7, however, it is apparent that the missingness of year is not at all sporadic. It is fairly systematic, occurring with almost half of the different teacher_id's having one, but never more than one, such observation. It worries me that this happens, and I wonder if it is a results of errors in data management leading up to the creation of the data set. Or perhaps it has some specific meaning that you have not revealed, but that might be relevant here. I will leave it to you to ponder these possibilities.

                But for now, the code needs to be modified to account for the fact that this missingness of year is a very frequent occurrence.

                Code:
                gen byte missing_year = missing(year)
                by teacher_id missing_year (year), sort: ///
                    gen byte retained = year[_n+1] == year + 1 if !missing_year & _n < _N
                sort teacher_id year

                Comment

                Working...
                X