Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using AI to facilitate data collating/correction

    I came from the thread about STATA and AI, while my question is about the feasibility of using AI to facilitate data collating in some circumstances.

    The dataset I recently processed include variables of client's birth countries, which was retrieved from an old database. The database allows the user to enter the country of birth manually, rather than selecting it using the drop down menu. Thus, it has led to many inaccurate (incorrect) country names. For example, Maori (supposed to be New Zealand), Wales (supposed to be United Kingdom), etc. Apparently, that 5% of manual input is very random and there are no rules to follow.

    As all the countries' names need to be recoded using value label (e.g., Australia->1101, New Zealand->1201, Vietnam->5105...), those random country names won't be able to be coded.

    Perhaps my expectation was beyond the specialised area of STATA, but after viewing the above thread, I was thinking whether AI may help me to achieve "guessing" what actual country that the random inputs refer to?

    Of course, maybe this question is not worth consuming too much time. I don't know if there is a smarter way to do this and would appreciate any guidance.

  • #2
    I think you might actually want a fuzzy matching algorithm for this problem - not necessarily AI. There are a few recommendations in this other thread that might help you get started.

    Comment


    • #3
      Originally posted by Daniel Schaefer View Post
      I think you might actually want a fuzzy matching algorithm for this problem - not necessarily AI. There are a few recommendations in this other thread that might help you get started.
      Thank you, Daniel. I'll definitely have a look.

      Comment


      • #4
        I'm not sure fuzzy matching is the right solution here. I can't think of any fuzzy matching program that will assign a high match score between Maori and New Zealand or Wales and United Kingdom. Fuzzy matching would deal well with things like misspellings. But the "fuzzy matching" wanted here is semantic, not orthographic. If this were my situation, I would start by creating a new data set containing only the distinct written-in country names that do not match the drop-down list. If the list is sufficiently small, I would simply Google these names to find out where the stated countries are (or once were) and add that to a second variable in this data set and save the result in a crosswalk data set. Then I would merge the original data set with the crosswalk.

        If the list of names not on the drop-down list is too long to be handled manually, then, perhaps an AI program would to handle the job of creating the crosswalk.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          I'm not sure fuzzy matching is the right solution here. I can't think of any fuzzy matching program that will assign a high match score between Maori and New Zealand or Wales and United Kingdom. Fuzzy matching would deal well with things like misspellings. But the "fuzzy matching" wanted here is semantic, not orthographic. If this were my situation, I would start by creating a new data set containing only the distinct written-in country names that do not match the drop-down list. If the list is sufficiently small, I would simply Google these names to find out where the stated countries are (or once were) and add that to a second variable in this data set and save the result in a crosswalk data set. Then I would merge the original data set with the crosswalk.

          If the list of names not on the drop-down list is too long to be handled manually, then, perhaps an AI program would to handle the job of creating the crosswalk.
          Well spotted, Clyde! Although Daniel's advice was very good and it allowed me to learn new plugin (matchit), your dissection of my needs was spot on. The matching I need here is semantic, not orthographic.

          It may be beyond the scope of STATA's specialization (and probably beyond the scope of this forum). I just want to reiterate that data cleaning and pre-processing is indeed a very complex, cumbersome matter and often not achievable with a single piece of software. For example, I actually use Excel to shorten the variable names (when they're more than 32 characters) before importing them in STATA.

          Thanks again for everyone's advice.

          Comment


          • #6
            Sorry Shen, I'm afraid I did not read your post carefully enough. I'm glad Clyde was able to be of more help to you!

            Comment


            • #7
              Actually, now that I think about it, a Word2Vec style machine learning model should be very well suited to this kind of task. I bet you could prompt a large language model like ChatGPT with a unique list of all of the place names you can't match and ask for the country name that corresponds to each place name. Just keep in mind that the algorithm won't be perfect because (e.g.) different places can have the same name.

              Comment

              Working...
              X