Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cleaning data; country names are not the same

    Hi Statalist,

    I've taken from two different databank the oil production and oil prices of all the countries in the world. After I merge both excel files on Country and Year a lot of country names were different, e.g Egypte - Egypt, Arab Rep. . I want to end-up with there 160 countries of which if have the oil production and the oil price but due to the mismatch in Country names I can't drop the observation with the lesser merge.

    Can someone help me with cleaning up the country names so I end-up with 160 countries of which I've the oilprice and production?

    Best Thomas

  • #2
    This is a classic example of a problem easy to understand but hard to solve. Some tips within

    J-8-3 dm0039 . . . Stata tip 64: Cleaning up user-entered string variables
    . . . . . . . . . . . . . . . . . . . . . . . . J. Herrin and E. Poen
    Q3/08 SJ 8(3):444--445 (no commands)
    tip on how to clean up user-entered string variables

    Comment


    • #3
      Thank you Nick!

      I understand how to get rid of spaces and such within the variable Country. But if I want to replace Bolivia (Plur. State of) with just Bolivia.

      I've around 50 countries that vary only in the length of their name. Is it possible to construct a loop that will only keep the important part?

      Click image for larger version

Name:	Screen Shot 2018-05-18 at 11.18.45.png
Views:	1
Size:	78.7 KB
ID:	1444728

      As a small side note, I changed one observation manually to see if the changed country name would merge automatically with the other variables but this is not the case.. Will this happen after all the country names will be the same?

      best
      Thomas

      Comment


      • #4
        Please don't use screenshot images. This is explained (FAQ Advice #12).

        Working with first words will get you a long way. But watch out for multiple Congo, etc.

        In a Stata forum the word "merge" suggests mainly the merge command. You mean something else but I don't know what it is.

        Comment


        • #5
          How would Stata now what is important?

          You may want to look at the user written kountry command
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Ah, the dirty side of the job...

            When the number of cases is large (lots of city names, can be 1000 to 36000), I end up writing some Python code to match names, using heavily dictionaries and regular expressions. In short, I try to "normalize" names (no spaces, no diacritics, lowercase), and to guess all possibilities (saint - st for instance). You can also use a "string distance" function here. And there is usually still some (or much) work to be done by hand, that I put in an additional "update" file into the program, to keep everything automatic. That's an incremental process: a first pass yields some unmatched data, I put that in Excel, translate some by hand, put back into the program and update the automatic part as much as possible, etc. Could probably be done in Stata, but I have used Python for this for a very long time.

            When there are not too many cases (say less than 200), my best advice is to match names in a spreadsheet, and use the resulting list as a translation table. You can do a first automatic pass to reduce the task. If you are going to repeat the computations regularly, with data from the same sources (hence same names), the time passed to build this table will be a valuable investment.

            Comment


            • #7
              Kountry doenst really have different naming conventions, I believe?
              I ahve been putting a file together with different country naming conventions, and a standard one. Plus some ctry codes for each. Use merge to get standard names for differently named countries in one set. If merge indicates the usng file doenst have one of your naming conventions, update the file.
              Attached is a txt file of a dataex (was too long to put in post), you can use that as a starter. Upload the text in attache file in the Stata command box and save as dta
              Attached Files
              Last edited by Jorrit Gosens; 18 May 2018, 03:53.

              Comment

              Working...
              X