Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Imported data get sorted "ascending" and not "descending"

    I'm trying to import an excel dataset of Belgian surnames into Stata (Source of dataset: Family names | Statbel (fgov.be), but I'm encountering issues when sorting data, probably regarding their format.


    import excel "C:$path\Family_names_2023.xlsx", sheet("top10000_2023") firstrow clear
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int Belgium str20 B int(C FlemishRegion) str19 E int(F WalloonRegion) str20 H int(I BrusselsCapitalRegion) str14 K int L
    1 "Peeters" 30716 1 "Peeters" 27459 1 "Dubois" 9096 1 "Diallo" 4798
    2 "Janssens" 28115 2 "Janssens" 24554 2 "Lambert" 8311 2 "Bah" 2682
    3 "Maes" 24392 3 "Maes" 21324 3 "Martin" 6578 3 "Barry" 1831
    4 "Jacobs" 18846 4 "Jacobs" 16244 4 "Dupont" 6126 4 "Sow" 1353
    5 "Mertens" 17725 5 "Mertens" 14632 5 "Simon" 5988 5 "Nguyen" 965


    So far so good, the problem is, when I try sorting by C, the order is ascending, so that I get:
    sort by C
    Example generated by -dataex-. For more info, type help dataex
    input int Belgium str20 B int(C FlemishRegion) str19 E int(F WalloonRegion) str20 H int(I BrusselsCapitalRegion) str14 K int L
    9994 "Bulckaen" 181 9918 "Van Steyvoort" 104 . "" . . "" .
    9994 "Davidts" 181 9918 "Werniers" 104 . "" . . "" .
    9994 "Dehez" 181 . "" . . "" . . "" .
    9994 "Van Boxem" 181 . "" . . "" . . "" .
    9994 "Scauflaire" 181 . "" . . "" . . "" .

    I tried to see whether applying absolute value works, nothing.
    Codebook also seems to show range and percentiles as I expect them to be.
    codebook C

    -----------------------------------------------------------------------------------------------------------------------------------------------------------------
    C #
    -----------------------------------------------------------------------------------------------------------------------------------------------------------------

    Type: Numeric (int)

    Range: [181,30716] Units: 1
    Unique values: 1,768 Missing .: 2/10,048

    Mean: 634.564
    Std. dev.: 1093.88

    Percentiles: 10% 25% 50% 75% 90%
    200 234 333 595 1219



    This problem seems present in the native excel data too and applies to all the numeric variables, but I would like to find a solution in Stata since I'm not familiar with excel at all.
    Thank you in advance

  • #2
    Code:
    gsort -C
    See

    Code:
    help gsort

    Comment


    • #3
      Originally posted by Andrew Musau View Post
      Code:
      gsort -C
      See

      Code:
      help gsort
      Thank you Andrew. I think I formulated my question in an ambiguous way (my bad!). More than only sorting (which gsort -C attains), I would like to transform my dataset to be so that sort C works as I intend

      Comment


      • #4
        What is C? Some sort of identifier? The rule for sorting numerical variables is from smallest to largest, so you need a new sort variable. You can create such a variable and if necessary, use the values of C as value labels. Below, I use labmask from the Stata Journal to achieve this.

        Code:
        *INSTALL LABMASK
        net install gr0034, from(http://www.stata-journal.com/software/sj8-2)

        Code:
        gen CR= -C
        egen newC= group(CR)
        labmask newC, values(C)
        drop CR
        sort newC

        Comment

        Working...
        X