Imported data get sorted "ascending" and not "descending"

Augusto Milani

Join Date: Jul 2022

Posts: 2
#1

Imported data get sorted "ascending" and not "descending"

04 Sep 2023, 04:39

I'm trying to import an excel dataset of Belgian surnames into Stata (Source of dataset: Family names | Statbel (fgov.be), but I'm encountering issues when sorting data, probably regarding their format.

import excel "C:$path\Family_names_2023.xlsx", sheet("top10000_2023") firstrow clear
* Example generated by -dataex-. For more info, type help dataex
clear
input int Belgium str20 B int(C FlemishRegion) str19 E int(F WalloonRegion) str20 H int(I BrusselsCapitalRegion) str14 K int L
1 "Peeters" 30716 1 "Peeters" 27459 1 "Dubois" 9096 1 "Diallo" 4798
2 "Janssens" 28115 2 "Janssens" 24554 2 "Lambert" 8311 2 "Bah" 2682
3 "Maes" 24392 3 "Maes" 21324 3 "Martin" 6578 3 "Barry" 1831
4 "Jacobs" 18846 4 "Jacobs" 16244 4 "Dupont" 6126 4 "Sow" 1353
5 "Mertens" 17725 5 "Mertens" 14632 5 "Simon" 5988 5 "Nguyen" 965

So far so good, the problem is, when I try sorting by C, the order is ascending, so that I get:
sort by C
Example generated by -dataex-. For more info, type help dataex
input int Belgium str20 B int(C FlemishRegion) str19 E int(F WalloonRegion) str20 H int(I BrusselsCapitalRegion) str14 K int L
9994 "Bulckaen" 181 9918 "Van Steyvoort" 104 . "" . . "" .
9994 "Davidts" 181 9918 "Werniers" 104 . "" . . "" .
9994 "Dehez" 181 . "" . . "" . . "" .
9994 "Van Boxem" 181 . "" . . "" . . "" .
9994 "Scauflaire" 181 . "" . . "" . . "" .

I tried to see whether applying absolute value works, nothing.
Codebook also seems to show range and percentiles as I expect them to be.
codebook C

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
C #
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

Type: Numeric (int)

Range: [181,30716] Units: 1
Unique values: 1,768 Missing .: 2/10,048

Mean: 634.564
Std. dev.: 1093.88

Percentiles: 10% 25% 50% 75% 90%
200 234 333 595 1219

This problem seems present in the native excel data too and applies to all the numeric variables, but I would like to find a solution in Stata since I'm not familiar with excel at all.
Thank you in advance
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10275
#2

04 Sep 2023, 05:01

Code:

gsort -C

See

Code:

help gsort
1 like
Comment
Augusto Milani

Join Date: Jul 2022

Posts: 2
#3

04 Sep 2023, 05:21

Originally posted by Andrew Musau View Post

Code:

gsort -C

See

Code:

help gsort

Thank you Andrew. I think I formulated my question in an ambiguous way (my bad!). More than only sorting (which gsort -C attains), I would like to transform my dataset to be so that sort C works as I intend
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10275
#4

04 Sep 2023, 05:36

What is C? Some sort of identifier? The rule for sorting numerical variables is from smallest to largest, so you need a new sort variable. You can create such a variable and if necessary, use the values of C as value labels. Below, I use labmask from the Stata Journal to achieve this.

Code:

*INSTALL LABMASK net install gr0034, from(http://www.stata-journal.com/software/sj8-2)

Code:

gen CR= -C egen newC= group(CR) labmask newC, values(C) drop CR sort newC
1 like
Comment

Announcement

Imported data get sorted "ascending" and not "descending"

Comment

Comment

Comment