Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difficulty to isolate a level of my variable with -levelsof- and -strpos()-

    Hello everyone. For a project, I have to clean my dataset by manually selecting some values of a variable called varname. To do this, I'm using the -levelsof- command associated with the condition that the function strpos(varname, "whatever I want to modify here") is positive. However for a reason I can't understand, when I enter the value of strpos to be detected, even when it is exactly pasted, the code doesn't detect the local in which it is saved. In other cases though, as you will see, the code runs just fine. Please consider these two observations of my dataset :

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str1227 varname
    "Avez-vous un agrément pour votre activité secondaire ?"
    "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d’accueil ?  "
    end

    I do :

    Code:
    // This one doesn't work
    levelsof varname if strpos(varname, "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  ") > 0, local(C5)
    local N_C5 = r(N)
    
    // This one works : 
    levelsof varname if strpos(varname, "Avez-vous un agrément pour votre activité secondaire ?") > 0, local(A23) 
    local N_A23 = r(N)
    What's strange is that now that I replicated this example based solely on this thread on Stata to check, the code works fine, and I didn't even edit anything! But I'm 100% sure it doesn't work on my do-file, since I used the command -set trace on- to know where the code stops. Here are the results:

    Code:
    - levelsof varname if strpos(varname, "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de 
    > votre site d'accueil ?  ") > 0, local(C5)
      ------------------------------------------------------------------------------------------------------------ begin levelsof ---
      - version 15.0
      - syntax varname [if] [in] [, Separate(str) MISSing Local(name local) Clean MATROW(name) MATCELL(name) HEXadecimal ]
      - if ("`missing'" == "") {
      = if ("" == "") {
      - marksample touse, strok
      - }
      - else {
        marksample touse, strok novarlist
        }
      - if (`"`separate'"' == "") {
      = if (`""' == "") {
      - local separate " "
      - }
      - local typ : type `varlist'
      = local typ : type varname
      - if ("`typ'" == "strL" | substr("`typ'", 1, 3) == "str") {
      = if ("str1227" == "strL" | substr("str1227", 1, 3) == "str") {
      - NoHexadecimal `hexadecimal'
      = NoHexadecimal 
        -------------------------------------------------------------------------------------------- begin levelsof.NoHexadecimal ---
        - if (`"`0'"' == "") {
        = if (`""' == "") {
        - exit
        ---------------------------------------------------------------------------------------------- end levelsof.NoHexadecimal ---
      - NoMatrow `matrow'
      = NoMatrow 
        ------------------------------------------------------------------------------------------------- begin levelsof.NoMatrow ---
        - if (`"`0'"' == "") {
        = if (`""' == "") {
        - exit
        --------------------------------------------------------------------------------------------------- end levelsof.NoMatrow ---
      - }
      - mata: st_rclear()
      - if ("`typ'" == "strL") {
      = if ("str1227" == "strL") {
        LevelsOfStrL `varlist' if `touse', separate(`"`separate'"') `clean' matcell(`matcell')
        }
      - else if (substr("`typ'", 1, 3) == "str") {
      = else if (substr("str1227", 1, 3) == "str") {
      - local isclean = ("`clean'" != "")
      = local isclean = ("" != "")
      - mata: LevelsOfString("`varlist'", "`touse'", `"`separate'"', `isclean', "`matcell'")
      = mata: LevelsOfString("varname", "__000000", `" "', 0, "")
      - }
      - else {
        local isint = inlist("`typ'", "byte", "int", "long")
        mata: LevelsOfReal("`varlist'", "`touse'", `"`separate'"', `isint', 1, "`matrow'", "`matcell'", "`hexadecimal'" != "")
        if ("`usetab'" == "usetab") {
        cap LevelsOfTab `varlist' if `touse', `missing' separate(`"`separate'"') matrow(`matrow') matcell(`matcell') `hexadecimal'
        if (_rc) {
        mata: LevelsOfReal("`varlist'", "`touse'", `"`separate'"', `isint', 0, "`matrow'", "`matcell'", "`hexadecimal'" != "")
        }
        }
        }
      - if ("`local'" != "") {
      = if ("C5" != "") {
      - c_local `local' `"`r(levels)'"'
      = c_local C5 `""'
      - }
      - di as text `"`r(levels)'"'
      = di as text `""'
    
      -------------------------------------------------------------------------------------------------------------- end levelsof
    For the one that works:

    Code:
    - levelsof varname if strpos(varname, "Avez-vous un agrément pour votre activité secondaire ?") > 0 & strpos(varname, "Avez-vous 
    > un agrément pour votre activité secondaire ? [Autre]") == 0, local(A23)
      ------------------------------------------------------------------------------------------------------------ begin levelsof ---
      - version 15.0
      - syntax varname [if] [in] [, Separate(str) MISSing Local(name local) Clean MATROW(name) MATCELL(name) HEXadecimal ]
      - if ("`missing'" == "") {
      = if ("" == "") {
      - marksample touse, strok
      - }
      - else {
        marksample touse, strok novarlist
        }
      - if (`"`separate'"' == "") {
      = if (`""' == "") {
      - local separate " "
      - }
      - local typ : type `varlist'
      = local typ : type varname
      - if ("`typ'" == "strL" | substr("`typ'", 1, 3) == "str") {
      = if ("str1227" == "strL" | substr("str1227", 1, 3) == "str") {
      - NoHexadecimal `hexadecimal'
      = NoHexadecimal 
        -------------------------------------------------------------------------------------------- begin levelsof.NoHexadecimal ---
        - if (`"`0'"' == "") {
        = if (`""' == "") {
        - exit
        ---------------------------------------------------------------------------------------------- end levelsof.NoHexadecimal ---
      - NoMatrow `matrow'
      = NoMatrow 
        ------------------------------------------------------------------------------------------------- begin levelsof.NoMatrow ---
        - if (`"`0'"' == "") {
        = if (`""' == "") {
        - exit
        --------------------------------------------------------------------------------------------------- end levelsof.NoMatrow ---
      - }
      - mata: st_rclear()
      - if ("`typ'" == "strL") {
      = if ("str1227" == "strL") {
        LevelsOfStrL `varlist' if `touse', separate(`"`separate'"') `clean' matcell(`matcell')
        }
      - else if (substr("`typ'", 1, 3) == "str") {
      = else if (substr("str1227", 1, 3) == "str") {
      - local isclean = ("`clean'" != "")
      = local isclean = ("" != "")
      - mata: LevelsOfString("`varlist'", "`touse'", `"`separate'"', `isclean', "`matcell'")
      = mata: LevelsOfString("varname", "__000000", `" "', 0, "")
      - }
      - else {
        local isint = inlist("`typ'", "byte", "int", "long")
        mata: LevelsOfReal("`varlist'", "`touse'", `"`separate'"', `isint', 1, "`matrow'", "`matcell'", "`hexadecimal'" != "")
        if ("`usetab'" == "usetab") {
        cap LevelsOfTab `varlist' if `touse', `missing' separate(`"`separate'"') matrow(`matrow') matcell(`matcell') `hexadecimal'
        if (_rc) {
        mata: LevelsOfReal("`varlist'", "`touse'", `"`separate'"', `isint', 0, "`matrow'", "`matcell'", "`hexadecimal'" != "")
        }
        }
        }
      - if ("`local'" != "") {
      = if ("A23" != "") {
      - c_local `local' `"`r(levels)'"'
      = c_local A23 `"`"Avez-vous un agrément pour votre activité secondaire ?"'"'
      - }
      - di as text `"`r(levels)'"'
      = di as text `"`"Avez-vous un agrément pour votre activité secondaire ?"'"'
    `"Avez-vous un agrément pour votre activité secondaire ?"'
      -------------------------------------------------------------------------------------------------------------- end levelsof
    I wonder what the problem could be... Could anyone help me ?

  • #2
    Well, -levelsof- has been around a long time, and it is one of the more frequently used commands, so I think if it has a bug, it would have been reported and fixed previously.

    Moreover, the trace output you show shows that the -mata: LevelsofString- command that is passed is the same for both the working and non-working examples. So if there is a bug it is within that Mata function, or perhaps somewhere there is a bug in -marksample- so that `touse' is wrongly calculated. But -marksample- is another highly used command that is unlike to have a bug that has never bitten until now.

    I think it is more likely that your data are not what you think they are. Where do those strings come from? If they are originally copied from somewhere else,* it is possible that the one that is not working properly contains some non-printing character that you and I cannot see by Stata does, and the presence of that causes a mismatch with the string appearing in the -strpos()- expression. There are two easy ways to screen for this problem. Calculate the length of the string and see if it is what you expect. If it is greater than you expect, then it almost certainly contains some non-printing character(s). Now, given that these are unicode strings, it may be hard to know what the length should be. So another thing you can do is run -chartab varname in 1- (or whatever the observation number is in your real data set) and you will get a list of all the characters along with an explanation of what they are and their hexadecimal codes. If there is something in there that shouldn't be, you'll see that. [Note: -chartab- is written by Robert Picard and is vailable from SSC.]

    *I have found that this problem of non-printing characters can arise with strings that are brought in from any kind of file other than a simple text file. I have experienced this with material copied from Word documents, PDFs (including the Stata PDFs!), web pages of all sorts. Basically, any source that has to format the display of the text is likely to embed some kind of formatting codes, which do not print, into the material being displayed. What is also peculiar is that sometimes when you then further paste the material into another application, the codes get dropped. For example, usually when I paste text into the do-file editor, the non-printing characters are somehow removed. But the Data Editor does not, nor do the various -import- commands used in creating data sets. As a result, if I copy the value of a contaminated string from the data set into a -strpos()- expression in the do-editor, the copied string may exclude the contaminating characters, which will lead to non-matching.

    Comment


    • #3
      Dear Clyde : Your experience on this matter is truly appreciated. Clearly, I would have never thought about this lead by my own.

      The code :

      Code:
      display strlen("Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  ")
      Yields 114, and after counting manually the number of characters, unless I did a mistake, I also reached the same number. I then did

      Code:
      chartab varname in 147
      And I found this output :

      Code:
         decimal  hexadecimal   character |     frequency    unique name
      ------------------------------------+---------------------------------------------
              32       \u0020             |            15    SPACE
              45       \u002d       -     |             2    HYPHEN-MINUS
              63       \u003f       ?     |             1    QUESTION MARK
              86       \u0056       V     |             1    LATIN CAPITAL LETTER V
              97       \u0061       a     |             6    LATIN SMALL LETTER A
              99       \u0063       c     |             5    LATIN SMALL LETTER C
             100       \u0064       d     |             5    LATIN SMALL LETTER D
             101       \u0065       e     |            17    LATIN SMALL LETTER E
             102       \u0066       f     |             1    LATIN SMALL LETTER F
             104       \u0068       h     |             1    LATIN SMALL LETTER H
             105       \u0069       i     |             4    LATIN SMALL LETTER I
             108       \u006c       l     |             4    LATIN SMALL LETTER L
             110       \u006e       n     |             4    LATIN SMALL LETTER N
             111       \u006f       o     |             5    LATIN SMALL LETTER O
             112       \u0070       p     |             2    LATIN SMALL LETTER P
             114       \u0072       r     |            10    LATIN SMALL LETTER R
             115       \u0073       s     |            11    LATIN SMALL LETTER S
             116       \u0074       t     |            10    LATIN SMALL LETTER T
             117       \u0075       u     |             5    LATIN SMALL LETTER U
             118       \u0076       v     |             2    LATIN SMALL LETTER V
             160       \u00a0             |             2    NO-BREAK SPACE
           8,217       \u2019       ’     |             1    RIGHT SINGLE QUOTATION MARK
      ------------------------------------+---------------------------------------------
      
                                          freq. count   distinct
      ASCII characters              =             111         20
      Multibyte UTF-8 characters    =               3          2
      Unicode replacement character =               0          0
      Total Unicode characters      =             114         22
      But I'm afraid I'm not able to interpret it correctly as I'm not familiar at all with the different types of characters. It seems to me that the table does look fine though, doesn't it ?

      As for the origin of those strings, I copied and pasted them from the Stata Data Browser of another dataset of mine. And ultimately, all my datasets come from Excel but all I did was to import them on Stata the classic way. I have no idea where these files come from though, I just received them.

      Should I try to rewrite the command character by character ?
      Last edited by Thomas Brot; 01 Mar 2023, 13:58.

      Comment


      • #4
        I tried running

        Code:
        replace varname = "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  " in 147
        levelsof varname if strpos(varname, "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  ") > 0, local(C5)
        local N_C5 = r(N)
        with the first line being copied and paste exactly how I received it and it yielded the same problem. However, when I did :

        Code:
        replace varname = "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  " in 147
        levelsof varname if strpos(varname, "Votre structure a-t-elle des partenariats professionnels avec des structures en dehors de votre site d'accueil ?  ") > 0, local(C5)
        local N_C5 = r(N)
        After writing manually the code, this time it worked. It seemed that it wasn't a bug from -levelsof-. This is going to be a very long night ! I will have to rewrite dozens of lines manually, if anyone is more familiar with this little issue I would appreciate their help
        Last edited by Thomas Brot; 01 Mar 2023, 14:50.

        Comment

        Working...
        X