Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fix no room to add more variables because of width issue when splitting a long string

    Hi all,

    I have a single string that is extremely long. It has a length of about 600,000 characters and 90,000 words separated by single spaces.

    I want to get each word of this single string as one observation each. So, I would like to have 90,000 observations with each observation corresponding to each word of the initial long string.

    What would be the most efficient way to achieve this?

    I tried using the split command with a variety of separators in the parse option. The idea is to split the string by spaces or some other separator and then reshape it from wide to long. Two examples I tried include:

    Code:
    clear
    set maxvar 32767
    
    split text, parse(" and")
    split text, parse(" ")

    Naturally, no matter what separator I use to split the string, Stata returns a "no room to add more variables because of width" error. I understand that this is happening because my string is so long that Stata is reaching the maximum number of variables allowable per observation.

    Is there a workaround to this issue to get to my final objective of converting the single long string with 90,000 words into a dataset with 90,000 observations/words?

    FYI, I am attaching an example string that I tried splitting. I did not include the string in the code above due to its immense size.

    Regards,
    Tasneem
    Attached Files

  • #2
    Just an update, I figured out a code that gives me what I want, i.e. 90,000 words contained in one string converted to 90,000 observations. However, it is excruciatingly slow since it creates and appends 90,000 temporary files. I am sure there is a better way of going about this.

    Code:
    use long_string, clear
    gen word_count = wordcount(text)
    
    * Word count gives 92,900 words so use this in the loop below
    
    forvalues i = 1/92900 {
        use long_string, clear
        gen word`i' = word(text,`i')
        keep word`i'
        rename word`i' text
        tempfile temp`i'
        quietly save "`temp`i''", replace
    }
    
    clear
    forvalues i = 1/92900{
        append using "`temp`i''"
    }

    If the above code is the only way forward, I would appreciate it if someone tells me of a way through which I can feed the word count of the string into the loop without manually specifying 92,900 at the start of the loop (I'm thinking macros but can't seem to get it right).

    Thanks,
    Tasneem

    Comment


    • #3
      If you are running version 16 or 17, you can do this with frames:
      Code:
      clear*
      
      use long_string
      
      gen long wc = wordcount(text)
      local wc = wc[1]
      
      frame create long_words str2045 word
      
      gen temp = ""
      forvalues i = 1/`wc' {
          quietly replace temp = word(text, `i')
          frame post long_words (temp[1])
      }
      
      frame change long_words
      compress
      des
      At the end of this code, the data set in frame long_words will contain a single variable, called word, that contains 92,900 observations, with one word from the long string in each.



      On my setup this ran in 66 seconds.

      Comment


      • #4
        I initially started to write some code that would bite off one word at a time, but then I had a better idea. You can "recast' your problem as one of data export and re-import. The approach requires that you can identify word boundaries and insert your own word delimiters. Clearly that works here since -word()- uses spaces as word boundaries. Then you can export the word list to a text file and then import it back as a CSV file. Here's a proof of concept.

        Code:
        input strL(words)
        "one two three four five"
        end
        
        * create your own delimiter between words.
        replace words = ustrregexra(words, " ", ",", .)
        
        * note: must use a different delimiter than the comma used above or Stata will wrap output string in quotes.
        tempfile mywords
        export delimited words using `"`mywords'"', novarnames replace delim("$")
        
        import delimited using `"`mywords'"', clear varnames(nonames) delim(",")
        gen `c(obs_t)' rownum = _n
        reshape long v, i(rownum) j(wordnum)
        rename v word
        drop rownum
        list
        Result

        Code:
        . list
        
             +-----------------+
             | wordnum    word |
             |-----------------|
          1. |       1     one |
          2. |       2     two |
          3. |       3   three |
          4. |       4    four |
          5. |       5    five |
             +-----------------+
        Edit: Clyde's method is superior to mine and should be used. It's simple an efficient. My method wont work because -reshape- doesn't work with -strL-s and despite only have ~93k words, I also get an error about too many variables for my version of Stata. What is puzzling is that, even though I have Stata MP which has a maximum number of variables of 120k, the error appears.
        Last edited by Leonardo Guizzetti; 16 Aug 2022, 21:07.

        Comment


        • #5
          I still liked the idea of exporting and importing, so I tried it again for fun. This way tweaks my earlier attempt with none of the drawbacks. The delimiter is a new line, so then on import each word is already on a new line.

          Code:
          use long_string, clear
          
          gen words = ustrtrim(text)
          di wordcount(words)
          
          * change the delimiter to a new line character
          replace words = ustrregexra(words, " ", "`=char(13)'", .)
          
          tempname fh
          tempfile mywords
          file open `fh' using "`mywords'", write text replace
          file write `fh' (words[1])
          file close `fh'
          type `"`mywords'"', lines(10)
          
          import delimited word using "`mywords'", clear varnames(nonames) delim("!") stringcols(_all)
          list in 1/10
          This took <1 second on my system using your data as input.

          Comment


          • #6
            Re #5: Wow! Yes, it took only 0.56 seconds on my setup as well. Very nice!

            Comment


            • #7
              Thanks a lot for your inputs, Clyde and Leonardo.

              I tried all the suggestions above and indeed, Leonardo's final piece of code appears to be the most efficient; I got what I wanted in less than a second too.


              Cheers,
              Tasneem
              Last edited by Tasneem Mohammed; 16 Aug 2022, 21:57.

              Comment


              • #8
                Let me improve slightly on Clyde Schechter's solution from post #3, while first acknowledging the elegance of Leonardo Guizzetti's solution from post #5. The latter wins the medal for thinking so far outside the box that you're in a different time zone than the box.

                For a situation where for some reason a looping solution is required, the code in post #3 suffers from having to dig progressively deeper into the text string to find successive words. By spending a little time to remove the first word from text after it has been extracted and posted to the long_words frame, we restrict the code to always finding the first word of what remains of the text string. Technically, this reduces the process from having a time roughly proportional to the square of the number of words to one having time roughly linear in the number of words. For a string as long as the one in question, the difference is substantial.
                Code:
                timer clear
                
                clear*
                
                use "~/Downloads/long_string"
                
                timer on 1
                
                gen long wc = wordcount(text)
                local wc = wc[1]
                
                frame create long_words str2045 word
                
                gen temp = ""
                forvalues i = 1/`wc' {
                    quietly replace temp = word(text, `i')
                    frame post long_words (temp[1])
                }
                
                frame change long_words
                compress
                des
                
                timer off 1
                
                frame change default
                frame drop long_words
                
                clear
                
                use "~/Downloads/long_string"
                
                timer on 2
                
                replace text = trim(text)
                frame create long_words str2045 word
                
                gen temp = ""
                while text!="" {
                    quietly replace temp = word(text, 1)
                    quietly replace text = substr(text,length(temp)+2,.)
                    frame post long_words (temp[1])
                }
                
                frame change long_words
                compress
                des
                
                timer off 2
                timer list
                Code:
                . timer list
                   1:     39.15 /        1 =      39.1520
                   2:      5.94 /        1 =       5.9430
                The initial 40% reduction in time compared to post #3 is perhaps due to running on this year's MacBook Air with the M2 version of Apple Silicon.

                Comment


                • #9
                  I'd offer one small change to Leonardo Guizzetti's nice solution, intended for the benefit of those of us who can never recall how to use -file write, file open- etc. correctly. <grin> That is to use -filewrite()- to save the file containing the words in one per line layout. That's a command I can usually almost remember how to use. I'm presuming here that the file with long_string can have just one observation
                  Code:
                  gen b = filewrite("`mywords'", words)

                  Comment


                  • #10
                    That's a nice convenience function Mike Lacy . I had forgotten that it exists, thanks.

                    Comment

                    Working...
                    X