Paradox Community
Search:

 Welcome |  What is Paradox |  Paradox Folk |  Paradox Solutions |
 Interactive Paradox |  Paradox Programming |  Internet/Intranet Development |
 Support Options |  Classified Ads |  Wish List |  Submissions 


Paradox Programming Articles  |  Beyond Help Articles  |  Tips & Tricks Articles  


Manipulating Strings with ObjectPAL
© 2002 Al Breveleri


Previous Section: Part 1: General Considerations and Part 2: Searching In A String


3. Parsing Grammatically

3.1. Importing Delimited Text Data

Delimited text data is sometimes preferred over fixed column text data because the records can vary in length which usually makes the file shorter. Data is expressed as text, with numeric values written as their text representations.

The data is organized into records, which are marked with a record-ender character (or sequence). This marking allows the records to vary in length. Text data files practically always use one line per record. Otherwise the file is almost impossible to handle. All my examples assume that a line-ender is used as a record-ender.

The records are organized into fields, which are marked with a field separator character. The separator is most often a comma or a tab, but anything may be used.

A difficulty immediately arises if the data in any field happens to contain the field-separator character. For example, a comma or tab may appear in a string, or a comma may appear in a number. If this cannot be avoided, then a field delimiter character must be defined. Typically the quote (") character is used. Any text bracketed by a pair of delimiters is defined as one field of data, even if it contains separators.

Now a difficulty arises if the any data happens to contain a field delimiter character. There is no good solution to this problem, but it is apparently much easier to avoid delimiters in data than to avoid separators in data. Go figure.

A delimited text data file may be specified as having no delimiter defined, as having a delimiter used to bracket all data fields, or as having a delimiter used to bracket only those fields that need it. I think it is best to be prepared for delimiters to appear around any field in any record.

3.1.1. Reading Lines from a Text File Composed of Lines

It is possible to locate the beginning and end of the next line in the textstream input and then extract the fields directly from the file. However, this is excessively complicated as all the patterns would need to be constructed to stop at the end of a line. Where a file naturally breaks into lines that can be treated separately (as in a typical delimited text import), it is easier to read each line into a string variable for further processing, even though this leads to moving the data an extra time.

You could read the entire file into a string variable and then apply the breakApart() method, but there are some fussy detail difficulties in dealing with some files using "\n" as a line ender, while others may use "\r\n". The textstream.readLine() method irons out this difference. (ReadLine() actually reads thru the "\n", then discards the trailing "\n" from the input, then discards any trailing "\r".)

If you need to have all lines available when processing each line, read the file into a 'lines' string array first. If the lines are independent then just process each line before reading the next -- no 'lines' array is necessary.

Listing 2: Reading a text file when you know none of the lines will be greater than 1023 chars long

; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_1()
var
  psLINE  string
  ; ... other variables as needed to process line
endvar
  gtsSRC'home()
  while not gtsSRC'eof()
    gtsSRC'readLine(psLINE)
    ; ...
    ; ... process line now in psLINE
    ; ...
  endwhile
endproc

Listing 3: Reading a text file when you know none of the lines will be greater than 32767 chars long

; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_2()
var
  psLINE                        string
  piANCHOR, piFNDBGN, piFNDEND  longint
  ; ...
endvar
  gtsSRC'home()
  piFNDEND = 1
  while true
    piANCHOR = piFNDEND
    piFNDBGN = piANCHOR
    ; At this point, the variables piANCHOR, piFNDBGN,
    ; and piFNDEND all point to the start of a line, where
    ; we want the next search to begin.
    if not gtsSRC'advMatch(piFNDBGN, piFNDEND, "(\r\n)|(\r)|(\n)") then
      ; If a line ender is found, advMatch sets piFNDBGN
      ; to point to the first char of the line ender and
      ; piFNDEND to point to the first char after it*.  If
      ; a line ender is not found, the next two statements
      ; set piFNDBGN and piFNDEND to point to the first
      ; char after the end of the file.
      piFNDBGN = size(gtsSRC)+1
      piFNDEND = piFNDBGN
    endif
    ; Now, piANCHOR, piFNDBGN, and piFNDEND can be
    ; compared to determine what was found, and what
    ; action should be taken:
    ; case                      piFNDBGN       piFNDEND       action
    ; ----------                ----------     ----------     ----------
    ; no text, no line ender    = piANCHOR     = piFNDBGN     quit (end of file)
    ; no text. but line ender   = piANCHOR     next line bgn  process empty line
    ; text but no line ender    curr line end  = piFNDBGN     process line
    ; text and line ender       curr line end  next line bgn  process line
    ; ----------                ----------     ----------     ----------
  ; quit if piFNDBGN=piANCHOR and piFNDEND=piFNDBGN
  if piFNDEND=piANCHOR then quitloop endif
    if piFNDBGN=piANCHOR then
      psLINE = blank()
    else
      gtsSRC'setPosition(piANCHOR)
      gtsSRC'readChars(psLINE,piFNDBGN-piANCHOR)
    endif
    ; ...
    ; ... process line now in psLINE
    ; ...
  endwhile
endproc

*Editor's Note: The help file regarding the value of piFNDEND is wrong.
Following is a modification of the above to cope with the fact that strings can be 2GB long but the textstream.readChars() method is restricted to 32KB per read.

Listing 4: Reading a text file in the general case, when you hope none of the lines will be greater than 2147483647 chars long

; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_3()
var
  psLINE, psBFFR                string
  piANCHOR, piFNDBGN, piFNDEND  longint
  piREMAINING                   longint
  ; ...
endvar
  gtsSRC'home()
  piFNDEND = 1
  while true
    piANCHOR = piFNDEND
    piFNDBGN = piANCHOR
    if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "(\r\n)|(\r)|(\n)") then
      piFNDBGN = size(gtsSRC)+1
      piFNDEND = piFNDBGN
    endif
    ; case                    piFNDBGN       piFNDEND       action
    ; ----------              ----------     ----------     ----------
    ; no text, no line ender  = piANCHOR     = piFNDBGN     quit (end of file)
    ; no text. but line ender = piANCHOR     next line bgn  process empty line
    ; text but no line ender  curr line end  = piFNDBGN     process line
    ; text and line ender     curr line end  next line bgn  process line
    ; ----------              ----------     ----------     ----------
  if piFNDEND=piANCHOR then quitloop endif
    if piFNDBGN=piANCHOR then
      psLINE = blank()
    else
      gtsSRC'setPosition(piANCHOR)
      ; The textstream.readChars() method is restricted to
      ; 32767 chars per read.  When there is no guarantee
      ; that all input lines will be shorter than that, we
      ; need to use a loop to read in the line 32767 chars
      ; at a time.
      piREMAINING = piFNDBGN-piANCHOR
      psLINE = blank()
      while piREMAINING>0
        gtsSRC'readChars(psBFFR,int(min(piREMAINING,32767)))
        psLINE = psLINE + psBFFR
        piREMAINING = piREMAINING-32767
      endwhile
    endif
    ; ...
    ; ... process line now in psLINE
    ; ...
  endwhile
endproc

3.1.2. Breaking the Fields Out of a Record Line

If the data has separators but no delimiters, then the separator character cannot appear in any data. The breakApart() method will extract the fields without further sophistication. This is a special case that doesn't come up often in general data entry, but can be reliable when you also control the export that produces the data.

Listing 5: Extracting fields from a record with separators and no delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above.

; assuming separator is in gsSEP
proc PROCESS_TEXT_2()
var
  ; ...
  pasFIELDS      array [] string
  II             longint
  ; ... other variables as needed to process line
endvar
  ; ...
  ; ... get next line into psLINE
  ; ...
    ; If the subject string ends with a separator
    ; character, breakApart() does not generate a
    ; corresponding final element after the separator.
    ; By appending a separator character to the end of the
    ; subject string, we force a final element.  If the
    ; subject string does not end with a separator
    ; character, appending a separator has no effect.
    breakApart(psLINE+gsSEP,pasFIELDS,gsSEP)
    for II from 1 to size(pasFIELDS)
      ; ...
      ; ... next datum is pasFIELDS[II]
      ; ...
    endfor
endproc

Generally, 'delimited text data' has both delimiters and separators. The whole point of using delimiters is so separators can appear in string field data. Separator characters within delimited fields must be ignored. This means that the delimiters must be located first. Here is a technique using the string.breakApart() method.

Listing 6: Extracting fields from a record in the general case, with both separators and delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above.

; assuming separator is in gsSEP and delimiter in gsDLM
proc PROCESS_TEXT_3()
var
  ; ...
  pasTOKENS, pasFIELDS    array [] string
  II, JJ                  longint
  ; ... other variables as needed to process line
endvar
  ; ...
  ; ... get next line into psLINE
  ; ...
    psLINE'breakApart(pasTOKENS,gsDLM)
    ; Now, even numbered items in pasTOKENS were inside
    ; delimited fields, and odd numbered items were
    ; everything between delimited fields.
    for II from 1 to size(pasTOKENS) step 2
      ; process text outside quotes from pasTOKENS[II]
      breakApart(pasTOKENS[II]+gsSEP,pasFIELDS,gsSEP)
      for JJ from iif(II=1,1,2) to iif(II=size(pasTOKENS), size(pasFIELDS), size(pasFIELDS)-1)
        ; ...
        ; ... next datum is pasFIELDS[JJ] (not delimited)
        ; ...
      endfor
      ; Check for an odd number of items in pasTOKENS --
      ; this happens whenever the last field in a record is
      ; not delimited.
      if II<>size(pasTOKENS) then
        ; process text inside quotes
        ; from pasTOKENS[II+1]
        ; ...
        ; ... next datum is pasTOKENS[II+1] (was delimited)
        ; ...
      endif
    endfor
endproc

3.2. Finding SGML Tags in a Text File

Here, as opposed to the delimited text data case, it's a waste of time to consider the file in terms of lines. Even a single tag may cross a line boundary. It is best to search the text file for the tag location, then use textstream.setPosition() and textstream.readChars() to extract the tag.

3.2.1. Finding a Single Tag

Listing 7: Find the next '<XXX ...>' tag in an opened textstream after the current position.

; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_TAG()
var
  ; ...
  psTAGSTR   string   ; tag will be copied to this variable
  piBGNPSN, piENDPSN  longint
endvar
  ; ...
  ; start searching at the current position
  piBGNPSN = gtsSRC'position()
  if gtsSRC'advMatch(piBGNPSN,piENDPSN, "<XXX([ \t\r\n]+[^>]*)?>") then
    gtsSRC'setPosition(piBGNPSN)
    gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN)
    ; tag with attributes is now in psTAGSTR
    ; current file position is now first char after the tag
  else
    ; tag not found
  endif
endproc

The pattern is intended to match the 'XXX' tag with or without attributes. Here is how the pattern was built up.
"<XXX>"
matches the single tag or opening tag 'XXX' without attributes.
"<XXX()?>"
matches the 'XXX' tag with possibly something after the tag name
"[ \t\r\n]+"
matches whitespace -- at least one space, tab, return, or newline
"[^>]*"
matches attributes -- zero or more characters that are not '>'
"<XXX([ \t\r\n]+[^>]*)?>"
matches the 'XXX' tag with possibly (whitespace followed by attributes) after the tag name. The whitespace is necessary to avoid matching a tag with a longer name beginning with 'XXX'.
3.2.2 Finding a Tag Pair

Listing 8: Find the next '<XXX ...>...</XXX>' tag pair in an opened textstream after the current position.

; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_TAG_PAIR()
var
  ; ...
  psTAGSTR   string   ; gets tag pair and all enclosed text
  piBGNPSN, piTMPPSN, piENDPSN  longint
endvar
  ; ...
  ; start searching at the current position
  piBGNPSN = gtsSRC'position()
  if gtsSRC'advMatch(piBGNPSN,piTMPPSN, "<XXX([ \t\r\n]+[^>]*)?>") then
    if gtsSRC'advMatch(piTMPPSN,piENDPSN,"</XXX>") then
      gtsSRC'setPosition(piBGNPSN)
      gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN)
      ; tag pair with attributes and enclosed text is
      ; now in psTAGSTR current file position is now first
      ; char after the closing tag
    else
      ; tag not found
    endif
  else
    ; tag not found
  endif
endproc

This proc will work properly only if tags of the specified name are never nested. When the proc encounters nested tag pairs, it incorrectly matches the next opening tag found and the next closing tag found, because those are the first it sees.

3.2.3. Finding Balanced Tag Pairs

The easiest way I know to find balanced nested tag pairs is to find all the opening and closing tags first and list them by location. Furthermore, this is the fastest way I know of to accomplish this task.

Okay, it's the only way I know how to do it. It's probably close to the best technique, though.

Listing 9: Here's how to use a dynarray to find the first balanced '<XXX...>...</XXX>' tag pair after the current file position when tags of this type may be nested.

; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_BALANCED_TAG_PAIR()
var
  ; ...
  pdsTAGS                       dynarray [] string
  psBFFR, psTAGTXT              string
  piANCHOR, piFNDBGN, piFNDEND  longint
  piLEVEL                       longint
endvar
  ; ...
  ; preclear the list of opening and closing tags
  pdsTAGS'empty()
  ; record the search start position
  ; (current read position in this example)
  piANCHOR = gtsSRC'position()
  ; Find all the opening tags of the specified name.
  piFNDEND = piANCHOR
  while true
    piFNDBGN = piFNDEND
    ; Construction of the pattern is described
    ; in section 3.2.1.
    if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "<XXX([ \t\r\n]+[^>]*)?>") then
      quitloop
    endif
    ; Record the opening tag text, without the '<' and '>'.
    gtsSRC'setPosition(piFNDBGN+1)
    gtsSRC'readChars(psBFFR,piFNDEND-piFNDBGN-2)
    ; Use the tag beginning location as
    ; the dynarray index for this entry.
    pdsTAGS[format("w10,ez",piFNDBGN)] = psBFFR
  endwhile
  ; Find all the closing tags of the specified name.
  piFNDEND = piANCHOR
  while true
    piFNDBGN = piFNDEND
    ; I trust this pattern is obvious.
    if not gtsSRC'advMatch(piFNDBGN,piFNDEND,"</XXX>") then
      quitloop
    endif
    ; For each closing tag, enter "/" in the dynarray.
    ; This facilitates discriminating between opening and
    ; closing tags.  Use the tag ending location as the
    ; dynarray index for this entry.
    pdsTAGS[format("w10,ez",piFNDEND)] = "/"
  endwhile
  ; Scan the dynarray.  The string variable 'psTAGTXT' will
  ; be blank until an opening tag is seen.  As soon as that
  ; happens, begin incrementing 'piLEVEL' for each opening
  ; tag and decrementing it for each closing tag.  When
  ; 'piLEVEL' becomes zero again, the matching closing
  ; tag has been found.
  psTAGTXT = blank()
  piLEVEL = 0
  foreach psBFFR in pdsTAGS
    if pdsTAGS[psBFFR]<>"/" then  ; opening tag
      if psTAGTXT=blank() then
        piFNDBGN = longint(psBFFR)
        psTAGTXT = pdsTAGS[psBFFR]
      endif
      piLEVEL = piLEVEL+1
    else                          ; closing tag
      if psTAGTXT<>blank() then
        piLEVEL = piLEVEL-1
        if piLEVEL<=0 then
          piFNDEND = longint(psBFFR)
          quitloop
        endif
      endif
    endif
  endforeach
  if psTAGTXT=blank() then
    ; ...
    ; ... no opening tag was found
    ; ...
  else
    if piLEVEL<>0 then
      ; ...
      ; ... ERROR -- no matching closing tag was found
      ; ...
    else
      ; piFNDBGN =
      ;  file position of first char in opening tag
      ; piFNDEND =
      ;  file position of first char after closing tag
      ; psTAGTXT =
      ;  text of opening tag and attributes without '<' '>'
      ; ...
      ; ... process tag pair
      ; ...
    endif
  endif
  ; ...
endproc


Part 4: Replacing Parts and Part 5: Building Long Strings


Discussion of this article


 Feedback |  Paradox Day |  Who Uses Paradox |  I Use Paradox |  Downloads 


 The information provided on this Web site is not in any way sponsored or endorsed by Corel Corporation.
 Paradox is a registered trademark of Corel Corporation.


 Modified: 15 May 2003
 Terms of Use / Legal Disclaimer


 Copyright © 2001- 2003 Paradox Community. All rights reserved. 
 Company and product names are trademarks or registered trademarks of their respective companies. 
 Authors hold the copyrights to their own works. Please contact the author of any article for details.