![]() |
![]() |
|
![]() |
Manipulating Strings with ObjectPAL © 2002 Al Breveleri Previous Section: Part 1: General Considerations and Part 2: Searching In A String 3. Parsing Grammatically 3.1. Importing Delimited Text Data Delimited text data is sometimes preferred over fixed column text data because the records can vary in length which usually makes the file shorter. Data is expressed as text, with numeric values written as their text representations. The data is organized into records, which are marked with a record-ender character (or sequence). This marking allows the records to vary in length. Text data files practically always use one line per record. Otherwise the file is almost impossible to handle. All my examples assume that a line-ender is used as a record-ender. The records are organized into fields, which are marked with a field separator character. The separator is most often a comma or a tab, but anything may be used. A difficulty immediately arises if the data in any field happens to contain the field-separator character. For example, a comma or tab may appear in a string, or a comma may appear in a number. If this cannot be avoided, then a field delimiter character must be defined. Typically the quote (") character is used. Any text bracketed by a pair of delimiters is defined as one field of data, even if it contains separators. Now a difficulty arises if the any data happens to contain a field delimiter character. There is no good solution to this problem, but it is apparently much easier to avoid delimiters in data than to avoid separators in data. Go figure. A delimited text data file may be specified as having no delimiter defined, as having a delimiter used to bracket all data fields, or as having a delimiter used to bracket only those fields that need it. I think it is best to be prepared for delimiters to appear around any field in any record. 3.1.1. Reading Lines from a Text File Composed of Lines It is possible to locate the beginning and end of the next line in the textstream input and then extract the fields directly from the file. However, this is excessively complicated as all the patterns would need to be constructed to stop at the end of a line. Where a file naturally breaks into lines that can be treated separately (as in a typical delimited text import), it is easier to read each line into a string variable for further processing, even though this leads to moving the data an extra time. You could read the entire file into a string variable and then apply the breakApart() method, but there are some fussy detail difficulties in dealing with some files using "\n" as a line ender, while others may use "\r\n". The textstream.readLine() method irons out this difference. (ReadLine() actually reads thru the "\n", then discards the trailing "\n" from the input, then discards any trailing "\r".) If you need to have all lines available when processing each line, read the file into a 'lines' string array first. If the lines are independent then just process each line before reading the next -- no 'lines' array is necessary. Listing 2: Reading a text file when you know none of the lines will be greater than 1023 chars long ; assuming gtsSRC has been opened globally ; as the input textstream proc PROCESS_TEXT_1() var psLINE string ; ... other variables as needed to process line endvar gtsSRC'home() while not gtsSRC'eof() gtsSRC'readLine(psLINE) ; ... ; ... process line now in psLINE ; ... endwhile endprocListing 3: Reading a text file when you know none of the lines will be greater than 32767 chars long ; assuming gtsSRC has been opened globally ; as the input textstream proc PROCESS_TEXT_2() var psLINE string piANCHOR, piFNDBGN, piFNDEND longint ; ... endvar gtsSRC'home() piFNDEND = 1 while true piANCHOR = piFNDEND piFNDBGN = piANCHOR ; At this point, the variables piANCHOR, piFNDBGN, ; and piFNDEND all point to the start of a line, where ; we want the next search to begin. if not gtsSRC'advMatch(piFNDBGN, piFNDEND, "(\r\n)|(\r)|(\n)") then ; If a line ender is found, advMatch sets piFNDBGN ; to point to the first char of the line ender and ; piFNDEND to point to the first char after it*. If ; a line ender is not found, the next two statements ; set piFNDBGN and piFNDEND to point to the first ; char after the end of the file. piFNDBGN = size(gtsSRC)+1 piFNDEND = piFNDBGN endif ; Now, piANCHOR, piFNDBGN, and piFNDEND can be ; compared to determine what was found, and what ; action should be taken: ; case piFNDBGN piFNDEND action ; ---------- ---------- ---------- ---------- ; no text, no line ender = piANCHOR = piFNDBGN quit (end of file) ; no text. but line ender = piANCHOR next line bgn process empty line ; text but no line ender curr line end = piFNDBGN process line ; text and line ender curr line end next line bgn process line ; ---------- ---------- ---------- ---------- ; quit if piFNDBGN=piANCHOR and piFNDEND=piFNDBGN if piFNDEND=piANCHOR then quitloop endif if piFNDBGN=piANCHOR then psLINE = blank() else gtsSRC'setPosition(piANCHOR) gtsSRC'readChars(psLINE,piFNDBGN-piANCHOR) endif ; ... ; ... process line now in psLINE ; ... endwhile endprocFollowing is a modification of the above to cope with the fact that strings can be 2GB long but the textstream.readChars() method is restricted to 32KB per read. Listing 4: Reading a text file in the general case, when you hope none of the lines will be greater than 2147483647 chars long ; assuming gtsSRC has been opened globally ; as the input textstream proc PROCESS_TEXT_3() var psLINE, psBFFR string piANCHOR, piFNDBGN, piFNDEND longint piREMAINING longint ; ... endvar gtsSRC'home() piFNDEND = 1 while true piANCHOR = piFNDEND piFNDBGN = piANCHOR if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "(\r\n)|(\r)|(\n)") then piFNDBGN = size(gtsSRC)+1 piFNDEND = piFNDBGN endif ; case piFNDBGN piFNDEND action ; ---------- ---------- ---------- ---------- ; no text, no line ender = piANCHOR = piFNDBGN quit (end of file) ; no text. but line ender = piANCHOR next line bgn process empty line ; text but no line ender curr line end = piFNDBGN process line ; text and line ender curr line end next line bgn process line ; ---------- ---------- ---------- ---------- if piFNDEND=piANCHOR then quitloop endif if piFNDBGN=piANCHOR then psLINE = blank() else gtsSRC'setPosition(piANCHOR) ; The textstream.readChars() method is restricted to ; 32767 chars per read. When there is no guarantee ; that all input lines will be shorter than that, we ; need to use a loop to read in the line 32767 chars ; at a time. piREMAINING = piFNDBGN-piANCHOR psLINE = blank() while piREMAINING>0 gtsSRC'readChars(psBFFR,int(min(piREMAINING,32767))) psLINE = psLINE + psBFFR piREMAINING = piREMAINING-32767 endwhile endif ; ... ; ... process line now in psLINE ; ... endwhile endproc3.1.2. Breaking the Fields Out of a Record Line If the data has separators but no delimiters, then the separator character cannot appear in any data. The breakApart() method will extract the fields without further sophistication. This is a special case that doesn't come up often in general data entry, but can be reliable when you also control the export that produces the data. Listing 5: Extracting fields from a record with separators and no delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above. ; assuming separator is in gsSEP proc PROCESS_TEXT_2() var ; ... pasFIELDS array [] string II longint ; ... other variables as needed to process line endvar ; ... ; ... get next line into psLINE ; ... ; If the subject string ends with a separator ; character, breakApart() does not generate a ; corresponding final element after the separator. ; By appending a separator character to the end of the ; subject string, we force a final element. If the ; subject string does not end with a separator ; character, appending a separator has no effect. breakApart(psLINE+gsSEP,pasFIELDS,gsSEP) for II from 1 to size(pasFIELDS) ; ... ; ... next datum is pasFIELDS[II] ; ... endfor endprocGenerally, 'delimited text data' has both delimiters and separators. The whole point of using delimiters is so separators can appear in string field data. Separator characters within delimited fields must be ignored. This means that the delimiters must be located first. Here is a technique using the string.breakApart() method. Listing 6: Extracting fields from a record in the general case, with both separators and delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above. ; assuming separator is in gsSEP and delimiter in gsDLM proc PROCESS_TEXT_3() var ; ... pasTOKENS, pasFIELDS array [] string II, JJ longint ; ... other variables as needed to process line endvar ; ... ; ... get next line into psLINE ; ... psLINE'breakApart(pasTOKENS,gsDLM) ; Now, even numbered items in pasTOKENS were inside ; delimited fields, and odd numbered items were ; everything between delimited fields. for II from 1 to size(pasTOKENS) step 2 ; process text outside quotes from pasTOKENS[II] breakApart(pasTOKENS[II]+gsSEP,pasFIELDS,gsSEP) for JJ from iif(II=1,1,2) to iif(II=size(pasTOKENS), size(pasFIELDS), size(pasFIELDS)-1) ; ... ; ... next datum is pasFIELDS[JJ] (not delimited) ; ... endfor ; Check for an odd number of items in pasTOKENS -- ; this happens whenever the last field in a record is ; not delimited. if II<>size(pasTOKENS) then ; process text inside quotes ; from pasTOKENS[II+1] ; ... ; ... next datum is pasTOKENS[II+1] (was delimited) ; ... endif endfor endproc3.2. Finding SGML Tags in a Text File Here, as opposed to the delimited text data case, it's a waste of time to consider the file in terms of lines. Even a single tag may cross a line boundary. It is best to search the text file for the tag location, then use textstream.setPosition() and textstream.readChars() to extract the tag. 3.2.1. Finding a Single Tag Listing 7: Find the next '<XXX ...>' tag in an opened textstream after the current position. ; assuming gtsSRC has been opened globally ; as the input textstream proc FIND_TAG() var ; ... psTAGSTR string ; tag will be copied to this variable piBGNPSN, piENDPSN longint endvar ; ... ; start searching at the current position piBGNPSN = gtsSRC'position() if gtsSRC'advMatch(piBGNPSN,piENDPSN, "<XXX([ \t\r\n]+[^>]*)?>") then gtsSRC'setPosition(piBGNPSN) gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN) ; tag with attributes is now in psTAGSTR ; current file position is now first char after the tag else ; tag not found endif endprocThe pattern is intended to match the 'XXX' tag with or without attributes. Here is how the pattern was built up.
Listing 8: Find the next '<XXX ...>...</XXX>' tag pair in an opened textstream after the current position. ; assuming gtsSRC has been opened globally ; as the input textstream proc FIND_TAG_PAIR() var ; ... psTAGSTR string ; gets tag pair and all enclosed text piBGNPSN, piTMPPSN, piENDPSN longint endvar ; ... ; start searching at the current position piBGNPSN = gtsSRC'position() if gtsSRC'advMatch(piBGNPSN,piTMPPSN, "<XXX([ \t\r\n]+[^>]*)?>") then if gtsSRC'advMatch(piTMPPSN,piENDPSN,"</XXX>") then gtsSRC'setPosition(piBGNPSN) gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN) ; tag pair with attributes and enclosed text is ; now in psTAGSTR current file position is now first ; char after the closing tag else ; tag not found endif else ; tag not found endif endprocThis proc will work properly only if tags of the specified name are never nested. When the proc encounters nested tag pairs, it incorrectly matches the next opening tag found and the next closing tag found, because those are the first it sees. 3.2.3. Finding Balanced Tag Pairs The easiest way I know to find balanced nested tag pairs is to find all the opening and closing tags first and list them by location. Furthermore, this is the fastest way I know of to accomplish this task. Okay, it's the only way I know how to do it. It's probably close to the best technique, though. Listing 9: Here's how to use a dynarray to find the first balanced '<XXX...>...</XXX>' tag pair after the current file position when tags of this type may be nested. ; assuming gtsSRC has been opened globally ; as the input textstream proc FIND_BALANCED_TAG_PAIR() var ; ... pdsTAGS dynarray [] string psBFFR, psTAGTXT string piANCHOR, piFNDBGN, piFNDEND longint piLEVEL longint endvar ; ... ; preclear the list of opening and closing tags pdsTAGS'empty() ; record the search start position ; (current read position in this example) piANCHOR = gtsSRC'position() ; Find all the opening tags of the specified name. piFNDEND = piANCHOR while true piFNDBGN = piFNDEND ; Construction of the pattern is described ; in section 3.2.1. if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "<XXX([ \t\r\n]+[^>]*)?>") then quitloop endif ; Record the opening tag text, without the '<' and '>'. gtsSRC'setPosition(piFNDBGN+1) gtsSRC'readChars(psBFFR,piFNDEND-piFNDBGN-2) ; Use the tag beginning location as ; the dynarray index for this entry. pdsTAGS[format("w10,ez",piFNDBGN)] = psBFFR endwhile ; Find all the closing tags of the specified name. piFNDEND = piANCHOR while true piFNDBGN = piFNDEND ; I trust this pattern is obvious. if not gtsSRC'advMatch(piFNDBGN,piFNDEND,"</XXX>") then quitloop endif ; For each closing tag, enter "/" in the dynarray. ; This facilitates discriminating between opening and ; closing tags. Use the tag ending location as the ; dynarray index for this entry. pdsTAGS[format("w10,ez",piFNDEND)] = "/" endwhile ; Scan the dynarray. The string variable 'psTAGTXT' will ; be blank until an opening tag is seen. As soon as that ; happens, begin incrementing 'piLEVEL' for each opening ; tag and decrementing it for each closing tag. When ; 'piLEVEL' becomes zero again, the matching closing ; tag has been found. psTAGTXT = blank() piLEVEL = 0 foreach psBFFR in pdsTAGS if pdsTAGS[psBFFR]<>"/" then ; opening tag if psTAGTXT=blank() then piFNDBGN = longint(psBFFR) psTAGTXT = pdsTAGS[psBFFR] endif piLEVEL = piLEVEL+1 else ; closing tag if psTAGTXT<>blank() then piLEVEL = piLEVEL-1 if piLEVEL<=0 then piFNDEND = longint(psBFFR) quitloop endif endif endif endforeach if psTAGTXT=blank() then ; ... ; ... no opening tag was found ; ... else if piLEVEL<>0 then ; ... ; ... ERROR -- no matching closing tag was found ; ... else ; piFNDBGN = ; file position of first char in opening tag ; piFNDEND = ; file position of first char after closing tag ; psTAGTXT = ; text of opening tag and attributes without '<' '>' ; ... ; ... process tag pair ; ... endif endif ; ... endproc Part 4: Replacing Parts and Part 5: Building Long Strings Discussion of this article |
![]() Feedback | Paradox Day | Who Uses Paradox | I Use Paradox | Downloads ![]() |
|
![]() The information provided on this Web site is not in any way sponsored or endorsed by Corel Corporation. Paradox is a registered trademark of Corel Corporation. ![]() |
|
![]() Modified: 15 May 2003 Terms of Use / Legal Disclaimer ![]() |
![]() Copyright © 2001- 2003 Paradox Community. All rights reserved. Company and product names are trademarks or registered trademarks of their respective companies. Authors hold the copyrights to their own works. Please contact the author of any article for details. ![]() |
![]() |
|