How-To Improve Large String Performance Using Buffers

From Pickwiki
Jump to navigationJump to search

Users >> Rex_Gozar >> How-To Improve Large String Performance Using Buffers

How-To Improve Large String Performance Using Buffers

2007-11-16 by Rex Gozar


Brief

Processes that create or manipulate large amounts of data (over 1MB) may be slow, impacting user response time and system performance. This performance degradation can be traced to Universe needing to reallocate and copy memory repeatedly as the amount of data grows. If you can allocate all the memory that you'll need up front, you can eliminate this bottleneck. This document discusses a coding strategy for doing this.

Summary

In processing a 12MB text file, execution time went from 83 seconds to 4 seconds by simply replacing VAR<-1> "append field" syntax with VAR[PTR,LENGTH] "substring assignment" syntax. Your mileage may vary.


*** Using VAR<-1> "append field" syntax ***
ITEM = ""
LOOP
   VALUE = some large amount of data
   ITEM<-1> = VALUE
WHILE (SOME.CONDITION) REPEAT


*** Preallocating memory and using VAR[PTR,LENGTH] "substring assignment" syntax ***
MAXBYTES = some calculated value
BUFFER = SPACE(MAXBYTES)            ;* allocates memory in one shot
BUFPTR = 0                          ;* always points to last byte in buffer
LOOP
   VALUE = some large amount of data
   BUFFER[(1 + BUFPTR), (1 + LEN(VALUE))] = @FM:VALUE
   BUFPTR += (1 + LEN(VALUE))
WHILE (SOME.CONDITION) REPEAT
BUFFER = BUFFER[2, (BUFPTR - 1)]    ;* strip first @FM


Discussion

In my business application, I work with vendor-supplied text files that contain fixed-width fields. These files can be several megabytes in size. I created a subroutine to convert the fixed-widths fields for each row to tab separated values.

Initially, I used "append field" syntax (i.e. VAR<-1> = VALUE) to build the resulting TSV file. Running it took over 83 seconds -- ouch! I put in display statements to see what was going on, and I noticed that as more rows were processed, the slower and slower it got.

Now for the theoretical stuff: When we create a variable within a program, Universe has to allocate memory to hold the contents of the variable. When we append data to the variable, Universe has to see if the new data will exceed the amount of memory that it allocated. When it does, it has to ask the system for a new chunk of memory, then copy the contents of the old chunk to the new plus the new data. Large character strings tend to trigger this memory reallocation and copy over and over, slowing down the entire process. To keep this from happening, we need to allocate all the memory we need when we first initialize the variable.

The revised program is shown below. I create a NEWBUF that is twice the size of the original ITEM. In my case, I figured that would be big enough to prevent memory reallocation. Elapsed run time went from 83 seconds to 4, over 20 times faster.


0001:       SUBROUTINE FIXED.TO.TSV(ITEM, TABMAP)
0002: * Take a flat file of fixed-width values and insert
0003: * tabs so we don't have to code field widths in all
0004: * the other programs.
0005: ******
0006:
0007:       * Set up a "macro" to get the system seconds
0008:       * and milliseconds for performance timing.
0009:       ***
0010:       EQU GET$TICKS LIT '(SYSTEM(99):((SYSTEM(12) * 1000) "R%3"))'
0011:
0012:       * Our TABMAP contains the field lengths (widths)
0013:       * for each tab delimited field.  Set the maximum
0014:       * attribute mark count (MAXAMC).
0015:       ***
0016:       MAXAMC = DCOUNT(TABMAP, @FM)
0017: ***
0018: * Get the starting ticks, since we also want to
0019: * include memory allocation in our timings.
0020: ***
0021:       START.TICKS = GET$TICKS
0022: ***
0023: * Initialize a variable in memory that's large enough
0024: * hold the existing data and the new tabs we'll be adding.
0025: ***
0026:       NEWBUF = ITEM:ITEM
0027:       NEWPTR = 0
0028: ***
0029: * Loop to remove each line for processing.
0030: ***
0031:       ITEM = ITEM
0032:       LOOP
0033:          LINE = REMOVE(ITEM, MORE.LINES)
0034:          GOSUB CONVERT.LINE
0035:          ***
0036:          * Note that the @FM is prepended to the
0037:          * line, since we are not using <-1>
0038:          * notation.
0039:          ***
0040:          NEWBUF[1+NEWPTR,1+LEN(LINE)] = @FM:LINE
0041:          NEWPTR += (1 + LEN(LINE))
0042:       WHILE MORE.LINES REPEAT
0043: ***
0044: * Strip off the leading @FM.
0045: ***
0046:       ITEM = NEWBUF[2,NEWPTR-1]
0047: ***
0048: * Show the elapsed time in milliseconds.
0049: ***
0050:       ELAPSED = GET$TICKS - START.TICKS
0051:       DISPLAY ELAPSED
0052:       RETURN
0053: *
0054: *
0055: CONVERT.LINE:
0056:       WORKBUF = ""
0057:       WORKPTR = 1
0058:       FOR AMC = 1 TO MAXAMC
0059:          FIELDLEN = TABMAP<AMC>
0060:          WORKBUF<AMC> = LINE[WORKPTR, FIELDLEN]
0061:          WORKPTR += FIELDLEN
0062:       NEXT AMC
0063:       LINE = CONVERT(@FM, CHAR(9), WORKBUF)
0064:       RETURN
0065:    END


Could this subroutine be faster? Sure it can, but the point is all I did was one change and now it runs in 1/20th of the time.


Recommendations

I only use this technique when performance can be noticeably increased. For example, building a savelist in a program may only be a few milliseconds faster using this technique, but who's going to notice? In many cases, readability and maintainability should take precedence over performance.

Always measure performance before and after optimizing your code. And only optimize the parts that take the longest to perform.

http://www.autopower.com/rgozar/pixel.gif