[Robelle] [SmugBook] [Index] [Prev] [Next]

Serial Dataset Scans: Good and Bad

A serial scan is an access method that reads an entire dataset in physical order to find what is wanted, rather than following a search key maintained for that dataset. MPE programmers are usually taught that IMAGE/SQL search keys are efficient and serial scans are to be avoided, but that is an over-simplification. There are situations where a serial scan is faster than a keyed search, and vice versa.

Good Serial Scans

Question: What are two ways to perform the following data retrieval?

We have a large detail dataset called Ord-Line with 2,307,685 records. We have a list of 162,770 Ord-Num key values of interest. We want to extract the Ord-Line records for those keyvalues and we want them sorted by Ord-Num. The Ord-Line record size is 308 bytes.

Answer: The traditional solution in IMAGE/SQL is to call the DBFIND intrinsic for each of the 162,770 Ord-Num key values, then repeatedly call DBGET mode-5 (directed access) to retrieve the Ord-Line detail entries on that key-value chain (261,230 in all). If the list of Ord-Num values is sorted, the Ord-Line records will also be sorted. An alternative solution is to call DBGET mode-2 (serial access) for each of the 2,308,685 detail entries (or use Suprtool's Get Command), select out the 261,230 records that match the Ord-Num table, then sort those records. That sounds like a lot of work.

Question: Which is the faster way to perform the retrieval?

Answer: The traditional method takes about 162,770 DBFIND disc reads and 261,230 DBGET disc reads, for a total of 424,000 disc reads. The reason why it takes about one disc read per record accessed is that the access is random and direct -- when you retrieve one record the chances that the next record retrieved is adjacent is vanishingly small.

If you do the serial scan with DBGET it takes 177,514 disc reads. The reason is that each disc read transfers a block containing 13 records. Suprtool uses a 50,000-byte buffer and reads 159 records of 308 bytes per disc read. If you use Suprtool to do the task, it will take only 14,500 total disc reads.

The Suprtool serial scan reduces disc reads by 96.6 percent.

Suprtool can do serial scans much faster than other tools and application programs because it bypasses the overhead of IMAGE/SQL by doing NOBUF/MR access to the underlying file of the dataset.

Bad Serial Scans

Question: Will a serial scan be faster than keyed access if you have 1 million entries in an IMAGE/SQL master dataset and you need to retrieve 1000 key values (which you have in a table). The entries are 400 bytes long and are packed 10 records per block in the master dataset.

Answer: Keyed access takes about 1000 disc reads, since master entries are stored randomly by hashing the key value. Serial access takes about 100,000 disc reads, since they are packed 10 per block. Serial access with Suprtool takes about 6,500 disc reads, since Suprtool reads 12 blocks per access. So even with Suprtool, the keyed access is much faster because we are reading a small portion of the dataset.

Question: You have deleted 90,000 old entries from a dataset with a capacity of 110,000. Why does the time to read the dataset serially remain the same?

Answer: Because IMAGE/SQL must scan the empty space in the dataset to find the remaining entries. There are no pointers around empty chunks in a dataset.

In a master dataset the entries are distributed randomly over the entire capacity and there is no way to find the next occupied entry after N except by looking at N+1, N+2, and so on, until you find it. The time to read a master dataset is proportional to the capacity, not to the number of entries.

In a detail dataset, the entries are stored chronologically adjacent to each other. As you add entries to a detail dataset, it moves up an EOF marker called the highwater mark. Serial scans must read all the space up to the highwater mark. When you delete an entry, it is put on a delete-chain for reuse on the next add operation, but the highwater mark is not necessarily reduced. The time to read a detail dataset is proportional to the highwater mark, not to the number of entries.

Moral: Use a dataset packing tool such as Detpack from Adager after you clean out a detail dataset. This will reduce the highwater mark by repacking the entries. For a master dataset, reduce the capacity if you won't need the empty space in the near future.

Question:After cleaning out the 90,000 detail entries, why does every user who runs the on-line application against this dataset have to wait a much longer time for the program to initialize itself?

Answer: The initialization section of the program calls DBGET with mode-2 (serial read) to initialize the field list (so that subsequent calls can use "*"). Each call takes a long time to find the first occupied entry in the dataset. This is because the entries were added to the dataset in chronological order and the oldest 90,000 records were deleted. Thus, the first 90,000 entries in the dataset are empty and have to be skipped by the first DBGET mode-2 call. Moral: use DBGET mode-1 to initialize field lists. The DBGET will fail, but the field list will be initialized anyway.

David Greer

[Robelle] [SmugBook] [Index] [Mpetips] [Prev] [Next]