Better Way To Process Long Streams Of Data
When sequentially processing a huge set of records in an SQL database, for example, what we developers usually do is this:
SELECT COUNT(*)on the records we will process and save that to a variable
SELECTagain with a
ROWNUMif you’re using Oracle; argh..)
- Repeat step 2 until the total of processed records matches
I was almost doing that for a task today, and then I realized I could do something much smarter.
Basically, I discarded the notion of a total, since it may not be consistent due to concurrency issues. From there, I just needed some way to figure out when to stop processing.
This is something that applies when:
- You don’t have/need a progress bar
- You have concurrency issues (e.g. when records are added/removed from the database while processing)
So I just did this (pseudo-code):
When I get less records than the maximum I asked for, this means that I’m at the last “page” of the set. Setting a flag allows me to control when to stop looking for records:
In case the amount of records is divisible by the batch limit, we’ll simply have an extra query that returns no results. E.g.: 1000 records using 100 as batch limit.