Batching With Guava

Batching With Guava

October 14, 2019

Today we’re going to take a quick look at batching. Batching comes into play when you need to make a lot of requests, typically to some external resource like a database, a server, or an API.

The Two Extremes

Let’s say we’re making a productivity app that manages our emails for us. We want to add a feature that deletes spam emails for us.

Without batching, we might have a snippet of code like this:

List<String> emailIds = //...
for (String id : emailIds) {
    try {
        emailServer.deleteEmail(id);
    } catch (Exception e) {/*...*/}
}

This will work correctly, but it’s very inefficient. Each call to the email server involves a network request, which can be slow.1 We’d have to make a lot of requests to delete tens of thousands of emails.

Fortunately, our email server lets us delete multiple emails at once. So, we can just try to delete all of our emails at once:

List<String> emailIds = //...
try {
    emailServer.deleteEmails(emailIds);
} catch (Exception e) {/*...*/}

This is much more efficient that our original snippet—we only need to make one call to our server. However, we now have issues if our request fails for any reason. In order to retry our request, we’d have to try to delete all of these emails again.

In addition to this, most APIs have limits on how many objects they can process per request. Our server might not even accept our request if we’re trying to delete too many emails at once.

Batching

Batching gives us a way to do something in between these two extremes. We’ll delete a batch of emails at a time.

The basic idea is to define some batch size and to use that number to determine how many emails we’ll send. That might look like this:

private final static int BATCH_SIZE = 10_000;
/* ... */
List<String> emailIds = //...
int start = 0;
int end = BATCH_SIZE;
while (start < emailIds.size()) {
    // This prevents an index out of bounds exception.
    if (end > emailIds.size()) {
        end = emailIds.size();
    }
    try {
        List<String> batch = emailIds.subList(start, end);
        emailServer.deleteEmails(batch);
        start = end;
        end += BATCH_SIZE;
    } catch (Exception e) {/*...*/}
}

Now we’re deleting emails in batches (in this case, 10000 at a time).

Although this works fine for batching, it involves some manual management of the start and end indices. While not wrong, there’s a chance of making a human mistake here (such as having a bound be off by one).

It also adds in a decent amount of boilerplate, which makes it harder to see that our code is really about deleting emails.

Fortunately, it’s really easy to get around this with Guava’s partition method.2 There’s an Iterable version and a List version.

private final static int BATCH_SIZE = 10_000;
/* ... */
List<String> emailIds = //...
List<List<String>> batches =
    Lists.partition(emailIds, BATCH_SIZE);
for (List<String> batch : batches) {
    try {
        emailServer.deleteEmails(batch);
    } catch (Exception e) {/*...*/}
}

Our batching is now much cleaner and less erorr prone! It’s much easier to see that our code is trying to batch and delete some emails.


  1. Additionally, lots of small requests are likely to be inefficient when it comes to the amount of data we need to send (e.g. because of request headers). [return]
  2. There are alternative approaches as well, such as those described here, but I prefer using Guava for this. [return]
comments powered by Disqus