The futurize package allows you to easily turn sequential code
into parallel code by piping the sequential code to the futurize()
function. Easy!
library(futurize)
plan(multisession)
library(tm)
data("crude")
m <- tm_map(crude, content_transformer(tolower)) |> futurize()
This vignette demonstrates how to use this approach to parallelize tm
functions such as tm_map().
The tm package provides a variety of text-mining methods. The
tm_map() function applies transformations to a corpus of text
documents, and TermDocumentMatrix() constructs document-term matrices.
When working with large corpora, these operations benefit greatly from
parallelization.
The tm_map() function applies a transformation to each document in
a corpus:
library(tm)
## Load the crude oil news corpus
data("crude")
## Convert all text to lowercase
m <- tm_map(crude, content_transformer(tolower))
Here tm_map() evaluates sequentially, but we can easily make it
evaluate in parallel by piping to futurize():
library(tm)
library(futurize)
plan(multisession)
data("crude")
m <- tm_map(crude, content_transformer(tolower)) |> futurize()
This will distribute the document transformations across the available parallel workers, given that we have set up parallel workers, e.g.
plan(multisession)
The built-in multisession backend parallelizes on your local
computer and works on all operating systems. There are [other
parallel backends] to choose from, including alternatives to
parallelize locally as well as distributed across remote machines,
e.g.
plan(future.mirai::mirai_multisession)
and
plan(future.batchtools::batchtools_slurm)
The following tm functions are supported by futurize():
tm_map()tm_index()TermDocumentMatrix()