Chapter 3.1 Loop Functions
Looping on the Command Line
Writing for and while loops is useful when programming but not particularly easy when working interactively on the command line. Multi-line expressions with curly braces are just not that easy to sort through when working on the command line. R has some functions which implement looping in a compact form to make your life easier.
-
lapply(): Loop over a list and evaluate a function on each element -
sapply(): Same aslapplybut try to simplify the result -
apply(): Apply a function over the margins of an array -
tapply(): Apply a function over subsets of a vector -
mapply(): Multivariate version oflapply
An auxiliary function split is also useful, particularly in conjunction with lapply.
Summary
-
The loop functions in R are very powerful because they allow you to conduct a series of operations on data using a compact form
-
The operation of a loop function involves iterating over an R object (e.g. a list or vector or matrix), applying a function to each element of the object, and the collating the results and returning the collated results.
-
Loop functions make heavy use of anonymous functions, which exist for the life of the loop function but are not stored anywhere
-
The
split()function can be used to divide an R object in to subsets determined by another variable which can subsequently be looped over using loop functions.
apply()
R Documentation
Apply Functions Over Array Margins
The apply() function is used to evaluate a function (often an anonymous one) over the margins of an array. It is most often used to apply a function to the rows or columns of a matrix (which is just a 2-dimensional array). However, it can be used with general arrays, for example, to take the average of an array of matrices.
Using apply() is not really faster than writing a loop, but it works in one line and is highly compact.
> str(apply)
function (X, MARGIN, FUN, ...)
The arguments to apply() are
Xis an arrayMARGINis an integer vector indicating which margins should be “retained”.FUNis a function to be applied...is for other arguments to be passed toFUN
You’ve probably noticed that the second argument is either a 1 or a 2, depending on whether we want row statistics or column statistics. What exactly is the second argument to apply()?
The MARGIN argument essentially indicates to apply() which dimension of the array you want to preserve or retain. So when taking the mean of each column, I specify
> apply(x, 2, mean)
because I want to collapse the first dimension (the rows) by taking the mean and I want to preserve the number of columns. Similarly, when I want the row sums, I run
> apply(x, 1, mean)
because I want to collapse the columns (the second dimension) and preserve the number of rows (the first dimension).
Col/Row Sums and Means
For the special case of column/row sums and column/row means of matrices, we have some useful shortcuts.
rowSums=apply(x, 1, sum)rowMeans=apply(x, 1, mean)colSums=apply(x, 2, sum)colMeans=apply(x, 2, mean)
The shortcut functions are heavily optimized and hence are much faster, but you probably won’t notice unless you’re using a large matrix. Another nice aspect of these functions is that they are a bit more descriptive. It’s arguably more clear to write colMeans(x) in your code than apply(x, 2, mean).
Description
Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
Usage
apply(X, MARGIN, FUN, ...)
Arguments
X
an array, including a matrix.
MARGIN
a vector giving the subscripts which the function will be applied over. E.g., for a matrix 1 indicates rows, 2 indicates columns, c(1, 2) indicates rows and columns. Where X has named dimnames, it can be a character vector selecting dimension names.
FUN
the function to be applied: see ‘Details’. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted.
...
optional arguments to FUN.
Details
If X is not an array but an object of a class with a non-null dim value (such as a data frame), apply attempts to coerce it to an array via as.matrix if it is two-dimensional (e.g., a data frame) or via as.array.
FUN is found by a call to match.fun and typically is either a function or a symbol (e.g., a backquoted name) or a character string specifying a function to be searched for from the environment of the call to apply.
Arguments in ... cannot have the same name as any of the other arguments, and care may be needed to avoid partial matching to MARGIN or FUN. In general-purpose code it is good practice to name the first three arguments if ... is passed through: this both avoids partial matching to MARGIN or FUN and ensures that a sensible error message is given if arguments named X, MARGIN or FUN are passed through ....
Value
If each call to FUN returns a vector of length n, then apply returns an array of dimension c(n, dim(X)[MARGIN]) if n > 1. If n equals 1, apply returns a vector if MARGIN has length 1 and an array of dimension dim(X)[MARGIN] otherwise. If n is 0, the result has length 0 but not necessarily the ‘correct’ dimension.
If the calls to FUN return vectors of different lengths, apply returns a list of length prod(dim(X)[MARGIN]) with dim set to MARGIN if this has length greater than one.
In all cases the result is coerced by as.vector to one of the basic vector types before the dimensions are set, so that (for example) factor results will be coerced to a character array.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
lapply and there, simplify2array; tapply, and convenience functions sweep and aggregate.
lapply()
R Documentation
Apply a Function over a List or Vector
The lapply() function does the following simple series of operations:
- it loops over a list, iterating over each element in that list
- it applies a function to each element of the list (a function that you specify)
- and returns a list (the
lis for “list”).
This function takes three arguments: (1) a list X; (2) a function (or the name of a function) FUN; (3) other arguments via its ... argument. If X is not a list, it will be coerced to a list using as.list().
The body of the lapply() function can be seen here.
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x7ff75e13fc00>
<environment: namespace:base>
Note that the actual looping is done internally in C code for efficiency reasons.
It’s important to remember that lapply() always returns a list, regardless of the class of the input.
Here’s an example of applying the mean() function to all elements of a list. If the original list has names, then the names will be preserved in the output.
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)
$a
3
$b
Notice that here we are passing the mean() function as an argument to the lapply() function.
Functions in R can be used this way and can be passed back and forth as arguments just like any other object. When you pass a function to another function, you do not need to include the open and closed parentheses () like you do when you are calling a function.
You can use lapply() to evaluate a function multiple times each with a different argument.
When you pass a function to lapply(), lapply() takes elements of the list and passes them as the first argument of the function you are applying. In the above example, the first argument of runif() is n, and so the elements of the sequence 1:4 all got passed to the n argument of runif().
Functions that you pass to lapply() may have other arguments. For example, the runif() function has a min and max argument too. In the example above I used the default values for min and max. How would you be able to specify different values for that in the context of lapply()?
Here is where the ... argument to lapply() comes into play. Any arguments that you place in the ... argument will get passed down to the function being applied to the elements of the list.
The lapply() function and its friends make heavy use of anonymous functions. Anonymous functions have no names. These are functions are generated “on the fly” as you are using lapply(). Once the call to lapply() is finished, the function disappears and does not appear in the workspace.
Description
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array().
sapply(x, f, simplify = FALSE, USE.NAMES = FALSE) is the same as lapply(x, f).
vapply is similar to sapply, but has a pre-specified type of return value, so it can be safer (and sometimes faster) to use.
replicate is a wrapper for the common use of sapply for repeated evaluation of an expression (which will usually involve random number generation).
simplify2array() is the utility called from sapply() when simplify is not false and is similarly called from mapply().
Usage
lapply(X, FUN, ...)
sapply(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)
vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)
replicate(n, expr, simplify = "array")
simplify2array(x, higher = TRUE)
Arguments
X
a vector (atomic or list) or an expression object. Other objects (including classed objects) will be coerced by base::as.list.
FUN
the function to be applied to each element of X: see ‘Details’. In the case of functions like +, %*%, the function name must be backquoted or quoted.
...
optional arguments to FUN.
simplify
logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).
USE.NAMES
logical; if TRUE and if X is character, use X as names for the result unless it had names already. Since this argument follows ... its name cannot be abbreviated.
FUN.VALUE
a (generalized) vector; a template for the return value from FUN. See ‘Details’.
n
integer: the number of replications.
expr
the expression (a language object, usually a call) to evaluate repeatedly.
x
a list, typically returned from lapply().
higher
logical; if true, simplify2array() will produce a (“higher rank”) array when appropriate, whereas higher = FALSE would return a matrix (or vector) only. These two cases correspond to sapply(*, simplify = "array") or simplify = TRUE, respectively.
Details
FUN is found by a call to match.fun and typically is specified as a function or a symbol (e.g., a backquoted name) or a character string specifying a function to be searched for from the environment of the call to lapply.
Function FUN must be able to accept as input any of the elements of X. If the latter is an atomic vector, FUN will always be passed a length-one vector of the same type as X.
Arguments in ... cannot have the same name as any of the other arguments, and care may be needed to avoid partial matching to FUN. In general-purpose code it is good practice to name the first two arguments X and FUN if ... is passed through: this both avoids partial matching to FUN and ensures that a sensible error message is given if arguments named X or FUN are passed through ....
Simplification in sapply is only attempted if X has length greater than zero and if the return values from all elements of X are all of the same (positive) length. If the common length is one the result is a vector, and if greater than one is a matrix with a column corresponding to each element of X.
Simplification is always done in vapply. This function checks that all values of FUN are compatible with the FUN.VALUE, in that they must have the same length and type. (Types may be promoted to a higher type within the ordering logical < integer < double < complex, but not demoted.)
Users of S4 classes should pass a list to lapply and vapply: the internal coercion is done by the as.list in the base namespace and not one defined by a user (e.g., by setting S4 methods on the base function).
lapply and vapply are primitive functions.
Value
For lapply, sapply(simplify = FALSE) and replicate(simplify = FALSE), a list.
For sapply(simplify = TRUE) and replicate(simplify = TRUE): if X has length zero or n = 0, an empty list. Otherwise an atomic vector or matrix or list of the same length as X (of length n for replicate). If simplification occurs, the output type is determined from the highest type of the return values in the hierarchy NULL < raw < logical < integer < double < complex < character < list < expression, after coercion of pairlists to lists.
vapply returns a vector or array of type matching the FUN.VALUE. If length(FUN.VALUE) == 1 a vector of the same length as X is returned, otherwise an array. If FUN.VALUE is not an array, the result is a matrix with length(FUN.VALUE) rows and length(X) columns, otherwise an array a with dim(a) == c(dim(FUN.VALUE), length(X)).
The (Dim)names of the array value are taken from the FUN.VALUE if it is named, otherwise from the result of the first function call. Column names of the matrix or more generally the names of the last dimension of the array value or names of the vector value are set from X as in sapply.
Note
sapply(*, simplify = FALSE, USE.NAMES = FALSE) is equivalent to lapply(*).
For historical reasons, the calls created by lapply are unevaluated, and code has been written (e.g., bquote) that relies on this. This means that the recorded call is always of the form FUN(X[[i]], ...), with i replaced by the current (integer or double) index. This is not normally a problem, but it can be if FUN uses sys.call or match.call or if it is a primitive function that makes use of the call. This means that it is often safer to call primitive functions with a wrapper, so that e.g. lapply(ll, function(x) is.numeric(x)) is required to ensure that method dispatch for is.numeric occurs correctly.
If expr is a function call, be aware of assumptions about where it is evaluated, and in particular what ... might refer to. You can pass additional named arguments to a function call as additional named arguments to replicate: see ‘Examples’.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
apply, tapply, mapply for applying a function to multiple arguments, and rapply for a recursive version of lapply(), eapply for applying a function to each entry in an environment.
mapply()
R Documentation
Apply a Function to Multiple List or Vector Arguments
The mapply() function is a multivariate apply of sorts which applies a function in parallel over a set of arguments. Recall that lapply() and friends only iterate over a single R object. What if you want to iterate over multiple R objects in parallel? This is what mapply() is for.
> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)
The arguments to mapply() are
FUNis a function to apply...contains R objects to apply overMoreArgsis a list of other arguments toFUN.SIMPLIFYindicates whether the result should be simplified
The mapply() function has a different argument order from lapply() because the function to apply comes first rather than the object to iterate over. The R objects over which we apply the function are given in the ... argument because we can apply over an arbitrary number of R objects.
For example, the following is tedious to type
list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
With mapply(), instead we can do
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
This passes the sequence 1:4 to the first argument of rep() and the sequence 4:1 to the second argument.
Vectorizing a Function
The mapply() function can be use to automatically “vectorize” a function. What this means is that it can be used to take a function that typically only takes single arguments and create a new function that can take vector arguments. This is often needed when you want to plot functions.
Here’s an example of a function that computes the sum of squares given some data, a mean parameter and a standard deviation.
The formula is
∑ni=1(xi−μ)2/σ2
> sumsq <- function(mu, sigma, x) {sum(((x - mu) / sigma)^2) }
This function takes a mean mu, a standard deviation sigma, and some data in a vector x.
In many statistical applications, we want to minimize the sum of squares to find the optimal mu and sigma. Before we do that, we may want to evaluate or plot the function for many different values of mu or sigma. However, passing a vector of mus or sigmas won’t work with this function because it’s not vectorized.
> x <- rnorm(100) ## Generate some data
> sumsq(1:10, 1:10, x) ## This is not what we want
[1] 110.2594
Note that the call to sumsq() only produced one value instead of 10 values.
However, we can do what we want to do by using mapply().
> mapply(sumsq, 1:10, 1:10, MoreArgs = list(x = x))
[1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998
[8] 100.3745 100.1685 100.0332
There’s even a function in R called Vectorize() that automatically can create a vectorized version of your function. So we could create a vsumsq() function that is fully vectorized as follows.
> vsumsq <- Vectorize(sumsq, c("mu", "sigma"))
> vsumsq(1:10, 1:10, x)
[1] 196.2289 121.4765 108.3981 104.0788 102.1975 101.2393 100.6998
[8] 100.3745 100.1685 100.0332
Description
mapply is a multivariate version of sapply. mapply applies FUN to the first elements of each ... argument, the second elements, the third elements, and so on. Arguments are recycled if necessary.
Usage
mapply(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)
Arguments
FUN
function to apply, found via match.fun.
...
arguments to vectorize over (vectors or lists of strictly positive length, or all of zero length). See also ‘Details’.
MoreArgs
a list of other arguments to FUN.
SIMPLIFY
logical or character string; attempt to reduce the result to a vector, matrix or higher dimensional array; see the simplify argument of sapply.
USE.NAMES
logical; use names if the first ... argument has names, or if it is a character vector, use that character vector as the names.
Details
mapply calls FUN for the values of ... (re-cycled to the length of the longest, unless any have length zero), followed by the arguments given in MoreArgs. The arguments in the call will be named if ... or MoreArgs are named.
Arguments with classes in ... will be accepted, and their subsetting and length methods will be used.
Value
A list, or for SIMPLIFY = TRUE, a vector, array or list.
See Also
sapply, after which mapply() is modelled.
outer, which applies a vectorized function to all combinations of two arguments.
sapply()
The sapply() function behaves similarly to lapply(); the only real difference is in the return value.
sapply() will try to simplify the result of lapply() if possible. Essentially, sapply() calls lapply() on its input and then applies the following algorithm:
-
If the result is a list where every element is length 1, then a vector is returned
-
If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
-
If it can’t figure things out, a list is returned
split()
The split() function takes a vector or other objects and splits it into groups determined by a factor or list of factors.
The arguments to split() are
> str(split)
function (x, f, drop = FALSE, ...)
where
xis a vector (or list) or data framefis a factor (or coerced to one) or a list of factorsdropindicates whether empty factors levels should be dropped
The combination of split() and a function like lapply() or sapply() is a common paradigm in R.
The basic idea is that you can take a data structure, split it into subsets defined by another variable, and apply a function over those subsets.
The results of applying the function over the subsets are then collated and returned as an object. This sequence of operations is sometimes referred to as “map-reduce” in other contexts.
tapply()
R Documentation
Apply a Function Over a Ragged Array
tapply() is used to apply a function over subsets of a vector. It can be thought of as a combination of split() and sapply() for vectors only. I’ve been told that the “t” in tapply() refers to “table”, but that is unconfirmed.
> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
The arguments to tapply() are as follows:
Xis a vectorINDEXis a factor or a list of factors (or else they are coerced to factors)FUNis a function to be applied- … contains other arguments to be passed
FUN simplify, should we simplify the result?
Given a vector of numbers, one simple operation is to take group means.
Description
Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
Usage
tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)
Arguments
X
an atomic object, typically a vector.
INDEX
a list of one or more factors, each of same length as X. The elements are coerced to factors by as.factor.
FUN
the function to be applied, or NULL. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted. If FUN is NULL, tapply returns a vector which can be used to subscript the multi-way array tapply normally produces.
...
optional arguments to FUN: the Note section.
simplify
logical; if FALSE, tapply always returns an array of mode "list"; in other words, a list with a dim attribute. If TRUE (the default), then if FUN always returns a scalar, tapply returns an array with the mode of the scalar.
Value
If FUN is not NULL, it is passed to match.fun, and hence it can be a function or a symbol or character string naming a function.
When FUN is present, tapply calls FUN for each cell that has any data in it. If FUN returns a single atomic value for each such cell (e.g., functions mean or var) and when simplify is TRUE, tapply returns a multi-way array containing the values, and NA for the empty cells. The array has the same number of dimensions as INDEX has components; the number of levels in a dimension is the number of levels (nlevels()) in the corresponding component of INDEX. Note that if the return value has a class (e.g., an object of class "Date") the class is discarded.
Note that contrary to S, simplify = TRUE always returns an array, possibly 1-dimensional.
If FUN does not return a single atomic value, tapply returns an array of mode list whose components are the values of the individual calls to FUN, i.e., the result is a list with a dim attribute.
When there is an array answer, its dimnames are named by the names of INDEX and are based on the levels of the grouping factors (possibly after coercion).
For a list result, the elements corresponding to empty cells are NULL.
Note
Optional arguments to FUN supplied by the ... argument are not divided into cells. It is therefore inappropriate for FUN to expect additional arguments with the same length as X.
References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
See Also
the convenience functions by and aggregate (using tapply); apply, lapply with its versions sapply and mapply.