www.eamoncaddigan.net

Content and configuration for https://www.eamoncaddigan.net
git clone https://git.eamoncaddigan.net/www.eamoncaddigan.net.git
Log | Files | Refs | Submodules | README

index.md (5900B)


      1 ---
      2 title: "Indexing matrix elements in R"
      3 date: "2015-10-22"
      4 description: Why I prefer zero-based numbering.
      5 categories:
      6 - Programming
      7 - Data Science
      8 tags:
      9 - R
     10 ---
     11 
     12 I came to science with a background in engineering, but most of my scientist
     13 friends didn’t. I often have a hard time articulating why I’m so annoyed by
     14 one-based indexing--which R and MATLAB use, but most other programming
     15 languages don’t. Here’s a recent example that might help.
     16 
     17 For a simulation I’m running, I use the values in several of the columns of a
     18 data frame as indexes into separate vectors. Here’s an example of using indexes
     19 in one vector (instead of a column from a `data.frame`) to access the elements
     20 in another:
     21 
     22 ```r
     23 valueVector <- c(2, 5, 8, 3, 0)
     24 indexVector <- c(2, 3, 3, 1, 4, 5, 5)
     25 valueVector[indexVector]
     26 ```
     27 ```
     28 ## [1] 5 8 8 2 3 0 0
     29 ```
     30 
     31 R veterans will point out that you can use factors for this, and that's
     32 definitely true here. However, when the values in the small vector are changing
     33 often but the indices are relatively stable, I prefer this approach. Either
     34 strategy works.
     35 
     36 Unfortunately, things aren't so easy when the data is in a matrix (a 2D vector)
     37 and you want to access its elements using two index vectors (i.e., one indexing
     38 the matrix’s rows, and the second indexing its columns). R’s default behavior
     39 might not be what you expect:
     40 
     41 ```r
     42 valueMatrix <- matrix(LETTERS[1:15], ncol = 3)
     43 valueMatrix
     44 ```
     45 ```
     46 ##      [,1] [,2] [,3]
     47 ## [1,] "A"  "F"  "K" 
     48 ## [2,] "B"  "G"  "L" 
     49 ## [3,] "C"  "H"  "M" 
     50 ## [4,] "D"  "I"  "N" 
     51 ## [5,] "E"  "J"  "O"
     52 ```
     53 
     54 ```r
     55 rowIndices <- c(4, 2, 2, 4, 4, 4)
     56 colIndices <- c(2, 3, 3, 2, 3, 2)
     57 valueMatrix[rowIndices, colIndices]
     58 ```
     59 ```
     60 ##      [,1] [,2] [,3] [,4] [,5] [,6]
     61 ## [1,] "I"  "N"  "N"  "I"  "N"  "I" 
     62 ## [2,] "G"  "L"  "L"  "G"  "L"  "G" 
     63 ## [3,] "G"  "L"  "L"  "G"  "L"  "G" 
     64 ## [4,] "I"  "N"  "N"  "I"  "N"  "I" 
     65 ## [5,] "I"  "N"  "N"  "I"  "N"  "I" 
     66 ## [6,] "I"  "N"  "N"  "I"  "N"  "I"
     67 ```
     68 
     69 Instead of returning the six values associated with the six row/column pairs,
     70 it returns a 6 × 6 matrix with the elements of interest along the diagonal.
     71 When the number of elements is this small, it’s easy to wrap this in a call to
     72 `diag()`; that approach isn’t appropriate for long index vectors, since memory
     73 requirements are squared.
     74 
     75 Readers who’ve done data manipulation in C are probably familiar with pointer
     76 arithmetic. Zero-based numbering makes it easy to combine row and column
     77 indexes and make a one-dimensional array behave like a matrix: 
     78 
     79 ```c
     80 /* When the code has to deal with matrices of varying size, you can’t allocate 
     81    an array right away. */  
     82 double *valueMatrix;
     83 
     84 /* Stuff happened, and now you know how big your matrix will be (nrow x ncol), 
     85    so you allocate memory on the “heap” to store the data. */
     86 valueMatrix = (double *)malloc(sizeof(double) * nrow * ncol);
     87 
     88 /* This is how you’d access specific elements. */
     89 oldValueAtRowCol = valueMatrix[row + nrow*col];
     90 valueMatrix[row + nrow*col] = newValueAtRowCol; 
     91 
     92 /* Can’t forget this when you’re done; I don’t miss C. */
     93 free(valueMatrix);
     94 valueMatrix = 0;
     95 
     96 /* NB: All programs work this way, but most scripting languages take care of 
     97    this stuff for you. Progress! */
     98 ```
     99 
    100 To its credit, R makes this easy too; everything is stored as a one-dimensional
    101 vector behind the scenes, and can be accessed as such. R’s use of one-based
    102 indexing just makes it look a bit more awkward: 
    103 
    104 ```r
    105 valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)]
    106 ```
    107 ```
    108 ## [1] "I" "L" "L" "I" "N" "I"
    109 ```
    110 
    111 I hope this is useful for anybody who wants to access matrix elements using
    112 index vectors. More importantly, maybe some scientists-turned-programmers can
    113 gain a bit of insight into why zero-based numbering make sense to those of us
    114 who cut our teeth on C (or languages like it).
    115 
    116 ---
    117 
    118 ### Update
    119 
    120 After I posted this, [Andrie de Vries](http://rfordummies.com/) sent [a
    121 tweet](https://twitter.com/RevoAndrie/status/657187336833388545) sharing an
    122 alternate syntax that's more R-like:
    123 
    124 ```r
    125 valueMatrix[cbind(rowIndices, colIndices)]
    126 ```
    127 ```
    128 ## [1] "I" "L" "L" "I" "N" "I"
    129 ```
    130 
    131 I definitely agree that it looks better, but there is a slight performance hit
    132 due to the call to `cbind()`. If you're repeatedly accessing a matrix with the
    133 same pair of indices, it might be worth it store the bound pair as a variable
    134 and reuse that. Here's the benchmark performance of each of the approaches I've
    135 discussed:
    136 
    137 
    138 ```r
    139 library("microbenchmark")
    140 
    141 rowIndices <- sample(1:5, 1e4, replace=TRUE)
    142 colIndices <- sample(1:3, 1e4, replace=TRUE)
    143 boundIndices <- cbind(rowIndices, colIndices)
    144 
    145 op <- microbenchmark(diag_matrix = diag(valueMatrix[rowIndices, colIndices]),
    146                      pointer_math = valueMatrix[rowIndices + nrow(valueMatrix) * (colIndices - 1)], 
    147                      array_indexing = valueMatrix[cbind(rowIndices, colIndices)], 
    148                      array_indexing_prebound = valueMatrix[boundIndices], 
    149                      times = 100)
    150 print(op)
    151 ```
    152 ```
    153 ## Unit: microseconds
    154 ##                     expr         min           lq         mean      median
    155 ##              diag_matrix 1692950.428 1984553.2750 2024360.4989 2031899.941
    156 ##             pointer_math     243.356     251.9710     265.7848     260.419
    157 ##           array_indexing     332.864     352.2625     373.6505     367.539
    158 ##  array_indexing_prebound     149.192     154.6295     165.4170     160.201
    159 ##            uq         max neval
    160 ##  2056263.5955 2201740.926   100
    161 ##      272.5130     331.070   100
    162 ##      387.1940     544.230   100
    163 ##      170.5105     277.498   100
    164 ```
    165 
    166 Indexing with the pre-bound pair is fastest, using arithmetic on the indexes is
    167 a close second, and calling `cbind()` inside the brackets is in third place.
    168 Creating the n × n matrix and extracting its diagonal is excessively slow (and
    169 uses up a lot of RAM), so don't ever do that.