R Database Connection Class in RS (R5 S5)

I'm learning R, and it's been interesting. The weird part is that I don't know stats. Yes, I know standard deviation, mean deviation, and easy stuff, but those are one-liners in the R tutorials. What I know in stats amounts to around 1/4 to 1/2 a page of the R tutorial. So to compensate, I have a stats text from the thrift store, so I'll eventually be less than a total ignoramus about what I'm reading.

One of the twists of what I'm messing with is that all our data is in a database. The normal mode of operation for R users is to load the entire data table into memory and do awesome reporting on it. Where I'm at, for better or worse, is more like a traditional web application with a database back end.

A problem with this is that MySQL lets me have only 16 connections going at the same time. I'm not sure if it's the driver, but I'll assume that. Since RStudio holds the connections, coding a lot of changes eventually leads to a lot of lost database connection handles, and eventually running out of connections.

What you must do is disconnect each connection. This is done in code like this:

  mydb = dbConnect(MySQL(), group='rstudio', dbname="thedb")
  rs <- dbSendQuery( mydb, query.string )
  data <- fetch( rs )

  ...
  
  dbClearResult(rs)
  dbDisconnect(mydb)

You have to bracket the code with a connect and a disconnect. There was some repetitive code, so I decided to make a small class that would abstract away the connection issues and slim down the code.

R has a weird object system. There are three different systems, called S3, S4, and RC (for Reference Classes). RC is the one most like traditional OO, and it feels a lot like Javascript's objects. JS's objects, in case you don't know, feel like an afterthought compared to Java's classes and objects. That doesn't mean JS's system doesn't work - it actually works well - it just lacks syntactic OOP features.

If you've done OOP in Perl 5, Python, C, or even ObjC, you'll feel right at home. OOP in R is added as a kind of minimal extension. It's almost as if the language designers of the above are thumbing their noses at C++ and Java and C#: "nyah nyah, we can do OOP too, and with only a tiny amount of extra code, and almost no new syntactic additions, because our native objects are so freaking powerful and our generic-programming-fu is so strong."

S3 is the object system that is used in R libraries. It allows the implementation of generic functions like plot() and print(). Through S3, plot() and print() and other generic methods can be implemented for new classes of data. Traditional OOP programmers can think of S3 as a system where new classes of data can implement popular interfaces.

These are the two articles I've studied:
http://www.agapow.net/programming/r/oop-in-r/reference-classes/
http://adv-r.had.co.nz/OO-essentials.html

My initial attempt was to use an S3 class. It didn't really work, because, as it was stated all over the place, though I didn't notice because I come out of trad OOP, is that R is a functional language, and S3 classes don't support mutable objects. None of the data structures you set up have state; they can't be changed.

A mindblowing aside

My mind was blown... and is still reeling. S3 isn't really about "objects". It's about "classes":. These are classes without traditional objects. The class describes an object, and defines functions for the object, but the object itself isn't this chunk of memory that can be manipulated by a function. Your code cannot alter the contents of an object it receives as an argument.

That's why I had to use Reference Classes (RC), which support mutable objects. RC classes also have constructors, initializers, and destructors. Here's the code for a class that creates an object that automatically connects to the database, and will automatically disconnect when the object goes out of scope.

Here's an example of the class in use, then the code for the class, followed by an explanation.

Code in use:

for (i in 1:20) {
  x <- function() {
    mydb <- db$new()
    mydb$query("select count(*) from foos")
  }
  print(paste(i," ",x()))
}

That demonstrates how, by encapsulating the connection, we reduce the line count to two lines of code. It also demonstrates automatic disconnection. Note that I don't call mydb$finalize() to disconnect. The class disconnects automatically.

I used a function to create a local scope for mydb. After x() is called, mydb should go out of scope, causing finalize() to be called.

Class source code:

library("DBI")
library("RMySQL")

db <- setRefClass("db",
  fields = c( connection="MySQLConnection" ),
  methods = c(
    initialize = function() {
      connection <<- dbConnect(MySQL(), group='rstudio', dbname="thedb")
    },
    finalize = function() {
      if (class(connection) == "MySQLConnection") {
        dbDisconnect( connection )
      }
    },
    query = function(sql) {
      if (class(connection)=="MySQLConnection")
      {
        rs <- dbSendQuery( connection, sql )
        data <- fetch(rs)
        dbClearResult(rs)
        data
      }
      else 
      {
        simpleError("No connection")
      }
    },
    execute = function(q) {
      dbSendQuery( connection, q)
    }    
  )
)

The new syntactic support for mutable objects is through the <<- operator, which does what you expect. It assigns a new value to the field.

Aside from that, the rest of the system is built upon the S4 OO system.

db is an object that contains the class. setRefClass() creates the class. setRefClass has at least three arguments. The first, "db" is the class name. The second, "fields" is a vector that describes the properties, which we will refer to as "fields". The third, "methods", is a list of functions, which will become the methods of the object.

Java programmers might miss the fact that there's no syntactic support for defining methods. In R, a function is data. The "methods" list is just a regular list. After each function definition, there's a comma; the comma is the separator between the name-value pairs in the list.

In RC classes, the function initialize() is called after the object is constructed. finalize() is called before the object is destroyed.

In RC classes, there's no explicit constructor required; there's a built-in constructor, new(). For this class, you just use obj <- db$new().

The typical usage of this class is:

x <- db$new()
data <- x$query( the.query.string )

Another mindblowing idea

I think this class is useful, and RC is cool... but I don't think I'll use RC as my go-to OO system. Introducing state makes things more difficult, and the fact that you can avoid creating state in R is a feature. If you think about it, most operations can be performed without maintaining state. R programmers say that "you can reason about" code written in a functional style, without state or variables. Once you introduce state, you (arguably) cannot reason about it - you should write code to interrogate the state of the object. That's more work, and more opportunities for bugs.

In the real world, however, you need variables, and you end up with some state. So, you end up with classes like the above, which try to reduce the number of lines of code, and make allocating and freeing the database connection automatic. It's a trade-off.

Addendum

After reading in the R-Inferno about immutable data, and pass by copy rather than pass by reference... I have to reiterate that introducing state is a bad thing to avoid if at all possible. The new language from Apple even makes this distinction with the introduction of "let", a keyword that allows assignment of a variable one time, and "struct", which is like an object, but that is always passed by copy, never passed by reference. These two things change the semantics of passing an object into a function so that you know the object is the same before and after a function. That makes programming with objects safer and more explicit.

This trend is also reflected in the "use const whenever possible" rule of thumb.

The rule of thumb would be to use "let" and "struct" whenever possible, and "var" and a classic object whenever necessary. Likewise, in R, use S3 when possible, then use S4 if S3 isn't good enough, and RC if necessary.