Book Review: R in a Nutshell

R is a statistical computing environment that is fully-compliant with state-of-the-art buzzwords: free, open-source, cross-platform, interactive, graphics, objects, closures, higher-order functions, and more. It is supported by an impressive collection of user-supplied modules through CRAN, the “Comprehensive R Archive Network”. (Sound familiar?)

And now it has its own O’Reilly Nutshell book, R in a Nutshell, written by Joseph Adler. I am pleased to report that Adler has risen to the challenge of the highly-regarded “Nutshell” franchise. As is traditional for the series, this title mixes introduction, tutorial, and reference material in a style that is well suited to a reader who already has a background in programming, but is a new or occasional user of R.

The book’s flow was very effective for addressing the different points of view from which I approached it.

As a curious newcomer to R who wanted to get going quickly, I was well-served by Part 1, which provided an R kickstart. Chapter 1 covers the process of getting and installing R. It is short, to the point, and just works, addressing Windows, Mac OS X, and Linux/Unix with equal attention. Chapter 2, on the R user interface, introduces the range of options for interacting with R: the GUI (both the standard version and some enhanced alternatives), the interactive console, batch mode, and the RExcel package (which supports R inside a certain well-known spreadsheet). Chapter 3 uses a set of interactive examples to provide a quick tour of the R language and environment, establishing a task-oriented theme that carries through the rest of the book. The last chapter of part 1 covers R packages. It summarizes the standard pre-loaded packages, introduces the tools to explore repositories and install additional package, and concludes by explaining how to create new packages.

As a polyglot programmer who is always interested in seeing how a new language approaches programs and their construction, I enjoyed Part 2, which described the R language. This section begins with an overview in chapter 5, and then devotes a chapter each to R syntax, R objects, symbols and environments (central to understanding the dynamic nature of R), functions (including higher-order functions), and R’s own approach to object-oriented programming. This section closes in chapter 11, with a discussion of techniques and tips for improving performance.

As a busy professional with data sitting on my hard drive that I’d like to understand better, I appreciated Part 3, with its practical emphasis on using R to load, transform, and visualize data. Chapter 12 presented alternatives for loading, editing, and saving data, from the built-in data editor, through file I/O in a variety of formats, to a mature set of database access options. Chapter 13 illustrated a range of techniques for manipulating, organizing, cleaning, and sorting data, in preparation for presentation or more detailed analysis. Chapter 14 introduces the reader to the wealth of graphical presentation options built into the R environment. There are so many charting types and details that this chapter could have been overwhelming, but Adler keeps the interest high and the mood light by drawing on an engaging variety of data: toxic chemical levels, baseball statistics, the topography of Yosemite Valley, demographic data, and even turkey prices. Chapter 15 is devoted to lattice graphics, the R implementation of the “trellis graphics” technique for data visualization developed at Bell Labs. This chapter illustrates the power of lattice graphics by exploring the question of why more babies are born on weekdays than weekends.

As a non-statistician who still occasionally needs to do some number-crunching, I’m sure I’ll be returning to Part 4, with its detailed explanations and illustrations of analysis tools and techniques–almost two-hundred pages worth. In chapters 16 through 20, Adler surveys topics in data analysis, probability, statistics, power tests, and regression modeling. As someone who has been offered too many medications and lost fortunes, I found much to enjoy in chapter 21, which used a variety of spam-detection techniques to illustrate the concepts of classification. Chapter 22, on machine learning, discusses several of the data mining techniques that R supports. Chapter 23 covers time series analysis, which may be used to identify trends or periodic patterns in data. Finally, chapter 24 offers an overview of Bioconductor, an open-source project focused on genomic data.

The book closes with a detailed reference to the standard R packages.

This is an impressive piece of work. In a volume of this size (about 650 pages), navigation is crucial, and I found both the organization of the chapters and index up to the task. I was able to follow the instructions and examples through the first several chapters of the book essentially without a hitch, and in the latter chapters the variety of illustrations and data sources added interest to what could have been very dull going.

I won’t claim perfection for this book. There were a couple of explanations that could have been clearer, and one or two odd turns of phrase or rough edits. Out of all the code examples that I tried, I found exactly one that didn’t seem to work without a minor correction. For a work of this size, that’s actually pretty amazing!

As a long-time O’Reilly reader, I see Joseph Adler’s R in a Nutshell as a welcome addition to the menagerie.

Artist, graph thyself!

Visual representations of graph structures have been important to programming from the beginning of the craft – earlier than that, if you count circuit diagrams. Most practicing programmers of my acquaintance have struggled with visual complexity as the size or amount of detail of such a diagram increases. So I was interested in an article titled “What is the Best Way to Represent Directionality in Network Visualizations?” that summarized research done at Eindhoven University of Technology.

The researchers compared a variety of techniques for representing directionality of arcs: standard arrows, gradients (light-to-dark, dark-to-light, and green-to-red, and curved and tapered connections. Subjects were shown graphs rendered using different techniques, and asked questions about connections between nodes, with speed and accuracy being analyzed.

According to the summary:

  • one should avoid standard arrows and curves,
  • tapered arcs (from wide to narrow) produced the best results,
  • dark-to-light value gradients were better than light-to-dark (no surprise there, as this is consistent with the “more intense to less intense” result of the tapering), and
  • there appears to be no advantage to combining techniques (I confess a bit of surprise over that conclusion).

But most fascinating to me is the fact that when presenting the results visually, the paper and article both violated the very conclusions just reached! Superior performance among techniques was presented using a directed graph using curved arcs with standard arrowheads! (See the diagram here, or in the article and linked paper.)

It was but a few moments’ work in OmniGraffle to recreate that diagram using the recommended technique (tapered arcs). In addition, I rearranged the nodes to reduce the arc crossings, while keeping the preference flow downward from higher to lower. Here’s the result:

directionality.jpg

I’m amazed that the researchers and author(s) of the summary ignored the result when presenting the result! Does this tell us something about the force of habit, or perhaps the limitations of existing tools, or something else entirely? I wonder.

Don’t return null; use a tail call

Why should an object-oriented programmer care about tail-call elimination? Isn’t that just another esoteric functional programming concept? Maybe not.

Back when dinosaurs roamed the earth, a common technique for performing a subroutine call worked this way:

  1. The caller stored the address at which it wished to resume execution in a known place (e.g. adjacent to the called routine’s code);
  2. The caller branched to the called routine’s entry address;
  3. Upon completion, the called routine branched indirectly to the stored return address.

Of course, this technique precluded both recursive and re-entrant calls, but those were regarded as esoteric, theoretical concepts with little if any practical use (see the “Historical Footnote” below). Times change, but programmers are still living with constraints whose roots go back to the kinds of mechanism described above. We all know that a simple and correct recursive routine can still founder on the stack overflow reef. But I recently saw a blog post by Zachary D. Shaw that rubbed a bit more salt in that wound.

His post on Returning Null discusses an alternative to the common idiom of returning a null result as a way of saying “not found”. In his post, the caller passes itself as an argument (using a suitable interface type, of course), and the callee responds by invoking either a “found” or “not found” method on the caller. This models the interaction as an exchange of specific and appropriate messages, instead of encoding the “not found” case in an abnormal value which the caller must decode in order to determine what to do.

I’m currently enjoying the book Growing Object-Oriented Software, Guided by Tests by Steve Freeman and Nat Pryce, of mock objects fame. That book reminded me of the value of separating the design concept of passing messages from the implementation detail of writing methods. It would be nice to be able to use such a technique without stack depth raising its ugly little head as an issue, however fleeting.

Simon Harris posted on this issue from a Ruby perspective, and Guy Steele’s Fortress blog has an elegant illustration in that language. Of course, Steele’s 1977 paper is the ultimate resource on the subject. Even the name of that paper has had its influence.


Historical Footnote: The November 1963 issue of Communications of the ACM contained a short piece entitled “Recursive programming in Fortran II“. Membership is required to download the PDF, but you can get a sense of the attitudes of the time just by reading the abstract on the linked page.

From Whose Perspective?

Java Posse Roundup 2010 is now history, and I’m still digesting and pondering. But one “Aha!” moment was worth posting quickly.

During a discussion on productivity and job satisfaction, a participant stated a view that I suspect many of us have shared: “If I can get to the office early, I can get my work done before the distractions and interruptions begin.” After a moment in which many of us nodded appreciatively, Diane Marsh replied, “But it’s all work.

Earlier in the same session, several of us had mentioned “learning” or “helping others learn” as one of the joys of our craft. But it occurred to me that perhaps there was an implicit “…what I want to learn” tacked onto the end.

I am not defending pointless interruptions or feckless meetings. But I benefit from Diane’s reminder that the programmer’s equivalent of taking out the trash and washing the dishes are still valuable parts of the day. Flow is good. So is individual accomplishment. But so are balance and avoiding tunnel vision.

Why lists are not primitive

Yesterday’s SICP study group call indulged briefly in the “why LISP hasn’t caught on” theme. I suggested that some programmers seem to have a limited tolerance for abstraction, and cited the well-known Primitive Obsession anti-pattern as evidence. One of the other participants (I’m sorry that I don’t know all the voices yet) challenged that statement. After pointing out how often the humble list is used as a universal data structure, he asked whether that was Primitive Obsession in another disguise. The conversation moved on, but that question stuck in the back of my mind.

This morning it re-emerged, dragging the following conjecture.

Although the list is a fundamental structuring mechanism in LISP, it is far from “primitive” in way that e.g. int is a primitive data type in Java. Although number theory is a well-respected branch of Mathematics, Ben Bitdiddler and Eva Lu Ator can write application code littered with int variables without ever appealing to the theoretical properties of integers–Peano’s Postulates and the like. After all, they’ve known how to count and do arithmetic since elementary school, right?

On the other hand, the LISP list is a fairly abstract and conceptual beast; list-based code normally uses recursion, either explicitly or hidden beneath the thinnest of veneers, such as map and reduce. And from there, it’s the tiniest of steps to begin reasoning about one’s code via full-blown Mathematical Induction. Furthermore, all of the conceptual properties of lists are lying there in plain sight every time a list appears. Therefore I’m encouraged to think of generalities that can re-use those tasty properties with great freedom.

In contrast, how many Java programmers routinely make the jump from

String personName;

to

public class PersonName {
    public final String firstName;
    public final String lastName;
    ...
}

to including (correctly-working!) methods for comparison and equality? (And I’m not just picking on Java; the same is true of many other “C-like” language communities.) Even though strings are comparable, and ordered composites of comparable types have an obvious default comparability, realizing that obvious comparison is tedious and error-prone. If my language trains me to think that certain things are too much trouble, then I’ll tend to stop thinking about them. And when offered a language that makes them freely available again, it will be all too easy to ask, “Who needs all that stuff?”

Intermission: Google Squared

Google Squared is an impressive (but still beta) offering from The Google that seems to have some real potential for students and others who want to do quick-and-maybe-not-do-dirty research.

This post is also a bit of an experiment. I want to find out whether I can capture screen shots that are large enough to be (at least partly) readable without completely blowing up the page layout I’m currently using for this blog.

Given the subject matter of this blog, it’s no surprise that I thought of this query:

Google squared query for programming languages

which gave me the following result:

First query result

I admit to being surprised and impressed with the collection of values presented with no additional guidance. However, I want to refine the content a little. First, I’ll click on the [X] column to the left to remove Pascal, Fortran, Cobol, and Forth, (none of which I’m currently using ;-); then I’ll use the “Add items” input field to include Java, Scala, Erlang, and Haskell. Some fairly interesting look-ahead kicked in as I began typing:

Typing J provided a hint for Jython

With all my new rows in place, I have this:

Updated query results with new rows

It’s interesting that “Appeared In” was not found for Java, but Google Squared is a beta. I’m impressed that it chose to put that column in, and did find values for the other languages!

The next hint of the underlying sophistication in G2 came when I decided to modify the columns. After removing “Influenced” and “Appeared In”, I started to add replacement columns, and was offered an interesting set of options.

List of proposed columns to add

Curiosity took over, and I picked “Typing Discipline” from the list. The resulting column confirmed that the term was being used correctly (e.g. with values of “duck,dynamic,strong” for Python and “static,strong,inferred” for Haskell).

Replacing that column with new ones for “tutorial” and “blogs” gave me some additional leads to follow:

Updated results with new columns

When I clicked on the tutorial column in the Scala row, G2 provided a pop-up with a link to Bill Venners’ presentation on Scala at last year’s QCon.

Pop-up with link to Bill's tutorial

I won’t burden this page (or you, kind reader) with any more screen shots from G2; perhaps by now you’ve seen enough that you’ve already left to try it out yourself or have decided it’s not (yet) for you.

There’s an obvious comparison begging to be done here; Wolfram Alpha wasn’t immediately successful for this particular search:

Wolfram|Alpha isn't sure what to do with your input.

…but let’s remember that both of these new tools are work in progress.

My overall impression is that it’s a very impressive start, but has room for improvement. Several of my attempts to add other columns relevant to this blog (such as “DSL” or “parsing”) yielded “No value found” in most or all rows. Even the proposal from G2 of “Major Implementations” was only partially successful. There were multiple values for all languages except Java, Scala, and Erlang, all three of which got “No value found”.

It would be really interesting to be able to derive new columns from the contents of others. For example, counting the number of values in a list or doing arithmetic with numeric values would both be handy.

You may have noticed in the full-page screenshots the link at the top-right-hand corner that invited me to “Sign in to save your Square”. I did so, and plan to come back later to see if the results change over time.

I’m very interested in seeing how G2 grows from this not-so-humble beginning.

Jesse Tilly at Memphis JUG

This month’s Memphis Java Users’ Group meeting featured Jesse Tilly of IBM Rational Software, who spoke to us on static analysis. He will be doing a more product-intensive session, “What is IBM® Rational® Software Analyzer® Telling Me?”, at the upcoming IBM Rational Software Conference. (Don’t be misled by all those “circle-R”s; I just linked to the title from the conference web site.)

For our meeting, Jesse left the branding iron at home. He began with an overview of the history and benefits of static analysis. The major portion of the presentation offered a practical approach to analysis as part of a development project, including a detailed how-to on interpreting and using analysis results. Jesse finished with return to history, drawing unexpected parallels with the analysis of Enigma traffic at Bletchley Park during WWII—the background for Allen Turing‘s later theoretical work that led to the computers we program today.

Because Jesse had an early flight, our regular door-prize drawing followed his presentation. In our lightning talk segment, Matt Stine introduced Morph AppSpace, I presented on “Structured Functional Programming” (pdf here), and Walter Heger gave a quick look at jGears.


Recommended reading:

Encryption and cryptanalysis are deeply entwined with computing, whether in history(Codebreakers: The Inside Story of Bletchley Park) or in imagination (Cryptonomicon).

Two highly-respected tools for static analysis in Java are FindBugs and PMD; both web sites offer excellent documentation and other reference material.

BuilderBuilder: The Model in Haskell

This post describes the first model in Haskell for the BuilderBuilder task. We will develop the model incrementally until we have rough parity with the Java version.

I’m experimenting with ways to distinguish user input from system output in transcripts of interactive sessions. This time I’m trying color, using a medium blue for output. I will appreciate feedback on whether that works for you.

Step one: defining and using a type

The simplest possible Haskell version of our model for a Java field is:

data JField = JField String String

However, the PMNOPML (“Pay Me Now Or Pay Me Later”) principle says that we’ll regret it if we stop there. In fact, later comes quickly.

We can create an instance of JField in a source file:

field1 = JField "name1" "Type1"

To do the same in a ghci session, prefix each definition with let, as in:

*Main> let field1 = JField "name1" "Type1"

Step two: showing the data

Trying to look at the instance yields a fairly opaque error message.

*Main> field1

<interactive>:1:0:
    No instance for (Show JField)
      arising from a use of `print' at <interactive>:1:0-5
    Possible fix: add an instance declaration for (Show JField)
    In a stmt of a 'do' expression: print it

Remember that in Java the default definition of toString() returns something like com.localhost.builderbuilder.JFieldDTO@53f67e; that’s also obscure at first glance. Haskell just goes a bit further, complaining that we haven’t defined how to show a JField instance. We can ask for a default implementation by adding deriving Show to a data type definition:

data JField = JField String String deriving Show

After loading that change, we get back a string that resembles the field’s defining expression:

*Main> field1
JField "name1" "Type1"

Step three: referential transparency

Our first model represented a Java class by its package, class name, and enclosed fields. The Haskell equivalent is:

data JClass = JClass String String [JField] deriving Show

The square brackets mean “list of …”, so a JClass takes two strings and a list of JField values. I’ll say more about lists in a moment, but first let’s deal with referential transparency.

We can build a class incrementally:

field1 = JField "name1" "Type1"
field2 = JField "name2" "Type2"
class1 = JClass "com.sample.foo" "TestClass" [field1, field2]

or all at once:

class1 = JClass "com.sample.foo"
                "TestClass"
                [   JField "name1" "Type1" ,
                    JField "name2" "Type2"
                ]

and get the same result:

*Main> class1
JClass "com.sample.foo" "TestClass" [JField "name1" "Type1",JField "name2" "Type2"]

As mentioned previously, only one of those definitions of class1 can go in our program. To Haskell, name = expression is a permanent commitment. From that point forward, we can use name and expression interchangeably, because they are expected to mean the same thing. That expectation would break if we were allowed to give name another meaning later (in the same scope).

Consequently, we can define a class using previously defined fields, or we can just write everything in one definition, nesting the literal fields inside the class definition. As we’ll see later, this also has implications for how we write functions; a “pure” function and its definition are also interchangeable.

Step four: lists

The array is the most fundamental multiple-valued data structure in Java; the list plays a corresponding role in Haskell. In fact, lists are so important that there are a few syntactical short-cuts for dealing with lists.

  • Type notation: If t is any Haskell type, then [t] represents a list of values of that type.
  • Empty lists: Square brackets with no content, written as [], indicate a list of length zero.
  • Literal lists: Square brackets, enclosing a comma-separated sequence of values of the same type, represent a literal list.
  • Constructing lists: The : operator constructs a new list from its left argument (a single value) and right argument (a list of the same type).

For example, ["my","dog","has","fleas"] is a literal value that has type [String] and contains four strings. "my":["dog","has","fleas"] and "my":"dog":"has":"fleas":[] are equivalent expressions that compute the list instead of stating it as a literal value.

By representing the fields in a class with a list, we achieve two benefits:

  • The number of fields can vary from class to class.
  • The order of the fields is significant.

Step five: types and records

Given a JField, how do we get its name? Or its type? We can define functions:

fieldName (JField n _) = n
fieldType (JField _ t) = t

and do the same for the JClass data:

package   (JClass p _ _ ) = p
className (JClass _ n _ ) = n
fields    (JClass _ _ fs) = fs

but all that typing seems tiresome.

Before solving that problem, let’s note two other limitations of our current implementation:

  • Definitions using multiple String values leave us with the burden of remembering the meaning of each strings.
  • The derived show method leaves us with a similar problem; it doesn’t help distinguish values of the same type.

If you suspect that I’m going to pull another rabbit out of Haskell’s hat, you’re right. In fact, two rabbits.

Type declarations

We can make our code more readable by defining synonyms that help us remember why we’re using a particular type. By adding these definitions:

type Name     = String
type JavaType = String
type Package  = String

we can rewrite our data definitions to be more informative:

data JField = JField Name JavaType deriving Show
data JClass = JClass Package Name [JField] deriving Show

Record syntax

The second rabbit is a technique to get Haskell to do even more work for us. We represent each component of a data type as a name with an explicit type—all in curly braces, separated by commas:

data JField = JField {
     fieldName :: Name ,
     fieldType :: JavaType
} deriving Show

data JClass = JClass {
     package   :: Package ,
     className :: Name ,
     fields    :: [JField]
} deriving Show

When we use this syntax, Haskell creates the accessor functions automagically, and enables a more explicit and flexible notation to create values. All of these definitions:

field1  = JField "name1" "Type1"
field1a = JField {fieldName = "name1", fieldType = "Type1"}
field1b = JField {fieldType = "Type1", fieldName = "name1"}

produce equivalent results:

*Main> field1
JField {fieldName = "name1", fieldType = "Type1"}
*Main> field1a
JField {fieldName = "name1", fieldType = "Type1"}
*Main> field1b
JField {fieldName = "name1", fieldType = "Type1"}

Step last: that’s it!

We have covered quite a bit of ground! The complete source code for the model appears at the end of this post. With both Java and Haskell behind us, we have most of the basic ideas we’ll need for the Erlang and Scala versions.


Recommended reading:

Real World Haskell, which is also available
for the Amazon Kindle
or on-line at the book’s web site. I really can’t say enough good things about this book.


The current BuilderBuilder mode in Haskell, along with sample data, is this:

-- BuilderBuilder.hs

-- data declarations

type Name     = String
type JavaType = String
type Package  = String

data JField = JField {
     fieldName :: Name ,
     fieldType :: JavaType
} deriving Show

data JClass = JClass {
     package   :: Package ,
     className :: Name ,
     fields    :: [JField]
} deriving Show

-- sample data for demonstration and testing

field1  = JField "name1" "Type1"
field1a = JField {fieldName = "name1", fieldType = "Type1"}
field1b = JField {fieldType = "Type1", fieldName = "name1"}

field2 = JField "name2" "Type2"

class1 = JClass "com.sample.foo" "TestClass" [field1, field2]

studentDto = JClass {
    package   = "edu.bogusu.registration" ,
    className = "StudentDTO" ,
    fields    = [
        JField {
            fieldName = "id" ,
            fieldType = "String"
        },
        JField {
            fieldName = "firstName" ,
            fieldType = "String"
        },
        JField {
            fieldName = "lastName" ,
            fieldType = "String"
        },
        JField {
            fieldName = "hoursEarned" ,
            fieldType = "int"
        },
        JField {
            fieldName = "gpa" ,
            fieldType = "float"
        }
    ]
}

Updated 2009-05-09 to correct formatting and add category.

BuilderBuilder: Haskell Preliminaries

The next step in the BuilderBuilder project is to develop a model in Haskell that is analogous to the Java model in the previous post. This post will introduce just enough Haskell to get started; the next post will get into the BuilderBuilder model.

Environment:

I’m using GHC 6.10.1, obtained from the Haskell web site. There are a variety of platform-specific binaries; I used the classic configure/make/install process on OSX. (For Java programmers, make is what we used instead of ant back in the Jurassic era.) Consult the Haskell Implementations page for details on obtaining Haskell for your preferred platform.

The complete development environment consists of two windows: one running a text editor, and the other running ghci, the interactive Haskell shell that comes with GHC.

Haskell introduction:

Use your text editor to create a file named bb1.hs with this content:

-- bb1.hs

-- simplest possible data declarations

data JField = JField String String

-- sample data for demonstration and testing

field1 = JField "id" "String"

-- sample function

helloField :: JField -> String
helloField (JField n t) = "Hello, " ++ n ++ ", of type " ++ t

Then run ghci as follows, where user input is underlined:

your-prompt-here$ ghci
GHCi, version 6.10.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer ... linking ... done.
Loading package base ... linking ... done.
Prelude> :l bb1.hs
[1 of 1] Compiling Main             ( bb1.hs, interpreted )
Ok, modules loaded: Main.
*Main> helloField field1
"Hello, id, of type String"
*Main> 

We started ghci, told it to load our source file (the :l … line), and then invoked the helloField function on the sample field. Now let’s examine the Haskell features used in that code. The lines beginning with double-hyphens are comments, and will be ignored in the description.

Defining data types

Because Haskell emphasizes functions, it’s no surprise that the syntax for defining data types is very lightweight. The Java BuilderBuilder model represents a field with two strings, one for the name and one for the type. The simplest possible Haskell equivalent is:

data JField = JField String String

This defines a data type named JField. It has a constructor (also named JField) that takes two strings, distinguished only by the order in which they are written.

Defining values

The next line of code defines an instance of this type:

field1 = JField "id" "String"

The equal sign means “is defined as“. That statement defines field1 as the instance of JField constructed on the right-hand side. It is not declaring and initializing a mutable variable. Within the current scope, attempting to redefine field1 will produce an error. (More about scope later.)

Defining functions

Finally, we have a simple function that converts a JField to a String.

helloField :: JField -> String
helloField (JField n t) = "Hello, " ++ n ++ ", of type " ++ t

Everything in Haskell has a type, including functions. The double colon means “is of type“, so the type of helloField is function from JField to String.

The value of applying helloField to a JField containing strings n and t is defined by the expression on the right-hand side. Haskell regards strings as lists of characters; the ++ operator concatenates lists of any type. The names n and t are only meaningful within that definition, similar to the local variables in this Java fragment:

public static String helloField(IJField f) {
    String n = f.getName();
    String t = f.getType();
    return "Hello, " + n + ", of type" + t;
}

Type inference

Java requires that we explicitly declare the local variables as type String. But in Haskell, because JField is specified to have two String values, the compiler can infer the types of n and t In fact, the entire first line of helloField is not necessary. The defining equation in the second line explicitly uses a JField on the left and constructs a String on the right. Therefore, the compiler can infer JField -> String as the type of the function. Haskell’s type inference allows us to write very compact code without giving up strong, static typing.

To see that in action, add the following line to the end of your bb1.hs file:

hiField (JField n _) = "Hi, " ++ n

(The underscore is a wild card, showing the presence of a second value but indicating that we don’t need it in this function.)

Reloading bb1.hs in ghci allows us to see type inference at work.

*Main> :l bb1.hs
[1 of 1] Compiling Main             ( bb1.hs, interpreted )
Ok, modules loaded: Main.
*Main> hiField field1
"Hi, id"
*Main> :type hiField
hiField :: JField -> [Char]

As we’ll see later in this series, Scala brings type inference to the JVM environment. Coming from the dynamic language side, the Diamondback Ruby research project is adding type inference to Ruby. So perhaps type inference is (finally) an idea whose time has come.

We’ll pick up more Haskell details along the way, but we have enough to start defining our first BuilderBuilder model. That will be the subject of the next post.


Updated 2009-05-09 to fix formatting.

BuilderBuilder: The Model in Java

This post will describe a tiny Java model for implementing the BuilderBuilder task. It is simple almost to the point of crudity, because the goal of the series is to compare languages and styles, not to produce production-ready sample code.

This post will focus on the parts of the overall data flow highlighted below:

GenerationModel.jpg

The interfaces:

I’m using interfaces to hide implementation from the remainder of the code. The first version will use simple DTOs, but I want to leave other options (e.g. by reflection against existing DTO classes) open for later exploration.

This first model has two interfaces; one for a Java class:

package com.localhost.builderbuilder;

public interface IJClass {
    public String getPkg();
    public String getName();
    public IJField[] getFields();
}

and the other for a Java field:

package com.localhost.builderbuilder;

public interface IJField {
    public String getName();
    public String getType();
}

We all know that “the simplest thing that could possibly work” doesn’t mean “the stupidest thing that could possibly work”. The use of an array may cross that line, but it was a deliberate choice. Developers who moved to OOP from imperative programming are very familiar with arrays. We’ll be able to compare array processing against the FP style of list processing, and perhaps consider other OOP alternatives later on.

First implementations:

In the spirit of eating our own dog food, the simple DTO implementation of those interfaces will contain their own Builder inner classes. Given that, there’s no surprise in the JFieldDTO code, which appears at the end of this post.

The JClassDTO class throws in one new wrinkle—instead of having a fields(IJField[] fields) method that accepts an entire field array, JClassDTO.Builder provides a field(IJField field) method that accepts one field at a time, accumulating them to be placed in an array by the instance() method. The complete code for JClassDTO is given at the end.

It remains to be seen whether this DTO-style implementation is throw-away code, but getting a first implementation in hand will allow us to start comparing data types and structures with the other language, and then move directly to the generation phase of the project. We can always come back and add features (and complexity ;-) ) at a later time.


Recommended reading:


The JFieldDTO implementation:

package com.localhost.builderbuilder;

public class JFieldDTO implements IJField {

    private final String name;
    private final String type;

    public static class Builder {
        
        private String name;
        private String type;

        private Builder() {
            // do nothing
        }

        public Builder name(String name) {
            this.name = name;
            return this;
        }

        public Builder type(String type) {
            this.type = type;
            return this;
        }

        public JFieldDTO instance() {
            return new JFieldDTO(name, type);
        }
    }

    public static Builder builder() {
        return new Builder();
    }

    private JFieldDTO(String name, String type) {
        this.name = name;
        this.type = type;
    }

    public String getName() {
        return name;
    }

    public String getType() {
        return type;
    }

}

The JClassDTO implementation:

package com.localhost.builderbuilder;

import java.util.ArrayList;
import java.util.List;

public class JClassDTO implements IJClass {

    private final String pkg;
    private final String name;
    private final IJField[] fields;

    public static class Builder {
        
        private String pkg;
        private String name;
        private List<JFieldDTO> fields;

        private Builder() {
            fields = new ArrayList<JFieldDTO>();
        }

        public Builder pkg(String pkg) {
            this.pkg = pkg;
            return this;
        }

        public Builder name(String name) {
            this.name = name;
            return this;
        }

        public Builder field(JFieldDTO field) {
            this.fields.add(field);
            return this;
        }

        public IJClass instance() {
            return new JClassDTO(
                pkg,
                name,
                fields.toArray(new JFieldDTO[fields.size()])
            );
        }

    }

    public static Builder builder() {
        return new  Builder();
    }

    private JClassDTO(String pkg, String name, IJField[] fields) {
        this.pkg = pkg;
        this.name = name;
        this.fields = fields;
    }

    public String getPkg() {
        return pkg;
    }

    public String getName() {
        return name;
    }

    public IJField[] getFields() {
        return fields;
    }

}

Updated 2009-05-09 to fix some formatting and to add a category.

Follow

Get every new post delivered to your Inbox.