Foreign Language Lessons, part 3

Posted in programming - Thursday, July 14, 2011

So, continuing on with the fun from last time, I went back to the my Lisp program and translated it into Erlang. Unlike the complete re-think I had to pull off in going from Ruby to Lisp, this was very straightforward.

Here’s the meat of the Erlang code:

-record(line, {content, indent}).
-record(node, {content, children}).

parse_text_line(Text) ->
    %% converts "    text" to (:content "text" :indent 4)
    Content = string:strip(string:strip(Text, left), left, $\t),
    Indent = length(Text) - length(Content),
    #line{content=string:strip(Content, right, $\n), indent=Indent}.

%% Split a list of lines in two, breaking on the first line whose indent is =< the minimum.
split_list(_MinIndent, List1, []) -> [lists:reverse(List1), []];
split_list(MinIndent, List1, [First|Rest]) when First#line.indent =< MinIndent ->
    [lists:reverse(List1), [First|Rest]];
split_list(MinIndent, List1, [First|Rest]) ->
    split_list(MinIndent, [First|List1], Rest).

%% Parse a list of 'line' record into a tree structure, based on indent
build_nodes([]) -> [];
build_nodes([First|Rest]) ->
    [ChildLines, SiblingLines] = split_list(First#line.indent, [], Rest),
    Children = build_nodes(ChildLines),
    Siblings = build_nodes(SiblingLines),
    Node = #node{content=First#line.content, children=Children},
    [Node|Siblings].

And the equivalent Lisp:

(defun parse-text-line (line)
  ;; converts "    text" to (:content "text" :indent 4)
  (let ((content (string-left-trim '(#\Space #\Tab) line)))
    (list :content content :indent (- (length line) (length content)))))

(defun split-list (criterion data-list)
  ;; break a list into two lists; the second begins with the first element that matches the criterion.
  (if data-list
      (if (funcall criterion (first data-list))
          (list () data-list)
          (let ((me (pop data-list)))
            (let ((chunks (split-list criterion data-list)))
              (push me (first chunks))
              chunks)))
      (list () ())))

(defun build-nodes (lines)
  (if lines
      (let* ((me (pop lines))
             ;; split off our children from the rest of the lines.
             ;; the first line with an indent <= ours is a sibling, not a child
             (chunks (split-list (lambda (x) (<= (getf x :indent) (getf me :indent))) lines)))
        (let ((children (build-nodes (pop chunks)))
              (siblings (build-nodes (pop chunks))))
          (let ((node (list :content (getf me :content))))
            (if children (setf (getf node :children) children))
            ;; (format t "~a~%" (getf node :content))
            (push node siblings))))
      nil))

Erlang has obvious Lisp roots, so most of the concepts translated 1-for-1. (I didn’t have any macros in my Lisp code, which might have made that hairier.) Most of it was changing dashes to underscores and moving parentheses around, but Erlang has some new features of interest.

Pattern Matching

Most of the Erlang functions take advantage of its pattern matching. This is a more extreme form of the method overloading that Java developers are familiar with. Erlang does allow varying numbers of parameters, as in

show_indented(Nodes) -> show_indented(Nodes, 0).
show_indented(Nodes, Indent) ->
    ...

where the one-parameter version just calls the two-parameter version with a default indent. But more interestingly, it lets you define functions by the parameter values, as we see here:

build_nodes([]) -> [];
build_nodes(Lines) ->
    ...

The first build_nodes function matches when it’s passed an empty array. If any other value is passed, it fails to match the first version and tries the second. The second matches any parameter, and assigns its value to the Lines variable. The really wacky case of this is in the definition of

main([[$\:|FuncName]|Filenames]) ->
    ...
main(Filenames) ->
    ...

The first one only matches a list whose first element is a string beginning with a ‘:’ (":my_function”), and handily assigns the rest of the string to the FuncName variable (FuncName = “my_function”) and all the other arguments to Filenames. No match, and it just assigns the whole argument list to Filenames. You could implement that functionality in any other language, but this expression is very concise and clear, very declarative.

The case clause works in a similar way:

case Node#node.children of
        [] -> io:fwrite("~s~n", [NodePath]);
        Children -> show_path(Children, NodePath)
    end

This says, “If node’s children are an empty list, write out the node path; otherwise, assign them to Children, and pass that to show_path() along with NodePath.

On the whole, as odd as Erlang’s syntax is, I find it friendlier than Lisp’s. The pattern matching is weird at first, but very elegant and powerful once you wrap your head around it. I’d be interested to know the reaction of someone new to programming. I suspect they’d be less thrown by the syntactic quirkiness - the arrows and the comma vs. semicolon vs. period for line endings.

Spawn!

But all this is not why I’m learning Erlang. If it were just a stylistic improvement over Lisp, that wouldn’t justify its existence. We’re here for crazy parallel processing hijinks. So ok, here, let me re-write this so that it spawns a new process for every node. When the node is done building its children and siblings, it sends a message with all its data to its parent node, and so on up and down the tree. Ready? Here we go…

build_nodes(Parent, Group, []) -> Parent ! {Group, []};
build_nodes(Parent, Group, Lines) ->
    [First|Rest] = Lines,
    %% split off our children from the rest of the Lines.
    %% the first line with an indent =< ours is a sibling, not a child
    Criterion = fun(X) -> X#line.indent =< First#line.indent end,
    [ChildLines, SiblingLines] = split_list(Criterion, Rest),
    spawn(?MODULE, build_nodes, [self(), children, ChildLines]),
    spawn(?MODULE, build_nodes, [self(), siblings, SiblingLines]),
    receive
        {children, Children} ->
            Node = #node{content=First#line.content, children=Children}
    end,
    receive
        {siblings, Siblings} ->
            io:fwrite( "[~s] ~p~n", [Node#node.content, self()]),
            Parent ! {Group, [Node|Siblings]}
    end.

Here’s the full thing, but that’s about it. Instead of calling build_nodes() directly, we spawn them off and wait for the results. Instead of just returning our list of nodes, we send it to our Parent. (“Group” is just to keep children and siblings separate.) If you look at the diff, you’ll see a couple more minor things, but that’s all there is to the functional change.

So Erlang is surprisingly useful as a general-purpose scripting language. Probably not the first tool I’d reach for, but it doesn’t suck. And all of its limitations and eccentricities pay off when you start spawing processes. They straightjacket you into designing something that’s easily parallelized. I’m imagining trying to do a similar conversion in Java, where I’d be building a shared data structure and having to manage synchronization, thread completion, ordering results, and all that. Bleah.

(I have to point out that this is actually a bad example - this is not the kind of processing you want to parallelize. It schleps around a fair amount of data, and does very little processing. You really want that the other way around.)