Ref: http://lists.canonical.org/pipermail/kragen-tol/2007-February/000847.html
novice's notes on learning OCaml
Kragen Javier Sitaker kragen at pobox.comThu Feb 22 03:37:01 EST 2007
- Previous message: installing Debian with debootstrap and LNX-BBC
- Next message: the presentation of "private" in programming languages
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I've just been learning OCaml for the first time; I've written OCaml programs before, but I didn't really learn enough of the language to do anything useful. I thought I would take some time to reflect on the experience of learning the language, since (barring brain damage) I won't have the opportunity to learn OCaml a second time. First, the positives. All Kinds Of Good Stuff ======================= OCaml has many advantages. Even the bytecode interpreter, running under PowerPC emulation on this Intel Mac, is much faster than Python's interpreter or any JavaScript interpreter. You can define new infix operations easily, or replace old ones (although you can't override the old ones for a particular type). Pattern matching makes all kinds of interpreter-like things a cinch. It's easy to use infix operations as values (merely by parenthesizing them). Implicit currying often makes programs briefer and sometimes even clearer. User-defined constructors display nicely by default, and just in general, tagged unions are nice to use. Pattern-matching is especially nice; consider this: let rec toyaplshow = ... function ... | Parenthesized e -> "(" ^ toyaplshow e ^ ")" | Binary (Atom _ as e1, f, e2) | Binary (Parenthesized _ as e1, f, e2) -> toyaplshow e1 ^ " " ^ f ^ " " ^ toyaplshow e2 | Binary (e, f, e2) -> toyaplshow (Parenthesized e) ^ " " ^ f ^ " " ^ toyaplshow e2 ;; Binary, Atom, and Parenthesized are three of the constructors for a certain abstract type; toyaplshow is a recursive function for displaying it as an expression; and "^" is string concatenation. In this case, I want to parenthesize the leftmost operand of a binary operation if it's not one of Atom or Parenthesized. Doing this in an object-oriented style, the code is about twice as long. class Expr: def needs_parentheses_sometimes(self): return True class Binary(Expr): def toyaplshow(self): left = self.left if left.needs_parentheses_sometimes(): left = Parenthesized(left) return "%s %s %s" % (left.toyaplshow(), self.op, self.right.toyaplshow()) class Atom(Expr): def needs_parentheses_sometimes(self): return False class Parenthesized(Expr): def needs_parentheses_sometimes(self): return False def toyaplshow(self): return "(%s)" % (self.expr,) In this case, there also happen to be more kinds of values than there are operations on them, so dividing the operations along class lines instead of along functional lines results in stronger coupling. Things can get far hairier; consider this example from the OCaml tutorial: let balance = function Black, z, Node (Red, y, Node (Red, x, a, b), c), d | Black, z, Node (Red, x, a, Node (Red, y, b, c)), d | Black, x, a, Node (Red, z, Node (Red, y, b, c), d) | Black, x, a, Node (Red, y, b, Node (Red, z, c, d)) -> Node (Red, y, Node (Black, x, a, b), Node (Black, z, c, d)) | a, b, c, d -> Node (a, b, c, d) In theory, you can get access to all of the Perl and Python libraries as well through FFIs to Perl and Python, but those FFIs don't seem to have been included in the OCaml package I got. As with Python and Perl, you get fairly full access to the POSIX API, but with some improvements; for example, select() takes lists of file-descriptor objects, gettimeofday() returns a float, and most of the system calls raise exceptions when they fail instead of returning an error code. I'm not sure all the changes are necessarily improvements; I might really prefer bitvectors for select() for efficiency, for example, and I'm not sure I can create a Unix.file_descr from an integer. But mostly they are. And there are optional and named arguments (which seem to mostly work), an object system that supports fully-static duck typing (at least as far as what methods your objects define), a profiler, a debugger, a native-code compiler that is reputed to generate good code, records with named fields, recursion, tail-call elimination, etc. No first-class continuations, but exceptions. And the compiler tells you if you left out a case in your pattern-matching case analysis. Some Bad Stuff Too ================== I wrote most of this while working through Jason Hickey's tutorial; these are my notes learning OCaml as an experienced programmer who doesn't know much OCaml. As programming language tutorials go, it's not too bad, although it's less than practical-minded; he doesn't mention I/O at all until chapter 8, or explain it in full until chapter 9. I'm not that impressed with the OCaml experience for newbies. Learning a new programming language is always somewhat frustrating; it usually takes several days of learning before you can do anything useful. But I've found OCaml particularly frustrating, despite its many advantages, and I thought I would write a bit about the frustrations in hopes that it might enable me to correct them later on. Some of these frustrations are deeply rooted in the nature of OCaml, and are not likely to go away without a completely new language, and will probably continue to bother me even when I know the language; others, such as badly-worded error messages, are quite superficial. I hope this doesn't attract flames; it's not intended to deprecate the OCaml language, which I am very impressed with. I'm a little worried that comparing OCaml unfavorably to déclassé languages like Perl, Python, and JavaScript will lead people to discount my opinion. I assure you, I know other programming languages well (C being my favorite low-level language and one I know well), but I find Python and JavaScript more practical than C most of the time. Insufficiently Helpful Error Messages ------------------------------------- Here's a typical newbie mistake. I mistakenly wrote the pattern "(a, b) :: rest" as "a, b :: rest", thinking that would parse as I intended (since, of course, [a, b; c, d] parses as [(a, b); (c, d)]). The error message from the ocaml toplevel said: This pattern matches values of type 'a * 'b list but is here used to match values of type 'c list Now, if you read the error message correctly, it says that I'm trying to match a list (which ocaml knew because another pattern-matching case was []) against a pattern for a tuple containing a list as a second member. But the type "'a * 'b list" is structurally ambiguous in the same way that the pattern is; I parsed it as ('a * 'b) list, which was what I intended, not 'a * ('b list), which is what the compiler meant. In short, I was communicating ambiguously to the compiler, which misinterpreted what I meant as something that didn't make sense; in order to explain how it interpreted my statement, it used an explanation that was ambiguous in exactly the same way, so that it appeared to me that the compiler had understood my intent, but for some mysterious reason did not think it was a good idea to carry out. Of course, anyone who knows the least little bit of OCaml can parse the type, and furthermore is not likely to seek explanations in "mysterious reasons". But as I was just beginning to learn OCaml, everything was somewhat mysterious to start with. Another example: function ([], []) -> 3 | (a1::as, b1::bs) -> 4 | _ -> 5 ;; ^ Syntax error: ')' expected, the highlighted '(' might be unmatched This pattern-match doesn't parse. Why? I couldn't tell. The error message doesn't help. (It turns out that "as" is a reserved word.) Even "assert", which you put in your program so as to give you a useful error message, gives you crap like this: Exception: Assert_failure ("toyapl.ml", 62, 4). It does tell you which assertion failed (eventually you learn that that means "line 62, column 4") but it would be nice if it included the string of the actual expression that failed. Most of my assertions are unit tests, usually assertions of equality between two things, an expected value and an actual value. I can compare these two with the "=" operator, but I cannot figure out how to declare an exception to tell me what they were: exception Test_failure of int * 'a * 'a ;; ^^ Unbound type parameter 'a I'd be satisfied with string representations; somehow the toplevel produces string representations of objects of arbitrary types, but I haven't yet learned how to do that. Another example that's not quite so egregious: let y = List.map ((+)2) up_to 2000000 ;; ^^^^^^^^ This function is applied to too many arguments, maybe you forgot a `;' In my experience so far, every time I have gotten this message, it hasn't been because I forgot a ';'; it's been because I neglected to include some parentheses. Sometimes Completely Enigmatic Error Messages --------------------------------------------- This expression has type toyaplexpr but is here used with type toyaplexpr This can happen when you redefine a type in the interactive toplevel, but some references to the old type remain. The enigmatic nature of this error message is such a well-known bug that it is described in many introductions to OCaml. # Unix.getpid () ;; Reference to undefined global `Unix' This means that it can't access the "Unix" library, although it doesn't bother to say why. It does not mean that the "Unix" library doesn't exist, or that you misspelled its name. It means you needed to launch the ocaml interpreter with "ocaml unix.cma" instead of "ocaml". Insufficiently Helpful Exceptions --------------------------------- # List.assoc "hi" ["ho", 3] ;; Exception: Not_found. Good thing "hi" was a constant --- it would be hard to tell what was "Not_found", because the Not_found exception doesn't take any arguments. Similarly: # int_of_string "31 ";; Exception: Failure "int_of_string". It would be helpful if the exception told what string it failed on. I think it may be impossible to add this in a backward-compatible way to OCaml today. Along the same lines, these exceptions fail to acquire any stack-trace information; you get the same error message of two or three words if the exception happened deep down the stack, in the middle of a big program. The default Python stack trace at least shows you each line of code in the exception stack, and I usually use cgitb, which also shows you a few lines of context and the values of all the variables on the line of code in question. Usually this is sufficient to fix the problem without further experimentation. The lack of a stack trace on exceptions is hardly a disadvantage when you're comparing to C (no exceptions) or C++ (no stack traces either), but it's a disadvantage comparing to Java, Lisp, or Python; even Perl records the source file and line number of origin by default, and a full stack trace if you use "croak" instead of just "die". Insufficient Introspection -------------------------- So what does List.assoc do, anyway? # List.assoc ;; - : 'a -> ('a * 'b) list -> 'b =Well, that's nice --- it tells you what the arguments and the return type are. Maybe from those you can figure out what the function is for. (In OCaml, that's not as far-fetched as it sounds; I can't think of any other useful functions that have the type signature above. But it takes some thought.) But if you type an analogous expression in R, you get the entire source code of the function. In Python, the default response is similarly unenlightening: >>> map But look, you can do this: >>> help(map) Help on built-in function map: map(...) map(function, sequence[, sequence, ...]) -> list Return a list of the results of applying the function to the items of the argument sequence(s). If more than one sequence is given, the function is called with an argument list consisting of the corresponding item of each sequence, substituting None for missing values when not all sequences have the same length. If the function is None, return a list of the items of the sequence (or a list of tuples if more than one sequence). How about the functions available in the List library? The List library isn't even a value that you can type as an expression: # List ;; Characters 0-4: List ;; ^^^^ Unbound constructor List Compare Python: >>> import string >>> dir(string) ['__builtins__', ... 'upper', 'uppercase', 'whitespace', 'zfill'] >>> dir() ['__builtins__', '__doc__', '__name__', 'string'] >>> dir(__builtins__) ['ArithmeticError', 'AssertionError', 'AttributeError', ... 'staticmethod', 'str', 'sum', 'super', 'tuple', 'type', 'unichr', 'unicode', 'vars', 'xrange', 'zip'] Python even has features like rlcompleter2 and IPython which allow you to type "string." and hit Tab to see what's in the "string" module. (What Python calls "modules", OCaml calls "libraries".) This would be a smaller problem if the OCaml documentation were complete and up-to-date, but see the section "Missing Documentation". Too Much Precedence ------------------- OCaml has a lot of syntax. Another typical newbie mistake: let simpleenv = ["+", fun x -> x; "-", (~-)] ;; Commonly, because "," binds tighter than ";", you can conveniently write lists of pairs conveniently like [1, "one"; 2, "two"]. In this case, though, the "fun" gobbles up everything to the right, as explained by the type the interpreter spits out: (string * ('a -> string * (int -> int))) list Which is to say, a list of pairs of strings and functions, where each function takes a value of some arbitrary type, discards it, and returns a pair of a string and a function from ints to ints. Somewhat Low-Level ------------------ The negative side of Ocaml's higher speed is that it's somewhat low-level; the intrinsic data structures are things like linked lists, fixed-size arrays, strings, and several different struct-like structures (tuples, constructors, and records). There are libraries that provide things like auto-growing string buffers, queues, sets, hash tables, balanced binary trees (in case you're a masochist), (Perl-incompatible) regular expressions, random number generation, large multidimensional numerical arrays, type-unsafe data structure marshaling, arbitrary-precision numbers, MD5, command-line parsing, and so on. However, it appears that because Ocaml's "module", functor, and object-orientation facilities are not widely enough used, the information about whether you're using, say, a hash table or a balanced binary tree, will be spread through all the code that accesses it. Lame Libraries -------------- It's not that the regexp library (in the Str module) is gratuitously Perl-incompatible, or that it gratuitously uses global state; given OCaml's age, that may be forgivable. It's just that the library is small. Some people see this as an advantage: It's not that there are an extraordinary number of tools in the toolbox. (No; in fact, the toolbox is much smaller than the usual toolboxes, the ones used by your friends that contain everything but the sink.) It's that the toolbox was carefully and very thoughtfully assembled by some very bright toolsmiths, distilling their many decades of experience, and designed, as all good toolkits are, for a very specific purpose: building fast, safe and solid programs that are oriented around separate compilation of functions that primarily do recursive manipulation of very complex data structures. Programs, like, say ... compilers. - Dwight VandenBerghe <dwight at pentasoft.com>, 1998-07-28, "Why ML/OCaml are good for writing compilers", posted to comp.compilers, currently online at http://flint.cs.yale.edu/cs421/case-for-ml.html I don't see it as an advantage, myself. I'm glad OCaml isn't multiple gigabytes like the Apple Developer Tools CD, but here I have listed some of the things I hoped were there, but that I can't find in the standard library shipped with the language. Most of the things listed below have Python versions preinstalled on this Mac in /System/Library/Frameworks/Python.framework (66MB), Perl versions in /System/Library/Perl (45MB), and Java versions in /System/Library/Frameworks/JavaVM.framework/Versions/1.5.0/Classes/classes.jar (21MB, and that includes something like seven independent implementations of the DOM.) ocaml-3.09.2.dmg is only 13MB; maybe there's an unmet need for an "OCaml distribution" that includes standard libraries that cover most of these things and weighs in at 25-50MB. - a more sophisticated unit-test framework than "assert" - basic algorithmic stuff: - priority queues (there's one on p.25 of the user's manual but none in /usr/local/lib/ocaml) - binary search - auto-resizing arrays (like OrderedCollection or Python's lists; the Buffer library has an implementation for strings) - CRC - random-shuffle and random-choice - parsing and production of common data formats: - SGML or HTML or XML - any parser at all would be a big improvement, but if it had some of the horrible standard interfaces like DOM, SAX, XPath, or even XSLT, I could pick it up and use it more easily. Amusingly, one of the best XQuery implementations is written in OCaml. - URLs - CSV - RFC-822 parsing - MIME - RSS - mbox - UTF-8 (!) - other character encodings - .ZIP files - networking things: - a socket interface that lets me open a TCP connection to a DNS name with a single function call - an HTTP client library (server would be nice too, but its absence is less surprising; but, hey, guys, I hear this WWW thing might catch on) - other things are also missing but that's less surprising (SSL/TLS, IMAP, FTP, SMTP, SNMP, LDAP, SOAP, XML-RPC, IRC, NNTP, POP) - signal-processing things: - the ability to read and write common image formats (JPEG, PNG, GIF, PBM) - the ability to lines, text, and other graphical objects on them, with alpha and antialiasing - the ability to read and write common audio formats (AU, AIFF) - other DSP stuff would be gravy but it's not surprising it's missing; the Bigarray library contains roughly the absolute minimum - I'd complain about the absence of GUI stuff, but actually I think it's included, but doesn't work due to the absence of C development tools on this Mac. (Although maybe it only works on X11.) - type-safe marshalling and unmarshalling, at least for some subset of data types (e.g. JSON, YAML) - some kind of RPC (I'm not asking for IIOP here!) - data compression and decompression tools - SHA-1 and other cryptographic tools (it's nice that it has MD5, but DES, AES, and RSA would be big pluses) - termcap - some kind of DBM or Berkeley DB interface - some kind of SQL interface (at least SQLite!) - some kind of date handling (although the Unix module does have gmtime, localtime, and mktime, it doesn't have strftime, and using mktime to print a calendar is kind of a pain) - file pathname manipulation (dirname, basename, relative-path resolution) - some kind of concurrency handling. I'm not saying I want threads --- I would never say that --- but threads, or Erlang-style processes with easy communication, or Twisted-style reactors, or something! Maybe there's something in the "vmthreads" directory, but it's hard to tell (see sections Missing Documentation and Insufficient Introspection for why.) - some kind of logging for debugging (more than just Printf.fprintf stderr) A lot of these things, maybe all of them, are available in other libraries you can download from various sites on the internet. Missing Documentation --------------------- The MacOS X man pages for things like "ocaml" suggest reading "The Objective Caml user's manual", but it does not seem to be included in the disk image from which I installed OCaml. Also, many of the .cm* files in /usr/local/lib/ocaml lack man pages. There are occasional errors in the documentation (for example, the Pervasives man page, for example, says that integers are 31 bits wide rather than 30; and the manual uses {< >} in an example a few sections before explaining what it is) but I don't think they're as important as the documentation that's missing altogether. Later, when I had internet access again, I downloaded "The Objective Caml system release 3.09 Documentation and user's manual", which I think is the user's manual referred to in the man pages; but I find that it refers me to the Camlp4 documentation, which I'm pretty sure I don't have, for information about pattern-matching on streams. Unfortunately I may not have internet access again for a while. Arbitrary Limits ---------------- These limits seem most unfortunate to me: # Sys.max_string_length ;; - : int = 16777211 # Sys.max_array_length ;; - : int = 4194303 I frequently deal with chunks of binary data bigger than 16 megabytes in size, and arrays of more than 4 million items; I'm typing this on a laptop with two gigabytes of RAM. And the standard file operations (seek_in, pos_in, etc.) use 30-bit numbers to represent file positions and sizes, with the consequence that every use of them adds a bug to your program whenever it deals with a file over 256MB in size. (Unless you're on a 64-bit platform, I think.) The problem with this is that 256MB is now 15 cents' worth of disk space, so I have a lot of files bigger than that. Too Much Early Binding ---------------------- I'm sure Xavier Leroy and company, too, wish seek_in and pos_in used 64-bit ints, and that you could tell what key was Not_found in an association list, and that you could change your program from storing stuff in an array to storing it in a hash-table easily when the keys turned out to be sparser than you expected. The module system makes the last item at least possible, but the others are not supported in a backwards-compatible way by the type system. Of course, a great deal of the Zen of OCaml is that lots of binding can happen at compile-time rather than at run-time, so your code can get type-checked and maybe run fast too. But the binding I'm complaining about is happening earlier, at edit-time; you have to actually change your source code in order to accommodate the use of Unix.LargeFile.lseek and its int64 arguments. It isn't obvious to me that, in principle, static typing requires these sorts of decisions to be accidentally strewn across your entire codebase, given the existence of type inference and parametric polymorphism in the form of functors. This is not a new problem --- it happens in C programs all the time, due to the ubiquitous explicit static typing. It's just that I had the impression that OCaml had perhaps conquered this problem, as have the languages I normally use day-to-day. Here's another, smaller-scale example. My unit tests largely consist of code like this: assert ([45; 50] = eval (Atom [45; 50])); Maybe at some point I would like to use floating-point for all my interpreter's values instead of integers. I could imagine a language where I could retain compatibility with these tests by making Atom into a function that does the necessary conversion for the old tests; but that language is not OCaml, because in OCaml, your functions cannot begin with capital letters. Clumsiness Working With Source Code Files ----------------------------------------- The unfortunate necessary duplication of code between .ml and .mli files (if you have .mli files) is well-documented elsewhere. But how about if you want to interactively test some code from your module to see why it fails a unit test? Here's how I load that code into a toplevel so I can test it: Beauty:~/devel/ocaml-tut kragen$ ocaml toyapl.ml Exception: Assert_failure ("toyapl.ml", 62, 4). Beauty:~/devel/ocaml-tut kragen$ ocaml Objective Caml version 3.09.2 # Toyapl.eval ;; Characters 0-11: Toyapl.eval ;; ^^^^^^^^^^^ Unbound value Toyapl.eval # Beauty:~/devel/ocaml-tut kragen$ ocamlc -g -c toyapl.ml Beauty:~/devel/ocaml-tut kragen$ ls toyapl.* toyapl.cmi toyapl.cmo toyapl.ml toyapl.ml~ Beauty:~/devel/ocaml-tut kragen$ ocaml toyapl.cmo Exception: Assert_failure ("toyapl.ml", 62, 4). Beauty:~/devel/ocaml-tut kragen$ # I edit the file, comment out the assert Beauty:~/devel/ocaml-tut kragen$ ocamlc -g -c toyapl.ml Beauty:~/devel/ocaml-tut kragen$ ocaml toyapl.cmo Objective Caml version 3.09.2 # Toyapl.eval ;; - : Toyapl.toyaplexpr -> Toyapl.toyaplval = The (faked) equivalent in Python: >>> import toyapl Traceback (most recent call last): File "toyapl.py", line 62, in test AssertionError: ([45], 45) >>> toyapl.eval Python checks to see if there was a bytecode file for toyapl.py, and if so whether it is up-to-date; if not, it recompiles it; then it imports the bytecode file. There is an exception during import (my unit tests normally run during import) but because I'm in an interactive toplevel, the module remains imported anyway so I can poke at it and see what was wrong. In a similar vein, consider this. Beauty:ocaml kragen$ cat > time.ml print_float (Unix.gettimeofday()) ;; print_string "\n" Beauty:ocaml kragen$ ocaml time.ml Reference to undefined global `Unix' Beauty:ocaml kragen$ (echo '#load "unix.cma" ;;'; cat time.ml) > time2.ml Beauty:ocaml kragen$ ocaml time2.ml 1170635455.04 Beauty:ocaml kragen$ ocamlc time2.ml File "time2.ml", line 1, characters 0-1: Syntax error The #load directive makes it work in the interpreter, but causes a syntax error in the compiler. I think the right solution is to run the program with "ocaml unix.cma time.ml" and compile it with "ocamlc unix.cma time.ml", but it's a non-obvious failure mode. This particular problem (in effect, the interpreter supports a superset of the language supported by the compiler, and I made the mistake of using one of the extensions) could be ameliorated by a more helpful error message. Syntax ------ I'm not going to belabor the point on syntax. Everybody knows there's a problem; I explained some aspects of the problem in some previous sections. This bites newbies like me particularly hard. Out-Of-Dateness --------------- ocamlopt generates PowerPC assembly code on this Intel MacBook, but it's January 2007. See also the complaints about libraries. Possible Improvements ===================== Probably some of these are bad ideas, as will be obvious to me after I have more experience. I'm not (yet) volunteering to do them. 1. A fatter distribution. Including the manual (1MB), ocamlnet (800K), and the ocamlnet manual (450K) would go a long way toward removing my own biggest frustrations. 2. A simpler syntax with fewer levels of precedence. Maybe writing OCaml in S-expressions would be going too far, or maybe not. Clearly this would make it effectively a different language. 3. No missing man pages. 4. No arbitrary limits on object size (other than physical address space, of course.) 5. Clearer wording for many error messages. 6. Reflection/introspection capabilities including the ability to interactively list the contents of modules. Full reflection could be difficult to reconcile with static type-checking, but I think it's possible; but even modest reflection capabilities would help a lot with interactive use. 7. Self-documentation --- the ability to attach doc strings to functions and types, and doc strings being present throughout the standard library. 8. Exceptions that accumulate a full stack trace. This is a little hairy to get to work in native code, but I don't think it's intractable.
- Previous message: installing Debian with debootstrap and LNX-BBC
- Next message: the presentation of "private" in programming languages
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Kragen-tol mailing list
Không có nhận xét nào:
Đăng nhận xét