Dec 28, 02:09 PM

I finally managed to build

This, as many things, was more involved than I had hoped at the outset it would be. I ended up having to wade into both a new language (F#) and a new ecosystem (specifically .NET, but I’d never really done any kind of “Windows-oriented” development before, either). What follows are some thoughts about that experience, not really tied together by any common thread other than being related somehow to achieving this one goal.

I have been spoiled by Rust.

This theme has two angles to it, the first of which is that I have come to believe that exceptions are Not My Favorite Error Handling Mechanism. You can’t tell, for example, just by lookng at a function’s signature what kind of monstrous error might come bubbling up the stack out of it. Well-documented libraries will generally have a section for this, but you have to trust the maintainer to get it right. Are they really sure they’ve trapped everything they haven’t explicitly listed? What if one of the dependencies changes?

In contrast, I think Rust’s Result type, combined with the ? operator, really hits the sweet spot. You’re guaranteed to know what kind of failure mode you might encounter, and you can’t fail to deal with it.

I don’t pretend to really understand the internals here, but it must also be the case that setting up for and trapping an exception is more involved (and less performant) than just dealing with a function’s return value.

In rust, I might have a sort of fallback value parser like this:

pub enum Val {
    Int(i64),
    Float(f64),
    Str(String),
    // ... maybe other variants
}

impl Val {
    /// Attempt to parse a string value several different ways.
    pub fn guess(s: &str) -> Val {
        match s.parse::<i64>() {
            Some(n) => Val::Int(n),
            Err(_) => match s.parse::<f64>() {
                Some(x) => Val::Float(f64),
                Err(_) => Val::Str(s.into()),
            }
        }
    }
}

Whereas in F# I’m pretty sure I have to do something like this:

type public Val =
    | Int of int64
    | Float of double
    | Str of string
    // ... maybe other variants

module public Val =
    // Attempt to parse a string value several different ways.
    let public guess (s: string) : Val =
        try Val.Int (int64 s)
        with | _ -> try Val.Float (double s)
                    with | _ -> Val.Str s

Those two try blocks have to be more fanfare at runtime than just the matching on return values. Right?

The other angle to my Rust-induced softness is that Cargo is just so convenient. Most projects need nothing more than a Cargo.toml file to which you add one line per dependency. The dotnet build (or msbuild) process is the polar opposite. For even a simple project with a couple of dependencies, you essentially need MonoDevelop1 to write a copule of screenfuls of XML for you. If you need to modify it, all the answers online all say something like, “open Visual Studio, navigate three menus deep, and check this box” because it’s essentially too byzantine to edit by hand without screwing up.

When it comes time to actually build, cargo build --release, or, if you’re cross-compiling, cargo build --release --target x86_64-unknown-linux-musl (or whatever) and you have a nice binary you put where you need.

In contrast, msbulid -Property:Config=Release populates the build target directory with not only an .exe file, but also a phalanx of .dlls that all need to end up in the right places on the target system. You need a different command, and to use a lot of non-default options, in order to embed all your dependencies in the executable so you can plop it on the target machine without needing to also put together a custom .msi. And if you want an actual standalone executable that doesn’t need the target machine to also have some compatible version of the .NET CLR installed, you need to use another tool entirely.

F# itself is actually pretty nice.

(Apart from the dependency/build system.)

It looks a lot like Scala and a little like Haskell (and more directly, I imagine, like OCaml, although I have no direct experience here). It has all the cool stuff you’d expect from a “functional first” language:

It also has a general “sequnce” collection which serves as an abstraction for iterative operations over lists, arrays, etc.

I did a chunk of this year’s Advent of Code in F#, and I definitely enjoyed it. I suspect the absence of any real dependency management or reliance on .NET functionality helped considerably.

F# has one huge wart: Most of its library-type functionality comes from the various .NET APIs, which are written to be used with C#, so any time you interact with any of this stuff, your code ends up looking like a weird Frankensteinian thing with a bunch of deep-OO-class-heirarchy-looking notation sitting in it like lima beans in your ice cream:

let nft = wbp.WorkbookStylesPart.Stylesheet.CellFormats.Descendants<CellFormat>() |>
          Seq.map (fun cft -> if isNull cft.NumberFormatId then 0u
                              else cft.NumberFormatId.Value) |>
          Seq.toArray

(This code just looks like a Subaru BRAT to me.)

Interleaved with this is the fact that, in F#, all .NET functions take a single value, which is a tuple that matches the arity of the function. This makes the callsites in your code look like they’ve been lifted from a mainstream OO language:

System.String.Join(", ", listItems)

This could be C# code,2 passing ", " and variable listItems as arguments to the System.String.Join function, except it’s F#, so you’re passing a single argument, which is a tuple.3

All of which isn’t really a criticism of F# so much as its status as a second-class citizen in the .NET ecosystem.4 I would happily work in F#, especially if I could avoid as much of the .NET API surface as possible.

F# has pretty unsatisfying editor support.

I am not a big fan of VS Code5. While it’s maybe the least terrible Electron app I’ve ever used, it’s still an Electron app; there are plenty of leaner editors I like better. Also, it’s evil, so nobody should use it. It has a powerful F#-supporting plugin in the form of Ionide, but the LSP depends on having the .NET SDK installed, which I’ve otherwise not needed (on Linux, with Mono at least), so I’d rather not pollute my sources.list with another third-party apt repo that doesn’t even really have support for Trixie6 anyway. So the ol’ standby nuclear option doesn’t really live up to its potential in this case.

Ideally, I’d use Helix, of course, but it uses the same LSP as Ionide, and also the Tree-Sitter F# grammar occasionally just fritzes out and gets stuck.

Obviously Visual Studio itself has great F# support, but

  1. Linux
  2. Fuck Visual Studio

It all but brings my work computer (a Windows machine) to its knees, so I use it even there as little as possible.

MonoDevelop is pretty close to being great. I use it to generate the reams of XML that msbuild requires to make sense of your projects, and it seems to have its own effective F# LSP built in. But it has some absolutely irritating autoindent behavior that crosses the line into dealbreaker territory when I’m not at my most patient.

Geany and Pluma obvs don’t do the LSP thing; Geany also doesn’t highlight quite right, and Pluma… just isn’t that great of an editor.

So I bounce around between editors as I get fed up with each’s deficiencies in turn, which probably isn’t the best way to work.

.NET is Ponderous AF

Every part of it is a massive, deep class hierarchy with all kinds of interfaces that are only ever implemented by one class. I think it’s safe to say that every criticism of OO that you’ve ever read online could be legitimately levelled at .NET. Also, the official documentation manages to be verbose and extensive without actually being all that helpful. (Maybe the idea here was to echo the ponderous, verbose, bureaucratic vibe of the libraries themselves.)

The Office OpenXML format seems awkward.

(Even for a big bog of XML tags.)

I am all for plain-text, human-readable formats. They generally aren’t as space/bandwidth/cycle-efficient as specialized binary formats, but having data that can be interpreted with eyeballs makes understanding, implementing, and debugging software that works with the format about 1.8 zillion times easier. I know that human eyes on XML is generally only for emergency situations and that XML isn’t really meant for human consumption, But OpenXML makes some weird choices. Here’s a representative row from a sample spreadsheet:

<x:c r="A6" t="s" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>103</x:v>
</x:c>

<x:c r="B6" s="2" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>44620</x:v>
</x:c>

<x:c r="D6" s="3" t="s" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>179</x:v>
</x:c>

<x:c r="G6" t="s" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>149</x:v>
</x:c>

<x:c r="H6" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>1.5</x:v>
</x:c>

<x:c r="I6" t="s" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>235</x:v>
</x:c>

<x:c r="J6" xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:v>71</x:v>
</x:c>

(Inter-tag whitespace inserted by me out of consideration for you.)

The x:c tag represents a single spreadsheet cell; the contained x:v tag represents the value in that cell. You observe that all the values here are numbers. The 1.5 is actually the decimal value 1.5, and the 71 is actually the integer 71. Four of those numbers, though, are references into a “shared string table”; finally, 44620 is actually a date, expressed in days since January 1, 1900, a totally normal reference to use. The strings you can recognize by the t="s" (type = string) property, so you know to go look that number up in the shared string table. To find out that the date is actually a date, though7, you need to follow the s="2" (style = 2) property to the styles table, and look up style 2:

<x:xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1"
      xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main" />

In which the numFmtId property points to an entry in yet another table, this time of number formats… unless there isn’t one, in which case its value is a magic number you can find buried in the spec. The 14 here means mm-dd-yy. Also note the applyNumberFormat property, which I think is a boolean value indicating whether the formatting specified by the numFmtId should be applied to cells with this style. Which seems redundant given that you could just specify a style with the “General” number format (or probably no format at all).

Let’s look at a copule entries from the shared string table.

<x:si xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:t>counties touched list 10.15.2023 and before</x:t>
</x:si>

<x:si xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:t>municipalities touched 10.15.2</x:t>
</x:si>

<x:si xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:r>
    <x:t xml:space="preserve">counties</x:t>
  </x:r>
  <x:r>
    <x:rPr><x:sz val="16" /><x:color theme="1" /><x:rFont val="Calibri" />
      <x:family val="2" /><x:scheme val="minor" />
    </x:rPr>
    <x:t xml:space="preserve">not </x:t>
  </x:r>
  <x:r>
    <x:rPr><x:sz val="11" /><x:color theme="1" /><x:rFont val="Calibri" />
      <x:family val="2" /><x:scheme val="minor" />
    </x:rPr>
    <x:t>touched as of 10.15.2023</x:t>
  </x:r>
</x:si>

<x:si xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:t>Data as of 5.18.2023</x:t>
</x:si>

<x:si xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:t>emailed to let them know reports were overdue</x:t>
</x:si>"

(Again, all non-syntactic whitespace is my addition.)

Oh, look, style information can be embedded in entries in the shared string table as well as specified in a the document’s style table. I guess that’s… flexible? It just seems kind of chaotic.8

Finally, observe the number of times that XML namespace appears. One small mark in OpenXML’s favor: The actual file doesn’t look like this, this is just how the .NET OpenXML API prints these objects; I guess the authors wanted to make sure that every snippet printed is a valid XML document, or something. Speaking of the OpenXML SDK API:

The OpenXML SDK API is hardly an API at all.

The introductory overview in the documentation begins with

This content set provides documentation and guidance for the strongly-typed classes in the Open XML SDK for Office.

Okay, but each of these classes is essentially just a named generic XML object, and the whole API isn’t anything more than, “iterate through all the descendant elements and collect the ones you want”, which doesn’t seem to really get you much over just using a general-purpose XML-parsing library. For example, let’s say you want a handle for a particular worksheet in a given Excel spreadsheet file (because you want to read data from or write data to it, right?). Here’s how that goes:

let getWorksheetByName (doc: SpreadsheetDocument) (name: string) : Worksheet =
    // The `Workbook` is the "root element of the `WorkbookPart`". Why is
    // there this extra level of abstraction here?
    let wb = doc.WorkbookPart.Workbook
    // Now we iterate through all the descendants of the `Sheets` part of
    // the `Workbook`, considering only those of type `Sheet`.
    let sheet = wb.Sheets.Descendants<Sheet>() |>
                // And we look for the one whose `Name` property matches
                // the one we want. But first we have to check and make sure
                // each name isn't `null`, or we'll throw an exception!
                // Also, this will throw an exception if we don't find a
                // `Sheet` with a matching name.
                Seq.find (fun s -> if s.Name.HasValue
                                   then s.Name.Value = name
                                   else false)
    // Oh, but this `Sheet` that we found is only some metadata; the actual
    // _contents_ of the sheet are found elsewhere by looking up the
    // `WorksheetPart` with the `Sheet`'s ID.
    // (Also, here we have to cast to the `WorksheetPart` type because
    // `GetPartById()` returns an `OpenXmlPartContainer`, of which
    // `WorksheetPart` is only one subtype! Yes, of course this cast can
    // throw an exception.)
    let wsp = doc.WorkbookPart.GetPartById(sheet.Id.Value) :?> WorksheetPart
    // Again, why is there this extra level of abstraction with the
    // `WorksheetPart` wrapping a `Worksheet`?
    wsp.Worksheet

(My implementation was more involved because of exception handling.)

Doesn’t this seem like something an “API” should support directly? doc.GetSheetByName("Gay Space Communism")

And once you have your Worksheet, how do you extract the contents of, say, cell “F5”? ws.getCellByReference("F5"), maybe? No,

let getCellByRef (ws: Worksheet) (ref: string) : Cell =
    ws.Descendants<Cell>() |>
    Seq.find (fun c -> if c.CellReference.HasValue
                       then c.CellReference.Value = ref
                       else false)

you iterate through all the Worksheet’s descendants until you find the Cell that has the reference you want, of course!

But wait, what if you turned the reference into coordinates? “F5” is, what, row 4, column 5 in a zero-based coordinate system? Can’t you—

Nope! Don’t be fooled by the class heirarchy; that Worksheet is just a big bag of tags. You’ll rummage through it until you find what you want, and you’ll like it. And then to get the value from the cell, you need to first check if c.CellValue.HasValue = true and then get the c.CellValue.Value.Text string, and then to interpret that string of text (which is probably just some digits), you need to do all the stuff with the string table and the style table and maybe a table of numeric formats (but also maybe not) I complained about before.

Also, the documentation, for all the browser-column-inches given over to it, is not great guidance. You want access to the spreadsheet’s cells’ data. You can find this in the Cell class, which has a property of type CellValue, which has a four-way overloaded constructor, 20 properties, and 49 methods, out of which you could only ever possibly give a shit about the Text property. The Cell class can be found as a descendant of the Worksheet class, but nowhere in the documentation for Worksheet is this ever mentioned (because, of course, it’s not a direct descendant).9 Your only hope is finding what you need to do by slogging through some of the how-to examples, which are so long and involved that the first couple seem like jokes. Go ahead: Check out the sample code in the documentation for “Get a column heading in a spreadsheet document” and “Insert text into a cell in a spreadsheet document”. Again, these are interfaces you’d think something that calls itself an API would provide.

Microsoft.Office.Interop.Excel is even worse.

The interface is marginally more domain-specific, but using it fires up a headless Excel process to do the work, so you can only write software that targets machines with “Excel for the Desktop” installed. (Which is less common than it used to be because even fucking MS Office tries to be SaaS now.) It’s also more resource-intensive and slower.

There’s no actual “password protection”.

First, these really are just zipped xml files. 7zip will open then right up into a little directory tree.

Excel files can be “locked” and their worksheets can be “hidden” behind passwords. I originally undertook this exercise because at work we receive batches10 of reports as Excel files; these files each have over a dozen separate sheets, including a “summary” sheet hidden behind a password, with (hundreds of) formulae pulling in data from the other (visible) sheets. I sought to automate the “unlocking” and reading of the data from the summary sheets.

I originally got something working using the aforementioned Microsoft.Office.Interop.Excel API, because it seemed like the only option that explicitly had the functionality to unhide password-hidden sheets.11 This was pretty unsatisfying, because I was either going to have to process batches by running something manually on my machine, or else wade into the world of deploying Windows containers, which I am sure is a bloated, overcomplicated hellscape.12

I’m not sure how I stumbled across the OpenXML SDK .NET library (or why I didn’t stumble across it before when I kept running aground on Office.Interop), but after fooling around with it a little bit, I realized that I could just read hidden sheets like any other sheets; no password necessary. Evidently the “password protection” is just a flag somewhere in the file that Excel “respects”13 saying, essentially, “Pwease don’t show dis sheet, it’s secwet!” Nothing is actually “protected” in any meaningful sense of the word.

Endmatter

I’m not really sure if I have any summary remarks here. I’m definitely a Linux bigot, and have always been what I’d assumed was unfairly dismissive of Windows development. Well, now I know. F# was a pleasant surprise, but every other part of this experience was just as bad as I’d always caricatured it to be.


  1. Or Visual Studio, obvs. [return]
  2. I think; I’ve never written any C#, only read about it. It could certainly be Java code, and I’m pretty sure that C#’s inception was just Microsoft throwing a tantrum about Java’s success.

    [return]
  3. I suspect this is done this way because F# can’t (or at least has some trouble) currying functions from other CLR languages, and this whole tuple thing makes it impossible to even try.

    [return]
  4. Honestly, it’s a third class citizen. It’s clear from the .NET docs that C# is the favorite son, VB is the slow little brother that mommy and daddy love and still bring to social functions, even if he can be a little embarrassing. F# is… the other one. They keep it in a cage the basement and a social worker comes and feeds it.

    [return]
  5. And by this I mean VS Codium, which I touch with a 10-foot pole. [return]
  6. The current Debian “Testing” version. [return]
  7. Once you’ve looked at these a little bit and realized it’s been almost 45,000 days since 1900-01-01, you can start to guess which ones are dates.

    [return]
  8. You can probably argue that styles this way can be applied more granularly than styles applied at the cell level (you’ll notice that in the example, only the word not is being styled). To whit, two things: 1. This is a spreadsheet; it is for making calculations; why oh why do we need multiple styles in a single cell? 2. Why even have a shared style table? Or at least why not refer, at the sub-cell styling level, to entries in the style table?

    [return]
  9. Good luck figuring out exactly which of the Worksheet’s child elements you need to drill down into next on your way to the Cells. Also, good luck figuring out how to get there because the names of those children are names from the spec, not their corresponding names from the DocumentFormat.OpenXml.Spreadsheet class hierarchy.

    [return]
  10. Relatively large batches; more than 100 at a time. [return]
  11. I looked at several things, like readxl, excel.link, and calamine.

    [return]
  12. Also, the smallest available image seems to be upwards of a gibibyte in size, and there are some institutional limits on how much computing power I’m allowed to wallow in. (In constrast, I have deployed Alpine-based images that ended up being 7-8 megs.)

    [return]
  13. Not sure I should use the word “respect” about anything transpiring in this paragraph.

    [return]