Failure in a Binary Search

Recently, a program I wrote to process a bit of data stopped working with an error.

The error message indicated that the problem was that, in one column, the program found an integer where it expected a string.

When looking for data errors in a dataset larger than I can read manually, I use a binary search. This is an iterative process where I split the data in half, and test both halves. When I find the half containing the problem data, I split that part in halves again until I have a subset of the data small enough to study manually.

What was interesting about this data was that both halves passed the tests individually – only together did I get an error.

By default the library I use for working with data (polars, a data science library) infers the structure of data as it imports that data into a dataframe. The default value is to infer what a column is “supposed” to contain from the first 100 rows of a table. The data I was working with was at that time 182 rows.

I was able to load the whole dataset if I increased the inference to 182 rows – then the inference “finds” the column with both strings and integers, and chooses an appropriate data type for that column. With only 91 rows in the two halves of the data, either half will succeed, because the whole table is checked – but not both, without increasing the column type inference level.

It was a good example of when you need to dig into the defaults, which I would call “imposed assumptions”, because they are assumptions that you didn’t make, but are silently made on your behalf.

USB Hard drive recovery

I’ve got a USB hard drive that is permanently plugged into my desktop, but recently it stopped working. It is formatted from the factory with NTFS, which is fine-ish, in that it does work under Linux without any intervention, but if it stops working, you have to know what to do. I didn’t.

When it stopped responding, I started poking round the Internet, to see if I could find a fix. What I eventually found out was that I needed ntfsfix to correct whatever seemed to go wrong. You point it at the mount point in /dev and then it magically fixes whatever seemed to have gone wrong.

I am hoping that this will get me to a good place, but I think my future is going to include a Network Attached Storage device.

Google Cloud Storage and WordPress

I am trying out a system that puts my media files onto Google Cloud Storage, which is much cheaper than using increasingly huge VPS instances.

I am trying WP-Stateless, and I will post a photo as a test.

Well, it works – you can absolutely do this. However, it only works if you have a public-facing bucket. The flow I was hoping for is for a password-protected site, and that does not work with a public bucket. I also explored using a FUSE-based setup, where I use SSH for a virtual link between my filesystem and a non-public bucket.

That works OK in a test, but if you add 40Gb of data, the FUSE link just fails to work – you can’t manage the throughput for even the ls command, let alone working as a high-latency-localish filesystem. Oh well.