Failure in a Binary Search

Recently, a program I wrote to process a bit of data stopped working with an error.

The error message indicated that the problem was that, in one column, the program found an integer where it expected a string.

When looking for data errors in a dataset larger than I can read manually, I use a binary search. This is an iterative process where I split the data in half, and test both halves. When I find the half containing the problem data, I split that part in halves again until I have a subset of the data small enough to study manually.

What was interesting about this data was that both halves passed the tests individually – only together did I get an error.

By default the library I use for working with data (polars, a data science library) infers the structure of data as it imports that data into a dataframe. The default value is to infer what a column is “supposed” to contain from the first 100 rows of a table. The data I was working with was at that time 182 rows.

I was able to load the whole dataset if I increased the inference to 182 rows – then the inference “finds” the column with both strings and integers, and chooses an appropriate data type for that column. With only 91 rows in the two halves of the data, either half will succeed, because the whole table is checked – but not both, without increasing the column type inference level.

It was a good example of when you need to dig into the defaults, which I would call “imposed assumptions”, because they are assumptions that you didn’t make, but are silently made on your behalf.