It's funny just how much the implementations described in the paper map to how m...

dalke · on July 21, 2020

There's a deep connection between what I do and text retrieval in general.

Take a look at the early work in IR in the 1940s and 1950s, at https://en.wikipedia.org/wiki/Information_retrieval#Timeline

1947, "Hans Peter Luhn (research engineer at IBM since 1941) began work on a mechanized punch card-based system for searching chemical compounds"

1950s, "invention of citation indexing (Eugene Garfield)" - Garfield's earlier work was with chemical structure data sets, and his PhD work was on the linguists of chemical compound names.

1950: "The term "information retrieval" was coined by Calvin Mooers." - that was presented at an American Chemical Society (ACS) meeting that year, and in the 1940s Mooers developed an early version of what is now called a connection table, hand-waved a substructure search algorithm which was implemented a few years later. (I'm a Mooers fanboy!)

Many of the early IR events were at ACS meetings - the concept of an "inverted index" was presented at one, as I recall.

This is because in the 1940s, chemical search was Big Data, with >1 million records containing many structured data search fields, and demand for chemical record search from many organizations.

So many of the core concepts are the same, though in cheminformatics we've switched to a lossy encoding of molecular features to a bitstring fingerprint since we tend to look more at global similarity than local similarity, and there are a lot of possible features to encode.

Thank you for writing that it was a very readable paper. I have received very little feedback of any sort about the publication, and have been worried that it was too verbose, pedantic, or turgid.

dekhn · on July 21, 2020

Its a bit verbose, and I really think it's several papers (the technical details of the package is one, the open source positioning is another). But it's readable- a person outside the field (say, a search engineer at Google) could sit down, read this and immediately recognize what you were trying to achieve ("implement popcnt" used to be a popular question), and then immediately suggest ways to get the output results faster by using a cluster :)

dalke · on July 21, 2020

Indeed, it is several papers. There are two journals in my field - one I can't read because it's behind a paywall and one that's expensive to publish in. I choose the latter, but couldn't afford multiple months of rent in order to publish several papers. :(

A blog post I wrote years ago use to part of the "implement popcnt" literature - http://www.dalkescientific.com/writings/diary/archive/2008/0... . It's now outdated, and actual low-level programmers have done better, but it still gets mentioned in-passing in postings like the one referenced on HN last year at https://news.ycombinator.com/item?id=20914479 .

dekhn · on July 21, 2020

It's really extraordinary how tightly coupled modern innovation in scientific fields is to processor implementations. I suspect you and I share a keen interest in the path by which we got to this enviable situation.

nkurz · on July 21, 2020

> even if I have a vastly different perspective on how to accelerate problems like this

Do tell more!

After the initial optimization, I did hint at an approach to Andrew that I thought could get a further large speedup. Essentially, the idea was to "rotate" all the stored data 90 degrees, so that instead of counting the features present in each compound you read lists of compounds that contain a given feature, storing the hits in some very fast custom data structure. He wasn't particularly interested, likely correctly realizing that it would be a lot of work for an uncertain amount of gain. The question wasn't really whether I could achieve further speedup (although there was question as to how much), rather (as alluded to in the paper) whether he would be able to sufficiently increase sales to justify the additional development cost and added complexity of the codebase.

dalke · on July 21, 2020

Towards the end of the writing the paper, another paper came out on "RISC: rapid inverted-index based search of chemical fingerprints", https://doi.org/10.1021/acs.jcim.9b00069 which does something along those lines.

It was close enough that I published the pre-print "RISC and dense fingerprints" at https://doi.org/10.26434/chemrxiv.8218517.v1 to examine its claims. I found that their RISC implementation was faster than chemfp for low bit densities (<~5%), which includes the popular 2048-bit ECFP/Morgan fingerprints for smaller radii, and uncommonly high similarity thresholds.

Otherwise, chemfp was faster.

So while there's certainly something to investigate there, I think it's better to focus that effort on truly sparse fingerprints and count fingerprints, rather than nominally dense bit fingerprints.

Just needs money and time. ;)

Plus, part of the focus was on making chemfp a really good baseline for these sorts of timing tests.