I admit this is way over my head, I am still trying to grok it. This seems to require an existing model to start from—-I am not sure how one would arrive at a model from scratch (I guess start from the same weights on all items?)
I think the point about A/B testing in production to confirm if a new model is working is really important, but quite important is to also do A/B/Control testing, where Control is random (seeded to the context or user) or no recommendations, which helps not only with A vs B, but helps validate that A or B isn’t performing worse than Control. What percentage of traffic (1% or 5%) goes to Control depends on traffic levels, but also requires convincing to run control.
I think one important technique is to pre-aggregate your data on a user-centered or item-centered basis. This can make it much more palatable to collect this data on a massive scale without having to store a log for every event.
Contextual bandit is one technique that attempts to deal with confounding factors and bias from actual recommendations. However, I think there’s a major challenge to scale it to large counts of items.
I think the quality of collected non-click data is also important—-did the user actually scroll down to see the recommendations or were they served but not looked at? Likewise, I think it’s important to add depth to the “views” or “clicks” metric—-if something was clicked, how long did the user spend viewing/interacting with the item? Did they click and immediately go back or did they click and look at it for a while? Did they add the item to cart? Or if we are talking about articles, did they spend time reading it? Item interest can be estimated more closely than just views, clicks and purchases. Of course, we know that purchases (or more generally conversion rates) have a direct business value, but, for example, an add to cart is somewhat of a proxy of purchase probability and can enhance the quality of the data used to train (and thus a higher proxy business value).
It’s probably impractical to train on control interactions only (and also difficult to keep the same user in control group between visits).
The SNIPS normalization technique reminds me of the Mutual Information factor correction when training co-occurrence (or association) models, where Mutual Information rewards items less likely to randomly co-occur.
Re: existing model, for recsys, as long as the product already exists you have some baseline available, even if it's not very good. Anything from "alphabetical order" to "random order" to "most popular" (a reasonable starting point for a lot of cases) is a baseline model.
I agree that a randomized control is extremely valuable, but more as a way to collect unbiased data than a way to validate that you're outperforming random: it's pretty difficult to do worse than random in most recommendation problems. A more palatable way to introduce some randomness is by showing a random item in a specific position with some probability, rather than showing totally random items for a given user/session. This has the advantage of not ruining the experience for an unlucky user when they get a page of things totally unrelated to their interests.
Who remembers Model-Driven Architecture and code generation from UML?
Nothing can replace code, because code is design[1]. Low-code came about as a solution to the insane clickfest of no-code. And what is low-code? It’s code over a boilerplate-free appropriately-high level of abstraction.
This reminds me of the 1st chapter of the Clean Architecture book[2], pages 5 and 6, which shows a chart of engineering staff growing from tens to 1200 and yet the product line count (as a simple estimate of features) asymptotically stops growing, barely growing in lines of code from 300 staff to 1200 staff.
As companies grow and throw more staff at the problem, software architecture is often neglected, dramatically slowing development (due to massive overhead required to implement features).
Some companies decided that the answer is to optimize for hiring lots of junior engineers to write dumbed down code full of boilerplate (e.g. Go).
The hard part is staying on top of the technical (architectural and design) debt to make sure that feature development is efficient. That is the hard job and the true value of a software architect, not writing design documents.
They likely were running data on EBS volumes instead of bare metal SSDs, due to ease of recovery (a failed instance does not lose data on the attached EBS volumes). You can only run your DBs on bare metal SSDs if you are prepared to lose a node’s data completely.
In fact, many instance types no longer have any ephemeral storage attached and it’s a default practice to use EBS for root and data volumes.
There are some instance types that have extremely fast EBS performance (EBS io2 Block Express), which has hardware acceleration and an optimized network protocol for EBS network I/O and offers sub-millisecond latency. However, these are expensive and get even more so if you go up in IOPS.
Even if the rollout was atomic to the servers, you will still have old clients with cached old front ends talking to updated front ends. Depending on the importance of the changes in question, you can sometimes accept breakage or force a full UI refresh. But that should be a conscious decision. It’s better to support old clients as the same time as new clients and deprecate the old behavior and remove it over time. Likewise, if there’s a critical change where you can’t risk new front ends breaking when talking to old front ends (what if you had to rollback), you can often deploy support for new changes, and activate the UI changes in a subsequent release or with a feature flag.
I think it’s better to always ask your devs to be concerned about backwards compatibility, and sometimes forwards compatibility, and to add test suites if possible to monitor for unexpected incompatible changes.
And think about what it’s like for humans as well—-spreading a feature over several repos with separate PRs makes either a mockery of the review process (if the PRs have to be merged in one repo to be able to test things together), or significantly increases cognitive overhead of reviewing code.
False. Mongo never pretended to be a SQL database. But some dimwits insisted on using it for transactions, for whatever reason, and so it got transactional support, way later in life, and in non-sharded clusters in the initial release. People that know what they are doing have been using MongoDB for reliable horizontally-scalable document storage basically since 3.4. With proper complex indexing.
Scylla! Yes, it will store and fetch your simple data very quickly with very good operational characteristics. Not so good for complex querying and indexing.
If you’re doing a restructuring of the company, i.e. mass layoffs, you’re allowed to do it regardless. In some states FMLA/PFMLA a company is automatically presumed a retaliation firing if it’s done within 6 months and the onus is on the company to prove it wasn’t—-the mass layoff is the cover, and large companies know it.
However, the fact that they cancelled her health insurance a week before returning and demanded she returned on a certain date or she’d be terminated despite a demonstrated disability, that’s pretty whack and might be hard to defend as company-wide restructuring.
Heavy mocks usage comes from dogmatically following the flawed “most tests should be unit tests” prescription of the “testing pyramid,” as well as a strict adherence to not testing more than one class at a time. This necessitates heavy mocking, which is fragile, terrible to refactor, leads to lots of low-value tests. Sadly, AI these days will generate tons of those unit tests in the hands of those who don’t know better. All in all leading to the same false sense of security and killing development speed.
I get what you are saying, but you can have your cake and eat it too. Fast, comprehensive tests that cover most of your codebase. Test through the domain, employ Fakes at the boundaries.
I think the point about A/B testing in production to confirm if a new model is working is really important, but quite important is to also do A/B/Control testing, where Control is random (seeded to the context or user) or no recommendations, which helps not only with A vs B, but helps validate that A or B isn’t performing worse than Control. What percentage of traffic (1% or 5%) goes to Control depends on traffic levels, but also requires convincing to run control.
I think one important technique is to pre-aggregate your data on a user-centered or item-centered basis. This can make it much more palatable to collect this data on a massive scale without having to store a log for every event.
Contextual bandit is one technique that attempts to deal with confounding factors and bias from actual recommendations. However, I think there’s a major challenge to scale it to large counts of items.
I think the quality of collected non-click data is also important—-did the user actually scroll down to see the recommendations or were they served but not looked at? Likewise, I think it’s important to add depth to the “views” or “clicks” metric—-if something was clicked, how long did the user spend viewing/interacting with the item? Did they click and immediately go back or did they click and look at it for a while? Did they add the item to cart? Or if we are talking about articles, did they spend time reading it? Item interest can be estimated more closely than just views, clicks and purchases. Of course, we know that purchases (or more generally conversion rates) have a direct business value, but, for example, an add to cart is somewhat of a proxy of purchase probability and can enhance the quality of the data used to train (and thus a higher proxy business value).
It’s probably impractical to train on control interactions only (and also difficult to keep the same user in control group between visits).
The SNIPS normalization technique reminds me of the Mutual Information factor correction when training co-occurrence (or association) models, where Mutual Information rewards items less likely to randomly co-occur.
reply