I like the idea. I don’t like the specific implementation.
He was overengineering the problem, wanted a magic solution, and predictably he didn’t find one.
OK this is getting tedious and I’ll never get it all. My next attempt was to switch to fuzzier string matching via Levenshtein distance. We can compare how closely the address strings match. And if they’re lexicographically close, we can assume they match
No, you can’t assume. Hell breaks loose when you pretend that you know what you don’t = when you assume.
Whoa! I simply told the LLM the logic I wanted. It took fewer than 5 minutes to write and got 90%+ accuracy against our test suite on the first try!
90% accuracy can be great or awful depending on your goals, but in no moment he mentions the scale of the problem, or how bad false positives/negatives would be.
I replaced 50 lines of code with a single LLM prompt
That’s fucking dumb. Use both.
Here's what I think that would be a better approach, if accuracy is a concern.
Conceptually (inside your head!), split all pairs of addresses into four categories:
dunno - your program didn’t test them yet.
same - your program tested them and determined them to be the same address.
different - your program tested them and determined them to be different addresses.
shit - your program tested them and gave up.
All pairs start in the “dunno” category. The job of the program is to accurately move as many of them as possible to the categories “same” and “different”, and as few of them as possible to the “shit” category.
Based on that, here’s what I would do.
[Sanitisation start] Unless the program ignores case, convert everything to lowercase.
Replace common abbreviations with the respective words, or vice versa. Common ones only. Do not play catch’em all. Stick to St, Ter, Cir, Way, Pl, Blvd. If any of those strings is followed by a dot, remove it; and if it is not followed by a comma, add it. (Yes, you’ll need something a bit more complex for Saint vs. Street, but that’s fine.)
Check if the address is properly formatted. It should contain a number, 1+ words, comma, 1+ words, comma, 1 2-letters word, number. If it is, go to 5. If it is not, go to 4.
Add some low-hanging approaches to safely fix the address. But don’t go overboard; if you still can’t fix it, move the pair of addresses that includes that address to the “shit” category. [Sanitisation ends]
If the string between the first and second commas (the city) is different, or if the string before the first comma starts with a different number, then move the pair to the “different” category.
Else, if the whole strings are identical, then move the pair to the “same” category.
Else, move the pair to the “shit” category.
Now run the program with a sizeable amount of pairs of addresses, and check how many of them ended in the “shit” category. Now use your judgment:
Is there some underlying pattern among a lot of those pairs in the “shit” category? If yes, can I easily fix step #4 to address them?
Based on the scale of my project, is it fine to manually review those pairs?
Now let’s say that you already fixed what you could reasonably fix, and manual review is out of question. Now plug in the chatbot.
Why am I suggesting that? Because the chatbot will sometimes output garbage, even for pairs that a simple routine would be able to accurately tell “they’re the same” or “they’re different”. So by using both, you’re increasing the accuracy of the whole testing routine. “90%+” might look like “wow such good very accuracy”, but it’s still one error each 10 pairs, it’s a fucking lot.
And that exemplifies better how you’re supposed to use LLMs. (Or text generators in general.) You should see them as yet another tool at your disposal, not as a replacement for your current tools.
I like the idea. I don’t like the specific implementation.
He was overengineering the problem, wanted a magic solution, and predictably he didn’t find one.
No, you can’t assume. Hell breaks loose when you pretend that you know what you don’t = when you assume.
90% accuracy can be great or awful depending on your goals, but in no moment he mentions the scale of the problem, or how bad false positives/negatives would be.
That’s fucking dumb. Use both.
Here's what I think that would be a better approach, if accuracy is a concern.
Conceptually (inside your head!), split all pairs of addresses into four categories:
All pairs start in the “dunno” category. The job of the program is to accurately move as many of them as possible to the categories “same” and “different”, and as few of them as possible to the “shit” category.
Based on that, here’s what I would do.
Now run the program with a sizeable amount of pairs of addresses, and check how many of them ended in the “shit” category. Now use your judgment:
Now let’s say that you already fixed what you could reasonably fix, and manual review is out of question. Now plug in the chatbot.
Why am I suggesting that? Because the chatbot will sometimes output garbage, even for pairs that a simple routine would be able to accurately tell “they’re the same” or “they’re different”. So by using both, you’re increasing the accuracy of the whole testing routine. “90%+” might look like “wow such good very accuracy”, but it’s still one error each 10 pairs, it’s a fucking lot.
And that exemplifies better how you’re supposed to use LLMs. (Or text generators in general.) You should see them as yet another tool at your disposal, not as a replacement for your current tools.