Varun Srinivasan pfp
Varun Srinivasan
@v
Some thoughts on spam on Farcaster and how we tackle it. First question - What is spam? The naive answer is "automated activity" but this isn't right. Over 75% of spam we find comes from real humans who have phones, wallets and x accounts. The best definition is "inauthentic activity". It's that feeling you get when you realize that someone who is following, liking or replying to is doing it to benefit themselves and not because they're interested in you.
31 replies
117 recasts
415 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Spam is driven by people who want to get airdrops. How much can you earn if you set up a fake account on Twitter? Probably not a whole lot and not in directly measurable dollars. If you do the same on Farcaster, you might earn 10 or even a 100 dollars in airdrops. Spammers on Farcaster are very, very motivated. We see patterns like LLM spamming before they become commonplace on larger networks like X.
1 reply
2 recasts
129 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Spam also needs to be classified very, very quickly. If we don't, a spammer will interact with a lot of users after signing up making them unhappy. We often have little more than a profile and a few casts when we need to make a decision. If we get this decision wrong people get really unhappy - a spammer who isn't labelled will make existing users unhappy, and a new user who is incorrectly labelled will get frustrated and never come back.
2 replies
0 recast
52 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Our spam models puts accounts into one of four categories: Level 0 - not enough information to make a decision Level 1 - an authentic users that other users will like Level 2 - a slightly inauthentic user that some users won't like Level 3 - a very inauthentic user that almost all people will dislike If we're certain that someone is spammy, their account goes into level 3 and their activity is usually hidden under the "Show more" in conversations. In most cases, it's less clear. An account may be good for a while and suddenly turn spammy when a new airdrop launches. In this case Level 2 might be applied, which does something lighter like disqualifying you from boosts, but still letting your replies appear. Accounts are also re-evaluated by our model very often so that new information can be used to make a more accurate decision. We rank and re-rank roughly 4-5 accounts every minute.
3 replies
1 recast
48 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
There are three parts to building a spam detection model: 1. Define signals, which can be calculated for each account. Ideally they have some correlation to spammy behavior. (e.g. frequency of posting) 2. Label data, either through manual review, user reports or heuristics. The dataset must be large enough that there is significance to the patterns. 3. Train the model, by letting it process labelled data and figure out which combinations of signals are the best predictors. @akshaan chose a type of model called a random forest which is a collection of decision trees. Here's a good lecture on the basics of how a decision tree works: https://www.youtube.com/watch?v=a3ioGSwfVpE&list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS&index=29
1 reply
0 recast
42 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Random forests can identify very subtle patterns in data. For example, we once had a spam ring in country X that would fire up all their bots at the same time. Because we fed in country and time of posts as signals, it quickly learned that accounts that posted frequently at 10pm in that country were spammy. But what's very interesting is that it otherwise ignored country as a predictor. If you posted from that same country but had a more human-like pattern of posting around the clock it didn't rank you as likely to be a spammer. Forests can get very sophisticated and layer dozens of signals to find such patterns. They can be retrained periodically to adapt as spammers change their behavior.
3 replies
0 recast
34 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
It's not always intuitive what the best signals are. When I worked on fraud at Coinbase - which is a similar problem - one of our best signals was screen resolution. It turned out that fraudsters used a virtual machine that had a very odd screen resolution that most normal computers would never have. We've found this to be true in Farcaster data as well. I'm going to be more cagey about what the actual signals are, because revealing them will cause spammers to change their behavior making them harder to detect.
5 replies
0 recast
55 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
Commonly suggested signals like onchain data don't work very well. It turns out that there are a lot of users with little or no blockchain activity that are quite interesting on social networks. And the opposite also tends to be true, which is that there are people with ENS's and other onchain activity that are aggressive spammers and airdrop farmers. We recently tested some onchain signals and found a near-zero improvement in predictive power. This may change over time as more activity moves onchain, but as of today it's not very useful.
3 replies
2 recasts
43 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
The signals that tend to do very well fall into one of three categories: 1. Graph based -- spammers often share similar patterns of activity which can be used to catch them 2. Behaviors - they also tend to do things a certain way, because they're being repetitive in their actions (e.g. posting at fixed internals) 3. Textual - the content of their casts is often very predictive of their quality
7 replies
1 recast
47 reactions

Varun Srinivasan pfp
Varun Srinivasan
@v
If you have any more questions about how spam works, please ask and I'll try to reply tomorrow (because its getting late here)
9 replies
0 recast
25 reactions

YES2Crypto 🎩 πŸŸͺ🟑 pfp
YES2Crypto 🎩 πŸŸͺ🟑
@yes2crypto.eth
Has much thought gone in to a "vouch" type of system to put weights or add'l dimensions to social graph? ie, I click on someone else's user profile and have some checkboxes - [ ] Met IRL - [ ] Spammy - [ ] Based or whatever makes sense. The higher one's reputation (if tracked), the more weight some of these make.
1 reply
0 recast
2 reactions

GΓΆkhan Turhan pfp
GΓΆkhan Turhan
@gokhan.eth
a blogpost that collates this thread would be beneficial esp. for teams trying to adapt antispam measures.
1 reply
0 recast
1 reaction

Nabil Abdellaoui pfp
Nabil Abdellaoui
@randombishop
Thanks for providing all these details πŸ™ I am curious about your training set: how do you label the "ground truth" into one of the 4 levels?
1 reply
0 recast
0 reaction

tim/vortac pfp
tim/vortac
@vortac
Just wanted to thank you for the broad explanations. Really interesting read and topic, learned a lot πŸ™
0 reply
0 recast
1 reaction

aferg pfp
aferg
@aaronrferguson.eth
Varun thank you for this super thread! :) Bookmarked so I can reference it occasionally!
0 reply
0 recast
1 reaction

slobo pfp
slobo
@slobo.eth
would you say that random forest akin to velocity models? when i used to work in fraud detection they were quite effective we used 10 or so factors like percentage of users with gmail that come through now vs. an hour ago, a day a go, 7 days, 30 days ago was pretty effective in detecting new fraud rings
0 reply
0 recast
0 reaction

VIPER  ツ pfp
VIPER ツ
@vipernft
Too much ambiguity currently exists Coming from a user perspective. I have no clue what my rating is nor what would need to change for the bot to change perception. Coming from a people manager perspective. How will you inform people of what they have been rated and what specific actions need to occur for it to change to the desired rating?
1 reply
0 recast
6 reactions

Jahkay pfp
Jahkay
@jahkay
Great thread and very informative. How often are accounts reevaluated?
0 reply
0 recast
0 reaction

AlPlanet  Ⓜ️   🎩   🎭 pfp
AlPlanet Ⓜ️ 🎩 🎭
@alplanet
I appreciate you sharing this. 🎩 200 $degen I am interacting less and fearful of being labeled falsely as a bot. I would fail your tests except for maybe the, not foolproof, human part (label data). I am a human but do "inauthentic activity" as that's required to participate in many things, "Quote cast to enter, Cast this preformated cast to enter". I am human that "is doing it to benefit themselves and not because they're interested in you". I find this statement naive. Be skeptical of human users, they are here for themselves of course, my existence can't be entirely selfless. (I have to tip you to keep my degen allowance.) Thank you for efforts. Please consider an undoxed human tag with a weight of certainty attached as a means to filter my feed.
0 reply
0 recast
0 reaction