My #1 match was an account[1] banned for "posting unsubstantive comments and repeatedly breaking the guidelines". Now, I may be biased, but I don't think that's accurate :v
I did the search on my self trying to find similar souls share the passion about Godel theorem, viewing the current carbon-based civilization from the views of silicon-based civilization or the alien's, functional programming... But none of them are even close. This #1 match has some views I'm totally not familiar with but I have an opportunity (which I appreciated) to understand other views
In my opinion, this service doesn't have a good S/N ratio. Could give you irrelevant information.
I have to agree. Nothing stood out as being any way similar. It's hard to tell what their measure of similarity is here. This might be a case of let's just throw the data in, and see what comes out.
It caught users that use my style of dumping a ‘in line rhetorical saying’ in their posts. It’s not terrible, you’d have to laugh honestly at how predictable you are.
That is doubtful. We don't shadowban established accounts: we tell them why we're banning them and why: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... Shadowbanning on HN, at least for the last 7 years, has been reserved for spammers and serial trolls. It's possible that we made a mistake and neglected to tell you, but it's far more likely that we did tell you.
We don't ban people for criticizing PG, as anyone can easily see for themselves by using HN Search or looking at any recent thread from paulgraham.com.
If you're going to make a claim about why you think you were banned, you should provide a link so readers can make up their own minds. When it comes to "I was banned" stories, people say all kinds of things, most of which don't hold up against the actual record.
You banned me 2 days after the comment for saying that YC can be as or more exploitative than China's deals with African countries and I said that the parent defending YC but condemning China was acting on a base of racism and ethnocentrism. You can see how he brought a bunch of low-quality links I refuted them and you came 2 days later(I just realized that now) and banned me.
Oh well, I suppose you will be ban me again. BB cannot be criticized.
That doesn't link to a banned account. I assume you mean this one: https://news.ycombinator.com/item?id=26659584. We told you we were banning you in that very thread: https://news.ycombinator.com/item?id=26678745. Comments like "I must conclude you are a dumb person letting his latent racism to take over or you are aware you are acting on bad faith and you just dont care" are obviously against the rules here and have nothing to do with PG, YC, China, or any particular topic.
Moreover, we'd warned you and asked you may times to follow the site guidelines before that:
If you break the rules that often and ignore that many warnings, it's not surprising that you'd end up getting banned. This was not a shadowban and not because you criticized some particular person.
All this is a pity because you posted many interesting comments in the past and we would much rather have you as a contributing user. The sad truth, though, is that the harm you cause by breaking the site guidelines exceeds the good you contribute with the interesting comments—so I don't think we made the wrong call.
As Cardinal Richelieu apocryphally said"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him."
I dont doubt you can present a similar set of "warnings" for any undesirable who refuse to toe the corporate line you are ordered and paid to maintain. All those rules are ambiguous, opaque, arbitrary and subjectively interpreted and enforced, but you know that.
The most ironic thing is that the measure you take with people who refuse to follow you faux-polite tone is censorship and virtual obliteration, something 1000 times worse. Most of the people I got personal (in argument) here were being racist/clasist/homophobic/ethnocentric but since they can shield themselves behind rhetorical what ifs and sing paeans to the geniuses of YC they have carte-blanche.
Give me a million times better the dysfunctional governments we all have before techno-fascists like you and your bosses, censoring anyone who does not suck up to them.
Not to contradict the Cardinal but your accounts have broken the site guidelines a lot more than the median commenter, and we really don't care about your views. Plenty of other commenters express similar views without getting banned. Actually we really, really, don't care. We're just trying to have an internet forum that doesn't suck.
It isn't about politeness, btw (let alone "faux" politeness) – you won't find that word, or that concept, in the site guidelines. It's about treating other people respectfully, and abstaining from garden-variety internet dreck. Let's not noble up the latter with self-flattering rhetoric.
I have 2 accounts at over 1k karma. I generally start a new one when I make major moves (across multiple state lines or between countries).
My accounts did not correlate, probably because they have been inactive at staggered intervals.
> We took usernames and respective comment histories from the past three years
However, putting the names in the Doppelganger search yielded very similar results and the comments of the users are from like-minded people. Well done.
> A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually!
The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today).
However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.
In fact, considering that unknown third-parties freely gather such similarity scores and correlate accounts, across different sites—by now it's a given that one's alt accounts have to adhere to different stylistic choices.
I think the weakness of this technique is in the normalization of the vectors. The close match comments don't look like mine because the content of my comments has to be massively compressed. The close matches appear to have been massively expanded.
Or to put it another way [1], cosign similarity is not enough here. Magnitude also matters here.
This is probably a case where traditional information retrieval methods should play some role. The data are not really big enough that a pure cosign similarity is warranted. [3]
[1]: a phrase that my actual Doppelganger must use. [2]
[2]: and also endnotes like these.
[3]: performative erudition is what is absent from all my matches.
If you get me and have always wondered why you never quite fit in, ask your GP about ADHD.
While folks are saying people they get match up with them in comment style, keep in mind due to the nature of the tool they're looking for that. Also look for comments and opinions you're not similar in to disprove the match.
I once made a Markov chain IRC bot that people would still be convinced was smart today because people discard the lines that make no sense when only looking to prove rather than also disprove
I identified three usernames in this table right away! tosh, todsacerdoti, and pjmlp. In fact, I like the stories posted by tosh and todsacerdoti quite often and I like the comments posted by pjmlp very often.
I'm probably late to the party, but is there a reader app people use for following specific people? Or are you just referring to whom you've favorited / following and have a good memory for names?
I'd be more interested to see a list of users who have commented the most on the same articles as me. Seems like a better way to measure interests, even if it's indirect. Of course, this wouldn't distinguish between doppelgangers and evil twins >:)
Some of my favorite HN threads remind me of bubble tracks in old school particle colliders. Just stacks of tangents that are hilariously off-topic but somehow, at times, still interesting and even informative.
One idea is to rank users more highly when their comment is closer to mine in the comment tree. Also to weight users more highly when we both comment on an article that has relatively few comments.
Yep, my doppels all seem to have participated in topics and threads that I ignored. We seem to have never interacted with each other, and I don't seem to have voted any of their comments.
If you encounter the error that "This user does not exist or does not have any activity.", check the case. They seem to be case-sensitive on this page.
(I checked with correct and incorrect spellings of
"dang", "pg", "TeMPOraL" and "_Microft".)
Hah, I’ve thought the same thing. I think I’d enjoy working with myself quite a bit. As a housemate, though, I doubt we’d cross boundaries often. Very thankful to be married to someone who truly complements me.
If you recognize your doppelganger, it's probably because you have an interest in the same topics. Your opinions and your posting style may be different—or even opposite—but from a certain perspective, you have more in common with each other than most.
I'm skeptical that this tool does a good job of identifying semantic meaning of a comment, but I bet it gets the topic right.
This is a nice case study on why it's good to predict distributions instead of raw values. Low comment accounts should have a high degree of uncertainty, which should translate into weaker similarity scores, of you compute the expected similarity of two accounts.
I’d love to see stats on how well this matches accounts to themselves if you split an account’s comments into two pseudo-accounts and tried to match them.
Well, the first comment I saw from my top doppelganger match began "I have been one of the toxic persons in a workplace. I was young and immature and an ungrateful arrogant prick..."
Reviewing additional posts I don't think we seem any more alike than a random pick.
The first account I found was banned, but the 2nd account found says a lot of the same things I do. I would send them in my place in a debate if I could not make it.
I don't comment much, my next-to-last comment were 6 words including the word "underpaid" and now my "Doppelgänger" are all comments who had "underpaid" in their last comment...
It seems I have some overlap with qsort. Given the posts I have seen from that account, I take that as a compliment.
I do wonder though if the model is smart enough to correct for when you quote other people, otherwise it might be measuring who you interact with much.
BTW, I'm not sure of the privacy implications of this, maybe someone else can comment on this.
Is this in any way distinguishable from just picking random accounts? There is no discernible similarity between the supposedly similar accounts that I can discern.
I'm pleased to see that while most people have doppelgangers similar around >0.990, my most similar doppelganger has a similarity of 0.975. I'm a unique individual!
> It compares the semantic meaning of your comment history with those of all other users, and finds the top ten users whose comment histories are most similar to yours.
So maybe it's comparing comments from entire history of the account and not just the recent ones and therefore hard for me to compare? Would it be possible for you to tweak it so that it only compares lets say the most recent 30 or 50 comments?
It is. I just picked that because it was an already provided option and I recognized that name. I few pages of comments I compared the accounts of didn't see much similar but OP also replied that it compares 3 years worth of comments so it's hard for me to notice it.
Yea, 3 years is a long period. I think by default you should compare only past 30-60 comments and have options for comparing past year and past 3 year. If that works well, you could actually build a "dating app for intellectually curious" ones. Might be easier to do using Reddit data too.
The number of comments about the Karma being divergent on people's matches suggest you could add a simple ensemble approach with a karma heuristic on top of the vector similarity search to help results.
If the data used isn't already reflecting Karma, it could be a useful metric for representing a whole bunch of things (quality of comments, participation, length of time), and make the vector similarity more meaningful.
It would be interesting to see if users perceive the results as higher quality if you added a simple Karma similarity filter (maybe 400 points or something based on the standard deviation of the average karma score), and then returned the closest matches filtered by that metric.
Vector similarity search as a service looks like a good market. Out of interest, what would the cost translate to for running something like this in practice using the API as a customer?
> Out of interest, what would the cost translate to for running something like this in practice using the API as a customer?
We offer usage-based billing at $0.10 per GB of memory hour. (https://www.pinecone.io/pricing/) For this app, our eng team knows for sure but I think the entire index is less than 1GB so it would be just $73/month if we keep it running 24/7.
Vector similarity search is new for most companies, so we want to make it very easy to try and test stuff out in production, without cost being a barrier. Even for larger volumes (40GB+) we offer volume and pre-commitment discounts.
PS Possibly you could add a floor (say less than 20 to exclude shill or throwaway accounts), or ignoring karma difference over a threshold (say 1000). I think applying simple filters based on common sense or domain knowledge can help vector similarity searches with sparse data or pollution. Just a random thought :)
The first two people in the list did not look like me much. But the third one (NeedMoreTea) was an interesting hit, commenting in a similar fashion and exploring similar topics, not necessarily from the same perspective. I am now immersed in his comment history.
Also, funnily, I really like tea and I drink ~ half a gallon a day.
My matches seemed pretty accurate in terms of general phrasing (I do a lot of 'I grew up in...' and 'I once knew someone who...'), but the subject matter was a bit concerning at times. Does the average account have so many judgmental downvote-heavy comments?
Many years ago, I built something similar for Drupal and its votingapi module. Just checked it, I was using Pearson's correlation coefficient between votes. It worked fast and was surprisinly accurate. You need access to voting history for that, of course.
Intriguing idea, but for me the criteria used to root out doppelgängers doesn’t lead to interesting results. My HN soul mates do not write like me, do not write about the same things, and do things I would never do, such as refer to Wikipedia articles.
Same. I don't really recognize anything in the top four for me.
Edit: I lied, top match 'asark' tends to share in my proclivity for the run-on sentence. Topically we're not really into the same things, so it has to be something along those lines.
I think currently it is a bit biased towards users with low comment count. Top similar user for me had 4 comments, top 2 only 1 comment. Then top 5 and top 6 again had 1 comment each.
Maybe the similarity score can be weighted by the number of comparisons somehow?
I wonder if there are simply more users with low comment numbers. But yes we considered only including users above some karma score, and still might do that in the future.
This depends on how the similarity is measured. If it is measured as % of agreement between comment texts then it's more likely to have better agreement with someone who has fewer comments rather than more.
For example I suspect that if we generated 1000 random users with random gibberish comments and varied their comment numbers from 1 to 10 or so, the top similarities would be biased towards low comment-count random users. This would be because having one randomly generated comment match your style is easier than having 2 randomly generated comments match the same style.
And if that's the case then the same issue would transfer to comparing real users.
@busymom0 suggested a great solution in this thread - only do comparisons based on "n" (like last 50) comments. This way every similarity would be measured using the same number of comments and users with low comment counts would be excluded automatically.
I think the algorithm might have trouble with negation, questions, irony, jokes, etc. My top matches seemed inclined to comment on similar topics but with different opinions. Granted, my matches topped out at 0.972, so maybe I'm an outlier.
Yeah. I wonder why more people here aren’t mentioning that this could be used to help unmask throwaway accounts for people who usually post here with a different account.
A major reason for my first match seems to have been both accounts talking about licensing issues and specifically GPL. Other matches seem to have shared some superficially similar political perspectives on HN.
By the way, on a related question, I have an interest in being able to download all the things I've written on HN, but am not clever enough to hack together some tool to wget and parse my own user history.
It seemed to me that it is case-sensitive, right? Maybe make it not case-sensitive or if that's not possible at least mention it somewhere, e.g. by putting it into the label of the input box as "Username (case-sensitive)".
Apparently I'm rather similar to jacquesm [0], while shantly [1] is the top match of both of us. Associativity seems to hold up. Also, I'm in a way on the front page (?), so I'm happy ;)
All my matches seem to be contentious, argumentative people, and I can't say it's uncalled for. I don't think we would get along.
I'm wondering how matching people completely randomly under the guise of of a fake algorithm would fare. Placebo matching if you will. The results in terms of social interactions might be more fruitful.
Would you please stop posting unsubstantive comments (like this one and https://news.ycombinator.com/item?id=27572603) and specifically not cross into personal attack? We ban accounts that do those things, and we've had to ask you more than once not to.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful. Note this one: "Please don't sneer, including at the rest of the community."
I have several abandoned HN accounts because I switch to a new account after a while in order to not leave to much PII for doxxing. I don't care about karma at all.
This tool didn't offer any of them as doppelgangers, although (in theory) I should match my own style 100%.
That’s black box AI systems for you. Works in some cases, doesn’t work in others, fails hilariously occasionally; and people will make false accusations based on them.
Okay, the first one I got does actually write like me. It's funny because I didn't agree with what he says in his posts, but I totally recognize why it ranked highly similar to my writing.
Oh gawd, now I'm sounding exactly like that guy... (I just need to add some parens.)
I have bipolar disorder and can be "all over the map" as my ex used to say. I was pleased that my twins were similar, so maybe the model does a better job matching with varied and distinctive data - more signal to work with.
Kind of weird looking at mine. I've got an old account that I don't use any more. It didn't show up in the list of similar accounts, and the list for this account and my old account were completely different.
Would love to chat ([email protected]) if you’re interested in using APIs like this in production. NLUDB is supplying folks with equivalent APIs both as a SaaS and a private cloud install.
If you’re interested in rolling your own, a good place to start is the sentence-transformers Python package along with a KNN search service like Spotify’s Annoy.
As a designer on HN. The top 5 matches didn’t resonate as they were obviously developers who nerded out about code. My most relatable characteristic was some of our comments are heavily downvoted. :)
I need the following users to go away, there isn’t room for all of us:
journalctl 0.994
thisisweirdok 0.994
veryworried 0.993
shantly 0.993
core-questions 0.993
not_a_cop75 0.993
lambda_obrien 0.993
LeoTinnitus 0.993
xwolfi 0.993
magashna 0.993
—————————-
Edit: On a more serious note, how can we use this to find echo chambers and homogenized news sources? Keep rolling with this idea, I think it’s important.
Seems it picked up on some phrases I use which are not too common among the masses. Beyond that there's little to no relation between me and my doppelgangers.
Most of the listed accounts for me have no activity in recent months, and the ones that do engage on different topics (they are quite technical; I am not).
Interesting. You seem like a closer match than the ones on my list. Looks like we're in different countries (US here), but we both comment on legal/tax matters. Are you a lawyer, by chance? I'm a former tax lawyer.
> We took usernames and respective comment histories from the past three years using the Hacker News API. Then we transformed them into vector embeddings using a pre-trained model, and loaded them into Pinecone.
These matches suffer from a very common problem in recommendation engines, which is a tendency toward matching with the long tail of randomness.
The problem is that your matching engine assumes perfect accuracy in vectorization, but actually, the vectorization is a sample from a distribution.
The distribution of the vectorization comes from the idea that it is trained on a mix of information and randomness, and this model is just one instance of the randomness. There are some sources of randomness in the model that you control, and others in the data.
To get better recommendations, you first need to get the distributions for each element of the vectorization. The simplest approximation would be a variance around the model output. Each item should have a variance determined by the amount of information available for that sample.
Now you have another problem, which is finding the closest vector. A p-test will fail you because that is telling you the probability that you came from the same distribution, which is going to be the random distribution for almost all pairs. You might ask something like, the probability that the two points come from a distribution that is distinct from the random distribution. You’d have to form that distribution, asses probability of membership, and then probability of rejection from randomness. You could also consider doing this for each vector element and returning a negative sum of log probability to represent the amount of information shared between vectors.
But ultimately, you need a truth set to test these methods against. This is easy. You just split each person into N people by randomly assigning their comments to one of N identities. This could also be used as an element of the variance.
An easy way to start is by bootstrapping the randomness. In this case, there is probably a random initialization vector in the model. You can bootstrap the distribution of each vector element w.r.t. the IV by running the model many times with random initialization vectors. You are still using a fixed point for the training data noise, but this is a start. To bootstrap the training noise variance, you can train many models using random selection of the data, and the same IV. A good heuristic split is to decide how many times you will run the model, x, and then do random selection by 1/sqrt(x). Ex. Run 100 times with 1/10th of the data in each run. Then you have two distributions for each vector element, one for the data randomness, and one for the model randomness, but the mean from the model randomness is the most informative. Now add the data variance / sqrt(x) to get a rough approximation of the true variance.
These are all hacky ways to get a decent improvement, and the formal methods will also give great improvement on top of that, but are easy to mess up and often quite expensive to bootstrap.
From what i can tell, you're saying that when the model runs, it compresses the data into a crystallized, idiosyncratic set of weights, and running the model a bunch of times + averaging them will smooth out the results.
Is that necessarily better? My matches felt like different people, and i like that crunchy recommendation butter more than the smooth version (unlike in real life).
I tried with proper case and it still didn't work.
Then I just mashed the button a few times and it said username found but error getting history, so I mashed the button a few more times and it eventually worked. Do probably just backend overloaded.
I skimmed through my top matches and one of them appears to have attended the same undergrad as me, so that was interesting.
Do you retain any user data? Since most users are probably looking up their own username, it would be a simple task to match and log HN usernames and their respective IPs.
It probably depends on the accounts, and the algorithm. If the comparison algorithm uses some sort of distance calculation between 2 users to figure out how close they are, then you could have single directional relationships.
If my comments are on an island of weirdness, you may be the closest person to me while still being really far away. If your comments are relatively normal, you might have a lot of people around who are closer than me. That would make you my doppelganger, but not make me yours.
Edit: I just checked my doppelganger (barry-cotter), and I'm not even in his list :p. I've seen that username appear in a couple other comments under this post. I wonder if there are a few super normal users that a lot of people are closest to.
My matches have each other within their top matches, but none of them have me in their top ten.
I have a lower similarity score with my top match, compared to my matches' similarity scores with their tenth closest matches. So it is that island scenario that you described.
No. There's no reason for it to be symmetric, there can be unlimited number of your closest neighbour's closest neighbours, that are closer to him than you are. I mean, it's literally measuring distance between dots in the n-dimensional space, and if you'll ask your question about the dots on the paper, the answer will be obvious to you.
1: https://news.ycombinator.com/threads?id=franciscrick1