Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Find Your Hacker News Doppelgänger (streamlit.io)
246 points by gk1 on June 20, 2021 | hide | past | favorite | 215 comments


My #1 match was an account[1] banned for "posting unsubstantive comments and repeatedly breaking the guidelines". Now, I may be biased, but I don't think that's accurate :v

1: https://news.ycombinator.com/threads?id=franciscrick1


I suspect most HN accounts are low Karma. Considering how often I see throw aways, probably by several orders of magnitude.

Relying solely on co-sign similarity, every vector is likely to be surrounded by the vectors of low karma accounts.

Or no matter which direction you travel from earth, you will almost certainly be surrounded by vacuum.


I don’t think it’s that random. Out of the top 5 accounts I was linked to 4 where over 8k, and 2 where over 20k karma.


Look at dang's top five matches.


Aren’t the population statistics of the NH karma distribution known? Histogram, percentiles…


I did the search on my self trying to find similar souls share the passion about Godel theorem, viewing the current carbon-based civilization from the views of silicon-based civilization or the alien's, functional programming... But none of them are even close. This #1 match has some views I'm totally not familiar with but I have an opportunity (which I appreciated) to understand other views

In my opinion, this service doesn't have a good S/N ratio. Could give you irrelevant information.


perhaps your interests are really unique ;)


With high enough dimension, almost everyone is.


I think it is correct as Doppelgänger are supposed to be the evil version of oneself.



I got a bunch of empty accounts and one startup launch post. Not sure how that qualifies as a doppelgänger


I have to agree. Nothing stood out as being any way similar. It's hard to tell what their measure of similarity is here. This might be a case of let's just throw the data in, and see what comes out.


It caught users that use my style of dumping a ‘in line rhetorical saying’ in their posts. It’s not terrible, you’d have to laugh honestly at how predictable you are.

I’m gonna marry one of my doppelgängers.


I'm guessing they came down on the wrong side of belts vs. bots?


[flagged]


That is doubtful. We don't shadowban established accounts: we tell them why we're banning them and why: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que.... Shadowbanning on HN, at least for the last 7 years, has been reserved for spammers and serial trolls. It's possible that we made a mistake and neglected to tell you, but it's far more likely that we did tell you.

We don't ban people for criticizing PG, as anyone can easily see for themselves by using HN Search or looking at any recent thread from paulgraham.com.

If you're going to make a claim about why you think you were banned, you should provide a link so readers can make up their own minds. When it comes to "I was banned" stories, people say all kinds of things, most of which don't hold up against the actual record.


Here it is:

https://news.ycombinator.com/item?id=26656133

You banned me 2 days after the comment for saying that YC can be as or more exploitative than China's deals with African countries and I said that the parent defending YC but condemning China was acting on a base of racism and ethnocentrism. You can see how he brought a bunch of low-quality links I refuted them and you came 2 days later(I just realized that now) and banned me.

Oh well, I suppose you will be ban me again. BB cannot be criticized.


That doesn't link to a banned account. I assume you mean this one: https://news.ycombinator.com/item?id=26659584. We told you we were banning you in that very thread: https://news.ycombinator.com/item?id=26678745. Comments like "I must conclude you are a dumb person letting his latent racism to take over or you are aware you are acting on bad faith and you just dont care" are obviously against the rules here and have nothing to do with PG, YC, China, or any particular topic.

Moreover, we'd warned you and asked you may times to follow the site guidelines before that:

https://news.ycombinator.com/item?id=26127670

https://news.ycombinator.com/item?id=25637111

https://news.ycombinator.com/item?id=25400449

https://news.ycombinator.com/item?id=24909805

https://news.ycombinator.com/item?id=24513786

https://news.ycombinator.com/item?id=23087338

https://news.ycombinator.com/item?id=18102477

If you break the rules that often and ignore that many warnings, it's not surprising that you'd end up getting banned. This was not a shadowban and not because you criticized some particular person.

All this is a pity because you posted many interesting comments in the past and we would much rather have you as a contributing user. The sad truth, though, is that the harm you cause by breaking the site guidelines exceeds the good you contribute with the interesting comments—so I don't think we made the wrong call.


As Cardinal Richelieu apocryphally said"If you give me six lines written by the hand of the most honest of men, I will find something in them which will hang him."

I dont doubt you can present a similar set of "warnings" for any undesirable who refuse to toe the corporate line you are ordered and paid to maintain. All those rules are ambiguous, opaque, arbitrary and subjectively interpreted and enforced, but you know that.

The most ironic thing is that the measure you take with people who refuse to follow you faux-polite tone is censorship and virtual obliteration, something 1000 times worse. Most of the people I got personal (in argument) here were being racist/clasist/homophobic/ethnocentric but since they can shield themselves behind rhetorical what ifs and sing paeans to the geniuses of YC they have carte-blanche.

Give me a million times better the dysfunctional governments we all have before techno-fascists like you and your bosses, censoring anyone who does not suck up to them.


Not to contradict the Cardinal but your accounts have broken the site guidelines a lot more than the median commenter, and we really don't care about your views. Plenty of other commenters express similar views without getting banned. Actually we really, really, don't care. We're just trying to have an internet forum that doesn't suck.

It isn't about politeness, btw (let alone "faux" politeness) – you won't find that word, or that concept, in the site guidelines. It's about treating other people respectfully, and abstaining from garden-variety internet dreck. Let's not noble up the latter with self-flattering rhetoric.


A comment about an account being banned in the linked thread. https://news.ycombinator.com/item?id=26678745

Not sure it is applicable since the linked comment’s author does not appear to be banned.


Think of minority report and it'll all become clear.


I have 2 accounts at over 1k karma. I generally start a new one when I make major moves (across multiple state lines or between countries).

My accounts did not correlate, probably because they have been inactive at staggered intervals.

> We took usernames and respective comment histories from the past three years

However, putting the names in the Doppelganger search yielded very similar results and the comments of the users are from like-minded people. Well done.


For some users I searched, the most similar user is a throwaway account, which is somewhat eye-opening and unnerving.


I was just about to point that out! I tried a friend's account and the second match was a throwaway that I know for a fact is theirs.


That’s a really interesting and unintended use case…


How did you gather the comment histories? Would you mind sharing a copy?


See description at the bottom. We used the Hacker News API to pull data into BigQuery.

From there we ran them through an embedding model and indexed the embeddings in Pinecone.

The actual similarity search is done with Pinecone. (https://www.pinecone.io)


Using Google BigQuery is one way. This comment might be of use:

https://news.ycombinator.com/item?id=25075318

> A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually! The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today). However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.


Two of my matches were <username> and <username>2.


Huh, checked some of my older accounts and none of them matched each other. So I must be doing something right.


In fact, considering that unknown third-parties freely gather such similarity scores and correlate accounts, across different sites—by now it's a given that one's alt accounts have to adhere to different stylistic choices.

Innit?


Wow, this works seriously good at some level then.


I think the weakness of this technique is in the normalization of the vectors. The close match comments don't look like mine because the content of my comments has to be massively compressed. The close matches appear to have been massively expanded.

Or to put it another way [1], cosign similarity is not enough here. Magnitude also matters here.

This is probably a case where traditional information retrieval methods should play some role. The data are not really big enough that a pure cosign similarity is warranted. [3]

[1]: a phrase that my actual Doppelganger must use. [2]

[2]: and also endnotes like these.

[3]: performative erudition is what is absent from all my matches.


(cosine, not cosign)


Hey, they said it was _performative_ erudition


If you get me and have always wondered why you never quite fit in, ask your GP about ADHD.

While folks are saying people they get match up with them in comment style, keep in mind due to the nature of the tool they're looking for that. Also look for comments and opinions you're not similar in to disprove the match.

I once made a Markov chain IRC bot that people would still be convinced was smart today because people discard the lines that make no sense when only looking to prove rather than also disprove


> GP

At first I thought you meant "grandparent" in the HN sense.


For those still confused: GP stands for General Practicioner. A medical doctor that isn't specialized.


I have a throwaway Doppel whose only comments are about diet coke.

Given that this is my 'pharmacology' alt account, it seems the author's pretrained word embeddings still associate Coca Cola with the old recipe =)

NLP is hard!


This is fun! Here are the results I got:

  | Username        | Similarity Score |
  |-----------------+------------------|
  | tosh            |            0.939 |
  | app4soft        |            0.931 |
  | beefhash        |            0.930 |
  | joseluisq       |            0.929 |
  | todsacerdoti    |            0.929 |
  | pjmlp           |            0.928 |
  | rbanffy         |            0.928 |
  | blattimwind     |            0.928 |
  | formerly_proven |            0.928 |
  | ducktective     |            0.928 |
I identified three usernames in this table right away! tosh, todsacerdoti, and pjmlp. In fact, I like the stories posted by tosh and todsacerdoti quite often and I like the comments posted by pjmlp very often.


I'm probably late to the party, but is there a reader app people use for following specific people? Or are you just referring to whom you've favorited / following and have a good memory for names?


The latter. I remember those three usernames.


Wow, u recognize ppl on hn? I only recognize ppl I know In real life, everybody else is kind of anonymous.


My number 1 doppelgänger is a Swedish person.

I’m British and I’ve been living in Sweden for 7 years now, and it’s only just occurring to me that this could be affecting the way I form comments.


are you using the "bork bork bork" extension to form comments perhaps? http://www.snert.com/Software/bork.html


This is fantastic! Thank you! :D (…schmerk de herdygerdy, bork, bork, bork…)


I think the poor service is swamped, but it isn't doing itself any favors by hammering the /healthz and /status endpoint about once or twice a second


I'd be more interested to see a list of users who have commented the most on the same articles as me. Seems like a better way to measure interests, even if it's indirect. Of course, this wouldn't distinguish between doppelgangers and evil twins >:)


Comments are often reply to someone else's point on things which may not be related to the original post.


Some of my favorite HN threads remind me of bubble tracks in old school particle colliders. Just stacks of tangents that are hilariously off-topic but somehow, at times, still interesting and even informative.


One idea is to rank users more highly when their comment is closer to mine in the comment tree. Also to weight users more highly when we both comment on an article that has relatively few comments.


Yep, my doppels all seem to have participated in topics and threads that I ignored. We seem to have never interacted with each other, and I don't seem to have voted any of their comments.


If you encounter the error that "This user does not exist or does not have any activity.", check the case. They seem to be case-sensitive on this page.

(I checked with correct and incorrect spellings of "dang", "pg", "TeMPOraL" and "_Microft".)


Yup, didn't work for me at first because it's case sensitive. Also, you're one of my doppelgangers, hello.


Hello, hello, nice to hear that - I'll make sure to look through your comment history later!


Didn't work when I fixed casing until I refreshed the page.


So my HN doppelgänger according to this tool is someone I don't hold in high regard, and even clashed with once. Cool...


Why not.

In the past I often had the impression I wouldn't get along with myself.

Now, that I'm more chill and less confrontative, I think I would like to meet myself.


I have unintentionally trolled myself more than once due to people necroing old forum threads.

Yeah, I talk a lot of shit but I've got nothing on my old self.


Hah, I’ve thought the same thing. I think I’d enjoy working with myself quite a bit. As a housemate, though, I doubt we’d cross boundaries often. Very thankful to be married to someone who truly complements me.


Well, to meet your doppelgänger is an ill omen after all.


If you recognize your doppelganger, it's probably because you have an interest in the same topics. Your opinions and your posting style may be different—or even opposite—but from a certain perspective, you have more in common with each other than most.

I'm skeptical that this tool does a good job of identifying semantic meaning of a comment, but I bet it gets the topic right.


This is a nice case study on why it's good to predict distributions instead of raw values. Low comment accounts should have a high degree of uncertainty, which should translate into weaker similarity scores, of you compute the expected similarity of two accounts.


Eep. My doppelganger is a banned user who posted too many gender flamewar-baiting comments. Let it not be so!


I get a blank screen with a link to the hosting website on the bottom right. Chrome on Android mobile.


I let them know, they’re trying to keep up with the load.

Edit: Should be better now.


same


I’d love to see stats on how well this matches accounts to themselves if you split an account’s comments into two pseudo-accounts and tried to match them.


That makes a lot of sense as control/sanity check. Then again, I don't think I always have the same style due to mood differences and whatnot.


If you split it randomly, the distributions over mood and style should balance out in the large.


Well, the first comment I saw from my top doppelganger match began "I have been one of the toxic persons in a workplace. I was young and immature and an ungrateful arrogant prick..."

Reviewing additional posts I don't think we seem any more alike than a random pick.


The first account I found was banned, but the 2nd account found says a lot of the same things I do. I would send them in my place in a debate if I could not make it.

Well done.


I don't comment much, my next-to-last comment were 6 words including the word "underpaid" and now my "Doppelgänger" are all comments who had "underpaid" in their last comment...


It seems I have some overlap with qsort. Given the posts I have seen from that account, I take that as a compliment.

I do wonder though if the model is smart enough to correct for when you quote other people, otherwise it might be measuring who you interact with much.

BTW, I'm not sure of the privacy implications of this, maybe someone else can comment on this.


Ran it with my name and the top hit was nzeribe, which led me to this amazing thread: https://news.ycombinator.com/item?id=27087795

And here I was living my life thinking that Soft Cell's version of Tainted Love was the original...


Is this in any way distinguishable from just picking random accounts? There is no discernible similarity between the supposedly similar accounts that I can discern.


It works! My doppelgänger is my old account.


Doesn’t work on iPhone? Or just doesn’t work for me:(


The service is apparently overloaded. I'm falling into a connection timeout with every browser and it takes ages to even proceed to this.



Try now, it was updated to handle the load better.


I'm pleased to see that while most people have doppelgangers similar around >0.990, my most similar doppelganger has a similarity of 0.975. I'm a unique individual!


0.957 here. I'm twice as unique as you are!


0.939 here. I wonder who has the lowest top similarity score (pg has 0.88)!


0.849 for me



0.866. Feels lonely :(


.886 here, let's feel special instead :D


I tested with dang's username and I really don't see the similarities between the comments on:

https://news.ycombinator.com/threads?id=dang

and

https://news.ycombinator.com/threads?id=porphyrogene

or

https://news.ycombinator.com/threads?id=julianeon

Your site does state:

> It compares the semantic meaning of your comment history with those of all other users, and finds the top ten users whose comment histories are most similar to yours.

So maybe it's comparing comments from entire history of the account and not just the recent ones and therefore hard for me to compare? Would it be possible for you to tweak it so that it only compares lets say the most recent 30 or 50 comments?


Isn't dang a bit of a special case? How many people comment about the rules of the site?


It is. I just picked that because it was an already provided option and I recognized that name. I few pages of comments I compared the accounts of didn't see much similar but OP also replied that it compares 3 years worth of comments so it's hard for me to notice it.


Hello, doppelgänger!


Heh that's an interesting one. I turn up as your doppelgänger but you don't seem to turn up as mine.


It currently compares past 3 years of comments, so yeah the similarity may not be apparent from just the recent comments.

We did this so it would work even for the less active users. But I think your suggestion might work, too.


Yea, 3 years is a long period. I think by default you should compare only past 30-60 comments and have options for comparing past year and past 3 year. If that works well, you could actually build a "dating app for intellectually curious" ones. Might be easier to do using Reddit data too.


Looks like the site crashed :( Gives "Please wait..."


Nice use of your service for a Show HN!

The number of comments about the Karma being divergent on people's matches suggest you could add a simple ensemble approach with a karma heuristic on top of the vector similarity search to help results.

If the data used isn't already reflecting Karma, it could be a useful metric for representing a whole bunch of things (quality of comments, participation, length of time), and make the vector similarity more meaningful.

It would be interesting to see if users perceive the results as higher quality if you added a simple Karma similarity filter (maybe 400 points or something based on the standard deviation of the average karma score), and then returned the closest matches filtered by that metric.

Vector similarity search as a service looks like a good market. Out of interest, what would the cost translate to for running something like this in practice using the API as a customer?


> Out of interest, what would the cost translate to for running something like this in practice using the API as a customer?

We offer usage-based billing at $0.10 per GB of memory hour. (https://www.pinecone.io/pricing/) For this app, our eng team knows for sure but I think the entire index is less than 1GB so it would be just $73/month if we keep it running 24/7.

Vector similarity search is new for most companies, so we want to make it very easy to try and test stuff out in production, without cost being a barrier. Even for larger volumes (40GB+) we offer volume and pre-commitment discounts.


PS Possibly you could add a floor (say less than 20 to exclude shill or throwaway accounts), or ignoring karma difference over a threshold (say 1000). I think applying simple filters based on common sense or domain knowledge can help vector similarity searches with sparse data or pollution. Just a random thought :)


The first two people in the list did not look like me much. But the third one (NeedMoreTea) was an interesting hit, commenting in a similar fashion and exploring similar topics, not necessarily from the same perspective. I am now immersed in his comment history.

Also, funnily, I really like tea and I drink ~ half a gallon a day.


My top 5 didn't post for over a year.

Corona probably got all of them :(


'There can be only one.'


My matches seemed pretty accurate in terms of general phrasing (I do a lot of 'I grew up in...' and 'I once knew someone who...'), but the subject matter was a bit concerning at times. Does the average account have so many judgmental downvote-heavy comments?


Looks like it's not loading at the moment. Really curious to try this out.

Edit: Nvm, just not loading in Safari


Try now, it was updated to handle the load better.


Many years ago, I built something similar for Drupal and its votingapi module. Just checked it, I was using Pearson's correlation coefficient between votes. It worked fast and was surprisinly accurate. You need access to voting history for that, of course.


Intriguing idea, but for me the criteria used to root out doppelgängers doesn’t lead to interesting results. My HN soul mates do not write like me, do not write about the same things, and do things I would never do, such as refer to Wikipedia articles.


Same. I don't really recognize anything in the top four for me.

Edit: I lied, top match 'asark' tends to share in my proclivity for the run-on sentence. Topically we're not really into the same things, so it has to be something along those lines.


Are you sure? You, pseudolus, barry-cotter, and jbegley sure do post a lot of the same URLs...


I’ll take another look.

EDIT: those users are not even on the list of my doppelgängers.


I think currently it is a bit biased towards users with low comment count. Top similar user for me had 4 comments, top 2 only 1 comment. Then top 5 and top 6 again had 1 comment each.

Maybe the similarity score can be weighted by the number of comparisons somehow?


I wonder if there are simply more users with low comment numbers. But yes we considered only including users above some karma score, and still might do that in the future.


This depends on how the similarity is measured. If it is measured as % of agreement between comment texts then it's more likely to have better agreement with someone who has fewer comments rather than more.

For example I suspect that if we generated 1000 random users with random gibberish comments and varied their comment numbers from 1 to 10 or so, the top similarities would be biased towards low comment-count random users. This would be because having one randomly generated comment match your style is easier than having 2 randomly generated comments match the same style.

And if that's the case then the same issue would transfer to comparing real users.

@busymom0 suggested a great solution in this thread - only do comparisons based on "n" (like last 50) comments. This way every similarity would be measured using the same number of comments and users with low comment counts would be excluded automatically.


I think the algorithm might have trouble with negation, questions, irony, jokes, etc. My top matches seemed inclined to comment on similar topics but with different opinions. Granted, my matches topped out at 0.972, so maybe I'm an outlier.


Also this is the kind of service I would build if I wanted to figure out the ip of a user.


Yeah. I wonder why more people here aren’t mentioning that this could be used to help unmask throwaway accounts for people who usually post here with a different account.


Yep. I'm using a VPN right now, but I only looked up my latest account. I'll look up my old HN accounts later on a different IP/VPN service.


This worked relatively well for my account.

A major reason for my first match seems to have been both accounts talking about licensing issues and specifically GPL. Other matches seem to have shared some superficially similar political perspectives on HN.


By the way, on a related question, I have an interest in being able to download all the things I've written on HN, but am not clever enough to hack together some tool to wget and parse my own user history.

Has anyone seen a tool to do this?


Where there is a will, there is a way!

Really you should just start doing it and eventually you will get there; no such thing as not clever enough.


Interesting that all of mine seemed to be quite well established in terms of account age and karma.

  | Username        | Similarity Score |
  |-----------------+------------------|
  | boomboomsubban  | 0.974            |
  | tgsovlerkhgsel  | 0.973            |
  | fixermark       | 0.972            |
  | shadowgovt      | 0.971            |
  | stanferder      | 0.971            |
  | xvector         | 0.971            |
  | geofft          | 0.971            |
  | drdaeman        | 0.971            |
  | shkkmo          | 0.971            |
  | frombody        | 0.971            |


Shows my username as inactive or not valid


Same here. I've tried Firefox and Edge, same result in both.


It works for me when I enter your name. Try again?


It seemed to me that it is case-sensitive, right? Maybe make it not case-sensitive or if that's not possible at least mention it somewhere, e.g. by putting it into the label of the input box as "Username (case-sensitive)".


I got that as well, worked on third try


Yup, same.


I just found people who actually agree with me on HN... kinda cool.


>I just found people who actually agree with me on HN

You're still wrong though :D


get enough accounts and you're scaling being wrong :P


Apparently I'm rather similar to jacquesm [0], while shantly [1] is the top match of both of us. Associativity seems to hold up. Also, I'm in a way on the front page (?), so I'm happy ;)

[0] https://news.ycombinator.com/threads?id=jacquesm

[1] https://news.ycombinator.com/threads?id=shantly


I also had an extreme match for shantly - dunno why exactly though they seem like a reasonable poster.

  shantly 0.989
  asark 0.988
  intergalplan 0.988
  jakobegger 0.988
  gh-throw 0.988
  moshmosh 0.988


This works quite well for me, I see a list of people who all seem to have agreeable comments. All my similarity scores are in the range 0.988-0.986.


There are 9 people with over a .99 match. It's the writing style of non-native English speaker who spends a lot of time on meme sites.


All my matches seem to be contentious, argumentative people, and I can't say it's uncalled for. I don't think we would get along.

I'm wondering how matching people completely randomly under the guise of of a fake algorithm would fare. Placebo matching if you will. The results in terms of social interactions might be more fruitful.


Stuck on "Please wait..." for me. :(


Try now. We’ve upped the resources.


Mine resulted in amazing synchronicity, turns out someone had sent me a link to a blog post of theirs just this morning.


Finally we bring authorship attribution to expose secondary HN accounts. I've been waiting for this for years!


This is cool. I can definitely see similarities in sentence length and style compared to my closest matches.


Does this mean I need to travel the earth and time to find the others so I can defeat them in single combat, because there can be only one ?

Seriously, if you want Highlanders, this is how you get Highlanders.

That said, after reading my doppegangers comment histories, I'd totally subscribe to their newsletters.


You are admitting you love echo chambers and bubbles.


Would you please stop posting unsubstantive comments (like this one and https://news.ycombinator.com/item?id=27572603) and specifically not cross into personal attack? We ban accounts that do those things, and we've had to ask you more than once not to.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful. Note this one: "Please don't sneer, including at the rest of the community."


Interesting! One of them is a banned account and another is a throw away.

  Bucephalus355  0.944
  freehunter  0.943
  sn_master  0.943
  protomyth  0.943
  zapttt  0.942
  cmhnn   0.942
  1cvmask  0.942
  sabujp  0.942
  Jkvngt  0.942 (Banned)
  GoinginSircles 0.942 (Throw away)


Handy to find people using several identities, spammers with multiple accounts or to uncover throwaways.


How would you even discern which is which?

I have several abandoned HN accounts because I switch to a new account after a while in order to not leave to much PII for doxxing. I don't care about karma at all.

This tool didn't offer any of them as doppelgangers, although (in theory) I should match my own style 100%.


That’s black box AI systems for you. Works in some cases, doesn’t work in others, fails hilariously occasionally; and people will make false accusations based on them.


So:

bool AI(int id1, int id2){

    return True;
   
}


Okay, the first one I got does actually write like me. It's funny because I didn't agree with what he says in his posts, but I totally recognize why it ranked highly similar to my writing.

Oh gawd, now I'm sounding exactly like that guy... (I just need to add some parens.)


I have bipolar disorder and can be "all over the map" as my ex used to say. I was pleased that my twins were similar, so maybe the model does a better job matching with varied and distinctive data - more signal to work with.


Kind of weird looking at mine. I've got an old account that I don't use any more. It didn't show up in the list of similar accounts, and the list for this account and my old account were completely different.


Can you share the tool and its database as well as hosting it? Is it small enough to do so?

I’d love to see / use your work but it feels weird to participate in allowing a third party to build up an (hn-username, ip-address) database.


Would love to chat ([email protected]) if you’re interested in using APIs like this in production. NLUDB is supplying folks with equivalent APIs both as a SaaS and a private cloud install.

If you’re interested in rolling your own, a good place to start is the sentence-transformers Python package along with a KNN search service like Spotify’s Annoy.

Pinecone, of course, looks awesome as well :-)


We’ll write a longer how-it-works post soon, so you’ll be able to make your own version.


“ This user does not exist or does not have any activity.”

:(

Edit: looks like it’s case sensitive, be warned people on mobile phones.

Edit2: the person I matched with the most seems a bit aggressive in their comments. I guess I can learn something from this.


As a designer on HN. The top 5 matches didn’t resonate as they were obviously developers who nerded out about code. My most relatable characteristic was some of our comments are heavily downvoted. :)


I need the following users to go away, there isn’t room for all of us:

journalctl 0.994

thisisweirdok 0.994

veryworried 0.993

shantly 0.993

core-questions 0.993

not_a_cop75 0.993

lambda_obrien 0.993

LeoTinnitus 0.993

xwolfi 0.993

magashna 0.993

—————————-

Edit: On a more serious note, how can we use this to find echo chambers and homogenized news sources? Keep rolling with this idea, I think it’s important.


Another signal for matching could be user's 'Favorite'.


My #2 match seems to not exist. https://news.ycombinator.com/threads?id=danaliv


Seems it picked up on some phrases I use which are not too common among the masses. Beyond that there's little to no relation between me and my doppelgangers.


I broke it. It says I don’t exist or have any activity. :)


> It says I don’t exist

We were hoping you wouldn't have to find out this way.


Just what I needed on Sunday morning -- an existential crisis!


All my doppelgangers seem much nicer than I am :^)

Also one was a throwaway of a guy talking about his time on shrooms /shrug

Edit: ah, one is most certainly not very nice ¯\_(ツ)_/¯


Most of the listed accounts for me have no activity in recent months, and the ones that do engage on different topics (they are quite technical; I am not).


Hello my Doppelgänger, you are on place two for me :-D


Interesting. You seem like a closer match than the ones on my list. Looks like we're in different countries (US here), but we both comment on legal/tax matters. Are you a lawyer, by chance? I'm a former tax lawyer.


Hi, nö I'm not in the legal space but software developer.

But the rule based system of law combined with the complexities of human society is something I consider interesting.


> Error: Received no response from server Code: 1ST

Not working for me.


Try again? The app is hosted by Streamlit which is still in beta.


So, we give this our username and it matches us to browser id/cookies/IP/whatever and sells the set to marketeers ... or?


From the "How it Works" section:

> We took usernames and respective comment histories from the past three years using the Hacker News API. Then we transformed them into vector embeddings using a pre-trained model, and loaded them into Pinecone.


definitely a neat way to find people that you may not have ever encountered before, as I certainly don't recognize anyone on my list:

hardwaresofton 0.97 g82918 0.969 karmakaze 0.969 pushpop 0.969 abqexpert 0.968 breischl 0.968 forgotmypw17 0.968 dan-robertson 0.968 anaerobicover 0.968 tgbugs 0.968

definitely some good stuff to spend some time looking into though. thanks for sharing


These matches suffer from a very common problem in recommendation engines, which is a tendency toward matching with the long tail of randomness.

The problem is that your matching engine assumes perfect accuracy in vectorization, but actually, the vectorization is a sample from a distribution.

The distribution of the vectorization comes from the idea that it is trained on a mix of information and randomness, and this model is just one instance of the randomness. There are some sources of randomness in the model that you control, and others in the data.

To get better recommendations, you first need to get the distributions for each element of the vectorization. The simplest approximation would be a variance around the model output. Each item should have a variance determined by the amount of information available for that sample.

Now you have another problem, which is finding the closest vector. A p-test will fail you because that is telling you the probability that you came from the same distribution, which is going to be the random distribution for almost all pairs. You might ask something like, the probability that the two points come from a distribution that is distinct from the random distribution. You’d have to form that distribution, asses probability of membership, and then probability of rejection from randomness. You could also consider doing this for each vector element and returning a negative sum of log probability to represent the amount of information shared between vectors.

But ultimately, you need a truth set to test these methods against. This is easy. You just split each person into N people by randomly assigning their comments to one of N identities. This could also be used as an element of the variance.

An easy way to start is by bootstrapping the randomness. In this case, there is probably a random initialization vector in the model. You can bootstrap the distribution of each vector element w.r.t. the IV by running the model many times with random initialization vectors. You are still using a fixed point for the training data noise, but this is a start. To bootstrap the training noise variance, you can train many models using random selection of the data, and the same IV. A good heuristic split is to decide how many times you will run the model, x, and then do random selection by 1/sqrt(x). Ex. Run 100 times with 1/10th of the data in each run. Then you have two distributions for each vector element, one for the data randomness, and one for the model randomness, but the mean from the model randomness is the most informative. Now add the data variance / sqrt(x) to get a rough approximation of the true variance.

These are all hacky ways to get a decent improvement, and the formal methods will also give great improvement on top of that, but are easy to mess up and often quite expensive to bootstrap.


From what i can tell, you're saying that when the model runs, it compresses the data into a crystallized, idiosyncratic set of weights, and running the model a bunch of times + averaging them will smooth out the results.

Is that necessarily better? My matches felt like different people, and i like that crunchy recommendation butter more than the smooth version (unlike in real life).


Now, can we turn it into a matchmaking service?


I'll probably match with a 51 year old Austrian woman who acts like a 22 year old, or my sister.


I don't exist, according to the site


Same here. Maybe they only index people with a certain threshold number of comments or amount of karma.


Yeah, that’s kinda a downer.


Same.


It’s case sensitive, try again?


I tried with proper case and it still didn't work.

Then I just mashed the button a few times and it said username found but error getting history, so I mashed the button a few more times and it eventually worked. Do probably just backend overloaded.

I skimmed through my top matches and one of them appears to have attended the same undergrad as me, so that was interesting.


I feel including only my last 3 years of activity would not characterize me sufficiently. Why not all of it?


I looked through the first half dozen and it seems pretty on point (similarity scores are high .98x's).


Is it intentional that this only works once and then the page needs to be reloaded to enter another name?


No, it might be resource constraints, trying to handle the traffic.


Nice idea.

Do you retain any user data? Since most users are probably looking up their own username, it would be a simple task to match and log HN usernames and their respective IPs.


First two with the highest similarity score to me were banned. Hope that's not predictive!


Hey all, if you see an error please try again! Might just be HN hug of (temporary) death.


My doppleganger has stopped posting somewhere in 2020.

Now I wonder what happened to an internet stranger.


It's bi-directional for me. I am my #1 match's #1 match.

Should be that way in general, right?


It probably depends on the accounts, and the algorithm. If the comparison algorithm uses some sort of distance calculation between 2 users to figure out how close they are, then you could have single directional relationships.

If my comments are on an island of weirdness, you may be the closest person to me while still being really far away. If your comments are relatively normal, you might have a lot of people around who are closer than me. That would make you my doppelganger, but not make me yours.

Edit: I just checked my doppelganger (barry-cotter), and I'm not even in his list :p. I've seen that username appear in a couple other comments under this post. I wonder if there are a few super normal users that a lot of people are closest to.


My matches have each other within their top matches, but none of them have me in their top ten.

I have a lower similarity score with my top match, compared to my matches' similarity scores with their tenth closest matches. So it is that island scenario that you described.


No. There's no reason for it to be symmetric, there can be unlimited number of your closest neighbour's closest neighbours, that are closer to him than you are. I mean, it's literally measuring distance between dots in the n-dimensional space, and if you'll ask your question about the dots on the paper, the answer will be obvious to you.

FWIW, I'm not even on my #1's list.


The person you're a closest match to, could have a closer match to someone else. Same with their closest match, and so on.


Says username doesn’t exist


I see results for your username. Try again? I can’t share direct link.


It couldn’t find my account, huh? Oh it’s case sensitive. I’d change that.


So does this tool connect IP address with account name?


It says I have no comment history


It says I don’t exist. Dark.


Weird. It could not find me.


It's case-sensitive


How many of us are Animats?


All you have to do now is figure out people's sexual preference and you have a dating app


For some reason I think HN userbase might not be the most gender-balanced.


Have you heard of Grindr?


So? Just use data augmentation.


Codegrindr


There's a Tinder extension for VSCode: https://www.youtube.com/watch?v=bfd8RyAJh6c


"It's not her you're sexually attracted to, it's my code. Just face it, Dinesh, you're gay for my code. You're code gay."

https://www.youtube.com/watch?v=_7bkbv4u1tc


Leetcodegrindr


"You need to enable JavaScript to run this app."

I wish HN had a tag to identify JS-only submissions. And a filter feature to allow me to not show them at all.


My adblock prevents site from loading, Lol.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: