⬅️ Reproducing Hacker News writing style fingerprinting

mtlynch 3 daysReload

>Well, the first problem I had, in order to do something like that, was to find an archive with Hacker News comments. Luckily there was one with apparently everything posted on HN from the start to 2023, for a huge 10GB of total data.

This is actually super easy. The data is available in BigQuery.[0] It's up to date, too. I tried the following query, and the latest comment was from yesterday.

    SELECT 
      id,
      text,
      `by` AS username,
      FORMAT_TIMESTAMP('%Y-%m-%dT%H:%M:%SZ', TIMESTAMP_SECONDS(time)) AS timestamp
    FROM 
      `bigquery-public-data.hacker_news.full`
    WHERE 
      type = 'comment'
      AND EXTRACT(YEAR FROM TIMESTAMP_SECONDS(time)) = 2025
    ORDER BY 
      time DESC
    LIMIT 
      100

https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1s...

Frieren 4 daysReload

It works for me. The accounts I used long time ago are there in high positions. I guess that my style is very distinctive.

But I also have seen some accounts that seem to be from other non-native English speakers. They may even have a Latin language as their native one (I just read some of their comments, and, at minimum, some of them seem to also be from the EU). So, I guess, that it is also grouping people by their native language other than English.

So, maybe, it is grouping many accounts by the shared bias of different native-languages. Probably, we make the same type of mistakes while using English.

My guess will be that native Indian or Chinese speakers accounts will also be grouped together, for the same reason. Even more so, as the language is more different to English and the bias probably stronger.

It would be cool that Australians, British, Canadians tried the tool. My guess is that the probability of them finding alt-accounts is higher as the populations is smaller and the writing more distinctive than Americans.

Thanks for sharing the projects. It is really interesting.

Also, do not trust the comments too much. There is an incentive to lie as to not acknowledge alt-accounts if they were created to remain hidden.

hammock 4 daysReload

The "analyze" feature works pretty well.

My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise.

They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to")

My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided.

In case anyone cares.

xnorswap 4 daysReload

I wonder how much accuracy would be improved if expanding from single words to the most common pairs or n-tuples.

You would need more computation to hash, but I bet adding frequency of the top 50 word-pairs and top 20 most common 3-tuples would be a strong signal.

( The nothing the accuracy is already good of course. I am indeed user eterm. I think I've said on this account or that one before that I don't sync passwords, so they are simply different machines that I use. I try not to cross-contribute or double-vote. )

jedberg 3 daysReload

Maybe I talk too much on HN. :)

When I ran it, it gave me 20 random users, but when I do the analyze, it says my most common words are [they because then that but their the was them had], which is basically just the most common English words.

Probably would be good to exclude those most common words.