charming_bhabha

I had to find a name for my side-project. This was the hardest part. Or so I imagined. The name TattleWeb just came to me. It was a mix of the word tattle which means to tell on, like a tattletale. And the web being the Internet.

TattleWeb is a product that “tells on” the Web.

With the name now decided I dived into figuring out how to get started. The first obvious next step I thought was data collection.

I used to walk around LinkedIn’s office in Sunnyvale for my 1:1s with my team. On many occasions, we had walked past Sergei Brin doing his 1:1s.

If you do 1:1s, you should check out this great document by GitLab folks on conducting effective 1:1s

LinkedIn’s office in Mountain View in those days was a couple of buildings in the middle of Googleplex. All around us were Google’s offices. Around many of our buildings were natural trails that snaked from one building to another, often crossing company campuses.

1:1s with your manager or colleagues would often happen as a walk-and-talk on these trails.

On one such occasion, Karthik and I walked past Sergei Brin and really wanted to take a selfie with him. I couldn’t think of any smart questions to ask him. And just asking for a selfie felt embaressing. We thought it was real-world spam.

As we went back and forth on our existential dilemma, I realized something.

What Sergei and Larry had done, what Reid Hoffman had done, and perhaps countless others had done fit a template.

Data | Pattern | Story

They organized data by discovering patterns. These patterns were weaved into stories that greatly transformed existing narratives.

Google organized the world's information using PageRank and allowed all of us to find answers quickly.

LinkedIn organized professional identities and gave us a global water cooler.

A water-cooler is where people gather during work and discuss “earned secrets”. The new ORM reduced latency by 10x. That cool feature raised accuracy by 14%. OR, that colleague who just quit to join a competitor.

LinkedIn water-cooler gives anyone with an internet connection and a device, access to the hive mind of corporate hallways across the world.

These legendary founders created tools of leverage and made it accessible at scale.

Information symmetry is the ultimate leverage.

The data for Google and LinkedIn already existed. Websites and resumes.

The data for BuiltWith already existed.

A bunch of people got together and put this data to better use.

I told Karthik that TattleWeb was trying to do what Sergei, Larry, and Reid had already done. But at a much smaller scale. Surely Sergei wouldn’t mind a selfie with folks trying to walk in his footsteps. Maybe that is how we will introduce ourselves and get a selfie.

By the time we turned around to catch up with Segei, he was gone.

We never saw Sergei on our walks again after that day. Maybe he overheard us and decided to avoid that trail. I will never know.

That weekend I decided to pick up the first item on our checklist - data collection.

I let my family know that the second half of Saturday would see me disappear into the garage only to be seen again the next morning. I took with me a flask of chai and rusks. And I got started.

I listed the following methods for data collection;

  • Start with a list of 20 to 30 websites and crawl. Say you start with www.indiatimes.com. You can then find all the links in indiatimes.com that are pointing to other pages on the site and to external sites. You then visit these links and repeat.
  • Get a large list of the most popular websites on the Internet.
  • Use the data provided by the Common Crawl project

The first method is what crawlers are made of. Search engines like Google use crawlers to index the web.

The original paper by Google founders - The Anatomy of a Large-Scale Hypertextual Web Search Engine http://infolab.stanford.edu/~backrub/google.html

Crawling can take on varying levels of complexity and developing one, though interesting, could become its own top-level side project.

Approach 2 - list of top websites

To get a large list of popular websites I searched for Alexa internet rankings GitHub.

Using GitHub in your search will show useful results. Useful for tech folks means code or data and adding Github to your search will surface both.

The first result https://gist.github.com/chilts/7229605 seemed to have a link to a CSV with the top 1 million websites on the Internet.

OK. This can work.

I could take the list of websites and fetch their content. What I needed was the HTML source. And in the HTML source, I had to extract the <script> tags and specifically the src attribute.

Check out the Kaggle notebook for an example of how this can be done https://www.kaggle.com/code/harshsinghal/alexa-top-1m-urls-get-js-libraries-used

I was about to start doing this when I decided to search for any datasets where crawled data was already available.

I came across Common Crawl www.commoncrawl.org