elegant_bose

For TattleWeb I decided not to process all the WARC files. I just wanted a sample. I chose a few hundred WARC files randomly and started processing them.

The output results format was simple. Store the URL and the name of the JS library found.

photo of a dataframe

After I put together all the results I had a large number of rows.

Big data.

photo of a dataframe

Almost 25 million rows of results.

A total of 1,896,873 unique URLs and a total of 1,704,754 unique Javascript library names.

That is a lot of data to analyze and learn from.

You can play with this data too.

Go here https://www.kaggle.com/harshsinghal/sample-urls-and-javascript-libraries-used

And don’t forget to upvote the notebooks on Kaggle if you like what you see.