elegant_bose
For TattleWeb I decided not to process all the WARC files. I just wanted a sample. I chose a few hundred WARC files randomly and started processing them.
The output results format was simple. Store the URL and the name of the JS library found.
After I put together all the results I had a large number of rows.
Big data.
Almost 25 million rows of results.
A total of 1,896,873 unique URLs and a total of 1,704,754 unique Javascript library names.
That is a lot of data to analyze and learn from.
You can play with this data too.
Go here https://www.kaggle.com/harshsinghal/sample-urls-and-javascript-libraries-used
And don’t forget to upvote the notebooks on Kaggle if you like what you see.