brave_chawla

How do you take a lot of text and find patterns in it? There are a lot of ways to do this.

Lets start with a solution that does not involve any technology or ML.

We can take the dataset below and stick it into a Google Sheet document.

screenshot of matching url_js and cdnsjs dataset

We can add a third column and call it segment or category.

This category is the name of the bucket into which this specific Javascript library can be put into.

An obvious question is do we even know what these categories are? Obviously, each Javascript library out there must and should belong to a category. The way to think about this is that each and every Javascript library serves a purpose. It does one or two things specifically. Figuring that out can help us identify what category it belongs to.

I headed over to https://github.com/sorrycc/awesome-javascript and found some categorizations.

Some of these categories are;

  • Package Managers
  • Component management
  • Loaders
  • Bundlers
  • Testing Frameworks
  • QA Tools
  • MVC Frameworks and Libraries
  • Templating Engines
  • Game Engines
  • Data Visualization
  • Utilities
  • UI

And the list goes on.

Back to our Google Sheet with the Javascript library names and descriptions. With the above categories you can go through each of the entries in the spreadsheet and type the category you think the Javascript library belongs to.

The above approach is what is called text labeling or data annotation.

And the result of the above exercise will give you labeled data.

Data labeling is a very large industry today.

Human experts are labeling data across text, images, videos, and sound clips. These datasets are from domains such as medical imaging, chemistry, biology, product reviews, autonomous vehicles, food pictures, manufacturing defects, and so on.

Why do we need labeled data?

This question cuts to the heart of Machine Learning.

Let us do a quick 101 on the broad categories under Machine Learning.

Supervised and unsupervised learning are two different types of machine learning approaches.

Their difference is in the way the models are trained and the condition of the training data that is required.

Training data is the data that is fed to a Machine Learning algorithm to produce useful outputs.

Supervised machine learning will learn the relationship between input and output through labeled training data, so that can it can be used to classify new data using these learned patterns.

Unsupervised machine learning on the other hand is useful in finding underlying patterns and relationships within unlabelled, raw data. This makes it particularly useful for exploratory data analysis, segmenting, or clustering of datasets.

In our spreadsheet exercise, before we started adding categories, our data was raw and unlabeled.

After we added category labels, our dataset became labeled.

It is no surprise that the data labeling industry has grown in line with the growth in AI in recent years. Artificial Intelligence models require large volumes of labeled training data so that neural networks can be trained to identify patterns that correlate with specific labels.

While it would be great to categorize by hand the Javascript libraries one by one, it can quickly become tedious and time-consuming.

We will take the path of unsupervised learning by trying to find groups or segments or clusters of Javascript libraries that have similar descriptions.

A popular approach to cluster text data is Topic modeling. There are many amazing Python libraries out there to help you build Topic models.

I will showcase BERTopics here, a very popular Topic modeling library in Python.