In this post, we will you provide with more information on our highly accurate and detailed (400+ categories) website category checker. It uses machine learning for categorizations and also provides you with explanation for each categorization.
You can start using it (for free) here: https://www.websitecategorizationapi.com/website-url-category-check
Why the need for website category check?
There are a lot of domains and a lot of URLs/subpages on the internet.
In the first quarter of 2022, there were 350.5 million domains registered according to Verisign. And according to Google’s Gary Illyes, there were over 30 trillion URLs already in 2015.
There are many situations where it is valuable and useful what is the category of website or URL. Imagine a company that may want to restrict employees to use gaming or shopping websites during working hours.
Or an advertiser that wants to place an ad for industrial company but does want it to be shown on irrelevant sites, e.g. fashion blog (not likely that fashion blog visitors are in need of industrial equipment).
What we need in these cases is the ability to categorize websites in appropriate categories based on content found on website. E.g. the website of www.cnn.com should be categorized as News, netflix.com as Television, apple.com as Technology & Computing and so on.
You may wonder – how do we choose categories for classification?
List of categories that are used for classification is also known as taxonomy. There are many standard taxonomies already available, for marketing/ads and general content the most known is that of IAB (Internet Advertising Bureau).
Website category check
Let us now turn to website category check tool (at https://www.websitecategorizationapi.com/website-url-category-check) and categorize some of the well known websites.
For list of most popular domains we will use the tranco list: https://tranco-list.eu/.
Let us take youtube.com as the first example and send it to our service. Tier 1 category is Music and Audio. As mentioned before, the service also provides “explainability”, i.e. which words in the website most contributed to the youtube being classified as Music and Audio:
Music, youtube, chill are some of the words most responsible for resulting classification.
How about Tier 2 categorization of youtube.com, which is much more detailed?
The highest confidence Tier 2 category is “Music TV”, but the classifier also returns that other categories are appropriate as well (see result below): Jazz, Rock Music, Alternative Music, Hip Hop Music, Gospel Music, Dance and Electronic Music, Reggae, Children’s Music, Concert’s & Music Events, Classical Music, Video Game Genres, Video, Talk Radio.
All of these categories are relevant for Youtube and this demonstrates the accuracy and coverage of Tier 2 classifier from https://www.websitecategorizationapi.com/website-url-category-check.
{ "category": "/Television/Music TV", "classification": [ { "category": "Music TV", "value": 0.433912273896154 }, { "category": "Jazz", "value": 0.11007760181847515 }, { "category": "Rock Music", "value": 0.08411351013124047 }, { "category": "Alternative Music", "value": 0.06501734127388545 }, { "category": "Hip Hop Music", "value": 0.04772999931369471 }, { "category": "Gospel Music", "value": 0.045166748875421756 }, { "category": "Dance and Electronic Music", "value": 0.041499008057419004 }, { "category": "Reggae", "value": 0.034923705866764965 }, { "category": "Children's Music", "value": 0.02785890941691067 }, { "category": "Concerts & Music Events", "value": 0.025525874724802066 }, { "category": "Classical Music", "value": 0.010679563164987417 }, { "category": "Video Game Genres", "value": 0.008006550753356649 }, { "category": "Country Music", "value": 0.0069512527265213064 }, { "category": "Classic Hits", "value": 0.004901352240189851 }, { "category": "Video", "value": 0.0048077257831409224 }, { "category": "Musicals", "value": 0.004730140943097158 }, { "category": "Awards Shows", "value": 0.004613502912886086 }, { "category": "Talk Radio", "value": 0.003770938986332557 }
Next website that we will analyze with website category checker is Netflix.com.
Tier 1 categorization is Television, as expected. Adding explanation produced by machine learning model:
Tv, shows, tvs, watch, netflix were those words that most contributed to resulting classification of Television.
Next, Tier 2 classifier for netflix.com produces many relevant categories:
Holiday TV, Children’s TV, Animation TV, Reality TV, Drama TV, Sports TV, Comedy TV, Science Fiction TV, Soap Opera TV, Action and Adventure Movies, World Movies.
{ "category": "/Television/Children's TV", "classification": [ { "category": "Children's TV", "value": 0.2591420036834231 }, { "category": "Animation TV", "value": 0.19785874588987387 }, { "category": "Holiday TV", "value": 0.16626209194848507 }, { "category": "Reality TV", "value": 0.08781978722036521 }, { "category": "Drama TV", "value": 0.03333617011931484 }, { "category": "Sports TV", "value": 0.028545918475148663 }, { "category": "Comedy TV", "value": 0.027605537429796594 }, { "category": "Science Fiction TV", "value": 0.023108636587018172 }, { "category": "Soap Opera TV", "value": 0.0194427945966532 }, { "category": "Action and Adventure Movies", "value": 0.01900383520321671 }, { "category": "World Movies", "value": 0.016158429871224265 },
Both accuracy and coverage is again excellent.
How does a website category checker work?
You may wonder what are the exact steps that website category checker makes to classify given website.
The first step is to fetch the website content. The important step for the relevant script that does this is to obtain the website content as the human would see it. If there are e.g. some website elements that are loaded dynamically via javascript then our script gets that text and other content as well.
Once the content was fetched it needs to be pre-processed. The steps taken, e.g. removal of stop words, lemmatization and others need to be exactly the same as the ones used when preparing the training data set for machine learning model that does the classification.
An important step concerns localization. Our machine learning model was trained on English corpus of texts, so when the pipeline encounters a website it first auto detects the language of the content on the website. If it is English in it passes the pre-processed text directly to the classifier.
How do we deal with support for many languages?
If not, the text needs to be translated to English language first. We use a highly accurate NMT service for translations, with high BLEU scores for the main neural machine translation engine.
Once the text is translated, it is sent to pre-processing and then to the main classifier.
Assigning many categories to websites for better filtering, discoverability
The classifier outputs a list of categories with associated probabilities. E.g. a row in results like this:
“category”: “Music TV”, “value”: 0.433912273896154
means that the classifier thinks that the website belongs to the category Music TV with probability of 0.433.
Probabilities are useful, because one can use it to assign more than one (main) category to given website by selecting all those categories that have the probability higher than given threshold (e.g. 0.2).
API of our service provides the probabilities of all categories so the clients can do that on their own as well. This is useful as a single website can then be present in many categories, which improves the filtering, discoverability, searching on the platform/apps where the categorizations are being used.
Classification from offline database or real-time, full path classification
When you send a domain or URL to our website category checker, then it will be classified in real-time, i.e. the content of the URL will be fetched from the website and then classified.
Some services serve the classification that was done at some time before and thus just retrieved from an offline database.
We do have an offline database but for the specific purpose where the client wants a low latency, e.g. result in milliseconds, which is however not possible to be achieved with real-time classifications, due to fetching website taking (usually) several seconds.
In this case, a better solution is to use an offline database of millions of already categorized domains and then the category of given domain is just retrieved from the database, which is much quicker.