{"id":609,"date":"2022-08-03T10:07:42","date_gmt":"2022-08-03T10:07:42","guid":{"rendered":"https:\/\/www.alpha-quantum.com\/blog\/?p=609"},"modified":"2022-08-03T10:07:42","modified_gmt":"2022-08-03T10:07:42","slug":"what-is-url-classification-and-where-do-we-need-it","status":"publish","type":"post","link":"https:\/\/www.alpha-quantum.com\/blog\/url-classification\/what-is-url-classification-and-where-do-we-need-it\/","title":{"rendered":"What is URL classification and where do we need it?"},"content":{"rendered":"<p>URL classification is a process of collecting the contents and other meta data about given URL and using this to classify URL, according to specific methodology and objectives.<\/p>\n<p>We have developed one of the most accurate URL classification APIs, with websites classified in over 1000 content categories. You can <a href=\"https:\/\/www.websitecategorizationapi.com\/demo_dashboard_iab\/\">try it out (for free) here<\/a>:<\/p>\n<p id=\"rqquJfK\"><a href=\"https:\/\/www.websitecategorizationapi.com\"><img loading=\"lazy\" width=\"1574\" height=\"1141\" class=\"alignnone size-full wp-image-590 \" src=\"https:\/\/www.alpha-quantum.com\/blog\/wp-content\/uploads\/2022\/06\/img_62bad0a2e3527.png\" alt=\"\" \/><\/a><\/p>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">URL categorization of websites\u200a\u2014\u200acybersecurity<\/strong><\/p>\n<p class=\"graf graf--p\">An important objective for URL categorization arises in the context of cybersecurity, where we want to protect our systems and sensitive information from digital attacks.<\/p>\n<p class=\"graf graf--p\">A typical example of latter is the practice of phishing\u200a\u2014\u200acreating a counterfeit website that tries to mimic the genuine website with the purpose of obtaining critical information from the user.<\/p>\n<p class=\"graf graf--p\">Phishing attacks can use a variety of methods, from link manipulation (making malicious URL look like legitimate URL), evasion of filters (using images of websites), forgery of website (injecting javascript or other code at legitimate website), social engineering and others.<\/p>\n<p class=\"graf graf--p\">When building a machine learning mode for malicious websites identification, a given URL may be binary classified as either \u201cmalicious\/not safe\u201d or \u201csafe\u201d.<\/p>\n<p class=\"graf graf--p\">There are many other ways to classify URL, one of them, content categorization, will be discussed in more detail in the second post.<\/p>\n<h3>Goals of this article<\/h3>\n<p class=\"graf graf--p\">This is part 1 of our multi-post article that will focus on several different topics:<\/p>\n<ul class=\"postList\">\n<li class=\"graf graf--li\">\u00a0we will introduce you to steps that are necessary to build an URL categorization service that can categorize websites as malicious \/ not malicious (safe)<\/li>\n<li class=\"graf graf--li\">we will discuss how one builds a machine learning model that can perform domain categorization, based on domain content\/text<\/li>\n<li class=\"graf graf--li\">we will provide a free URL database with around 1 million domains classified into 21 content categories (IAB taxonomy), together with in-depth analysis of data set<\/li>\n<\/ul>\n<p>The URL classification is usually done on a large number of websites, thus it is almost always automated. With the rise of machine learning and deep learning models in last decade, the automation for URL classification is thus usually done by a machine learning model.<\/p>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Training data set for URL categorization ML models<br \/>\n<\/strong><\/p>\n<p class=\"graf graf--p\">When building a supervised machine learning model, the first step is obtaining the necessary labelled data\u200a\u2014\u200atraining data set, on which to build the ML model.<\/p>\n<p class=\"graf graf--p\">In our case, what we need is a list of malicious URLs that have been manually checked and labelled as such.<\/p>\n<p class=\"graf graf--p\">One great source for this purpose can be found here: <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.unb.ca\/cic\/datasets\/url-2016.html\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/www.unb.ca\/cic\/datasets\/url-2016.html\">https:\/\/www.unb.ca\/cic\/datasets\/url-2016.html<\/a>.<\/p>\n<p class=\"graf graf--p\">We need two classes of URLS (\u201csafe\u201d and \u201cnot safe), the authors collected 35,300 URLs from Alexa top sites that can be considered as safe.<\/p>\n<p class=\"graf graf--p\">For malicious URLs they selected the following types:<\/p>\n<ul class=\"postList\">\n<li class=\"graf graf--li\">Spam URLs from WEBSPAM-UK2007 collection<\/li>\n<li class=\"graf graf--li\">Phishing URLs collected from OpenPhish website.<\/li>\n<li class=\"graf graf--li\">Malware URLs obtained from the project DNS-BH which keeps the list of malware URLs.<\/li>\n<li class=\"graf graf--li\">Finally, Defacement URLs were those Alexa ranked trusted websites that were hosting fraudulent or hidden URL with malicious web pages.<\/li>\n<\/ul>\n<p class=\"graf graf--p\">In addition to this, we also collected malicious URLs by using the free API from <a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.phishtank.com\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/www.phishtank.com\">https:\/\/www.phishtank.com<\/a>.<\/p>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Feature engineering<\/strong><\/p>\n<p class=\"graf graf--p\">Once we have the appropriate training data set, the next step is to think about the feature engineering, i.e. defining features that will be computed from our data items which will then be used in the machine learning model as the input.<\/p>\n<p class=\"graf graf--p\">We will be using features collected from a large number of research papers on this topic (for full list please refer to Appendix\u200a\u2014\u200aResources at the end of the post).<\/p>\n<p class=\"graf graf--p\">URL features considered were:<\/p>\n<ul class=\"postList\">\n<li class=\"graf graf--li\">does URL use IP (e.g. <a class=\"markup--anchor markup--li-anchor\" href=\"http:\/\/21.12.42.215\/index.html%29\" target=\"_blank\" rel=\"noopener\" data-href=\"http:\/\/21.12.42.215\/index.html)\">http:\/\/21.12.42.215\/index.html)<\/a><\/li>\n<li class=\"graf graf--li\">is URL using a shortening service, e.g. bit.ly\/2aDMTE<\/li>\n<li class=\"graf graf--li\">number of dots in URL<\/li>\n<li class=\"graf graf--li\">number of sensitive words in URL (as defined by Garera et al., see references for more information)<\/li>\n<li class=\"graf graf--li\">number of days since registration of domain<\/li>\n<li class=\"graf graf--li\">the length of URL<\/li>\n<li class=\"graf graf--li\">is domain out of expected location in URL<\/li>\n<li class=\"graf graf--li\">is favicon being loaded from the domain that is different that the in URL address bar<\/li>\n<li class=\"graf graf--li\">presence of @ symbol<\/li>\n<li class=\"graf graf--li\">existence of double slash redirection using \/\/<\/li>\n<li class=\"graf graf--li\">prefix, suffix in URL. These are used to make the domain look more like legitimate website.<\/li>\n<li class=\"graf graf--li\">using port that is not standard<\/li>\n<li class=\"graf graf--li\">is URL using subdomain, e.g. <a class=\"markup--anchor markup--li-anchor\" href=\"http:\/\/pay.domain.com\" target=\"_blank\" rel=\"noopener\" data-href=\"http:\/\/pay.domain.com\">http:\/\/pay.domain.com<\/a><\/li>\n<li class=\"graf graf--li\">are any objects (e.g. image, video) being loaded from another domain than the one being used<\/li>\n<li class=\"graf graf--li\">is HTTP protocol being used<\/li>\n<li class=\"graf graf--li\">links that are being used in tags<\/li>\n<li class=\"graf graf--li\">number of redirects<\/li>\n<li class=\"graf graf--li\">is there javascript that shows fake URL in status bar<\/li>\n<li class=\"graf graf--li\">is URL using pop-up window<\/li>\n<li class=\"graf graf--li\">presence of IFrame<\/li>\n<li class=\"graf graf--li\">is HTTPS protocol being used<\/li>\n<li class=\"graf graf--li\">OpenPageRank of domain, using <a class=\"markup--anchor markup--li-anchor\" href=\"https:\/\/www.domcop.com\/openpagerank\/what-is-openpagerank\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/www.domcop.com\/openpagerank\/what-is-openpagerank\">https:\/\/www.domcop.com\/openpagerank\/what-is-openpagerank<\/a>. which measures the backlink profile strength. Legitimate domains tend to have higher OpenPageRank.<\/li>\n<li class=\"graf graf--li\">is URL indexed in Google search engine or not.<\/li>\n<\/ul>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Machine Learning Models for URL classification<\/strong><\/p>\n<p class=\"graf graf--p\">Next step is to consider which machine learning models are appropriate for URL classification model based on these features.<\/p>\n<p class=\"graf graf--p\">Given the form of features, one can consider the following ML models (though the list is by no means exhaustive):<\/p>\n<p class=\"graf graf--p\">&#8211; random forests<\/p>\n<p class=\"graf graf--p\">&#8211; decision trees<\/p>\n<p class=\"graf graf--p\">&#8211; logistic regression<\/p>\n<p class=\"graf graf--p\">&#8211; adaboost<\/p>\n<p class=\"graf graf--p\">&#8211; xgboost (<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/xgboost.readthedocs.io\/en\/stable\/\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/xgboost.readthedocs.io\/en\/stable\/\">https:\/\/xgboost.readthedocs.io\/en\/stable\/<\/a>)<\/p>\n<p class=\"graf graf--p\">&#8211; neural nets<\/p>\n<p class=\"graf graf--p\">During training of ML models it is useful to also use explainability libraries like LIME; Partial Dependence Plot and SHAP (<a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/github.com\/slundberg\/shap\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/github.com\/slundberg\/shap\">https:\/\/github.com\/slundberg\/shap<\/a>) to better understand which features are most important for driving predictions of the ML model. In our next post, we will do this in practice, when we will build an URL classification model from scratch.<\/p>\n<h3 class=\"graf graf--p\">Offline databases or real-time, live full URL-path categorization<\/h3>\n<p class=\"graf graf--p\">In example above, we considered a specific URL classification model which helps us answer whether a given URL is malicious or not. This service can either be real-time in the sense that it fetches the URL on the fly, as soon as you submit it, then sends it to ML model, which computes the features that we discussed above, then based on these predicts whether we are dealing with problematic URL or not.<\/p>\n<p class=\"graf graf--p\">It can however also work from the offline data set of malicious URLs, where the URL submitted could be just checked against existing offline URL database. The advantage of this is that it is much faster, because you do not have to wait for the URL to be fetched and processed. So many applications actually use offline URLs database.<\/p>\n<p class=\"graf graf--p\">Advantage of real-time URL classification however is that you can classify URLs that were not seen before. Or e.g. published just a short while ago, so the URL classification bots of the service have not yet checked it. This approach is also a bit more safe because whereas some URL could be safe e.g. 1 week ago, it could be hijacked in the meantime and is not safe anymore.<\/p>\n<h3>Other use cases of URL categorization<\/h3>\n<p>URL categorization is not used only for analysing safety of websites. One very common use is to do a URL category check in terms of URL content.<\/p>\n<p>A company e.g. may be interested in not allowing employees use shopping or gaming websites so a filtering system set up needs to know the content category of each domain that may be visited. There are two options available here, one, most often is to use an offline content categorization database of domains (there are more than 350+ million domains on the web) and the domain of URL requested by the user is then quickly checked for category against this this database.<\/p>\n<p>So if user wants to visits Netflix and the content category that comes back from DB is TV, then the filtering system would block access.<\/p>\n<p>Offline content categorization databases can be used for other purposes as well. E.g. a platform with millions of domains would like to provide information about the content category of each domain to the users of its platform.<\/p>\n<p class=\"graf graf--p\"><strong class=\"markup--strong markup--p-strong\">Appendix<\/strong><\/p>\n<p class=\"graf graf--p\">Resources:<\/p>\n<p class=\"graf graf--p\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/arxiv.org\/pdf\/2009.11116.pdf\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/arxiv.org\/pdf\/2009.11116.pdf\">https:\/\/arxiv.org\/pdf\/2009.11116.pdf<\/a><\/p>\n<p class=\"graf graf--p\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.igi-global.com\/article\/phishing-website-detection-with-semantic-features-based-on-machine-learning-classifiers\/297032\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/www.igi-global.com\/article\/phishing-website-detection-with-semantic-features-based-on-machine-learning-classifiers\/297032\">https:\/\/www.igi-global.com\/article\/phishing-website-detection-with-semantic-features-based-on-machine-learning-classifiers\/297032<\/a><\/p>\n<p class=\"graf graf--p\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.142.4092&amp;rep=rep1&amp;type=pdf\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.142.4092&amp;rep=rep1&amp;type=pdf\">https:\/\/citeseerx.ist.psu.edu\/viewdoc\/download?doi=10.1.1.142.4092&amp;rep=rep1&amp;type=pdf<\/a><\/p>\n<p class=\"graf graf--p\"><a class=\"markup--anchor markup--p-anchor\" href=\"https:\/\/www.ml.cmu.edu\/research\/dap-papers\/dap-guang-xiang.pdf\" target=\"_blank\" rel=\"nofollow noopener\" data-href=\"https:\/\/www.ml.cmu.edu\/research\/dap-papers\/dap-guang-xiang.pdf\">https:\/\/www.ml.cmu.edu\/research\/dap-papers\/dap-guang-xiang.pdf<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>URL classification is a process of collecting the contents and other meta data about given URL and using this to classify URL,&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[66],"tags":[],"_links":{"self":[{"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/posts\/609"}],"collection":[{"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/comments?post=609"}],"version-history":[{"count":4,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/posts\/609\/revisions"}],"predecessor-version":[{"id":613,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/posts\/609\/revisions\/613"}],"wp:attachment":[{"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/media?parent=609"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/categories?post=609"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.alpha-quantum.com\/blog\/wp-json\/wp\/v2\/tags?post=609"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<br />
<b>Notice</b>:  Trying to access array offset on value of type null in <b>/var/www/alpha-quantum.com/public_html/blog/wp-content/plugins/woocommerce/includes/class-woocommerce.php</b> on line <b>202</b><br />
