{"id":23,"date":"2022-02-05T17:38:00","date_gmt":"2022-02-05T17:38:00","guid":{"rendered":"https:\/\/ghothi.co.uk\/?p=23"},"modified":"2025-05-28T14:15:47","modified_gmt":"2025-05-28T14:15:47","slug":"categorising-data-with-zero-shot-classification","status":"publish","type":"post","link":"https:\/\/ghothi.co.uk\/?p=23","title":{"rendered":"Categorising Data with Zero-Shot Classification"},"content":{"rendered":"\n<p>Recently, I needed to group about 1000 terms into a number of categories.<\/p>\n\n\n\n<p>Doing it manually was far too painful, so I turned to machine learning as it can be so good for these kinds of tasks.<\/p>\n\n\n\n<p>Specifically, I used the <code>torch<\/code> and <code>transformers<\/code> library from Hugging Face. Their <code>pipeline<\/code> function makes setting up a model and performing various tasks really simple. In this case, I used the <code>zero-shot-classification<\/code> pipeline, which lets us classify text into categories without needing a pre-trained model specific to our labels. In other words, someone else has already done the heavy lifting of training a model perfect for this use case.<\/p>\n\n\n\n<p>Given how much of a time-saving this was, I thought I&#8217;d record the process in case it&#8217;s useful for anyone else. This will be a simplification, but it should be simple enough to read between the lines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Zero-Shot Classification Explained<\/h2>\n\n\n\n<p>Unlike traditional machine learning models that need to be trained on specific pre-defined categories, zero-shot classification can categorise text into labels it hasn&#8217;t seen before (i.e. data it wasn&#8217;t trained on). It&#8217;s similar to having a grad student organise your stuff without needing to give them examples first, they just understand how the info needs to be organised based on the categories you tell them. The below image probably speaks a thousand words:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img fetchpriority=\"high\" decoding=\"async\" width=\"748\" height=\"506\" src=\"https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification.png\" alt=\"\" class=\"wp-image-208\" srcset=\"https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification.png 748w, https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification-300x203.png 300w, https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification-500x338.png 500w, https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification-150x101.png 150w, https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification-400x271.png 400w, https:\/\/ghothi.co.uk\/wp-content\/uploads\/2022\/02\/few-shot-classification-200x135.png 200w\" sizes=\"(max-width: 748px) 100vw, 748px\" \/><\/figure>\n\n\n\n<p>This is really useful when you need to sort data into custom categories without having a huge dataset (or the time) to train a new model.<\/p>\n\n\n\n<p>This is all possible because of the &#8216;Transformers&#8217; architecture &#8211; these are just models that understand and generate human language by understanding the meaning, context and relationship between words. We&#8217;ll be using the Hugging Face library, which gives us easy access to these pre-trained models.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Goal<\/h2>\n\n\n\n<p>Categorise a list of about 1000 terms into predefined categories using zero-shot classification to save me having to do it manually.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setting Up the Code<\/h2>\n\n\n\n<p>First, import the necessary libraries and set up our environment:<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"import torch\nfrom transformers import pipeline\nimport pandas as pd\nimport csv\nimport io\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #81A1C1\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">torch<\/span><\/span>\n<span class=\"line\"><span style=\"color: #81A1C1\">from<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">transformers<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">pipeline<\/span><\/span>\n<span class=\"line\"><span style=\"color: #8FBCBB\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">pandas<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">pd<\/span><\/span>\n<span class=\"line\"><span style=\"color: #8FBCBB\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">csv<\/span><\/span>\n<span class=\"line\"><span style=\"color: #8FBCBB\">import<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #8FBCBB\">io<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>Specify that you want to use the first GPU available in your PC rather than your CPU. GPUs can process many more tasks in parallel than CPUs which speeds things up massively, so there&#8217;s no reason to not do this.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"device = 0  # Use the first GPU\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9\">device<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #D8DEE9FF\">  # <\/span><span style=\"color: #D8DEE9\">Use<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">the<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">first<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">GPU<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>Next, I loaded the CSV file containing the keywords I wanted to categorise. I used Python&#8217;s built-in <code>csv<\/code> module to read the file and then wrote it into an in-memory buffer using <code>io.StringIO<\/code>. This just ensures compatibility when loading the data into a pandas DataFrame, which makes data manipulation much easier.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"with open(&quot;\\\\text_to_cluster.csv&quot;, &quot;r&quot;, encoding=&quot;utf-8&quot;) as file:\n    reader = csv.reader(file)\n    csv_file = list(reader)\n\ncsv_buffer = io.StringIO()\nwriter = csv.writer(csv_buffer)\nwriter.writerows(csv_file)\n\ncsv_buffer.seek(0) # Move the cursor back to the beginning of the buffer\ndf = pd.read_csv(csv_buffer)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #81A1C1\">with<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">open<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #EBCB8B\">\\\\<\/span><span style=\"color: #A3BE8C\">text_to_cluster.csv<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">r<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">encoding<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">utf-8<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #D8DEE9FF\">) <\/span><span style=\"color: #81A1C1\">as<\/span><span style=\"color: #D8DEE9FF\"> file:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">reader<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">csv<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">reader<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">file<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">csv_file<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">list<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">reader<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">csv_buffer<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">io<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">StringIO<\/span><span style=\"color: #D8DEE9FF\">()<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">writer<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">csv<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">writer<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">csv_buffer<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">writer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">writerows<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">csv_file<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">csv_buffer<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">seek<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #D8DEE9FF\">) # <\/span><span style=\"color: #D8DEE9\">Move<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">the<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">cursor<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">back<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">to<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">the<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">beginning<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">of<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">the<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">buffer<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">df<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">pd<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">read_csv<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">csv_buffer<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>I then defined a batch size for processing the keywords. Processing in batches helps manage memory usage and ensures the model runs smoothly. The classifier is set up with the <code>facebook\/bart-large-mnli<\/code> model, which is a great model for a lot of different natural language processing tasks.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"batch_size = 16  # see how big you can make this number before OOM\nclassifier = pipeline(&quot;zero-shot-classification&quot;, model=&quot;facebook\/bart-large-mnli&quot;, device=device)\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9\">batch_size<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #B48EAD\">16<\/span><span style=\"color: #D8DEE9FF\">  # <\/span><span style=\"color: #D8DEE9\">see<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">how<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">big<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">you<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">can<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">make<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">this<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">number<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">before<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">OOM<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">classifier<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">pipeline<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">zero-shot-classification<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">model<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #A3BE8C\">facebook\/bart-large-mnli<\/span><span style=\"color: #ECEFF4\">&quot;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">device<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9\">device<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>I listed out the categories I wanted to classify my keywords into.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"labels = ['research methods', 'academic writing', 'statistics and data analysis', 'general science', 'psychology and social science',\n          'marketing and business', 'history']\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9\">labels<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> [<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">research methods<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">academic writing<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">statistics and data analysis<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">general science<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">psychology and social science<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">          <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">marketing and business<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">history<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">]<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>Finally, I iterated over the keywords in batches, feeding them into the classifier. The results were printed out, showing each keyword and its corresponding category. This simple loop does all the heavy lifting, classifying each term accurately and efficiently.<\/p>\n\n\n\n<div class=\"wp-block-kevinbatdorf-code-block-pro\" data-code-block-pro-font-family=\"Code-Pro-JetBrains-Mono\" style=\"font-size:.875rem;font-family:Code-Pro-JetBrains-Mono,ui-monospace,SFMono-Regular,Menlo,Monaco,Consolas,monospace;line-height:1.25rem;--cbp-tab-width:2;tab-size:var(--cbp-tab-width, 2)\"><span style=\"display:block;padding:16px 0 0 16px;margin-bottom:-1px;width:100%;text-align:left;background-color:#2e3440ff\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"54\" height=\"14\" viewBox=\"0 0 54 14\"><g fill=\"none\" fill-rule=\"evenodd\" transform=\"translate(1 1)\"><circle cx=\"6\" cy=\"6\" r=\"6\" fill=\"#FF5F56\" stroke=\"#E0443E\" stroke-width=\".5\"><\/circle><circle cx=\"26\" cy=\"6\" r=\"6\" fill=\"#FFBD2E\" stroke=\"#DEA123\" stroke-width=\".5\"><\/circle><circle cx=\"46\" cy=\"6\" r=\"6\" fill=\"#27C93F\" stroke=\"#1AAB29\" stroke-width=\".5\"><\/circle><\/g><\/svg><\/span><span role=\"button\" tabindex=\"0\" data-code=\"sequences = df['Keyword'].to_list()\n\nfor i in range(0, len(sequences), batch_size):\n    result = classifier(sequences[i:i+batch_size], labels, multi_label=False)\n    for i in result:\n        sequence = i['sequence'].rstrip('\\n')\n        match = i['labels'][0]\n        print(f'{sequence} | {match}')\" style=\"color:#d8dee9ff;display:none\" aria-label=\"Copy\" class=\"code-block-pro-copy-button\"><svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" style=\"width:24px;height:24px\" fill=\"none\" viewBox=\"0 0 24 24\" stroke=\"currentColor\" stroke-width=\"2\"><path class=\"with-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2m-6 9l2 2 4-4\"><\/path><path class=\"without-check\" stroke-linecap=\"round\" stroke-linejoin=\"round\" d=\"M9 5H7a2 2 0 00-2 2v12a2 2 0 002 2h10a2 2 0 002-2V7a2 2 0 00-2-2h-2M9 5a2 2 0 002 2h2a2 2 0 002-2M9 5a2 2 0 012-2h2a2 2 0 012 2\"><\/path><\/svg><\/span><pre class=\"shiki nord\" style=\"background-color: #2e3440ff\" tabindex=\"0\"><code><span class=\"line\"><span style=\"color: #D8DEE9\">sequences<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">df<\/span><span style=\"color: #D8DEE9FF\">[<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">Keyword<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">]<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">to_list<\/span><span style=\"color: #D8DEE9FF\">()<\/span><\/span>\n<span class=\"line\"><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9\">for<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">range<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">len<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">sequences<\/span><span style=\"color: #D8DEE9FF\">)<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">batch_size<\/span><span style=\"color: #D8DEE9FF\">):<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">result<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #88C0D0\">classifier<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">sequences<\/span><span style=\"color: #D8DEE9FF\">[<\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #D8DEE9FF\">:<\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #81A1C1\">+<\/span><span style=\"color: #D8DEE9\">batch_size<\/span><span style=\"color: #D8DEE9FF\">]<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">labels<\/span><span style=\"color: #ECEFF4\">,<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">multi_label<\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9\">False<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">    <\/span><span style=\"color: #D8DEE9\">for<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">in<\/span><span style=\"color: #D8DEE9FF\"> result<\/span><span style=\"color: #ECEFF4\">:<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #D8DEE9\">sequence<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #D8DEE9FF\">[<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">sequence<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">]<\/span><span style=\"color: #ECEFF4\">.<\/span><span style=\"color: #88C0D0\">rstrip<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #EBCB8B\">\\n<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #D8DEE9\">match<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #81A1C1\">=<\/span><span style=\"color: #D8DEE9FF\"> <\/span><span style=\"color: #D8DEE9\">i<\/span><span style=\"color: #D8DEE9FF\">[<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">labels<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">][<\/span><span style=\"color: #B48EAD\">0<\/span><span style=\"color: #D8DEE9FF\">]<\/span><\/span>\n<span class=\"line\"><span style=\"color: #D8DEE9FF\">        <\/span><span style=\"color: #88C0D0\">print<\/span><span style=\"color: #D8DEE9FF\">(<\/span><span style=\"color: #D8DEE9\">f<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #A3BE8C\">{sequence} | {match}<\/span><span style=\"color: #ECEFF4\">&#39;<\/span><span style=\"color: #D8DEE9FF\">)<\/span><\/span><\/code><\/pre><\/div>\n\n\n\n<p>And that&#8217;s it! In just a few lines of code, I managed to categorise over 1000 terms in a few minutes instead of a few days!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Recently, I needed to group about 1000 terms into a number of categories. Doing it manually was far too painful, so I turned to machine learning as it can be so good for these kinds of tasks. Specifically, I used the torch and transformers library from Hugging Face. Their pipeline function makes setting up a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":208,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6],"tags":[],"class_list":["post-23","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml-automation"],"_links":{"self":[{"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/23","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=23"}],"version-history":[{"count":9,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/23\/revisions"}],"predecessor-version":[{"id":211,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/posts\/23\/revisions\/211"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=\/wp\/v2\/media\/208"}],"wp:attachment":[{"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=23"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=23"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ghothi.co.uk\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=23"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}