{"id":1122,"date":"2024-07-30T15:42:55","date_gmt":"2024-07-30T14:42:55","guid":{"rendered":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/?p=1122"},"modified":"2024-08-09T09:31:39","modified_gmt":"2024-08-09T08:31:39","slug":"discus-library-joint-project","status":"publish","type":"post","link":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2024\/07\/30\/discus-library-joint-project\/","title":{"rendered":"DISCUS-Library Joint Project"},"content":{"rendered":"\n<p>The aim of this post is just to give a very quick overview of a joint project between the Library and the Data-Intensive Science Centre at the University of Sussex (DISCUS).<\/p>\n\n\n\n<p>This began with a successful proposal to the 2024 Development Studies Association (DSA) conference by Danny Millum, Paul Gilbert and Alice Corble, to run a panel entitled \u2018Decentring development thinking by engaging with archives\u2019.<\/p>\n\n\n\n<p>Danny and Alice, along with Tim Graves from the Systems Team, then decided to submit our own paper to the panel (which surprisingly enough was accepted!), entitled \u2018Camels and chatbots: an experiment in applying AI technology to the BLDS West African Economic Journals\u2019.<\/p>\n\n\n\n<p>This paper would \u2018draw on previous collaborative analysis of the British Library for Development Studies Legacy (BLDS) Collection, which involved using metadata from the collection to create a mapping tool to contrast its provenance with that of the main library collections at Sussex and use this to explore the potential for applying decolonial approaches to library discovery and research\u2019.<\/p>\n\n\n\n<p>This time though, the aim was to move from metadata to the data itself, inspired by yet another project (undertaken in collaboration with the University of Manchester) which was digitising another part of the BLDS collections, the rare West African Economic Journals.<\/p>\n\n\n\n<p>This provided a unique corpus of Global South-originating materials on which to explore the potential of a variety of AI tools, including chatbots, text and image analysis, and visualisation. Out of these journals we focused on the Camel Forum Working Papers from the Somali Academy of Sciences and Arts, hoping these would generate lenses on technological development discourse that offer a radical departure from traditional Global North analytical norms.<\/p>\n\n\n\n<p>So we had an overall idea and some materials to work with, but were still a bit vague about how we might deploy the myriad new AI tools becoming available. We basically took two main approaches:<\/p>\n\n\n\n<h3>1. <strong>CamelGPT<\/strong><\/h3>\n\n\n\n<p>The first approach was to create a LLM limited solely to the Camel Forum Papers. This has yielded various decolonising-adjacent possibilities, some very straightforward in that the papers are now available to Somali researchers to interrogate via the CamelGPT.<\/p>\n\n\n\n<p>Others are less straightforward or less proven &#8211; we need researchers to try and break the model to see how accurate its superficially plausible responses are, and we\u2019d also like to find some way of comparing and contrasting the responses we are getting here to a comparable subset originating from the Global North.<\/p>\n\n\n\n<p>A further note &#8211; and many thanks to our digital humanities colleagues Jonathan Blaney and Marty Steer here &#8211; relates to issues of language. We\u2019d initially claimed that CamelGPT (and by extension ChatGPT) was \u2018language-agnostic\u2019 &#8211; that it would treat its contents equally no matter what language they were in, that we could ask it questions in any language, and we could&nbsp; get replies in any language.<\/p>\n\n\n\n<p>However, this doesn\u2019t stand up to scrutiny. The Arabic corpus is smaller than the English and French corpora. So this increases the chances that if you ask CamelGPT a question in Arabic it could:<\/p>\n\n\n\n<ul><li>simply not understand your question<\/li><li>give you an answer in a faulty rendition of your language, on a scale of questionable -&gt; nonsensical<\/li><li>misunderstand your question and give you clearly incorrect or dubious answers<\/li><li>misunderstand your question and give you plausible but incorrect results that you don&#8217;t know are incorrect<\/li><\/ul>\n\n\n\n<p>In fact, there\u2019s already a growing corpus of evidence that chatbots are significantly less capable in <a href=\"https:\/\/www.wired.com\/story\/chatgpt-non-english-languages-ai-revolution\/\">languages other than English<\/a>.<\/p>\n\n\n\n<p>We also need to bear in mind that no training has occurred here. The instance of ChatGPT we are using was still trained using a standard corpus of text from the World Wide Web, and so still reflects many of the <a href=\"https:\/\/www.science.org\/doi\/full\/10.1126\/science.aal4230\">biases that are present in human language<\/a>. So, for instance, in some word vector models, &#8220;doctor minus man plus woman&#8221; yields &#8220;nurse.&#8221;<\/p>\n\n\n\n<p>Obviously, we\u2019d like to try and train up our own LLM with Global South biases &#8211; but we haven\u2019t YET been able to do this. We must hence acknowledge that the tools we are playing with here are likely to be conditioned by the algorithmic biases and oppressive logics that underpin Global North information spheres.<\/p>\n\n\n\n<h3>2. BERTopic<\/h3>\n\n\n\n<p>At the same time however DISCUS were looking for library projects that they could use as pilot projects. We therefore proposed the following to them:<\/p>\n\n\n\n<p><em>\u201cTo explore the potential of AI to surface knowledge from a digitised collection of rare West African Economic Journals: The Camel Forum Working Papers.<\/em><\/p>\n\n\n\n<p><em>&nbsp;We plan to experiment with a range of AI tools: text mining\/analysis, chatbots, image analysis and data visualisation. One outcome would be to develop an LLM chatbot to interrogate the corpus: the intention being to compare and contrast the responses this generates with that produced by generic ChatGPT.<\/em><\/p>\n\n\n\n<p><em>&nbsp;We anticipate that AI will offer a radical departure from traditional Global North analytical norms and want to test this hypothesis and problematise any outcomes.<\/em><\/p>\n\n\n\n<p><em>We have already been accepted to speak at SOAS in June to present our experiences up to that point in the project.\u201d<\/em><\/p>\n\n\n\n<p>DISCUS accepted and we were assigned to work with Dr Chloe Hopling under the watchful supervision of Professor Julie Weeds.<\/p>\n\n\n\n<p>We then shared the Camel Forum Working Papers which had been scanned, PDF\u2019d and OCR\u2019d.<\/p>\n\n\n\n<p>Chloe then started work.<\/p>\n\n\n\n<p>Her first step just to get a feel for manipulating this dataset was to load the dataset\u00a0into Python and generated a frequency distribution showing the top 20 most common words in the sample document given.<\/p>\n\n\n\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcyS0d1nWFM1uJHOY50yV5qdTXsRqCb1H4qH_oXWOtEquxPCRmm9ayDAsdB3IKxj0vqlHUWGQk0B-6l8boLsutIglVtSoeeXAjyXR5nibxtK1Q0frMQAusglC03MDQ7FC8D_hNcCQrCXPeQ8iEsjj1C0Vc?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A bar chart titled 'Frequency of Words' shows the word 'camel' with the highest frequency, appearing around 110 times. The words 'project,' 'research,' 'milk,' 'camels,' and 'somali' have lower frequencies, ranging from about 30 to 60. The rest of the words, including 'study,' 'also,' 'mohamed,' 'ali,' 'one,' 'herd,' 'somalia,' 'survey,' 'projects,' 'carried,' 'disease,' 'animals,' 'hussein,' and 'proposed,' have frequencies between approximately 15 to 30. The x-axis represents the words, while the y-axis shows their frequency of occurrence.\" width=\"800\" height=\"548\" \/><\/figure>\n\n\n\n<p>Strangely enough \u2018camel\u2019 came out on top\u2026<\/p>\n\n\n\n<p>Next, she applied a couple of approaches to help group similar words by normalising them, in the document to their roots, stemming and lemmatization:<\/p>\n\n\n\n<ol><li>Stemming removes affixes &#8211; computationally fast, but the stemmed word doesn\u2019t always have a meaning e.g. anim as the root of animal. (But this can still be useful depending on the application)<\/li><li>The lemma of a word is produced by taking into account context and converts the word into its meaningful root.\u00a0 This can be computationally slower depending on the size of the body of text, but gives a real word as the root.<\/li><\/ol>\n\n\n\n<p>Below we can see the frequency distribution for these two normalisation approaches.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"550\" height=\"341\" src=\"https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=550%2C341&#038;ssl=1\" alt=\"A bar chart titled 'Stemming' displays word frequencies, with 'camel' being the most frequent word at around 140 occurrences. 'Project' follows with approximately 80 occurrences. Other words such as 'research,' 'studi' (likely a stemmed form of 'study'), 'milk,' 'somali,' and 'herd' have frequencies ranging from 40 to 60. Words including 'also,' 'moham,' 'propos,' 'ali,' 'one,' 'carri,' 'survey,' 'report,' 'somalia,' 'diseas,' 'anim,' 'work,' and 'differ' appear less frequently, with counts between roughly 15 and 30. The x-axis represents the words, while the y-axis indicates their frequency of occurrence.\" class=\"wp-image-1124\" srcset=\"https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?w=817&amp;ssl=1 817w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=300%2C186&amp;ssl=1 300w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=768%2C476&amp;ssl=1 768w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=100%2C62&amp;ssl=1 100w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=150%2C93&amp;ssl=1 150w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=200%2C124&amp;ssl=1 200w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=450%2C279&amp;ssl=1 450w, https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image-1.png?resize=600%2C372&amp;ssl=1 600w\" sizes=\"(max-width: 550px) 100vw, 550px\" data-recalc-dims=\"1\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" width=\"550\" height=\"354\" src=\"https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=550%2C354&#038;ssl=1\" alt=\"A bar chart titled 'Lemmatization' shows the word 'camel' as the most frequent, with a count around 140. The words 'project,' 'study,' 'research,' 'milk,' 'somali,' 'herd,' and 'also' have moderate frequencies, ranging from approximately 30 to 70. The words 'mohamed,' 'ali,' 'one,' 'survey,' 'report,' 'somalia,' 'animal,' 'disease,' 'carried,' 'field,' 'hussein,' and 'proposed' appear less frequently, with counts between roughly 15 to 30. The x-axis lists the words, and the y-axis indicates the frequency of each word's occurrence.\" class=\"wp-image-1123\" srcset=\"https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?w=850&amp;ssl=1 850w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=300%2C193&amp;ssl=1 300w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=768%2C494&amp;ssl=1 768w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=100%2C64&amp;ssl=1 100w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=150%2C97&amp;ssl=1 150w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=200%2C129&amp;ssl=1 200w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=450%2C290&amp;ssl=1 450w, https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/07\/image.png?resize=600%2C386&amp;ssl=1 600w\" sizes=\"(max-width: 550px) 100vw, 550px\" data-recalc-dims=\"1\" \/><\/figure>\n\n\n\n<p>BERTopic is a machine learning tool that helps us understand texts by automatically finding and grouping similar words.<\/p>\n\n\n\n<p>It identifies key themes and patterns.<\/p>\n\n\n\n<p>It has gone through all the 37 Camel Working Papers and identified the most common groupings of words, to create \u2018topics\u2019 of 4 words, and then assigned sentences to these topics.<\/p>\n\n\n\n<p>For those interested in a slightly more technical explanation, BERTopic captures the semantic meaning of each sentence by generating a numerical representation of the sentence. We call this process embedding. Using the sentence embeddings BERTopic can then cluster sentences with similar meaning and identify topics (clusters of sentences with similar meaning). BERTopic summarises these topics by selecting words from within the cluster that it deems to best represent that topic (Representation) and providing the sentences that best represent the meaning of the topic (Representative Docs).<\/p>\n\n\n\n<p>Here\u2019s how the original listing looks (note that Topic -1 is the label given to sentences with no topic).<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXd7edkUu9bEEEkbdpvdCSpJFke0VdnSwqNgDdR8TtQAf7ulPm8vljqXrFHTFw5iONKd9Bv9-kXUBIrMLUJJlWxneJjozWwzW7WU_FQqSXjhC0yuTw2mayASvb0bftDMAam2p0Unl_pDUD0Z3RQxiXfUU4I?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A table with 25 rows and four columns. The columns are labeled &quot;Topic Count,&quot; &quot;Name,&quot; &quot;Representation,&quot; and &quot;Representative Docs.&quot; Each row contains numerical data and text related to a specific topic, such as camels, milk, and pastoralism.\" \/><\/figure>\n\n\n\n<p>Next up Chloe refined this by reducing the number of outlier sentences &#8211; sentences have been added to their nearest topic if they meet a certain threshold criteria.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXcFugMOXl2qRLKdLHJHynWP3cOuBmtQe7P4fds40HqVcnc6JM1_r5Oza0SKtyyxAQ-lgNrd_gD0DbF0CxLe11TcIfAYjizDQ0RBs5fQb-g3M7uK-QSU5mMcPXX3sC-1G7hMEr-FJ7FYH9HJSPC5AKpdhlU?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A table with 25 rows and four columns containing data related to various topics. Columns are labeled &quot;Topic Count,&quot; &quot;Name,&quot; &quot;Representation,&quot; and &quot;Representative Docs.&quot; Information includes numbers, short phrases, and potentially code-like sequences. Topics seem to relate to camels, pastoralism, and potentially geographical regions.\" \/><\/figure>\n\n\n\n<p>Chloe also checked that the sentences being assigned to topics seemed reasonable by looking at some examples of representative sentences for a topic:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXd1nVdCgPlpq8vsDsaVRbf9Tho-eCOoPJdLKoFKdl3zXtctwrzF8iQQyUFVnUR_kM9TsD-iLPQoT8QlyjJO-UDrlOzy9JTwK5gkX4hn6iBoMWFrpPpL_-23cXt59Qc29ggeNaPVR9G2-Ide0Ah7immn3vzC?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"Text snippet with a topic label &quot;1_soshs_marketing_price_middlemen&quot; and representative documents discussing marketing costs, including variable costs, overhead costs, and profit margins. Details include references to local government, specific locations, and potential hidden costs.\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXer043j2y_UvUgjkoz0G7sgThOxyrD5UqfOOOJxNhIBgKwijGml-f90VAuBAmY5wQYenQcUplJVxGgbcVeOr3D6fcn-wJB7RRUDKIwNKZGV4xj1743E_i8JT38IreAfHXAI6nShBbneqyeTdNmZyf9GhME?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A text string from one of the groups, with the label 'milk processing products dairy'\" \/><\/figure>\n\n\n\n<p>We can now see the top topics, and their most frequently occurring words:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdzZwS-Sa3t443cSRisP6yM5DlI2FXFbX6ClQaJ1m1diwIR38-w4OBQ6zacG01ULkKPNuBjvCSdPzD9Jccml1y9UIG1KqXV7B8eVY2JjAOzijHRZYeFPTfPKOG2BKpSNbsTaqm2y_i8FaZ2_OMLHGyhNdgA?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"Bar charts showing the top words associated with ten different topics. Each topic has a unique color and includes words related to agriculture, livestock, trade, and geography.\" \/><\/figure>\n\n\n\n<p>It\u2019s also useful to try and reduce the number of topics where there is overlap, and the image below shows how Chloe generated a cosine similarity score between 0 and 1 (1 being the most similar), allowing us to merge the most similar topics:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeXuwoMDfxSjlkfgh1sNWzFfPh_FXqnmcRGTBxwwojlz08xPzh18Al_aSDnTe2ovrPUguRhdMaO1c99NRRR9ZdD1QWu02ZoDWJMTbTrtU4xaHvYSdnnJv599acJJdeac-EvGYZik2zYl42xqyiLSn7sX5aP?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A similarity matrix visualized as a heatmap. The matrix displays the similarity scores between 280 topics, ranging from 0.3 to 1. The topics are listed on both the x and y axes. Warmer colors indicate higher similarity between topics.\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeh7hNvlI-fAzEYK5PMWRpvwjnTcLhKnwPbkNKRLBMuJl3Au0zPj2Z07G3_OMc1-b8_CDiFDIa00lPTCr3r0ZvA2gJF7aQpPrBL_zJc3jePSDz9ghNBxJSlHjLy94zhPe2G3BeZt1a_aN8qqJNIvWlAJrqD?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A screenshot of a table labelled  suggested topics to merge, with two rows highlighted.\" \/><\/figure>\n\n\n\n<p>Chloe also produced some intertopic distance maps &#8211; basically the bigger the circle, the more frequently a topic occurs, and the closer the circles the more more similar the topics are. On the left you have a pre-merging map, and on the right a post-merging map.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdXrHnt5C6vkod0U-wpB3dGitp5iqSX2IlEAbgbUDVfzb3hC5ewpPGfe0eWtaJA56DEorQz4NCNVmtsSzYtWpgyVRL1DRNjz9xrlcq1RxMubmvbHMK48rJriVYHzBwQ0O5DJmV4w5ihUdnEuYSKOIPdArUS?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"An intertopic distance map showing the relationship between 280 topics. Topics are represented as circles with varying sizes, indicating topic prevalence. The closer the circles, the more similar the topics are. The map is divided into four quadrants by two axes labeled D1 and D2.\" \/><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdXxH-eCUiRIIOYCyUm4ttQStxOKvOCNduip0pKw7OydTAAizGglS3IEzgbDNXZ8euovTZV7RLLqjVg-H3UCKY49M1yirdXm_JwVWjXj5oLuixiyvQc4c6eAzBgrh9yvsxDrXwmmzMlgl-e_CpzTPvoCvA?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"An intertopic distance map showing the relationship between multiple topics. Topics are represented as circles with varying sizes, indicating topic prevalence. The closer the circles, the more similar the topics are. The map is divided into four quadrants by two axes labeled D1 and D2.\" \/><\/figure><\/div>\n\n\n\n<p>We can also look at each document &#8211; and see what the top topics for each document are:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXfS2B1aecycvo9GVp7ETJ1DxPQg1nRmyGTYnYWB3UgQUKe_7VPtYu-VlnkXlarMhEtSKBHAVCv3w7KtIOm1itMC8OqblJJGJcv-UmCoxeVzT4vkBTdJw9q300t3clKgJgYOEN85CFBxyVPqChc5j6QO6C1n?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A table with three columns and 16 rows. The first column contains file names, the second column contains a list of words labeled &quot;First,&quot; and the third column contains a list of words labeled &quot;Second.&quot; The content of the table appears to be data related to research, potentially in the field of animal science or agriculture.\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXdKhrqUR2-87atkeDvxXltWCP15m2gfqhpnt65ARbgtEQja8ZMHYM1vcxdmJnr6sdyhmXfQiFkXpa98bjwfZ5lhVeFuiA_l_WKY22Ac1I33k4y7UWWbEENIe4syfruR0xnCV0jl5463XiMDry5FoCCh0Vk1?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A table with 17 rows and four columns. The columns are labeled &quot;Third,&quot; &quot;% First,&quot; &quot;Second,&quot; and &quot;% Third.&quot; Each row contains a word or phrase in the first column, followed by numerical data in the remaining columns. The data appears to be related to research or analysis, potentially in the fields of language, agriculture, or social sciences.\" \/><\/figure>\n\n\n\n<p>And lastly we can use this to produce a document similarity map, which should show how similar the documents are, and group together those which share the most similar topics:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img src=\"https:\/\/lh7-rt.googleusercontent.com\/docsz\/AD_4nXeUfC7nFtB9XTZLRxaQlh3yHRf8p0SpjG_a6t2WWfiOi1R9dV2miqlHEYkjDMibRY-EZQSvuBDMvVekrYqwut-tlQtr0kkx0ZC1FtouBYDiFyWRzHuS4NTipMBeab2nYVeG-KE2cCMOwuucvApeDlg2fmKC?key=wNRqXLKdEgABrzM6yZcm1g\" alt=\"A scatter plot displaying data points with varying colors and sizes. The x-axis ranges from -20 to 25, and the y-axis from 6 to 18. A text box in the top left corner provides context about the data, including file name, first, second, and third ranked words, and their corresponding numerical values.\" \/><\/figure>\n\n\n\n<p>As an adjacent issue Chloe also looked into how <strong>OCR noise might affect our findings.<\/strong><\/p>\n\n\n\n<p>BERTopic automatically identified a topic that seemed to contain mostly OCR noise:<\/p>\n\n\n\n<p>\u201c&#8217;of th tr th co to SC Re si of Pr we ir ic lE sj Cc pl SC tl a p M BI BI BI Unknown &#8216;\u201d<\/p>\n\n\n\n<p>\u201cII I I f II II, Id, &#8216; Id 1 ii I 1), I &#8221; \/I 1)1 Jl, &#8220;&#8221; 1\/l, IIJ) 11111, 1 Ii pl. I &#8220;&#8221;\u2022<\/p>\n\n\n\n<p>11, 11 I, , Ill I I rl I \u2022 \/ 11111,11 1 I d Ill II rill\u2022 I&gt; ,y<\/p>\n\n\n\n<p>&#8221; &#8216; ,1s n lt,1 11 111 . 1q11Jr,,11y Il l I, I II Ill\u2022 l 1, Ii , IH 1 1 , , m, I 1 &#8216;\u201d<\/p>\n\n\n\n<p>and group these together into one topic, which makes up about 800 sentences (~4% of the total) sentences) in the collection. There are probably more sentences in other topics that will have some (potentially minor) OCR noise but these ones seem to be the ones with major OCR noise.<\/p>\n\n\n\n<p><strong>Conclusion and next steps<\/strong><\/p>\n\n\n\n<p>As we\u2019ve explained above, we\u2019ve used a Jupyter Notebook and the BertTopic tool to topic model the Camel Papers. With this approach the only text corpus the tool has been exposed to are these 37 working papers, and the topic clusters and themes that it is drawing out should be solely intrinsic to the papers themselves.<\/p>\n\n\n\n<p>Even from our quick whizz through here, where we\u2019re trying to show what we\u2019ve done in the best possible light, there are all manner of glitches and issues with these topics (for instance, some sets of words have been grouped together just because they are all French).<\/p>\n\n\n\n<p>So refining the process is definitely going to be one of the things we concentrate on next.&nbsp;<\/p>\n\n\n\n<p>Another is going to be trying to use it to compare different corpuses. As part of a related project, we plan to digitise the radical Tricontinental journal produced by OSPAAL in Cuba, and our initial thoughts are that we might take these mid to late 20th century new left anti-imperialist texts and compare them with the Internet Archive\u2019s repository of the original writings of Marx and Engels.<\/p>\n\n\n\n<p>Oh, and Tricontinental also has a wealth of images, which is another AI area to explore\u2026<\/p>\n\n\n\n<p>But for these and other projects to progress, we need our third wheel. Alongside collection librarians and digital humanists, we need to enlist the help of experts in the field, ideally researchers based in the Global South, to tell us which collections which should be prioritising, what questions to ask, and how valid the answers and the modelling we are producing are.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The aim of this post is just to give a very quick overview of a joint project between the Library and the Data-Intensive Science Centre at the University of Sussex (DISCUS). This began with a successful proposal to the 2024<span class=\"ellipsis\">&hellip;<\/span><\/p>\n<div class=\"read-more\"><a href=\"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2024\/07\/30\/discus-library-joint-project\/\">Read more &#8250;<\/a><\/div>\n<p><!-- end of .read-more --><\/p>\n","protected":false},"author":412,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"spay_email":""},"categories":[123513],"tags":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":1015,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2024\/06\/05\/exploring-different-approaches-to-using-tricontinental-and-mujeres-in-your-research-from-a-library-perspective\/","url_meta":{"origin":1122,"position":0},"title":"Exploring different approaches to using Tricontinental and Mujeres in your research from a library perspective","date":"5 June 2024","format":false,"excerpt":"Reposted from the BLDS Legacy Collection Blog By Danny Millum A little belatedly we wanted to write up the details of the \u2018Exploring different approaches to using\u00a0Tricontinental\u00a0and\u00a0Mujeres\u00a0in your research from a library perspective\u2019 workshop, which took place on Monday 22 April in the Global Studies Resource Centre. It was organised\u2026","rel":"","context":"In &quot;Uncategorised&quot;","img":{"alt_text":"A slide from a presentation at the workshop","src":"https:\/\/i2.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2024\/05\/20240422_111846-scaled.jpg?fit=1200%2C900&ssl=1&resize=350%2C200","width":350,"height":200},"classes":[]},{"id":328,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2021\/01\/28\/for-security-reasons-it-may-not-be-prudent-to-unfold-where-i-am-ghanas-1978-electoral-commissioners-letter-from-hiding-surfaces-in-the-blds-legacy-collect\/","url_meta":{"origin":1122,"position":1},"title":"\u2018For security reasons it may not be prudent to unfold where I am\u2019 \u2013 Ghana\u2019s 1978 electoral commissioner\u2019s letter from hiding surfaces in the BLDS Legacy collection","date":"28 January 2021","format":false,"excerpt":"By Danny Millum - BLDS Metadata and Discovery Officer Cataloguing on the BLDS Legacy Collection project has now reached Ghana, and we\u2019ve just unearthed a fascinating letter from a dramatic time in that country\u2019s political history. On 30 March 1978 the country\u2019s Supreme Military Council, led by Col. Ignatius Kutu\u2026","rel":"","context":"In &quot;BLDS (British Library for Development Studies)&quot;","img":{"alt_text":"Image of letter written on a typewriter from I.K. Abban to I.K Acheampong","src":"https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2021\/01\/Ghana-Annan-letter-.jpeg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":433,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2021\/04\/30\/well-thats-a-lot-of-pamphlets\/","url_meta":{"origin":1122,"position":2},"title":"Well that's a lot of pamphlets....","date":"30 April 2021","format":false,"excerpt":"BLDS Legacy Collection By Caroline Marchant-Wallis - BLDS Metadata and Discovery Officer I was chatting to my Librarian mentor recently about how we approached starting the BLDS Legacy Collection project, and I realised it was a good question. What did we do? Having been caught up in the whirlwind of\u2026","rel":"","context":"In &quot;BLDS (British Library for Development Studies)&quot;","img":{"alt_text":"2 shelves on wall containing pamphlets above a card catalogue","src":"https:\/\/i1.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2021\/04\/Shelving-along-store-2-wall-right-hand-side-scaled.jpg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":1010,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2024\/05\/17\/using-ai-to-explore-collections-at-the-university-of-sussex\/","url_meta":{"origin":1122,"position":3},"title":"Using AI to Explore Collections at the University of Sussex","date":"17 May 2024","format":false,"excerpt":"We're excited to share some groundbreaking work our systems librarian Tim Graves has been doing in collaboration with Danny Millum from our Collections team and DISCUS, the Data Intensive Science Center on campus. His focus has been on leveraging the latest advancements in artificial intelligence (AI) to unlock the hidden\u2026","rel":"","context":"In &quot;British Library of Development Studies&quot;","img":{"alt_text":"","src":"https:\/\/i1.wp.com\/img.youtube.com\/vi\/i7df_XcV3Mk\/0.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":475,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2021\/05\/14\/well-meet-again-or-how-i-gambled-away-vera-lynns-autograph-and-ended-up-in-a-zambian-jungle-with-a-bunch-of-hippies\/","url_meta":{"origin":1122,"position":4},"title":"We\u2019ll meet again \u2013 or how I gambled away Vera Lynn\u2019s autograph and ended up in a Zambian jungle with a bunch of hippies\u2026","date":"14 May 2021","format":false,"excerpt":"By Danny Millum - BLDS Metadata and Discovery Officer Normally when you tell your family \/ friends about what you do, unless you\u2019re a fireman or a nurse they just zone out (especially when your job title is Metadata Discovery Officer). But it really seems as if the BLDS was\u2026","rel":"","context":"In &quot;BLDS (British Library for Development Studies)&quot;","img":{"alt_text":"Black and white image of Danny's Great uncle in Burma in","src":"https:\/\/i0.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2021\/05\/Dannys-great-uncle-e1620922642918.png?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]},{"id":550,"url":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/2021\/10\/08\/public-opinion-and-conflict-in-south-asia\/","url_meta":{"origin":1122,"position":5},"title":"Public opinion and conflict in South Asia","date":"8 October 2021","format":false,"excerpt":"A brief look at the Bangladesh Liberation War through the holdings of the BLDS Legacy Collection One of the main aims of the British Library for Development Studies Legacy Collection (BLDS) project is outreach and promotion. As part of this we are assisting with some teaching sessions at Sussex in\u2026","rel":"","context":"In &quot;BLDS (British Library for Development Studies)&quot;","img":{"alt_text":"Front page of Pakistan News Vol. XXIII No.14 July 15, 1991","src":"https:\/\/i1.wp.com\/blogs.sussex.ac.uk\/librarycollections\/files\/2021\/10\/Pakistan-Pakistan-News-1.jpeg?resize=350%2C200&ssl=1","width":350,"height":200},"classes":[]}],"_links":{"self":[{"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/posts\/1122"}],"collection":[{"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/users\/412"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/comments?post=1122"}],"version-history":[{"count":3,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/posts\/1122\/revisions"}],"predecessor-version":[{"id":1131,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/posts\/1122\/revisions\/1131"}],"wp:attachment":[{"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/media?parent=1122"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/categories?post=1122"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.sussex.ac.uk\/librarycollections\/wp-json\/wp\/v2\/tags?post=1122"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}