AI-Bias

Decoding Biases in "Toxic" Speech Detection

The Google Perspective API and minority discrimination

Introduction
Literature Review
Methodology
Distribution of positive/negative language
ScatterTexts
Qualitative Analysis
Research contributions and policy implications
References
1. Introduction

Social media platforms have revolutionized the public debate in the past years. Everyone with an internet connection is able to participate in the public discourse, publish its opinion and share content, regardless of skin colour, origin or sexual orientation. Some even understand social media in a Habermasian way as the perfect place for public debate, laying the foundation for rational decision-making based on the objective interpretation of arguments and thus for a true democracy (Willis 2020, 513).

However, recent developments, such as the rise of extremist content or the spread of misinformation, which lead for example to a questioning of the election results and attack of the Capitol in the United States, have shown that social media are far away from being the foundation for democratic discourse and - on the contrary - even endanger democracy. More and more actors are therefore calling for the strengthening of content moderation to regulate extreme forms of content. What opponents see as "censorship of free speech", is for others a necessary and inclusionary tool to avoid the silencing of voices and opinions "through (…) dignitary harms in the absence of intervention" (Rieder & Skop 2021, 2).

As the mass of online expressions with psychologically challenging content makes it impossible to moderate online content through an exclusively manual and human moderation, machine learning techniques are increasingly implemented as "cheap and effective solutions" to moderate online content (Rieder & Skop 2021, 2).

One of such a machine learning solution is "Perspective API", which has been developed by the Google unit Jigsaw in collaboration with Wikipedia and The New York Times (Rieder & Skop 2021, 2). Perspective API uses millions of manually reviewed comments from Wikipedia and the New York Times as a training data-set and identifies abusive comments. "Based on the perceived impact the text may have in a conversation", perspective API allocates a toxicity score with a range from 0 to 1 (Perspective API, n.d.). Comments which obtain a score above 0.7 are said to be "very likely toxic". The allocated score can be used by publishers to presort comments and facilitate the reviewing and moderating process (Perspective API, n.d.). The perspective model provides scores for toxicity in 17 languages, but also for other attributes such as "severe toxicity", "insult", "profanity", "identity attack", "threat" and "sexually explicit". Perspective API is used by companies and newspapers, such as "The New York Times", "El Pais" or the local newspaper "Southeast Missourian" (ibid.).

The key problem of a machine learning hate speech detector, however, is the fact that the term "hate speech" is, despite a frequent incorporation into political and legal documents, not universally defined (Baider et al. 2017, 3). Some define hate speech as "the expression of hatred towards an individual or group of individuals on the basis of protected characteristics" (without further defining the "protected characteristics'') (OSCE/ODIHR 2009). Others, such as the European Court of Human Rights, declared that "all forms of expression which spread, incite, promote, or justify hatred based on intolerance" should be sanctioned (Council of Europe 2020). Baider et al. propose two different categories of hate speech. They distinguish between hard hate speech "which comprises prosecutable forms that are prohibited by law", and soft hate speech "which is lawful but raises serious concerns in terms of intolerance and discrimination" (Baider et al. 2017, 4). The Perspective API is not based on a definition of hate speech, but on a definition of toxicity and allocates a toxic score to all those comments that are "a rude, disrespectful, or unreasonable (…)" and that are "likely to make you leave a discussion" (Perspective API, n.d.). What constitutes rude or disrespectful speech is not further defined.

However, this is often context-dependent. The use of derogatory terms by an outgroup might be seen as toxic, but some social groups have turned to use derogatory terms as a tool for identity-formation and in-group cohesion (Croom 2011). Previous studies have already established that hate speech classification algorithms, including perspective, have trouble in dealing with ambiguous, context-dependent terms. Toxic or hate speech detection algorithm have in the past struggled with identity terms like "gay" or "Black" (Blue 2017). Further, the use of slurs in a positive way is often mislabelled as toxic or hate speech (Röttger et al. 2021). This can have detrimental effect, since the algorithm silences the groups it is intended to protect. To delve further into these potential effects, we sought to investigate the following question:

In what ways does Perspective's toxic speech classification algorithm discriminate against positive speech by minority groups?

Our objective is not to assess the performance of the algorithm but rather to critically investigate its functioning. Positive speech is defined as any supportive speech supportive of minority groups. We will start our analysis with a literature review, followed by the presentation of our methodology and datasets, as well as our data analysis. In the final part we present the contributions of our research and discuss further policy implications.

2. Literature Review

The role of algorithms in our society as well as the more specific topic of online content moderation through machine learning have been widely discussed in the literature. The importance of technical artifacts in the shaping of societal and social interactions has long been explored in philosophy and sociology. Practical thinkers such as American philosopher John Dewey or Langdon Winner explore the co-evolution of technics and politics. According to Langdon Winner (1980), "machines, structures, and systems of modern material culture can be accurately judged […] for the ways in which they can embody specific forms of power and authority". Nowadays, the rising influence of algorithms in every aspect of our life as well as the discovery of large amounts of biases from their application raises concern about the need for control over the outcome of such computational systems. Technology should serve its users rather than complexify and opacify the inequalities latent in our society.

Cases of discrimination through algorithmic systems are being constantly discovered. The demand from government to private companies is globally raised to investigate solutions for platform governance issues such as hate speech and misinformation. In this dynamic, algorithmic moderation systems are increasingly being developed to conduct large scale content moderation on major online platforms such as Twitter, Facebook, YouTube and newspaper websites (Gorwa, Binns & Katzenbach, 2020).

The scale of what computation has made possible is hardly manageable through human knowledge, thus creating a paradox: algorithmic systems seem to be the only valuable tools to solve online content moderation. Companies tend to rely on additional computation to solve the issues from the very same tools as the scale now overtaken by algorithms becomes unconceivable. Content moderation is defined as the "governance mechanisms that structure participation in a community to facilitate cooperation and prevent abuse". (Grimmelmann, 2015). Its role is to classify user-generated content in order to help human moderators in taking decisions on a governance outcome such as removal, geo-blocking or account takedown (Gorwa, Binns & Katzenbach, 2020).

For Gorwa, Binns & Katzenbach (2020), "algorithmic moderation has become necessary to manage growing public expectations for increased platform responsibility, safety and security on the global stage; however, […] these systems remain opaque, unaccountable and poorly understood". The authors point out three specific worrying controversies about algorithmic moderation: increase of opacity, issues of fairness and justice, obscuring of the political nature of speech decisions.In the case of toxic speech, machine learning algorithms are trained with large amounts of data to recognize and flag "toxic" comments. By filtering some of the information, the algorithms, such as Perspective, provide moderators with the possibility to analyze more rapidly the extensive number of online comments. While automation in moderation is growing, the need for a human to draw a line between acceptable and unacceptable speech will most likely always be present (Gorwa, Binns & Katzenbach, 2020).

With a growing number of public controversies, many of the platform companies propose solutions through machine learning and artificial intelligence to moderate the content from the extensive number of users as a response to the scale of the problem (Gillespie, 2020). These techniques remain therefore an assistant to actual moderators. According to Gillespie, "the immense amount of the data, the relentlessness of the violations, the need to make judgments without demanding that human moderators make them, all become the reason why AI approaches sound so desirable, inevitable, unavoidable to platform managers" (Gillespie, 2020).

Often questioned in the literature, the need for computation in front of the complexity of language and contexts of discussions online remains unsure. As a matter of fact, even humans sometimes experience difficulties in judging if a comment should be removed or not according to a potential "toxicity".

This difficulty has been demonstrated in reference to identity terms. During the early days of Perspective, Blue (2017) demonstrated that the API was very easily triggered by the presence of neutral identity terms like "gay" or "black". Back then, sentences like "I am a gay black woman" would be labelled as toxic. Blue (2017) ascribes this to biases present in Perspective's training set. Today, most of the sentences from her text no longer trigger the algorithm. A more recent study by Röttger et al. (2017) evaluate how different hate speech classification systems, including Perspective, deal with hate towards and speech by different identity groups. They find that Perspective struggles especially with re-statements of negative language. This includes reclaimed slurs, quotes, and the negated use of hateful language. Interestingly, they find that the Perspective algorithm appears to have very similar accuracy scores across between social groups.

Xia et al. (2020) discuss a more precise issue, specifically the identification of hate speech and African American English (AAE). AAE contains words often seen unprofessional or "offensive" (King and Kinzler 2020). For example, speakers of AAE might use the n-word instead of less "offensive" terms like bro. These biases are taken up by hate speech classification algorithms, which label AAE more often as toxic or hate speech than speech by other groups.

The above discussed studies show that some marginalized groups might be disadvantaged by a widespread implementation of hate speech classification algorithms. This is especially the case when they function automatically, which would equal silencing or enforcing self-censoring on the groups most often affected by hate speech online. Our research thus seeks to establish whether the biases discussed in the literature persist and we seek to identify patterns of how the algorithm is triggered.

3. Methodology

We have decided to focus our analysis on four social groups that are prominently fighting against their marginalization via social media. These are the anti-racism movement (example hashtags: #blackempowerment), feminism (#womenempowerment), the LGBT community (#pride), and body-positivity (#allbodiesarebeautiful). For each movement, we identified five prominent hashtags that are primarily aligned with positive speech. Several examples are mentioned above, the full list can be found in the code in the repository. Per group, we scraped 25.000 tweets using minet. Minet (https://github.com/medialab/minet) is a webmining library for python that can extract content from, among others, Twitter. We limited our search to only original tweets (i.e., no replies) and no links to exclude images and articles. The focus on original tweets results in a higher quality dataset. Replies often contain purely hashtags or are hard to interpret out of context. Tweets with links often include little genuine text. The limitation on 25.000 tweets was done based on the constraints of our project. The Perspective API runs at a maximum speed of 1 query/second. A limitation on 25.000 tweets allowed us to run one dataset per night.

In the second step, we cleaned the data set, removing data distorting elements. These are emoji’s, symbols and hashtags, tags (@), punctuation, digits and tabs. This was carried out with the open source software R.

The final datasets were then run through the Perspective API to assign a toxicity score. Here, we relied on the peRspective library (https://github.com/favstats/peRspective/), which significantly simplifies the use of the software by creating a library for R. The R script used for analysis is shared in the repository.

One of the inherent limitations to the way we built our datasets comes from the choice of hashtags, which may or not be used to express support for our four groups. To gain an initial overview of our data, we tried to count the proportion of anti-BLM, anti-body positivity, anti-feminism or anti-LGBTQ tweets by labelling manually, for each dataset, the 100 most-toxic tweets according to Perspective and and a random sample of 100 tweets. The details of this analysis are presented below.

We then created Scattertexts in Python. Scattertext (https://github.com/JasonKessler/scattertext) is a python library that creates helpful data visualizations for texts. It extracts terms from texts and analyses them based on a chosen classifier, here a toxicity score above or below 0.7. This score is the same as advised by Perspective API and as used by the New York Times. The visualizations, shown further below, place terms on a scale from Infrequent to frequent per category. This allowed us to identify the main terms present primarily in toxic tweets which guided our further analysis. Given the large amount of noise in our dataset, we were unable to rely only on quantitative methods like scattertext. We extracted the underlying dataset the scattertext is built upon, which is essentially a frequency list of the terms used in the tweets per category.

We used these frequency lists to guide a qualitative analysis. First, we identified all terms which appear more often in toxic than non-toxic tweets. Second, we ordered the terms based on frequency, beginning with the most common and thus relevant terms. Based on the list, we identified key terms that are also used positively, based upon an initial screening of the tweets. These were words like “gay”, “Blacks”, “vaginas”, or “ass”. We then filtered the original tweet data set to read and analyse “toxic” tweets containing the identified terms. Out of these lists, we identified tweets that appeared representative of wider misidentification of “toxicity”. For example, the term “vagina” appeared 55 times, out of which 51 times in toxic tweets. This signals a likely systematic error of the Perspective system since vagina is also an official term. We then marked several tweets as likely mismatch to inductively identify common patterns of the algorithm. Based on the large datasets, this allowed us to get a good understanding of the main “triggers” of the algorithm.

4. Distribution of positive/negative language

Perspective's understanding of "toxic language" is highly dependent on normative decisions: the use of slurs for example may be seen as inherently "rude, disrespectful and unreasonable" or not, especially in the context of activism, where they might me "appropriated" in a non-derogatory manner for instance "to strengthen in-group solidarity" (Croom 2010: 243). By contrast, our definition of negative language (as any speech critical of BLM, feminism, body positivity or LGBTQ) and positive language (as any speech supportive of BLM, feminism, body positivity or LGBTQ) helps us label our datasets without making too many normative assumptions. And yet, some of the tweets may not be "critical" or "supportive" or we may lack sufficient context to decide whether they are. In such cases, tweets are labelled as "unclear". Rather than to label the 100000 tweets, we use two different types of samples for each dataset:

A random sample of 100 tweets (each tweet is assigned a random number using the RAND () command on excel. All the tweets are then sorted in ascending order and the first 100 tweets labelled)
And a sample of the top 100 most-toxic tweets according to Perspective, which is not representative of the whole dataset but relevant for our research since high-toxicity tweets are those used to identify algorithmic discrimination against certain types of speech (whereas an analysis of low-toxicity tweets may be used to highlight the failure of the algorithm to flag some actually "toxic" tweets, i.e., as a performance indicator).

distribution feminist distribution blm distribution body positivity distribution lgbt dataset

Random samples show that the selected hashtags generally allow to identify tweets with positive language (less than 10% of tweets in the random samples are anti-BLM, anti-body positivity, anti-feminism or anti-LGBTQ tweets). However, high toxicity samples show that a significant number of tweets among those with the highest toxicity scores contain negative language (16 to 35% of high toxicity tweets are anti-BLM, anti-body positivity, anti-feminism or anti-LGBTQ tweets). The relatively high proportion of negative tweets among high toxicity tweets in the feminism, body positivity and LGBTQ datasets shows that the selected hashtags tend to be used more often to express anti-feminist, anti-body positive, or anti-LGBTQ ideas.

The higher proportion of negative language within high toxicity samples compared to random samples may indicate that the algorithm is relatively successful in identifying tweets containing negative language. Conversely, high toxicity samples also show that a significant number of tweets among those which obtained the highest toxicity scores contain positive language (42 to 74% of high toxicity tweets are pro-BLM, pro-body positivity, pro-feminism or pro-LGBTQ tweets). In the following, we try to understand why these tweets are labelled as highly toxic by Perspective, in particular whether "identity terms" play a role in triggering the algorithm.

5. ScatterTexts

Scattertext is a python tool that enables users to interactively visualise differences in two corpi of texts (Kessler, 2017). Using this tool allowed us to differentiate between the frequencies of word uses in the tweets classified as toxic as opposed to non-toxic tweets. In Figure 1, a scattertext graph shows these differences for the LGBTQ relating dataset. In general it can be noted that 6429 out of 25000 tweets are labeled as toxic, compared to the other datasets this is almost three times the amount of toxic tweets. Second, to understand the wealth of information contained in the graph it is important to know how the distribution is organised. Each point of the graph represents a word, the higher it is ranked on the y-axis the more often it was used in a toxic tweet and the further right on the x-axis, the more it was used by non-toxic tweets. The closer a point is to the axis, the more its usage is exclusive to the category. A dark blue dot in the top left corner therefore represents a tweet often used in a toxic tweet while not often used in a non-toxic tweet. These words in Figure 1, are most notably swearwords and references to sexual matters. The lighter points represent words used non-discriminatory by both categories. The most frequent of these words are located in the top right hand corner. These include words refering to sexual orientations and were often included in our hashtags queries, such as gay, lgbt or queer. Frequent non-toxic words and infrequent toxic words include counseling, belief, gender or nation. No clear indication of the context of these words and the connotations of their meaning can be inferred.

Figure 1 : Visualisation of words used in toxic and non-toxic tweets, LGBTQ Screenshot 2021-12-15 at 16 19 05

Figure 2 visualises the same relationship between toxic and non-toxic tweets for the bodypositivity corpus. From the first view the distribution looks similar to Figure 1. It is noticeable that less dark blue points appear in the top left corner. This suggests that less words were exclusively used by toxic tweets. These high precision words include swear words and references to body parts, mostly sexual parts and genitals. A similarity to the first figure is the lack of distinct non-toxic and highly frequent words. In this case the words in the bottom left corner include fancy, scale, mindfulness, or holiday and again no coherent meaning for this cluster could be recognised. In the top right corner are the most used words displayed and as expected these are either words that are inherently characteristical for this dataset, for reasons of selection criteria, or because they are generally frequently used words (at, the, to, …).

Figure 2: Visualisation of words used in toxic and non-toxic tweets, Body Positivity Screenshot 2021-12-15 at 16 07 44

Figure 3 is the scattertext plot for our dataset containing tweets focused on anti-racism. This dataset has the lowest count of as toxic classified tweets of all, with 840 out of 24160. Our analysis of the positive and negative distribution of support for this movement showed that most tweets in this category are indeed non-toxic and might use offensive or reclaimed terms to address issues salient for this group. The distinct cluster of toxic, frequent terms is again composed of offensive language but for the distinct non-toxic and frequent terms some coherence is visible. The terms breonna, justiceforall and say their (names) all refer to victims of often deadly police brutality in the USA. One limitation of this dataset is revealed by the contamination of other languages despite the language filter applied. Terms such as german greetings (morgen, guten morgen) or hindi words (जस). These indicate that a more thorough data cleaning would have been appropriate and that the selection filters in twitter do not work costently.

Figure 3 : Visualisation of words used in toxic and non-toxic tweets, Anti-Racism! Screenshot 2021-12-15 at 16 02 14

Figure 4 concerns the last group of social movements, feminism. This scattertext includes overall less infrequent terms, indicating a more coherent set of used words for both toxic and non-toxic tweets. Most and more words than in the other plots are clustered in the top right corner. These most frequently and indiscriminately used terms are again terms closely related to our selection criteria or commonly used words. Characteristic for this dataset are the words womenempowerment, feminism, womensday, womensupportingwomen and patriarchy which are mostly used in a non toxic context.

Figure 4 : Visualisation of words used in toxic and non-toxic tweets, Feminism Screenshot 2021-12-15 at 15 56 34

Scattertext allowed us to effectively visualise the frequency differences between toxic and non-toxic words in all four dataset. A fact common to all these graphs is that the distinct and highly frequent toxic words include most often swearwords, sexual references, genitals or slurs. In none of the datasets a conclusive and unique cluster of distinctly non-toxic and frequently used words can be discerned. The amount of information contained in one of the charts along with the possibility to use this chart interactively to show the exact usage and context of the terms and to discuss and analyse these aspects is beyond the scope of this work. The corresponding .html files for viewing and getting a deeper understanding of these diagrams are included in the Github repository.

6. Qualitative Analysis

The results of our analysis are shown in Table 1. Our qualitative analysis involved a large corpus of tweets, given that most tweets are actually toxic or offensive. Several key words and patterns emerged from the data that signal recurring problems of the algorithm. Overall, the analysis shows that the algorithm appears to struggle to take context into account when faced with potentially offensive words. These can be neutral identity descriptions, like "Black" or "gay", body parts, like "vagina", reclaimed slurs like the n-word or "bitch". A grey zone exists around "offensive" denunciations of problems, like "fuck racism". Whereas there is clearly a presence of "offensive" language, it is used to denounce a harmful practice.

For the anti-racism group, some of the problems described by Blue (2017) reemerge. Unlike in her analysis, the algorithm appears to no longer be directly triggered by identity descriptions. However, the algorithm still remains very sensitive to the use of Black identity terms in addition to words that can be seen as potentially harmful, like "beating" in the example below. The algorithm draws a clear red line at racial slurs, not allowing for any reclamation as for example done under the African American English dialect. A last issue emerges in denunciation of racism. The word itself, as well as derivatives like racist, can trigger the algorithm, as seen in the example below. It can therefore be difficult for the anti-racist community to share experiences and denounce harmful practices.

For the feminist movement, some of the speech is silenced merely for referencing body parts. Neutral talk about female body parts can easily trigger the algorithm. For example, the sentence "I have a vagina" (not found in the tweets) is assigned a likelihood of 81% to be toxic. This issue also emerges for the body positivity movement. Another issue emerged in the denunciation of gender-based crime. For example, the sentence "I was raped" is rated as 71% likely to be toxic. For the feminist movement, it can thus be difficult to reclaim their body through online speech and fight back against oppressing practices.

The algorithm appears to struggle a lot with speech by the LGBTQ+ community. Especially the word gay nearly always triggers the algorithm. Following the findings by Blue (2017), there appear to have been some improvements, however, the algorithm is still triggered by repeated mentioning of LGBTQ+ identity terms.

Table 1: Main identified problems per cluster. The bottom row shows representative example tweets with the corresponding toxicity score assigned by the Perspective API.

Screenshot 2021-12-15 125023

7. Research contributions and policy implications

Our study contributes to both algorithmic studies, more specifically critical approaches to normative assumptions embedded within algorithms (especially here in the definition of "toxic language"), and research on toxic speech detection and the mitigation of algorithmic discrimination. It demonstrates that Google's Perspective algorithm can flag tweets containing language supportive of four minority groups/activist movements (Feminism, Body Positivity, Black Lives Matter and LGBTQ) as "toxic" and investigates the reasons why this is the case. Perspective's assessment of toxicity involves strong normative assumptions regarding morally acceptable and non-acceptable forms of speech online. In fact, according to Gorwa et al. (2020: 11), any "toxic speech classifier will have unequal impacts on different populations because it will inevitably have to privilege certain formalizations of offence above others".

The allocation of Perspective's toxicity scores relies almost exclusively on key words and associations of words without taking into accounts elements of context such as "the semantic environment a conversation is embedded in, its conversational structure and dynamics, voting or flagging signals, and even the full comment history of a user" (Rieder & Skop 2021). Our study highlights the ways in which "identity terms" such as gay or black but also references to body parts as well as slurs that may not all be derogatory (Croom 2011) trigger the algorithm, potentially leading discrimination against minority groups and activist movements. Concerns over the discriminatory potential of Perspective with regards to "identity terms" are not new (Blue 2017), yet our study provides evidence of the persistence of the problem and shows how it can impact on a large variety of groups based on actual data scraped from twitter. Besides, it suggests that there are some inherent limitations to "toxicity" as a category used to design automated tools for online speech moderation. Further research is needed to find quantifiable measures of the discriminatory potential of different algorithmic moderation tools operating on different models/attributes (for example on some of the other attributes Perspective can provide a score for like "profanity" or "identity attack") and using more or less elements of conversational context.

Inputs from science and technology studies and the internet governance literature help us identify some of the main policy implications of our research. A significant part of the public sphere has now shifted to large online platforms that enact their own rules; which has ultimately led to what some have referred to as the "privatisation of freedom of expression policies" (DeNardis & Hackl 2015: 766-767). As a reaction to the rapid spread of harmful speech online, governments have increasingly tried to regulate the activities of online platforms (see for example the German NetzDG), which have themselves invested in cost-efficient moderation tools. The fear of sanctions may encourage platforms to censor a large quantity of online speech, according to procedures that lack transparency, accountability, and openness to public debate (Bloch-Wehba 2019: 66-67). Although Perspective does not claim to function as an autonomous moderation tool, but more as a facilitator of human moderation, the risk is that platforms will over rely on automated decisions. Interestingly, the latest proposal for platform regulation at EU level, the Digital Services Act, acknowledges that specific groups can be "disproportionately" affected by "removal measures following from […] biases potentially embedded in […] automated content moderation tools". To mitigate the risk of discrimination, the proposal "will impose mandatory safeguards when users' information is removed including the provision of explanatory information", as well as "complaint" and "dispute resolution mechanisms" (European Commission 2020: 12).

However, remedy actions such as those provided for in the Digital Services Act will not be enough to mitigate the discriminatory potential of algorithmic moderation tools like Perspective, which should also be addressed at their roots, that is at the design stage of the algorithm. According to Rieder & Skop (2021: 3), Perspective cannot be reduced to a "singular object" but must be seen as "an entry point into a complex arrangement of work processes, technologies, partnerships, and normative choices". They identify two competing organisational logics within the Perspective project: a "multi-polar model" inspired by academic norms and open-source practices, and a "platform model" which is profit-oriented and involves "processes of cultural normalization" (Rieder & Skop 2021: 10). The multi-polar model materialises through "practices like open communication, code sharing, academic involvement, and the technical setup as a Web-API" (Rieder & Skop 2021: 13). However, the authors argue that Perspective tends to shift toward a more closed organizational structure where the supervising role of Google constrains possibilities for dialogue and contradiction. This shift may reduce the ability and willingness of such projects to find ways of mitigating algorithmic discriminatory effects which is why policy-makers need to further encourage open and transparent multi-stakeholder initiatives in the field of online speech moderation.

8. References

Baider, F., Assimakopoulos, S., Millar, S. (2017). Online Hate Speech in the European Union. A Discourse-Analytic Perspective, SpringerOpen.

Bloch-Wehba, H. (2019). Global Platform Governance: Private Power in the Shadow of the State. SMU Law Review 27. 72(1), 27-80. URL: https://ssrn.com/abstract=3247372 [Accessed 12.12.2021].

Blue, V. (2017). Google's comment-ranking system will be a hit with the alt-right. Engadget, 1 September. URL:https://www.engadget.com/2017-09-01-google-perspective-comment-ranking-system.html [Accessed 12.12.2021].

Council of Europe (2020). Hate Speech. Fact sheet. URL: https://www.echr.coe.int/Documents/FS_Hate_speech_ENG.pdf.

Croom, A. (2011). Slurs. Language Sciences. 33(3), 343-358. URL:https://doi.org/10.1016/j.langsci.2010.11.005 [Accessed 12.12.2021].

DeNardis, L. & Hackl, A. (2015). Internet governance by social media platforms. Telecommunications Policy. 39(9), pp.761–770.

European Commission. (2020). Proposal for a Regulation of the European Parliament and the Council on a Single Market for Digital Services (Digital Services Act) and amending Directive 2000/31/EC. 2020/0361 (COD). URL:https://digital-strategy.ec.europa.eu/en/library/proposal-regulation-european-parliament-and-council-single-market-digital-services-digital-services [Accessed 12.12.2021].

Gillespie, T. (2020). Content moderation, AI, and the question of scale. Big Data & Society. URL:https://doi.org/10.1177/2053951720943234

Gorwa, R., Binns, R. & Katzenbach, C. (2020). Algorithmic content moderation: Technical and political challenges in the automation of platform governance. Big Data & Society. 1-15. URL:https://journals.sagepub.com/doi/full/10.1177/2053951719897945.

Grimmelmann, J. (2015). The Virtues of Moderation. 17 Yale J.L. & Tech. URL:https://digitalcommons.law.yale.edu/yjolt/vol17/iss1/2

Kessler, J. (2017). Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Link to preprint: arxiv.org/abs/1703.00565

King, S., & Kinzler, K. D. (2020).Op-Ed: Bias against African American English speakers is a pillar of systemic racism. LA Times. https://www.latimes.com/opinion/story/2020-07-14/african-american-english-racism-discrimination-speech

Lopez, K., Muldoon,M., McKeown, J. (2019). One Day of #Feminism: Twitter as a Complex Digital Arena for Wielding, Shielding, and Trolling talk on Feminism, Lecture Sciences 41 (3).

OSCE/ODIHR (2009). Hate crime laws: A practical guide. OSCE guide. URL: https://www.osce.org/odihr/36426.

Perspective API (n.d.). Using machine learning to reduce toxicity online. Perspective API. URL: https://www.perspectiveapi.com/.

Rieder, B., & Skop, Y. (2021). The fabrics of machine moderation: Studying the technical, normative, and organizational structure of Perspective API. Big Data & Society. URL:https://journals.sagepub.com/doi/full/10.1177/20539517211046181 [Accessed 12.12.2021].

Röttger, P et al. (2021). HateCheck: Functional tests for hate speech detection models. ArXiv. https://arxiv.org/pdf/2012.15606.pdf

Winner, L. (1980). Do Artifacts Have Politics? Daedalus. 109(1), 121-136. URL :http://www.jstor.org/stable/20024652

Willis, R. (2020). Habermasian utopia our Sunstein's echo chamber? The 'dark side' of hashtag hijacking and feminist activism. Legal Studies 40, 507-526.

Xia, M., Field, A., Tsetkov, Y. (2020). Demoting racial bias in hate speech detection models. ArXiv. https://arxiv.org/abs/2005.12246