As part of the CATCH project, a benchmark dataset named EmoDIFT (EMOtion Detection In French Tweets) has been published and is available to download, for multi-label emotion detection in French tweets.
The EmoDIFT dataset can be accessed at the following address: https://github.com/smnbrnrd/EmoDIFT
This dataset if further described in [U. Malik, S. Bernard, A. Pauchet, C. Chatelain, R. Picot-Clémente and J. Cortinovis, « Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets, » in IEEE Access, vol. 12, pp. 15902-15916, 2024, doi: 10.1109/ACCESS.2024.3354705.]
Description of the dataset:
The EmoDIFT dataset was created as part of the CATCH project, funded by the ANR and the Normandy region (AAP ANR RA-SIOMRI 2021). The aim is to design tools for automatically understanding written testimonies from a population affected by the health and environmental fallout of an industrial incident. The case study for this project is the fire at the Lubrizol plant in Rouen on 19 September 2019.
This dataset includes tweets collected with the Twitter microblogging platform API, over a period corresponding to the 15 months following the fire at the Lubrizol plant in Rouen, i.e. from 19/09/2019 to 31/12/2020. It has then be compiled following three steps:
- the collection of over 90,496 tweets containing the word « Lubrizol »
- a filtering step to retain the 12,508 tweets from the population affected
- the manual annotation of a random selection of 2,215 tweets
These 2,215 tweets have then been labelled for multi-label emotion recognition tasks by human annotators. Each tweet have been annotated by three different annotators, each of which being instructed to read and to evaluate tweets with the goal of identifying a maximum of three emotions from a predefined set of emotion labels. If the emotions expressed within the tweet belong to multiple emotion registers, the annotator is required to rank them in order of importance. This order is based on the predominance of expressed emotions within the tweet. Specifically, the label associated with the dominant emotion is assigned a value of 1, while the second and third most dominant emotions are assigned values of 2 and 3, respectively. The ultimate goal of the annotation is to identify expressions of opinion, whether they are explicit (I’m scared), or implicit (Those are hydrocarbons in there, that’s dangerous), taking only the semantic information into consideration.
The predefined emotions labels has been selected from the second tier of Plutchik’s model of emotions. The following 6 emotions has been used: anger, disgust, fear, surprise, sadness, and mistrust. In addition, irony has been included as an optional label since irony has been shown to be an important clue for emotion classification. Furthermore, the labels neutral and inexploitable have been added to allow annotators to identify tweets that do not express any particular emotion or that are not relevant to our case study (for example, a tweet from a news media that has not been filtered in the pre-processing phase).
The final step was to merge the three human annotations of each tweet to transforme emotions ranks into a level of presence. We do not give here the formula used for that purpose, but refer the reader to the article cited above for more details.