The tokenizer is able to identify hashtags and Twitter user names to the extent that these conform to the conventions used in Twitter, i. URLs and addresses are not completely covered. The tokenizer counts on clear markers for these, e. Assuming that any sequence including periods is likely to be a URL provesunwise, given that spacing between normal wordsis often irregular. And actually checking the existence of a proposed URL was computationally infeasible for the amount of text we intended to process.
Finally, as the use of capitalization and diacritics is quite haphazard in the tweets, the tokenizer strips all words of diacritics and transforms them to lower case. For those techniques where hyperparameters need to be selected, we used a leave-one-out strategy on the test material.
For each test author, we determined the optimal hyperparameter settings with regard to the classification of all other authors in the same part of the corpus, in effect using these as development material. In this way, we derived a classification score for each author without the system having any direct or indirect access to the actual gender of the author.
We then measured for which percentage of the authors in the corpus this score was in agreement with the actual gender. These percentages are presented below in Section Profiling Strategies In this section, we describe the strategies that we investigated for the gender recognition task.
As we approached the task from a machine learning viewpoint, we needed to select text features to be provided as input to the machine learning systems, as well as machine learning systems which are to use this input for classification.
We first describe the features we used Section 4. Then we explain how we used the three selected machine learning systems to classify the authors Section 4. The use of syntax or even higher level features is for now impossible as the language use on Twitter deviates too much from standard Dutch, and we have no tools to provide reliable analyses. However, even with purely lexical features, 4. Several errors could be traced back to the fact that the account had moved on to another user since We could have used different dividing strategies, but chose balanced folds in order to give a equal chance to all machine learning techniques, also those that have trouble with unbalanced data.
If, in any application, unbalanced collections are expected, the effects of biases, and corrections for them, will have to be investigated. Most of them rely on the tokenization described above. We will illustrate the options we explored with the Hahaha Top Function Words The most frequent function words see kestemont for an overview. We used the most frequent, as measured on our tweet collection, of which the example tweet contains the words ik, dat, heeft, op, een, voor, and het.
Then, we used a set of feature types based on token n-grams, with which we already had previous experience Van Bael and van Halteren For all feature types, we used only those features which were observed with at least 5 authors in our whole collection for skip bigrams 10 authors. Unigrams Single tokens, similar to the top function words, but then using all tokens instead of a subset. In the example tweet, we find e. Bigrams Two adjacent tokens.
In the example tweet, e. Trigrams Three adjacent tokens. Skip bigrams Two tokens in the tweet, but not adjacent, without any restrictions on the gap size. Finally, we included feature types based on character n-grams following kjell et al. We used the n-grams with n from 1 to 5, again only when the n-gram was observed with at least 5 authors.
However, we used two types of character n-grams. The first set is derived from the tokenizer output, and can be viewed as a kind of normalized character n-grams. Normalized 1-gram About features. Normalized 3-gram About 36K features.
Normalized 4-gram About K features. Normalized 5-gram About K features. The second set of character n-grams is derived from the original tweets. This type of character n-gram has the clear advantage of not needing any preprocessing in the form of tokenization.
Original 1-gram About features. Be Original 3-gram About 77K features. Original 4-gram About K features. Original 5-gram About K features. Again, we decided to explore more than one option, but here we preferred more focus and restricted ourselves to three systems. Our primary choice for classification was the use of Support Vector Machines, viz.
We chose Support Vector Regression ν-svr to be exact with an RBF kernel, as it had shown the best results in several research projects e. With these main choices, we performed a grid search for well-performing hyperparameters, with the following investigated values: The second classification system was Linguistic Profiling LP; van Halteren , which was specifically designed for authorship recognition and profiling. Roughly speaking, it classifies on the basis of noticeable over- and underuse of specific features.
Before being used in comparisons, all feature counts were normalized to counts per words, and then transformed to Z-scores with regard to the average and standard deviation within each feature. Here the grid search investigated: As the input features are numerical, we used IB1 with k equal to 5 so that we can derive a confidence value. The only hyperparameters we varied in the grid search are the metric Numerical and Cosine distance and the weighting no weighting, information gain, gain ratio, chi-square, shared variance, and standard deviation.
However, the high dimensionality of our vectors presented us with a problem. For such high numbers of features, it is known that k-nn learning is unlikely to yield useful results Beyer et al.
This meant that, if we still wanted to use k-nn, we would have to reduce the dimensionality of our feature vectors. For each system, we provided the first N principal components for various N.
In effect, this N is a further hyperparameter, which we varied from 1 to the total number of components usually , as there are authors , using a stepsize of 1 from 1 to 10, and then slowly increasing the stepsize to a maximum of 20 when over Rather than using fixed hyperparameters, we let the control shell choose them automatically in a grid search procedure, based on development data.
When running the underlying systems 7. As scaling is not possible when there are columns with constant values, such columns were removed first.
For each setting and author, the systems report both a selected class and a floating point score, which can be used as a confidence score.
In order to improve the robustness of the hyperparameter selection, the best three settings were chosen and used for classifying the current author in question.
For LP, this is by design. A model, called profile, is constructed for each individual class, and the system determines for each author to which degree they are similar to the class profile. For SVR, one would expect symmetry, as both classes are modeled simultaneously, and differ merely in the sign of the numeric class identifier.
However, we do observe different behaviour when reversing the signs. For this reason, we did all classification with SVR and LP twice, once building a male model and once a female model. For both models the control shell calculated a final score, starting with the three outputs for the best hyperparameter settings.
It normalized these by expressing them as the number of non-model class standard deviations over the threshold, which was set at the class separation value. The control shell then weighted each score by multiplying it by the class separation value on the development data for the settings in question, and derived the final score by averaging. It then chose the class for which the final score is highest. In this way, we also get two confidence values, viz. Results In this section, we will present the overall results of the gender recognition.
We start with the accuracy of the various features and systems Section 5. Then we will focus on the effect of preprocessing the input vectors with PCA Section 5.
After this, we examine the classification of individual authors Section 5. For the measurements with PCA, the number of principal components provided to the classification system is learned from the development data. Below, in Section 5. Starting with the systems, we see that SVR using original vectors consistently outperforms the other two.
For only one feature type, character trigrams, LP with PCA manages to reach a higher accuracy than SVR, but the difference is not statistically significant.
For SVR and LP, these are rather varied, but TiMBL s confidence value consists of the proportion of selected class cases among the nearest neighbours, which with k at 5 is practically always 0.
The class separation value is a variant of Cohen s d Cohen Where Cohen assumes the two distributions have the same standard deviation, we use the sum of the two, practically always different, standard deviations. Accuracy Percentages for various Feature Types and Techniques. In fact, for all the tokens n-grams, it would seem that the further one goes away from the unigrams, the worse the accuracy gets.
An explanation for this might be that recognition is mostly on the basis of the content of the tweet, and unigrams represent the content most clearly. Possibly, the other n-grams are just mirroring this quality of the unigrams, with the effectiveness of the mirror depending on how well unigrams are represented in the n-grams.
For the character n-grams, our first observation is that the normalized versions are always better than the original versions. This means that the content of the n-grams is more important than their form. This is in accordance with the hypothesis just suggested for the token n-grams, as normalization too brings the character n-grams closer to token unigrams. The best performing character n-grams normalized 5-grams , will be most closely linked to the token unigrams, with some token bigrams thrown in, as well as a smidgen of the use of morphological processes.
However, we cannot conclude that what is wiped away by the normalization, use of diacritics, capitals and spacing, holds no information for the gender recognition. To test that, we would have to experiment with a new feature types, modeling exactly the difference between the normalized and the original form.
This number was treated as just another hyperparameter to be selected. As a result, the systems accuracy was partly dependent on the quality of the hyperparameter selection mechanism. In this section, we want to investigate how strong this dependency may have been. Recognition accuracy as a function of the number of principal components provided to the systems, using token unigrams.
Figures 1, 2, and 3 show accuracy measurements for the token unigrams, token bigrams, and normalized character 5-grams, for all three systems at various numbers of principal components. For the unigrams, SVR reaches its peak Interestingly, it is SVR that degrades at higher numbers of principal components, while TiMBL, said to need fewer dimensions, manages to hold on to the recognition quality.
LP peaks much earlier However, it does not manage to achieve good results with the principal components that were best for the other two systems. Furthermore, LP appears to suffer some kind of mathematical breakdown for higher numbers of components. Although LP performs worse than it could on fixed numbers of principal components, its more detailed confidence score allows a better hyperparameter selection, on average selecting around 9 principal components, where TiMBL chooses a wide range of numbers, and generally far lower than is optimal.
We expect that the performance with TiMBL can be improved greatly with the development of a better hyperparameter selection mechanism. For the bigrams Figure 2 , we see much the same picture, although there are differences in the details. SVR now already reaches its peak TiMBL peaks a bit later at with And LP just mirrors its behaviour with unigrams. LP keeps its peak at 10, but now even lower than for the token n-grams However, all systems are in principle able to reach the same quality i.
Even with an automatically selected number, LP already profits clearly Recognition accuracy as a function of the number of principal components provided to the systems, using token bigrams. And TiMBL is currently underperforming, but might be a challenger to SVR when provided with a better hyperparameter selection mechanism. We will focus on the token n-grams and the normalized character 5-grams. As for systems, we will involve all five systems in the discussion.
However, our starting point will always be SVR with token unigrams, this being the best performing combination. We will only look at the final scores for each combination, and forgo the extra detail of any underlying separate male and female model scores which we have for SVR and LP; see above.
When we look at his tweets, we see a kind of financial blog, which is an exception in the population we have in our corpus. The exception also leads to more varied classification by the different systems, yielding a wide range of scores. SVR tends to place him clearly in the male area with all the feature types, with unigrams at the extreme with a score of SVR with PCA on the other hand, is less convinced, and even classifies him as female for unigrams 1.
Figure 4 shows that the male population contains some more extreme exponents than the female population. The most obvious male is author , with a resounding Looking at his texts, we indeed see a prototypical young male Twitter user: From this point on in the discussion, we will present female confidence as positive numbers and male as negative.
Recognition accuracy as a function of the number of principal components provided to the systems, using normalized character 5-grams. All systems have no trouble recognizing him as a male, with the lowest scores around 1 for the top function words. If we look at the rest of the top males Table 2 , we may see more varied topics, but the wide recognizability stays.
Unigrams are mostly closely mirrored by the character 5-grams, as could already be suspected from the content of these two feature types. For the other feature types, we see some variation, but most scores are found near the top of the lists. Feature type Unigram 1: Top Function 4: On the female side, everything is less extreme. The best recognizable female, author , is not as focused as her male counterpart. There is much more variation in the topics, but most of it is clearly girl talk of the type described in Section 5.
In scores, too, we see far more variation. Even the character 5-grams have ranks up to 40 for this top Another interesting group of authors is formed by the misclassified ones. Taking again SVR on unigrams as our starting point, this group contains 11 males and 16 females. We show the 5 most Confidence scores for gender assignment with regard to the female and male profiles built by SVR on the basis of token unigrams. The dashed line represents the separation threshold, i.
The dotted line represents exactly opposite scores for the two genders. Top rankingfemales insvr ontokenunigrams, with ranksand scoresforsvr with various feature types. Top Function 9: With one exception author is recognized as male when using trigrams , all feature types agree on the misclassification. This may support ourhypothesis that allfeature types aredoingmore orlessthe same. But it might alsomean that the gender just influences all feature types to a similar degree.
In addition, the recognition is of course also influenced by our particular selection of authors, as we will see shortly. Apart from the general agreement on the final decision, the feature types vary widely in the scores assigned, but this also allows for both conclusions.
The male which is attributed the most female score is author On re examination, we see a clearly male first name and also profile photo. However, his Twitter network contains mostly female friends. This apparently colours not only the discussion topics, which might be expected, but also the general language use.
The unigrams do not judge him to write in an extremely female way, but all other feature types do. When looking at his tweets, we This has also been remarked by Bamman et al.
There is an extreme number of misspellings even for Twitter , which may possibly confuse the systems models. The most extreme misclassification is reserved for a female, author This turns out to be Judith Sargentini, a member of the European Parliament, who tweets under the 14 Although clearly female, she is judged as rather strongly male In this case, it would seem that the systems are thrown off by the political texts.
If we search for the word parlement parliament in our corpus, which is used 40 times by Sargentini, we find two more female authors each using it once , as compared to 21 male authors with up to 9 uses. Apparently, in our sample, politics is a male thing. We did a quick spot check with author , a girl who plays soccer and is therefore also misclassified often; here, the PCA version agrees with and misclassified even stronger than the original unigrams versus.
In later research, when we will try to identify the various user types on Twitter, we will certainly have another look at this phenomenon. Are they mostly targeting the content of the tweets, i. In this section, we will attempt to get closer to the answer to this question.
Again, we take the token unigrams as a starting point. However, looking at SVR is not an option here. Because of the way in which SVR does its classification, hyperplane separation in a transformed version of the vector space, it is impossible to determine which features do the most work. Instead, we will just look at the distribution of the various features over the female and male texts. Figure 5 shows all token unigrams. The ones used more by women are plotted in green, those used more by men in red.
The position in the plot represents the relative number of men and women who used the token at least once somewhere in their tweets. However, for classification, it is more important how often the token is used by each gender. We represent this quality by the class separation value that we described in Section 4. As the separation value and the percentages are generally correlated, the bigger tokens are found further away from the diagonal, while the area close to the diagonal contains mostly unimportant and therefore unreadable tokens.
On the female side, we see a representation of the world of the prototypical young female Twitter user. And also some more negative emotions, such as haat hate and pijn pain. Next we see personal care, with nagels nails , nagellak nail polish , makeup makeup , mascara mascara , and krullen curls. Clearly, shopping is also important, as is watching soaps on television gtst. The age is reconfirmed by the endearingly high presence of mama and papa.
As for style, the only real factor is echt really. The word haar may be the pronoun her, but just as well the noun hair, and in both cases it is actually more related to the Identity disclosed with permission. And by TweetGenie as well. An alternative hypothesis was that Sargentini does not write her own tweets, but assigns this task to a male press spokesperson.
However, we received confirmation that she writes almost all her tweets herself Sargentini, personal communication. Percentages of use of tokens by female and male authors. The font size of the words indicates to which degree they differentiate between the gender when also taking into account the relative frequencies of occurrence.
Spelling Bestuderen Inleiding Op B1 niveau gaan we wat meer aandacht schenken aan spelling. Je mag niet meer zoveel fouten maken als op A1 en A2 niveau.
We bespreken een aantal belangrijke. Puzzle a Lees de omschrijvingen. Zet de Engelse woorden in de puzzel. Understanding and being understood begins with speaking Dutch Begrijpen en begrepen worden begint met het spreken van de Nederlandse taal The Dutch language links us all Wat leest u in deze folder? Als je een onderdeel. Vergaderen in het Engels In dit artikel beschrijven we verschillende situaties die zich kunnen voordoen tijdens een business meeting.
Na het doorlopen van deze zinnen zal je genoeg kennis hebben om je. Online Resource 1 Title: Implementing the flipped classroom: An exploration of study behaviour and student performance Journal: Dus ik durfde het niet aan om op de fiets naar. Document properties Most word processors show some properties of the text in a document, such as the number of words or the number of letters in that document.
Write a program that can determine some of. Waar gaat deze test over? Flash info 1 In the morning I always make my bed. C Sometimes, when I feel like it. List of variables with corresponding questionnaire items in English used in chapter 2 Task clarity 1.
I understand exactly what the task is 2. I understand exactly what is required of. See slides 2 4 of lecture 8. See slides 4 6 of lecture 8. Wouldn t it be great to create your own funny character that will give. Please use the latest firmware for the router. The firmware is available on http: Quick scan method to evaluate your applied educational game light validation 1. What is an API user? How is it different from other users?
What is an operation code? And should I choose "Authorisation" or "Sale"? Assessing writing through objectively scored tests: Aim of this presentation Give inside information about our commercial comparison website and our role in the Dutch and Spanish energy market Energieleveranciers. My family Main language Dit is de basiswoordenschat. Deze woorden moeten de leerlingen zowel passief als actief kennen. Als je een nieuwe taal wilt spreken en schrijven, heb je vooral veel nieuwe woorden nodig.
Voorstel rondje Wat hoop je te leren? Heb je iets te delen? Wat zegt de Programma Gids? Ze zijn dus volkomen veilig en worden al sinds de jaren 90 zonder incident gebruikt op bijna ALLE websites in de wereld. Uitleg over onze cookies. Dit is een hash van je huidige session id. Deze wordt gebruikt om te voorkomen dat anderen zich door middel van browsermanipulatie kunnen voordoen als jou. In dit cookie staat je userid opgeslagen. Deze werkt alleen in combinatie met het sessid cookie dat hierboven al vermeld staat.
Hier wordt de schermbreedte van je device opgeslagen. Op basis hiervan kunnen bepaalde elementen wel of niet worden ingeladen of van een passende weergave worden voorzien. Dit cookie wordt door cloudserverdienst Cloudflare gebruikt om de juiste bezoekers naar onze server door te sturen.
Zonder dit id zou je geen pagina te zien krijgen. Deze cookies worden gebruikt door Google Analytics en zij geven ons inzicht in onze overigens anonieme bezoekersstatistieken. Google Analytics wordt door FOK! Deze cijfers worden gebruikt om de site verder te optimaliseren. Bij video's die op onze site gebruikt worden worden door de aanbieder vaak youtube, maar er zijn meer aanbieders cookies geplaatst om bijvoorbeeld het aantal bekeken video's te meten.
Bij de afbeeldingen die op de site geplaatst worden door onze bezoekers kunnen cookies geplaatst worden door de gebruiker zelf, danwel door de gebruikte hostingprovider. Deze worden bijvoorbeeld gebruikt om het bereik van de afbeeldingen te meten. Deze advertentienetwerken verkopen ook advertentieruimte aan andere partijen. Welke partij gebruikt wordt kan per advertentie verschillen.
De adverteerders plaatsen cookies om onder meer het bereik te meten.