DMC is now Dmed. Read more


Self-plagiarism and the problems of the Similarity Index

July 16th, 2020, by Guy Harris

DMC is often asked to paraphrase a paper’s text when questions of plagiarism and self-plagiarism arise. But the manner in which plagiarism is currently identified is simplistic - based on the identification of similarity among text - and rife with problems.

One common service is iThenticate. Most authors will know it. iThenticate reports tag text items that are identified in previous reports as ‘similar’, and then analyzes these similar texts to calculate a Similarity Index. A score below 20% is considered preferred.

A paper we recently edited was given a Similarity Index of 37%. In just the abstract alone, however, iThenticate had tagged the following text as identical to text from other studies: 


     cohort data
     all-cause and cause-specific mortality
     person-years of follow-up
     in the analysis
     95% CI: 1

Yes, even ’the’ and ‘a’ were tagged. In fact these ubiquitous terms actually accounted for all but one of the tagged items in the Abstract: of 25 tagged items, only 1 could be considered as genuinely ’similar’ - and thus potentially plagiarised - to text from a previous study. Yet all contributed to iThenticate's Similarity Index.

Adding to the absurdity, iThenticate listed 12 source papers with fewer than 10 words in the whole paper which were similar. These also contributed to the Similarity Index.

As a final insult, iThenticate particularly punishes authors who publish frequently in a narrow field of science. iThenticate will invariably tag terminology and expression in these authors’ papers as similar to that in their previous papers. But a similarity of terminology and expression in papers in the same field is unavoidable. Indeed, we frequently find ourselves having to change the same sentence in multiple ways across multiple papers, simply to avoid ’similarity.’ This does not serve the goal of minimizing plagiarism. We have even had to resort to replacing good expressions with awkward and inconcise expressions, simply because all the good expressions had already been used in previous papers!

Preventing plagiarism is a necessary goal, but the simple identification of similarity of text is an inefficient and inferior way to achieve it.