HyPlag is a web-based tool to assist users in efficiently examining academic documents for suspicious text and citation similarities, which may point to potential plagiarism. The algorithms used by HyPlag are based on the Citation-based Plagiarism Detection concept, a novel approach developed by the Information Science Group at the University of Konstanz.

Bibliographic coupling denotes the number of references two documents have in common in their bibliographies.

Bibliographic coupling

Schematic depiction of Bibliographic Coupling

In the figure, Doc A and Doc B are two documents, which cite the same three documents [1], [2], and [3]. Thus, Doc A and Doc B have 3 references in common within their respective references lists and hence a “coupling strength” of 3.

BC is a single value and a raw measure of global document similarity, since it considers only the reference lists found in academic texts but does not take into account the position or order of the citations within the full texts of the document.

Solely considering bibliographic coupling strength is not a sufficient indicator for potential plagiarism and does not allow pinpointing to potentially plagiarized text segments.

For more information please refer to this publication.

The Citation Chunking algorithms are a set of heuristic procedures that aim to identify matching citation patterns regardless of whether the order of matching citations differs in both documents.

We derived several strategies for forming citation chunks by observing behaviors of plagiarists and modeling the resulting citation patterns.


The procedure chunks both documents. Chunking means that matching citations are grouped and considered as a chunk if n ≤ 1 or 1 > n ≤ s non-matching citations separate it from the last preceding matching citation. The number s denotes the number of citations in the chunk currently under construction.

Once chunks have been formed for both documents, the order of citations within a chunk is disregarded and each chunk of the first document is compared with each chunk of the second document. The chunk pairs having the highest number of matching citations are permanently related to each other and considered a match. If multiple chunks in the documents share the same number matching citations, all combinations of chunks with equally many matching citations are stored.

The chunking strategy aims to uncover potential cases in which text segments or logical structures have been taken over from or influenced by another text. The chunking strategy allows for sporadic non-shared citations that may have been inserted to make the resulting text more “genuine”. It can also detect potential cases of concealed shake&paste plagiarism by allowing an increasing number of non-shared citations within a chunk, given that a certain number of shared citations have already been included. Shake&paste plagiarism describes the behavior that text segments (including citations) from different sources are interwoven.

For more information please refer to this publication.

The Greedy Citation Tiling algorithm identifies all individually longest citation patterns that consist entirely of matching citations in the exact same order.

Individually longest patterns refer to sequences of matching citations in the same order that cannot be extended to the left or right without encountering a citation that is not shared by both documents under comparison. Such individual longest matches are called citation tiles.

The figure below illustrates the formation of citation tiles, which are numbered with I, II and III.


Citation tile formation

We designed the GCT algorithm primarily to identify shake&paste plagiarism, which may have been paraphrased. Shake&paste plagiarism describes the behavior that text segments (including citations) from different sources are interwoven. Finding many or long matching citation tiles provides a strong indication for potential plagiarism.

For more information please refer to this publication.

The Longest Common Citation Sequence is the maximum number of citations that match in both documents in the same order, but can be interrupted by non-matching citations. Each document pair has either exactly one or no LCCS.

The following figure illustrates the LCCS measure:


LCCS can identify potential cases of plagiarism in which sections of the text have been copied without changes, or only slight alterations in the order of citations. This can be the case for copy&paste plagiarism concealed by basic rewording, e.g. through synonym replacements. If significant reordering within plagiarized text segments took place (shake&paste plagiarism), or a different citation style is applied that permutes the sequence of citations, the LCS approach is not suitable.

For more information please refer to this publication.