Bill On Business

Online Business and the Search Industry

Duplicate Content

Detect Duplicate Content – Think Like a Search Engine

 

A client’s copywriter recently went to lengths to show how he could reword copy to avoid duplicate content challenges by merely moving paragraphs and sentences around.

 

The copywriter was certain any search system could be outsmarted by moving text positions and syntax around because that would be theoretically so complex to monitor on a large scale that surely a search engine would not have the resources to cope.

 

Without going into the depths of duplicate content systems, content , scalable search systems or if he is right that you can blag it… it is worth thinking like a search engine for a minute but keeping things simple.

 

Rather than use search engine results or search systems to examine his plan, I resorted to the suitable unassuming MS Word.

 

By using a MS Word macro to calculate word frequency and total unique word count you can quickly see shortcuts to calculating document uniqueness without stopping by MIT for 4 years.

 

I compared two identical documents, with word orders shifted, paragraphs moved and sentences inverted etc. I crazied the format up until mr copywriter was good and happy.

 

The output of the macro was identical results for both documents.

1171 unique words per document and the 18 keywords listed below occuring in exactly the same frequency order in both documents.

 

To have exact unique keyword volumes and identical counts on keywords between two 3854 word documents is statistically very unlikely. So without a single Search system to hand, it can be seen how easy it is to compare document duplication.

 

Count   Keyword

83         search

65         seo

55         company-name-removed

51         in

49         project

48         company-brand-removed

41         we

33         team

32         at

29         as

29         will

28         delivery

28         this

27         on

26         our

26         site

25         with

21         performance

 

The Macro Used: Thanks to Allen Wyatt’s Word Tips – Count Unique Word Occurrences

http://word.tips.net/Pages/T001833_Generating_a_Count_of_Word_Occurrences.html

 

Sub FindWords()
    Dim sResponse As String
    Dim iCount As Integer

    ‘ Input different words until the user clicks cancel
    Do
        ‘ Identify the word to count
        sResponse = InputBox( _
          Prompt:=”What word do you want to count?”, _
          Title:=”Count Words”, Default:=”")
   
        If sResponse > “” Then
            ‘ Set the counter to zero for each loop
            iCount = 0
            Application.ScreenUpdating = False
            With Selection
                .HomeKey Unit:=wdStory
                With .Find
                    .ClearFormatting
                    .Text = sResponse
                    ‘ Loop until Word can no longer
                    ‘ find the search string and
                    ‘ count each instance
                    Do While .Execute
                        iCount = iCount + 1
                        Selection.MoveRight
                    Loop
                End With
                ‘ show the number of occurences
                MsgBox sResponse & ” appears ” & iCount & ” times”
            End With
            Application.ScreenUpdating = True
        End If
    Loop While sResponse <> “”
End Sub
</pre><p>If you want to determine all the unique words in a document, along with how many times each of them appears in the document, then a different approach is needed. The following VBA macro will do just that.</p><pre>
Sub WordFrequency()
    Const maxwords = 9000          ‘Maximum unique words allowed
    Dim SingleWord As String       ‘Raw word pulled from doc
    Dim Words(maxwords) As String  ‘Array to hold unique words
    Dim Freq(maxwords) As Integer  ‘Frequency counter for unique words
    Dim WordNum As Integer         ‘Number of unique words
    Dim ByFreq As Boolean          ‘Flag for sorting order
    Dim ttlwds As Long             ‘Total words in the document
    Dim Excludes As String         ‘Words to be excluded
    Dim Found As Boolean           ‘Temporary flag
    Dim j, k, l, Temp As Integer   ‘Temporary variables
    Dim ans As String              ‘How user wants to sort results
    Dim tword As String            ‘

    ‘ Set up excluded words
    Excludes = “[the][a][of][is][to][for][by][be][and][are]“

    ‘ Find out how to sort
    ByFreq = True
    ans = InputBox(“Sort by WORD or by FREQ?”, “Sort order”, “WORD”)
    If ans = “” Then End
    If UCase(ans) = “WORD” Then
        ByFreq = False
    End If
   
    Selection.HomeKey Unit:=wdStory
    System.Cursor = wdCursorWait
    WordNum = 0
    ttlwds = ActiveDocument.Words.Count

    ‘ Control the repeat
    For Each aword In ActiveDocument.Words
        SingleWord = Trim(LCase(aword))
        ‘Out of range?
        If SingleWord < “a” Or SingleWord > “z” Then
            SingleWord = “”
        End If
        ‘On exclude list?
        If InStr(Excludes, “[" & SingleWord & "]“) Then
            SingleWord = “”
        End If
        If Len(SingleWord) > 0 Then
            Found = False
            For j = 1 To WordNum
                If Words(j) = SingleWord Then
                    Freq(j) = Freq(j) + 1
                    Found = True
                    Exit For
                End If
            Next j
            If Not Found Then
                WordNum = WordNum + 1
                Words(WordNum) = SingleWord
                Freq(WordNum) = 1
            End If
            If WordNum > maxwords – 1 Then
                j = MsgBox(“Too many words.”, vbOKOnly)
                Exit For
            End If
        End If
        ttlwds = ttlwds – 1
        StatusBar = “Remaining: ” & ttlwds & “, Unique: ” & WordNum
    Next aword

    ‘ Now sort it into word order
    For j = 1 To WordNum – 1
        k = j
        For l = j + 1 To WordNum
            If (Not ByFreq And Words(l) < Words(k)) _
              Or (ByFreq And Freq(l) > Freq(k)) Then k = l
        Next l
        If k <> j Then
            tword = Words(j)
            Words(j) = Words(k)
            Words(k) = tword
            Temp = Freq(j)
            Freq(j) = Freq(k)
            Freq(k) = Temp
        End If
        StatusBar = “Sorting: ” & WordNum – j
    Next j

    ‘ Now write out the results
    tmpName = ActiveDocument.AttachedTemplate.FullName
    Documents.Add Template:=tmpName, NewTemplate:=False
    Selection.ParagraphFormat.TabStops.ClearAll
    With Selection
        For j = 1 To WordNum
            .TypeText Text:=Trim(Str(Freq(j))) _
              & vbTab & Words(j) & vbCrLf
        Next j
    End With
    System.Cursor = wdCursorNormal
    j = MsgBox(“There were ” & Trim(Str(WordNum)) & _
      ” different words “, vbOKOnly, “Finished”)
End Sub

December 10, 2008 - Posted by billonbusiness | Search Engines | | No Comments Yet

No comments yet.

Leave a comment