Duplicate Content
Detect Duplicate Content – Think Like a Search Engine
A client’s copywriter recently went to lengths to show how he could reword copy to avoid duplicate content challenges by merely moving paragraphs and sentences around.
The copywriter was certain any search system could be outsmarted by moving text positions and syntax around because that would be theoretically so complex to monitor on a large scale that surely a search engine would not have the resources to cope.
Without going into the depths of duplicate content systems, content , scalable search systems or if he is right that you can blag it… it is worth thinking like a search engine for a minute but keeping things simple.
Rather than use search engine results or search systems to examine his plan, I resorted to the suitable unassuming MS Word.
By using a MS Word macro to calculate word frequency and total unique word count you can quickly see shortcuts to calculating document uniqueness without stopping by MIT for 4 years.
I compared two identical documents, with word orders shifted, paragraphs moved and sentences inverted etc. I crazied the format up until mr copywriter was good and happy.
The output of the macro was identical results for both documents.
1171 unique words per document and the 18 keywords listed below occuring in exactly the same frequency order in both documents.
To have exact unique keyword volumes and identical counts on keywords between two 3854 word documents is statistically very unlikely. So without a single Search system to hand, it can be seen how easy it is to compare document duplication.
Count Keyword
83 search
65 seo
55 company-name-removed
51 in
49 project
48 company-brand-removed
41 we
33 team
32 at
29 as
29 will
28 delivery
28 this
27 on
26 our
26 site
25 with
21 performance
The Macro Used: Thanks to Allen Wyatt’s Word Tips – Count Unique Word Occurrences
http://word.tips.net/Pages/T001833_Generating_a_Count_of_Word_Occurrences.html
Sub FindWords()
Dim sResponse As String
Dim iCount As Integer
‘ Input different words until the user clicks cancel
Do
‘ Identify the word to count
sResponse = InputBox( _
Prompt:=”What word do you want to count?”, _
Title:=”Count Words”, Default:=”")
If sResponse > “” Then
‘ Set the counter to zero for each loop
iCount = 0
Application.ScreenUpdating = False
With Selection
.HomeKey Unit:=wdStory
With .Find
.ClearFormatting
.Text = sResponse
‘ Loop until Word can no longer
‘ find the search string and
‘ count each instance
Do While .Execute
iCount = iCount + 1
Selection.MoveRight
Loop
End With
‘ show the number of occurences
MsgBox sResponse & ” appears ” & iCount & ” times”
End With
Application.ScreenUpdating = True
End If
Loop While sResponse <> “”
End Sub
</pre><p>If you want to determine all the unique words in a document, along with how many times each of them appears in the document, then a different approach is needed. The following VBA macro will do just that.</p><pre>
Sub WordFrequency()
Const maxwords = 9000 ‘Maximum unique words allowed
Dim SingleWord As String ‘Raw word pulled from doc
Dim Words(maxwords) As String ‘Array to hold unique words
Dim Freq(maxwords) As Integer ‘Frequency counter for unique words
Dim WordNum As Integer ‘Number of unique words
Dim ByFreq As Boolean ‘Flag for sorting order
Dim ttlwds As Long ‘Total words in the document
Dim Excludes As String ‘Words to be excluded
Dim Found As Boolean ‘Temporary flag
Dim j, k, l, Temp As Integer ‘Temporary variables
Dim ans As String ‘How user wants to sort results
Dim tword As String ‘
‘ Set up excluded words
Excludes = “[the][a][of][is][to][for][by][be][and][are]“
‘ Find out how to sort
ByFreq = True
ans = InputBox(“Sort by WORD or by FREQ?”, “Sort order”, “WORD”)
If ans = “” Then End
If UCase(ans) = “WORD” Then
ByFreq = False
End If
Selection.HomeKey Unit:=wdStory
System.Cursor = wdCursorWait
WordNum = 0
ttlwds = ActiveDocument.Words.Count
‘ Control the repeat
For Each aword In ActiveDocument.Words
SingleWord = Trim(LCase(aword))
‘Out of range?
If SingleWord < “a” Or SingleWord > “z” Then
SingleWord = “”
End If
‘On exclude list?
If InStr(Excludes, “[" & SingleWord & "]“) Then
SingleWord = “”
End If
If Len(SingleWord) > 0 Then
Found = False
For j = 1 To WordNum
If Words(j) = SingleWord Then
Freq(j) = Freq(j) + 1
Found = True
Exit For
End If
Next j
If Not Found Then
WordNum = WordNum + 1
Words(WordNum) = SingleWord
Freq(WordNum) = 1
End If
If WordNum > maxwords – 1 Then
j = MsgBox(“Too many words.”, vbOKOnly)
Exit For
End If
End If
ttlwds = ttlwds – 1
StatusBar = “Remaining: ” & ttlwds & “, Unique: ” & WordNum
Next aword
‘ Now sort it into word order
For j = 1 To WordNum – 1
k = j
For l = j + 1 To WordNum
If (Not ByFreq And Words(l) < Words(k)) _
Or (ByFreq And Freq(l) > Freq(k)) Then k = l
Next l
If k <> j Then
tword = Words(j)
Words(j) = Words(k)
Words(k) = tword
Temp = Freq(j)
Freq(j) = Freq(k)
Freq(k) = Temp
End If
StatusBar = “Sorting: ” & WordNum – j
Next j
‘ Now write out the results
tmpName = ActiveDocument.AttachedTemplate.FullName
Documents.Add Template:=tmpName, NewTemplate:=False
Selection.ParagraphFormat.TabStops.ClearAll
With Selection
For j = 1 To WordNum
.TypeText Text:=Trim(Str(Freq(j))) _
& vbTab & Words(j) & vbCrLf
Next j
End With
System.Cursor = wdCursorNormal
j = MsgBox(“There were ” & Trim(Str(WordNum)) & _
” different words “, vbOKOnly, “Finished”)
End Sub
No comments yet.