Recent Question/Assignment

PART 1
The final part of Project 1 takes the form of an -authorship attribution- system, that is a system which attempts to determine who wrote a given document, based on analysis of the language used and style of that document. We will break the system down into multiple parts, to make it clearer what the different moving parts are, and make it easier for you to test your system.
The first step in our authorship attribution system will be to take a document, separate it out into its component words, and construct/return a dictionary of word frequencies. As we are focused on the English language, we will assume that -words- are separated by whitespace, in the form of spaces (' '), tabs (' ') and newline characters (' ').
We will also do something slightly unconventional in considering each -standalone- non-alphabetic character (i.e. any character other than whitespace, or upper- or lower-case alphabetic characters) to be a single word. For example, given the document 'Dynamic-typed variables, Python; really?!!', the component words, in sequence, would be 'Dynamic-typed' (noting that '-' here is not considered to be a word despite being non-alphabetic, as it is surrounded by alphabetic characters), 'variables', ',', 'Python', ';', 'really', '?', '!', '!'. Note here that, in the case of the document starting with 'Dynamic--typed', the breakdown into words would instead be 'Dynamic', '-', '-', and 'typed', as both of the hyphens neighbour a non-alphabetic letter. Note also that case should be preserved in the output (i.e. if a word is upper case in the original, it should remain in upper case).
Write a function authattr_worddict(doc) that takes a single string argument doc and returns a dictionary (dict) of words contained in doc (as defined above), with the frequency of each word as an int. Note that, as the output is a dict, the order of those words may not correspond exactly to that indicated below, and that the testing will accept any word ordering within the dictionary.
Here are some example calls to your authattr_worddict function:

authattr_worddict('Dynamic-typed variables, Python; really?!!')
{'Dynamic-typed': 1, 'Python': 1, 'really': 1, '!': 2, 'variables': 1, '?': 1, ',': 1, ';': 1}
authattr_worddict('')
{}
authattr_worddict(-Truly, rooly, rooly, indisputably 'tis ..... Gr00vy-)
{-'-: 1, 'vy': 1, '.': 5, '0': 2, 'tis': 1, 'rooly': 2, 'Truly': 1, 'indisputably': 1, 'Gr': 1, ',': 3}
PART 2
The next step in our authorship attribution system will be to take two dictionaries of word counts and count the similarity between them. We will do this by:
1. ranking the two sets of words in descending order of frequency, and;
2. for corresponding word pairs, calculate the absolute difference in rank between the two.
If a word is found in one ranking but not the other, we will set the ranking for the second to the value maxrank (provided as part of the function call). In the case of a tie in the word frequency ranking (due to multiple words having the same frequency), we will assign all items the same value, calculated as follows:
For example, if two items were tied for second, we would assign each of them the rank . The ranking of the next item would then be 4 rather than 3, as two places in the ranking have been taken. For example, if the first dictionary was {'a': 10, 'b':5, 'c': 5, 'd': 2, 'e': 2, 'f': 2, 'g': 1} (i.e. 'a' occurs 10 times, 'b' 5 times, etc.), then the corresponding ranking would be:
word frequency ranking
'a' 10 1
'b' 5 2.5
'c' 5 2.5
'd' 2 5
'e' 2 5
'f' 2 5
'g' 1 7
Note that 'd', 'e' and 'f' are assigned a ranking of 5 because they are all tied for fourth (three items precede them), and
If the second ranking were:
word frequency ranking
'b' 27 1
'h' 22 2
'a' 11 3.5
'i' 11 3.5
'j' 5 5
Then the combined ranking would be:
word ranking 1 ranking 2
'a' 1 3.5
'b' 2.5 1
'c' 2.5 maxrank
'd' 5 maxrank
'e' 5 maxrank
'f' 5 maxrank
'g' 7 maxrank
'h' maxrank 2
'i' maxrank 3.5
'j' maxrank 5
The final step is to calculate the -out-of-place- distance between the two rankings, by calculating the total absolute difference between the respective rankings for each word contained in the union of the rankings ... which is just a complicated way of saying, for each row in the table above calculate the absolute difference between the two ranking values (e.g. for 'a', ), and sum up across all the rows. Assuming that maxrank is equal to 10, the value for the case above would be:
Write a function authattr_oop(dictfreq1, dictfreq2, maxrank) that takes three arguments:
• dictfreq1: a dictionary of words, with the (positive integer) frequency of each
• dictfreq2: a second dictionary of words, with the (positive integer) frequency of each
• maxrank: the positive int value to set the ranking to in the case that the word isn't in the dictionary of words in question
and returns a float out-of-place distance between the two (where the smaller the number, the more similar the two rankings are).
Here are some example calls to your authattr_oop function:

authattr_oop({'a': 10, 'b': 5, 'c': 5, 'd': 2, 'e': 2, 'f': 2, 'g': 1}, {'b': 27, 'h': 22, 'a': 11, 'i': 11, 'j': 5}, 10)
49.0
authattr_oop({'a': 5000, 'b': 4000, 'c': 3000}, {'a': 5, 'b': 4, 'c':3}, 100)
0.0
authattr_oop({'a': 5000, 'b': 4000, 'c': 3000}, {'d': 5, 'e': 4, 'f':3}, 100)
588.0
PART 3
The final step in our authorship attribution system will be to perform authorship attribution based on a selection of sample documents from a range of authors, and a document of unknown origin.
You will be given a selection of sample documents from a range of authors (from which we will learn our word frequency dictionaries), and a document of unknown origin. Given these, you need to return a list of authors in ascending order of out-of-place distance between the document of unknown origin and the combined set of documents from each of the authors. You should do this according to the following steps:
1. compute a single dictionary of word frequencies for each author based on the combined set of documents from that author (provided in the form of a list of strings)
2. compute a dictionary of word frequencies for the document of unknown origin
3. compare the document of unknown origin with the combined works of each author, based on the out-of-place distance metric
4. calculate and return a ranking of authors, from most similar (smallest distance) to least similar (greatest distance), resolving any ties in the ranking based on an alphabetic sort
You have been provided with reference implementations of the functions authattr_worddict and authattr_oop from the preceding questions in order to complete this question, and should make use of these in your solution. These are provided via the from hidden_lib import authattr_worddict, authattr_oop statement, which must not removed from the header of your code for these functions to work.
Write a function authattr_authorpred(authordict, unknown, maxrank) that takes three arguments:
• authordict: a dictionary of authors (each of which is a str), associated with a non-empty list of documents (each of which is a str)
• unknown: a str contained the document of unknown origin
• maxrank: the positive int value to set maxrank to in the call to authattr_oop
and returns a list of (author, oop) tuples, where author is the name of an author from authordict, and oop is the out-of-place distance between unknown and the combined works of author, in the form of a float.
For example:

authattr_authorpred({'tim': ['One One was a racehorse; Two Two was one too', 'How much wood could a woodchuck chuck'], 'einstein': ['Unthinking respect for authority is the greatest enemy of truth.', 'Not everything that can be counted counts, and not everything that counts can be counted.']}, 'She sells sea shells on the seashore', 20)
[('tim', 287.0), ('einstein', 290.0)]
authattr_authorpred({'Beatles': ['Hey Jude', 'The Fool on the Hill', -A Hard Day's Night-, -Yesterday-], 'Rolling Stones': [-(I Can't Get No) Satisfation-, 'Ruby Tuesday', 'Paint it Black']}, 'Eleanor Rigby', 15)
[('Beatles', 129.0), ('Rolling Stones', 129.0)]

Looking for answers ?