Code
="Simple is better than complex"
textprint(text.split())
['Simple', 'is', 'better', 'than', 'complex']
With this lab we practice basic operations on lists, string and dictionaries, as well as fundamental programming concepts such as loops, if statements and functions. The lab is designed to be simple but despite its simplicity it also demonstrates a technique which is in the foundations of many algorithms which allow machines to understand human language.
Note that machines do not understand language the way we do. Instead they collect statistics which allow them to make inferences which are right most of the time, even when the machine lacks real understanding. The most basic statistics is plain word counting, i.e. to collect which words appear in a document and how often. There are four parts in this lab:
Counting - aggregate statistics for the frequencies of the different words; Printing - print the final results; Completed Program - assembling a complete program by using the pieces above.**
**Tokenization(Processing Lines and Recognizing Words) In most languages, it is easy to separate words by just looking for the spaces in between. There are exceptions of course but as longs as we stick with English (or Swedish), this is mostly true. Try this two lines in the Python shell:
="Simple is better than complex"
textprint(text.split())
['Simple', 'is', 'better', 'than', 'complex']
It looks as if Python already knows how to separate into words. The method split simply splits a string into a list of words if they are separated by one or more spaces. Unfortunately this solution wouldn’t go very far:
= 'In the face of ambiguity, refuse the temptation to guess.'
text print(text.split())
['In', 'the', 'face', 'of', 'ambiguity,', 'refuse', 'the', 'temptation', 'to', 'guess.']
Do you see the problem? Punctuation symbols such as comma and dot are usually glued to the last token without space, despite that we don’t consider them part of the word. The method split does not know anything about punctuation, and that is the problem. We can do something better, but lets first clarify what we count as a word:
a word cannot contain white spaces. White spaces are used to detect word boundaries but are otherwise ignored; a word is any sequence of one or more letters, i.e. symbols from the alphabet; a word is any sequence of digits, e.g. the symbols “0”… “9”; any other symbol not mentioned above is counted as a single word containing only the symbol alone. Your first task is to define a function called tokenize which should take a complete document as a list of text lines and produce a list of tokens. For example:
def tokenize(lines):
=[]
wordsfor line in lines:
=0
start while start < len(line):
while start < len(line) and line[start].isspace():#skip the whitespace
=start+1
start#print(line[start])
if start < len(line) and line[start].isalpha():#indentity the character type
#print(f"{line[start]} is a letter")
=start
endwhile end<len(line) and line[end].isalpha():
=end+1
end
words.append(line[start:end].lower())=end
startelif start <len(line) and line[start].isdigit():
#print(f"{line[start]} is a digit")
=start
endwhile end <len(line) and line[end].isdigit():
=end+1
end
words.append(line[start:end])=end
startelif start< len(line):
#print(f"{line[start]} is a symbol")
words.append(line[start])=start+1
startreturn words
= ['"They had 16 rolls of duct tape, 2 bags of clothes pins,',
document '130 hampsters from the cancer labs down the hall, and',
'at least 500 pounds of grape jello and unknown amounts of chopped liver"',
'said the source on a recent Geraldo interview.']
tokenize(document)
['"',
'they',
'had',
'16',
'rolls',
'of',
'duct',
'tape',
',',
'2',
'bags',
'of',
'clothes',
'pins',
',',
'130',
'hampsters',
'from',
'the',
'cancer',
'labs',
'down',
'the',
'hall',
',',
'and',
'at',
'least',
'500',
'pounds',
'of',
'grape',
'jello',
'and',
'unknown',
'amounts',
'of',
'chopped',
'liver',
'"',
'said',
'the',
'source',
'on',
'a',
'recent',
'geraldo',
'interview',
'.']
The second part of the lab is to implement a function which takes a list of words and counts how often each word appears. In addition, the function takes a list of stop words. These are words which are not interesting and should be ignored while counting.
def countWords(words, stopwords):
={}
frequenciesfor word in words:
if word in stopwords:
continue
if word not in frequencies:
=1
frequencies[word]else:
+=1
frequencies[word]return frequencies
'it','is','a','book'], ['a','is','it']) countWords([
{'book': 1}
The last missing piece is to be able to print the collected statistics. We already have a way to construct a dictionary where the keys are the words and the values are the counts. We just have to iterate through the entries and print the data. Precisely for that the dictionary type has a method called items.
def printTopMost(frequencies, n):
=sorted(frequencies.items(), key=lambda x:-x[1])#descreasing order to sort
sorted_itemsfor i, (word, freq) in enumerate(sorted_items):
if i >=n:
break
print(f"{word.ljust(20)}{str(freq).rjust(5)}")
'text': 9, 'word': 30, 'fiction': 6, 'count': 11, 'counting': 7, 'novel': 6},3) printTopMost({
word 30
count 11
text 9
It is time to piece the different parts into a complete working program. Start a new module called topmost (file topmost.py) and import the module wordfreq from it.
import wordfreq
import sys
import urllib.request
def main():
if len(sys.argv) != 4:
print("Usage: python3 topmost.py <stopwords_file> <input_file> <top_n>")
1)
sys.exit(
= sys.argv[1]
stopwords_file = sys.argv[2]
input_file = int(float(sys.argv[3])) # handles if passed as 20. instead of 20
top_n
# Read stopwords
=open(stopwords_file, "r")
f1=[line.strip() for line in f1]
stopwords
f1.close()
# Read input file
if input_file.startswith("http://") or input_file.startswith("https://"):
=urllib.request.urlopen(input_file)
response=response.read().decode("utf8").splitlines()
lineselse:
with open(input_file, "r", encoding="utf-8") as f2:
= f2.readlines()
lines
= wordfreq.tokenize(lines)
tokens = wordfreq.countWords(tokens, stopwords)
word_counts
wordfreq.printTopMost(word_counts, top_n)
# Call main
main()
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[20], line 31 28 wordfreq.printTopMost(word_counts, top_n) 30 # Call main ---> 31 main() Cell In[20], line 11, in main() 9 stopwords_file = sys.argv[1] 10 input_file = sys.argv[2] ---> 11 top_n = int(float(sys.argv[3])) # handles if passed as 20. instead of 20 13 # Read stopwords 14 f1=open(stopwords_file, "r") ValueError: could not convert string to float: '--HistoryManager.hist_file=:memory:'
Test the program:
import subprocess
= subprocess.run(
result "python", "topmost.py", "eng_stopwords.txt", "examples/article1.txt", "20"],
[=True,
capture_output=True
text
)
# Print output
print("Output:")
print(result.stdout)
# Print any errors
if result.stderr:
print("Error output:")
print(result.stderr)
Output:
word 30
words 21
count 11
text 9
000 9
counting 7
fiction 6
novel 6
rules 5
length 5
used 4
usually 4
details 4
software 4
sources 4
· 4
processing 4
segmentation 4
rule 4
novels 4