PyFreq: Python implementation of word frequency count
Last modified
Background
Python Implementation
Based on our current breif discussion of Python, you should have all the tools needed to implement the Word Count program in Python with the following caveats:
- We haven’t discussed sorting collections in Python, so the output may be printed in any order
- Likewise, don’t implement any command line argument parsing for now and assume the program will always read from standard input.
- A dict is not a sequence, so slicing ( the
[i:j]
syntax to get index itemsi
up toj
from a sequence) won’t work directly on dictionary to limit how many lines of output are printed, so for now, just print all counted words, don’t limit to 10 or any other fixed value.
The code we wrote in class is mostly complete, though you will need to make a few changes:
-
modify the output format to match
$ ./pyfreq.py < numbers 3 three 1 one 2 two 4 four
but remember, line order doesn’t matter and in fact you will probably get a different order each time you run the program.
-
Strip punctuation marks (
!
,?
,.
,,
,:
) from words so that, for example, given inputthis line has punctuation marks! this line has punctuation marks. this line does not have punctuation marks
the program counts the word
marks
three times, instead of countingmarks!
once,marks.
once andmarks
once.- Hint: see
string.strip()
-
Hint: In the
string
module there is a constant namedpunctuation
>>> import string >>> print(string.punctuation) !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
- Hint: see
Considerations
- The tests will check for the existence of two files:
pyfreq.py
andwordlist.py
- The tests will check that
pyfreq.py
importswordlist
as a module -
The tests will check that the
wordlist.py
module contains test code that will read words from standard input and print the raw list obtained to standard output. The test code should run ifwordlist.py
is invoked from the command line, but not if it is imported by another file (Hint: just like we did in class).$ echo "this is a list of words" > some_file $ python3 wordlist.py < some_file ['this', 'is', 'a', 'list', 'of', 'words'] $ echo 'this, is. a? list: of (words)' > words_with_punct $ python3 wordlist.py < words_with_punct ['this', 'is', 'a', 'list', 'of', 'words']
- The tests will check that
pyfreq.py
can be executed by name:- it’s execution permission bit should be set
- it should contain a valid ‘shebang’ line specifying the python interpreter (read the “Portability” section of the linked wikipedia article to gain insight as to why we prefer
/usr/bin/env python3
instead of/usr/bin/python3
)
Submission
The source files should exist in their own git repository, if you change to the directory containing your source files and run ls -a
you should see a directory named .git
. If not, run git init
to initialize a git repository in the current directory. You should only run git init
once for each new project.
Push your git repository to the remote at git@ece2524.ece.vt.edu:USER/pyfreq.git
where USER
is your git user name.
If you have initialized a new repo but have not added a remote yet:
$ git remote add origin git@ece2524.ece.vt.edu:USER/pyfreq.git
where is your git user name.
If you have already added a remote named origin
, but the URL is incorrect, replace add
with set-url
in the above command. You can always check that remotes you have added by running git remote -v
.
Remember, if this is the first time pushing to a new remote you need to specify a destination branch (usually `master`). Using the `-u` option will save this default destination for future pushes.
$ git push -u origin master
Testing
Feature repo path: features/pyfreq
The following features will be tested using cucumber:
@part1
Feature: Word Frequency Utility
Background:
Given I am working from a clean git clone to "pyfreq"
And I cd to "pyfreq"
And a file named "fox.txt" with:
"""
the quick brown fox jumped over the lazy cow.
but the cow jumped over the moon!
what does the fox say?
"""
And a file named "numbers" with:
"""
four two four one
two four three three
three four
"""
Then a file named "pyfreq.py" should exist
And a file named "wordlist.py" should exist
Scenario:
When I run the shell command "./pyfreq.py < fox.txt"
Then its stdout should contain exactly 13 lines
And its stdout should contain a line matching /5\s*the/
And its stdout should contain a line matching /2\s*cow/
And its stdout should contain a line matching /1\s*say/
Scenario: Strip Punctuation
When I run the shell command "./pyfreq.py < numbers"
Then its stdout should contain exactly 4 lines
And its stdout should contain a line matching /4\s+four$/
And its stdout should contain a line matching /3\s+three$/
And its stdout should contain a line matching /2\s+two$/
And its stdout should contain a line matching /1\s+one$/
@part1
Feature: Modular Design
Background:
Given I am working from a clean git clone to "pyfreq"
And I cd to "pyfreq"
And a file named "some_words" with:
"""
this is a
list of words
"""
And a file named "words_with_punctuation" with:
"""
this, is. a; list? of: words!
"""
Scenario:
Then a file named "pyfreq.py" should exist
And the file "pyfreq.py" should contain a line matching /(import|from) wordlist/
Scenario: Split words
Then a file named "wordlist.py" should exist
When I run the shell command "python3 wordlist.py < some_words"
Then its stdout should contain "['this', 'is', 'a', 'list', 'of', 'words']"
Scenario: Strip punctuation
Then a file named "wordlist.py" should exist
When I run the shell command "python3 wordlist.py < words_with_punctuation"
Then its stdout should contain "['this', 'is', 'a', 'list', 'of', 'words']"
@part1
Feature: executable command
Background:
Given I am working from a clean git clone to "pyfreq"
And I cd to "pyfreq"
Scenario: does it execute?
Then the file "pyfreq.py" should have a valid shebang
And the file "pyfreq.py" should be executable
You can run the tests manually with
$ cucumber /usr/share/features/pyfreq
when logged in to your shell account. This command assumes your current working directory is your project directory.