PyFreq: Python implementation of word frequency count

Last modified April 18, 2014

Background

More Python

Python Implementation

Based on our current breif discussion of Python, you should have all the tools needed to implement the Word Count program in Python with the following caveats:

We haven’t discussed sorting collections in Python, so the output may be printed in any order
Likewise, don’t implement any command line argument parsing for now and assume the program will always read from standard input.
A dict is not a sequence, so slicing ( the [i:j] syntax to get index items i up to j from a sequence) won’t work directly on dictionary to limit how many lines of output are printed, so for now, just print all counted words, don’t limit to 10 or any other fixed value.

The code we wrote in class is mostly complete, though you will need to make a few changes:

modify the output format to match
```
$ ./pyfreq.py < numbers
3 three
1 one
2 two
4 four
```
but remember, line order doesn’t matter and in fact you will probably get a different order each time you run the program.
Strip punctuation marks (!, ?, ., ,, :) from words so that, for example, given input
```
this line has punctuation marks!
this line has punctuation marks.
this line does not have punctuation marks
```
the program counts the word marks three times, instead of counting marks! once, marks. once and marks once.
- Hint: see string.strip()
- Hint: In the string module there is a constant named punctuation
```
>>> import string
>>> print(string.punctuation)
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
```

Considerations

The tests will check for the existence of two files: pyfreq.py and wordlist.py
The tests will check that pyfreq.py imports wordlist as a module
The tests will check that the wordlist.py module contains test code that will read words from standard input and print the raw list obtained to standard output. The test code should run if wordlist.py is invoked from the command line, but not if it is imported by another file (Hint: just like we did in class).
```
$ echo "this is a list of words" > some_file
$ python3 wordlist.py < some_file
['this', 'is', 'a', 'list', 'of', 'words']
$ echo 'this, is. a? list: of (words)' > words_with_punct
$ python3 wordlist.py < words_with_punct
['this', 'is', 'a', 'list', 'of', 'words']
```
The tests will check that pyfreq.py can be executed by name:
- it’s execution permission bit should be set
- it should contain a valid ‘shebang’ line specifying the python interpreter (read the “Portability” section of the linked wikipedia article to gain insight as to why we prefer /usr/bin/env python3 instead of /usr/bin/python3)

Submission

The source files should exist in their own git repository, if you change to the directory containing your source files and run ls -a you should see a directory named .git. If not, run git init to initialize a git repository in the current directory. You should only run git init once for each new project.

Push your git repository to the remote at git@ece2524.ece.vt.edu:USER/pyfreq.git where USER is your git user name.

If you have initialized a new repo but have not added a remote yet:

$ git remote add origin git@ece2524.ece.vt.edu:USER/pyfreq.git

where is your git user name.

If you have already added a remote named origin, but the URL is incorrect, replace add with set-url in the above command. You can always check that remotes you have added by running git remote -v.

Remember, if this is the first time pushing to a new remote you need to specify a destination branch (usually `master`). Using the `-u` option will save this default destination for future pushes.

$ git push -u origin master

Testing

Feature repo path: features/pyfreq

The following features will be tested using cucumber:

@part1 
Feature: Word Frequency Utility
  
  Background:
    Given I am working from a clean git clone to "pyfreq"
    And I cd to "pyfreq"
    And a file named "fox.txt" with:
    """
    the quick brown fox jumped over the lazy cow.
    but the cow jumped over the moon!
    what does the fox say?
    
    """
    And a file named "numbers" with:
    """
    four two four one
    two four three three
    three four
    
    """
    Then a file named "pyfreq.py" should exist
    And a file named "wordlist.py" should exist 

  Scenario: 
    When I run the shell command "./pyfreq.py < fox.txt"
    Then its stdout should contain exactly 13 lines
    And its stdout should contain a line matching /5\s*the/
    And its stdout should contain a line matching /2\s*cow/
    And its stdout should contain a line matching /1\s*say/

  Scenario: Strip Punctuation
    When I run the shell command "./pyfreq.py < numbers"
    Then its stdout should contain exactly 4 lines
    And its stdout should contain a line matching /4\s+four$/
    And its stdout should contain a line matching /3\s+three$/
    And its stdout should contain a line matching /2\s+two$/
    And its stdout should contain a line matching /1\s+one$/

@part1
Feature: Modular Design

  Background:
    Given I am working from a clean git clone to "pyfreq"
    And I cd to "pyfreq"
    And a file named "some_words" with:
    """
    this is a
    list of words
    
    """
    And a file named "words_with_punctuation" with:
    """
    this, is. a; list? of: words!

    """

    Scenario:
      Then a file named "pyfreq.py" should exist
      And the file "pyfreq.py" should contain a line matching /(import|from) wordlist/

    Scenario: Split words
      Then a file named "wordlist.py" should exist
      When I run the shell command "python3 wordlist.py < some_words"
      Then its stdout should contain "['this', 'is', 'a', 'list', 'of', 'words']"

    Scenario: Strip punctuation
      Then a file named "wordlist.py" should exist
      When I run the shell command "python3 wordlist.py < words_with_punctuation"
      Then its stdout should contain "['this', 'is', 'a', 'list', 'of', 'words']"

@part1
Feature: executable command 

  Background:
    Given I am working from a clean git clone to "pyfreq"
    And I cd to "pyfreq"

  Scenario: does it execute?
    Then the file "pyfreq.py" should have a valid shebang
    And the file "pyfreq.py" should be executable

You can run the tests manually with

$ cucumber /usr/share/features/pyfreq

when logged in to your shell account. This command assumes your current working directory is your project directory.