Home

Assignment 1: Preliminaries

In this assignment, you will:

Submission summary:

NB: This assignment may seem very long.

The purpose is to give you enough time to become accustomed to a range of tools which are unavoidable in most computational work. Go through it gradually, one part at a time (and split those between multiple days if needed). Post any questions to the Discussion Board on Canvas! Consult the Asking Questions on the Discussion Board Guidelines

Part 1: The IMDB review dataset

(NB: Complete as you do the readings assigned “for April 8.” There is no significance to their being listed in the April 8 cell; you can read them any time, for example right now.)

We will be using the IMDB review dataset (Maas et al. 2011)

  1. Download the dataset from here.
  2. On Linux/Mac: Unpack the dataset by double-clicking on it or by using a utility appropriate for your OS. For Windows, download it from Canvas–>Files, but still follow the link and read what’s there. The dataset is an archived folder of the TAR.GZ type; you need special software to extract the contents. Windows does not have that software by default, but it can do .zip, so I created a ZIP version for you and uploaded it to Canvas.
  3. Unpack the dataset by double-clicking on it or by using a utility appropriate for your OS.
  4. Read the README file which comes with the dataset.
  5. In a text file, answer the following questions about the dataset:
    1. How many movie reviews does it contain?
    2. How is the dataset divided? (Here, talk about how many reivews are in each folder and what each folder represents, in your own words. Do not copy the text from the README file.)
    3. Why is it divided in this way? (Make sure to give a thoughtful answer here, at least a paragraph! You may not yet know everything about this, but answer the best you can, based on what you learned in the first couple weeks of class.) (NB: Complete closer to the due date!)
    4. Why is a citation to the ACL paper by Maas et al. included in the README file and in the dataset description on the website? (What is the relationship of the paper and of the dataset? Thoughtful, paragraph-length answer here.)
    5. Why is there a reference to Potts’s paper?
    6. Would you say this README file qualifies as a “data statement” (see Bender and Friedman paper which was assigned earlier). If yes, point to the specific portions of the file and map them to corresponding definitions from Bender and Friedman’s paper. If no, explain what a data statement could look like for such a dataset or why the concept does not apply here. You can of course argue against data statements here if you like! It is up to you; what counts is the depth and quality of argument.
  6. Submit your text file to Canvas, in the appropriate area associated with Assignment 1.

Part 2: Git and Your GitHub repository (NB: Complete what you can now and the rest after April 8.)

In this part of the assignment, you will create a GitHub repository for your code in this class.

  1. Create an account on https://github.com/
  2. Create a new repository; make it private. Call it whatever you like, but you will use it for this class. The screenshot below shows what creating a repository looks like on GutHub:

    Screen Shot 2021-03-22 at 12 35 41 PM

  3. After you’ve created the repository, note its https:// address:

    Screen Shot 2021-03-22 at 12 04 44 PM

  4. Go to Settings and add olzama and yuanheTian as “collaborators” (NB: this counts as your submission for this task):

    Screen Shot 2021-03-22 at 12 22 26 PM

  5. Now, install git on your machine. Click on the “Latest Source Release 2.31.0.” button here
  6. Ask some question about git or leave a comment about it in the Assignment 1 area on Canvas. It can be anything.

Part 3: Python and Visual Studio Code. (NB: Complete what you can now. You should be able to do everything after April 8.)

In this part of the assignment, you will start learning how to program in python and how to use an Integrated Development Environment.

  1. Download and install python (For now, just install the latest version (3.9.2) by clicking the most visible button “Download Python 3.9.2” here)
  2. Download and install Visual Studio Code, a free IDE. Use all the default settings during the installation.
  3. Install support for python by clicking on “Python” here:

    python1

  4. Configure git for VS Code following these steps. In particular, open the terminal (cmd on Windows; type cmd in the search field to the right of the Start button) and run the two commands as indicated in the steps, with your GitHub user name and email. Hit Enter after each command; there should be no message.

  5. Now, clone the repository you’ve created in Part 2 into VS Code. Click on “clone a repository” and then enter the https:// address of your repository. Then choose a folder for the local copy of the repository to go to. It can be any folder on your computer, such as one dedicated to this class.

    Screen Shot 2021-03-22 at 12 05 23 PM

  6. Locate the local copy of your repository on your computer. (Navigate to the folder which you chose when cloning.) This is what it looks like in my Finder (I use a Mac):

    Screen Shot 2021-03-22 at 12 11 28 PM

  7. Add a python file to your repository. You can do it any way you like, including from within VS Code. It is crucial that the file name has your UW netID in it!!!

    Screen Shot 2021-03-22 at 12 13 59 PM

  8. Write a program in python which prints a statement, such as “Hello, world!” (or whatever you like).
  9. Find the Source Control menu in the left panel (it’s the one with the number “1” in the picture). Then find the small icon which is for “staging changes” and click on it. You will then see something like this:

    Screen Shot 2021-03-22 at 12 24 10 PM

  10. Enter a meaningfull commit message. Then Click on the “check mark”; it means committing the staged changes.

    VS-code-pic

  11. Now click on the “…”, find the command “Push”, and click it.
  12. Give it a few minutes, and check that your python file can be found not only in the local copy but also in the remote repository. (NB: This will count as your submission for this task.) Check once more that the file name has your UW netID in it!!! If not, rename it so it does!

    Screen Shot 2021-03-22 at 12 31 09 PM

  13. Write YOUR FULL NAME in your README file using the GitHub website. Click on the README file, click on edit, write something, then click on “Commit changes”.

    Screen Shot 2021-03-22 at 12 53 47 PM Screen Shot 2021-03-22 at 12 54 49 PM

  14. Now go to your VS Code, to the Source Control pane, find “Pull” under “…” and pull the changes into your local copy of the repository. Make sure that you now see the updated README!

  15. Leave a question or comment about VS Code in the Assignment 1 discussion area on Canvas, if you haven’t already. (The purpose is to show that you got at least some experience with the software. Otherwise, it can be about anything.)

Part 4: Command line and remote servers. (NB: Complete after April 8)

It is important to be able to connect to a remote server and to be able to copy files between that server and your machine. It is also important to be able to run things like python or git via the command line (rather than in an IDE such as VS Code or in a GUI such as github.com).

  1. Open a terminal on your Linux/Mac or Windows 10 (if you have an earlier version of Windows, you will need additional instructions, so contact Olga and Yuanhe ahead of time).
  2. Connect to patas cluster (where you should have created an account last week!):

    ssh your-NetID@patas.ling.washington.edu

    (It will ask you whether you should add patas to trusted hosts; type yes.)

  3. Check that you have proper access to patas. You should see something like: your-user-name@patas:~$ or simply bash-4.2$. If you don’t yet have access, let Olga know. Sometimes there are delays in how the accounts are created and set up. If you see bash-4.2$ or similar but would like to see your username and current directory instead, try typing the following in the terminal: echo "PS1='\u@patas:\w\$ '" >> ~/.bash_profile; source ~/.bash_profile – and pressing Enter. That will change what you see in the prompt.

  4. Clone your git repository into your home directory on patas:

    git clone your-repo-address

    (It will ask you for your GitHub username and password. There may be some error messages; ignore them. Just make sure you type your password correctly.)

  5. Navigate into your repository folder on patas. Execute the python program and observe it printing whatever it prints.

    Screen Shot 2021-03-22 at 1 45 46 PM

  6. Copy your python program to /dropbox/20-21/471 (the class folder). It is crucial that the file name has your UW netID in it!!! Otherwise you may overwrite other people’s homework, plus we won’t know the program is yours and you won’t get credit.

You are now DONE with Assignment 1! Don’t forget to submit the file for Part 1 to Canvas. In Assignment 2, you will already be writing programs and running them on the IMDB review data!

Home