ENGR 120 Quiz 5-PDFs, Word, Pandas

Created by

Thistle Dristle

Cards (43)

Extract text from page:
step 1. Identify which page you're interested in by index. Syntax: page = my_pdf.pages[ n ]
step 2. Extract the text from that page. Syntax: page_text = page.extract_text( )
step 3. Print. Syntax: print(page_text)
To determine how many pages are in a PDF, what method would you use?
my_pdf.get_num_pages()
We can also extract text from a slice of pages as follows:
– Identify the desired slice
– Use an accumulator loop to extract and combine text from each page (with each page separated by a new line)
Similar approach to extract text from all pages:
– Identify that you want to use all pages (don't use an index)
– Use an accumulator loop to extract and combine text from each
page (with each page separated by a new line)
To use code within the pypdf library, we need to both install it and import it 
%pip install pypdf from pypdf import PdfReader
For Python to “read” the contents of the file using PdfReader this e
ssentially means that we’ll turn the PDF into something Python can understand
True or False? When using PDFReader, the primary component of a PDF is a paragraph, and when using Document, the primary component of a Word doc is a page.
False, the logic is reversed.
Imagine I’ve already extracted text from a PDF and I’m ready to analyze it. What result will the following command give? len(my_text)
The number of characters because you haven’t split up the text into a list of words.
In order to use the python-docx library to read a word doc we need to
first install then import
Syntax for python to "read" the contents of a file using Document
my_doc = Document(my_filepath)
Word docs have more structure than PDFs and are easier for Python to read and convert to a string.
For Document, the primary "component" of the Word doc is a paragraph. Thus, we extract text on a paragraph-by-paragraph basis
Method to determine the number of paragraphs in your Word doc.
len(my_doc.paragraphs)
Python considers something a paragraph every time it sees the new line sequence: \n
Extracting text from a paragraph
Step 1. Identify which paragraph you're interested in via index.
Syntax: para = my_doc.paragraphs[n]
Step 2. Extract text from that paragraph. Syntax: text = para.text
.text property 
extracting text, creating a string based off the paragraph you’re interested in. Takes the text and converts to a string. Text is a property not a method. Its a variable owned by an object, where the object here is para
In this example, python reads the first line as the first paragraph.
Extract text from slice of paragraphs
Step 1. Identify the desired slice (line 2)
Step 2. Use an accumulator loop to extract and combine text from each paragraph (separating paragraphs with a new line)
Extract text from ALL paragraphs
Step 1. Identify that you want to use all paragraphs (don't use an index)
Step 2. Use an accumulator loop to extract and combine text from each paragraph (separating paragraphs with a new line)
Once you've extracted the text from a PDF or Word doc, how would you obtain the number of characters?
len(my_string)
Once you've extracted the text from a PDF or Word doc, how would you obtain the number of words?
split method
word_list = my_str.split( )
len(word_list)
Checklist for Word frequency analysis function:
Initialize a dictionary
Make the string lower case
Split the string into a list of words
Iterate over each word in the list --> add the word and its count to the dictionary
Return dictionary
Calling word frequency function
output = function name(input)
Number of unique words is
simply the length of the dictionary
Dataframes are
a Python data type
How do you create dataframes?
By reading in an external file
What does CSV file type stand for
comma separated values
When importing Pandas, what is the syntax including the abbreviation?
import pandas as pd
The Pandas function read_csv( ) does two things in one command:
Reads the entire CSV file
Converts content to a Pandas dataframe
Generic syntax: my_df = pd.read_csv(filepath)
*note* make sure the data frame variable name is meaningful and not generic
By default, head ( ) shows the first 5 rows and tails ( ) shows the last 5 rows 
or you can provide a specific number in the parentheses
If you want to see the whole dataframe, you simply type its name
Three approaches to see a specific column:
Attribute approach: used if the column name is simple. Syntax: my_df.column name
Label Approach: used if column name includes symbols or spaces. Syntax: my_df['column name']
Loc Approach. my_df.loc[: , ['column name']]
Subsetting: viewing a portion (a subset) of the dataframe. We can subset a data frame according to variables (columns), observations (rows), or both.
Write code that will allow you to view South Carolina's data (observations):
states_df[states_df.state == 'South Carolina']
data frame name[data frame name. row title == ' row of interest']
Write code that will allow you to determine the three states with the largest Hispanic population:
states_df.nlargest(3, 'hispPop')
How to select and order the top or bottom entries of observations (rows). n indicates the desired set size
Combining math + Boolean expressions
Write code that selects states with less than 2% Hispanic population.
Step 1. (hispPop / totalPop)
Step 2. (hispPop / totalPop) < 0.02
Boolean and/or operators with Pandas
df_name [ (criteria 1) & (criteria 2) ] df_name [ (criteria 1) | (criteria 2) ]
What method can be used to subset a random sample of rows?
.sample ( ) method
General syntax: df_name.sample (n = x)
Subsetting variables (columns) 
General syntax: df_name [['column 1', 'column 2', column 3', etc.]]
*double square brackets on both sides*