ENGR 120 Quiz 5-PDFs, Word, Pandas

Cards (43)

  • Extract text from page:
    step 1. Identify which page you're interested in by index. Syntax: page = my_pdf.pages[ n ]
    step 2. Extract the text from that page. Syntax: page_text = page.extract_text( )
    step 3. Print. Syntax: print(page_text)
  • To determine how many pages are in a PDF, what method would you use?
    my_pdf.get_num_pages()
  • We can also extract text from a slice of pages as follows:
    – Identify the desired slice
    – Use an accumulator loop to extract and combine text from each page (with each page separated by a new line)
  • Similar approach to extract text from all pages:
    – Identify that you want to use all pages (don't use an index)
    – Use an accumulator loop to extract and combine text from each
    page (with each page separated by a new line)
  • To use code within the pypdf library, we need to both install it and import it

    %pip install pypdf from pypdf import PdfReader
  • For Python to “read” the contents of the file using PdfReader this e
    ssentially means that we’ll turn the PDF into something Python can understand
  • True or False? When using PDFReader, the primary component of a PDF is a paragraph, and when using Document, the primary component of a Word doc is a page.
    False, the logic is reversed.
  • Imagine I’ve already extracted text from a PDF and I’m ready to analyze it. What result will the following command give? len(my_text)
    The number of characters because you haven’t split up the text into a list of words.
  • In order to use the python-docx library to read a word doc we need to
    first install then import
  • Syntax for python to "read" the contents of a file using Document
    my_doc = Document(my_filepath)
  • Word docs have more structure than PDFs and are easier for Python to read and convert to a string.
  • For Document, the primary "component" of the Word doc is a paragraph. Thus, we extract text on a paragraph-by-paragraph basis
  • Method to determine the number of paragraphs in your Word doc.
    len(my_doc.paragraphs)
  • Python considers something a paragraph every time it sees the new line sequence: \n
  • Extracting text from a paragraph
    Step 1. Identify which paragraph you're interested in via index.
    Syntax: para = my_doc.paragraphs[n]
    Step 2. Extract text from that paragraph. Syntax: text = para.text
  • .text property
    extracting text, creating a string based off the paragraph you’re interested in. Takes the text and converts to a string. Text is a property not a method. Its a variable owned by an object, where the object here is para
  • In this example, python reads the first line as the first paragraph.
  • Extract text from slice of paragraphs
    Step 1. Identify the desired slice (line 2)
    Step 2. Use an accumulator loop to extract and combine text from each paragraph (separating paragraphs with a new line)
  • Extract text from ALL paragraphs
    Step 1. Identify that you want to use all paragraphs (don't use an index)
    Step 2. Use an accumulator loop to extract and combine text from each paragraph (separating paragraphs with a new line)
  • Once you've extracted the text from a PDF or Word doc, how would you obtain the number of characters?
    len(my_string)
  • Once you've extracted the text from a PDF or Word doc, how would you obtain the number of words?
    split method
    word_list = my_str.split( )
    len(word_list)
  • Checklist for Word frequency analysis function:
    1. Initialize a dictionary
    2. Make the string lower case
    3. Split the string into a list of words
    4. Iterate over each word in the list --> add the word and its count to the dictionary
    5. Return dictionary
  • Calling word frequency function
    output = function name(input)
  • Number of unique words is
    simply the length of the dictionary
  • Dataframes are
    a Python data type
  • How do you create dataframes?
    By reading in an external file
  • What does CSV file type stand for
    comma separated values
  • When importing Pandas, what is the syntax including the abbreviation?
    import pandas as pd
  • The Pandas function read_csv( ) does two things in one command:
    1. Reads the entire CSV file
    2. Converts content to a Pandas dataframe
    3. Generic syntax: my_df = pd.read_csv(filepath)
    4. *note* make sure the data frame variable name is meaningful and not generic
  • By default, head ( ) shows the first 5 rows and tails ( ) shows the last 5 rows

    or you can provide a specific number in the parentheses
  • If you want to see the whole dataframe, you simply type its name
  • Three approaches to see a specific column:
    1. Attribute approach: used if the column name is simple. Syntax: my_df.column name
    2. Label Approach: used if column name includes symbols or spaces. Syntax: my_df['column name']
    3. Loc Approach. my_df.loc[: , ['column name']]
  • Subsetting: viewing a portion (a subset) of the dataframe. We can subset a data frame according to variables (columns), observations (rows), or both.
  • Write code that will allow you to view South Carolina's data (observations):
    states_df[states_df.state == 'South Carolina']
    data frame name[data frame name. row title == ' row of interest']
  • Write code that will allow you to determine the three states with the largest Hispanic population:
    states_df.nlargest(3, 'hispPop')
  • How to select and order the top or bottom entries of observations (rows). n indicates the desired set size
  • Combining math + Boolean expressions
    Write code that selects states with less than 2% Hispanic population.
    Step 1. (hispPop / totalPop)
    Step 2. (hispPop / totalPop) < 0.02
  • Boolean and/or operators with Pandas
    df_name [ (criteria 1) & (criteria 2) ] df_name [ (criteria 1) | (criteria 2) ]
  • What method can be used to subset a random sample of rows?
    .sample ( ) method
    General syntax: df_name.sample (n = x)
  • Subsetting variables (columns)

    General syntax: df_name [['column 1', 'column 2', column 3', etc.]]
    *double square brackets on both sides*