module 2

Cards (104)

  • Computers store characters as numbers. Every character used by a computer corresponds to a unique number, and vice versa.
  • Some of these characters are called whitespaces, while others are named control characters, because their purpose is to control input/output devices.
  • This system has led to a need to introduce a universal and widely accepted standard implemented by (almost) all computers and operating systems all over the world.

    The one named ASCII (short for American Standard Code for Information Interchange) is the most widely used, and you can assume that nearly all modern devices use that code.
  • The word internationalization is commonly shortened to I18N
    Why? Look carefully – there is an I at the front of the word, next there are 18 different letters, and an N at the end.
    Despite the slightly humorous origin, the term is officially used in many documents and standards.
  • The software I18N is a standard in present times. Each program has to be written in a way that enables it to be used all around the world, among different cultures, languages, and alphabets.
  • A classic form of ASCII code uses eight bits for each sign. Eight bits mean 256 different characters. The first 128 are used for the standard Latin alphabet (both upper-case and lower-case characters). Is it possible to push all the other national characters used around the world into the remaining 128 locations?
    No. It isn't.
  • A code point is a number which makes a character. For example, 32 is a code point which makes a space in ASCII encoding. We can say that standard ASCII code consists of 128 code points.
  • Can you set the higher half of the code points differently for different languages? Yes, you can. Such a concept is called a code page.
  • A code page is a standard for using the upper 128 code points to store specific national characters
  • Unicode assigns unique (unambiguous) characters (letters, hyphens, ideograms, etc.) to more than a million code points.
  • There is more than one standard describing the techniques used to implement Unicode in actual computers and computer storage systems. The most general of them is UCS-4.
    The name comes from Universal Character Set.
  • Fortunately, there are smarter forms of encoding Unicode texts.
    One of the most commonly used is UTF-8.
    The name is derived from Unicode Transformation Format
  • UTF-8 uses as many bits for each of the code points as it really needs to represent them.
  •  Computers store characters as numbers. There is more than one possible way of encoding characters, but only some of them gained worldwide popularity and are commonly used in IT: these are ASCII (used mainly to encode the Latin alphabet and some of its derivates) and UNICODE (able to encode virtually all alphabets being used by humans).
  •  A number corresponding to a particular character is called a codepoint.
  • UNICODE uses different ways of encoding when it comes to storing the characters using files or computer memory: two of them are UCS-4 and UTF-8 (the latter is the most common as it wastes less memory space).
  • BOM (Byte Order Mark) is a special combination of bits announcing the encoding used by a file's content (eg. UCS-4 or UTF-8).
  • In general, strings can be:
    • concatenated (joined)
    • replicated.
  • The ability to use the same operator against completely different kinds of data (like numbers vs. strings) is called overloading 
  • If you want to know a specific character's ASCII/UNICODE code point value, you can use a function named ord() (as in ordinal).
  • If you know the code point (number) and want to get the corresponding character, you can use a function named chr().
    The function takes a code point and returns its character.
  • The in operator shouldn't surprise you when applied to strings – it simply checks if its left argument (a string) can be found anywhere within the right argument (another string).
    The result of the check is simply True or False.
  • We've also told you that Python's strings are immutable. This is a very important feature. What does it mean?
    This primarily means that the similarity of strings and lists is limited. Not everything you can do with a list may be done with a string.
  • Python strings don't have the append() method and
    doesn't allow you to use the del instruction to remove anything from a string.
  • a function named max() finds the maximum element of the sequence.
  • The index() method (it's a method, not a function) searches the sequence from the beginning, in order to find the first element of the value specified in its argument.
  • The list() function takes its argument (a string) and creates a new list containing all the string's characters, one per list element.
  • The count() method counts all occurrences of the element inside the sequence.
  • Python strings are immutable sequences and can be indexed, sliced, and iterated like any other sequence, as well as being subject to the in and not in operators. There are two kinds of strings in Python:
    • one-line strings, which cannot cross line boundaries – we denote them using either apostrophes ('string') or quotes ("string")
    • multi-line strings, which occupy more than one line of source code, delimited by trigraphs:
  • The length of a string is determined by the len() function. The escape character (\) is not counted.
  • Strings can be concatenated using the + operator, and replicated using the * operator
  •  Some other functions that can be applied to strings are:
    • list() – creates a list consisting of all the string's characters;
    • max() – finds the character with the maximal codepoint;
    • min() – finds the character with the minimal codepoint.
  • What is the length of the following string assuming there is no whitespaces between the quotes?
    1
  • What is the expected output of the following code?
    s = 'yesteryears'
    the_list = list(s)
    print(the_list[3:6])

    ['t', 'e', 'r']
  • What is the expected output of the following code?
    for ch in "abc":    
    print(chr(ord(ch) + 1), end='')
    bcd
  • The capitalize() method does exactly what it says – it creates a new string filled with characters taken from the source string
  • The two-parameter variant of center() makes use of the character from the second argument, instead of a space. 
  • The endswith() method checks if the given string ends with the specified argument and returns True or False, depending on the check result.
  • The find() method is similar to index(), which you already know – it looks for a substring and returns the index of the first occurrence of this substring, but:
    • it's saferit doesn't generate an error for an argument containing a non-existent substring (it returns -1 then)
    • it works with strings only – don't try to apply it to any other sequence.
  • Note: don't use find() if you only want to check if a single character occurs within a string - the in operator will be significantly faster.