“Chapter 5. Use Sequences, Sets, Dictionaries, and Text Files” in “Introduction to Computer Programming with Python”
Chapter 5 Use Sequences, Sets, Dictionaries, and Text Files
Chapter 5 details how compound data types and files can be used in programming to solve problems. Data need to be structured and organized to represent certain kinds of information and to make problem solving, information processing, and computing possible and more efficient. In addition to integer, float, and complex numbers, Python provides compound data types to represent more complicated information. These compound data types include strings, lists, tuples, sets, and dictionaries, as well as files that can be used to store a large volume of data in the long term (after the computer is shut off).
Learning Objectives
After completing this chapter, you should be able to
- • explain sequences.
- • explain strings and the methods and functions that can be applied to them.
- • construct and format strings with the f prefix and the format method.
- • discuss lists and tuples and the differences between the two.
- • properly use the methods and functions of lists and tuples.
- • explain sets and dictionaries and discuss the methods and functions that can be used on them.
- • explain files and discuss the differences between text files and binary files and the methods and functions available for manipulating files.
- • use strings, lists, tuples, sets, dictionaries, and files in problem solving and system design and development with Python.
5.1 Strings
The string is one of the most important data types for information representation and processing. Strings are the base of information and data, and they were used to structure the sequences of characters for the ASCII table in the early days of modern computers. They are still used the same way now in UTF-8 (8-bit Unicode Transformation Format Unicode), which includes characters for all human languages in the world.
Because strings are sequences of characters, the characters are ordered and indexed. We can access and manipulate individual characters through these indexes, starting from 0, as shown in the following example:
>>> name = "John Doe"
>>> name[0]
"J"
>>> name[3]
"n"
To construct a string from another data type, you use built-in function str(), as shown in the following example:
>>> tax_rate = 0.16
>>> tax_string = str(tax_rate)
>>> tax_string
'0.16'
>>> type(tax_string)
<class 'str'>
Methods of Built-In Class str
As is the case with some other object-oriented programming languages, string is a built-in class but is named str in Python. The str class has many powerful methods built into it, as detailed below with coding samples.
s.capitalize()
This converts the first character of the first word of string s to upper case and returns the converted string. Please note that characters in string s remain unchanged. This is the same for all string methods: no string method will alter the content of the original string variable. Rather, the method will make a copy of the content, manipulate the copy, and return it.
>>> s = "intro to programming with python"
>>> s_capitalized = s.capitalize()
>>> s_capitalized
'Intro to programming with python'
>>> s
'intro to programming with python'
s.casefold()
Converts all characters of string s into lower case and returns the converted characters.
>>> s_capitalized
'Intro to programming with python'
>>> s_capitalized.casefold()
'intro to programming with python'
s.center(space)
Returns a string centred within the given space. Note how the empty whitespace is divided when the number is not even.
>>> s="hello"
>>> s.center(10)
' hello '
s.count(sub)
Returns the number of times a specified value occurs in a string.
>>> s = "intro to programming with python"
>>> s.count('i')
3
>>> s.count('in')
2
s.encode()
Returns an encoded version of characters if they are not in the standard ASCII table. In the example below, there are Chinese characters in the string assigned to variable es.
>>> es = "Python is not a big snake (蟒蛇)"
>>> print(cs.encode())
b'Python is not a big snake \xe8\x9f\x92\xe8\x9b\x87'
Please note that the b in b'Python is not a big snake \xe8\x9f\x92\xe8\x9b\x87' indicates that all non-ASCII characters in the string are in byte.
s.endswith(sub)
Returns true if the string ends with the specified value, such as a question mark.
>>> cs = "Is Python an animal?"
>>> print(cs.endswith('?'))
True
s.expandtabs(ts)
Sets the size of tabs in the string to ts, which is an integer.
>>> cs = "Is\t Python\t an\t animal?"
>>> cs
'Is\t Python\t an\t animal?'
>>> print(cs)
Is Python an animal?
>>> print(cs.expandtabs(10))
Is Python an animal?
s.find(sub)
Searches the string for a substring and returns the position of where it was found.
>>> s= 'intro to programming with python'
>>> s.find("ro")
3
s.format(*args, **kwargs)
Formats specified values given in the list of position arguments *args, and/or the list of keyword arguments **kwargs into string s, according to the formatting specs given in s.
This is very useful in constructing complex strings.
>>> "Hello {0}, you are {1:5.2f} years old.".format("Python", 23.5)
'Hello Python, you are 23.50 years old.'
Please note that when mapping a dictionary, s.format(**mapping) can be used to format a string by mapping values of the Python dictionary to its keys.
>>> point = {'x':9,'y':-10} # point is a dictionary
>>> print('{x} {y}'.format(**point))
9 -10
Please note that ** has converted the dictionary point into a list of keyword arguments. This formatting can also be done by directly using keyword arguments:
>>> print('{x} {y}'.format(x=9,y=-10))
9 -10
s.format_map(mapping)
Similar to format(**mapping) above. The only difference is that this one takes a dictionary without operator **.
>>> point = {'x':9,'y':-10}
>>> print('{x} {y}'.format_map(point))
9 -10
s.index(sub)
Searches the string for a substring and returns the position of the substring. Generates a return error if there is no such substring.
Note that this may not be a good method to test if one string is a substring of another.
>>> s= 'intro to programming with python'
'intro to programming with python'
>>> s.index("ing")
17
>>> s.index('w')
21
>>> s.index('z')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: substring not found
s.isalnum()
Returns True if all characters in the string are alphanumeric.
>>> "98765".isalnum()
True
>>> "98765abcde".isalnum()
True
>>> "98765<>abcde".isalnum()
False
s.isalpha()
Returns True if all characters in the string are in the alphabet, including Unicode characters.
>>> "abcde".isalpha()
True
>>> "abcTde".isalpha()
True
>>> "abc35Tde".isalpha()
False
>>> "abc他Tde".isalpha()
True
s.isdecimal()
Returns True if all characters in the string are decimals.
>>> "1235".isdecimal()
True
>>> "1235.65".isdecimal()
False
>>> "1235.65e".isdecimal()
False
s.isdigit()
Returns True if all characters in the string are digits.
>>> "123565".isdigit()
True
>>> "1235.65".isdigit()
False
>>> "1235y65".isdigit()
False
s.isidentifier()
Returns True if the string is an identifier by Python’s definition.
>>> "w1235y65".isidentifier()
True
>>> "9t1235y65".isidentifier()
False
>>> "w1235_y65".isidentifier()
True
s.islower()
Returns True if all characters in the string are lower case.
>>> "w1235y65".isidentifier()
True
>>> "9t1235y65".isidentifier()
False
>>> "w1235_y65".isidentifier()
True
s.isnumeric()
Returns True if all characters in the string are numeric.
>>> "123565".isnumeric()
True
>>> "1235.65".isnumeric()
False
>>> "123565nine".isnumeric()
False
s.isprintable()
Returns True if all characters in the string are printable.
>>> "123565nine".isprintable()
True
>>> "123565 all printable".isprintable()
True
>>> "123565 all printable<>!@#$%^&**".isprintable()
True
s.isspace()
Returns True if all characters in the string are whitespace.
>>> " ".isspace()
True
>>> " \t ".isspace()
True
>>> " \t \n".isspace()
True
>>> " \t m\n".isspace())
False
s.istitle()
Returns True if the string follows the rules of a title—that is, the first letter of each word is upper case, while the rest are not.
>>> "Python Is a Great Language".istitle()
False
>>> "Python Is A Great Language".istitle()
True
s.isupper()
Returns True if all characters in the string are upper case.
>>> "THIS IS ALL UPPER".isupper()
True
>>> "THIS IS ALL UPPER with some lower".isupper()
False
sep.join(iterable)
Joins the elements of an iterable with the separator. The iterable can be a list, tuple, string, dictionary, or set. Note that each element of the iterable must be a string. An integer or other number will raise an error.
>>> "-".join([" for", " programming!"])
'for- programming!'
>>> "&".join([" for", " programming!"])
'for& programming!'
>>> "%".join([" for", " programming!"])
'for% programming!'
>>> "%".join(" for programming!")
'%f%o%r% %p%r%o%g%r%a%m%m%i%n%g%!'
>>> "%".join(('a', '2', '3'))
'a%2%3'
>>> "%".join({'a', '2', '3'})
'3%a%2'
>>> "%".join({'a':'mnmn', '2':'987', '3':'43322'})
'a%2%3'
s.ljust(sl)
Returns a left-justified version of the string within the given size of space.
>>> "Python Is A Great Language".ljust(30)
'Python Is A Great Language '
s.lower()
Converts a string into lower case.
>>> "Python Is A Great Language".lower()
'python is a great language'
s.lstrip()
Returns a left trim version of the string.
>>> " Python Is A Great Language ".lstrip()
'Python Is A Great Language '
s.maketrans(dict)
s.maketrans(s1, s2)
Return a translation table to be used in translations.
In s.maketrans(dict), key-value pairs of dict provide mapping for translation; in the case of s.maketrans(s1, s2), chars in s1 are mapped to chars in s2 one by one.
>>> "Python Is A Great Language".maketrans({'a':'b', 'c':'d'})
{97: 'b', 99: 'd'}
>>> "Python Is A Great Language".maketrans('ab', 'cd')
{97: 99, 98: 100}
s.partition(sub)
Returns a tuple where the string is divided into three parts with sub in the middle.
>>> "Python Is A Great Language".partition('A')
('Python Is ', 'A', ' Great Language')
s.replace(s1, s2)
Returns a string where a specified value is replaced with a specified value.
>>> "Python Is A Great Language".replace('Great', 'Powerful')
'Python Is A Powerful Language'
s.rfind(sub)
Searches the string from the right for a substring and returns the position of where it was first found.
>>> "Python Is A Great Language".rfind('g')
24
s.rindex(sub)
Searches the string from the right for a substring and returns the index of the substring where it was first found.
>>> "Python Is A Great Language".rindex('g')
24
s.rjust(sub)
Returns a right-justified version of the string within the given size of space.
>>> "Python Is A Great Language".rjust(35)
' Python Is A Great Language'
s.rpartition(sub)
Returns a tuple where the string is divided into three parts at the substring found from the right.
>>> "Python Is A Great Language".rpartition('g')
('Python Is A Great Langua', 'g', 'e')
s.rsplit(sep)
Splits the string at the specified separator and returns a list.
>>> "Python Is A Great Language".rsplit('g')
['Python Is A Great Lan', 'ua', 'e']
s.rstrip()
Returns a right-trimmed version of the string.
>>> "Python Is A Great Language ".rstrip()
'Python Is A Great Language'
s.split(sep)
Splits the string at the specified separator and returns a list.
>>> "Python Is A Great Language".split('g')
['Python Is A Great Lan', 'ua', 'e']
s.splitlines()
Splits the string at line breaks and returns a list.
>>> "Python Is A Great Language.\n I love it.".splitlines()
['Python Is A Great Language.', ' I love it.']
s.startswith(ss)
Returns true if the string starts with the specified value.
>>> "Python Is A Great Language".startswith('g')
False
>>> "Python Is A Great Language".startswith('P')
True
s.strip()
Returns a trimmed version of the string.
>>> " Python Is A Great Language ".strip()
'Python Is A Great Language'
s.swapcase()
Swaps cases, so lower case becomes upper case, and vice versa.
>>> "Python Is A Great Language".swapcase()
'pYTHON iS a gREAT lANGUAGE'
s.title()
Converts the first character of each word to upper case.
>>> 'pYTHON iS a gREAT lANGUAGE'.title()
'Python Is A Great Language'
s.translate()
Returns a string translated from s using a translation table created with the maketrans() method.
>>> table = "".maketrans("ab", 'cd')
>>> print("Python Is A Great Language".translate(table))
Python Is A Grect Lcngucge
s.upper()
Converts a string into upper case.
>>> "Python Is A Great Language".upper()
'PYTHON IS A GREAT LANGUAGE'
s.zfill(sl)
Fills the string to a specific length with a specified number of 0s at the beginning.
>>> "Python Is A Great Language".zfill(39)
'0000000000000Python Is A Great Language'
Built-In Functions and Operators for Strings
In addition to the string methods that you can use to manipulate strings, there are built-in functions and operators. The following are some examples.
Use operator + to join strings together
>>> "Python is a good language " + "for first-time programming learners."
'Python is a good language for first-time programming learners.'
Use operator * to duplicate a string
>>> "Python! "*3
'Python! Python! Python! '
Use built-in function len(s) to find out the length if a string
>>> p_string = "Python is a good language " + "for first-time programming learners."
>>> len(p_string)
62
Use operator [i:j] to slice a string
>>> p_string[0:5] # slice begins at index 0 till index 5 but excluding 5
'Pytho'
>>> p_string[5:25] # slice begins at index 5 till index 25 but excluding 25
'n is a good language'
>>> p_string[:16] # when the starting index point is missing, 0 is assumed
'Python is a good'
>>> p_string[6:] # when the ending index point is missing, the string is copied from the start to the end
'is a good language for first-time programming learners.'
>>> p_string[:] # when both indexes are missing, the entire string is copied
'Python is a good language for first-time programming learners.'
>>> p_string[:-1] #the result is the same as using [:]
'Python is a good language for first-time programming learners'
Table 5-1 summarizes the operators and built-in functions you can use to manipulate strings.
Operators and built-in functions on string | Operation | Code samples in Python interactive mode |
---|---|---|
s[n] | Get the character at nth position (n is an integer) |
|
s[start:end] | Get a slice of the string. Negative indexes can also be used to count from the end. |
|
s1 + s2 | Concatenate two strings together |
|
s * n | Duplicate s n times |
|
s1 in s | Test if s1 is a substring of s |
|
len(s) | Get the length of string s |
|
print(s) | Print string s |
|
In addition to the ways discussed above to construct and manipulate strings, Python also provides some methods for constructing and formatting strings nicely.
Constructing and Formatting Strings
Because text is made of strings, and text is very important in representing data, information, and knowledge, there is a need to convert various data objects into well-formatted strings. For this purpose, Python has provided programmers with a very powerful means of formatting strings that may consist of various types of data such as literals, integer numbers, float numbers with different precisions, compound data, and even user-defined objects.
In 2.1, we saw how we could use f/F, r/R, u/U, and b/B to provide some direction on string formation and representation. We also saw that prefixing f or F to a string allows us to conveniently embed expressions into the string with {} and have the expressions be automatically evaluated. In the following, you will see two more ways of formatting strings.
Formatting with %-Led Placeholders
Let’s begin with an example to explain how %-led placeholders are used to format and construct strings:
In [ ]: |
|
Out [ ]: | n has a value of 8, and d has a value of 5.68900 |
In the example above, the string before the last percentage sign % is called a formatting string. %3d is a %-led placeholder for an integer that will take a 3-digit spot, whereas %9.5 is a %-led placeholder for a float number, where 9 specifies the total number of digits the float number will take and 5 specifies the total number of decimal digits. The values in the tuple behind the last percentage sign are to be converted and placed into their corresponding placeholders. In the example, the value of n will be converted to an integer and placed into the first placeholder, whereas the value of d will be converted into a float number and placed into the second placeholder.
You can also use named placeholders, as shown in the next example, where the course and language (in the parentheses) are the names of the placeholders.
In [ ]: |
|
Out [ ]: | comp218 - introduction to programming in Python |
Note that when named placeholders are used, you will need to use dictionary instead of a tuple behind the last percentage sign.
The general format of a %-led placeholder is as follows:
%[flags][width] [.precision] type
or the following, if you like to use named placeholders:
%[(name)][flags][width] [.precision] type
The flags may use one or a combination of the characters in Table 5-2.
Flag | Meaning | Code sample |
---|---|---|
# | Used with b, o, x, or X, this specifies that the formatted value is preceded with 0b, 0o, 0x, or 0X, respectively. |
|
0 | The conversion result will be zero-padded for numeric values. |
|
- | The converted value is left-adjusted. |
|
If no sign (e.g., a minus sign) is going to be written, a blank space is inserted before the value. |
| |
+ | The converted value is right-adjusted, and a sign character (+ or -, depending on whether the converted value is positive or negative) will precede it. |
|
The width is the total width of space held for the corresponding value, and precision is the number of digits that the decimal portion will take if the value is a float number. The type can be one of the types shown in Table 5-3.
Conversion | Meaning | Code sample |
---|---|---|
d, i, or u | Signed integer decimal. Note that in the last three coding samples, the plus sign has been automatically removed in the printout. |
|
o | Unsigned octal. |
|
X or x | Unsigned hexadecimal. |
|
E or e | Floating-point exponential format (lower case or upper case). |
|
F or f | Floating-point decimal format. |
|
G or g | Same as E or e if exponent is greater than −4 or less than precision; F otherwise. |
|
c | Single character (accepts integer or single character string). |
|
r | String (converts any python object using repr() or __repr__(), instead of str() or __str__()). In class definition, you need to implement the dunder method __repr__ in order for repr() or __repr__() to work on the objects of the defined class. |
|
s | String (converts any python object using str()). We will see the difference between %r and %s when we defined __repr__ method for a user-defined class. |
|
%% | No argument is converted (results in a "%" character in the result). It works only if the formatting is complete. |
|
Formatting strings with the format Method
Compared to the two methods we have seen so far, a more formal way of string formatting in Python is using the format method, as shown in the following example:
>>> s = "{0} is the first integer; {1} is the second integer".format(88, 99)
>>> s
'88 is the first integer; 99 is the second integer'
The {} in the above example is also called a placeholder or replacement field. You can index the placeholders with integer numbers starting from 0, corresponding to the positions of values. You can also name the placeholders, in which case dictionary or keywords arguments need to be used within the format method call. In the example above, if we switch the indices (0 and 1), 99 will be placed as the first integer and 88 will be placed as the second integer, as shown below:
>>> s = "{1} is the first integer; {0} is the second integer".format(88, 99)
>>> print(s)
99 is the first integer; 88 is the second integer
The general form of the replacement field is as follows:
{[field_name] [! conversion] [: format_spec]}
As mentioned before, having the item inside [] is optional; a placeholder can be as simple as an empty {}, as shown in the following example:
>>> 'X: {}; Y: {}'.format(3, 5)
'X: 3; Y: 5'
In the general form of the replacement field above, field name is something that can be used to identify the object within the arguments of the format method. It can be an integer to identify the position of the object, a name if keyword arguments are used, or a dot notation referring to any attribute of the object, as shown in the following example:
>>> c = 23 - 35j
>>> ('The complex number {0} has a real part {0.real} and an imaginary part {0.imag}.').format(c)
'The complex number (23 - 35j) has a real part 23.0 and an imaginary part -35.0.'
In this string formatting example, the first placeholder is {0}, in which integer 0 indicates that the value of the first argument of the format method call will be placed here; the second placeholder is {0.real}, which indicates that the value of the attribute real of the first object pf the format method call will be converted and inserted in that location; and the third placeholder is {0.imag}, which indicates that the value of the attribute imag of the first object of the format method call will be converted and inserted in that location. It is up to the programmer to use the right attribute names or to compose the right reference to a valid object or value within the arguments of the format method call.
Please note that conversion in the general form of the replacement field above is led by an exclamation mark !, which is followed by a letter: r, s, or a. The combination !r is used to convert the value into a raw string, !s is used to convert the value into a normal string, and !a is used to convert the value into standard ASCII, as shown in the following examples:
>>> print('{!r} is displayed as a raw string'.format('\t is not tab, \n is not newline'))
'\t is not tab, \n is not newline' is displayed as a raw string.
>>> print('{!s} is not displayed as a raw string'.format('\t is a tab, \n is a new line'))
is a tab,
is a new line is not displayed as a raw string.
>>> print('{!s} is displayed in Chinese'.format('Python is not 大蟒蛇.'))
Python is not 大蟒蛇. is displayed in Chinese.
>>> print('{!a} is displayed as an ASCII string'.format('Python is not 大蟒蛇.'))
'Python is not \u5927\u87d2\u86c7.' is displayed as an ASCII string.
Please note the difference between the two outputs using !s and !a in particular.
It may also have been noted that with !r, the quotation marks surrounding the argument remain in the output, whereas with !s, the quotation marks have disappeared from the output. This is true when the argument for the !r is a string.
When the argument for !r is not a string, especially when it is a complicated object, o, the !r will cause the placeholder to be replaced with the result of o.repr(), which in turn calls the dunder method __repr__() defined for the object’s class. You will learn how to define and use Python dunder methods later in Chapter 7.
In string formatting with format method, formatting specification is led by a colon :, which is followed by formatting instructions, including the following:
- 1. justification or alignment: > for right justification, < for left justification, ^ for centre justification
- 2. with/without sign for numbers: + for always showing the sign, − for only show the minus sign, and ' ' for showing a whitespace when the number is positive
- 3. the total number of letter spaces allocated to the data, such as in {:6d}, where 6 specifies that 6 letter spaces are taken by the integer number
- 4. the number of decimal digits for float numbers, such as in {:6.2f}, in which the 2 specifies the decimal, so the float number will be rounded to take 2 spaces
- 5. data type conversion indicates what data will be converted and inserted into the placeholder; the types of data include
- a. s for string
- b. d for integer
- c. f for float number
- d. x or X for hex number
- e. o for octal number
- f. b for binary number
- g. #x, #X, #o, and #b to prefix the numbers 0x, 0X, 0o, and 0b, respectively
The following example shows how the data type conversions work:
>>> '{:+12.8f}, {:+f}, {:#b}, {:#X}'.format(2.71828182, -3.14, 78, 127)
' +2.71828182, -3.140000, 0b1001110, 0X7F'
If you wish the output of a placeholder to be left, right, or centre justified within the given space, <, >, or ^ can be used to lead the format spec, as shown in the following example:
>>> '{:<+22.8f}, {:+f}, {:#b}, {:#X}'.format(2.71828182, -3.14, 78, 127)
'+2.71828182 , -3.140000, 0b1001110, 0X7F'
If you want the extra space to be filled with a special character, such as #, you can put the character between the colon and <, >, or ^, as shown below:
>>> '{:#^+22.8f}, {:+f}, {:#b}, {:#X}'.format(2.71828182, -3.14, 78, 127)
'#####+2.71828182######, -3.140000, 0b1001110, 0X7F'
By this point, we have learned three ways of constructing and formatting strings: the first one is to use the f/F prefix, the second is to use a %-led placeholder, and the last is to use the format method.
Among the three, the first one is the most compact and good for simple string construction without any fancy formatting. The expression within each {} will be evaluated, and the value will be converted into a string with which the placeholder is replaced as is.
Both the second and the third way can be used to construct and format more complex strings from various objects. The difference between the two is that the second, using a %-led placeholder, is more casual, whereas the third is more formal and the code more readable.
Regular Expressions
Information processing and text manipulation are important uses for modern computers, and regular expressions, called “REs” or “regexes” for short, were developed as a powerful way to manipulate text. Many modern programming languages have special libraries for searching and manipulating text using regular expressions. In Python, a standard module called re was developed for that purpose.
To correctly use the re module, we first must understand what regular expressions are and how to construct a regular expression that genuinely defines the strings we want to find and/or manipulate within a text because almost all functions/methods of the re module are based on such defined regular expressions.
What is a regular expression? A regular expression is a pattern that describes certain text or literal strings. Examples of some useful patterns include telephone numbers, email addresses, URLs, and many others.
To be able to correctly define a regular expression precisely describing the strings we want to find and manipulate, we must first understand and remember the rules of regular expressions, as well as special characters and sequences that have special meanings in a re module. Since regular expressions are strings themselves, they should be quoted with single or double quotation marks. For simplicity, however, we may omit some quotation marks in our discussion when we know what we are talking about in the context.
Plain literals such as a, b, c,…z, A, B, C,…Z, and numeric digits such as 0, 1, 2,…9 can be directly used in a regular expression to construct a pattern, such as Python, Foo, Canada. Some symbols in the ASCII table have been given special meanings in re. These symbols, called metacharacters, are shown in Table 5-4.
Symbols | Meaning | Example |
---|---|---|
. | Match any character except \n, a new line, in a string. | t..t will match test, text,… |
^ | Affixed to a pattern to match the preceding regex if it is at the beginning of the string being searched. | ^Hello will only match Hello when it is at the start of an email |
$ | Affixed to a pattern to match the preceding regex if it is at the end of a string. | mpeg$ will only match mpeg when it is at the end of a text |
| | Match either the regex on the left or the regex on the right. | Wang|Wong will match either Wang or Wong |
\ | Form an escape sequence such as \d, \s,… with special meaning. Table 5-5 lists all the escape sequences defined in the re module. Also used to escape the metacharacters in this table back to their original meanings. | \d will match any single decimal digit \D is the negation of \d, meaning it will not match any single decimal digit |
[…] | Define a set/class of characters. | [xyz] will match either x, y, or z. W[ao]ng is the same as Wang|Wong |
[^…] | Define a set of characters excluded from the pattern. Inside and at the beginning of [], ^ is used as negation | [^A-Z\s] will match everything else except upper case letters and whitespace |
[…x-y…] | Within [], define a range of characters from x to y | [0-9], [a-zA-Z] |
(…) | Match enclosed regex and save as subgroup for later use. | (B|blah\s)+ will only match the second blah in “blah, blah and blah” and save it |
? | This and the rest in this table are called quantifiers. When ? is affixed to a preceding regex (character or group) it becomes a nongreedy qualifier, meaning it will match only 0 or 1 occurrence of the preceding regex. ? can also be affixed to + as +?, or * as *?, to make + or * nongreedy. | mpe?g will match mpg or mpeg |
* | Affixed to pattern meaning to match 0 or more (greedy) occurrences of preceding regular expression. Greedy means that it will match as many as possible. | =* will match 0 or more consecutive =s |
+ | Affixed to a pattern to match 1 or more occurrences of the preceding regular expression. | =+ will match 1 or more consecutive =s |
{n} | Affixed to a pattern to match exactly n occurrences of the preceding regex. | [0-9]{3} will match the first 3 occurrences of digits, like an area code, for example |
{m, n} | Affixed to a pattern to match from m to n occurrences of the preceding regex. | [0-9]{5, 11} will match all sequences of decimal digits that are 5 to 11 digits in length |
Escape sequence | Special meaning in re | Example |
---|---|---|
\d | Match any decimal digit 0-9. | Img\d+.jpg |
\D | Opposite of \d, meaning do not match any decimal digit. | [\D] will match everything but decimal digits |
\w | Match any alphanumeric character, A-Z, a-z, 0-9. | [_a-zA-Z]\w* will match all legitimate identifiers in Python |
\W | Opposite of \w, meaning do not match any alphanumeric character. | [\W] will match everything but alphanumeric characters |
\n | Match a new line whitespace. | \.\n will match all periods that end a paragraph |
\t | Match a tab whitespace. | re.findall(r'\t', py_scripts) will find all the tabs in the py_scripts |
\r | Match a return/enter whitespace. | re.findall(r'\r', article) will find all the return/enter whitespaces in the article. |
\v | Match a vertical feed whitespace. | re.findall(r'\v', article) will find all the vertical feed whitespaces in the article |
\f | Match a feed whitespace. | re.findall(r'\f', article) will find all the feed whitespaces in the article |
\s | Match any of the whitespaces above. | re.findall(r'\s', article) will find all the whitespaces in the article |
\S | Opposite of \s, \S matches any character which is not a whitespace character. | re.findall(r'\S', article) will find everything except whitespaces in the article |
\N | N is an integer > 0. \1 refers to the first subgroup saved with (…). | In r'\b\w*(\w)\w*\1', \1 refers to the first found alphanumeric characters that appear more than once in a word |
\b | Match any word boundary: the left boundary if \b is at the left of the pattern, the right boundary if \b is at the right side of the pattern | \bthe\b will match the if it is not part of other words |
\B | Opposite of \b. | \bthe\B will match the if it is at the beginning of other words |
\. \\ \+ \* | Match a special symbol ., \, +, * respectively. | \d+\*\d+ will match multiplications of two integers in a text |
\A | Match at the start of a string, same as ^. | \AHello will match Hello if Hello is at the beginning of the string |
\Z | Match at the end of a string, same as $. | \.com\Z will match .com if it is at the end of the string |
The above are the basic rules for constructing regular expressions or regex patterns. Using these rules, we can write regular expressions to define most string patterns we are interested in.
The following are some examples of regex patterns:
- 780-\d{7}, pattern for telephone numbers in Edmonton, Alberta
- \$\d+\.\d{2}, pattern for currency representations in accounting
- [A-Z]{3}-\d{3}, pattern for licence plate numbers
The re module is also empowered with the following extension rules, which all begin with a question mark ? within a pair of parentheses. Although surrounded by a pair of parentheses, an extension rule, except (?P<name>…), does not create a new group.
(?aiLmsux)
Here, ? is followed by one or more letters from set a, i, L, m, s, u, and x, setting the corresponding flags for the re engine. (?a) sets re.A, meaning ASCII-only matching; (?i) sets re.I, meaning ignore case when matching; (?L) sets re.L, meaning local dependent; (?m) sets re.M, meaning multiple lines; (?s) sets re.S, meaning dot matches all characters including newline; (?u) sets re.U, meaning Unicode matching; (?x) sets re.X, meaning verbose matching. These flags are defined in the re module. The details can be found by running help(re).
The flags can be used at the beginning of a regular expression in place of passing the optional flag arguments to re functions or methods of pattern object.
(?aiLmsux-imsx:…)
Sets or removes the corresponding flags. (?a-u…) will remove Unicode matching.
(?:…)
Is a noncapturing version of regular parentheses, meaning the match cannot be retrieved or referenced later.
(?P<name>…)
Makes the substring matched by the group accessible by name.
(?P=name)
Matches the text matched earlier by given name.
(?#…)
Is a comment; ignored.
(?=…)
Matches if… matches next but does not consume the string being searched, which means that the current position in string remains unchanged. This is called a lookahead assertion.
John (?=Doe) will match John only if it is followed by Doe.
(?!…)
Matches if… does not match next.
Jon (?!Doe) will match Jon only if it is not followed by Doe.
(?<=…)
Matches if preceded by… (must be fixed length).
(?<=John) Doe will find a match in John Doe because there is John before Doe.
(?<!…)
Matches if not preceded by… (must be fixed length).
(?<!John) Doe will find a match in Joe Doe because there is not Joe before Doe.
(?(id)yes pattern | no pattern)
(?(name)yes pattern | no pattern)
Match yes pattern if the group with id or name is matched; match no pattern otherwise.
To do text manipulation and information processing using regular expressions in Python, we will need to use a module in the standard Python library called Re. Similarly, we will need to import the module before using it, as shown below:
>>> import re
Using the dir(re) statement, you can find out what names are defined in the module, as shown below, but you will need to use help(re) to find out the core functions and methods you can use from the re module.
The following are functions defined in the re module:
re.compile(pattern, flags=0)
Compile a pattern into a pattern object and return the compiled pattern object for more effective uses later.
>>> import re
>>> pobj=re.compile('780-?\d{3}-?\d{4}')
>>> pobj.findall('780-9381396, 7804311508, 18663016227') # findall method of pattern object
['780-9381396', '7804311508']
>>> b = re.compile(r'\d+\.\d*')
>>> b.match('32.23') # match method of pattern object
<re.Match object; span=(0, 5), match='32.23'>
re.match(pattern, string, flags=0)
Match a regular expression pattern to the beginning of a string. Return None if no match is found.
>>> r = re.match(r'\d+\.\d*', '123.89float')
>>> r
<re.Match object; span=(0, 6), match='123.89'>
re.fullmatch(pattern, string, flags=0)
Match a regular expression pattern to all of a string. Return None if no match is found.
>>> r = re.fullmatch(r'\d+\.\d*', '123.89')
# this will match
>>> r = re.fullmatch(r'\d+\.\d*', '123.89float')
# this will not match
re.search(pattern, string, flags=0)
Search a string for the presence of a pattern; return the first match object. Return None if no match is found.
>>> r = re.search(r'\d+\.\d+', 'real 123.89')
>>> r
<re.Match object; span=(5, 11), match='123.89'>
re.sub(pattern, replacing, string, count=0, flags=0)
Substitute occurrences of a pattern found in a string by replacing and return the resulted string.
>>> re.sub('t', 'T', 'Python is great.')
'PyThon is greaT.'
re.subn(pattern, replacing, string, count=0, flags=0)
Same as sub, but also return the number of substitutions made.
>>> re.subn('t', 'T', 'Python is great.')
('PyThon is greaT.', 2)
re.split(pattern, string, maxsplit=0, flags=0)
Split a string by the occurrences of a pattern and return a list of substrings cut by the pattern.
>>> re.split(r'\W+', 'Python is great.') # \W is nonalphanumeric so it will get a list of words
['Python', 'is', 'great', '']
re.findall(pattern, string, flags=0)
Find all occurrences of a pattern in a string and return a list of matches.
>>> re.findall('t', 'Python is great.')
['t', 't']
re.finditer(pattern, string, flags=0)
Return an iterator yielding a match object for each match.
>>> re.finditer('t', 'Python is great.')
<callable_iterator object at 0x00000198FE0F5FC8>
re.purge()
Clear the regular expression cache.
>>> re.purge()
>>>
re.escape(pattern)
Backslash all nonalphanumerics in a string.
>>> print(re.escape('1800.941.7896'))
1800\.941\.7896
Suppose we want to write a program to check if a name given by a user is a legitimate Python identifier. We can define a regex pattern for a Python identifier as shown below:
idPatt = '\b_{0,2}[A-Za-z](_?[A-Za-z0-9])*_{0,2}\b\'
Before using the re module, we need to import it, as shown below:
import re
The next step will be to get an input from the user and test it:
name = input('Give me a name and I will tell you if it is a Python identifier: ')
Trim the whitespace at the beginning and the end of the name just in case:
name = name.strip() # this will strip the whitespaces
Then, we do the real test:
if re.match(idPatt, name) is not None:
print('Congratulations! It is!') else:
print('Sorry, it is not.')
The complete code of the program is shown in the code section of Table 5-6.
The problem | In this case study, we will write a program to check if a name given by a user is legitimate Python identifier. |
The analysis and design | Steps: Step 1: Import re module before using it Step 2: Define a regex pattern for Python identifiers Step 3: Get an input from the user, and Step 4: Test it with an if-else statement |
The code |
|
The result | Give me a name and I will tell you if it is a Python identifier:A2 Congratulations! It is! |
5.2 Lists
The list is an important compound data type in Python and in almost all programming languages, though not many programming languages have list as a built-in data type.
In previous sections, you saw a few program examples with a list involved. In the following, we explain the operators and functions that can be used on lists.
list(iterable)
To construct a list from an iterable such as a sequence or call to range().
>>> l1 = list("test")
>>> l1
['t', 'e', 's', 't']
>>> l2 = list((1,2,3,4))
>>> l2
[1, 2, 3, 4]
>>> l5 = list(range(13, 26))
>>> l5
[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]
In addition, you can also create a list by directly putting items in a pair of square brackets, as shown below:
>>> l6 = ['Jon', 'John', 'Jonathan', 'Jim', 'James']
>>> l6
['Jon', 'John', 'Jonathan', 'Jim', 'James']
l[nth]
To get the nth element of list l.
>>> students = ['John', 'Mary', 'Terry', 'Smith', 'Chris']
>>> students[3]
'Smith'
l[start:end]
To get a slice/sublist of l, including the members from a start position till right before the end position.
>>> students = ['John', 'Mary', 'Terry', 'Smith', 'Chris']
>>> students[1:3]
['Mary', 'Terry']
l[start:end:step]
To get a slice/sublist of l, including members from a start position to right before an end position with set step.
>>> l6 = ['Jon', 'John', 'Jonathan', 'Jim', 'James']
>>> l6[:5:3] # start from 0 till 5 with step set as 3.
['Jon', 'Jim']
l[n] = e
To replace element at n with e.
>>> print(students)
['John', 'Mary', 'Terry', 'Smith', 'Chris']
>>> students[2] = 'Cindy'
>>> print(students)
['John', 'Mary', 'Cindy', 'Smith', 'Chris']
l1 + l2
To concatenate list l2 to l1, but without changing l1. As such, if you want to keep the result of concatenation, you will need to assign the result to a new variable.
>>> teachers = ['Jeffery', 'Clover', 'David']
>>> students + teachers
['John', 'Mary', 'Terry', 'Smith', 'Chris', 'Jeffery', 'Clover', 'David']
>>> teachers
['Jeffery', 'Clover', 'David']
>>> class_members = students + teachers
>>> class_members
['John', 'Mary', 'Terry', 'Smith', 'Chris', 'Jeffery', 'Clover', 'David']
l * n
n * l
To duplicate list l n times but without changing l.
>>> students[1:3]
['Mary', 'Terry']
>>> students[1:3] * 2
['Mary', 'Terry', 'Mary', 'Terry']
>>> 2*students[1:3]
['Mary', 'Terry', 'Mary', 'Terry']
e in l
To test if e is in list l. If l has compound data such as lists, tuples, or instances of a class, e is only part of a compound data or object and is not considered in the list.
>>> teachers
['David', 'Jeffery', 'Clover']
>>> 'Clover' in teachers
True
>>> l0 = [1, 2, 3, [4, 5], 6] # 4 and 5 are members of a sublist of l0
>>> 5 in l0 # so that 5 is not considered as part of list l0
False
len(l)
To get the number of elements in the list l.
>>> students
['John', 'Mary', 'Terry', 'Smith', 'Chris']
>>> len(students)
5
print(l)
To print list l. Note that the list will be recursively printed, but complex objects such as instances of a user-defined class may not be printed the way you expected unless you have defined the __str__() method for the class.
>>> print(teachers)
['Jeffery', 'Clover', 'David']
In addition, there are also built-in methods for list objects, as detailed below.
l.append(e)
To append element e to list l.
>>> l = ['T', 'h']
>>> l.append('e')
>>> l
['T', 'h', 'e']
l.clear()
To remove all items from list l.
>>> l1 = list("test")
>>> l1
['t', 'e', 's', 't']
>>> l1.clear()
>>> l1 # l1 became an empty list
[]
l.copy()
To return a shallow copy of list l—that is, it only copies simple objects of the list such as numbers and strings; for compound data, it does not copy the actual objects but only makes references to the objects.
>>> l7 = l6.copy() # from above we know that items in l6 are all simple strings
>>> l7
['Jon', 'John', 'Jonathan', 'Jim', 'James']
>>> l7[3] = 'Joe' # change the value of l7[3]
>>> l7 # it shows l7 has been changed
['Jon', 'John', 'Jonathan', 'Joe', 'James']
>>> l6 # it shows l6 remains the same
['Jon', 'John', 'Jonathan', 'Jim', 'James']
Now suppose we have
>>> l8 = [[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], ['Jon', 'John', 'Jonathan', 'Joe', 'James'], 100]
>>> l9 = l8.copy()
>>> l9 # l9 has the same items as l8
[[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], ['Jon', 'John', 'Jonathan', 'Joe', 'James'], 100]
>>> l9[0][0] = 1000 # make change to the internal value of list l9[0], that is, l9[0][0] to 1000
>>> l9 # l9 has been changed
[[1000, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], ['Jon', 'John', 'Jonathan', 'Joe', 'James'], 100]
>>> l8 # l8 has been changed as well
[[1000, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25], ['Jon', 'John', 'Jonathan', 'Joe', 'James'], 100]
As can be seen, if you make changes to a piece of compound data (a list as the first item of l9 copied from l8), the change also occurs in the original list, and vice versa.
l.index(e, start = 0, stop = 9223372036854775807)
To return the first index of element e, from a start position till a stop position. The default range is from 0 to 9223372036854775807.
>>> l6.index('Jim')
3
l.pop()
To remove and return an item from the end of list l.
>>> l.pop()
'e'
l.pop(2)
To remove and return an item from the middle of list l. When there are an even number of elements in the list, there will be two elements in the middle, but only the first one pops out.
>>> l = [1, 3, 2, 6, 5, 7]
>>> l.pop(2)
2
>>> l
[1, 3, 6, 5, 7]
>>> l.pop(2)
6
l.reverse()
To reverse the list.
>>> l.reverse()
>>> l
[7, 5, 3, 1]
l.sort()
To sort the list in ascending order by default. To sort in descending order, use l.sort(reverse = True).
>>> l.sort()
>>> l
[1, 3, 5, 7]
l.extend(l0)
To extend list l by appending list l0 to the end of list l. It is different from l + l0 but it is same as l += l0.
>>> l = list(range(5))
>>> l
[0, 1, 2, 3, 4]
>>> l0 = list(range(6, 11))
>>> l0
[5, 6, 7, 8, 9, 10]
>>> l.extend(l0)
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
l.insert(i, e)
To insert e before index i of existing list l.
>>> l = list(range(10))
>>> l
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> l.insert(5, 13)
>>> l
[0, 1, 2, 3, 4, 13, 5, 6, 7, 8, 9]
l.remove(e)
To remove first occurrence of e in the list.
>>> l
[0, 1, 2, 3, 4, 13, 5, 6, 7, 8, 9]
>>> l.remove(5)
>>> l # 5 has been removed from l
[0, 1, 2, 3, 4, 13, 6, 7, 8, 9]
l.count(e)
To search the list and return the number of occurrences of e.
>>> l
[0, 1, 2, 3, 4, 13, 6, 7, 8, 9]
>>> l.count(6)
1
As you can see, elements in lists can be changed or mutated. You can insert, delete, replace, expand, and reorder all the elements in a list.
Lists are a very important data model in programming and problem solving. First, lists can be used as collections of data. Each member of the collection can be as simple as a number or a string and as complex as another list or any other compound data type, or even an object. Many functions and methods, as discussed above, have been made available for accessing and manipulating lists and their members.
Suppose we want to develop a management system for a company, for example. Within the system, we need to represent information on its employees. We can use a list containing the name, birthdate, department, start date at the company, and level of employment to represent information on each employee, then use another list to represent a collection of employees. This is illustrated as follows:
# this defines an embedded list or two-dimensional array
employees = [['Kevin Smith', 19560323, 'Sale', 20100621, 3],
['Paul Davina', 19860323, 'HR', 20120621, 5],
['Jim Carri', 1969323, 'Design', 20120625, 2],
['John Wong', 19580323, 'Customer Service', 20110323, 3],
['Keri Lam', 19760323, 'Sale', 20130522, 5]]
Moreover, lists can be used to represent trees, which is an important data structure in programming and problem solving.
5.3 Tuples
Unlike a list, a tuple is an immutable object, which means that once created, the internal structure of a tuple cannot be changed. Hence, most methods you have seen for lists are not available for tuples, except for the following two.
t.count(e)
To count and return the number of occurrences of a specified value in a tuple.
>>> t = (3, 6, 5, 7, 5, 9)
>>> t.count(5)
2
t.index(e, start = 0, stop = 9223372036854775807)
To search the tuple for a specified value e and return the index of the first occurrence of the value. Remember that just like a list, a tuple can have duplicate values as well.
>>> t.index(6)
1
>>> t.index(7)
3
>>> t0 = tuple("same as list, tuple")
>>> t0
('s', 'a', 'm', 'e', ' ', 'a', 's', ' ', 'l', 'i', 's', 't', ',', ' ', 't', 'u', 'p', 'l', 'e')
>>> t0.index('l') # it only returns the index of the first l
8
>>> t.index('l', 9) # to get the index of the next occurrence
17
As well, compared to list, fewer number of operators and built-in functions can be used on tuples, as shown below.
tuple(iterable)
To construct a tuple from an iterable such as another sequence or a call to range(), a built-in function.
>>> l1 = [1, 2, 3]
>>> t0 = tuple(l1)
>>> t0
(1, 2, 3)
This would be the same as the following:
>>> t0 = (1, 2, 3)
>>> t1 = tuple('tuple')
>>> t1
('t', 'u', 'p', 'l', 'e')
>>> tuple(range(7))
(0, 1, 2, 3, 4, 5, 6)
t[n]
To get nth element of a tuple.
>>> teachers = ('Jeffery', 'Clover', 'David')
>>> teachers[2]
'David'
Please note that because a tuple is an immutable sequence, making changes to its members will generate an error, as shown below:
>>> teachers[1] = 'Chris'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
t[i:j]
To get a slice of tuple t including elements from point i to the one right before point j.
>>> teachers[0:2]
('Jeffery', 'Clover')
>>> print(teachers)
('Jeffery', 'Clover', 'David')
t1 + t2
To concatenate tuple t2 to t1.
>>> students = tuple(students)
>>> print(students)
('John', 'Mary', 'Terry', 'Smith', 'Chris')
>>> students + teachers
('John', 'Mary', 'Terry', 'Smith', 'Chris', 'Jeffery', 'Clover', 'David')
t * n
To duplicate tuple t n times.
>>> teachers * 2
('Jeffery', 'Clover', 'David', 'Jeffery', 'Clover', 'David')
e in t
To test if e is an element of tuple t.
>>> teachers
('David', 'Jeffery', 'Clover')
>>> 'David' in teachers
True
len(t)
To get the number of elements in the tuple t.
>>> len(teachers * 2)
6
print(t)
To print tuple t. Again, print may print the tuple recursively, but the expected result can only be achieved if __str__() has been defined for every object at all levels.
>>> print(students)
('John', 'Mary', 'Terry', 'Smith', 'Chris')
Again, although we can extend the tuple by concatenating and duplicating, we cannot make any change to the existing element of a tuple as we did to lists, because tuples are immutable. As a result, the tuple is not a suitable data structure for representing the group of employees in the example presented at the end of the previous section because employees may come and go.
5.4 Sets
As in mathematics, a set is a collection of unindexed and unordered elements. For sets, Python has very few operators and built-in functions that we can use.
set(s)
To construct a set from s, which can be a list, tuple, or string.
>>> students = ['Cindy', 'Smith', 'John', 'Chris', 'Mary']
>>> students = set(students)
>>> students
{'Cindy', 'Smith', 'John', 'Chris', 'Mary'}
>>> numbers = set(range(10))
>>> numbers
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
e in s
To test if e is a member of set s.
>>> students
{'Cindy', 'Smith', 'John', 'Chris', 'Mary'}
>>> 'Chris' in students
True
len(s)
To get the total number of elements in the set.
>>> numbers
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> len(numbers)
10
However, there are good number of methods defined for sets.
s.add(m)
To add an element to the set s.
>>> s = set([3])
>>> s
{3}
>>> s.add(5)
>>> s
{3, 5}
s.clear()
To remove all the elements from the set s.
>>> s.clear()
>>> s
set()
s.copy()
To make and return a copy of the set s.
>>> s
{3, 5}
>>> s1 = s.copy()
>>> s1
{3, 5}
s.difference(s1,…)
To make and return a set containing only members of s that other sets in the arguments don’t have—that is, the difference between two or more sets.
>>> s1
{3, 5}
>>> s2 = {5, 7}
>>> s1.difference(s2)
{3}
>>> s3={3,7}
>>> s1.difference(s2,s3) # returns an empty set
set()
s.difference_update(*sx)
To remove the items in set s that are also included in another, specified set.
>>> s1 = {2 * i for i in range(15)}
>>> s2 = {3 * i for i in range(15)}
>>> s1
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.difference_update(s2)
>>> s1
{2, 4, 8, 10, 14, 16, 20, 22, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
s.discard(m)
To remove the specified item.
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s2.discard(18)
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 21, 24, 27, 30}
s.intersection(*sx)
To return a set that is the intersection of two other sets.
>>> s1 = {2 * i for i in range(15)}
>>> s2 = {3 * i for i in range(15)}
>>> s1
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.intersection(s2)
{0, 6, 12, 18, 24}
s.intersection_update(*sx)
To remove the items in this set that are not present in another, specified set or sets.
>>> s1
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.intersection_update(s2)
>>> s1
{0, 6, 12, 18, 24}
s.isdisjoint(sx)
To check and return whether two sets have an intersection (common member) or not.
>>> s1
{0, 6, 12, 18, 24}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.isdisjoint(s2)
False
s.issubset(sx)
To check and return whether another set contains this set or not.
>>> s1 = {2 * i for i in range(15)}
>>> s2 = {3 * i for i in range(15)}
>>> s1
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.issubset(s2)
False
s.issuperset(sx)
To check and return whether this set contains another set or not.
>>> s1.issuperset(s2)
False
s.pop()
To remove an element from the set.
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s2.pop()
0
>>> s2
{33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
s.remove(m)
To remove the specified element.
>>> s2
{33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s2.remove(18)
>>> s2
{33, 3, 36, 6, 39, 9, 42, 12, 15, 21, 24, 27, 30}
s.symmetric_difference(sx)
To construct and return a set with elements in either set s or another set but not both. These are called set symmetric differences (“I have you do not; you have I do not”).
>>> s1 = {2 * i for i in range(15)}
>>> s2 = {3 * i for i in range(15)}
>>> s1
{0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
>>> s2
{0, 33, 3, 36, 6, 39, 9, 42, 12, 15, 18, 21, 24, 27, 30}
>>> s1.symmetric_difference(s2)
{2, 3, 4, 8, 9, 10, 14, 15, 16, 20, 21, 22, 26, 27, 28, 30, 33, 36, 39, 42}
s.symmetric_difference_update(sx)
To insert the symmetric differences from this set and another.
>>> s1.symmetric_difference_update(s2)
>>> s1
{2, 3, 4, 8, 9, 10, 14, 15, 16, 20, 21, 22, 26, 27, 28, 30, 33, 36, 39, 42}
s.union(sx)
To return a set containing the union of sets.
>>> s2 = {3 * i for i in range(5)}
>>> s1 = {2 * i for i in range(5)}
>>> s1
{0, 2, 4, 6, 8}
>>> s2
{0, 3, 6, 9, 12}
>>> s1.union(s2)
{0, 2, 3, 4, 6, 8, 9, 12}
s.update(sx)
To update the set by adding members from other sets.
>>> s1
{0, 2, 4, 6, 8}
>>> s2
{0, 3, 6, 9, 12}
>>> s1.update(s2)
>>> s1
{0, 2, 3, 4, 6, 8, 9, 12}
5.5 Dictionaries
A dictionary is a collection of key and value pairs enclosed in curly brackets. As with a set, the dictionary is also immutable. There are very few operators and built-in functions that can be used on dictionaries, as shown below.
dict(**kwarg)
To construct a dictionary from a series of keyword arguments.
>>> dt = dict(one = 1, two = 2, three = 3)
>>> dt
{'one': 1, 'two': 2, 'three': 3}
dict(mapping, **kwarg)
To construct a dictionary from mapping. If keyword arguments are present, they will be added to the dictionary constructed from the mapping.
>>> d1 = dict(zip(['one', 'two', 'three'], [1, 2, 3]))
>>> d2 = dict(zip([1, 2, 3], ['one', 'two', 'three']))
>>> d1
{'one': 1, 'two': 2, 'three': 3}
>>> d2
{1:'one', 2:'two', 3:'three'}
dict(iterable, **kwarg)
To construct a dictionary from an iterable. If keyword arguments are present, they will be added to the dictionary constructed from the mapping.
>>> d3 = dict([('two', 2), ('one', 1), ('three', 3)])
>>> d3
{'two': 2, 'one': 1, 'three': 3}
list(dt)
To return a list of all the keys used in the dictionary dt.
>>> d3
{'two': 2, 'one': 1, 'three': 3}
>>> list(d3)
['two', 'one', 'three']
dt[k]
To get the value of key k from dictionary dt.
>>> dt = {1:'One', 2:'Two', 3:'Three'}
>>> dt[1]
'One'
dt[k] = V
To set d[key] to value V.
>>> d3
{'two': 2, 'one': 1, 'three': 3}
>>> d3['two']
2
>>> d3['two'] = bin(2)
>>> d3['two']
'0b10'
del dt[k]
To remove dt[key] from dt.
>>> d3
{'two':'0b10', 'one': 1, 'three': 3}
>>> del d3['two']
>>> d3
{'one': 1, 'three': 3}
k in dt
To test if dt has a key k.
>>> d3
{'one': 1, 'three': 3}
>>> 'two' in d3
False
k not in dt
Same as not k in dt, or k not in dt.
>>> 'two' not in d3
True
>>> not 'two' in d3
True
iter(dt)
To return an iterator over the keys of dt. Same as iter(dt.keys()).
>>> iter(d3)
<dict_keyiterator object at 0x00000198FE0EFEF8>
>>> list(iter(d3))
['one', 'three']
len(dt)
To get the total number of elements in the dictionary.
>>> dt
{1:'One', 2:'Two', 3:'Three'}
>>> len(dt)
3
reversed(dt)
To return a reverse iterator over the keys of the dictionary. Same effect as reversed(dt.keys()). This is new in Python 3.8.
>>> dt = {1:'One', 2:'Two', 3:'Three'} # keys: 1, 2, 3
>>> rk = reversed(dt) # reversed iterator over the keys in rk
>>> for k in rk:
print(k)
…
3
2
1
Note that in the output above, the keys in rk are 3, 2, 1.
The following built-in methods of the dictionary class can be used to manipulate dictionaries.
d.clear()
To remove all the elements from the dictionary.
>>> dt = {1:'One', 2:'Two', 3:'Three'}
>>> dt
{1:'One', 2:'Two', 3:'Three'}
>>> dt.clear()
>>> dt
d.copy()
To make and return a copy of the dictionary.
>>> dt = {1:'One', 2:'Two', 3:'Three'}
>>> dx = dt.copy()
>>> dx
{1:'One', 2:'Two', 3:'Three'}
dict.fromkeys()
To make a dictionary from a list of keys.
>>> keys = ['Edmonton','Calgary','Toronto']
>>> weather = dict.fromkeys(keys, 'Sunny')
>>> print(weather)
{'Edmonton': 'Sunny', 'Calgary': 'Sunny', 'Toronto': 'Sunny'}
d.get(k)
To return the value of the specified key.
>>> d3 =dict([('two', 2), ('one', 1), ('three', 3)])
>>> d3.get('two')
2
d.items()
To return a list containing a tuple for each key-value pair.
>>> d3.items()
dict_items([('two', 2), ('one', 1), ('three', 3)])
d.keys()
To return a list containing the dictionary’s keys.
>>> d3.keys()
dict_keys(['two', 'one', 'three'])
d.values()
To return a list of all the values in the dictionary.
>>> d3.values()
dict_values([2, 1, 3])
d.pop(k)
To remove the element with the specified key. Note that the removed item will no longer exist in the dictionary.
>>> d3
{'two': 2, 'one': 1, 'three': 3}
>>> d3.pop('two')
2
>>> d3
{'one': 1, 'three': 3}
d.popitem()
To remove an item from the end of the dictionary, as a key and value pair.
>>> d3
{'two': 2, 'one': 1, 'three': 3}
>>> d3.popitem()
('three', 3)
d.setdefault(key, value)
To insert a key-value pair into the dictionary if the key is not in the dictionary; return the value of the key if the key exists in the dictionary.
>>> d3
{'two': 2, 'one': 1}
>>> d3.setdefault('three', 3)
3
>>> d3.setdefault('two', 'II')
2
>>> d3
{'two': 2, 'one': 1, 'three': 3}
d.update(dx)
To update the dictionary with the specified key-value pairs in dx.
>>> d3
{'two': 2, 'one': 1, 'three': 3}
>>> d2
{1:'one', 2:'two', 3:'three'}
>>> d3.update(d2)
>>> d3
{'two': 2, 'one': 1, 'three': 3, 1:'one', 2:'two', 3:'three'}
5.6 List, Set, and Dictionary Comprehension
Lists, sets, and dictionaries are important data models for programmers to structure and organize data with. Before using lists, tuples, sets, and dictionaries, it is important to create them in a nice way. List, set, and dictionary comprehension is provided by Python to construct lists, sets, and dictionaries in a concise but efficient language. The essential idea for list, set, and dictionary comprehension is the use of a for loop with or without conditions.
List Comprehension
The following is an example that constructs a list of odd numbers from 1 to 100:
In [ ]: |
|
Out [ ]: | [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99] |
In the example, the expression before for represents the items of the list; the for loop will run through the item expression in each iteration. This list can also be generated using the for loop with an if clause, as shown below:
In [ ]: |
|
Out [ ]: | [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33, 35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99] |
In the example above, the list item expression will be evaluated in every iteration of the for loop only if the condition of the if clause is met. In fact, we can put any condition on the iteration variable i. For example, assume we have a Boolean function isPrime(N) that can test if number N is prime or not; then the following statement will produce a list of prime numbers in the given range:
primes = [i for i in range(1000) if isPrime(i)]
Please note that the list item expression before for can be anything whose value is a legitimate list item, as shown in the example below:
In [ ]: |
|
Out [ ]: | ['0 is even', '1 is odd', '2 is even', '3 is odd', '4 is even', '5 is odd', '6 is even', '7 is odd', '8 is even', '9 is odd'] |
For item expressions involving two variables or more, nested for statements can be used. For example, the following statement will generate a list of combinations of some years and months:
In [ ]: |
|
Out [ ]: | ['201501', '201502', '201503', '201504', '201505', '201506', '201507', '201508', '201509', '201510', '201511', '201512', '201601', '201602', '201603', '201604', '201605', '201606', '201607', '201608', '201609', '201610', '201611', '201612', '201701', '201702', '201703', '201704', '201705', '201706', '201707', '201708', '201709', '201710', '201711', '201712', ' 201801', ' 201802', ' 201803', ' 201804', ' 201805', ' 201806', ' 201807', ' 201808', ' 201809', ' 201810', ' 201811', ' 201812', '201901', '201902', '201903', '201904', '201905', '201906', '201907', '201908', '201909', '201910', '201911', '201912'] |
Set Comprehension
A set is a collection of unique unordered items. With that in mind, set comprehension is similar to list comprehension, except that the items are enclosed in curly brackets, as shown in the example below:
In [ ]: |
|
Out [ ]: | {'K', 'M', 'G', 'T', 'C', 'O', 'L', 'D', 'S', 'I', 'B', 'N', 'A', 'F', 'W', 'H', 'P', 'X', 'J', 'Z', 'E', 'R', 'U', 'Y', 'Q', 'V'} |
What would happen if items generated from the iteration were duplicated? No worries! The implementation of set comprehension can take care of that. If we want to find out all the unique words contained in a web document, we can simply use set comprehension to get them, as shown below:
In [ ]: |
|
Out [ ]: | 969 |
The example above took a web document at https://scis.athabascau.ca, pulled all the unique words used in the document into a set, and printed the number of unique words used, which is 969.
As you can see, we could get the unique words in a document very easily by using set comprehension. How would we find out the ratio between the number of unique words and the total number of words used?
Dictionary Comprehension
Dictionary comprehension is very similar to set comprehension, except that we need to add a key and colon before each item to make a dictionary item, as shown in the following:
In [ ]: |
|
Out [ ]: | {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'May', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'} |
This can also be written in nested for clauses, as shown below:
In [ ]: |
|
Out [ ]: | {1: 'Dec', 2: 'Dec', 3: 'Dec', 4: 'Dec', 5: 'Dec', 6: 'Dec', 7: 'Dec', 8: 'Dec', 9: 'Dec', 10: 'Dec', 11: 'Dec', 12: 'Dec'} |
5.7 Text Files
For people and smart beings in general, part of intelligence is being able to remember things. Memory is an important part of that.
Computers have two types of memory. The first is RAM, which is volatile, expensive, and of relatively lower capacity but provides high-speed access. The variables we have been talking about so far are using RAM to hold data and running programs that are also inside RAM. If the computer is turned off, both data and programs will disappear from RAM. RAM is also called internal memory.
The second type of memory that modern computers have is persistent memory, such as a hard drive, flash memory card, or solid-state hard disk. This type of memory is also called external memory. For this type of memory to be useful, it must be part of a file system managed by an OS such as Windows, iOS, or Linux. Within a file system, files are used for saving data and programs to external memory, and the files are organized into hierarchical directories or folders. In a file system, a file can be located through a path from the root to the file within a tree-like structure.
Opening and Closing a File
To use a file, the first step is to open the file using the built-in function open. The following will open a file for writing:
f = open("./mypoem.txt", 'w') # open file mypoem.txt in the current working directory for writing
The statement above opened the file mypoem.txt in the current working directory and returned a stream, which was then assigned to variable f, often called a file handle.
The general syntax of the open statement is as follows:
open(file, mode = 'r', buffering = -1, encoding = None, errors = None, newline = None, closefd = True, opener = None)
The statement opens a file and returns a stream or file object; it will raise OSError upon failure. In the statement, file is a text or byte string referring to the file to be opened. It may include the path to the actual file if the file is not in the current working directory.
Apart from the first argument for the name of the file to be opened, all other arguments have default values, which means that these arguments are optional. If no value is supplied, the default value will be used.
The second argument is called mode. It is optional with a default value r, which means “open a text file for reading.” This argument is used to tell the program what to do with the file after opening. Because reading from a text file is not always what you want it to do with a file, the argument is not optional. All available values for the mode argument are shown in Table 5-7.
Mode argument | Access or accesses |
---|---|
r | Open for reading only; file must already exist. |
r+ | Open for both reading and writing; file must already exist. |
w | Open for writing only; file may or may not exist. If not, a new file will be created and ready for writing; if file already exists, content will be overwritten. |
w+ | Open for writing and reading; file may or may not exist. |
a | Open for appending, same as w, but will not overwrite the existing content. |
a+ | Open for both appending and reading. |
x | Create a new file for writing. File Exists Error will occur if the file already exists. This would help prevent accidentally overwriting an existing file. |
The file to be opened or created can be either a text file, referred to as t, or a binary file containing raw bytes, referred as b. To explicitly specify whether the file is a text or binary file, t or b can be used in combination with each of the values in Table 5-7. The default file type is text, so you do not have to use t if the file is a text file. So in the example above, the statement is equivalent to the following, in which t is used to explicitly indicate that it is a text file:
f = open("./mypoem.txt", 'wt') # open text file mypoem.txt in the current working directory
In both examples, we assigned the file stream returned from the open statement to a variable f, which is called the file handle. After the file is opened, all operations on the file—such as read, write, append, or even close—must be appended to the file handle. Every file must be closed using the f.close() method after open and use unless the file is opened within a with statement, in which case a context manager will take over access to the file and close it when the job is done, as in the sample code below:
In [ ]: |
|
Out [ ]: | Yet it was plain she struggled, and that salt |
When not using the with statement, you will need to use the following instead:
In [ ]: |
|
Out [ ]: | Yet it was plain she struggled, and that salt |
The third argument is called buffering. It takes an optional integer used to specify the buffering policy of the file operation. Passing 0 turns off buffering, but it is only allowed for binary files. Passing 1 selects line buffering, which is only allowed for text files. Passing an integer greater than 1 specifies the actual size of a fixed-size chunk buffer. When no buffering argument is provided, the default value −1 is used, which means that if the file is a binary file, the file will be buffered in a fixed-size chunk; if the file is a text file, line buffering policy is used, which means that the data will be flush to the actual file (on a disk such as a hard drive) from the buffer (a portion of internal memory to buffer the data) after each line was written.
The fourth argument is encoding, which specifies how the data in the file are encoded. This argument only makes sense for text files. The default value is platform dependent. If you believe that the encoding of data on a file is not the default, you can specify whatever encoding in which the data are encoded. However, the encoding must be supported by Python. In most cases, UTF-8 is the default encoding.
The next optional argument is errors, which takes a string if provided. The string specifies how encoding errors should be handled. Again, this argument only makes sense if the file is a text file. The same is true for the optional newline argument, which controls how universal newlines work in a text file. The optional newline argument can take the following:
- • None
- • ,
- • \n
- • \r
- • \rn
Once a file is opened, a list of methods can be used to operate on the file object, as detailed below.
Write or Append to a File
New data can be added to a file in two different manners. The first one is to overwrite everything already in the file and place the new data at the beginning of the file, and the second one is to keep the data already in the file and append the new data to the end of the existing data. The write methods are the same for both, but the file must be opened with a different mode depending on the operation. That is, mode w or x must be used to open a file for writing the data from the beginning of the file, and mode a must be used to append new data, as you will see shortly in the examples.
There are two methods for writing to a file. The first one is f.write(string), which writes string to the file referred by f. The second method is f.writelines(sequence), in which the sequence is any iterable object, such as a list or tuple, but often a list of strings. The following is an example of opening or creating a file (if the file does not exist yet) for writing data from the beginning of the file using the write(string) method:
>>> f = open("./mypoem.txt", "w") # open file in the current working directory for writing
>>> f.write("\nYou may write me down in history") # add \n to write on a new line
>>> f.flush() # to flush the data out to the actual file
The resulting file will read,
You may write me down in history
If you could write only this one line of your poem and had to close the file and shut down the computer, you would be more likely to continue from where you had stopped the next time you came back to the poem. So you need to append the new lines to the file. This is done by opening the file in a mode, as shown below:
>>> f = open("./mypoem.txt", "a") # open file in the current working directory for writing
>>> f.write("\nWith your bitter, twisted lies") # add \n to write on a new line
>>> f.flush() # to flush the data out to the actual file
The file is extended with one more line of poem:
You may write me down in history
With your bitter, twisted lies
Note that the write method will only write a string to the file. As such, anything that is not a string must be converted into a string before being written to the file, as shown in the following example:
>>> f.write(f'\n{3.1415926}')
Also, because of buffering, the data you write to a file will not immediately show up in the actual file until you close the file or use the flush() method to flush the data in the buffer out to the file, as shown below:
>>> f.flush()
The write(string) method can only write one string to a file each time. To write multiple strings to a file, the writelines(sequence) method is used. However, keep in mind that writelines() does not automatically write one string on each line. You will still need to add \n at the beginning of the string if you want it to be on the next line or at the end of the string if you don’t want anything behind the string on the same line.
Recall the example of printing a 9 × 9 multiplication table. Now we can write the table to a text file so that you can print it out whenever you want. This is shown in the two examples below:
"""This first code sample is using the write method."""
f = open('./my9x9table.txt', 'w')
for i in range(1, 10):
for j in range(1, i + 1):
f.write('{:1d} x {:1d} = {:2d} '.format(j, i, i * j))
f.write('\n')
f.close()
The output of the program is in the file my9x9table.txt:
1 x 1 = 1
1 x 2 = 2 2 x 2 = 4
1 x 3 = 3 2 x 3 = 6 3 x 3 = 9
1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16
1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25
To use the writelines(sequence) method, we need to store the results in a list first; each item of the list will be printed on one line. The code is shown as follows:
"""This code sample is using the writelines method."""
table = []
for i in range(1, 10):
newline = ''
for j in range(1, i + 1):
newline += '{:1d} x {:1d} = {:2d} '.format(j, i, i * j)
newline += '\n'
table.append(newline)
f = open('./my9x9table0.txt', 'w')
f.writelines(table)
f.close()
The result is the same as in the my9x9table.txt text shown above.
Reading from a File
To read from a file, three methods can be used. These methods are read([size]), readline([size]), and readlines([sizehint]).
Use the read([size]) method to read the entire file and return the entire contents as a single string or, if the optional size argument is given, to read the specified number of bytes and return the contents as a single string. The following example shows how the 9 × 9 multiplication table is read using the read([size]) method:
In [ ]: |
|
Out [ ]: | 1 x 1 = 1 1 x 2 = 2 2 x 2 = 4 1 x 3 = 3 2 x 3 = 6 3 x 3 = 9 1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 1 x 6 = 6 2 x 6 = 12 3 x 6 = 18 4 x 6 = 24 5 x 6 = 30 6 x 6 = 36 1 x 7 = 7 2 x 7 = 14 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49 1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56 8 x 8 = 64 1 x 9 = 9 2 x 9 = 18 3 x 9 = 27 4 x 9 = 36 5 x 9 = 45 6 x 9 = 54 7 x 9 = 63 8 x 9 = 72 9 x 9 = 81 |
If size is given, only that number of bytes will be read, as shown in the next example:
In [ ]: |
|
Out [ ]: | 1 x 1 = 1 1 x 2 = 2 2 x 2 = 4 1 x 3 = 3 2 x 3 = 6 3 x 3 = 9 1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16 1 |
Because the given size is so small, only a small portion of the multiplication table has been read from the file.
Our next method for reading data from a file is readline([size]). This method will read and return one entire line from the file if the optional size argument is not provided or if the integer value is equal to or greater than the size of the line. If the provided size is smaller than the actual size of the line being read, then only part of that line, equal to the size in bytes, will be read and returned, as shown in the following example:
In [ ]: |
|
Out [ ]: | 1 x |
Using this method to read all the lines of the 9 × 9 multiplication table in the file shown in the previous examples, we will need to put it in a loop and read line by line until the end of the file. In Python, however, there is no effective way to test if it has reached the end of the file. For this particular file, since we know there is no blank line before the end of the file, we will use an empty string to signify the end of the file. The revised code is shown below:
In [ ]: |
|
Out [ ]: | 1 x 1 = 1 1 x 2 = 2 2 x 2 = 4 1 x 3 = 3 2 x 3 = 6 3 x 3 = 9 1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 1 x 6 = 6 2 x 6 = 12 3 x 6 = 18 4 x 6 = 24 5 x 6 = 30 6 x 6 = 36 1 x 7 = 7 2 x 7 = 14 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49 1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56 8 x 8 = 64 1 x 9 = 9 2 x 9 = 18 3 x 9 = 27 4 x 9 = 36 5 x 9 = 45 6 x 9 = 54 7 x 9 = 63 8 x 9 = 72 9 x 9 = 81 |
The code above does not look so neat. In fact, since the text file is treated as an iterator in Python, with one item for each line, the above code can be simply written as follows:
f = open('./my9x9table0.txt', 'r')
for ln in f:
print(ln, end = '')
f.close()
The output is the same as above.
Using a context manager with the code can be further simplified as follows:
with open('./my9x9table0.txt', 'r') as f:
for ln in f:
print(ln, end='') # keyword argument end is set empty because ln already has newline in it
Considering the fact that a text file is an iterator, the built-in function next(iterator) can be used to iterate the file line by line. However, it would raise a StopIteration error if it reached the end of the file. The following example shows how to use next(iterator) to read and print the entire multiplication table:
In [ ]: |
|
Out [ ]: | 1 x 1 = 1 1 x 2 = 2 2 x 2 = 4 1 x 3 = 3 2 x 3 = 6 3 x 3 = 9 1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 1 x 6 = 6 2 x 6 = 12 3 x 6 = 18 4 x 6 = 24 5 x 6 = 30 6 x 6 = 36 1 x 7 = 7 2 x 7 = 14 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49 1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56 8 x 8 = 64 1 x 9 = 9 2 x 9 = 18 3 x 9 = 27 4 x 9 = 36 5 x 9 = 45 6 x 9 = 54 7 x 9 = 63 8 x 9 = 72 9 x 9 = 81 |
The third method for reading data from a file is readlines([sizehint]), where optional sizehint, if provided, should be an integer hinting at the amount of data to be read. Again, it is only available for text files. As the name implies, it reads multiple lines into a Python list until the end of the file, or as much as defined by sizehint, if the argument is provided. For example, if the total amount of data of the first n lines is less than sizehint, but the first n + 1 lines is greater than sizehint, then the method will read (n + 1) lines. So it will read whole lines rather than partial, in contrast to the readline([size]) method.
Sometimes, we might like to read from a particular portion of a file, just like we want to start reading a book from a specific page. How can we do that in Python?
Imagine there is a pointer indicating where the reading will start in a file. In Python, several methods can be used to adjust the pointer.
The first method is f.tell(), which determines where the pointer is in terms of how many bytes ahead it is, as shown in the following example:
In [ ]: |
|
Out [ ]: | 0: 1 x 1 = 1 15: 1 x 2 = 2 2 x 2 = 4 43: 1 x 3 = 3 2 x 3 = 6 3 x 3 = 9 84: 1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16 138: 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 205: 1 x 6 = 6 2 x 6 = 12 3 x 6 = 18 4 x 6 = 24 5 x 6 = 30 6 x 6 = 36 285: 1 x 7 = 7 2 x 7 = 14 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49 378: 1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56 8 x 8 = 64 484: 1 x 9 = 9 2 x 9 = 18 3 x 9 = 27 4 x 9 = 36 5 x 9 = 45 6 x 9 = 54 7 x 9 = 63 8 x 9 = 72 9 x 9 = 81 603: |
The output above shows where each line starts. For example, the end of the file is at 603. So with this information, the code using the readline([size]) method to read the multiplication table previously given can be easily revised to the following:
f = open('./my9x9table0.txt', 'r')
while f.tell() != 603: # we use the read location to identify if it has reached the end of the file
line = f.readline()
print(line, end = '')
f.close()
What if we want to read from a specific point in the file? To do that, we need to move the pointer to that point. That brings us to the second method of adjusting the pointer, which is f.seek(offset, start), in which the offset is how much to move it from the start. The default value for start is the current position of the pointer in the file. It would have been 0 when the file opened.
So suppose we want to read the line of the multiplication table 138 bytes from the beginning of the file. We would have to move the pointer to 138 first. From 0 at the beginning, the offset would be 138 as well. The code is shown below:
In [ ]: |
|
Out [ ]: | 138: 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 |
Update Existing Content of a Text File
In a text file, how do we replace an existing line with something else? To achieve this, we need to take the following steps:
- 1. Open the file in r+ mode
- 2. Find out the position of the line
- 3. Move the file pointer to that position using f.seek(offset, start)
- 4. Write whatever you want to the file, which will replace the original content on that line
An example of such an update is shown below:
In [ ]: |
|
Out [ ]: | 138: 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 |
The updated content of the file is shown below:
1 x 1 = 1
1 x 2 = 2 2 x 2 = 4
1 x 3 = 3 2 x 3 = 6 3 x 3 = 9
1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16
1 x 5 = 5 2 x 5 = 1update this line
= 20 5 x 5 = 25
Note that if there is less new content than the original, only part of the original is replaced.
We can also replace a single original line with multiple lines, as shown below:
In [ ]: |
|
Out [ ]: | 138: 1 x 5 = 5 2 x 5 = 10 3 x 5 = 15 4 x 5 = 20 5 x 5 = 25 |
The updated file is shown below:
1 x 1 = 1
1 x 2 = 2 2 x 2 = 4
1 x 3 = 3 2 x 3 = 6 3 x 3 = 9
1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16
1 x 5 = 5 2 x 5 = 1we write a line at the current position
we write another line below
we add third line below the two lines already written
4 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49
1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56 8 x 8 = 64
1 x 9 = 9 2 x 9 = 18 3 x 9 = 27 4 x 9 = 36 5 x 9 = 45 6 x 9 = 54 7 x 9 = 63 8 x 9 = 72 9 x 9 = 81
However, if the total size of the new written data is longer than the line being replaced, part of the line or lines under will be overwritten. Therefore, if you want to replace a specific line of the text file exactly, you will need to write just enough to cover the existing data—no more, no less.
Deleting Portion of a Text File
To delete a portion of existing data from a file, you will need to use the f.truncate([size]). If the optional size argument is given, the file will be truncated to that size or the size of the file. Otherwise, the file will be truncated to the current file position. So if a file is freshly opened in w, w+, a, or a+ mode, f.truncate() will remove all data from the beginning to the end of the file. Please note that if the optional size argument is given and greater than 0, the file must be opened in a or a+ mode in order to truncate the file to the expected size.
f = open('./my9x9table0.txt', 'a')
f.truncate(399)
f.close()
The resulting content of the file is shown below:
1 x 1 = 1
1 x 2 = 2 2 x 2 = 4
1 x 3 = 3 2 x 3 = 6 3 x 3 = 9
1 x 4 = 4 2 x 4 = 8 3 x 4 = 12 4 x 4 = 16
1 x 5 = 5 2 x 5 = 1we write a line at the current position
we write another line below
we add third line below the two lines already written
4 3 x 7 = 21 4 x 7 = 28 5 x 7 = 35 6 x 7 = 42 7 x 7 = 49
1 x 8 = 8 2 x 8 = 16 3 x 8 = 24 4 x 8 = 32 5 x 8 = 40 6 x 8 = 48 7 x 8 = 56
Please note that only part of the content in the file is left.
If the file is opened in w or w+ mode, the file will be truncated to a size of 0 regardless.
With all we have learned so far, we are ready to design and code a program to analyze an article stored in a text file. The program is shown in Table 5-8.
The problem | Different people have different styles when writing articles. These styles may include the words, phrases, and even sentences used most often in their writing. In this case study, we will develop a program that analyzes an article stored as a text file to create a list of the words used most often in the article. |
The analysis and design | To analyze the article, we need to read the file into memory, build a list of words used in the article, then count how many times each word appears in the article. Because the file needs to be read line by line, we will read, analyze, and count the words in each line. How can we store the result containing the unique words and the number of times each word appeared in the article? Recall what we learned about dictionaries: each unique word can be used as a key, and the number of times the word appears can be the value. We then just need to determine the words used most often in the article. The best way to do that would be to sort the items of the dictionary based on the values (instead of keys), then take the first 10 items as the result. The algorithm is as follows: 1. Prepare by creating a list of punctuation marks and a list of nonessential words that can be ignored. 2. Initialize by setting the counter to 0, w_dict = {}. 3. Read the first line from the article into the memory. 4. Build a list of all words within the line: a. replace all the punctuation marks with whitespace b. split the entire article at whitespace to build a list of words c. remove all nonessential words from the list 5. Add the length of the list to the word counter: a. get a word from the list b. if the word is already in w_dict, increase the value by 1; otherwise, add dictionary item (word:1) to w_dict c. repeat step 5 6. Repeat steps 3–5 on the next line until there are no remaining lines in the file. 7. Sort w_dict based on the values. 8. Print out the first 10 items to show the words used most often by the article’s author. Please note that this is just an example of programming with text files and Python dictionaries. The underlying theory about writing style may not be sound. The words that appear most often in an article may also be relevant to the topics covered by the article. |
The code |
|
The result | The article has 926 words in total The number of unique words used is 467 word "will" used 20 times. word "be" used 13 times. word "NATO" used 13 times. word "https" used 13 times. word "research" used 10 times. word "interoperability" used 9 times. word "as" used 9 times. word "Information" used 8 times. word "standards" used 7 times. word "military" used 7 times. word "Interoperability" used 7 times. word "for" used 7 times. word "www" used 7 times. word "such" used 6 times. word "This" used 5 times. word "this" used 4 times. word "Canada" used 4 times. word "essay" used 4 times. word "exchange" used 4 times. word "MIP" used 4 times. |
Chapter Summary
- • Computer intelligence is achieved by computing and information processing.
- • Both computing and information processing involve the manipulation of data.
- • Simple types of data such as integers, floats, characters, and bools are the fundamental elements for complex data types.
- • Unlike other languages, such as C, Python doesn’t have a separate data type for single characters. In Python, a single character is a string whose length is 1.
- • Python has two constants, True and False, defined as values of type bool. However, Python also treats 0, None, and empty string, empty list, empty tuple, empty set, and empty dictionary as False and treats all other values/objects as True.
- • Python has some special data values/constants that don’t belong to any ordinary data type. These special values/constants include None, NotImplemented, Ellipsis, and __debug__.
- • Strings, lists, tuples, sets, and dictionaries are the compound data types in Python.
- • A string is a sequence of characters (ASCII, Unicode, or another encoding standard).
- • A list is a sequence of data in a pair of square brackets [], such as [1, 2, 3, 4, 5].
- • A tuple is a sequence of data in a pair of parentheses (), such as (1, 2, 3, 4, 5).
- • The difference between a list and a tuple is that a list is mutable whereas a tuple is immutable, which means that once created, the data member of a tuple cannot be changed.
- • Characters in a string and members of a list or a tuple are indexed from 0 to n − 1, where n is the length of the string, list, or tuple.
- • The character at place j of string s can be accessed using s[j].
- • Similarly, a data member at place j of a list or tuple x can be accessed using x[j].
- • Strings, lists, and tuples are collectively called sequences.
- • A slice of sequence (string/list/tuple) x can be taken using x[i:j], in which i specifies where the slice starts and j specifies that the slice should end right before location j.
- • A number of operators and functions are available for constructing and manipulating strings, lists, or tuples.
- • String, list, and tuple objects also have a number of methods available for constructing and manipulating strings, lists, or tuples.
- • Some operators, functions, and methods are common for strings, lists, and tuples.
- • A set is a collection of unique data enclosed by a pair of curly brackets {}, such as {1, 2, 3, 5}.
- • Members of a set s are unordered, which means that they cannot be accessed using notion S[j], for example.
- • A set has some very special functions and methods from other compound data types in Python.
- • A dictionary is a collection of keys: value pairs enclosed by a pair of curly brackets {}, such as {'one':1, 'two':2, 'three':3, 'five':5}.
- • Members of a set s are unordered, which means that in a dictionary, there is no such thing as a member at location j, for example.
- • However, the value of a dictionary d can be accessed using the key associated with the value with the notion of d[k], which refers to the value whose associated key is k.
- • Some special methods are defined for the operations of dictionaries.
- • Files are important for storing data and information permanently.
- • Files include text files and binary files.
- • The basic operations of files include create, read, write, append, and expand.
- • A new file can be created when open with the w or x flag, when the file doesn’t already exist. Opening a file with the w flag will overwrite the existing content of the file.
- • To prevent data already in a file from being overwritten, open a file with the a flag or x flag. The a flag will open the file for appending new data to the end of the existing content of the file, while the x flag will not open the file if it already exists.
- • Open a file with the t flag to indicate that the file is a text file.
- • Open a file with the b flag to indicate that the file is a binary file.
- • After reading or writing a file, use the close() file object method to close the file.
Exercises
- 1. Mentally run the following code blocks and write down the output of each code block.
- a.
course = 'comp218 - introduction to programming in Python'
print(f'The length of \'{course}\' is {len(course)}')
- b.
course = 'comp218 - introduction to programming in Python'
print(f'The length of \'{course[10:22]}\' is {len(course[10:22])}')
- c.
ls = list(range(9))
print(ls[2:5])
- d.
asc = {chr(c) for c in range(ord('A'), ord('Z')+1)}
print(asc)
- e.
l0 = [i*2+1 for i in range(10)]
print(l0[2])
- f.
combo = [year + str(month+1) for year in ['2015', '2016'] for month in range(6)]
print(combo)
- g.
s0 = 'Python '
s1 = 'is my language!'
print(s0+s1)
- a.
Projects
- 1. Write a program that reads a text from a user, then counts and displays how many words and how many alphanumeric letters are in the text.
- 2. Write a program that
- a. reads a series of numbers that are separated by whitespace and uses a new line to end the input, then converts the numbers in the input string and puts them into a list.
- b. sorts the numbers in the list in descending order, using the sort() list object method.
- c. sorts the numbers in the list in descending order, using the Python built-in function sorted().
Write your own code to sort the numbers in the list in ascending order without using the sort() method or sorted() function.
- 3. Sorting is a very important operation in computing and information processing because it is much easier to find a particular item (a number or a word) from a large collection of items if the items have been sorted in some manner. In computer science, many algorithms have been developed, among which selection sort, bubble sort, insertion sort, merge sort, quick sort, and heap sort are the fundamental ones. For this project, search the internet for articles about these sorting algorithms. Choose one to sort a list of integers.
- 4. Every course offered at universities has a course number and a title. For this project, write an application that uses a dictionary to save course information, allows users to add a course into the dictionary, and allows a user to get the title of a course for a given course number. The application should perform the following functions:
- a. Get a course number and name from a user and add an item, with the course number as key and the name as value, to the dictionary if the course doesn’t already exist in the dictionary.
- b. Get a course number from a user, then find out the name of the course.
- c. Display a list of all the courses in the dictionary showing the course numbers as well as names.
- d. Quit the application.
Hint: You will need a top-level while loop, which displays a menu showing the four options then acts accordingly.
- 5. This project is about text analysis. Find a news article on the internet, analyze the content, and generate and display some statistical data from the article. The detailed requirements are as follows:
- a. Find a news article on the internet and save it as a text file on your computer.
- b. Have your program build a list of words in the article while reading the news content from the file.
- c. Generate and display the following statistics of the article:
- i. the total number of words in the article
- ii. a list of unique words
- iii. the frequency of each unique word in the article
- iv. a short list of words that represent the essence of the article
- v. a table with the above data nicely presented
- 6. Cryptography is the study of theory and technology for the protection of confidential documents in transmission or storage. It involves both encryption and decryption. In any cryptographic scheme, encryption is the process of converting plaintext to ciphertext according to a given algorithm using an encryption key, whereas decryption is the process of converting encrypted text (ciphertext) back to plaintext according to a given algorithm using a decryption key. If the encryption key and decryption key are the same in a cryptographic scheme, the scheme is a symmetric cryptographic scheme; if the two keys are different, the scheme is an asymmetrical cryptographic scheme.
Among the many cryptographic schemes, substitution is a classic one, though the scheme is prone to frequency analysis attack. Write a program that can
- a. automatically generate a substitution key and add it to a key list stored in a file.
- b. display the substitution keys in the file.
- c. allow the user to choose a key in the list and encrypt some text taken from the user.
- d. allow the user to choose a key to decrypt an encrypted text taken from the user.
- e. allow the user to choose a key, encrypt the content of a text file, and save the encrypted content into a different file.
- f. allow a user to choose a key, decrypt the encrypted content in a file, and display the plaintext decrypted content.
We use cookies to analyze our traffic. Please decide if you are willing to accept cookies from our website. You can change this setting anytime in Privacy Settings.