Previous lesson: Flow control

This time we're going through sequence data types, and all the kickass things you can do with them. Actually, you've kind of already seen a sequence type: strings. Strings are a sequence of characters. Technically, in Python, strings are a special case and different from other sequence data types, but you can still do a lot of sequencey things to them. Once you've seen a few, I'll introduce the tuple data type.

Indexing

The most basic feature of sequences is the ability to access a specific item inside them. This is done like this:

>>> "Hello"[3]
'l'

A critical property of indexes is that they start at 0. "Hello"[0] is 'H'. "Hello"[1] is 'e'. This is actually pretty common in computing.

You can also, of course, index with a variable.

>>> i = 1
>>> "Hello"[i]
'e'

Exercise: make a program that asks the user for a string, and then a number, and prints out the character at that position in the string. (For extra fun, do it in one line.)

print(input("enter a string:")[int(input("enter a number:"))])

This might look hard to parse - indeed, it's bad enough that a serious programmer might do it on multiple lines just for readability's sake - so I'll dissect it for ya. Assuming I enter blah and 3, the code can be parsed like this:

  1. input("enter a string:") is replaced with the string entered, leaving: print('blah'[int(input("enter a number:"))])
  2. input("enter a number:") is replaced with the next string entered, leaving: print('blah'[int('3')])
  3. int('3') is replaced with 3, leaving: print('blah'[3])

Negative index

You might've already thought to try this and figured out how it works, but what happens if you run "Hello"[-1]?

Negative indices start from the end. Note that this means they are not subject to zero-indexing, since -0 is the same as 0. 0 is the first element, 1 is the second element, -1 is the last element, -2 is the second-last. If you find this confusing, you're not alone :)

If you try to access an element that doesn't exist, such as 'blah'[10], you'll get an error. This is a good time to introduce the handy len function that can help you avoid this:

>>> len("hi")
2

Exercise: modify the string indexing program so that it won't crash if the user asks for an invalid index, but print a message instead.

Slicing

Another super cool feature of sequences is slicing: the ability to index a range of elements at a time.

>>> 'pizza'[1:3]
'iz'

It gives us a string that starts at position 1 and ends at position 3, giving us characters #1 and #2. (You can think of this like a slice is always from and including the start position and up to but not including the end position.)

If you omit one or both numbers of the slice, it goes to the beginning or end:

>>> 'pizza'[2:]
'zza'
>>> 'pizza'[:2]
'pi'

Note that 'pizza'[:-1] is 'pizz' while 'pizza'[:] is 'pizza'. An omitted start position is the same as 0, but an omitted end position is not the same as -1. -1 is the last item, so slicing up to -1 cuts it out.

Slice step

Okay, this is a rather obscure feature, but I might as well demonstrate it while I'm talking about this. You can have a third number inside the slice brackets, which specifies the "step" size:

>>> 'abcabcabcabcabc'[::3]
'aaaaa'

This slices from the beginning (because start position is omitted) to the end (because end position is omitted), selecting only every third character. You can think of the step size as defaulting to 1.

Iteration

Jargon: iterate: to loop with a sequence and do something with each element inside it. It can be used with either "on" or "over" as a preposition.

Exercise: use your knowledge of loops and indexing to write a program that gets a string from the user and then prints out each character inside it on its own line. (I'm about to introduce an easier way of doing this, but I want you to see how it can be done without it.)

string = input("give me a string:")
index = 0
while index < len(string):
    print(string[index])
    index += 1

Note that I couldn't put the input("give me a string:") that defines the variable string inside the condition like while index < len(input("give me a string:")):. If you tried to solve this yourself before looking at the solution (which you should have), you probably ran into this, but the reason that doesn't work is that a while loop's condition is evaluated every time it's checked, since it has to know when to stop. So every time it loops, it would ask, is index < len(input("give me a string:"))?. And every time it asks that, it would execute input("give me string:") to find out what its value was, which means the user would be asked to enter a new string after every iteration of the loop. The solution was to execute input("give me a string:") once at the beginning, and store the value, so that when the while loop evaluates its condition every time, it's only asking whether index is less than the length of string, where string is the result of input("give me a string:"). This way, it doesn't ask the user for a new string every time.

for

One of the most important keyword related to sequences: for is an alternate loop construction that makes iterating on a sequence much easier:

for letter in input("enter a word:"):
    print(letter)

Note that with for, the expression that tells it the sequence to be iterated (in this case, the result of input("enter a word:")) is only evaluated once, and then it just internally runs the loop with letter set to each character in that string. So with for it's safe to put the input in the for line.

In general, in Python you should never have to iterate in the fashion I had you come with before I told you about this, but many other languages require it (C, Javascript in some situations), so it's a very good problem to have solved.

Another problem you can solve now: make a program that gets a string from the user, and then a letter, and determines whether the letter is in the string.

string = input("give me a string:")
char_to_find = input("give me a single character:")
found = False
for char in string:
    if char == char_to_find:
        found = True
if found:
    print(char, 'is in', string)
else:
    print(char, 'is not in', string)

in

Yes, the problem I just made you solve was another unnecessary one :P You can use in outside of the context of for to test whether something is inside a sequence:

>>> 'e' in 'Hello'
True
>>> 'x' in 'Hello'
False

Well isn't that neat! I just wanted you to solve this problem the hard way as an intellectual exercise, and because many other langugaes don't have this keyword or anything equivalent to it. (C doesn't; Go only has it for strings, but not for other sequence types.)

Additionally, on strings, in works with multi-character substrings. Check this out:

>>> 'He' in 'Hello'
True
>>> 'eH' in 'Hello'
False

Testing whether a multi-character string is inside of another string manually is a nightmare compared to this. (If you want, take a stab at it.)

not in works the way you expect, even though, technically, you should expect it to be not (x in y) (which does also work). After all, x not > y is a syntax error. Basically, it's like not in is an operator in its own right.

break and continue

Now that we're iterating on stuff, it's a very good time to introduce two handy keywords used in loops: the break statement, which exits the loop immediately even if its condition is still true, and continue, which skips the rest of the current iteration, and continues from the top of the loop. Here's a demo of both:

number = 0
while number < 10:
    number += 1
    if number == 5: # skip 5 for no reason
        continue
    print("the next number is", number)
    if input("want to see another? (y/n)") == 'n':
        break

Tuples

Tuples are a more general sequence data type. They store an arbitrary list of arbitrary values. The syntax for tuple literals is to enclose them in brackets and separate elements by commas:

>>> nums = (6, 1, 4)
>>> nums
(6, 1, 4)
>>> nums[0]
6
>>> for num in nums: print(num)
6
1
4
>>> greetings = ("Hi", "Hello", "Good day", "Salutations")
>>> greetings[2]
"Good day"
>>> for greeting in greetings: print(greeting)
Hi
Hello
Good Day
Saluations
>>> print(greetings[:2])
('Hi', 'Hello')

As you can see, tuples are subject to indexing, slicing, and the rest of the bag the same way strings are, but they aren't limited to holding strings; they can hold ints, floats, strings, Booleans, or any other type of value.

Warning! Declaring a tuple with only a single element isn't done the way you might expect! nums = (5) does not make a tuple; since parentheses are also used as mathematical or logical operators, that statement would just set nums to 5. Python only interprets parentheses as enclosing a tuple if there's at least one comma inside (or if there's nothing inside). To set nums to a one-element tuple, you could do nums = (5,) - unnecessary trailing commas are permitted. Actually, you can even just write nums = 5,.

You can also add tuples together:

>>> nums = (1, 2, 3)
>>> more_nums = (4, 5, 6)
>>> nums + more_nums
(1, 2, 3, 4, 5, 6)

Something I struggled with when learning Python was trying to add a single element to a tuple like: nums += 5. This would raise a TypeError, saying can only concatenate tuple (not "int") to tuple. Remember, since var1 += var2 is shorthand for var1 = var1 + var2, nums += 5 is saying nums = nums + 5. To add something to a tuple, the new addend has to itself be made into a tuple, like: nums += (5,).

There is one difference in the way the in operator works: with "real" sequences, like tuples, in only tests if one of the members of the sequence after in is equal to the element before in. With strings, in does "in a row" checking rather than "is a member" checking, so "he" in 'hello' evaluates to True, but with tuples, ('h', 'e') in ('h', 'e', 'l', 'l', 'o') or (5, 3) in (5, 3, 6) evaluates to False, because none of the members of the tuple on the right is the tuple on the left. The reason for this behavior is that, as you may have guessed, you can have a tuple of tuples:

Nested tuples

>>> high_scores = (("Alice", 1260), ("Bob", 1135), ("Carl", 1390))
>>> for score in high_scores:
...   print(score[0], 'scored', score[1])
...
Alice scored 1260
Bob scored 1135
Carl scored 1390
>>> high_scores[1][0] # demonstrating double-indexing: high_scores[1] is ('Bob', 1135)
'Bob'

Isn't that cool! Each element in high_scores is a tuple that holds a name in position 0 and a score in position 1. ('Alice', 1260) in high_scores would evaluate to True. (Strings don't have the concept of nested sequences in the way tuples do, so strings are the only sequence type that have the "in a row" behavior for in instead of "is a member".)

This is also a good time to introduce a couple of minor features about line breaks.

Line continuation

When you need to break a statement across multiple lines, you're allowed to do so if it contains commas:

names = (
    'Alice',
    'Bob',
    'Carl',
    'Dana',
    'Elijah',
    'Fiona',
)

But if it's not with commas, you need to use a backslash at the end of the line:

# This will raise a syntax error:
#sentence = "The " + "quick " + "brown " + "fox " +
#   "jumps"

# This works:
sentence = "The " + "quick " + "brown " + \
    "fox " + "jumps " + "over " + \
    "the " + "lazy" + "dog"

You can also put two string literals together without the +, and it will be assumed:

>>> print("hello" "friend")
hellofriend

I don't recommend using this though. I find it less clear than using + and it's at most 2 characters shorter, and most other languages don't have it, so it's not a good habit to build. (It also only works on string literals, not string variables.) I honestly wish it wasn't in the language. It made me have to include this section to explain it, which costs both my time and yours.

Inline blocks

So far, we've always put the block of an if, while, or similar keyword indented under the condition, but if it's only one line, you can actually do this:

>>> if True: print("logic has not been broken")
logic has not been broken

You can't nest them, though, even if they could theoretically all be on one line:

>>> for letter in "hi": if letter != 'h': print(letter)
  File "<stdin>", line 1
    for letter in "hi": if letter != 'h': print(letter)
                         ^
SyntaxError: invalid syntax

The most common time I use inline blocks is with break and continue.

Semicolons

You should also be aware of semicolons. You can put multiple unrelated statements on one line by using a semicolon:

>>> a = 5; print(a)
5

You generally shouldn't, though, because it's less readable to have multiple, semantically distinct instructions on one line.

Triple-quioted strings

Another thing I'll talk about while we're on the topic of line continuations: Triple-quoted strings, enclosed on both sides with """ or ''', are allowed to span multiple lines without a backslash.

message = """Incoming transmission:

Hi, I hacked Yujiri's website and replaced his original example string with this!

Plz don't point this out to him. I'm wondering how long it'll be before he notices.
Also, I don't want him to plug his security hole :P"""

These are often used when you need to store a big message in a string, like help text for a command-line tool.

Multiple assignment

Quick trick: it's possible to assign two variables to the same value in one line without a semicolon:

>>> a = b = 5
>>> print('a is', a, 'and b is also', b)
a is 5 and b is also 5

This feature isn't useful very often, but I should mention it.

Convention: capitalizing variable names

By convention, variable names in Python are all-lowercase, but there's an exception. Constants (variables that are meant to never change) are often written in all caps. I'm saying this because I'm going to use it in the following project, and don't want it to look weird.

Censorship simulator!

With that, you're ready to write a much more fun program than you did in the last chapter.

The government hires you to write a program to automatically censor messages that contain politically unacceptable speech. There's a predefined set of words you're searching for. Your program must accept multi-line input (a blank line signals the end of the message) and then output: the message with all lines containing dirty words removed; and some metadata for the overseer of the censorship department, including an account of which unacceptable words were found in the message (all listed on one line, without the parentheses you get when you print a tuple, but with commas placed appropriately), and the total number of characters that were removed.

Additionally, the filter should impose a message length limit of 280 characters (including the newlines, including the one at the end of the last non-blank line). If a message is longer than that, it should be cut off and terminated with "..." so that exactly the first 280 characters (including the added ...) of the message are outputted. In either case, a blank line should be printed between the message and the statistics.

This is supposed to be a fairly difficult assignment for someone with no programming experience outside of these three lessons. Give it some time. When I learned Python from the book that taught me, some of the end-of-chapter projects took me a few hours, but if you can solve a problem of this caliber on your own, then you're really catching on.

WORDS = ('free', 'liberty', 'tyrant', 'tyranny', 'oppress', 'rebel', 'revolt', 'revolution')
caught = ()
removed = 0
message = ''

while True:
    line = input() + '\n' # re-add the newline (which input leaves out) so we count it as a character
    # empty line signals end of input
    if line == '\n': break
    # since this is after we break out if the line was empty,
    # everything after this in the loop is dealing with a non-empty line.
    # next step: search for dirty words.
    censor = False
    for word in WORDS:
        if word in line:
            censor = True
            if word not in caught: caught += (word,)
    if censor:
        removed += len(line)
    else:
        message += line

# truncate the message
if len(message) > 280:
    # we have to re-add the newline that we cut off so there will still be
    # a blank line between it and the statistics. this means cutting off
    # 4 characters, not 3.
    print(message[:276] + '...\n')
    removed += len(message) - 280
else:
    print(message)

# statistics
words_caught_str = ''
for word in caught:
    words_caught_str += word
    if caught[-1] != word: # avoid putting a comma after the last word.
        words_caught_str += ', '
print("words found:", words_caught_str)
print("characters removed:", removed)