Why Learn Python for Data Processing

Python is one of the most popular languages in the world. It’s used in a lot of different fields, like web services, automation, data science, managing computer infrastructure, and artificial intelligence and machine learning.

Its readable and concise syntax makes it a great option for teaching students their “your first programming language,” but under the façade of an easy and amicable language, there’s a huge amount of power.

Python is easy to learn and to use, sure, but it’s also capable of fantastic feats in demanding environments like video games, banking services, healthcare, or state-of-the-art scientific research. 

Python, in particular, is an appreciated language for Data Processing, for several reasons, among others:

A quick, but insightful way of showing the power of Python for data processing, is to present how easy it is to operate with common files. Let’s imagine that we have a text file with a bunch of lines with numbers, and we want to calculate their average and store it in a new file

example.txt
---
5
4
3
7

We can read the file a with a clause that will automatically close the file when it’s finished. The file is open by default as text, which allows to read it in lines iterating through it.

with open('example.txt') as file:
    numbers = [int(line) for line in file]

The list numbers process the file line by line and transform each into an integer, as it’s read as text. This structure, with a loop between brackets, is called a list comprehension  in Python and allows to generate lists in an easy and readable way.

The average can be calculated by adding all numbers and dividing by the number of them, the length of the list.

    average = sum(numbers) / len(numbers)

Finally, we store the result in a new file. We use the same with clause, but this time opening in writing mode adding ‘w’. The file is also written in text format.

with open('result.txt', 'w') as file:
    file.write(f'Average: {average}')

The f-string allows to replace in a template string the variable average by putting it inside between curly brackets.

And that’s it. Five lines of code that deal with reading and writing into two files, transform the input from text to integers, and perform the calculation. All the code is very easy to follow.

This method of reading from a text input, performing some calculations, and dumps the results in a text output, is very useful in data processing, as several steps can be performed in order to generate complex pipelines. The intermediate steps are stored, allowing to repeat, in case of an error, only the required steps, and not from the start. Because the read and the write is so easy, it allows saving points to avoid repeating processing the data multiple times.

This barely scratches the surface, of course. Python has included in its standard library modules to read and write in CSV format, and there are a lot of third-party options to read and write in other formats like HTML, PDF, or even Word or Excel format.

There are also modules that allow to present the information, not only in text format, but including different kinds of graphics, like the useful Matplotlib. And powerful data manipulation modules like Pandas to crunch the numbers and obtain insightful results.

Exit mobile version