# Data science is at the service of mathematics

Published on:

So far, mathematics has made an enormous contribution to data science, machine learning and the entire AI field. Now the tables have partially turned: Mathematics is no longer just a means to an end of data science, but data science is at the service of mathematics, and that in three areas:

• the topological data analysis investigates the question of what knowledge can be learned from a sample of data points about the geometric shape.
• the Coding and number theory helps to shed light on certain number-theoretic problems, to uncover connections or to find clues for proofs of conjectures – i.e. unproven mathematical theorems.
• That symbolic arithmetic finally deals with understanding and changing formulas. ML is now able to solve even more extensive and complex mathematical tasks, for example symbolic integration or differential equations.

Data science can be used practically for studying relationships in coding and number theory – area number two. This article demonstrates the exploratory and programmatic approach with Python and points out hidden pitfalls.

In the context of data science, the data is usually analyzed exploratively, i.e. in a step-by-step process: conclusions are drawn from the results, which in turn lead to new analyzes. This is how you work your way up to a result step by step. As a scripting language, Python is very well suited for this way of working.

Exploratory work requires tools that enable both the creation and execution of the Python scripts and the representation of the data as text output and graphics. Jupyter notebooks have established themselves here. They offer a web interface in which you can both program and display analysis results. Professional IDEs such as Visual Studio Code and PyCharm are also available for Python.

The info box Trees as an example from today’s everyday research offers an insight into how data science can be incorporated into everyday scientific mathematics.

With the so-called goal-means trees, decision trees and hierarchical organization charts, the exploratory data analysis and visualization techniques have led to the surprising insight that they have to be supplemented by undesired side effects, unsafe variants and informal networks.

In mathematics, trees are usually thought of as a set `G=(V,E)` of knots `V` and their edges `E` with the peculiarity that each pair of nodes is only connected by one path and all paths have only one node in common: the root. The mathematician Lothar Collatz combined number and graph theory and constructed a tree: He has one with a natural number `n` labeled node with another node `n/2 `connected – if `n` is straight, otherwise with a knot `3n+1` – and continued this game at will.

The Collatz conjecture says that – no matter what number `n` a node is labeled – from this a path leads to the root node with `1` is labeled. With ML and data science, mathematicians were able to build trees with exorbitantly high numbers and determine a pattern: The Collatz tree has the structure of Hilbert’s hotels and owes this to the Skolem-Noether theorem.

With the help of Mathematica, a program package for symbolic computing, it can be shown that the asymptotic density in the Collatz tree is determined by the fact that every further iteration of the two inverse Collatz functions `(n -> 2n, n -> (n-1/3)` shows a higher periodicity. While retaining the branches, a binary tree is created that has no nodes with `2` and `3` divisible numbers are labeled. The iterative pruning of this binary tree can be thought of as the Hilberts-Hotel paradox. The point is that a hotel with an infinite number of rooms wants to accommodate an infinite number of guests. The rooms of the hotel are repeatedly occupied by descendants from an even higher inverse Collatz iteration. A Python program can be used to confirm that this principle works in the thousandth iteration even with exorbitantly high numbers. The associated library can be found on GitHub.

Linguistics professor Noam Chomsky found that all sentences in every language share a common grammatical tree structure. Thanks to supervised learning with CoreNLP, Spacy or UDpipe, tree banks (text corpuses) with universal dependencies can be used to identify sentence structures from messages in different languages. If you convert the trees into a semantic-political network, surprisingly meaningful follow-up questions can be generated that journalists have not yet thought of. This data science research into syntactic tree analyzes is currently used to extract semantic relationships from Dutch newspaper articles. This report provides the first results.

Jan Kleinnijenhuis (Vrije Universiteit Amsterdam), Alissa Kleinnijenhuis (Stanford University), Mustafa G. Aydogan (University of California San Francisco)

In contrast to the programming of classic apps, data science applications do not focus on processing data row by row, but rather on column by column. Set-oriented and data-oriented languages ​​such as SQL and R are widespread. Functional programming, which allows a function to be applied to a set of data sets, is also of great use in data science. This type of programming saves the repeated implementation of loops in the code. Python only offers functional elements of its own. However, extensions like NumPy and Pandas also introduce set-oriented features into the language. This combination is one of the secrets of Python’s success.

In addition, Python allows the processing of arbitrarily large integers – and this in relatively large quantities. The only limitation on the size of integers is the underlying hardware. There are no software limitations. This is of great advantage for questions of coding and number theory, since some phenomena can only be observed with very large numbers. Research fields relating to divisibility, prime numbers, prime powers and elliptic curves benefit from this.

Despite its set orientation, Python is not, like SQL, for example, very separate from the underlying hardware. This makes Python a popular language for hackers in addition to its use in the data science environment. In contrast to Java, Python is scriptable and therefore allows an exploratory approach. At the same time, the language offers access to the low-level APIs of the executing machines. The variant CPython, which is implemented in C, is used by default.

To keep Python scriptable, the code is not translated to C, but interpreted. However, Python can access the C language at any time. In addition, the Cython framework is available, with the help of which individual code fragments developed in C can be integrated into Python. Cython thus adds extensions to the language and allows you to compile and use your own C code. In addition, frameworks such as TensorFlow and PyTorch enable the training of machine learning models with so-called graphics processing units (GPUs for short). In contrast to CPUs, they enable highly parallel data processing. Overall, the hardware proximity of Python offers possibilities for implementing very high-performance algorithms.

There are countless free packages and libraries available for Python that have been developed and optimized for scientific purposes. The areas of application include statistics, geometric data processing, visualizations, analysis, algebra, machine learning and quantum computing. Programmers can conveniently download and manage modules via the central repositories, PyPI and Conda. Examples of popular libraries are Pandas, NumPy, TensorFlow, Scikit-learn, SimPy, and Cirq.