Analysis of the IT market in the Czech Republic - an example of automated data mining and data processing

Reference or Study

The basic task of these assignments is to collect unstructured data, process it, evaluate it and present the result in a simple form, ideally using graphs.

Introduction

For HELIsmile's in-house needs, we decided to analyse the IT market. The reason was quite simple, for the offers we needed to find out the popularity of individual IT technologies and "buzzwords", we could also find out how the market is with the demand for "data". But how to do this?

Job advertisements for "IT systems development" jobs will undoubtedly provide a good picture of market demand. The technologies in demand will appear in them in large volumes. No one will recruit employees "for stock" for technologies for which they do not have customers and demand. The technologies mentioned in the advertisements should therefore correlate well with the popularity of each technology in practice.

How we did it

As a result of our previous extensive experience in data analysis, we have already developed some of the necessary tools in Python, R and C#. So let's get into the analysis.

We have chosen www.Jobs.cz as our data source. It should be the largest and most well-known job portal in the Czech Republic (https://video.aktualne.cz/nejvetsi-karierni-portal-v-cr-jobscz-slavi-20-let/r~f9f923a6f4b111e5b8100025900fea04/ ). Of course, the portal does not contain the data we require and will not deliver it to us (or will deliver it for expensive money). So we have to help ourselves.

 

Web crawler

The first task was to download all the advertisements for local processing. In Python, this should be a piece of cake... Just go through all the "headline" pages of the ads from https://www.jobs.cz/prace/is-it-vyvoj-aplikaci-a-systemu/, follow the URL, download each ad and then analyse them for keywords. We have less than thirty of them, besides IT technologies we are interested in more general words like "Programmer", "Admin" (and their different variants), "Data", "Big data", "Linux", "Windows" and then individual programming technologies (see the evaluation).

Everything went like clockwork, I created the program for parsing the pages, downloaded the individual ads, but there was a small problem. Our simple robot downloaded the ad pages, but some of them were empty... The problem is that nowadays it is not enough to download only HTML code, some "enlightened" companies display the text of the ad "spectacularly" only afterwards using client Javascript... So an exercise in extra programming, to link the browser kernel, interpret such a page and download the result.

 

Analysis

The downloaded ads (there were about two thousand of them) were then cleaned of all unnecessary tags like style and script (otherwise we would have a lot of false positives for C and R, which are quite popular IT technologies), s.r.o. (thanks to this R became one of the most popular languages in the first version), etc. Of course, the results should be taken with a large reserve, it's really just about the popularity of "words" (labels) in IT development advertisements; moreover, unexpected errors can occur in machine text analysis, for example, Basic can also appear as "basic knowledge" in an English advertisement (we don't distinguish the case of letters in the search for simplicity).

The second, quite simple, task was to supply the county by the name of the municipality. N/A in counties means it failed, there are enquiries for foreign countries under this tag, or they only contain the county etc. Given the number of these (18 out of 2163 in total, several of which are for foreign countries), I didn't pursue the invalid counties any further.

 

Results

The first surprise was the overwhelming dominance of Prague, with 60% of all job offers coming from the capital city (tab "Advertisements_by_region"). Our Moravia is quite behind, Brno has 18% and "transformed" Ostrava with IT (https://www.seznamzpravy.cz/clanek/promena-ostravska-it-misto-tezkeho-prumyslu-a-zamestnanci-z-ciziny-36394) only 6%. So there is not much to talk about the remaining regions, only 15% is left. Given our Moravian patriotism, I have continued to include two regions from Moravia, our seat Olomouc region and the neighbouring Zlín region.

Practically used and demanded technologies are not such a big surprise. Given the prevalence of database applications in custom development, the most popular technology "label" is "SQL" (the "Labels_overall" and "Labels_by_county" tabs), which is completely independent of the programming language used (SQL can be required for Java, C#, PHP, or even Python programmers).

The representation of database servers may come as a slight surprise. Mostly commercial Oracle (also searched using PL/SQL abbreviation), followed by commercial MS SQL and then freely available databases (PostgreSQL, MySQL/MariaDB).

Among programming languages, Java, Javascript and C# are the most frequently mentioned. Perhaps everything is related to the development of applications for banks, they have a relatively high demand (see the tab "Companies by advertisements")

On the other hand, the surprise (at least for me) is the popularity of Python, I didn't expect it to overtake the representation of HTML and PHP. There were even a few mentions of the statistical programming language R, which we use, but as expected it is a marginal issue. R is far more popular in the world of academic computing.

To test our interface over R, I also tried some simple hypothesis testing, wondering if there were significant differences in the representation of each technology across counties (the "Hypotheses" tab). I tested using a standard χ2 test and an "exact" Monte-Carlo method with 50,000 repetitions. The first table shows the "difference" test over all counties, the second compares pairs of some counties, comparing Moravia with Prague. The darker the colour, the more significant the difference. Of course, the results are uncorrected for multiple comparisons, so they are full of false positives. The whole "Hypotheses" tab is thus more for interest and to test our interface over R.

There is also a WordCloud tab for interest, where we can see graphically the popularity of each technology (without too many general words like SQL, data, etc.). I hear it's very popular in marketing...

 

Result

All the analysis tools are in place, so it's not a problem to run them again in a few months and monitor changes in IT trends. You can see a hint on the "Tags_overall" tab, the first data I had was from 10/31/2019.

Others References and Studies