In NOMAD we used several technologies, some quite new. I was reevaluating the situation periodically internally during the project, and some time changed ideas, but when one has committed resources to something changes are costly an not always worth doing. But after some time I think it is good do a public evaluation.
Here I want to evaluate how well from my personal point of view as the technologies used in the NOMAD Archive and NOMAD analytics (for which I was the main responsible) worked in these projects. This is not an evaluation of the general worth of that technology, but just its application in this specific case.
Thus, in no particular order:
Files & File system
Files and file system is an old technology, but still relevant today. For something that want to ensure future access to the data, also from all kinds of new technologies, it is still a good choice. Object storage is a simplification and better scaling option, and if one can live within it worth the trade-off. I think that striving to keep the source of truth at the file system level has been a good choice.
Meta info is an abstract representation, in files it must use format. We defined a representation using 3 generic file formats
- hdf5 , is a binary format for scientific data, specialized for multidimensional arrays. It can support parallel read and writes, several programming languages, and has been supported for many years. It is supported by several visualization programs. For these reasons we choose it, but it turned out that it had a big drawback for our use: it is not well supported by hadoop and other big data infrastructure. Integrating it was difficult, because the C library has a single global lock for basically all operations, even on different files. There is a Java implementation for the subset of hdf5 used by NCDF, but we were using hdf5 features in the output library. For this reason parallel access using multiple threads from the JVM gets sequentialized, and breaks the parallelization. Knowing this beforehand, I would not have chosen it, it made the use of big data technologies more difficult.
- Parquet , is a columnar format that is supported by most modern big data storage. It is immutable, and writing it is a bit annoying, because one has basically to have a full record, and cannot add to it. Still, it seems one of the best future proof choices to enable all sorts of big data softwares to efficiently access the data.
Scala & Python
For big data analysis Hadoop represent the first and still largest open source ecosystem. Haddop was developed using java. Later scala, as alternative JVM language gained popularity thanks to apache spark and apache flink. Scala more complex, but less verbose, needing less boilerplate code than java, and it still integrates seamlessly with java and java libraries. That is the reason of scala success, and also the reason I choose scala for the core of our infrastructure.
Indeed, for the parsers, i.e. the part specific to each simulation program that extracts the data for each different simulation program, we choose to use python. That part has to be written by scientists that understand the simulation program and the meta info and can map the input and output of the program to the meta info structure. Most of these scientist already knew python, and especially given the very ambitious goal of having support for 30 different simulation codes in the first year without having the infrastructure in place, minimizing their effort was crucial.
As python is used for a part of the system a natural question is why not use it for everything? Indeed, asking this question now my answer would probably be to use just python3.
Exactly due to the popularity of python in the scientific field (connected with its dynamic nature, focus on readability, usefulness for scripts, and ability to connect to C/C++ reasonably well) a large effort has been done to better support big data analysis. When we started this this was not the case, still now using it with spark or flink has some limitations, using an external process for python, furthermore the transition from python 2 to python 3 was still in progress, with important scientific packages not yet available in python 3, further reducing the attractiveness of python.
Using two different languages creates a barrier, an interface, and makes it more difficult for contributors to understand and change the whole system. Still, that is not only bad: the interface has been based on the meta info, and thus it forced us to have a code independent meta info, and extract the information from it. Given the central role of the meta info it is a good thing, under pressure to deliver things being forced to maintain a clean language independent interface is a good thing.
Scala allowed a rapid development and compact code very important when developing quickly, and the transition to production, thanks to the clean packaging was easy. Still, it does definitely limit the people that can contribute. Also it attracts people that want to experiment with languages, so that some libraries try (in my opinion) too much to use the more exotic features, so that sometime using the plain java library seemed much safer, and kinder toward other contributors that had to understand the whole system, not just a library. So on the whole I would evaluate the choice of scala for our project as arguable, not bad, but neither clearly good.
Python on the other hand, despite its limitations, has definitely been a good choice.
nodejs & express
Software in the end has to be deployed in production. Here we settled on Kubernetes, which is (I think) still the best provider independent solution to ship a complex system. I discuss the details here but on the whole I think it was a good decision.
For some of our uses serverless architectures like openwhisk are becoming interesting. We did not explore this option because it came later, and maintaining the infrastructure was a task that took quite a bit of our time, so adding anything is something that we don’t do lightly. In any case the open options can typically work on the top of kubernetes, thus kubernetes is still needed.
flink and spark
Flink and spark, generalize and standardize and simplify batch and stream processing, providing the tools for it. Theoretically, it is nice and useful, and we did create some prototypes. We did not have time to really make a production ready solution because:
- The use of HDF5 and its single lock breaks the parallelization in default setups of hadoop flink or spark.
- Non standard installs are more work to perform
- We worked on supporting parquet, but did not optimize things for it
- With BBDC the group Alexander Alexdrandescu developed a prototype to use the meta info and piggy back on spark SQL to access the parquet files
- In general all this is nice for programmers, especially those using scala, but just for trusted developers This is good, but a solution for any user on the internet had a higher priority, and for that one has to build an interface around it, as access the whole cluster cannot be given freely, kubernetes can help if one manages to create clusters on demand, but it is a heavy solution, and thus not for all users.
So while potentially useful neither of these systems really proved their worth in production.
beaker & jupyter
We wanted to provide notebooks to allow real analysis to any user. We developed the Container Manager for that purpose.
There is always a risk of choosing the wrong technology, especially in a very effervescent domain, or having large porting effort if a technology changes much. In this case Beaker has not been a good choice, luckily our approach was largely independent from the notebook provider, and we support also jupyter, which seems a better choice. See the Container Manager for further discussions on it.
Elasticsearch allows one to have efficient indexes of json documents. We used that to index the important indexes and provide search functionality to all users. One has to be a bit careful about controlling access, especially with the free version, but on the whole it was a good choice that solved a practical problem (having quick queries).
We use postgres as efficient open source relational database. 9.6 improved support for upserts and json, that is our current choice for rawdatadb: the place where we store information about the raw data archives. As index elastic search is superior, but to store data and have full ACID (Atomicity, Consistency, Isolation, Durability) using an opensource SQL Database, postgres always delivered.
Postgres needs a server, is centralized and its performance can be improved with better hardware for large databases. Sometime one wants just to organize relatively little data, what can be handled by a single computer, query it, and finally store it. This is the case for user generated databases. Here sqlite is perfect, lightweight, and a future proof data format.
Redis is a fast distributed key-value store. We use it for sessions and extra user info. Setup performance and use have always been satisfactory, and simple to use.
MongoDB was one of the first open source large non SQL databases, we use it (due to historical reasons), to store information about the user generated notebooks in the notebooks-mongo service. Others use it more, and store most of their data in it. We could have just as well used redis for our purposes. Indeed to reduce the maintenance burden I would consolidate on redis.
RabbitMQ is used as a queue system to generate the normalized data.
We evaluated kafka, which scales much further if one has really many events. Back then the setup was a bit complicated, and Kafka focused on handling a lot of relatively small events (log processing, and handling spikes). It did not cope well with large events, when one event takes a much larger time than another, or fails, and one wants to maximize the parallel execution: the bookkeeping was basically up to the user.
On the other hand, with RabbitMQ one can easily have a queue where each single event has to be acknowledged, and handling of failures, retry,… is customizable. Thus RabbitMQ reduced the code we had to write, and was a better choice for our use case.
helm is a tool to simplify the deployment of multiple services. We used it to deploy some of the standard software. The main issue was incompatible changes in the charts: changing the name/type of parameters. We had some issues with it, still on the whole it was useful. It would make sense to use it, also for our deployment, but due to some reservations about its usage by CSC we delayed using it.
nexus is a repository for jars, and docker registry. It is one of the few open source secure docker registry. That was the main reason we used it, but it was a single point of failure, and to simplify management, we switched to gitlab (that added docker support, and is managed by MPCDF). So it was useful, but we replaced all the same.
prometheus is a tool to monitor the various services. We needed some better monitoring, and used prometheus, but never really took advantage of it due to time issues, a pity, but we haven’t seen the advantages we should have out of it.
gitlab provides git based collaboration for projects a bit like github, but in a way that can be run on premises. MPCDF has been running it for the NOMAD project, and it has been nice, and steadily improving during this time.