NOMAD post mortem

At the moment this document is not directly linked anywhere in the site, it can be reached only through a direct link.

NOMAD as a center of excellence is finished, and I think that it is time to try to look back to what have learnt from it. We did a lot of nice things, as can be seen on the official site and in the things I wrote about it.

I think that it is very instructive to evaluate critically the whole project and see what worked, but also (especially) what didn’t work, and where the problems lie. Any large or challenging project has some issues, that is part of the game, and that is especially so when it is both large and challenging. I know it first hand because I have been involved in several of them: working at Uptime, working on Cp2k, on the Tango library, and in general in the D language community, and working at Nokia and on Qt.

Now, on the things that did not work so well there are bound to be different opinions on what did not work, and especially on whose fault it was. I have tried to be as objective as I could, and wait some time before writing this, to put things in perspective. Still, I was involved, and thus this is incontestably my view of the issues, there are bound to be different opinions.

There are many things involved in a large software project, and many books discussing software management. For people in another field I find Paul Ford’s What is Code explains what is code and software and talks about his experience with software projects.

Technical issues

NOMAD was a challenging project, its goal was to apply Big Data analysis to material science, in particular to theoretical material science. The idea was to use computations performed by the community, and support as many simulation programs as possible. This increases greatly the difficulty of the problem, because one cannot rely on knowing the purpose or the organization of the calculations, or handle just the small subset one would use in high throughput calculations, but should really try to extract all information from input and output files.

All the extracted information has to be organized in a common and uniform way, it turns out that a large part of the analysis is actually preparing the data for it. Simplifying that part is an excellent way to improve analysis, the development of the NOMAD meta info is aimed exactly at that

Meta info is something new, and a topic important enough to have its own space, but we were not concentrating just on new things, it is not possible, and not healthy, you want to stand on the shoulders of giants to achieve more, you want to reuse the work of others as much as possible.

Big data and big data analysis have been a hot topic for a bit, this makes it interesting and exciting, but it also means that there are many players, and people trying to pump up their technology, a lot of marketing, and definitely a lot of these will fail, indeed the field has already changed from the start of the project. Choosing what to build on is not easy, sometimes one is bound to choose wrongly.

In technical evaluation page I try to evaluate the technologies chosen, as feel this is a good point to review them. See also the infrastructure description to have an overview of the services we provide.

Managment and HR issues

NOMAD was a challenging project, but in my opinion, and for me personally the biggest issues haven’t been technical, but management related.

I did try to think quite a bit on this, all people involved are intelligent, and actually wanted the project to succeed, and were not willfully malicious (despite harboring some doubts at times I still think it to be the best working hypothesis).

Still some of the mistakes done looked to me like a car crash in slow motion, and have been a big source of frustration for me. Sometime it looke like a satirical sohow in which you do not know if you should laugh or cry.

What happened

Ankit Kariryaa

There are way too many issues to list, but the short story is that in the first year and a bit more than 90% of the infrastructure work in WP1 (NOMAD Archive) and WP4 (Analytics) has been done by me and Ankit (supported by me), but we were way overworked.

Several issues were connected with the large burden of managing all parser developers, and test the parsers (30 of the 40 parsers were due after nine months). Administrative tasks like setting up and giving access to the repositories and the machines, took precious time away from development. We did try streamline things as much as possible, writing wiki pages pointing people to them, adding CI tests, but it remained a burden. The supercomputing centers had little experience with big data and cloud computing. This project was also a chance for them to learn more about it. To ensure that we could develop we got some virtual machines (but managed like real machines, provisioning one taking around a day). This let us quite freedom on what to do, but also shifted the administration, installation and the maintenance of most of the things to us. This was a non negligible extra burden.

It became quickly clear that developer time was one of the most precious assets, thus having maintanable code and system, was crucial. This clashed a bit with the issue of generating normalized data, new sources of data were coming in, parser were being developed, and parsed data had to be given right now.

This in turn created an issue with respect to computer resources. The initial vision was to have enough computing resources allocatable when the need arose. Unfortunately it turned out that initially there would not be many easily provisioned machines, we just had 3 VM coming from a 20-cores (40 counting hyperthreading if my memory serves me right) machine shared with all other work packages. New hardware would be bought for the project, but choosing and setting it up took quite a bit (and a false fire alarm that triggered the fire supressing system which flooded the rooms with gas and damaged lot of hard drives with the noise/vibrations and pressure changes). The solution was to use one of the computing clusters to perform the parsing, as a normal user of its the batch system. This implied some difficuklt choices, in particular it cut off the hadoop environment, flink and spark, because it was not possible to run them that way (this might have patially changed now, but especially for storage it is still difficult). Furthermore the storage of our VMs (GPFS, now IBM Spectrum Scale) was not available in the cluster, which meant days to synchronize data between the two systems. The solution was to run the RabbitMQ based queuing system on the VMs, and just the client workers on the clusters as java(scala) applications calling python when needed. This worked but took quite some extra developing time and babysitting.

ScaDS Dresden/Leipzig a collaboration of supercomputing centers that was trying to provide Hadoop/Spark/Flink installations as Software as a Service (SaaS). This would have helped us reduce all our administration burden, but unfortunately they were considered “competitors” of the Berlin Big Data Center (BBDC) that we were collaborating with.

The people involved in NOMAD were smart, and somehow a solution could be woked out, the biggest issue was actually the interaction with Matthias.

Matthias, the director of the FHI Theory department and coordinator of the whole NOMAD project, is intelligent, and a successful manager of scientific projects. Still, I think, he had little experience with software management, and he knew that he was going out of his comfort zone with this high profile project, and this he was afraid of failure. Due to this fear he was not trusting neither himself nor others, on technical aspects, but he was confident of his management ability. His management uses confrontation and pressure to force one to react and is willing to distort reality, create internal competition and isolate people to reach his goals.

This forceful management style can be successful, but needs a firm grasp of the whole situation, with neither own understanding nor trust in someone having that understanding it is not sustainable. Furthermore failure will add frustration and easily lure one toward a more ruthless and manipulating approach making matters worse.

Matthias misjudged the amount of work involved by order of magnitude, and was reticent to add computer scientists close to him, or create a group that he could not manage or understand. Furthermore I think that he interpreted my “I could use computer scientists” as a way to highlight how much work I was doing, and not as real objective statements (I always try to be as objective as possible, and I do not like to brag).

Also such an approach is quite at odds with my style: I have experience with open source projects where one does not have much coercive power. Indeed to keep people, that often do things just in their free time, working, motivation is key. Thus sharing the vision and keeping people motivated is crucial, and having as much openness and inclusion in discussions and tap also the silent contributors is the best goal. It is difficult at times, a big issue is to overcome the reticence to speak up, a reticence that only grows when one (sometime only perceived) place grows, and might hide issues, and thus requires constant attention. Confrontation when well done avoids this by forcing issues in the open.

The disconnect between the actual work needed and the amount of work he perceived as needed created issues quite early. His idea was that the whole thing was “to just write some script to extract the data” and this was reflected in the plan to quickly write the support for 30 codes (and corresponding milestones).

I was supposed to just work on a work on a sample parser for others to copy. The need to have a firm basis as starting data (which I solved creating raw data archives in BagIt format, and named using recursive checksum depending on the content of the bag), to work on a common parsing infrastructure, the whole meta info, communication with parser developers, were not something that had been considered at all.

The need of these tasks had to be painfully explained several times. I have an archive with all my mails, but I a going with just with the broad scope from memory. He would write emails or make statements that would make me so angry and frustrated that I would not work effectively for days. I would then slowly write down and explain things like the technical debt that was crucial to keep under control in this project, especially given that we wanted to quickly involve 10+ people to write parsers, because in the end he was the boss, and I was it as my duty to explain things, so that he could choose, even if I disagreed. Matthias would often Ths would create code that could lock us in a wrong/suboptimal approach. The raw data and the meta info were ways to create a clean interface, allowing people to code, but leaving us enough freedom to change and optimize things along the way.

Still despite being very stressful we manged to reach all the external milestones.

Later Matthias sort of conceded that “Parsing the data was more work than we [read “I”] thought”.

Ankit did an excellent work in this period, around a year Ankit left partly because he planned form the beginning to do a PhD, partly because while he did enjoy working with me he was afraid of the boundary conditions (being overworked and Matthias management style). I found it very funny (and how much he was willing to distort things to fit his narrative) when later Matthias did try to hint that it was my fault that he (and later others) left.

Knowing how badly we did need more people I asked to put an announce on linkedIn (that I payed with my credit card, and got reimbursed), but still we had a short period in which I had to leave for the adoption, and then Ankit left and I was alone, so clearly there were problems.

Matthias had the habit of giving me tasks better done by someone else with my help, and actively stopping me to be “unfocused” helping others not understanding that helping others to work acts as multiplicator and guarantees that the knowledge of the system is spread to multiple persons. Documentation is important but in software, and even more in software in development lot of the knowledge is in the people. Walling out really hurts the project, loosing people even more.

Alfonso https://alfonsosastre.blog/about/

After several, time consuming After Alfonso’s arrival things got even worse, because I think Matthias conclusion was not that we really did need the people, but that clearly I was the problem, so I was shifted away, and he was put in charge.

The result was totally predictable: the work was too much even for me knowing the whole system, clearly for one that had to learn it was overwelmening. Thus we lost a capable programmer, and lot of his work was done under so much pressure, and without really consulting me, so that it is unmaintainable bash scripts that will have to be replaced.

With Yazid, he started the same way, but I was really fed up with the situation, the most ironic thing is that while taking all these decisions clearly damaging the project, setting priorities without asking me, and then having the problems still coming to me for fixing, only without being involved with the decisions, and being blamed for obvious mismanagement. Thus I spoke clearly with Matthias.

In the end some damage was already done, and I hope we do not loose also Yazid, so I said the following to Matthias.

About Yazid, being the one responsible I am fine with it, as said several times I do not like to mange, I take issue with the saying that able to, I think that comes from a lack of trust and a misconception of software management.

Structuring development is useful and some structure is needed, but much structuring  does not guarantee success, what it does is that it gives a better blame protection if things go bad, and that is an important and useful reason to do it.

Still I very much prefer design and programming to managing, and I think my position should be one an architect, meaning that should go around an meddle in various things, do “help desk”, act as a joker to help out when one is stuck and do pair programming.

If one manages to motivate the persons, share the vision, and can be genuinely useful to them, he gets automatically involved in the difficult parts and that is very important because at that point many details are fixed, and computer are much less forgiving when combining pieces developed independently.

This might seem being unfocused, but is one of the most important tasks for a successful large software project, and reduces greatly the need of detailed managing.

This does not mean that it is little work, because good programmers often do not like to interact when stuck or deciding the architecture, and the best way is to show them that it is good also for them and that needs a detailed knowledge of the code.

My experience in managing software project is a bit skewed toward open source projects, where people do not have to work on something, and one has little power to force people, but I applied it very successfully also in the private sector.

I hope that now the importance of such an effort, that ideally goes across the whole project, also involving the encyclopedia, is better appreciated, an ideally officially acknowledged. We lost lot of things because I was not able to do it properly.

I manged a bit on a small scale,… but it should have been done much better, involving the users is very important to understand better what they want, and maybe they even make them become contributors.

Matthias is not stupid I hope he understood.

You know management can definitely kill a project, but at most it can support success, the practical work and programming is the only thing that determines success at the end, I am ok to focus specially on that side.

That is why in our discussions I tried to bring up the practical things to do or to plan, rather than discussing management.

I think that the choice to go with OpenShift is a good one, and simplifies our choices.

Short term for the storage I might set up

https://github.com/minio/minio/tree/master/docs/shared-backend

to export GPFS also as S3. This does not need any work on your side, but it is for your knowledge.

With best wishes Fawzi (I did not cc Stefan Heinzel, but you can share it with him if you think it relevant) ======== for me it has been extremely frustrating

basic management

large project

why smart person

Dear Aga, I saw that you scaled the pay accordingly to the reduction from 40 to 32 hours, my previous contract was for 30 hours, and I do not feel it is fair, if anything I have more experience than previously, so if anything I should get more.

I do not like to speak badly or focus on past mistakes, I rather prefer to look at new challenges, still I think that this might be due to things that (I guess) Matthias said, and I am not willing to be a scapgoat in this. So here is my side of the problems in NOMAD.

NOMAD we did many nice things, things I am proud of, but we also had serious issues. Any large or challenging project has some issues, that is part of the game, and that is especially so when it is both large and challenging. I know it first hand because I have been involved in several of them: working at Uptime, working on Cp2k, on the Tango library, and in general in the D language community, and working at Nokia and on Qt.

Matthias, the director of the FHI Theory department and coordinator of the whole NOMAD project, is intelligent, and a successful manager of scientific projects. Still, I think, he had little experience with software management, and he knew that he was going out of his comfort zone with this high profile project, and this he was afraid of failure. Due to this fear he was not trusting neither himself nor others, on technical aspects, but he was confident of his management ability. His management uses confrontation and pressure to force one to react and is willing to distort reality, create internal competition and isolate people to reach his goals.

This forceful management style can be successful, but needs a firm grasp of the whole situation, with neither own understanding, nor trust in someone having that understanding it is not sustainable. Furthermore, failure will add frustration and easily lure one toward a more ruthless and manipulating approach making matters worse.

Also such an approach is quite at odds with my style: I have experience with open source projects where one does not have much coercive power. Indeed to keep people, that often do things just in their free time, working, motivation is key. Thus sharing the vision and keeping people motivated is crucial, and having as much openness and inclusion in discussions and tap also the silent contributors is the best goal. It is difficult at times, a big issue is to overcome the reticence to speak up, a reticence that only grows when one (sometime only perceived) place grows, and might hide issues, and thus requires constant attention. Confrontation when well done avoids this by forcing issues in the open.

Matthias misjudged the amount of work involved by an order of magnitude, and was reticent to add computer scientists close to him, or create a group that he could not manage or understand. Also, he was used to PhD works where one can work alone quite independently from others. An infrastructure, especially when adding 30 parsers coded by different persons, needs quite some work to ensure that one does not need to then modify all the parsers, or that you are locked into a bad design.

I worked initially with Ankit Kariryaa https://bscc.spatial-cognition.de/kariryaa who did excellent work, both for the archive and the analytics infrastructure, but we were overworked. Also Matthias had an extremely aggressive management requiring things immediately, and being demeaning. He would write emails or make statements that would make me so angry and frustrated that I would not work effectively for days. I would then slowly write down and explain things like the technical debt that was crucial to keep under control in this project, especially given that we wanted to quickly involve 10+ people to write parsers, because in the end he was the boss, and I was it as my duty to explain things, so that he could choose, even if I disagreed. I resorted to trying to find out the real deadline (with the help of Luca Ghiringhelli, which tried for the whole NOMAD to be a moderating force with Matthias, something that was hard also for him), work toward a maintainable system until the deadline was close, so as to perform as few shortcuts as possible. This meant sometime missing Matthias’s deadline (but not the real one), but keeping the system manageable, keeping the technical debt under control was a priority to me, something that I did try to explain several times to Matthias in long emails.

When Ankit decided to leave to do a PhD, and because he was afraid of Matthias aggressive management I finally convinced Matthias that we needed more computer scientists. I put the announce on LinkedIn and paid with my credit card (and got reimbursed), to ensure that we would get people.

After many interviews and lots of extra lost time we found Alfonso Sastre https://alfonsosastre.blog, and Yazid Hamdi https://yazid.xyz https://www.linkedin.com/in/yazidhamdi (vetted by Matthias).

In the meantime, I was alone as responsible for the whole infrastructure, and had also to leave shortly for the adoption of my second son David. Thus updating the normalized data with the new parsers, and After Alfonso’s arrival things got even worse, because I think Matthias conclusion was not that we really did need extra people, but that clearly I was the problem, so I was shifted away (I worked on the Repository and sped up its response from 20+ seconds to sub second using elastic search), and he was put in charge, with Angelo Ziletti https://www.linkedin.com/in/angeloziletti/ angelo.ziletti@gmail.com overseeing him.

In that period also Dr. Danilo Simoes Brambila d.s.brambila@gmail.com helped with the parsing.

I helped even if had been told to work just on the repository, but still the result was totally predictable: the work had been too much even for me knowing the whole system, clearly for one that had to learn it was overwhelming. Thus we lost a capable programmer, and a lot of his work was done under so much pressure (he was not able to push back and insist on sustainability, and sometime accept missing a deadline), and without really consulting me. So the system began, for example, to rely on non-maintainable copy pasted bash scripts living outside git, that we ad to remove. Obviously Matthias tried to shift the blame to Alfonso, but I think also that was clearly a management issue that had wasted a lot of time and energy, not only the time to help and explain things, but also misguided contributions that had to be redone.

After that he was very worried that I would leave (and I thought about it). In the end I cared a lot about the project, I had invested a lot in it, I really shared the vision of making simulation data more accessible, I liked the people involved in it, I had also left a work that I liked and was better payed for it, so I stayed.

With Yazid (that started later, and in the meantime Alfonso had already left), he started the same way, but I was really fed up with the situation. Yazid was much more focused on the managing aspect of software development, so he quickly realized the situation, despite Matthias trying to shift blame for problems to me. We also managed to get another person working on the infrastructure: Arvid Ihrig, which was a very good programmer, and helped with the repository. The main issue now was that Matthias had tried Yazid took a more active role in confronting Matthias, I was good for me and Arvid, but while I had strictly tried to avoid speaking of the bad/stupid behaviour of Matthias, complaining almost exclusivly to him, to avoid an us vs him approach that would not be healty (I strongly belive that anybody has the right to some respect, and one should strive to discuss the issues, not the persons). With Yazid the thing was out in the open, I am not sure if it was good, it made it ranting against Matthias a usual procedure. This was deserved, still he was the boss, and his behavior normally had some reason, so I worried that it would not be sustainable. Indeed Yazid left.

We looked for new people, but with less motivation (still costing quite some time), and finally Markus Schneider. Markus (who seems to have completely disappeared from the FHI site) was not an excellent programmer like Arvid, but was someone that Matthias could convince to work on the infrastructure. Any help was appreciated, but the main issue was another: Matthias had tried to “sell” the NOMAD repository to Chinese Univerities, and failed, because the code base was bad (that is another long story), anyway all his requests focused on fixing the repository. The archive (on which the CEO based and for which I was still in charge of writing reports) had basically nobody working on it (I was assigned to analytics), but was not finished. His answer to the fact that we would for sure encounter problems was “I see them as one thing”, pointing out that they were two code bases did not help.

I convinced Arvid and Markus to perform a last attempt to focus on the project issues, but the discussion with Matthias basically did not take place at all. Markus Schneider, got actually ill from the toxic environment created by Matthias, and stopped coming to work, and Arvid also decided to leave. The illness of Markus was for me the thing that told me that it was too much I could not go out and try to convince people to join such a toxic environment, also the willingness to do heroic coding marathons to fix things was not there anymore.

I find very ironic that Matthias was thinking that I was not good at managing, while making obvious mistakes for software management: driving away good peoples, ignoring (despite my attempts to explain it) that especially during heavy development, the documentation has limits, that the code in some way belongs to the people that develop it, not to the manager, something that the open source idea actually expresses, and the best way to ensure its future is growing the number of persons developing it. His callousness and forceful management style, severely affected the project, I still get angry thinking about how much more could have been done with a slightly better management that actually avoided driving away so many persons. I tried to explain it to him several times, and at the end, I could not avoid to sort of tell him “told you so”, something that made him quite angry. Still the problems we had were to be expected, actually even deserved.

I guess that a useful trait being a successful manager, is to be able to shift the blame to others, and I think that sadly it is what is at work here. I am not perfect, and I am sure that just as all the people involved here, he also got scarred from some of the problems, but I also think that there are objective criteria understandable by any good software manager that show how much damage (I am sure mostly not wanted) was done by Matthias. It is possible that Matthias learned the lessons of NOMAD (he isn’t stupid), and now there is less pressure and people working on NOMAD related things, but actually I am not willing to find out, so when my contract finished, I did not want to work with him (and I guess after my “I told you so” the feeling was mutual. Indeed, several people told me that they did not know how I could resist so long (the answer is that I can withstand pressure, and I had invested a lot in NOMAD), even a PI told me that probably the best thing was to cut NOMAD loose.

There is more the gist of the whole thing is this, I can gladly discuss about it if needed.