NOMAD post mortem

At the moment this document is not directly linked anywhere in the site, it can be reached only through a direct link.

NOMAD as a center of excellence is finished, and I think that it is time to try to look back to what have learnt from it. We did a lot of nice things, as can be seen on the official site and in the things I wrote about it.

I think that it is very instructive to evaluate critically the whole project and see what worked, but also (especially) what didn’t work, and where the problems lie. Any large or challenging project has some issues, that is part of the game, and that is especially so when it is both large and challenging. I know it first hand because I have been involved in several of them: working at Uptime, working on Cp2k, on the Tango library, and in general in the D language community, and working at Nokia and on Qt.

Now, on the things that did not work so well there are bound to be different opinions on what did not work, and especially on whose fault it was. I have tried to be as objective as I could, and wait some time before writing this, to put things in perspective. Still, I was involved, and thus this is incontestably my view of the issues, there are bound to be different opinions.

There are many things involved in a large software project, and many books discussing software management. For people in another field I find Paul Ford’s What is Code explains what is code and software and talks about his experience with software projects.

Technical issues

NOMAD was a challenging project, its goal was to apply Big Data analysis to material science, in particular to theoretical material science. The idea was to use computations performed by the community, and support as many simulation programs as possible. This increases greatly the difficulty of the problem, because one cannot rely on knowing the purpose or the organization of the calculations, or handle just the small subset one would use in high throughput calculations, but should really try to extract all information from input and output files.

All the extracted information has to be organized in a common and uniform way, it turns out that a large part of the analysis is actually preparing the data for it. Simplifying that part is an excellent way to improve analysis, the development of the NOMAD meta info is aimed exactly at that

Meta info is something new, and a topic important enough to have its own space, but we were not concentrating just on new things, it is not possible, and not healthy, you want to stand on the shoulders of giants to achieve more, you want to reuse the work of others as much as possible.

Big data and big data analysis have been a hot topic for a bit, this makes it interesting and exciting, but it also means that there are many players, and people trying to pump up their technology, a lot of marketing, and definitely a lot of these will fail, indeed the field has already changed from the start of the project. Choosing what to build on is not easy, sometimes one is bound to choose wrongly.

In technical evaluation page I try to evaluate the technologies chosen, as feel this is a good point to review them. See also the infrastructure description to have an overview of the services we provide.

Managment and HR issues

NOMAD was a challenging project, but in my opinion, and for me personally the biggest issues haven’t been technical, but management related.

The problem

I did try to think quite a bit on this, all people involved are intelligent, and actually wanted the project to succeed, and were not willfully malicious (despite harboring some doubts at times I still think it to be the best working hypothesis).

Still some of the mistakes done looked to me like a car crash in slow motion, and have been a big source of frustration for me.

Matthias, the director of the FHI Theory department and coordinator of the whole NOMAD project, if intelligent, and a successful manager of scientific projects. Still, I think, he had little experience with software management, and he knew that he was going out of his comfort zone with this high profile project, and this he was afraid of failure. Due to this fear he was not trusting neither himself nor others, on technical aspects, but he was confident of his management ability. His management uses confrontation and pressure to force one to react and is willing to distort reality, create internal competition and isolate people to reach his goals.

This forceful management style can be successful, but needs a firm grasp of the whole situation, with neither own understanding nor trust in someone having that understanding it is not sustainable. Furthermore failure will add frustration and easily lure one toward a more ruthless and manipulating approach making matters worse.

Also such an approach is quite at odds with my style: I have experience with open source projects where one does not have much coercive power. Indeed to keep people, that often do things just in their free time, working, motivation is key. Thus sharing the vision and keeping people motivated is crucial, and having as much openness and inclusion in discussions and tap also the silent contributors is the best goal. It is difficult at times, a big issue is to overcome the reticence to speak up, a reticence that only grows when one (sometime only perceived) place grows, and might hide issues, and thus requires constant attention. Confrontation when well done avoids this by forcing issues, forcing them in the open.

What happened

Ankit

There are way too many issues to list, but the short story is that in the first year and a bit more than 90% of the infrastructure work in WP1 (NOMAD Archive) and WP4 (Analytics) has been done by me and Ankit (supported by me), but we were way overworked.

Several issues were connected with the large burden of managing all parser developers, and test the parsers (30 of the 40 parsers were due after nine months). Administrative tasks like setting up and giving access to the repositories and the machines, took precious time away from development. We did try streamline things as much as possible, writing wiki pages pointing people to them, adding CI tests, but it remained a burden. This was composted with the issue of generating normalized data, new sources of data were coming in, and parsed data had to be given right now. The supercomputing centers had little experience with big data and cloud computing. This project was also a chance for them to learn more about it. To ensure that we could develop we got some virtual machines (but managed like real machines, provisioning one taking around a day). This let us quite freedom on what to do, but also shifted the administration, installation and the maintenance of most of the things to us. This was a non negligible extra burden.

Data generation, cluster.

Computer resources Still despite being very stressful we manged to reach all the external milestones. The biggest issues besides Matthias management, were connected with the large burden of managing all parser developers, and test the parsers (30 of the 40 parsers were due after nine months).

Matthias misjudged the amount of work involved by order of magnitude, and was reticent to add computer scientists close to him, or create a group that he could not manage or understand. Furthermore I think that he interpreted my “I could use computer scientists” as a way to highlight how much work I was doing, and not as real objective statements (I always try to be as objective as possible, and I do not like to brag).

This misjudgment of the amount of work “to just write some script to extract the data” was highlighted by the plan to quickly write the support for 30 codes, the absence from the plan of any work on a common parsing infrastructure, the discussions we had to actually work on that rather than simply write a sample parser for others to copy. Also the whole meta info was something that had been considered. The later “Parsing the data was more work than we [read “I”] thought”, is all the concession he gave.

BBDC

After around a year Ankit left partly because he planned form the beginning to do a PhD, partly because while he did enjoy working with me he was afraid of the boundary conditions (being overworked and Matthias management style). I found it very funny (and how much he was willing to distort things to fit his narrative) when later Matthias did try to hint that it was my fault that he (and later others) left.

Knowing how badly we did need more people I asked to put an announce on linkedIn (that I payed with my credit card, and got reimbursed), but still we had a short period in which I had to leave for the adoption, and then Ankit left and I was alone, so clearly there were problems.

Matthias had the habit of giving me tasks better done by someone else with my help, and actively stopping me to be “unfocused” helping others not understanding that helping others to work acts as multiplicator and guarantees that the knowledge of the system is spread to multiple persons. Documentation is important but in software, and even more in software in development lot of the knowledge is in the people. Walling out really hurts the project, loosing people even more.

Alfonso

After several, time consuming After Alfonso’s arrival things got even worse, because I think Matthias conclusion was not that we really did need the people, but that clearly I was the problem, so I was shifted away, and he was put in charge.

The result was totally predictable: the work was too much even for me knowing the whole system, clearly for one that had to learn it was overwelmening. Thus we lost a capable programmer, and lot of his work was done under so much pressure, and without really consulting me, so that it is unmaintainable bash scripts that will have to be replaced.

With Yazid, he started the same way, but I was really fed up with the situation, the most ironic thing is that while taking all these decisions clearly damaging the project, setting priorities without asking me, and then having the problems still coming to me for fixing, only without being involved with the decisions, and being blamed for obvious mismanagement. Thus I spoke clearly with Matthias.

In the end some damage was already done, and I hope we do not loose also Yazid, so I said the following to Matthias.

About Yazid, being the one responsible I am fine with it, as said several times I do not like to mange, I take issue with the saying that able to, I think that comes from a lack of trust and a misconception of software management.

Structuring development is useful and some structure is needed, but much structuring  does not guarantee success, what it does is that it gives a better blame protection if things go bad, and that is an important and useful reason to do it.

Still I very much prefer design and programming to managing, and I think my position should be one an architect, meaning that should go around an meddle in various things, do “help desk”, act as a joker to help out when one is stuck and do pair programming.

If one manages to motivate the persons, share the vision, and can be genuinely useful to them, he gets automatically involved in the difficult parts and that is very important because at that point many details are fixed, and computer are much less forgiving when combining pieces developed independently.

This might seem being unfocused, but is one of the most important tasks for a successful large software project, and reduces greatly the need of detailed managing.

This does not mean that it is little work, because good programmers often do not like to interact when stuck or deciding the architecture, and the best way is to show them that it is good also for them and that needs a detailed knowledge of the code.

My experience in managing software project is a bit skewed toward open source projects, where people do not have to work on something, and one has little power to force people, but I applied it very successfully also in the private sector.

I hope that now the importance of such an effort, that ideally goes across the whole project, also involving the encyclopedia, is better appreciated, an ideally officially acknowledged. We lost lot of things because I was not able to do it properly.

I manged a bit on a small scale,… but it should have been done much better, involving the users is very important to understand better what they want, and maybe they even make them become contributors.

Matthias is not stupid I hope he understood.

You know management can definitely kill a project, but at most it can support success, the practical work and programming is the only thing that determines success at the end, I am ok to focus specially on that side.

That is why in our discussions I tried to bring up the practical things to do or to plan, rather than discussing management.

I think that the choice to go with OpenShift is a good one, and simplifies our choices.

Short term for the storage I might set up

https://github.com/minio/minio/tree/master/docs/shared-backend

to export GPFS also as S3. This does not need any work on your side, but it is for your knowledge.

With best wishes Fawzi (I did not cc Stefan Heinzel, but you can share it with him if you think it relevant) ======== for me it has been extremely frustrating

basic management

large project

why smart person