I spent a while trying to reclaim disk from an on-premise Nexus Repository Manager instance, and got stuck when delete requests would take a surprisingly long time.
The Nexus instance is an old version, running on an old OS, old hardware and an old JVM, simply because we haven’t got round to upgrading it yet. So this is not me complaining about Sonatype – instead, I want to leave a search-engine-friendly explanation of the problem I found, in case anyone else experiences it. It’s been raised as a bug in Sonatype’s JIRA, but that isn’t publicly accessible (yet, I’m not sure if it will be) but I don’t think there’s any security concerns around the bug, so I’ve repeated here verbatim in the hopes it might help someone else.
tl;dr: deleting any component takes > 15minutes, deleting all the assets belonging to it takes <1 seconds.
We run Nexus Repository Manager OSS on-premise, with local disk storage. On finding disk was about to run out, I set about deleting unused artifacts. We do not use SNAPSHOT vs RELEASE artifacts, because we are (mostly) continuously deploying new versions, so I didn’t feel existing cleanup tasks fit the bill. I also was not able to confirm it is safe to delete all artifacts with a published-at or last-accessed-at policy – the fear is that some projects may not have built in a while, but still need that library available. I opted for removing versions of artifacts that I knew that were safe to delete (via special knowledge, nothing systematic) .
I tried issuing HTTP DELETE requests with a component ID, and also clicking “Delete Component” in the Web UI. In both cases the response would not return for many minutes (if at all) and the Nexus instance would experience high sustained CPU load, sometimes to the point where it could not serve artifacts. In some cases, shortly after issuing the DELETE, if I did a GET with the same ID, it would return nothing. Which suggests there is some lingering task in the delete that blocks the HTTP response, but still actually gets the job done. In some cases the server would not recover except with a service restart.
During delete component requests, there would be a series of active threads, named like: “Thread-10723
We use one default blobstore for all repositories, proxy several public repositories (central, jcenter, npm) one private docker repository with Amazon ECR, and several private repositories which are of type either maven2 or npm. I only attempted to delete components from a maven2 repository. The blobstore was approx 180G, and the same behaviour would occur on components with a very small and a very large number of versions (from 1 to 4k). Components usually had around 8-10 assets attached to them (jars, checksums, etc) and no individual asset was larger than 10MB. I wasn’t able to update to the latest version to find if the problem still exists on recent versions.
Unlike deleting components, deleting the assets directly is very fast. Since I could traverse search results to access the asset IDs via the components I wanted to delete, it wasn’t difficult to script, and I have been able to reclaim space.
Because the setup uses nothing up-to-date (hardware, OS, JVM, Nexus version) I’m not sure how much value this bug report is. My motivation for raising this bug is to leave a breadcrumb for other poor souls who find themselves needing to clear space fast from an opaque blobstore, and are unsure how to workaround the problem that deleting a single component takes > 15 minutes. The answer appears to be to delete the underlying assets.
Environment: Nexus version 3.16.2, Linux (Ubuntu Trusty 14.04) virtual machine, ext4 filesystem, spinning disk. 2x Intel Westmere CPU, 2.6GHz. 2Gb Java heap, 4G MaxDirectMemorySize, 12G of addressable memory.
For several years I’ve been a fan of the analogy of Mechanical Sympathy as it applies to software development*. I’ve decided to make the analogy more concrete, by finding out what pressing the pedals has been doing all these years I’ve been driving. Out of my way Jackie Stewart, I’m taking a class in basic car maintenance.
Last night was the first class in an eight week beginner’s Car Maintenance course at Glasgow Clyde College. By the end of it I should understand how to look after our family cars and perform basic repairs. I’m hoping that will translate to being a better driver as well, not that I’ll be drag racing the local teenagers, but I might mean detecting problems sooner, driving safer, and possibly extending the life of an expensive and depreciating asset.
The workshop we’ll learn in was pretty impressive given it’s in a college campus. Would not have looked out of place in big-brand repair shop. The instructor, Gordon, was particularly keen on insisting that there are no silly questions, we’re there to learn, it’s important to ask. I kind of enjoyed the freedom of deciding that “yup, I am a complete novice, I don’t care about looking stupid, if I’m clueless I’m asking”.
I’ll be better placed to recommend the course by the end of the eight weeks, but so far I’m pretty excited and optimistic.
* Footnote: I think I’m able to claim (with no hipster irony at all no, sir, honest) that I liked the analogy of “Mechanical Sympathy” in software before it was cool, being introduced to it at a software conference way back in 2011.
tl;dr:
Recently I was working on setting up an internal tool[0] for myself and fellow developers to test Logstash filters. We are currently upgrading the ELK stack we use for centralised logging, and I wanted to increase confidence in the Logstash filters we deployed. Previously, our Logstash filter configuration ran to about 900 lines of code. New changes could not be tested without being deployed. There was nothing to catch a regression except a human who happened to notice.
I predicted, if I wanted to increase confidence, this meant not only making it possible to test changes to filters, it had to be easy. This, I felt, would be crucial in maintaining test coverage over time — my fellow developers must be able to discover how to add tests, and find the process pain-free enough to be maintain the habit.
To help me in this, I considered how I would get the feedback I needed. Pair programming and code review can be immensely useful, but they’re different, tending to focus on implementation. What I needed was feedback on how discoverable the tool is. I wanted to know, “can a developer, tasked with changing the Logstash filter configuration, discover all they need without any prior knowledge of the tool?”.
Enter usability testing.
One of the first hits when searching for “usability testing”, contains this description:
"Usability testing is a method used to evaluate how easy a website is to use. The tests take place with real users to measure how ‘usable' or ‘intuitive' a website is and how easy it is for users to reach their goals"
Replace “website” with “developer tool” and the principle remains the same. I might not be designing the website our paying customers use, but I still have users. Users who have a goal. With that mindset, I decided to conduct a usability test with some of my intended users, to observe how easy they found it to reach their goal. Conducting the Usability Test
My test subjects were my two colleagues, Adam and Gary[1]. I ran the test individually with each of them, using video chat, which also allowed them to share their screen for me to observe. My process was as follows:
That was basically it. I didn’t give any further direction. Didn’t introduce them to the tool, where they could get it, how it worked or anything. I just shut up, listened, and took notes.
Then I got to learn which parts were easy, such as: discovering the documentation; finding existing test cases; how to execute test cases. Even confirming that the name of the git repository is discoverable is useful — after all, naming is not easy.
I also got to learn what was hard; that the biggest challenge was in setting up test data and expected output.
I only intervened a couple of times, when some aspect of the tooling had been misunderstood, and they had strayed off the path far enough that I wasn’t gaining any insight about what I was trying to test. For example, at one point while trying to find an example input, Adam began querying ElasticSearch and working through how to form the right query. But at that point, the log is the output, and has already been transformed by the Logstash filter and by ElasticSearch, so it wouldn’t serve as a representative example of input. I felt there was no point in letting Adam continue down that path; it was obvious the tooling needed to improve, and we could talk later about potential improvements.
Both Adam and Gary were able to complete the task, which is a good indication that several aspects of the tooling were discoverable. The interventions that were required indicated that other aspects needed to be improved.
With the task complete, and Adam and Gary both introduced to the tool, they could never be “new users” again. The old adage “You only get one chance to make a first impression” applies here[2]. Being unable to use them as test subjects in the same way again, I took the opportunity to describe features or documentation they had missed, and ask their opinion on how it could be improved and made more useful and discoverable. Those changes can then be tested in future, when I find more blank canvasses eager developers to run the usability test with.
Lessons learned
So if you’re working on internal tools, or APIs, don’t forget that while you may not have direct customers, you (hopefully) have users. Watching them trying to interact with what you produce, is illuminating, insightful and motivating. I encourage you to try.
[0] why this is an internal tool, and how I implemented it, is a whole ‘nuther blog post.
[1] names unchanged to condemn the guilty.
[2] not until I can get my hands on a Neuralyzer
Originally published on devblog.timgroup.com
ABOUT GRUNDLEFLECK
Graham "Grundlefleck" Allan is a Software Developer living in Scotland. His only credentials as an authority on software are that he has a beard. Most of the time.