Introduction to Open Source Workflows

Update: The section about licenses has been expanded to elaborate more differences and clauses in the licenses I mention.

Note: This is written specifically for OSS development, but very much, if not all of it, may be of use for teams in closed source software development.

When I finally "got" inheritance as a programming concept, I began using it as a big, giant hammer on absolutely everything I touched, especially on my "advanced" Java calculator applet. As this was about 1-2 months into my programming career, I hadn't heard about good design or best practices. I didn't even understand how exceptions and try/catch worked properly: I just figured out that if I attempted to do something which would crash the program, I could just wrap it in a try/catch and just write some error messages. How did I figure out this? Well, I had found a program which reverse engineered jar files and gave back (relatively) sensible source code back. I then took a similar Java applet and reverse engineered it back to source code, and there I saw the use of try/catch. What I didn't know was that the variable names, like aa_ and similar, were minified names. Consequently, most of my variables used a similar naming scheme.

So there I sat, with a reverse engineered program, some vague understanding of exceptions, no concept of good program design, a Java book and an idea on how inheritance worked. And I thought: Why not make everything inherit from a single class? That made sense, because many of my calculator parts would need the same set of methods to calculate things. And of course, creating those values and put them within protected variables within the constructor made sense, so every single class constructor called of course super[1] as the first thing. Did I mention that I did this in Notepad and did javac *.java manually from the command line?

I have changed my idea around good design and idiomatic code slightly since then. However, when I come across a new concept, I always amaze myself at how bad I am at using that concept properly at first: You would be scared[2]. But honestly, who hasn't been there at some point? A combination of experience and learning is needed to learn X properly, and experience means you'll have to play around with it. And it's sort of like that for open source contribution and development as well: You need experience to get good at it.

You've Got Most of it Already

The good thing about open source contribution and development is that one can be relatively good at it without having done it at all. That may sound a bit strange at first, but it's actually fairly evident. Most of the things you would need for open source development is experience with programming and being able to communicate with people. If you are able to combine those skills together, then you end up with open source development. Of course, it's not as easy as being good at programming and being good at communicating, you need to be good at communicating with programmers. And that's where we end up with all these constructs you may not necessarily need in a solo closed-source project[3]: Licensing, contribution "policies", issues, version control, todos, changelogs and documentation.

If you notice, all of those things are there to improve communication between developers, not to directly improve how the code runs on a machine. In a sense, those tools won't change the semantics of your code, but it improves the usability of the code. The usability of the code is likely to improve the code itself in the future, due to contributions and the ease of reading the code for other humans.

I think it's worthwhile to take a deeper look into all of these constructs, and I would like to share what kind of choices I have taken and what kind of tools I use to ease development for myself, both open source and closed source.

Who Can Use What?

GNU - A new fragrance by RMS "GNU - A new fragrance by RMS" by Jeff McNeill, CC-BY-SA 2.0

Note: This section shouldn't be considered legal advice. Consider it more like my (perhaps wrongfully) choice of licenses based on what I call "freedom without 'abuse'".

Whenever you show the world what you've written, you have all rights by default. Consequently, no one can use what you've created without your explicit permission. That's why we have licenses: By attaching a license to our code, we give everyone a permission to use that software within what the license allows us to do. Therefore, I find it quite important that you pick a license and consider what license to use. As this blog post is not really about which license you pick (just that you pick one), I'll just quickly go through my choices and rationales. Most of them are based upon a combination of pragmatism and freedom.

Eclipse Public License

The Eclipse Public License is the default license I use for code I write. That's not really a choice I took myself, since this is the default license Leiningen places out whenever you're doing lein new. However, I came to like it after looking at it. Whether the rationale for that is my objective look at the EPL or cognitive dissonance is another matter.

The EPL gives possibility to modify, (commercially) use, redistribute and sublicence modifications. But I can't be held liable for anything, people cannot use trademarks without explicit permission, and derivations of the work must be open sourced. Essentially, it's something in between the public domain, where everything is permitted, and the GPL, where closed source software cannot use it.

Creative Commons

The EPL is great for code, but for things like text (like this blog post) and artistic work, it is not that suitable. I tend to use the Creative Commons Attribution-ShareAlike 3.0 Unported license for these kinds of things. It makes it possible to share the work and alter it, without any restriction on commercial usage. The only thing needed is attribution.

The main differences to the EPL is, well, there's "nothing to hide" here. There's no source code which, when given to e.g. your company's competitors, may disclose their company secrets. Consequently, it's kind of stupid to assign a software license to such kind of work.

Others

I actually tend to use other licenses from time to time. Whenever I create a generator which takes some input and create some output, I explicitly say that the output is in the public domain, whereas the source code for generating that output is not[4]. I also tend to use the MIT license whenever I want attribution for an algorithm or data structure, where the code is more of a proof of concept rather than a library.

If you decide to use the MIT license yourself, be aware that the MIT license does not have any patent clause: If some startup contribute code that uses a patent they own to your project, and then get acquired by some big company, that company may now be able to sue users of your project. The EPL protects against this, so take this into consideration when picking a license for your project.

Version Control

Version control is important, and solves several problems: You are able to reverse changes if you figure out that it won't work within the project, multiple people can work on the project concurrently, and you got the history of the project itself at your hands. If a bug was introduced into the project and you knew that it wasn't there at a certain point in time, you can do a binary search to find out where it was introduced. But, most importantly for this blog post: People can work concurrently on different problems, without causing trouble for others. They can even work on different problems themselves without causing solution X to interfere with solution Y.

The main version control systems used today is Mercurial and Git. I prefer Git, mostly because of GitHub (more on that in a second) and the Magit package in Emacs. Magit makes it possible to use Git within Emacs and speeds up Git usage tremendously if you're an Emacs user. However, there are similar tools for Mercurial, and Bitbucket is able to host Mercurial as well as Git repositories. I would suspect that it's at this point just a matter of taste and what the community you're in uses by default.

GitHub is a web-based hosting service for Git projects which handles issues, pull requests, repository rights, releases and discussion around new features and bugs. I also use its gh-pages feature to put documentation there. Some people prefer to move discussions and questions to IRC channels and/or mailing lists, or even to completely different issue tracker systems (like JIRA). However, many people keep the main discussion at GitHub, at least for smaller projects.

Don't be a Jerk: Reply to People

Licensing and tools are the the easier things, as those are a one-time only thing (albeit with a learning curve) which usually won't change within a single project. Another entirely different beast is to keep the project alive: You need to keep track of bugs, be open to enhancements, preferably ensure the new enhancements are properly documented, release new versions and so on.

For me, it's very, very important to take bug requests and issues seriously. For my own projects, I attempt to reply to issues and pull requests within one day (unless I am extremely busy). If the pull request is trivial, I merge it immediately. If it is harder to figure out if it's correct or not, I put it on my list of things to do when I have some spare time.

I believe that not responding to people's questions, bug reports, feature enhancements and pull requests properly whenever you got time is disgraceful: People have already invested their time in your project, and ignoring them for some time when they might have spent hours working on a bug fix or a feature implementation just expresses that you believe your hours are more worth than theirs.

That being said, it's completely fine to say *"Well, this looks good, but unfortunately I'm unable to have a closer look before this weekend."*, and it's also completely fine to say *"Thanks for the pull request, I really appreciate that you took the time to implement this. However, I believe that this goes against the vision of this project. I recommend that you set up an enhancement request before implementing it, as I don't want to waste your time."*. The point here is that you communicate with them, and tell them that you value their contribution.

It is also completely fine to get bored of a project, deprecate it, or simply want to focus on other projects. There are ways to remove pull request and/or the issue tracker, and putting a big "This library is no longer maintained" on top of the readme file is not much work. Even better, if there are other contributors in the project, ask them if they want to take over. You'll do people a big favour, and you value their time.

Documentation

Empty Library Shelves "Empty Library Shelves" by jvoss, CC-BY-SA 2.0

One of the reasons I wanted to write this blog post was because I am sometimes furious that documentation is lacking in projects. Suddenly I have to dig into the source code and simulate a computer in my head to understand what's going on. That's not good if I just want to understand what a library actually do. Again, it is about valuing people's time.

It's not only documentation on how the library works which is lacking. If you want other people to contribute to the project, it is vital that they understand how the project is designed, and how different component interact with each other. Draw some figures, tell them where to start in order to understand the internals. This is a little bit more excusable than other forms of documentation, but don't expect much help if people don't know where to start.

Another great by-product of documentation is the fact that you get a better overview of the source code yourself. It's thus easier to detect poorly designed components, bugs may be easier to understand and enhancements may be implemented in better way because contributors may have a better understanding of the system/project/library as a whole.

Create your own War Stories

I think I have talked more than enough about open source workflows now, and if there's anything more I can tell you, it must be that experience is the best way to learn more. Find an interesting project in your favourite programming language and contribute to it. If you cannot find an interesting project in the language which you would consider valuable, create one yourself. If the currently existing tools differ from your intended usage, or if they are simply put poorly designed, consider to make your own or to contribute to it.

Being good at warfare requires theory, but you cannot become excellent without being in a war.

(In Hindsight)

Hey, I was supposed to write something about my open source Clojure workflow, then something happened. I'm not sure how I got from there to an introduction to my open source workflow in general, but oh well. If you've gotten this far, then I'm almost certain it must have been somewhat valuable to you.



I would like to thank Phil Hagelberg (technomancy) for clarifying how the licenses differ, and what I should mention about the licenses I wrote about. Without that advice, I would have forgotten to write about critical differences between the licenses. No, this is still not considered legal advice, but without Phil's help, it could have easily turned into bad advice.


[1] For people not familiar with Java: a super call within an object constructor calls one of the ancestor's constructors, based upon the method signature.

[2] I tried to learn Erlang while doing what I will not assume is one of the most time consuming courses in the Computer Science/Engineering department at NTNU (Subsymbolic Methods in AI): A rather terrifying experience, if I may say so myself. The result after a semester is a monster of a program: genetica. The program itself started out without any qualified file names (not starting with genetica_) and heavily abused anonymous functions—a very Clojure-ish way of programming. It had in the beginning no gen_servers, nor did it use any other OTP modules or supervisors at all, and I tried to move over to that while using rebar for application management. As a result, half of the code has Clojure idiomatic code, whereas the other half has Erlang idiomatic code, and it's filled with funny bugs and errors.

[3] I would argue that all these things, except licensing, is very valuable if you're going to spend more than one week away from a solo project. It is as if some other person wrote it after some time, and you will give yourself a big favour if you document the code properly.

[4] Actually, I would love to get some feedback on whether this is necessary or not. I strongly suspect that it's not needed, but since I am sure I just slap in a sentence stating that it is in the public domain.

Tagged with: nothing.