Basic bioinformatics using command line
As sequencing technologies continue to advance and become more accessible, so does the need for effective tools and pipelines to analyse the vast amount of data these technologies produce.
There are numerous freely available online tools for analysing bacterial sequence data, many of which are showcased in the SEQAFRICA courses “Introduction to WGS in AMR surveillance” and “WGS workflow – From isolate to Analysis” available on the SEQAFRIC courses page. However, using command line tools for genomic data analysis offers a more efficient approach, enabling faster processing by automating repetitive tasks and integrating many steps into powerful workflows.
Starting down the path of bioinformatics and the command line can feel overwhelming, especially without a strong background in computing or computer science. To help you get started, we’ve compiled a list of freely available introductory courses, along with a list of sites that offer in-person or online courses for a fee. We’ve also included links to useful tools and guides. We highly recommend exploring online university courses, which provide structured learning paths.
Learning the command line is like learning a new language—it can be challenging at first, but don’t be discouraged! Remember, "Google is your friend!" If you run into a problem, chances are someone else has encountered it before, and a quick Google search often leads to helpful solutions.
For further guidance, this paper outlines ten simple rules to help you begin your command-line journey:
Brandies and Hogg, 2021: Ten simple rules for starting with command-line bioinformatics. PLOS Computational Biology.
The first step
The first step to working with the command line is to locate the terminal on your computer and learn how to navigate it. For this, we recommend The Unix Shell from Software Carpentry. This free site begins with a “Setup” section, showing users how to open a Unix shell across different operating systems. Users can then download the necessary data to complete various lessons on the command line, covering essential topics such as navigation, directories, pipes, loops, and scripts. While the site does not specifically introduce bioinformatics, it provides a step-by-step introduction to using a Unix shell.
A similar alternative is Unix Tutorials for beginners. This tutorial consists of eight lessons featuring written instructions and picture guides. It covers the key topics needed to work with Unix, though it provides only limited guidance on setting up a Unix shell.
Beginning to work with computer science
Beginning with command-line bioinformatics is like learning a new language and way of thinking. Understanding how to utilize the computational power available on your PC or a high-performance computing (HPC) system, as well as familiarizing yourself with bioinformatics terminology, is essential. This video covers ten rules for getting started with the command line, explaining important computational terms and concepts that biologists or microbiologists should know when beginning their journey into command-line bioinformatics.Recommended Coursera courses
Coursera offers courses from leading universities, providing a flexible learning platform where students can watch lectures, take quizzes, and access course material online at any time. You can either take individual courses or enroll in a Coursera Specialization. By enrolling in the Bioinformatics Specialization, you will follow a structured series of bioinformatics courses with a natural progression in difficulty and depth. The courses recommended by SEQAFRICA are free, though certificates upon completion are available for a fee. To receive a certificate for completing a Specialization, you will also need to complete a hands-on project.
If cost is a concern, Coursera offers scholarships and financial aid to help cover the cost of certificates, whether for individual courses or an entire Specialization. You can find more information on eligibility and the application process here.
- Biology Meets Programming: Bioinformatics for Beginners: University of California, San Diego, introduces programming in Python within a bioinformatics context, with no prior coding experience required. The course uses an interactive textbook and exercises to guide students through solving biological problems. The course provider recommends starting with this beginner-level course and then continuing your bioinformatics training by progressing through their Bioinformatics Specialization.
- Bioinformatics Specialization (Coursera Specialization): University of California, San Diego, provides this comprehensive Bioinformatics Specialization for a fee, which can be covered by a Coursera scholarship. The specialization consists of seven courses that introduce students to key topics such as sequencing, DNA replication, and molecular evolution from a bioinformatics perspective. If you opt for the "Honors Track," you will engage in additional exercises that apply the course material to real computational challenges. We recommend this specialization if you need a refresher or are new to bioinformatics and would like to start using command-line tools.
- Genomic Data Science Specialization (Coursera Specialization): Johns Hopkins University offers this Specialization for a fee, which can be covered by a Coursera scholarship. The Genomic Data Science Specialization provides a comprehensive introduction to working with next-generation sequencing (NGS) data using the command line. It introduces students to the Unix environment and teaches them how to use software like R and Python to manage large biological datasets. The Specialization includes six courses, and we especially recommend Course 4: Command Line Tools for Genomic Data Science. This Specialization is ideal for students already familiar with bioinformatics who want to develop practical skills in using command-line tools for data analysis, as the focus is on hands-on bioinformatics rather than theoretical concepts.
- The Unix Workbench (Coursera course): Johns Hopkins University introduces students to using Unix and working in a command-line interface. The course is designed for students with no prior programming experience and does not focus on bioinformatics but rather provides a solid foundation for using Unix in various contexts. Students are also introduced to GitHub and bash scripting, making it an excellent starting point for anyone looking to build command-line skills.
Other courses
Some of the courses on these sites are freely available, while others may have participation fees.
- NGS Academy - Part of the Africa CDC Pathogen Genomics initiative. This site offers a range of courses focused on pathogen surveillance and NGS. The site has both live online courses and recordings of past sessions. The site also provides links to relevant sessions from other organizations covering NGS workflows.
- H3ABioNet - The Pan African Bioinformatics Network for H3Africa regulary provides bioinformatics courses and has other resources available on their site. The site offers training both online and in local classrooms across Africa, with in live sessions requiring students to participate consistently over a period of time to complete the courses.
Tools and resources
- Rosalind - A free bioinformatic learning platform that guides users through installing, setting up, and using Python. It provides exercises that accompany the book Bioinformatics Algorithms: Active-Learning Approach by Phillip Compeau & Pavel Pevzner, which is freely available in a non-interactive version.
- Explain shell - A site where you can enter lines of code and have it broken down into the separate commands with explanations.
- Command-line cheat sheets - Downloadable and printable single-page sheets containing commonly used commands for the command-line.
- Command-line cheat sheet for Linux - A downloadable and printable single-page sheet containing the most used commands for Linux.