In April, I wrote a blog to celebrate Earth Day titled, “Who’s Going to Save the Polar Bears? Environmentalists, Politicians, or Engineers?” This blog was the first in a series of “green” blogs about the enterprise networking industry and how we can be more intelligent with energy consumption via technology innovation.
As I slide into the second blog of this continuing series, I hope to give you a “behind the scenes” view of what the Office of the CTO at Extreme Networks is doing to research possible innovation in this arena. We decided to take an engineering, data-driven approach to discover possible ways to conserve energy for our customers. Our engineers have built a metrics collector and have begun collecting data.
In preparation for this project, we talked to interested customers who use a mix of cloud and on-premises network management systems. Our goal is to retrieve quality metrics from an extensive spectrum of customer scenarios while enabling ultimate flexibility for data collection and deployment options. So, we built a dedicated, on-premises data collection engine to run onsite for a few weeks at some of our most supportive customer locations. We are capturing live data from production networks versus our engineering labs. And what is the end-game for this data-driven project? Lessons learned from the collected data can then be funneled into our mainstream product portfolio to conserve energy consumption. More specifically, the goal is to derive multiple benefits from this project:
Allow me to outline our overall approach for the data collection. We delivered the collector as a VMWare OVA to customers. This method solves a few challenges. First, the data can be collected and stored on-premises using a single virtual machine (VM) which enables us to collect data from non-cloud enabled switches and Wi-Fi controllers. Second, all the required solution components are pre-installed on a single VM, with no external dependencies. Third, this provides an ease of installation at customer locations. Most customers already have a VMWare infrastructure and have the technical expertise for operating a VM.
As depicted in Figure 1, using either SSH or SNMP, the collector regularly queries data from switches using either the Switch Engine (EXOS) or Fabric Engine (VOSS) operating systems. To extract data from two supported Wi-Fi platforms (ExtremeCloud IQ Controller and WiNG controller), the collector uses the Wi-Fi solutions’ representational state transfer (REST) APIs. The data is stored locally on a time series database (TSDB). We can export each database after the data collection phase ends and securely move the data to our research lab. The data from all participating customers will be stored in a central DB and used to run multiple analytical models using both machine learning techniques and manual discovery methods. While validating any initial assumptions, we anticipate gaining many new valuable insights.
Figure 1 – The metrics collector
The collector VM also provides a few dashboards locally so customers can inspect some of the data during the collection process. This onsite visibility provides immediate value during the early phases of this project.
So what data are we collecting and why? The following is a broad overview of some of the wired and wireless metrics related to power consumption that we are gathering from the real-world production environments of customers.
So, you might ask me for further details of the inner workings of this data-collection project. In this blog, I will primarily focus on the technical stack of the collector. As we progress, I will discuss the backend technology in future blog posts. Here is a quick breakdown of the collector components:
The collector is written in Golang because the language has a small footprint and provides a reliable mechanism for concurrency called goroutines. Let’s say we want to collect metrics from a switch every 5 minutes. For EXOS switches, we decided to use secure shell (SSH) as the communication protocol and run debug commands to grab the metrics. This process can sometimes take longer than a minute to complete. Now apply that to a customer network with 100 switches, and you understand why concurrency is mandatory: a 5-minute collection interval would be over if we collect data synchronously with only five or six switches. Goroutines allow us to open multiple SSH sessions simultaneously and thus increase data collection efficiency.
We are collecting metrics over time (for example: every 5 minutes) with thousands of metrics per minute on a typical customer network. Therefore, we require a DB that efficiently handles storing and querying timeseries data. So, we ended up going with TimescaleDB for the following reasons:
As we wanted to provide our supportive customers with immediate value out of this project, we decided to add Grafana to the OVA. Grafana is a multi-platform open-source analytics and interactive visualization web application. Grafana has built-in support for TimescaleDB and allows for the visualization of timeseries data in very powerful and customizable ways. As seen in Figure 2, the chart displays timeseries data on the left panel and the same data in an aggregated fashion on the right panel. The right panel uses a standard table visualization with an overwrite for the kWh column to make it use the LCD gauge cell display mode with custom, percentage-based thresholds for coloring.
Figure 2 – Grafana visualization dashboard
In addition to displaying metrics, Grafana can also be used to work with the application logs in a more powerful and visual way if we use Loki in the tech stack. Loki is a log aggregation system designed to store and query logs from all your applications and infrastructure. Our main application is our collector, so we configured the Docker container to use the Loki Docker driver to forward logs towards the Loki service. Figure 3 shows us a simple example on how we can query all “ERROR” logs filtered for the “collector” Docker container logs within the last 7 days. Grafana will chart the occurrence frequency of those ERROR logs over time and display the detailed logs.
Figure 3 – Loki, log file aggregation
We can even have Grafana display the “context” of each ERROR by showing logs that occurred immediately before and after that ERROR log – very impressive and I highly recommend it especially if any application gets more complex and consists of more than a few containers!
Every component of our tech stack is running as a Docker container for obvious reasons. We are using Docker Compose to build, start, and stop the application, the Docker network, and all Docker volumes. It still amazes me how easy it is to upgrade the application after it has been deployed at a customer site: simply run docker pull to get the latest version of images and then docker compose up -d to restart only those containers that have a newer image. That’s it.
We have begun our data collection effort. Do you want to participate? If you are an Extreme customer and would like to participate, you can find more information on the project’s landing page.
You can help us understand energy use, lower your energy use, and help save the polar bears!
I’m looking forward my next blog in this series where we will start looking at the collected data and see which insights we can derive. We invite you to come with us on this journey over the next few months to learn from our results. Stay tuned.