Active network troubleshooting will always be a necessity for broadband and enterprise network operators. Issues can arise that require a network operations center (NOC) team to investigate closely. Having a recording of network traffic at the point of the problem, in the form of network packet captures (“pcaps”), are one of the best tools at an operator’s disposal when troubleshooting these customer issues. What are some of the best ways that operator network teams can gather and record network packets? Since they can come from many different places and apply to many different teams, what is the best way to work with and get the most use out of them?
For operators, it’s all about the data
Whether done through automated tools that catch events, or through direct interaction, the ability to investigate and resolve issues remotely - whenever possible - saves operators huge amounts of resources and ensures a quality end-user experience. Naturally, this ability is only as good as the data available to the operations team. When it comes to remote troubleshooting and management, large scale telemetry combined with machine learning alert and resolution systems can cut down on a bulk of trouble tickets. When something goes wrong and does require the eyes of an analyst, there are many tools available to help actively manage the network from the core all the way to Wi-Fi and the end-user.
Packet captures let you get the details
For operators, network packet captures present an opportunity to gather useful intelligence that can’t be gained through typical network management systems or protocols. This data is useful because it is a record of everything that happened on the network in a given period of time, and contains all of the details that would either be missed by “big data” telemetry processing or left out of a simple alert system.
Some examples include:
- Intermittent connectivity issues that occur sporadically or at a particular time every day can be better analyzed by looking at the network data that was transmitted just before and just after the outage.
- Frequent problems with application performance experienced by an end-user, particularly with streaming services and the UI applications that enable them can come down to very specific or esoteric issues at the packet level.
- End-user security issues can sometimes only be identified, investigated, and resolved by looking directly at the packets and using historical information to find attacks that may have happened before or are ongoing in other areas of the network.
- When you need to escalate an issue to your vendor, packet captures contain some of the best evidence you need to demonstrate and explain the issue.
Gathering packet captures in an operator environment
Obviously, operators have a much bigger - and heterogeneous - network footprint to handle. Furthermore, each part of the network is often handled by different departments entirely, whether it’s the network core, interfaces with wholesale networks, enterprise broadband customers with SD-WAN or cloud services offered by the operator, and residential broadband subscribers themselves, with both retail and operator-provided Customer Premises Equipment (CPE).
This means that the method for gathering packet captures will be different depending on the use case. Here’s a few examples.
Using high-speed packet capture
For parts of the network where data is crossing on high-speed links (10G+), maintaining constant packet capture can be resource intensive and not always practical. In these cases, adding a network tap capable of sourcing packets at these speeds is the way to go. There are several products dedicated to this specific use case, given just how hardware intensive it can be. Fortunately, companies like Counterflow provide appliances that can handle recording at these speeds, summarize important data, and send the resulting pcaps to a central location or analysis system.
Packet capture in the cloud
To troubleshoot services that are cloud dependent and deployed in web-service environments, packet captures need to be gathered on virtual interfaces in the cloud-network infrastructure. This is tricky, as access to packet level network data isn’t always available in a contained or virtual server.
One of the ways that this is done in the field is through traffic mirroring, which turns a virtual interface into a tunnel specifically to sink packet data from another interface. In Amazon Web Services deployments, this is done through a feature called VPC traffic mirroring. With the ability to record packet captures in a cloud environment, operators can then take the resulting files and send them to a central location where they can be accessed by analysts.
Packet capture in the CPE
A lot can be learned about end-user performance and experience issues by monitoring the network gateway or Wi-Fi router itself. For enterprise use cases, high-end managed Wi-Fi systems from companies like Meraki, Cradlepoint, and Mist by Juniper have native packet capture built into their products that can be initiated and collected through their management systems.
For residential gateways, many do include native packet capture in their products, but the open-source router operating system known as OpenWRT in particular has a great implementation for gathering pcaps. In addition to the ability to activate them via the UI, OpenWRT can automatically send the pcaps to a system such as CloudShark for storage, organization, and analysis.
Using TR-069/USP to manage packet captures
Many operators use standardized network management protocols like the Broadband Forum’s TR-069 protocol, and its successor, the User Services Platform (TR-369), to manage their CPE base. These protocols work much like enterprise network management protocols like SNMP or NETCONF, but are built specifically for broadband/home CPE use cases.
These protocols use a “data model” to describe the capabilities of a managed device, and to describe the components, commands, and KPIs used to manage configurations, monitor and optimize network links and applications, and manage firmware upgrades. The primary data model today is known as “Device:2”.
Within version 2.13 and later of Device:2 is the PacketCaptureDiagnostics object/command, which can be used to automatically initiate a packet capture on a capable device. Moreover, it can be used to automatically retrieve those captures as part of the process, rather than storing them locally on the CPE.
Here’s how it works. The Device.PacketCaptureDiagnostics. command object (Device.PacketCaptureDiagnostics() in USP) has an argument called “FileTarget”. This can be any URL capable of receiving the resulting packet capture file. This URL can include HTTP arguments used in web application APIs. This makes it great for integration with CloudShark, which has a versatile upload API to receive a capture into an operator’s CloudShark Enterprise system. CloudShark will return a unique URL of the location of the packet capture that is stored in the .Results. table (or returned in an OperationComplete notification in USP).
Putting all together with CloudShark
What should you do with all of these captures once you have them, and what is the best way to work with them? As we mentioned, operators are in an interesting position given that network responsibilities often span many different teams and have points that can use packet captures distributed all over the world.
CloudShark Enterprise was built specifically with these use cases in mind. Here are some best practices enabled for operators who use CloudShark with these packet capture methods:
- Storage and record-keeping - By taking packet captures from any location and collecting them in one system, network operations teams make sure that valuable troubleshooting information is never lost. Captures can be tagged, sorted, and organized in accordance with a particular issue or customer, and annotated with saved notes that make it easy to keep track of work that has already been done.
- Intra-team and cross-team collaboration - CloudShark allows NOC and SOC team members to work together on packet capture data by sharing views, profiles, filters, and notes directly via a URL. These direct links can be used with trouble ticket systems, security reports, or in interactions with customers. Moreover, team members of all levels of expertise can collaborate through the use of profiles designed to make packet data more meaningful.
- Collaboration with vendors - The storing, tagging, and sharing features of CloudShark make it ideal for working with vendors when vendor escalation is required. A marked-up packet capture shared with links to specific filters, streams, and views lets your team show the evidence of an issue and the work that needs to be done to resolve it.
- Easy access to expert tools - Network and security troubleshooting professionals make use of advanced tools like Wireshark, Zeek, and Suricata IDS. By having these tools built into CloudShark, operators can standardize their use and save the time it takes to install, configure, and train users on them. Moreover, having these deployed as part of a web application eliminates the need to install them on local workstations or individual servers, helping adhere to company software policy.
Ultimately, packet captures are a fundamental part of network troubleshooting and a great asset to network operators of all tiers. What are some of the ways you use captures now? Let us know!