Troubleshooting Common Issues in Embedded Systems and Network Programming

Aniruddha Amit Dutta
5 min readMay 22, 2024

--

Photo by AltumCode on Unsplash

In the dynamic field of embedded systems and network programming, various challenges can arise during the development and maintenance of software. These challenges range from memory management issues and synchronization problems to incorrect configurations and network-related errors. In this article, we will explore a series of common issues encountered in embedded and network systems, along with their respective solutions. By understanding these problems and their fixes, we can enhance the stability, efficiency, and reliability of their systems.

Variety of challenges faced -

1.The object was created by a user in node, and it was saved in the database and RAM; upon reboot, the object was not saved in RAM again.

Solution- Code a replay function upon reboot to read values from database and set appropriate values in RAM.

2. Object parameters were set by the user in the chipset, upon reboot, the object parameters were not read from database and set onto the chipset again.

Solution- Code a replay function upon reboot to read values from database and set them on chipset.

3. The log file name was changed where old logs were being recorded, hence they were not getting printed.

4. The uint8 variable was used to store the uint32 variable value. Hence, the uint8 variable was set to 0, and part of the code was not executed due to the if statement.

5. The node crashed because the pointer to the object was NULL, and we did not have a NULL check in the code before accessing that object's member variables.

Solution - add appropriate NULL checks and corresponding Log statements to avoid crash.

6. Memory was not deallocated for the message queue after the message was received on the destination. Hence, heap memory was getting filled, and the node used to run out of memory in due period of time.

7. The correct bitmask was not set while reading data from the hexadecimal packet. Hence, the value in the packet was not getting read.

8. The stdc++ flag was not set during compilation in the makefile; hence, compilation errors were present.

9. While doing cross-compilation, due to many objects being created in the executable file, the Global Offset Table was not able to reference all objects
{RMIPS_GOT_PAGE error}.

Solution — remove objects that are not required from the makefile.

10. Two threads were accessing global memory simultaneously since semaphore was not being used. Race conditions were occurring.

solution- use semaphore or mutex locks while accessing or modifying global memory.

11. Rabbit MQ queue for alarms was getting filled quickly from different network devices, hence some alarms were not getting reported to node UI. Created multiple queues to handle different alarms and events.

12. During packet creation for traffic, the priority bit flag was not set correctly hence, traffic was not getting through.

13. The data on node UI was being displayed to user from database 2 which was a replica of original database 1 , the rsync call was netting executed when all the data was being written on database 1 , hence some data was missing for user in UI.

14. The alarms were getting reported for both raise and clear condition within milliseconds hence we had to put a threshold function to not report alarm on UI if it being raised and cleared for same network element within span of 300 milli seconds.

15. If one network element, say A, goes down and a bulk operation is performed on network elements A, B, and C , then how do you sync data across all network elements?

16. Network element got hanged due to multiple logs present in code .

Solution- Add trace statements or logs which could be dynamically enabled for debugging.

17. A network element say A was sending continuously packets over TCP to network element B which was not able to process the packets and kept it storing in its queue. Network element B then got hanged.

solution- use command netstat -nalp | grep <daemon_name> | grep tcp

to check on which port the packets are not getting processed. We could also try approach of using tcpdump command.

18. Deadlog occurred in a process where thread A had locked and resource using mutex and did not release the lock , hence other threads B and C were waiting continuously for the resource to be released.

solution- use thread apply all bt in gdb to check which threads are waiting for which semaphore to be released. We could also check if logs of certain threads are not getting printed then they may be stuck in deadlock.

19. A new value has been added to database and we try to bring up node from the newly compiled build , however old database is not deleted. Hence node goes to continuous reboot since it is not getting that new value in databse.

solution- remove the old database and let it be freshly created by newly compiled build.

20. A network element is assigned the IP address of gateway , hence all other network elements in that subnet are no longer accessible to user.

21. Did not use htonl and ntohl to pass a parameter from host to chipset due to which the parameter value was not being read correctly on chipset.

solution- make sure to use htonl and ntohl to avoid little endian and big endian issues.

22. Lots of code was present in multiple files which was not being used (was deprecated) and was not removed. This was causing issues for developing new feature.

solution- remove unused code or add comments that this part of code is not being used.

23. Blocking call was not used for an atomic operation due to which there was mismatch between two databases .

solution- write a periodic sync function or make the call blocking in nature to revert changes from database if operation is not successful.

Some good practices to follow-

  1. Always ask debugging team to specify software version in which the bug has occurred before debugging.
  2. While debugging live issue on network element always check if relevant alarms are present on node or not.
  3. Before coding make sure to write edge cases to be handled.

Addressing and resolving issues in embedded systems and network programming is critical to ensuring smooth and reliable operation. By implementing the solutions discussed, such as coding replay functions, adding NULL checks, managing memory effectively, and ensuring proper synchronization, developers can mitigate common pitfalls and improve overall system performance.

I would love to hear what kind of challenges you faced during your coding journey.

--

--

Aniruddha Amit Dutta

Software Engineer @Tejas Networks | Writer | Cloud and DevOps Enthusiast