Shrawan Baral|Computer Engineer|AI Enthusiast|Cybersecurity|Web Developer

Design Patterns for Fun

shrawan baral — Sat, 22 Jul 2023 04:36:29 GMT

Howdy Everyone! 🤠

I always try to find a pattern while learning and implementing something. Patterns serve the predictability of operation, structure and behavior. For instance, If you know how academic writing structure works, you find it easy to predict what will possibly come next in the paragraph or paper.

Design patterns specifically are reusable solutions to common software design problems. They provide a structured and organized approach to designing software, making it more maintainable, scalable, and flexible.

These design patterns can be classified into 3 main groups based on purpose and application such as:

Creational Design Patterns: They focus on object creation mechanisms, providing ways to create objects in a manner suitable for a specific situation helping decouple the system from the details of object creation. For instance: Factory Method, Abstract Factory Method, Singleton, Builder, Prototype patterns.
Structural Design Patterns: They deal with object composition, providing ways to organize classes and objects to form larger structures and focus on simplifying the structure of the system. For instance: Adapter, Decorater, Facade, Composite, and Proxy patterns.
Behavioral Design Patterns: They focus on how objects interact and communicate with each other by defining communication patterns between classes. For instance: Command, Observer, Strategy, and Iterator patterns.

Exploring Facade, Proxy and Command Patterns

A. Facade Pattern

We have been implementing facades knowingly/unknowingly all the time. The Facade pattern provides a unified interface to a complex subsystem, making it easier for clients to interact with the system by encapsulating the complexity and intricacies of the subsystem behind a simple API, shielding the client from the internal details.

Let us suppose: We have a multimedia system with an audio player and a video. Each component may have its various methods and configurations but we want to confine the component complexities within themselves.

class AudioPlayer:
    def play_audio(self):
        print("Playing audio...")

class VideoPlayer:
    def play_video(self):
        print("Playing video...")

class MultimediaFacade:
    def __init__(self):
        self.audio_player = AudioPlayer()
        self.video_player = VideoPlayer()

    def play(self):
        self.audio_player.play_audio()
        self.video_player.play_video()

my_multimedia = MultimediaFacade()

# Turning my multimedia on 
my_multimedia.play()

# Playing audio...
# Playing video...

From the above code, we can see how the complex details of turning on of multimedia system are encapsulated by the play() method of MultimediaFacade class. The MultimediaFacade class acts as a facade for the multimedia system encapsulating individual components - AudioPlayer, VideoPlayer.

B. Proxy Pattern

The proxy pattern acts as a powerful tool to control access to objects and add additional functionalities without altering their core behavior. It is a structural design pattern that acts as a surrogate or placeholder for another object, allowing managing access to the real object and performing various tasks before or after its execution.

It is especially useful in scenarios where direct access to the real object is not desirable or when we need to add extra features, such as logging, access control, caching, or lazy initialization or when we want to prevent heavy resource utilization.

Let's consider a scenario: We have a Resource object that represents a heavy computational task in which we want to optimize the resource usage by caching its results.

from abc import ABC, abstractmethod

class AbstractResource(ABC):
    @abstractmethod
    def execute_task(self, task_name):
        pass

class ConcreteResource(AbstractResource):
    def execute_task(self, task_name):
        print(f"Executing task '{task_name}'...")
        # Simulate heavy computational task here
        return f"Result of '{task_name}' task"

class ProxyResource:
    def __init__(self):
        self._real_resource = ConcreteResource()
        self._cache = {}

    def execute_task(self, task_name):
        if task_name in self._cache:
            print(f"Fetching cached result for task '{task_name}'...")
            return self._cache[task_name]
        else:
            result = self._real_resource.execute_task(task_name)
            self._cache[task_name] = result
            return result


if __name__ == "__main__":
    resource_proxy = ProxyResource()

    # First execution of the task (not in cache)
    task1_result = resource_proxy.execute_task("Task 1")
    print(task1_result)

    # Second execution of the same task (cached)
    task1_result_cached = resource_proxy.execute_task("Task 1")
    print(task1_result_cached)

    # Another task execution (not in cache)
    task2_result = resource_proxy.execute_task("Task 2")
    print(task2_result)

The ProxyResource class implements the Resource interface as a proxy for the real resource holding a reference to the real resource and maintaining a cache to store the results of executed tasks.

Note: Implementing cache logic directly into execute_task() may not be a good approach but for simplicity. For cleaner implementation, cache hit or miss can be encapsulated in its own Cache class.

C. Command Pattern

The Command pattern is a behavioral design pattern that encapsulates a request as an object, allowing clients to parameterize and decouple the sender and receiver of a request. And what I mean by decoupling is the separation of concerns. Furthermore, we can implement undo/redo functionality, deferred execution, and logging of commands.

For exceptional reference regarding a more depth explanation of design patterns, look for the references below.

But for now:

Let's create a simple calculator application with the Calculator class, which supports basic operations like addition and subtraction as command objects.

from abc import ABC, abstractmethod

class Command(ABC):
    @abstractmethod
    def execute(self):
        pass

class AddCommand(Command):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def execute(self):
        return self.x + self.y

class SubtractCommand(Command):
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def execute(self):
        return self.x - self.y

class Calculator:
    def __init__(self):
        self.command = None

    def set_command(self,command):
        self.command = command

    def execute_command(self):
        return self.command.execute()

calculator = Calculator()

add_command = AddCommand(20, 5)
subtract_command = SubtractCommand(4, 2)

calculator.set_command(add_command)
result_add = calculator.execute_command()

print(result_add)

calculator.set_command(subtract_command)
result_subtract = calculator.execute_command()

print(result_subtract)

A single execute_command() method is called over Calculator class, which takes a Command object as an argument. Due to this, the execution output of the calculator is different based on the command that is passed to the instance of Calculator. Thus, we are now able to parameterize clients with different requests.

Design Patterns: The Fun Part

Till now we have learned about 3 different design patterns, you'll be amazed to know the core reason for this blog is yet to come.

Problem: I have to hit 3rd party API (let it be Google search API), and respond to the client after processing, but I have to do it in a way that's more manageable, reliable and if we stack a new search API in our core, it should not hurt the overall code structuring.

Let's define an implementation class of our 3rd party API.

from abc import ABC, abstractmethod
import requests
import os

class API(ABC):
    @abstractmethod
    def search(self, query):
        pass

class GenericSearchAPI(API):
    def __init__(self,url):
        self.url = url
    def search(self,query):
        pass

class GoogleAPI(GenericSearchAPI):
    def __init__(self,*args,**kwargs):
        self._access_key = os.getenv('access_key')
        super().__init__(self,*args,**kwargs)
    def search(self, query):
        res = requests.get(f"{self.url}/?search={query}&access_key={self._access_key}").json()
        return res

We want a domain linguistic for our search domain, our ubiquitous language. Let's suppose we are working in a finance company, we would like to call our search engine as Finance Lookup (I just made it up, pardon me ), rather than Google API, in the future we might implement local db search so we can't direct call GoogleAPI instance and call it as our holistic Finance Lookup engine.

Using our knowledge of Facade:

from abc import ABC, abstractmethod
import requests
import os

class API(ABC):
    @abstractmethod
    def search(self, query):
        pass

class GenericSearchAPI(API):
    def __init__(self,url):
        self.url = url
    def search(self,query):
        pass

class GoogleAPI(GenericSearchAPI):
    def __init__(self,*args,**kwargs):
        self._access_key = os.getenv('access_key')
        super().__init__(self,*args,**kwargs)
    def search(self, query):
        res = requests.get(f"{self.url}/?search={query}&access_key={self._access_key}").json()
        return res

class FinanceLookup:
    def __init__(self):
        self.google_search = GoogleAPI()

    def lookup(self,keyword):
        results = self.google_search.search(keyword)
        return results

finance_lookup_engine = FinanceLookup()

keyword = "finance"
# Lookup keyword
results = finance_lookup_engine.lookup(keyword)

But Hold On.

Only if in our imaginary world, our 3rd party API would let us hit API continuously, but in reality, we are Throttled to a specific number of requests per unit of time. It seems better if we cache our responses to prevent hitting 3rd party API and causing delay/no responses at all.

Let's use our knowledge of using proxy patterns to either limit the user from our system or cache to reply them back, the response of exact query.

Using our knowledge of Proxy:

from abc import ABC, abstractmethod
import requests
import os

class API(ABC):
    @abstractmethod
    def search(self, query):
        pass

class GenericSearchAPI(API):
    def __init__(self,url):
        self.url = url
    def search(self,query):
        pass

class GoogleAPI(GenericSearchAPI):
    def __init__(self,*args,**kwargs):
        self._access_key = os.getenv('access_key')
        super().__init__(self,*args,**kwargs)
    def search(self, query):
        res = requests.get(f"{self.url}/?search={query}&access_key={self._access_key}").json()
        return res

class GoogleAPIProxy:
    def __init__(self):
        self._real_api = GoogleAPI()
        self._cache = {}

    def search(self, query):
        if query in self._cache:
            return self._cache[query]
        else:
            result = self._real_api.search(query)
            self._cache[query] = result
            return result

class FinanceLookup:
    def __init__(self):
        self.google_search = GoogleAPIProxy()

    def lookup(self,keyword):
        results = self.google_search.search(keyword)
        return results

finance_lookup_engine = FinanceLookup()

keyword = "finance"
# Lookup keyword
results = finance_lookup_engine.lookup(keyword)

In the above code, GoogleAPIProxy is a proxy to GoogleAPI which searches the cache before hitting the actual API, to prevent excessive API hits.

Looks good till now right?

We have one more problem to be solved now. As I said earlier, we may have different search engines stacked to our Finance Lookup, what can we do to create a feature that allows searching using the next search engine without modifying the underlying code structure?

Our Finance Lookup is highly coupled with our GoogleAPIProxy. Let us give our users a choice to define which search engine to use.

from abc import ABC, abstractmethod
import requests
import os

class API(ABC):
    @abstractmethod
    def search(self, query):
        pass

class GenericSearchAPI(API):
    def __init__(self,url):
        self.url = url
    def search(self,query):
        pass

class GoogleAPI(GenericSearchAPI):
    def __init__(self,*args,**kwargs):
        self._access_key = os.getenv('access_key')
        super().__init__(self,*args,**kwargs)
    def search(self, query):
        res = requests.get(f"{self.url}/?search={query}&access_key={self._access_key}").json()
        return res

class DuckDuckGOAPI(GenericSearchAPI):
    def __init__(self,*args,**kwargs):
        self._app_name = os.getenv('app_name')
        self._public_key = os.getenv('public_key')
        super().__init__(self,*args,**kwargs)
    def search(self, query):
        res = requests.get(f"{self.url}/?search={query}&app_name={self._app_name}&public_key={self._public_key}").json()
        return res

class SearchAPIProxy:
    def __init__(self,api):
        self._real_api = api
        self._cache = {}

    def search(self, query):
        if query in self._cache:
            return self._cache[query]
        else:
            result = self._real_api.search(query)
            self._cache[query] = result
            return result


class SearchCommand(ABC):
    @abstractmethod
    def execute(self, query):
        pass

class GoogleSearchCommand(SearchCommand):
    def __init__(self, api):
        self.api = api

    def execute(self, query):
        return self.api.search(query)

class DuckDuckGoSearchCommand(SearchCommand):
    def __init__(self, api):
        self.api = api

    def execute(self, query):
        return self.api.search(query)

class FinanceLookup:
    def __init__(self):
        self.google_search = SearchAPIProxy(GoogleAPI())
        self.duckduckgo_search = SearchAPIProxy(DuckDuckGOAPI())
        self.current_command = None

    def set_command(self, api_choice):
        if api_choice == "google":
            self.current_command = GoogleSearchCommand(self.google_search)
        elif api_choice == "duckduckgo":
            self.current_command = DuckDuckGoSearchCommand(self.duckduckgo_search)
        else:
            raise ValueError("Invalid API choice")

    def lookup(self,keyword):
        results = self.current_command.execute(keyword)
        return results

finance_lookup_engine = FinanceLookup()

api_choice = "google"

# Set the command based on the chosen API
finance_lookup_engine.set_command(api_choice)

# Lookup keyword using the selected API
keyword = "finance"
results = finance_lookup_engine.lookup(keyword)
print(results)

Phew !! A lot of code.

Let me summarize the changes I have made:

Since Proxy behavior is needed for both 3rd party api's, rather than implementing a single Google Proxy now each instance of API's are supplied to the proxy while initializing. We have now SearchAPIProxy class.
The command to execute APIs is segregated as GoogleSearchCommand and DuckDuckGoSearchCommand .
Each command now executes the concrete implementation of the search APIs.

Well, that's done for now. I did it just for the sake of demonstrating how some patterns can be used to achieve code readability and future mutations/modifications.

Harnessing the Power of Design Patterns:

By incorporating design patterns into our coding practices, we can:

Improve Code Quality
Reduce Code Duplication
Enhance Collaboration
Boost Flexibility and Adaptability
Ensure Best Practices

BUT...

It's important to remember that design patterns are not one-size-fits-all solutions. Careful consideration must be given to the specific problem and context when choosing and implementing design patterns. Applying design patterns where they are not necessary or appropriate can lead to unnecessary complexity and reduced maintainability.

Wrapping up

Design patterns are powerful tools that can help create elegant, efficient, and scalable solutions to real-world problems. By understanding and applying design patterns, we can elevate our software designs and build robust and innovative software systems.

You reached the end of this blog, many thanks to you for taking the patience to follow along with me. Hope it helped in any way around.

References:

https://refactoring.guru/design-patterns

Concurrency Challenges: Locks, Race Conditions, and the Python Global Interpreter Lock (GIL)

shrawan baral — Tue, 11 Jul 2023 00:32:30 GMT

Concurrency is the ability of a system to execute multiple tasks or processes simultaneously which allows the overlapping or simultaneous execution of different parts of a program, providing the illusion of parallelism and improving overall system performance by increased throughput, improved response times and better resource utilization.

There are different models and approaches to concurrency:

Multiprocessing: Multiprocessing is a programming technique that involves executing multiple processes simultaneously to achieve concurrency. It allows programs to utilize multiple processors or cores on a computer to execute tasks in parallel. Each process operates independently and has its own memory space, program counter, and stack and leverages inter-process communication mechanisms, such as pipes, queues, shared memory, or sockets to communicate. It is suitable for CPU-bound tasks.
Multithreading: A thread is a lightweight unit of execution within a process(a process within a process) which share the same memory space and resources to communicate and cooperate more efficiently than separate processes.

Multithreading, on the other hand, is a programming technique that involves creating and managing multiple threads within a single process. Each thread operates independently and can perform its own set of instructions, potentially running in parallel on different CPU cores or time slices. However, proper synchronization mechanisms, like locks or semaphores, are needed to ensure thread safety and prevent race conditions.
Asynchronous Programming: Asynchronous programming revolves around non-blocking operations, where a task can initiate an operation and proceed to other tasks without waiting for the operation to complete. Instead, the task receives a notification or a callback when the operation finishes, allowing it to continue execution or perform additional work. It is typically used for I/O-bound operations(r/w to a file, networking), where waiting for I/O can be time-consuming. It leverages techniques like coroutines and event loops, to efficiently manage concurrent execution. (I will be explaining in more detail how asynchronous programming works in my further blogs)

Concurrency challenges:

Race Condition: Understanding Concurrent Access Challenges

In concurrent programming, a race condition occurs when multiple threads or processes access shared resources concurrently for read and write operations.

Understanding Critical Regions

A critical region refers to a section of code or a shared resource that must be accessed atomically or exclusively. It is a portion of code where race conditions can occur if not properly synchronized. Critical regions are typically associated with shared data structures or variables that are accessed and modified by multiple threads or processes.

Race Condition and Critical Region

Thread 1               Thread 2
-----------           -----------
   Read X
                        Read X
   Modify X
                        Modify X
   Write X
                        Write X

In the diagram above, two threads, Thread 1 and Thread 2, are concurrently accessing and modifying a shared variable X. The steps involved in a race condition are as follows:

Both Thread 1 and Thread 2 read the value of X simultaneously.
Thread 1 modifies the value of X based on its computation.
However, before Thread 1 writes the modified value back to X, Thread 2 also modifies X based on its computation.
Thread 2 writes its modified value back to X.
Finally, Thread 1 writes its modified value back to X, overwriting the modification made by Thread 2.

The outcome of the race condition depends on the specific interleaving and timing of the thread execution. As a result, the final value of X may not reflect the intended or expected result.

So, what's the solution?

Preventing Race Conditions: Critical Region and Synchronization

To prevent race conditions, critical sections need to be properly synchronized. Synchronization mechanisms, such as locks, semaphores, or mutexes, can be used to ensure exclusive access to critical regions. These mechanisms allow only one thread or process to access the critical section at a time, preventing concurrent modifications and maintaining data integrity.

By acquiring a lock or semaphore before entering a critical section, a thread can ensure that other threads are prevented from accessing or modifying the shared resource until it has finished its operation. Once the thread completes its operation, it releases the lock or semaphore, allowing other threads to enter the critical section.

Thread 1               Thread 2
-----------           -----------
   Read X
                        Read X
   Modify X
                        Modify X
   Write X
                        Write X

With proper synchronization, as shown in the modified diagram above, only one thread can access the critical section at a time. This ensures that modifications to X occur sequentially, avoiding race conditions and producing the desired outcome.

Note:

Locks are a broad topic, and the above explanation provides a general overview of their usage. There are various types of locks, including Semaphore and Mutex, which serve similar purposes. However, their specific usage and implementation details vary. Advanced topics like single lock phase and two-phase lock go deeper into the subject. I will delve into these topics and explain how databases leverage locks to ensure atomicity and consistency in transactions in further blogs

So locks solve the problem, right? Umm. Not really

You see locks, rather put our thread in some waiting state. What if the lock-acquiring thread does not release the lock at all even though it has finished its job?

Situation: Let each thread hold one lock while waiting to acquire another lock, which is held by another thread. As a result, all threads involved are blocked and unable to proceed, and become stuck in a circular dependency, this condition is called Deadlock.

That's why synchronizing your critical regions( I hope you got that 😉) is crucial to maintaining data consistency and preventing race conditions else lead to deadlocks or performance bottlenecks.

Don't worry we have several preventive measures to prevent deadlocks like lock ordering, lock timeout, and avoidance of nested locks.

Take Rest: Got lost in the blog

Let's summarize, repeat after me: locks are one of the ways for running one thread at a time to prevent racing conditions, that's all.

Note: If you are interested in research on thread-safe data structures, atomic operations, immutable data, and message passing for other methods for thread safety. What's thread safety: basically, it's ensuring a single thread is running at a time

Global Interpreter Lock

Global interpreter lock or GIL is a mechanism inherent to the CPython interpreter, the reference implementation of Python, which is a type of mutex(Lock) that ensures that only one thread executes Python bytecode at any given time.

The GIL provides thread safety for CPython.

Benefit: only one thread can execute Python bytecode at a time preventing race conditions.

Kind of Worry: The GIL can act as a bottleneck for CPU-bound tasks, inhibiting the full utilization of multiple CPU cores.

Note: why locks in the Python library though we have GIL, be aware that GIL is for Python bytecode.

Birds Eye View of Workings of the GIL

The GIL is implemented as a mutex, a mutual exclusion object that permits only one thread to possess it at any given time. When a thread seeks to acquire the GIL, it checks its availability. If the GIL is available, the thread acquires the lock and proceeds to execute bytecode. However, if the GIL is held by another thread, the current thread is suspended and placed in a waiting state until the GIL becomes available. Below is the general idea behind working of GIL:

Main Thread:
┌───────────────────────┐
│   Acquire the GIL     │
│                       │
│   Execute Python      │
│   Bytecode (hold GIL) │
│                       │
│   Release GIL         │
│                       │
└───────────────────────┘
        ▲
        │
New Thread:
        │
   GIL unavailable
   ──────────────────┐
        ▼            │
  Wait for GIL        │
  to become available │
        ▼            │
─────────────────────┘
        │
GIL becomes available
and acquired by the thread
        ▼
┌───────────────────────┐
│   Execute Python      │
│   Bytecode (hold GIL) │
│                       │
│   Release GIL         │
│                       │
└───────────────────────┘
        ▲
        │
   Repeat steps for
   other threads

Mitigating the GIL's Impact

While the GIL may restrict the parallelism of CPU-bound tasks, Python provides strategies to alleviate its impact and achieve concurrent execution in specific scenarios:

I/O-Bound Tasks: The GIL's impact on I/O-bound tasks is relatively lower since threads often spend time waiting for external operations to complete. In such cases, leveraging threads can still enhance performance by overlapping I/O operations using multithreading or asyncio.
Multi-Process Execution: Python's multiprocessing module enables the utilization of multiple processes for genuine parallelism. Each process possesses its own Python interpreter with an independent GIL, facilitating parallel execution across multiple CPU cores.

Note: C Extensions and Subinterpreters are great concepts implemented in recent Python versions, which provide distinct Python interpreters within a single process, each with its own GIL.

Conclusion

Understanding locks, race conditions, and the Global Interpreter Lock (GIL) is crucial for writing efficient and thread-safe Python code. Locks manage shared resources, preventing race conditions and ensuring data consistency. While the GIL limits parallelism by allowing only one thread to execute Python bytecode at a time, alternative approaches like multi-process execution and sub-interpreters can achieve parallelism. By utilizing effective synchronization, understanding the GIL's limitations, and employing appropriate concurrency strategies, developers can create robust and efficient Python applications.

Looking for No GIL at all:

Please refer to other Python implementations such as Jython, IronPython, PyPy

References:

Python Wiki - Global Interpreter Lock (GIL): https://wiki.python.org/moin/GlobalInterpreterLock
Understanding the Python GIL: https://realpython.com/python-gil/
Locks and Semaphores in Python: https://docs.python.org/3/library/threading.html#locks
Tanenbaum, A. S. (2014). Modern Operating Systems. (4th ed.). Pearson.

Garbage Collection in Python

shrawan baral — Wed, 21 Jun 2023 05:56:12 GMT

Garbage collection but before that let's do a quick dictionary work:

schedule = {
            "morning":"apple",
            "arvo":"orange",
            "evening":"apple"
            }

The above dictionary shows the dietary schedule. It's interesting to see I have some fruits repeating for the day. If I had to find how many times a single fruit I have to eat, I would have to count how many times a fruit has been mentioned in my above schedule.

from collections import Counter
schedule = {
            "morning":"apple",
            "arvo":"orange",
            "evening":"apple"
            }

fruit_count = Counter(schedule.values())
print(fruit_count) #Counter({'apple': 2, 'orange': 1})

The above shows fruit counts in my schedule. The above result is "Reference Count" (a word used to describe the above result) of fruits in our schedule, which shows the number of times our fruit has been referenced.

Let's talk global.

So, until now you know what reference count is.

Do you know that Python has various objects created at runtime that are readily available to you?

They are imported from the builtins module. (Out of scope for now)

Why I brought this topic here is to understand the scope, that is every object has a scope of existence, builtins provided objects have a scope of existence until the program exits, and global objects have scope within the program itself. You can access the global objects via globals().

globals()
{
  '__name__': '__main__',
  '__builtins__': 'builtins' (built-in)>,
  '__doc__': None,
  '__file__': '',
  '__package__': None,
  '__loader__': <class '_frozen_importlib.BuiltinImporter'>,
  '__spec__': None,
  '__annotations__': {},
}

If you create a new object in global scope then it can be viewed using globals.

egg = 20

print(globals()) 
{
  '__name__': '__main__',
  'egg': 20,
  '__builtins__': 'builtins' (built-in)>,
  '__doc__': None,
  '__file__': '',
  '__package__': None,
  '__loader__': <class '_frozen_importlib.BuiltinImporter'>,
  '__spec__': None,
  '__annotations__': {},
}

This code will first create an integer object called egg and assign it the value of 20. Then, it will print the globals dictionary. The globals dictionary contains all of the global variables that are defined in the current scope. In this case, the globals dictionary will contain the variable egg.

The key egg is associated with the value 20. This means that the integer object egg is still in use.

egg = 20
ball = egg
print(globals()) 
{
  '__name__': '__main__',
  'egg': 20,
  'ball':20,
  '__builtins__': 'builtins' (built-in)>,
  '__doc__': None,
  '__file__': '',
  '__package__': None,
  '__loader__': <class '_frozen_importlib.BuiltinImporter'>,
  '__spec__': None,
  '__annotations__': {},
}

We see that 20 has been referenced by both var egg and ball, and using the initial logic of reference count we have reference count 2 for object 20 (in a real scenario reference count for 20 is not exactly 2, python has already cached and referenced so count may be different).

For every new reference to an object, the garbage collector increments the reference count and decrements whenever a reference to the object is deleted.

That is

del egg
print(globals())
{
  '__name__': '__main__',
  'ball':20,
  '__builtins__': 'builtins' (built-in)>,
  '__doc__': None,
  '__file__': '',
  '__package__': None,
  '__loader__': <class '_frozen_importlib.BuiltinImporter'>,
  '__spec__': None,
  '__annotations__': {},
}

Here, del egg deletes the reference to object 20, and gc decrements the reference count to object 20.

(You know now that del object does not delete the object but the reference to it)

So do we have objects with 0 reference count? what do we do with it?

Yes, we do, and that is what garbage collector uses to track objects and finally delete them.

The reference count dictionary in Python is not accessible directly but can be using gc module itself. However, you can access it indirectly by using the sys.getrefcount() function.

For nerds, the following code will print the reference count of the object egg:

import sys

class Oval:
    pass

egg = Oval()

print(sys.getrefcount(egg)) #2

This code will print the number of references that are currently pointing to the object egg.

How does `gc` handle when an object's reference count reaches 0?

The garbage collector knows that the object is no longer in use and can be deleted if the reference count is 0. But, given that the object, if is still referenced in the globals or builtins dictionariesgc will not directly delete it.

Python's garbage collector is a powerful tool for managing memory. It automatically deletes objects that are no longer in use, which helps to keep memory usage low. This frees programmers from having to worry about memory management, which can be a complex and error-prone task.

Pass by Assignment in Python: What you need to know

shrawan baral — Wed, 21 Jun 2023 00:06:35 GMT

In Python, arguments to a function are passed by assignment. This means that when you call a function, each argument is assigned to a variable in the function's scope. The variable in the function's scope then points to the same object as the variable that was passed in.

For example, consider the following code:

def spam(eggs):
  print(eggs)

eggs = 35

spam(eggs) #35

When the spam() function is called, the value of eggs, which is 35, is assigned to the variable eggs in the function's scope. The variable eggs in the function's scope then point to the same object as the variable eggs in the global scope. This means that when the print() statement in the spam() function is executed, the value of the object that eggs point to will be printed.

How is it even different from Pass by reference and Pass by value?

All objects in Python are references, when you pass an object as an argument to a function, you are actually passing a reference to the object.

Then why is it not pass-by-reference then?

It's because the reference is not copied, it is simply assigned to a new variable in the function's scope and finally when we make changes to the object in the function's scope, those changes are reflected in the object in the calling scope. This is because the two objects are pointing to the same thing. This is kind of similar behavior to what we call pass-by-reference but not exactly how we achieve it in Python.

Let's see the following code:

bucket_of_eggs = [1,2,3]

def spam(bucket_of_eggs):
  bucket_of_eggs.append(4)

print(bucket_of_eggs) #[1,2,3]
spam(bucket_of_eggs)

print(bucket_of_eggs) #[1,2,3,4]

When the spam() function is called, the list is passed as an argument. The append() method is then called on the list in the function's scope. This appends the value 4 to the list. When the spam() function returns, the changes that were made to the list in the function's scope are reflected in the list in the calling scope. This means that the output of the print() statement will be [1, 2, 3, 4].

But, hold on, this mutation to the globally scoped variable is only possible for mutable data.

The following code may clarify further:

def spam(eggs):
  eggs = eggs + 2
  print(eggs)

eggs = 35

spam(eggs) #37
print(eggs) #35

When the spam() function is called, the value of eggs, which is 35, is assigned to the variable eggs in the function's scope. The variable eggs in the function's scope then point to the same object as the variable eggs in the global scope. But when adding two more eggs, the new object 37 is created and referenced with the function's scope, breaking the initial reference to a global object. This means that when the print() statement in the spam() function is executed, the value of the object that eggs point to in the function scope will be printed without making any changes to the original object, which kind of like feels similar to pass-by-value which is not.

Pass by assignment is an important concept to understand in Python. It can be a bit confusing at first, but it is essential for understanding how Python functions work. By understanding the pass by assignment, you will be able to write more effective and efficient Python code.

OpenCV Basics

shrawan baral — Thu, 18 Aug 2022 05:18:16 GMT

Welcome to OpenCV with Python Series for computer vision. We will explore OpenCV from basics to advanced, this article is created along my learning process about OpenCV. The series will be different than from rest of my writing style (formal writing) and the article update will try to improve the article gradually. Without any further intro, let me introduce OpenCV.

import cv2

img = cv2.imread('cat.jpeg')
print("Image is of type: ",type(img))

Image is of type:  <class 'numpy.ndarray'>

This show that after reading an image OpenCV converts to numpy arrary by default. Which makes much easy to work with.

Displaying the image can be done in the notebook using matplotlib but if we want to display full image we can use opencv imshow method.

cv2.imshow('cat',img)
cv2.waitKey(0)
cv2.destroyAllWindows()

In the following code I will be using matplotlib.

import matplotlib.pyplot as plt 
plt.imshow(img)

Let's view the shape of this numpy array(image)

print("Image shape: ",img.shape)

Image shape:  (1280, 900, 3)

We have a 1280*900 pixels in 3 channels(Red, Blue and Green) each. That is if we were to calculate how many pixels make up this image then the anwser is 1280 X 900 X 3 = 3456000 pixels.

OpenCV provides methods to save the numpy array (image) into any format we specify as following

cv2.imwrite('cat.png',img)

Understanding channels

Till now we just explored how to read and write image, and view them. Now we will be looking into what channels are.

Understanding Black and white images: These are the images with pixels that consists of either black or white colors, more specifically light intensity 0 for black and 255 for white.

Understanding Gray Scale images: Whereas, they are the images with pixels storing intensities anywhere between 0 to 255. Where values near to 255 are lighter and values near to 0 are darker under different shades of grey.

Understanding Channels: Channels are nothing other than color. If we say Red channel then we mean Red color, and saying a pixel in Red channel has intensity 0 will mean that there is No-Red color(shade) and saying intensity of 255 will mean we have darkest red.

Interestingly, we can use combination of 3 colors to generate other colors.

By using combinations of Red, Blue and Green we get variety of colors.

We can read images in Grey Channel or in RGB.

Note: OpenCV reads image in BGR(Blue,Green,Red) order not RGB order

Reading images in Gray Scale in OpenCV

grey_img = cv2.imread("./cat.png",cv2.IMREAD_GRAYSCALE)
plt.imshow(grey_img)

Or we can even convert the read image into any other channels as

converted_grey = cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)
plt.imshow(converted_grey)

To access the individual channel from the BGR image, we can proceed by:

b,g,r = img[:,:,0],img[:,:,1],img[:,:,2]
concated = cv2.hconcat([b,g,r])
plt.imshow(concated)

In the next article we will explore some augmentation methods available in OpenCV.

Towards creating an Intrusion Detection system

shrawan baral — Wed, 10 Aug 2022 04:09:31 GMT

Abstract

Intrusions in network refers to all those anomalous/undesirable activities that affects the CIA’s of the security for the intention of stealing, altering data and system that serves the desire of the attacker of bringing the service down. With our heavy dependence over internet technologies, we are living in the age of inter networked digital life but along with such advancements we’re are at a constant risk in terms of security and our privacy. The attacks include DoS, DDoS, Phishing and many more, they are varied and are always changing and more sophisticated. Still at present day DoS has been a major issue for big companies serving their business online. We at our defense side must be smart enough and deal with changing patterns and detect them as early as possible.

This article tries to introduced machine learning approach on detecting those intrusive patterns, taking data-set reference from CICIDS2017. Where we first capture the network traffic flow in the form of .pcap files, then process it into a well structured format of data using CIC-flow-meter. We try to infer the patterns that separates benign from intrusive ones. From the analysis, it is obtained that machine learning approaches could be one of the layer for security purpose rather than supplementing signature based IDS wholly.

Keywords: IDS(Intrusion Detection System), IPS(Intrusion Prevention System), FP(False Positive), FN(False Negative), Benign, Intrusions,Nftables

Introduction

Using rule based system to analyze the network flow and detect the anomalies/intrusions based on their signatures stored on the database only isn’t an effective way to detect the network intrusions early for new type of intrusions that the database doesn’t consists of. We are living in the digital age where our entire business today is heavily dependent on. While having perks, we also have risks associated with it such as DoS, DDoS, Botnet, etc and many undiscovered attacks. So we want to realize such a system which learns patterns from network flows and discriminate them as desired. For that purpose we build a machine learning model based on CICIDS2017 and use a flow meter to listen and dump packets into structured form. Then for training purpose utilize smote to balance the imbalanced datasets then apply pca for dimension reduction and finally use random forest classifier to create a model from these data. After inferring the model with testing data we achieved a astonishing result that implies that is can classify the unseen network flows consisting of benign and intrusive network flows pretty well. This brings us to implementing such a system which now can create firewall rules based on these predictions in order to block them from harming the system. It utilizes the batch processing of each captured flow and outputs the corresponding predictions.

Objectives

Apply Machine learning approach to build an intrusion detection system

Problem statement

We are heavily dependent on internet based services, with rise of digital networking we are benefited as well as are in a risk of network based attacks.The network based attacks are costly for an organization and for individuals in terms of data, privacy and service. Previously used signature based method are though simple and direct in architecture, they need to maintain huge datasets, such isn’t possible to include all attack patterns till date, also new day attack type detection isn’t possible at all until network flow verified as an attack and updated to the signature database. We require an intelligent agent that learns what is actually benign network flow, we want it to detect the flows that deviate from this profile and mark it as intrusive/suspicious as early as possible.

Literature Review

For the purpose of intrusion detection we need to create a baseline about what is traffic is normal and what is not. Further we require such a datasets which can represent the knowledge about the feature (domain knowledge) which is valuable for the study. Searching for those datasets bring us to KDD99Cup, NSL KDD, DEFCON-8, CICIDS2017, etc. Selecting from those didn't pose such a challenge because we want such dataset that is most recent and has data entries that captures the new types of attacks. So, for the study analysis CICIDS2017 was chosen. This data set consisted of about 80% benign and 20% different types of attack types. The original creators 4 of the dataset used random forest regressor and weighted average method for feature selection whereas some authors 5 used principal component analysis for dimensionality reduction.

Methodology

The methodology follows a ml workflow where Principal Component Analysis and Random Forest Classifier are used. The following link forwards to the implementation code. The dataset was downloaded from CICIDS2017 website, which was processed into parquet file for fast reading of the data entries. Intrusion detection(Binary classification) with PCA and RandomForest For implementing IPS, we design the system in such a way where each classified flows are then prevented by creating firewall rules using net-filter nftables in Linux based operating system by automated creation of nftables block rules in input, output or forward chain. Moreover, generating what kind of rule for the each flows is more of a concern.

Conclusion

From the above analysis, we could create a logically modeled solution for separating benign patterns apart from the intrusive ones based on the features extracted from the captured network packets using flow meter.

We could see that only applying principal component analysis setting components to be 10 such that knowledge from data doesn’t diminish for the purpose of dimentionality reduction and applying random forest classification algorithm such that the hyperparameteres used from cross validated high scoring model, we could obtain F1 score to be nearly 0.99 and roc auc to be 0.98. It shows the model learns the hidden patterns from data and is distinct on what separates benign from intrusion.

The study doesn’t end here rather with this kind of approach, it raises a question on what if we study on more set of features, what features have a value in determining the result.

Further, this study relies on flow-meter and its data structuring and its captured bidirectional flow. The input provided to the model act as an independent flow, but sophisticated attacks are sometimes made up of connected flows, it will create an ambiguity when same connection is sometime benign and anomalous.

While creating an IPS, the major factors we consider are False Positives and False Negatives. Both high FP and FN are unwanted but can be handled as per the purpose. This raises another intuition behind using this type of model along with other types of IDS. Controlling Benign as intrusions (FN) and intrusions as benign (FP) can be pretty straight forward if one applies manual setting for preventing mislabeling of benign as intrusion for networks with low traffic but still using them in large scale and creating manual rules to revert predictions is still not an efficient way.

In further analysis we will see what other methods we can focus on getting features that are valuable using CNN and neural networks and also study what kind of IPS needs to be built for such system. In following articles we will discuss the types of approach to achieve creating varied types of rules for different kinds of attacks and block types such as specific port or ip block or entire network block . Such implementation will be targeted for network layer and for IPv4 networks only and also we will discuss the efficiency required for such system for processing huge network traffic for classifying them as Benign or Intrusions.

The entire source code of this could be found at : https://github.com/azwyane/NGuard

References

https://www.unb.ca/cic/research/applications.html
http://www.unb.ca/cic/datasets/IDS2017.html
https://wiki.nftables.org/wiki-nftables/index.php/Main_Page
I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward generating a new intrusion detection dataset and intrusion traffic characterization,” in ICISSP 2018 - Proceedings of the 4th International Conference on Information Systems Security and Privacy, 2018,vol.2018-January,pp.108–116.doi:10.5220/0006639801080116.
R. Abdulhammed, H. Musafer, A. Alessa, M. Faezipour, and A. Abuzneid, “Features dimensionality reduction approaches for machine learning based network intrusion detection,” Electronics (Switzerland), vol. 8, no. 3, Mar. 2019, doi:10.3390/electronics8030322.