In this article I will demonstrate how to build a multi-threaded web crawler which can process https web pages.

Photo by Anoir Chafik on Unsplash
  1. The web crawler will utilize multiple threads.
  2. It will be able to crawl all the particular web pages of a website.
  3. It will be able to report back any 2XX and 4XX links.
  4. It will take in the domain name from the command line.
  5. It will avoid the cyclic traversal of links.

So how the crawler will work.

So let's get into it by importing some libraries we will be needing in order to build this crawler.

Alright before proceeding…

There are two types of data we use for Analytics.

  • ORGANIC DATA: Data which is collected organically over a time period. Like financial or stock market exchanges data or Netflix viewing data which the netflix algorithm collects in order to enhance our recommendations. Data collected by web tracker to personalize our adds. The above few mentioned examples produce a humongous amount of data which led to the term BIG DATA. …

Photo by Jamie Haughton on Unsplash

If you are new to python iterators and generators please refer these articles otherwise you are good to go.

To start let's talk about what is lazy evaluation.

Lazy evaluation is an evaluation strategy which holds the evaluation of an expression until its value is needed i.e it avoids repeated evaluation.

Let's compare Strict Evaluation vs Lazy Evaluation .

Problem: Given a list and positive integer n write a function that splits the list into n groups

Photo by Andre Mouton on Unsplash

This is part 3 of my series in which, I will discuss some applications of iterators and generators.

  1. Reading lines of file without using for loop.

We use next() method in this function to manually consume a iterator

with open("/etc/something/something.some_extension") as f:
while True:
line = next(f)
# file object is an iterable
print(line, end="")
except StopIteration:

2. Consider there is a custom container object which internally holds a iterable. And you want to make this container iterable.

Note: when we use a for loop the __iter__() method of the iterable is invoked. …

Photo by Agê Barros on Unsplash

First lets understand what a temp or temporary file is.

  1. A temporary file is created by a program that serves a temporary purpose and is created due to various reasons such as temporary backup, when a program is manipulating data larger than the systems capability or to break up larger chunks of data into pieces that are easy to process.
  2. Temp files have the extension as “.tmp” but they are program dependent(i.e different programs create different temp files).

Most common examples of temp files are

  1. Temp Internet Files: This files contain cached info on frequent and recent pages so that…

This is a continuation of part 1

Photo by Emile Cudelou on Unsplash
  1. Python generators are a simple way to create an iterator.
  2. A generator is a function that returns an object(iterator) which we can iterate over(one value at a time).
  3. Creating a generator in python is as easy as writing a function but instead of using the return statement, we use the yield statement.
  4. The difference between a return statement and yield statement is that return terminates the program entirely but the yield statement pauses the function saving all its states and later continues from there on successive calls.
  1. Generator functions consist of one or…

Photo by Kote Puerto on Unsplash

Definition: A memory-mapped file object maps a normal file object into a memory. This allows us to modify a file object’s content directly in memory.

  1. memory mapped file objects behave both like bytearray and file objects . Hence all the operations which can be performed on a bytearray like indexing,slicing assigning a slice, or using re module to search through the file.
  2. And all the operations which can be performed on a file object like reading and writing data starting at current position. or using seek() to position the current pointer to different position.

The memory mapped file object is…

Photo by Marek Szturc on Unsplash

What are partial functions. understanding functools.partial and its applications and use-cases.

functools.partial(func,*args,**kwargs) returns a new partial object when called will behave like func called with positional arguments(*args) and keywords arguments (*kwargs)

Photo by Ugur Arpaci on Unsplash

The generator functions are one-way communication i.e we can retrieve information from generator using next() ,but we cannot interact with it or affect its execution while running.

First let’s understand generator.send()


It is used to send value to a generator that just yielded.

def double_inputs():
while True:
x = yield
yield x * 2
gen = double_inputs()
next(gen) # run upto the first yield
print(gen.send(10)) # goes into x variable -->20
next(gen) # run upto the next yield
print(gen.send(6)) --> 12
next(gen) # runs upto next yield
print(gen.send(45)) # foes into x again -->90
next(gen) # runs upto the…

Photo by Robert Thiemann on Unsplash
  1. StringIO and BytesIO are methods that manipulate string and bytes data in memory.
  2. StringIO is used for string data and BytesIO is used for binary data.
  3. This classes create file like object that operate on string data.
  4. The StringIO and BytesIO classes are most useful in scenarios where you need to mimic a normal file.

In this case the data won't be kept in the memory(RAM) after it’s written to the file

with open("test.bin","wb") as f:
f.write(b"Hello world")
f.write(b"Hello world")
f.write(b"Hello world")
f.write(b"Hello world")
f.write(b"Hello world")

In this case instead of writing contents to a file, it is written…

Siddharth Kshirsagar

Data Scientist, Pythonista, Algorithms lover

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store