Jupyter nbconvert

Jupyter's nbconvert is a well established tool created and maintained by the core jupyter developers. There is no need to reinvent the wheel, hence we will be using nbconvert's python API to convert Jupyter Notebook to Markdown documents. You can read more about nbconvert's function in the official documentation

We will be using Exporter, namely the MarkdownExporter which can read a python notebook and extract the main body (text) and resources (images, etc). Let's first see the basics of how it works and then make a thin wrapper function around it.

m = MarkdownExporter()

body, resources = m.from_filename('../samples/test-notebook.ipynb')

All notebook exporters return a tuple containing the body and the resources of the document, for instance the matplotlib image from our test notebook was stored as output_4_1.png

resources['outputs'].keys()

dict_keys(['output_13_0.png', 'output_15_0.png', 'output_17_0.png', 'output_17_1.png'])

Also it is important to know that so far, the notebook markdown representation only exists as a python object and no files have been written.

def nb2md_draft(notebook:str):
    """
    Paper thin wrapper around nbconvert.MarkdownExporter. This function takes the path to a jupyter
    notebook and passes it to `MarkdownExporter().from_filename` which returns the body and resources
    of the document
    """
    m = MarkdownExporter()
    body, resources = m.from_filename(notebook)
    return body, resources

This is a very basic notebook-to-markdown converter that is greatly improved and features are added to it with preprocessors further down in this module, hence this nb2md() function is not the one that will be exported and is hence marked as draft.

b, r = nb2md_draft('../samples/test-notebook.ipynb')

Setting up a module logger

init_logger('converter', logging.DEBUG)

logger = logging.getLogger('converter')
logger.debug('debug message')
logger.info('info message')
logger.warning('warn message')
logger.error('error message')
logger.critical('critical message')

converter:DEBUG - debug message
converter:INFO - info message
converter:WARNING - warn message
converter:ERROR - error message
converter:CRITICAL - critical message

log_stream.getvalue()

'debug message\ninfo message\nwarn message\nerror message\ncritical message\n'

assert log_stream.getvalue() == '\
debug message\n\
info message\n\
warn message\n\
error message\n\
critical message\n'

For the rest of this notebook, I will append the output of log_stream.getvalue()) to a list log_list to test the output of the different operations that I perform as that makes testing clearer (less parsing). For readability I will delete such cells from the rest of this document. Because .getvalue() simply concatenates the new stream to the existing stream I will run this little bit of code every time so that I just append the last message to the log_list instead of the whole stream.

old = "abc"
new = "abcdef"

new[len(old):]

'def'

log_list = []
log_list.append(log_stream.getvalue())

assert log_list[-1] == '\
debug message\n\
info message\n\
warn message\n\
error message\n\
critical message\n'

Writing Notebook to file

We use the FilesWriter object to write the resulting markdown file onto our laptop's storage. We can precise the build_directory attribute (see more Writer options) to indicate where we would like to store our Notebook and the auxiliary files (images, etc). The FilesWriter is "aggresive", meaning it will overwrite whatever files exists if there is a directory or filename clash. Lastly, it is also possible to write a custom Writer such as MediumWriter that renders the document and then uploads it to Medium but because I am learning I'd rather see every step in the pipeline.

f = FilesWriter(build_directory = 'Rendered/')

Conveniently, the write() method of FilesWriter returns the output path.

f.write(output = body, 
        resources = resources,
        notebook_name = 'test-notebook')

'Rendered/test-notebook.md'

Simple writing function

Example 1 - Write to Jupyter's Notebook directory

WriteMarkdown(body, resources, filename = 'test-notebook')

converter:INFO - Markdown document written to ../samples/test-notebook/test-notebook.md

assert log_list[-1] == 'Markdown document written to ../samples/test-notebook/test-notebook.md\n'

Example 2 - Write to new directory

WriteMarkdown(body, resources, dir_path= 'Docs', filename= 'test-notebook')

converter:INFO - Markdown document written to Docs/test-notebook.md

assert log_list[-1] == 'Markdown document written to Docs/test-notebook.md\n'

Example 3 - Write to directory with subdirectory

WriteMarkdown(body, resources, dir_path= 'Docs/Attempt1', filename= 'test-notebook')

converter:INFO - Markdown document written to Docs/Attempt1/test-notebook.md

assert log_list[-1] == 'Markdown document written to Docs/Attempt1/test-notebook.md\n'

Example 4 - Write outside the current working directory

WriteMarkdown(body, resources, dir_path= '../Docs', filename= 'test-notebook')

converter:INFO - Markdown document written to ../Docs/test-notebook.md

assert log_list[-1] == 'Markdown document written to ../Docs/test-notebook.md\n'

Handling special tags

Hide tags - Remove cell if cell has no output

We may wish certain markdown or code cells to not be present in the output document. To achieve this we can use nbconvert's RegexRemovePreprocessor. preprocessors such as this one can either be registered to an Exporter(see how) or passed as part of a config (see how).

from nbconvert.preprocessors import RegexRemovePreprocessor

m = MarkdownExporter()
m.register_preprocessor(RegexRemovePreprocessor(patterns = ['^#\s*hide-cell']), enabled = True);

Funnily enough, the RegexRemovePreprocessor only hides cells that have the tag AND that do no produce an output. For example:

#hide-cell
a = 1

would be removed, but:

#hide-cell
a = 1
print(a) # or simply a
`

would not be removed.

Clear Output - Remove cell's output but keep cell's content

The standard preprocessors aren't really useful for what I want to do. RegexRemovePreprocessors only remove cells if they have no output in addition to matching the pattern(s) specified. The ClearOuputPreprocessor removes all outputs from a notebook. Hence I am just going to write a custom preprocessor that is able to hide either a cell's source, a cell's output or the whole cell based on pattern matching performed on a cell's source. After some investigation I realised that best way to achieve this was using cell tags, though I do not like Jupyter's current tag environment. I do not like them because you have to use the GUI entirely to add tags to a cell, navigating to the the top sidebar, then the View section and then the Cell Toolbar sub-section and finally click on Tags to enable this extra chunky section added to all your cells, even those you may not want to add tag onto. Hence I've gone for an implementation that allows for both the use of tags and the of use of text/regex based tagging in the custom preprocessor HidePreprocessor written below.

nbconvert uses traitlets where I would normally expect an __init__() method. Luckily it is quite intuitive to work with traitlets but I do not grasp the pros and cons of using it.

m = MarkdownExporter()
m.register_preprocessor(HidePreprocessor(mode = 'source'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'output'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'cell'), enabled = True)
m.register_preprocessor(TagRemovePreprocessor(
    remove_input_tags = ('hide-source',),
    remove_all_outputs_tags = ('hide-output',),
    remove_cell_tags = ('hide-cell',),
    enabled = True)
)

<nbconvert.preprocessors.tagremove.TagRemovePreprocessor at 0x1044e9220>

The file test-hiding.ipynb contains 4 cells printing the string 'My name is Jack'. The first one has no tags added. The second one has the #hide-source tag which results in only the output string being present in the Markdown document. The third cell has the #hide-output tag added to it which results in only the cell source ("the code") being present in the Markdown document. The last cell has the #hide-cell tag which removes the whole cell (source and output) altogether.

b, r = m.from_filename('../samples/test-hiding.ipynb')
print(b)

converter:INFO - Found a hide-source tag in cell #3.
converter:INFO - Found a hide-output tag in cell #1.
converter:INFO - Found a hide-cell tag in cell #5.

### The output of the next cell is hidden


```
print('My name is Jack')
```

### The source of the next cell is hidden

    My name is Jack


### The entire next cell is hidden

Tag are in cells 3, 1 and 5

assert re.findall('#([0-9])', log_list[-1]) == ['3', '1', '5']

Above has been the exploration of how to implement hiding cells sources, cells outputs and entire cells based on text based tags. These will be added in the main nb2md() function at the end of this module

Gister tags

I like syntax highlighting in Medium articles and this is only available (to my knowledge) via GitHub Gists. We will be making our own preprocessor to uploads the source code of cells that start with the special tag # gist. Creating a POST request to submit a Github Gist is easy enough, here we have simply translated the GitHub API to a python request

The only thing needed to submit GET/POST request via the GitHub API is a github token. In the same way than with the Medium tokens we can have the environment variable declared in our ~/.bashrc or ~/.zshrc files. The documentation to create a token can be found in this page

upload_gist('ghapitest', gistcontent = '../CONTRIBUTING.md', description = "nb2medium-test")

(True, 'https://gist.github.com/ec79e3b314b24697e46a1fd1ca3e9a4d')

I wish to have a gister that acts like a magic function but without being a magic function, instead it's just a set of instruction that are sent to the parser like so:

# gist description: My python program gistname: script.py public: false upload: source
a = 1
b = 2
c = a*b

where the public, description and upload flags are optional.

The upload flag exists to enable the user to upload the ouput of a command as a gist too (text file or html table/pandas dataframe). The user can specify what to upload by specifing upload: source, upload: output, upload: both

Handling tables

import bs4
import pandas as pd

# notebook with example dataframe
source = json.load(open("../samples/test-gist-output-df.ipynb"))['cells'][0]['outputs'][0]['data']['text/html'] 
soup = bs4.BeautifulSoup(''.join(source),'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table), index_col = 0)[0]
df

We may choose to remove the index column when exporting to csv by running df.to_csv(index = False).

Now that we know how how to upload a gist to Github and how to recover an html table as a pandas dataframe we can incorporate these method into our own GisterPreprocessor.

Note: The output of the gisted cell will be removed as the code cell is turned into a markdown cell

m = MarkdownExporter()
m.register_preprocessor(GisterPreprocessor(), enabled = True)

<__main__.GisterPreprocessor at 0x10f88b5e0>

b, r = m.from_filename('../samples/test-gist.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname; uploading...
converter:INFO - Gist script.py from cell 1 succesfully uploaded!

## Uploading cells as gists            

[https://gist.github.com/9fecb3ad0fb45f39d13f44f6f88b191a](https://gist.github.com/9fecb3ad0fb45f39d13f44f6f88b191a)

assert log_list[-1] == "\
Detected gist tag in cell 1  with arguments: description, gistname; uploading...\n\
Gist script.py from cell 1 succesfully uploaded!\n"

b, r = m.from_filename('../samples/test-gist-output-df.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 0  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist pandas.py from cell 0 succesfully uploaded!
converter:DEBUG - Found table in cell 0, uploading...
converter:INFO - Gist pandas.py.csv from cell 0 succesfully uploaded!

[https://gist.github.com/2cc69823ad165489bcab0b552aa39daf](https://gist.github.com/2cc69823ad165489bcab0b552aa39daf)

[https://gist.github.com/5ef13d2bf695adcbca51a3e28feb88c5](https://gist.github.com/5ef13d2bf695adcbca51a3e28feb88c5)

b, r = m.from_filename('../samples/test-gist-output-print.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist script.py from cell 1 succesfully uploaded!
converter:DEBUG - Detected printed output in cell 1, uploading...
converter:INFO - Gist script.py.txt from cell 1 succesfully uploaded!

## Uploading cells as gists            

[https://gist.github.com/26d5e047dfbb821351f6f2f1ecaff915](https://gist.github.com/26d5e047dfbb821351f6f2f1ecaff915)

[https://gist.github.com/572914339a6151c64716944b73efe443](https://gist.github.com/572914339a6151c64716944b73efe443)

b, r = m.from_filename('../samples/test-gist-multi-mode-output.ipynb')
print(b)

converter:DEBUG - Detected gist tag in cell 1  with arguments: description, gistname, upload; uploading...
converter:INFO - Gist script.py from cell 1 succesfully uploaded!
converter:DEBUG - Detected printed output in cell 1, uploading...
converter:INFO - Gist script.py.txt from cell 1 succesfully uploaded!
converter:DEBUG - Found table in cell 1, uploading...
converter:INFO - Gist script.py.txt.csv from cell 1 succesfully uploaded!

## Uploading cells as gists            

[https://gist.github.com/5ab06f2a1a9c47921ac043ad17975aa2](https://gist.github.com/5ab06f2a1a9c47921ac043ad17975aa2)

[https://gist.github.com/b23613968dbafa5b1e318de45696858e](https://gist.github.com/b23613968dbafa5b1e318de45696858e)

[https://gist.github.com/1f5da1f58fb5213e4de0cf0458a442b6](https://gist.github.com/1f5da1f58fb5213e4de0cf0458a442b6)

Image preprocessor

As we have noticed before when we use an nbconvert's exporter on a Jupyter notebook it extracts the images from that notebook (e.g. plots) and stores them locally. We now need to take those images, upload them to Medium and replace the image with the image URL. It is very similar to what we have done with code cells.m

Notice in the cell below how the cells that containg an image have an entry such at cell['outputs']...['data']['image/png']. We can detect the presence of such file and upload the image to Medium via the python Medium API we have written

demonb = json.load(open('../samples/test-notebook.ipynb'))
print(
    demonb['cells'][13]['outputs'][0].keys(), '\n',
    demonb['cells'][13]['outputs'][0]['data'].keys(), '\n'
)

dict_keys(['data', 'metadata', 'output_type']) 
 dict_keys(['image/png', 'text/plain'])

demonb['cells'][13]['outputs'][0]['data']['image/png'][:200]

'iVBORw0KGgoAAAANSUhEUgAAAYIAAAD4CAYAAADhNOGaAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuNCwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8QVMy6AAAACXBIWXMAAAsTAAALEwEAmpwYAAA5CklEQVR4nO3dd3hU55X48e8Z9QISQjBq'

I am going to make a bold assumption. I am going to assume that if in a given cell the user outputs an image, the user doesn't want anything else to be outputted (e.g. text, or values).

The representation of images in raw Jupyter Notebooks is not actually that of a valid image file, the image is represents with ASCII characters. We need to use binascii.a2b_base64 to turns the ASCII characters into binary ones which results in a valid image, which can then upload to medium. I figured this out by exploring how they extract images in nbconvert's ExtractOutputPreprocessor

m = MarkdownExporter()
m.register_preprocessor(ImagePreprocessor(), enabled = True)

<__main__.ImagePreprocessor at 0x10f913dc0>

As we would expect nothing happens when we have a notebook containing online images in it's markdown cells. Medium can access these directly from the internet

b, r = m.from_filename('../samples/test-md-online-image.ipynb')

converter:INFO - Detected 0 plots and 0 local images in notebook.

The previous notebook had no plots and no local images

assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('0', '0')]

But! If our preprocessor finds offline images, it will upload them to Medium

b, r = m.from_filename('../samples/test-md-offline-image.ipynb')

converter:DEBUG - Detected 1 local image(s) in cell 0, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*xYdnXpwz3wapR0XTS4aP6Q.png
converter:INFO - Detected 0 plots and 1 local images in notebook.

The prior document had 0 plots and 1 local image

assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('0', '1')]

And finally, if it detects plots such as those coming out of matplotlib, the plot will be uploaded without ever writing the image to memory 😮. In future implementations, more image types can be handles such as plotly interactive plots

Works with `matplotlib`

b, r = m.from_filename('../samples/test-matplotlib.ipynb')

converter:DEBUG - Detected 1 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*Sr5Vt6FDDMwnWQgU8FV9Rw.png
converter:INFO - Detected 1 plots and 0 local images in notebook.

assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('1', '0')]

Works with multiple `matplotlib` images

b, r = m.from_filename('../samples/test-multi-matplotlib.ipynb')

converter:DEBUG - Detected 1 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*Sr5Vt6FDDMwnWQgU8FV9Rw.png
converter:DEBUG - Detected 2 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*ptKG19-sFG7x90-eg5VkoQ.png
converter:INFO - Detected 2 plots and 0 local images in notebook.

2 plots and 0 local images

assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('2', '0')]

Works with `seaborn` plots

b, r = m.from_filename('../samples/test-seaborn.ipynb')

converter:DEBUG - Detected 1 plot(s) in cell 2, uploading...
converter:DEBUG - Image succesfully uploaded to https://cdn-images-1.medium.com/proxy/1*V2CUHu4F4A2cN4foNjthFA.png
converter:INFO - Detected 1 plots and 0 local images in notebook.

Only one seaborn plot and 0 local images

assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('1', '0')]

Wrapping it all together

Given all the work we have done making the final function that wraps eveything together is easy! We combine all the tools we have built in this module in the uploader module

Convert

Jupyter nbconvert

Setting up a module logger

`init_logger`[source]

Writing Notebook to file

Simple writing function

`WriteMarkdown`[source]

Example 1 - Write to Jupyter's Notebook directory

Example 2 - Write to new directory

Example 3 - Write to directory with subdirectory

Example 4 - Write outside the current working directory

Handling special tags

Hide tags - Remove cell if cell has no output

Clear Output - Remove cell's output but keep cell's content

`class` `HidePreprocessor`[source]

Gister tags

`check_gh_auth`[source]

`upload_gist`[source]

Handling tables

`class` `GisterPreprocessor`[source]

Image preprocessor

`class` `ImagePreprocessor`[source]

Works with `matplotlib`

Works with multiple `matplotlib` images

Works with `seaborn` plots

Wrapping it all together

	a	b	c
0	1	9	hair
1	2	0	potato
2	3	1	water

Convert

Jupyter nbconvert

Setting up a module logger

init_logger[source]

Writing Notebook to file

Simple writing function

WriteMarkdown[source]

Example 1 - Write to Jupyter's Notebook directory

Example 2 - Write to new directory

Example 3 - Write to directory with subdirectory

Example 4 - Write outside the current working directory

Handling special tags

Hide tags - Remove cell if cell has no output

Clear Output - Remove cell's output but keep cell's content

class HidePreprocessor[source]

Gister tags

check_gh_auth[source]

upload_gist[source]

Handling tables

class GisterPreprocessor[source]

Image preprocessor

class ImagePreprocessor[source]

Works with matplotlib

Works with multiple matplotlib images

Works with seaborn plots

Wrapping it all together

`init_logger`[source]

`WriteMarkdown`[source]

`class` `HidePreprocessor`[source]

`check_gh_auth`[source]

`upload_gist`[source]

`class` `GisterPreprocessor`[source]

`class` `ImagePreprocessor`[source]

Works with `matplotlib`

Works with multiple `matplotlib` images

Works with `seaborn` plots