Jupyter's nbconvert is a well established tool created and maintained by the core jupyter developers. There is no need to reinvent the wheel, hence we will be using nbconvert
's python API to convert Jupyter Notebook to Markdown documents. You can read more about nbconvert
's function in the official documentation
We will be using Exporter
, namely the MarkdownExporter
which can read a python notebook and extract the main body (text) and resources (images, etc). Let's first see the basics of how it works and then make a thin wrapper function around it.
m = MarkdownExporter()
body, resources = m.from_filename('../samples/test-notebook.ipynb')
All notebook exporters return a tuple containing the body and the resources of the document, for instance the matplotlib image from our test notebook was stored as output_4_1.png
resources['outputs'].keys()
Also it is important to know that so far, the notebook markdown representation only exists as a python object and no files have been written.
def nb2md_draft(notebook:str):
"""
Paper thin wrapper around nbconvert.MarkdownExporter. This function takes the path to a jupyter
notebook and passes it to `MarkdownExporter().from_filename` which returns the body and resources
of the document
"""
m = MarkdownExporter()
body, resources = m.from_filename(notebook)
return body, resources
This is a very basic notebook-to-markdown converter that is greatly improved and features are added to it with preprocessors further down in this module, hence this nb2md()
function is not the one that will be exported and is hence marked as draft.
b, r = nb2md_draft('../samples/test-notebook.ipynb')
init_logger('converter', logging.DEBUG)
logger = logging.getLogger('converter')
logger.debug('debug message')
logger.info('info message')
logger.warning('warn message')
logger.error('error message')
logger.critical('critical message')
log_stream.getvalue()
assert log_stream.getvalue() == '\
debug message\n\
info message\n\
warn message\n\
error message\n\
critical message\n'
For the rest of this notebook, I will append the output of log_stream.getvalue()
) to a list log_list
to test the output of the different operations that I perform as that makes testing clearer (less parsing). For readability I will delete such cells from the rest of this document. Because .getvalue()
simply concatenates the new stream to the existing stream I will run this little bit of code every time so that I just append the last message to the log_list
instead of the whole stream.
old = "abc"
new = "abcdef"
new[len(old):]
log_list = []
log_list.append(log_stream.getvalue())
assert log_list[-1] == '\
debug message\n\
info message\n\
warn message\n\
error message\n\
critical message\n'
We use the FilesWriter
object to write the resulting markdown file onto our laptop's storage. We can precise the build_directory
attribute (see more Writer options) to indicate where we would like to store our Notebook and the auxiliary files (images, etc). The FilesWriter is "aggresive", meaning it will overwrite whatever files exists if there is a directory or filename clash. Lastly, it is also possible to write a custom Writer such as MediumWriter
that renders the document and then uploads it to Medium but because I am learning I'd rather see every step in the pipeline.
f = FilesWriter(build_directory = 'Rendered/')
Conveniently, the write()
method of FilesWriter
returns the output path.
f.write(output = body,
resources = resources,
notebook_name = 'test-notebook')
WriteMarkdown(body, resources, filename = 'test-notebook')
assert log_list[-1] == 'Markdown document written to ../samples/test-notebook/test-notebook.md\n'
WriteMarkdown(body, resources, dir_path= 'Docs', filename= 'test-notebook')
assert log_list[-1] == 'Markdown document written to Docs/test-notebook.md\n'
WriteMarkdown(body, resources, dir_path= 'Docs/Attempt1', filename= 'test-notebook')
assert log_list[-1] == 'Markdown document written to Docs/Attempt1/test-notebook.md\n'
WriteMarkdown(body, resources, dir_path= '../Docs', filename= 'test-notebook')
assert log_list[-1] == 'Markdown document written to ../Docs/test-notebook.md\n'
from nbconvert.preprocessors import RegexRemovePreprocessor
m = MarkdownExporter()
m.register_preprocessor(RegexRemovePreprocessor(patterns = ['^#\s*hide-cell']), enabled = True);
Funnily enough, the RegexRemovePreprocessor
only hides cells that have the tag AND that do no produce an output. For example:
#hide-cell
a = 1
would be removed, but:
#hide-cell
a = 1
print(a) # or simply a
`
would not be removed.
The standard preprocessors aren't really useful for what I want to do. RegexRemovePreprocessors
only remove cells if they have no output in addition to matching the pattern(s) specified. The ClearOuputPreprocessor
removes all outputs from a notebook. Hence I am just going to write a custom preprocessor that is able to hide either a cell's source, a cell's output or the whole cell based on pattern matching performed on a cell's source. After some investigation I realised that best way to achieve this was using cell tags, though I do not like Jupyter's current tag environment. I do not like them because you have to use the GUI entirely to add tags to a cell, navigating to the the top sidebar, then the View section and then the Cell Toolbar sub-section and finally click on Tags to enable this extra chunky section added to all your cells, even those you may not want to add tag onto. Hence I've gone for an implementation that allows for both the use of tags and the of use of text/regex based tagging in the custom preprocessor HidePreprocessor
written below.
nbconvert
uses traitlets
where I would normally expect an __init__()
method. Luckily it is quite intuitive to work with traitlets but I do not grasp the pros and cons of using it.
m = MarkdownExporter()
m.register_preprocessor(HidePreprocessor(mode = 'source'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'output'), enabled = True)
m.register_preprocessor(HidePreprocessor(mode = 'cell'), enabled = True)
m.register_preprocessor(TagRemovePreprocessor(
remove_input_tags = ('hide-source',),
remove_all_outputs_tags = ('hide-output',),
remove_cell_tags = ('hide-cell',),
enabled = True)
)
The file test-hiding.ipynb
contains 4 cells printing the string 'My name is Jack'. The first one has no tags added. The second one has the #hide-source
tag which results in only the output string being present in the Markdown document. The third cell has the #hide-output
tag added to it which results in only the cell source ("the code") being present in the Markdown document. The last cell has the #hide-cell
tag which removes the whole cell (source and output) altogether.
b, r = m.from_filename('../samples/test-hiding.ipynb')
print(b)
Tag are in cells 3, 1 and 5
assert re.findall('#([0-9])', log_list[-1]) == ['3', '1', '5']
Above has been the exploration of how to implement hiding cells sources, cells outputs and entire cells based on text based tags. These will be added in the main nb2md()
function at the end of this module
I like syntax highlighting in Medium articles and this is only available (to my knowledge) via GitHub Gists. We will be making our own preprocessor to uploads the source code of cells that start with the special tag # gist
. Creating a POST request to submit a Github Gist is easy enough, here we have simply translated the GitHub API to a python request
The only thing needed to submit GET/POST request via the GitHub API is a github token. In the same way than with the Medium tokens we can have the environment variable declared in our ~/.bashrc
or ~/.zshrc
files. The documentation to create a token can be found in this page
upload_gist('ghapitest', gistcontent = '../CONTRIBUTING.md', description = "nb2medium-test")
I wish to have a gister that acts like a magic function but without being a magic function, instead it's just a set of instruction that are sent to the parser like so:
# gist description: My python program gistname: script.py public: false upload: source
a = 1
b = 2
c = a*b
where the public
, description
and upload
flags are optional.
The upload
flag exists to enable the user to upload the ouput of a command as a gist too (text file or html table/pandas dataframe). The user can specify what to upload by specifing upload: source
, upload: output
, upload: both
import bs4
import pandas as pd
# notebook with example dataframe
source = json.load(open("../samples/test-gist-output-df.ipynb"))['cells'][0]['outputs'][0]['data']['text/html']
soup = bs4.BeautifulSoup(''.join(source),'lxml')
table = soup.find_all('table')
df = pd.read_html(str(table), index_col = 0)[0]
df
We may choose to remove the index column when exporting to csv by running df.to_csv(index = False)
.
Now that we know how how to upload a gist to Github and how to recover an html table as a pandas dataframe we can incorporate these method into our own GisterPreprocessor
.
Note: The output of the gisted cell will be removed as the code cell is turned into a markdown cell
m = MarkdownExporter()
m.register_preprocessor(GisterPreprocessor(), enabled = True)
b, r = m.from_filename('../samples/test-gist.ipynb')
print(b)
assert log_list[-1] == "\
Detected gist tag in cell 1 with arguments: description, gistname; uploading...\n\
Gist script.py from cell 1 succesfully uploaded!\n"
b, r = m.from_filename('../samples/test-gist-output-df.ipynb')
print(b)
b, r = m.from_filename('../samples/test-gist-output-print.ipynb')
print(b)
b, r = m.from_filename('../samples/test-gist-multi-mode-output.ipynb')
print(b)
As we have noticed before when we use an nbconvert
's exporter on a Jupyter notebook it extracts the images from that notebook (e.g. plots) and stores them locally. We now need to take those images, upload them to Medium and replace the image with the image URL. It is very similar to what we have done with code cells.m
Notice in the cell below how the cells that containg an image have an entry such at cell['outputs']...['data']['image/png'
]. We can detect the presence of such file and upload the image to Medium via the python Medium API we have written
demonb = json.load(open('../samples/test-notebook.ipynb'))
print(
demonb['cells'][13]['outputs'][0].keys(), '\n',
demonb['cells'][13]['outputs'][0]['data'].keys(), '\n'
)
demonb['cells'][13]['outputs'][0]['data']['image/png'][:200]
I am going to make a bold assumption. I am going to assume that if in a given cell the user outputs an image, the user doesn't want anything else to be outputted (e.g. text, or values).
The representation of images in raw Jupyter Notebooks is not actually that of a valid image file, the image is represents with ASCII characters. We need to use binascii.a2b_base64
to turns the ASCII characters into binary ones which results in a valid image, which can then upload to medium. I figured this out by exploring how they extract images in nbconvert
's ExtractOutputPreprocessor
m = MarkdownExporter()
m.register_preprocessor(ImagePreprocessor(), enabled = True)
As we would expect nothing happens when we have a notebook containing online images in it's markdown cells. Medium can access these directly from the internet
b, r = m.from_filename('../samples/test-md-online-image.ipynb')
The previous notebook had no plots and no local images
assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('0', '0')]
But! If our preprocessor finds offline images, it will upload them to Medium
b, r = m.from_filename('../samples/test-md-offline-image.ipynb')
The prior document had 0 plots and 1 local image
assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('0', '1')]
And finally, if it detects plots such as those coming out of matplotlib, the plot will be uploaded without ever writing the image to memory 😮. In future implementations, more image types can be handles such as plotly interactive plots
b, r = m.from_filename('../samples/test-matplotlib.ipynb')
assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('1', '0')]
b, r = m.from_filename('../samples/test-multi-matplotlib.ipynb')
2 plots and 0 local images
assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('2', '0')]
b, r = m.from_filename('../samples/test-seaborn.ipynb')
Only one seaborn plot and 0 local images
assert re.findall('Detected ([0-9]) plots and ([0-9]) local images', log_list[-1]) == [('1', '0')]
Given all the work we have done making the final function that wraps eveything together is easy! We combine all the tools we have built in this module in the uploader
module