spitdoc/README.md

71 lines
2.1 KiB
Markdown

# DOCX to Markdown Converter
This Python script converts DOCX files to Markdown format, preserving formatting such as headings, bold, italic, underline, strikethrough, and highlight. It also extracts images from the DOCX file and saves them in an `images` directory.
## Features
- Converts DOCX to Markdown format
- Preserves text formatting (headings, bold, italic, underline, strikethrough, highlight)
- Extracts images and saves them in an `images` directory
- Processes tables and converts them to Markdown format
- Command-line interface for specifying input and output files
## Requirements
- Python 3.x
- python-docx library
Install the required dependencies with:
```bash
pip install python-docx
```
## Usage
```bash
python docx_to_md.py <input.docx> [output_directory]
```
### Examples
```bash
# Convert a DOCX file to Markdown (output to current directory)
python docx_to_md.py document.docx
# Convert a DOCX file to Markdown with a specific output directory
python docx_to_md.py document.docx /path/to/output/directory
# If not specified, the output directory defaults to the current directory
python docx_to_md.py document.docx
```
The output Markdown file will have the same name as the input DOCX file, but with a `.md` extension.
## How It Works
1. The script reads the DOCX file using the `python-docx` library
2. It extracts all images from the document and saves them in an `images` subdirectory
3. It processes paragraphs, preserving formatting:
- Headings are converted to Markdown headings (#, ##, ###, etc.)
- Bold text is wrapped in `**`
- Italic text is wrapped in `*`
- Underlined text is wrapped in `*`
- Strikethrough text is wrapped in `~~`
- Highlighted text is wrapped in `**`
4. Tables are converted to Markdown table format
5. The output is written to the specified Markdown file
## Output Structure
The script creates the following structure:
```
output.md # The main Markdown file
images/ # Directory containing extracted images
image_1.png
image_2.png
...
```
## License
This project is licensed under the MIT License.