Analysis, in general, is the process of exploring a topic in particular by breaking and simplifying it into smaller sub-topics to gain insights and a better understanding. The process of evaluating a business and its associated resources such as projects, budgets, and other finance-related information is known as Financial analysis. Financial Analysis performed on a particular business helps us determine its performance and suitability for others. It also provides us with a better understanding of whether the entity, in particular, is profitable enough to provide a return on investment to its investors, is the entity stable and solvent or not. In other words, it is the process of determining the financial strengths and weaknesses of the entity by establishing a strategic relationship between the items of the entity, such as balance sheets, profit and loss account, and other financial statements.
Financial analysis is used to evaluate economic trends, set financial policies, help build long-term plans for business activities, and identify projects or companies for investment. This is done through the crunching of financial numbers and data. Financial analysis is generally conducted in both corporate finance and investment finance settings. There are two ways of conducting a financial analysis: Internal Analysis and External Analysis. When conducted Internally, the analysis can help managers and business heads make key decisions from the generated insights or review their historical trends to analyse and review what has worked for them in the past. On the other hand, Analysis, when conducted Externally, helps potential investors explore and choose the best investment opportunities or policies.
When conducted, such analytical research also helps the top management determine the success or failure of the company’s operation, appreciating and appraising an individual’s performance and contribution to the organization, realising whether the firm’s resources are being used efficiently and evaluating the internal control system. Investors who have invested their money into the firm’s shares are always keen and interested in its performance earnings and future profitability. Financial analysis helps them predict and interpret the firm’s bankruptcy and failure probability of its business enterprises. After being aware of a probable failure, investors can take preventive measures that can help them to avoid and minimize financial losses.
Investing in stocks requires a careful analysis of the financial data to determine a company’s true worth. This can be time-consuming and somewhat cumbersome. The easier way to find out about a company’s performance is to look at its financial ratios, most of which are freely available on the internet. Data Analysts and Market Researchers can use one such library to explore financial data, known as Simfin.
What is Simfin?
Simfin is an open world crowdsourcing platform to explore and research fundamental financial data. All the present data is free, easily accessible, readily available and user-generated. Simfin aims to make financial data freely available to private investors, researchers, and students. The data collection process is self-automated and combined along with machine learning technologies. With such available data services, SimFin aims to help those investors and researchers who cannot afford to have huge investments per year on data. SimFin tries to also maintain a check on their data by manually updating and expanding its datasets. SimFin makes it very easy to obtain and use financial and stock market data in Python. The easiest way of working with the entire datasets is to use the Python API for SimFin; it automatically downloads the share-prices through its in technology, known as signal and fundamental data from the SimFin server is fetched in no time, also saves the data to disk for future use, and loads the data into the Pandas DataFrames easily. It features data for more than 2000+ companies and standardised statements, and a stock screener with 70+ company ratios and shareable.
Using Simfin for Exploratory Data Analysis
We’ll be using Simfin’s Python API and exploring one of the many datasets that it contains. Our goal will be to use SimFin’s Stock Screener for NetNet stocks, better known as Net Net Working Capital stocks which focus on current assets, taking cash and cash equivalents at full value, then reducing accounts receivable for doubtful accounts, and reducing inventories to liquidation values. These are essential stocks where the current assets have been further discounted before subtracting all liabilities, preferred shares, and off-balance sheet liabilities. We will work on a defined hypothesis that a given valuation ratio or another financial signal can be used to predict future stock returns and find how we can know a relation between some financial signals and future stock returns?
The following is an official implementation recreated from SimFin’s Tutorial. You can find the link to their Github page here.
Getting Started with EDA
To get started we’ll first install the SimFin Library; you can use the following code.
!pip install simfin
Next we will install all the dependencies and also import the SimFin Python API,
%matplotlib inline import pandas as pd import seaborn as sns import statsmodels.api as sm # Import the main functionality from the SimFin Python API. import simfin as sf # Import names used for easy access to SimFin's data-columns. from simfin.names import * Check SimFin Version, # Version of the SimFin Python API. sf.__version__
Configuring the data director path and API path
# SimFin data-directory. sf.set_data_dir('~/simfin_data/')
Setting key to use free data.
# SimFin load API key or use free data. sf.load_api_key(path='~/simfin_api_key.txt', default_key='free') # Seaborn set plotting style. sns.set_style("whitegrid")
Accessing the Data Hub
We will now use a StockHub object to load easily and process the US stock market financial data with these settings:
# We are interested in the US stock-market. market = 'us' # Add this date-offset to the fundamental data such as Income Statements # can be set to 1, 2 or even 3 months after the Report Date. offset = pd.DateOffset(days=60) # Refresh the fundamental datasets (Income Statements etc.) every 30 days. refresh_days = 30 # Refresh the dataset with share prices every 10 days. refresh_days_shareprices = 10 Accessing StockHub, %%time hub = sf.StockHub(market=market, offset=offset, refresh_days=refresh_days, refresh_days_shareprices=refresh_days_shareprices)
Using Data Signals
First, we calculate financial signals for all the stocks, such as the Current Ratio, Debt Ratio, Net Profit Margin, Return on Assets, etc. These are calculated using and accessing data from the financial reports: Income Statements, Balance Sheets and Cash-Flow Statements, which are automatically downloaded and loaded by the data hub.
Defining Signal to be Fed and extract data,
df_fin_signals = hub.fin_signals(variant='daily')
Now we will calculate growth signals for the stocks, such as Earnings Growth, FCF Growth, Sales Growth, etc. These are also calculated using data from the financial reports: Income Statements, Balance Sheets and Cash-Flow Statements, which are automatically downloaded and loaded by the data hub.
Fetching growth signals, df_growth_signals = hub.growth_signals(variant='daily')
Now we will calculate valuation signals for the stocks, such as P/E, P/Sales, etc. Again, these are calculated from the share prices and data from the financial reports. Because the data-hub has already loaded the required datasets in the function-calls above, the data is merely reused here, and the data-hub can proceed directly to computing the signals.
Fetching valuation signals,
df_val_signals = hub.val_signals(variant='daily') Combining all the signals into a single DataFrame: # Combine the DataFrames. dfs = [df_fin_signals, df_growth_signals, df_val_signals] df_signals = pd.concat(dfs, axis=1) Show the resultant dataframe : df_signals.dropna(how='all').head()

Evaluating & Handling the Missing Data
Some of the data signals have a lot of missing data, which can cause problems in the statistical analysis. Let us first see the fraction of each signal column that is missing:
# Remove all rows with only NaN values. df = df_signals.dropna(how='all').reset_index(drop=True) # For each column, show the fraction of the rows that are NaN. (df.isnull().sum() / len(df)).sort_values(ascending=False)
We will get the following output :
R&D / Revenue 0.616803 R&D / Gross Profit 0.616652 Return on Research Capital 0.616652 Dividend Yield 0.430889 Inventory Turnover 0.287232 Debt Ratio 0.259329 Net Acquisitions / Total Assets 0.256498 Sales Growth 0.141619 Sales Growth YOY 0.136864 FCF Growth 0.133579 Assets Growth 0.133579 Earnings Growth 0.133579 FCF Growth YOY 0.127126 Earnings Growth YOY 0.127109 Assets Growth YOY 0.127109 Interest Coverage 0.100014 Sales Growth QOQ 0.046013 CapEx / (Depr + Amor) 0.038795 FCF Growth QOQ 0.033883 Earnings Growth QOQ 0.033883 Assets Growth QOQ 0.033883 P/Sales 0.023100 Gross Profit Margin 0.019524 Log Revenue 0.017117 Asset Turnover 0.017060 Net Profit Margin 0.017060 P/Cash 0.016489 Price to Book Value 0.015124 Earnings Yield 0.015124 FCF Yield 0.015124 Market-Cap 0.015124 P/E 0.015124 P/FCF 0.015124 P/NCAV 0.015124 P/NetNet 0.015124 Quick Ratio 0.010466 Return on Assets 0.009083 Return on Equity 0.009083 Share Buyback / FCF 0.009083 Dividends / FCF 0.009083 Current Ratio 0.009083 (Dividends + Share Buyback) / FCF 0.009083 dtype: float64
As you can see from the above list, the Dividend Yield has missing data for nearly half of its rows, while the Debt Ratio has missing data for a third of its rows. Let us remove all signals that have more than 25% missing data:
# List of the columns before removing any. columns_before = df_signals.columns # Threshold for the number of rows that must be NaN for each column. thresh = 0.75 * len(df_signals.dropna(how='all')) # Remove all columns which don't have sufficient data. df_signals = df_signals.dropna(axis='columns', thresh=thresh) # List of the columns after the removal. columns_after = df_signals.columns # Show the columns that were removed. columns_before.difference(columns_after) We get the output as, Index(['Debt Ratio', 'Dividend Yield', 'Inventory Turnover', 'Net Acquisitions / Total Assets', 'R&D / Gross Profit', 'R&D / Revenue', 'Return on Research Capital'], dtype='object')
Calling the Screener for NetNet Stocks
We want to test whether the NetNet investment strategy might work, so we will only use the signal-rows with a P/NetNet ratio between zero and one, thus indicating the stock is trading at a discount to its NetNet liquidation estimate. We create this stock-screener by making a boolean mask as follows:
mask_netnet = (df_signals[P_NETNET] > 0) \ & (df_signals[P_NETNET] < 1)
Rows that satisfy the screener condition have a value of True, and rows that do not meet the condition have a value of False.
mask_netnet.head()
Output:
Ticker Date A 2007-01-03 False 2007-01-04 False 2007-01-05 False 2007-01-08 False 2007-01-09 False Name: P/NetNet, dtype: bool
Let us plot how many different stocks have traded at NetNet-discounts in the past,
mask_netnet.groupby(DATE).sum().plot(grid=True);

We can also show all the tickers that have traded at NetNet discounts at some point:
tickers_netnet = mask_netnet[mask_netnet].reset_index()[TICKER].unique()
print('Number of unique tickers:', len(tickers_netnet))
print(tickers_netnet)
Output :
Number of unique tickers: 139 ['AAMC' 'ABIO' 'ABUS' 'ACER' 'ACET' 'ACRX' 'ADIL' 'ADVM' 'ADXS' 'AEHR' 'AGLE' 'AKER' 'ALPN' 'ALRN' 'ALT' 'AMSC' 'APRE' 'APVO' 'ASPN' 'ASTC' 'ATOM' 'AVEO' 'AVGR' 'AWRE' 'BLCM' 'BLFS' 'BLUE' 'BPMC' 'BRMK' 'BSPM' 'BTAI' 'CALA' 'CCXI' 'CFMS' 'CGA' 'CLBS' 'CLRB' 'COGT' 'CRDF' 'CRSP' 'CRVS' 'CSLT' 'CTIC' 'CTMX' 'CVSI' 'CYCC' 'CYIG' 'EFOI' 'ENPH' 'EPZM' 'EVH' 'EVLO' 'FBRX' 'FNJN' 'FORD' 'FORM' 'FRD' 'FTEK' 'FTK' 'GBT' 'GENC' 'GEVO' 'GLYC' 'GTHX' 'HGEN' 'HROW' 'HTGM' 'IBRX' 'INFI' 'IPDN' 'ISEE' 'JAGX' 'KKR' 'KPTI' 'KURA' 'LEDS' 'LGL' 'LUMO' 'MBOT' 'MEIP' 'MGNI' 'MICT' 'MIST' 'MN' 'MNKD' 'MRSN' 'MRTX' 'MTEM' 'MTSI' 'MYO' 'NHTC' 'NPTN' 'NSPR' 'NTIP' 'NTRA' 'NURO' 'NVTA' 'ONCT' 'ONVO' 'OOMA' 'PACB' 'PBSV' 'PDEX' 'PTE' 'PTN' 'PTON' 'RCUS' 'RGLS' 'RGNX' 'RIGL' 'RKDA' 'RSLS' 'RWLK' 'SCND' 'SINT' 'SMIT' 'SMSI' 'SOHU' 'SPRT' 'SRPT' 'SRRA' 'STOK' 'SURF' 'SYRS' 'THMO' 'TNDM' 'TRUP' 'UBX' 'VIDE' 'VIVE' 'VKTX' 'VOXX' 'VSTM' 'VTVT' 'VYGR' 'WSTL' 'XBIT' 'XPO' 'ZYNE']
Let us now compare the P/NetNet ratio to 1-day stock returns,
# Name of the new column for the returns. TOTAL_RETURN_1D = 'Total Return 1-Day' # Calculate 1-day stock-returns. df_returns_1d = hub.returns(name=TOTAL_RETURN_1D, bdays=1, future=True, annualized=False) # Show the 1-day stock-returns. df_returns_1d.loc[mask_netnet] # Limit the daily returns between -10% and +10% df_returns_1d = df_returns_1d.clip(-0.1, 0.1) # Combine the signals and stock-returns. # We are only using the rows which are NetNet discounts. dfs = [df_signals.loc[mask_netnet], df_returns_1d.loc[mask_netnet]] df_sig_rets = pd.concat(dfs, axis=1) We can then create a scatter-plot of the P/NetNet ratio versus the daily stock-returns: df_sig_rets.plot(kind='scatter', grid=True, x=P_NETNET, y=TOTAL_RETURN_1D);

Using colours to distinguish the dots for different tickers:
sns.scatterplot(x=P_NETNET, y=TOTAL_RETURN_1D, hue=TICKER, data=df_sig_rets.reset_index(), legend=False);
We can do the same for 1-3 year returns as well.
# Name of the new column for the returns. TOTAL_RETURN_1_3Y = 'Total Return 1-3 Years' # Calculate the mean log-returns for all 1-3 year periods. df_returns_1_3y = \ hub.mean_log_returns(name=TOTAL_RETURN_1_3Y, future=True, annualized=True, min_years=1, max_years=3) # Combine the signals and stock-returns. # We are only using the rows which are NetNet discounts. dfs = [df_signals.loc[mask_netnet], df_returns_1_3y.loc[mask_netnet]] df_sig_rets = pd.concat(dfs, axis=1) Plotting the graph, sns.scatterplot(x=P_NETNET, y=TOTAL_RETURN_1_3Y, hue=TICKER, data=df_sig_rets.reset_index(), legend=False);

Removing Outliers from Data
A common method for removing outliers is the so-called Winsorization of the data. It asically just limits or clips the data between, e.g. the 5% and 95% quantiles of the data.
# Select all columns except for the P/NetNet ratio. columns = df_sig_rets.columns.drop(P_NETNET) # Winsorize all the other signals and stock-returns. df_sig_rets2 = sf.winsorize(df_sig_rets, columns=columns) # Winsorize all the other signals and stock-returns. # Instead of clipping values beyond the bounds, set them to NaN. df_sig_rets = sf.winsorize(df_sig_rets, columns=columns, clip=False)
Generating a Correlation
Linear correlation (aka. Pearson correlation) is a simple statistical measure of how two variables change together. We will study the linear correlation between the signals and stock-returns to roughly assess which signals might be the best predictors for stock-returns. We will also study the linear correlation between the signals themselves to assess whether some of the signals seem redundant and can be removed.
# Calculate the correlation between all signals and stock-returns. df_corr = df_sig_rets.corr() # Show how the signals are correlated with the stock-returns. # Sorted to show the strongest absolute correlations first. df_corr_returns = df_corr[TOTAL_RETURN_1_3Y].abs().sort_values(ascending=False) df_corr_returns
Output :
Total Return 1-3 Years 1.000000 Current Ratio 0.309601 Quick Ratio 0.293966 Log Revenue 0.220491 Market-Cap 0.217647 P/FCF 0.198681 Interest Coverage 0.196469 Share Buyback / FCF 0.194303 Earnings Yield 0.177254 (Dividends + Share Buyback) / FCF 0.169556 FCF Yield 0.166700 Return on Equity 0.154312 Earnings Growth YOY 0.148643 Assets Growth QOQ 0.136451 Price to Book Value 0.091554 FCF Growth YOY 0.090232 Assets Growth 0.089341 FCF Growth 0.084630 Asset Turnover 0.084033 Dividends / FCF 0.068013 Gross Profit Margin 0.062543 Assets Growth YOY 0.062482 Return on Assets 0.057145 Sales Growth 0.054878 Earnings Growth QOQ 0.042634 Sales Growth YOY 0.041231 Net Profit Margin 0.040865 FCF Growth QOQ 0.037546 CapEx / (Depr + Amor) 0.020960 P/NCAV 0.020314 Earnings Growth 0.015977 P/NetNet 0.014188 Sales Growth QOQ 0.008207 P/Cash 0.006348 P/Sales 0.005994 P/E 0.004789 Name: Total Return 1-3 Years, dtype: float64
We can also create a correlation-matrix between all the signals and the 1-3 year average stock-returns.
df_corr.round(2).style.background_gradient(cmap='seismic', axis=None)

Even for a small number of signals, the correlation matrix can easily get confusing. Instead, we can show only the larger correlations:
# Only show the large (absolute) correlation values. df_corr2 = df_corr[df_corr.abs() > 0.7] # Round correlation values to 2 digits. df_corr2 = df_corr2.round(2) # Transform the table to give a better overview. df_corr2 = df_corr2.stack() # Remove all values that are 1.0 df_corr2 = df_corr2[df_corr2 != 1.0] # Show the result. Use a DataFrame for pretty printing. pd.DataFrame(df_corr2, columns=['Correlation'])

Plotting the Final Output :
Let us make some scatter plots of the most important signals versus the stock-returns.
# Plot these signals on the x-axis. x_vars = [GROSS_PROFIT_MARGIN, P_NETNET] # Plot the stock-returns on the y-axis y_vars = [TOTAL_RETURN_1_3Y] # Create the plots. g = sns.PairGrid(df_sig_rets.reset_index(), height=4, x_vars=x_vars, y_vars=y_vars, hue=TICKER) g.map(sns.scatterplot);

The colours in these plots represent different stock-tickers. The Gross Profit Margin was found further above to be one of the most important signals for predicting the future stock returns for 1-3 year period. The second scatter plot is for the P/NetNet signal. This had a lower R-squared value than the Gross Profit Margin.
EndNotes
This article tried to implement a basic financial analysis using SimFin and explored the library and its uses. You can perform more such operations on it and understand the use of SimFin even further. You can check my implementation on the colab notebook here.
Happy Learning!